Some latency-intensive workload will see obviously performance
drop when running inside VM. The main reason is that the overhead
is amplified when running inside VM. The most cost i have seen is
inside idle path.
This patch introduces a new mechanism to poll for a while before
entering idle state. If schedule is needed during poll, then we
don't need to goes through the heavy overhead path.
Here is the data we get when running benchmark contextswitch to measure
the latency(lower is better):
1. w/o patch:
2493.14 ns/ctxsw -- 200.3 %CPU
2. w/ patch:
halt_poll_threshold=10000 -- 1485.96ns/ctxsw -- 201.0 %CPU
halt_poll_threshold=20000 -- 1391.26 ns/ctxsw -- 200.7 %CPU
halt_poll_threshold=30000 -- 1488.55 ns/ctxsw -- 200.1 %CPU
halt_poll_threshold=500000 -- 1159.14 ns/ctxsw -- 201.5 %CPU
3. kvm dynamic poll
halt_poll_ns=10000 -- 2296.11 ns/ctxsw -- 201.2 %CPU
halt_poll_ns=20000 -- 2599.7 ns/ctxsw -- 201.7 %CPU
halt_poll_ns=30000 -- 2588.68 ns/ctxsw -- 211.6 %CPU
halt_poll_ns=500000 -- 2423.20 ns/ctxsw -- 229.2 %CPU
4. idle=poll
2050.1 ns/ctxsw -- 1003 %CPU
5. idle=mwait
2188.06 ns/ctxsw -- 206.3 %CPU
Here is the data we get when running benchmark netperf:
1. w/o patch:
14556.8 bits/s -- 144.2 %CPU
2. w/ patch:
halt_poll_threshold=10000 -- 15803.89 bits/s -- 159.5 %CPU
halt_poll_threshold=20000 -- 15899.04 bits/s -- 161.5 %CPU
halt_poll_threshold=30000 -- 15642.38 bits/s -- 161.8 %CPU
halt_poll_threshold=40000 -- 18040.76 bits/s -- 184.0 %CPU
halt_poll_threshold=50000 -- 18877.61 bits/s -- 197.3 %CPU
3. kvm dynamic poll
halt_poll_ns=10000 -- 15876.00 bits/s -- 172.2 %CPU
halt_poll_ns=20000 -- 15602.58 bits/s -- 185.4 %CPU
halt_poll_ns=30000 -- 15930.69 bits/s -- 194.4 %CPU
halt_poll_ns=40000 -- 16413.09 bits/s -- 195.3 %CPU
halt_poll_ns=50000 -- 16417.42 bits/s -- 196.3 %CPU
4. idle=poll in guest
18441.3bit/s -- 1003 %CPU
5. idle=mwait in guest
15760.6 bits/s -- 157.6 %CPU
V1 -> V2:
- integrate the smart halt poll into paravirt code
- use idle_stamp instead of check_poll
- since it hard to get whether vcpu is the only task in pcpu, so we
don't consider it in this series.(May improve it in future)
Yang Zhang (7):
x86/paravirt: Add pv_idle_ops to paravirt ops
KVM guest: register kvm_idle_poll for pv_idle_ops
sched/idle: Add poll before enter real idle path
x86/paravirt: Add update in x86/paravirt pv_idle_ops
Documentation: Add three sysctls for smart idle poll
KVM guest: introduce smart idle poll algorithm
sched/idle: update poll time when wakeup from idle
Documentation/sysctl/kernel.txt | 25 +++++++++++++
arch/x86/include/asm/paravirt.h | 10 ++++++
arch/x86/include/asm/paravirt_types.h | 7 ++++
arch/x86/kernel/kvm.c | 67 +++++++++++++++++++++++++++++++++++
arch/x86/kernel/paravirt.c | 11 ++++++
arch/x86/kernel/process.c | 7 ++++
include/linux/kernel.h | 6 ++++
include/linux/sched/idle.h | 4 +++
kernel/sched/core.c | 4 +++
kernel/sched/idle.c | 9 +++++
kernel/sysctl.c | 23 ++++++++++++
11 files changed, 173 insertions(+)
--
1.8.3.1
On Tue, Aug 29, 2017 at 10:02:15PM +0800, Wanpeng Li wrote:
> > Here is the data we get when running benchmark netperf:
> >
> > 2. w/ patch:
> > halt_poll_threshold=10000 -- 15803.89 bits/s -- 159.5 %CPU
> > halt_poll_threshold=20000 -- 15899.04 bits/s -- 161.5 %CPU
> > halt_poll_threshold=30000 -- 15642.38 bits/s -- 161.8 %CPU
> > halt_poll_threshold=40000 -- 18040.76 bits/s -- 184.0 %CPU
> > halt_poll_threshold=50000 -- 18877.61 bits/s -- 197.3 %CPU
> >
> > 3. kvm dynamic poll
> > halt_poll_ns=10000 -- 15876.00 bits/s -- 172.2 %CPU
> > halt_poll_ns=20000 -- 15602.58 bits/s -- 185.4 %CPU
> > halt_poll_ns=30000 -- 15930.69 bits/s -- 194.4 %CPU
> > halt_poll_ns=40000 -- 16413.09 bits/s -- 195.3 %CPU
> > halt_poll_ns=50000 -- 16417.42 bits/s -- 196.3 %CPU
> >
>
> Actually I'm not sure how much sense it makes to introduce this pv
> stuff and the duplicate adaptive halt-polling logic as what has
> already been done in kvm w/o obvious benefit for real workload like
> netperf. In addition, as you mentioned offline to me, enable both the
"real workload like netperf"?
That is not a real workload. That is a synthetic one.
> patchset and the adaptive halt-polling logic in kvm simultaneously can
> result in more cpu power consumption. I remembered that David from
> Google mentioned that Windows Event Objects can get 2x latency
> improvement in KVM FORUM, which means that the adaptive halt-polling
> in kvm should be enabled by default. So if the windows guests and
> linux guests are mixed on the same host, then this patchset will
> result in more cpu power consumption if the customer enable the
> polling in the linux guest. Anyway, if the patchset is finally
More CPU power consumption sounds as a bad idea, does it not?
> acceptable by maintainer, I will introduce the generic adaptive
> halt-polling framework in kvm to avoid the duplicate logic.
>
> Regards,
> Wanpeng Li
On Tue, Aug 29, 2017 at 11:46:34AM +0000, Yang Zhang wrote:
> Some latency-intensive workload will see obviously performance
> drop when running inside VM.
But are we trading a lot of CPU for a bit of lower latency?
> The main reason is that the overhead
> is amplified when running inside VM. The most cost i have seen is
> inside idle path.
>
> This patch introduces a new mechanism to poll for a while before
> entering idle state. If schedule is needed during poll, then we
> don't need to goes through the heavy overhead path.
Isn't it the job of an idle driver to find the best way to
halt the CPU?
It looks like just by adding a cstate we can make it
halt at higher latencies only. And at lower latencies,
if it's doing a good job we can hopefully use mwait to
stop the CPU.
In fact I have been experimenting with exactly that.
Some initial results are encouraging but I could use help
with testing and especially tuning. If you can help
pls let me know!
Patch below is not intended for upstream - it's just
the fastest way I found to test things.
So it just uses command line arguments to configure the guest,
the right thing is through a combination of ACPI and CPUIDs
but let's decide whether we need this first.
RFC dontmerge PATCH intel_idle: add pv cstates when running on kvm
Usage:
kvm_pv_mwait - enables the feature. Note: you must have a recent
host that allows guests to execute mwait without an exit,
otherwise you will just get 100% CPU.
kvm_halt_target_residency - halt above this target residency.
Should probably be a function of the cost of
halt+wakeup.
kvm_halt_native - set to 0 if your VCPU does not match host.
Signed-off-by: Michael S. Tsirkin <[email protected]>
---
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index c2ae819..6fa58ad 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -65,8 +65,10 @@
#include <asm/intel-family.h>
#include <asm/mwait.h>
#include <asm/msr.h>
+#include <linux/kvm_para.h>
#define INTEL_IDLE_VERSION "0.4.1"
+#define PREFIX "intel_idle: "
static struct cpuidle_driver intel_idle_driver = {
.name = "intel_idle",
@@ -94,6 +96,7 @@ struct idle_cpu {
};
static const struct idle_cpu *icpu;
+static struct idle_cpu icpus;
static struct cpuidle_device __percpu *intel_idle_cpuidle_devices;
static int intel_idle(struct cpuidle_device *dev,
struct cpuidle_driver *drv, int index);
@@ -119,6 +122,49 @@ static struct cpuidle_state *cpuidle_state_table;
#define flg2MWAIT(flags) (((flags) >> 24) & 0xFF)
#define MWAIT2flg(eax) ((eax & 0xFF) << 24)
+static int intel_halt(struct cpuidle_device *dev,
+ struct cpuidle_driver *drv, int index)
+{
+ printk_once(KERN_ERR "safe_halt started\n");
+ safe_halt();
+ printk_once(KERN_ERR "safe_halt done\n");
+ return index;
+}
+
+static int kvm_halt_target_residency = 400; /* Halt above this target residency */
+module_param(kvm_halt_target_residency, int, 0444);
+static int kvm_halt_native = 1; /* Use native mwait substates */
+module_param(kvm_halt_native, int, 0444);
+static int kvm_pv_mwait = 0; /* Whether to do mwait within KVM */
+module_param(kvm_pv_mwait, int, 0444);
+
+static struct cpuidle_state kvm_halt_cstate = {
+ .name = "HALT-KVM",
+ .desc = "HALT",
+ .flags = MWAIT2flg(0x10),
+ .exit_latency = 0,
+ .target_residency = 0,
+ .enter = &intel_halt,
+};
+
+static struct cpuidle_state kvm_cstates[] = {
+ {
+ .name = "C1-NHM",
+ .desc = "MWAIT 0x00",
+ .flags = MWAIT2flg(0x00),
+ .exit_latency = 3,
+ .target_residency = 6,
+ .enter = &intel_idle,
+ .enter_freeze = intel_idle_freeze, },
+ {
+ .name = "HALT-KVM",
+ .desc = "HALT",
+ .flags = MWAIT2flg(0x10),
+ .exit_latency = 30,
+ .target_residency = 399,
+ .enter = &intel_halt, }
+};
+
/*
* States are indexed by the cstate number,
* which is also the index into the MWAIT hint array.
@@ -927,8 +973,11 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev,
if (!(lapic_timer_reliable_states & (1 << (cstate))))
tick_broadcast_enter();
+ printk_once(KERN_ERR "mwait_idle_with_hints started\n");
mwait_idle_with_hints(eax, ecx);
+ printk_once(KERN_ERR "mwait_idle_with_hints done\n");
+
if (!(lapic_timer_reliable_states & (1 << (cstate))))
tick_broadcast_exit();
@@ -989,6 +1038,10 @@ static const struct idle_cpu idle_cpu_tangier = {
.state_table = tangier_cstates,
};
+static const struct idle_cpu idle_cpu_kvm = {
+ .state_table = kvm_cstates,
+};
+
static const struct idle_cpu idle_cpu_lincroft = {
.state_table = atom_cstates,
.auto_demotion_disable_flags = ATM_LNC_C6_AUTO_DEMOTE,
@@ -1061,7 +1115,7 @@ static const struct idle_cpu idle_cpu_dnv = {
};
#define ICPU(model, cpu) \
- { X86_VENDOR_INTEL, 6, model, X86_FEATURE_MWAIT, (unsigned long)&cpu }
+ { X86_VENDOR_INTEL, 6, model, X86_FEATURE_ANY, (unsigned long)&cpu }
static const struct x86_cpu_id intel_idle_ids[] __initconst = {
ICPU(INTEL_FAM6_NEHALEM_EP, idle_cpu_nehalem),
@@ -1125,19 +1180,39 @@ static int __init intel_idle_probe(void)
return -ENODEV;
}
- if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
- return -ENODEV;
+ icpus = *(struct idle_cpu *)id->driver_data;
+
+ if (kvm_pv_mwait) {
+
+ if (!kvm_halt_native)
+ icpus = idle_cpu_kvm;
+
+ pr_debug(PREFIX "MWAIT enabled by KVM\n");
+ mwait_substates = 0x1;
+ /*
+ * these MSRs do not work on kvm maybe they should?
+ * more likely we need to poke at CPUID before using MSRs
+ */
+ icpus.auto_demotion_disable_flags = 0;
+ icpus.disable_promotion_to_c1e = 0;
+ } else {
+ if (!cpu_has(&boot_cpu_data, X86_FEATURE_MWAIT))
+ return -ENODEV;
+
+ if (boot_cpu_data.cpuid_level < CPUID_MWAIT_LEAF)
+ return -ENODEV;
- cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
+ cpuid(CPUID_MWAIT_LEAF, &eax, &ebx, &ecx, &mwait_substates);
- if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED) ||
- !(ecx & CPUID5_ECX_INTERRUPT_BREAK) ||
- !mwait_substates)
+ if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED) ||
+ !(ecx & CPUID5_ECX_INTERRUPT_BREAK) ||
+ !mwait_substates)
return -ENODEV;
- pr_debug("MWAIT substates: 0x%x\n", mwait_substates);
+ pr_debug(PREFIX "MWAIT substates: 0x%x\n", mwait_substates);
+ }
- icpu = (const struct idle_cpu *)id->driver_data;
+ icpu = &icpus;
cpuidle_state_table = icpu->state_table;
pr_debug("v" INTEL_IDLE_VERSION " model 0x%X\n",
@@ -1340,6 +1415,11 @@ static void __init intel_idle_cpuidle_driver_init(void)
(cpuidle_state_table[cstate].enter_freeze == NULL))
break;
+ if (kvm_pv_mwait &&
+ cpuidle_state_table[cstate].target_residency >=
+ kvm_halt_target_residency)
+ break;
+
if (cstate + 1 > max_cstate) {
pr_info("max_cstate %d reached\n", max_cstate);
break;
@@ -1353,7 +1433,7 @@ static void __init intel_idle_cpuidle_driver_init(void)
& MWAIT_SUBSTATE_MASK;
/* if NO sub-states for this state in CPUID, skip it */
- if (num_substates == 0)
+ if (num_substates == 0 && !kvm_pv_mwait)
continue;
/* if state marked as disabled, skip it */
@@ -1375,6 +1455,20 @@ static void __init intel_idle_cpuidle_driver_init(void)
drv->state_count += 1;
}
+ if (kvm_halt_native && kvm_pv_mwait) {
+ drv->states[drv->state_count] = /* structure copy */
+ kvm_halt_cstate;
+ drv->states[drv->state_count].exit_latency =
+ drv->state_count > 1 ?
+ drv->states[drv->state_count - 1].exit_latency + 1 : 1;
+ drv->states[drv->state_count].target_residency =
+ kvm_halt_target_residency;
+
+ drv->state_count += 1;
+ }
+
+ printk(KERN_ERR "detected states: %d\n\n", drv->state_count);
+
if (icpu->byt_auto_demotion_disable_flag) {
wrmsrl(MSR_CC6_DEMOTION_POLICY_CONFIG, 0);
wrmsrl(MSR_MC6_DEMOTION_POLICY_CONFIG, 0);
@@ -1452,7 +1546,8 @@ static int __init intel_idle_init(void)
goto init_driver_fail;
}
- if (boot_cpu_has(X86_FEATURE_ARAT)) /* Always Reliable APIC Timer */
+ if (boot_cpu_has(X86_FEATURE_ARAT) || /* Always Reliable APIC Timer */
+ kvm_pv_mwait)
lapic_timer_reliable_states = LAPIC_TIMER_ALWAYS_RELIABLE;
retval = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "idle/intel:online",
Yang Zhang <[email protected]> writes:
> Some latency-intensive workload will see obviously performance
> drop when running inside VM. The main reason is that the overhead
> is amplified when running inside VM. The most cost i have seen is
> inside idle path.
You could test with https://lkml.org/lkml/2017/7/9/204
which optimizes these idle paths with a fast path.
If that works it would be a far better solution than
explicit tunables and basically giving up by polling.
-Andi
> Here is the data we get when running benchmark netperf:
>
> 2. w/ patch:
> halt_poll_threshold=10000 -- 15803.89 bits/s -- 159.5 %CPU
> halt_poll_threshold=20000 -- 15899.04 bits/s -- 161.5 %CPU
> halt_poll_threshold=30000 -- 15642.38 bits/s -- 161.8 %CPU
> halt_poll_threshold=40000 -- 18040.76 bits/s -- 184.0 %CPU
> halt_poll_threshold=50000 -- 18877.61 bits/s -- 197.3 %CPU
>
> 3. kvm dynamic poll
> halt_poll_ns=10000 -- 15876.00 bits/s -- 172.2 %CPU
> halt_poll_ns=20000 -- 15602.58 bits/s -- 185.4 %CPU
> halt_poll_ns=30000 -- 15930.69 bits/s -- 194.4 %CPU
> halt_poll_ns=40000 -- 16413.09 bits/s -- 195.3 %CPU
> halt_poll_ns=50000 -- 16417.42 bits/s -- 196.3 %CPU
>
Actually I'm not sure how much sense it makes to introduce this pv
stuff and the duplicate adaptive halt-polling logic as what has
already been done in kvm w/o obvious benefit for real workload like
netperf. In addition, as you mentioned offline to me, enable both the
patchset and the adaptive halt-polling logic in kvm simultaneously can
result in more cpu power consumption. I remembered that David from
Google mentioned that Windows Event Objects can get 2x latency
improvement in KVM FORUM, which means that the adaptive halt-polling
in kvm should be enabled by default. So if the windows guests and
linux guests are mixed on the same host, then this patchset will
result in more cpu power consumption if the customer enable the
polling in the linux guest. Anyway, if the patchset is finally
acceptable by maintainer, I will introduce the generic adaptive
halt-polling framework in kvm to avoid the duplicate logic.
Regards,
Wanpeng Li
On 08/29/2017 01:46 PM, Yang Zhang wrote:
> Some latency-intensive workload will see obviously performance
> drop when running inside VM. The main reason is that the overhead
> is amplified when running inside VM. The most cost i have seen is
> inside idle path.
>
> This patch introduces a new mechanism to poll for a while before
> entering idle state. If schedule is needed during poll, then we
> don't need to goes through the heavy overhead path.
>
> Here is the data we get when running benchmark contextswitch to measure
> the latency(lower is better):
>
> 1. w/o patch:
> 2493.14 ns/ctxsw -- 200.3 %CPU
>
> 2. w/ patch:
> halt_poll_threshold=10000 -- 1485.96ns/ctxsw -- 201.0 %CPU
> halt_poll_threshold=20000 -- 1391.26 ns/ctxsw -- 200.7 %CPU
> halt_poll_threshold=30000 -- 1488.55 ns/ctxsw -- 200.1 %CPU
> halt_poll_threshold=500000 -- 1159.14 ns/ctxsw -- 201.5 %CPU
>
> 3. kvm dynamic poll
> halt_poll_ns=10000 -- 2296.11 ns/ctxsw -- 201.2 %CPU
> halt_poll_ns=20000 -- 2599.7 ns/ctxsw -- 201.7 %CPU
> halt_poll_ns=30000 -- 2588.68 ns/ctxsw -- 211.6 %CPU
> halt_poll_ns=500000 -- 2423.20 ns/ctxsw -- 229.2 %CPU
>
> 4. idle=poll
> 2050.1 ns/ctxsw -- 1003 %CPU
>
> 5. idle=mwait
> 2188.06 ns/ctxsw -- 206.3 %CPU
Could you please try to create another metric for guest initiated, host
aborted mwait?
For a quick benchmark, reserve 4 registers for a magic value, set them
to the magic value before you enter MWAIT in the guest. Then allow
native MWAIT execution on the host. If you see the guest wants to enter
with the 4 registers containing the magic contents and no events are
pending, directly go into the vcpu block function on the host.
That way any time a guest gets naturally aborted while in mwait, it will
only reenter mwait when an event actually occured. While the guest is
normally running (and nobody else wants to run on the host), we just
stay in guest context, but with a sleeping CPU.
Overall, that might give us even better performance, as it allows for
turbo boost and HT to work properly.
Alex
On Tue, Aug 29, 2017 at 11:46:41AM +0000, Yang Zhang wrote:
> In ttwu_do_wakeup, it will update avg_idle when wakeup from idle. Here
> we just reuse this logic to update the poll time. It may be a little
> late to update the poll in ttwu_do_wakeup, but the test result shows no
> obvious performance gap compare with updating poll in irq handler.
>
> one problem is that idle_stamp only used when using CFS scheduler. But
> it is ok since it is the default policy for scheduler and only consider
> it should enough.
>
> Signed-off-by: Yang Zhang <[email protected]>
> Signed-off-by: Quan Xu <[email protected]>
Same broken SoB chain, and not a useful word on why you need to adjust
this crap to begin with. What you want that poll duration to be related
to is the cost of a VMEXIT/VMENTER cycle, not however long we happened
to be idle.
So no.
On Tue, Aug 29, 2017 at 11:46:37AM +0000, Yang Zhang wrote:
> Add poll in do_idle. For UP VM, if there are running task, it will not
> goes into idle path, so we only enable poll in SMP VM.
>
> Signed-off-by: Yang Zhang <[email protected]>
> Signed-off-by: Quan Xu <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: [email protected]
> Cc: Peter Zijlstra <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Kyle Huey <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: [email protected]
> ---
> arch/x86/kernel/process.c | 7 +++++++
> kernel/sched/idle.c | 2 ++
> 2 files changed, 9 insertions(+)
>
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 3ca1980..def4113 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -332,6 +332,13 @@ void arch_cpu_idle(void)
> x86_idle();
> }
>
> +#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
> +void arch_cpu_idle_poll(void)
> +{
> + paravirt_idle_poll();
> +}
> +#endif
So this will get called on any system which has CONFIG_PARAVIRT enabled
*even* if they're not running any guests.
Huh?
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
On Tue, Aug 29, 2017 at 11:46:37AM +0000, Yang Zhang wrote:
> Add poll in do_idle. For UP VM, if there are running task, it will not
> goes into idle path, so we only enable poll in SMP VM.
>
> Signed-off-by: Yang Zhang <[email protected]>
> Signed-off-by: Quan Xu <[email protected]>
Broken SoB chain.
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 6c23e30..b374744 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
> }
>
> /* Weak implementations for optional arch specific functions */
> +void __weak arch_cpu_idle_poll(void) { }
> void __weak arch_cpu_idle_prepare(void) { }
> void __weak arch_cpu_idle_enter(void) { }
And not a word on why we need a new arch hook. What's wrong with
arch_cpu_idle_enter() for instance?
On Tue, Aug 29, 2017 at 11:46:35AM +0000, Yang Zhang wrote:
> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in
> idle path which will polling for a while before we enter the real idle
> state.
>
> In virtualization, idle path includes several heavy operations
> includes timer access(LAPIC timer or TSC deadline timer) which will hurt
> performance especially for latency intensive workload like message
> passing task. The cost is mainly come from the vmexit which is a
> hardware context switch between VM and hypervisor. Our solution is to
> poll for a while and do not enter real idle path if we can get the
> schedule event during polling.
>
> Poll may cause the CPU waste so we adopt a smart polling mechanism to
> reduce the useless poll.
>
> Signed-off-by: Yang Zhang <[email protected]>
> Signed-off-by: Quan Xu <[email protected]>
> Cc: Jeremy Fitzhardinge <[email protected]>
> Cc: Chris Wright <[email protected]>
> Cc: Alok Kataria <[email protected]>
> Cc: Rusty Russell <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: [email protected]
> Cc: Peter Zijlstra <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: "Kirill A. Shutemov" <[email protected]>
> Cc: Pan Xinhui <[email protected]>
> Cc: Kees Cook <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
Adding xen-devel.
Juergen, we really should replace Jeremy's name with xen-devel or
your name.. Wasn't there an patch by you that took some of the
mainternship over it?
> ---
> arch/x86/include/asm/paravirt.h | 5 +++++
> arch/x86/include/asm/paravirt_types.h | 6 ++++++
> arch/x86/kernel/paravirt.c | 6 ++++++
> 3 files changed, 17 insertions(+)
>
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index 9ccac19..6d46760 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -202,6 +202,11 @@ static inline unsigned long long paravirt_read_pmc(int counter)
>
> #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter))
>
> +static inline void paravirt_idle_poll(void)
> +{
> + PVOP_VCALL0(pv_idle_ops.poll);
> +}
> +
> static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries)
> {
> PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 9ffc36b..cf45726 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -324,6 +324,10 @@ struct pv_lock_ops {
> struct paravirt_callee_save vcpu_is_preempted;
> } __no_randomize_layout;
>
> +struct pv_idle_ops {
> + void (*poll)(void);
> +} __no_randomize_layout;
> +
> /* This contains all the paravirt structures: we get a convenient
> * number for each function using the offset which we use to indicate
> * what to patch. */
> @@ -334,6 +338,7 @@ struct paravirt_patch_template {
> struct pv_irq_ops pv_irq_ops;
> struct pv_mmu_ops pv_mmu_ops;
> struct pv_lock_ops pv_lock_ops;
> + struct pv_idle_ops pv_idle_ops;
> } __no_randomize_layout;
>
> extern struct pv_info pv_info;
> @@ -343,6 +348,7 @@ struct paravirt_patch_template {
> extern struct pv_irq_ops pv_irq_ops;
> extern struct pv_mmu_ops pv_mmu_ops;
> extern struct pv_lock_ops pv_lock_ops;
> +extern struct pv_idle_ops pv_idle_ops;
>
> #define PARAVIRT_PATCH(x) \
> (offsetof(struct paravirt_patch_template, x) / sizeof(void *))
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index bc0a849..1b5b247 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -128,6 +128,7 @@ static void *get_call_destination(u8 type)
> #ifdef CONFIG_PARAVIRT_SPINLOCKS
> .pv_lock_ops = pv_lock_ops,
> #endif
> + .pv_idle_ops = pv_idle_ops,
> };
> return *((void **)&tmpl + type);
> }
> @@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = {
> .steal_clock = native_steal_clock,
> };
>
> +struct pv_idle_ops pv_idle_ops = {
> + .poll = paravirt_nop,
> +};
> +
> __visible struct pv_irq_ops pv_irq_ops = {
> .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
> .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
> @@ -471,3 +476,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
> EXPORT_SYMBOL (pv_mmu_ops);
> EXPORT_SYMBOL_GPL(pv_info);
> EXPORT_SYMBOL (pv_irq_ops);
> +EXPORT_SYMBOL (pv_idle_ops);
> --
> 1.8.3.1
>
On Tue, Aug 29, 2017 at 10:02:15PM +0800, Wanpeng Li wrote:
> Actually I'm not sure how much sense it makes to introduce this pv
> stuff and the duplicate adaptive halt-polling logic as what has
> already been done in kvm w/o obvious benefit for real workload like
> netperf.
In fact, I would really like to better understand why does the polling
in kvm even help. Switching to the idle task is supposed to be really
cheap as you are not losing context. In case of e.g. network polling
you gain the interrupt latency, but in case of kvm it's just an IPI
which is converted to a memory write when using mwait. Is mwait more
costly than commonly thought? Or is the idle driver too agressive in
putting the CPU into deep sleep?
I think this analysis is something that would benefit
bare-metal/containers as well.
--
MST
So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in
idle path which will polling for a while before we enter the real idle
state.
In virtualization, idle path includes several heavy operations
includes timer access(LAPIC timer or TSC deadline timer) which will hurt
performance especially for latency intensive workload like message
passing task. The cost is mainly come from the vmexit which is a
hardware context switch between VM and hypervisor. Our solution is to
poll for a while and do not enter real idle path if we can get the
schedule event during polling.
Poll may cause the CPU waste so we adopt a smart polling mechanism to
reduce the useless poll.
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Quan Xu <[email protected]>
Cc: Jeremy Fitzhardinge <[email protected]>
Cc: Chris Wright <[email protected]>
Cc: Alok Kataria <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Pan Xinhui <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/x86/include/asm/paravirt.h | 5 +++++
arch/x86/include/asm/paravirt_types.h | 6 ++++++
arch/x86/kernel/paravirt.c | 6 ++++++
3 files changed, 17 insertions(+)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 9ccac19..6d46760 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -202,6 +202,11 @@ static inline unsigned long long paravirt_read_pmc(int counter)
#define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter))
+static inline void paravirt_idle_poll(void)
+{
+ PVOP_VCALL0(pv_idle_ops.poll);
+}
+
static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries)
{
PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 9ffc36b..cf45726 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -324,6 +324,10 @@ struct pv_lock_ops {
struct paravirt_callee_save vcpu_is_preempted;
} __no_randomize_layout;
+struct pv_idle_ops {
+ void (*poll)(void);
+} __no_randomize_layout;
+
/* This contains all the paravirt structures: we get a convenient
* number for each function using the offset which we use to indicate
* what to patch. */
@@ -334,6 +338,7 @@ struct paravirt_patch_template {
struct pv_irq_ops pv_irq_ops;
struct pv_mmu_ops pv_mmu_ops;
struct pv_lock_ops pv_lock_ops;
+ struct pv_idle_ops pv_idle_ops;
} __no_randomize_layout;
extern struct pv_info pv_info;
@@ -343,6 +348,7 @@ struct paravirt_patch_template {
extern struct pv_irq_ops pv_irq_ops;
extern struct pv_mmu_ops pv_mmu_ops;
extern struct pv_lock_ops pv_lock_ops;
+extern struct pv_idle_ops pv_idle_ops;
#define PARAVIRT_PATCH(x) \
(offsetof(struct paravirt_patch_template, x) / sizeof(void *))
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index bc0a849..1b5b247 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -128,6 +128,7 @@ static void *get_call_destination(u8 type)
#ifdef CONFIG_PARAVIRT_SPINLOCKS
.pv_lock_ops = pv_lock_ops,
#endif
+ .pv_idle_ops = pv_idle_ops,
};
return *((void **)&tmpl + type);
}
@@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = {
.steal_clock = native_steal_clock,
};
+struct pv_idle_ops pv_idle_ops = {
+ .poll = paravirt_nop,
+};
+
__visible struct pv_irq_ops pv_irq_ops = {
.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
@@ -471,3 +476,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
EXPORT_SYMBOL (pv_mmu_ops);
EXPORT_SYMBOL_GPL(pv_info);
EXPORT_SYMBOL (pv_irq_ops);
+EXPORT_SYMBOL (pv_idle_ops);
--
1.8.3.1
In ttwu_do_wakeup, it will update avg_idle when wakeup from idle. Here
we just reuse this logic to update the poll time. It may be a little
late to update the poll in ttwu_do_wakeup, but the test result shows no
obvious performance gap compare with updating poll in irq handler.
one problem is that idle_stamp only used when using CFS scheduler. But
it is ok since it is the default policy for scheduler and only consider
it should enough.
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Quan Xu <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: [email protected]
---
include/linux/sched/idle.h | 4 ++++
kernel/sched/core.c | 4 ++++
kernel/sched/idle.c | 7 +++++++
3 files changed, 15 insertions(+)
diff --git a/include/linux/sched/idle.h b/include/linux/sched/idle.h
index 5ca63eb..6e0554d 100644
--- a/include/linux/sched/idle.h
+++ b/include/linux/sched/idle.h
@@ -12,6 +12,10 @@ enum cpu_idle_type {
extern void wake_up_if_idle(int cpu);
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
+extern void update_poll_duration(unsigned long idle_duration);
+#endif
+
/*
* Idle thread specific functions to determine the need_resched
* polling state.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0869b20..25be9a3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1678,6 +1678,10 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
if (rq->avg_idle > max)
rq->avg_idle = max;
+#if defined(CONFIG_PARAVIRT)
+ update_poll_duration(rq->avg_idle);
+#endif
+
rq->idle_stamp = 0;
}
#endif
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index b374744..7eb8559 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -101,6 +101,13 @@ void __cpuidle default_idle_call(void)
}
}
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
+void update_poll_duration(unsigned long idle_duration)
+{
+ paravirt_idle_update_poll_duration(idle_duration);
+}
+#endif
+
static int call_cpuidle(struct cpuidle_driver *drv, struct cpuidle_device *dev,
int next_state)
{
--
1.8.3.1
Although smart idle poll has nothing to do with paravirt, it can not
bring any benifit to native. So we only enable it when Linux runs as a
KVM guest(it can extend to other hypervisor like Xen, HyperV and
VMware).
Introduce per-CPU variable poll_duration_ns to control the max poll
time.
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Quan Xu <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
arch/x86/kernel/kvm.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index d04e30e..7d84a02 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -75,6 +75,7 @@ static int parse_no_kvmclock_vsyscall(char *arg)
early_param("no-kvmclock-vsyscall", parse_no_kvmclock_vsyscall);
+static DEFINE_PER_CPU(unsigned long, poll_duration_ns);
static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
static int has_steal_clock = 0;
@@ -357,6 +358,29 @@ static void kvm_guest_cpu_init(void)
kvm_register_steal_time();
}
+static void kvm_idle_poll(void)
+{
+ unsigned long poll_duration = this_cpu_read(poll_duration_ns);
+ ktime_t start, cur, stop;
+
+ start = cur = ktime_get();
+ stop = ktime_add_ns(ktime_get(), poll_duration);
+
+ do {
+ if (need_resched())
+ break;
+ cur = ktime_get();
+ } while (ktime_before(cur, stop));
+}
+
+static void kvm_guest_idle_init(void)
+{
+ if (!kvm_para_available())
+ return;
+
+ pv_idle_ops.poll = kvm_idle_poll;
+}
+
static void kvm_pv_disable_apf(void)
{
if (!__this_cpu_read(apf_reason.enabled))
@@ -492,6 +516,8 @@ void __init kvm_guest_init(void)
kvm_guest_cpu_init();
#endif
+ kvm_guest_idle_init();
+
/*
* Hard lockup detection is enabled by default. Disable it, as guests
* can get false positives too easily, for example if the host is
--
1.8.3.1
To reduce the cost of poll, we introduce three sysctl to control the
poll time.
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Quan Xu <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Jeremy Fitzhardinge <[email protected]>
Cc: Chris Wright <[email protected]>
Cc: Alok Kataria <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: "Luis R. Rodriguez" <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Mauro Carvalho Chehab <[email protected]>
Cc: Krzysztof Kozlowski <[email protected]>
Cc: Josh Poimboeuf <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Jessica Yu <[email protected]>
Cc: Larry Finger <[email protected]>
Cc: zijun_hu <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Ian Abbott <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
Documentation/sysctl/kernel.txt | 25 +++++++++++++++++++++++++
arch/x86/kernel/paravirt.c | 4 ++++
include/linux/kernel.h | 6 ++++++
kernel/sysctl.c | 23 +++++++++++++++++++++++
4 files changed, 58 insertions(+)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bac23c1..67447b8 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -63,6 +63,9 @@ show up in /proc/sys/kernel:
- perf_event_max_stack
- perf_event_max_contexts_per_stack
- pid_max
+- poll_grow [ X86 only ]
+- poll_shrink [ X86 only ]
+- poll_threshold_ns [ X86 only ]
- powersave-nap [ PPC only ]
- printk
- printk_delay
@@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one.
==============================================================
+poll_grow: (X86 only)
+
+This parameter is multiplied in the grow_poll_ns() to increase the poll time.
+By default, the values is 2.
+
+==============================================================
+poll_shrink: (X86 only)
+
+This parameter is divided in the shrink_poll_ns() to reduce the poll time.
+By default, the values is 2.
+
+==============================================================
+poll_threshold_ns: (X86 only)
+
+This parameter controls the max poll time before entering real idle path.
+This parameter is expected to take effect only when running inside a VM.
+It would make no sense to turn on it in bare mental.
+By default, the values is 0 means don't poll. It is recommended to change
+the value to non-zero if running latency-bound workloads inside VM.
+
+==============================================================
+
powersave-nap: (PPC only)
If set, Linux-PPC will use the 'nap' mode of powersaving,
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index a11b2c2..0b92f8f 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = {
.update = paravirt_nop,
};
+unsigned long poll_threshold_ns;
+unsigned int poll_shrink = 2;
+unsigned int poll_grow = 2;
+
__visible struct pv_irq_ops pv_irq_ops = {
.save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
.restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index bd6d96c..6cb2820 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -462,6 +462,12 @@ extern __scanf(2, 0)
extern bool crash_kexec_post_notifiers;
+#ifdef CONFIG_PARAVIRT
+extern unsigned long poll_threshold_ns;
+extern unsigned int poll_shrink;
+extern unsigned int poll_grow;
+#endif
+
/*
* panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
* holds a CPU number which is executing panic() currently. A value of
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..9b86a8f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
.extra2 = &one,
},
#endif
+#ifdef CONFIG_PARAVIRT
+ {
+ .procname = "halt_poll_threshold",
+ .data = &poll_threshold_ns,
+ .maxlen = sizeof(unsigned long),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "halt_poll_grow",
+ .data = &poll_grow,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "halt_poll_shrink",
+ .data = &poll_shrink,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+#endif
{ }
};
--
1.8.3.1
Add poll in do_idle. For UP VM, if there are running task, it will not
goes into idle path, so we only enable poll in SMP VM.
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Quan Xu <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Kyle Huey <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Len Brown <[email protected]>
Cc: [email protected]
---
arch/x86/kernel/process.c | 7 +++++++
kernel/sched/idle.c | 2 ++
2 files changed, 9 insertions(+)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 3ca1980..def4113 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -332,6 +332,13 @@ void arch_cpu_idle(void)
x86_idle();
}
+#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
+void arch_cpu_idle_poll(void)
+{
+ paravirt_idle_poll();
+}
+#endif
+
/*
* We use this if we don't have any better idle routine..
*/
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 6c23e30..b374744 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
}
/* Weak implementations for optional arch specific functions */
+void __weak arch_cpu_idle_poll(void) { }
void __weak arch_cpu_idle_prepare(void) { }
void __weak arch_cpu_idle_enter(void) { }
void __weak arch_cpu_idle_exit(void) { }
@@ -219,6 +220,7 @@ static void do_idle(void)
*/
__current_set_polling();
+ arch_cpu_idle_poll();
quiet_vmstat();
tick_nohz_idle_enter();
--
1.8.3.1
.update is used to adjust the next poll time.
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Quan Xu <[email protected]>
Cc: Jeremy Fitzhardinge <[email protected]>
Cc: Chris Wright <[email protected]>
Cc: Alok Kataria <[email protected]>
Cc: Rusty Russell <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Pan Xinhui <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/x86/include/asm/paravirt.h | 5 +++++
arch/x86/include/asm/paravirt_types.h | 1 +
arch/x86/kernel/paravirt.c | 1 +
3 files changed, 7 insertions(+)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 6d46760..32e1c06 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -207,6 +207,11 @@ static inline void paravirt_idle_poll(void)
PVOP_VCALL0(pv_idle_ops.poll);
}
+static inline void paravirt_idle_update_poll_duration(unsigned long duration)
+{
+ PVOP_VCALL1(pv_idle_ops.update, duration);
+}
+
static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries)
{
PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index cf45726..3b4f95a 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -326,6 +326,7 @@ struct pv_lock_ops {
struct pv_idle_ops {
void (*poll)(void);
+ void (*update)(unsigned long);
} __no_randomize_layout;
/* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 1b5b247..a11b2c2 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -315,6 +315,7 @@ struct pv_time_ops pv_time_ops = {
struct pv_idle_ops pv_idle_ops = {
.poll = paravirt_nop,
+ .update = paravirt_nop,
};
__visible struct pv_irq_ops pv_irq_ops = {
--
1.8.3.1
using smart idle poll to reduce the useless poll when system is idle.
Signed-off-by: Yang Zhang <[email protected]>
Signed-off-by: Quan Xu <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
arch/x86/kernel/kvm.c | 41 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 7d84a02..ffc8307 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -373,12 +373,53 @@ static void kvm_idle_poll(void)
} while (ktime_before(cur, stop));
}
+static unsigned int grow_poll_ns(unsigned int old, unsigned int grow,
+ unsigned int max)
+{
+ unsigned int val;
+
+ /* set base poll time to 10000ns */
+ if (old == 0 && grow)
+ return 10000;
+
+ val = old * grow;
+ if (val > max)
+ val = max;
+
+ return val;
+}
+
+static unsigned int shrink_poll_ns(unsigned int old, unsigned int shrink)
+{
+ if (shrink == 0)
+ return 0;
+
+ return old / shrink;
+}
+
+static int kvm_idle_update_poll_duration(unsigned long idle_duration)
+{
+ unsigned long poll_duration = this_cpu_read(poll_duration_ns);
+
+ if (poll_duration && idle_duration > poll_threshold_ns)
+ poll_duration = shrink_poll_ns(poll_duration, poll_shrink);
+ else if (poll_duration < poll_threshold_ns &&
+ idle_duration < poll_threshold_ns)
+ poll_duration = grow_poll_ns(poll_duration, poll_grow,
+ poll_threshold_ns);
+
+ this_cpu_write(poll_duration_ns, poll_duration);
+
+ return 0;
+}
+
static void kvm_guest_idle_init(void)
{
if (!kvm_para_available())
return;
pv_idle_ops.poll = kvm_idle_poll;
+ pv_idle_ops.update = kvm_idle_update_poll_duration;
}
static void kvm_pv_disable_apf(void)
--
1.8.3.1
On Tue, Aug 29, 2017 at 11:46:39AM +0000, Yang Zhang wrote:
> To reduce the cost of poll, we introduce three sysctl to control the
> poll time.
This commit does not describe in any way the fact that these knobs are
all for and only for PARAVIRT.
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index bac23c1..67447b8 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -63,6 +63,9 @@ show up in /proc/sys/kernel:
> - perf_event_max_stack
> - perf_event_max_contexts_per_stack
> - pid_max
> +- poll_grow [ X86 only ]
> +- poll_shrink [ X86 only ]
> +- poll_threshold_ns [ X86 only ]
How about paravirt_ prefix?
> - powersave-nap [ PPC only ]
> - printk
> - printk_delay
> @@ -702,6 +705,28 @@ kernel tries to allocate a number starting from this one.
>
> ==============================================================
>
> +poll_grow: (X86 only)
> +
> +This parameter is multiplied in the grow_poll_ns() to increase the poll time.
> +By default, the values is 2.
> +
> +==============================================================
> +poll_shrink: (X86 only)
> +
> +This parameter is divided in the shrink_poll_ns() to reduce the poll time.
> +By default, the values is 2.
These don't say anything about this being related to paravirt.
> +
> +==============================================================
> +poll_threshold_ns: (X86 only)
> +
> +This parameter controls the max poll time before entering real idle path.
> +This parameter is expected to take effect only when running inside a VM.
> +It would make no sense to turn on it in bare mental.
turn it on? Or modify, or use it?
> +By default, the values is 0 means don't poll. It is recommended to change
> +the value to non-zero if running latency-bound workloads inside VM.
> +
> +==============================================================
> +
> powersave-nap: (PPC only)
>
> If set, Linux-PPC will use the 'nap' mode of powersaving,
> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
> index a11b2c2..0b92f8f 100644
> --- a/arch/x86/kernel/paravirt.c
> +++ b/arch/x86/kernel/paravirt.c
> @@ -318,6 +318,10 @@ struct pv_idle_ops pv_idle_ops = {
> .update = paravirt_nop,
> };
>
> +unsigned long poll_threshold_ns;
> +unsigned int poll_shrink = 2;
> +unsigned int poll_grow = 2;
> +
> __visible struct pv_irq_ops pv_irq_ops = {
> .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
> .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index bd6d96c..6cb2820 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -462,6 +462,12 @@ extern __scanf(2, 0)
>
> extern bool crash_kexec_post_notifiers;
>
> +#ifdef CONFIG_PARAVIRT
> +extern unsigned long poll_threshold_ns;
> +extern unsigned int poll_shrink;
> +extern unsigned int poll_grow;
> +#endif
What are these if we are on bare metal but support
paravirt? The docs are not clear in any way about it.
> +
> /*
> * panic_cpu is used for synchronizing panic() and crash_kexec() execution. It
> * holds a CPU number which is executing panic() currently. A value of
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 6648fbb..9b86a8f 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1229,6 +1229,29 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
> .extra2 = &one,
> },
> #endif
> +#ifdef CONFIG_PARAVIRT
> + {
> + .procname = "halt_poll_threshold",
> + .data = &poll_threshold_ns,
> + .maxlen = sizeof(unsigned long),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
> + },
proc_doulongvec_minmax() may be more appropriate here? What is the range?
Also what user did you see used proc_dointvec() but had unsigned long?
If the test below reveal this is invalid we would need to go fix them
as well.
> + {
> + .procname = "halt_poll_grow",
> + .data = &poll_grow,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
> + },
> + {
> + .procname = "halt_poll_shrink",
> + .data = &poll_shrink,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec,
We have now a much more appropriate proc_douintvec() proc_douintvec_minmax().
It seems you want to support only unsigned int for two of these so that would
be more appropriate.
To test things out you can replicate your values using or extending the
test driver lib/test_sysctl.c CONFIG_TEST_SYSCTL=m with your type of
values and then fuzz testing them with arbitrary values to ensure you get
only as input what you really think these values should get.
Once done please extend the script:
tools/testing/selftests/sysctl/sysctl.sh
to cover the different tests you have run, we want tests to be generic.
Luis
On 29/08/17 15:55, Konrad Rzeszutek Wilk wrote:
> On Tue, Aug 29, 2017 at 11:46:35AM +0000, Yang Zhang wrote:
>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in
>> idle path which will polling for a while before we enter the real idle
>> state.
>>
>> In virtualization, idle path includes several heavy operations
>> includes timer access(LAPIC timer or TSC deadline timer) which will hurt
>> performance especially for latency intensive workload like message
>> passing task. The cost is mainly come from the vmexit which is a
>> hardware context switch between VM and hypervisor. Our solution is to
>> poll for a while and do not enter real idle path if we can get the
>> schedule event during polling.
>>
>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>> reduce the useless poll.
>>
>> Signed-off-by: Yang Zhang <[email protected]>
>> Signed-off-by: Quan Xu <[email protected]>
>> Cc: Jeremy Fitzhardinge <[email protected]>
>> Cc: Chris Wright <[email protected]>
>> Cc: Alok Kataria <[email protected]>
>> Cc: Rusty Russell <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: "H. Peter Anvin" <[email protected]>
>> Cc: [email protected]
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: "Kirill A. Shutemov" <[email protected]>
>> Cc: Pan Xinhui <[email protected]>
>> Cc: Kees Cook <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>
> Adding xen-devel.
>
> Juergen, we really should replace Jeremy's name with xen-devel or
> your name..
I wouldn't mind being added. What does Jeremy think of being removed?
> Wasn't there an patch by you that took some of the
> mainternship over it?
I added include/linux/hypervisor.h to the PARAVIRT section and offered
to maintain it in case the PARAVIRT maintainers didn't want to.
Juergen
on 2017/8/29 20:45, Peter Zijlstra wrote:
> On Tue, Aug 29, 2017 at 11:46:37AM +0000, Yang Zhang wrote:
>> Add poll in do_idle. For UP VM, if there are running task, it will not
>> goes into idle path, so we only enable poll in SMP VM.
>>
>> Signed-off-by: Yang Zhang <[email protected]>
>> Signed-off-by: Quan Xu <[email protected]>
> Broken SoB chain.
Peter, I can't follow 'Broken SoB chain'.. could you more about it?
-Quan
>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>> index 6c23e30..b374744 100644
>> --- a/kernel/sched/idle.c
>> +++ b/kernel/sched/idle.c
>> @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
>> }
>>
>> /* Weak implementations for optional arch specific functions */
>> +void __weak arch_cpu_idle_poll(void) { }
>> void __weak arch_cpu_idle_prepare(void) { }
>> void __weak arch_cpu_idle_enter(void) { }
> And not a word on why we need a new arch hook. What's wrong with
> arch_cpu_idle_enter() for instance?
On 2017/8/29 19:58, Alexander Graf wrote:
> On 08/29/2017 01:46 PM, Yang Zhang wrote:
>> Some latency-intensive workload will see obviously performance
>> drop when running inside VM. The main reason is that the overhead
>> is amplified when running inside VM. The most cost i have seen is
>> inside idle path.
>>
>> This patch introduces a new mechanism to poll for a while before
>> entering idle state. If schedule is needed during poll, then we
>> don't need to goes through the heavy overhead path.
>>
>> Here is the data we get when running benchmark contextswitch to measure
>> the latency(lower is better):
>>
>> ??? 1. w/o patch:
>> ?????? 2493.14 ns/ctxsw -- 200.3 %CPU
>> ??? 2. w/ patch:
>> ?????? halt_poll_threshold=10000 -- 1485.96ns/ctxsw -- 201.0 %CPU
>> ?????? halt_poll_threshold=20000 -- 1391.26 ns/ctxsw -- 200.7 %CPU
>> ?????? halt_poll_threshold=30000 -- 1488.55 ns/ctxsw -- 200.1 %CPU
>> ?????? halt_poll_threshold=500000 -- 1159.14 ns/ctxsw -- 201.5 %CPU
>> ??? 3. kvm dynamic poll
>> ?????? halt_poll_ns=10000 -- 2296.11 ns/ctxsw -- 201.2 %CPU
>> ?????? halt_poll_ns=20000 -- 2599.7 ns/ctxsw -- 201.7 %CPU
>> ?????? halt_poll_ns=30000 -- 2588.68 ns/ctxsw -- 211.6 %CPU
>> ?????? halt_poll_ns=500000 -- 2423.20 ns/ctxsw -- 229.2 %CPU
>> ??? 4. idle=poll
>> ?????? 2050.1 ns/ctxsw -- 1003 %CPU
>> ??? 5. idle=mwait
>> ?????? 2188.06 ns/ctxsw -- 206.3 %CPU
>
> Could you please try to create another metric for guest initiated, host
> aborted mwait?
>
> For a quick benchmark, reserve 4 registers for a magic value, set them
> to the magic value before you enter MWAIT in the guest. Then allow
> native MWAIT execution on the host. If you see the guest wants to enter
I guess you want to allow native MWAIT execution on the guest not host?
> with the 4 registers containing the magic contents and no events are
> pending, directly go into the vcpu block function on the host.
Mmm..It is not very clear to me. If guest executes MWAIT without vmexit,
how to check the register?
>
> That way any time a guest gets naturally aborted while in mwait, it will
> only reenter mwait when an event actually occured. While the guest is
> normally running (and nobody else wants to run on the host), we just
> stay in guest context, but with a sleeping CPU.
>
> Overall, that might give us even better performance, as it allows for
> turbo boost and HT to work properly.
In our testing, we have enough cores(32cores) but only 10VCPUs, so in
the best case, we may see the same performance as poll.
--
Yang
Alibaba Cloud Computing
On 2017/8/29 22:02, Wanpeng Li wrote:
>> Here is the data we get when running benchmark netperf:
>>
>> 2. w/ patch:
>> halt_poll_threshold=10000 -- 15803.89 bits/s -- 159.5 %CPU
>> halt_poll_threshold=20000 -- 15899.04 bits/s -- 161.5 %CPU
>> halt_poll_threshold=30000 -- 15642.38 bits/s -- 161.8 %CPU
>> halt_poll_threshold=40000 -- 18040.76 bits/s -- 184.0 %CPU
>> halt_poll_threshold=50000 -- 18877.61 bits/s -- 197.3 %CPU
>>
>> 3. kvm dynamic poll
>> halt_poll_ns=10000 -- 15876.00 bits/s -- 172.2 %CPU
>> halt_poll_ns=20000 -- 15602.58 bits/s -- 185.4 %CPU
>> halt_poll_ns=30000 -- 15930.69 bits/s -- 194.4 %CPU
>> halt_poll_ns=40000 -- 16413.09 bits/s -- 195.3 %CPU
>> halt_poll_ns=50000 -- 16417.42 bits/s -- 196.3 %CPU
>>
>
> Actually I'm not sure how much sense it makes to introduce this pv
> stuff and the duplicate adaptive halt-polling logic as what has
> already been done in kvm w/o obvious benefit for real workload like > netperf. In addition, as you mentioned offline to me, enable both the
> patchset and the adaptive halt-polling logic in kvm simultaneously can
> result in more cpu power consumption. I remembered that David from
If we use poll in KVM side, it will consume more cpu than in guest side.
If use both two, then we can get the same performance as only enable
guest side poll but it will cost more cpu because of poll KVM side. It
means we should disable KVM side poll since it cannot give much
improvement than guest side except consume more cpu and large latency.
> Google mentioned that Windows Event Objects can get 2x latency
> improvement in KVM FORUM, which means that the adaptive halt-polling
> in kvm should be enabled by default. So if the windows guests and
> linux guests are mixed on the same host, then this patchset will
> result in more cpu power consumption if the customer enable the
> polling in the linux guest. Anyway, if the patchset is finally
> acceptable by maintainer, I will introduce the generic adaptive
> halt-polling framework in kvm to avoid the duplicate logic.
We will add more conditions than the current algorithm in future. But
it's ok to use the one copy currently, we will do it in next version.
--
Yang
Alibaba Cloud Computing
On 2017/8/29 22:02, Wanpeng Li wrote:
>> Here is the data we get when running benchmark netperf:
>>
>> 2. w/ patch:
>> halt_poll_threshold=10000 -- 15803.89 bits/s -- 159.5 %CPU
>> halt_poll_threshold=20000 -- 15899.04 bits/s -- 161.5 %CPU
>> halt_poll_threshold=30000 -- 15642.38 bits/s -- 161.8 %CPU
>> halt_poll_threshold=40000 -- 18040.76 bits/s -- 184.0 %CPU
>> halt_poll_threshold=50000 -- 18877.61 bits/s -- 197.3 %CPU
>>
>> 3. kvm dynamic poll
>> halt_poll_ns=10000 -- 15876.00 bits/s -- 172.2 %CPU
>> halt_poll_ns=20000 -- 15602.58 bits/s -- 185.4 %CPU
>> halt_poll_ns=30000 -- 15930.69 bits/s -- 194.4 %CPU
>> halt_poll_ns=40000 -- 16413.09 bits/s -- 195.3 %CPU
>> halt_poll_ns=50000 -- 16417.42 bits/s -- 196.3 %CPU
>>
>
> Actually I'm not sure how much sense it makes to introduce this pv
> stuff and the duplicate adaptive halt-polling logic as what has
> already been done in kvm w/o obvious benefit for real workload like
> netperf. In addition, as you mentioned offline to me, enable both the
> patchset and the adaptive halt-polling logic in kvm simultaneously can
> result in more cpu power consumption. I remembered that David from
No.If we use poll in KVM side, it will consume more cpu than in guest
side. If use both two, then we can get the performance as only enable
guest side poll but it will cost more cpu because of poll in KVM side.
It means we should disable KVM side poll since it cannot give much
improvement than guest side except consume more cpu.
> Google mentioned that Windows Event Objects can get 2x latency
> improvement in KVM FORUM, which means that the adaptive halt-polling
> in kvm should be enabled by default. So if the windows guests and
> linux guests are mixed on the same host, then this patchset will
> result in more cpu power consumption if the customer enable the
> polling in the linux guest. Anyway, if the patchset is finally
I have said in last time, there already users using idle=poll in there
VM, you *cannot* prevent them doing it. This patch provide a better
solution than unconditional poll, we didn't introduce any worse stuff.
> acceptable by maintainer, I will introduce the generic adaptive
> halt-polling framework in kvm to avoid the duplicate logic.
We will add more conditions than the current algorithm in future. But
it's ok to use one code currently, we will do it in next version.
>
> Regards,
> Wanpeng Li
>
--
Yang
Alibaba Cloud Computing
on 2017/8/29 22:39, Borislav Petkov wrote:
> On Tue, Aug 29, 2017 at 11:46:37AM +0000, Yang Zhang wrote:
>> Add poll in do_idle. For UP VM, if there are running task, it will not
>> goes into idle path, so we only enable poll in SMP VM.
>>
>> Signed-off-by: Yang Zhang <[email protected]>
>> Signed-off-by: Quan Xu <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: "H. Peter Anvin" <[email protected]>
>> Cc: [email protected]
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: Kyle Huey <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Len Brown <[email protected]>
>> Cc: [email protected]
>> ---
>> arch/x86/kernel/process.c | 7 +++++++
>> kernel/sched/idle.c | 2 ++
>> 2 files changed, 9 insertions(+)
>>
>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>> index 3ca1980..def4113 100644
>> --- a/arch/x86/kernel/process.c
>> +++ b/arch/x86/kernel/process.c
>> @@ -332,6 +332,13 @@ void arch_cpu_idle(void)
>> x86_idle();
>> }
>>
>> +#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
>> +void arch_cpu_idle_poll(void)
>> +{
>> + paravirt_idle_poll();
>> +}
>> +#endif
> So this will get called on any system which has CONFIG_PARAVIRT enabled
> *even* if they're not running any guests.
>
> Huh?
Borislav ,
yes, this will get called on any system which has CONFIG_PARAVIRT
enabled.
but if they are not running any guests, the callback is
paravirt_nop() ,
IIUC which is as similar as the other paravirt_*, such as
paravirt_pgd_free()..
- Quan
2017-09-01 14:32 GMT+08:00 Yang Zhang <[email protected]>:
> On 2017/8/29 22:02, Wanpeng Li wrote:
>>>
>>> Here is the data we get when running benchmark netperf:
>>>
>>> 2. w/ patch:
>>> halt_poll_threshold=10000 -- 15803.89 bits/s -- 159.5 %CPU
>>> halt_poll_threshold=20000 -- 15899.04 bits/s -- 161.5 %CPU
>>> halt_poll_threshold=30000 -- 15642.38 bits/s -- 161.8 %CPU
>>> halt_poll_threshold=40000 -- 18040.76 bits/s -- 184.0 %CPU
>>> halt_poll_threshold=50000 -- 18877.61 bits/s -- 197.3 %CPU
>>>
>>> 3. kvm dynamic poll
>>> halt_poll_ns=10000 -- 15876.00 bits/s -- 172.2 %CPU
>>> halt_poll_ns=20000 -- 15602.58 bits/s -- 185.4 %CPU
>>> halt_poll_ns=30000 -- 15930.69 bits/s -- 194.4 %CPU
>>> halt_poll_ns=40000 -- 16413.09 bits/s -- 195.3 %CPU
>>> halt_poll_ns=50000 -- 16417.42 bits/s -- 196.3 %CPU
>>>
>>
>> Actually I'm not sure how much sense it makes to introduce this pv
>> stuff and the duplicate adaptive halt-polling logic as what has
>> already been done in kvm w/o obvious benefit for real workload like >
>> netperf. In addition, as you mentioned offline to me, enable both the
>> patchset and the adaptive halt-polling logic in kvm simultaneously can
>> result in more cpu power consumption. I remembered that David from
>
>
> If we use poll in KVM side, it will consume more cpu than in guest side. If
> use both two, then we can get the same performance as only enable guest side
> poll but it will cost more cpu because of poll KVM side. It means we should
> disable KVM side poll since it cannot give much improvement than guest side
> except consume more cpu and large latency.
How can message passing workloads in windows guest survive?
Regards,
Wanpeng Li
>
>> Google mentioned that Windows Event Objects can get 2x latency
>> improvement in KVM FORUM, which means that the adaptive halt-polling
>> in kvm should be enabled by default. So if the windows guests and
>> linux guests are mixed on the same host, then this patchset will
>> result in more cpu power consumption if the customer enable the
>> polling in the linux guest. Anyway, if the patchset is finally
>> acceptable by maintainer, I will introduce the generic adaptive
>> halt-polling framework in kvm to avoid the duplicate logic.
>
>
> We will add more conditions than the current algorithm in future. But it's
> ok to use the one copy currently, we will do it in next version.
>
>
>
> --
> Yang
> Alibaba Cloud Computing
On 2017/8/29 21:55, Konrad Rzeszutek Wilk wrote:
> On Tue, Aug 29, 2017 at 11:46:35AM +0000, Yang Zhang wrote:
>> So far, pv_idle_ops.poll is the only ops for pv_idle. .poll is called in
>> idle path which will polling for a while before we enter the real idle
>> state.
>>
>> In virtualization, idle path includes several heavy operations
>> includes timer access(LAPIC timer or TSC deadline timer) which will hurt
>> performance especially for latency intensive workload like message
>> passing task. The cost is mainly come from the vmexit which is a
>> hardware context switch between VM and hypervisor. Our solution is to
>> poll for a while and do not enter real idle path if we can get the
>> schedule event during polling.
>>
>> Poll may cause the CPU waste so we adopt a smart polling mechanism to
>> reduce the useless poll.
>>
>> Signed-off-by: Yang Zhang <[email protected]>
>> Signed-off-by: Quan Xu <[email protected]>
>> Cc: Jeremy Fitzhardinge <[email protected]>
>> Cc: Chris Wright <[email protected]>
>> Cc: Alok Kataria <[email protected]>
>> Cc: Rusty Russell <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: "H. Peter Anvin" <[email protected]>
>> Cc: [email protected]
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: "Kirill A. Shutemov" <[email protected]>
>> Cc: Pan Xinhui <[email protected]>
>> Cc: Kees Cook <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>
> Adding xen-devel.
>
> Juergen, we really should replace Jeremy's name with xen-devel or
> your name.. Wasn't there an patch by you that took some of the
> mainternship over it?
Hi Konard, I didn't test it in Xen side since i don't have the
environment but i can add it for Xen in next version if you think it is
useful to Xen as well.
>
>> ---
>> arch/x86/include/asm/paravirt.h | 5 +++++
>> arch/x86/include/asm/paravirt_types.h | 6 ++++++
>> arch/x86/kernel/paravirt.c | 6 ++++++
>> 3 files changed, 17 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
>> index 9ccac19..6d46760 100644
>> --- a/arch/x86/include/asm/paravirt.h
>> +++ b/arch/x86/include/asm/paravirt.h
>> @@ -202,6 +202,11 @@ static inline unsigned long long paravirt_read_pmc(int counter)
>>
>> #define rdpmcl(counter, val) ((val) = paravirt_read_pmc(counter))
>>
>> +static inline void paravirt_idle_poll(void)
>> +{
>> + PVOP_VCALL0(pv_idle_ops.poll);
>> +}
>> +
>> static inline void paravirt_alloc_ldt(struct desc_struct *ldt, unsigned entries)
>> {
>> PVOP_VCALL2(pv_cpu_ops.alloc_ldt, ldt, entries);
>> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
>> index 9ffc36b..cf45726 100644
>> --- a/arch/x86/include/asm/paravirt_types.h
>> +++ b/arch/x86/include/asm/paravirt_types.h
>> @@ -324,6 +324,10 @@ struct pv_lock_ops {
>> struct paravirt_callee_save vcpu_is_preempted;
>> } __no_randomize_layout;
>>
>> +struct pv_idle_ops {
>> + void (*poll)(void);
>> +} __no_randomize_layout;
>> +
>> /* This contains all the paravirt structures: we get a convenient
>> * number for each function using the offset which we use to indicate
>> * what to patch. */
>> @@ -334,6 +338,7 @@ struct paravirt_patch_template {
>> struct pv_irq_ops pv_irq_ops;
>> struct pv_mmu_ops pv_mmu_ops;
>> struct pv_lock_ops pv_lock_ops;
>> + struct pv_idle_ops pv_idle_ops;
>> } __no_randomize_layout;
>>
>> extern struct pv_info pv_info;
>> @@ -343,6 +348,7 @@ struct paravirt_patch_template {
>> extern struct pv_irq_ops pv_irq_ops;
>> extern struct pv_mmu_ops pv_mmu_ops;
>> extern struct pv_lock_ops pv_lock_ops;
>> +extern struct pv_idle_ops pv_idle_ops;
>>
>> #define PARAVIRT_PATCH(x) \
>> (offsetof(struct paravirt_patch_template, x) / sizeof(void *))
>> diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
>> index bc0a849..1b5b247 100644
>> --- a/arch/x86/kernel/paravirt.c
>> +++ b/arch/x86/kernel/paravirt.c
>> @@ -128,6 +128,7 @@ static void *get_call_destination(u8 type)
>> #ifdef CONFIG_PARAVIRT_SPINLOCKS
>> .pv_lock_ops = pv_lock_ops,
>> #endif
>> + .pv_idle_ops = pv_idle_ops,
>> };
>> return *((void **)&tmpl + type);
>> }
>> @@ -312,6 +313,10 @@ struct pv_time_ops pv_time_ops = {
>> .steal_clock = native_steal_clock,
>> };
>>
>> +struct pv_idle_ops pv_idle_ops = {
>> + .poll = paravirt_nop,
>> +};
>> +
>> __visible struct pv_irq_ops pv_irq_ops = {
>> .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
>> .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
>> @@ -471,3 +476,4 @@ struct pv_mmu_ops pv_mmu_ops __ro_after_init = {
>> EXPORT_SYMBOL (pv_mmu_ops);
>> EXPORT_SYMBOL_GPL(pv_info);
>> EXPORT_SYMBOL (pv_irq_ops);
>> +EXPORT_SYMBOL (pv_idle_ops);
>> --
>> 1.8.3.1
>>
--
Yang
Alibaba Cloud Computing
2017-09-01 14:44 GMT+08:00 Yang Zhang <[email protected]>:
> On 2017/8/29 22:02, Wanpeng Li wrote:
>>>
>>> Here is the data we get when running benchmark netperf:
>>>
>>> 2. w/ patch:
>>> halt_poll_threshold=10000 -- 15803.89 bits/s -- 159.5 %CPU
>>> halt_poll_threshold=20000 -- 15899.04 bits/s -- 161.5 %CPU
>>> halt_poll_threshold=30000 -- 15642.38 bits/s -- 161.8 %CPU
>>> halt_poll_threshold=40000 -- 18040.76 bits/s -- 184.0 %CPU
>>> halt_poll_threshold=50000 -- 18877.61 bits/s -- 197.3 %CPU
>>>
>>> 3. kvm dynamic poll
>>> halt_poll_ns=10000 -- 15876.00 bits/s -- 172.2 %CPU
>>> halt_poll_ns=20000 -- 15602.58 bits/s -- 185.4 %CPU
>>> halt_poll_ns=30000 -- 15930.69 bits/s -- 194.4 %CPU
>>> halt_poll_ns=40000 -- 16413.09 bits/s -- 195.3 %CPU
>>> halt_poll_ns=50000 -- 16417.42 bits/s -- 196.3 %CPU
>>>
>>
>> Actually I'm not sure how much sense it makes to introduce this pv
>> stuff and the duplicate adaptive halt-polling logic as what has
>> already been done in kvm w/o obvious benefit for real workload like
>> netperf. In addition, as you mentioned offline to me, enable both the
>> patchset and the adaptive halt-polling logic in kvm simultaneously can
>> result in more cpu power consumption. I remembered that David from
>
>
> No.If we use poll in KVM side, it will consume more cpu than in guest side.
> If use both two, then we can get the performance as only enable guest side
> poll but it will cost more cpu because of poll in KVM side. It means we
> should disable KVM side poll since it cannot give much improvement than
> guest side except consume more cpu.
The customers should have enough knowledge about what's the meaning of
the tunning which you exposed.
Regards,
Wanpeng Li
>
>> Google mentioned that Windows Event Objects can get 2x latency
>> improvement in KVM FORUM, which means that the adaptive halt-polling
>> in kvm should be enabled by default. So if the windows guests and
>> linux guests are mixed on the same host, then this patchset will
>> result in more cpu power consumption if the customer enable the
>> polling in the linux guest. Anyway, if the patchset is finally
>
>
> I have said in last time, there already users using idle=poll in there VM,
> you *cannot* prevent them doing it. This patch provide a better solution
> than unconditional poll, we didn't introduce any worse stuff.
>
>> acceptable by maintainer, I will introduce the generic adaptive
>> halt-polling framework in kvm to avoid the duplicate logic.
>
>
> We will add more conditions than the current algorithm in future. But it's
> ok to use one code currently, we will do it in next version.
>
>>
>> Regards,
>> Wanpeng Li
>>
>
>
> --
> Yang
> Alibaba Cloud Computing
On 2017/8/29 20:46, Peter Zijlstra wrote:
> On Tue, Aug 29, 2017 at 11:46:41AM +0000, Yang Zhang wrote:
>> In ttwu_do_wakeup, it will update avg_idle when wakeup from idle. Here
>> we just reuse this logic to update the poll time. It may be a little
>> late to update the poll in ttwu_do_wakeup, but the test result shows no
>> obvious performance gap compare with updating poll in irq handler.
>>
>> one problem is that idle_stamp only used when using CFS scheduler. But
>> it is ok since it is the default policy for scheduler and only consider
>> it should enough.
>>
>> Signed-off-by: Yang Zhang <[email protected]>
>> Signed-off-by: Quan Xu <[email protected]>
>
> Same broken SoB chain, and not a useful word on why you need to adjust
> this crap to begin with. What you want that poll duration to be related
> to is the cost of a VMEXIT/VMENTER cycle, not however long we happened
> to be idle.
Actually, we should compare the cost of VMEXIT/VMENTER with the real
duration in idle. We have a rough number of the cost for one
VMEXIT/VMENTER(it is about 2k~4k cycles depends on the underlying CPU)
and it introduces 4~5 VMENTER/VMEXITs in idle path which may increase
about 7us latency in average. So we set the poll duration to 10us by
default.
Another problem is there is no good way to measure the duration in idle.
avg_idle is the only way i find so far. Do you have any suggestion to do
it better? Thanks.
--
Yang
Alibaba Cloud Computing
On 2017/9/1 14:58, Wanpeng Li wrote:
> 2017-09-01 14:44 GMT+08:00 Yang Zhang <[email protected]>:
>> On 2017/8/29 22:02, Wanpeng Li wrote:
>>>>
>>>> Here is the data we get when running benchmark netperf:
>>>>
>>>> 2. w/ patch:
>>>> halt_poll_threshold=10000 -- 15803.89 bits/s -- 159.5 %CPU
>>>> halt_poll_threshold=20000 -- 15899.04 bits/s -- 161.5 %CPU
>>>> halt_poll_threshold=30000 -- 15642.38 bits/s -- 161.8 %CPU
>>>> halt_poll_threshold=40000 -- 18040.76 bits/s -- 184.0 %CPU
>>>> halt_poll_threshold=50000 -- 18877.61 bits/s -- 197.3 %CPU
>>>>
>>>> 3. kvm dynamic poll
>>>> halt_poll_ns=10000 -- 15876.00 bits/s -- 172.2 %CPU
>>>> halt_poll_ns=20000 -- 15602.58 bits/s -- 185.4 %CPU
>>>> halt_poll_ns=30000 -- 15930.69 bits/s -- 194.4 %CPU
>>>> halt_poll_ns=40000 -- 16413.09 bits/s -- 195.3 %CPU
>>>> halt_poll_ns=50000 -- 16417.42 bits/s -- 196.3 %CPU
>>>>
>>>
>>> Actually I'm not sure how much sense it makes to introduce this pv
>>> stuff and the duplicate adaptive halt-polling logic as what has
>>> already been done in kvm w/o obvious benefit for real workload like
>>> netperf. In addition, as you mentioned offline to me, enable both the
>>> patchset and the adaptive halt-polling logic in kvm simultaneously can
>>> result in more cpu power consumption. I remembered that David from
>>
>>
>> No.If we use poll in KVM side, it will consume more cpu than in guest side.
>> If use both two, then we can get the performance as only enable guest side
>> poll but it will cost more cpu because of poll in KVM side. It means we
>> should disable KVM side poll since it cannot give much improvement than
>> guest side except consume more cpu.
>
> The customers should have enough knowledge about what's the meaning of
> the tunning which you exposed.
We have applied this patch to customize kernel for some real customers
and we get positive feedback from them since the CPU never run at 100%
even there is no task running. Also, this helps them to give more CPUs
to other tasks and reduce the power consumption in their rack. Don't you
think it is better?
>
> Regards,
> Wanpeng Li
>
>>
>>> Google mentioned that Windows Event Objects can get 2x latency
>>> improvement in KVM FORUM, which means that the adaptive halt-polling
>>> in kvm should be enabled by default. So if the windows guests and
>>> linux guests are mixed on the same host, then this patchset will
>>> result in more cpu power consumption if the customer enable the
>>> polling in the linux guest. Anyway, if the patchset is finally
>>
>>
>> I have said in last time, there already users using idle=poll in there VM,
>> you *cannot* prevent them doing it. This patch provide a better solution
>> than unconditional poll, we didn't introduce any worse stuff.
>>
>>> acceptable by maintainer, I will introduce the generic adaptive
>>> halt-polling framework in kvm to avoid the duplicate logic.
>>
>>
>> We will add more conditions than the current algorithm in future. But it's
>> ok to use one code currently, we will do it in next version.
>>
>>>
>>> Regards,
>>> Wanpeng Li
>>>
>>
>>
>> --
>> Yang
>> Alibaba Cloud Computing
--
Yang
Alibaba Cloud Computing
On 2017/8/29 22:56, Michael S. Tsirkin wrote:
> On Tue, Aug 29, 2017 at 11:46:34AM +0000, Yang Zhang wrote:
>> Some latency-intensive workload will see obviously performance
>> drop when running inside VM.
>
> But are we trading a lot of CPU for a bit of lower latency?
>
>> The main reason is that the overhead
>> is amplified when running inside VM. The most cost i have seen is
>> inside idle path.
>>
>> This patch introduces a new mechanism to poll for a while before
>> entering idle state. If schedule is needed during poll, then we
>> don't need to goes through the heavy overhead path.
>
> Isn't it the job of an idle driver to find the best way to
> halt the CPU?
>
> It looks like just by adding a cstate we can make it
> halt at higher latencies only. And at lower latencies,
> if it's doing a good job we can hopefully use mwait to
> stop the CPU.
>
> In fact I have been experimenting with exactly that.
> Some initial results are encouraging but I could use help
> with testing and especially tuning. If you can help
> pls let me know!
Quan, Can you help to test it and give result? Thanks.
--
Yang
Alibaba Cloud Computing
on 2017/9/13 19:56, Yang Zhang wrote:
> On 2017/8/29 22:56, Michael S. Tsirkin wrote:
>> On Tue, Aug 29, 2017 at 11:46:34AM +0000, Yang Zhang wrote:
>>> Some latency-intensive workload will see obviously performance
>>> drop when running inside VM.
>>
>> But are we trading a lot of CPU for a bit of lower latency?
>>
>>> The main reason is that the overhead
>>> is amplified when running inside VM. The most cost i have seen is
>>> inside idle path.
>>>
>>> This patch introduces a new mechanism to poll for a while before
>>> entering idle state. If schedule is needed during poll, then we
>>> don't need to goes through the heavy overhead path.
>>
>> Isn't it the job of an idle driver to find the best way to
>> halt the CPU?
>>
>> It looks like just by adding a cstate we can make it
>> halt at higher latencies only. And at lower latencies,
>> if it's doing a good job we can hopefully use mwait to
>> stop the CPU.
>>
>> In fact I have been experimenting with exactly that.
>> Some initial results are encouraging but I could use help
>> with testing and especially tuning. If you can help
>> pls let me know!
>
> Quan, Can you help to test it and give result? Thanks.
>
Hi, MST
I have tested the patch "intel_idle: add pv cstates when running on
kvm" on a recent host that allows guests
to execute mwait without an exit. also I have tested our patch "[RFC
PATCH v2 0/7] x86/idle: add halt poll support",
upstream linux, and idle=poll.
the following is the result (which seems better than ever berfore, as I
ran test case on a more powerful machine):
for __netperf__, the first column is trans. rate per sec, the second
column is CPU utilzation.
1. upstream linux
28371.7 bits/s -- 76.6 %CPU
2. idle=poll
34372 bit/s -- 999.3 %CPU
3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different
values of parameter 'halt_poll_threshold':
28362.7 bits/s -- 74.7 %CPU (halt_poll_threshold=10000)
32949.5 bits/s -- 82.5 %CPU (halt_poll_threshold=20000)
39717.9 bits/s -- 104.1 %CPU (halt_poll_threshold=30000)
40137.9 bits/s -- 104.4 %CPU (halt_poll_threshold=40000)
40079.8 bits/s -- 105.6 %CPU (halt_poll_threshold=50000)
4. "intel_idle: add pv cstates when running on kvm"
33041.8 bits/s -- 999.4 %CPU
for __ctxsw__, the first column is the time per process context
switches, the second column is CPU utilzation..
1. upstream linux
3624.19 ns/ctxsw -- 191.9 %CPU
2. idle=poll
3419.66 ns/ctxsw -- 999.2 %CPU
3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different
values of parameter 'halt_poll_threshold':
1123.40 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=10000)
1127.38 ns/ctxsw -- 199.7 %CPU (halt_poll_threshold=20000)
1113.58 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=30000)
1117.12 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=40000)
1121.62 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=50000)
4. "intel_idle: add pv cstates when running on kvm"
3427.59 ns/ctxsw -- 999.4 %CPU
-Quan
on 2017/9/1 13:57, Quan Xu wrote:
> on 2017/8/29 20:45, Peter Zijlstra wrote:
>
>> On Tue, Aug 29, 2017 at 11:46:37AM +0000, Yang Zhang wrote:
>>> Add poll in do_idle. For UP VM, if there are running task, it will not
>>> goes into idle path, so we only enable poll in SMP VM.
>>>
>>> Signed-off-by: Yang Zhang <[email protected]>
>>> Signed-off-by: Quan Xu <[email protected]>
>> Broken SoB chain.
> Peter, I can't follow 'Broken SoB chain'.. could you explain more
> about it?
>
Peter, Ping..
Quan
> -Quan
>
>>> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
>>> index 6c23e30..b374744 100644
>>> --- a/kernel/sched/idle.c
>>> +++ b/kernel/sched/idle.c
>>> @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
>>> }
>>> /* Weak implementations for optional arch specific functions */
>>> +void __weak arch_cpu_idle_poll(void) { }
>>> void __weak arch_cpu_idle_prepare(void) { }
>>> void __weak arch_cpu_idle_enter(void) { }
>> And not a word on why we need a new arch hook. What's wrong with
>> arch_cpu_idle_enter() for instance?
>
On Thu, Sep 14, 2017 at 04:41:39PM +0800, Quan Xu wrote:
> > > On Tue, Aug 29, 2017 at 11:46:37AM +0000, Yang Zhang wrote:
> > > > Add poll in do_idle. For UP VM, if there are running task, it will not
> > > > goes into idle path, so we only enable poll in SMP VM.
> > > >
> > > > Signed-off-by: Yang Zhang <[email protected]>
> > > > Signed-off-by: Quan Xu <[email protected]>
> > > Broken SoB chain.
> > Peter, I can't follow 'Broken SoB chain'.. could you explain more
> > about it?
> >
> Peter, Ping..
The SOB chain needs to show the path from the author to the upstream
maintainer. Yours has Yang before you, which doesn't say what his role
is.
Read Documentation/process/submitting-patches.rst, section 11.
--
Regards/Gruss,
Boris.
Good mailing practices for 400: avoid top-posting and trim the reply.
2017-09-14 16:36 GMT+08:00 Quan Xu <[email protected]>:
>
>
> on 2017/9/13 19:56, Yang Zhang wrote:
>>
>> On 2017/8/29 22:56, Michael S. Tsirkin wrote:
>>>
>>> On Tue, Aug 29, 2017 at 11:46:34AM +0000, Yang Zhang wrote:
>>>>
>>>> Some latency-intensive workload will see obviously performance
>>>> drop when running inside VM.
>>>
>>>
>>> But are we trading a lot of CPU for a bit of lower latency?
>>>
>>>> The main reason is that the overhead
>>>> is amplified when running inside VM. The most cost i have seen is
>>>> inside idle path.
>>>>
>>>> This patch introduces a new mechanism to poll for a while before
>>>> entering idle state. If schedule is needed during poll, then we
>>>> don't need to goes through the heavy overhead path.
>>>
>>>
>>> Isn't it the job of an idle driver to find the best way to
>>> halt the CPU?
>>>
>>> It looks like just by adding a cstate we can make it
>>> halt at higher latencies only. And at lower latencies,
>>> if it's doing a good job we can hopefully use mwait to
>>> stop the CPU.
>>>
>>> In fact I have been experimenting with exactly that.
>>> Some initial results are encouraging but I could use help
>>> with testing and especially tuning. If you can help
>>> pls let me know!
>>
>>
>> Quan, Can you help to test it and give result? Thanks.
>>
>
> Hi, MST
>
> I have tested the patch "intel_idle: add pv cstates when running on kvm" on
> a recent host that allows guests
> to execute mwait without an exit. also I have tested our patch "[RFC PATCH
> v2 0/7] x86/idle: add halt poll support",
> upstream linux, and idle=poll.
>
> the following is the result (which seems better than ever berfore, as I ran
> test case on a more powerful machine):
>
> for __netperf__, the first column is trans. rate per sec, the second column
> is CPU utilzation.
>
> 1. upstream linux
This "upstream linux" means that disables the kvm adaptive
halt-polling after confirm with Xu Quan.
Regards,
Wanpeng Li
>
> 28371.7 bits/s -- 76.6 %CPU
>
> 2. idle=poll
>
> 34372 bit/s -- 999.3 %CPU
>
> 3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different
> values of parameter 'halt_poll_threshold':
>
> 28362.7 bits/s -- 74.7 %CPU (halt_poll_threshold=10000)
> 32949.5 bits/s -- 82.5 %CPU (halt_poll_threshold=20000)
> 39717.9 bits/s -- 104.1 %CPU (halt_poll_threshold=30000)
> 40137.9 bits/s -- 104.4 %CPU (halt_poll_threshold=40000)
> 40079.8 bits/s -- 105.6 %CPU (halt_poll_threshold=50000)
>
>
> 4. "intel_idle: add pv cstates when running on kvm"
>
> 33041.8 bits/s -- 999.4 %CPU
>
>
>
>
>
> for __ctxsw__, the first column is the time per process context switches,
> the second column is CPU utilzation..
>
> 1. upstream linux
>
> 3624.19 ns/ctxsw -- 191.9 %CPU
>
> 2. idle=poll
>
> 3419.66 ns/ctxsw -- 999.2 %CPU
>
> 3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different
> values of parameter 'halt_poll_threshold':
>
> 1123.40 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=10000)
> 1127.38 ns/ctxsw -- 199.7 %CPU (halt_poll_threshold=20000)
> 1113.58 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=30000)
> 1117.12 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=40000)
> 1121.62 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=50000)
>
> 4. "intel_idle: add pv cstates when running on kvm"
>
> 3427.59 ns/ctxsw -- 999.4 %CPU
>
> -Quan
On 2017/9/14 17:19, Wanpeng Li wrote:
> 2017-09-14 16:36 GMT+08:00 Quan Xu <[email protected]>:
>>
>> on 2017/9/13 19:56, Yang Zhang wrote:
>>> On 2017/8/29 22:56, Michael S. Tsirkin wrote:
>>>> On Tue, Aug 29, 2017 at 11:46:34AM +0000, Yang Zhang wrote:
>>>>> Some latency-intensive workload will see obviously performance
>>>>> drop when running inside VM.
>>>>
>>>> But are we trading a lot of CPU for a bit of lower latency?
>>>>
>>>>> The main reason is that the overhead
>>>>> is amplified when running inside VM. The most cost i have seen is
>>>>> inside idle path.
>>>>>
>>>>> This patch introduces a new mechanism to poll for a while before
>>>>> entering idle state. If schedule is needed during poll, then we
>>>>> don't need to goes through the heavy overhead path.
>>>>
>>>> Isn't it the job of an idle driver to find the best way to
>>>> halt the CPU?
>>>>
>>>> It looks like just by adding a cstate we can make it
>>>> halt at higher latencies only. And at lower latencies,
>>>> if it's doing a good job we can hopefully use mwait to
>>>> stop the CPU.
>>>>
>>>> In fact I have been experimenting with exactly that.
>>>> Some initial results are encouraging but I could use help
>>>> with testing and especially tuning. If you can help
>>>> pls let me know!
>>>
>>> Quan, Can you help to test it and give result? Thanks.
>>>
>> Hi, MST
>>
>> I have tested the patch "intel_idle: add pv cstates when running on kvm" on
>> a recent host that allows guests
>> to execute mwait without an exit. also I have tested our patch "[RFC PATCH
>> v2 0/7] x86/idle: add halt poll support",
>> upstream linux, and idle=poll.
>>
>> the following is the result (which seems better than ever berfore, as I ran
>> test case on a more powerful machine):
>>
>> for __netperf__, the first column is trans. rate per sec, the second column
>> is CPU utilzation.
>>
>> 1. upstream linux
> This "upstream linux" means that disables the kvm adaptive
> halt-polling after confirm with Xu Quan.
upstream linux -- the source code is just from upstream linux, without
our patch or MST's patch..
yes, we disable kvm halt-polling(halt_poll_ns=0) for _all_of_ following
cases.
Quan
> Regards,
> Wanpeng Li
>
>> 28371.7 bits/s -- 76.6 %CPU
>>
>> 2. idle=poll
>>
>> 34372 bit/s -- 999.3 %CPU
>>
>> 3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different
>> values of parameter 'halt_poll_threshold':
>>
>> 28362.7 bits/s -- 74.7 %CPU (halt_poll_threshold=10000)
>> 32949.5 bits/s -- 82.5 %CPU (halt_poll_threshold=20000)
>> 39717.9 bits/s -- 104.1 %CPU (halt_poll_threshold=30000)
>> 40137.9 bits/s -- 104.4 %CPU (halt_poll_threshold=40000)
>> 40079.8 bits/s -- 105.6 %CPU (halt_poll_threshold=50000)
>>
>>
>> 4. "intel_idle: add pv cstates when running on kvm"
>>
>> 33041.8 bits/s -- 999.4 %CPU
>>
>>
>>
>>
>>
>> for __ctxsw__, the first column is the time per process context switches,
>> the second column is CPU utilzation..
>>
>> 1. upstream linux
>>
>> 3624.19 ns/ctxsw -- 191.9 %CPU
>>
>> 2. idle=poll
>>
>> 3419.66 ns/ctxsw -- 999.2 %CPU
>>
>> 3. "[RFC PATCH v2 0/7] x86/idle: add halt poll support", with different
>> values of parameter 'halt_poll_threshold':
>>
>> 1123.40 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=10000)
>> 1127.38 ns/ctxsw -- 199.7 %CPU (halt_poll_threshold=20000)
>> 1113.58 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=30000)
>> 1117.12 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=40000)
>> 1121.62 ns/ctxsw -- 199.6 %CPU (halt_poll_threshold=50000)
>>
>> 4. "intel_idle: add pv cstates when running on kvm"
>>
>> 3427.59 ns/ctxsw -- 999.4 %CPU
>>
>> -Quan
On 2017/8/29 20:46, Peter Zijlstra wrote:
> On Tue, Aug 29, 2017 at 11:46:41AM +0000, Yang Zhang wrote:
>> In ttwu_do_wakeup, it will update avg_idle when wakeup from idle. Here
>> we just reuse this logic to update the poll time. It may be a little
>> late to update the poll in ttwu_do_wakeup, but the test result shows no
>> obvious performance gap compare with updating poll in irq handler.
>>
>> one problem is that idle_stamp only used when using CFS scheduler. But
>> it is ok since it is the default policy for scheduler and only consider
>> it should enough.
>>
>> Signed-off-by: Yang Zhang <[email protected]>
>> Signed-off-by: Quan Xu <[email protected]>
> Same broken SoB chain, and not a useful word on why you need to adjust
> this crap to begin with. What you want that poll duration to be related
> to is the cost of a VMEXIT/VMENTER cycle, not however long we happened
> to be idle.
>
> So no.
Peter,
I think you are right..
IIUC, the time we happened to be idle may contain a chain of
VMEXIT/VMENTER cycles,
which would be mainly (except the last VMEXIT/VMENTER cycles) for just
idle loops. right?
as you mentioned, poll duration to be related to is the cost of __a__
VMEXIT/VMENTER cycle.
howerver it is very difficult to measure a VMEXIT/VMENTER cycle
accurately from
kvm guest, we could find out an approximate one -- dropping the idle
loops from the
time we happened to be idle.. make sense?
Quan
On 2017/9/1 14:49, Quan Xu wrote:
> on 2017/8/29 22:39, Borislav Petkov wrote:
>> On Tue, Aug 29, 2017 at 11:46:37AM +0000, Yang Zhang wrote:
>>> Add poll in do_idle. For UP VM, if there are running task, it will not
>>> goes into idle path, so we only enable poll in SMP VM.
>>>
>>> Signed-off-by: Yang Zhang <[email protected]>
>>> Signed-off-by: Quan Xu <[email protected]>
>>> Cc: Thomas Gleixner <[email protected]>
>>> Cc: Ingo Molnar <[email protected]>
>>> Cc: "H. Peter Anvin" <[email protected]>
>>> Cc: [email protected]
>>> Cc: Peter Zijlstra <[email protected]>
>>> Cc: Borislav Petkov <[email protected]>
>>> Cc: Kyle Huey <[email protected]>
>>> Cc: Andy Lutomirski <[email protected]>
>>> Cc: Len Brown <[email protected]>
>>> Cc: [email protected]
>>> ---
>>> arch/x86/kernel/process.c | 7 +++++++
>>> kernel/sched/idle.c | 2 ++
>>> 2 files changed, 9 insertions(+)
>>>
>>> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
>>> index 3ca1980..def4113 100644
>>> --- a/arch/x86/kernel/process.c
>>> +++ b/arch/x86/kernel/process.c
>>> @@ -332,6 +332,13 @@ void arch_cpu_idle(void)
>>> x86_idle();
>>> }
>>> +#if defined(CONFIG_SMP) && defined(CONFIG_PARAVIRT)
>>> +void arch_cpu_idle_poll(void)
>>> +{
>>> + paravirt_idle_poll();
>>> +}
>>> +#endif
>> So this will get called on any system which has CONFIG_PARAVIRT enabled
>> *even* if they're not running any guests.
>>
>> Huh?
Borislav,
think twice, it is better to run ander an IF condition, to make
sure they are running any guests.
Quan
> Borislav ,
> yes, this will get called on any system which has CONFIG_PARAVIRT
> enabled.
>
> but if they are not running any guests, the callback is
> paravirt_nop() ,
> IIUC which is as similar as the other paravirt_*, such as
> paravirt_pgd_free()..
>
> - Quan