Here's the same series as before, just incorporating the resolutions
we've agreed upon over the list. If you guys are in agreement with me
with what stayed, this series is ready for getting merged, I believe.
Feel free to keep the discussion going otherwise.
Glauber Costa (7):
KVM-HDR Add constant to represent KVM MSRs enabled bit
KVM-HDR: KVM Steal time implementation
KVM-HV: KVM Steal time implementation
KVM-GST: Add a pv_ops stub for steal time
KVM-GST: KVM Steal time accounting
KVM-GST: adjust scheduler cpu power
KVM-GST: KVM Steal time registration
Documentation/kernel-parameters.txt | 4 ++
Documentation/virtual/kvm/msr.txt | 33 ++++++++++++
arch/x86/Kconfig | 12 +++++
arch/x86/include/asm/kvm_host.h | 8 +++
arch/x86/include/asm/kvm_para.h | 15 ++++++
arch/x86/include/asm/paravirt.h | 9 +++
arch/x86/include/asm/paravirt_types.h | 1 +
arch/x86/kernel/kvm.c | 73 +++++++++++++++++++++++++++
arch/x86/kernel/kvmclock.c | 2 +
arch/x86/kernel/paravirt.c | 9 +++
arch/x86/kvm/x86.c | 60 +++++++++++++++++++++-
kernel/sched.c | 89 +++++++++++++++++++++++++++++---
kernel/sched_features.h | 4 +-
13 files changed, 305 insertions(+), 14 deletions(-)
--
1.7.3.4
This patch is simple, put in a different commit so it can be more easily
shared between guest and hypervisor. It just defines a named constant
to indicate the enable bit for KVM-specific MSRs.
Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index a427bf7..d6cd79b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -30,6 +30,7 @@
#define MSR_KVM_WALL_CLOCK 0x11
#define MSR_KVM_SYSTEM_TIME 0x12
+#define KVM_MSR_ENABLED 1
/* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
#define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00
#define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
--
1.7.3.4
To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.
In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
holds the memory area address containing information about steal time
This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.
Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
Documentation/virtual/kvm/msr.txt | 33 +++++++++++++++++++++++++++++++++
arch/x86/include/asm/kvm_para.h | 9 +++++++++
2 files changed, 42 insertions(+), 0 deletions(-)
diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt
index d079aed..79c12a7 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -185,3 +185,36 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02
Currently type 2 APF will be always delivered on the same vcpu as
type 1 was, but guest should not rely on that.
+
+MSR_KVM_STEAL_TIME: 0x4b564d03
+
+ data: 64-byte alignment physical address of a memory area which must be
+ in guest RAM, plus an enable bit in bit 0. This memory is expected to
+ hold a copy of the following structure:
+
+ struct kvm_steal_time {
+ __u64 steal;
+ __u32 version;
+ __u32 flags;
+ __u32 pad[6];
+ }
+
+ whose data will be filled in by the hypervisor periodically. Only one
+ write, or registration, is needed for each VCPU. The interval between
+ updates of this structure is arbitrary and implementation-dependent.
+ The hypervisor may update this structure at any time it sees fit until
+ anything with bit0 == 0 is written to it. Guest is required to make sure
+ this structure is initialized to zero.
+
+ Fields have the following meanings:
+
+ version: guest has to check version before and after grabbing
+ time information and check that they are both equal and even.
+ An odd version indicates an in-progress update.
+
+ flags: At this point, always zero. May be used to indicate
+ changes in this structure in the future.
+
+ steal: the amount of time in which this vCPU did not run, in
+ nanoseconds.
+
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index d6cd79b..ac306c4 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -21,6 +21,7 @@
*/
#define KVM_FEATURE_CLOCKSOURCE2 3
#define KVM_FEATURE_ASYNC_PF 4
+#define KVM_FEATURE_STEAL_TIME 5
/* The last 8 bits are used to indicate how to interpret the flags field
* in pvclock structure. If no bits are set, all flags are ignored.
@@ -35,6 +36,14 @@
#define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00
#define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
#define MSR_KVM_ASYNC_PF_EN 0x4b564d02
+#define MSR_KVM_STEAL_TIME 0x4b564d03
+
+struct kvm_steal_time {
+ __u64 steal;
+ __u32 version;
+ __u32 flags;
+ __u32 pad[6];
+};
#define KVM_MAX_MMU_OP_BATCH 32
--
1.7.3.4
To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.
In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
holds the memory area address containing information about steal time
This patch contains the hypervisor part for it. I am keeping it separate from
the headers to facilitate backports to people who wants to backport the kernel
part but not the hypervisor, or the other way around.
Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 8 +++++
arch/x86/include/asm/kvm_para.h | 4 ++
arch/x86/kvm/x86.c | 60 +++++++++++++++++++++++++++++++++++++--
3 files changed, 69 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fc38eca..5dce014 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -388,6 +388,14 @@ struct kvm_vcpu_arch {
unsigned int hw_tsc_khz;
unsigned int time_offset;
struct page *time_page;
+
+ struct {
+ u64 msr_val;
+ gpa_t stime;
+ struct kvm_steal_time steal;
+ u64 this_time_out;
+ } st;
+
u64 last_guest_tsc;
u64 last_kernel_ns;
u64 last_tsc_nsec;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index ac306c4..0341e61 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -45,6 +45,10 @@ struct kvm_steal_time {
__u32 pad[6];
};
+#define KVM_STEAL_ALIGNMENT_BITS 5
+#define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
+#define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
+
#define KVM_MAX_MMU_OP_BATCH 32
#define KVM_ASYNC_PF_ENABLED (1 << 0)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6645634..10fe028 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -797,12 +797,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
* kvm-specific. Those are put in the beginning of the list.
*/
-#define KVM_SAVE_MSRS_BEGIN 8
+#define KVM_SAVE_MSRS_BEGIN 9
static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
- HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
+ HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_STAR,
#ifdef CONFIG_X86_64
@@ -1480,6 +1480,34 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
}
}
+static void record_steal_time(struct kvm_vcpu *vcpu)
+{
+ u64 delta;
+
+ if (vcpu->arch.st.stime && vcpu->arch.st.this_time_out) {
+
+ if (unlikely(kvm_read_guest(vcpu->kvm, vcpu->arch.st.stime,
+ &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
+
+ vcpu->arch.st.stime = 0;
+ return;
+ }
+
+ delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
+
+ vcpu->arch.st.steal.steal += delta;
+ vcpu->arch.st.steal.version += 2;
+
+ if (unlikely(kvm_write_guest(vcpu->kvm, vcpu->arch.st.stime,
+ &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
+
+ vcpu->arch.st.stime = 0;
+ return;
+ }
+ }
+
+}
+
int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
{
switch (msr) {
@@ -1562,6 +1590,23 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
if (kvm_pv_enable_async_pf(vcpu, data))
return 1;
break;
+ case MSR_KVM_STEAL_TIME:
+ vcpu->arch.st.msr_val = data;
+
+ if (!(data & KVM_MSR_ENABLED)) {
+ vcpu->arch.st.stime = 0;
+ break;
+ }
+
+ if (data & KVM_STEAL_RESERVED_MASK)
+ return 1;
+
+ vcpu->arch.st.this_time_out = get_kernel_ns();
+ vcpu->arch.st.stime = data & KVM_STEAL_VALID_BITS;
+ record_steal_time(vcpu);
+
+ break;
+
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1847,6 +1892,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
case MSR_KVM_ASYNC_PF_EN:
data = vcpu->arch.apf.msr_val;
break;
+ case MSR_KVM_STEAL_TIME:
+ data = vcpu->arch.st.msr_val;
+ break;
case MSR_IA32_P5_MC_ADDR:
case MSR_IA32_P5_MC_TYPE:
case MSR_IA32_MCG_CAP:
@@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_migrate_timers(vcpu);
vcpu->cpu = cpu;
}
+
+ record_steal_time(vcpu);
}
void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
@@ -2165,6 +2215,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
+ vcpu->arch.st.this_time_out = get_kernel_ns();
}
static int is_efer_nx(void)
@@ -2477,7 +2528,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
(1 << KVM_FEATURE_NOP_IO_DELAY) |
(1 << KVM_FEATURE_CLOCKSOURCE2) |
(1 << KVM_FEATURE_ASYNC_PF) |
- (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
+ (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
+ (1 << KVM_FEATURE_STEAL_TIME);
entry->ebx = 0;
entry->ecx = 0;
entry->edx = 0;
@@ -6200,6 +6252,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
kvmclock_reset(vcpu);
+ vcpu->arch.st.stime = 0;
+
kvm_clear_async_pf_completion_queue(vcpu);
kvm_async_pf_hash_reset(vcpu);
vcpu->arch.apf.halted = false;
--
1.7.3.4
This patch adds a function pointer in one of the many paravirt_ops
structs, to allow guests to register a steal time function.
Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
arch/x86/include/asm/paravirt.h | 9 +++++++++
arch/x86/include/asm/paravirt_types.h | 1 +
arch/x86/kernel/paravirt.c | 9 +++++++++
3 files changed, 19 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index ebbc4d8..a7d2db9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -230,6 +230,15 @@ static inline unsigned long long paravirt_sched_clock(void)
return PVOP_CALL0(unsigned long long, pv_time_ops.sched_clock);
}
+struct jump_label_key;
+extern struct jump_label_key paravirt_steal_enabled;
+extern struct jump_label_key paravirt_steal_rq_enabled;
+
+static inline u64 paravirt_steal_clock(int cpu)
+{
+ return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+}
+
static inline unsigned long long paravirt_read_pmc(int counter)
{
return PVOP_CALL1(u64, pv_cpu_ops.read_pmc, counter);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 8288509..2c76521 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -89,6 +89,7 @@ struct pv_lazy_ops {
struct pv_time_ops {
unsigned long long (*sched_clock)(void);
+ unsigned long long (*steal_clock)(int cpu);
unsigned long (*get_tsc_khz)(void);
};
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 869e1ae..613a793 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -202,6 +202,14 @@ static void native_flush_tlb_single(unsigned long addr)
__native_flush_tlb_single(addr);
}
+struct jump_label_key paravirt_steal_enabled;
+struct jump_label_key paravirt_steal_rq_enabled;
+
+static u64 native_steal_clock(int cpu)
+{
+ return 0;
+}
+
/* These are in entry.S */
extern void native_iret(void);
extern void native_irq_enable_sysexit(void);
@@ -307,6 +315,7 @@ struct pv_init_ops pv_init_ops = {
struct pv_time_ops pv_time_ops = {
.sched_clock = native_sched_clock,
+ .steal_clock = native_steal_clock,
};
struct pv_irq_ops pv_irq_ops = {
--
1.7.3.4
This patch accounts steal time time in kernel/sched.
I kept it from last proposal, because I still see advantages
in it: Doing it here will give us easier access from scheduler
variables such as the cpu rq. The next patch shows an example of
usage for it.
Since functions like account_idle_time() can be called from
multiple places, not only account_process_tick(), steal time
grabbing is repeated in each account function separatedely.
Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
kernel/sched.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 53 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 3f2e502..fa983c6 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -75,6 +75,7 @@
#include <asm/tlb.h>
#include <asm/irq_regs.h>
#include <asm/mutex.h>
+#include <asm/paravirt.h>
#include "sched_cpupri.h"
#include "workqueue_sched.h"
@@ -528,6 +529,7 @@ struct rq {
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
u64 prev_irq_time;
#endif
+ u64 prev_steal_time;
/* calc_load related fields */
unsigned long calc_load_update;
@@ -3705,6 +3707,49 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p)
}
/*
+ * We have to at flush steal time information every time something else
+ * is accounted. Since the accounting functions are all visible to the rest
+ * of the kernel, it gets tricky to do them in one place. This helper function
+ * helps us.
+ *
+ * When the system is idle, the concept of steal time does not apply. We just
+ * tell the underlying hypervisor that we grabbed the data, but skip steal time
+ * accounting
+ */
+static inline bool touch_steal_time(int is_idle)
+{
+ u64 steal, st = 0;
+
+ if (static_branch(¶virt_steal_enabled)) {
+
+ steal = paravirt_steal_clock(smp_processor_id());
+
+ steal -= this_rq()->prev_steal_time;
+ if (is_idle) {
+ this_rq()->prev_steal_time += steal;
+ return false;
+ }
+
+ while (steal >= TICK_NSEC) {
+ /*
+ * Inline assembly required to prevent the compiler
+ * optimising this loop into a divmod call.
+ * See __iter_div_u64_rem() for another example of this.
+ */
+ asm("" : "+rm" (steal));
+
+ steal -= TICK_NSEC;
+ this_rq()->prev_steal_time += TICK_NSEC;
+ st++;
+ }
+
+ account_steal_time(st);
+ return !!st;
+ }
+ return false;
+}
+
+/*
* Account user cpu time to a process.
* @p: the process that the cpu time gets accounted to
* @cputime: the cpu time spent in user space since the last update
@@ -3716,6 +3761,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
cputime64_t tmp;
+ if (touch_steal_time(0))
+ return;
+
/* Add user time to process. */
p->utime = cputime_add(p->utime, cputime);
p->utimescaled = cputime_add(p->utimescaled, cputime_scaled);
@@ -3802,6 +3850,9 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
cputime64_t *target_cputime64;
+ if (touch_steal_time(0))
+ return;
+
if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
account_guest_time(p, cputime, cputime_scaled);
return;
@@ -3839,6 +3890,8 @@ void account_idle_time(cputime_t cputime)
cputime64_t cputime64 = cputime_to_cputime64(cputime);
struct rq *rq = this_rq();
+ touch_steal_time(1);
+
if (atomic_read(&rq->nr_iowait) > 0)
cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
else
--
1.7.3.4
This is a first proposal for using steal time information
to influence the scheduler. There are a lot of optimizations
and fine grained adjustments to be done, but it is working reasonably
so far for me (mostly)
With this patch (and some host pinnings to demonstrate the situation),
two vcpus with very different steal time (Say 80 % vs 1 %) will not get
an even distribution of processes. This is a situation that can naturally
arise, specially in overcommited scenarios. Previosly, the guest scheduler
would wrongly think that all cpus have the same ability to run processes,
lowering the overall throughput.
Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
arch/x86/Kconfig | 12 ++++++++++++
kernel/sched.c | 36 +++++++++++++++++++++++++++---------
kernel/sched_features.h | 4 ++--
3 files changed, 41 insertions(+), 11 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index da34972..b26f312 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -512,6 +512,18 @@ menuconfig PARAVIRT_GUEST
if PARAVIRT_GUEST
+config PARAVIRT_TIME_ACCOUNTING
+ bool "Paravirtual steal time accounting"
+ select PARAVIRT
+ default n
+ ---help---
+ Select this option to enable fine granularity task steal time
+ accounting. Time spent executing other tasks in parallel with
+ the current vCPU is discounted from the vCPU power. To account for
+ that, there can be a small performance impact.
+
+ If in doubt, say N here.
+
source "arch/x86/xen/Kconfig"
config KVM_CLOCK
diff --git a/kernel/sched.c b/kernel/sched.c
index fa983c6..8513898 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -530,6 +530,9 @@ struct rq {
u64 prev_irq_time;
#endif
u64 prev_steal_time;
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+ u64 prev_steal_time_acc;
+#endif
/* calc_load related fields */
unsigned long calc_load_update;
@@ -1955,10 +1958,13 @@ void account_system_vtime(struct task_struct *curr)
}
EXPORT_SYMBOL_GPL(account_system_vtime);
+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
static void update_rq_clock_task(struct rq *rq, s64 delta)
{
- s64 irq_delta;
+ s64 irq_delta = 0, steal = 0;
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
/*
@@ -1981,12 +1987,29 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
rq->prev_irq_time += irq_delta;
delta -= irq_delta;
+#endif
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+ if (static_branch((¶virt_steal_rq_enabled))) {
+
+ steal = paravirt_steal_clock(cpu_of(rq));
+ steal -= rq->prev_steal_time_acc;
+
+ rq->prev_steal_time_acc += steal;
+
+ if (steal > delta)
+ steal = delta;
+
+ delta -= steal;
+ }
+#endif
+
rq->clock_task += delta;
- if (irq_delta && sched_feat(NONIRQ_POWER))
- sched_rt_avg_update(rq, irq_delta);
+ if ((irq_delta + steal) && sched_feat(NONTASK_POWER))
+ sched_rt_avg_update(rq, irq_delta + steal);
}
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
static int irqtime_account_hi_update(void)
{
struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
@@ -2021,12 +2044,7 @@ static int irqtime_account_si_update(void)
#define sched_clock_irqtime (0)
-static void update_rq_clock_task(struct rq *rq, s64 delta)
-{
- rq->clock_task += delta;
-}
-
-#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+#endif
#include "sched_idletask.c"
#include "sched_fair.c"
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index be40f73..ca3b025 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -61,9 +61,9 @@ SCHED_FEAT(LB_BIAS, 1)
SCHED_FEAT(OWNER_SPIN, 1)
/*
- * Decrement CPU power based on irq activity
+ * Decrement CPU power based on time not spent running tasks
*/
-SCHED_FEAT(NONIRQ_POWER, 1)
+SCHED_FEAT(NONTASK_POWER, 1)
/*
* Queue remote wakeups on the target CPU and process them
--
1.7.3.4
Register steal time within KVM. Everytime we sample the steal time
information, we update a local variable that tells what was the
last time read. We then account the difference.
Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
Documentation/kernel-parameters.txt | 4 ++
arch/x86/include/asm/kvm_para.h | 1 +
arch/x86/kernel/kvm.c | 73 +++++++++++++++++++++++++++++++++++
arch/x86/kernel/kvmclock.c | 2 +
4 files changed, 80 insertions(+), 0 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index fd248a31..a722574 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1737,6 +1737,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page
fault handling.
+ no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting.
+ steal time is computed, but won't influence scheduler
+ behaviour
+
nolapic [X86-32,APIC] Do not enable or use the local APIC.
nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 0341e61..2a8f2a5 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -94,6 +94,7 @@ struct kvm_vcpu_pv_apf_data {
extern void kvmclock_init(void);
extern int kvm_register_clock(char *txt);
+extern void kvm_disable_steal_time(void);
/* This instruction is vmcall. On non-VT architectures, it will generate a
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 33c07b0..58331c2 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -51,6 +51,15 @@ static int parse_no_kvmapf(char *arg)
early_param("no-kvmapf", parse_no_kvmapf);
+static int steal_acc = 1;
+static int parse_no_stealacc(char *arg)
+{
+ steal_acc = 0;
+ return 0;
+}
+
+early_param("no-steal-acc", parse_no_stealacc);
+
struct kvm_para_state {
u8 mmu_queue[MMU_QUEUE_SIZE];
int mmu_queue_len;
@@ -58,6 +67,8 @@ struct kvm_para_state {
static DEFINE_PER_CPU(struct kvm_para_state, para_state);
static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
+static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
+static int has_steal_clock = 0;
static struct kvm_para_state *kvm_para_state(void)
{
@@ -441,6 +452,21 @@ static void __init paravirt_ops_setup(void)
#endif
}
+static void kvm_register_steal_time(void)
+{
+ int cpu = smp_processor_id();
+ struct kvm_steal_time *st = &per_cpu(steal_time, cpu);
+
+ if (!has_steal_clock)
+ return;
+
+ memset(st, 0, sizeof(*st));
+
+ wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
+ printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n",
+ cpu, __pa(st));
+}
+
void __cpuinit kvm_guest_cpu_init(void)
{
if (!kvm_para_available())
@@ -457,6 +483,9 @@ void __cpuinit kvm_guest_cpu_init(void)
printk(KERN_INFO"KVM setup async PF for cpu %d\n",
smp_processor_id());
}
+
+ if (has_steal_clock)
+ kvm_register_steal_time();
}
static void kvm_pv_disable_apf(void *unused)
@@ -483,6 +512,31 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
};
+static u64 kvm_steal_clock(int cpu)
+{
+ u64 steal;
+ struct kvm_steal_time *src;
+ int version;
+
+ src = &per_cpu(steal_time, cpu);
+ do {
+ version = src->version;
+ rmb();
+ steal = src->steal;
+ rmb();
+ } while ((version & 1) || (version != src->version));
+
+ return steal;
+}
+
+void kvm_disable_steal_time(void)
+{
+ if (!has_steal_clock)
+ return;
+
+ wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
+}
+
#ifdef CONFIG_SMP
static void __init kvm_smp_prepare_boot_cpu(void)
{
@@ -500,6 +554,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
static void kvm_guest_cpu_offline(void *dummy)
{
+ kvm_disable_steal_time();
kvm_pv_disable_apf(NULL);
apf_task_wake_all();
}
@@ -548,6 +603,11 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
x86_init.irqs.trap_init = kvm_apf_trap_init;
+ if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
+ has_steal_clock = 1;
+ pv_time_ops.steal_clock = kvm_steal_clock;
+ }
+
#ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
register_cpu_notifier(&kvm_cpu_notifier);
@@ -555,3 +615,16 @@ void __init kvm_guest_init(void)
kvm_guest_cpu_init();
#endif
}
+
+static __init int activate_jump_labels(void)
+{
+ if (has_steal_clock) {
+ jump_label_inc(¶virt_steal_enabled);
+ if (steal_acc)
+ jump_label_inc(¶virt_steal_rq_enabled);
+ }
+
+ return 0;
+}
+arch_initcall(activate_jump_labels);
+
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 6389a6b..c1a0188 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -160,6 +160,7 @@ static void __cpuinit kvm_setup_secondary_clock(void)
static void kvm_crash_shutdown(struct pt_regs *regs)
{
native_write_msr(msr_kvm_system_time, 0, 0);
+ kvm_disable_steal_time();
native_machine_crash_shutdown(regs);
}
#endif
@@ -167,6 +168,7 @@ static void kvm_crash_shutdown(struct pt_regs *regs)
static void kvm_shutdown(void)
{
native_write_msr(msr_kvm_system_time, 0, 0);
+ kvm_disable_steal_time();
native_machine_shutdown();
}
--
1.7.3.4
On Thu, 16 Jun 2011, Glauber Costa wrote:
> This patch is simple, put in a different commit so it can be more easily
> shared between guest and hypervisor. It just defines a named constant
> to indicate the enable bit for KVM-specific MSRs.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>
Tested-by: Eric B Munson <[email protected]>
On Thu, 16 Jun 2011, Glauber Costa wrote:
> To implement steal time, we need the hypervisor to pass the guest information
> about how much time was spent running other processes outside the VM.
> This is per-vcpu, and using the kvmclock structure for that is an abuse
> we decided not to make.
>
> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> holds the memory area address containing information about steal time
>
> This patch contains the headers for it. I am keeping it separate to facilitate
> backports to people who wants to backport the kernel part but not the
> hypervisor, or the other way around.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>
Tested-by: Eric B Munson <[email protected]>
On Thu, 16 Jun 2011, Glauber Costa wrote:
> To implement steal time, we need the hypervisor to pass the guest information
> about how much time was spent running other processes outside the VM.
> This is per-vcpu, and using the kvmclock structure for that is an abuse
> we decided not to make.
>
> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> holds the memory area address containing information about steal time
>
> This patch contains the hypervisor part for it. I am keeping it separate from
> the headers to facilitate backports to people who wants to backport the kernel
> part but not the hypervisor, or the other way around.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>
Tested-by: Eric B Munson <[email protected]>
On Thu, 16 Jun 2011, Glauber Costa wrote:
> This patch adds a function pointer in one of the many paravirt_ops
> structs, to allow guests to register a steal time function.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>
Tested-by: Eric B Munson <[email protected]>
On Thu, 16 Jun 2011, Glauber Costa wrote:
> This patch accounts steal time time in kernel/sched.
> I kept it from last proposal, because I still see advantages
> in it: Doing it here will give us easier access from scheduler
> variables such as the cpu rq. The next patch shows an example of
> usage for it.
>
> Since functions like account_idle_time() can be called from
> multiple places, not only account_process_tick(), steal time
> grabbing is repeated in each account function separatedely.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>
Tested-by: Eric B Munson <[email protected]>
On Thu, 16 Jun 2011, Glauber Costa wrote:
> This is a first proposal for using steal time information
> to influence the scheduler. There are a lot of optimizations
> and fine grained adjustments to be done, but it is working reasonably
> so far for me (mostly)
>
> With this patch (and some host pinnings to demonstrate the situation),
> two vcpus with very different steal time (Say 80 % vs 1 %) will not get
> an even distribution of processes. This is a situation that can naturally
> arise, specially in overcommited scenarios. Previosly, the guest scheduler
> would wrongly think that all cpus have the same ability to run processes,
> lowering the overall throughput.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>
Tested-by: Eric B Munson <[email protected]>
On Thu, 16 Jun 2011, Glauber Costa wrote:
> Register steal time within KVM. Everytime we sample the steal time
> information, we update a local variable that tells what was the
> last time read. We then account the difference.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>
Tested-by: Eric B Munson <[email protected]>
On 06/17/2011 01:20 AM, Glauber Costa wrote:
> To implement steal time, we need the hypervisor to pass the guest information
> about how much time was spent running other processes outside the VM.
> This is per-vcpu, and using the kvmclock structure for that is an abuse
> we decided not to make.
>
> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> holds the memory area address containing information about steal time
>
> This patch contains the headers for it. I am keeping it separate to facilitate
> backports to people who wants to backport the kernel part but not the
> hypervisor, or the other way around.
>
>
> index d079aed..79c12a7 100644
> --- a/Documentation/virtual/kvm/msr.txt
> +++ b/Documentation/virtual/kvm/msr.txt
> @@ -185,3 +185,36 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02
>
> Currently type 2 APF will be always delivered on the same vcpu as
> type 1 was, but guest should not rely on that.
> +
> +MSR_KVM_STEAL_TIME: 0x4b564d03
> +
> + data: 64-byte alignment physical address of a memory area which must be
> + in guest RAM, plus an enable bit in bit 0. This memory is expected to
> + hold a copy of the following structure:
> +
> + struct kvm_steal_time {
> + __u64 steal;
> + __u32 version;
> + __u32 flags;
> + __u32 pad[6];
Should be 12 to be a 64-byte structure, no?
--
error compiling committee.c: too many arguments to function
On 06/17/2011 01:20 AM, Glauber Costa wrote:
> To implement steal time, we need the hypervisor to pass the guest information
> about how much time was spent running other processes outside the VM.
> This is per-vcpu, and using the kvmclock structure for that is an abuse
> we decided not to make.
>
> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> holds the memory area address containing information about steal time
>
> This patch contains the hypervisor part for it. I am keeping it separate from
> the headers to facilitate backports to people who wants to backport the kernel
> part but not the hypervisor, or the other way around.
>
>
>
> +#define KVM_STEAL_ALIGNMENT_BITS 5
> +#define KVM_STEAL_VALID_BITS ((-1ULL<< (KVM_STEAL_ALIGNMENT_BITS + 1)))
> +#define KVM_STEAL_RESERVED_MASK (((1<< KVM_STEAL_ALIGNMENT_BITS) - 1 )<< 1)
Clumsy, but okay.
> +static void record_steal_time(struct kvm_vcpu *vcpu)
> +{
> + u64 delta;
> +
> + if (vcpu->arch.st.stime&& vcpu->arch.st.this_time_out) {
0 is a valid value for stime.
> +
> + if (unlikely(kvm_read_guest(vcpu->kvm, vcpu->arch.st.stime,
> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
> +
> + vcpu->arch.st.stime = 0;
> + return;
> + }
> +
> + delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
> +
> + vcpu->arch.st.steal.steal += delta;
> + vcpu->arch.st.steal.version += 2;
> +
> + if (unlikely(kvm_write_guest(vcpu->kvm, vcpu->arch.st.stime,
> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
> +
> + vcpu->arch.st.stime = 0;
> + return;
> + }
> + }
> +
> +}
> +
>
> @@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> kvm_migrate_timers(vcpu);
> vcpu->cpu = cpu;
> }
> +
> + record_steal_time(vcpu);
> }
This records time spent in userspace in the vcpu thread as steal time.
Is this what we want? Or just time preempted away?
--
error compiling committee.c: too many arguments to function
On 06/17/2011 01:20 AM, Glauber Costa wrote:
> This patch accounts steal time time in kernel/sched.
> I kept it from last proposal, because I still see advantages
> in it: Doing it here will give us easier access from scheduler
> variables such as the cpu rq. The next patch shows an example of
> usage for it.
>
> Since functions like account_idle_time() can be called from
> multiple places, not only account_process_tick(), steal time
> grabbing is repeated in each account function separatedely.
>
> /*
> + * We have to at flush steal time information every time something else
> + * is accounted. Since the accounting functions are all visible to the rest
> + * of the kernel, it gets tricky to do them in one place. This helper function
> + * helps us.
> + *
> + * When the system is idle, the concept of steal time does not apply. We just
> + * tell the underlying hypervisor that we grabbed the data, but skip steal time
> + * accounting
> + */
> +static inline bool touch_steal_time(int is_idle)
> +{
> + u64 steal, st = 0;
> +
> + if (static_branch(¶virt_steal_enabled)) {
> +
> + steal = paravirt_steal_clock(smp_processor_id());
> +
> + steal -= this_rq()->prev_steal_time;
> + if (is_idle) {
> + this_rq()->prev_steal_time += steal;
> + return false;
> + }
> +
> + while (steal>= TICK_NSEC) {
> + /*
> + * Inline assembly required to prevent the compiler
> + * optimising this loop into a divmod call.
> + * See __iter_div_u64_rem() for another example of this.
> + */
Why not use said function?
> + asm("" : "+rm" (steal));
> +
> + steal -= TICK_NSEC;
> + this_rq()->prev_steal_time += TICK_NSEC;
> + st++;
Suppose a live migration or SIGSTOP causes lots of steal time. How long
will we spend here?
> + }
> +
> + account_steal_time(st);
> + return !!st;
!! !needed, you're returning a bool.
> + }
> + return false;
> +}
> +
I'll need Peter's (or another sched maintainer's) review to apply this.
--
error compiling committee.c: too many arguments to function
On 06/17/2011 01:20 AM, Glauber Costa wrote:
> This is a first proposal for using steal time information
> to influence the scheduler. There are a lot of optimizations
> and fine grained adjustments to be done, but it is working reasonably
> so far for me (mostly)
>
> With this patch (and some host pinnings to demonstrate the situation),
> two vcpus with very different steal time (Say 80 % vs 1 %) will not get
> an even distribution of processes. This is a situation that can naturally
> arise, specially in overcommited scenarios. Previosly, the guest scheduler
> would wrongly think that all cpus have the same ability to run processes,
> lowering the overall throughput.
>
Looks fine, but sched maintainer review needed.
--
error compiling committee.c: too many arguments to function
On 06/19/2011 07:04 AM, Avi Kivity wrote:
> On 06/17/2011 01:20 AM, Glauber Costa wrote:
>> This patch accounts steal time time in kernel/sched.
>> I kept it from last proposal, because I still see advantages
>> in it: Doing it here will give us easier access from scheduler
>> variables such as the cpu rq. The next patch shows an example of
>> usage for it.
>>
>> Since functions like account_idle_time() can be called from
>> multiple places, not only account_process_tick(), steal time
>> grabbing is repeated in each account function separatedely.
>>
>> /*
>> + * We have to at flush steal time information every time something else
>> + * is accounted. Since the accounting functions are all visible to
>> the rest
>> + * of the kernel, it gets tricky to do them in one place. This helper
>> function
>> + * helps us.
>> + *
>> + * When the system is idle, the concept of steal time does not apply.
>> We just
>> + * tell the underlying hypervisor that we grabbed the data, but skip
>> steal time
>> + * accounting
>> + */
>> +static inline bool touch_steal_time(int is_idle)
>> +{
>> + u64 steal, st = 0;
>> +
>> + if (static_branch(¶virt_steal_enabled)) {
>> +
>> + steal = paravirt_steal_clock(smp_processor_id());
>> +
>> + steal -= this_rq()->prev_steal_time;
>> + if (is_idle) {
>> + this_rq()->prev_steal_time += steal;
>> + return false;
>> + }
>> +
>> + while (steal>= TICK_NSEC) {
>> + /*
>> + * Inline assembly required to prevent the compiler
>> + * optimising this loop into a divmod call.
>> + * See __iter_div_u64_rem() for another example of this.
>> + */
>
> Why not use said function?
because here we want to do work during each loop. The said function
would have to be adapted for that, possibly using a macro, to run
arbitrary code during each loop iteration, in a way that I don't think
it is worthy given the current number of callers (2 counting this new one)
>> + asm("" : "+rm" (steal));
>> +
>> + steal -= TICK_NSEC;
>> + this_rq()->prev_steal_time += TICK_NSEC;
>> + st++;
>
> Suppose a live migration or SIGSTOP causes lots of steal time. How long
> will we spend here?
Silly me. I actually used this same argument with Peter to cap it with
"delta" in the next patch in this series. So I think you are 100 %
right. Here, however, we do want to account all that time, I believe.
How about we do a slow division if we're > 10 sec (unlikely), and
account everything as steal time in this scenario ?
>> + }
>> +
>> + account_steal_time(st);
>> + return !!st;
>
> !! !needed, you're returning a bool.
ah, sure thing.
>> + }
>> + return false;
>> +}
>> +
>
> I'll need Peter's (or another sched maintainer's) review to apply this.
>
On 06/19/2011 06:57 AM, Avi Kivity wrote:
> On 06/17/2011 01:20 AM, Glauber Costa wrote:
>> To implement steal time, we need the hypervisor to pass the guest
>> information
>> about how much time was spent running other processes outside the VM.
>> This is per-vcpu, and using the kvmclock structure for that is an abuse
>> we decided not to make.
>>
>> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
>> holds the memory area address containing information about steal time
>>
>> This patch contains the hypervisor part for it. I am keeping it
>> separate from
>> the headers to facilitate backports to people who wants to backport
>> the kernel
>> part but not the hypervisor, or the other way around.
>>
>>
>>
>> +#define KVM_STEAL_ALIGNMENT_BITS 5
>> +#define KVM_STEAL_VALID_BITS ((-1ULL<< (KVM_STEAL_ALIGNMENT_BITS + 1)))
>> +#define KVM_STEAL_RESERVED_MASK (((1<< KVM_STEAL_ALIGNMENT_BITS) - 1
>> )<< 1)
>
> Clumsy, but okay.
>
>> +static void record_steal_time(struct kvm_vcpu *vcpu)
>> +{
>> + u64 delta;
>> +
>> + if (vcpu->arch.st.stime&& vcpu->arch.st.this_time_out) {
>
> 0 is a valid value for stime.
how exactly? stime is a guest physical address...
>> +
>> + if (unlikely(kvm_read_guest(vcpu->kvm, vcpu->arch.st.stime,
>> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
>> +
>> + vcpu->arch.st.stime = 0;
>> + return;
>> + }
>> +
>> + delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
>> +
>> + vcpu->arch.st.steal.steal += delta;
>> + vcpu->arch.st.steal.version += 2;
>> +
>> + if (unlikely(kvm_write_guest(vcpu->kvm, vcpu->arch.st.stime,
>> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
>> +
>> + vcpu->arch.st.stime = 0;
>> + return;
>> + }
>> + }
>> +
>> +}
>> +
>>
>> @@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu,
>> int cpu)
>> kvm_migrate_timers(vcpu);
>> vcpu->cpu = cpu;
>> }
>> +
>> + record_steal_time(vcpu);
>> }
>
> This records time spent in userspace in the vcpu thread as steal time.
> Is this what we want? Or just time preempted away?
There are arguments either way.
Right now, the way it is, it does account our iothread as steal time,
which is not 100 % accurate if we think steal time as "whatever takes
time away from our VM". I tend to think it as "whatever takes time away
from this CPU", which includes other cpus in the same VM. So thinking
this way, in a 1-1 phys-to-virt cpu mapping, if the iothread is taking
80 % cpu for whatever reason, we have 80 % steal time the cpu that is
sharing the physical cpu with the iothread.
Maybe we could account that as iotime ?
Questions like that are one of the reasons behind me leaving extra
fields in the steal time structure. We could do a more fine grained
accounting and differentiate between the multiple entities that can do
work (of various kinds) in our behalf.
On 06/19/2011 06:49 AM, Avi Kivity wrote:
> On 06/17/2011 01:20 AM, Glauber Costa wrote:
>> To implement steal time, we need the hypervisor to pass the guest
>> information
>> about how much time was spent running other processes outside the VM.
>> This is per-vcpu, and using the kvmclock structure for that is an abuse
>> we decided not to make.
>>
>> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
>> holds the memory area address containing information about steal time
>>
>> This patch contains the headers for it. I am keeping it separate to
>> facilitate
>> backports to people who wants to backport the kernel part but not the
>> hypervisor, or the other way around.
>>
>>
>> index d079aed..79c12a7 100644
>> --- a/Documentation/virtual/kvm/msr.txt
>> +++ b/Documentation/virtual/kvm/msr.txt
>> @@ -185,3 +185,36 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02
>>
>> Currently type 2 APF will be always delivered on the same vcpu as
>> type 1 was, but guest should not rely on that.
>> +
>> +MSR_KVM_STEAL_TIME: 0x4b564d03
>> +
>> + data: 64-byte alignment physical address of a memory area which must be
>> + in guest RAM, plus an enable bit in bit 0. This memory is expected to
>> + hold a copy of the following structure:
>> +
>> + struct kvm_steal_time {
>> + __u64 steal;
>> + __u32 version;
>> + __u32 flags;
>> + __u32 pad[6];
>
> Should be 12 to be a 64-byte structure, no?
Avi, you should understand this is a discussion between 3 people: One of
them can do math, the other can't.
On 06/20/2011 05:53 AM, Glauber Costa wrote:
>
>>
>>> +static void record_steal_time(struct kvm_vcpu *vcpu)
>>> +{
>>> + u64 delta;
>>> +
>>> + if (vcpu->arch.st.stime&& vcpu->arch.st.this_time_out) {
>>
>> 0 is a valid value for stime.
>
>
> how exactly? stime is a guest physical address...
0 is a valid physical address.
>>>
>>> @@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu,
>>> int cpu)
>>> kvm_migrate_timers(vcpu);
>>> vcpu->cpu = cpu;
>>> }
>>> +
>>> + record_steal_time(vcpu);
>>> }
>>
>> This records time spent in userspace in the vcpu thread as steal time.
>> Is this what we want? Or just time preempted away?
>
> There are arguments either way.
>
> Right now, the way it is, it does account our iothread as steal time,
> which is not 100 % accurate if we think steal time as "whatever takes
> time away from our VM". I tend to think it as "whatever takes time
> away from this CPU", which includes other cpus in the same VM. So
> thinking this way, in a 1-1 phys-to-virt cpu mapping, if the iothread
> is taking 80 % cpu for whatever reason, we have 80 % steal time the
> cpu that is sharing the physical cpu with the iothread.
I'm not talking about the iothread, rather the vcpu thread while running
in userspace.
>
> Maybe we could account that as iotime ?
> Questions like that are one of the reasons behind me leaving extra
> fields in the steal time structure. We could do a more fine grained
> accounting and differentiate between the multiple entities that can do
> work (of various kinds) in our behalf.
>
What do other architectures do (xen, s390)?
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On 06/20/2011 05:38 AM, Glauber Costa wrote:
> On 06/19/2011 07:04 AM, Avi Kivity wrote:
>> On 06/17/2011 01:20 AM, Glauber Costa wrote:
>>> This patch accounts steal time time in kernel/sched.
>>> I kept it from last proposal, because I still see advantages
>>> in it: Doing it here will give us easier access from scheduler
>>> variables such as the cpu rq. The next patch shows an example of
>>> usage for it.
>>>
>>> Since functions like account_idle_time() can be called from
>>> multiple places, not only account_process_tick(), steal time
>>> grabbing is repeated in each account function separatedely.
>>>
>>> /*
>>> + * We have to at flush steal time information every time something
>>> else
>>> + * is accounted. Since the accounting functions are all visible to
>>> the rest
>>> + * of the kernel, it gets tricky to do them in one place. This helper
>>> function
>>> + * helps us.
>>> + *
>>> + * When the system is idle, the concept of steal time does not apply.
>>> We just
>>> + * tell the underlying hypervisor that we grabbed the data, but skip
>>> steal time
>>> + * accounting
>>> + */
>>> +static inline bool touch_steal_time(int is_idle)
>>> +{
>>> + u64 steal, st = 0;
>>> +
>>> + if (static_branch(¶virt_steal_enabled)) {
>>> +
>>> + steal = paravirt_steal_clock(smp_processor_id());
>>> +
>>> + steal -= this_rq()->prev_steal_time;
>>> + if (is_idle) {
>>> + this_rq()->prev_steal_time += steal;
>>> + return false;
>>> + }
>>> +
>>> + while (steal>= TICK_NSEC) {
>>> + /*
>>> + * Inline assembly required to prevent the compiler
>>> + * optimising this loop into a divmod call.
>>> + * See __iter_div_u64_rem() for another example of this.
>>> + */
>>
>> Why not use said function?
>
> because here we want to do work during each loop. The said function
> would have to be adapted for that, possibly using a macro, to run
> arbitrary code during each loop iteration, in a way that I don't think
> it is worthy given the current number of callers (2 counting this new
> one)
You mean adding to prev_steal_time? That can be done outside the loop.
>
>>> + asm("" : "+rm" (steal));
>>> +
>>> + steal -= TICK_NSEC;
>>> + this_rq()->prev_steal_time += TICK_NSEC;
>>> + st++;
>>
>> Suppose a live migration or SIGSTOP causes lots of steal time. How long
>> will we spend here?
> Silly me. I actually used this same argument with Peter to cap it with
> "delta" in the next patch in this series. So I think you are 100 %
> right. Here, however, we do want to account all that time, I believe.
>
> How about we do a slow division if we're > 10 sec (unlikely), and
> account everything as steal time in this scenario ?
Okay. Division would be faster for a lot less than 10s though.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On Sun, Jun 19, 2011 at 12:57:53PM +0300, Avi Kivity wrote:
> On 06/17/2011 01:20 AM, Glauber Costa wrote:
> >To implement steal time, we need the hypervisor to pass the guest information
> >about how much time was spent running other processes outside the VM.
> >This is per-vcpu, and using the kvmclock structure for that is an abuse
> >we decided not to make.
> >
> >In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> >holds the memory area address containing information about steal time
> >
> >This patch contains the hypervisor part for it. I am keeping it separate from
> >the headers to facilitate backports to people who wants to backport the kernel
> >part but not the hypervisor, or the other way around.
> >
> >
> >
> >+#define KVM_STEAL_ALIGNMENT_BITS 5
> >+#define KVM_STEAL_VALID_BITS ((-1ULL<< (KVM_STEAL_ALIGNMENT_BITS + 1)))
> >+#define KVM_STEAL_RESERVED_MASK (((1<< KVM_STEAL_ALIGNMENT_BITS) - 1 )<< 1)
>
> Clumsy, but okay.
>
> >+static void record_steal_time(struct kvm_vcpu *vcpu)
> >+{
> >+ u64 delta;
> >+
> >+ if (vcpu->arch.st.stime&& vcpu->arch.st.this_time_out) {
>
> 0 is a valid value for stime.
>
> >+
> >+ if (unlikely(kvm_read_guest(vcpu->kvm, vcpu->arch.st.stime,
> >+ &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
> >+
> >+ vcpu->arch.st.stime = 0;
> >+ return;
> >+ }
> >+
> >+ delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
> >+
> >+ vcpu->arch.st.steal.steal += delta;
> >+ vcpu->arch.st.steal.version += 2;
> >+
> >+ if (unlikely(kvm_write_guest(vcpu->kvm, vcpu->arch.st.stime,
> >+ &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
> >+
> >+ vcpu->arch.st.stime = 0;
> >+ return;
> >+ }
> >+ }
> >+
> >+}
> >+
> >
> >@@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> > kvm_migrate_timers(vcpu);
> > vcpu->cpu = cpu;
> > }
> >+
> >+ record_steal_time(vcpu);
> > }
>
> This records time spent in userspace in the vcpu thread as steal
> time. Is this what we want? Or just time preempted away?
It also accounts halt time (kvm_vcpu_block) as steal time. Glauber, you
could instead use the "runnable-state-but-waiting-in-runqueue" field of
SCHEDSTATS, i forgot the exact name.
On 06/20/2011 05:56 PM, Marcelo Tosatti wrote:
> On Sun, Jun 19, 2011 at 12:57:53PM +0300, Avi Kivity wrote:
>> On 06/17/2011 01:20 AM, Glauber Costa wrote:
>>> To implement steal time, we need the hypervisor to pass the guest information
>>> about how much time was spent running other processes outside the VM.
>>> This is per-vcpu, and using the kvmclock structure for that is an abuse
>>> we decided not to make.
>>>
>>> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
>>> holds the memory area address containing information about steal time
>>>
>>> This patch contains the hypervisor part for it. I am keeping it separate from
>>> the headers to facilitate backports to people who wants to backport the kernel
>>> part but not the hypervisor, or the other way around.
>>>
>>>
>>>
>>> +#define KVM_STEAL_ALIGNMENT_BITS 5
>>> +#define KVM_STEAL_VALID_BITS ((-1ULL<< (KVM_STEAL_ALIGNMENT_BITS + 1)))
>>> +#define KVM_STEAL_RESERVED_MASK (((1<< KVM_STEAL_ALIGNMENT_BITS) - 1 )<< 1)
>>
>> Clumsy, but okay.
>>
>>> +static void record_steal_time(struct kvm_vcpu *vcpu)
>>> +{
>>> + u64 delta;
>>> +
>>> + if (vcpu->arch.st.stime&& vcpu->arch.st.this_time_out) {
>>
>> 0 is a valid value for stime.
>>
>>> +
>>> + if (unlikely(kvm_read_guest(vcpu->kvm, vcpu->arch.st.stime,
>>> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
>>> +
>>> + vcpu->arch.st.stime = 0;
>>> + return;
>>> + }
>>> +
>>> + delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
>>> +
>>> + vcpu->arch.st.steal.steal += delta;
>>> + vcpu->arch.st.steal.version += 2;
>>> +
>>> + if (unlikely(kvm_write_guest(vcpu->kvm, vcpu->arch.st.stime,
>>> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
>>> +
>>> + vcpu->arch.st.stime = 0;
>>> + return;
>>> + }
>>> + }
>>> +
>>> +}
>>> +
>>>
>>> @@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>> kvm_migrate_timers(vcpu);
>>> vcpu->cpu = cpu;
>>> }
>>> +
>>> + record_steal_time(vcpu);
>>> }
>>
>> This records time spent in userspace in the vcpu thread as steal
>> time. Is this what we want? Or just time preempted away?
>
> It also accounts halt time (kvm_vcpu_block) as steal time. Glauber, you
> could instead use the "runnable-state-but-waiting-in-runqueue" field of
> SCHEDSTATS, i forgot the exact name.
>
I thought about it in the past. I let the idea aside because I didn't
want to introduce a dependency on SCHEDSTATS. But thinking about it
again now (and after some days of experimentations with it), I think we
could have both.
use run_delay (the field you were thinking of) when schedstats are
available, and fallback to an estimate method like the one we're doing
when it is not.
Objections ?
On Tue, Jun 28, 2011 at 09:30:29AM -0300, Glauber Costa wrote:
> On 06/20/2011 05:56 PM, Marcelo Tosatti wrote:
> >On Sun, Jun 19, 2011 at 12:57:53PM +0300, Avi Kivity wrote:
> >>On 06/17/2011 01:20 AM, Glauber Costa wrote:
> >>>To implement steal time, we need the hypervisor to pass the guest information
> >>>about how much time was spent running other processes outside the VM.
> >>>This is per-vcpu, and using the kvmclock structure for that is an abuse
> >>>we decided not to make.
> >>>
> >>>In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> >>>holds the memory area address containing information about steal time
> >>>
> >>>This patch contains the hypervisor part for it. I am keeping it separate from
> >>>the headers to facilitate backports to people who wants to backport the kernel
> >>>part but not the hypervisor, or the other way around.
> >>>
> >>>
> >>>
> >>>+#define KVM_STEAL_ALIGNMENT_BITS 5
> >>>+#define KVM_STEAL_VALID_BITS ((-1ULL<< (KVM_STEAL_ALIGNMENT_BITS + 1)))
> >>>+#define KVM_STEAL_RESERVED_MASK (((1<< KVM_STEAL_ALIGNMENT_BITS) - 1 )<< 1)
> >>
> >>Clumsy, but okay.
> >>
> >>>+static void record_steal_time(struct kvm_vcpu *vcpu)
> >>>+{
> >>>+ u64 delta;
> >>>+
> >>>+ if (vcpu->arch.st.stime&& vcpu->arch.st.this_time_out) {
> >>
> >>0 is a valid value for stime.
> >>
> >>>+
> >>>+ if (unlikely(kvm_read_guest(vcpu->kvm, vcpu->arch.st.stime,
> >>>+ &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
> >>>+
> >>>+ vcpu->arch.st.stime = 0;
> >>>+ return;
> >>>+ }
> >>>+
> >>>+ delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
> >>>+
> >>>+ vcpu->arch.st.steal.steal += delta;
> >>>+ vcpu->arch.st.steal.version += 2;
> >>>+
> >>>+ if (unlikely(kvm_write_guest(vcpu->kvm, vcpu->arch.st.stime,
> >>>+ &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
> >>>+
> >>>+ vcpu->arch.st.stime = 0;
> >>>+ return;
> >>>+ }
> >>>+ }
> >>>+
> >>>+}
> >>>+
> >>>
> >>>@@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> >>> kvm_migrate_timers(vcpu);
> >>> vcpu->cpu = cpu;
> >>> }
> >>>+
> >>>+ record_steal_time(vcpu);
> >>> }
> >>
> >>This records time spent in userspace in the vcpu thread as steal
> >>time. Is this what we want? Or just time preempted away?
> >
> >It also accounts halt time (kvm_vcpu_block) as steal time. Glauber, you
> >could instead use the "runnable-state-but-waiting-in-runqueue" field of
> >SCHEDSTATS, i forgot the exact name.
> >
> I thought about it in the past. I let the idea aside because I
> didn't want to introduce a dependency on SCHEDSTATS. But thinking
> about it again now (and after some days of experimentations with
> it), I think we could have both.
>
> use run_delay (the field you were thinking of) when schedstats are
> available, and fallback to an estimate method like the one we're
> doing when it is not.
>
> Objections ?
I'm okay with that.