2011-06-13 23:38:51

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 0/7] KVM steal time implementation

Hi,

This series is a repost of the last series I posted about this.
It tries to address most concerns that were raised at the time,
plus makes uses of the static_branch interface to disable the
steal code when not in use.


Glauber Costa (7):
KVM-HDR Add constant to represent KVM MSRs enabled bit
KVM-HDR: KVM Steal time implementation
KVM-HV: KVM Steal time implementation
KVM-GST: Add a pv_ops stub for steal time
KVM-GST: KVM Steal time accounting
KVM-GST: adjust scheduler cpu power
KVM-GST: KVM Steal time registration

Documentation/kernel-parameters.txt | 4 ++
Documentation/virtual/kvm/msr.txt | 33 +++++++++++++
arch/x86/Kconfig | 12 +++++
arch/x86/include/asm/kvm_host.h | 8 +++
arch/x86/include/asm/kvm_para.h | 15 ++++++
arch/x86/include/asm/paravirt.h | 9 ++++
arch/x86/include/asm/paravirt_types.h | 1 +
arch/x86/kernel/kvm.c | 72 +++++++++++++++++++++++++++++
arch/x86/kernel/kvmclock.c | 2 +
arch/x86/kernel/paravirt.c | 9 ++++
arch/x86/kvm/x86.c | 60 +++++++++++++++++++++++-
kernel/sched.c | 81 +++++++++++++++++++++++++++++----
kernel/sched_features.h | 4 +-
13 files changed, 296 insertions(+), 14 deletions(-)

--
1.7.3.4


2011-06-13 23:39:03

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 1/7] KVM-HDR Add constant to represent KVM MSRs enabled bit

This patch is simple, put in a different commit so it can be more easily
shared between guest and hypervisor. It just defines a named constant
to indicate the enable bit for KVM-specific MSRs.

Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index a427bf7..d6cd79b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -30,6 +30,7 @@
#define MSR_KVM_WALL_CLOCK 0x11
#define MSR_KVM_SYSTEM_TIME 0x12

+#define KVM_MSR_ENABLED 1
/* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
#define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00
#define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
--
1.7.3.4

2011-06-13 23:39:13

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 2/7] KVM-HDR: KVM Steal time implementation

To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
holds the memory area address containing information about steal time

This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
Documentation/virtual/kvm/msr.txt | 33 +++++++++++++++++++++++++++++++++
arch/x86/include/asm/kvm_para.h | 9 +++++++++
2 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt
index d079aed..79c12a7 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -185,3 +185,36 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02

Currently type 2 APF will be always delivered on the same vcpu as
type 1 was, but guest should not rely on that.
+
+MSR_KVM_STEAL_TIME: 0x4b564d03
+
+ data: 64-byte alignment physical address of a memory area which must be
+ in guest RAM, plus an enable bit in bit 0. This memory is expected to
+ hold a copy of the following structure:
+
+ struct kvm_steal_time {
+ __u64 steal;
+ __u32 version;
+ __u32 flags;
+ __u32 pad[6];
+ }
+
+ whose data will be filled in by the hypervisor periodically. Only one
+ write, or registration, is needed for each VCPU. The interval between
+ updates of this structure is arbitrary and implementation-dependent.
+ The hypervisor may update this structure at any time it sees fit until
+ anything with bit0 == 0 is written to it. Guest is required to make sure
+ this structure is initialized to zero.
+
+ Fields have the following meanings:
+
+ version: guest has to check version before and after grabbing
+ time information and check that they are both equal and even.
+ An odd version indicates an in-progress update.
+
+ flags: At this point, always zero. May be used to indicate
+ changes in this structure in the future.
+
+ steal: the amount of time in which this vCPU did not run, in
+ nanoseconds.
+
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index d6cd79b..ac306c4 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -21,6 +21,7 @@
*/
#define KVM_FEATURE_CLOCKSOURCE2 3
#define KVM_FEATURE_ASYNC_PF 4
+#define KVM_FEATURE_STEAL_TIME 5

/* The last 8 bits are used to indicate how to interpret the flags field
* in pvclock structure. If no bits are set, all flags are ignored.
@@ -35,6 +36,14 @@
#define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00
#define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
#define MSR_KVM_ASYNC_PF_EN 0x4b564d02
+#define MSR_KVM_STEAL_TIME 0x4b564d03
+
+struct kvm_steal_time {
+ __u64 steal;
+ __u32 version;
+ __u32 flags;
+ __u32 pad[6];
+};

#define KVM_MAX_MMU_OP_BATCH 32

--
1.7.3.4

2011-06-13 23:39:07

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 3/7] KVM-HV: KVM Steal time implementation

To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
holds the memory area address containing information about steal time

This patch contains the hypervisor part for it. I am keeping it separate from
the headers to facilitate backports to people who wants to backport the kernel
part but not the hypervisor, or the other way around.

Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 8 +++++
arch/x86/include/asm/kvm_para.h | 4 ++
arch/x86/kvm/x86.c | 60 +++++++++++++++++++++++++++++++++++++--
3 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fc38eca..5dce014 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -388,6 +388,14 @@ struct kvm_vcpu_arch {
unsigned int hw_tsc_khz;
unsigned int time_offset;
struct page *time_page;
+
+ struct {
+ u64 msr_val;
+ gpa_t stime;
+ struct kvm_steal_time steal;
+ u64 this_time_out;
+ } st;
+
u64 last_guest_tsc;
u64 last_kernel_ns;
u64 last_tsc_nsec;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index ac306c4..0341e61 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -45,6 +45,10 @@ struct kvm_steal_time {
__u32 pad[6];
};

+#define KVM_STEAL_ALIGNMENT_BITS 5
+#define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
+#define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
+
#define KVM_MAX_MMU_OP_BATCH 32

#define KVM_ASYNC_PF_ENABLED (1 << 0)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6645634..10fe028 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -797,12 +797,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
* kvm-specific. Those are put in the beginning of the list.
*/

-#define KVM_SAVE_MSRS_BEGIN 8
+#define KVM_SAVE_MSRS_BEGIN 9
static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
- HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
+ HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_STAR,
#ifdef CONFIG_X86_64
@@ -1480,6 +1480,34 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
}
}

+static void record_steal_time(struct kvm_vcpu *vcpu)
+{
+ u64 delta;
+
+ if (vcpu->arch.st.stime && vcpu->arch.st.this_time_out) {
+
+ if (unlikely(kvm_read_guest(vcpu->kvm, vcpu->arch.st.stime,
+ &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
+
+ vcpu->arch.st.stime = 0;
+ return;
+ }
+
+ delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
+
+ vcpu->arch.st.steal.steal += delta;
+ vcpu->arch.st.steal.version += 2;
+
+ if (unlikely(kvm_write_guest(vcpu->kvm, vcpu->arch.st.stime,
+ &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
+
+ vcpu->arch.st.stime = 0;
+ return;
+ }
+ }
+
+}
+
int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
{
switch (msr) {
@@ -1562,6 +1590,23 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
if (kvm_pv_enable_async_pf(vcpu, data))
return 1;
break;
+ case MSR_KVM_STEAL_TIME:
+ vcpu->arch.st.msr_val = data;
+
+ if (!(data & KVM_MSR_ENABLED)) {
+ vcpu->arch.st.stime = 0;
+ break;
+ }
+
+ if (data & KVM_STEAL_RESERVED_MASK)
+ return 1;
+
+ vcpu->arch.st.this_time_out = get_kernel_ns();
+ vcpu->arch.st.stime = data & KVM_STEAL_VALID_BITS;
+ record_steal_time(vcpu);
+
+ break;
+
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1847,6 +1892,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
case MSR_KVM_ASYNC_PF_EN:
data = vcpu->arch.apf.msr_val;
break;
+ case MSR_KVM_STEAL_TIME:
+ data = vcpu->arch.st.msr_val;
+ break;
case MSR_IA32_P5_MC_ADDR:
case MSR_IA32_P5_MC_TYPE:
case MSR_IA32_MCG_CAP:
@@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_migrate_timers(vcpu);
vcpu->cpu = cpu;
}
+
+ record_steal_time(vcpu);
}

void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
@@ -2165,6 +2215,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
+ vcpu->arch.st.this_time_out = get_kernel_ns();
}

static int is_efer_nx(void)
@@ -2477,7 +2528,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
(1 << KVM_FEATURE_NOP_IO_DELAY) |
(1 << KVM_FEATURE_CLOCKSOURCE2) |
(1 << KVM_FEATURE_ASYNC_PF) |
- (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
+ (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
+ (1 << KVM_FEATURE_STEAL_TIME);
entry->ebx = 0;
entry->ecx = 0;
entry->edx = 0;
@@ -6200,6 +6252,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)

kvmclock_reset(vcpu);

+ vcpu->arch.st.stime = 0;
+
kvm_clear_async_pf_completion_queue(vcpu);
kvm_async_pf_hash_reset(vcpu);
vcpu->arch.apf.halted = false;
--
1.7.3.4

2011-06-13 23:40:28

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 4/7] KVM-GST: Add a pv_ops stub for steal time

This patch adds a function pointer in one of the many paravirt_ops
structs, to allow guests to register a steal time function.

Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
arch/x86/include/asm/paravirt.h | 9 +++++++++
arch/x86/include/asm/paravirt_types.h | 1 +
arch/x86/kernel/paravirt.c | 9 +++++++++
3 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index ebbc4d8..a7d2db9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -230,6 +230,15 @@ static inline unsigned long long paravirt_sched_clock(void)
return PVOP_CALL0(unsigned long long, pv_time_ops.sched_clock);
}

+struct jump_label_key;
+extern struct jump_label_key paravirt_steal_enabled;
+extern struct jump_label_key paravirt_steal_rq_enabled;
+
+static inline u64 paravirt_steal_clock(int cpu)
+{
+ return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+}
+
static inline unsigned long long paravirt_read_pmc(int counter)
{
return PVOP_CALL1(u64, pv_cpu_ops.read_pmc, counter);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 8288509..2c76521 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -89,6 +89,7 @@ struct pv_lazy_ops {

struct pv_time_ops {
unsigned long long (*sched_clock)(void);
+ unsigned long long (*steal_clock)(int cpu);
unsigned long (*get_tsc_khz)(void);
};

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 869e1ae..613a793 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -202,6 +202,14 @@ static void native_flush_tlb_single(unsigned long addr)
__native_flush_tlb_single(addr);
}

+struct jump_label_key paravirt_steal_enabled;
+struct jump_label_key paravirt_steal_rq_enabled;
+
+static u64 native_steal_clock(int cpu)
+{
+ return 0;
+}
+
/* These are in entry.S */
extern void native_iret(void);
extern void native_irq_enable_sysexit(void);
@@ -307,6 +315,7 @@ struct pv_init_ops pv_init_ops = {

struct pv_time_ops pv_time_ops = {
.sched_clock = native_sched_clock,
+ .steal_clock = native_steal_clock,
};

struct pv_irq_ops pv_irq_ops = {
--
1.7.3.4

2011-06-13 23:40:48

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 5/7] KVM-GST: KVM Steal time accounting

This patch accounts steal time time in kernel/sched.
I kept it from last proposal, because I still see advantages
in it: Doing it here will give us easier access from scheduler
variables such as the cpu rq. The next patch shows an example of
usage for it.

Since functions like account_idle_time() can be called from
multiple places, not only account_process_tick(), steal time
grabbing is repeated in each account function separatedely.

Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
kernel/sched.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 45 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 3f2e502..154cb14 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -75,6 +75,7 @@
#include <asm/tlb.h>
#include <asm/irq_regs.h>
#include <asm/mutex.h>
+#include <asm/paravirt.h>

#include "sched_cpupri.h"
#include "workqueue_sched.h"
@@ -528,6 +529,7 @@ struct rq {
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
u64 prev_irq_time;
#endif
+ u64 prev_steal_time;

/* calc_load related fields */
unsigned long calc_load_update;
@@ -3705,6 +3707,41 @@ unsigned long long thread_group_sched_runtime(struct task_struct *p)
}

/*
+ * We have to at flush steal time information every time something else
+ * is accounted. Since the accounting functions are all visible to the rest
+ * of the kernel, it gets tricky to do them in one place. This helper function
+ * helps us.
+ *
+ * When the system is idle, the concept of steal time does not apply. We just
+ * tell the underlying hypervisor that we grabbed the data, but skip steal time
+ * accounting
+ */
+static inline int touch_steal_time(int is_idle)
+{
+ u64 steal, st = 0;
+
+ if (static_branch(&paravirt_steal_enabled)) {
+
+ steal = paravirt_steal_clock(smp_processor_id());
+
+ steal -= this_rq()->prev_steal_time;
+ this_rq()->prev_steal_time += steal;
+
+ if (is_idle || (steal < TICK_NSEC))
+ return 0;
+
+ while (steal > TICK_NSEC) {
+ steal -= TICK_NSEC;
+ st++;
+ }
+
+ account_steal_time(st);
+ return 1;
+ }
+ return 0;
+}
+
+/*
* Account user cpu time to a process.
* @p: the process that the cpu time gets accounted to
* @cputime: the cpu time spent in user space since the last update
@@ -3716,6 +3753,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
cputime64_t tmp;

+ if (touch_steal_time(0))
+ return;
+
/* Add user time to process. */
p->utime = cputime_add(p->utime, cputime);
p->utimescaled = cputime_add(p->utimescaled, cputime_scaled);
@@ -3802,6 +3842,9 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
cputime64_t *target_cputime64;

+ if (touch_steal_time(0))
+ return;
+
if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
account_guest_time(p, cputime, cputime_scaled);
return;
@@ -3839,6 +3882,8 @@ void account_idle_time(cputime_t cputime)
cputime64_t cputime64 = cputime_to_cputime64(cputime);
struct rq *rq = this_rq();

+ touch_steal_time(1);
+
if (atomic_read(&rq->nr_iowait) > 0)
cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
else
--
1.7.3.4

2011-06-13 23:39:21

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 6/7] KVM-GST: adjust scheduler cpu power

This is a first proposal for using steal time information
to influence the scheduler. There are a lot of optimizations
and fine grained adjustments to be done, but it is working reasonably
so far for me (mostly)

With this patch (and some host pinnings to demonstrate the situation),
two vcpus with very different steal time (Say 80 % vs 1 %) will not get
an even distribution of processes. This is a situation that can naturally
arise, specially in overcommited scenarios. Previosly, the guest scheduler
would wrongly think that all cpus have the same ability to run processes,
lowering the overall throughput.

Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
arch/x86/Kconfig | 12 ++++++++++++
kernel/sched.c | 36 +++++++++++++++++++++++++++---------
kernel/sched_features.h | 4 ++--
3 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index da34972..b26f312 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -512,6 +512,18 @@ menuconfig PARAVIRT_GUEST

if PARAVIRT_GUEST

+config PARAVIRT_TIME_ACCOUNTING
+ bool "Paravirtual steal time accounting"
+ select PARAVIRT
+ default n
+ ---help---
+ Select this option to enable fine granularity task steal time
+ accounting. Time spent executing other tasks in parallel with
+ the current vCPU is discounted from the vCPU power. To account for
+ that, there can be a small performance impact.
+
+ If in doubt, say N here.
+
source "arch/x86/xen/Kconfig"

config KVM_CLOCK
diff --git a/kernel/sched.c b/kernel/sched.c
index 154cb14..d6b2749 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -530,6 +530,9 @@ struct rq {
u64 prev_irq_time;
#endif
u64 prev_steal_time;
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+ u64 prev_steal_time_acc;
+#endif

/* calc_load related fields */
unsigned long calc_load_update;
@@ -1955,10 +1958,13 @@ void account_system_vtime(struct task_struct *curr)
}
EXPORT_SYMBOL_GPL(account_system_vtime);

+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
static void update_rq_clock_task(struct rq *rq, s64 delta)
{
- s64 irq_delta;
+ s64 irq_delta = 0, steal = 0;

+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;

/*
@@ -1981,12 +1987,29 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)

rq->prev_irq_time += irq_delta;
delta -= irq_delta;
+#endif
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+ if (static_branch((&paravirt_steal_rq_enabled))) {
+
+ steal = paravirt_steal_clock(cpu_of(rq));
+ steal -= rq->prev_steal_time_acc;
+
+ rq->prev_steal_time_acc += steal;
+
+ if (steal > delta)
+ steal = delta;
+
+ delta -= steal;
+ }
+#endif
+
rq->clock_task += delta;

- if (irq_delta && sched_feat(NONIRQ_POWER))
- sched_rt_avg_update(rq, irq_delta);
+ if ((irq_delta + steal) && sched_feat(NONTASK_POWER))
+ sched_rt_avg_update(rq, irq_delta + steal);
}

+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
static int irqtime_account_hi_update(void)
{
struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
@@ -2021,12 +2044,7 @@ static int irqtime_account_si_update(void)

#define sched_clock_irqtime (0)

-static void update_rq_clock_task(struct rq *rq, s64 delta)
-{
- rq->clock_task += delta;
-}
-
-#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+#endif

#include "sched_idletask.c"
#include "sched_fair.c"
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index be40f73..ca3b025 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -61,9 +61,9 @@ SCHED_FEAT(LB_BIAS, 1)
SCHED_FEAT(OWNER_SPIN, 1)

/*
- * Decrement CPU power based on irq activity
+ * Decrement CPU power based on time not spent running tasks
*/
-SCHED_FEAT(NONIRQ_POWER, 1)
+SCHED_FEAT(NONTASK_POWER, 1)

/*
* Queue remote wakeups on the target CPU and process them
--
1.7.3.4

2011-06-13 23:39:56

by Glauber Costa

[permalink] [raw]
Subject: [PATCH 7/7] KVM-GST: KVM Steal time registration

Register steal time within KVM. Everytime we sample the steal time
information, we update a local variable that tells what was the
last time read. We then account the difference.

Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
Documentation/kernel-parameters.txt | 4 ++
arch/x86/include/asm/kvm_para.h | 1 +
arch/x86/kernel/kvm.c | 72 +++++++++++++++++++++++++++++++++++
arch/x86/kernel/kvmclock.c | 2 +
4 files changed, 79 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index fd248a31..a722574 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1737,6 +1737,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page
fault handling.

+ no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting.
+ steal time is computed, but won't influence scheduler
+ behaviour
+
nolapic [X86-32,APIC] Do not enable or use the local APIC.

nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 0341e61..2a8f2a5 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -94,6 +94,7 @@ struct kvm_vcpu_pv_apf_data {

extern void kvmclock_init(void);
extern int kvm_register_clock(char *txt);
+extern void kvm_disable_steal_time(void);


/* This instruction is vmcall. On non-VT architectures, it will generate a
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 33c07b0..5a5ac19 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -51,6 +51,15 @@ static int parse_no_kvmapf(char *arg)

early_param("no-kvmapf", parse_no_kvmapf);

+static int steal_acc = 1;
+static int parse_no_stealacc(char *arg)
+{
+ steal_acc = 0;
+ return 0;
+}
+
+early_param("no-steal-acc", parse_no_stealacc);
+
struct kvm_para_state {
u8 mmu_queue[MMU_QUEUE_SIZE];
int mmu_queue_len;
@@ -58,6 +67,8 @@ struct kvm_para_state {

static DEFINE_PER_CPU(struct kvm_para_state, para_state);
static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
+static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
+static int has_steal_clock = 0;

static struct kvm_para_state *kvm_para_state(void)
{
@@ -483,23 +494,66 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
};

+static void kvm_register_steal_time(void)
+{
+ int cpu = smp_processor_id();
+ struct kvm_steal_time *st = &per_cpu(steal_time, cpu);
+
+ if (!has_steal_clock)
+ return;
+
+ memset(st, 0, sizeof(*st));
+
+ wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
+ printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n",
+ cpu, __pa(st));
+}
+
+static u64 kvm_steal_clock(int cpu)
+{
+ u64 steal;
+ struct kvm_steal_time *src;
+ int version;
+
+ src = &per_cpu(steal_time, cpu);
+ do {
+ version = src->version;
+ rmb();
+ steal = src->steal;
+ rmb();
+ } while ((version & 1) || (version != src->version));
+
+ return steal;
+}
+
#ifdef CONFIG_SMP
static void __init kvm_smp_prepare_boot_cpu(void)
{
#ifdef CONFIG_KVM_CLOCK
WARN_ON(kvm_register_clock("primary cpu clock"));
#endif
+ kvm_register_steal_time();
kvm_guest_cpu_init();
native_smp_prepare_boot_cpu();
}

static void __cpuinit kvm_guest_cpu_online(void *dummy)
{
+ kvm_register_steal_time();
kvm_guest_cpu_init();
}

+void kvm_disable_steal_time(void)
+{
+ if (!has_steal_clock)
+ return;
+
+ wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
+}
+
static void kvm_guest_cpu_offline(void *dummy)
{
+ kvm_disable_steal_time();
kvm_pv_disable_apf(NULL);
apf_task_wake_all();
}
@@ -548,6 +602,11 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
x86_init.irqs.trap_init = kvm_apf_trap_init;

+ if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
+ has_steal_clock = 1;
+ pv_time_ops.steal_clock = kvm_steal_clock;
+ }
+
#ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
register_cpu_notifier(&kvm_cpu_notifier);
@@ -555,3 +614,16 @@ void __init kvm_guest_init(void)
kvm_guest_cpu_init();
#endif
}
+
+static __init int activate_jump_labels(void)
+{
+ if (has_steal_clock) {
+ jump_label_inc(&paravirt_steal_enabled);
+ if (steal_acc)
+ jump_label_inc(&paravirt_steal_rq_enabled);
+ }
+
+ return 0;
+}
+arch_initcall(activate_jump_labels);
+
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 6389a6b..c1a0188 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -160,6 +160,7 @@ static void __cpuinit kvm_setup_secondary_clock(void)
static void kvm_crash_shutdown(struct pt_regs *regs)
{
native_write_msr(msr_kvm_system_time, 0, 0);
+ kvm_disable_steal_time();
native_machine_crash_shutdown(regs);
}
#endif
@@ -167,6 +168,7 @@ static void kvm_crash_shutdown(struct pt_regs *regs)
static void kvm_shutdown(void)
{
native_write_msr(msr_kvm_system_time, 0, 0);
+ kvm_disable_steal_time();
native_machine_shutdown();
}

--
1.7.3.4

2011-06-14 00:59:47

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM-HDR Add constant to represent KVM MSRs enabled bit

On 06/13/2011 07:31 PM, Glauber Costa wrote:
> This patch is simple, put in a different commit so it can be more easily
> shared between guest and hypervisor. It just defines a named constant
> to indicate the enable bit for KVM-specific MSRs.
>
> Signed-off-by: Glauber Costa<[email protected]>
> CC: Rik van Riel<[email protected]>
> CC: Jeremy Fitzhardinge<[email protected]>
> CC: Peter Zijlstra<[email protected]>
> CC: Avi Kivity<[email protected]>
> CC: Anthony Liguori<[email protected]>
> CC: Eric B Munson<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2011-06-14 01:19:06

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH 1/7] KVM-HDR Add constant to represent KVM MSRs enabled bit

On Mon, 13 Jun 2011, Glauber Costa wrote:

> This patch is simple, put in a different commit so it can be more easily
> shared between guest and hypervisor. It just defines a named constant
> to indicate the enable bit for KVM-specific MSRs.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>

Acked-by: Eric B Munson <[email protected]>


Attachments:
(No filename) (601.00 B)
signature.asc (490.00 B)
Digital signature
Download all attachments

2011-06-14 01:19:49

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH 2/7] KVM-HDR: KVM Steal time implementation

On Mon, 13 Jun 2011, Glauber Costa wrote:

> To implement steal time, we need the hypervisor to pass the guest information
> about how much time was spent running other processes outside the VM.
> This is per-vcpu, and using the kvmclock structure for that is an abuse
> we decided not to make.
>
> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> holds the memory area address containing information about steal time
>
> This patch contains the headers for it. I am keeping it separate to facilitate
> backports to people who wants to backport the kernel part but not the
> hypervisor, or the other way around.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>

Tested-by: Eric B Munson <[email protected]>


Attachments:
(No filename) (997.00 B)
signature.asc (490.00 B)
Digital signature
Download all attachments

2011-06-14 01:20:13

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On Mon, 13 Jun 2011, Glauber Costa wrote:

> To implement steal time, we need the hypervisor to pass the guest information
> about how much time was spent running other processes outside the VM.
> This is per-vcpu, and using the kvmclock structure for that is an abuse
> we decided not to make.
>
> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> holds the memory area address containing information about steal time
>
> This patch contains the hypervisor part for it. I am keeping it separate from
> the headers to facilitate backports to people who wants to backport the kernel
> part but not the hypervisor, or the other way around.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>

Tested-by: Eric B Munson <[email protected]>


Attachments:
(No filename) (1.00 kB)
signature.asc (490.00 B)
Digital signature
Download all attachments

2011-06-14 01:21:04

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH 4/7] KVM-GST: Add a pv_ops stub for steal time

On Mon, 13 Jun 2011, Glauber Costa wrote:

> This patch adds a function pointer in one of the many paravirt_ops
> structs, to allow guests to register a steal time function.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>

Tested-by: Eric B Munson <[email protected]>


Attachments:
(No filename) (533.00 B)
signature.asc (490.00 B)
Digital signature
Download all attachments

2011-06-14 01:21:14

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH 5/7] KVM-GST: KVM Steal time accounting

On Mon, 13 Jun 2011, Glauber Costa wrote:

> This patch accounts steal time time in kernel/sched.
> I kept it from last proposal, because I still see advantages
> in it: Doing it here will give us easier access from scheduler
> variables such as the cpu rq. The next patch shows an example of
> usage for it.
>
> Since functions like account_idle_time() can be called from
> multiple places, not only account_process_tick(), steal time
> grabbing is repeated in each account function separatedely.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>

Tested-by: Eric B Munson <[email protected]>


Attachments:
(No filename) (858.00 B)
signature.asc (490.00 B)
Digital signature
Download all attachments

2011-06-14 01:21:24

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH 6/7] KVM-GST: adjust scheduler cpu power

On Mon, 13 Jun 2011, Glauber Costa wrote:

> This is a first proposal for using steal time information
> to influence the scheduler. There are a lot of optimizations
> and fine grained adjustments to be done, but it is working reasonably
> so far for me (mostly)
>
> With this patch (and some host pinnings to demonstrate the situation),
> two vcpus with very different steal time (Say 80 % vs 1 %) will not get
> an even distribution of processes. This is a situation that can naturally
> arise, specially in overcommited scenarios. Previosly, the guest scheduler
> would wrongly think that all cpus have the same ability to run processes,
> lowering the overall throughput.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>

Tested-by: Eric B Munson <[email protected]>


Attachments:
(No filename) (1.01 kB)
signature.asc (490.00 B)
Digital signature
Download all attachments

2011-06-14 01:21:34

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH 7/7] KVM-GST: KVM Steal time registration

On Mon, 13 Jun 2011, Glauber Costa wrote:

> Register steal time within KVM. Everytime we sample the steal time
> information, we update a local variable that tells what was the
> last time read. We then account the difference.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>

Tested-by: Eric B Munson <[email protected]>


Attachments:
(No filename) (587.00 B)
signature.asc (490.00 B)
Digital signature
Download all attachments

2011-06-14 01:33:40

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 2/7] KVM-HDR: KVM Steal time implementation

On 06/13/2011 07:31 PM, Glauber Costa wrote:
> To implement steal time, we need the hypervisor to pass the guest information
> about how much time was spent running other processes outside the VM.
> This is per-vcpu, and using the kvmclock structure for that is an abuse
> we decided not to make.
>
> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> holds the memory area address containing information about steal time
>
> This patch contains the headers for it. I am keeping it separate to facilitate
> backports to people who wants to backport the kernel part but not the
> hypervisor, or the other way around.
>
> Signed-off-by: Glauber Costa<[email protected]>
> CC: Rik van Riel<[email protected]>
> CC: Jeremy Fitzhardinge<[email protected]>
> CC: Peter Zijlstra<[email protected]>
> CC: Avi Kivity<[email protected]>
> CC: Anthony Liguori<[email protected]>
> CC: Eric B Munson<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2011-06-14 02:49:35

by Asias He

[permalink] [raw]
Subject: Re: [PATCH 6/7] KVM-GST: adjust scheduler cpu power

On 06/14/2011 07:31 AM, Glauber Costa wrote:
> This is a first proposal for using steal time information
> to influence the scheduler. There are a lot of optimizations
> and fine grained adjustments to be done, but it is working reasonably
> so far for me (mostly)
>
> With this patch (and some host pinnings to demonstrate the situation),
> two vcpus with very different steal time (Say 80 % vs 1 %) will not get
> an even distribution of processes. This is a situation that can naturally
> arise, specially in overcommited scenarios. Previosly, the guest scheduler
> would wrongly think that all cpus have the same ability to run processes,
> lowering the overall throughput.

One typo in the commit log:

s/Previosly/Previously

--
Best Regards,
Asias He

2011-06-14 07:45:16

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On Mon, Jun 13, 2011 at 07:31:33PM -0400, Glauber Costa wrote:
> To implement steal time, we need the hypervisor to pass the guest information
> about how much time was spent running other processes outside the VM.
> This is per-vcpu, and using the kvmclock structure for that is an abuse
> we decided not to make.
>
> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> holds the memory area address containing information about steal time
>
> This patch contains the hypervisor part for it. I am keeping it separate from
> the headers to facilitate backports to people who wants to backport the kernel
> part but not the hypervisor, or the other way around.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>
> ---
> arch/x86/include/asm/kvm_host.h | 8 +++++
> arch/x86/include/asm/kvm_para.h | 4 ++
> arch/x86/kvm/x86.c | 60 +++++++++++++++++++++++++++++++++++++--
> 3 files changed, 69 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index fc38eca..5dce014 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -388,6 +388,14 @@ struct kvm_vcpu_arch {
> unsigned int hw_tsc_khz;
> unsigned int time_offset;
> struct page *time_page;
> +
> + struct {
> + u64 msr_val;
> + gpa_t stime;
> + struct kvm_steal_time steal;
> + u64 this_time_out;
> + } st;
> +
> u64 last_guest_tsc;
> u64 last_kernel_ns;
> u64 last_tsc_nsec;
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index ac306c4..0341e61 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -45,6 +45,10 @@ struct kvm_steal_time {
> __u32 pad[6];
> };
>
> +#define KVM_STEAL_ALIGNMENT_BITS 5
> +#define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
> +#define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
> +
> #define KVM_MAX_MMU_OP_BATCH 32
>
> #define KVM_ASYNC_PF_ENABLED (1 << 0)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6645634..10fe028 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -797,12 +797,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
> * kvm-specific. Those are put in the beginning of the list.
> */
>
> -#define KVM_SAVE_MSRS_BEGIN 8
> +#define KVM_SAVE_MSRS_BEGIN 9
> static u32 msrs_to_save[] = {
> MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
> MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
> HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
> - HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
> + HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
> MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
> MSR_STAR,
> #ifdef CONFIG_X86_64
> @@ -1480,6 +1480,34 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
> }
> }
>
> +static void record_steal_time(struct kvm_vcpu *vcpu)
> +{
> + u64 delta;
> +
> + if (vcpu->arch.st.stime && vcpu->arch.st.this_time_out) {
> +
> + if (unlikely(kvm_read_guest(vcpu->kvm, vcpu->arch.st.stime,
> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
> +
> + vcpu->arch.st.stime = 0;
> + return;
> + }
> +
> + delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
> +
> + vcpu->arch.st.steal.steal += delta;
> + vcpu->arch.st.steal.version += 2;
> +
> + if (unlikely(kvm_write_guest(vcpu->kvm, vcpu->arch.st.stime,
Why not use kvm_write_guest_cached() here and introduce kvm_read_guest_cached()
for the read above?

> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
> +
> + vcpu->arch.st.stime = 0;
> + return;
> + }
> + }
> +
> +}
> +
> int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> {
> switch (msr) {
> @@ -1562,6 +1590,23 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> if (kvm_pv_enable_async_pf(vcpu, data))
> return 1;
> break;
> + case MSR_KVM_STEAL_TIME:
> + vcpu->arch.st.msr_val = data;
> +
> + if (!(data & KVM_MSR_ENABLED)) {
> + vcpu->arch.st.stime = 0;
> + break;
> + }
> +
> + if (data & KVM_STEAL_RESERVED_MASK)
> + return 1;
> +
> + vcpu->arch.st.this_time_out = get_kernel_ns();
> + vcpu->arch.st.stime = data & KVM_STEAL_VALID_BITS;
> + record_steal_time(vcpu);
> +
> + break;
> +
> case MSR_IA32_MCG_CTL:
> case MSR_IA32_MCG_STATUS:
> case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
> @@ -1847,6 +1892,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
> case MSR_KVM_ASYNC_PF_EN:
> data = vcpu->arch.apf.msr_val;
> break;
> + case MSR_KVM_STEAL_TIME:
> + data = vcpu->arch.st.msr_val;
> + break;
> case MSR_IA32_P5_MC_ADDR:
> case MSR_IA32_P5_MC_TYPE:
> case MSR_IA32_MCG_CAP:
> @@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> kvm_migrate_timers(vcpu);
> vcpu->cpu = cpu;
> }
> +
> + record_steal_time(vcpu);
> }
>
> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> @@ -2165,6 +2215,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> kvm_x86_ops->vcpu_put(vcpu);
> kvm_put_guest_fpu(vcpu);
> kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
> + vcpu->arch.st.this_time_out = get_kernel_ns();
> }
>
Shouldn't we call record_steal_time(vcpu)/vcpu->arch.st.this_time_out = get_kernel_ns();
just before/after entering/exiting a guest? vcpu_(put|get) are called
for each vcpu ioctl, not only VCPU_RUN.

> static int is_efer_nx(void)
> @@ -2477,7 +2528,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> (1 << KVM_FEATURE_NOP_IO_DELAY) |
> (1 << KVM_FEATURE_CLOCKSOURCE2) |
> (1 << KVM_FEATURE_ASYNC_PF) |
> - (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
> + (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
> + (1 << KVM_FEATURE_STEAL_TIME);
> entry->ebx = 0;
> entry->ecx = 0;
> entry->edx = 0;
> @@ -6200,6 +6252,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
>
> kvmclock_reset(vcpu);
>
> + vcpu->arch.st.stime = 0;
> +
> kvm_clear_async_pf_completion_queue(vcpu);
> kvm_async_pf_hash_reset(vcpu);
> vcpu->arch.apf.halted = false;
> --
> 1.7.3.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Gleb.

2011-06-14 08:06:56

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 7/7] KVM-GST: KVM Steal time registration

On Mon, Jun 13, 2011 at 07:31:37PM -0400, Glauber Costa wrote:
> Register steal time within KVM. Everytime we sample the steal time
> information, we update a local variable that tells what was the
> last time read. We then account the difference.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>
> ---
> Documentation/kernel-parameters.txt | 4 ++
> arch/x86/include/asm/kvm_para.h | 1 +
> arch/x86/kernel/kvm.c | 72 +++++++++++++++++++++++++++++++++++
> arch/x86/kernel/kvmclock.c | 2 +
> 4 files changed, 79 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
> index fd248a31..a722574 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -1737,6 +1737,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
> no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page
> fault handling.
>
> + no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting.
> + steal time is computed, but won't influence scheduler
> + behaviour
> +
> nolapic [X86-32,APIC] Do not enable or use the local APIC.
>
> nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 0341e61..2a8f2a5 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -94,6 +94,7 @@ struct kvm_vcpu_pv_apf_data {
>
> extern void kvmclock_init(void);
> extern int kvm_register_clock(char *txt);
> +extern void kvm_disable_steal_time(void);
>
>
> /* This instruction is vmcall. On non-VT architectures, it will generate a
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 33c07b0..5a5ac19 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -51,6 +51,15 @@ static int parse_no_kvmapf(char *arg)
>
> early_param("no-kvmapf", parse_no_kvmapf);
>
> +static int steal_acc = 1;
> +static int parse_no_stealacc(char *arg)
> +{
> + steal_acc = 0;
> + return 0;
> +}
> +
> +early_param("no-steal-acc", parse_no_stealacc);
> +
> struct kvm_para_state {
> u8 mmu_queue[MMU_QUEUE_SIZE];
> int mmu_queue_len;
> @@ -58,6 +67,8 @@ struct kvm_para_state {
>
> static DEFINE_PER_CPU(struct kvm_para_state, para_state);
> static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
> +static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
> +static int has_steal_clock = 0;
>
> static struct kvm_para_state *kvm_para_state(void)
> {
> @@ -483,23 +494,66 @@ static struct notifier_block kvm_pv_reboot_nb = {
> .notifier_call = kvm_pv_reboot_notify,
> };
>
> +static void kvm_register_steal_time(void)
> +{
> + int cpu = smp_processor_id();
> + struct kvm_steal_time *st = &per_cpu(steal_time, cpu);
> +
> + if (!has_steal_clock)
> + return;
> +
> + memset(st, 0, sizeof(*st));
> +
> + wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
> + printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n",
> + cpu, __pa(st));
> +}
> +
> +static u64 kvm_steal_clock(int cpu)
> +{
> + u64 steal;
> + struct kvm_steal_time *src;
> + int version;
> +
> + src = &per_cpu(steal_time, cpu);
> + do {
> + version = src->version;
> + rmb();
> + steal = src->steal;
> + rmb();
> + } while ((version & 1) || (version != src->version));
> +
> + return steal;
> +}
> +
> #ifdef CONFIG_SMP
> static void __init kvm_smp_prepare_boot_cpu(void)
> {
> #ifdef CONFIG_KVM_CLOCK
> WARN_ON(kvm_register_clock("primary cpu clock"));
> #endif
> + kvm_register_steal_time();
> kvm_guest_cpu_init();
> native_smp_prepare_boot_cpu();
> }
>
> static void __cpuinit kvm_guest_cpu_online(void *dummy)
> {
> + kvm_register_steal_time();
> kvm_guest_cpu_init();
> }
>
Why not call kvm_register_steal_time() from kvm_guest_cpu_init()?
This way you save one line of code and steal time will be initialized
in !CONFIG_SMP kernel too.

> +void kvm_disable_steal_time(void)
> +{
> + if (!has_steal_clock)
> + return;
> +
> + wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
> +}
> +
> static void kvm_guest_cpu_offline(void *dummy)
> {
> + kvm_disable_steal_time();
> kvm_pv_disable_apf(NULL);
> apf_task_wake_all();
> }
> @@ -548,6 +602,11 @@ void __init kvm_guest_init(void)
> if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
> x86_init.irqs.trap_init = kvm_apf_trap_init;
>
> + if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
> + has_steal_clock = 1;
> + pv_time_ops.steal_clock = kvm_steal_clock;
> + }
> +
> #ifdef CONFIG_SMP
> smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
> register_cpu_notifier(&kvm_cpu_notifier);
> @@ -555,3 +614,16 @@ void __init kvm_guest_init(void)
> kvm_guest_cpu_init();
> #endif
> }
> +
> +static __init int activate_jump_labels(void)
> +{
> + if (has_steal_clock) {
> + jump_label_inc(&paravirt_steal_enabled);
> + if (steal_acc)
> + jump_label_inc(&paravirt_steal_rq_enabled);
> + }
> +
> + return 0;
> +}
> +arch_initcall(activate_jump_labels);
> +
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 6389a6b..c1a0188 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -160,6 +160,7 @@ static void __cpuinit kvm_setup_secondary_clock(void)
> static void kvm_crash_shutdown(struct pt_regs *regs)
> {
> native_write_msr(msr_kvm_system_time, 0, 0);
> + kvm_disable_steal_time();
> native_machine_crash_shutdown(regs);
> }
> #endif
> @@ -167,6 +168,7 @@ static void kvm_crash_shutdown(struct pt_regs *regs)
> static void kvm_shutdown(void)
> {
> native_write_msr(msr_kvm_system_time, 0, 0);
> + kvm_disable_steal_time();
> native_machine_shutdown();
> }
>
> --
> 1.7.3.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Gleb.

2011-06-14 10:10:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/7] KVM-GST: KVM Steal time accounting

On Mon, 2011-06-13 at 19:31 -0400, Glauber Costa wrote:
> +static inline int touch_steal_time(int is_idle)
> +{
> + u64 steal, st = 0;
> +
> + if (static_branch(&paravirt_steal_enabled)) {
> +
> + steal = paravirt_steal_clock(smp_processor_id());
> +
> + steal -= this_rq()->prev_steal_time;
> + this_rq()->prev_steal_time += steal;

If you move this addition below this test:

> + if (is_idle || (steal < TICK_NSEC))
> + return 0;

that is, right here, then you don't loose tiny steal deltas and
subsequent ticks accumulate their steal time until you really
have a full steal tick to account.

I guess you want something different for the idle case though.

> + while (steal > TICK_NSEC) {

/* really, if we wanted a division we'd have written one */
asm("" : "+rm" (steal));

> + steal -= TICK_NSEC;
> + st++;
> + }
> +
> + account_steal_time(st);
> + return 1;
> + }
> + return 0;
> +}

2011-06-14 10:42:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 6/7] KVM-GST: adjust scheduler cpu power

On Mon, 2011-06-13 at 19:31 -0400, Glauber Costa wrote:
> @@ -1981,12 +1987,29 @@ static void update_rq_clock_task(struct rq
> *rq, s64 delta)
>
> rq->prev_irq_time += irq_delta;
> delta -= irq_delta;
> +#endif
> +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> + if (static_branch((&paravirt_steal_rq_enabled))) {

Why is that a different variable from the touch_steal_time() one?

> +
> + steal = paravirt_steal_clock(cpu_of(rq));
> + steal -= rq->prev_steal_time_acc;
> +
> + rq->prev_steal_time_acc += steal;

You have this addition in the wrong place, when you clip:

> + if (steal > delta)
> + steal = delta;

you just lost your steal delta, so the addition to prev_steal_time_acc
needs to be after the clip.

> + delta -= steal;
> + }
> +#endif
> +
> rq->clock_task += delta;
>
> - if (irq_delta && sched_feat(NONIRQ_POWER))
> - sched_rt_avg_update(rq, irq_delta);
> + if ((irq_delta + steal) && sched_feat(NONTASK_POWER))
> + sched_rt_avg_update(rq, irq_delta + steal);
> }

2011-06-14 12:21:59

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On 06/13/2011 07:31 PM, Glauber Costa wrote:
> To implement steal time, we need the hypervisor to pass the guest information
> about how much time was spent running other processes outside the VM.
> This is per-vcpu, and using the kvmclock structure for that is an abuse
> we decided not to make.
>
> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> holds the memory area address containing information about steal time
>
> This patch contains the hypervisor part for it. I am keeping it separate from
> the headers to facilitate backports to people who wants to backport the kernel
> part but not the hypervisor, or the other way around.
>
> Signed-off-by: Glauber Costa<[email protected]>
> CC: Rik van Riel<[email protected]>
> CC: Jeremy Fitzhardinge<[email protected]>
> CC: Peter Zijlstra<[email protected]>
> CC: Avi Kivity<[email protected]>
> CC: Anthony Liguori<[email protected]>
> CC: Eric B Munson<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2011-06-14 13:23:48

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 4/7] KVM-GST: Add a pv_ops stub for steal time

On 06/13/2011 07:31 PM, Glauber Costa wrote:
> This patch adds a function pointer in one of the many paravirt_ops
> structs, to allow guests to register a steal time function.
>
> Signed-off-by: Glauber Costa<[email protected]>
> CC: Rik van Riel<[email protected]>
> CC: Jeremy Fitzhardinge<[email protected]>
> CC: Peter Zijlstra<[email protected]>
> CC: Avi Kivity<[email protected]>
> CC: Anthony Liguori<[email protected]>
> CC: Eric B Munson<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2011-06-15 01:02:05

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On 06/14/2011 04:45 AM, Gleb Natapov wrote:
> On Mon, Jun 13, 2011 at 07:31:33PM -0400, Glauber Costa wrote:
>> To implement steal time, we need the hypervisor to pass the guest information
>> about how much time was spent running other processes outside the VM.
>> This is per-vcpu, and using the kvmclock structure for that is an abuse
>> we decided not to make.
>>
>> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
>> holds the memory area address containing information about steal time
>>
>> This patch contains the hypervisor part for it. I am keeping it separate from
>> the headers to facilitate backports to people who wants to backport the kernel
>> part but not the hypervisor, or the other way around.
>>
>> Signed-off-by: Glauber Costa<[email protected]>
>> CC: Rik van Riel<[email protected]>
>> CC: Jeremy Fitzhardinge<[email protected]>
>> CC: Peter Zijlstra<[email protected]>
>> CC: Avi Kivity<[email protected]>
>> CC: Anthony Liguori<[email protected]>
>> CC: Eric B Munson<[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 8 +++++
>> arch/x86/include/asm/kvm_para.h | 4 ++
>> arch/x86/kvm/x86.c | 60 +++++++++++++++++++++++++++++++++++++--
>> 3 files changed, 69 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index fc38eca..5dce014 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -388,6 +388,14 @@ struct kvm_vcpu_arch {
>> unsigned int hw_tsc_khz;
>> unsigned int time_offset;
>> struct page *time_page;
>> +
>> + struct {
>> + u64 msr_val;
>> + gpa_t stime;
>> + struct kvm_steal_time steal;
>> + u64 this_time_out;
>> + } st;
>> +
>> u64 last_guest_tsc;
>> u64 last_kernel_ns;
>> u64 last_tsc_nsec;
>> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
>> index ac306c4..0341e61 100644
>> --- a/arch/x86/include/asm/kvm_para.h
>> +++ b/arch/x86/include/asm/kvm_para.h
>> @@ -45,6 +45,10 @@ struct kvm_steal_time {
>> __u32 pad[6];
>> };
>>
>> +#define KVM_STEAL_ALIGNMENT_BITS 5
>> +#define KVM_STEAL_VALID_BITS ((-1ULL<< (KVM_STEAL_ALIGNMENT_BITS + 1)))
>> +#define KVM_STEAL_RESERVED_MASK (((1<< KVM_STEAL_ALIGNMENT_BITS) - 1 )<< 1)
>> +
>> #define KVM_MAX_MMU_OP_BATCH 32
>>
>> #define KVM_ASYNC_PF_ENABLED (1<< 0)
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 6645634..10fe028 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -797,12 +797,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
>> * kvm-specific. Those are put in the beginning of the list.
>> */
>>
>> -#define KVM_SAVE_MSRS_BEGIN 8
>> +#define KVM_SAVE_MSRS_BEGIN 9
>> static u32 msrs_to_save[] = {
>> MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
>> MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
>> HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
>> - HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
>> + HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
>> MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
>> MSR_STAR,
>> #ifdef CONFIG_X86_64
>> @@ -1480,6 +1480,34 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
>> }
>> }
>>
>> +static void record_steal_time(struct kvm_vcpu *vcpu)
>> +{
>> + u64 delta;
>> +
>> + if (vcpu->arch.st.stime&& vcpu->arch.st.this_time_out) {
>> +
>> + if (unlikely(kvm_read_guest(vcpu->kvm, vcpu->arch.st.stime,
>> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
>> +
>> + vcpu->arch.st.stime = 0;
>> + return;
>> + }
>> +
>> + delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
>> +
>> + vcpu->arch.st.steal.steal += delta;
>> + vcpu->arch.st.steal.version += 2;
>> +
>> + if (unlikely(kvm_write_guest(vcpu->kvm, vcpu->arch.st.stime,
> Why not use kvm_write_guest_cached() here and introduce kvm_read_guest_cached()
> for the read above?

Good idea, I can do that.

>> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
>> +
>> + vcpu->arch.st.stime = 0;
>> + return;
>> + }
>> + }
>> +
>> +}
>> +
>> int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>> {
>> switch (msr) {
>> @@ -1562,6 +1590,23 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>> if (kvm_pv_enable_async_pf(vcpu, data))
>> return 1;
>> break;
>> + case MSR_KVM_STEAL_TIME:
>> + vcpu->arch.st.msr_val = data;
>> +
>> + if (!(data& KVM_MSR_ENABLED)) {
>> + vcpu->arch.st.stime = 0;
>> + break;
>> + }
>> +
>> + if (data& KVM_STEAL_RESERVED_MASK)
>> + return 1;
>> +
>> + vcpu->arch.st.this_time_out = get_kernel_ns();
>> + vcpu->arch.st.stime = data& KVM_STEAL_VALID_BITS;
>> + record_steal_time(vcpu);
>> +
>> + break;
>> +
>> case MSR_IA32_MCG_CTL:
>> case MSR_IA32_MCG_STATUS:
>> case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
>> @@ -1847,6 +1892,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>> case MSR_KVM_ASYNC_PF_EN:
>> data = vcpu->arch.apf.msr_val;
>> break;
>> + case MSR_KVM_STEAL_TIME:
>> + data = vcpu->arch.st.msr_val;
>> + break;
>> case MSR_IA32_P5_MC_ADDR:
>> case MSR_IA32_P5_MC_TYPE:
>> case MSR_IA32_MCG_CAP:
>> @@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>> kvm_migrate_timers(vcpu);
>> vcpu->cpu = cpu;
>> }
>> +
>> + record_steal_time(vcpu);
>> }
>>
>> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>> @@ -2165,6 +2215,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>> kvm_x86_ops->vcpu_put(vcpu);
>> kvm_put_guest_fpu(vcpu);
>> kvm_get_msr(vcpu, MSR_IA32_TSC,&vcpu->arch.last_guest_tsc);
>> + vcpu->arch.st.this_time_out = get_kernel_ns();
>> }
>>
> Shouldn't we call record_steal_time(vcpu)/vcpu->arch.st.this_time_out = get_kernel_ns();
> just before/after entering/exiting a guest? vcpu_(put|get) are called
> for each vcpu ioctl, not only VCPU_RUN.
>
>> static int is_efer_nx(void)
>> @@ -2477,7 +2528,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>> (1<< KVM_FEATURE_NOP_IO_DELAY) |
>> (1<< KVM_FEATURE_CLOCKSOURCE2) |
>> (1<< KVM_FEATURE_ASYNC_PF) |
>> - (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
>> + (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
>> + (1<< KVM_FEATURE_STEAL_TIME);
>> entry->ebx = 0;
>> entry->ecx = 0;
>> entry->edx = 0;
>> @@ -6200,6 +6252,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
>>
>> kvmclock_reset(vcpu);
>>
>> + vcpu->arch.st.stime = 0;
>> +
>> kvm_clear_async_pf_completion_queue(vcpu);
>> kvm_async_pf_hash_reset(vcpu);
>> vcpu->arch.apf.halted = false;
>> --
>> 1.7.3.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Gleb.

2011-06-15 01:08:15

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 5/7] KVM-GST: KVM Steal time accounting

On 06/14/2011 07:10 AM, Peter Zijlstra wrote:
> On Mon, 2011-06-13 at 19:31 -0400, Glauber Costa wrote:
>> +static inline int touch_steal_time(int is_idle)
>> +{
>> + u64 steal, st = 0;
>> +
>> + if (static_branch(&paravirt_steal_enabled)) {
>> +
>> + steal = paravirt_steal_clock(smp_processor_id());
>> +
>> + steal -= this_rq()->prev_steal_time;
>> + this_rq()->prev_steal_time += steal;
>
> If you move this addition below this test:
>
>> + if (is_idle || (steal< TICK_NSEC))
>> + return 0;
>
> that is, right here, then you don't loose tiny steal deltas and
> subsequent ticks accumulate their steal time until you really
> have a full steal tick to account.

true
> I guess you want something different for the idle case though.

definitely.

>> + while (steal> TICK_NSEC) {
>
> /* really, if we wanted a division we'd have written one */
> asm("" : "+rm" (steal));

Out of curiosity, have we seen any compiler de-optimize it to a
division, or are you just being careful ?

>> + steal -= TICK_NSEC;
>> + st++;
>> + }
>> +
>> + account_steal_time(st);
>> + return 1;
>> + }
>> + return 0;
>> +}
>
>

2011-06-15 01:26:42

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 6/7] KVM-GST: adjust scheduler cpu power

On 06/14/2011 07:42 AM, Peter Zijlstra wrote:
> On Mon, 2011-06-13 at 19:31 -0400, Glauber Costa wrote:
>> @@ -1981,12 +1987,29 @@ static void update_rq_clock_task(struct rq
>> *rq, s64 delta)
>>
>> rq->prev_irq_time += irq_delta;
>> delta -= irq_delta;
>> +#endif
>> +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
>> + if (static_branch((&paravirt_steal_rq_enabled))) {
>
> Why is that a different variable from the touch_steal_time() one?

because they track different things, touch_steal_time() and
update_rq_clock() are
called from different places at different situations.

If we advance prev_steal_time in touch_steal_time(), and later on call
update_rq_clock_task(), we won't discount the time already flushed from
the rq_clock. Conversely, if we call update_rq_clock_task(), and only
then arrive at touch_steal_time, we won't account steal time properly.

update_rq_clock_task() is called whenever update_rq_clock() is called.
touch_steal_time is called every tick. If there is a causal relation
between them that would allow us to track it in a single location, I
fail to realize.

>> +
>> + steal = paravirt_steal_clock(cpu_of(rq));
>> + steal -= rq->prev_steal_time_acc;
>> +
>> + rq->prev_steal_time_acc += steal;
>
> You have this addition in the wrong place, when you clip:

I begin by disagreeing
>> + if (steal> delta)
>> + steal = delta;
>
> you just lost your steal delta, so the addition to prev_steal_time_acc
> needs to be after the clip.
Unlike irq time, steal time can be extremely huge. Just think of a
virtual machine that got interrupted for a very long time. We'd have
steal >> delta, leading to steal == delta for a big number of iterations.
That would affect cpu power for an extended period of time, not
reflecting present situation, just the past. So I like to think of delta
as a hard cap for steal time.

Obviously, I am open to debate.
>
>> + delta -= steal;
>> + }
>> +#endif
>> +
>> rq->clock_task += delta;
>>
>> - if (irq_delta&& sched_feat(NONIRQ_POWER))
>> - sched_rt_avg_update(rq, irq_delta);
>> + if ((irq_delta + steal)&& sched_feat(NONTASK_POWER))
>> + sched_rt_avg_update(rq, irq_delta + steal);
>> }

2011-06-15 03:09:59

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On 06/14/2011 04:45 AM, Gleb Natapov wrote:
> On Mon, Jun 13, 2011 at 07:31:33PM -0400, Glauber Costa wrote:
>> To implement steal time, we need the hypervisor to pass the guest information
>> about how much time was spent running other processes outside the VM.
>> This is per-vcpu, and using the kvmclock structure for that is an abuse
>> we decided not to make.
>>
>> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
>> holds the memory area address containing information about steal time
>>
>> This patch contains the hypervisor part for it. I am keeping it separate from
>> the headers to facilitate backports to people who wants to backport the kernel
>> part but not the hypervisor, or the other way around.
>>
>> Signed-off-by: Glauber Costa<[email protected]>
>> CC: Rik van Riel<[email protected]>
>> CC: Jeremy Fitzhardinge<[email protected]>
>> CC: Peter Zijlstra<[email protected]>
>> CC: Avi Kivity<[email protected]>
>> CC: Anthony Liguori<[email protected]>
>> CC: Eric B Munson<[email protected]>
>> ---
>> arch/x86/include/asm/kvm_host.h | 8 +++++
>> arch/x86/include/asm/kvm_para.h | 4 ++
>> arch/x86/kvm/x86.c | 60 +++++++++++++++++++++++++++++++++++++--
>> 3 files changed, 69 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index fc38eca..5dce014 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -388,6 +388,14 @@ struct kvm_vcpu_arch {
>> unsigned int hw_tsc_khz;
>> unsigned int time_offset;
>> struct page *time_page;
>> +
>> + struct {
>> + u64 msr_val;
>> + gpa_t stime;
>> + struct kvm_steal_time steal;
>> + u64 this_time_out;
>> + } st;
>> +
>> u64 last_guest_tsc;
>> u64 last_kernel_ns;
>> u64 last_tsc_nsec;
>> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
>> index ac306c4..0341e61 100644
>> --- a/arch/x86/include/asm/kvm_para.h
>> +++ b/arch/x86/include/asm/kvm_para.h
>> @@ -45,6 +45,10 @@ struct kvm_steal_time {
>> __u32 pad[6];
>> };
>>
>> +#define KVM_STEAL_ALIGNMENT_BITS 5
>> +#define KVM_STEAL_VALID_BITS ((-1ULL<< (KVM_STEAL_ALIGNMENT_BITS + 1)))
>> +#define KVM_STEAL_RESERVED_MASK (((1<< KVM_STEAL_ALIGNMENT_BITS) - 1 )<< 1)
>> +
>> #define KVM_MAX_MMU_OP_BATCH 32
>>
>> #define KVM_ASYNC_PF_ENABLED (1<< 0)
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 6645634..10fe028 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -797,12 +797,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
>> * kvm-specific. Those are put in the beginning of the list.
>> */
>>
>> -#define KVM_SAVE_MSRS_BEGIN 8
>> +#define KVM_SAVE_MSRS_BEGIN 9
>> static u32 msrs_to_save[] = {
>> MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
>> MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
>> HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
>> - HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
>> + HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
>> MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
>> MSR_STAR,
>> #ifdef CONFIG_X86_64
>> @@ -1480,6 +1480,34 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
>> }
>> }
>>
>> +static void record_steal_time(struct kvm_vcpu *vcpu)
>> +{
>> + u64 delta;
>> +
>> + if (vcpu->arch.st.stime&& vcpu->arch.st.this_time_out) {
>> +
>> + if (unlikely(kvm_read_guest(vcpu->kvm, vcpu->arch.st.stime,
>> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
>> +
>> + vcpu->arch.st.stime = 0;
>> + return;
>> + }
>> +
>> + delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
>> +
>> + vcpu->arch.st.steal.steal += delta;
>> + vcpu->arch.st.steal.version += 2;
>> +
>> + if (unlikely(kvm_write_guest(vcpu->kvm, vcpu->arch.st.stime,
> Why not use kvm_write_guest_cached() here and introduce kvm_read_guest_cached()
> for the read above?

Actually, I'd expect most read/writes to benefit from caching, no?
So why don't we just rename kvm_write_guest_cached() to
kvm_write_guest(), and the few places - if any - that need to force
transversing of the gfn mappings, get renamed to kvm_write_guest_uncached ?

>> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
>> +
>> + vcpu->arch.st.stime = 0;
>> + return;
>> + }
>> + }
>> +
>> +}
>> +
>> int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>> {
>> switch (msr) {
>> @@ -1562,6 +1590,23 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>> if (kvm_pv_enable_async_pf(vcpu, data))
>> return 1;
>> break;
>> + case MSR_KVM_STEAL_TIME:
>> + vcpu->arch.st.msr_val = data;
>> +
>> + if (!(data& KVM_MSR_ENABLED)) {
>> + vcpu->arch.st.stime = 0;
>> + break;
>> + }
>> +
>> + if (data& KVM_STEAL_RESERVED_MASK)
>> + return 1;
>> +
>> + vcpu->arch.st.this_time_out = get_kernel_ns();
>> + vcpu->arch.st.stime = data& KVM_STEAL_VALID_BITS;
>> + record_steal_time(vcpu);
>> +
>> + break;
>> +
>> case MSR_IA32_MCG_CTL:
>> case MSR_IA32_MCG_STATUS:
>> case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
>> @@ -1847,6 +1892,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>> case MSR_KVM_ASYNC_PF_EN:
>> data = vcpu->arch.apf.msr_val;
>> break;
>> + case MSR_KVM_STEAL_TIME:
>> + data = vcpu->arch.st.msr_val;
>> + break;
>> case MSR_IA32_P5_MC_ADDR:
>> case MSR_IA32_P5_MC_TYPE:
>> case MSR_IA32_MCG_CAP:
>> @@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>> kvm_migrate_timers(vcpu);
>> vcpu->cpu = cpu;
>> }
>> +
>> + record_steal_time(vcpu);
>> }
>>
>> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>> @@ -2165,6 +2215,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>> kvm_x86_ops->vcpu_put(vcpu);
>> kvm_put_guest_fpu(vcpu);
>> kvm_get_msr(vcpu, MSR_IA32_TSC,&vcpu->arch.last_guest_tsc);
>> + vcpu->arch.st.this_time_out = get_kernel_ns();
>> }
>>
> Shouldn't we call record_steal_time(vcpu)/vcpu->arch.st.this_time_out = get_kernel_ns();
> just before/after entering/exiting a guest? vcpu_(put|get) are called
> for each vcpu ioctl, not only VCPU_RUN.
Sorry, missed that the first time I've read your e-mail.

If done like you said, time spent on the hypervisor is accounted as
steal time. I don't think it is.

Steal time is time spent running someone else's job instead of yours.
The name for the time spent in the hypervisor doing something for *you*
is just overhead.

>
>> static int is_efer_nx(void)
>> @@ -2477,7 +2528,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>> (1<< KVM_FEATURE_NOP_IO_DELAY) |
>> (1<< KVM_FEATURE_CLOCKSOURCE2) |
>> (1<< KVM_FEATURE_ASYNC_PF) |
>> - (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
>> + (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
>> + (1<< KVM_FEATURE_STEAL_TIME);
>> entry->ebx = 0;
>> entry->ecx = 0;
>> entry->edx = 0;
>> @@ -6200,6 +6252,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
>>
>> kvmclock_reset(vcpu);
>>
>> + vcpu->arch.st.stime = 0;
>> +
>> kvm_clear_async_pf_completion_queue(vcpu);
>> kvm_async_pf_hash_reset(vcpu);
>> vcpu->arch.apf.halted = false;
>> --
>> 1.7.3.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Gleb.

2011-06-15 09:10:10

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On Wed, Jun 15, 2011 at 12:09:31AM -0300, Glauber Costa wrote:
> On 06/14/2011 04:45 AM, Gleb Natapov wrote:
> >On Mon, Jun 13, 2011 at 07:31:33PM -0400, Glauber Costa wrote:
> >>To implement steal time, we need the hypervisor to pass the guest information
> >>about how much time was spent running other processes outside the VM.
> >>This is per-vcpu, and using the kvmclock structure for that is an abuse
> >>we decided not to make.
> >>
> >>In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> >>holds the memory area address containing information about steal time
> >>
> >>This patch contains the hypervisor part for it. I am keeping it separate from
> >>the headers to facilitate backports to people who wants to backport the kernel
> >>part but not the hypervisor, or the other way around.
> >>
> >>Signed-off-by: Glauber Costa<[email protected]>
> >>CC: Rik van Riel<[email protected]>
> >>CC: Jeremy Fitzhardinge<[email protected]>
> >>CC: Peter Zijlstra<[email protected]>
> >>CC: Avi Kivity<[email protected]>
> >>CC: Anthony Liguori<[email protected]>
> >>CC: Eric B Munson<[email protected]>
> >>---
> >> arch/x86/include/asm/kvm_host.h | 8 +++++
> >> arch/x86/include/asm/kvm_para.h | 4 ++
> >> arch/x86/kvm/x86.c | 60 +++++++++++++++++++++++++++++++++++++--
> >> 3 files changed, 69 insertions(+), 3 deletions(-)
> >>
> >>diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >>index fc38eca..5dce014 100644
> >>--- a/arch/x86/include/asm/kvm_host.h
> >>+++ b/arch/x86/include/asm/kvm_host.h
> >>@@ -388,6 +388,14 @@ struct kvm_vcpu_arch {
> >> unsigned int hw_tsc_khz;
> >> unsigned int time_offset;
> >> struct page *time_page;
> >>+
> >>+ struct {
> >>+ u64 msr_val;
> >>+ gpa_t stime;
> >>+ struct kvm_steal_time steal;
> >>+ u64 this_time_out;
> >>+ } st;
> >>+
> >> u64 last_guest_tsc;
> >> u64 last_kernel_ns;
> >> u64 last_tsc_nsec;
> >>diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> >>index ac306c4..0341e61 100644
> >>--- a/arch/x86/include/asm/kvm_para.h
> >>+++ b/arch/x86/include/asm/kvm_para.h
> >>@@ -45,6 +45,10 @@ struct kvm_steal_time {
> >> __u32 pad[6];
> >> };
> >>
> >>+#define KVM_STEAL_ALIGNMENT_BITS 5
> >>+#define KVM_STEAL_VALID_BITS ((-1ULL<< (KVM_STEAL_ALIGNMENT_BITS + 1)))
> >>+#define KVM_STEAL_RESERVED_MASK (((1<< KVM_STEAL_ALIGNMENT_BITS) - 1 )<< 1)
> >>+
> >> #define KVM_MAX_MMU_OP_BATCH 32
> >>
> >> #define KVM_ASYNC_PF_ENABLED (1<< 0)
> >>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>index 6645634..10fe028 100644
> >>--- a/arch/x86/kvm/x86.c
> >>+++ b/arch/x86/kvm/x86.c
> >>@@ -797,12 +797,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
> >> * kvm-specific. Those are put in the beginning of the list.
> >> */
> >>
> >>-#define KVM_SAVE_MSRS_BEGIN 8
> >>+#define KVM_SAVE_MSRS_BEGIN 9
> >> static u32 msrs_to_save[] = {
> >> MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
> >> MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
> >> HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
> >>- HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
> >>+ HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
> >> MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
> >> MSR_STAR,
> >> #ifdef CONFIG_X86_64
> >>@@ -1480,6 +1480,34 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
> >> }
> >> }
> >>
> >>+static void record_steal_time(struct kvm_vcpu *vcpu)
> >>+{
> >>+ u64 delta;
> >>+
> >>+ if (vcpu->arch.st.stime&& vcpu->arch.st.this_time_out) {
> >>+
> >>+ if (unlikely(kvm_read_guest(vcpu->kvm, vcpu->arch.st.stime,
> >>+ &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
> >>+
> >>+ vcpu->arch.st.stime = 0;
> >>+ return;
> >>+ }
> >>+
> >>+ delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
> >>+
> >>+ vcpu->arch.st.steal.steal += delta;
> >>+ vcpu->arch.st.steal.version += 2;
> >>+
> >>+ if (unlikely(kvm_write_guest(vcpu->kvm, vcpu->arch.st.stime,
> >Why not use kvm_write_guest_cached() here and introduce kvm_read_guest_cached()
> >for the read above?
>
> Actually, I'd expect most read/writes to benefit from caching, no?
> So why don't we just rename kvm_write_guest_cached() to
> kvm_write_guest(), and the few places - if any - that need to force
> transversing of the gfn mappings, get renamed to
> kvm_write_guest_uncached ?
>
Good idea. I do not see any places where kvm_write_guest_uncached is
needed from a brief look. Avi?

> >>+ &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
> >>+
> >>+ vcpu->arch.st.stime = 0;
> >>+ return;
> >>+ }
> >>+ }
> >>+
> >>+}
> >>+
> >> int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> >> {
> >> switch (msr) {
> >>@@ -1562,6 +1590,23 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> >> if (kvm_pv_enable_async_pf(vcpu, data))
> >> return 1;
> >> break;
> >>+ case MSR_KVM_STEAL_TIME:
> >>+ vcpu->arch.st.msr_val = data;
> >>+
> >>+ if (!(data& KVM_MSR_ENABLED)) {
> >>+ vcpu->arch.st.stime = 0;
> >>+ break;
> >>+ }
> >>+
> >>+ if (data& KVM_STEAL_RESERVED_MASK)
> >>+ return 1;
> >>+
> >>+ vcpu->arch.st.this_time_out = get_kernel_ns();
> >>+ vcpu->arch.st.stime = data& KVM_STEAL_VALID_BITS;
> >>+ record_steal_time(vcpu);
> >>+
> >>+ break;
> >>+
> >> case MSR_IA32_MCG_CTL:
> >> case MSR_IA32_MCG_STATUS:
> >> case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
> >>@@ -1847,6 +1892,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
> >> case MSR_KVM_ASYNC_PF_EN:
> >> data = vcpu->arch.apf.msr_val;
> >> break;
> >>+ case MSR_KVM_STEAL_TIME:
> >>+ data = vcpu->arch.st.msr_val;
> >>+ break;
> >> case MSR_IA32_P5_MC_ADDR:
> >> case MSR_IA32_P5_MC_TYPE:
> >> case MSR_IA32_MCG_CAP:
> >>@@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> >> kvm_migrate_timers(vcpu);
> >> vcpu->cpu = cpu;
> >> }
> >>+
> >>+ record_steal_time(vcpu);
> >> }
> >>
> >> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> >>@@ -2165,6 +2215,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> >> kvm_x86_ops->vcpu_put(vcpu);
> >> kvm_put_guest_fpu(vcpu);
> >> kvm_get_msr(vcpu, MSR_IA32_TSC,&vcpu->arch.last_guest_tsc);
> >>+ vcpu->arch.st.this_time_out = get_kernel_ns();
> >> }
> >>
> >Shouldn't we call record_steal_time(vcpu)/vcpu->arch.st.this_time_out = get_kernel_ns();
> >just before/after entering/exiting a guest? vcpu_(put|get) are called
> >for each vcpu ioctl, not only VCPU_RUN.
> Sorry, missed that the first time I've read your e-mail.
>
> If done like you said, time spent on the hypervisor is accounted as
> steal time. I don't think it is.
I thought that this is the point of a steal time. Running other
tasks/guests is a hypervisor overhead too after all :) Also what about
time spend serving host interrupts between put/get? It will not be
accounted as steal time, correct?

>
> Steal time is time spent running someone else's job instead of
> yours. The name for the time spent in the hypervisor doing something
> for *you* is just overhead.
OK. That is the question of a definition I guess. If you define it like
that the code is correct.

>
> >
> >> static int is_efer_nx(void)
> >>@@ -2477,7 +2528,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> >> (1<< KVM_FEATURE_NOP_IO_DELAY) |
> >> (1<< KVM_FEATURE_CLOCKSOURCE2) |
> >> (1<< KVM_FEATURE_ASYNC_PF) |
> >>- (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
> >>+ (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
> >>+ (1<< KVM_FEATURE_STEAL_TIME);
> >> entry->ebx = 0;
> >> entry->ecx = 0;
> >> entry->edx = 0;
> >>@@ -6200,6 +6252,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
> >>
> >> kvmclock_reset(vcpu);
> >>
> >>+ vcpu->arch.st.stime = 0;
> >>+
> >> kvm_clear_async_pf_completion_queue(vcpu);
> >> kvm_async_pf_hash_reset(vcpu);
> >> vcpu->arch.apf.halted = false;
> >>--
> >>1.7.3.4
> >>
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe kvm" in
> >>the body of a message to [email protected]
> >>More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >--
> > Gleb.

--
Gleb.

2011-06-15 09:28:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/7] KVM-GST: KVM Steal time accounting

On Tue, 2011-06-14 at 22:08 -0300, Glauber Costa wrote:
> >> + while (steal> TICK_NSEC) {
> >
> > /* really, if we wanted a division we'd have
> written one */
> > asm("" : "+rm" (steal));
>
> Out of curiosity, have we seen any compiler de-optimize it to a
> division, or are you just being careful ?
>
> >> + steal -= TICK_NSEC;
> >> + st++;
> >> + }

No that really happened a number of times, there's one in
sched_avg_period() that actually triggered and __iter_div_u64_rem() that
started it all iirc.

2011-06-16 03:31:31

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On 06/15/2011 06:09 AM, Gleb Natapov wrote:
> On Wed, Jun 15, 2011 at 12:09:31AM -0300, Glauber Costa wrote:
>> On 06/14/2011 04:45 AM, Gleb Natapov wrote:
>>> On Mon, Jun 13, 2011 at 07:31:33PM -0400, Glauber Costa wrote:
>>>> To implement steal time, we need the hypervisor to pass the guest information
>>>> about how much time was spent running other processes outside the VM.
>>>> This is per-vcpu, and using the kvmclock structure for that is an abuse
>>>> we decided not to make.
>>>>
>>>> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
>>>> holds the memory area address containing information about steal time
>>>>
>>>> This patch contains the hypervisor part for it. I am keeping it separate from
>>>> the headers to facilitate backports to people who wants to backport the kernel
>>>> part but not the hypervisor, or the other way around.
>>>>
>>>> Signed-off-by: Glauber Costa<[email protected]>
>>>> CC: Rik van Riel<[email protected]>
>>>> CC: Jeremy Fitzhardinge<[email protected]>
>>>> CC: Peter Zijlstra<[email protected]>
>>>> CC: Avi Kivity<[email protected]>
>>>> CC: Anthony Liguori<[email protected]>
>>>> CC: Eric B Munson<[email protected]>
>>>> ---
>>>> arch/x86/include/asm/kvm_host.h | 8 +++++
>>>> arch/x86/include/asm/kvm_para.h | 4 ++
>>>> arch/x86/kvm/x86.c | 60 +++++++++++++++++++++++++++++++++++++--
>>>> 3 files changed, 69 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>>> index fc38eca..5dce014 100644
>>>> --- a/arch/x86/include/asm/kvm_host.h
>>>> +++ b/arch/x86/include/asm/kvm_host.h
>>>> @@ -388,6 +388,14 @@ struct kvm_vcpu_arch {
>>>> unsigned int hw_tsc_khz;
>>>> unsigned int time_offset;
>>>> struct page *time_page;
>>>> +
>>>> + struct {
>>>> + u64 msr_val;
>>>> + gpa_t stime;
>>>> + struct kvm_steal_time steal;
>>>> + u64 this_time_out;
>>>> + } st;
>>>> +
>>>> u64 last_guest_tsc;
>>>> u64 last_kernel_ns;
>>>> u64 last_tsc_nsec;
>>>> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
>>>> index ac306c4..0341e61 100644
>>>> --- a/arch/x86/include/asm/kvm_para.h
>>>> +++ b/arch/x86/include/asm/kvm_para.h
>>>> @@ -45,6 +45,10 @@ struct kvm_steal_time {
>>>> __u32 pad[6];
>>>> };
>>>>
>>>> +#define KVM_STEAL_ALIGNMENT_BITS 5
>>>> +#define KVM_STEAL_VALID_BITS ((-1ULL<< (KVM_STEAL_ALIGNMENT_BITS + 1)))
>>>> +#define KVM_STEAL_RESERVED_MASK (((1<< KVM_STEAL_ALIGNMENT_BITS) - 1 )<< 1)
>>>> +
>>>> #define KVM_MAX_MMU_OP_BATCH 32
>>>>
>>>> #define KVM_ASYNC_PF_ENABLED (1<< 0)
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index 6645634..10fe028 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -797,12 +797,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
>>>> * kvm-specific. Those are put in the beginning of the list.
>>>> */
>>>>
>>>> -#define KVM_SAVE_MSRS_BEGIN 8
>>>> +#define KVM_SAVE_MSRS_BEGIN 9
>>>> static u32 msrs_to_save[] = {
>>>> MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
>>>> MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
>>>> HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
>>>> - HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
>>>> + HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
>>>> MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
>>>> MSR_STAR,
>>>> #ifdef CONFIG_X86_64
>>>> @@ -1480,6 +1480,34 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
>>>> }
>>>> }
>>>>
>>>> +static void record_steal_time(struct kvm_vcpu *vcpu)
>>>> +{
>>>> + u64 delta;
>>>> +
>>>> + if (vcpu->arch.st.stime&& vcpu->arch.st.this_time_out) {
>>>> +
>>>> + if (unlikely(kvm_read_guest(vcpu->kvm, vcpu->arch.st.stime,
>>>> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
>>>> +
>>>> + vcpu->arch.st.stime = 0;
>>>> + return;
>>>> + }
>>>> +
>>>> + delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
>>>> +
>>>> + vcpu->arch.st.steal.steal += delta;
>>>> + vcpu->arch.st.steal.version += 2;
>>>> +
>>>> + if (unlikely(kvm_write_guest(vcpu->kvm, vcpu->arch.st.stime,
>>> Why not use kvm_write_guest_cached() here and introduce kvm_read_guest_cached()
>>> for the read above?
>>
>> Actually, I'd expect most read/writes to benefit from caching, no?
>> So why don't we just rename kvm_write_guest_cached() to
>> kvm_write_guest(), and the few places - if any - that need to force
>> transversing of the gfn mappings, get renamed to
>> kvm_write_guest_uncached ?
>>
> Good idea. I do not see any places where kvm_write_guest_uncached is
> needed from a brief look. Avi?
>
>>>> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
>>>> +
>>>> + vcpu->arch.st.stime = 0;
>>>> + return;
>>>> + }
>>>> + }
>>>> +
>>>> +}
>>>> +
>>>> int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>>>> {
>>>> switch (msr) {
>>>> @@ -1562,6 +1590,23 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>>>> if (kvm_pv_enable_async_pf(vcpu, data))
>>>> return 1;
>>>> break;
>>>> + case MSR_KVM_STEAL_TIME:
>>>> + vcpu->arch.st.msr_val = data;
>>>> +
>>>> + if (!(data& KVM_MSR_ENABLED)) {
>>>> + vcpu->arch.st.stime = 0;
>>>> + break;
>>>> + }
>>>> +
>>>> + if (data& KVM_STEAL_RESERVED_MASK)
>>>> + return 1;
>>>> +
>>>> + vcpu->arch.st.this_time_out = get_kernel_ns();
>>>> + vcpu->arch.st.stime = data& KVM_STEAL_VALID_BITS;
>>>> + record_steal_time(vcpu);
>>>> +
>>>> + break;
>>>> +
>>>> case MSR_IA32_MCG_CTL:
>>>> case MSR_IA32_MCG_STATUS:
>>>> case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
>>>> @@ -1847,6 +1892,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>>>> case MSR_KVM_ASYNC_PF_EN:
>>>> data = vcpu->arch.apf.msr_val;
>>>> break;
>>>> + case MSR_KVM_STEAL_TIME:
>>>> + data = vcpu->arch.st.msr_val;
>>>> + break;
>>>> case MSR_IA32_P5_MC_ADDR:
>>>> case MSR_IA32_P5_MC_TYPE:
>>>> case MSR_IA32_MCG_CAP:
>>>> @@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>> kvm_migrate_timers(vcpu);
>>>> vcpu->cpu = cpu;
>>>> }
>>>> +
>>>> + record_steal_time(vcpu);
>>>> }
>>>>
>>>> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>>>> @@ -2165,6 +2215,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>>>> kvm_x86_ops->vcpu_put(vcpu);
>>>> kvm_put_guest_fpu(vcpu);
>>>> kvm_get_msr(vcpu, MSR_IA32_TSC,&vcpu->arch.last_guest_tsc);
>>>> + vcpu->arch.st.this_time_out = get_kernel_ns();
>>>> }
>>>>
>>> Shouldn't we call record_steal_time(vcpu)/vcpu->arch.st.this_time_out = get_kernel_ns();
>>> just before/after entering/exiting a guest? vcpu_(put|get) are called
>>> for each vcpu ioctl, not only VCPU_RUN.
>> Sorry, missed that the first time I've read your e-mail.
>>
>> If done like you said, time spent on the hypervisor is accounted as
>> steal time. I don't think it is.
> I thought that this is the point of a steal time. Running other
> tasks/guests is a hypervisor overhead too after all :) Also what about
> time spend serving host interrupts between put/get? It will not be
> accounted as steal time, correct?

This is mostly semantics. I like to compare this to a normal process:
There is a difference between time the OS spent on your behalf, doing
your system calls (sys), and time spent by other processes. Similar
thing here.

Which put/get are you referring to specifically ? You mean vcpu_put() vs
vcpu_load() ?

If they are after vcpu_put(), they will, because at this time your
process is officially out of the cpu.


>>
>> Steal time is time spent running someone else's job instead of
>> yours. The name for the time spent in the hypervisor doing something
>> for *you* is just overhead.
> OK. That is the question of a definition I guess. If you define it like
> that the code is correct.
>
>>
>>>
>>>> static int is_efer_nx(void)
>>>> @@ -2477,7 +2528,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>>>> (1<< KVM_FEATURE_NOP_IO_DELAY) |
>>>> (1<< KVM_FEATURE_CLOCKSOURCE2) |
>>>> (1<< KVM_FEATURE_ASYNC_PF) |
>>>> - (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
>>>> + (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
>>>> + (1<< KVM_FEATURE_STEAL_TIME);
>>>> entry->ebx = 0;
>>>> entry->ecx = 0;
>>>> entry->edx = 0;
>>>> @@ -6200,6 +6252,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
>>>>
>>>> kvmclock_reset(vcpu);
>>>>
>>>> + vcpu->arch.st.stime = 0;
>>>> +
>>>> kvm_clear_async_pf_completion_queue(vcpu);
>>>> kvm_async_pf_hash_reset(vcpu);
>>>> vcpu->arch.apf.halted = false;
>>>> --
>>>> 1.7.3.4
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> Gleb.
>
> --
> Gleb.

2011-06-16 11:27:39

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On Thu, Jun 16, 2011 at 12:31:02AM -0300, Glauber Costa wrote:
> On 06/15/2011 06:09 AM, Gleb Natapov wrote:
> >On Wed, Jun 15, 2011 at 12:09:31AM -0300, Glauber Costa wrote:
> >>On 06/14/2011 04:45 AM, Gleb Natapov wrote:
> >>>On Mon, Jun 13, 2011 at 07:31:33PM -0400, Glauber Costa wrote:
> >>>>To implement steal time, we need the hypervisor to pass the guest information
> >>>>about how much time was spent running other processes outside the VM.
> >>>>This is per-vcpu, and using the kvmclock structure for that is an abuse
> >>>>we decided not to make.
> >>>>
> >>>>In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> >>>>holds the memory area address containing information about steal time
> >>>>
> >>>>This patch contains the hypervisor part for it. I am keeping it separate from
> >>>>the headers to facilitate backports to people who wants to backport the kernel
> >>>>part but not the hypervisor, or the other way around.
> >>>>
> >>>>Signed-off-by: Glauber Costa<[email protected]>
> >>>>CC: Rik van Riel<[email protected]>
> >>>>CC: Jeremy Fitzhardinge<[email protected]>
> >>>>CC: Peter Zijlstra<[email protected]>
> >>>>CC: Avi Kivity<[email protected]>
> >>>>CC: Anthony Liguori<[email protected]>
> >>>>CC: Eric B Munson<[email protected]>
> >>>>---
> >>>> arch/x86/include/asm/kvm_host.h | 8 +++++
> >>>> arch/x86/include/asm/kvm_para.h | 4 ++
> >>>> arch/x86/kvm/x86.c | 60 +++++++++++++++++++++++++++++++++++++--
> >>>> 3 files changed, 69 insertions(+), 3 deletions(-)
> >>>>
> >>>>diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >>>>index fc38eca..5dce014 100644
> >>>>--- a/arch/x86/include/asm/kvm_host.h
> >>>>+++ b/arch/x86/include/asm/kvm_host.h
> >>>>@@ -388,6 +388,14 @@ struct kvm_vcpu_arch {
> >>>> unsigned int hw_tsc_khz;
> >>>> unsigned int time_offset;
> >>>> struct page *time_page;
> >>>>+
> >>>>+ struct {
> >>>>+ u64 msr_val;
> >>>>+ gpa_t stime;
> >>>>+ struct kvm_steal_time steal;
> >>>>+ u64 this_time_out;
> >>>>+ } st;
> >>>>+
> >>>> u64 last_guest_tsc;
> >>>> u64 last_kernel_ns;
> >>>> u64 last_tsc_nsec;
> >>>>diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> >>>>index ac306c4..0341e61 100644
> >>>>--- a/arch/x86/include/asm/kvm_para.h
> >>>>+++ b/arch/x86/include/asm/kvm_para.h
> >>>>@@ -45,6 +45,10 @@ struct kvm_steal_time {
> >>>> __u32 pad[6];
> >>>> };
> >>>>
> >>>>+#define KVM_STEAL_ALIGNMENT_BITS 5
> >>>>+#define KVM_STEAL_VALID_BITS ((-1ULL<< (KVM_STEAL_ALIGNMENT_BITS + 1)))
> >>>>+#define KVM_STEAL_RESERVED_MASK (((1<< KVM_STEAL_ALIGNMENT_BITS) - 1 )<< 1)
> >>>>+
> >>>> #define KVM_MAX_MMU_OP_BATCH 32
> >>>>
> >>>> #define KVM_ASYNC_PF_ENABLED (1<< 0)
> >>>>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>>>index 6645634..10fe028 100644
> >>>>--- a/arch/x86/kvm/x86.c
> >>>>+++ b/arch/x86/kvm/x86.c
> >>>>@@ -797,12 +797,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
> >>>> * kvm-specific. Those are put in the beginning of the list.
> >>>> */
> >>>>
> >>>>-#define KVM_SAVE_MSRS_BEGIN 8
> >>>>+#define KVM_SAVE_MSRS_BEGIN 9
> >>>> static u32 msrs_to_save[] = {
> >>>> MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
> >>>> MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
> >>>> HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
> >>>>- HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
> >>>>+ HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
> >>>> MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
> >>>> MSR_STAR,
> >>>> #ifdef CONFIG_X86_64
> >>>>@@ -1480,6 +1480,34 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
> >>>> }
> >>>> }
> >>>>
> >>>>+static void record_steal_time(struct kvm_vcpu *vcpu)
> >>>>+{
> >>>>+ u64 delta;
> >>>>+
> >>>>+ if (vcpu->arch.st.stime&& vcpu->arch.st.this_time_out) {
> >>>>+
> >>>>+ if (unlikely(kvm_read_guest(vcpu->kvm, vcpu->arch.st.stime,
> >>>>+ &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
> >>>>+
> >>>>+ vcpu->arch.st.stime = 0;
> >>>>+ return;
> >>>>+ }
> >>>>+
> >>>>+ delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
> >>>>+
> >>>>+ vcpu->arch.st.steal.steal += delta;
> >>>>+ vcpu->arch.st.steal.version += 2;
> >>>>+
> >>>>+ if (unlikely(kvm_write_guest(vcpu->kvm, vcpu->arch.st.stime,
> >>>Why not use kvm_write_guest_cached() here and introduce kvm_read_guest_cached()
> >>>for the read above?
> >>
> >>Actually, I'd expect most read/writes to benefit from caching, no?
> >>So why don't we just rename kvm_write_guest_cached() to
> >>kvm_write_guest(), and the few places - if any - that need to force
> >>transversing of the gfn mappings, get renamed to
> >>kvm_write_guest_uncached ?
> >>
> >Good idea. I do not see any places where kvm_write_guest_uncached is
> >needed from a brief look. Avi?
> >
> >>>>+ &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
> >>>>+
> >>>>+ vcpu->arch.st.stime = 0;
> >>>>+ return;
> >>>>+ }
> >>>>+ }
> >>>>+
> >>>>+}
> >>>>+
> >>>> int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> >>>> {
> >>>> switch (msr) {
> >>>>@@ -1562,6 +1590,23 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> >>>> if (kvm_pv_enable_async_pf(vcpu, data))
> >>>> return 1;
> >>>> break;
> >>>>+ case MSR_KVM_STEAL_TIME:
> >>>>+ vcpu->arch.st.msr_val = data;
> >>>>+
> >>>>+ if (!(data& KVM_MSR_ENABLED)) {
> >>>>+ vcpu->arch.st.stime = 0;
> >>>>+ break;
> >>>>+ }
> >>>>+
> >>>>+ if (data& KVM_STEAL_RESERVED_MASK)
> >>>>+ return 1;
> >>>>+
> >>>>+ vcpu->arch.st.this_time_out = get_kernel_ns();
> >>>>+ vcpu->arch.st.stime = data& KVM_STEAL_VALID_BITS;
> >>>>+ record_steal_time(vcpu);
> >>>>+
> >>>>+ break;
> >>>>+
> >>>> case MSR_IA32_MCG_CTL:
> >>>> case MSR_IA32_MCG_STATUS:
> >>>> case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
> >>>>@@ -1847,6 +1892,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
> >>>> case MSR_KVM_ASYNC_PF_EN:
> >>>> data = vcpu->arch.apf.msr_val;
> >>>> break;
> >>>>+ case MSR_KVM_STEAL_TIME:
> >>>>+ data = vcpu->arch.st.msr_val;
> >>>>+ break;
> >>>> case MSR_IA32_P5_MC_ADDR:
> >>>> case MSR_IA32_P5_MC_TYPE:
> >>>> case MSR_IA32_MCG_CAP:
> >>>>@@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> >>>> kvm_migrate_timers(vcpu);
> >>>> vcpu->cpu = cpu;
> >>>> }
> >>>>+
> >>>>+ record_steal_time(vcpu);
> >>>> }
> >>>>
> >>>> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> >>>>@@ -2165,6 +2215,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> >>>> kvm_x86_ops->vcpu_put(vcpu);
> >>>> kvm_put_guest_fpu(vcpu);
> >>>> kvm_get_msr(vcpu, MSR_IA32_TSC,&vcpu->arch.last_guest_tsc);
> >>>>+ vcpu->arch.st.this_time_out = get_kernel_ns();
> >>>> }
> >>>>
> >>>Shouldn't we call record_steal_time(vcpu)/vcpu->arch.st.this_time_out = get_kernel_ns();
> >>>just before/after entering/exiting a guest? vcpu_(put|get) are called
> >>>for each vcpu ioctl, not only VCPU_RUN.
> >>Sorry, missed that the first time I've read your e-mail.
> >>
> >>If done like you said, time spent on the hypervisor is accounted as
> >>steal time. I don't think it is.
> >I thought that this is the point of a steal time. Running other
> >tasks/guests is a hypervisor overhead too after all :) Also what about
> >time spend serving host interrupts between put/get? It will not be
> >accounted as steal time, correct?
>
> This is mostly semantics. I like to compare this to a normal
> process: There is a difference between time the OS spent on your
> behalf, doing your system calls (sys), and time spent by other
> processes. Similar thing here.
>
The problem with this approach is that things like doing "info cpus"
in qemu monitor will change guest scheduling behaviour. Do we want it
to be like that?

> Which put/get are you referring to specifically ? You mean
> vcpu_put() vs vcpu_load() ?
>
Yes.

> If they are after vcpu_put(), they will, because at this time your
> process is officially out of the cpu.
>
And if they are between vcpu_load() and vcpu_put() they will be accounted as
a work done on behalf of a guest although they are likely unrelated.

>
> >>
> >>Steal time is time spent running someone else's job instead of
> >>yours. The name for the time spent in the hypervisor doing something
> >>for *you* is just overhead.
> >OK. That is the question of a definition I guess. If you define it like
> >that the code is correct.
> >
> >>
> >>>
> >>>> static int is_efer_nx(void)
> >>>>@@ -2477,7 +2528,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> >>>> (1<< KVM_FEATURE_NOP_IO_DELAY) |
> >>>> (1<< KVM_FEATURE_CLOCKSOURCE2) |
> >>>> (1<< KVM_FEATURE_ASYNC_PF) |
> >>>>- (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
> >>>>+ (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
> >>>>+ (1<< KVM_FEATURE_STEAL_TIME);
> >>>> entry->ebx = 0;
> >>>> entry->ecx = 0;
> >>>> entry->edx = 0;
> >>>>@@ -6200,6 +6252,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
> >>>>
> >>>> kvmclock_reset(vcpu);
> >>>>
> >>>>+ vcpu->arch.st.stime = 0;
> >>>>+
> >>>> kvm_clear_async_pf_completion_queue(vcpu);
> >>>> kvm_async_pf_hash_reset(vcpu);
> >>>> vcpu->arch.apf.halted = false;
> >>>>--
> >>>>1.7.3.4
> >>>>
> >>>>--
> >>>>To unsubscribe from this list: send the line "unsubscribe kvm" in
> >>>>the body of a message to [email protected]
> >>>>More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>
> >>>--
> >>> Gleb.
> >
> >--
> > Gleb.

--
Gleb.

2011-06-16 12:12:05

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On 06/16/2011 08:27 AM, Gleb Natapov wrote:
> On Thu, Jun 16, 2011 at 12:31:02AM -0300, Glauber Costa wrote:
>> On 06/15/2011 06:09 AM, Gleb Natapov wrote:
>>> On Wed, Jun 15, 2011 at 12:09:31AM -0300, Glauber Costa wrote:
>>>> On 06/14/2011 04:45 AM, Gleb Natapov wrote:
>>>>> On Mon, Jun 13, 2011 at 07:31:33PM -0400, Glauber Costa wrote:
>>>>>> To implement steal time, we need the hypervisor to pass the guest information
>>>>>> about how much time was spent running other processes outside the VM.
>>>>>> This is per-vcpu, and using the kvmclock structure for that is an abuse
>>>>>> we decided not to make.
>>>>>>
>>>>>> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
>>>>>> holds the memory area address containing information about steal time
>>>>>>
>>>>>> This patch contains the hypervisor part for it. I am keeping it separate from
>>>>>> the headers to facilitate backports to people who wants to backport the kernel
>>>>>> part but not the hypervisor, or the other way around.
>>>>>>
>>>>>> Signed-off-by: Glauber Costa<[email protected]>
>>>>>> CC: Rik van Riel<[email protected]>
>>>>>> CC: Jeremy Fitzhardinge<[email protected]>
>>>>>> CC: Peter Zijlstra<[email protected]>
>>>>>> CC: Avi Kivity<[email protected]>
>>>>>> CC: Anthony Liguori<[email protected]>
>>>>>> CC: Eric B Munson<[email protected]>
>>>>>> ---
>>>>>> arch/x86/include/asm/kvm_host.h | 8 +++++
>>>>>> arch/x86/include/asm/kvm_para.h | 4 ++
>>>>>> arch/x86/kvm/x86.c | 60 +++++++++++++++++++++++++++++++++++++--
>>>>>> 3 files changed, 69 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>>>>> index fc38eca..5dce014 100644
>>>>>> --- a/arch/x86/include/asm/kvm_host.h
>>>>>> +++ b/arch/x86/include/asm/kvm_host.h
>>>>>> @@ -388,6 +388,14 @@ struct kvm_vcpu_arch {
>>>>>> unsigned int hw_tsc_khz;
>>>>>> unsigned int time_offset;
>>>>>> struct page *time_page;
>>>>>> +
>>>>>> + struct {
>>>>>> + u64 msr_val;
>>>>>> + gpa_t stime;
>>>>>> + struct kvm_steal_time steal;
>>>>>> + u64 this_time_out;
>>>>>> + } st;
>>>>>> +
>>>>>> u64 last_guest_tsc;
>>>>>> u64 last_kernel_ns;
>>>>>> u64 last_tsc_nsec;
>>>>>> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
>>>>>> index ac306c4..0341e61 100644
>>>>>> --- a/arch/x86/include/asm/kvm_para.h
>>>>>> +++ b/arch/x86/include/asm/kvm_para.h
>>>>>> @@ -45,6 +45,10 @@ struct kvm_steal_time {
>>>>>> __u32 pad[6];
>>>>>> };
>>>>>>
>>>>>> +#define KVM_STEAL_ALIGNMENT_BITS 5
>>>>>> +#define KVM_STEAL_VALID_BITS ((-1ULL<< (KVM_STEAL_ALIGNMENT_BITS + 1)))
>>>>>> +#define KVM_STEAL_RESERVED_MASK (((1<< KVM_STEAL_ALIGNMENT_BITS) - 1 )<< 1)
>>>>>> +
>>>>>> #define KVM_MAX_MMU_OP_BATCH 32
>>>>>>
>>>>>> #define KVM_ASYNC_PF_ENABLED (1<< 0)
>>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>>> index 6645634..10fe028 100644
>>>>>> --- a/arch/x86/kvm/x86.c
>>>>>> +++ b/arch/x86/kvm/x86.c
>>>>>> @@ -797,12 +797,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
>>>>>> * kvm-specific. Those are put in the beginning of the list.
>>>>>> */
>>>>>>
>>>>>> -#define KVM_SAVE_MSRS_BEGIN 8
>>>>>> +#define KVM_SAVE_MSRS_BEGIN 9
>>>>>> static u32 msrs_to_save[] = {
>>>>>> MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
>>>>>> MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
>>>>>> HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
>>>>>> - HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
>>>>>> + HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
>>>>>> MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
>>>>>> MSR_STAR,
>>>>>> #ifdef CONFIG_X86_64
>>>>>> @@ -1480,6 +1480,34 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> +static void record_steal_time(struct kvm_vcpu *vcpu)
>>>>>> +{
>>>>>> + u64 delta;
>>>>>> +
>>>>>> + if (vcpu->arch.st.stime&& vcpu->arch.st.this_time_out) {
>>>>>> +
>>>>>> + if (unlikely(kvm_read_guest(vcpu->kvm, vcpu->arch.st.stime,
>>>>>> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
>>>>>> +
>>>>>> + vcpu->arch.st.stime = 0;
>>>>>> + return;
>>>>>> + }
>>>>>> +
>>>>>> + delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
>>>>>> +
>>>>>> + vcpu->arch.st.steal.steal += delta;
>>>>>> + vcpu->arch.st.steal.version += 2;
>>>>>> +
>>>>>> + if (unlikely(kvm_write_guest(vcpu->kvm, vcpu->arch.st.stime,
>>>>> Why not use kvm_write_guest_cached() here and introduce kvm_read_guest_cached()
>>>>> for the read above?
>>>>
>>>> Actually, I'd expect most read/writes to benefit from caching, no?
>>>> So why don't we just rename kvm_write_guest_cached() to
>>>> kvm_write_guest(), and the few places - if any - that need to force
>>>> transversing of the gfn mappings, get renamed to
>>>> kvm_write_guest_uncached ?
>>>>
>>> Good idea. I do not see any places where kvm_write_guest_uncached is
>>> needed from a brief look. Avi?
>>>
>>>>>> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
>>>>>> +
>>>>>> + vcpu->arch.st.stime = 0;
>>>>>> + return;
>>>>>> + }
>>>>>> + }
>>>>>> +
>>>>>> +}
>>>>>> +
>>>>>> int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>>>>>> {
>>>>>> switch (msr) {
>>>>>> @@ -1562,6 +1590,23 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>>>>>> if (kvm_pv_enable_async_pf(vcpu, data))
>>>>>> return 1;
>>>>>> break;
>>>>>> + case MSR_KVM_STEAL_TIME:
>>>>>> + vcpu->arch.st.msr_val = data;
>>>>>> +
>>>>>> + if (!(data& KVM_MSR_ENABLED)) {
>>>>>> + vcpu->arch.st.stime = 0;
>>>>>> + break;
>>>>>> + }
>>>>>> +
>>>>>> + if (data& KVM_STEAL_RESERVED_MASK)
>>>>>> + return 1;
>>>>>> +
>>>>>> + vcpu->arch.st.this_time_out = get_kernel_ns();
>>>>>> + vcpu->arch.st.stime = data& KVM_STEAL_VALID_BITS;
>>>>>> + record_steal_time(vcpu);
>>>>>> +
>>>>>> + break;
>>>>>> +
>>>>>> case MSR_IA32_MCG_CTL:
>>>>>> case MSR_IA32_MCG_STATUS:
>>>>>> case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
>>>>>> @@ -1847,6 +1892,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>>>>>> case MSR_KVM_ASYNC_PF_EN:
>>>>>> data = vcpu->arch.apf.msr_val;
>>>>>> break;
>>>>>> + case MSR_KVM_STEAL_TIME:
>>>>>> + data = vcpu->arch.st.msr_val;
>>>>>> + break;
>>>>>> case MSR_IA32_P5_MC_ADDR:
>>>>>> case MSR_IA32_P5_MC_TYPE:
>>>>>> case MSR_IA32_MCG_CAP:
>>>>>> @@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>>>> kvm_migrate_timers(vcpu);
>>>>>> vcpu->cpu = cpu;
>>>>>> }
>>>>>> +
>>>>>> + record_steal_time(vcpu);
>>>>>> }
>>>>>>
>>>>>> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>>>>>> @@ -2165,6 +2215,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>>>>>> kvm_x86_ops->vcpu_put(vcpu);
>>>>>> kvm_put_guest_fpu(vcpu);
>>>>>> kvm_get_msr(vcpu, MSR_IA32_TSC,&vcpu->arch.last_guest_tsc);
>>>>>> + vcpu->arch.st.this_time_out = get_kernel_ns();
>>>>>> }
>>>>>>
>>>>> Shouldn't we call record_steal_time(vcpu)/vcpu->arch.st.this_time_out = get_kernel_ns();
>>>>> just before/after entering/exiting a guest? vcpu_(put|get) are called
>>>>> for each vcpu ioctl, not only VCPU_RUN.
>>>> Sorry, missed that the first time I've read your e-mail.
>>>>
>>>> If done like you said, time spent on the hypervisor is accounted as
>>>> steal time. I don't think it is.
>>> I thought that this is the point of a steal time. Running other
>>> tasks/guests is a hypervisor overhead too after all :) Also what about
>>> time spend serving host interrupts between put/get? It will not be
>>> accounted as steal time, correct?
>>
>> This is mostly semantics. I like to compare this to a normal
>> process: There is a difference between time the OS spent on your
>> behalf, doing your system calls (sys), and time spent by other
>> processes. Similar thing here.
>>
> The problem with this approach is that things like doing "info cpus"
> in qemu monitor will change guest scheduling behaviour. Do we want it
> to be like that?

You mean because it is running in a different thread, and will compete
for resources with the cpu thread ?

>> Which put/get are you referring to specifically ? You mean
>> vcpu_put() vs vcpu_load() ?
>>
> Yes.
>
>> If they are after vcpu_put(), they will, because at this time your
>> process is officially out of the cpu.
>>
> And if they are between vcpu_load() and vcpu_put() they will be accounted as
> a work done on behalf of a guest although they are likely unrelated.
I think the best we can do around it here is record steal time / measure
time as late as we can. We could in theory subtract irq time, but it
sounds too complicated for little gain.

>>
>>>>
>>>> Steal time is time spent running someone else's job instead of
>>>> yours. The name for the time spent in the hypervisor doing something
>>>> for *you* is just overhead.
>>> OK. That is the question of a definition I guess. If you define it like
>>> that the code is correct.
>>>
>>>>
>>>>>
>>>>>> static int is_efer_nx(void)
>>>>>> @@ -2477,7 +2528,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>>>>>> (1<< KVM_FEATURE_NOP_IO_DELAY) |
>>>>>> (1<< KVM_FEATURE_CLOCKSOURCE2) |
>>>>>> (1<< KVM_FEATURE_ASYNC_PF) |
>>>>>> - (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
>>>>>> + (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
>>>>>> + (1<< KVM_FEATURE_STEAL_TIME);
>>>>>> entry->ebx = 0;
>>>>>> entry->ecx = 0;
>>>>>> entry->edx = 0;
>>>>>> @@ -6200,6 +6252,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
>>>>>>
>>>>>> kvmclock_reset(vcpu);
>>>>>>
>>>>>> + vcpu->arch.st.stime = 0;
>>>>>> +
>>>>>> kvm_clear_async_pf_completion_queue(vcpu);
>>>>>> kvm_async_pf_hash_reset(vcpu);
>>>>>> vcpu->arch.apf.halted = false;
>>>>>> --
>>>>>> 1.7.3.4
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>>>>> the body of a message to [email protected]
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> --
>>>>> Gleb.
>>>
>>> --
>>> Gleb.
>
> --
> Gleb.

2011-06-16 12:21:57

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On Thu, Jun 16, 2011 at 09:11:42AM -0300, Glauber Costa wrote:
> On 06/16/2011 08:27 AM, Gleb Natapov wrote:
> >On Thu, Jun 16, 2011 at 12:31:02AM -0300, Glauber Costa wrote:
> >>On 06/15/2011 06:09 AM, Gleb Natapov wrote:
> >>>On Wed, Jun 15, 2011 at 12:09:31AM -0300, Glauber Costa wrote:
> >>>>On 06/14/2011 04:45 AM, Gleb Natapov wrote:
> >>>>>On Mon, Jun 13, 2011 at 07:31:33PM -0400, Glauber Costa wrote:
> >>>>>>To implement steal time, we need the hypervisor to pass the guest information
> >>>>>>about how much time was spent running other processes outside the VM.
> >>>>>>This is per-vcpu, and using the kvmclock structure for that is an abuse
> >>>>>>we decided not to make.
> >>>>>>
> >>>>>>In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> >>>>>>holds the memory area address containing information about steal time
> >>>>>>
> >>>>>>This patch contains the hypervisor part for it. I am keeping it separate from
> >>>>>>the headers to facilitate backports to people who wants to backport the kernel
> >>>>>>part but not the hypervisor, or the other way around.
> >>>>>>
> >>>>>>Signed-off-by: Glauber Costa<[email protected]>
> >>>>>>CC: Rik van Riel<[email protected]>
> >>>>>>CC: Jeremy Fitzhardinge<[email protected]>
> >>>>>>CC: Peter Zijlstra<[email protected]>
> >>>>>>CC: Avi Kivity<[email protected]>
> >>>>>>CC: Anthony Liguori<[email protected]>
> >>>>>>CC: Eric B Munson<[email protected]>
> >>>>>>---
> >>>>>> arch/x86/include/asm/kvm_host.h | 8 +++++
> >>>>>> arch/x86/include/asm/kvm_para.h | 4 ++
> >>>>>> arch/x86/kvm/x86.c | 60 +++++++++++++++++++++++++++++++++++++--
> >>>>>> 3 files changed, 69 insertions(+), 3 deletions(-)
> >>>>>>
> >>>>>>diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >>>>>>index fc38eca..5dce014 100644
> >>>>>>--- a/arch/x86/include/asm/kvm_host.h
> >>>>>>+++ b/arch/x86/include/asm/kvm_host.h
> >>>>>>@@ -388,6 +388,14 @@ struct kvm_vcpu_arch {
> >>>>>> unsigned int hw_tsc_khz;
> >>>>>> unsigned int time_offset;
> >>>>>> struct page *time_page;
> >>>>>>+
> >>>>>>+ struct {
> >>>>>>+ u64 msr_val;
> >>>>>>+ gpa_t stime;
> >>>>>>+ struct kvm_steal_time steal;
> >>>>>>+ u64 this_time_out;
> >>>>>>+ } st;
> >>>>>>+
> >>>>>> u64 last_guest_tsc;
> >>>>>> u64 last_kernel_ns;
> >>>>>> u64 last_tsc_nsec;
> >>>>>>diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> >>>>>>index ac306c4..0341e61 100644
> >>>>>>--- a/arch/x86/include/asm/kvm_para.h
> >>>>>>+++ b/arch/x86/include/asm/kvm_para.h
> >>>>>>@@ -45,6 +45,10 @@ struct kvm_steal_time {
> >>>>>> __u32 pad[6];
> >>>>>> };
> >>>>>>
> >>>>>>+#define KVM_STEAL_ALIGNMENT_BITS 5
> >>>>>>+#define KVM_STEAL_VALID_BITS ((-1ULL<< (KVM_STEAL_ALIGNMENT_BITS + 1)))
> >>>>>>+#define KVM_STEAL_RESERVED_MASK (((1<< KVM_STEAL_ALIGNMENT_BITS) - 1 )<< 1)
> >>>>>>+
> >>>>>> #define KVM_MAX_MMU_OP_BATCH 32
> >>>>>>
> >>>>>> #define KVM_ASYNC_PF_ENABLED (1<< 0)
> >>>>>>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>>>>>index 6645634..10fe028 100644
> >>>>>>--- a/arch/x86/kvm/x86.c
> >>>>>>+++ b/arch/x86/kvm/x86.c
> >>>>>>@@ -797,12 +797,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
> >>>>>> * kvm-specific. Those are put in the beginning of the list.
> >>>>>> */
> >>>>>>
> >>>>>>-#define KVM_SAVE_MSRS_BEGIN 8
> >>>>>>+#define KVM_SAVE_MSRS_BEGIN 9
> >>>>>> static u32 msrs_to_save[] = {
> >>>>>> MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
> >>>>>> MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
> >>>>>> HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
> >>>>>>- HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
> >>>>>>+ HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
> >>>>>> MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
> >>>>>> MSR_STAR,
> >>>>>> #ifdef CONFIG_X86_64
> >>>>>>@@ -1480,6 +1480,34 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
> >>>>>> }
> >>>>>> }
> >>>>>>
> >>>>>>+static void record_steal_time(struct kvm_vcpu *vcpu)
> >>>>>>+{
> >>>>>>+ u64 delta;
> >>>>>>+
> >>>>>>+ if (vcpu->arch.st.stime&& vcpu->arch.st.this_time_out) {
> >>>>>>+
> >>>>>>+ if (unlikely(kvm_read_guest(vcpu->kvm, vcpu->arch.st.stime,
> >>>>>>+ &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
> >>>>>>+
> >>>>>>+ vcpu->arch.st.stime = 0;
> >>>>>>+ return;
> >>>>>>+ }
> >>>>>>+
> >>>>>>+ delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
> >>>>>>+
> >>>>>>+ vcpu->arch.st.steal.steal += delta;
> >>>>>>+ vcpu->arch.st.steal.version += 2;
> >>>>>>+
> >>>>>>+ if (unlikely(kvm_write_guest(vcpu->kvm, vcpu->arch.st.stime,
> >>>>>Why not use kvm_write_guest_cached() here and introduce kvm_read_guest_cached()
> >>>>>for the read above?
> >>>>
> >>>>Actually, I'd expect most read/writes to benefit from caching, no?
> >>>>So why don't we just rename kvm_write_guest_cached() to
> >>>>kvm_write_guest(), and the few places - if any - that need to force
> >>>>transversing of the gfn mappings, get renamed to
> >>>>kvm_write_guest_uncached ?
> >>>>
> >>>Good idea. I do not see any places where kvm_write_guest_uncached is
> >>>needed from a brief look. Avi?
> >>>
> >>>>>>+ &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
> >>>>>>+
> >>>>>>+ vcpu->arch.st.stime = 0;
> >>>>>>+ return;
> >>>>>>+ }
> >>>>>>+ }
> >>>>>>+
> >>>>>>+}
> >>>>>>+
> >>>>>> int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> >>>>>> {
> >>>>>> switch (msr) {
> >>>>>>@@ -1562,6 +1590,23 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> >>>>>> if (kvm_pv_enable_async_pf(vcpu, data))
> >>>>>> return 1;
> >>>>>> break;
> >>>>>>+ case MSR_KVM_STEAL_TIME:
> >>>>>>+ vcpu->arch.st.msr_val = data;
> >>>>>>+
> >>>>>>+ if (!(data& KVM_MSR_ENABLED)) {
> >>>>>>+ vcpu->arch.st.stime = 0;
> >>>>>>+ break;
> >>>>>>+ }
> >>>>>>+
> >>>>>>+ if (data& KVM_STEAL_RESERVED_MASK)
> >>>>>>+ return 1;
> >>>>>>+
> >>>>>>+ vcpu->arch.st.this_time_out = get_kernel_ns();
> >>>>>>+ vcpu->arch.st.stime = data& KVM_STEAL_VALID_BITS;
> >>>>>>+ record_steal_time(vcpu);
> >>>>>>+
> >>>>>>+ break;
> >>>>>>+
> >>>>>> case MSR_IA32_MCG_CTL:
> >>>>>> case MSR_IA32_MCG_STATUS:
> >>>>>> case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
> >>>>>>@@ -1847,6 +1892,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
> >>>>>> case MSR_KVM_ASYNC_PF_EN:
> >>>>>> data = vcpu->arch.apf.msr_val;
> >>>>>> break;
> >>>>>>+ case MSR_KVM_STEAL_TIME:
> >>>>>>+ data = vcpu->arch.st.msr_val;
> >>>>>>+ break;
> >>>>>> case MSR_IA32_P5_MC_ADDR:
> >>>>>> case MSR_IA32_P5_MC_TYPE:
> >>>>>> case MSR_IA32_MCG_CAP:
> >>>>>>@@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> >>>>>> kvm_migrate_timers(vcpu);
> >>>>>> vcpu->cpu = cpu;
> >>>>>> }
> >>>>>>+
> >>>>>>+ record_steal_time(vcpu);
> >>>>>> }
> >>>>>>
> >>>>>> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> >>>>>>@@ -2165,6 +2215,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> >>>>>> kvm_x86_ops->vcpu_put(vcpu);
> >>>>>> kvm_put_guest_fpu(vcpu);
> >>>>>> kvm_get_msr(vcpu, MSR_IA32_TSC,&vcpu->arch.last_guest_tsc);
> >>>>>>+ vcpu->arch.st.this_time_out = get_kernel_ns();
> >>>>>> }
> >>>>>>
> >>>>>Shouldn't we call record_steal_time(vcpu)/vcpu->arch.st.this_time_out = get_kernel_ns();
> >>>>>just before/after entering/exiting a guest? vcpu_(put|get) are called
> >>>>>for each vcpu ioctl, not only VCPU_RUN.
> >>>>Sorry, missed that the first time I've read your e-mail.
> >>>>
> >>>>If done like you said, time spent on the hypervisor is accounted as
> >>>>steal time. I don't think it is.
> >>>I thought that this is the point of a steal time. Running other
> >>>tasks/guests is a hypervisor overhead too after all :) Also what about
> >>>time spend serving host interrupts between put/get? It will not be
> >>>accounted as steal time, correct?
> >>
> >>This is mostly semantics. I like to compare this to a normal
> >>process: There is a difference between time the OS spent on your
> >>behalf, doing your system calls (sys), and time spent by other
> >>processes. Similar thing here.
> >>
> >The problem with this approach is that things like doing "info cpus"
> >in qemu monitor will change guest scheduling behaviour. Do we want it
> >to be like that?
>
> You mean because it is running in a different thread, and will
> compete for resources with the cpu thread ?
>
No, because it executes GET_REGS ioctl (in vcpu thread). The time
it takes to execute it is accounted as time the hypervior spent on behave
of a guest. Not a big deal if it is executed rarely. I tend to use "info
cpus"/"info register" quite a lot when debugging and would preffer it to
not affect guest behaviour.


> >>Which put/get are you referring to specifically ? You mean
> >>vcpu_put() vs vcpu_load() ?
> >>
> >Yes.
> >
> >>If they are after vcpu_put(), they will, because at this time your
> >>process is officially out of the cpu.
> >>
> >And if they are between vcpu_load() and vcpu_put() they will be accounted as
> >a work done on behalf of a guest although they are likely unrelated.
> I think the best we can do around it here is record steal time /
> measure time as late as we can. We could in theory subtract irq
> time, but it sounds too complicated for little gain.
>
Recording steal time/measure time as close as possible to vmentry/vmexit
is what I propose :) I agree about little gain part.

> >>
> >>>>
> >>>>Steal time is time spent running someone else's job instead of
> >>>>yours. The name for the time spent in the hypervisor doing something
> >>>>for *you* is just overhead.
> >>>OK. That is the question of a definition I guess. If you define it like
> >>>that the code is correct.
> >>>
> >>>>
> >>>>>
> >>>>>> static int is_efer_nx(void)
> >>>>>>@@ -2477,7 +2528,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
> >>>>>> (1<< KVM_FEATURE_NOP_IO_DELAY) |
> >>>>>> (1<< KVM_FEATURE_CLOCKSOURCE2) |
> >>>>>> (1<< KVM_FEATURE_ASYNC_PF) |
> >>>>>>- (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
> >>>>>>+ (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
> >>>>>>+ (1<< KVM_FEATURE_STEAL_TIME);
> >>>>>> entry->ebx = 0;
> >>>>>> entry->ecx = 0;
> >>>>>> entry->edx = 0;
> >>>>>>@@ -6200,6 +6252,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
> >>>>>>
> >>>>>> kvmclock_reset(vcpu);
> >>>>>>
> >>>>>>+ vcpu->arch.st.stime = 0;
> >>>>>>+
> >>>>>> kvm_clear_async_pf_completion_queue(vcpu);
> >>>>>> kvm_async_pf_hash_reset(vcpu);
> >>>>>> vcpu->arch.apf.halted = false;
> >>>>>>--
> >>>>>>1.7.3.4
> >>>>>>
> >>>>>>--
> >>>>>>To unsubscribe from this list: send the line "unsubscribe kvm" in
> >>>>>>the body of a message to [email protected]
> >>>>>>More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>>>
> >>>>>--
> >>>>> Gleb.
> >>>
> >>>--
> >>> Gleb.
> >
> >--
> > Gleb.

--
Gleb.

2011-06-16 12:24:43

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On 06/16/2011 09:21 AM, Gleb Natapov wrote:
> On Thu, Jun 16, 2011 at 09:11:42AM -0300, Glauber Costa wrote:
>> On 06/16/2011 08:27 AM, Gleb Natapov wrote:
>>> On Thu, Jun 16, 2011 at 12:31:02AM -0300, Glauber Costa wrote:
>>>> On 06/15/2011 06:09 AM, Gleb Natapov wrote:
>>>>> On Wed, Jun 15, 2011 at 12:09:31AM -0300, Glauber Costa wrote:
>>>>>> On 06/14/2011 04:45 AM, Gleb Natapov wrote:
>>>>>>> On Mon, Jun 13, 2011 at 07:31:33PM -0400, Glauber Costa wrote:
>>>>>>>> To implement steal time, we need the hypervisor to pass the guest information
>>>>>>>> about how much time was spent running other processes outside the VM.
>>>>>>>> This is per-vcpu, and using the kvmclock structure for that is an abuse
>>>>>>>> we decided not to make.
>>>>>>>>
>>>>>>>> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
>>>>>>>> holds the memory area address containing information about steal time
>>>>>>>>
>>>>>>>> This patch contains the hypervisor part for it. I am keeping it separate from
>>>>>>>> the headers to facilitate backports to people who wants to backport the kernel
>>>>>>>> part but not the hypervisor, or the other way around.
>>>>>>>>
>>>>>>>> Signed-off-by: Glauber Costa<[email protected]>
>>>>>>>> CC: Rik van Riel<[email protected]>
>>>>>>>> CC: Jeremy Fitzhardinge<[email protected]>
>>>>>>>> CC: Peter Zijlstra<[email protected]>
>>>>>>>> CC: Avi Kivity<[email protected]>
>>>>>>>> CC: Anthony Liguori<[email protected]>
>>>>>>>> CC: Eric B Munson<[email protected]>
>>>>>>>> ---
>>>>>>>> arch/x86/include/asm/kvm_host.h | 8 +++++
>>>>>>>> arch/x86/include/asm/kvm_para.h | 4 ++
>>>>>>>> arch/x86/kvm/x86.c | 60 +++++++++++++++++++++++++++++++++++++--
>>>>>>>> 3 files changed, 69 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>>>>>>> index fc38eca..5dce014 100644
>>>>>>>> --- a/arch/x86/include/asm/kvm_host.h
>>>>>>>> +++ b/arch/x86/include/asm/kvm_host.h
>>>>>>>> @@ -388,6 +388,14 @@ struct kvm_vcpu_arch {
>>>>>>>> unsigned int hw_tsc_khz;
>>>>>>>> unsigned int time_offset;
>>>>>>>> struct page *time_page;
>>>>>>>> +
>>>>>>>> + struct {
>>>>>>>> + u64 msr_val;
>>>>>>>> + gpa_t stime;
>>>>>>>> + struct kvm_steal_time steal;
>>>>>>>> + u64 this_time_out;
>>>>>>>> + } st;
>>>>>>>> +
>>>>>>>> u64 last_guest_tsc;
>>>>>>>> u64 last_kernel_ns;
>>>>>>>> u64 last_tsc_nsec;
>>>>>>>> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
>>>>>>>> index ac306c4..0341e61 100644
>>>>>>>> --- a/arch/x86/include/asm/kvm_para.h
>>>>>>>> +++ b/arch/x86/include/asm/kvm_para.h
>>>>>>>> @@ -45,6 +45,10 @@ struct kvm_steal_time {
>>>>>>>> __u32 pad[6];
>>>>>>>> };
>>>>>>>>
>>>>>>>> +#define KVM_STEAL_ALIGNMENT_BITS 5
>>>>>>>> +#define KVM_STEAL_VALID_BITS ((-1ULL<< (KVM_STEAL_ALIGNMENT_BITS + 1)))
>>>>>>>> +#define KVM_STEAL_RESERVED_MASK (((1<< KVM_STEAL_ALIGNMENT_BITS) - 1 )<< 1)
>>>>>>>> +
>>>>>>>> #define KVM_MAX_MMU_OP_BATCH 32
>>>>>>>>
>>>>>>>> #define KVM_ASYNC_PF_ENABLED (1<< 0)
>>>>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>>>>> index 6645634..10fe028 100644
>>>>>>>> --- a/arch/x86/kvm/x86.c
>>>>>>>> +++ b/arch/x86/kvm/x86.c
>>>>>>>> @@ -797,12 +797,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
>>>>>>>> * kvm-specific. Those are put in the beginning of the list.
>>>>>>>> */
>>>>>>>>
>>>>>>>> -#define KVM_SAVE_MSRS_BEGIN 8
>>>>>>>> +#define KVM_SAVE_MSRS_BEGIN 9
>>>>>>>> static u32 msrs_to_save[] = {
>>>>>>>> MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
>>>>>>>> MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
>>>>>>>> HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
>>>>>>>> - HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
>>>>>>>> + HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
>>>>>>>> MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
>>>>>>>> MSR_STAR,
>>>>>>>> #ifdef CONFIG_X86_64
>>>>>>>> @@ -1480,6 +1480,34 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> +static void record_steal_time(struct kvm_vcpu *vcpu)
>>>>>>>> +{
>>>>>>>> + u64 delta;
>>>>>>>> +
>>>>>>>> + if (vcpu->arch.st.stime&& vcpu->arch.st.this_time_out) {
>>>>>>>> +
>>>>>>>> + if (unlikely(kvm_read_guest(vcpu->kvm, vcpu->arch.st.stime,
>>>>>>>> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
>>>>>>>> +
>>>>>>>> + vcpu->arch.st.stime = 0;
>>>>>>>> + return;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
>>>>>>>> +
>>>>>>>> + vcpu->arch.st.steal.steal += delta;
>>>>>>>> + vcpu->arch.st.steal.version += 2;
>>>>>>>> +
>>>>>>>> + if (unlikely(kvm_write_guest(vcpu->kvm, vcpu->arch.st.stime,
>>>>>>> Why not use kvm_write_guest_cached() here and introduce kvm_read_guest_cached()
>>>>>>> for the read above?
>>>>>>
>>>>>> Actually, I'd expect most read/writes to benefit from caching, no?
>>>>>> So why don't we just rename kvm_write_guest_cached() to
>>>>>> kvm_write_guest(), and the few places - if any - that need to force
>>>>>> transversing of the gfn mappings, get renamed to
>>>>>> kvm_write_guest_uncached ?
>>>>>>
>>>>> Good idea. I do not see any places where kvm_write_guest_uncached is
>>>>> needed from a brief look. Avi?
>>>>>
>>>>>>>> + &vcpu->arch.st.steal, sizeof(struct kvm_steal_time)))) {
>>>>>>>> +
>>>>>>>> + vcpu->arch.st.stime = 0;
>>>>>>>> + return;
>>>>>>>> + }
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>>>>>>>> {
>>>>>>>> switch (msr) {
>>>>>>>> @@ -1562,6 +1590,23 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>>>>>>>> if (kvm_pv_enable_async_pf(vcpu, data))
>>>>>>>> return 1;
>>>>>>>> break;
>>>>>>>> + case MSR_KVM_STEAL_TIME:
>>>>>>>> + vcpu->arch.st.msr_val = data;
>>>>>>>> +
>>>>>>>> + if (!(data& KVM_MSR_ENABLED)) {
>>>>>>>> + vcpu->arch.st.stime = 0;
>>>>>>>> + break;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + if (data& KVM_STEAL_RESERVED_MASK)
>>>>>>>> + return 1;
>>>>>>>> +
>>>>>>>> + vcpu->arch.st.this_time_out = get_kernel_ns();
>>>>>>>> + vcpu->arch.st.stime = data& KVM_STEAL_VALID_BITS;
>>>>>>>> + record_steal_time(vcpu);
>>>>>>>> +
>>>>>>>> + break;
>>>>>>>> +
>>>>>>>> case MSR_IA32_MCG_CTL:
>>>>>>>> case MSR_IA32_MCG_STATUS:
>>>>>>>> case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
>>>>>>>> @@ -1847,6 +1892,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>>>>>>>> case MSR_KVM_ASYNC_PF_EN:
>>>>>>>> data = vcpu->arch.apf.msr_val;
>>>>>>>> break;
>>>>>>>> + case MSR_KVM_STEAL_TIME:
>>>>>>>> + data = vcpu->arch.st.msr_val;
>>>>>>>> + break;
>>>>>>>> case MSR_IA32_P5_MC_ADDR:
>>>>>>>> case MSR_IA32_P5_MC_TYPE:
>>>>>>>> case MSR_IA32_MCG_CAP:
>>>>>>>> @@ -2158,6 +2206,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>>>>>> kvm_migrate_timers(vcpu);
>>>>>>>> vcpu->cpu = cpu;
>>>>>>>> }
>>>>>>>> +
>>>>>>>> + record_steal_time(vcpu);
>>>>>>>> }
>>>>>>>>
>>>>>>>> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>>>>>>>> @@ -2165,6 +2215,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>>>>>>>> kvm_x86_ops->vcpu_put(vcpu);
>>>>>>>> kvm_put_guest_fpu(vcpu);
>>>>>>>> kvm_get_msr(vcpu, MSR_IA32_TSC,&vcpu->arch.last_guest_tsc);
>>>>>>>> + vcpu->arch.st.this_time_out = get_kernel_ns();
>>>>>>>> }
>>>>>>>>
>>>>>>> Shouldn't we call record_steal_time(vcpu)/vcpu->arch.st.this_time_out = get_kernel_ns();
>>>>>>> just before/after entering/exiting a guest? vcpu_(put|get) are called
>>>>>>> for each vcpu ioctl, not only VCPU_RUN.
>>>>>> Sorry, missed that the first time I've read your e-mail.
>>>>>>
>>>>>> If done like you said, time spent on the hypervisor is accounted as
>>>>>> steal time. I don't think it is.
>>>>> I thought that this is the point of a steal time. Running other
>>>>> tasks/guests is a hypervisor overhead too after all :) Also what about
>>>>> time spend serving host interrupts between put/get? It will not be
>>>>> accounted as steal time, correct?
>>>>
>>>> This is mostly semantics. I like to compare this to a normal
>>>> process: There is a difference between time the OS spent on your
>>>> behalf, doing your system calls (sys), and time spent by other
>>>> processes. Similar thing here.
>>>>
>>> The problem with this approach is that things like doing "info cpus"
>>> in qemu monitor will change guest scheduling behaviour. Do we want it
>>> to be like that?
>>
>> You mean because it is running in a different thread, and will
>> compete for resources with the cpu thread ?
>>
> No, because it executes GET_REGS ioctl (in vcpu thread). The time
> it takes to execute it is accounted as time the hypervior spent on behave
> of a guest. Not a big deal if it is executed rarely. I tend to use "info
> cpus"/"info register" quite a lot when debugging and would preffer it to
> not affect guest behaviour.
>
>
>>>> Which put/get are you referring to specifically ? You mean
>>>> vcpu_put() vs vcpu_load() ?
>>>>
>>> Yes.
>>>
>>>> If they are after vcpu_put(), they will, because at this time your
>>>> process is officially out of the cpu.
>>>>
>>> And if they are between vcpu_load() and vcpu_put() they will be accounted as
>>> a work done on behalf of a guest although they are likely unrelated.
>> I think the best we can do around it here is record steal time /
>> measure time as late as we can. We could in theory subtract irq
>> time, but it sounds too complicated for little gain.
>>
> Recording steal time/measure time as close as possible to vmentry/vmexit
> is what I propose :) I agree about little gain part.

Well, I don't like it a priori, due to the reasons I've already stated.
If there would be a large noticeable gain, there could be a trade off.
But since you agree with the little gain, I'd prefer to keep it this way.

>>>>
>>>>>>
>>>>>> Steal time is time spent running someone else's job instead of
>>>>>> yours. The name for the time spent in the hypervisor doing something
>>>>>> for *you* is just overhead.
>>>>> OK. That is the question of a definition I guess. If you define it like
>>>>> that the code is correct.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> static int is_efer_nx(void)
>>>>>>>> @@ -2477,7 +2528,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
>>>>>>>> (1<< KVM_FEATURE_NOP_IO_DELAY) |
>>>>>>>> (1<< KVM_FEATURE_CLOCKSOURCE2) |
>>>>>>>> (1<< KVM_FEATURE_ASYNC_PF) |
>>>>>>>> - (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
>>>>>>>> + (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
>>>>>>>> + (1<< KVM_FEATURE_STEAL_TIME);
>>>>>>>> entry->ebx = 0;
>>>>>>>> entry->ecx = 0;
>>>>>>>> entry->edx = 0;
>>>>>>>> @@ -6200,6 +6252,8 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
>>>>>>>>
>>>>>>>> kvmclock_reset(vcpu);
>>>>>>>>
>>>>>>>> + vcpu->arch.st.stime = 0;
>>>>>>>> +
>>>>>>>> kvm_clear_async_pf_completion_queue(vcpu);
>>>>>>>> kvm_async_pf_hash_reset(vcpu);
>>>>>>>> vcpu->arch.apf.halted = false;
>>>>>>>> --
>>>>>>>> 1.7.3.4
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>>>>>>> the body of a message to [email protected]
>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>> --
>>>>>>> Gleb.
>>>>>
>>>>> --
>>>>> Gleb.
>>>
>>> --
>>> Gleb.
>
> --
> Gleb.

2011-06-19 12:36:13

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On 06/15/2011 12:09 PM, Gleb Natapov wrote:
> >
> > Actually, I'd expect most read/writes to benefit from caching, no?
> > So why don't we just rename kvm_write_guest_cached() to
> > kvm_write_guest(), and the few places - if any - that need to force
> > transversing of the gfn mappings, get renamed to
> > kvm_write_guest_uncached ?
> >
> Good idea. I do not see any places where kvm_write_guest_uncached is
> needed from a brief look. Avi?
>

kvm_write_guest_cached() needs something to supply the cache, and needs
recurring writes to the same location. Neither of these are common (for
example, instruction emulation doesn't have either).

> >
> > If done like you said, time spent on the hypervisor is accounted as
> > steal time. I don't think it is.
> I thought that this is the point of a steal time. Running other
> tasks/guests is a hypervisor overhead too after all :) Also what about
> time spend serving host interrupts between put/get? It will not be
> accounted as steal time, correct?

With accurate interrupt time accounting, it should be. Otherwise
general hypervisor overhead is not steal time.

(i.e. if the host is not overcommitted, steal time should be close to zero).

--
error compiling committee.c: too many arguments to function

2011-06-19 13:00:40

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On Sun, Jun 19, 2011 at 03:35:58PM +0300, Avi Kivity wrote:
> On 06/15/2011 12:09 PM, Gleb Natapov wrote:
> >>
> >> Actually, I'd expect most read/writes to benefit from caching, no?
> >> So why don't we just rename kvm_write_guest_cached() to
> >> kvm_write_guest(), and the few places - if any - that need to force
> >> transversing of the gfn mappings, get renamed to
> >> kvm_write_guest_uncached ?
> >>
> >Good idea. I do not see any places where kvm_write_guest_uncached is
> >needed from a brief look. Avi?
> >
>
> kvm_write_guest_cached() needs something to supply the cache, and
> needs recurring writes to the same location. Neither of these are
> common (for example, instruction emulation doesn't have either).
>
Correct. Missed that. So what about changing steal time to use
kvm_write_guest_cached()?

> >>
> >> If done like you said, time spent on the hypervisor is accounted as
> >> steal time. I don't think it is.
> >I thought that this is the point of a steal time. Running other
> >tasks/guests is a hypervisor overhead too after all :) Also what about
> >time spend serving host interrupts between put/get? It will not be
> >accounted as steal time, correct?
>
> With accurate interrupt time accounting, it should be. Otherwise
> general hypervisor overhead is not steal time.
>
> (i.e. if the host is not overcommitted, steal time should be close to zero).
>

--
Gleb.

2011-06-19 13:02:52

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On 06/19/2011 03:59 PM, Gleb Natapov wrote:
> On Sun, Jun 19, 2011 at 03:35:58PM +0300, Avi Kivity wrote:
> > On 06/15/2011 12:09 PM, Gleb Natapov wrote:
> > >>
> > >> Actually, I'd expect most read/writes to benefit from caching, no?
> > >> So why don't we just rename kvm_write_guest_cached() to
> > >> kvm_write_guest(), and the few places - if any - that need to force
> > >> transversing of the gfn mappings, get renamed to
> > >> kvm_write_guest_uncached ?
> > >>
> > >Good idea. I do not see any places where kvm_write_guest_uncached is
> > >needed from a brief look. Avi?
> > >
> >
> > kvm_write_guest_cached() needs something to supply the cache, and
> > needs recurring writes to the same location. Neither of these are
> > common (for example, instruction emulation doesn't have either).
> >
> Correct. Missed that. So what about changing steal time to use
> kvm_write_guest_cached()?

Makes sense, definitely. Want to post read_guest_cached() as well?

--
error compiling committee.c: too many arguments to function

2011-06-20 07:21:54

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On Sun, Jun 19, 2011 at 04:02:22PM +0300, Avi Kivity wrote:
> On 06/19/2011 03:59 PM, Gleb Natapov wrote:
> >On Sun, Jun 19, 2011 at 03:35:58PM +0300, Avi Kivity wrote:
> >> On 06/15/2011 12:09 PM, Gleb Natapov wrote:
> >> >>
> >> >> Actually, I'd expect most read/writes to benefit from caching, no?
> >> >> So why don't we just rename kvm_write_guest_cached() to
> >> >> kvm_write_guest(), and the few places - if any - that need to force
> >> >> transversing of the gfn mappings, get renamed to
> >> >> kvm_write_guest_uncached ?
> >> >>
> >> >Good idea. I do not see any places where kvm_write_guest_uncached is
> >> >needed from a brief look. Avi?
> >> >
> >>
> >> kvm_write_guest_cached() needs something to supply the cache, and
> >> needs recurring writes to the same location. Neither of these are
> >> common (for example, instruction emulation doesn't have either).
> >>
> >Correct. Missed that. So what about changing steal time to use
> >kvm_write_guest_cached()?
>
> Makes sense, definitely. Want to post read_guest_cached() as well?
>
Glauber can you write read_guest_cached() as part of your series (should
be trivial), or do you want me to do it? I do not have a code to test it
with though :)

--
Gleb.

2011-06-20 08:35:27

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On 06/20/2011 10:21 AM, Gleb Natapov wrote:
> On Sun, Jun 19, 2011 at 04:02:22PM +0300, Avi Kivity wrote:
> > On 06/19/2011 03:59 PM, Gleb Natapov wrote:
> > >On Sun, Jun 19, 2011 at 03:35:58PM +0300, Avi Kivity wrote:
> > >> On 06/15/2011 12:09 PM, Gleb Natapov wrote:
> > >> >>
> > >> >> Actually, I'd expect most read/writes to benefit from caching, no?
> > >> >> So why don't we just rename kvm_write_guest_cached() to
> > >> >> kvm_write_guest(), and the few places - if any - that need to force
> > >> >> transversing of the gfn mappings, get renamed to
> > >> >> kvm_write_guest_uncached ?
> > >> >>
> > >> >Good idea. I do not see any places where kvm_write_guest_uncached is
> > >> >needed from a brief look. Avi?
> > >> >
> > >>
> > >> kvm_write_guest_cached() needs something to supply the cache, and
> > >> needs recurring writes to the same location. Neither of these are
> > >> common (for example, instruction emulation doesn't have either).
> > >>
> > >Correct. Missed that. So what about changing steal time to use
> > >kvm_write_guest_cached()?
> >
> > Makes sense, definitely. Want to post read_guest_cached() as well?
> >
> Glauber can you write read_guest_cached() as part of your series (should
> be trivial), or do you want me to do it? I do not have a code to test it
> with though :)

Yes.

(you can write it, and Glauber can include it in the series)

--
error compiling committee.c: too many arguments to function

2011-06-20 12:43:01

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On 06/20/2011 05:02 AM, Avi Kivity wrote:
> On 06/20/2011 10:21 AM, Gleb Natapov wrote:
>> On Sun, Jun 19, 2011 at 04:02:22PM +0300, Avi Kivity wrote:
>> > On 06/19/2011 03:59 PM, Gleb Natapov wrote:
>> > >On Sun, Jun 19, 2011 at 03:35:58PM +0300, Avi Kivity wrote:
>> > >> On 06/15/2011 12:09 PM, Gleb Natapov wrote:
>> > >> >>
>> > >> >> Actually, I'd expect most read/writes to benefit from caching,
>> no?
>> > >> >> So why don't we just rename kvm_write_guest_cached() to
>> > >> >> kvm_write_guest(), and the few places - if any - that need to
>> force
>> > >> >> transversing of the gfn mappings, get renamed to
>> > >> >> kvm_write_guest_uncached ?
>> > >> >>
>> > >> >Good idea. I do not see any places where
>> kvm_write_guest_uncached is
>> > >> >needed from a brief look. Avi?
>> > >> >
>> > >>
>> > >> kvm_write_guest_cached() needs something to supply the cache, and
>> > >> needs recurring writes to the same location. Neither of these are
>> > >> common (for example, instruction emulation doesn't have either).
>> > >>
>> > >Correct. Missed that. So what about changing steal time to use
>> > >kvm_write_guest_cached()?
>> >
>> > Makes sense, definitely. Want to post read_guest_cached() as well?
>> >
>> Glauber can you write read_guest_cached() as part of your series (should
>> be trivial), or do you want me to do it? I do not have a code to test it
>> with though :)
>
> Yes.
>
> (you can write it, and Glauber can include it in the series)
>
Write it, handle me the patch, I'll include it and test it.

2011-06-20 13:39:13

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH 3/7] KVM-HV: KVM Steal time implementation

On Mon, Jun 20, 2011 at 09:42:31AM -0300, Glauber Costa wrote:
> On 06/20/2011 05:02 AM, Avi Kivity wrote:
> >On 06/20/2011 10:21 AM, Gleb Natapov wrote:
> >>On Sun, Jun 19, 2011 at 04:02:22PM +0300, Avi Kivity wrote:
> >>> On 06/19/2011 03:59 PM, Gleb Natapov wrote:
> >>> >On Sun, Jun 19, 2011 at 03:35:58PM +0300, Avi Kivity wrote:
> >>> >> On 06/15/2011 12:09 PM, Gleb Natapov wrote:
> >>> >> >>
> >>> >> >> Actually, I'd expect most read/writes to benefit from caching,
> >>no?
> >>> >> >> So why don't we just rename kvm_write_guest_cached() to
> >>> >> >> kvm_write_guest(), and the few places - if any - that need to
> >>force
> >>> >> >> transversing of the gfn mappings, get renamed to
> >>> >> >> kvm_write_guest_uncached ?
> >>> >> >>
> >>> >> >Good idea. I do not see any places where
> >>kvm_write_guest_uncached is
> >>> >> >needed from a brief look. Avi?
> >>> >> >
> >>> >>
> >>> >> kvm_write_guest_cached() needs something to supply the cache, and
> >>> >> needs recurring writes to the same location. Neither of these are
> >>> >> common (for example, instruction emulation doesn't have either).
> >>> >>
> >>> >Correct. Missed that. So what about changing steal time to use
> >>> >kvm_write_guest_cached()?
> >>>
> >>> Makes sense, definitely. Want to post read_guest_cached() as well?
> >>>
> >>Glauber can you write read_guest_cached() as part of your series (should
> >>be trivial), or do you want me to do it? I do not have a code to test it
> >>with though :)
> >
> >Yes.
> >
> >(you can write it, and Glauber can include it in the series)
> >
> Write it, handle me the patch, I'll include it and test it.

Only compile tested.

===
Introduce kvm_read_guest_cached() function in addition to write one we
already have.

Signed-off-by: Gleb Natapov <[email protected]>
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fa2321a..bf62c76 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1414,6 +1414,26 @@ int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
}
EXPORT_SYMBOL_GPL(kvm_write_guest_cached);

+int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+ void *data, unsigned long len)
+{
+ struct kvm_memslots *slots = kvm_memslots(kvm);
+ int r;
+
+ if (slots->generation != ghc->generation)
+ kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa);
+
+ if (kvm_is_error_hva(ghc->hva))
+ return -EFAULT;
+
+ r = __copy_from_user(data, (void __user *)ghc->hva, len);
+ if (r)
+ return -EFAULT;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_read_guest_cached);
+
int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
{
return kvm_write_guest_page(kvm, gfn, (const void *) empty_zero_page,
--
Gleb.

2011-06-20 14:42:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 6/7] KVM-GST: adjust scheduler cpu power

On Tue, 2011-06-14 at 22:26 -0300, Glauber Costa wrote:
> On 06/14/2011 07:42 AM, Peter Zijlstra wrote:
> > On Mon, 2011-06-13 at 19:31 -0400, Glauber Costa wrote:
> >> @@ -1981,12 +1987,29 @@ static void update_rq_clock_task(struct rq
> >> *rq, s64 delta)
> >>
> >> rq->prev_irq_time += irq_delta;
> >> delta -= irq_delta;
> >> +#endif
> >> +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> >> + if (static_branch((&paravirt_steal_rq_enabled))) {
> >
> > Why is that a different variable from the touch_steal_time() one?
>
> because they track different things, touch_steal_time() and
> update_rq_clock() are
> called from different places at different situations.
>
> If we advance prev_steal_time in touch_steal_time(), and later on call
> update_rq_clock_task(), we won't discount the time already flushed from
> the rq_clock. Conversely, if we call update_rq_clock_task(), and only
> then arrive at touch_steal_time, we won't account steal time properly.

But that's about prev_steal_time vs prev_steal_time_acc, I agree those
should be different.

> update_rq_clock_task() is called whenever update_rq_clock() is called.
> touch_steal_time is called every tick. If there is a causal relation
> between them that would allow us to track it in a single location, I
> fail to realize.

Both are steal time muck, I was wondering why we'd want to do one and
not the other when we have a high res stealtime clock.

> >> +
> >> + steal = paravirt_steal_clock(cpu_of(rq));
> >> + steal -= rq->prev_steal_time_acc;
> >> +
> >> + rq->prev_steal_time_acc += steal;
> >
> > You have this addition in the wrong place, when you clip:
>
> I begin by disagreeing
> >> + if (steal> delta)
> >> + steal = delta;
> >
> > you just lost your steal delta, so the addition to prev_steal_time_acc
> > needs to be after the clip.

> Unlike irq time, steal time can be extremely huge. Just think of a
> virtual machine that got interrupted for a very long time. We'd have
> steal >> delta, leading to steal == delta for a big number of iterations.
> That would affect cpu power for an extended period of time, not
> reflecting present situation, just the past. So I like to think of delta
> as a hard cap for steal time.
>
> Obviously, I am open to debate.

I'm failing to see how this would happen, if the virtual machine wasn't
scheduled for a long long while, delta would be huge too. But suppose it
does happen, wouldn't it be likely that the virtual machine would
receive similar bad service in the near future? Making the total
accounting relevant.