This patchset adds on the previous one, merging the suggestions
made mainly by Gleb, Peter and Marcelo.
SCHEDSTATS is now used when available to provide information about
time spent on the runqueue for the cpu threads. Information to/from
the guest is obtained using *_cached() functions, as suggested by Gleb.
On the guest side, the two "prev_steal_time" variables are colapsed
into a single one. Steal time accounting is also done inside
update_rq_clock().
Glauber Costa (8):
KVM-HDR Add constant to represent KVM MSRs enabled bit
KVM-HDR: KVM Steal time implementation
KVM-HV: KVM Steal time implementation
KVM-HV: use schedstats to calculate steal time
KVM-GST: Add a pv_ops stub for steal time
KVM-GST: KVM Steal time accounting
KVM-GST: adjust scheduler cpu power
KVM-GST: KVM Steal time registration
Gleb Natapov (1):
introduce kvm_read_guest_cached
Documentation/kernel-parameters.txt | 4 ++
Documentation/virtual/kvm/msr.txt | 33 ++++++++++++
arch/x86/Kconfig | 12 ++++
arch/x86/include/asm/kvm_host.h | 8 +++
arch/x86/include/asm/kvm_para.h | 15 +++++
arch/x86/include/asm/paravirt.h | 9 +++
arch/x86/include/asm/paravirt_types.h | 1 +
arch/x86/kernel/kvm.c | 73 +++++++++++++++++++++++++
arch/x86/kernel/kvmclock.c | 2 +
arch/x86/kernel/paravirt.c | 9 +++
arch/x86/kvm/x86.c | 67 ++++++++++++++++++++++-
include/linux/kvm_host.h | 2 +
kernel/sched.c | 94 +++++++++++++++++++++++++++++---
kernel/sched_features.h | 4 +-
virt/kvm/kvm_main.c | 20 +++++++
15 files changed, 339 insertions(+), 14 deletions(-)
--
1.7.3.4
From: Gleb Natapov <[email protected]>
Introduce kvm_read_guest_cached() function in addition to write one we
already have.
[ by glauber: export function signature in kvm header ]
Signed-off-by: Gleb Natapov <[email protected]>
Signed-off-by: Glauber Costa <[email protected]>
---
include/linux/kvm_host.h | 2 ++
virt/kvm/kvm_main.c | 20 ++++++++++++++++++++
2 files changed, 22 insertions(+), 0 deletions(-)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 31ebb59..f7df0a3 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -381,6 +381,8 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset,
int kvm_read_guest_atomic(struct kvm *kvm, gpa_t gpa, void *data,
unsigned long len);
int kvm_read_guest(struct kvm *kvm, gpa_t gpa, void *data, unsigned long len);
+int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+ void *data, unsigned long len);
int kvm_write_guest_page(struct kvm *kvm, gfn_t gfn, const void *data,
int offset, int len);
int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 11d2783..d5ef9eb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1418,6 +1418,26 @@ int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
}
EXPORT_SYMBOL_GPL(kvm_write_guest_cached);
+int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+ void *data, unsigned long len)
+{
+ struct kvm_memslots *slots = kvm_memslots(kvm);
+ int r;
+
+ if (slots->generation != ghc->generation)
+ kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa);
+
+ if (kvm_is_error_hva(ghc->hva))
+ return -EFAULT;
+
+ r = __copy_from_user(data, (void __user *)ghc->hva, len);
+ if (r)
+ return -EFAULT;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_read_guest_cached);
+
int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len)
{
return kvm_write_guest_page(kvm, gfn, (const void *) empty_zero_page,
--
1.7.3.4
This patch is simple, put in a different commit so it can be more easily
shared between guest and hypervisor. It just defines a named constant
to indicate the enable bit for KVM-specific MSRs.
Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index a427bf7..d6cd79b 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -30,6 +30,7 @@
#define MSR_KVM_WALL_CLOCK 0x11
#define MSR_KVM_SYSTEM_TIME 0x12
+#define KVM_MSR_ENABLED 1
/* Custom MSRs falls in the range 0x4b564d00-0x4b564dff */
#define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00
#define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
--
1.7.3.4
To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.
In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
holds the memory area address containing information about steal time
This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.
Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
Documentation/virtual/kvm/msr.txt | 33 +++++++++++++++++++++++++++++++++
arch/x86/include/asm/kvm_para.h | 9 +++++++++
2 files changed, 42 insertions(+), 0 deletions(-)
diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt
index d079aed..fab9b30 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -185,3 +185,36 @@ MSR_KVM_ASYNC_PF_EN: 0x4b564d02
Currently type 2 APF will be always delivered on the same vcpu as
type 1 was, but guest should not rely on that.
+
+MSR_KVM_STEAL_TIME: 0x4b564d03
+
+ data: 64-byte alignment physical address of a memory area which must be
+ in guest RAM, plus an enable bit in bit 0. This memory is expected to
+ hold a copy of the following structure:
+
+ struct kvm_steal_time {
+ __u64 steal;
+ __u32 version;
+ __u32 flags;
+ __u32 pad[12];
+ }
+
+ whose data will be filled in by the hypervisor periodically. Only one
+ write, or registration, is needed for each VCPU. The interval between
+ updates of this structure is arbitrary and implementation-dependent.
+ The hypervisor may update this structure at any time it sees fit until
+ anything with bit0 == 0 is written to it. Guest is required to make sure
+ this structure is initialized to zero.
+
+ Fields have the following meanings:
+
+ version: guest has to check version before and after grabbing
+ time information and check that they are both equal and even.
+ An odd version indicates an in-progress update.
+
+ flags: At this point, always zero. May be used to indicate
+ changes in this structure in the future.
+
+ steal: the amount of time in which this vCPU did not run, in
+ nanoseconds.
+
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index d6cd79b..65f8bb9 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -21,6 +21,7 @@
*/
#define KVM_FEATURE_CLOCKSOURCE2 3
#define KVM_FEATURE_ASYNC_PF 4
+#define KVM_FEATURE_STEAL_TIME 5
/* The last 8 bits are used to indicate how to interpret the flags field
* in pvclock structure. If no bits are set, all flags are ignored.
@@ -35,6 +36,14 @@
#define MSR_KVM_WALL_CLOCK_NEW 0x4b564d00
#define MSR_KVM_SYSTEM_TIME_NEW 0x4b564d01
#define MSR_KVM_ASYNC_PF_EN 0x4b564d02
+#define MSR_KVM_STEAL_TIME 0x4b564d03
+
+struct kvm_steal_time {
+ __u64 steal;
+ __u32 version;
+ __u32 flags;
+ __u32 pad[12];
+};
#define KVM_MAX_MMU_OP_BATCH 32
--
1.7.3.4
To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.
In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
holds the memory area address containing information about steal time
This patch contains the hypervisor part for it. I am keeping it separate from
the headers to facilitate backports to people who wants to backport the kernel
part but not the hypervisor, or the other way around.
Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 8 ++++++
arch/x86/include/asm/kvm_para.h | 4 +++
arch/x86/kvm/x86.c | 53 ++++++++++++++++++++++++++++++++++++--
3 files changed, 62 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index da6bbee..f85ed56 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -389,6 +389,14 @@ struct kvm_vcpu_arch {
unsigned int hw_tsc_khz;
unsigned int time_offset;
struct page *time_page;
+
+ struct {
+ u64 msr_val;
+ u64 this_time_out;
+ struct gfn_to_hva_cache stime;
+ struct kvm_steal_time steal;
+ } st;
+
u64 last_guest_tsc;
u64 last_kernel_ns;
u64 last_tsc_nsec;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index 65f8bb9..c484ba8 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -45,6 +45,10 @@ struct kvm_steal_time {
__u32 pad[12];
};
+#define KVM_STEAL_ALIGNMENT_BITS 5
+#define KVM_STEAL_VALID_BITS ((-1ULL << (KVM_STEAL_ALIGNMENT_BITS + 1)))
+#define KVM_STEAL_RESERVED_MASK (((1 << KVM_STEAL_ALIGNMENT_BITS) - 1 ) << 1)
+
#define KVM_MAX_MMU_OP_BATCH 32
#define KVM_ASYNC_PF_ENABLED (1 << 0)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7167717..df9d274 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -808,12 +808,12 @@ EXPORT_SYMBOL_GPL(kvm_get_dr);
* kvm-specific. Those are put in the beginning of the list.
*/
-#define KVM_SAVE_MSRS_BEGIN 8
+#define KVM_SAVE_MSRS_BEGIN 9
static u32 msrs_to_save[] = {
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
- HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN,
+ HV_X64_MSR_APIC_ASSIST_PAGE, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
MSR_STAR,
#ifdef CONFIG_X86_64
@@ -1491,6 +1491,26 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
}
}
+static void record_steal_time(struct kvm_vcpu *vcpu)
+{
+ u64 delta;
+
+ if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
+ return;
+
+ if (unlikely(kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
+ &vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
+ return;
+
+ delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
+
+ vcpu->arch.st.steal.steal += delta;
+ vcpu->arch.st.steal.version += 2;
+
+ kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
+ &vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
+}
+
int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
{
switch (msr) {
@@ -1573,6 +1593,25 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
if (kvm_pv_enable_async_pf(vcpu, data))
return 1;
break;
+ case MSR_KVM_STEAL_TIME:
+ vcpu->arch.st.msr_val = data;
+
+ if (!(data & KVM_MSR_ENABLED)) {
+ break;
+ }
+
+ if (data & KVM_STEAL_RESERVED_MASK)
+ return 1;
+
+ if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.st.stime,
+ data & KVM_STEAL_VALID_BITS))
+ return 1;
+
+ vcpu->arch.st.this_time_out = get_kernel_ns();
+ record_steal_time(vcpu);
+
+ break;
+
case MSR_IA32_MCG_CTL:
case MSR_IA32_MCG_STATUS:
case MSR_IA32_MC0_CTL ... MSR_IA32_MC0_CTL + 4 * KVM_MAX_MCE_BANKS - 1:
@@ -1858,6 +1897,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
case MSR_KVM_ASYNC_PF_EN:
data = vcpu->arch.apf.msr_val;
break;
+ case MSR_KVM_STEAL_TIME:
+ data = vcpu->arch.st.msr_val;
+ break;
case MSR_IA32_P5_MC_ADDR:
case MSR_IA32_P5_MC_TYPE:
case MSR_IA32_MCG_CAP:
@@ -2169,6 +2211,8 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_migrate_timers(vcpu);
vcpu->cpu = cpu;
}
+
+ record_steal_time(vcpu);
}
void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
@@ -2176,6 +2220,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
+ vcpu->arch.st.this_time_out = get_kernel_ns();
}
static int is_efer_nx(void)
@@ -2489,7 +2534,8 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
(1 << KVM_FEATURE_NOP_IO_DELAY) |
(1 << KVM_FEATURE_CLOCKSOURCE2) |
(1 << KVM_FEATURE_ASYNC_PF) |
- (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
+ (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
+ (1 << KVM_FEATURE_STEAL_TIME);
entry->ebx = 0;
entry->ecx = 0;
entry->edx = 0;
@@ -6209,6 +6255,7 @@ int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu)
kvm_make_request(KVM_REQ_EVENT, vcpu);
vcpu->arch.apf.msr_val = 0;
+ vcpu->arch.st.msr_val = 0;
kvmclock_reset(vcpu);
--
1.7.3.4
SCHEDSTATS provide a precise source of information about time tasks
spent on a runqueue, but not running (among other things). It is
specially useful for the steal time implementation, because it doesn't
record halt time at all.
To avoid a hard dependency on schedstats, since it is possible one won't
want to record statistics about all processes running, the previous method
of time measurement on put/load vcpu is kept for !SCHEDSTATS.
Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
CC: Marcelo Tosatti <[email protected]>
---
arch/x86/kvm/x86.c | 22 ++++++++++++++++++----
1 files changed, 18 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index df9d274..7e87159 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1502,7 +1502,13 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
return;
- delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
+ if (likely(sched_info_on())) {
+ delta = current->sched_info.run_delay - vcpu->arch.st.this_time_out;
+ vcpu->arch.st.this_time_out = current->sched_info.run_delay;
+ } else
+#endif
+ delta = (get_kernel_ns() - vcpu->arch.st.this_time_out);
vcpu->arch.st.steal.steal += delta;
vcpu->arch.st.steal.version += 2;
@@ -1607,9 +1613,14 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
data & KVM_STEAL_VALID_BITS))
return 1;
- vcpu->arch.st.this_time_out = get_kernel_ns();
- record_steal_time(vcpu);
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
+ if (likely(sched_info_on()))
+ vcpu->arch.st.this_time_out = current->sched_info.run_delay;
+ else
+#endif
+ vcpu->arch.st.this_time_out = get_kernel_ns();
+ record_steal_time(vcpu);
break;
case MSR_IA32_MCG_CTL:
@@ -2220,7 +2231,10 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
kvm_get_msr(vcpu, MSR_IA32_TSC, &vcpu->arch.last_guest_tsc);
- vcpu->arch.st.this_time_out = get_kernel_ns();
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
+ if (unlikely(!sched_info_on()))
+#endif
+ vcpu->arch.st.this_time_out = get_kernel_ns();
}
static int is_efer_nx(void)
--
1.7.3.4
This patch adds a function pointer in one of the many paravirt_ops
structs, to allow guests to register a steal time function.
Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
arch/x86/include/asm/paravirt.h | 9 +++++++++
arch/x86/include/asm/paravirt_types.h | 1 +
arch/x86/kernel/paravirt.c | 9 +++++++++
3 files changed, 19 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index ebbc4d8..a7d2db9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -230,6 +230,15 @@ static inline unsigned long long paravirt_sched_clock(void)
return PVOP_CALL0(unsigned long long, pv_time_ops.sched_clock);
}
+struct jump_label_key;
+extern struct jump_label_key paravirt_steal_enabled;
+extern struct jump_label_key paravirt_steal_rq_enabled;
+
+static inline u64 paravirt_steal_clock(int cpu)
+{
+ return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+}
+
static inline unsigned long long paravirt_read_pmc(int counter)
{
return PVOP_CALL1(u64, pv_cpu_ops.read_pmc, counter);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 8288509..2c76521 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -89,6 +89,7 @@ struct pv_lazy_ops {
struct pv_time_ops {
unsigned long long (*sched_clock)(void);
+ unsigned long long (*steal_clock)(int cpu);
unsigned long (*get_tsc_khz)(void);
};
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 869e1ae..613a793 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -202,6 +202,14 @@ static void native_flush_tlb_single(unsigned long addr)
__native_flush_tlb_single(addr);
}
+struct jump_label_key paravirt_steal_enabled;
+struct jump_label_key paravirt_steal_rq_enabled;
+
+static u64 native_steal_clock(int cpu)
+{
+ return 0;
+}
+
/* These are in entry.S */
extern void native_iret(void);
extern void native_irq_enable_sysexit(void);
@@ -307,6 +315,7 @@ struct pv_init_ops pv_init_ops = {
struct pv_time_ops pv_time_ops = {
.sched_clock = native_sched_clock,
+ .steal_clock = native_steal_clock,
};
struct pv_irq_ops pv_irq_ops = {
--
1.7.3.4
This patch accounts steal time time in kernel/sched.
I kept it from last proposal, because I still see advantages
in it: Doing it here will give us easier access from scheduler
variables such as the cpu rq. The next patch shows an example of
usage for it.
Since functions like account_idle_time() can be called from
multiple places, not only account_process_tick(), steal time
grabbing is repeated in each account function separatedely.
Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
kernel/sched.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 52 insertions(+), 0 deletions(-)
diff --git a/kernel/sched.c b/kernel/sched.c
index 3f2e502..c166863 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -75,6 +75,7 @@
#include <asm/tlb.h>
#include <asm/irq_regs.h>
#include <asm/mutex.h>
+#include <asm/paravirt.h>
#include "sched_cpupri.h"
#include "workqueue_sched.h"
@@ -528,6 +529,7 @@ struct rq {
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
u64 prev_irq_time;
#endif
+ u64 prev_steal_time;
/* calc_load related fields */
unsigned long calc_load_update;
@@ -1953,6 +1955,48 @@ void account_system_vtime(struct task_struct *curr)
}
EXPORT_SYMBOL_GPL(account_system_vtime);
+#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+
+static inline u64 steal_ticks(u64 steal)
+{
+ if (unlikely(steal > NSEC_PER_SEC))
+ return steal / TICK_NSEC;
+
+ return __iter_div_u64_rem(steal, TICK_NSEC, &steal);
+}
+
+/*
+ * We have to at flush steal time information every time something else
+ * is accounted. Since the accounting functions are all visible to the rest
+ * of the kernel, it gets tricky to do them in one place. This helper function
+ * helps us.
+ *
+ * When the system is idle, the concept of steal time does not apply. We just
+ * tell the underlying hypervisor that we grabbed the data, but skip steal time
+ * accounting
+ */
+static noinline bool touch_steal_time(int is_idle)
+{
+ u64 steal, st = 0;
+
+ if (static_branch(¶virt_steal_enabled)) {
+
+ steal = paravirt_steal_clock(smp_processor_id());
+
+ steal -= this_rq()->prev_steal_time;
+
+ st = steal_ticks(steal);
+ this_rq()->prev_steal_time += st * TICK_NSEC;
+
+ if (is_idle || st == 0)
+ return false;
+
+ account_steal_time(st);
+ return true;
+ }
+ return false;
+}
+
static void update_rq_clock_task(struct rq *rq, s64 delta)
{
s64 irq_delta;
@@ -3716,6 +3760,9 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
cputime64_t tmp;
+ if (touch_steal_time(0))
+ return;
+
/* Add user time to process. */
p->utime = cputime_add(p->utime, cputime);
p->utimescaled = cputime_add(p->utimescaled, cputime_scaled);
@@ -3802,6 +3849,9 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
cputime64_t *target_cputime64;
+ if (touch_steal_time(0))
+ return;
+
if ((p->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0)) {
account_guest_time(p, cputime, cputime_scaled);
return;
@@ -3839,6 +3889,8 @@ void account_idle_time(cputime_t cputime)
cputime64_t cputime64 = cputime_to_cputime64(cputime);
struct rq *rq = this_rq();
+ touch_steal_time(1);
+
if (atomic_read(&rq->nr_iowait) > 0)
cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
else
--
1.7.3.4
This is a first proposal for using steal time information
to influence the scheduler. There are a lot of optimizations
and fine grained adjustments to be done, but it is working reasonably
so far for me (mostly)
With this patch (and some host pinnings to demonstrate the situation),
two vcpus with very different steal time (Say 80 % vs 1 %) will not get
an even distribution of processes. This is a situation that can naturally
arise, specially in overcommited scenarios. Previosly, the guest scheduler
would wrongly think that all cpus have the same ability to run processes,
lowering the overall throughput.
Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
arch/x86/Kconfig | 12 ++++++++++++
kernel/sched.c | 44 ++++++++++++++++++++++++++++++++++----------
kernel/sched_features.h | 4 ++--
3 files changed, 48 insertions(+), 12 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index da34972..b26f312 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -512,6 +512,18 @@ menuconfig PARAVIRT_GUEST
if PARAVIRT_GUEST
+config PARAVIRT_TIME_ACCOUNTING
+ bool "Paravirtual steal time accounting"
+ select PARAVIRT
+ default n
+ ---help---
+ Select this option to enable fine granularity task steal time
+ accounting. Time spent executing other tasks in parallel with
+ the current vCPU is discounted from the vCPU power. To account for
+ that, there can be a small performance impact.
+
+ If in doubt, say N here.
+
source "arch/x86/xen/Kconfig"
config KVM_CLOCK
diff --git a/kernel/sched.c b/kernel/sched.c
index c166863..3ef0de9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1975,7 +1975,7 @@ static inline u64 steal_ticks(u64 steal)
* tell the underlying hypervisor that we grabbed the data, but skip steal time
* accounting
*/
-static noinline bool touch_steal_time(int is_idle)
+static noinline bool __touch_steal_time(int is_idle, u64 max_steal, u64 *ticks)
{
u64 steal, st = 0;
@@ -1985,8 +1985,13 @@ static noinline bool touch_steal_time(int is_idle)
steal -= this_rq()->prev_steal_time;
+ if (steal > max_steal)
+ steal = max_steal;
+
st = steal_ticks(steal);
this_rq()->prev_steal_time += st * TICK_NSEC;
+ if (ticks)
+ *ticks = st;
if (is_idle || st == 0)
return false;
@@ -1997,10 +2002,16 @@ static noinline bool touch_steal_time(int is_idle)
return false;
}
+static inline bool touch_steal_time(int is_idle)
+{
+ return __touch_steal_time(is_idle, UINT_MAX, NULL);
+}
+
static void update_rq_clock_task(struct rq *rq, s64 delta)
{
- s64 irq_delta;
+ s64 irq_delta = 0, steal = 0;
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
/*
@@ -2023,12 +2034,30 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
rq->prev_irq_time += irq_delta;
delta -= irq_delta;
+#endif
+#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
+ if (static_branch((¶virt_steal_rq_enabled))) {
+ int is_idle;
+ u64 st;
+
+ is_idle = ((rq->curr != rq->idle) ||
+ irq_count() != HARDIRQ_OFFSET);
+
+ __touch_steal_time(is_idle, delta, &st);
+
+ steal = st * TICK_NSEC;
+
+ delta -= steal;
+ }
+#endif
+
rq->clock_task += delta;
- if (irq_delta && sched_feat(NONIRQ_POWER))
- sched_rt_avg_update(rq, irq_delta);
+ if ((irq_delta + steal) && sched_feat(NONTASK_POWER))
+ sched_rt_avg_update(rq, irq_delta + steal);
}
+#ifdef CONFIG_IRQ_TIME_ACCOUNTING
static int irqtime_account_hi_update(void)
{
struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
@@ -2063,12 +2092,7 @@ static int irqtime_account_si_update(void)
#define sched_clock_irqtime (0)
-static void update_rq_clock_task(struct rq *rq, s64 delta)
-{
- rq->clock_task += delta;
-}
-
-#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
+#endif
#include "sched_idletask.c"
#include "sched_fair.c"
diff --git a/kernel/sched_features.h b/kernel/sched_features.h
index be40f73..ca3b025 100644
--- a/kernel/sched_features.h
+++ b/kernel/sched_features.h
@@ -61,9 +61,9 @@ SCHED_FEAT(LB_BIAS, 1)
SCHED_FEAT(OWNER_SPIN, 1)
/*
- * Decrement CPU power based on irq activity
+ * Decrement CPU power based on time not spent running tasks
*/
-SCHED_FEAT(NONIRQ_POWER, 1)
+SCHED_FEAT(NONTASK_POWER, 1)
/*
* Queue remote wakeups on the target CPU and process them
--
1.7.3.4
Register steal time within KVM. Everytime we sample the steal time
information, we update a local variable that tells what was the
last time read. We then account the difference.
Signed-off-by: Glauber Costa <[email protected]>
CC: Rik van Riel <[email protected]>
CC: Jeremy Fitzhardinge <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Avi Kivity <[email protected]>
CC: Anthony Liguori <[email protected]>
CC: Eric B Munson <[email protected]>
---
Documentation/kernel-parameters.txt | 4 ++
arch/x86/include/asm/kvm_para.h | 1 +
arch/x86/kernel/kvm.c | 73 +++++++++++++++++++++++++++++++++++
arch/x86/kernel/kvmclock.c | 2 +
4 files changed, 80 insertions(+), 0 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index fd248a31..a722574 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1737,6 +1737,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page
fault handling.
+ no-steal-acc [X86,KVM] Disable paravirtualized steal time accounting.
+ steal time is computed, but won't influence scheduler
+ behaviour
+
nolapic [X86-32,APIC] Do not enable or use the local APIC.
nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index c484ba8..35d732d 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -94,6 +94,7 @@ struct kvm_vcpu_pv_apf_data {
extern void kvmclock_init(void);
extern int kvm_register_clock(char *txt);
+extern void kvm_disable_steal_time(void);
/* This instruction is vmcall. On non-VT architectures, it will generate a
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 33c07b0..58331c2 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -51,6 +51,15 @@ static int parse_no_kvmapf(char *arg)
early_param("no-kvmapf", parse_no_kvmapf);
+static int steal_acc = 1;
+static int parse_no_stealacc(char *arg)
+{
+ steal_acc = 0;
+ return 0;
+}
+
+early_param("no-steal-acc", parse_no_stealacc);
+
struct kvm_para_state {
u8 mmu_queue[MMU_QUEUE_SIZE];
int mmu_queue_len;
@@ -58,6 +67,8 @@ struct kvm_para_state {
static DEFINE_PER_CPU(struct kvm_para_state, para_state);
static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64);
+static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64);
+static int has_steal_clock = 0;
static struct kvm_para_state *kvm_para_state(void)
{
@@ -441,6 +452,21 @@ static void __init paravirt_ops_setup(void)
#endif
}
+static void kvm_register_steal_time(void)
+{
+ int cpu = smp_processor_id();
+ struct kvm_steal_time *st = &per_cpu(steal_time, cpu);
+
+ if (!has_steal_clock)
+ return;
+
+ memset(st, 0, sizeof(*st));
+
+ wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
+ printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n",
+ cpu, __pa(st));
+}
+
void __cpuinit kvm_guest_cpu_init(void)
{
if (!kvm_para_available())
@@ -457,6 +483,9 @@ void __cpuinit kvm_guest_cpu_init(void)
printk(KERN_INFO"KVM setup async PF for cpu %d\n",
smp_processor_id());
}
+
+ if (has_steal_clock)
+ kvm_register_steal_time();
}
static void kvm_pv_disable_apf(void *unused)
@@ -483,6 +512,31 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
};
+static u64 kvm_steal_clock(int cpu)
+{
+ u64 steal;
+ struct kvm_steal_time *src;
+ int version;
+
+ src = &per_cpu(steal_time, cpu);
+ do {
+ version = src->version;
+ rmb();
+ steal = src->steal;
+ rmb();
+ } while ((version & 1) || (version != src->version));
+
+ return steal;
+}
+
+void kvm_disable_steal_time(void)
+{
+ if (!has_steal_clock)
+ return;
+
+ wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
+}
+
#ifdef CONFIG_SMP
static void __init kvm_smp_prepare_boot_cpu(void)
{
@@ -500,6 +554,7 @@ static void __cpuinit kvm_guest_cpu_online(void *dummy)
static void kvm_guest_cpu_offline(void *dummy)
{
+ kvm_disable_steal_time();
kvm_pv_disable_apf(NULL);
apf_task_wake_all();
}
@@ -548,6 +603,11 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF))
x86_init.irqs.trap_init = kvm_apf_trap_init;
+ if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
+ has_steal_clock = 1;
+ pv_time_ops.steal_clock = kvm_steal_clock;
+ }
+
#ifdef CONFIG_SMP
smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu;
register_cpu_notifier(&kvm_cpu_notifier);
@@ -555,3 +615,16 @@ void __init kvm_guest_init(void)
kvm_guest_cpu_init();
#endif
}
+
+static __init int activate_jump_labels(void)
+{
+ if (has_steal_clock) {
+ jump_label_inc(¶virt_steal_enabled);
+ if (steal_acc)
+ jump_label_inc(¶virt_steal_rq_enabled);
+ }
+
+ return 0;
+}
+arch_initcall(activate_jump_labels);
+
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 6389a6b..c1a0188 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -160,6 +160,7 @@ static void __cpuinit kvm_setup_secondary_clock(void)
static void kvm_crash_shutdown(struct pt_regs *regs)
{
native_write_msr(msr_kvm_system_time, 0, 0);
+ kvm_disable_steal_time();
native_machine_crash_shutdown(regs);
}
#endif
@@ -167,6 +168,7 @@ static void kvm_crash_shutdown(struct pt_regs *regs)
static void kvm_shutdown(void)
{
native_write_msr(msr_kvm_system_time, 0, 0);
+ kvm_disable_steal_time();
native_machine_shutdown();
}
--
1.7.3.4
On Wed, 29 Jun 2011, Glauber Costa wrote:
> This patch is simple, put in a different commit so it can be more easily
> shared between guest and hypervisor. It just defines a named constant
> to indicate the enable bit for KVM-specific MSRs.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>
My mail provider seems to have dropped patch 1 of the series so I can't reply
directly to it, please add my Tested-by there as well.
Tested-by: Eric B Munson <[email protected]>
On Wed, 29 Jun 2011, Glauber Costa wrote:
> To implement steal time, we need the hypervisor to pass the guest information
> about how much time was spent running other processes outside the VM.
> This is per-vcpu, and using the kvmclock structure for that is an abuse
> we decided not to make.
>
> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> holds the memory area address containing information about steal time
>
> This patch contains the headers for it. I am keeping it separate to facilitate
> backports to people who wants to backport the kernel part but not the
> hypervisor, or the other way around.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>
Tested-by: Eric B Munson <[email protected]>
On Wed, 29 Jun 2011, Glauber Costa wrote:
> To implement steal time, we need the hypervisor to pass the guest information
> about how much time was spent running other processes outside the VM.
> This is per-vcpu, and using the kvmclock structure for that is an abuse
> we decided not to make.
>
> In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
> holds the memory area address containing information about steal time
>
> This patch contains the hypervisor part for it. I am keeping it separate from
> the headers to facilitate backports to people who wants to backport the kernel
> part but not the hypervisor, or the other way around.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>
Tested-by: Eric B Munson <[email protected]>
On Wed, 29 Jun 2011, Glauber Costa wrote:
> SCHEDSTATS provide a precise source of information about time tasks
> spent on a runqueue, but not running (among other things). It is
> specially useful for the steal time implementation, because it doesn't
> record halt time at all.
>
> To avoid a hard dependency on schedstats, since it is possible one won't
> want to record statistics about all processes running, the previous method
> of time measurement on put/load vcpu is kept for !SCHEDSTATS.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>
> CC: Marcelo Tosatti <[email protected]>
Tested-by: Eric B Munson <[email protected]>
On Wed, 29 Jun 2011, Glauber Costa wrote:
> This patch adds a function pointer in one of the many paravirt_ops
> structs, to allow guests to register a steal time function.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>
Tested-by: Eric B Munson <[email protected]>
On Wed, 29 Jun 2011, Glauber Costa wrote:
> This patch accounts steal time time in kernel/sched.
> I kept it from last proposal, because I still see advantages
> in it: Doing it here will give us easier access from scheduler
> variables such as the cpu rq. The next patch shows an example of
> usage for it.
>
> Since functions like account_idle_time() can be called from
> multiple places, not only account_process_tick(), steal time
> grabbing is repeated in each account function separatedely.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>
Tested-by: Eric B Munson <[email protected]>
On Wed, 29 Jun 2011, Glauber Costa wrote:
> This is a first proposal for using steal time information
> to influence the scheduler. There are a lot of optimizations
> and fine grained adjustments to be done, but it is working reasonably
> so far for me (mostly)
>
> With this patch (and some host pinnings to demonstrate the situation),
> two vcpus with very different steal time (Say 80 % vs 1 %) will not get
> an even distribution of processes. This is a situation that can naturally
> arise, specially in overcommited scenarios. Previosly, the guest scheduler
> would wrongly think that all cpus have the same ability to run processes,
> lowering the overall throughput.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>
Tested-by: Eric B Munson <[email protected]>
On Wed, 29 Jun 2011, Glauber Costa wrote:
> Register steal time within KVM. Everytime we sample the steal time
> information, we update a local variable that tells what was the
> last time read. We then account the difference.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Rik van Riel <[email protected]>
> CC: Jeremy Fitzhardinge <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Avi Kivity <[email protected]>
> CC: Anthony Liguori <[email protected]>
> CC: Eric B Munson <[email protected]>
Tested-by: Eric B Munson <[email protected]>
On 06/30/2011 12:56 AM, Eric B Munson wrote:
> My mail provider seems to have dropped patch 1 of the series so I can't reply
> directly to it, please add my Tested-by there as well.
How did you test it then?
--
error compiling committee.c: too many arguments to function
On Thu, 30 Jun 2011, Avi Kivity wrote:
> On 06/30/2011 12:56 AM, Eric B Munson wrote:
> >My mail provider seems to have dropped patch 1 of the series so I can't reply
> >directly to it, please add my Tested-by there as well.
>
> How did you test it then?
>
I built host and guest kernels with the patches and pinned a while(1) and the
CPU thread from qemu to CPU 2 on the host. I then started the same while(1)
process in guest and verified that I see ~50% steal time reported.
I then built 2.6.39 (just the code I had present) on the guest and time it
while it was competing with the while(1) on the host for CPU time. Next I
built the guest kernel with STEAL_TIME=N and reran the kernel compile to make
sure that there weren't any huge variations in performace.
Eric
On 06/30/2011 09:59 AM, Eric B Munson wrote:
> On Thu, 30 Jun 2011, Avi Kivity wrote:
>
>> On 06/30/2011 12:56 AM, Eric B Munson wrote:
>>> My mail provider seems to have dropped patch 1 of the series so I can't reply
>>> directly to it, please add my Tested-by there as well.
>>
>> How did you test it then?
>>
>
> I built host and guest kernels with the patches and pinned a while(1) and the
> CPU thread from qemu to CPU 2 on the host. I then started the same while(1)
> process in guest and verified that I see ~50% steal time reported.
>
> I then built 2.6.39 (just the code I had present) on the guest and time it
> while it was competing with the while(1) on the host for CPU time. Next I
> built the guest kernel with STEAL_TIME=N and reran the kernel compile to make
> sure that there weren't any huge variations in performace.
>
> Eric
I think what Avi means is, it won't even compile without PATCH 1/9. If
you don't have it, how could you test it ?
On Thu, 30 Jun 2011, Glauber Costa wrote:
> On 06/30/2011 09:59 AM, Eric B Munson wrote:
> >On Thu, 30 Jun 2011, Avi Kivity wrote:
> >
> >>On 06/30/2011 12:56 AM, Eric B Munson wrote:
> >>>My mail provider seems to have dropped patch 1 of the series so I can't reply
> >>>directly to it, please add my Tested-by there as well.
> >>
> >>How did you test it then?
> >>
> >
> >I built host and guest kernels with the patches and pinned a while(1) and the
> >CPU thread from qemu to CPU 2 on the host. I then started the same while(1)
> >process in guest and verified that I see ~50% steal time reported.
> >
> >I then built 2.6.39 (just the code I had present) on the guest and time it
> >while it was competing with the while(1) on the host for CPU time. Next I
> >built the guest kernel with STEAL_TIME=N and reran the kernel compile to make
> >sure that there weren't any huge variations in performace.
> >
> >Eric
> I think what Avi means is, it won't even compile without PATCH 1/9.
> If you don't have it, how could you test it ?
>
It made it to several lkml archives.
On Wed, 2011-06-29 at 11:29 -0400, Glauber Costa wrote:
> +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> + if (static_branch((¶virt_steal_rq_enabled))) {
> + int is_idle;
> + u64 st;
> +
> + is_idle = ((rq->curr != rq->idle) ||
> + irq_count() != HARDIRQ_OFFSET);
Now that hurt my brain. If the vcpu is idle, why does it want to run?
How can an idle vcpu ever rack up steal time?
Also, what's that HARDIRQ_OFFSET bit about? sorely lacking in
explanation, and the Changelog to this patch is about as bad as the last
one.
> + __touch_steal_time(is_idle, delta, &st);
> +
> + steal = st * TICK_NSEC;
> +
> + delta -= steal;
> + }
> +#endif
On Wed, 2011-06-29 at 11:29 -0400, Glauber Costa wrote:
> + return __touch_steal_time(is_idle, UINT_MAX, NULL);
That wants to be ULLONG_MAX, because max_steal is a u64, with UINT_MAX
the comparison:
+ if (steal > max_steal)
Isn't true per-se and the compiler cannot optimize the branch away.
Also, make sure it does indeed optimize that branch away, otherwise
simply duplicate the code. This is on the very hottest scheduler paths,
branches do count.
Same for:
+ if (ticks)
with ticks == NULL, make sure it never emits code for the above call.
Ah, also, all that requires you strip that noinline crap you put on it,
if it cannot inline, it cannot assume anything on the input and will
this have to emit all this nonsense.
On Wed, 2011-06-29 at 11:29 -0400, Glauber Costa wrote:
> @@ -2063,12 +2092,7 @@ static int irqtime_account_si_update(void)
>
> #define sched_clock_irqtime (0)
>
> -static void update_rq_clock_task(struct rq *rq, s64 delta)
> -{
> - rq->clock_task += delta;
> -}
> -
> -#endif /* CONFIG_IRQ_TIME_ACCOUNTING */
That relies on the compiler realizing that irq_delta and steal will
never be used in the below function:
> static void update_rq_clock_task(struct rq *rq, s64 delta)
> {
> + s64 irq_delta = 0, steal = 0;
>
> +#ifdef CONFIG_IRQ_TIME_ACCOUNTING
> irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
>
> /*
> @@ -2023,12 +2034,30 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>
> rq->prev_irq_time += irq_delta;
> delta -= irq_delta;
> +#endif
> +#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
> + if (static_branch((¶virt_steal_rq_enabled))) {
> + int is_idle;
> + u64 st;
> +
> + is_idle = ((rq->curr != rq->idle) ||
> + irq_count() != HARDIRQ_OFFSET);
> +
> + __touch_steal_time(is_idle, delta, &st);
> +
> + steal = st * TICK_NSEC;
> +
> + delta -= steal;
> + }
> +#endif
> +
> rq->clock_task += delta;
>
> + if ((irq_delta + steal) && sched_feat(NONTASK_POWER))
> + sched_rt_avg_update(rq, irq_delta + steal);
> }
And thus trim the last bit as if(0) and DCE. Did you verify that the
compiler does indeed do that for !PARAVIRT !IRQ_TIME_ACCOUNTING ?
On Wed, 2011-06-29 at 11:29 -0400, Glauber Costa wrote:
> + if (static_branch(¶virt_steal_enabled)) {
How is that going to compile on !CONFIG_PARAVIRT or !x86 in general?
Only x86-PARAVIRT will provide that variable.
On Wed, 2011-06-29 at 11:29 -0400, Glauber Costa wrote:
> This patch accounts steal time time in kernel/sched.
> I kept it from last proposal, because I still see advantages
> in it: Doing it here will give us easier access from scheduler
> variables such as the cpu rq. The next patch shows an example of
> usage for it.
>
> Since functions like account_idle_time() can be called from
> multiple places, not only account_process_tick(), steal time
> grabbing is repeated in each account function separatedely.
>
That changelog is so going to frustrate someone trying to re-create your
thinking in a year's time. They really don't care about the last
proposals and the next patch.
On Wed, 2011-06-29 at 11:29 -0400, Glauber Costa wrote:
> +static noinline bool touch_steal_time(int is_idle)
That noinline is very unlucky there,
> +{
> + u64 steal, st = 0;
> +
> + if (static_branch(¶virt_steal_enabled)) {
> +
> + steal = paravirt_steal_clock(smp_processor_id());
> +
> + steal -= this_rq()->prev_steal_time;
> +
> + st = steal_ticks(steal);
> + this_rq()->prev_steal_time += st * TICK_NSEC;
> +
> + if (is_idle || st == 0)
> + return false;
> +
> + account_steal_time(st);
> + return true;
> + }
> + return false;
> +}
> +
> static void update_rq_clock_task(struct rq *rq, s64 delta)
> {
> s64 irq_delta;
> @@ -3716,6 +3760,9 @@ void account_user_time(struct task_struct *p,
> cputime_t cputime,
> struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
> cputime64_t tmp;
>
> + if (touch_steal_time(0))
> + return;
Means we have an unconditional call here, even if the static_branch() is
patched out.
On Wed, 2011-06-29 at 11:29 -0400, Glauber Costa wrote:
> +static inline u64 steal_ticks(u64 steal)
> +{
> + if (unlikely(steal > NSEC_PER_SEC))
> + return steal / TICK_NSEC;
That won't compile on a number of 32bit architecture, use div_u64 or
something similar.
> +
> + return __iter_div_u64_rem(steal, TICK_NSEC, &steal);
> +}
On Wed, 2011-06-29 at 11:29 -0400, Glauber Costa wrote:
> + version: guest has to check version before and after grabbing
> + time information and check that they are both equal and even.
> + An odd version indicates an in-progress update.
That's generally called a sequence count.
On 06/30/2011 06:54 PM, Peter Zijlstra wrote:
> On Wed, 2011-06-29 at 11:29 -0400, Glauber Costa wrote:
>> +static noinline bool touch_steal_time(int is_idle)
>
> That noinline is very unlucky there,
>
>> +{
>> + u64 steal, st = 0;
>> +
>> + if (static_branch(¶virt_steal_enabled)) {
>> +
>> + steal = paravirt_steal_clock(smp_processor_id());
>> +
>> + steal -= this_rq()->prev_steal_time;
>> +
>> + st = steal_ticks(steal);
>> + this_rq()->prev_steal_time += st * TICK_NSEC;
>> +
>> + if (is_idle || st == 0)
>> + return false;
>> +
>> + account_steal_time(st);
>> + return true;
>> + }
>> + return false;
>> +}
>> +
>> static void update_rq_clock_task(struct rq *rq, s64 delta)
>> {
>> s64 irq_delta;
>> @@ -3716,6 +3760,9 @@ void account_user_time(struct task_struct *p,
>> cputime_t cputime,
>> struct cpu_usage_stat *cpustat =&kstat_this_cpu.cpustat;
>> cputime64_t tmp;
>>
>> + if (touch_steal_time(0))
>> + return;
>
> Means we have an unconditional call here, even if the static_branch() is
> patched out.
Ok.
I was under the impression that the proper use of jump labels required
each label to be tied to a single location. If we make it inline, the
same key would point to multiple locations, and we would have trouble
altering all of the locations. I might be wrong, of course. Isn't it the
case?
On 06/30/2011 06:54 PM, Peter Zijlstra wrote:
> On Wed, 2011-06-29 at 11:29 -0400, Glauber Costa wrote:
>> This patch accounts steal time time in kernel/sched.
>> I kept it from last proposal, because I still see advantages
>> in it: Doing it here will give us easier access from scheduler
>> variables such as the cpu rq. The next patch shows an example of
>> usage for it.
>>
>> Since functions like account_idle_time() can be called from
>> multiple places, not only account_process_tick(), steal time
>> grabbing is repeated in each account function separatedely.
>>
>
> That changelog is so going to frustrate someone trying to re-create your
> thinking in a year's time. They really don't care about the last
> proposals and the next patch.
You're right. I'll review all changelogs and rewrite them as needed.
On 06/30/2011 06:54 PM, Peter Zijlstra wrote:
> On Wed, 2011-06-29 at 11:29 -0400, Glauber Costa wrote:
>> + if (static_branch(¶virt_steal_enabled)) {
>
> How is that going to compile on !CONFIG_PARAVIRT or !x86 in general?
> Only x86-PARAVIRT will provide that variable.
>
>
Good point. I'd wrap it into CONFIG_PARAVIRT.
To be clear, the reason I did not put it inside
CONFIG_PARAVIRT_TIME_ACCOUNTING, is because I wanted to have the mere
display of steal time separated from the rest - unless, of course, you
object this idea.
Using CONFIG_PARAVIRT achieves this goal well.