2012-11-26 20:36:37

by Michael Wolf

[permalink] [raw]
Subject: [PATCH 0/5] Alter steal time reporting in KVM

In the case of where you have a system that is running in a
capped or overcommitted environment the user may see steal time
being reported in accounting tools such as top or vmstat. This can
cause confusion for the end user. To ease the confusion this patch set
adds the idea of consigned (expected steal) time. The host will separate
the consigned time from the steal time. The consignment limit passed to the
host will be the amount of steal time expected within a fixed period of
time. Any other steal time accruing during that period will show as the
traditional steal time.

---

Michael Wolf (5):
Alter the amount of steal time reported by the guest.
Expand the steal time msr to also contain the consigned time.
Add the code to send the consigned time from the host to the guest
Add a timer to allow the separation of consigned from steal time.
Add an ioctl to communicate the consign limit to the host.


arch/x86/include/asm/kvm_host.h | 11 +++++++
arch/x86/include/asm/kvm_para.h | 3 +-
arch/x86/include/asm/paravirt.h | 4 +--
arch/x86/include/asm/paravirt_types.h | 2 +
arch/x86/kernel/kvm.c | 8 ++---
arch/x86/kernel/paravirt.c | 4 +--
arch/x86/kvm/x86.c | 50 ++++++++++++++++++++++++++++++++-
fs/proc/stat.c | 9 +++++-
include/linux/kernel_stat.h | 2 +
include/linux/kvm_host.h | 2 +
include/uapi/linux/kvm.h | 2 +
kernel/sched/core.c | 10 ++++++-
kernel/sched/cputime.c | 21 +++++++++++++-
kernel/sched/sched.h | 2 +
virt/kvm/kvm_main.c | 7 +++++
15 files changed, 120 insertions(+), 17 deletions(-)

--
Signature


2012-11-26 20:36:50

by Michael Wolf

[permalink] [raw]
Subject: [PATCH 1/5] Alter the amount of steal time reported by the guest.

Modify the amount of stealtime that the kernel reports via the /proc interface.
Steal time will now be broken down into steal_time and consigned_time.
Consigned_time will represent the amount of time that is expected to be lost
due to overcommitment of the physical cpu or by using cpu capping. The amount
consigned_time will be passed in using an ioctl. The time will be expressed in
the number of nanoseconds to be lost in during the fixed period. The fixed period
is currently 1/10th of a second.

Signed-off-by: Michael Wolf <[email protected]>
---
fs/proc/stat.c | 9 +++++++--
include/linux/kernel_stat.h | 1 +
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index e296572..cb7fe80 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v)
int i, j;
unsigned long jif;
u64 user, nice, system, idle, iowait, irq, softirq, steal;
- u64 guest, guest_nice;
+ u64 guest, guest_nice, consign;
u64 sum = 0;
u64 sum_softirq = 0;
unsigned int per_softirq_sums[NR_SOFTIRQS] = {0};
@@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v)

user = nice = system = idle = iowait =
irq = softirq = steal = 0;
- guest = guest_nice = 0;
+ guest = guest_nice = consign = 0;
getboottime(&boottime);
jif = boottime.tv_sec;

+
for_each_possible_cpu(i) {
user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
@@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v)
steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+ consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
sum += kstat_cpu_irqs_sum(i);
sum += arch_irq_stat_cpu(i);

@@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+ seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');

for_each_online_cpu(i) {
@@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v)
steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
+ consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN];
seq_printf(p, "cpu%d", i);
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice));
@@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v)
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest));
seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice));
+ seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign));
seq_putc(p, '\n');
}
seq_printf(p, "intr %llu", (unsigned long long)sum);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 1865b1f..e5978b0 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -28,6 +28,7 @@ enum cpu_usage_stat {
CPUTIME_STEAL,
CPUTIME_GUEST,
CPUTIME_GUEST_NICE,
+ CPUTIME_CONSIGN,
NR_STATS,
};

2012-11-26 20:36:59

by Michael Wolf

[permalink] [raw]
Subject: [PATCH 2/5] Expand the steal time msr to also contain the consigned time.

Add a consigned field. This field will hold the time lost due to capping or overcommit.
The rest of the time will still show up in the steal-time field.

Signed-off-by: Michael Wolf <[email protected]>
---
arch/x86/include/asm/paravirt.h | 4 ++--
arch/x86/include/asm/paravirt_types.h | 2 +-
arch/x86/kernel/kvm.c | 7 ++-----
kernel/sched/core.c | 10 +++++++++-
kernel/sched/cputime.c | 2 +-
5 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a0facf3..a5f9f30 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
extern struct static_key paravirt_steal_enabled;
extern struct static_key paravirt_steal_rq_enabled;

-static inline u64 paravirt_steal_clock(int cpu)
+static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
{
- return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
+ PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
}

static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 142236e..5d4fc8b 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -95,7 +95,7 @@ struct pv_lazy_ops {

struct pv_time_ops {
unsigned long long (*sched_clock)(void);
- unsigned long long (*steal_clock)(int cpu);
+ void (*steal_clock)(int cpu, unsigned long long *steal);
unsigned long (*get_tsc_khz)(void);
};

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 4180a87..ac357b3 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -372,9 +372,8 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
};

-static u64 kvm_steal_clock(int cpu)
+static void kvm_steal_clock(int cpu, u64 *steal)
{
- u64 steal;
struct kvm_steal_time *src;
int version;

@@ -382,11 +381,9 @@ static u64 kvm_steal_clock(int cpu)
do {
version = src->version;
rmb();
- steal = src->steal;
+ *steal = src->steal;
rmb();
} while ((version & 1) || (version != src->version));
-
- return steal;
}

void kvm_disable_steal_time(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c2e077c..b21d92d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -748,6 +748,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
*/
#if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING)
s64 steal = 0, irq_delta = 0;
+ u64 consigned = 0;
#endif
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
irq_delta = irq_time_read(cpu_of(rq)) - rq->prev_irq_time;
@@ -776,8 +777,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
if (static_key_false((&paravirt_steal_rq_enabled))) {
u64 st;
+ u64 cs;

- steal = paravirt_steal_clock(cpu_of(rq));
+ paravirt_steal_clock(cpu_of(rq), &steal, &consigned);
+ /*
+ * since we are not assigning the steal time to cpustats
+ * here, just combine the steal and consigned times to
+ * do the rest of the calculations.
+ */
+ steal += consigned;
steal -= rq->prev_steal_time_rq;

if (unlikely(steal > delta))
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 8d859da..593b647 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void)
if (static_key_false(&paravirt_steal_enabled)) {
u64 steal, st = 0;

- steal = paravirt_steal_clock(smp_processor_id());
+ paravirt_steal_clock(smp_processor_id(), &steal);
steal -= this_rq()->prev_steal_time;

st = steal_ticks(steal);

2012-11-26 20:37:13

by Michael Wolf

[permalink] [raw]
Subject: [PATCH 3/5] Add the code to send the consigned time from the host to the guest

Add the code to send the consigned time from the host to the guest.

Signed-off-by: Michael Wolf <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/include/asm/kvm_para.h | 3 ++-
arch/x86/include/asm/paravirt.h | 4 ++--
arch/x86/kernel/kvm.c | 3 ++-
arch/x86/kernel/paravirt.c | 4 ++--
arch/x86/kvm/x86.c | 2 ++
include/linux/kernel_stat.h | 1 +
kernel/sched/cputime.c | 21 +++++++++++++++++++--
kernel/sched/sched.h | 2 ++
9 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b2e11f4..434d378 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -426,6 +426,7 @@ struct kvm_vcpu_arch {
u64 msr_val;
u64 last_steal;
u64 accum_steal;
+ u64 accum_consigned;
struct gfn_to_hva_cache stime;
struct kvm_steal_time steal;
} st;
diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
index eb3e9d8..1763369 100644
--- a/arch/x86/include/asm/kvm_para.h
+++ b/arch/x86/include/asm/kvm_para.h
@@ -42,9 +42,10 @@

struct kvm_steal_time {
__u64 steal;
+ __u64 consigned;
__u32 version;
__u32 flags;
- __u32 pad[12];
+ __u32 pad[10];
};

#define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index a5f9f30..d39e8d0 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,9 +196,9 @@ struct static_key;
extern struct static_key paravirt_steal_enabled;
extern struct static_key paravirt_steal_rq_enabled;

-static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
+static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
{
- PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
+ PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
}

static inline unsigned long long paravirt_read_pmc(int counter)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index ac357b3..4439a5c 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -372,7 +372,7 @@ static struct notifier_block kvm_pv_reboot_nb = {
.notifier_call = kvm_pv_reboot_notify,
};

-static void kvm_steal_clock(int cpu, u64 *steal)
+static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned)
{
struct kvm_steal_time *src;
int version;
@@ -382,6 +382,7 @@ static void kvm_steal_clock(int cpu, u64 *steal)
version = src->version;
rmb();
*steal = src->steal;
+ *consigned = src->consigned;
rmb();
} while ((version & 1) || (version != src->version));
}
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 17fff18..3797683 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr)
struct static_key paravirt_steal_enabled;
struct static_key paravirt_steal_rq_enabled;

-static u64 native_steal_clock(int cpu)
+static void native_steal_clock(int cpu, u64 *steal, u64 *consigned)
{
- return 0;
+ *steal = *consigned = 0;
}

/* These are in entry.S */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1eefebe..683b531 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1565,8 +1565,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
return;

vcpu->arch.st.steal.steal += vcpu->arch.st.accum_steal;
+ vcpu->arch.st.steal.consigned += vcpu->arch.st.accum_consigned;
vcpu->arch.st.steal.version += 2;
vcpu->arch.st.accum_steal = 0;
+ vcpu->arch.st.accum_consigned = 0;

kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index e5978b0..91afaa3 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct task_struct *);
extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t);
extern void account_steal_time(cputime_t);
+extern void account_consigned_time(cputime_t);
extern void account_idle_time(cputime_t);

extern void account_process_tick(struct task_struct *, int user);
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 593b647..53bd0be 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int hardirq_offset,
}

/*
+ * This accounts for the time that is split out of steal time.
+ * Consigned time represents the amount of time that the cpu was
+ * expected to be somewhere else.
+ */
+void account_consigned_time(cputime_t cputime)
+{
+ u64 *cpustat = kcpustat_this_cpu->cpustat;
+
+ cpustat[CPUTIME_CONSIGN] += (__force u64) cputime;
+}
+
+/*
* Account for involuntary wait time.
* @cputime: the cpu time spent in involuntary wait
*/
@@ -274,15 +286,20 @@ static __always_inline bool steal_account_process_tick(void)
#ifdef CONFIG_PARAVIRT
if (static_key_false(&paravirt_steal_enabled)) {
u64 steal, st = 0;
+ u64 consigned, cs = 0;

- paravirt_steal_clock(smp_processor_id(), &steal);
+ paravirt_steal_clock(smp_processor_id(), &steal, &consigned);
steal -= this_rq()->prev_steal_time;
+ consigned -= this_rq()->prev_consigned_time;

st = steal_ticks(steal);
+ cs = steal_ticks(consigned);
this_rq()->prev_steal_time += st * TICK_NSEC;
+ this_rq()->prev_consigned_time += cs * TICK_NSEC;

account_steal_time(st);
- return st;
+ account_consigned_time(cs);
+ return st || cs;
}
#endif
return false;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 508e77e..83fe976 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -450,9 +450,11 @@ struct rq {
#endif
#ifdef CONFIG_PARAVIRT
u64 prev_steal_time;
+ u64 prev_consigned_time;
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
u64 prev_steal_time_rq;
+ u64 prev_consigned_time_rq;
#endif

/* calc_load related fields */

2012-11-26 20:37:23

by Michael Wolf

[permalink] [raw]
Subject: [PATCH 4/5] Add a timer to allow the separation of consigned from steal time.

Add a timer to the host. This will define the period. During a period
the first n ticks will go into the consigned bucket. Any other ticks that
occur within the period will be placed in the stealtime bucket.

Signed-off-by: Michael Wolf <[email protected]>
---
arch/x86/include/asm/kvm_host.h | 10 +++++++++
arch/x86/include/asm/paravirt.h | 2 +-
arch/x86/kvm/x86.c | 42 ++++++++++++++++++++++++++++++++++++++-
3 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 434d378..4794c95 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -41,6 +41,8 @@
#define KVM_PIO_PAGE_OFFSET 1
#define KVM_COALESCED_MMIO_PAGE_OFFSET 2

+#define KVM_STEAL_TIMER_DELAY 100000000UL
+
#define CR0_RESERVED_BITS \
(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
| X86_CR0_ET | X86_CR0_NE | X86_CR0_WP | X86_CR0_AM \
@@ -353,6 +355,14 @@ struct kvm_vcpu_arch {
bool tpr_access_reporting;

/*
+ * timer used to determine if the time should be counted as
+ * steal time or consigned time.
+ */
+ struct hrtimer steal_timer;
+ u64 current_consigned;
+ u64 consigned_limit;
+
+ /*
* Paging state of the vcpu
*
* If the vcpu runs in guest mode with two level paging this still saves
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d39e8d0..6db79f9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -196,7 +196,7 @@ struct static_key;
extern struct static_key paravirt_steal_enabled;
extern struct static_key paravirt_steal_rq_enabled;

-static inline u64 paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
+static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned)
{
PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned);
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 683b531..c91f4c9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1546,13 +1546,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu)
static void accumulate_steal_time(struct kvm_vcpu *vcpu)
{
u64 delta;
+ u64 steal_delta;
+ u64 consigned_delta;

if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
return;

delta = current->sched_info.run_delay - vcpu->arch.st.last_steal;
vcpu->arch.st.last_steal = current->sched_info.run_delay;
- vcpu->arch.st.accum_steal = delta;
+
+ /* split the delta into steal and consigned */
+ if (vcpu->arch.current_consigned < vcpu->arch.consigned_limit) {
+ vcpu->arch.current_consigned += delta;
+ if (vcpu->arch.current_consigned > vcpu->arch.consigned_limit) {
+ steal_delta = vcpu->arch.current_consigned
+ - vcpu->arch.consigned_limit;
+ consigned_delta = delta - steal_delta;
+ } else {
+ consigned_delta = delta;
+ steal_delta = 0;
+ }
+ } else {
+ consigned_delta = 0;
+ steal_delta = delta;
+ }
+ vcpu->arch.st.accum_steal = steal_delta;
+ vcpu->arch.st.accum_consigned = consigned_delta;
}

static void record_steal_time(struct kvm_vcpu *vcpu)
@@ -6203,11 +6222,25 @@ bool kvm_vcpu_compatible(struct kvm_vcpu *vcpu)

struct static_key kvm_no_apic_vcpu __read_mostly;

+enum hrtimer_restart steal_timer_fn(struct hrtimer *data)
+{
+ struct kvm_vcpu *vcpu;
+ ktime_t now;
+
+ vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer);
+ vcpu->arch.current_consigned = 0;
+ now = ktime_get();
+ hrtimer_forward(&vcpu->arch.steal_timer, now,
+ ktime_set(0, KVM_STEAL_TIMER_DELAY));
+ return HRTIMER_RESTART;
+}
+
int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
{
struct page *page;
struct kvm *kvm;
int r;
+ ktime_t ktime;

BUG_ON(vcpu->kvm == NULL);
kvm = vcpu->kvm;
@@ -6251,6 +6284,12 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)

kvm_async_pf_hash_reset(vcpu);
kvm_pmu_init(vcpu);
+ /* Initialize and start a timer to capture steal and consigned time */
+ hrtimer_init(&vcpu->arch.steal_timer, CLOCK_MONOTONIC,
+ HRTIMER_MODE_REL);
+ vcpu->arch.steal_timer.function = &steal_timer_fn;
+ ktime = ktime_set(0, KVM_STEAL_TIMER_DELAY);
+ hrtimer_start(&vcpu->arch.steal_timer, ktime, HRTIMER_MODE_REL);

return 0;
fail_free_mce_banks:
@@ -6269,6 +6308,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
{
int idx;

+ hrtimer_cancel(&vcpu->arch.steal_timer);
kvm_pmu_destroy(vcpu);
kfree(vcpu->arch.mce_banks);
kvm_free_lapic(vcpu);

2012-11-26 20:37:34

by Michael Wolf

[permalink] [raw]
Subject: [PATCH 5/5] Add an ioctl to communicate the consign limit to the host.

Add an ioctl to communicate the consign limit to the host.

Signed-off-by: Michael Wolf <[email protected]>
---
arch/x86/kvm/x86.c | 6 ++++++
include/linux/kvm_host.h | 2 ++
include/uapi/linux/kvm.h | 2 ++
virt/kvm/kvm_main.c | 7 +++++++
4 files changed, 17 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c91f4c9..5d57469 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5938,6 +5938,12 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
return 0;
}

+int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu, long entitlement)
+{
+ vcpu->arch.consigned_limit = entitlement;
+ return 0;
+}
+
int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
{
struct i387_fxsave_struct *fxsave =
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0e2212f..de13648 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -590,6 +590,8 @@ void kvm_arch_hardware_unsetup(void);
void kvm_arch_check_processor_compat(void *rtn);
int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
+int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm_vcpu *vcpu,
+ long entitlement);

void kvm_free_physmem(struct kvm *kvm);

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0a6d6ba..86f24bb 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -921,6 +921,8 @@ struct kvm_s390_ucas_mapping {
#define KVM_SET_ONE_REG _IOW(KVMIO, 0xac, struct kvm_one_reg)
/* VM is being stopped by host */
#define KVM_KVMCLOCK_CTRL _IO(KVMIO, 0xad)
+/* Set the consignment limit which will be used to separete steal time */
+#define KVM_SET_ENTITLEMENT _IOW(KVMIO, 0xae, unsigned long)

#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be70035..c712fe5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2062,6 +2062,13 @@ out_free2:
r = 0;
break;
}
+ case KVM_SET_ENTITLEMENT: {
+ r = kvm_arch_vcpu_ioctl_set_entitlement(vcpu, arg);
+ if (r)
+ goto out;
+ r = 0;
+ break;
+ }
default:
r = kvm_arch_vcpu_ioctl(filp, ioctl, arg);
}

2012-11-27 08:48:23

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 0/5] Alter steal time reporting in KVM

Hi,

On 11/27/2012 12:36 AM, Michael Wolf wrote:
> In the case of where you have a system that is running in a
> capped or overcommitted environment the user may see steal time
> being reported in accounting tools such as top or vmstat. This can
> cause confusion for the end user. To ease the confusion this patch set
> adds the idea of consigned (expected steal) time. The host will separate
> the consigned time from the steal time. The consignment limit passed to the
> host will be the amount of steal time expected within a fixed period of
> time. Any other steal time accruing during that period will show as the
> traditional steal time.

If you submit this again, please include a version number in your series.

It would also be helpful to include a small changelog about what changed
between last version and this version, so we could focus on that.

As for the rest, I answered your previous two submissions saying I don't
agree with the concept. If you hadn't changed anything, resending it
won't change my mind.

I could of course, be mistaken or misguided. But I had also not seen any
wave of support in favor of this previously, so basically I have no new
data to make me believe I should see it any differently.

Let's try this again:

* Rik asked you in your last submission how does ppc handle this. You
said, and I quote: "In the case of lpar on POWER systems they simply
report steal time and do not alter it in any way.
They do however report how much processor is assigned to the partition
and that information is in /proc/ppc64/lparcfg."

Now, that is a *way* more sensible thing to do. Much more. "Confusing
users" is something extremely subjective. This is specially true about
concepts that are know for quite some time, like steal time. If you out
of a sudden change the meaning of this, it is sure to confuse a lot more
users than it would clarify.





>
> ---
>
> Michael Wolf (5):
> Alter the amount of steal time reported by the guest.
> Expand the steal time msr to also contain the consigned time.
> Add the code to send the consigned time from the host to the guest
> Add a timer to allow the separation of consigned from steal time.
> Add an ioctl to communicate the consign limit to the host.
>
>
> arch/x86/include/asm/kvm_host.h | 11 +++++++
> arch/x86/include/asm/kvm_para.h | 3 +-
> arch/x86/include/asm/paravirt.h | 4 +--
> arch/x86/include/asm/paravirt_types.h | 2 +
> arch/x86/kernel/kvm.c | 8 ++---
> arch/x86/kernel/paravirt.c | 4 +--
> arch/x86/kvm/x86.c | 50 ++++++++++++++++++++++++++++++++-
> fs/proc/stat.c | 9 +++++-
> include/linux/kernel_stat.h | 2 +
> include/linux/kvm_host.h | 2 +
> include/uapi/linux/kvm.h | 2 +
> kernel/sched/core.c | 10 ++++++-
> kernel/sched/cputime.c | 21 +++++++++++++-
> kernel/sched/sched.h | 2 +
> virt/kvm/kvm_main.c | 7 +++++
> 15 files changed, 120 insertions(+), 17 deletions(-)
>

2012-11-27 15:10:43

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH 0/5] Alter steal time reporting in KVM

On 11/27/2012 02:48 AM, Glauber Costa wrote:
> Hi,
>
> On 11/27/2012 12:36 AM, Michael Wolf wrote:
>> In the case of where you have a system that is running in a
>> capped or overcommitted environment the user may see steal time
>> being reported in accounting tools such as top or vmstat. This can
>> cause confusion for the end user. To ease the confusion this patch set
>> adds the idea of consigned (expected steal) time. The host will separate
>> the consigned time from the steal time. The consignment limit passed to the
>> host will be the amount of steal time expected within a fixed period of
>> time. Any other steal time accruing during that period will show as the
>> traditional steal time.
> If you submit this again, please include a version number in your series.
Will do. The patchset was sent twice yesterday by mistake. Got an
error the first time and didn't
think the patches went out. This has been corrected.
>
> It would also be helpful to include a small changelog about what changed
> between last version and this version, so we could focus on that.
yes, will do that. When I took the RFC off the patches I was looking at
it as a new patchset which was
a mistake. I will make sure to add a changelog when I submit again.
>
> As for the rest, I answered your previous two submissions saying I don't
> agree with the concept. If you hadn't changed anything, resending it
> won't change my mind.
>
> I could of course, be mistaken or misguided. But I had also not seen any
> wave of support in favor of this previously, so basically I have no new
> data to make me believe I should see it any differently.
>
> Let's try this again:
>
> * Rik asked you in your last submission how does ppc handle this. You
> said, and I quote: "In the case of lpar on POWER systems they simply
> report steal time and do not alter it in any way.
> They do however report how much processor is assigned to the partition
> and that information is in /proc/ppc64/lparcfg."
Yes, but we still get questions from users asking what is steal time?
why am I seeing this?
>
> Now, that is a *way* more sensible thing to do. Much more. "Confusing
> users" is something extremely subjective. This is specially true about
> concepts that are know for quite some time, like steal time. If you out
> of a sudden change the meaning of this, it is sure to confuse a lot more
> users than it would clarify.
Something like this could certainly be done. But when I was submitting
the patch set as
an RFC then qemu was passing a cpu percentage that would be used by the
guest kernel
to adjust the steal time. This percentage was being stored on the guest
as a sysctl value.
Avi stated he didn't like that kind of coupling, and that the value
could get out of sync. Anthony stated "The guest shouldn't need to know
it's entitlement. Or at least, it's up to a management tool to report
that in a way that's meaningful for the guest."

So perhaps I misunderstood what they were suggesting, but I took it to
mean that they did not
want the guest to know what the entitlement was. That the host should
take care of it and just
report the already adjusted data to the guest. So in this version of
the code the host would use a set
period for a timer and be passed essentially a number of ticks of
expected steal time. The host
would then use the timer to break out the steal time into consigned and
steal buckets which would be
reported to the guest.

Both the consigned and the steal would be reported via /proc/stat. So
anyone needing to see total
time away could add the two fields together. The user, however, when
using tools like top or vmstat
would see the usage based on what the guest is entitled to.

Do you have suggestions for how I can build consensus around one of the
two approaches?


>
>
>
>
>> ---
>>
>> Michael Wolf (5):
>> Alter the amount of steal time reported by the guest.
>> Expand the steal time msr to also contain the consigned time.
>> Add the code to send the consigned time from the host to the guest
>> Add a timer to allow the separation of consigned from steal time.
>> Add an ioctl to communicate the consign limit to the host.
>>
>>
>> arch/x86/include/asm/kvm_host.h | 11 +++++++
>> arch/x86/include/asm/kvm_para.h | 3 +-
>> arch/x86/include/asm/paravirt.h | 4 +--
>> arch/x86/include/asm/paravirt_types.h | 2 +
>> arch/x86/kernel/kvm.c | 8 ++---
>> arch/x86/kernel/paravirt.c | 4 +--
>> arch/x86/kvm/x86.c | 50 ++++++++++++++++++++++++++++++++-
>> fs/proc/stat.c | 9 +++++-
>> include/linux/kernel_stat.h | 2 +
>> include/linux/kvm_host.h | 2 +
>> include/uapi/linux/kvm.h | 2 +
>> kernel/sched/core.c | 10 ++++++-
>> kernel/sched/cputime.c | 21 +++++++++++++-
>> kernel/sched/sched.h | 2 +
>> virt/kvm/kvm_main.c | 7 +++++
>> 15 files changed, 120 insertions(+), 17 deletions(-)
>>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2012-11-27 21:03:42

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 2/5] Expand the steal time msr to also contain the consigned time.

On Mon, Nov 26, 2012 at 02:36:45PM -0600, Michael Wolf wrote:
> Add a consigned field. This field will hold the time lost due to capping or overcommit.
> The rest of the time will still show up in the steal-time field.
>
> Signed-off-by: Michael Wolf <[email protected]>
> ---
> arch/x86/include/asm/paravirt.h | 4 ++--
> arch/x86/include/asm/paravirt_types.h | 2 +-
> arch/x86/kernel/kvm.c | 7 ++-----
> kernel/sched/core.c | 10 +++++++++-
> kernel/sched/cputime.c | 2 +-
> 5 files changed, 15 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index a0facf3..a5f9f30 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -196,9 +196,9 @@ struct static_key;
> extern struct static_key paravirt_steal_enabled;
> extern struct static_key paravirt_steal_rq_enabled;
>
> -static inline u64 paravirt_steal_clock(int cpu)
> +static inline u64 paravirt_steal_clock(int cpu, u64 *steal)

So its u64 here.
> {
> - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
> + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
> }
>
> static inline unsigned long long paravirt_read_pmc(int counter)
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 142236e..5d4fc8b 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -95,7 +95,7 @@ struct pv_lazy_ops {
>
> struct pv_time_ops {
> unsigned long long (*sched_clock)(void);
> - unsigned long long (*steal_clock)(int cpu);
> + void (*steal_clock)(int cpu, unsigned long long *steal);

But not u64 here? Any particular reason?

2012-11-27 23:25:18

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/5] Alter steal time reporting in KVM

On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:
> In the case of where you have a system that is running in a
> capped or overcommitted environment the user may see steal time
> being reported in accounting tools such as top or vmstat.

The definition of stolen time is 'time during which the virtual CPU is
runnable to not running'. Overcommit is the main scenario which steal
time helps to detect.

Can you describe the 'capped' case?

> This can
> cause confusion for the end user. To ease the confusion this patch set
> adds the idea of consigned (expected steal) time. The host will separate
> the consigned time from the steal time. The consignment limit passed to the
> host will be the amount of steal time expected within a fixed period of
> time. Any other steal time accruing during that period will show as the
> traditional steal time.
>
> ---
>
> Michael Wolf (5):
> Alter the amount of steal time reported by the guest.
> Expand the steal time msr to also contain the consigned time.
> Add the code to send the consigned time from the host to the guest
> Add a timer to allow the separation of consigned from steal time.
> Add an ioctl to communicate the consign limit to the host.
>
>
> arch/x86/include/asm/kvm_host.h | 11 +++++++
> arch/x86/include/asm/kvm_para.h | 3 +-
> arch/x86/include/asm/paravirt.h | 4 +--
> arch/x86/include/asm/paravirt_types.h | 2 +
> arch/x86/kernel/kvm.c | 8 ++---
> arch/x86/kernel/paravirt.c | 4 +--
> arch/x86/kvm/x86.c | 50 ++++++++++++++++++++++++++++++++-
> fs/proc/stat.c | 9 +++++-
> include/linux/kernel_stat.h | 2 +
> include/linux/kvm_host.h | 2 +
> include/uapi/linux/kvm.h | 2 +
> kernel/sched/core.c | 10 ++++++-
> kernel/sched/cputime.c | 21 +++++++++++++-
> kernel/sched/sched.h | 2 +
> virt/kvm/kvm_main.c | 7 +++++
> 15 files changed, 120 insertions(+), 17 deletions(-)
>
> --
> Signature
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-11-28 00:51:52

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/5] Alter steal time reporting in KVM

On Tue, Nov 27, 2012 at 09:24:42PM -0200, Marcelo Tosatti wrote:
> On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:
> > In the case of where you have a system that is running in a
> > capped or overcommitted environment the user may see steal time
> > being reported in accounting tools such as top or vmstat.
>
> The definition of stolen time is 'time during which the virtual CPU is
> runnable to not running'. Overcommit is the main scenario which steal
> time helps to detect.

Meant 'runnable but not running'.


> Can you describe the 'capped' case?

2012-11-28 08:45:57

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 0/5] Alter steal time reporting in KVM

On 11/27/2012 07:10 PM, Michael Wolf wrote:
> On 11/27/2012 02:48 AM, Glauber Costa wrote:
>> Hi,
>>
>> On 11/27/2012 12:36 AM, Michael Wolf wrote:
>>> In the case of where you have a system that is running in a
>>> capped or overcommitted environment the user may see steal time
>>> being reported in accounting tools such as top or vmstat. This can
>>> cause confusion for the end user. To ease the confusion this patch set
>>> adds the idea of consigned (expected steal) time. The host will
>>> separate
>>> the consigned time from the steal time. The consignment limit passed
>>> to the
>>> host will be the amount of steal time expected within a fixed period of
>>> time. Any other steal time accruing during that period will show as the
>>> traditional steal time.
>> If you submit this again, please include a version number in your series.
> Will do. The patchset was sent twice yesterday by mistake. Got an
> error the first time and didn't
> think the patches went out. This has been corrected.
>>
>> It would also be helpful to include a small changelog about what changed
>> between last version and this version, so we could focus on that.
> yes, will do that. When I took the RFC off the patches I was looking at
> it as a new patchset which was
> a mistake. I will make sure to add a changelog when I submit again.
>>
>> As for the rest, I answered your previous two submissions saying I don't
>> agree with the concept. If you hadn't changed anything, resending it
>> won't change my mind.
>>
>> I could of course, be mistaken or misguided. But I had also not seen any
>> wave of support in favor of this previously, so basically I have no new
>> data to make me believe I should see it any differently.
>>
>> Let's try this again:
>>
>> * Rik asked you in your last submission how does ppc handle this. You
>> said, and I quote: "In the case of lpar on POWER systems they simply
>> report steal time and do not alter it in any way.
>> They do however report how much processor is assigned to the partition
>> and that information is in /proc/ppc64/lparcfg."
> Yes, but we still get questions from users asking what is steal time?
> why am I seeing this?
>>
>> Now, that is a *way* more sensible thing to do. Much more. "Confusing
>> users" is something extremely subjective. This is specially true about
>> concepts that are know for quite some time, like steal time. If you out
>> of a sudden change the meaning of this, it is sure to confuse a lot more
>> users than it would clarify.
> Something like this could certainly be done. But when I was submitting
> the patch set as
> an RFC then qemu was passing a cpu percentage that would be used by the
> guest kernel
> to adjust the steal time. This percentage was being stored on the guest
> as a sysctl value.
> Avi stated he didn't like that kind of coupling, and that the value
> could get out of sync. Anthony stated "The guest shouldn't need to know
> it's entitlement. Or at least, it's up to a management tool to report
> that in a way that's meaningful for the guest."
>
> So perhaps I misunderstood what they were suggesting, but I took it to
> mean that they did not
> want the guest to know what the entitlement was. That the host should
> take care of it and just
> report the already adjusted data to the guest. So in this version of
> the code the host would use a set
> period for a timer and be passed essentially a number of ticks of
> expected steal time. The host
> would then use the timer to break out the steal time into consigned and
> steal buckets which would be
> reported to the guest.
>
> Both the consigned and the steal would be reported via /proc/stat. So
> anyone needing to see total
> time away could add the two fields together. The user, however, when
> using tools like top or vmstat
> would see the usage based on what the guest is entitled to.
>
> Do you have suggestions for how I can build consensus around one of the
> two approaches?
>

Before I answer this, can you please detail which mechanism are you
using to enforce the entitlement? Is it the cgroup cpu controller, or
something else?

2012-11-28 15:23:43

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH 2/5] Expand the steal time msr to also contain the consigned time.

On 11/27/2012 03:03 PM, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 26, 2012 at 02:36:45PM -0600, Michael Wolf wrote:
>> Add a consigned field. This field will hold the time lost due to capping or overcommit.
>> The rest of the time will still show up in the steal-time field.
>>
>> Signed-off-by: Michael Wolf <[email protected]>
>> ---
>> arch/x86/include/asm/paravirt.h | 4 ++--
>> arch/x86/include/asm/paravirt_types.h | 2 +-
>> arch/x86/kernel/kvm.c | 7 ++-----
>> kernel/sched/core.c | 10 +++++++++-
>> kernel/sched/cputime.c | 2 +-
>> 5 files changed, 15 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
>> index a0facf3..a5f9f30 100644
>> --- a/arch/x86/include/asm/paravirt.h
>> +++ b/arch/x86/include/asm/paravirt.h
>> @@ -196,9 +196,9 @@ struct static_key;
>> extern struct static_key paravirt_steal_enabled;
>> extern struct static_key paravirt_steal_rq_enabled;
>>
>> -static inline u64 paravirt_steal_clock(int cpu)
>> +static inline u64 paravirt_steal_clock(int cpu, u64 *steal)
> So its u64 here.
>> {
>> - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu);
>> + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal);
>> }
>>
>> static inline unsigned long long paravirt_read_pmc(int counter)
>> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
>> index 142236e..5d4fc8b 100644
>> --- a/arch/x86/include/asm/paravirt_types.h
>> +++ b/arch/x86/include/asm/paravirt_types.h
>> @@ -95,7 +95,7 @@ struct pv_lazy_ops {
>>
>> struct pv_time_ops {
>> unsigned long long (*sched_clock)(void);
>> - unsigned long long (*steal_clock)(int cpu);
>> + void (*steal_clock)(int cpu, unsigned long long *steal);
> But not u64 here? Any particular reason?
>
It should be void everywhere, thanks for catching that I will change the
code.

2012-11-28 18:43:34

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH 0/5] Alter steal time reporting in KVM

On 11/27/2012 05:24 PM, Marcelo Tosatti wrote:
> On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:
>> In the case of where you have a system that is running in a
>> capped or overcommitted environment the user may see steal time
>> being reported in accounting tools such as top or vmstat.
> The definition of stolen time is 'time during which the virtual CPU is
> runnable to not running'. Overcommit is the main scenario which steal
> time helps to detect.
>
> Can you describe the 'capped' case?
In the capped case, the time that the guest spends waiting due to it
having used its full allottment of time shows up as steal time. The way
my patchset currently stands is that you would set up the
bandwidth control and you would have to pass it a matching value from
qemu. In the future, it would
be possible to have something parse the bandwidth setting and
automatically adjust the setting in the
host used for steal time reporting.
>
>> This can
>> cause confusion for the end user. To ease the confusion this patch set
>> adds the idea of consigned (expected steal) time. The host will separate
>> the consigned time from the steal time. The consignment limit passed to the
>> host will be the amount of steal time expected within a fixed period of
>> time. Any other steal time accruing during that period will show as the
>> traditional steal time.
>>
>> ---
>>
>> Michael Wolf (5):
>> Alter the amount of steal time reported by the guest.
>> Expand the steal time msr to also contain the consigned time.
>> Add the code to send the consigned time from the host to the guest
>> Add a timer to allow the separation of consigned from steal time.
>> Add an ioctl to communicate the consign limit to the host.
>>
>>
>> arch/x86/include/asm/kvm_host.h | 11 +++++++
>> arch/x86/include/asm/kvm_para.h | 3 +-
>> arch/x86/include/asm/paravirt.h | 4 +--
>> arch/x86/include/asm/paravirt_types.h | 2 +
>> arch/x86/kernel/kvm.c | 8 ++---
>> arch/x86/kernel/paravirt.c | 4 +--
>> arch/x86/kvm/x86.c | 50 ++++++++++++++++++++++++++++++++-
>> fs/proc/stat.c | 9 +++++-
>> include/linux/kernel_stat.h | 2 +
>> include/linux/kvm_host.h | 2 +
>> include/uapi/linux/kvm.h | 2 +
>> kernel/sched/core.c | 10 ++++++-
>> kernel/sched/cputime.c | 21 +++++++++++++-
>> kernel/sched/sched.h | 2 +
>> virt/kvm/kvm_main.c | 7 +++++
>> 15 files changed, 120 insertions(+), 17 deletions(-)
>>
>> --
>> Signature
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2012-11-28 18:45:10

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH 0/5] Alter steal time reporting in KVM

On 11/28/2012 02:45 AM, Glauber Costa wrote:
> On 11/27/2012 07:10 PM, Michael Wolf wrote:
>> On 11/27/2012 02:48 AM, Glauber Costa wrote:
>>> Hi,
>>>
>>> On 11/27/2012 12:36 AM, Michael Wolf wrote:
>>>> In the case of where you have a system that is running in a
>>>> capped or overcommitted environment the user may see steal time
>>>> being reported in accounting tools such as top or vmstat. This can
>>>> cause confusion for the end user. To ease the confusion this patch set
>>>> adds the idea of consigned (expected steal) time. The host will
>>>> separate
>>>> the consigned time from the steal time. The consignment limit passed
>>>> to the
>>>> host will be the amount of steal time expected within a fixed period of
>>>> time. Any other steal time accruing during that period will show as the
>>>> traditional steal time.
>>> If you submit this again, please include a version number in your series.
>> Will do. The patchset was sent twice yesterday by mistake. Got an
>> error the first time and didn't
>> think the patches went out. This has been corrected.
>>> It would also be helpful to include a small changelog about what changed
>>> between last version and this version, so we could focus on that.
>> yes, will do that. When I took the RFC off the patches I was looking at
>> it as a new patchset which was
>> a mistake. I will make sure to add a changelog when I submit again.
>>> As for the rest, I answered your previous two submissions saying I don't
>>> agree with the concept. If you hadn't changed anything, resending it
>>> won't change my mind.
>>>
>>> I could of course, be mistaken or misguided. But I had also not seen any
>>> wave of support in favor of this previously, so basically I have no new
>>> data to make me believe I should see it any differently.
>>>
>>> Let's try this again:
>>>
>>> * Rik asked you in your last submission how does ppc handle this. You
>>> said, and I quote: "In the case of lpar on POWER systems they simply
>>> report steal time and do not alter it in any way.
>>> They do however report how much processor is assigned to the partition
>>> and that information is in /proc/ppc64/lparcfg."
>> Yes, but we still get questions from users asking what is steal time?
>> why am I seeing this?
>>> Now, that is a *way* more sensible thing to do. Much more. "Confusing
>>> users" is something extremely subjective. This is specially true about
>>> concepts that are know for quite some time, like steal time. If you out
>>> of a sudden change the meaning of this, it is sure to confuse a lot more
>>> users than it would clarify.
>> Something like this could certainly be done. But when I was submitting
>> the patch set as
>> an RFC then qemu was passing a cpu percentage that would be used by the
>> guest kernel
>> to adjust the steal time. This percentage was being stored on the guest
>> as a sysctl value.
>> Avi stated he didn't like that kind of coupling, and that the value
>> could get out of sync. Anthony stated "The guest shouldn't need to know
>> it's entitlement. Or at least, it's up to a management tool to report
>> that in a way that's meaningful for the guest."
>>
>> So perhaps I misunderstood what they were suggesting, but I took it to
>> mean that they did not
>> want the guest to know what the entitlement was. That the host should
>> take care of it and just
>> report the already adjusted data to the guest. So in this version of
>> the code the host would use a set
>> period for a timer and be passed essentially a number of ticks of
>> expected steal time. The host
>> would then use the timer to break out the steal time into consigned and
>> steal buckets which would be
>> reported to the guest.
>>
>> Both the consigned and the steal would be reported via /proc/stat. So
>> anyone needing to see total
>> time away could add the two fields together. The user, however, when
>> using tools like top or vmstat
>> would see the usage based on what the guest is entitled to.
>>
>> Do you have suggestions for how I can build consensus around one of the
>> two approaches?
>>
> Before I answer this, can you please detail which mechanism are you
> using to enforce the entitlement? Is it the cgroup cpu controller, or
> something else?
It is setup using cpu overcommit. But the request was for something
that would work in both
the overcommit environment as well as when hard capping is being used.

>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2012-11-28 19:16:28

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 0/5] Alter steal time reporting in KVM

Glauber Costa <[email protected]> writes:

> Hi,
>
> On 11/27/2012 12:36 AM, Michael Wolf wrote:
>> In the case of where you have a system that is running in a
>> capped or overcommitted environment the user may see steal time
>> being reported in accounting tools such as top or vmstat. This can
>> cause confusion for the end user. To ease the confusion this patch set
>> adds the idea of consigned (expected steal) time. The host will separate
>> the consigned time from the steal time. The consignment limit passed to the
>> host will be the amount of steal time expected within a fixed period of
>> time. Any other steal time accruing during that period will show as the
>> traditional steal time.
>
> If you submit this again, please include a version number in your series.
>
> It would also be helpful to include a small changelog about what changed
> between last version and this version, so we could focus on that.
>
> As for the rest, I answered your previous two submissions saying I don't
> agree with the concept. If you hadn't changed anything, resending it
> won't change my mind.
>
> I could of course, be mistaken or misguided. But I had also not seen any
> wave of support in favor of this previously, so basically I have no new
> data to make me believe I should see it any differently.
>
> Let's try this again:
>
> * Rik asked you in your last submission how does ppc handle this. You
> said, and I quote: "In the case of lpar on POWER systems they simply
> report steal time and do not alter it in any way.
> They do however report how much processor is assigned to the partition
> and that information is in /proc/ppc64/lparcfg."

This only is helpful for static entitlements.

But if we allow dynamic entitlements--which is a very useful feature,
think buying an online "upgrade" in a cloud environment--then you need
to account for entitlement loss at the same place where you do the rest
of the accounting: in /proc/stat.

> Now, that is a *way* more sensible thing to do. Much more. "Confusing
> users" is something extremely subjective. This is specially true about
> concepts that are know for quite some time, like steal time. If you out
> of a sudden change the meaning of this, it is sure to confuse a lot more
> users than it would clarify.

I'll bring you a nice bottle of scotch at the next KVM Forum if you can
find me one user that can accurately describe what steal time is.

The semantics are so incredibly subtle that I have a hard time believing
anyone actually understands what it means today.

Regards,

Anthony Liguori
>
>
>
>
>
>>
>> ---
>>
>> Michael Wolf (5):
>> Alter the amount of steal time reported by the guest.
>> Expand the steal time msr to also contain the consigned time.
>> Add the code to send the consigned time from the host to the guest
>> Add a timer to allow the separation of consigned from steal time.
>> Add an ioctl to communicate the consign limit to the host.
>>
>>
>> arch/x86/include/asm/kvm_host.h | 11 +++++++
>> arch/x86/include/asm/kvm_para.h | 3 +-
>> arch/x86/include/asm/paravirt.h | 4 +--
>> arch/x86/include/asm/paravirt_types.h | 2 +
>> arch/x86/kernel/kvm.c | 8 ++---
>> arch/x86/kernel/paravirt.c | 4 +--
>> arch/x86/kvm/x86.c | 50 ++++++++++++++++++++++++++++++++-
>> fs/proc/stat.c | 9 +++++-
>> include/linux/kernel_stat.h | 2 +
>> include/linux/kvm_host.h | 2 +
>> include/uapi/linux/kvm.h | 2 +
>> kernel/sched/core.c | 10 ++++++-
>> kernel/sched/cputime.c | 21 +++++++++++++-
>> kernel/sched/sched.h | 2 +
>> virt/kvm/kvm_main.c | 7 +++++
>> 15 files changed, 120 insertions(+), 17 deletions(-)
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-11-28 20:55:47

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 0/5] Alter steal time reporting in KVM

On 11/28/2012 10:43 PM, Michael Wolf wrote:
> On 11/27/2012 05:24 PM, Marcelo Tosatti wrote:
>> On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:
>>> In the case of where you have a system that is running in a
>>> capped or overcommitted environment the user may see steal time
>>> being reported in accounting tools such as top or vmstat.
>> The definition of stolen time is 'time during which the virtual CPU is
>> runnable to not running'. Overcommit is the main scenario which steal
>> time helps to detect.
>>
>> Can you describe the 'capped' case?
> In the capped case, the time that the guest spends waiting due to it
> having used its full allottment of time shows up as steal time. The way
> my patchset currently stands is that you would set up the
> bandwidth control and you would have to pass it a matching value from
> qemu. In the future, it would
> be possible to have something parse the bandwidth setting and
> automatically adjust the setting in the
> host used for steal time reporting.

Ok, so correct me if I am wrong, but I believe you would be using
something like the bandwidth capper in the cpu cgroup to set those
entitlements, right?

Some time has passed since I last looked into it, but IIRC, after you
get are out of your quota, you should be out of the runqueue. In the
lovely world of KVM, we approximate steal time as runqueue time:

arch/x86/kvm/x86.c:
delta = current->sched_info.run_delay - vcpu->arch.st.last_steal;
vcpu->arch.st.last_steal = current->sched_info.run_delay;
vcpu->arch.st.accum_steal = delta;

include/linux/sched.h:
unsigned long long run_delay; /* time spent waiting on a runqueue */

So if you are out of the runqueue, you won't get steal time accounted,
and then I truly fail to understand what you are doing.

In case I am wrong, and run_delay also includes the time you can't run
because you are out of capacity, then maybe what we should do, is to
just subtract it from run_delay in kvm/x86.c before we pass it on. In
summary:


>>> Alter the amount of steal time reported by the guest.
Maybe this should go away.

>>> Expand the steal time msr to also contain the consigned time.
Maybe this should go away

>>> Add the code to send the consigned time from the host to the
>>> guest
This definitely should be heavily modified

>>> Add a timer to allow the separation of consigned from steal time.
Maybe this should go away

>>> Add an ioctl to communicate the consign limit to the host.
This definitely should go away.

More specifically, *whatever* way we use to cap the processor, the host
system will have all the information at all times.


2012-11-29 17:43:54

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH 0/5] Alter steal time reporting in KVM

On 11/28/2012 02:55 PM, Glauber Costa wrote:
> On 11/28/2012 10:43 PM, Michael Wolf wrote:
>> On 11/27/2012 05:24 PM, Marcelo Tosatti wrote:
>>> On Mon, Nov 26, 2012 at 02:36:24PM -0600, Michael Wolf wrote:
>>>> In the case of where you have a system that is running in a
>>>> capped or overcommitted environment the user may see steal time
>>>> being reported in accounting tools such as top or vmstat.
>>> The definition of stolen time is 'time during which the virtual CPU is
>>> runnable to not running'. Overcommit is the main scenario which steal
>>> time helps to detect.
>>>
>>> Can you describe the 'capped' case?
>> In the capped case, the time that the guest spends waiting due to it
>> having used its full allottment of time shows up as steal time. The way
>> my patchset currently stands is that you would set up the
>> bandwidth control and you would have to pass it a matching value from
>> qemu. In the future, it would
>> be possible to have something parse the bandwidth setting and
>> automatically adjust the setting in the
>> host used for steal time reporting.
> Ok, so correct me if I am wrong, but I believe you would be using
> something like the bandwidth capper in the cpu cgroup to set those
> entitlements, right?
Yes, in the context above I'm referring to the cfs bandwidth control.
>
> Some time has passed since I last looked into it, but IIRC, after you
> get are out of your quota, you should be out of the runqueue. In the
> lovely world of KVM, we approximate steal time as runqueue time:
>
> arch/x86/kvm/x86.c:
> delta = current->sched_info.run_delay - vcpu->arch.st.last_steal;
> vcpu->arch.st.last_steal = current->sched_info.run_delay;
> vcpu->arch.st.accum_steal = delta;
>
> include/linux/sched.h:
> unsigned long long run_delay; /* time spent waiting on a runqueue */
>
> So if you are out of the runqueue, you won't get steal time accounted,
> and then I truly fail to understand what you are doing.
So I looked at something like this in the past. To make sure things
haven't changed
I set up a cgroup on my test server running a kernel built from the
latest tip tree.

[root]# cat cpu.cfs_quota_us
50000
[root]# cat cpu.cfs_period_us
100000
[root]# cat cpuset.cpus
1
[root]# cat cpuset.mems
0

Next I put the PID from the cpu thread into tasks. When I start a
script that will hog the cpu I see the
following in top on the guest
Cpu(s): 1.9%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 48.3%hi, 0.0%si,
49.8%st

So the steal time here is in line with the bandwidth control settings.
>
> In case I am wrong, and run_delay also includes the time you can't run
> because you are out of capacity, then maybe what we should do, is to
> just subtract it from run_delay in kvm/x86.c before we pass it on. In
> summary:
About a year ago I was playing with this patch. It is out of date now
but will give you
an idea of what I was looking at.

kernel/sched_fair.c | 4 ++--
kernel/sched_stats.h | 7 ++++++-
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5c9e679..a837e4e 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -707,7 +707,7 @@ account_entity_dequeue(struct cfs_rq *cfs_rq,
struct sched_entity *se)

#ifdef CONFIG_FAIR_GROUP_SCHED
/* we need this in update_cfs_load and load-balance functions below */
-static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
+inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
# ifdef CONFIG_SMP
static void update_cfs_rq_load_contribution(struct cfs_rq *cfs_rq,
int global_update)
@@ -1420,7 +1420,7 @@ static inline int cfs_rq_throttled(struct
cfs_rq *cfs_rq)
}

/* check whether cfs_rq, or any parent, is throttled */
-static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
+inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
{
return cfs_rq->throttle_count;
}
diff --git a/kernel/sched_stats.h b/kernel/sched_stats.h
index 87f9e36..e30ff26 100644
--- a/kernel/sched_stats.h
+++ b/kernel/sched_stats.h
@@ -213,14 +213,19 @@ static inline void sched_info_queued(struct
task_struct *t)
* sched_info_queued() to mark that it has now again started waiting on
* the runqueue.
*/
+extern inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
static inline void sched_info_depart(struct task_struct *t)
{
+ struct task_group *tg = task_group(t);
+ struct cfs_rq *cfs_rq;
unsigned long long delta = task_rq(t)->clock -
t->sched_info.last_arrival;

+ cfs_rq = tg->cfs_rq[smp_processor_id()];
rq_sched_info_depart(task_rq(t), delta);

- if (t->state == TASK_RUNNING)
+
+ if (t->state == TASK_RUNNING && !throttled_hierarchy(cfs_rq))
sched_info_queued(t);
}


So then the steal time did not show on the guest. You have no value
that needs to be passed
around. What I did not like about this approach was
* only works for cfs bandwidth control. If another type of hard limit
was added to the kernel
the code would potentially need to change.
* This approach doesn't help if the limits are set by overcommitting the
cpus. It is my understanding
that this is a common approach.

>
>>>> Alter the amount of steal time reported by the guest.
> Maybe this should go away.
>
>>>> Expand the steal time msr to also contain the consigned time.
> Maybe this should go away
>
>>>> Add the code to send the consigned time from the host to the
>>>> guest
> This definitely should be heavily modified
>
>>>> Add a timer to allow the separation of consigned from steal time.
> Maybe this should go away
>
>>>> Add an ioctl to communicate the consign limit to the host.
> This definitely should go away.
>
> More specifically, *whatever* way we use to cap the processor, the host
> system will have all the information at all times.
I'm not understanding that comment. If you are capping by simply
controlling the amount of
overcommit on the host then wouldn't you still need some value to
indicate the desired amount.
If another guest was inadvertently started or something else on the host
started taking more
processor than expected, you would need to report the steal time.


>
>
>