2012-08-23 23:14:30

by Michael Wolf

[permalink] [raw]
Subject: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

This is an RFC regarding the reporting of stealtime. In the case of
where you have a system that is running with partial processors such as
KVM the user may see steal time being reported in accounting tools such
as top or vmstat. This can cause confusion for the end user. To
ease the confusion this patch set adds a sysctl interface to set the
cpu entitlement. This is the percentage of cpu that the guest system is
expected to receive. As long as the steal time is within its expected
range it will show up as 0 in /proc/stat. The user will then see in the
accounting tools that they are getting a full utilization of the cpu
resources assigned to them.

This patchset is changing the contents/output of /proc/stat and could affect
user tools. However the default setting is that the cpu is entitled to 100%
so the code will act as before. Also another field could be added to the
/proc/stat output and show the unaltered steal time. Since this additional
field could cause more confusion than it would clear up I have left it out
for now.

Michael Wolf (3):
Add a sysctl interface to control the cpu entitlement setting.
Add a hypercall to retrieve the cpu entitlement value from the host.
Modify the amount of stealtime that the kernel reports via the /proc

---
Documentation/sysctl/kernel.txt | 14 +++++++
arch/x86/kvm/x86.c | 8 ++++
fs/proc/stat.c | 80 ++++++++++++++++++++++++++++++++++++++-
include/linux/kernel_stat.h | 2 +
include/linux/kvm.h | 3 ++
include/linux/kvm_host.h | 3 ++
include/linux/kvm_para.h | 1 +
kernel/sysctl.c | 10 +++++
virt/kvm/kvm_main.c | 7 ++++
9 files changed, 126 insertions(+), 2 deletions(-)


2012-08-23 23:14:47

by Michael Wolf

[permalink] [raw]
Subject: [PATCH RFC 1/3] Add a sysctl interface to control and report the cpu entitlement setting.

This setting will later be used to compute an expected steal time.

Signed-off-by: Michael Wolf <[email protected]>
---
Documentation/sysctl/kernel.txt | 14 ++++++++++++++
fs/proc/stat.c | 1 +
kernel/sysctl.c | 10 ++++++++++
3 files changed, 25 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 6d78841..0f617dc 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -28,6 +28,7 @@ show up in /proc/sys/kernel:
- core_pattern
- core_pipe_limit
- core_uses_pid
+- cpu_entitlement
- ctrl-alt-del
- dmesg_restrict
- domainname
@@ -226,6 +227,19 @@ the filename.

==============================================================

+cpu_entitlement:
+
+The cpu_entitlement is the percentage of cpu utilization that
+the system expects to receive. By default this is set to 100,
+in a guest system this could be set to a value between 0 and 100.
+This value is used to adjust the amount of steal time that
+process accounting code in the guest will display. The end effect
+will be is that steal time will only be reported if the
+percentage of steal time is greater than 100 - cpu_entitlement
+value.
+
+==============================================================
+
ctrl-alt-del:

When the value in this file is 0, ctrl-alt-del is trapped and
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 64c3b31..14e26c8 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -12,6 +12,7 @@
#include <asm/cputime.h>
#include <linux/tick.h>

+int cpu_entitlement = 100;
#ifndef arch_irq_stat_cpu
#define arch_irq_stat_cpu(cpu) 0
#endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 87174ef..85efbc2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -109,6 +109,7 @@ extern int percpu_pagelist_fraction;
extern int compat_log;
extern int latencytop_enabled;
extern int sysctl_nr_open_min, sysctl_nr_open_max;
+extern int cpu_entitlement;
#ifndef CONFIG_MMU
extern int sysctl_nr_trim_pages;
#endif
@@ -673,6 +674,15 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "cpu_entitlement",
+ .data = &cpu_entitlement,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &one_hundred,
+ },
#if defined CONFIG_PRINTK
{
.procname = "printk",

2012-08-23 23:14:56

by Michael Wolf

[permalink] [raw]
Subject: [PATCH RFC 3/3] Modify the amount of stealtime that the kernel reports via the /proc interface.

Stealtime will be adjusted based on the cpu entitlement setting. The user
will supply the cpu_entitlement which is the percentage of cpu the guest can
expect to receive. The expected steal time is based on the expected steal
percentage which is 100 - cpu_entitlement. If steal_time is less than the
expected steal time that is reported steal_time is changed to 0 no other fields
are changed. If the steal_time is greater than the expected_steal then the
difference is reported. By default the cpu_entitlement will be 100% and the
steal time will be reported without any modification.

Signed-off-by: Michael Wolf <[email protected]>
---
fs/proc/stat.c | 70 ++++++++++++++++++++++++++++++++++++++++++-
include/linux/kernel_stat.h | 2 +
2 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index cf66665..efbaa03 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -73,6 +73,68 @@ static u64 get_iowait_time(int cpu)

#endif

+/*
+ * This function will alter the steal time value that is written out
+ * to /proc/stat. The cpu_entitlement is set by the user/admin and is
+ * meant to reflect the percentage of the processor that is expected to
+ * be used. So as long as the amount of steal time is less than the
+ * expected steal time (based on cpu_entitlement) then report steal time
+ * as zero.
+ */
+static void kstat_adjust_steal_time(int currcpu)
+{
+ int j;
+ u64 cpustat_delta[NR_STATS];
+ u64 total_elapsed_time;
+ int expected_steal_pct;
+ u64 expected_steal;
+ u64 *currstat, *prevstat;
+
+ /*
+ * if cpu_entitlement = 100% then the expected steal time is 0
+ * so we don't need to do any adjustments to the fields.
+ */
+ if (cpu_entitlement == 100) {
+ kcpustat_cpu(currcpu).cpustat[CPUTIME_ADJ_STEAL] =
+ kcpustat_cpu(currcpu).cpustat[CPUTIME_STEAL];
+ return;
+ }
+ /*
+ * For the user it is more intuitive to think in terms of
+ * cpu entitlement. To do the calculations it is easier to
+ * think in terms of allowed steal time. So convert the percentage
+ * from cpu_entitlement to expected_steal_percent.
+ */
+ expected_steal_pct = 100 - cpu_entitlement;
+
+ total_elapsed_time = 0;
+ /* determine the total time elapsed between calls */
+ currstat = kcpustat_cpu(currcpu).cpustat;
+ prevstat = kcpustat_cpu(currcpu).prev_cpustat;
+ for (j = CPUTIME_USER; j < CPUTIME_GUEST; j++) {
+ cpustat_delta[j] = currstat[j] - prevstat[j];
+ prevstat[j] = currstat[j];
+ total_elapsed_time = total_elapsed_time + cpustat_delta[j];
+ }
+
+ /*
+ * calculate the amount of expected steal time. Add 5 as a
+ * rounding factor.
+ */
+
+ expected_steal = (total_elapsed_time * expected_steal_pct + 5) / 100;
+ if (cpustat_delta[CPUTIME_STEAL] < expected_steal)
+ cpustat_delta[CPUTIME_STEAL] = 0;
+ else
+ cpustat_delta[CPUTIME_STEAL] -= expected_steal;
+
+ /* Adjust the steal time accordingly */
+ currstat[CPUTIME_ADJ_STEAL] = prevstat[CPUTIME_ADJ_STEAL]
+ + cpustat_delta[CPUTIME_STEAL];
+ prevstat[CPUTIME_ADJ_STEAL] = currstat[CPUTIME_ADJ_STEAL];
+}
+
+
static int show_stat(struct seq_file *p, void *v)
{
int i, j;
@@ -90,7 +152,11 @@ static int show_stat(struct seq_file *p, void *v)
getboottime(&boottime);
jif = boottime.tv_sec;

+
for_each_possible_cpu(i) {
+ /* adjust the steal time based on the processor entitlement */
+ kstat_adjust_steal_time(i);
+
user += kcpustat_cpu(i).cpustat[CPUTIME_USER];
nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE];
system += kcpustat_cpu(i).cpustat[CPUTIME_SYSTEM];
@@ -98,7 +164,7 @@ static int show_stat(struct seq_file *p, void *v)
iowait += get_iowait_time(i);
irq += kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
softirq += kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ];
- steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
+ steal += kcpustat_cpu(i).cpustat[CPUTIME_ADJ_STEAL];
guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
sum += kstat_cpu_irqs_sum(i);
@@ -135,7 +201,7 @@ static int show_stat(struct seq_file *p, void *v)
iowait = get_iowait_time(i);
irq = kcpustat_cpu(i).cpustat[CPUTIME_IRQ];
softirq = kcpustat_cpu(i).cpustat[CPUTIME_SOFTIRQ];
- steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL];
+ steal = kcpustat_cpu(i).cpustat[CPUTIME_ADJ_STEAL];
guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST];
guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE];
seq_printf(p, "cpu%d", i);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index bbe5d15..a4f6d1c 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -27,11 +27,13 @@ enum cpu_usage_stat {
CPUTIME_STEAL,
CPUTIME_GUEST,
CPUTIME_GUEST_NICE,
+ CPUTIME_ADJ_STEAL,
NR_STATS,
};

struct kernel_cpustat {
u64 cpustat[NR_STATS];
+ u64 prev_cpustat[NR_STATS];
};

struct kernel_stat {

2012-08-23 23:15:41

by Michael Wolf

[permalink] [raw]
Subject: [PATCH RFC 2/3] Add a hypercall to retrieve the cpu entitlement value from the host.

If the hypercall is not implemented on the host a default value of
100 will be used. This value will be stored in /proc/sys/kernel/cpu_entitlements.

Signed-off-by: Michael Wolf <[email protected]>
---
arch/x86/kvm/x86.c | 8 ++++++++
fs/proc/stat.c | 9 +++++++++
include/linux/kvm.h | 3 +++
include/linux/kvm_host.h | 3 +++
include/linux/kvm_para.h | 1 +
virt/kvm/kvm_main.c | 7 +++++++
6 files changed, 31 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 42bce48..734bc3d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5052,6 +5052,9 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
case KVM_HC_VAPIC_POLL_IRQ:
ret = 0;
break;
+ case KVM_HC_CPU_ENTITLEMENT:
+ ret = vcpu->kvm->vcpu_entitlement;
+ break;
default:
ret = -KVM_ENOSYS;
break;
@@ -5897,6 +5900,11 @@ int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
return 0;
}

+int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm *kvm, long entitlement)
+{
+ kvm->vcpu_entitlement = (int) entitlement;
+ return 0;
+}
int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
{
struct i387_fxsave_struct *fxsave =
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 14e26c8..cf66665 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -11,6 +11,7 @@
#include <linux/irqnr.h>
#include <asm/cputime.h>
#include <linux/tick.h>
+#include <linux/kvm_para.h>

int cpu_entitlement = 100;
#ifndef arch_irq_stat_cpu
@@ -213,6 +214,14 @@ static const struct file_operations proc_stat_operations = {

static int __init proc_stat_init(void)
{
+ long ret;
+
+ if (kvm_para_available()) {
+ ret = kvm_hypercall0(KVM_HC_CPU_ENTITLEMENT);
+ if (ret > 0)
+ cpu_entitlement = (int) ret;
+ }
+
proc_create("stat", 0, NULL, &proc_stat_operations);
return 0;
}
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 2ce09aa..fccd08e 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -904,6 +904,9 @@ struct kvm_s390_ucas_mapping {
#define KVM_SET_ONE_REG _IOW(KVMIO, 0xac, struct kvm_one_reg)
/* VM is being stopped by host */
#define KVM_KVMCLOCK_CTRL _IO(KVMIO, 0xad)
+/* Set the cpu entitlement this will be used to adjust stealtime reporting */
+#define KVM_SET_ENTITLEMENT _IOW(KVMIO, 0xae, unsigned long)
+

#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b70b48b..71e3d73 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -276,6 +276,7 @@ struct kvm {
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
atomic_t online_vcpus;
int last_boosted_vcpu;
+ int vcpu_entitlement;
struct list_head vm_list;
struct mutex lock;
struct kvm_io_bus *buses[KVM_NR_BUSES];
@@ -538,6 +539,8 @@ void kvm_arch_hardware_unsetup(void);
void kvm_arch_check_processor_compat(void *rtn);
int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
+int kvm_arch_vcpu_ioctl_set_entitlement(struct kvm *kvm,
+ long entitlement);

void kvm_free_physmem(struct kvm *kvm);

diff --git a/include/linux/kvm_para.h b/include/linux/kvm_para.h
index ff476dd..95f3387 100644
--- a/include/linux/kvm_para.h
+++ b/include/linux/kvm_para.h
@@ -19,6 +19,7 @@
#define KVM_HC_MMU_OP 2
#define KVM_HC_FEATURES 3
#define KVM_HC_PPC_MAP_MAGIC_PAGE 4
+#define KVM_HC_CPU_ENTITLEMENT 5

/*
* hypercalls use architecture specific
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2468523..a0a4939 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2093,6 +2093,13 @@ static long kvm_vm_ioctl(struct file *filp,
break;
}
#endif
+ case KVM_SET_ENTITLEMENT: {
+ r = kvm_arch_vcpu_ioctl_set_entitlement(kvm, arg);
+ if (r)
+ goto out;
+ r = 0;
+ break;
+ }
default:
r = kvm_arch_vm_ioctl(filp, ioctl, arg);
if (r == -ENOTTY)

2012-08-24 04:56:45

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

On 08/24/2012 03:14 AM, Michael Wolf wrote:
> This is an RFC regarding the reporting of stealtime. In the case of
> where you have a system that is running with partial processors such as
> KVM the user may see steal time being reported in accounting tools such
> as top or vmstat. This can cause confusion for the end user. To
> ease the confusion this patch set adds a sysctl interface to set the
> cpu entitlement. This is the percentage of cpu that the guest system is
> expected to receive. As long as the steal time is within its expected
> range it will show up as 0 in /proc/stat. The user will then see in the
> accounting tools that they are getting a full utilization of the cpu
> resources assigned to them.
>

And how is such a knob not confusing?

Steal time is pretty well defined in meaning and is shown in top for
ages. I really don't see the point for this.

2012-08-24 15:11:25

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote:
> On 08/24/2012 03:14 AM, Michael Wolf wrote:
> > This is an RFC regarding the reporting of stealtime. In the case of
> > where you have a system that is running with partial processors such as
> > KVM the user may see steal time being reported in accounting tools such
> > as top or vmstat. This can cause confusion for the end user. To
> > ease the confusion this patch set adds a sysctl interface to set the
> > cpu entitlement. This is the percentage of cpu that the guest system is
> > expected to receive. As long as the steal time is within its expected
> > range it will show up as 0 in /proc/stat. The user will then see in the
> > accounting tools that they are getting a full utilization of the cpu
> > resources assigned to them.
> >
>
> And how is such a knob not confusing?
>
> Steal time is pretty well defined in meaning and is shown in top for
> ages. I really don't see the point for this.

Currently you can see the steal time but you have no way of knowing if
the cpu utilization you are seeing on the guest is the expected amount.
I decided on making it a knob because a guest could be migrated to
another system and it's entitlement could change because of hardware or
load differences. It could simply be a /proc file and report the
current entitlement if needed. As things are currently implemented I
don't see how someone knows if the guest is running as expected or
whether there is a problem.

2012-08-25 23:39:42

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

On 08/24/2012 11:11 AM, Michael Wolf wrote:
> On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote:
>> On 08/24/2012 03:14 AM, Michael Wolf wrote:
>>> This is an RFC regarding the reporting of stealtime. In the case of
>>> where you have a system that is running with partial processors such as
>>> KVM the user may see steal time being reported in accounting tools such
>>> as top or vmstat. This can cause confusion for the end user. To
>>> ease the confusion this patch set adds a sysctl interface to set the
>>> cpu entitlement. This is the percentage of cpu that the guest system is
>>> expected to receive. As long as the steal time is within its expected
>>> range it will show up as 0 in /proc/stat. The user will then see in the
>>> accounting tools that they are getting a full utilization of the cpu
>>> resources assigned to them.
>>>
>>
>> And how is such a knob not confusing?
>>
>> Steal time is pretty well defined in meaning and is shown in top for
>> ages. I really don't see the point for this.
>
> Currently you can see the steal time but you have no way of knowing if
> the cpu utilization you are seeing on the guest is the expected amount.
> I decided on making it a knob because a guest could be migrated to
> another system and it's entitlement could change because of hardware or
> load differences. It could simply be a /proc file and report the
> current entitlement if needed. As things are currently implemented I
> don't see how someone knows if the guest is running as expected or
> whether there is a problem.
>

Turning off steal time display won't get even close to displaying the
information you want. What you probably want is a guest-visible way to
say how many miliseconds you are expected to run each second. Right?

2012-08-27 15:51:26

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

On Sat, 2012-08-25 at 19:36 -0400, Glauber Costa wrote:
> On 08/24/2012 11:11 AM, Michael Wolf wrote:
> > On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote:
> >> On 08/24/2012 03:14 AM, Michael Wolf wrote:
> >>> This is an RFC regarding the reporting of stealtime. In the case of
> >>> where you have a system that is running with partial processors such as
> >>> KVM the user may see steal time being reported in accounting tools such
> >>> as top or vmstat. This can cause confusion for the end user. To
> >>> ease the confusion this patch set adds a sysctl interface to set the
> >>> cpu entitlement. This is the percentage of cpu that the guest system is
> >>> expected to receive. As long as the steal time is within its expected
> >>> range it will show up as 0 in /proc/stat. The user will then see in the
> >>> accounting tools that they are getting a full utilization of the cpu
> >>> resources assigned to them.
> >>>
> >>
> >> And how is such a knob not confusing?
> >>
> >> Steal time is pretty well defined in meaning and is shown in top for
> >> ages. I really don't see the point for this.
> >
> > Currently you can see the steal time but you have no way of knowing if
> > the cpu utilization you are seeing on the guest is the expected amount.
> > I decided on making it a knob because a guest could be migrated to
> > another system and it's entitlement could change because of hardware or
> > load differences. It could simply be a /proc file and report the
> > current entitlement if needed. As things are currently implemented I
> > don't see how someone knows if the guest is running as expected or
> > whether there is a problem.
> >
>
> Turning off steal time display won't get even close to displaying the
> information you want. What you probably want is a guest-visible way to
> say how many miliseconds you are expected to run each second. Right?

It is not clear to me how knowing how many milliseconds you are
expecting to run will help the user. Currently the users will run top
to see how well the guest is running. If they see _any_ steal time some
users think they are not getting the full use of their processor
entitlement.

Maybe I'm missing what you are proposing, but even if you knew the
milliseconds that you were expecting for your processor you would have
to adjust the top output in your head so to speak. You would see the
utilization and then say 'ok that matches the number of milliseconds I
expected to run..." If we take away the steal time (as long as it is
equal to or less than the expected amount of steal time) then the user
running top will see the 100% utilization.


2012-08-27 18:53:40

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

On 08/27/2012 08:50 AM, Michael Wolf wrote:
> On Sat, 2012-08-25 at 19:36 -0400, Glauber Costa wrote:
>> On 08/24/2012 11:11 AM, Michael Wolf wrote:
>>> On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote:
>>>> On 08/24/2012 03:14 AM, Michael Wolf wrote:
>>>>> This is an RFC regarding the reporting of stealtime. In the case of
>>>>> where you have a system that is running with partial processors such as
>>>>> KVM the user may see steal time being reported in accounting tools such
>>>>> as top or vmstat. This can cause confusion for the end user. To
>>>>> ease the confusion this patch set adds a sysctl interface to set the
>>>>> cpu entitlement. This is the percentage of cpu that the guest system is
>>>>> expected to receive. As long as the steal time is within its expected
>>>>> range it will show up as 0 in /proc/stat. The user will then see in the
>>>>> accounting tools that they are getting a full utilization of the cpu
>>>>> resources assigned to them.
>>>>>
>>>>
>>>> And how is such a knob not confusing?
>>>>
>>>> Steal time is pretty well defined in meaning and is shown in top for
>>>> ages. I really don't see the point for this.
>>>
>>> Currently you can see the steal time but you have no way of knowing if
>>> the cpu utilization you are seeing on the guest is the expected amount.
>>> I decided on making it a knob because a guest could be migrated to
>>> another system and it's entitlement could change because of hardware or
>>> load differences. It could simply be a /proc file and report the
>>> current entitlement if needed. As things are currently implemented I
>>> don't see how someone knows if the guest is running as expected or
>>> whether there is a problem.
>>>
>>
>> Turning off steal time display won't get even close to displaying the
>> information you want. What you probably want is a guest-visible way to
>> say how many miliseconds you are expected to run each second. Right?
>
> It is not clear to me how knowing how many milliseconds you are
> expecting to run will help the user. Currently the users will run top
> to see how well the guest is running. If they see _any_ steal time some
> users think they are not getting the full use of their processor
> entitlement.
>

And your plan is just to selectively lie about it, but disabling it with
a knob?

> Maybe I'm missing what you are proposing, but even if you knew the
> milliseconds that you were expecting for your processor you would have
> to adjust the top output in your head so to speak. You would see the
> utilization and then say 'ok that matches the number of milliseconds I
> expected to run..." If we take away the steal time (as long as it is
> equal to or less than the expected amount of steal time) then the user
> running top will see the 100% utilization.
>

2012-08-27 18:55:33

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

On 08/23/2012 04:14 PM, Michael Wolf wrote:
> This is an RFC regarding the reporting of stealtime. In the case of
> where you have a system that is running with partial processors such as
> KVM the user may see steal time being reported in accounting tools such
> as top or vmstat. This can cause confusion for the end user. To
> ease the confusion this patch set adds a sysctl interface to set the
> cpu entitlement. This is the percentage of cpu that the guest system is
> expected to receive. As long as the steal time is within its expected
> range it will show up as 0 in /proc/stat. The user will then see in the
> accounting tools that they are getting a full utilization of the cpu
> resources assigned to them.
>
> This patchset is changing the contents/output of /proc/stat and could affect
> user tools. However the default setting is that the cpu is entitled to 100%
> so the code will act as before. Also another field could be added to the
> /proc/stat output and show the unaltered steal time. Since this additional
> field could cause more confusion than it would clear up I have left it out
> for now.
>

How would a guest know what its entitlement is?


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2012-08-27 20:19:33

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

On Mon, 2012-08-27 at 11:50 -0700, Glauber Costa wrote:
> On 08/27/2012 08:50 AM, Michael Wolf wrote:
> > On Sat, 2012-08-25 at 19:36 -0400, Glauber Costa wrote:
> >> On 08/24/2012 11:11 AM, Michael Wolf wrote:
> >>> On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote:
> >>>> On 08/24/2012 03:14 AM, Michael Wolf wrote:
> >>>>> This is an RFC regarding the reporting of stealtime. In the case of
> >>>>> where you have a system that is running with partial processors such as
> >>>>> KVM the user may see steal time being reported in accounting tools such
> >>>>> as top or vmstat. This can cause confusion for the end user. To
> >>>>> ease the confusion this patch set adds a sysctl interface to set the
> >>>>> cpu entitlement. This is the percentage of cpu that the guest system is
> >>>>> expected to receive. As long as the steal time is within its expected
> >>>>> range it will show up as 0 in /proc/stat. The user will then see in the
> >>>>> accounting tools that they are getting a full utilization of the cpu
> >>>>> resources assigned to them.
> >>>>>
> >>>>
> >>>> And how is such a knob not confusing?
> >>>>
> >>>> Steal time is pretty well defined in meaning and is shown in top for
> >>>> ages. I really don't see the point for this.
> >>>
> >>> Currently you can see the steal time but you have no way of knowing if
> >>> the cpu utilization you are seeing on the guest is the expected amount.
> >>> I decided on making it a knob because a guest could be migrated to
> >>> another system and it's entitlement could change because of hardware or
> >>> load differences. It could simply be a /proc file and report the
> >>> current entitlement if needed. As things are currently implemented I
> >>> don't see how someone knows if the guest is running as expected or
> >>> whether there is a problem.
> >>>
> >>
> >> Turning off steal time display won't get even close to displaying the
> >> information you want. What you probably want is a guest-visible way to
> >> say how many miliseconds you are expected to run each second. Right?
> >
> > It is not clear to me how knowing how many milliseconds you are
> > expecting to run will help the user. Currently the users will run top
> > to see how well the guest is running. If they see _any_ steal time some
> > users think they are not getting the full use of their processor
> > entitlement.
> >
>
> And your plan is just to selectively lie about it, but disabling it with
> a knob?

It is about making it very obvious to the end user whether they are
receiving their cpu entitlement. If there is more steal time than
expected that will still show up. I have experimented, and it seems to
work, to put the raw stealtime at the end of each cpu line
in /proc/stat. That way the raw data is there as well.

Do you have another suggestion to communicate to the user whether they
are receiving their full entitlement? At the very least shouldn't the
entitlement reside in a /proc file somewhere so that the user could look
up the value and "do the math"?

>
> > Maybe I'm missing what you are proposing, but even if you knew the
> > milliseconds that you were expecting for your processor you would have
> > to adjust the top output in your head so to speak. You would see the
> > utilization and then say 'ok that matches the number of milliseconds I
> > expected to run..." If we take away the steal time (as long as it is
> > equal to or less than the expected amount of steal time) then the user
> > running top will see the 100% utilization.
> >
>


2012-08-27 20:23:17

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

On Mon, 2012-08-27 at 11:55 -0700, Avi Kivity wrote:
> On 08/23/2012 04:14 PM, Michael Wolf wrote:
> > This is an RFC regarding the reporting of stealtime. In the case of
> > where you have a system that is running with partial processors such as
> > KVM the user may see steal time being reported in accounting tools such
> > as top or vmstat. This can cause confusion for the end user. To
> > ease the confusion this patch set adds a sysctl interface to set the
> > cpu entitlement. This is the percentage of cpu that the guest system is
> > expected to receive. As long as the steal time is within its expected
> > range it will show up as 0 in /proc/stat. The user will then see in the
> > accounting tools that they are getting a full utilization of the cpu
> > resources assigned to them.
> >
> > This patchset is changing the contents/output of /proc/stat and could affect
> > user tools. However the default setting is that the cpu is entitled to 100%
> > so the code will act as before. Also another field could be added to the
> > /proc/stat output and show the unaltered steal time. Since this additional
> > field could cause more confusion than it would clear up I have left it out
> > for now.
> >
>
> How would a guest know what its entitlement is?
>
>

Currently the Admin/management tool setting up the guests will put it on
the qemu commandline. From this it is passed via an ioctl to the host.
The guest will get the value from the host via a hypercall.

In the future the host could try and do some of it automatically in some
cases.

2012-08-27 20:31:44

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

On 08/27/2012 01:23 PM, Michael Wolf wrote:
> >
> > How would a guest know what its entitlement is?
> >
> >
>
> Currently the Admin/management tool setting up the guests will put it on
> the qemu commandline. From this it is passed via an ioctl to the host.
> The guest will get the value from the host via a hypercall.
>
> In the future the host could try and do some of it automatically in some
> cases.

Seems to me it's a meaningless value for the guest. Suppose it is
migrated to a host that is more powerful, and as a result its relative
entitlement is reduced. The value needs to be adjusted.

This is best taken care of from the host side.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2012-08-27 20:55:11

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

On 08/27/2012 01:19 PM, Michael Wolf wrote:
> On Mon, 2012-08-27 at 11:50 -0700, Glauber Costa wrote:
>> On 08/27/2012 08:50 AM, Michael Wolf wrote:
>>> On Sat, 2012-08-25 at 19:36 -0400, Glauber Costa wrote:
>>>> On 08/24/2012 11:11 AM, Michael Wolf wrote:
>>>>> On Fri, 2012-08-24 at 08:53 +0400, Glauber Costa wrote:
>>>>>> On 08/24/2012 03:14 AM, Michael Wolf wrote:
>>>>>>> This is an RFC regarding the reporting of stealtime. In the case of
>>>>>>> where you have a system that is running with partial processors such as
>>>>>>> KVM the user may see steal time being reported in accounting tools such
>>>>>>> as top or vmstat. This can cause confusion for the end user. To
>>>>>>> ease the confusion this patch set adds a sysctl interface to set the
>>>>>>> cpu entitlement. This is the percentage of cpu that the guest system is
>>>>>>> expected to receive. As long as the steal time is within its expected
>>>>>>> range it will show up as 0 in /proc/stat. The user will then see in the
>>>>>>> accounting tools that they are getting a full utilization of the cpu
>>>>>>> resources assigned to them.
>>>>>>>
>>>>>>
>>>>>> And how is such a knob not confusing?
>>>>>>
>>>>>> Steal time is pretty well defined in meaning and is shown in top for
>>>>>> ages. I really don't see the point for this.
>>>>>
>>>>> Currently you can see the steal time but you have no way of knowing if
>>>>> the cpu utilization you are seeing on the guest is the expected amount.
>>>>> I decided on making it a knob because a guest could be migrated to
>>>>> another system and it's entitlement could change because of hardware or
>>>>> load differences. It could simply be a /proc file and report the
>>>>> current entitlement if needed. As things are currently implemented I
>>>>> don't see how someone knows if the guest is running as expected or
>>>>> whether there is a problem.
>>>>>
>>>>
>>>> Turning off steal time display won't get even close to displaying the
>>>> information you want. What you probably want is a guest-visible way to
>>>> say how many miliseconds you are expected to run each second. Right?
>>>
>>> It is not clear to me how knowing how many milliseconds you are
>>> expecting to run will help the user. Currently the users will run top
>>> to see how well the guest is running. If they see _any_ steal time some
>>> users think they are not getting the full use of their processor
>>> entitlement.
>>>
>>
>> And your plan is just to selectively lie about it, but disabling it with
>> a knob?
>
> It is about making it very obvious to the end user whether they are
> receiving their cpu entitlement. If there is more steal time than
> expected that will still show up. I have experimented, and it seems to
> work, to put the raw stealtime at the end of each cpu line
> in /proc/stat. That way the raw data is there as well.
>
> Do you have another suggestion to communicate to the user whether they
> are receiving their full entitlement? At the very least shouldn't the
> entitlement reside in a /proc file somewhere so that the user could look
> up the value and "do the math"?
>

I personally believe Avi is right here. This is something to be done at
the host side. The user can learn this from any tool he is using to
manage his VMs.

Now if you absolutely must inform him from inside the guest, I would go
with the later, informing him in another location. (I am not saying I
agree with this, just that this is less worse)

2012-08-27 21:27:23

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

On Mon, 2012-08-27 at 13:31 -0700, Avi Kivity wrote:
> On 08/27/2012 01:23 PM, Michael Wolf wrote:
> > >
> > > How would a guest know what its entitlement is?
> > >
> > >
> >
> > Currently the Admin/management tool setting up the guests will put it on
> > the qemu commandline. From this it is passed via an ioctl to the host.
> > The guest will get the value from the host via a hypercall.
> >
> > In the future the host could try and do some of it automatically in some
> > cases.
>
> Seems to me it's a meaningless value for the guest. Suppose it is
> migrated to a host that is more powerful, and as a result its relative
> entitlement is reduced. The value needs to be adjusted.

This is why I chose to manage the value from the sysctl interface rather
than just have it stored as a value in /proc. Whatever tool was used to
migrate the vm could hopefully adjust the sysctl value on the guest.
>
> This is best taken care of from the host side.

Not sure what you are getting at here. If you are running in a cloud
environment, you purchase a VM with the understanding that you are
getting certain resources. As this type of user I don't believe you
have any access to the host to see this type of information. So the
user still wouldnt have a way to confirm that they are receiving what
they should be in the way of processor resources.

Would you please elaborate a little more on this?
>


2012-08-27 21:44:20

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

On 08/27/2012 02:27 PM, Michael Wolf wrote:
> On Mon, 2012-08-27 at 13:31 -0700, Avi Kivity wrote:
>> On 08/27/2012 01:23 PM, Michael Wolf wrote:
>>>>
>>>> How would a guest know what its entitlement is?
>>>>
>>>>
>>>
>>> Currently the Admin/management tool setting up the guests will put it on
>>> the qemu commandline. From this it is passed via an ioctl to the host.
>>> The guest will get the value from the host via a hypercall.
>>>
>>> In the future the host could try and do some of it automatically in some
>>> cases.
>>
>> Seems to me it's a meaningless value for the guest. Suppose it is
>> migrated to a host that is more powerful, and as a result its relative
>> entitlement is reduced. The value needs to be adjusted.
>
> This is why I chose to manage the value from the sysctl interface rather
> than just have it stored as a value in /proc. Whatever tool was used to
> migrate the vm could hopefully adjust the sysctl value on the guest.
>>
>> This is best taken care of from the host side.
>
> Not sure what you are getting at here. If you are running in a cloud
> environment, you purchase a VM with the understanding that you are
> getting certain resources. As this type of user I don't believe you
> have any access to the host to see this type of information. So the
> user still wouldnt have a way to confirm that they are receiving what
> they should be in the way of processor resources.
>
> Would you please elaborate a little more on this?

What do you mean they have no access to the host?
They have access to all sorts of tools that display information from the
host. Speaking of a view-only resource, those are strictly equivalent.


2012-08-27 21:52:27

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

On 08/27/2012 02:27 PM, Michael Wolf wrote:
> On Mon, 2012-08-27 at 13:31 -0700, Avi Kivity wrote:
> > On 08/27/2012 01:23 PM, Michael Wolf wrote:
> > > >
> > > > How would a guest know what its entitlement is?
> > > >
> > > >
> > >
> > > Currently the Admin/management tool setting up the guests will put it on
> > > the qemu commandline. From this it is passed via an ioctl to the host.
> > > The guest will get the value from the host via a hypercall.
> > >
> > > In the future the host could try and do some of it automatically in some
> > > cases.
> >
> > Seems to me it's a meaningless value for the guest. Suppose it is
> > migrated to a host that is more powerful, and as a result its relative
> > entitlement is reduced. The value needs to be adjusted.
>
> This is why I chose to manage the value from the sysctl interface rather
> than just have it stored as a value in /proc. Whatever tool was used to
> migrate the vm could hopefully adjust the sysctl value on the guest.

We usually try to avoid this type of coupling. What if the guest is
rebooting while this is happening? What if it's not running Linux at all?

> >
> > This is best taken care of from the host side.
>
> Not sure what you are getting at here. If you are running in a cloud
> environment, you purchase a VM with the understanding that you are
> getting certain resources. As this type of user I don't believe you
> have any access to the host to see this type of information. So the
> user still wouldnt have a way to confirm that they are receiving what
> they should be in the way of processor resources.
>
> Would you please elaborate a little more on this?

I meant not reporting this time as steal time. But that cripples steal
time reporting.

Looks like for each quanta we need to report how much real time has
passed, how much the guest was actually using, and how much the guest
was not using due to overcommit (with the reminder being unallocated
time). The guest could then present it any way it wanted to.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2012-08-27 21:55:23

by Michael Wolf

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

On Mon, 2012-08-27 at 14:41 -0700, Glauber Costa wrote:
> On 08/27/2012 02:27 PM, Michael Wolf wrote:
> > On Mon, 2012-08-27 at 13:31 -0700, Avi Kivity wrote:
> >> On 08/27/2012 01:23 PM, Michael Wolf wrote:
> >>>>
> >>>> How would a guest know what its entitlement is?
> >>>>
> >>>>
> >>>
> >>> Currently the Admin/management tool setting up the guests will put it on
> >>> the qemu commandline. From this it is passed via an ioctl to the host.
> >>> The guest will get the value from the host via a hypercall.
> >>>
> >>> In the future the host could try and do some of it automatically in some
> >>> cases.
> >>
> >> Seems to me it's a meaningless value for the guest. Suppose it is
> >> migrated to a host that is more powerful, and as a result its relative
> >> entitlement is reduced. The value needs to be adjusted.
> >
> > This is why I chose to manage the value from the sysctl interface rather
> > than just have it stored as a value in /proc. Whatever tool was used to
> > migrate the vm could hopefully adjust the sysctl value on the guest.
> >>
> >> This is best taken care of from the host side.
> >
> > Not sure what you are getting at here. If you are running in a cloud
> > environment, you purchase a VM with the understanding that you are
> > getting certain resources. As this type of user I don't believe you
> > have any access to the host to see this type of information. So the
> > user still wouldnt have a way to confirm that they are receiving what
> > they should be in the way of processor resources.
> >
> > Would you please elaborate a little more on this?
>
> What do you mean they have no access to the host?
> They have access to all sorts of tools that display information from the
> host. Speaking of a view-only resource, those are strictly equivalent.
>
>
>

ok. I will go look at those resources.

2012-08-28 16:01:23

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH RFC 0/3] Add guest cpu_entitlement reporting

Avi Kivity <[email protected]> writes:

> On 08/27/2012 02:27 PM, Michael Wolf wrote:
>> On Mon, 2012-08-27 at 13:31 -0700, Avi Kivity wrote:
>> > On 08/27/2012 01:23 PM, Michael Wolf wrote:
>> > > >
>> > > > How would a guest know what its entitlement is?
>> > > >
>> > > >
>> > >
>> > > Currently the Admin/management tool setting up the guests will put it on
>> > > the qemu commandline. From this it is passed via an ioctl to the host.
>> > > The guest will get the value from the host via a hypercall.
>> > >
>> > > In the future the host could try and do some of it automatically in some
>> > > cases.
>> >
>> > Seems to me it's a meaningless value for the guest. Suppose it is
>> > migrated to a host that is more powerful, and as a result its relative
>> > entitlement is reduced. The value needs to be adjusted.
>>
>> This is why I chose to manage the value from the sysctl interface rather
>> than just have it stored as a value in /proc. Whatever tool was used to
>> migrate the vm could hopefully adjust the sysctl value on the guest.
>
> We usually try to avoid this type of coupling. What if the guest is
> rebooting while this is happening? What if it's not running Linux at
> all?

The guest shouldn't need to know it's entitlement. Or at least, it's up
to a management tool to report that in a way that's meaningful for the
guest.

For instance, with a hosting provider, you may have 3 service levels
(small, medium, large). How you present whether the guest is small,
medium, or large to the guest is up to the hosting provider.

>
>> >
>> > This is best taken care of from the host side.
>>
>> Not sure what you are getting at here. If you are running in a cloud
>> environment, you purchase a VM with the understanding that you are
>> getting certain resources. As this type of user I don't believe you
>> have any access to the host to see this type of information. So the
>> user still wouldnt have a way to confirm that they are receiving what
>> they should be in the way of processor resources.
>>
>> Would you please elaborate a little more on this?
>
> I meant not reporting this time as steal time. But that cripples steal
> time reporting.
>
> Looks like for each quanta we need to report how much real time has
> passed, how much the guest was actually using, and how much the guest
> was not using due to overcommit (with the reminder being unallocated
> time). The guest could then present it any way it wanted to.

What I had previously suggested what splitting entitlement loss out of
steal time and reporting it as a separate metric (but not reporting a
fixed notion of entitlement).

You're missing the entitlement loss bit above. But you need to call
out entitlement loss in order to report idle time correctly.

I think changing steal time (as this patch does) is wrong.

Regards,

Anthony Liguori

>
> --
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html