2012-10-29 14:11:09

by Raghavendra K T

[permalink] [raw]
Subject: [PATCH V2 RFC 0/3] kvm: Improving undercommit,overcommit scenarios

In some special scenarios like #vcpu <= #pcpu, PLE handler may
prove very costly, because there is no need to iterate over vcpus
and do unsuccessful yield_to burning CPU.

Similarly, when we have large number of small guests, it is
possible that a spinning vcpu fails to yield_to any vcpu of same
VM and go back and spin. This is also not effective when we are
over-committed. Instead, we do a yield() so that we give chance
to other VMs to run.

This patch tries to optimize above scenarios.

The first patch optimizes all the yield_to by bailing out when there
is no need to continue yield_to (i.e., when there is only one task
in source and target rq).

Second patch uses that in PLE handler.

Third patch uses overall system load knowledge to take decison on
continuing in yield_to handler, and also yielding in overcommits.
To be precise,
* loadavg is converted to a scale of 2048 / per CPU
* a load value of less than 1024 is considered as undercommit and we
return from PLE handler in those cases
* a load value of greater than 3586 (1.75 * 2048) is considered as overcommit
and we yield to other VMs in such cases.

(let threshold = 2048)
Rationale for using threshold/2 for undercommit limit:
Having a load below (0.5 * threshold) is used to avoid (the concern rasied by Rik)
scenarios where we still have lock holder preempted vcpu waiting to be
scheduled. (scenario arises when rq length is > 1 even when we are under
committed)

Rationale for using (1.75 * threshold) for overcommit scenario:
This is a heuristic where we should probably see rq length > 1
and a vcpu of a different VM is waiting to be scheduled.

Related future work (independent of this series):

- Dynamically changing PLE window depending on system load.

Result on 3.7.0-rc1 kernel shows around 146% improvement for ebizzy 1x
with 32 core PLE machine with 32 vcpu guest.
I believe we should get very good improvements for overcommit (especially > 2)
on large machines with small vcpu guests. (Could not test this as I do not have
access to a bigger machine)

base = 3.7.0-rc1
machine: 32 core mx3850 x5 PLE mc

--+-----------+-----------+-----------+------------+-----------+
ebizzy (rec/sec higher is beter)
--+-----------+-----------+-----------+------------+-----------+
base stdev patched stdev %improve
--+-----------+-----------+-----------+------------+-----------+
1x 2543.3750 20.2903 6279.3750 82.5226 146.89143
2x 2410.8750 96.4327 2450.7500 207.8136 1.65396
3x 2184.9167 205.5226 2178.3333 97.2034 -0.30131
--+-----------+-----------+-----------+------------+-----------+

--+-----------+-----------+-----------+------------+-----------+
dbench (throughput in MB/sec. higher is better)
--+-----------+-----------+-----------+------------+-----------+
base stdev patched stdev %improve
--+-----------+-----------+-----------+------------+-----------+
1x 5545.4330 596.4344 7042.8510 1012.0924 27.00272
2x 1993.0970 43.6548 1990.6200 75.7837 -0.12428
3x 1295.3867 22.3997 1315.5208 36.0075 1.55429
--+-----------+-----------+-----------+------------+-----------+

Changes since V1:
- Discard the idea of exporting nrrunning and optimize in core scheduler (Peter)
- Use yield() instead of schedule in overcommit scenarios (Rik)
- Use loadavg knowledge to detect undercommit/overcommit

Peter Zijlstra (1):
Bail out of yield_to when source and target runqueue has one task

Raghavendra K T (2):
Handle yield_to failure return for potential undercommit case
Check system load and handle different commit cases accordingly

Please let me know your comments and suggestions.

Link for V1:
https://lkml.org/lkml/2012/9/21/168

kernel/sched/core.c | 25 +++++++++++++++++++------
virt/kvm/kvm_main.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++----------
2 files changed, 65 insertions(+), 16 deletions(-)


2012-10-29 14:11:29

by Raghavendra K T

[permalink] [raw]
Subject: [PATCH V2 RFC 1/3] sched: Bail out of yield_to when source and target runqueue has one task

From: Peter Zijlstra <[email protected]>

In case of undercomitted scenarios, especially in large guests
yield_to overhead is significantly high. when run queue length of
source and target is one, take an opportunity to bail out and return
-ESRCH. This return condition can be further exploited to quickly come
out of PLE handler.

Signed-off-by: Peter Zijlstra <[email protected]>
Raghavendra, Checking the rq length of target vcpu condition added.
Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
---

kernel/sched/core.c | 25 +++++++++++++++++++------
1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..fc219a5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
* It's the caller's job to ensure that the target task struct
* can't go away on us before we can do any checks.
*
- * Returns true if we indeed boosted the target task.
+ * Returns:
+ * true (>0) if we indeed boosted the target task.
+ * false (0) if we failed to boost the target.
+ * -ESRCH if there's no task to yield to.
*/
bool __sched yield_to(struct task_struct *p, bool preempt)
{
@@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt)

again:
p_rq = task_rq(p);
+ /*
+ * If we're the only runnable task on the rq and target rq also
+ * has only one task, there's absolutely no point in yielding.
+ */
+ if (rq->nr_running == 1 && p_rq->nr_running == 1) {
+ yielded = -ESRCH;
+ goto out_irq;
+ }
+
double_rq_lock(rq, p_rq);
while (task_rq(p) != p_rq) {
double_rq_unlock(rq, p_rq);
@@ -4310,13 +4322,13 @@ again:
}

if (!curr->sched_class->yield_to_task)
- goto out;
+ goto out_unlock;

if (curr->sched_class != p->sched_class)
- goto out;
+ goto out_unlock;

if (task_running(p_rq, p) || p->state)
- goto out;
+ goto out_unlock;

yielded = curr->sched_class->yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4329,11 +4341,12 @@ again:
resched_task(p_rq->curr);
}

-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);

- if (yielded)
+ if (yielded > 0)
schedule();

return yielded;

2012-10-29 14:11:44

by Raghavendra K T

[permalink] [raw]
Subject: [PATCH V2 RFC 2/3] kvm: Handle yield_to failure return code for potential undercommit case

From: Raghavendra K T <[email protected]>

Also we do not update last boosted vcpu in failure cases.

Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
---

virt/kvm/kvm_main.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be70035..e376434 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
{
struct pid *pid;
struct task_struct *task = NULL;
+ bool ret = false;

rcu_read_lock();
pid = rcu_dereference(target->pid);
@@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
task = get_pid_task(target->pid, PIDTYPE_PID);
rcu_read_unlock();
if (!task)
- return false;
+ return ret;
if (task->flags & PF_VCPU) {
put_task_struct(task);
- return false;
- }
- if (yield_to(task, 1)) {
- put_task_struct(task);
- return true;
+ return ret;
}
+ ret = yield_to(task, 1);
put_task_struct(task);
- return false;
+
+ return ret;
}
EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);

@@ -1697,6 +1696,7 @@ bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
return eligible;
}
#endif
+
void kvm_vcpu_on_spin(struct kvm_vcpu *me)
{
struct kvm *kvm = me->kvm;
@@ -1727,11 +1727,12 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
continue;
if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
continue;
- if (kvm_vcpu_yield_to(vcpu)) {
+
+ yielded = kvm_vcpu_yield_to(vcpu);
+ if (yielded > 0)
kvm->last_boosted_vcpu = i;
- yielded = 1;
+ if (yielded)
break;
- }
}
}
kvm_vcpu_set_in_spin_loop(me, false);

2012-10-29 14:12:30

by Raghavendra K T

[permalink] [raw]
Subject: [PATCH V2 RFC 3/3] kvm: Check system load and handle different commit cases accordingly

From: Raghavendra K T <[email protected]>

The patch indroduces a helper function that calculates the system load
(idea borrowed from loadavg calculation). The load is normalized to
2048 i.e., return value (threshold) of 2048 implies an approximate 1:1
committed guest.

In undercommit cases (threshold/2) we simply return from PLE handler.
In overcommit cases (1.75 * threshold) we do a yield(). The rationale is to
allow other VMs of the host to run instead of burning the cpu cycle.

Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
---
Idea of yielding in overcommit cases (especially in large number of small guest
cases was
Acked-by: Rik van Riel <[email protected]>
Andrew Theurer also has stressed the importance of reducing yield_to
overhead and using yield().

(let threshold = 2048)
Rationale for using threshold/2 for undercommit limit:
Having a load below (0.5 * threshold) is used to avoid (the concern rasied by Rik)
scenarios where we still have lock holder preempted vcpu waiting to be
scheduled. (scenario arises when rq length is > 1 even when we are under
committed)

Rationale for using (1.75 * threshold) for overcommit scenario:
This is a heuristic where we should probably see rq length > 1
and a vcpu of a different VM is waiting to be scheduled.

virt/kvm/kvm_main.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e376434..28bbdfb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1697,15 +1697,43 @@ bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
}
#endif

+/*
+ * A load of 2048 corresponds to 1:1 overcommit
+ * undercommit threshold is half the 1:1 overcommit
+ * overcommit threshold is 1.75 times of 1:1 overcommit threshold
+ */
+#define COMMIT_THRESHOLD (FIXED_1)
+#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1)
+#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) - (COMMIT_THRESHOLD >> 2))
+
+unsigned long kvm_system_load(void)
+{
+ unsigned long load;
+
+ load = avenrun[0] + FIXED_1/200;
+ load = load / num_online_cpus();
+
+ return load;
+}
+
void kvm_vcpu_on_spin(struct kvm_vcpu *me)
{
struct kvm *kvm = me->kvm;
struct kvm_vcpu *vcpu;
int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
int yielded = 0;
+ unsigned long load;
int pass;
int i;

+ load = kvm_system_load();
+ /*
+ * When we are undercomitted let us not waste time in
+ * iterating over all the VCPUs.
+ */
+ if (load < UNDERCOMMIT_THRESHOLD)
+ return;
+
kvm_vcpu_set_in_spin_loop(me, true);
/*
* We boost the priority of a VCPU that is runnable but not
@@ -1735,6 +1763,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
break;
}
}
+ /*
+ * If we are not able to yield especially in overcommit cases
+ * let us be courteous to other VM's VCPUs waiting to be scheduled.
+ */
+ if (!yielded && load > OVERCOMMIT_THRESHOLD)
+ yield();
+
kvm_vcpu_set_in_spin_loop(me, false);

/* Ensure vcpu is not eligible during next spinloop */

2012-10-29 17:55:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 3/3] kvm: Check system load and handle different commit cases accordingly

On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote:
> +/*
> + * A load of 2048 corresponds to 1:1 overcommit
> + * undercommit threshold is half the 1:1 overcommit
> + * overcommit threshold is 1.75 times of 1:1 overcommit threshold
> + */
> +#define COMMIT_THRESHOLD (FIXED_1)
> +#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1)
> +#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) -
> (COMMIT_THRESHOLD >> 2))
> +
> +unsigned long kvm_system_load(void)
> +{
> + unsigned long load;
> +
> + load = avenrun[0] + FIXED_1/200;
> + load = load / num_online_cpus();
> +
> + return load;
> +}

ARGH.. no that's wrong.. very wrong.

1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a
joke.

2) avenrun[] is a global load, do not ever use a global load measure

3) avenrun[] has nothing what so ever to do with runqueue lengths,
someone with a gazillion tasks in D state will get a huge load but the
cpu is very idle.

2012-10-30 06:02:34

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 3/3] kvm: Check system load and handle different commit cases accordingly

On 10/29/2012 11:24 PM, Peter Zijlstra wrote:
> On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote:
>> +/*
>> + * A load of 2048 corresponds to 1:1 overcommit
>> + * undercommit threshold is half the 1:1 overcommit
>> + * overcommit threshold is 1.75 times of 1:1 overcommit threshold
>> + */
>> +#define COMMIT_THRESHOLD (FIXED_1)
>> +#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1)
>> +#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) -
>> (COMMIT_THRESHOLD >> 2))
>> +
>> +unsigned long kvm_system_load(void)
>> +{
>> + unsigned long load;
>> +
>> + load = avenrun[0] + FIXED_1/200;
>> + load = load / num_online_cpus();
>> +
>> + return load;
>> +}
>
> ARGH.. no that's wrong.. very wrong.
>
> 1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a
> joke.

Okay.

> 2) avenrun[] is a global load, do not ever use a global load measure

This makes sense. Using a local optimization that leads to near global
optimization is the way to go.

>
> 3) avenrun[] has nothing what so ever to do with runqueue lengths,
> someone with a gazillion tasks in D state will get a huge load but the
> cpu is very idle.
>

I used loadavg as an alternative measure. But the above condition
poses a concern for that.

Okay, now IIUC, usage of *any* global measure is bad?

Because I was also thinking to use nrrunning()/ num_online_cpus(), to
get an idea of global overcommit sense. (ofcourse since, this involves
iteration over per CPU nrrunning, I wanted to calculate this
periodically)

The overall logic, of having overcommit_threshold,
undercommit_threshold, I wanted to use for even dynamic ple_window
tuning purpose.

so logic was:
< undercommit_threshold => 16k ple_window
> overcommit_threshold => 4k window.
for in between case scale the ple_window accordingly.

The alternative was to decide depending on how ple handler succeeded in
yield_to. But I thought, that is too sensitive and more overhead.

This topic may deserve different thread, but thought I shall table it here.

So, Thinking about the alternatives to implement, logic such as

(a) if(undercommitted)
just go back and spin rather than going for yield_to iteration.
(b) if (overcommitted)
better to yield rather than spinning logic

of current patches..

[ ofcourse, (a) is already met to large extent by your patches..]

So I think everything boils down to

"how do we measure these two thresholds without much overhead in a
compliant way"

Ideas welcome..

2012-10-30 06:35:20

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 3/3] kvm: Check system load and handle different commit cases accordingly

On Tue, Oct 30, 2012 at 11:27:52AM +0530, Raghavendra K T wrote:
> On 10/29/2012 11:24 PM, Peter Zijlstra wrote:
> >On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote:
> >>+/*
> >>+ * A load of 2048 corresponds to 1:1 overcommit
> >>+ * undercommit threshold is half the 1:1 overcommit
> >>+ * overcommit threshold is 1.75 times of 1:1 overcommit threshold
> >>+ */
> >>+#define COMMIT_THRESHOLD (FIXED_1)
> >>+#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1)
> >>+#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) -
> >>(COMMIT_THRESHOLD >> 2))
> >>+
> >>+unsigned long kvm_system_load(void)
> >>+{
> >>+ unsigned long load;
> >>+
> >>+ load = avenrun[0] + FIXED_1/200;
> >>+ load = load / num_online_cpus();
> >>+
> >>+ return load;
> >>+}
> >
> >ARGH.. no that's wrong.. very wrong.
> >
> > 1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a
> >joke.
>
> Okay.
>
> > 2) avenrun[] is a global load, do not ever use a global load measure
>
> This makes sense. Using a local optimization that leads to near global
> optimization is the way to go.
>
> >
> > 3) avenrun[] has nothing what so ever to do with runqueue lengths,
> >someone with a gazillion tasks in D state will get a huge load but the
> >cpu is very idle.
> >
>
> I used loadavg as an alternative measure. But the above condition
> poses a concern for that.
>
> Okay, now IIUC, usage of *any* global measure is bad?
>
> Because I was also thinking to use nrrunning()/ num_online_cpus(), to
> get an idea of global overcommit sense. (ofcourse since, this involves
> iteration over per CPU nrrunning, I wanted to calculate this
> periodically)
>
> The overall logic, of having overcommit_threshold,
> undercommit_threshold, I wanted to use for even dynamic ple_window
> tuning purpose.
>
> so logic was:
> < undercommit_threshold => 16k ple_window
> > overcommit_threshold => 4k window.
> for in between case scale the ple_window accordingly.
>
> The alternative was to decide depending on how ple handler succeeded in
> yield_to. But I thought, that is too sensitive and more overhead.
>
> This topic may deserve different thread, but thought I shall table it here.
>
> So, Thinking about the alternatives to implement, logic such as
>
> (a) if(undercommitted)
> just go back and spin rather than going for yield_to iteration.
> (b) if (overcommitted)
> better to yield rather than spinning logic
>
> of current patches..
>
> [ ofcourse, (a) is already met to large extent by your patches..]
>
> So I think everything boils down to
>
> "how do we measure these two thresholds without much overhead in a
> compliant way"
>
> Ideas welcome..
>

What happened to Avi's preempt notifier idea for determining
under/overcommit? If nobody has picked that up yet, then I'll go ahead and
try to prototype it.

Drew

2012-10-30 07:36:52

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 3/3] kvm: Check system load and handle different commit cases accordingly

On 10/30/2012 12:04 PM, Andrew Jones wrote:
> On Tue, Oct 30, 2012 at 11:27:52AM +0530, Raghavendra K T wrote:
>> On 10/29/2012 11:24 PM, Peter Zijlstra wrote:
>>> On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote:
>>>> +/*
>>>> + * A load of 2048 corresponds to 1:1 overcommit
>>>> + * undercommit threshold is half the 1:1 overcommit
>>>> + * overcommit threshold is 1.75 times of 1:1 overcommit threshold
>>>> + */
>>>> +#define COMMIT_THRESHOLD (FIXED_1)
>>>> +#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1)
>>>> +#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) -
>>>> (COMMIT_THRESHOLD >> 2))
>>>> +
>>>> +unsigned long kvm_system_load(void)
>>>> +{
>>>> + unsigned long load;
>>>> +
>>>> + load = avenrun[0] + FIXED_1/200;
>>>> + load = load / num_online_cpus();
>>>> +
>>>> + return load;
>>>> +}
>>>
>>> ARGH.. no that's wrong.. very wrong.
>>>
>>> 1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a
>>> joke.
>>
>> Okay.
>>
>>> 2) avenrun[] is a global load, do not ever use a global load measure
>>
>> This makes sense. Using a local optimization that leads to near global
>> optimization is the way to go.
>>
>>>
>>> 3) avenrun[] has nothing what so ever to do with runqueue lengths,
>>> someone with a gazillion tasks in D state will get a huge load but the
>>> cpu is very idle.
>>>
>>
>> I used loadavg as an alternative measure. But the above condition
>> poses a concern for that.
>>
>> Okay, now IIUC, usage of *any* global measure is bad?
>>
>> Because I was also thinking to use nrrunning()/ num_online_cpus(), to
>> get an idea of global overcommit sense. (ofcourse since, this involves
>> iteration over per CPU nrrunning, I wanted to calculate this
>> periodically)
>>
>> The overall logic, of having overcommit_threshold,
>> undercommit_threshold, I wanted to use for even dynamic ple_window
>> tuning purpose.
>>
>> so logic was:
>> < undercommit_threshold => 16k ple_window
>>> overcommit_threshold => 4k window.
>> for in between case scale the ple_window accordingly.
>>
>> The alternative was to decide depending on how ple handler succeeded in
>> yield_to. But I thought, that is too sensitive and more overhead.
>>
>> This topic may deserve different thread, but thought I shall table it here.
>>
>> So, Thinking about the alternatives to implement, logic such as
>>
>> (a) if(undercommitted)
>> just go back and spin rather than going for yield_to iteration.
>> (b) if (overcommitted)
>> better to yield rather than spinning logic
>>
>> of current patches..
>>
>> [ ofcourse, (a) is already met to large extent by your patches..]
>>
>> So I think everything boils down to
>>
>> "how do we measure these two thresholds without much overhead in a
>> compliant way"
>>
>> Ideas welcome..
>>
>
> What happened to Avi's preempt notifier idea for determining
> under/overcommit? If nobody has picked that up yet, then I'll go ahead and
> try to prototype it.

Hi Drew,

I had assumed my priority order as
1) this patch series 2) dynamic ple window 3) preempt notifiers.

But I do not have any problem on re-prioritizing / helping on these
as far as we are clear on what we are looking into.

I was thinking about preempt notifier idea as a tool to refine
candidate VCPUs. But you are right, Avi, also told we can use
bitmap/counter itself as an indicator to decide whether we go ahead
with yield_to at all.

IMO, only patch(3) has some conflict because of various approach we can
try.May be we should attack the problem via all 3 solutions at once and
decide?

To be frank, within each of the approach, trying/analyzing all the
possibilities made the things slow.. (my end).

Suggestions..?

2012-10-30 08:14:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 3/3] kvm: Check system load and handle different commit cases accordingly

On Tue, 2012-10-30 at 11:27 +0530, Raghavendra K T wrote:
> Okay, now IIUC, usage of *any* global measure is bad?

Yep, people like to carve up their machines, esp. now that they're
somewhat bigger than they used to be. This can result in very asymmetric
loads, no global measure can ever deal with that.

2012-10-30 09:07:55

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 3/3] kvm: Check system load and handle different commit cases accordingly

On Tue, Oct 30, 2012 at 01:01:54PM +0530, Raghavendra K T wrote:
> On 10/30/2012 12:04 PM, Andrew Jones wrote:
> >On Tue, Oct 30, 2012 at 11:27:52AM +0530, Raghavendra K T wrote:
> >>On 10/29/2012 11:24 PM, Peter Zijlstra wrote:
> >>>On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote:
> >>>>+/*
> >>>>+ * A load of 2048 corresponds to 1:1 overcommit
> >>>>+ * undercommit threshold is half the 1:1 overcommit
> >>>>+ * overcommit threshold is 1.75 times of 1:1 overcommit threshold
> >>>>+ */
> >>>>+#define COMMIT_THRESHOLD (FIXED_1)
> >>>>+#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1)
> >>>>+#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) -
> >>>>(COMMIT_THRESHOLD >> 2))
> >>>>+
> >>>>+unsigned long kvm_system_load(void)
> >>>>+{
> >>>>+ unsigned long load;
> >>>>+
> >>>>+ load = avenrun[0] + FIXED_1/200;
> >>>>+ load = load / num_online_cpus();
> >>>>+
> >>>>+ return load;
> >>>>+}
> >>>
> >>>ARGH.. no that's wrong.. very wrong.
> >>>
> >>> 1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a
> >>>joke.
> >>
> >>Okay.
> >>
> >>> 2) avenrun[] is a global load, do not ever use a global load measure
> >>
> >>This makes sense. Using a local optimization that leads to near global
> >>optimization is the way to go.
> >>
> >>>
> >>> 3) avenrun[] has nothing what so ever to do with runqueue lengths,
> >>>someone with a gazillion tasks in D state will get a huge load but the
> >>>cpu is very idle.
> >>>
> >>
> >>I used loadavg as an alternative measure. But the above condition
> >>poses a concern for that.
> >>
> >>Okay, now IIUC, usage of *any* global measure is bad?
> >>
> >>Because I was also thinking to use nrrunning()/ num_online_cpus(), to
> >>get an idea of global overcommit sense. (ofcourse since, this involves
> >>iteration over per CPU nrrunning, I wanted to calculate this
> >>periodically)
> >>
> >>The overall logic, of having overcommit_threshold,
> >>undercommit_threshold, I wanted to use for even dynamic ple_window
> >>tuning purpose.
> >>
> >>so logic was:
> >>< undercommit_threshold => 16k ple_window
> >>>overcommit_threshold => 4k window.
> >>for in between case scale the ple_window accordingly.
> >>
> >>The alternative was to decide depending on how ple handler succeeded in
> >>yield_to. But I thought, that is too sensitive and more overhead.
> >>
> >>This topic may deserve different thread, but thought I shall table it here.
> >>
> >>So, Thinking about the alternatives to implement, logic such as
> >>
> >>(a) if(undercommitted)
> >> just go back and spin rather than going for yield_to iteration.
> >>(b) if (overcommitted)
> >> better to yield rather than spinning logic
> >>
> >> of current patches..
> >>
> >>[ ofcourse, (a) is already met to large extent by your patches..]
> >>
> >>So I think everything boils down to
> >>
> >>"how do we measure these two thresholds without much overhead in a
> >>compliant way"
> >>
> >>Ideas welcome..
> >>
> >
> >What happened to Avi's preempt notifier idea for determining
> >under/overcommit? If nobody has picked that up yet, then I'll go ahead and
> >try to prototype it.
>
> Hi Drew,
>
> I had assumed my priority order as
> 1) this patch series 2) dynamic ple window 3) preempt notifiers.
>
> But I do not have any problem on re-prioritizing / helping on these
> as far as we are clear on what we are looking into.
>
> I was thinking about preempt notifier idea as a tool to refine
> candidate VCPUs. But you are right, Avi, also told we can use
> bitmap/counter itself as an indicator to decide whether we go ahead
> with yield_to at all.
>
> IMO, only patch(3) has some conflict because of various approach we can
> try.May be we should attack the problem via all 3 solutions at once and
> decide?
>
> To be frank, within each of the approach, trying/analyzing all the
> possibilities made the things slow.. (my end).
>
> Suggestions..?
>

I agree, it's a complex problem that needs lots of trial+error work. We
should definitely work in parallel on multiple ideas. I'll go ahead and
dig into the preempt notifiers.

Drew

2012-10-30 12:17:30

by Andrew Theurer

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 0/3] kvm: Improving undercommit,overcommit scenarios

On Mon, 2012-10-29 at 19:36 +0530, Raghavendra K T wrote:
> In some special scenarios like #vcpu <= #pcpu, PLE handler may
> prove very costly, because there is no need to iterate over vcpus
> and do unsuccessful yield_to burning CPU.
>
> Similarly, when we have large number of small guests, it is
> possible that a spinning vcpu fails to yield_to any vcpu of same
> VM and go back and spin. This is also not effective when we are
> over-committed. Instead, we do a yield() so that we give chance
> to other VMs to run.
>
> This patch tries to optimize above scenarios.
>
> The first patch optimizes all the yield_to by bailing out when there
> is no need to continue yield_to (i.e., when there is only one task
> in source and target rq).
>
> Second patch uses that in PLE handler.
>
> Third patch uses overall system load knowledge to take decison on
> continuing in yield_to handler, and also yielding in overcommits.
> To be precise,
> * loadavg is converted to a scale of 2048 / per CPU
> * a load value of less than 1024 is considered as undercommit and we
> return from PLE handler in those cases
> * a load value of greater than 3586 (1.75 * 2048) is considered as overcommit
> and we yield to other VMs in such cases.
>
> (let threshold = 2048)
> Rationale for using threshold/2 for undercommit limit:
> Having a load below (0.5 * threshold) is used to avoid (the concern rasied by Rik)
> scenarios where we still have lock holder preempted vcpu waiting to be
> scheduled. (scenario arises when rq length is > 1 even when we are under
> committed)
>
> Rationale for using (1.75 * threshold) for overcommit scenario:
> This is a heuristic where we should probably see rq length > 1
> and a vcpu of a different VM is waiting to be scheduled.
>
> Related future work (independent of this series):
>
> - Dynamically changing PLE window depending on system load.
>
> Result on 3.7.0-rc1 kernel shows around 146% improvement for ebizzy 1x
> with 32 core PLE machine with 32 vcpu guest.
> I believe we should get very good improvements for overcommit (especially > 2)
> on large machines with small vcpu guests. (Could not test this as I do not have
> access to a bigger machine)
>
> base = 3.7.0-rc1
> machine: 32 core mx3850 x5 PLE mc
>
> --+-----------+-----------+-----------+------------+-----------+
> ebizzy (rec/sec higher is beter)
> --+-----------+-----------+-----------+------------+-----------+
> base stdev patched stdev %improve
> --+-----------+-----------+-----------+------------+-----------+
> 1x 2543.3750 20.2903 6279.3750 82.5226 146.89143
> 2x 2410.8750 96.4327 2450.7500 207.8136 1.65396
> 3x 2184.9167 205.5226 2178.3333 97.2034 -0.30131
> --+-----------+-----------+-----------+------------+-----------+
>
> --+-----------+-----------+-----------+------------+-----------+
> dbench (throughput in MB/sec. higher is better)
> --+-----------+-----------+-----------+------------+-----------+
> base stdev patched stdev %improve
> --+-----------+-----------+-----------+------------+-----------+
> 1x 5545.4330 596.4344 7042.8510 1012.0924 27.00272
> 2x 1993.0970 43.6548 1990.6200 75.7837 -0.12428
> 3x 1295.3867 22.3997 1315.5208 36.0075 1.55429
> --+-----------+-----------+-----------+------------+-----------+

Could you include a PLE-off result for 1x over-commit, so we know what
the best possible result is?

Looks like skipping the yield_to() for rq = 1 helps, but I'd like to
know if the performance is the same as PLE off for 1x. I am concerned
the vcpu to task lookup is still expensive.

Based on Peter's comments I would say the 3rd patch and the 2x,3x
results are not conclusive at this time.

I think we should also discuss what we think a good target is. We
should know what our high-water mark is, and IMO, if we cannot get
close, then I do not feel we are heading down the right path. For
example, if dbench aggregate throughput for 1x with PLE off is 10000
MB/sec, then the best possible 2x,3x result, should be a little lower
than that due to task switching the vcpus and sharing chaches. This
should be quite evident with current PLE handler and smaller VMs (like
10 vcpus or less).
>
> Changes since V1:
> - Discard the idea of exporting nrrunning and optimize in core scheduler (Peter)
> - Use yield() instead of schedule in overcommit scenarios (Rik)
> - Use loadavg knowledge to detect undercommit/overcommit
>
> Peter Zijlstra (1):
> Bail out of yield_to when source and target runqueue has one task
>
> Raghavendra K T (2):
> Handle yield_to failure return for potential undercommit case
> Check system load and handle different commit cases accordingly
>
> Please let me know your comments and suggestions.
>
> Link for V1:
> https://lkml.org/lkml/2012/9/21/168
>
> kernel/sched/core.c | 25 +++++++++++++++++++------
> virt/kvm/kvm_main.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++----------
> 2 files changed, 65 insertions(+), 16 deletions(-)

-Andrew Theurer

2012-10-31 06:15:22

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 3/3] kvm: Check system load and handle different commit cases accordingly

On 10/30/2012 01:44 PM, Peter Zijlstra wrote:
> On Tue, 2012-10-30 at 11:27 +0530, Raghavendra K T wrote:
>> Okay, now IIUC, usage of *any* global measure is bad?
>
> Yep, people like to carve up their machines, esp. now that they're
> somewhat bigger than they used to be. This can result in very asymmetric
> loads, no global measure can ever deal with that.

Thanks for explaining the concerns. Very True and if load is very
asymmetric due to power optimization etc constraints. This may affect.

2012-10-31 06:41:20

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 0/3] kvm: Improving undercommit,overcommit scenarios

On 10/30/2012 05:47 PM, Andrew Theurer wrote:
> On Mon, 2012-10-29 at 19:36 +0530, Raghavendra K T wrote:
>> In some special scenarios like #vcpu <= #pcpu, PLE handler may
>> prove very costly, because there is no need to iterate over vcpus
>> and do unsuccessful yield_to burning CPU.
>>
>> Similarly, when we have large number of small guests, it is
>> possible that a spinning vcpu fails to yield_to any vcpu of same
>> VM and go back and spin. This is also not effective when we are
>> over-committed. Instead, we do a yield() so that we give chance
>> to other VMs to run.
>>
>> This patch tries to optimize above scenarios.
>>
>> The first patch optimizes all the yield_to by bailing out when there
>> is no need to continue yield_to (i.e., when there is only one task
>> in source and target rq).
>>
>> Second patch uses that in PLE handler.
>>
>> Third patch uses overall system load knowledge to take decison on
>> continuing in yield_to handler, and also yielding in overcommits.
>> To be precise,
>> * loadavg is converted to a scale of 2048 / per CPU
>> * a load value of less than 1024 is considered as undercommit and we
>> return from PLE handler in those cases
>> * a load value of greater than 3586 (1.75 * 2048) is considered as overcommit
>> and we yield to other VMs in such cases.
>>
>> (let threshold = 2048)
>> Rationale for using threshold/2 for undercommit limit:
>> Having a load below (0.5 * threshold) is used to avoid (the concern rasied by Rik)
>> scenarios where we still have lock holder preempted vcpu waiting to be
>> scheduled. (scenario arises when rq length is > 1 even when we are under
>> committed)
>>
>> Rationale for using (1.75 * threshold) for overcommit scenario:
>> This is a heuristic where we should probably see rq length > 1
>> and a vcpu of a different VM is waiting to be scheduled.
>>
>> Related future work (independent of this series):
>>
>> - Dynamically changing PLE window depending on system load.
>>
>> Result on 3.7.0-rc1 kernel shows around 146% improvement for ebizzy 1x
>> with 32 core PLE machine with 32 vcpu guest.
>> I believe we should get very good improvements for overcommit (especially > 2)
>> on large machines with small vcpu guests. (Could not test this as I do not have
>> access to a bigger machine)
>>
>> base = 3.7.0-rc1
>> machine: 32 core mx3850 x5 PLE mc
>>
>> --+-----------+-----------+-----------+------------+-----------+
>> ebizzy (rec/sec higher is beter)
>> --+-----------+-----------+-----------+------------+-----------+
>> base stdev patched stdev %improve
>> --+-----------+-----------+-----------+------------+-----------+
>> 1x 2543.3750 20.2903 6279.3750 82.5226 146.89143
>> 2x 2410.8750 96.4327 2450.7500 207.8136 1.65396
>> 3x 2184.9167 205.5226 2178.3333 97.2034 -0.30131
>> --+-----------+-----------+-----------+------------+-----------+
>>
>> --+-----------+-----------+-----------+------------+-----------+
>> dbench (throughput in MB/sec. higher is better)
>> --+-----------+-----------+-----------+------------+-----------+
>> base stdev patched stdev %improve
>> --+-----------+-----------+-----------+------------+-----------+
>> 1x 5545.4330 596.4344 7042.8510 1012.0924 27.00272
>> 2x 1993.0970 43.6548 1990.6200 75.7837 -0.12428
>> 3x 1295.3867 22.3997 1315.5208 36.0075 1.55429
>> --+-----------+-----------+-----------+------------+-----------+
>
> Could you include a PLE-off result for 1x over-commit, so we know what
> the best possible result is?

Yes,

base no PLE

ebizzy_1x 7651.3000 rec/sec
ebizzy_2x 51.5000 rec/sec

ebizzy we are closer.

dbench_1x 12631.4210 MB/sec
dbench_2x 45.0842 MB/sec

(strangely dbench 1x result is not consistent sometime despite 10 runs
of 3min + 30 sec warmup runs on a 3G tmpfs. But surely it tells the trend)

>
> Looks like skipping the yield_to() for rq = 1 helps, but I'd like to
> know if the performance is the same as PLE off for 1x. I am concerned
> the vcpu to task lookup is still expensive.
>

Yes. I still see that.

> Based on Peter's comments I would say the 3rd patch and the 2x,3x
> results are not conclusive at this time.

Avi, IMO patch 1 and 2 seem to be good to go. Please let me know.

>
> I think we should also discuss what we think a good target is. We
> should know what our high-water mark is, and IMO, if we cannot get
> close, then I do not feel we are heading down the right path. For
> example, if dbench aggregate throughput for 1x with PLE off is 10000
> MB/sec, then the best possible 2x,3x result, should be a little lower
> than that due to task switching the vcpus and sharing chaches. This
> should be quite evident with current PLE handler and smaller VMs (like
> 10 vcpus or less).

Very much agree here. If we see the 2x 3x results (all/any of them).
aggregate is not near 1x. May be even 70% is a good target.

2012-10-31 12:29:32

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 3/3] kvm: Check system load and handle different commit cases accordingly

On 10/30/2012 02:37 PM, Andrew Jones wrote:
> On Tue, Oct 30, 2012 at 01:01:54PM +0530, Raghavendra K T wrote:
>> On 10/30/2012 12:04 PM, Andrew Jones wrote:
>>> On Tue, Oct 30, 2012 at 11:27:52AM +0530, Raghavendra K T wrote:
>>>> On 10/29/2012 11:24 PM, Peter Zijlstra wrote:
>>>>> On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote:
>>>>>> +/*
>>>>>> + * A load of 2048 corresponds to 1:1 overcommit
>>>>>> + * undercommit threshold is half the 1:1 overcommit
>>>>>> + * overcommit threshold is 1.75 times of 1:1 overcommit threshold
>>>>>> + */
>>>>>> +#define COMMIT_THRESHOLD (FIXED_1)
>>>>>> +#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1)
>>>>>> +#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) -
>>>>>> (COMMIT_THRESHOLD >> 2))
>>>>>> +
>>>>>> +unsigned long kvm_system_load(void)
>>>>>> +{
>>>>>> + unsigned long load;
>>>>>> +
>>>>>> + load = avenrun[0] + FIXED_1/200;
>>>>>> + load = load / num_online_cpus();
>>>>>> +
>>>>>> + return load;
>>>>>> +}
>>>>>
>>>>> ARGH.. no that's wrong.. very wrong.
>>>>>
>>>>> 1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a
>>>>> joke.
>>>>
>>>> Okay.
>>>>
>>>>> 2) avenrun[] is a global load, do not ever use a global load measure
>>>>
>>>> This makes sense. Using a local optimization that leads to near global
>>>> optimization is the way to go.
>>>>
>>>>>
>>>>> 3) avenrun[] has nothing what so ever to do with runqueue lengths,
>>>>> someone with a gazillion tasks in D state will get a huge load but the
>>>>> cpu is very idle.
>>>>>
>>>>
>>>> I used loadavg as an alternative measure. But the above condition
>>>> poses a concern for that.
>>>>
>>>> Okay, now IIUC, usage of *any* global measure is bad?
>>>>
>>>> Because I was also thinking to use nrrunning()/ num_online_cpus(), to
>>>> get an idea of global overcommit sense. (ofcourse since, this involves
>>>> iteration over per CPU nrrunning, I wanted to calculate this
>>>> periodically)
>>>>
>>>> The overall logic, of having overcommit_threshold,
>>>> undercommit_threshold, I wanted to use for even dynamic ple_window
>>>> tuning purpose.
>>>>
>>>> so logic was:
>>>> < undercommit_threshold => 16k ple_window
>>>>> overcommit_threshold => 4k window.
>>>> for in between case scale the ple_window accordingly.
>>>>
>>>> The alternative was to decide depending on how ple handler succeeded in
>>>> yield_to. But I thought, that is too sensitive and more overhead.
>>>>
>>>> This topic may deserve different thread, but thought I shall table it here.
>>>>
>>>> So, Thinking about the alternatives to implement, logic such as
>>>>
>>>> (a) if(undercommitted)
>>>> just go back and spin rather than going for yield_to iteration.
>>>> (b) if (overcommitted)
>>>> better to yield rather than spinning logic
>>>>
>>>> of current patches..
>>>>
>>>> [ ofcourse, (a) is already met to large extent by your patches..]
>>>>
>>>> So I think everything boils down to
>>>>
>>>> "how do we measure these two thresholds without much overhead in a
>>>> compliant way"
>>>>
>>>> Ideas welcome..
>>>>
>>>
>>> What happened to Avi's preempt notifier idea for determining
>>> under/overcommit? If nobody has picked that up yet, then I'll go ahead and
>>> try to prototype it.
>>
>> Hi Drew,
>>
>> I had assumed my priority order as
>> 1) this patch series 2) dynamic ple window 3) preempt notifiers.
>>
>> But I do not have any problem on re-prioritizing / helping on these
>> as far as we are clear on what we are looking into.
>>
>> I was thinking about preempt notifier idea as a tool to refine
>> candidate VCPUs. But you are right, Avi, also told we can use
>> bitmap/counter itself as an indicator to decide whether we go ahead
>> with yield_to at all.
>>
>> IMO, only patch(3) has some conflict because of various approach we can
>> try.May be we should attack the problem via all 3 solutions at once and
>> decide?
>>
>> To be frank, within each of the approach, trying/analyzing all the
>> possibilities made the things slow.. (my end).
>>
>> Suggestions..?
>>
>
> I agree, it's a complex problem that needs lots of trial+error work. We
> should definitely work in parallel on multiple ideas. I'll go ahead and
> dig into the preempt notifiers.
>

Okay. Thank you. I will concentrate on dynamic_ple window.. But I think
implementation need some overlapping details from preempt notifier.

For dynamic ple window, To summarize, what we thought of
doing,

( I hope we have to keep the ple window between 4k - 16k throughout)

From preempt notifiers:

(1) from the preempt notifier check the overcommit case, if so increase
the ple window
questions:
How do we say we are overcommitted?
- is it number of preemption we keep track vs total vcpus. I think so.
But we have to convert into some formula.. we shall decrease the ple
window by some factor (unless we hit 4k)

(2) How can say we are undercommitted:
Perhaps there is very less number of vcpus that are scheduled out
currently. we tend to set ple window closer to max (16k).

From yield_to failures:

if yield_to fails with ESRCH, it potentially indicate undercommit and
we can again use logic of increasing ple window.

Did we miss anything?

2012-10-31 12:38:57

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 2/3] kvm: Handle yield_to failure return code for potential undercommit case

On 10/29/2012 04:07 PM, Raghavendra K T wrote:
> From: Raghavendra K T <[email protected]>
>
> Also we do not update last boosted vcpu in failure cases.
>
> #endif
> +
> void kvm_vcpu_on_spin(struct kvm_vcpu *me)
> {
> struct kvm *kvm = me->kvm;
> @@ -1727,11 +1727,12 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
> continue;
> if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
> continue;
> - if (kvm_vcpu_yield_to(vcpu)) {
> +
> + yielded = kvm_vcpu_yield_to(vcpu);
> + if (yielded > 0)
> kvm->last_boosted_vcpu = i;
> - yielded = 1;
> + if (yielded)
> break;
> - }
> }

If yielded == -ESRCH, should we not try to yield to another vcpu?


--
error compiling committee.c: too many arguments to function

2012-10-31 12:46:16

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 2/3] kvm: Handle yield_to failure return code for potential undercommit case

On 10/31/2012 06:08 PM, Avi Kivity wrote:
> On 10/29/2012 04:07 PM, Raghavendra K T wrote:
>> From: Raghavendra K T <[email protected]>
>>
>> Also we do not update last boosted vcpu in failure cases.
>>
>> #endif
>> +
>> void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>> {
>> struct kvm *kvm = me->kvm;
>> @@ -1727,11 +1727,12 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>> continue;
>> if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
>> continue;
>> - if (kvm_vcpu_yield_to(vcpu)) {
>> +
>> + yielded = kvm_vcpu_yield_to(vcpu);
>> + if (yielded > 0)
>> kvm->last_boosted_vcpu = i;
>> - yielded = 1;
>> + if (yielded)
>> break;
>> - }
>> }
>
> If yielded == -ESRCH, should we not try to yield to another vcpu?
>

Yes. plan is to abort the iteration. since it means we are mostly
undercommitted.

2012-10-31 13:20:22

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 2/3] kvm: Handle yield_to failure return code for potential undercommit case

On 10/31/2012 06:11 PM, Raghavendra K T wrote:
> On 10/31/2012 06:08 PM, Avi Kivity wrote:
>> On 10/29/2012 04:07 PM, Raghavendra K T wrote:
>>> From: Raghavendra K T <[email protected]>
>>>
>>> Also we do not update last boosted vcpu in failure cases.
>>>
>>> #endif
>>> +
>>> void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>> {
>>> struct kvm *kvm = me->kvm;
>>> @@ -1727,11 +1727,12 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>> continue;
>>> if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
>>> continue;
>>> - if (kvm_vcpu_yield_to(vcpu)) {
>>> +
>>> + yielded = kvm_vcpu_yield_to(vcpu);
>>> + if (yielded > 0)
>>> kvm->last_boosted_vcpu = i;
>>> - yielded = 1;
>>> + if (yielded)
>>> break;
>>> - }
>>> }
>>
>> If yielded == -ESRCH, should we not try to yield to another vcpu?
>>
>
> Yes. plan is to abort the iteration. since it means we are mostly
> undercommitted.

Sorry if it was ambiguous. I wanted to say we do not want to continue
yield to another vcpu..

2012-10-31 13:41:52

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 2/3] kvm: Handle yield_to failure return code for potential undercommit case

On 10/31/2012 03:15 PM, Raghavendra K T wrote:
> On 10/31/2012 06:11 PM, Raghavendra K T wrote:
>> On 10/31/2012 06:08 PM, Avi Kivity wrote:
>>> On 10/29/2012 04:07 PM, Raghavendra K T wrote:
>>>> From: Raghavendra K T <[email protected]>
>>>>
>>>> Also we do not update last boosted vcpu in failure cases.
>>>>
>>>> #endif
>>>> +
>>>> void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>>> {
>>>> struct kvm *kvm = me->kvm;
>>>> @@ -1727,11 +1727,12 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>>> continue;
>>>> if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
>>>> continue;
>>>> - if (kvm_vcpu_yield_to(vcpu)) {
>>>> +
>>>> + yielded = kvm_vcpu_yield_to(vcpu);
>>>> + if (yielded > 0)
>>>> kvm->last_boosted_vcpu = i;
>>>> - yielded = 1;
>>>> + if (yielded)
>>>> break;
>>>> - }
>>>> }
>>>
>>> If yielded == -ESRCH, should we not try to yield to another vcpu?
>>>
>>
>> Yes. plan is to abort the iteration. since it means we are mostly
>> undercommitted.
>
> Sorry if it was ambiguous. I wanted to say we do not want to continue
> yield to another vcpu..
>


Why not? We found that this particular vcpu is running and therefore
likely not a lock holder. That says nothing about other vcpus. The
next in line might be runnable-but-not-running on another runqueue.


--
error compiling committee.c: too many arguments to function

2012-10-31 17:11:30

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 2/3] kvm: Handle yield_to failure return code for potential undercommit case

On 10/31/2012 07:11 PM, Avi Kivity wrote:
> On 10/31/2012 03:15 PM, Raghavendra K T wrote:
>> On 10/31/2012 06:11 PM, Raghavendra K T wrote:
>>> On 10/31/2012 06:08 PM, Avi Kivity wrote:
>>>> On 10/29/2012 04:07 PM, Raghavendra K T wrote:
>>>>> From: Raghavendra K T <[email protected]>
>>>>>
>>>>> Also we do not update last boosted vcpu in failure cases.
>>>>>
>>>>> #endif
>>>>> +
>>>>> void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>>>> {
>>>>> struct kvm *kvm = me->kvm;
>>>>> @@ -1727,11 +1727,12 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>>>> continue;
>>>>> if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
>>>>> continue;
>>>>> - if (kvm_vcpu_yield_to(vcpu)) {
>>>>> +
>>>>> + yielded = kvm_vcpu_yield_to(vcpu);
>>>>> + if (yielded > 0)
>>>>> kvm->last_boosted_vcpu = i;
>>>>> - yielded = 1;
>>>>> + if (yielded)
>>>>> break;
>>>>> - }
>>>>> }
>>>>
>>>> If yielded == -ESRCH, should we not try to yield to another vcpu?
>>>>
>>>
>>> Yes. plan is to abort the iteration. since it means we are mostly
>>> undercommitted.
>>
>> Sorry if it was ambiguous. I wanted to say we do not want to continue
>> yield to another vcpu..
>>
>
>
> Why not? We found that this particular vcpu is running and therefore
> likely not a lock holder. That says nothing about other vcpus. The
> next in line might be runnable-but-not-running on another runqueue.

Agree that next in the line might be runnable-not-running. But here we
are optimistic that, that is not the case and we save time by
returning back instead of iterating, thinking we are mostly in
undercommitted case and each vcpu has dedicated cpu.

Probably an alternative we have here is to look for say 2-3 successive
failures before breaking out?

2012-11-07 10:30:35

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH V2 RFC 2/3] kvm: Handle yield_to failure return code for potential undercommit case

* Raghavendra K T <[email protected]> [2012-10-31 22:36:25]:

> On 10/31/2012 07:11 PM, Avi Kivity wrote:
> >On 10/31/2012 03:15 PM, Raghavendra K T wrote:
> >>On 10/31/2012 06:11 PM, Raghavendra K T wrote:
> >>>On 10/31/2012 06:08 PM, Avi Kivity wrote:
> >>>>On 10/29/2012 04:07 PM, Raghavendra K T wrote:
> >>>>>From: Raghavendra K T <[email protected]>
> >>>>>
> >>>>>Also we do not update last boosted vcpu in failure cases.
> >>>>>
> >>>>> #endif
> >>>>>+
> >>>>> void kvm_vcpu_on_spin(struct kvm_vcpu *me)
> >>>>> {
> >>>>> struct kvm *kvm = me->kvm;
> >>>>>@@ -1727,11 +1727,12 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
> >>>>> continue;
> >>>>> if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
> >>>>> continue;
> >>>>>- if (kvm_vcpu_yield_to(vcpu)) {
> >>>>>+
> >>>>>+ yielded = kvm_vcpu_yield_to(vcpu);
> >>>>>+ if (yielded > 0)
> >>>>> kvm->last_boosted_vcpu = i;
> >>>>>- yielded = 1;
> >>>>>+ if (yielded)
> >>>>> break;
> >>>>>- }
> >>>>> }
> >>>>
> >>>>If yielded == -ESRCH, should we not try to yield to another vcpu?
> >>>>
> >>>
> >>> Yes. plan is to abort the iteration. since it means we are mostly
> >>>undercommitted.
> >>
> >>Sorry if it was ambiguous. I wanted to say we do not want to continue
> >>yield to another vcpu..
> >>
> >
> >
> >Why not? We found that this particular vcpu is running and therefore
> >likely not a lock holder. That says nothing about other vcpus. The
> >next in line might be runnable-but-not-running on another runqueue.
>
> Agree that next in the line might be runnable-not-running. But here we
> are optimistic that, that is not the case and we save time by
> returning back instead of iterating, thinking we are mostly in
> undercommitted case and each vcpu has dedicated cpu.
>
> Probably an alternative we have here is to look for say 2-3 successive
> failures before breaking out?

Hi Avi,

I tried the idea of bailing out only when we have successive failure
too (hunk below). results are as follows.

base = 3.7.0-rc1
A = base + patch 1 + patch 2 (original series except patch 3)
B = A + bail out on successive failures patch (hunk below)

% improvements w.r.t base kernel 32 vcpu guest on 32 core PLE machine

A B
kernbench_1x 1.83959 0.95158
kernbench_2x 5.58283 8.31783

ebizzy_1x 144.96959 147.47995
ebizzy_2x -11.52278 -4.52835
ebizzy_3x -7.59141 -5.17241

dbench_1x 87.46935 61.14888
dbench_2x -7.16749 -4.17130
dbench_3x -0.34115 -3.18740


IMO,
the original patch would have affceted moderate overcommits more
when we have average runqueue length less than two but we are still
overcommitted.

With this patch though we may get litlle less improvement for
1x overcommit, probability of this affetcting moderate overcommit
situation reduces drastically.

If we consider cases where, we have average run queue length as follows,
and corresponding probability of we considering it as undercommitted (wrongly)
case with rough/simple math is as follows: (correct me if something is
wrong here)

( In precise math it should be the probability of we hitting
a source and target runqueue length one when we are overcommitted for
two successive trials, noting that source of the yield_to remains same)

(Probability = p^3 where p is the probability that we hit a cpu having
rq length = 1. I have taken out #cpus from calculation here though it affects)

average Probability
rq length
---------------------------------
1.25 27/64 Note: we are slightly overcommitted here and hopefully it does not afftect much.
1.5 1/8
1.75 1/64

for original patch it was p^2 and hence 9/16, 1/4, 1/16 respectively.

I think continuing to yield_to beyond this would make us go back to
square one, unnecessarily wasting time.

Please let me know if you have any comments on idea of bailing out after successive trials?

(Side) Note: Dynamic ple window built on top of this logic is already done. and
will be posting it with results in a separate patch.

Changed hunk:
---8<---
@@ -1697,12 +1696,14 @@ bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
return eligible;
}
#endif
+
void kvm_vcpu_on_spin(struct kvm_vcpu *me)
{
struct kvm *kvm = me->kvm;
struct kvm_vcpu *vcpu;
int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
int yielded = 0;
+ int try = 2;
int pass;
int i;

@@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
* VCPU is holding the lock that we need and will release it.
* We approximate round-robin by starting at the last boosted VCPU.
*/
- for (pass = 0; pass < 2 && !yielded; pass++) {
+ for (pass = 0; pass < 2 && !yielded && try; pass++) {
kvm_for_each_vcpu(i, vcpu, kvm) {
if (!pass && i <= last_boosted_vcpu) {
i = last_boosted_vcpu;
@@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
continue;
if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
continue;
- if (kvm_vcpu_yield_to(vcpu)) {
+
+ yielded = kvm_vcpu_yield_to(vcpu);
+ if (yielded > 0) {
kvm->last_boosted_vcpu = i;
- yielded = 1;
break;
+ } else if (yielded < 0) {
+ try--;
+ if (!try)
+ break;
}
}
}

2012-11-09 08:43:27

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH V2 RESEND RFC 2/3] kvm: Handle yield_to failure return code for potential undercommit case

Handle yield_to failure return code for potential undercommit case

From: Raghavendra K T <[email protected]>

yield_to returns -ESRCH when When source and target of yield_to
run queue length is one. When we see two successive failures of
yield_to we assume we are in potential undercommit case and abort
from PLE handler.
The assumption is backed by low probability of wrong decision
for even worst case scenarios such as average runqueue length
between 1 and 2.

note that we do not update last boosted vcpu in failure cases.
Thank Avi for raising question on aborting after first fail from yield_to.

Reviewed-by: Srikar Dronamraju <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
---

virt/kvm/kvm_main.c | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)


diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be70035..9f390e7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1639,6 +1639,7 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
{
struct pid *pid;
struct task_struct *task = NULL;
+ bool ret = false;

rcu_read_lock();
pid = rcu_dereference(target->pid);
@@ -1646,17 +1647,15 @@ bool kvm_vcpu_yield_to(struct kvm_vcpu *target)
task = get_pid_task(target->pid, PIDTYPE_PID);
rcu_read_unlock();
if (!task)
- return false;
+ return ret;
if (task->flags & PF_VCPU) {
put_task_struct(task);
- return false;
- }
- if (yield_to(task, 1)) {
- put_task_struct(task);
- return true;
+ return ret;
}
+ ret = yield_to(task, 1);
put_task_struct(task);
- return false;
+
+ return ret;
}
EXPORT_SYMBOL_GPL(kvm_vcpu_yield_to);

@@ -1697,12 +1696,14 @@ bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
return eligible;
}
#endif
+
void kvm_vcpu_on_spin(struct kvm_vcpu *me)
{
struct kvm *kvm = me->kvm;
struct kvm_vcpu *vcpu;
int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
int yielded = 0;
+ int try = 2;
int pass;
int i;

@@ -1714,7 +1715,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
* VCPU is holding the lock that we need and will release it.
* We approximate round-robin by starting at the last boosted VCPU.
*/
- for (pass = 0; pass < 2 && !yielded; pass++) {
+ for (pass = 0; pass < 2 && !yielded && try; pass++) {
kvm_for_each_vcpu(i, vcpu, kvm) {
if (!pass && i <= last_boosted_vcpu) {
i = last_boosted_vcpu;
@@ -1727,10 +1728,15 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
continue;
if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
continue;
- if (kvm_vcpu_yield_to(vcpu)) {
+
+ yielded = kvm_vcpu_yield_to(vcpu);
+ if (yielded > 0) {
kvm->last_boosted_vcpu = i;
- yielded = 1;
break;
+ } else if (yielded < 0) {
+ try--;
+ if (!try)
+ break;
}
}
}