2012-06-19 20:22:25

by Raghavendra K T

[permalink] [raw]
Subject: Regarding improving ple handler (vcpu_on_spin)


In ple handler code, last_boosted_vcpu (lbv) variable is
serving as reference point to start when we enter.

lbv = kvm->lbv;
for each vcpu i of kvm
if i is eligible
if yield_to(i) is success
lbv = i

currently this variable is per VM and it is set after we do
yield_to(target), unfortunately it may take little longer
than we expect to come back again (depending on its lag in rb tree)
on successful yield and set the value.

So when several ple_handle entry happens before it is set,
all of them start from same place. (and overall RR is also slower).

Also statistical analysis (below) is showing lbv is not very well
distributed with current approach.

naturally, first approach is to move lbv before yield_to, without
bothering failure case to make RR fast. (was in Rik's V4
vcpu_on_spin patch series).

But when I did performance analysis, in no-overcommit scenario,
I saw violent/cascaded directed yield happening, leading to
more wastage of cpu in spinning. (huge degradation in 1x and
improvement in 3x, I assume this was the reason it was moved after
yield_to in V5 of vcpu_on_spin series.)

Second approach, I tried was,
(1) get rid of per kvm lbv variable
(2) everybody who enters handler start from a random vcpu as reference
point.

The above gave good distribution of starting point,(and performance
improvement in 32 vcpu guest I tested) and also IMO, it scales well
for larger VM's.

Analysis
=============
Four 32 vcpu guest running with one of them running kernbench.

PLE handler yield stat is the statistics for successfully
yielded case (for 32 vcpus)

PLE handler start stat is the statistics for frequency of
each vcpu index as starting point (for 32 vcpus)

snapshot1
=============
PLE handler yield stat :
274391 33088 32554 46688 46653 48742 48055 37491
38839 31799 28974 30303 31466 45936 36208 51580
32754 53441 28956 30738 37940 37693 26183 40022
31725 41879 23443 35826 40985 30447 37352 35445

PLE handler start stat :
433590 383318 204835 169981 193508 203954 175960 139373
153835 125245 118532 140092 135732 134903 119349 149467
109871 160404 117140 120554 144715 125099 108527 125051
111416 141385 94815 138387 154710 116270 123130 173795

snapshot2
============
PLE handler yield stat :
1957091 59383 67866 65474 100335 77683 80958 64073
53783 44620 80131 81058 66493 56677 74222 74974
42398 132762 48982 70230 78318 65198 54446 104793
59937 57974 73367 96436 79922 59476 58835 63547

PLE handler start stat :
2555089 611546 461121 346769 435889 452398 407495 314403
354277 298006 364202 461158 344783 288263 342165 357270
270887 451660 300020 332120 378403 317848 307969 414282
351443 328501 352840 426094 375050 330016 347540 371819

So questions I have in mind is,

1. Do you think going for randomizing last_boosted_vcpu and get rid
of per VM variable is better?

2. Can/Do we have a mechanism, from which we will be able to decide
not to yield to vcpu who is doing frequent PLE exit (possibly
because he is doing unnecessary busy-waits) OR doing yield_to better
candidate?

On a side note: With pv patches I have tried doing yield_to a kicked
VCPU, in vcpu_block path and is giving some performance improvement.

Please let me know if you have any comments/suggestions.


2012-06-19 20:51:44

by Rik van Riel

[permalink] [raw]
Subject: [PATCH] kvm: handle last_boosted_vcpu = 0 case

On Wed, 20 Jun 2012 01:50:50 +0530
Raghavendra K T <[email protected]> wrote:

>
> In ple handler code, last_boosted_vcpu (lbv) variable is
> serving as reference point to start when we enter.

> Also statistical analysis (below) is showing lbv is not very well
> distributed with current approach.

You are the second person to spot this bug today (yes, today).

Due to time zones, the first person has not had a chance yet to
test the patch below, which might fix the issue...

Please let me know how it goes.

====8<====

If last_boosted_vcpu == 0, then we fall through all test cases and
may end up with all VCPUs pouncing on vcpu 0. With a large enough
guest, this can result in enormous runqueue lock contention, which
can prevent vcpu0 from running, leading to a livelock.

Changing < to <= makes sure we properly handle that case.

Signed-off-by: Rik van Riel <[email protected]>
---
virt/kvm/kvm_main.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7e14068..1da542b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
*/
for (pass = 0; pass < 2 && !yielded; pass++) {
kvm_for_each_vcpu(i, vcpu, kvm) {
- if (!pass && i < last_boosted_vcpu) {
+ if (!pass && i <= last_boosted_vcpu) {
i = last_boosted_vcpu;
continue;
} else if (pass && i > last_boosted_vcpu)

2012-06-20 20:13:52

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

On 06/20/2012 02:21 AM, Rik van Riel wrote:
> On Wed, 20 Jun 2012 01:50:50 +0530
> Raghavendra K T<[email protected]> wrote:
>
>>
>> In ple handler code, last_boosted_vcpu (lbv) variable is
>> serving as reference point to start when we enter.
>
>> Also statistical analysis (below) is showing lbv is not very well
>> distributed with current approach.
>
> You are the second person to spot this bug today (yes, today).

Oh! really interesting.

>
> Due to time zones, the first person has not had a chance yet to
> test the patch below, which might fix the issue...

May be his timezone also falls near to mine. I am also pretty late
now. :)

>
> Please let me know how it goes.

Yes, have got result today, too tired to summarize. got better
performance result too. will come back again tomorrow morning.
have to post, randomized start point patch also, which I discussed to
know the opinion.

>
> ====8<====
>
> If last_boosted_vcpu == 0, then we fall through all test cases and
> may end up with all VCPUs pouncing on vcpu 0. With a large enough
> guest, this can result in enormous runqueue lock contention, which
> can prevent vcpu0 from running, leading to a livelock.
>
> Changing< to<= makes sure we properly handle that case.

Analysis shows distribution is more flatten now than before.
Here are the snapshots:
snapshot1
PLE handler yield stat :
66447 132222 75510 65875 121298 92543 111267 79523
118134 105366 116441 114195 107493 66666 86779 87733
84415 105778 94210 73197 55626 93036 112959 92035
95742 78558 72190 101719 94667 108593 63832 81580

PLE handler start stat :
334301 687807 384077 344917 504917 343988 439810 371389
466908 415509 394304 484276 376510 292821 370478 363727
366989 423441 392949 309706 292115 437900 413763 346135
364181 323031 348405 399593 336714 373995 302301 347383


snapshot2
PLE handler yield stat :
320547 267528 264316 164213 249246 182014 246468 225386
277179 310659 349767 310281 238680 187645 225791 266290
216202 316974 231077 216586 151679 356863 266031 213047
306229 182629 229334 241204 275975 265086 282218 242207

PLE handler start stat :
1335370 1378184 1252001 925414 1196973 951298 1219835 1108788
1265427 1290362 1308553 1271066 1107575 980036 1077210 1278611
1110779 1365130 1151200 1049859 937159 1577830 1209099 993391
1173766 987307 1144775 1102960 1100082 1177134 1207862 1119551


>
> Signed-off-by: Rik van Riel<[email protected]>
> ---
> virt/kvm/kvm_main.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7e14068..1da542b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
> */
> for (pass = 0; pass< 2&& !yielded; pass++) {
> kvm_for_each_vcpu(i, vcpu, kvm) {
> - if (!pass&& i< last_boosted_vcpu) {
> + if (!pass&& i<= last_boosted_vcpu) {

Hmmm true, great catch. it was partial towards zero earlier.

> i = last_boosted_vcpu;
> continue;
> } else if (pass&& i> last_boosted_vcpu)
>
>

2012-06-21 02:11:18

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

On 06/20/2012 04:12 PM, Raghavendra K T wrote:
> On 06/20/2012 02:21 AM, Rik van Riel wrote:

>> Please let me know how it goes.
>
> Yes, have got result today, too tired to summarize. got better
> performance result too. will come back again tomorrow morning.
> have to post, randomized start point patch also, which I discussed to
> know the opinion.

The other person's problem has also gone away with this
patch.

Avi, could I convince you to apply this obvious bugfix
to kvm.git? :)

>> ====8<====
>>
>> If last_boosted_vcpu == 0, then we fall through all test cases and
>> may end up with all VCPUs pouncing on vcpu 0. With a large enough
>> guest, this can result in enormous runqueue lock contention, which
>> can prevent vcpu0 from running, leading to a livelock.
>>
>> Changing< to<= makes sure we properly handle that case.

>>
>> Signed-off-by: Rik van Riel<[email protected]>
>> ---
>> virt/kvm/kvm_main.c | 2 +-
>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 7e14068..1da542b 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>> */
>> for (pass = 0; pass< 2&& !yielded; pass++) {
>> kvm_for_each_vcpu(i, vcpu, kvm) {
>> - if (!pass&& i< last_boosted_vcpu) {
>> + if (!pass&& i<= last_boosted_vcpu) {
>> i = last_boosted_vcpu;
>> continue;
>> } else if (pass&& i> last_boosted_vcpu)
>>
>>
>


--
All rights reversed

2012-06-21 06:44:09

by Gleb Natapov

[permalink] [raw]
Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

On Tue, Jun 19, 2012 at 04:51:04PM -0400, Rik van Riel wrote:
> On Wed, 20 Jun 2012 01:50:50 +0530
> Raghavendra K T <[email protected]> wrote:
>
> >
> > In ple handler code, last_boosted_vcpu (lbv) variable is
> > serving as reference point to start when we enter.
>
> > Also statistical analysis (below) is showing lbv is not very well
> > distributed with current approach.
>
> You are the second person to spot this bug today (yes, today).
>
> Due to time zones, the first person has not had a chance yet to
> test the patch below, which might fix the issue...
>
> Please let me know how it goes.
>
> ====8<====
>
> If last_boosted_vcpu == 0, then we fall through all test cases and
> may end up with all VCPUs pouncing on vcpu 0. With a large enough
> guest, this can result in enormous runqueue lock contention, which
> can prevent vcpu0 from running, leading to a livelock.
>
> Changing < to <= makes sure we properly handle that case.
>
> Signed-off-by: Rik van Riel <[email protected]>
> ---
> virt/kvm/kvm_main.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7e14068..1da542b 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
> */
> for (pass = 0; pass < 2 && !yielded; pass++) {
> kvm_for_each_vcpu(i, vcpu, kvm) {
> - if (!pass && i < last_boosted_vcpu) {
> + if (!pass && i <= last_boosted_vcpu) {
> i = last_boosted_vcpu;
> continue;
> } else if (pass && i > last_boosted_vcpu)
>
Looks correct. We can simplify this by introducing something like:

#define kvm_for_each_vcpu_from(idx, n, vcpup, kvm) \
for (n = atomic_read(&kvm->online_vcpus); \
n && (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
n--, idx = (idx+1) % atomic_read(&kvm->online_vcpus))

--
Gleb.

2012-06-21 10:25:25

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

On 06/21/2012 12:13 PM, Gleb Natapov wrote:
> On Tue, Jun 19, 2012 at 04:51:04PM -0400, Rik van Riel wrote:
>> On Wed, 20 Jun 2012 01:50:50 +0530
>> Raghavendra K T<[email protected]> wrote:
>>
>>>
>>> In ple handler code, last_boosted_vcpu (lbv) variable is
>>> serving as reference point to start when we enter.
>>
>>> Also statistical analysis (below) is showing lbv is not very well
>>> distributed with current approach.
>>
>> You are the second person to spot this bug today (yes, today).
>>
>> Due to time zones, the first person has not had a chance yet to
>> test the patch below, which might fix the issue...
>>
>> Please let me know how it goes.
>>
>> ====8<====
>>
>> If last_boosted_vcpu == 0, then we fall through all test cases and
>> may end up with all VCPUs pouncing on vcpu 0. With a large enough
>> guest, this can result in enormous runqueue lock contention, which
>> can prevent vcpu0 from running, leading to a livelock.
>>
>> Changing< to<= makes sure we properly handle that case.
>>
>> Signed-off-by: Rik van Riel<[email protected]>
>> ---
>> virt/kvm/kvm_main.c | 2 +-
>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 7e14068..1da542b 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>> */
>> for (pass = 0; pass< 2&& !yielded; pass++) {
>> kvm_for_each_vcpu(i, vcpu, kvm) {
>> - if (!pass&& i< last_boosted_vcpu) {
>> + if (!pass&& i<= last_boosted_vcpu) {
>> i = last_boosted_vcpu;
>> continue;
>> } else if (pass&& i> last_boosted_vcpu)
>>
> Looks correct. We can simplify this by introducing something like:
>
> #define kvm_for_each_vcpu_from(idx, n, vcpup, kvm) \
> for (n = atomic_read(&kvm->online_vcpus); \
> n&& (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
> n--, idx = (idx+1) % atomic_read(&kvm->online_vcpus))
>

Thumbs up for this simplification. This really helps in all the places
where we want to start iterating from middle.

2012-06-21 11:27:49

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

On 06/21/2012 01:42 AM, Raghavendra K T wrote:
> On 06/20/2012 02:21 AM, Rik van Riel wrote:
>> On Wed, 20 Jun 2012 01:50:50 +0530
>> Raghavendra K T<[email protected]> wrote:
>>
[...]
>> Please let me know how it goes.
>
> Yes, have got result today, too tired to summarize. got better
> performance result too. will come back again tomorrow morning.
> have to post, randomized start point patch also, which I discussed to
> know the opinion.
>

Here are the results from kernbench.

PS: I think we have to only take that, both the patches perform better,
than reading into actual numbers since I am seeing more variance in
especially 3x. may be I can test with some more stable benchmark if
somebody points

+----------+-------------+------------+------------+-----------+
| base | Rik patch | % improve |Random patch| %improve |
+----------+-------------+------------+------------+-----------+
| 49.98 | 49.935 | 0.0901172 | 49.924286 | 0.111597 |
| 106.0051 | 89.25806 | 18.7625 | 88.122217 | 20.2933 |
| 189.82067| 175.58783 | 8.10582 | 166.99989 | 13.6651 |
+----------+-------------+------------+------------+-----------+

I also have posted result of randomizing starting point patch.

I agree that Rik's fix should ideally go into git ASAP. and when above
patches go into git, feel free to add,

Tested-by: Raghavendra K T <[email protected]>

But I still see some questions unanswered.
1) why can't we move setting of last_boosted_vcpu up, it gives more
randomness ( As I said earlier, it gave degradation in 1x case because
of violent yields but performance benefit in 3x case. degradation
because most of them yielding back to same spinning guy increasing
busy-wait but it gives huge benefit with ple_window set to higher
values such as 32k/64k. But that is a different issue altogethor)

2) Having the update of last_boosted_vcpu after yield_to does not seem
to be entirely correct. and having a common variable as starting point
may not be that good too. Also RR is little slower.

suppose we have 64 vcpu guest, and 4 vcpus enter ple_handler all of
them jumping on same guy to yield may not be good. Rather I personally
feel each of them starting at different point would be good idea.

But this alone will not help, we need more filtering of eligible VCPU.
for e.g. in first pass don't choose a VCPU that has recently done
PL exit. (Thanks Vatsa for brainstorming this). May be Peter/Avi
/Rik/Vatsa can give more idea in this area ( I mean, how can we identify
that a vcpu had done a PL exit/OR exited from spinlock context etc)

other idea may be something like identifying next eligible lock-holder
(which is already possible with PV patches), and do yield-to him.

Here is the stat from randomizing starting point patch. We can see that
the patch has amazing fairness w.r.t starting point. IMO, this would be
great only after we add more eligibility criteria to target vcpus (of
yield_to).

Randomizing start index
===========================
snapshot1
PLE handler yield stat :
218416 176802 164554 141184 148495 154709 159871 145157
135476 158025 139997 247638 152498 133338 122774 248228
158469 121825 138542 113351 164988 120432 136391 129855
172764 214015 158710 133049 83485 112134 81651 190878

PLE handler start stat :
547772 547725 547545 547931 547836 548656 548272 547849
548879 549012 547285 548185 548700 547132 548310 547286
547236 547307 548328 548059 547842 549152 547870 548340
548170 546996 546678 547842 547716 548096 547918 547546

snapshot2
==============
PLE handler yield stat :
310690 222992 275829 156876 187354 185373 187584 155534
151578 205994 223731 320894 194995 167011 153415 286910
181290 143653 173988 181413 194505 170330 194455 181617
251108 226577 192070 143843 137878 166393 131405 250657

PLE handler start stat :
781335 782388 781837 782942 782025 781357 781950 781695
783183 783312 782004 782804 783766 780825 783232 781013
781587 781228 781642 781595 781665 783530 781546 781950
782268 781443 781327 781666 781907 781593 782105 781073


Sorry for attaching patch inline, I am using a dumb client. will post
it separately if needed.

====8<====

Currently PLE handler uses per VM variable as starting point. Get rid
of the variable and use randomized starting point.
Thanks Vatsa for scheduler related clarifications.

Suggested-by: Srikar <[email protected]>
Signed-off-by: Raghavendra K T <[email protected]>
---


Attachments:
randomize_starting_vcpu.patch (1.90 kB)

2012-06-22 15:12:49

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

On Thu, Jun 21, 2012 at 04:56:08PM +0530, Raghavendra K T wrote:
> Here are the results from kernbench.
>
> PS: I think we have to only take that, both the patches perform better,
> than reading into actual numbers since I am seeing more variance in
> especially 3x. may be I can test with some more stable benchmark if
> somebody points
>

Hi Raghu,

I wonder if we should back up and try to determine the best
benchmark/test environment first. I think kernbench is good, but
I wonder about how to simulate the overcommit, and to what degree
(1x, 3x, ??). What are you currently running to simulate overcommit
now? Originally we were running kernbench in one VM and cpu hogs
(bash infinite loops) in other VMs. Then we added vcpus and infinite
loops to get up to the desired overcommit. I saw later that you've
experimented with running kernbench in the other VMs as well, rather
than cpu hogs. Is that still the case?

I started playing with benchmarking these proposals myself, but so
far have stuck to the cpu hog, since I wanted to keep variability
limited. However, when targeting a reasonable host loadavg with a
bunch of cpu hog vcpus, it limits the overcommit too much. I certainly
haven't tried 3x this way. So I'm inclined to throw out the cpu hog
approach as well. The question is, what to replace it with? It appears
that the performance of the PLE and pvticketlock proposals are quite
dependant on the level of overcommit, so we should choose a target
overcommit level and also a constraint on the host loadavg first,
then determine how to setup a test environment that fits it and yields
results with low variance.

Here are results from my 1.125x overcommit test environment using
cpu hogs.

kcbench (a.k.a kernbench) results; 'mean-time (stddev)'
base-noPLE: 235.730 (25.932)
base-PLE: 238.820 (11.199)
rand_start-PLE: 283.193 (23.262)
pvticketlocks-noPLE: 244.987 (7.562)
pvticketlocks-PLE: 247.597 (17.200)

base kernel: 3.5.0-rc3 + Rik's new last_boosted patch
rand_start kernel: 3.5.0-rc3 + Raghu's proposed random start patch
pvticketlocks kernel: 3.5.0-rc3 + Rik's new last_boosted patch
+ Raghu's pvticketlock series

The relative standard deviations are as high as 11%. So I'm not
real pleased with the results, and they show degradation everywhere.
Below are the details of the benchmarking. Everything is there except
the kernel config, but our benchmarking should be reproducible with
nearly random configs anyway.

Drew

= host =
- Intel(R) Xeon(R) CPU X7560 @ 2.27GHz
- 64 cpus, 4 nodes, 64G mem
- Fedora 17 with test kernels (see tests)

= benchmark =
- one cpu hog F17 VM
- 64 vcpus, 8G mem
- all vcpus run a bash infinite loop
- kernel: 3.5.0-rc3
- one kcbench (a.k.a kernbench) F17 VM
- 8 vcpus, 8G mem
- 'kcbench -d /mnt/ram', /mnt/ram is 1G ramfs
- kcbench-0.3-8.1.noarch, kcbench-data-2.6.38-0.1-9.fc17.noarch,
kcbench-data-0.1-9.fc17.noarch
- gcc (GCC) 4.7.0 20120507 (Red Hat 4.7.0-5)
- kernel: same test kernel as host

= test 1: base, PLE disabled (ple_gap=0) =
- kernel: 3.5.0-rc3 + Rik's last_boosted patch

Run 1 (-j 16): 4211 (e:237.43 P:637% U:697.98 S:815.46 F:0)
Run 2 (-j 16): 3834 (e:260.77 P:631% U:729.69 S:917.56 F:0)
Run 3 (-j 16): 4784 (e:208.99 P:644% U:638.17 S:708.63 F:0)

mean: 235.730 stddev: 25.932

= test 2: base, PLE enabled =
- kernel: 3.5.0-rc3 + Rik's last_boosted patch

Run 1 (-j 16): 4335 (e:230.67 P:639% U:657.74 S:818.28 F:0)
Run 2 (-j 16): 4269 (e:234.20 P:647% U:743.43 S:772.52 F:0)
Run 3 (-j 16): 3974 (e:251.59 P:639% U:724.29 S:884.21 F:0)

mean: 238.820 stddev: 11.199

= test 3: rand_start, PLE enabled =
- kernel: 3.5.0-rc3 + Raghu's random start patch

Run 1 (-j 16): 3898 (e:256.52 P:639% U:756.14 S:884.63 F:0)
Run 2 (-j 16): 3341 (e:299.27 P:633% U:857.49 S:1039.62 F:0)
Run 3 (-j 16): 3403 (e:293.79 P:635% U:857.21 S:1008.83 F:0)

mean: 283.193 stddev: 23.262

= test 4: pvticketlocks, PLE disabled (ple_gap=0) =
- kernel: 3.5.0-rc3 + Rik's last_boosted patch + Raghu's pvticketlock series
+ PARAVIRT_SPINLOCKS=y config change

Run 1 (-j 16): 3963 (e:252.29 P:647% U:736.43 S:897.16 F:0)
Run 2 (-j 16): 4216 (e:237.19 P:650% U:706.68 S:837.42 F:0)
Run 3 (-j 16): 4073 (e:245.48 P:649% U:709.46 S:884.68 F:0)

mean: 244.987 stddev: 7.562

= test 5: pvticketlocks, PLE enabled =
- kernel: 3.5.0-rc3 + Rik's last_boosted patch + Raghu's pvticketlock series
+ PARAVIRT_SPINLOCKS=y config change

Run 1 (-j 16): 3978 (e:251.32 P:629% U:758.86 S:824.29 F:0)
Run 2 (-j 16): 4369 (e:228.84 P:634% U:708.32 S:743.71 F:0)
Run 3 (-j 16): 3807 (e:262.63 P:626% U:767.03 S:877.96 F:0)

mean: 247.597 stddev: 17.200

2012-06-22 21:02:43

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

On 06/22/2012 08:41 PM, Andrew Jones wrote:
> On Thu, Jun 21, 2012 at 04:56:08PM +0530, Raghavendra K T wrote:
>> Here are the results from kernbench.
>>
>> PS: I think we have to only take that, both the patches perform better,
>> than reading into actual numbers since I am seeing more variance in
>> especially 3x. may be I can test with some more stable benchmark if
>> somebody points
>>
>
> Hi Raghu,
>

First of all Thank you for your test and raising valid points.
It also made the avenue for discussion of all the different experiments
done over a month (apart from tuning/benchmarking), which may bring
more feedback and precious ideas from community to optimize the
performance further.

I shall discuss in reply to this mail separately.

> I wonder if we should back up and try to determine the best
> benchmark/test environment first.

I agree, we have to be able to produce similar result independently.
So far sysbench (even pgbench) has been consistent, Currently trying,
if other benchmarks like hackbench (modified #loops), ebizzy/dbench
have low variance.

[ but they too are dependent on #client/threads etc ]

I think kernbench is good, but

Yes kernbench atleast helped me to tune SPIN_THRESHOLD to good extent.
But Jeremy also had pointed out that kernbench is little inconsistent.

> I wonder about how to simulate the overcommit, and to what degree
> (1x, 3x, ??). What are you currently running to simulate overcommit
> now? Originally we were running kernbench in one VM and cpu hogs
> (bash infinite loops) in other VMs. Then we added vcpus and infinite
> loops to get up to the desired overcommit. I saw later that you've
> experimented with running kernbench in the other VMs as well, rather
> than cpu hogs. Is that still the case?
>

Yes, I am now running same benchmark on all the guest.

on non PLE, while 1 cpuhogs, played good role of simulating LHP, but on
PLE machine It did not seem to be the case.

> I started playing with benchmarking these proposals myself, but so
> far have stuck to the cpu hog, since I wanted to keep variability
> limited. However, when targeting a reasonable host loadavg with a
> bunch of cpu hog vcpus, it limits the overcommit too much. I certainly
> haven't tried 3x this way. So I'm inclined to throw out the cpu hog
> approach as well. The question is, what to replace it with? It appears
> that the performance of the PLE and pvticketlock proposals are quite
> dependant on the level of overcommit, so we should choose a target
> overcommit level and also a constraint on the host loadavg first,
> then determine how to setup a test environment that fits it and yields
> results with low variance.
>
> Here are results from my 1.125x overcommit test environment using
> cpu hogs.

At first, result seemed backward, but after seeing individual runs and
variations, it seems, except for rand start I believe all the result
should converge to zero difference. So if we run the same again we may
get completely different result.

IMO, on a 64 vcpu guest if we run -j16 it may not represent 1x load, so
what I believe is it has resulted in more of under-commit/nearly 1x
commit result. May be we should try atleast #threads = #vcpu or 2*#vcpu

>
> kcbench (a.k.a kernbench) results; 'mean-time (stddev)'
> base-noPLE: 235.730 (25.932)
> base-PLE: 238.820 (11.199)
> rand_start-PLE: 283.193 (23.262)

Problem currently as we know, in PLE handler we may end up choosing
same VCPU, which was in spinloop, that would unfortunately result in
more cpu burning.

And with randomizing start_vcpu, we are making that probability more.
we need to have a logic, not choose a vcpu that has recently PL exited
since it cannot be a lock-holder. and next eligible lock-holder can be
picked up easily with PV patches.

> pvticketlocks-noPLE: 244.987 (7.562)
> pvticketlocks-PLE: 247.597 (17.200)
>
> base kernel: 3.5.0-rc3 + Rik's new last_boosted patch
> rand_start kernel: 3.5.0-rc3 + Raghu's proposed random start patch
> pvticketlocks kernel: 3.5.0-rc3 + Rik's new last_boosted patch
> + Raghu's pvticketlock series

Ok, I believe SPIN_THRESHOLD was 2k right? what I had observed is with
2k THRESHOLD, we see halt exit overheads. currently I am trying with
mostly 4k.

>
> The relative standard deviations are as high as 11%. So I'm not
> real pleased with the results, and they show degradation everywhere.
> Below are the details of the benchmarking. Everything is there except
> the kernel config, but our benchmarking should be reproducible with
> nearly random configs anyway.
>
> Drew
>
> = host =
> - Intel(R) Xeon(R) CPU X7560 @ 2.27GHz
> - 64 cpus, 4 nodes, 64G mem
> - Fedora 17 with test kernels (see tests)
>
> = benchmark =
> - one cpu hog F17 VM
> - 64 vcpus, 8G mem
> - all vcpus run a bash infinite loop
> - kernel: 3.5.0-rc3
> - one kcbench (a.k.a kernbench) F17 VM
> - 8 vcpus, 8G mem
> - 'kcbench -d /mnt/ram', /mnt/ram is 1G ramfs

may be we have to check whether 1GB RAM is ok when we have 128 threads,
not sure..

> - kcbench-0.3-8.1.noarch, kcbench-data-2.6.38-0.1-9.fc17.noarch,
> kcbench-data-0.1-9.fc17.noarch
> - gcc (GCC) 4.7.0 20120507 (Red Hat 4.7.0-5)
> - kernel: same test kernel as host
>
> = test 1: base, PLE disabled (ple_gap=0) =
> - kernel: 3.5.0-rc3 + Rik's last_boosted patch
>
> Run 1 (-j 16): 4211 (e:237.43 P:637% U:697.98 S:815.46 F:0)
> Run 2 (-j 16): 3834 (e:260.77 P:631% U:729.69 S:917.56 F:0)
> Run 3 (-j 16): 4784 (e:208.99 P:644% U:638.17 S:708.63 F:0)
>
> mean: 235.730 stddev: 25.932
>
> = test 2: base, PLE enabled =
> - kernel: 3.5.0-rc3 + Rik's last_boosted patch
>
> Run 1 (-j 16): 4335 (e:230.67 P:639% U:657.74 S:818.28 F:0)
> Run 2 (-j 16): 4269 (e:234.20 P:647% U:743.43 S:772.52 F:0)
> Run 3 (-j 16): 3974 (e:251.59 P:639% U:724.29 S:884.21 F:0)
>
> mean: 238.820 stddev: 11.199
>
> = test 3: rand_start, PLE enabled =
> - kernel: 3.5.0-rc3 + Raghu's random start patch
>
> Run 1 (-j 16): 3898 (e:256.52 P:639% U:756.14 S:884.63 F:0)
> Run 2 (-j 16): 3341 (e:299.27 P:633% U:857.49 S:1039.62 F:0)
> Run 3 (-j 16): 3403 (e:293.79 P:635% U:857.21 S:1008.83 F:0)
>
> mean: 283.193 stddev: 23.262
>
> = test 4: pvticketlocks, PLE disabled (ple_gap=0) =
> - kernel: 3.5.0-rc3 + Rik's last_boosted patch + Raghu's pvticketlock series
> + PARAVIRT_SPINLOCKS=y config change
>
> Run 1 (-j 16): 3963 (e:252.29 P:647% U:736.43 S:897.16 F:0)
> Run 2 (-j 16): 4216 (e:237.19 P:650% U:706.68 S:837.42 F:0)
> Run 3 (-j 16): 4073 (e:245.48 P:649% U:709.46 S:884.68 F:0)
>
> mean: 244.987 stddev: 7.562
>
> = test 5: pvticketlocks, PLE enabled =
> - kernel: 3.5.0-rc3 + Rik's last_boosted patch + Raghu's pvticketlock series
> + PARAVIRT_SPINLOCKS=y config change
>
> Run 1 (-j 16): 3978 (e:251.32 P:629% U:758.86 S:824.29 F:0)
> Run 2 (-j 16): 4369 (e:228.84 P:634% U:708.32 S:743.71 F:0)
> Run 3 (-j 16): 3807 (e:262.63 P:626% U:767.03 S:877.96 F:0)
>
> mean: 247.597 stddev: 17.200
>
>

Ok in summary,
can we agree like, for kernbench 1x= -j (2*#vcpu) in 1 vm.
1.5x = -j (2*#vcpu) in 1 vm and -j (#vcpu) in other.. and so on.
also a SPIN_THRESHOLD of 4k?

Any ideas on benchmarks is welcome from all.

- Raghu

2012-06-23 18:36:19

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

On 06/23/2012 02:30 AM, Raghavendra K T wrote:
> On 06/22/2012 08:41 PM, Andrew Jones wrote:
>> On Thu, Jun 21, 2012 at 04:56:08PM +0530, Raghavendra K T wrote:
>>> Here are the results from kernbench.
>>>
>>> PS: I think we have to only take that, both the patches perform better,
>>> than reading into actual numbers since I am seeing more variance in
>>> especially 3x. may be I can test with some more stable benchmark if
>>> somebody points
>>>
[...]
> can we agree like, for kernbench 1x= -j (2*#vcpu) in 1 vm.
> 1.5x = -j (2*#vcpu) in 1 vm and -j (#vcpu) in other.. and so on.
> also a SPIN_THRESHOLD of 4k?

Please forget about 1.5x above. I am not too sure on that.

>
> Any ideas on benchmarks is welcome from all.
>

My run for other benchmarks did not have Rik's patches, so re-spinning
everything with that now.

Here is the detailed info on env and benchmark I am currently trying.
Let me know if you have any comments

=======
kernel 3.5.0-rc1 with Rik's Ple handler fix as base

Machine : Intel(R) Xeon(R) CPU X7560 @ 2.27GHz, 4 numa node, 256GB RAM,
32 core machine

Host: enterprise linux gcc version 4.4.6 20120305 (Red Hat 4.4.6-4)
(GCC) with test kernels
Guest: fedora 16 with different built-in kernel from same source tree.
32 vcpus 8GB memory. (configs not changed with patches except for
CONFIG_PARAVIRT_SPINLOCK)

Note: for Pv patches, SPIN_THRESHOLD is set to 4k

Benchmarks:
1) kernbench: kernbench-0.50

cmd:
echo "3" > /proc/sys/vm/drop_caches
ccache -C
kernbench -f -H -M -o 2*vcpu

Very first run in kernbench is omitted.

2) dbench: dbench version 4.00
cmd: dbench --warmup=30 -t 120 2*vcpu

3) hackbench:
https://build.opensuse.org/package/files?package=hackbench&project=benchmark
hackbench.c modified with loops=10000
used hackbench with num-threads = 2* vcpu

4) Specjbb: specjbb2000-1.02
Input Properties:
ramp_up_seconds = 30
measurement_seconds = 120
forcegc = true
starting_number_warehouses = 1
increment_number_warehouses = 1
ending_number_warehouses = 8


5) sysbench: 0.4.12
sysbench --test=oltp --db-driver=pgsql prepare
sysbench --num-threads=2*vcpu --max-requests=100000 --test=oltp
--oltp-table-size=500000 --db-driver=pgsql --oltp-read-only run
Note that driver for this pgsql.


6) ebizzy: release 0.3
cmd: ebizzy -S 120

- specjbb ran for 1x and 2x others mostly for 1x, 2x, 3x overcommit.
- overcommit of 2x means same benchmark running on 2 guests.
- sample for each overcommit is mostly 8

Note: I ran kernbench with old kernbench0.50, may be I can try kcbench
with ramfs if necessary

will soon come with detailed results
> - Raghu

2012-06-27 20:29:14

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

On 06/24/2012 12:04 AM, Raghavendra K T wrote:
> On 06/23/2012 02:30 AM, Raghavendra K T wrote:
>> On 06/22/2012 08:41 PM, Andrew Jones wrote:
[...]
> My run for other benchmarks did not have Rik's patches, so re-spinning
> everything with that now.
>
> Here is the detailed info on env and benchmark I am currently trying.
> Let me know if you have any comments
>
> =======
> kernel 3.5.0-rc1 with Rik's Ple handler fix as base
>
> Machine : Intel(R) Xeon(R) CPU X7560 @ 2.27GHz, 4 numa node, 256GB RAM,
> 32 core machine
>
> Host: enterprise linux gcc version 4.4.6 20120305 (Red Hat 4.4.6-4)
> (GCC) with test kernels
> Guest: fedora 16 with different built-in kernel from same source tree.
> 32 vcpus 8GB memory. (configs not changed with patches except for
> CONFIG_PARAVIRT_SPINLOCK)
>
> Note: for Pv patches, SPIN_THRESHOLD is set to 4k
>
> Benchmarks:
> 1) kernbench: kernbench-0.50
>
> cmd:
> echo "3" > /proc/sys/vm/drop_caches
> ccache -C
> kernbench -f -H -M -o 2*vcpu
>
> Very first run in kernbench is omitted.
>
> 2) dbench: dbench version 4.00
> cmd: dbench --warmup=30 -t 120 2*vcpu
>
> 3) hackbench:
>https://build.opensuse.org/package/files?package=hackbench&project=benchmark
>
> hackbench.c modified with loops=10000
> used hackbench with num-threads = 2* vcpu
>
> 4) Specjbb: specjbb2000-1.02
> Input Properties:
> ramp_up_seconds = 30
> measurement_seconds = 120
> forcegc = true
> starting_number_warehouses = 1
> increment_number_warehouses = 1
> ending_number_warehouses = 8
>
>
> 5) sysbench: 0.4.12
> sysbench --test=oltp --db-driver=pgsql prepare
> sysbench --num-threads=2*vcpu --max-requests=100000 --test=oltp
> --oltp-table-size=500000 --db-driver=pgsql --oltp-read-only run
> Note that driver for this pgsql.
>
>
> 6) ebizzy: release 0.3
> cmd: ebizzy -S 120
>
> - specjbb ran for 1x and 2x others mostly for 1x, 2x, 3x overcommit.
> - overcommit of 2x means same benchmark running on 2 guests.
> - sample for each overcommit is mostly 8
>
> Note: I ran kernbench with old kernbench0.50, may be I can try kcbench
> with ramfs if necessary
>
> will soon come with detailed results

With the above env, Here is the result I have for 4k SPIN_THRESHOLD.

Lower is better for following benchmarks:
kernbench: (time in sec)
hackbench: (time in sec)
sysbench : (time in sec)

Higher is better for following benchmarks:
specjbb: score (Throughput)
dbench : Throughput in MB/sec
ebizzy : records/sec

In summary, current PV has huge benefit on non-PLE machine.

On PLE machine, the results become very sensitive to load, type of
workload and SPIN_THRESHOLD. Also PLE interference has significant
effect on them. But still it has slight edge over non PV.

Overall, specjbb, sysbench, kernbench seem to do well with PV.

dbench has been little unreliable (same reason I have not published
2x, 3x result but experimental values are included in tarball) but
seem to be on par with PV

hackbench non-overcommit case is better and ebizzy overcommit case is
better.
[ebizzy seems to very sensitive w.r.t SPIN_THRESHOLD].

I have still not experimented with SPIN_THRESHOLD of 2k/8k and w/, w/o PLE
after having Rik's fix.

+-----------+-----------+-----------+------------+---------+
specjbb
+-----------+-----------+-----------+------------+---------+
| value | stdev | value | stdev | %improve|
+-----------+-----------+-----------+------------+---------+
|114232.2500|21774.0660 |122591.0000| 18239.0900 | 7.31733 |
|112154.5000|19696.6860 |113386.2500| 22262.5890 | 1.09826 |
+-----------+-----------+-----------+------------+---------+

+-----------+-----------+-----------+------------+---------+
kernbench
+-----------+-----------+-----------+------------+---------+
| value | stdev | value | stdev | %improve|
+-----------+-----------+-----------+------------+---------+
| 48.9150 | 0.8608 | 48.5550 | 0.7372 | 0.74143 |
| 96.3691 | 7.9724 | 96.6367 | 1.6938 |-0.27691 |
| 192.6972 | 9.1881 | 188.3195 | 8.1267 | 2.32461 |
| 320.6500 | 29.6892 | 302.1225 | 16.0515 | 6.13245 |
++-----------+-----------+-----------+------------+---------+

+-----------+-----------+-----------+------------+---------+
sysbench
+-----------+-----------+-----------+------------+---------+
| value | stdev | value | stdev | %improve|
+-----------+-----------+-----------+------------+---------+
| 12.4082 | 0.2370 | 12.2797 | 0.1037 | 1.04644 |
| 14.1705 | 0.4272 | 14.0300 | 1.1478 | 1.00143 |
| 19.3769 | 1.0833 | 18.9745 | 0.0560 | 2.12074 |
| 24.5373 | 1.3237 | 22.3078 | 0.8999 | 9.99426 |
+-----------+-----------+-----------+------------+---------+

+-----------+-----------+-----------+------------+---------+
hackbench
+-----------+-----------+-----------+------------+---------+
| value | stdev | value | stdev | %improve|
+-----------+-----------+-----------+------------+---------+
| 73.2627 | 11.2413 | 67.5125 | 2.5722 | 8.51724|
| 134.4294 | 1.9688 | 153.6160 | 5.2033 |-12.48998|
| 215.4521 | 3.8672 | 238.8965 | 3.0035 | -9.81362|
| 303.8553 | 5.0427 | 310.3569 | 6.1463 | -2.09488|
++-----------+-----------+-----------+------------+--------+

+-----------+-----------+-----------+------------+---------+
ebizzy
+-----------+-----------+-----------+------------+---------+
| value | stdev | value | stdev | %improve|
+-----------+-----------+-----------+------------+---------+
| 1108.6250 | 19.3090 | 1088.2500 | 11.0809 |-1.83786 |
| 1662.6250 | 150.5466 | 1064.0000 | 2.8284 |-36.00481|
| 1394.0000 | 85.0867 | 1073.2857 | 10.3877 |-23.00676|
| 1172.1250 | 20.3501 | 1245.8750 | 25.3852 | 6.29199 |
+-----------+-----------+-----------+------------+---------+

+-----------+-----------+-----------+------------+---------+
dbench
+-----------+-----------+-----------+------------+---------+
| value | stdev | value | stdev | %improve|
+-----------+-----------+-----------+------------+---------+
| 29.0378 | 1.1625 | 28.8466 | 1.1132 |-0.65845 |
+-----------+-----------+-----------+------------+---------+

(benchmark values will be attached in reply to this mail)

Planning to post patches rebased to 3.5-rc. Avi, Ingo.. Please let me know.

2012-06-27 20:31:04

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case with benchmark detail attachment

On 06/28/2012 01:57 AM, Raghavendra K T wrote:
> On 06/24/2012 12:04 AM, Raghavendra K T wrote:
>> On 06/23/2012 02:30 AM, Raghavendra K T wrote:
>>> On 06/22/2012 08:41 PM, Andrew Jones wrote:
[...]
>
> (benchmark values will be attached in reply to this mail)


Attachments:
pv_benchmark_summary.bz2 (6.90 kB)

2012-06-28 02:16:14

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

On 06/21/2012 12:13 PM, Gleb Natapov wrote:
> On Tue, Jun 19, 2012 at 04:51:04PM -0400, Rik van Riel wrote:
>> On Wed, 20 Jun 2012 01:50:50 +0530
>> Raghavendra K T<[email protected]> wrote:
>>
>>>
>>> In ple handler code, last_boosted_vcpu (lbv) variable is
>>> serving as reference point to start when we enter.
>>
>>> Also statistical analysis (below) is showing lbv is not very well
>>> distributed with current approach.
>>
>> You are the second person to spot this bug today (yes, today).
>>
>> Due to time zones, the first person has not had a chance yet to
>> test the patch below, which might fix the issue...
>>
>> Please let me know how it goes.
>>
>> ====8<====
>>
>> If last_boosted_vcpu == 0, then we fall through all test cases and
>> may end up with all VCPUs pouncing on vcpu 0. With a large enough
>> guest, this can result in enormous runqueue lock contention, which
>> can prevent vcpu0 from running, leading to a livelock.
>>
>> Changing< to<= makes sure we properly handle that case.
>>
>> Signed-off-by: Rik van Riel<[email protected]>
>> ---
>> virt/kvm/kvm_main.c | 2 +-
>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 7e14068..1da542b 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1586,7 +1586,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>> */
>> for (pass = 0; pass< 2&& !yielded; pass++) {
>> kvm_for_each_vcpu(i, vcpu, kvm) {
>> - if (!pass&& i< last_boosted_vcpu) {
>> + if (!pass&& i<= last_boosted_vcpu) {
>> i = last_boosted_vcpu;
>> continue;
>> } else if (pass&& i> last_boosted_vcpu)
>>
> Looks correct. We can simplify this by introducing something like:
>
> #define kvm_for_each_vcpu_from(idx, n, vcpup, kvm) \
> for (n = atomic_read(&kvm->online_vcpus); \
> n&& (vcpup = kvm_get_vcpu(kvm, idx)) != NULL; \
> n--, idx = (idx+1) % atomic_read(&kvm->online_vcpus))
>

Gleb, Rik,
Any updates on this or Rik's patch status?
I can come up with the above suggested cleanup patch with Gleb's
from,sob.

Please let me know.

2012-06-28 16:02:06

by Andrew Jones

[permalink] [raw]
Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case



----- Original Message -----
> In summary, current PV has huge benefit on non-PLE machine.
>
> On PLE machine, the results become very sensitive to load, type of
> workload and SPIN_THRESHOLD. Also PLE interference has significant
> effect on them. But still it has slight edge over non PV.
>

Hi Raghu,

sorry for my slow response. I'm on vacation right now (until the
9th of July) and I have limited access to mail. Also, thanks for
continuing the benchmarking. Question, when you compare PLE vs.
non-PLE, are you using different machines (one with and one
without), or are you disabling its use by loading the kvm module
with the ple_gap=0 modparam as I did?

Drew

2012-06-28 16:24:08

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

On 06/28/2012 09:30 PM, Andrew Jones wrote:
>
>
> ----- Original Message -----
>> In summary, current PV has huge benefit on non-PLE machine.
>>
>> On PLE machine, the results become very sensitive to load, type of
>> workload and SPIN_THRESHOLD. Also PLE interference has significant
>> effect on them. But still it has slight edge over non PV.
>>
>
> Hi Raghu,
>
> sorry for my slow response. I'm on vacation right now (until the
> 9th of July) and I have limited access to mail.

Ok. Happy Vacation :)

Also, thanks for
> continuing the benchmarking. Question, when you compare PLE vs.
> non-PLE, are you using different machines (one with and one
> without), or are you disabling its use by loading the kvm module
> with the ple_gap=0 modparam as I did?

Yes, I am doing the same when I say with PLE disabled and comparing the
benchmarks (i.e loading kvm module with ple_gap=0).

But older non-PLE results were on a different machine altogether. (I
had limited access to PLE machine).

2012-06-28 22:57:03

by Vinod, Chegu

[permalink] [raw]
Subject: RE: [PATCH] kvm: handle last_boosted_vcpu = 0 case

Hello,

I am just catching up on this email thread...

Perhaps one of you may be able to help answer this query.. preferably along with some data. [BTW, I do understand the basic intent behind PLE in a typical [sweet spot] use case where there is over subscription etc. and the need to optimize the PLE handler in the host etc. ]

In a use case where the host has fewer but much larger guests (say 40VCPUs and higher) and there is no over subscription (i.e. # of vcpus across guests <= physical cpus in the host and perhaps each guest has their vcpu's pinned to specific physical cpus for other reasons), I would like to understand if/how the PLE really helps ? For these use cases would it be ok to turn PLE off (ple_gap=0) since is no real need to take an exit and find some other VCPU to yield to ?

Thanks
Vinod

-----Original Message-----
From: Raghavendra K T [mailto:[email protected]]
Sent: Thursday, June 28, 2012 9:22 AM
To: Andrew Jones
Cc: Rik van Riel; Marcelo Tosatti; Srikar; Srivatsa Vaddagiri; Peter Zijlstra; Nikunj A. Dadhania; KVM; LKML; Gleb Natapov; Vinod, Chegu; Jeremy Fitzhardinge; Avi Kivity; Ingo Molnar
Subject: Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

On 06/28/2012 09:30 PM, Andrew Jones wrote:
>
>
> ----- Original Message -----
>> In summary, current PV has huge benefit on non-PLE machine.
>>
>> On PLE machine, the results become very sensitive to load, type of
>> workload and SPIN_THRESHOLD. Also PLE interference has significant
>> effect on them. But still it has slight edge over non PV.
>>
>
> Hi Raghu,
>
> sorry for my slow response. I'm on vacation right now (until the 9th
> of July) and I have limited access to mail.

Ok. Happy Vacation :)

Also, thanks for
> continuing the benchmarking. Question, when you compare PLE vs.
> non-PLE, are you using different machines (one with and one without),
> or are you disabling its use by loading the kvm module with the
> ple_gap=0 modparam as I did?

Yes, I am doing the same when I say with PLE disabled and comparing the benchmarks (i.e loading kvm module with ple_gap=0).

But older non-PLE results were on a different machine altogether. (I had limited access to PLE machine).


????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?