2012-10-03 12:26:12

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

* Avi Kivity <[email protected]> [2012-09-24 17:41:19]:

> On 09/21/2012 08:24 PM, Raghavendra K T wrote:
> > On 09/21/2012 06:32 PM, Rik van Riel wrote:
> >> On 09/21/2012 08:00 AM, Raghavendra K T wrote:
> >>> From: Raghavendra K T <[email protected]>
> >>>
> >>> When total number of VCPUs of system is less than or equal to physical
> >>> CPUs,
> >>> PLE exits become costly since each VCPU can have dedicated PCPU, and
> >>> trying to find a target VCPU to yield_to just burns time in PLE handler.
> >>>
> >>> This patch reduces overhead, by simply doing a return in such
> >>> scenarios by
> >>> checking the length of current cpu runqueue.
> >>
> >> I am not convinced this is the way to go.
> >>
> >> The VCPU that is holding the lock, and is not releasing it,
> >> probably got scheduled out. That implies that VCPU is on a
> >> runqueue with at least one other task.
> >
> > I see your point here, we have two cases:
> >
> > case 1)
> >
> > rq1 : vcpu1->wait(lockA) (spinning)
> > rq2 : vcpu2->holding(lockA) (running)
> >
> > Here Ideally vcpu1 should not enter PLE handler, since it would surely
> > get the lock within ple_window cycle. (assuming ple_window is tuned for
> > that workload perfectly).
> >
> > May be this explains why we are not seeing benefit with kernbench.
> >
> > On the other side, Since we cannot have a perfect ple_window tuned for
> > all type of workloads, for those workloads, which may need more than
> > 4096 cycles, we gain. thinking is it that we are seeing in benefited
> > cases?
>
> Maybe we need to increase the ple window regardless. 4096 cycles is 2
> microseconds or less (call it t_spin). The overhead from
> kvm_vcpu_on_spin() and the associated task switches is at least a few
> microseconds, increasing as contention is added (call it t_tield). The
> time for a natural context switch is several milliseconds (call it
> t_slice). There is also the time the lock holder owns the lock,
> assuming no contention (t_hold).
>
> If t_yield > t_spin, then in the undercommitted case it dominates
> t_spin. If t_hold > t_spin we lose badly.
>
> If t_spin > t_yield, then the undercommitted case doesn't suffer as much
> as most of the spinning happens in the guest instead of the host, so it
> can pick up the unlock timely. We don't lose too much in the
> overcommitted case provided the values aren't too far apart (say a
> factor of 3).
>
> Obviously t_spin must be significantly smaller than t_slice, otherwise
> it accomplishes nothing.
>
> Regarding t_hold: if it is small, then a larger t_spin helps avoid false
> exits. If it is large, then we're not very sensitive to t_spin. It
> doesn't matter if it takes us 2 usec or 20 usec to yield, if we end up
> yielding for several milliseconds.
>
> So I think it's worth trying again with ple_window of 20000-40000.
>

Hi Avi,

I ran different benchmarks increasing ple_window, and results does not
seem to be encouraging for increasing ple_window.

Results:
16 core PLE machine with 16 vcpu guest.

base kernel = 3.6-rc5 + ple handler optimization patch
base_pleopt_8k = base kernel + ple window = 8k
base_pleopt_16k = base kernel + ple window = 16k
base_pleopt_32k = base kernel + ple window = 32k


Percentage improvements of benchmarks w.r.t base_pleopt with ple_window = 4096

base_pleopt_8k base_pleopt_16k base_pleopt_32k
-----------------------------------------------------------------
kernbench_1x -5.54915 -15.94529 -44.31562
kernbench_2x -7.89399 -17.75039 -37.73498
-----------------------------------------------------------------
sysbench_1x 0.45955 -0.98778 0.05252
sysbench_2x 1.44071 -0.81625 1.35620
sysbench_3x 0.45549 1.51795 -0.41573
-----------------------------------------------------------------

hackbench_1x -3.80272 -13.91456 -40.79059
hackbench_2x -4.78999 -7.61382 -7.24475
-----------------------------------------------------------------
ebizzy_1x -2.54626 -16.86050 -38.46109
ebizzy_2x -8.75526 -19.29116 -48.33314
-----------------------------------------------------------------

I also got perf top output to analyse the difference. Difference comes
because of flushtlb (and also spinlock).

Ebizzy run for 4k ple_window
- 87.20% [kernel] [k] arch_local_irq_restore
- arch_local_irq_restore
- 100.00% _raw_spin_unlock_irqrestore
+ 52.89% release_pages
+ 47.10% pagevec_lru_move_fn
- 5.71% [kernel] [k] arch_local_irq_restore
- arch_local_irq_restore
+ 86.03% default_send_IPI_mask_allbutself_phys
+ 13.96% default_send_IPI_mask_sequence_phys
- 3.10% [kernel] [k] smp_call_function_many
smp_call_function_many


Ebizzy run for 32k ple_window

- 91.40% [kernel] [k] arch_local_irq_restore
- arch_local_irq_restore
- 100.00% _raw_spin_unlock_irqrestore
+ 53.13% release_pages
+ 46.86% pagevec_lru_move_fn
- 4.38% [kernel] [k] smp_call_function_many
smp_call_function_many
- 2.51% [kernel] [k] arch_local_irq_restore
- arch_local_irq_restore
+ 90.76% default_send_IPI_mask_allbutself_phys
+ 9.24% default_send_IPI_mask_sequence_phys


Below is the detailed result:
patch = base_pleopt_8k
+-----------+-----------+-----------+------------+-----------+
kernbench
+-----------+-----------+-----------+------------+-----------+
base stddev patch stdev %improve
+-----------+-----------+-----------+------------+-----------+
41.0027 0.7990 43.2780 0.5180 -5.54915
89.2983 1.2406 96.3475 1.8891 -7.89399
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
sysbench
+-----------+-----------+-----------+------------+-----------+
9.9010 0.0558 9.8555 0.1246 0.45955
19.7611 0.4290 19.4764 0.0835 1.44071
29.1775 0.9903 29.0446 0.8641 0.45549
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
hackbench
+-----------+-----------+-----------+------------+-----------+
77.1580 1.9787 80.0921 2.9696 -3.80272
239.2490 1.5660 250.7090 2.6074 -4.78999
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
ebizzy
+-----------+-----------+-----------+------------+-----------+
4256.2500 186.8053 4147.8750 206.1840 -2.54626
2197.2500 93.1048 2004.8750 85.7995 -8.75526
+-----------+-----------+-----------+------------+-----------+

patch = base_pleopt_16k
+-----------+-----------+-----------+------------+-----------+
kernbench
+-----------+-----------+-----------+------------+-----------+
base stddev patch stdev %improve
+-----------+-----------+-----------+------------+-----------+
41.0027 0.7990 47.5407 0.5739 -15.94529
89.2983 1.2406 105.1491 1.2244 -17.75039
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
sysbench
+-----------+-----------+-----------+------------+-----------+
9.9010 0.0558 9.9988 0.1106 -0.98778
19.7611 0.4290 19.9224 0.9016 -0.81625
29.1775 0.9903 28.7346 0.2788 1.51795
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
hackbench
+-----------+-----------+-----------+------------+-----------+
77.1580 1.9787 87.8942 2.2132 -13.91456
239.2490 1.5660 257.4650 5.3674 -7.61382
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
ebizzy
+-----------+-----------+-----------+------------+-----------+
4256.2500 186.8053 3538.6250 101.1165 -16.86050
2197.2500 93.1048 1773.3750 91.8414 -19.29116
+-----------+-----------+-----------+------------+-----------+

patch = base_pleopt_32k
+-----------+-----------+-----------+------------+-----------+
kernbench
+-----------+-----------+-----------+------------+-----------+
base stddev patch stdev %improve
+-----------+-----------+-----------+------------+-----------+
41.0027 0.7990 59.1733 0.8102 -44.31562
89.2983 1.2406 122.9950 1.5534 -37.73498
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
sysbench
+-----------+-----------+-----------+------------+-----------+
9.9010 0.0558 9.8958 0.0593 0.05252
19.7611 0.4290 19.4931 0.1767 1.35620
29.1775 0.9903 29.2988 1.0420 -0.41573
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
hackbench
+-----------+-----------+-----------+------------+-----------+
77.1580 1.9787 108.6312 13.1500 -40.79059
239.2490 1.5660 256.5820 2.2722 -7.24475
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
ebizzy
+-----------+-----------+-----------+------------+-----------+
4256.2500 186.8053 2619.2500 80.8150 -38.46109
2197.2500 93.1048 1135.2500 22.2887 -48.33314
+-----------+-----------+-----------+------------+-----------+


2012-10-03 17:06:31

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On 10/03/2012 02:22 PM, Raghavendra K T wrote:
>> So I think it's worth trying again with ple_window of 20000-40000.
>>
>
> Hi Avi,
>
> I ran different benchmarks increasing ple_window, and results does not
> seem to be encouraging for increasing ple_window.

Thanks for testing! Comments below.

> Results:
> 16 core PLE machine with 16 vcpu guest.
>
> base kernel = 3.6-rc5 + ple handler optimization patch
> base_pleopt_8k = base kernel + ple window = 8k
> base_pleopt_16k = base kernel + ple window = 16k
> base_pleopt_32k = base kernel + ple window = 32k
>
>
> Percentage improvements of benchmarks w.r.t base_pleopt with ple_window = 4096
>
> base_pleopt_8k base_pleopt_16k base_pleopt_32k
> -----------------------------------------------------------------
> kernbench_1x -5.54915 -15.94529 -44.31562
> kernbench_2x -7.89399 -17.75039 -37.73498

So, 44% degradation even with no overcommit? That's surprising.

> I also got perf top output to analyse the difference. Difference comes
> because of flushtlb (and also spinlock).

That's in the guest, yes?

>
> Ebizzy run for 4k ple_window
> - 87.20% [kernel] [k] arch_local_irq_restore
> - arch_local_irq_restore
> - 100.00% _raw_spin_unlock_irqrestore
> + 52.89% release_pages
> + 47.10% pagevec_lru_move_fn
> - 5.71% [kernel] [k] arch_local_irq_restore
> - arch_local_irq_restore
> + 86.03% default_send_IPI_mask_allbutself_phys
> + 13.96% default_send_IPI_mask_sequence_phys
> - 3.10% [kernel] [k] smp_call_function_many
> smp_call_function_many
>
>
> Ebizzy run for 32k ple_window
>
> - 91.40% [kernel] [k] arch_local_irq_restore
> - arch_local_irq_restore
> - 100.00% _raw_spin_unlock_irqrestore
> + 53.13% release_pages
> + 46.86% pagevec_lru_move_fn
> - 4.38% [kernel] [k] smp_call_function_many
> smp_call_function_many
> - 2.51% [kernel] [k] arch_local_irq_restore
> - arch_local_irq_restore
> + 90.76% default_send_IPI_mask_allbutself_phys
> + 9.24% default_send_IPI_mask_sequence_phys
>

Both the 4k and the 32k results are crazy. Why is
arch_local_irq_restore() so prominent? Do you have a very high
interrupt rate in the guest?




--
error compiling committee.c: too many arguments to function

2012-10-04 10:53:36

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On 10/03/2012 10:35 PM, Avi Kivity wrote:
> On 10/03/2012 02:22 PM, Raghavendra K T wrote:
>>> So I think it's worth trying again with ple_window of 20000-40000.
>>>
>>
>> Hi Avi,
>>
>> I ran different benchmarks increasing ple_window, and results does not
>> seem to be encouraging for increasing ple_window.
>
> Thanks for testing! Comments below.
>
>> Results:
>> 16 core PLE machine with 16 vcpu guest.
>>
>> base kernel = 3.6-rc5 + ple handler optimization patch
>> base_pleopt_8k = base kernel + ple window = 8k
>> base_pleopt_16k = base kernel + ple window = 16k
>> base_pleopt_32k = base kernel + ple window = 32k
>>
>>
>> Percentage improvements of benchmarks w.r.t base_pleopt with ple_window = 4096
>>
>> base_pleopt_8k base_pleopt_16k base_pleopt_32k
>> -----------------------------------------------------------------
>> kernbench_1x -5.54915 -15.94529 -44.31562
>> kernbench_2x -7.89399 -17.75039 -37.73498
>
> So, 44% degradation even with no overcommit? That's surprising.

Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it
spending 8 times the original ple_window cycles for 16 vcpus
significant?

>
>> I also got perf top output to analyse the difference. Difference comes
>> because of flushtlb (and also spinlock).
>
> That's in the guest, yes?

Yes. Perf is in guest.

>
>>
>> Ebizzy run for 4k ple_window
>> - 87.20% [kernel] [k] arch_local_irq_restore
>> - arch_local_irq_restore
>> - 100.00% _raw_spin_unlock_irqrestore
>> + 52.89% release_pages
>> + 47.10% pagevec_lru_move_fn
>> - 5.71% [kernel] [k] arch_local_irq_restore
>> - arch_local_irq_restore
>> + 86.03% default_send_IPI_mask_allbutself_phys
>> + 13.96% default_send_IPI_mask_sequence_phys
>> - 3.10% [kernel] [k] smp_call_function_many
>> smp_call_function_many
>>
>>
>> Ebizzy run for 32k ple_window
>>
>> - 91.40% [kernel] [k] arch_local_irq_restore
>> - arch_local_irq_restore
>> - 100.00% _raw_spin_unlock_irqrestore
>> + 53.13% release_pages
>> + 46.86% pagevec_lru_move_fn
>> - 4.38% [kernel] [k] smp_call_function_many
>> smp_call_function_many
>> - 2.51% [kernel] [k] arch_local_irq_restore
>> - arch_local_irq_restore
>> + 90.76% default_send_IPI_mask_allbutself_phys
>> + 9.24% default_send_IPI_mask_sequence_phys
>>
>
> Both the 4k and the 32k results are crazy. Why is
> arch_local_irq_restore() so prominent? Do you have a very high
> interrupt rate in the guest?

How to measure if I have high interrupt rate in guest?
From /proc/interrupt numbers I am not able to judge :(

I went back and got the results on a 32 core machine with 32 vcpu guest.
Strangely, I got result supporting the claim that increasing ple_window
helps for non-overcommitted scenario.

32 core 32 vcpu guest 1x scenarios.

ple_gap = 0
kernbench: Elapsed Time 38.61
ebizzy: 7463 records/s

ple_window = 4k
kernbench: Elapsed Time 43.5067
ebizzy: 2528 records/s

ple_window = 32k
kernebench : Elapsed Time 39.4133
ebizzy: 7196 records/s


perf top for ebizzy for above:
ple_gap = 0
- 84.74% [kernel] [k] arch_local_irq_restore
- arch_local_irq_restore
- 100.00% _raw_spin_unlock_irqrestore
+ 50.96% release_pages
+ 49.02% pagevec_lru_move_fn
- 6.57% [kernel] [k] arch_local_irq_restore
- arch_local_irq_restore
+ 92.54% default_send_IPI_mask_allbutself_phys
+ 7.46% default_send_IPI_mask_sequence_phys
- 1.54% [kernel] [k] smp_call_function_many
smp_call_function_many

ple_window = 32k
- 84.47% [kernel] [k] arch_local_irq_restore
+ arch_local_irq_restore
- 6.46% [kernel] [k] arch_local_irq_restore
- arch_local_irq_restore
+ 93.51% default_send_IPI_mask_allbutself_phys
+ 6.49% default_send_IPI_mask_sequence_phys
- 1.80% [kernel] [k] smp_call_function_many
- smp_call_function_many
+ 99.98% native_flush_tlb_others


ple_window = 4k
- 91.35% [kernel] [k] arch_local_irq_restore
- arch_local_irq_restore
- 100.00% _raw_spin_unlock_irqrestore
+ 53.19% release_pages
+ 46.81% pagevec_lru_move_fn
- 3.90% [kernel] [k] smp_call_function_many
smp_call_function_many
- 2.94% [kernel] [k] arch_local_irq_restore
- arch_local_irq_restore
+ 93.12% default_send_IPI_mask_allbutself_phys
+ 6.88% default_send_IPI_mask_sequence_phys

Let me know if I can try something here..
/me confused :(

2012-10-04 12:41:45

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On 10/04/2012 12:49 PM, Raghavendra K T wrote:
> On 10/03/2012 10:35 PM, Avi Kivity wrote:
>> On 10/03/2012 02:22 PM, Raghavendra K T wrote:
>>>> So I think it's worth trying again with ple_window of 20000-40000.
>>>>
>>>
>>> Hi Avi,
>>>
>>> I ran different benchmarks increasing ple_window, and results does not
>>> seem to be encouraging for increasing ple_window.
>>
>> Thanks for testing! Comments below.
>>
>>> Results:
>>> 16 core PLE machine with 16 vcpu guest.
>>>
>>> base kernel = 3.6-rc5 + ple handler optimization patch
>>> base_pleopt_8k = base kernel + ple window = 8k
>>> base_pleopt_16k = base kernel + ple window = 16k
>>> base_pleopt_32k = base kernel + ple window = 32k
>>>
>>>
>>> Percentage improvements of benchmarks w.r.t base_pleopt with
>>> ple_window = 4096
>>>
>>> base_pleopt_8k base_pleopt_16k base_pleopt_32k
>>> -----------------------------------------------------------------
>>>
>>> kernbench_1x -5.54915 -15.94529 -44.31562
>>> kernbench_2x -7.89399 -17.75039 -37.73498
>>
>> So, 44% degradation even with no overcommit? That's surprising.
>
> Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it
> spending 8 times the original ple_window cycles for 16 vcpus
> significant?

A PLE exit when not overcommitted cannot do any good, it is better to
spin in the guest rather that look for candidates on the host. In fact
when we benchmark we often disable PLE completely.

>
>>
>>> I also got perf top output to analyse the difference. Difference comes
>>> because of flushtlb (and also spinlock).
>>
>> That's in the guest, yes?
>
> Yes. Perf is in guest.
>
>>
>>>
>>> Ebizzy run for 4k ple_window
>>> - 87.20% [kernel] [k] arch_local_irq_restore
>>> - arch_local_irq_restore
>>> - 100.00% _raw_spin_unlock_irqrestore
>>> + 52.89% release_pages
>>> + 47.10% pagevec_lru_move_fn
>>> - 5.71% [kernel] [k] arch_local_irq_restore
>>> - arch_local_irq_restore
>>> + 86.03% default_send_IPI_mask_allbutself_phys
>>> + 13.96% default_send_IPI_mask_sequence_phys
>>> - 3.10% [kernel] [k] smp_call_function_many
>>> smp_call_function_many
>>>
>>>
>>> Ebizzy run for 32k ple_window
>>>
>>> - 91.40% [kernel] [k] arch_local_irq_restore
>>> - arch_local_irq_restore
>>> - 100.00% _raw_spin_unlock_irqrestore
>>> + 53.13% release_pages
>>> + 46.86% pagevec_lru_move_fn
>>> - 4.38% [kernel] [k] smp_call_function_many
>>> smp_call_function_many
>>> - 2.51% [kernel] [k] arch_local_irq_restore
>>> - arch_local_irq_restore
>>> + 90.76% default_send_IPI_mask_allbutself_phys
>>> + 9.24% default_send_IPI_mask_sequence_phys
>>>
>>
>> Both the 4k and the 32k results are crazy. Why is
>> arch_local_irq_restore() so prominent? Do you have a very high
>> interrupt rate in the guest?
>
> How to measure if I have high interrupt rate in guest?
> From /proc/interrupt numbers I am not able to judge :(

'vmstat 1'

>
> I went back and got the results on a 32 core machine with 32 vcpu guest.
> Strangely, I got result supporting the claim that increasing ple_window
> helps for non-overcommitted scenario.
>
> 32 core 32 vcpu guest 1x scenarios.
>
> ple_gap = 0
> kernbench: Elapsed Time 38.61
> ebizzy: 7463 records/s
>
> ple_window = 4k
> kernbench: Elapsed Time 43.5067
> ebizzy: 2528 records/s
>
> ple_window = 32k
> kernebench : Elapsed Time 39.4133
> ebizzy: 7196 records/s

So maybe something was wrong with the first measurement.

>
>
> perf top for ebizzy for above:
> ple_gap = 0
> - 84.74% [kernel] [k] arch_local_irq_restore
> - arch_local_irq_restore
> - 100.00% _raw_spin_unlock_irqrestore
> + 50.96% release_pages
> + 49.02% pagevec_lru_move_fn
> - 6.57% [kernel] [k] arch_local_irq_restore
> - arch_local_irq_restore
> + 92.54% default_send_IPI_mask_allbutself_phys
> + 7.46% default_send_IPI_mask_sequence_phys
> - 1.54% [kernel] [k] smp_call_function_many
> smp_call_function_many

Again the numbers are ridiculously high for arch_local_irq_restore.
Maybe there's a bad perf/kvm interaction when we're injecting an
interrupt, I can't believe we're spending 84% of the time running the
popf instruction.

>
> ple_window = 32k
> - 84.47% [kernel] [k] arch_local_irq_restore
> + arch_local_irq_restore
> - 6.46% [kernel] [k] arch_local_irq_restore
> - arch_local_irq_restore
> + 93.51% default_send_IPI_mask_allbutself_phys
> + 6.49% default_send_IPI_mask_sequence_phys
> - 1.80% [kernel] [k] smp_call_function_many
> - smp_call_function_many
> + 99.98% native_flush_tlb_others
>
>
> ple_window = 4k
> - 91.35% [kernel] [k] arch_local_irq_restore
> - arch_local_irq_restore
> - 100.00% _raw_spin_unlock_irqrestore
> + 53.19% release_pages
> + 46.81% pagevec_lru_move_fn
> - 3.90% [kernel] [k] smp_call_function_many
> smp_call_function_many
> - 2.94% [kernel] [k] arch_local_irq_restore
> - arch_local_irq_restore
> + 93.12% default_send_IPI_mask_allbutself_phys
> + 6.88% default_send_IPI_mask_sequence_phys
>
> Let me know if I can try something here..
> /me confused :(
>

I'm even more confused. Please try 'perf kvm' from the host, it does
fewer dirty tricks with the PMU and so may be more accurate.

--
error compiling committee.c: too many arguments to function

2012-10-04 13:07:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
>
> Again the numbers are ridiculously high for arch_local_irq_restore.
> Maybe there's a bad perf/kvm interaction when we're injecting an
> interrupt, I can't believe we're spending 84% of the time running the
> popf instruction.

Smells like a software fallback that doesn't do NMI, hrtimer based
sampling typically hits popf where we re-enable interrupts.

2012-10-04 14:41:54

by Andrew Theurer

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> On 10/04/2012 12:49 PM, Raghavendra K T wrote:
> > On 10/03/2012 10:35 PM, Avi Kivity wrote:
> >> On 10/03/2012 02:22 PM, Raghavendra K T wrote:
> >>>> So I think it's worth trying again with ple_window of 20000-40000.
> >>>>
> >>>
> >>> Hi Avi,
> >>>
> >>> I ran different benchmarks increasing ple_window, and results does not
> >>> seem to be encouraging for increasing ple_window.
> >>
> >> Thanks for testing! Comments below.
> >>
> >>> Results:
> >>> 16 core PLE machine with 16 vcpu guest.
> >>>
> >>> base kernel = 3.6-rc5 + ple handler optimization patch
> >>> base_pleopt_8k = base kernel + ple window = 8k
> >>> base_pleopt_16k = base kernel + ple window = 16k
> >>> base_pleopt_32k = base kernel + ple window = 32k
> >>>
> >>>
> >>> Percentage improvements of benchmarks w.r.t base_pleopt with
> >>> ple_window = 4096
> >>>
> >>> base_pleopt_8k base_pleopt_16k base_pleopt_32k
> >>> -----------------------------------------------------------------
> >>>
> >>> kernbench_1x -5.54915 -15.94529 -44.31562
> >>> kernbench_2x -7.89399 -17.75039 -37.73498
> >>
> >> So, 44% degradation even with no overcommit? That's surprising.
> >
> > Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it
> > spending 8 times the original ple_window cycles for 16 vcpus
> > significant?
>
> A PLE exit when not overcommitted cannot do any good, it is better to
> spin in the guest rather that look for candidates on the host. In fact
> when we benchmark we often disable PLE completely.

Agreed. However, I really do not understand why the kernbench regressed
with bigger ple_window. It should stay the same or improve. Raghu, do
you have perf data for the kernbench runs?
>
> >
> >>
> >>> I also got perf top output to analyse the difference. Difference comes
> >>> because of flushtlb (and also spinlock).
> >>
> >> That's in the guest, yes?
> >
> > Yes. Perf is in guest.
> >
> >>
> >>>
> >>> Ebizzy run for 4k ple_window
> >>> - 87.20% [kernel] [k] arch_local_irq_restore
> >>> - arch_local_irq_restore
> >>> - 100.00% _raw_spin_unlock_irqrestore
> >>> + 52.89% release_pages
> >>> + 47.10% pagevec_lru_move_fn
> >>> - 5.71% [kernel] [k] arch_local_irq_restore
> >>> - arch_local_irq_restore
> >>> + 86.03% default_send_IPI_mask_allbutself_phys
> >>> + 13.96% default_send_IPI_mask_sequence_phys
> >>> - 3.10% [kernel] [k] smp_call_function_many
> >>> smp_call_function_many
> >>>
> >>>
> >>> Ebizzy run for 32k ple_window
> >>>
> >>> - 91.40% [kernel] [k] arch_local_irq_restore
> >>> - arch_local_irq_restore
> >>> - 100.00% _raw_spin_unlock_irqrestore
> >>> + 53.13% release_pages
> >>> + 46.86% pagevec_lru_move_fn
> >>> - 4.38% [kernel] [k] smp_call_function_many
> >>> smp_call_function_many
> >>> - 2.51% [kernel] [k] arch_local_irq_restore
> >>> - arch_local_irq_restore
> >>> + 90.76% default_send_IPI_mask_allbutself_phys
> >>> + 9.24% default_send_IPI_mask_sequence_phys
> >>>
> >>
> >> Both the 4k and the 32k results are crazy. Why is
> >> arch_local_irq_restore() so prominent? Do you have a very high
> >> interrupt rate in the guest?
> >
> > How to measure if I have high interrupt rate in guest?
> > From /proc/interrupt numbers I am not able to judge :(
>
> 'vmstat 1'
>
> >
> > I went back and got the results on a 32 core machine with 32 vcpu guest.
> > Strangely, I got result supporting the claim that increasing ple_window
> > helps for non-overcommitted scenario.
> >
> > 32 core 32 vcpu guest 1x scenarios.
> >
> > ple_gap = 0
> > kernbench: Elapsed Time 38.61
> > ebizzy: 7463 records/s
> >
> > ple_window = 4k
> > kernbench: Elapsed Time 43.5067
> > ebizzy: 2528 records/s
> >
> > ple_window = 32k
> > kernebench : Elapsed Time 39.4133
> > ebizzy: 7196 records/s
>
> So maybe something was wrong with the first measurement.

OK, this is more in line with what I expected for kernbench. FWIW, in
order to show an improvement for a larger ple_window, we really need a
workload which we know has a longer lock holding time (without factoring
in LHP). We have noticed this on IO based locks mostly. We saw it with
a massive disk IO test (qla2xxx lock), and also with a large web serving
test (some vfs related lock, but I forget what exactly it was).
>
> >
> >
> > perf top for ebizzy for above:
> > ple_gap = 0
> > - 84.74% [kernel] [k] arch_local_irq_restore
> > - arch_local_irq_restore
> > - 100.00% _raw_spin_unlock_irqrestore
> > + 50.96% release_pages
> > + 49.02% pagevec_lru_move_fn
> > - 6.57% [kernel] [k] arch_local_irq_restore
> > - arch_local_irq_restore
> > + 92.54% default_send_IPI_mask_allbutself_phys
> > + 7.46% default_send_IPI_mask_sequence_phys
> > - 1.54% [kernel] [k] smp_call_function_many
> > smp_call_function_many
>
> Again the numbers are ridiculously high for arch_local_irq_restore.
> Maybe there's a bad perf/kvm interaction when we're injecting an
> interrupt, I can't believe we're spending 84% of the time running the
> popf instruction.

I do have a feeling that ebizzy just has too many variables and LHP is
just one of many problems. However, am I curious what perf kvm from
host shows as Avi suggested below.
>
> >
> > ple_window = 32k
> > - 84.47% [kernel] [k] arch_local_irq_restore
> > + arch_local_irq_restore
> > - 6.46% [kernel] [k] arch_local_irq_restore
> > - arch_local_irq_restore
> > + 93.51% default_send_IPI_mask_allbutself_phys
> > + 6.49% default_send_IPI_mask_sequence_phys
> > - 1.80% [kernel] [k] smp_call_function_many
> > - smp_call_function_many
> > + 99.98% native_flush_tlb_others
> >
> >
> > ple_window = 4k
> > - 91.35% [kernel] [k] arch_local_irq_restore
> > - arch_local_irq_restore
> > - 100.00% _raw_spin_unlock_irqrestore
> > + 53.19% release_pages
> > + 46.81% pagevec_lru_move_fn
> > - 3.90% [kernel] [k] smp_call_function_many
> > smp_call_function_many
> > - 2.94% [kernel] [k] arch_local_irq_restore
> > - arch_local_irq_restore
> > + 93.12% default_send_IPI_mask_allbutself_phys
> > + 6.88% default_send_IPI_mask_sequence_phys
> >
> > Let me know if I can try something here..
> > /me confused :(
> >
>
> I'm even more confused. Please try 'perf kvm' from the host, it does
> fewer dirty tricks with the PMU and so may be more accurate.
>

2012-10-04 15:01:07

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
>>
>> Again the numbers are ridiculously high for arch_local_irq_restore.
>> Maybe there's a bad perf/kvm interaction when we're injecting an
>> interrupt, I can't believe we're spending 84% of the time running the
>> popf instruction.
>
> Smells like a software fallback that doesn't do NMI, hrtimer based
> sampling typically hits popf where we re-enable interrupts.

Good nose, that's probably it. Raghavendra, can you ensure that the PMU
is properly exposed? 'dmesg' in the guest will tell. If it isn't, -cpu
host will expose it (and a good idea anyway to get best performance).

--
error compiling committee.c: too many arguments to function

2012-10-05 09:07:01

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On 10/04/2012 06:11 PM, Avi Kivity wrote:
> On 10/04/2012 12:49 PM, Raghavendra K T wrote:
>> On 10/03/2012 10:35 PM, Avi Kivity wrote:
>>> On 10/03/2012 02:22 PM, Raghavendra K T wrote:
>>>>> So I think it's worth trying again with ple_window of 20000-40000.
>>>>>
>>>>
>>>> Hi Avi,
>>>>
>>>> I ran different benchmarks increasing ple_window, and results does not
>>>> seem to be encouraging for increasing ple_window.
>>>
>>> Thanks for testing! Comments below.
>>>
>>>> Results:
>>>> 16 core PLE machine with 16 vcpu guest.
>>>>
>>>> base kernel = 3.6-rc5 + ple handler optimization patch
>>>> base_pleopt_8k = base kernel + ple window = 8k
>>>> base_pleopt_16k = base kernel + ple window = 16k
>>>> base_pleopt_32k = base kernel + ple window = 32k
>>>>
>>>>
>>>> Percentage improvements of benchmarks w.r.t base_pleopt with
>>>> ple_window = 4096
>>>>
>>>> base_pleopt_8k base_pleopt_16k base_pleopt_32k
>>>> -----------------------------------------------------------------
>>>>
>>>> kernbench_1x -5.54915 -15.94529 -44.31562
>>>> kernbench_2x -7.89399 -17.75039 -37.73498
>>>
>>> So, 44% degradation even with no overcommit? That's surprising.
>>
>> Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it
>> spending 8 times the original ple_window cycles for 16 vcpus
>> significant?
>
> A PLE exit when not overcommitted cannot do any good, it is better to
> spin in the guest rather that look for candidates on the host. In fact
> when we benchmark we often disable PLE completely.
>
>>
>>>
>>>> I also got perf top output to analyse the difference. Difference comes
>>>> because of flushtlb (and also spinlock).
>>>
>>> That's in the guest, yes?
>>
>> Yes. Perf is in guest.
>>
>>>
>>>>
>>>> Ebizzy run for 4k ple_window
>>>> - 87.20% [kernel] [k] arch_local_irq_restore
>>>> - arch_local_irq_restore
>>>> - 100.00% _raw_spin_unlock_irqrestore
>>>> + 52.89% release_pages
>>>> + 47.10% pagevec_lru_move_fn
>>>> - 5.71% [kernel] [k] arch_local_irq_restore
>>>> - arch_local_irq_restore
>>>> + 86.03% default_send_IPI_mask_allbutself_phys
>>>> + 13.96% default_send_IPI_mask_sequence_phys
>>>> - 3.10% [kernel] [k] smp_call_function_many
>>>> smp_call_function_many
>>>>
>>>>
>>>> Ebizzy run for 32k ple_window
>>>>
>>>> - 91.40% [kernel] [k] arch_local_irq_restore
>>>> - arch_local_irq_restore
>>>> - 100.00% _raw_spin_unlock_irqrestore
>>>> + 53.13% release_pages
>>>> + 46.86% pagevec_lru_move_fn
>>>> - 4.38% [kernel] [k] smp_call_function_many
>>>> smp_call_function_many
>>>> - 2.51% [kernel] [k] arch_local_irq_restore
>>>> - arch_local_irq_restore
>>>> + 90.76% default_send_IPI_mask_allbutself_phys
>>>> + 9.24% default_send_IPI_mask_sequence_phys
>>>>
>>>
>>> Both the 4k and the 32k results are crazy. Why is
>>> arch_local_irq_restore() so prominent? Do you have a very high
>>> interrupt rate in the guest?
>>
>> How to measure if I have high interrupt rate in guest?
>> From /proc/interrupt numbers I am not able to judge :(
>
> 'vmstat 1'
>

Thanks you. 'll save this. Apart from in,cs I think r: The number of
processes waiting for run time, would be useful for me in vmstat.

>>
>> I went back and got the results on a 32 core machine with 32 vcpu guest.
>> Strangely, I got result supporting the claim that increasing ple_window
>> helps for non-overcommitted scenario.
>>
>> 32 core 32 vcpu guest 1x scenarios.
>>
>> ple_gap = 0
>> kernbench: Elapsed Time 38.61
>> ebizzy: 7463 records/s
>>
>> ple_window = 4k
>> kernbench: Elapsed Time 43.5067
>> ebizzy: 2528 records/s
>>
>> ple_window = 32k
>> kernebench : Elapsed Time 39.4133
>> ebizzy: 7196 records/s
>
> So maybe something was wrong with the first measurement.

May be I was not clear. The first time I had run on x240 (sandybridge)
16 core cpu,

Then ran on 32 core x3850 to confirm the perf top results.
But yes both had

[ 0.018997] Performance Events: Broken PMU hardware detected, using
software events only.

problem as rightly pointed by you and PeterZ.

after -cpu host, I see that is fixed on x240,

[ 0.017997] Performance Events: 16-deep LBR, SandyBridge events,
Intel PMU driver.
[ 0.018868] NMI watchdog: enabled on all CPUs, permanently consumes
one hw-PMU counter.

So I 'll try it on x240 again.

( Some how mx3850 -cpu host resulted in
[ 0.026995] Performance Events: unsupported p6 CPU model 26 no PMU
driver, software events only.
I think qemu needs some fix as pointed in
http://www.mail-archive.com/[email protected]/msg55836.html


>
>>
>>
>> perf top for ebizzy for above:
>> ple_gap = 0
>> - 84.74% [kernel] [k] arch_local_irq_restore
>> - arch_local_irq_restore
>> - 100.00% _raw_spin_unlock_irqrestore
>> + 50.96% release_pages
>> + 49.02% pagevec_lru_move_fn
>> - 6.57% [kernel] [k] arch_local_irq_restore
>> - arch_local_irq_restore
>> + 92.54% default_send_IPI_mask_allbutself_phys
>> + 7.46% default_send_IPI_mask_sequence_phys
>> - 1.54% [kernel] [k] smp_call_function_many
>> smp_call_function_many
>
> Again the numbers are ridiculously high for arch_local_irq_restore.
> Maybe there's a bad perf/kvm interaction when we're injecting an
> interrupt, I can't believe we're spending 84% of the time running the
> popf instruction.
>
>>
>> ple_window = 32k
>> - 84.47% [kernel] [k] arch_local_irq_restore
>> + arch_local_irq_restore
>> - 6.46% [kernel] [k] arch_local_irq_restore
>> - arch_local_irq_restore
>> + 93.51% default_send_IPI_mask_allbutself_phys
>> + 6.49% default_send_IPI_mask_sequence_phys
>> - 1.80% [kernel] [k] smp_call_function_many
>> - smp_call_function_many
>> + 99.98% native_flush_tlb_others
>>
>>
>> ple_window = 4k
>> - 91.35% [kernel] [k] arch_local_irq_restore
>> - arch_local_irq_restore
>> - 100.00% _raw_spin_unlock_irqrestore
>> + 53.19% release_pages
>> + 46.81% pagevec_lru_move_fn
>> - 3.90% [kernel] [k] smp_call_function_many
>> smp_call_function_many
>> - 2.94% [kernel] [k] arch_local_irq_restore
>> - arch_local_irq_restore
>> + 93.12% default_send_IPI_mask_allbutself_phys
>> + 6.88% default_send_IPI_mask_sequence_phys
>>
>> Let me know if I can try something here..
>> /me confused :(
>>
>
> I'm even more confused. Please try 'perf kvm' from the host, it does
> fewer dirty tricks with the PMU and so may be more accurate.
>

I will try with host perf kvm this time..

2012-10-05 09:11:03

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On 10/04/2012 08:11 PM, Andrew Theurer wrote:
> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
>> On 10/04/2012 12:49 PM, Raghavendra K T wrote:
>>> On 10/03/2012 10:35 PM, Avi Kivity wrote:
>>>> On 10/03/2012 02:22 PM, Raghavendra K T wrote:
>>>>>> So I think it's worth trying again with ple_window of 20000-40000.
>>>>>>
>>>>>
>>>>> Hi Avi,
>>>>>
>>>>> I ran different benchmarks increasing ple_window, and results does not
>>>>> seem to be encouraging for increasing ple_window.
>>>>
>>>> Thanks for testing! Comments below.
>>>>
>>>>> Results:
>>>>> 16 core PLE machine with 16 vcpu guest.
>>>>>
>>>>> base kernel = 3.6-rc5 + ple handler optimization patch
>>>>> base_pleopt_8k = base kernel + ple window = 8k
>>>>> base_pleopt_16k = base kernel + ple window = 16k
>>>>> base_pleopt_32k = base kernel + ple window = 32k
>>>>>
>>>>>
>>>>> Percentage improvements of benchmarks w.r.t base_pleopt with
>>>>> ple_window = 4096
>>>>>
>>>>> base_pleopt_8k base_pleopt_16k base_pleopt_32k
>>>>> -----------------------------------------------------------------
>>>>>
>>>>> kernbench_1x -5.54915 -15.94529 -44.31562
>>>>> kernbench_2x -7.89399 -17.75039 -37.73498
>>>>
>>>> So, 44% degradation even with no overcommit? That's surprising.
>>>
>>> Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it
>>> spending 8 times the original ple_window cycles for 16 vcpus
>>> significant?
>>
>> A PLE exit when not overcommitted cannot do any good, it is better to
>> spin in the guest rather that look for candidates on the host. In fact
>> when we benchmark we often disable PLE completely.
>
> Agreed. However, I really do not understand why the kernbench regressed
> with bigger ple_window. It should stay the same or improve. Raghu, do
> you have perf data for the kernbench runs?

Andrew, No. 'll get this with perf kvm.

2012-10-09 18:55:32

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

* Avi Kivity <[email protected]> [2012-10-04 17:00:28]:

> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> > On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> >>
> >> Again the numbers are ridiculously high for arch_local_irq_restore.
> >> Maybe there's a bad perf/kvm interaction when we're injecting an
> >> interrupt, I can't believe we're spending 84% of the time running the
> >> popf instruction.
> >
> > Smells like a software fallback that doesn't do NMI, hrtimer based
> > sampling typically hits popf where we re-enable interrupts.
>
> Good nose, that's probably it. Raghavendra, can you ensure that the PMU
> is properly exposed? 'dmesg' in the guest will tell. If it isn't, -cpu
> host will expose it (and a good idea anyway to get best performance).
>

Hi Avi, you are right. SandyBridge machine result was not proper.
I cleaned up the services, enabled PMU, re-ran all the test again.

Here is the summary:
We do get good benefit by increasing ple window. Though we don't
see good benefit for kernbench and sysbench, for ebizzy, we get huge
improvement for 1x scenario. (almost 2/3rd of ple disabled case).

Let me know if you think we can increase the default ple_window
itself to 16k.

I am experimenting with V2 version of undercommit improvement(this) patch
series, But I think if you wish to go for increase of
default ple_window, then we would have to measure the benefit of patches
when ple_window = 16k.

I can respin the whole series including this default ple_window change.

I also have the perf kvm top result for both ebizzy and kernbench.
I think they are in expected lines now.

Improvements
================

16 core PLE machine with 16 vcpu guest

base = 3.6.0-rc5 + ple handler optimization patches
base_pleopt_16k = base + ple_window = 16k
base_pleopt_32k = base + ple_window = 32k
base_pleopt_nople = base + ple_gap = 0
kernbench, hackbench, sysbench (time in sec lower is better)
ebizzy (rec/sec higher is better)

% improvements w.r.t base (ple_window = 4k)
---------------+---------------+-----------------+-------------------+
|base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
---------------+---------------+-----------------+-------------------+
kernbench_1x | 0.42371 | 1.15164 | 0.09320 |
kernbench_2x | -1.40981 | -17.48282 | -570.77053 |
---------------+---------------+-----------------+-------------------+
sysbench_1x | -0.92367 | 0.24241 | -0.27027 |
sysbench_2x | -2.22706 |-0.30896 | -1.27573 |
sysbench_3x | -0.75509 | 0.09444 | -2.97756 |
---------------+---------------+-----------------+-------------------+
ebizzy_1x | 54.99976 | 67.29460 | 74.14076 |
ebizzy_2x | -8.83386 |-27.38403 | -96.22066 |
---------------+---------------+-----------------+-------------------+

perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window)
========================================================================
pleopt ple_gap=0
--------------------
ebizzy : 18131 records/s
63.78% [guest.kernel] [g] _raw_spin_lock_irqsave
5.65% [guest.kernel] [g] smp_call_function_many
3.12% [guest.kernel] [g] clear_page
3.02% [guest.kernel] [g] down_read_trylock
1.85% [guest.kernel] [g] async_page_fault
1.81% [guest.kernel] [g] up_read
1.76% [guest.kernel] [g] native_apic_mem_write
1.70% [guest.kernel] [g] find_vma

kernbench :Elapsed Time 29.4933 (27.6007)
5.72% [guest.kernel] [g] async_page_fault
3.48% [guest.kernel] [g] pvclock_clocksource_read
2.68% [guest.kernel] [g] copy_user_generic_unrolled
2.58% [guest.kernel] [g] clear_page
2.09% [guest.kernel] [g] page_cache_get_speculative
2.00% [guest.kernel] [g] do_raw_spin_lock
1.78% [guest.kernel] [g] unmap_single_vma
1.74% [guest.kernel] [g] kmem_cache_alloc

pleopt ple_window = 4k
---------------------------
ebizzy: 10176 records/s
69.17% [guest.kernel] [g] _raw_spin_lock_irqsave
3.34% [guest.kernel] [g] clear_page
2.16% [guest.kernel] [g] down_read_trylock
1.94% [guest.kernel] [g] async_page_fault
1.89% [guest.kernel] [g] native_apic_mem_write
1.63% [guest.kernel] [g] smp_call_function_many
1.58% [guest.kernel] [g] SetPageLRU
1.37% [guest.kernel] [g] up_read
1.01% [guest.kernel] [g] find_vma


kernbench: 29.9533
nts: 240K cycles
6.04% [guest.kernel] [g] async_page_fault
4.17% [guest.kernel] [g] pvclock_clocksource_read
3.28% [guest.kernel] [g] clear_page
2.57% [guest.kernel] [g] copy_user_generic_unrolled
2.30% [guest.kernel] [g] do_raw_spin_lock
2.13% [guest.kernel] [g] _raw_spin_lock_irqsave
1.93% [guest.kernel] [g] page_cache_get_speculative
1.92% [guest.kernel] [g] unmap_single_vma
1.77% [guest.kernel] [g] kmem_cache_alloc
1.61% [guest.kernel] [g] __d_lookup_rcu
1.19% [guest.kernel] [g] find_vma
1.19% [guest.kernel] [g] __list_del_entry


pleopt: ple_window=16k
-------------------------
ebizzy: 16990
62.35% [guest.kernel] [g] _raw_spin_lock_irqsave
5.22% [guest.kernel] [g] smp_call_function_many
3.57% [guest.kernel] [g] down_read_trylock
3.20% [guest.kernel] [g] clear_page
2.16% [guest.kernel] [g] up_read
1.89% [guest.kernel] [g] find_vma
1.86% [guest.kernel] [g] async_page_fault
1.81% [guest.kernel] [g] native_apic_mem_write

kernbench: 28.5
6.24% [guest.kernel] [g] async_page_fault
4.16% [guest.kernel] [g] pvclock_clocksource_read
3.33% [guest.kernel] [g] clear_page
2.50% [guest.kernel] [g] copy_user_generic_unrolled
2.08% [guest.kernel] [g] do_raw_spin_lock
1.98% [guest.kernel] [g] unmap_single_vma
1.89% [guest.kernel] [g] kmem_cache_alloc
1.82% [guest.kernel] [g] page_cache_get_speculative
1.46% [guest.kernel] [g] __d_lookup_rcu
1.42% [guest.kernel] [g] _raw_spin_lock_irqsave
1.15% [guest.kernel] [g] __list_del_entry
1.10% [guest.kernel] [g] find_vma



Detailed result for the run
=============================
patched = base_pleopt_16k
+-----------+-----------+-----------+------------+-----------+
kernbench
+-----------+-----------+-----------+------------+-----------+
base stddev patched stdev %improve
+-----------+-----------+-----------+------------+-----------+
1x 30.0440 1.1896 29.9167 1.6755 0.42371
2x 62.0083 3.4884 62.8825 2.5509 -1.40981
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
sysbench
+-----------+-----------+-----------+------------+-----------+
1x 7.1779 0.0577 7.2442 0.0479 -0.92367
2x 15.5362 0.3370 15.8822 0.3591 -2.22706
3x 23.8249 0.1513 24.0048 0.1844 -0.75509
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
ebizzy
+-----------+-----------+-----------+------------+-----------+
1x 10358.0000 442.6598 16054.8750 252.5088 54.99976
2x 2705.5000 130.0286 2466.5000 120.0024 -8.83386
+-----------+-----------+-----------+------------+-----------+

patched = base_pleopt_32k
+-----------+-----------+-----------+------------+-----------+
kernbench
+-----------+-----------+-----------+------------+-----------+
base stddev patched stdev %improve
+-----------+-----------+-----------+------------+-----------+
1x 30.0440 1.1896 29.6980 0.6760 1.15164
2x 62.0083 3.4884 72.8491 4.4616 -17.48282
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
sysbench
+-----------+-----------+-----------+------------+-----------+
1x 7.1779 0.0577 7.1605 0.0447 0.24241
2x 15.5362 0.3370 15.5842 0.1731 -0.30896
3x 23.8249 0.1513 23.8024 0.2342 0.09444
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
ebizzy
+-----------+-----------+-----------+------------+-----------+
1x 10358.0000 442.6598 17328.3750 281.4569 67.29460
2x 2705.5000 130.0286 1964.6250 143.0793 -27.38403
+-----------+-----------+-----------+------------+-----------+

patched = base_pleopt_nople
+-----------+-----------+-----------+------------+-----------+
kernbench
+-----------+-----------+-----------+------------+-----------+
base stddev patched stdev %improve
+-----------+-----------+-----------+------------+-----------+
1x 30.0440 1.1896 30.0160 0.7523 0.09320
2x 62.0083 3.4884 415.9334 189.9901 -570.77053
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
sysbench
+-----------+-----------+-----------+------------+-----------+
1x 7.1779 0.0577 7.1973 0.0354 -0.27027
2x 15.5362 0.3370 15.7344 0.2315 -1.27573
3x 23.8249 0.1513 24.5343 0.3437 -2.97756
+-----------+-----------+-----------+------------+-----------+
+-----------+-----------+-----------+------------+-----------+
ebizzy
+-----------+-----------+-----------+------------+-----------+
1x 10358.0000 442.6598 18037.5000 315.2074 74.14076
2x 2705.5000 130.0286 102.2500 104.3521 -96.22066
+-----------+-----------+-----------+------------+-----------+

2012-10-10 03:00:01

by Andrew Theurer

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
> * Avi Kivity <[email protected]> [2012-10-04 17:00:28]:
>
> > On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> > > On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> > >>
> > >> Again the numbers are ridiculously high for arch_local_irq_restore.
> > >> Maybe there's a bad perf/kvm interaction when we're injecting an
> > >> interrupt, I can't believe we're spending 84% of the time running the
> > >> popf instruction.
> > >
> > > Smells like a software fallback that doesn't do NMI, hrtimer based
> > > sampling typically hits popf where we re-enable interrupts.
> >
> > Good nose, that's probably it. Raghavendra, can you ensure that the PMU
> > is properly exposed? 'dmesg' in the guest will tell. If it isn't, -cpu
> > host will expose it (and a good idea anyway to get best performance).
> >
>
> Hi Avi, you are right. SandyBridge machine result was not proper.
> I cleaned up the services, enabled PMU, re-ran all the test again.
>
> Here is the summary:
> We do get good benefit by increasing ple window. Though we don't
> see good benefit for kernbench and sysbench, for ebizzy, we get huge
> improvement for 1x scenario. (almost 2/3rd of ple disabled case).
>
> Let me know if you think we can increase the default ple_window
> itself to 16k.
>
> I am experimenting with V2 version of undercommit improvement(this) patch
> series, But I think if you wish to go for increase of
> default ple_window, then we would have to measure the benefit of patches
> when ple_window = 16k.
>
> I can respin the whole series including this default ple_window change.
>
> I also have the perf kvm top result for both ebizzy and kernbench.
> I think they are in expected lines now.
>
> Improvements
> ================
>
> 16 core PLE machine with 16 vcpu guest
>
> base = 3.6.0-rc5 + ple handler optimization patches
> base_pleopt_16k = base + ple_window = 16k
> base_pleopt_32k = base + ple_window = 32k
> base_pleopt_nople = base + ple_gap = 0
> kernbench, hackbench, sysbench (time in sec lower is better)
> ebizzy (rec/sec higher is better)
>
> % improvements w.r.t base (ple_window = 4k)
> ---------------+---------------+-----------------+-------------------+
> |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
> ---------------+---------------+-----------------+-------------------+
> kernbench_1x | 0.42371 | 1.15164 | 0.09320 |
> kernbench_2x | -1.40981 | -17.48282 | -570.77053 |
> ---------------+---------------+-----------------+-------------------+
> sysbench_1x | -0.92367 | 0.24241 | -0.27027 |
> sysbench_2x | -2.22706 |-0.30896 | -1.27573 |
> sysbench_3x | -0.75509 | 0.09444 | -2.97756 |
> ---------------+---------------+-----------------+-------------------+
> ebizzy_1x | 54.99976 | 67.29460 | 74.14076 |
> ebizzy_2x | -8.83386 |-27.38403 | -96.22066 |
> ---------------+---------------+-----------------+-------------------+
>
> perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window)
> ========================================================================

Is the perf data for 1x overcommit?

> pleopt ple_gap=0
> --------------------
> ebizzy : 18131 records/s
> 63.78% [guest.kernel] [g] _raw_spin_lock_irqsave
> 5.65% [guest.kernel] [g] smp_call_function_many
> 3.12% [guest.kernel] [g] clear_page
> 3.02% [guest.kernel] [g] down_read_trylock
> 1.85% [guest.kernel] [g] async_page_fault
> 1.81% [guest.kernel] [g] up_read
> 1.76% [guest.kernel] [g] native_apic_mem_write
> 1.70% [guest.kernel] [g] find_vma

Does 'perf kvm top' not give host samples at the same time? Would be
nice to see the host overhead as a function of varying ple window. I
would expect that to be the major difference between 4/16/32k window
sizes.

A big concern I have (if this is 1x overcommit) for ebizzy is that it
has just terrible scalability to begin with. I do not think we should
try to optimize such a bad workload.

> kernbench :Elapsed Time 29.4933 (27.6007)
> 5.72% [guest.kernel] [g] async_page_fault
> 3.48% [guest.kernel] [g] pvclock_clocksource_read
> 2.68% [guest.kernel] [g] copy_user_generic_unrolled
> 2.58% [guest.kernel] [g] clear_page
> 2.09% [guest.kernel] [g] page_cache_get_speculative
> 2.00% [guest.kernel] [g] do_raw_spin_lock
> 1.78% [guest.kernel] [g] unmap_single_vma
> 1.74% [guest.kernel] [g] kmem_cache_alloc

>
> pleopt ple_window = 4k
> ---------------------------
> ebizzy: 10176 records/s
> 69.17% [guest.kernel] [g] _raw_spin_lock_irqsave
> 3.34% [guest.kernel] [g] clear_page
> 2.16% [guest.kernel] [g] down_read_trylock
> 1.94% [guest.kernel] [g] async_page_fault
> 1.89% [guest.kernel] [g] native_apic_mem_write
> 1.63% [guest.kernel] [g] smp_call_function_many
> 1.58% [guest.kernel] [g] SetPageLRU
> 1.37% [guest.kernel] [g] up_read
> 1.01% [guest.kernel] [g] find_vma
>
>
> kernbench: 29.9533
> nts: 240K cycles
> 6.04% [guest.kernel] [g] async_page_fault
> 4.17% [guest.kernel] [g] pvclock_clocksource_read
> 3.28% [guest.kernel] [g] clear_page
> 2.57% [guest.kernel] [g] copy_user_generic_unrolled
> 2.30% [guest.kernel] [g] do_raw_spin_lock
> 2.13% [guest.kernel] [g] _raw_spin_lock_irqsave
> 1.93% [guest.kernel] [g] page_cache_get_speculative
> 1.92% [guest.kernel] [g] unmap_single_vma
> 1.77% [guest.kernel] [g] kmem_cache_alloc
> 1.61% [guest.kernel] [g] __d_lookup_rcu
> 1.19% [guest.kernel] [g] find_vma
> 1.19% [guest.kernel] [g] __list_del_entry
>
>
> pleopt: ple_window=16k
> -------------------------
> ebizzy: 16990
> 62.35% [guest.kernel] [g] _raw_spin_lock_irqsave
> 5.22% [guest.kernel] [g] smp_call_function_many
> 3.57% [guest.kernel] [g] down_read_trylock
> 3.20% [guest.kernel] [g] clear_page
> 2.16% [guest.kernel] [g] up_read
> 1.89% [guest.kernel] [g] find_vma
> 1.86% [guest.kernel] [g] async_page_fault
> 1.81% [guest.kernel] [g] native_apic_mem_write
>
> kernbench: 28.5
> 6.24% [guest.kernel] [g] async_page_fault
> 4.16% [guest.kernel] [g] pvclock_clocksource_read
> 3.33% [guest.kernel] [g] clear_page
> 2.50% [guest.kernel] [g] copy_user_generic_unrolled
> 2.08% [guest.kernel] [g] do_raw_spin_lock
> 1.98% [guest.kernel] [g] unmap_single_vma
> 1.89% [guest.kernel] [g] kmem_cache_alloc
> 1.82% [guest.kernel] [g] page_cache_get_speculative
> 1.46% [guest.kernel] [g] __d_lookup_rcu
> 1.42% [guest.kernel] [g] _raw_spin_lock_irqsave
> 1.15% [guest.kernel] [g] __list_del_entry
> 1.10% [guest.kernel] [g] find_vma
>
>
>
> Detailed result for the run
> =============================
> patched = base_pleopt_16k
> +-----------+-----------+-----------+------------+-----------+
> kernbench
> +-----------+-----------+-----------+------------+-----------+
> base stddev patched stdev %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x 30.0440 1.1896 29.9167 1.6755 0.42371
> 2x 62.0083 3.4884 62.8825 2.5509 -1.40981
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
> sysbench
> +-----------+-----------+-----------+------------+-----------+
> 1x 7.1779 0.0577 7.2442 0.0479 -0.92367
> 2x 15.5362 0.3370 15.8822 0.3591 -2.22706
> 3x 23.8249 0.1513 24.0048 0.1844 -0.75509
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
> ebizzy
> +-----------+-----------+-----------+------------+-----------+
> 1x 10358.0000 442.6598 16054.8750 252.5088 54.99976
> 2x 2705.5000 130.0286 2466.5000 120.0024 -8.83386
> +-----------+-----------+-----------+------------+-----------+
>
> patched = base_pleopt_32k
> +-----------+-----------+-----------+------------+-----------+
> kernbench
> +-----------+-----------+-----------+------------+-----------+
> base stddev patched stdev %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x 30.0440 1.1896 29.6980 0.6760 1.15164
> 2x 62.0083 3.4884 72.8491 4.4616 -17.48282
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
> sysbench
> +-----------+-----------+-----------+------------+-----------+
> 1x 7.1779 0.0577 7.1605 0.0447 0.24241
> 2x 15.5362 0.3370 15.5842 0.1731 -0.30896
> 3x 23.8249 0.1513 23.8024 0.2342 0.09444
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
> ebizzy
> +-----------+-----------+-----------+------------+-----------+
> 1x 10358.0000 442.6598 17328.3750 281.4569 67.29460
> 2x 2705.5000 130.0286 1964.6250 143.0793 -27.38403
> +-----------+-----------+-----------+------------+-----------+
>
> patched = base_pleopt_nople
> +-----------+-----------+-----------+------------+-----------+
> kernbench
> +-----------+-----------+-----------+------------+-----------+
> base stddev patched stdev %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x 30.0440 1.1896 30.0160 0.7523 0.09320
> 2x 62.0083 3.4884 415.9334 189.9901 -570.77053
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
> sysbench
> +-----------+-----------+-----------+------------+-----------+
> 1x 7.1779 0.0577 7.1973 0.0354 -0.27027
> 2x 15.5362 0.3370 15.7344 0.2315 -1.27573
> 3x 23.8249 0.1513 24.5343 0.3437 -2.97756
> +-----------+-----------+-----------+------------+-----------+
> +-----------+-----------+-----------+------------+-----------+
> ebizzy
> +-----------+-----------+-----------+------------+-----------+
> 1x 10358.0000 442.6598 18037.5000 315.2074 74.14076
> 2x 2705.5000 130.0286 102.2500 104.3521 -96.22066
> +-----------+-----------+-----------+------------+-----------+
>

2012-10-10 14:25:35

by Andrew Theurer

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

I ran 'perf sched map' on the dbench workload for medium and large VMs,
and I thought I would share some of the results. I think it helps to
visualize what's going on regarding the yielding.

These files are png bitmaps, generated from processing output from 'perf
sched map' (and perf data generated from 'perf sched record'). The Y
axis is the host cpus, each row being 10 pixels high. For these tests,
there are 80 host cpus, so the total height is 800 pixels. The X axis
is time (in microseconds), with each pixel representing 1 microsecond.
Each bitmap plots 30,000 microseconds. The bitmaps are quite wide
obviously, and zooming in/out while viewing is recommended.

Each row (each host cpu) is assigned a color based on what thread is
running. vCPUs of the same VM are assigned a common color (like red,
blue, magenta, etc), and each vCPU has a unique brightness for that
color. There are a maximum of 12 assignable colors, so in any VMs >12
revert to vCPU color of gray. I would use more colors, but it becomes
harder to distinguish one color from another. The white color
represents missing data from perf, and black color represents any thread
which is not a vCPU.

For the following tests, VMs were pinned to host NUMA nodes and to
specific cpus to help with consistency and operate within the
constraints of the last test (gang scheduler).

Here is a good example of PLE. These are 10-way VMs, 16 of them (as
described above only 12 of the VMs have a color, rest are gray).

https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU

If you zoom out and look at the whole bitmap, you may notice the 4ms
intervals of the scheduler. They are pretty well aligned across all
cpus. Normally, for cpu bound workloads, we would expect to see each
thread to run for 4 ms, then something else getting to run, and so on.
That is mostly true in this test. We have 2x over-commit and we
generally see the switching of threads at 4ms. One thing to note is
that not all vCPU threads for the same VM run at exactly the same time,
and that is expected and the whole reason for lock-holder preemption.
Now, if you zoom in on the bitmap, you should notice within the 4ms
intervals there is some task switching going on. This is most likely
because of the yield_to initiated by the PLE handler. In this case
there is not that much yielding to do. It's quite clean, and the
performance is quite good.

Below is an example of PLE, but this time with 20-way VMs, 8 of them.
CPU over-commit is still 2x.

https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU

This one looks quite different. In short, it's a mess. The switching
between tasks can be lower than 10 microseconds. It basically never
recovers. There is constant yielding all the time.

Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
scheduling patches. While I am not recommending gang scheduling, I
think it's a good data point. The performance is 3.88x the PLE result.

https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M

Note that the task switching intervals of 4ms are quite obvious again,
and this time all vCPUs from same VM run at the same time. It
represents the best possible outcome.


Anyway, I thought the bitmaps might help better visualize what's going
on.

-Andrew


2012-10-10 17:48:03

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On 10/10/2012 07:54 PM, Andrew Theurer wrote:
> I ran 'perf sched map' on the dbench workload for medium and large VMs,
> and I thought I would share some of the results. I think it helps to
> visualize what's going on regarding the yielding.
>
> These files are png bitmaps, generated from processing output from 'perf
> sched map' (and perf data generated from 'perf sched record'). The Y
> axis is the host cpus, each row being 10 pixels high. For these tests,
> there are 80 host cpus, so the total height is 800 pixels. The X axis
> is time (in microseconds), with each pixel representing 1 microsecond.
> Each bitmap plots 30,000 microseconds. The bitmaps are quite wide
> obviously, and zooming in/out while viewing is recommended.
>
> Each row (each host cpu) is assigned a color based on what thread is
> running. vCPUs of the same VM are assigned a common color (like red,
> blue, magenta, etc), and each vCPU has a unique brightness for that
> color. There are a maximum of 12 assignable colors, so in any VMs >12
> revert to vCPU color of gray. I would use more colors, but it becomes
> harder to distinguish one color from another. The white color
> represents missing data from perf, and black color represents any thread
> which is not a vCPU.
>
> For the following tests, VMs were pinned to host NUMA nodes and to
> specific cpus to help with consistency and operate within the
> constraints of the last test (gang scheduler).
>
> Here is a good example of PLE. These are 10-way VMs, 16 of them (as
> described above only 12 of the VMs have a color, rest are gray).
>
> https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU

This looks very nice to visualize what is happening. Beginning of the
graph looks little messy but later it is clear.

>
> If you zoom out and look at the whole bitmap, you may notice the 4ms
> intervals of the scheduler. They are pretty well aligned across all
> cpus. Normally, for cpu bound workloads, we would expect to see each
> thread to run for 4 ms, then something else getting to run, and so on.
> That is mostly true in this test. We have 2x over-commit and we
> generally see the switching of threads at 4ms. One thing to note is
> that not all vCPU threads for the same VM run at exactly the same time,
> and that is expected and the whole reason for lock-holder preemption.
> Now, if you zoom in on the bitmap, you should notice within the 4ms
> intervals there is some task switching going on. This is most likely
> because of the yield_to initiated by the PLE handler. In this case
> there is not that much yielding to do. It's quite clean, and the
> performance is quite good.
>
> Below is an example of PLE, but this time with 20-way VMs, 8 of them.
> CPU over-commit is still 2x.
>
> https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU

I think this link still 10x16. Could you paste the link again?

>
> This one looks quite different. In short, it's a mess. The switching
> between tasks can be lower than 10 microseconds. It basically never
> recovers. There is constant yielding all the time.
>
> Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
> scheduling patches. While I am not recommending gang scheduling, I
> think it's a good data point. The performance is 3.88x the PLE result.
>
> https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M
>
> Note that the task switching intervals of 4ms are quite obvious again,
> and this time all vCPUs from same VM run at the same time. It
> represents the best possible outcome.
>
>
> Anyway, I thought the bitmaps might help better visualize what's going
> on.
>
> -Andrew
>
>
>
>

2012-10-10 18:18:53

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On 10/10/2012 11:33 PM, David Ahern wrote:
> On 10/10/12 11:54 AM, Raghavendra K T wrote:
>> No, I did something like this
>> perf kvm --guestvmlinux ./vmlinux.guest top -g -U -d 3. Yes that is a
>> good idea.
>>
>> (I am getting some segfaults with perf top, I think it is already fixed
>> but yet to see the patch that fixes)
>
> What version of perf: perf --version
>

perf version 2.6.32-279.el6.x86_64.debug

(I searched that it is fixed in 288. could not dig-out actual patch
though)

2012-10-10 17:58:21

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On 10/10/2012 08:29 AM, Andrew Theurer wrote:
> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
>> * Avi Kivity <[email protected]> [2012-10-04 17:00:28]:
>>
>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
>>>>>
>>>>> Again the numbers are ridiculously high for arch_local_irq_restore.
>>>>> Maybe there's a bad perf/kvm interaction when we're injecting an
>>>>> interrupt, I can't believe we're spending 84% of the time running the
>>>>> popf instruction.
>>>>
>>>> Smells like a software fallback that doesn't do NMI, hrtimer based
>>>> sampling typically hits popf where we re-enable interrupts.
>>>
>>> Good nose, that's probably it. Raghavendra, can you ensure that the PMU
>>> is properly exposed? 'dmesg' in the guest will tell. If it isn't, -cpu
>>> host will expose it (and a good idea anyway to get best performance).
>>>
>>
>> Hi Avi, you are right. SandyBridge machine result was not proper.
>> I cleaned up the services, enabled PMU, re-ran all the test again.
>>
>> Here is the summary:
>> We do get good benefit by increasing ple window. Though we don't
>> see good benefit for kernbench and sysbench, for ebizzy, we get huge
>> improvement for 1x scenario. (almost 2/3rd of ple disabled case).
>>
>> Let me know if you think we can increase the default ple_window
>> itself to 16k.
>>
>> I am experimenting with V2 version of undercommit improvement(this) patch
>> series, But I think if you wish to go for increase of
>> default ple_window, then we would have to measure the benefit of patches
>> when ple_window = 16k.
>>
>> I can respin the whole series including this default ple_window change.
>>
>> I also have the perf kvm top result for both ebizzy and kernbench.
>> I think they are in expected lines now.
>>
>> Improvements
>> ================
>>
>> 16 core PLE machine with 16 vcpu guest
>>
>> base = 3.6.0-rc5 + ple handler optimization patches
>> base_pleopt_16k = base + ple_window = 16k
>> base_pleopt_32k = base + ple_window = 32k
>> base_pleopt_nople = base + ple_gap = 0
>> kernbench, hackbench, sysbench (time in sec lower is better)
>> ebizzy (rec/sec higher is better)
>>
>> % improvements w.r.t base (ple_window = 4k)
>> ---------------+---------------+-----------------+-------------------+
>> |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
>> ---------------+---------------+-----------------+-------------------+
>> kernbench_1x | 0.42371 | 1.15164 | 0.09320 |
>> kernbench_2x | -1.40981 | -17.48282 | -570.77053 |
>> ---------------+---------------+-----------------+-------------------+
>> sysbench_1x | -0.92367 | 0.24241 | -0.27027 |
>> sysbench_2x | -2.22706 |-0.30896 | -1.27573 |
>> sysbench_3x | -0.75509 | 0.09444 | -2.97756 |
>> ---------------+---------------+-----------------+-------------------+
>> ebizzy_1x | 54.99976 | 67.29460 | 74.14076 |
>> ebizzy_2x | -8.83386 |-27.38403 | -96.22066 |
>> ---------------+---------------+-----------------+-------------------+
>>
>> perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window)
>> ========================================================================
>
> Is the perf data for 1x overcommit?

Yes, 16vcpu guest on 16 core

>
>> pleopt ple_gap=0
>> --------------------
>> ebizzy : 18131 records/s
>> 63.78% [guest.kernel] [g] _raw_spin_lock_irqsave
>> 5.65% [guest.kernel] [g] smp_call_function_many
>> 3.12% [guest.kernel] [g] clear_page
>> 3.02% [guest.kernel] [g] down_read_trylock
>> 1.85% [guest.kernel] [g] async_page_fault
>> 1.81% [guest.kernel] [g] up_read
>> 1.76% [guest.kernel] [g] native_apic_mem_write
>> 1.70% [guest.kernel] [g] find_vma
>
> Does 'perf kvm top' not give host samples at the same time? Would be
> nice to see the host overhead as a function of varying ple window. I
> would expect that to be the major difference between 4/16/32k window
> sizes.

No, I did something like this
perf kvm --guestvmlinux ./vmlinux.guest top -g -U -d 3. Yes that is a
good idea.

(I am getting some segfaults with perf top, I think it is already fixed
but yet to see the patch that fixes)



>
> A big concern I have (if this is 1x overcommit) for ebizzy is that it
> has just terrible scalability to begin with. I do not think we should
> try to optimize such a bad workload.
>

I think my way of running dbench has some flaw, so I went to ebizzy.
Could you let me know how you generally run dbench?

2012-10-10 18:04:02

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On 10/10/12 11:54 AM, Raghavendra K T wrote:
> No, I did something like this
> perf kvm --guestvmlinux ./vmlinux.guest top -g -U -d 3. Yes that is a
> good idea.
>
> (I am getting some segfaults with perf top, I think it is already fixed
> but yet to see the patch that fixes)

What version of perf: perf --version

2012-10-10 19:28:14

by Andrew Theurer

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On Wed, 2012-10-10 at 23:13 +0530, Raghavendra K T wrote:
> On 10/10/2012 07:54 PM, Andrew Theurer wrote:
> > I ran 'perf sched map' on the dbench workload for medium and large VMs,
> > and I thought I would share some of the results. I think it helps to
> > visualize what's going on regarding the yielding.
> >
> > These files are png bitmaps, generated from processing output from 'perf
> > sched map' (and perf data generated from 'perf sched record'). The Y
> > axis is the host cpus, each row being 10 pixels high. For these tests,
> > there are 80 host cpus, so the total height is 800 pixels. The X axis
> > is time (in microseconds), with each pixel representing 1 microsecond.
> > Each bitmap plots 30,000 microseconds. The bitmaps are quite wide
> > obviously, and zooming in/out while viewing is recommended.
> >
> > Each row (each host cpu) is assigned a color based on what thread is
> > running. vCPUs of the same VM are assigned a common color (like red,
> > blue, magenta, etc), and each vCPU has a unique brightness for that
> > color. There are a maximum of 12 assignable colors, so in any VMs >12
> > revert to vCPU color of gray. I would use more colors, but it becomes
> > harder to distinguish one color from another. The white color
> > represents missing data from perf, and black color represents any thread
> > which is not a vCPU.
> >
> > For the following tests, VMs were pinned to host NUMA nodes and to
> > specific cpus to help with consistency and operate within the
> > constraints of the last test (gang scheduler).
> >
> > Here is a good example of PLE. These are 10-way VMs, 16 of them (as
> > described above only 12 of the VMs have a color, rest are gray).
> >
> > https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU
>
> This looks very nice to visualize what is happening. Beginning of the
> graph looks little messy but later it is clear.
>
> >
> > If you zoom out and look at the whole bitmap, you may notice the 4ms
> > intervals of the scheduler. They are pretty well aligned across all
> > cpus. Normally, for cpu bound workloads, we would expect to see each
> > thread to run for 4 ms, then something else getting to run, and so on.
> > That is mostly true in this test. We have 2x over-commit and we
> > generally see the switching of threads at 4ms. One thing to note is
> > that not all vCPU threads for the same VM run at exactly the same time,
> > and that is expected and the whole reason for lock-holder preemption.
> > Now, if you zoom in on the bitmap, you should notice within the 4ms
> > intervals there is some task switching going on. This is most likely
> > because of the yield_to initiated by the PLE handler. In this case
> > there is not that much yielding to do. It's quite clean, and the
> > performance is quite good.
> >
> > Below is an example of PLE, but this time with 20-way VMs, 8 of them.
> > CPU over-commit is still 2x.
> >
> > https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU
>
> I think this link still 10x16. Could you paste the link again?

Oops
https://docs.google.com/open?id=0B6tfUNlZ-14wSGtYYzZtRTcyVjQ

>
> >
> > This one looks quite different. In short, it's a mess. The switching
> > between tasks can be lower than 10 microseconds. It basically never
> > recovers. There is constant yielding all the time.
> >
> > Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
> > scheduling patches. While I am not recommending gang scheduling, I
> > think it's a good data point. The performance is 3.88x the PLE result.
> >
> > https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M
> >
> > Note that the task switching intervals of 4ms are quite obvious again,
> > and this time all vCPUs from same VM run at the same time. It
> > represents the best possible outcome.
> >
> >
> > Anyway, I thought the bitmaps might help better visualize what's going
> > on.
> >
> > -Andrew
> >
> >
> >
> >
>

2012-10-10 19:36:42

by Andrew Theurer

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
> On 10/10/2012 08:29 AM, Andrew Theurer wrote:
> > On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
> >> * Avi Kivity <[email protected]> [2012-10-04 17:00:28]:
> >>
> >>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> >>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> >>>>>
> >>>>> Again the numbers are ridiculously high for arch_local_irq_restore.
> >>>>> Maybe there's a bad perf/kvm interaction when we're injecting an
> >>>>> interrupt, I can't believe we're spending 84% of the time running the
> >>>>> popf instruction.
> >>>>
> >>>> Smells like a software fallback that doesn't do NMI, hrtimer based
> >>>> sampling typically hits popf where we re-enable interrupts.
> >>>
> >>> Good nose, that's probably it. Raghavendra, can you ensure that the PMU
> >>> is properly exposed? 'dmesg' in the guest will tell. If it isn't, -cpu
> >>> host will expose it (and a good idea anyway to get best performance).
> >>>
> >>
> >> Hi Avi, you are right. SandyBridge machine result was not proper.
> >> I cleaned up the services, enabled PMU, re-ran all the test again.
> >>
> >> Here is the summary:
> >> We do get good benefit by increasing ple window. Though we don't
> >> see good benefit for kernbench and sysbench, for ebizzy, we get huge
> >> improvement for 1x scenario. (almost 2/3rd of ple disabled case).
> >>
> >> Let me know if you think we can increase the default ple_window
> >> itself to 16k.
> >>
> >> I am experimenting with V2 version of undercommit improvement(this) patch
> >> series, But I think if you wish to go for increase of
> >> default ple_window, then we would have to measure the benefit of patches
> >> when ple_window = 16k.
> >>
> >> I can respin the whole series including this default ple_window change.
> >>
> >> I also have the perf kvm top result for both ebizzy and kernbench.
> >> I think they are in expected lines now.
> >>
> >> Improvements
> >> ================
> >>
> >> 16 core PLE machine with 16 vcpu guest
> >>
> >> base = 3.6.0-rc5 + ple handler optimization patches
> >> base_pleopt_16k = base + ple_window = 16k
> >> base_pleopt_32k = base + ple_window = 32k
> >> base_pleopt_nople = base + ple_gap = 0
> >> kernbench, hackbench, sysbench (time in sec lower is better)
> >> ebizzy (rec/sec higher is better)
> >>
> >> % improvements w.r.t base (ple_window = 4k)
> >> ---------------+---------------+-----------------+-------------------+
> >> |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
> >> ---------------+---------------+-----------------+-------------------+
> >> kernbench_1x | 0.42371 | 1.15164 | 0.09320 |
> >> kernbench_2x | -1.40981 | -17.48282 | -570.77053 |
> >> ---------------+---------------+-----------------+-------------------+
> >> sysbench_1x | -0.92367 | 0.24241 | -0.27027 |
> >> sysbench_2x | -2.22706 |-0.30896 | -1.27573 |
> >> sysbench_3x | -0.75509 | 0.09444 | -2.97756 |
> >> ---------------+---------------+-----------------+-------------------+
> >> ebizzy_1x | 54.99976 | 67.29460 | 74.14076 |
> >> ebizzy_2x | -8.83386 |-27.38403 | -96.22066 |
> >> ---------------+---------------+-----------------+-------------------+
> >>
> >> perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window)
> >> ========================================================================
> >
> > Is the perf data for 1x overcommit?
>
> Yes, 16vcpu guest on 16 core
>
> >
> >> pleopt ple_gap=0
> >> --------------------
> >> ebizzy : 18131 records/s
> >> 63.78% [guest.kernel] [g] _raw_spin_lock_irqsave
> >> 5.65% [guest.kernel] [g] smp_call_function_many
> >> 3.12% [guest.kernel] [g] clear_page
> >> 3.02% [guest.kernel] [g] down_read_trylock
> >> 1.85% [guest.kernel] [g] async_page_fault
> >> 1.81% [guest.kernel] [g] up_read
> >> 1.76% [guest.kernel] [g] native_apic_mem_write
> >> 1.70% [guest.kernel] [g] find_vma
> >
> > Does 'perf kvm top' not give host samples at the same time? Would be
> > nice to see the host overhead as a function of varying ple window. I
> > would expect that to be the major difference between 4/16/32k window
> > sizes.
>
> No, I did something like this
> perf kvm --guestvmlinux ./vmlinux.guest top -g -U -d 3. Yes that is a
> good idea.
>
> (I am getting some segfaults with perf top, I think it is already fixed
> but yet to see the patch that fixes)
>
>
>
> >
> > A big concern I have (if this is 1x overcommit) for ebizzy is that it
> > has just terrible scalability to begin with. I do not think we should
> > try to optimize such a bad workload.
> >
>
> I think my way of running dbench has some flaw, so I went to ebizzy.
> Could you let me know how you generally run dbench?

I mount a tmpfs and then specify that mount for dbench to run on. This
eliminates all IO. I use a 300 second run time and number of threads is
equal to number of vcpus. All of the VMs of course need to have a
synchronized start.

I would also make sure you are using a recent kernel for dbench, where
the dcache scalability is much improved. Without any lock-holder
preemption, the time in spin_lock should be very low:


> 21.54% 78016 dbench [kernel.kallsyms] [k] copy_user_generic_unrolled
> 3.51% 12723 dbench libc-2.12.so [.] __strchr_sse42
> 2.81% 10176 dbench dbench [.] child_run
> 2.54% 9203 dbench [kernel.kallsyms] [k] _raw_spin_lock
> 2.33% 8423 dbench dbench [.] next_token
> 2.02% 7335 dbench [kernel.kallsyms] [k] __d_lookup_rcu
> 1.89% 6850 dbench libc-2.12.so [.] __strstr_sse42
> 1.53% 5537 dbench libc-2.12.so [.] __memset_sse2
> 1.47% 5337 dbench [kernel.kallsyms] [k] link_path_walk
> 1.40% 5084 dbench [kernel.kallsyms] [k] kmem_cache_alloc
> 1.38% 5009 dbench libc-2.12.so [.] memmove
> 1.24% 4496 dbench libc-2.12.so [.] vfprintf
> 1.15% 4169 dbench [kernel.kallsyms] [k] __audit_syscall_exit

-Andrew

2012-10-11 10:40:37

by Nikunj A Dadhania

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On Wed, 10 Oct 2012 09:24:55 -0500, Andrew Theurer <[email protected]> wrote:
>
> Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
> scheduling patches. While I am not recommending gang scheduling, I
> think it's a good data point. The performance is 3.88x the PLE result.
>
> https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M

That looks pretty good and serves the purpose. And the result says it all.

> Note that the task switching intervals of 4ms are quite obvious again,
> and this time all vCPUs from same VM run at the same time. It
> represents the best possible outcome.
>
>
> Anyway, I thought the bitmaps might help better visualize what's going
> on.
>
> -Andrew
>

Regards
Nikunj

2012-10-11 17:18:32

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On 10/11/2012 12:57 AM, Andrew Theurer wrote:
> On Wed, 2012-10-10 at 23:13 +0530, Raghavendra K T wrote:
>> On 10/10/2012 07:54 PM, Andrew Theurer wrote:
>>> I ran 'perf sched map' on the dbench workload for medium and large VMs,
>>> and I thought I would share some of the results. I think it helps to
>>> visualize what's going on regarding the yielding.
>>>
>>> These files are png bitmaps, generated from processing output from 'perf
>>> sched map' (and perf data generated from 'perf sched record'). The Y
>>> axis is the host cpus, each row being 10 pixels high. For these tests,
>>> there are 80 host cpus, so the total height is 800 pixels. The X axis
>>> is time (in microseconds), with each pixel representing 1 microsecond.
>>> Each bitmap plots 30,000 microseconds. The bitmaps are quite wide
>>> obviously, and zooming in/out while viewing is recommended.
>>>
>>> Each row (each host cpu) is assigned a color based on what thread is
>>> running. vCPUs of the same VM are assigned a common color (like red,
>>> blue, magenta, etc), and each vCPU has a unique brightness for that
>>> color. There are a maximum of 12 assignable colors, so in any VMs >12
>>> revert to vCPU color of gray. I would use more colors, but it becomes
>>> harder to distinguish one color from another. The white color
>>> represents missing data from perf, and black color represents any thread
>>> which is not a vCPU.
>>>
>>> For the following tests, VMs were pinned to host NUMA nodes and to
>>> specific cpus to help with consistency and operate within the
>>> constraints of the last test (gang scheduler).
>>>
>>> Here is a good example of PLE. These are 10-way VMs, 16 of them (as
>>> described above only 12 of the VMs have a color, rest are gray).
>>>
>>> https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU
>>
>> This looks very nice to visualize what is happening. Beginning of the
>> graph looks little messy but later it is clear.
>>
>>>
>>> If you zoom out and look at the whole bitmap, you may notice the 4ms
>>> intervals of the scheduler. They are pretty well aligned across all
>>> cpus. Normally, for cpu bound workloads, we would expect to see each
>>> thread to run for 4 ms, then something else getting to run, and so on.
>>> That is mostly true in this test. We have 2x over-commit and we
>>> generally see the switching of threads at 4ms. One thing to note is
>>> that not all vCPU threads for the same VM run at exactly the same time,
>>> and that is expected and the whole reason for lock-holder preemption.
>>> Now, if you zoom in on the bitmap, you should notice within the 4ms
>>> intervals there is some task switching going on. This is most likely
>>> because of the yield_to initiated by the PLE handler. In this case
>>> there is not that much yielding to do. It's quite clean, and the
>>> performance is quite good.
>>>
>>> Below is an example of PLE, but this time with 20-way VMs, 8 of them.
>>> CPU over-commit is still 2x.
>>>
>>> https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU
>>
>> I think this link still 10x16. Could you paste the link again?
>
> Oops
> https://docs.google.com/open?id=0B6tfUNlZ-14wSGtYYzZtRTcyVjQ
>
>>
>>>
>>> This one looks quite different. In short, it's a mess. The switching
>>> between tasks can be lower than 10 microseconds. It basically never
>>> recovers. There is constant yielding all the time.
>>>
>>> Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
>>> scheduling patches. While I am not recommending gang scheduling, I
>>> think it's a good data point. The performance is 3.88x the PLE result.
>>>
>>> https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M

Yes.. we see lot of yields.

2012-10-15 12:14:40

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On 10/11/2012 01:06 AM, Andrew Theurer wrote:
> On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
>> On 10/10/2012 08:29 AM, Andrew Theurer wrote:
>>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
>>>> * Avi Kivity <[email protected]> [2012-10-04 17:00:28]:
>>>>
>>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
>>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
>>>>>>>
[...]
>>> A big concern I have (if this is 1x overcommit) for ebizzy is that it
>>> has just terrible scalability to begin with. I do not think we should
>>> try to optimize such a bad workload.
>>>
>>
>> I think my way of running dbench has some flaw, so I went to ebizzy.
>> Could you let me know how you generally run dbench?
>
> I mount a tmpfs and then specify that mount for dbench to run on. This
> eliminates all IO. I use a 300 second run time and number of threads is
> equal to number of vcpus. All of the VMs of course need to have a
> synchronized start.
>
> I would also make sure you are using a recent kernel for dbench, where
> the dcache scalability is much improved. Without any lock-holder
> preemption, the time in spin_lock should be very low:
>
>
>> 21.54% 78016 dbench [kernel.kallsyms] [k] copy_user_generic_unrolled
>> 3.51% 12723 dbench libc-2.12.so [.] __strchr_sse42
>> 2.81% 10176 dbench dbench [.] child_run
>> 2.54% 9203 dbench [kernel.kallsyms] [k] _raw_spin_lock
>> 2.33% 8423 dbench dbench [.] next_token
>> 2.02% 7335 dbench [kernel.kallsyms] [k] __d_lookup_rcu
>> 1.89% 6850 dbench libc-2.12.so [.] __strstr_sse42
>> 1.53% 5537 dbench libc-2.12.so [.] __memset_sse2
>> 1.47% 5337 dbench [kernel.kallsyms] [k] link_path_walk
>> 1.40% 5084 dbench [kernel.kallsyms] [k] kmem_cache_alloc
>> 1.38% 5009 dbench libc-2.12.so [.] memmove
>> 1.24% 4496 dbench libc-2.12.so [.] vfprintf
>> 1.15% 4169 dbench [kernel.kallsyms] [k] __audit_syscall_exit
>

Hi Andrew,
I ran the test with dbench with tmpfs. I do not see any improvements in
dbench for 16k ple window.

So it seems apart from ebizzy no workload benefited by that. and I
agree that, it may not be good to optimize for ebizzy.
I shall drop changing to 16k default window and continue with other
original patch series. Need to experiment with latest kernel.

(PS: Thanks for pointing towards, perf in latest kernel. It works fine.)

Results:
dbench run for 120 sec 30 sec warmup 8 iterations using tmpfs
base = 3.6.0-rc5 with ple handler optimization patch.

x => base + ple_window = 4k
+ => base + ple_window = 16k
* => base + ple_gap = 0

dbench 1x overcommit case
=========================
N Min Max Median Avg Stddev
x 8 5322.5 5519.05 5482.71 5461.0962 63.522276
+ 8 5255.45 5530.55 5496.94 5455.2137 93.070363
* 8 5350.85 5477.81 5408.065 5418.4338 44.762697


dbench 2x overcommit case
==========================

N Min Max Median Avg Stddev
x 8 3054.32 3194.47 3137.33 3132.625 54.491615
+ 8 3040.8 3148.87 3088.615 3088.1887 32.862336
* 8 3031.51 3171.99 3083.6 3097.4612 50.526977

2012-10-15 14:35:33

by Andrew Theurer

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:
> On 10/11/2012 01:06 AM, Andrew Theurer wrote:
> > On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
> >> On 10/10/2012 08:29 AM, Andrew Theurer wrote:
> >>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
> >>>> * Avi Kivity <[email protected]> [2012-10-04 17:00:28]:
> >>>>
> >>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> >>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> >>>>>>>
> [...]
> >>> A big concern I have (if this is 1x overcommit) for ebizzy is that it
> >>> has just terrible scalability to begin with. I do not think we should
> >>> try to optimize such a bad workload.
> >>>
> >>
> >> I think my way of running dbench has some flaw, so I went to ebizzy.
> >> Could you let me know how you generally run dbench?
> >
> > I mount a tmpfs and then specify that mount for dbench to run on. This
> > eliminates all IO. I use a 300 second run time and number of threads is
> > equal to number of vcpus. All of the VMs of course need to have a
> > synchronized start.
> >
> > I would also make sure you are using a recent kernel for dbench, where
> > the dcache scalability is much improved. Without any lock-holder
> > preemption, the time in spin_lock should be very low:
> >
> >
> >> 21.54% 78016 dbench [kernel.kallsyms] [k] copy_user_generic_unrolled
> >> 3.51% 12723 dbench libc-2.12.so [.] __strchr_sse42
> >> 2.81% 10176 dbench dbench [.] child_run
> >> 2.54% 9203 dbench [kernel.kallsyms] [k] _raw_spin_lock
> >> 2.33% 8423 dbench dbench [.] next_token
> >> 2.02% 7335 dbench [kernel.kallsyms] [k] __d_lookup_rcu
> >> 1.89% 6850 dbench libc-2.12.so [.] __strstr_sse42
> >> 1.53% 5537 dbench libc-2.12.so [.] __memset_sse2
> >> 1.47% 5337 dbench [kernel.kallsyms] [k] link_path_walk
> >> 1.40% 5084 dbench [kernel.kallsyms] [k] kmem_cache_alloc
> >> 1.38% 5009 dbench libc-2.12.so [.] memmove
> >> 1.24% 4496 dbench libc-2.12.so [.] vfprintf
> >> 1.15% 4169 dbench [kernel.kallsyms] [k] __audit_syscall_exit
> >
>
> Hi Andrew,
> I ran the test with dbench with tmpfs. I do not see any improvements in
> dbench for 16k ple window.
>
> So it seems apart from ebizzy no workload benefited by that. and I
> agree that, it may not be good to optimize for ebizzy.
> I shall drop changing to 16k default window and continue with other
> original patch series. Need to experiment with latest kernel.

Thanks for running this again. I do believe there are some workloads,
when run at 1x overcommit, would benefit from a larger ple_window [with
he current ple handling code], but I do not also want to potentially
degrade >1x with a larger window. I do, however, think there may be a
another option. I have not fully worked this out, but I think I am on
to something.

I decided to revert back to just a yield() instead of a yield_to(). My
motivation was that yield_to() [for large VMs] is like a dog chasing its
tail, round and round we go.... Just yield(), in particular a yield()
which results in yielding to something -other- than the current VM's
vcpus, helps synchronize the execution of sibling vcpus by deferring
them until the lock holder vcpu is running again. The more we can do to
get all vcpus running at the same time, the far less we deal with the
preemption problem. The other benefit is that yield() is far, far lower
overhead than yield_to()

This does assume that vcpus from same VM do not share same runqueues.
Yielding to a sibling vcpu with yield() is not productive for larger VMs
in the same way that yield_to() is not. My recent results include
restricting vcpu placement so that sibling vcpus do not get to run on
the same runqueue. I do believe we could implement a initial placement
and load balance policy to strive for this restriction (making it purely
optional, but I bet could also help user apps which use spin locks).

For 1x VMs which still vm_exit due to PLE, I believe we could probably
just leave the ple_window alone, as long as we mostly use yield()
instead of yield_to(). The problem with the unneeded exits in this case
has been the overhead in routines leading up to yield_to() and the
yield_to() itself. If we use yield() most of the time, this overhead
will go away.

Here is a comparison of yield_to() and yield():

dbench with 20-way VMs, 8 of them on 80-way host:

no PLE 426 +/- 11.03%
no PLE w/ gangsched 32001 +/- .37%
PLE with yield() 29207 +/- .28%
PLE with yield_to() 8175 +/- 1.37%

Yield() is far and way better than yield_to() here and almost approaches
gang sched result. Here is a link for the perf sched map bitmap:

https://docs.google.com/open?id=0B6tfUNlZ-14weXBfVnFFZGw1akU

The thrashing is way down and sibling vcpus tend to run together,
approximating the behavior of the gang scheduling without needing to
actually implement gang scheduling.

I did test a smaller VM:

dbench with 10-way VMs, 16 of them on 80-way host:

no PLE 6248 +/- 7.69%
no PLE w/ gangsched 28379 +/- .07%
PLE with yield() 29196 +/- 1.62%
PLE with yield_to() 32217 +/- 1.76%

There is some degrade from yield() to yield_to() here, but nearly as
large as the uplift we see on the larger VMs. Regardless, I have an
idea to fix that: Instead of using yield() all the time, we could use
yield_to(), but limit the rate per vcpu to something like 1 per jiffie.
All other exits use yield(). That rate of yield_to() should be more
than enough for the smaller VMs, and the result should be hopefully just
the same as the current code. I have not coded this up yet, but it's my
next step.

I am also hopeful the limitation of yield_to() will also make the 1x
issue just go away as well (even with 4096 ple_window). The vast
majority of exits will result in yield() which should be harmless.

Keep in mind this did require ensuring sibling vcpus do not share host
runqueues -I do think that can be possible given some optional scheduler
tweaks.

>
> (PS: Thanks for pointing towards, perf in latest kernel. It works fine.)
>
> Results:
> dbench run for 120 sec 30 sec warmup 8 iterations using tmpfs
> base = 3.6.0-rc5 with ple handler optimization patch.
>
> x => base + ple_window = 4k
> + => base + ple_window = 16k
> * => base + ple_gap = 0
>
> dbench 1x overcommit case
> =========================
> N Min Max Median Avg Stddev
> x 8 5322.5 5519.05 5482.71 5461.0962 63.522276
> + 8 5255.45 5530.55 5496.94 5455.2137 93.070363
> * 8 5350.85 5477.81 5408.065 5418.4338 44.762697
>
>
> dbench 2x overcommit case
> ==========================
>
> N Min Max Median Avg Stddev
> x 8 3054.32 3194.47 3137.33 3132.625 54.491615
> + 8 3040.8 3148.87 3088.615 3088.1887 32.862336
> * 8 3031.51 3171.99 3083.6 3097.4612 50.526977
>

-Andrew

2012-10-18 12:40:17

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On 10/09/2012 08:51 PM, Raghavendra K T wrote:
> Here is the summary:
> We do get good benefit by increasing ple window. Though we don't
> see good benefit for kernbench and sysbench, for ebizzy, we get huge
> improvement for 1x scenario. (almost 2/3rd of ple disabled case).
>
> Let me know if you think we can increase the default ple_window
> itself to 16k.
>

I think so, there is no point running with untuned defaults.

>
> I can respin the whole series including this default ple_window change.

It can come as a separate patch.

>
> I also have the perf kvm top result for both ebizzy and kernbench.
> I think they are in expected lines now.
>
> Improvements
> ================
>
> 16 core PLE machine with 16 vcpu guest
>
> base = 3.6.0-rc5 + ple handler optimization patches
> base_pleopt_16k = base + ple_window = 16k
> base_pleopt_32k = base + ple_window = 32k
> base_pleopt_nople = base + ple_gap = 0
> kernbench, hackbench, sysbench (time in sec lower is better)
> ebizzy (rec/sec higher is better)
>
> % improvements w.r.t base (ple_window = 4k)
> ---------------+---------------+-----------------+-------------------+
> |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
> ---------------+---------------+-----------------+-------------------+
> kernbench_1x | 0.42371 | 1.15164 | 0.09320 |
> kernbench_2x | -1.40981 | -17.48282 | -570.77053 |
> ---------------+---------------+-----------------+-------------------+
> sysbench_1x | -0.92367 | 0.24241 | -0.27027 |
> sysbench_2x | -2.22706 |-0.30896 | -1.27573 |
> sysbench_3x | -0.75509 | 0.09444 | -2.97756 |
> ---------------+---------------+-----------------+-------------------+
> ebizzy_1x | 54.99976 | 67.29460 | 74.14076 |
> ebizzy_2x | -8.83386 |-27.38403 | -96.22066 |
> ---------------+---------------+-----------------+-------------------+

So it seems we want dynamic PLE windows. As soon as we enter overcommit
we need to decrease the window.


--
error compiling committee.c: too many arguments to function

2012-10-19 08:23:47

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On 10/18/2012 06:09 PM, Avi Kivity wrote:
> On 10/09/2012 08:51 PM, Raghavendra K T wrote:
>> Here is the summary:
>> We do get good benefit by increasing ple window. Though we don't
>> see good benefit for kernbench and sysbench, for ebizzy, we get huge
>> improvement for 1x scenario. (almost 2/3rd of ple disabled case).
>>
>> Let me know if you think we can increase the default ple_window
>> itself to 16k.
>>
>
> I think so, there is no point running with untuned defaults.
>

Oaky.

>>
>> I can respin the whole series including this default ple_window change.
>
> It can come as a separate patch.

Yes. Will spin it separately.

>
>>
>> I also have the perf kvm top result for both ebizzy and kernbench.
>> I think they are in expected lines now.
>>
>> Improvements
>> ================
>>
>> 16 core PLE machine with 16 vcpu guest
>>
>> base = 3.6.0-rc5 + ple handler optimization patches
>> base_pleopt_16k = base + ple_window = 16k
>> base_pleopt_32k = base + ple_window = 32k
>> base_pleopt_nople = base + ple_gap = 0
>> kernbench, hackbench, sysbench (time in sec lower is better)
>> ebizzy (rec/sec higher is better)
>>
>> % improvements w.r.t base (ple_window = 4k)
>> ---------------+---------------+-----------------+-------------------+
>> |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
>> ---------------+---------------+-----------------+-------------------+
>> kernbench_1x | 0.42371 | 1.15164 | 0.09320 |
>> kernbench_2x | -1.40981 | -17.48282 | -570.77053 |
>> ---------------+---------------+-----------------+-------------------+
>> sysbench_1x | -0.92367 | 0.24241 | -0.27027 |
>> sysbench_2x | -2.22706 |-0.30896 | -1.27573 |
>> sysbench_3x | -0.75509 | 0.09444 | -2.97756 |
>> ---------------+---------------+-----------------+-------------------+
>> ebizzy_1x | 54.99976 | 67.29460 | 74.14076 |
>> ebizzy_2x | -8.83386 |-27.38403 | -96.22066 |
>> ---------------+---------------+-----------------+-------------------+
>
> So it seems we want dynamic PLE windows. As soon as we enter overcommit
> we need to decrease the window.
>

Okay.
I have some rough idea on the implementation. I 'll try that after this
V2 experiments are over.
So in brief, I have this in my queue priority wise

1) V2 version of this patch series( in progress)
2) default PLE window
3) preemption notifiers
4) Pv spinlock

2012-10-19 08:35:08

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On 10/15/2012 08:04 PM, Andrew Theurer wrote:
> On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:
>> On 10/11/2012 01:06 AM, Andrew Theurer wrote:
>>> On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
>>>> On 10/10/2012 08:29 AM, Andrew Theurer wrote:
>>>>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
>>>>>> * Avi Kivity <[email protected]> [2012-10-04 17:00:28]:
>>>>>>
>>>>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
>>>>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
>>>>>>>>>
>> [...]
>>>>> A big concern I have (if this is 1x overcommit) for ebizzy is that it
>>>>> has just terrible scalability to begin with. I do not think we should
>>>>> try to optimize such a bad workload.
>>>>>
>>>>
>>>> I think my way of running dbench has some flaw, so I went to ebizzy.
>>>> Could you let me know how you generally run dbench?
>>>
>>> I mount a tmpfs and then specify that mount for dbench to run on. This
>>> eliminates all IO. I use a 300 second run time and number of threads is
>>> equal to number of vcpus. All of the VMs of course need to have a
>>> synchronized start.
>>>
>>> I would also make sure you are using a recent kernel for dbench, where
>>> the dcache scalability is much improved. Without any lock-holder
>>> preemption, the time in spin_lock should be very low:
>>>
>>>
>>>> 21.54% 78016 dbench [kernel.kallsyms] [k] copy_user_generic_unrolled
>>>> 3.51% 12723 dbench libc-2.12.so [.] __strchr_sse42
>>>> 2.81% 10176 dbench dbench [.] child_run
>>>> 2.54% 9203 dbench [kernel.kallsyms] [k] _raw_spin_lock
>>>> 2.33% 8423 dbench dbench [.] next_token
>>>> 2.02% 7335 dbench [kernel.kallsyms] [k] __d_lookup_rcu
>>>> 1.89% 6850 dbench libc-2.12.so [.] __strstr_sse42
>>>> 1.53% 5537 dbench libc-2.12.so [.] __memset_sse2
>>>> 1.47% 5337 dbench [kernel.kallsyms] [k] link_path_walk
>>>> 1.40% 5084 dbench [kernel.kallsyms] [k] kmem_cache_alloc
>>>> 1.38% 5009 dbench libc-2.12.so [.] memmove
>>>> 1.24% 4496 dbench libc-2.12.so [.] vfprintf
>>>> 1.15% 4169 dbench [kernel.kallsyms] [k] __audit_syscall_exit
>>>
>>
>> Hi Andrew,
>> I ran the test with dbench with tmpfs. I do not see any improvements in
>> dbench for 16k ple window.
>>
>> So it seems apart from ebizzy no workload benefited by that. and I
>> agree that, it may not be good to optimize for ebizzy.
>> I shall drop changing to 16k default window and continue with other
>> original patch series. Need to experiment with latest kernel.
>
> Thanks for running this again. I do believe there are some workloads,
> when run at 1x overcommit, would benefit from a larger ple_window [with
> he current ple handling code], but I do not also want to potentially
> degrade >1x with a larger window. I do, however, think there may be a
> another option. I have not fully worked this out, but I think I am on
> to something.
>
> I decided to revert back to just a yield() instead of a yield_to(). My
> motivation was that yield_to() [for large VMs] is like a dog chasing its
> tail, round and round we go.... Just yield(), in particular a yield()
> which results in yielding to something -other- than the current VM's
> vcpus, helps synchronize the execution of sibling vcpus by deferring
> them until the lock holder vcpu is running again. The more we can do to
> get all vcpus running at the same time, the far less we deal with the
> preemption problem. The other benefit is that yield() is far, far lower
> overhead than yield_to()
>
> This does assume that vcpus from same VM do not share same runqueues.
> Yielding to a sibling vcpu with yield() is not productive for larger VMs
> in the same way that yield_to() is not. My recent results include
> restricting vcpu placement so that sibling vcpus do not get to run on
> the same runqueue. I do believe we could implement a initial placement
> and load balance policy to strive for this restriction (making it purely
> optional, but I bet could also help user apps which use spin locks).
>
> For 1x VMs which still vm_exit due to PLE, I believe we could probably
> just leave the ple_window alone, as long as we mostly use yield()
> instead of yield_to(). The problem with the unneeded exits in this case
> has been the overhead in routines leading up to yield_to() and the
> yield_to() itself. If we use yield() most of the time, this overhead
> will go away.
>
> Here is a comparison of yield_to() and yield():
>
> dbench with 20-way VMs, 8 of them on 80-way host:
>
> no PLE 426 +/- 11.03%
> no PLE w/ gangsched 32001 +/- .37%
> PLE with yield() 29207 +/- .28%
> PLE with yield_to() 8175 +/- 1.37%
>
> Yield() is far and way better than yield_to() here and almost approaches
> gang sched result. Here is a link for the perf sched map bitmap:
>
> https://docs.google.com/open?id=0B6tfUNlZ-14weXBfVnFFZGw1akU
>
> The thrashing is way down and sibling vcpus tend to run together,
> approximating the behavior of the gang scheduling without needing to
> actually implement gang scheduling.
>
> I did test a smaller VM:
>
> dbench with 10-way VMs, 16 of them on 80-way host:
>
> no PLE 6248 +/- 7.69%
> no PLE w/ gangsched 28379 +/- .07%
> PLE with yield() 29196 +/- 1.62%
> PLE with yield_to() 32217 +/- 1.76%

Hi Andrew, Results are encouraging.

>
> There is some degrade from yield() to yield_to() here, but nearly as
> large as the uplift we see on the larger VMs. Regardless, I have an
> idea to fix that: Instead of using yield() all the time, we could use
> yield_to(), but limit the rate per vcpu to something like 1 per jiffie.
> All other exits use yield(). That rate of yield_to() should be more
> than enough for the smaller VMs, and the result should be hopefully just
> the same as the current code. I have not coded this up yet, but it's my
> next step.

I personally feel rate limiting yield_to may be a good idea.

>
> I am also hopeful the limitation of yield_to() will also make the 1x
> issue just go away as well (even with 4096 ple_window). The vast
> majority of exits will result in yield() which should be harmless.
>
> Keep in mind this did require ensuring sibling vcpus do not share host
> runqueues -I do think that can be possible given some optional scheduler
> tweaks.

I think this is a concern (placing). Having rate limit alone may
suffice.May be tuning that taking into overcommitted/non-overcommitted
scenario also into account would be better.

Okay below is my V2 implementation I am experimenting

1) check source -and- target runq to decide on exiting the ple handler
2)

vcpu_on_spin()
{

.....
if yield_to_same_vm did not succeed and we are overcommitted
yield()

}

I think combining your thoughts and (2) complicates scenario a bit.
anyways let me see how my experiment goes. I will also check how yield
performs without any pinning.

2012-10-19 13:31:43

by Andrew Theurer

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

On Fri, 2012-10-19 at 14:00 +0530, Raghavendra K T wrote:
> On 10/15/2012 08:04 PM, Andrew Theurer wrote:
> > On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:
> >> On 10/11/2012 01:06 AM, Andrew Theurer wrote:
> >>> On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
> >>>> On 10/10/2012 08:29 AM, Andrew Theurer wrote:
> >>>>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
> >>>>>> * Avi Kivity <[email protected]> [2012-10-04 17:00:28]:
> >>>>>>
> >>>>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> >>>>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> >>>>>>>>>
> >> [...]
> >>>>> A big concern I have (if this is 1x overcommit) for ebizzy is that it
> >>>>> has just terrible scalability to begin with. I do not think we should
> >>>>> try to optimize such a bad workload.
> >>>>>
> >>>>
> >>>> I think my way of running dbench has some flaw, so I went to ebizzy.
> >>>> Could you let me know how you generally run dbench?
> >>>
> >>> I mount a tmpfs and then specify that mount for dbench to run on. This
> >>> eliminates all IO. I use a 300 second run time and number of threads is
> >>> equal to number of vcpus. All of the VMs of course need to have a
> >>> synchronized start.
> >>>
> >>> I would also make sure you are using a recent kernel for dbench, where
> >>> the dcache scalability is much improved. Without any lock-holder
> >>> preemption, the time in spin_lock should be very low:
> >>>
> >>>
> >>>> 21.54% 78016 dbench [kernel.kallsyms] [k] copy_user_generic_unrolled
> >>>> 3.51% 12723 dbench libc-2.12.so [.] __strchr_sse42
> >>>> 2.81% 10176 dbench dbench [.] child_run
> >>>> 2.54% 9203 dbench [kernel.kallsyms] [k] _raw_spin_lock
> >>>> 2.33% 8423 dbench dbench [.] next_token
> >>>> 2.02% 7335 dbench [kernel.kallsyms] [k] __d_lookup_rcu
> >>>> 1.89% 6850 dbench libc-2.12.so [.] __strstr_sse42
> >>>> 1.53% 5537 dbench libc-2.12.so [.] __memset_sse2
> >>>> 1.47% 5337 dbench [kernel.kallsyms] [k] link_path_walk
> >>>> 1.40% 5084 dbench [kernel.kallsyms] [k] kmem_cache_alloc
> >>>> 1.38% 5009 dbench libc-2.12.so [.] memmove
> >>>> 1.24% 4496 dbench libc-2.12.so [.] vfprintf
> >>>> 1.15% 4169 dbench [kernel.kallsyms] [k] __audit_syscall_exit
> >>>
> >>
> >> Hi Andrew,
> >> I ran the test with dbench with tmpfs. I do not see any improvements in
> >> dbench for 16k ple window.
> >>
> >> So it seems apart from ebizzy no workload benefited by that. and I
> >> agree that, it may not be good to optimize for ebizzy.
> >> I shall drop changing to 16k default window and continue with other
> >> original patch series. Need to experiment with latest kernel.
> >
> > Thanks for running this again. I do believe there are some workloads,
> > when run at 1x overcommit, would benefit from a larger ple_window [with
> > he current ple handling code], but I do not also want to potentially
> > degrade >1x with a larger window. I do, however, think there may be a
> > another option. I have not fully worked this out, but I think I am on
> > to something.
> >
> > I decided to revert back to just a yield() instead of a yield_to(). My
> > motivation was that yield_to() [for large VMs] is like a dog chasing its
> > tail, round and round we go.... Just yield(), in particular a yield()
> > which results in yielding to something -other- than the current VM's
> > vcpus, helps synchronize the execution of sibling vcpus by deferring
> > them until the lock holder vcpu is running again. The more we can do to
> > get all vcpus running at the same time, the far less we deal with the
> > preemption problem. The other benefit is that yield() is far, far lower
> > overhead than yield_to()
> >
> > This does assume that vcpus from same VM do not share same runqueues.
> > Yielding to a sibling vcpu with yield() is not productive for larger VMs
> > in the same way that yield_to() is not. My recent results include
> > restricting vcpu placement so that sibling vcpus do not get to run on
> > the same runqueue. I do believe we could implement a initial placement
> > and load balance policy to strive for this restriction (making it purely
> > optional, but I bet could also help user apps which use spin locks).
> >
> > For 1x VMs which still vm_exit due to PLE, I believe we could probably
> > just leave the ple_window alone, as long as we mostly use yield()
> > instead of yield_to(). The problem with the unneeded exits in this case
> > has been the overhead in routines leading up to yield_to() and the
> > yield_to() itself. If we use yield() most of the time, this overhead
> > will go away.
> >
> > Here is a comparison of yield_to() and yield():
> >
> > dbench with 20-way VMs, 8 of them on 80-way host:
> >
> > no PLE 426 +/- 11.03%
> > no PLE w/ gangsched 32001 +/- .37%
> > PLE with yield() 29207 +/- .28%
> > PLE with yield_to() 8175 +/- 1.37%
> >
> > Yield() is far and way better than yield_to() here and almost approaches
> > gang sched result. Here is a link for the perf sched map bitmap:
> >
> > https://docs.google.com/open?id=0B6tfUNlZ-14weXBfVnFFZGw1akU
> >
> > The thrashing is way down and sibling vcpus tend to run together,
> > approximating the behavior of the gang scheduling without needing to
> > actually implement gang scheduling.
> >
> > I did test a smaller VM:
> >
> > dbench with 10-way VMs, 16 of them on 80-way host:
> >
> > no PLE 6248 +/- 7.69%
> > no PLE w/ gangsched 28379 +/- .07%
> > PLE with yield() 29196 +/- 1.62%
> > PLE with yield_to() 32217 +/- 1.76%
>
> Hi Andrew, Results are encouraging.
>
> >
> > There is some degrade from yield() to yield_to() here, but nearly as
> > large as the uplift we see on the larger VMs. Regardless, I have an
> > idea to fix that: Instead of using yield() all the time, we could use
> > yield_to(), but limit the rate per vcpu to something like 1 per jiffie.
> > All other exits use yield(). That rate of yield_to() should be more
> > than enough for the smaller VMs, and the result should be hopefully just
> > the same as the current code. I have not coded this up yet, but it's my
> > next step.
>
> I personally feel rate limiting yield_to may be a good idea.
>
> >
> > I am also hopeful the limitation of yield_to() will also make the 1x
> > issue just go away as well (even with 4096 ple_window). The vast
> > majority of exits will result in yield() which should be harmless.
> >
> > Keep in mind this did require ensuring sibling vcpus do not share host
> > runqueues -I do think that can be possible given some optional scheduler
> > tweaks.
>
> I think this is a concern (placing). Having rate limit alone may
> suffice.May be tuning that taking into overcommitted/non-overcommitted
> scenario also into account would be better.
>
> Okay below is my V2 implementation I am experimenting
>
> 1) check source -and- target runq to decide on exiting the ple handler
> 2)
>
> vcpu_on_spin()
> {
>
> .....
> if yield_to_same_vm did not succeed and we are overcommitted
> yield()
>
> }
>
> I think combining your thoughts and (2) complicates scenario a bit.
> anyways let me see how my experiment goes. I will also check how yield
> performs without any pinning.

FWIW, below is the latest with throttling yield_to(). Results were
slightly higher than the above with just yield(). Although I can see an
improvement when not forcing non-shared runqueues among same-VM vcpus
(via binding), it's not as effective. I am more concerned this problem
requires a multi-part solution, and reducing lock-holder preemption is
the other part (by not allowing sequential execution of same-VM vcpus by
virtue of sharing runqueues).

signed-off-by: Andrew Theurer <[email protected]>

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b70b48b..595ef3e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -153,6 +153,7 @@ struct kvm_vcpu {
int mode;
unsigned long requests;
unsigned long guest_debug;
+ unsigned long last_yield_to;

struct mutex mutex;
struct kvm_run *run;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d617f69..1f0ec36 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -49,6 +49,7 @@
#include <linux/slab.h>
#include <linux/sort.h>
#include <linux/bsearch.h>
+#include <linux/jiffies.h>

#include <asm/processor.h>
#include <asm/io.h>
@@ -228,6 +229,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
vcpu->pid = NULL;
init_waitqueue_head(&vcpu->wq);
kvm_async_pf_vcpu_init(vcpu);
+ vcpu->last_yield_to = 0;

page = alloc_page(GFP_KERNEL | __GFP_ZERO);
if (!page) {
@@ -1590,27 +1592,39 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
int i;

/*
+ * A yield_to() can be quite expensive, so we try to limit its use.
+ * to one per jiffie. Subsequent exits just yield the current vcpu
+ * in hopes of having it run again when the lock holding vcpu
+ * gets to run again. This is most effective when vcpus from
+ * the same VM do not share a runqueue
+ */
+ if (me->last_yield_to == jiffies) {
+ yield();
+ } else {
+ /*
* We boost the priority of a VCPU that is runnable but not
* currently running, because it got preempted by something
* else and called schedule in __vcpu_run. Hopefully that
* VCPU is holding the lock that we need and will release it.
* We approximate round-robin by starting at the last boosted VCPU.
*/
- for (pass = 0; pass < 2 && !yielded; pass++) {
- kvm_for_each_vcpu(i, vcpu, kvm) {
- if (!pass && i <= last_boosted_vcpu) {
- i = last_boosted_vcpu;
- continue;
- } else if (pass && i > last_boosted_vcpu)
- break;
- if (vcpu == me)
- continue;
- if (waitqueue_active(&vcpu->wq))
- continue;
- if (kvm_vcpu_yield_to(vcpu)) {
- kvm->last_boosted_vcpu = i;
- yielded = 1;
- break;
+ for (pass = 0; pass < 2 && !yielded; pass++) {
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ if (!pass && i <= last_boosted_vcpu) {
+ i = last_boosted_vcpu;
+ continue;
+ } else if (pass && i > last_boosted_vcpu)
+ break;
+ if (vcpu == me)
+ continue;
+ if (waitqueue_active(&vcpu->wq))
+ continue;
+ if (kvm_vcpu_yield_to(vcpu)) {
+ kvm->last_boosted_vcpu = i;
+ me->last_yield_to = jiffies;
+ yielded = 1;
+ break;
+ }
}
}
}