2013-04-22 05:55:35

by Raghavendra K T

[permalink] [raw]
Subject: Re: Preemptable Ticket Spinlock

On 04/21/2013 03:42 AM, Jiannan Ouyang wrote:
> Hello Everyone,
>
> I recently came up with a spinlock algorithm that can adapt to
> preemption, which you may be interested in.

It is overall a great and clever idea as Rik mentioned already.

The intuition is to
> downgrade a fair lock to an unfair lock automatically upon preemption,
> and preserve the fairness otherwise.

I also hope being little unfair, does not affect the original intention
of introducing ticket spinlocks too.
Some discussions were here long back in this thead,
https://lkml.org/lkml/2010/6/3/331

It is a guest side optimization,
> and can be used as a complementary technique to host side optimizations
> like co-scheduling and Pause-Loop Exiting.
>
> In my experiments, it improves VM performance by 5:32X on average, when
> running on a non paravirtual VMM, and by 7:91X when running on a VMM
> that supports a paravirtual locking interface (using a pv preemptable
> ticket spinlock), when executing a set of microbenchmarks as well as a
> realistic e-commerce benchmark.

AFAIU, the experiments are on non PLE machines and it would be worth
experimenting on PLE machines too. and also bigger machines.
(we may get some surprises there otherwise).
'll wait for your next iteration of the patches with "using lower bit"
changes.


>
> A detailed algorithm description can be found in my VEE 2013 paper,
> Preemptable Ticket Spinlocks: Improving Consolidated Performance in the
> Cloud
> Jiannan Ouyang, John R. Lange
> ouyang,[email protected] <mailto:[email protected]>
> University of Pittsburgh
> http://people.cs.pitt.edu/~ouyang/files/publication/preemptable_lock-ouyang-vee13.pdf
>
> The patch is based on stock Linux kernel 3.5.0, and tested on kernel
> 3.4.41 as well.
> http://www.cs.pitt.edu/~ouyang/files/preemptable_lock.tar.gz
>
> Thanks
> --Jiannan
>
> I'm not familiar with patch over email, so I just paste it below, sorry
> for the inconvenience.
> ======================
> diff --git a/arch/x86/include/asm/spinlock.h
> b/arch/x86/include/asm/spinlock.h
> index b315a33..895d3b3 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -48,18 +48,35 @@
> * in the high part, because a wide xadd increment of the low part
> would carry
> * up and contaminate the high part.
> */
> +#define TIMEOUT_UNIT (1<<14)

This value seem to be at the higher end. But I hope you have
experimented enough to come up with this. Better again to test all these
tunables?? on PLE machines too.

> static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
> {
> register struct __raw_tickets inc = { .tail = 1 };
> + unsigned int timeout = 0;
> + __ticket_t current_head;
>
> inc = xadd(&lock->tickets, inc);
> -
> + if (likely(inc.head == inc.tail))
> + goto spin;
> +
> + timeout = TIMEOUT_UNIT * (inc.tail - inc.head);
> + do {
> + current_head = ACCESS_ONCE(lock->tickets.head);
> + if (inc.tail <= current_head) {
> + goto spin;
> + } else if (inc.head != current_head) {
> + inc.head = current_head;
> + timeout = TIMEOUT_UNIT * (inc.tail - inc.head);

Good idea indeed to base the loop on head and tail difference.. But for
virtualization I believe this "directly proportional notion" is little
tricky too.


2013-04-22 16:42:53

by Jiannan Ouyang

[permalink] [raw]
Subject: Re: Preemptable Ticket Spinlock

On Mon, Apr 22, 2013 at 1:58 AM, Raghavendra K T
<[email protected]> wrote:

> The intuition is to
>>
>> downgrade a fair lock to an unfair lock automatically upon preemption,
>> and preserve the fairness otherwise.
>
>
> I also hope being little unfair, does not affect the original intention
> of introducing ticket spinlocks too.
> Some discussions were here long back in this thead,
> https://lkml.org/lkml/2010/6/3/331
>

Good point. I also have the question that why not use unfair lock
under virtual environment,
and is fairness really a big issue. However, given that current kernel
is using ticket lock, I
assume fairness is a necessary spinlock feature.

Regard the fairness of preemptable-lock, I did a user space experiment
using 8 pCPU to
compete on one spinlock, and count the lock acquisition times. Results
show that lock acquisition
counts are *almost* evenly distributed between threads in preemptable-lock.

>
> AFAIU, the experiments are on non PLE machines and it would be worth
> experimenting on PLE machines too. and also bigger machines.
> (we may get some surprises there otherwise).
> 'll wait for your next iteration of the patches with "using lower bit"
> changes.
>

Yes, they are on PLE machines. Current implementation and evaluation
is still at the stage of
concept proving. More experiments (with PLE, bigger machiens, etc) are
needed to better understand
the lock behavior.

>> +#define TIMEOUT_UNIT (1<<14)
>
>
> This value seem to be at the higher end. But I hope you have experimented
> enough to come up with this. Better again to test all these tunables?? on
> PLE machines too.
>
>
I actually didn't tune this parameter at all... But yes, find a better
value is necessary if it is in industry code.

>> static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
>> {
>> register struct __raw_tickets inc = { .tail = 1 };
>> + unsigned int timeout = 0;
>> + __ticket_t current_head;
>>
>> inc = xadd(&lock->tickets, inc);
>> -
>> + if (likely(inc.head == inc.tail))
>> + goto spin;
>> +
>> + timeout = TIMEOUT_UNIT * (inc.tail - inc.head);
>> + do {
>> + current_head = ACCESS_ONCE(lock->tickets.head);
>> + if (inc.tail <= current_head) {
>> + goto spin;
>> + } else if (inc.head != current_head) {
>> + inc.head = current_head;
>> + timeout = TIMEOUT_UNIT * (inc.tail - inc.head);
>
>
> Good idea indeed to base the loop on head and tail difference.. But for
> virtualization I believe this "directly proportional notion" is little
> tricky too.
>

Could you explain your concern a little bit more?

Thanks
--Jiannan

2013-04-23 01:51:32

by Raghavendra K T

[permalink] [raw]
Subject: Re: Preemptable Ticket Spinlock

On 04/22/2013 10:12 PM, Jiannan Ouyang wrote:
> On Mon, Apr 22, 2013 at 1:58 AM, Raghavendra K T
> <[email protected]> wrote:

[...]

>>> static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock)
>>> {
>>> register struct __raw_tickets inc = { .tail = 1 };
>>> + unsigned int timeout = 0;
>>> + __ticket_t current_head;
>>>
>>> inc = xadd(&lock->tickets, inc);
>>> -
>>> + if (likely(inc.head == inc.tail))
>>> + goto spin;
>>> +
>>> + timeout = TIMEOUT_UNIT * (inc.tail - inc.head);

Forgot to mention about this, for immediate wait case,
you can busyloop instead of timeout (I mean

timeout = TIMEOUT_UNIT * (inc.tail - inc.head -1);

This ideas was used by Rik in his spinlock backoff patches.

>>> + do {
>>> + current_head = ACCESS_ONCE(lock->tickets.head);
>>> + if (inc.tail <= current_head) {
>>> + goto spin;
>>> + } else if (inc.head != current_head) {
>>> + inc.head = current_head;
>>> + timeout = TIMEOUT_UNIT * (inc.tail - inc.head);
>>
>>
>> Good idea indeed to base the loop on head and tail difference.. But for
>> virtualization I believe this "directly proportional notion" is little
>> tricky too.
>>
>
> Could you explain your concern a little bit more?
>

Consider a big machine with 2 VMs running.
If nth vcpu of say VM1 waiting in the queue, the question is,

Do we have to have all the n VCPU doing busyloop and thus burning
sigma (n*(n+1) * TIMEOUT_UNIT)) ?

OR

Is it that, far off vcpu in the queue worth giving his time back so that
probably some other vcpu of VM1 doing good work OR vcpu of VM2 can
benefit from this.

I mean far the vcpu in the queue, let him yield voluntarily. (inversely
proportional notion just because it is vcpu). and of course for some n <
THRESHOLD we can still have directly proportional wait idea.

Does this idea sound good ?