On Wed, 2008-05-07 at 11:49 -0700, Christoph Lameter wrote:
> Subject: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disabled
>
> The nice spinlock functions that enable interrupts while spinning are only
> usable if GENERIC_LOCKBREAK is set (and we do not debug locks but lock
> debugging would not be used for a production configuration).
>
> Factor out the dependencies on the ->lock_break field from the nice functions
> into a set of BREAKLOCK macros (cannot be functions since the type of the lock
> variable varies).
>
> The nice spinlock function can then also be used if !GENERIC_LOCKBREAK
>
> Signed-off-by: Christoph Lameter <[email protected]>
>
> ---
> kernel/spinlock.c | 40 ++++++++++++++++++++++++++++++----------
> 1 file changed, 30 insertions(+), 10 deletions(-)
>
> Index: linux-2.6/kernel/spinlock.c
> ===================================================================
> --- linux-2.6.orig/kernel/spinlock.c 2008-05-07 11:19:31.000000000 -0700
> +++ linux-2.6/kernel/spinlock.c 2008-05-07 11:40:56.000000000 -0700
> @@ -65,7 +65,7 @@ EXPORT_SYMBOL(_write_trylock);
> * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
> * not re-enabled during lock-acquire (which the preempt-spin-ops do):
> */
> -#if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
Maybe I'm just blind, but doesn't this change effectively disable any
arch-specific optimized code for _raw_*_lock?
If CONFIG_DEBUG_LOCK_ALLOC is set, then CONFIG_DEBUG_SPINLOCK must also
be set, so in that case the debugging versions of _raw_*_lock are used.
But if CONFIG_DEBUG_LOCK_ALLOC is _not_ set, then the locks are built
with _trylock and _can_lock primitives.
What am I missing here?
Petr Tesarik
On Mon, 2008-06-23 at 19:19 +0200, Petr Tesarik wrote:
> On Wed, 2008-05-07 at 11:49 -0700, Christoph Lameter wrote:
> > Subject: Spinlocks: Factor our GENERIC_LOCKBREAK in order to avoid spin with irqs disabled
> >
> > The nice spinlock functions that enable interrupts while spinning are only
> > usable if GENERIC_LOCKBREAK is set (and we do not debug locks but lock
> > debugging would not be used for a production configuration).
> >
> > Factor out the dependencies on the ->lock_break field from the nice functions
> > into a set of BREAKLOCK macros (cannot be functions since the type of the lock
> > variable varies).
> >
> > The nice spinlock function can then also be used if !GENERIC_LOCKBREAK
> >
> > Signed-off-by: Christoph Lameter <[email protected]>
> >
> > ---
> > kernel/spinlock.c | 40 ++++++++++++++++++++++++++++++----------
> > 1 file changed, 30 insertions(+), 10 deletions(-)
> >
> > Index: linux-2.6/kernel/spinlock.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/spinlock.c 2008-05-07 11:19:31.000000000 -0700
> > +++ linux-2.6/kernel/spinlock.c 2008-05-07 11:40:56.000000000 -0700
> > @@ -65,7 +65,7 @@ EXPORT_SYMBOL(_write_trylock);
> > * even on CONFIG_PREEMPT, because lockdep assumes that interrupts are
> > * not re-enabled during lock-acquire (which the preempt-spin-ops do):
> > */
> > -#if !defined(CONFIG_GENERIC_LOCKBREAK) || defined(CONFIG_DEBUG_LOCK_ALLOC)
> > +#ifdef CONFIG_DEBUG_LOCK_ALLOC
>
> Maybe I'm just blind, but doesn't this change effectively disable any
> arch-specific optimized code for _raw_*_lock?
>
> If CONFIG_DEBUG_LOCK_ALLOC is set, then CONFIG_DEBUG_SPINLOCK must also
> be set, so in that case the debugging versions of _raw_*_lock are used.
> But if CONFIG_DEBUG_LOCK_ALLOC is _not_ set, then the locks are built
> with _trylock and _can_lock primitives.
>
> What am I missing here?
No, I think you're right.
I've been playing with these patches:
http://programming.kicks-ass.net/kernel-patches/spinlocks/
But as it stands it breaks !CONFIG_PREEMPT... ought to find some other
cycles to spend on it..
On Mon, 23 Jun 2008, Petr Tesarik wrote:
> Maybe I'm just blind, but doesn't this change effectively disable any
> arch-specific optimized code for _raw_*_lock?
True. Only the __raw_xxx_trylock is still used.
> If CONFIG_DEBUG_LOCK_ALLOC is set, then CONFIG_DEBUG_SPINLOCK must also
> be set, so in that case the debugging versions of _raw_*_lock are used.
> But if CONFIG_DEBUG_LOCK_ALLOC is _not_ set, then the locks are built
> with _trylock and _can_lock primitives.
>
> What am I missing here?
It is good that the locks are build with _trylock and _can_lock because
then we can reenable interrupts while spinning.
On Mon, 2008-06-23 at 11:20 -0700, Christoph Lameter wrote:
> On Mon, 23 Jun 2008, Petr Tesarik wrote:
>
> > Maybe I'm just blind, but doesn't this change effectively disable any
> > arch-specific optimized code for _raw_*_lock?
>
> True. Only the __raw_xxx_trylock is still used.
>
> > If CONFIG_DEBUG_LOCK_ALLOC is set, then CONFIG_DEBUG_SPINLOCK must also
> > be set, so in that case the debugging versions of _raw_*_lock are used.
> > But if CONFIG_DEBUG_LOCK_ALLOC is _not_ set, then the locks are built
> > with _trylock and _can_lock primitives.
> >
> > What am I missing here?
>
> It is good that the locks are build with _trylock and _can_lock because
> then we can reenable interrupts while spinning.
Well, good and bad, the turn side is that fairness schemes like ticket
locks are utterly defeated.
On Mon, 23 Jun 2008, Peter Zijlstra wrote:
> > It is good that the locks are build with _trylock and _can_lock because
> > then we can reenable interrupts while spinning.
>
> Well, good and bad, the turn side is that fairness schemes like ticket
> locks are utterly defeated.
True. But maybe we can make these fairness schemes more generic so that
they can go into core code?
On Mon, 2008-06-23 at 13:45 -0700, Christoph Lameter wrote:
> On Mon, 23 Jun 2008, Peter Zijlstra wrote:
>
> > > It is good that the locks are build with _trylock and _can_lock because
> > > then we can reenable interrupts while spinning.
> >
> > Well, good and bad, the turn side is that fairness schemes like ticket
> > locks are utterly defeated.
>
> True. But maybe we can make these fairness schemes more generic so that
> they can go into core code?
The trouble with ticket locks is that they can't handle waiters going
away - or in this case getting preempted by irq handlers. The one who
took the ticket must pass it on, so if you're preempted it just sits
there being idle, until you get back to deal with the lock.
But yeah, perhaps another fairness scheme might work in the generic
code..
Peter Zijlstra wrote:
> On Mon, 2008-06-23 at 13:45 -0700, Christoph Lameter wrote:
>
>> On Mon, 23 Jun 2008, Peter Zijlstra wrote:
>>
>>
>>>> It is good that the locks are build with _trylock and _can_lock because
>>>> then we can reenable interrupts while spinning.
>>>>
>>> Well, good and bad, the turn side is that fairness schemes like ticket
>>> locks are utterly defeated.
>>>
>> True. But maybe we can make these fairness schemes more generic so that
>> they can go into core code?
>>
>
> The trouble with ticket locks is that they can't handle waiters going
> away - or in this case getting preempted by irq handlers. The one who
> took the ticket must pass it on, so if you're preempted it just sits
> there being idle, until you get back to deal with the lock.
>
> But yeah, perhaps another fairness scheme might work in the generic
> code..
Thomas Friebel presented results at the Xen Summit this week showing
that ticket locks are an absolute disaster for scalability in a virtual
environment, for a similar reason. It's a bit irritating if the lock
holder vcpu gets preempted by the hypervisor, but its much worse when
they release the lock: unless the vcpu scheduler gives a cpu to the vcpu
with the next ticket, it can waste up to N timeslices spinning.
I'm experimenting with adding pvops hook to allow you to put in new
spinlock implementations on the fly. If nothing else, it will be useful
for experimenting with different algorithms. But it definitely seems
like the old unfair lock algorithm played much better with a virtual
environment, because the next cpu to get the lock is the next one the
scheduler gives time, rather than dictating an order - and the scheduler
should mitigate the unfairness that ticket locks were designed to solve.
J
On Wed, 2008-06-25 at 19:51 -0700, Jeremy Fitzhardinge wrote:
> Peter Zijlstra wrote:
> > On Mon, 2008-06-23 at 13:45 -0700, Christoph Lameter wrote:
> >
> >> On Mon, 23 Jun 2008, Peter Zijlstra wrote:
> >>
> >>
> >>>> It is good that the locks are build with _trylock and _can_lock because
> >>>> then we can reenable interrupts while spinning.
> >>>>
> >>> Well, good and bad, the turn side is that fairness schemes like ticket
> >>> locks are utterly defeated.
> >>>
> >> True. But maybe we can make these fairness schemes more generic so that
> >> they can go into core code?
> >>
> >
> > The trouble with ticket locks is that they can't handle waiters going
> > away - or in this case getting preempted by irq handlers. The one who
> > took the ticket must pass it on, so if you're preempted it just sits
> > there being idle, until you get back to deal with the lock.
> >
> > But yeah, perhaps another fairness scheme might work in the generic
> > code..
>
> Thomas Friebel presented results at the Xen Summit this week showing
> that ticket locks are an absolute disaster for scalability in a virtual
> environment, for a similar reason. It's a bit irritating if the lock
> holder vcpu gets preempted by the hypervisor, but its much worse when
> they release the lock: unless the vcpu scheduler gives a cpu to the vcpu
> with the next ticket, it can waste up to N timeslices spinning.
>
> I'm experimenting with adding pvops hook to allow you to put in new
> spinlock implementations on the fly. If nothing else, it will be useful
> for experimenting with different algorithms. But it definitely seems
> like the old unfair lock algorithm played much better with a virtual
> environment, because the next cpu to get the lock is the next one the
> scheduler gives time, rather than dictating an order - and the scheduler
> should mitigate the unfairness that ticket locks were designed to solve.
Paravirt spinlocks sounds like a good idea anyway, that way you can make
them scheduling locks (from the host's POV) when the lock owner (vcpu)
isn't running.
Burning time spinning on !running vcpus seems like a waste to me.
As for the scheduler solving the unfairness that ticket locks solve,
that cannot be done. The ticket lock solves intra-cpu fairness for a
resource other than time. The cpu scheduler only cares about fairness in
time, and its intra-cpu fairness is on a larger scale than most spinlock
hold times - so even if time and the locked resource would overlap it
wouldn't work.
The simple scenario is running N tasks on N cpus that all pound the same
lock, cache issues will make it unlikely the lock would migrate away
from whatever cpu its on, essentially starving all the other N-1 cpus.
Ticket locks solve that exact issue, all the scheduler can do is ensure
they're all spending an equal amount of time on the cpu, whether that is
spinning for lock acquisition or getting actual work done is beyond its
scope.
On Wed, 2008-06-25 at 19:51 -0700, Jeremy Fitzhardinge wrote:
> [...]
> I'm experimenting with adding pvops hook to allow you to put in new
> spinlock implementations on the fly. If nothing else, it will be useful
> for experimenting with different algorithms. But it definitely seems
> like the old unfair lock algorithm played much better with a virtual
> environment, because the next cpu to get the lock is the next one the
> scheduler gives time, rather than dictating an order - and the scheduler
> should mitigate the unfairness that ticket locks were designed to solve.
We really should paravirtualize spin locks, because there's always
something better to do than just burn time spinning. But in a
non-virtualized environment, tickets (or a similar scheme) should be
preserved.
We should probably re-think the whole locking scheme, because spinlocks
were designed to be held for a short period of time. This was a fair
assumption when they were introduced, but obviously it is now false in
many cases (such as virtualization).
Ticket-based spinlocks have actually already changed the original
design, so why not implement a generic "lock scheduler" on top of
spinlock_t and rwlock_t?
Petr Tesarik
Peter Zijlstra wrote:
> Paravirt spinlocks sounds like a good idea anyway, that way you can make
> them scheduling locks (from the host's POV) when the lock owner (vcpu)
> isn't running.
>
> Burning time spinning on !running vcpus seems like a waste to me.
>
In theory. But in practice Linux locks are so low-contention that not
much time seems to get wasted. I've been doing experiments with
spin-a-while-then-block locks, but they never got to the -then-block
part in my test. The burning cycles spinning only gets expensive if the
lock-holder vcpu gets preempted, and there's other cpus spinning on that
lock; but if locks are held only briefly, then there's little chance
being preempted while holding the lock.
At least that's at the scale I've been testing, with only two cores. I
expect things look different with 8 or 16 cores and similarly scaled guests.
> As for the scheduler solving the unfairness that ticket locks solve,
>
No, I never said scheduler would the problem, merely mitigate it.
> that cannot be done. The ticket lock solves intra-cpu fairness for a
> resource other than time. The cpu scheduler only cares about fairness in
> time, and its intra-cpu fairness is on a larger scale than most spinlock
> hold times - so even if time and the locked resource would overlap it
> wouldn't work.
>
> The simple scenario is running N tasks on N cpus that all pound the same
> lock, cache issues will make it unlikely the lock would migrate away
> from whatever cpu its on, essentially starving all the other N-1 cpus.
>
Yep. But in practice, the scheduler will steal the real cpu from under
the vcpu dominating the lock and upset the pathalogical pattern. I'm
not saying its ideal, but the starvation case that ticketlocks solve is
pretty rare in the large scheme of things.
Also, ticket locks don't help either, if the lock is always
transitioning between locked->unlocked->locked on all cpus. It only
helps in the case of one cpu doing rapid lock->unlock transitions while
others wait on the lock.
> Ticket locks solve that exact issue, all the scheduler can do is ensure
> they're all spending an equal amount of time on the cpu, whether that is
> spinning for lock acquisition or getting actual work done is beyond its
> scope.
>
Yes. But the problem with ticket locks is that they dictate a
scheduling order, and if you fail to schedule in that order vast amounts
of time are wasted. You can get into this state:
1. vcpu A takes a lock
2. vcpu A is preempted, effectively making a 5us lock be held for 30ms
3. vcpus E,D,C,B try to take the lock in that order
4. they all spin, wasting time. bad, but no worse than the old lock
algorithm
5. vcpu A eventually runs again and releases the lock
6. vcpu B runs, spinning until preempted
7. vcpu C runs, spinning until preempted
8. vcpu D runs, spinning until preempted
9. vcpu E runs, and takes the lock and releases it
10. (repeat spinning on B,C,D until D gets the lock)
11. (repeat spinning on B,C until C gets the lock)
12. B finally gets the lock
Steps 6-12 are all caused by ticket locks, and the situation is
exacerbated by vcpus F-Z trying to get the lock in the meantime while
its all tangled up handing out tickets in the right order.
The problem is that the old lock-byte locks made no fairness guarantees,
and interacted badly with the hardware causing severe starvation in some
cases. Ticket locks are too fair, and absolutely dictate the order in
which the lock is taken. Really, all that's needed is the weaker
assertion that "when I release the lock, any current spinner should get
the lock".
J
On Thu, 26 Jun 2008, Petr Tesarik wrote:
> We should probably re-think the whole locking scheme, because spinlocks
> were designed to be held for a short period of time. This was a fair
> assumption when they were introduced, but obviously it is now false in
> many cases (such as virtualization).
And NUMA.
> Ticket-based spinlocks have actually already changed the original
> design, so why not implement a generic "lock scheduler" on top of
> spinlock_t and rwlock_t?
How about making the semaphore / mutex code as fast as spinlocks and just
use that?
On Thu, 2008-06-26 at 10:02 -0700, Christoph Lameter wrote:
> On Thu, 26 Jun 2008, Petr Tesarik wrote:
>
> > We should probably re-think the whole locking scheme, because spinlocks
> > were designed to be held for a short period of time. This was a fair
> > assumption when they were introduced, but obviously it is now false in
> > many cases (such as virtualization).
>
> And NUMA.
>
> > Ticket-based spinlocks have actually already changed the original
> > design, so why not implement a generic "lock scheduler" on top of
> > spinlock_t and rwlock_t?
>
> How about making the semaphore / mutex code as fast as spinlocks and just
> use that?
Hehe, please have a look at a recent -rt tree and possibly test some of
that if you're interested.
It has adaptive spins for rt_mutex. Which is to say, as long as the lock
owner is current on its cpu, we'll spin trying to acquire the lock. The
reasoning here is that as long as the owner is running, there is a fair
chance it'll release the lock soonish.
This is another case where semaphores can never match, due to their lack
of ownership.
On Thursday 26 June 2008 12:51, Jeremy Fitzhardinge wrote:
> Peter Zijlstra wrote:
> > On Mon, 2008-06-23 at 13:45 -0700, Christoph Lameter wrote:
> >> On Mon, 23 Jun 2008, Peter Zijlstra wrote:
> >>>> It is good that the locks are build with _trylock and _can_lock
> >>>> because then we can reenable interrupts while spinning.
> >>>
> >>> Well, good and bad, the turn side is that fairness schemes like ticket
> >>> locks are utterly defeated.
> >>
> >> True. But maybe we can make these fairness schemes more generic so that
> >> they can go into core code?
> >
> > The trouble with ticket locks is that they can't handle waiters going
> > away - or in this case getting preempted by irq handlers. The one who
> > took the ticket must pass it on, so if you're preempted it just sits
> > there being idle, until you get back to deal with the lock.
> >
> > But yeah, perhaps another fairness scheme might work in the generic
> > code..
>
> Thomas Friebel presented results at the Xen Summit this week showing
> that ticket locks are an absolute disaster for scalability in a virtual
> environment, for a similar reason. It's a bit irritating if the lock
> holder vcpu gets preempted by the hypervisor, but its much worse when
> they release the lock: unless the vcpu scheduler gives a cpu to the vcpu
> with the next ticket, it can waste up to N timeslices spinning.
I didn't realise it is good practice to run multiple "virtual CPUs"
of the same guest on a single physical CPU on the host...
> I'm experimenting with adding pvops hook to allow you to put in new
> spinlock implementations on the fly. If nothing else, it will be useful
> for experimenting with different algorithms. But it definitely seems
> like the old unfair lock algorithm played much better with a virtual
> environment, because the next cpu to get the lock is the next one the
> scheduler gives time, rather than dictating an order - and the scheduler
> should mitigate the unfairness that ticket locks were designed to solve.
... if it is good practice, then, virtualizing spinlocks I guess is
reasonable. If not, then "don't do that". Considering that probably
many bare metal systems will run pv kernels, every little cost adds
up.
On Monday 07 July 2008 21:50, Nick Piggin wrote:
> On Thursday 26 June 2008 12:51, Jeremy Fitzhardinge wrote:
> > Thomas Friebel presented results at the Xen Summit this week showing
> > that ticket locks are an absolute disaster for scalability in a virtual
> > environment, for a similar reason. It's a bit irritating if the lock
> > holder vcpu gets preempted by the hypervisor, but its much worse when
> > they release the lock: unless the vcpu scheduler gives a cpu to the vcpu
> > with the next ticket, it can waste up to N timeslices spinning.
>
> I didn't realise it is good practice to run multiple "virtual CPUs"
> of the same guest on a single physical CPU on the host...
>
> > I'm experimenting with adding pvops hook to allow you to put in new
> > spinlock implementations on the fly. If nothing else, it will be useful
> > for experimenting with different algorithms. But it definitely seems
> > like the old unfair lock algorithm played much better with a virtual
> > environment, because the next cpu to get the lock is the next one the
> > scheduler gives time, rather than dictating an order - and the scheduler
> > should mitigate the unfairness that ticket locks were designed to solve.
>
> ... if it is good practice, then, virtualizing spinlocks I guess is
> reasonable. If not, then "don't do that". Considering that probably
> many bare metal systems will run pv kernels, every little cost adds
> up.
Although, you wouldn't need to oversubscribe physical CPUs to hit
suboptimal behaviour.
Basically, I just ask for performance improvement to be measured
with some "realistic" configuration, then it should be easier to
justify.
Nick Piggin wrote:
> On Thursday 26 June 2008 12:51, Jeremy Fitzhardinge wrote:
>
>> Peter Zijlstra wrote:
>>
>>> On Mon, 2008-06-23 at 13:45 -0700, Christoph Lameter wrote:
>>>
>>>> On Mon, 23 Jun 2008, Peter Zijlstra wrote:
>>>>
>>>>>> It is good that the locks are build with _trylock and _can_lock
>>>>>> because then we can reenable interrupts while spinning.
>>>>>>
>>>>> Well, good and bad, the turn side is that fairness schemes like ticket
>>>>> locks are utterly defeated.
>>>>>
>>>> True. But maybe we can make these fairness schemes more generic so that
>>>> they can go into core code?
>>>>
>>> The trouble with ticket locks is that they can't handle waiters going
>>> away - or in this case getting preempted by irq handlers. The one who
>>> took the ticket must pass it on, so if you're preempted it just sits
>>> there being idle, until you get back to deal with the lock.
>>>
>>> But yeah, perhaps another fairness scheme might work in the generic
>>> code..
>>>
>> Thomas Friebel presented results at the Xen Summit this week showing
>> that ticket locks are an absolute disaster for scalability in a virtual
>> environment, for a similar reason. It's a bit irritating if the lock
>> holder vcpu gets preempted by the hypervisor, but its much worse when
>> they release the lock: unless the vcpu scheduler gives a cpu to the vcpu
>> with the next ticket, it can waste up to N timeslices spinning.
>>
>
> I didn't realise it is good practice to run multiple "virtual CPUs"
> of the same guest on a single physical CPU on the host...
>
It isn't. It makes no sense at all to give a guest more vcpus than
physical cpus, so that kind of contention won't happen in general. But
the bad locking scenario happens when there's any system-wide
contention, so it could happen if some other virtual machine preempts a
vcpu holding a lock. And once a lock ends up being (effectively) held
for 30ms rather than 30us, the likelihood of going into contention goes
way up, and you can enter the catastrophic N^2 unlock->relock state.
My measurements show that reverting to the old lock-byte algorithm
avoids the worst case, and just results in a bit of excessive spinning.
Replacing it with a smarter spin-then-block-vcpu algorithm doesn't
really benefit the specific guest VM very much (kernbench elapsed time
is only slightly improved), but its consumption of physical cpu time can
go down by ~10%.
>> I'm experimenting with adding pvops hook to allow you to put in new
>> spinlock implementations on the fly. If nothing else, it will be useful
>> for experimenting with different algorithms. But it definitely seems
>> like the old unfair lock algorithm played much better with a virtual
>> environment, because the next cpu to get the lock is the next one the
>> scheduler gives time, rather than dictating an order - and the scheduler
>> should mitigate the unfairness that ticket locks were designed to solve.
>>
>
> ... if it is good practice, then, virtualizing spinlocks I guess is
> reasonable. If not, then "don't do that". Considering that probably
> many bare metal systems will run pv kernels, every little cost adds
> up
I'm aware of that. In my current implementation the overhead amounts to
an extra direct call in the lock/unlock path, but that can be eliminated
with a small amount of restructuring (by making spin_lock/unlock()
inline functions, and having the call to raw_spin_lock/unlock within
it). The pvops patching machinery removes any indirect calls or jumps.
J
Nick Piggin wrote:
> Although, you wouldn't need to oversubscribe physical CPUs to hit
> suboptimal behaviour.
>
> Basically, I just ask for performance improvement to be measured
> with some "realistic" configuration, then it should be easier to
> justify.
Overcommitting cpus system-wide is pretty common. A use-case is
consolidating a pile of underused non-performance-critical servers into
one physical machine with a lot of VMs. It doesn't take much load to
cause a dhcp request to the dhcp server to steal a cpu away from the
mysql server while it holds a lock...
I'll mail out my patches so we can have a more concrete discussion shortly.
J
On Wed, 25 Jun 2008 19:51:12 -0700
Jeremy Fitzhardinge <[email protected]> wrote:
> Thomas Friebel presented results at the Xen Summit this week showing
> that ticket locks are an absolute disaster for scalability in a virtual
> environment, for a similar reason. It's a bit irritating if the lock
> holder vcpu gets preempted by the hypervisor, but its much worse when
> they release the lock: unless the vcpu scheduler gives a cpu to the vcpu
> with the next ticket, it can waste up to N timeslices spinning.
>
> I'm experimenting with adding pvops hook to allow you to put in new
> spinlock implementations on the fly.
Alternatively, the guest could tell the host which vcpus
are next in line for a ticket spinlock, or a vcpu that gets
scheduled but is not supposed to grab the lock yet can give
some CPU time to the vcpu that should get the lock next.
I believe the IBM PPC64 people have done some work to implement
just that.
--
All Rights Reversed
Rik van Riel wrote:
> Alternatively, the guest could tell the host which vcpus
> are next in line for a ticket spinlock, or a vcpu that gets
> scheduled but is not supposed to grab the lock yet can give
> some CPU time to the vcpu that should get the lock next.
>
Those are possible, but would either 1) require hypervisor changes,
and/or 2) changes no less extensive than the ones I had to make anyway.
Thomas's proposal was to modify the scheduler to try to avoiding
preempting vcpus while they're in kernel mode. That's nice because it
requires no guest changes, and seems at least somewhat successful at
mitigating the problem. But it can't completely solve the problem, and
you end up with a bunch of heuristics in the hypervisor to decide who to
preempt.
The other point, of course, is that ticket locks are massive overkill
for the problem they're trying to solve. It's one thing to introduce an
element of fairness into spinlocks, its another to impose strict FIFO
ordering. It would be enough to make the locks "polite" by preventing a
new lock-holder from taking the lock while its under contention.
Something like:
union lock {
unsigned short word;
struct { unsigned char lock, count; };
};
spin_lock: # ebx - lock pointer
movw $0x0001, %ax # add 1 to lock, 0 to count
xaddw %ax, (%ebx) # attempt to take lock and test user count
testw %ax,%ax
jnz slow
taken: ret
# slow path
slow: lock incb 1(%ebx) # inc count
1: rep;nop
cmpb $0,(%ebx)
jnz 1b # wait for unlocked
movb $1,%al # attempt to take lock (count already increased)
xchgb %al,(%ebx)
testb %al,%al
jnz 1b
lock decb 1(%ebx) # drop count
jmp taken
spin_unlock:
movb $0,(%ebx)
ret
The uncontended fastpath is similar to the pre-ticket locks, but it
refuses to take the lock if there are other waiters, even if the lock is
not currently held. This prevents the rapid lock-unlock cycle on one
CPU from starving another CPU, which I understand was the original
problem tickets locks were trying to solve.
But it also means that all the contended spinners get the lock in
whatever order the system decides to give it to them, rather than
imposing a strict order.
> I believe the IBM PPC64 people have done some work to implement
> just that.
>
Do you have any references?
J
On Tuesday 08 July 2008 06:14, Jeremy Fitzhardinge wrote:
> The other point, of course, is that ticket locks are massive overkill
> for the problem they're trying to solve.
No they aren't.
> It's one thing to introduce an
> element of fairness into spinlocks, its another to impose strict FIFO
> ordering. It would be enough to make the locks "polite" by preventing a
> new lock-holder from taking the lock while its under contention.
> Something like:
>
> union lock {
> unsigned short word;
> struct { unsigned char lock, count; };
> };
>
> spin_lock: # ebx - lock pointer
> movw $0x0001, %ax # add 1 to lock, 0 to count
> xaddw %ax, (%ebx) # attempt to take lock and test user count
> testw %ax,%ax
> jnz slow
>
> taken: ret
>
> # slow path
> slow: lock incb 1(%ebx) # inc count
>
> 1: rep;nop
> cmpb $0,(%ebx)
> jnz 1b # wait for unlocked
>
> movb $1,%al # attempt to take lock (count already increased)
> xchgb %al,(%ebx)
> testb %al,%al
> jnz 1b
>
> lock decb 1(%ebx) # drop count
> jmp taken
>
> spin_unlock:
> movb $0,(%ebx)
> ret
>
>
> The uncontended fastpath is similar to the pre-ticket locks, but it
> refuses to take the lock if there are other waiters, even if the lock is
> not currently held. This prevents the rapid lock-unlock cycle on one
> CPU from starving another CPU, which I understand was the original
> problem tickets locks were trying to solve.
They prevent lots of unfairness and starvation problems. The most
prominent one (ie. actually observed in Linux) was a single CPU
being totally starved by N others (to the point where lockup timers
would kick in).
As an aside, these locks you propose are also a lot more costly in
the contended path. 4 vs 1 atomic operations on the lock cacheline
is not so great.
> But it also means that all the contended spinners get the lock in
> whatever order the system decides to give it to them, rather than
> imposing a strict order.
The exact problem is that the system actively does the wrong thing
when you allow it to decide.
Unlike simple cacheline access, I don't believe it is such a good
idea to batch up locks many times on the same CPU for example. While
it surely could improve performance in specific situations, I think
that if code is taking and releasing a lock many times, then it most
probably should be either reworked to hold the lock for longer, or
changed completely. And once locks become *really* contended, then
the cost of moving the critical section to another CPU is really
drowned out by the cost of contention itself (all the idle time,
and taking/releasing the lock cacheline).
So far my theory has held up (except for virtualized systems).
On Tuesday 08 July 2008 01:56, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > Although, you wouldn't need to oversubscribe physical CPUs to hit
> > suboptimal behaviour.
> >
> > Basically, I just ask for performance improvement to be measured
> > with some "realistic" configuration, then it should be easier to
> > justify.
>
> Overcommitting cpus system-wide is pretty common.
(yes, sorry I meant: oversubscribing with a single guest)
Nick Piggin wrote:
> On Tuesday 08 July 2008 06:14, Jeremy Fitzhardinge wrote:
>
>
>> The other point, of course, is that ticket locks are massive overkill
>> for the problem they're trying to solve.
>>
>
> No they aren't.
>
>
>
>> It's one thing to introduce an
>> element of fairness into spinlocks, its another to impose strict FIFO
>> ordering. It would be enough to make the locks "polite" by preventing a
>> new lock-holder from taking the lock while its under contention.
>> Something like:
>>
>> union lock {
>> unsigned short word;
>> struct { unsigned char lock, count; };
>> };
>>
>> spin_lock: # ebx - lock pointer
>> movw $0x0001, %ax # add 1 to lock, 0 to count
>> xaddw %ax, (%ebx) # attempt to take lock and test user count
>> testw %ax,%ax
>> jnz slow
>>
>> taken: ret
>>
>> # slow path
>> slow: lock incb 1(%ebx) # inc count
>>
>> 1: rep;nop
>> cmpb $0,(%ebx)
>> jnz 1b # wait for unlocked
>>
>> movb $1,%al # attempt to take lock (count already increased)
>> xchgb %al,(%ebx)
>> testb %al,%al
>> jnz 1b
>>
>> lock decb 1(%ebx) # drop count
>> jmp taken
>>
>> spin_unlock:
>> movb $0,(%ebx)
>> ret
>>
>>
>> The uncontended fastpath is similar to the pre-ticket locks, but it
>> refuses to take the lock if there are other waiters, even if the lock is
>> not currently held. This prevents the rapid lock-unlock cycle on one
>> CPU from starving another CPU, which I understand was the original
>> problem tickets locks were trying to solve.
>>
>
> They prevent lots of unfairness and starvation problems. The most
> prominent one (ie. actually observed in Linux) was a single CPU
> being totally starved by N others (to the point where lockup timers
> would kick in).
>
Yep. My understanding was that the specific case was that cpu A was
repeatedly taking and releasing the lock, while other cpus spin waiting
for it, and that the cache coherence logic kept the cacheline owned by A
(presumably because it kept modifying it). Ticket locks work well in
this case because, as you say, it enforces a fairness policy that the
hardware doesn't implement. Are there other cases that ticket locks
help with? Does the algorithm above solve the starvation issue?
> As an aside, these locks you propose are also a lot more costly in
> the contended path. 4 vs 1 atomic operations on the lock cacheline
> is not so great.
>
Yep, that's not great. But it doesn't bounce cache lines around as much
either, so perhaps it doesn't make much difference.
But really, I was being my own devil's advocate, to see if there's some
other lock algorithm which satisfies both the normal ticket-lock case
and the virtualization case.
I have no real objections to ticket locks, so long as I can turn them off ;)
J
On Tuesday 08 July 2008 15:57, Jeremy Fitzhardinge wrote:
> Nick Piggin wrote:
> > They prevent lots of unfairness and starvation problems. The most
> > prominent one (ie. actually observed in Linux) was a single CPU
> > being totally starved by N others (to the point where lockup timers
> > would kick in).
>
> Yep. My understanding was that the specific case was that cpu A was
> repeatedly taking and releasing the lock, while other cpus spin waiting
> for it, and that the cache coherence logic kept the cacheline owned by A
> (presumably because it kept modifying it).
In that case, the CPUs I've tested seem to do a reasonable job of
avoiding hogging the lock. And they all obviously make something of
an attempt at starvation/fairness of simple cacheline access...
But spinlocks require cacheline access *at exactly the right time*
in order to make progress. So simple cacheline fairness can break
down quite easily.
> Ticket locks work well in
> this case because, as you say, it enforces a fairness policy that the
> hardware doesn't implement. Are there other cases that ticket locks
> help with?
Well just starvation and fairness mainly.
Interestingly, it actually has some better scalability properties under
high contention sometimes (because the lock aquisition slowpath requires
no further stores to the cacheline).
If a single CPU is allowed to take and release a contended lock
multiple times, I don't consider that in itself as a bad property of
a lock implementation. I just reasoned that (maybe a little counter
intuitively) it isn't such an important thing to have either...
> Does the algorithm above solve the starvation issue?
I don't believe so because there is still no guarantee you get out of
the slowpath (it's fairly similar to old spinlocks in that regard).
> > As an aside, these locks you propose are also a lot more costly in
> > the contended path. 4 vs 1 atomic operations on the lock cacheline
> > is not so great.
>
> Yep, that's not great. But it doesn't bounce cache lines around as much
> either, so perhaps it doesn't make much difference.
>
> But really, I was being my own devil's advocate, to see if there's some
> other lock algorithm which satisfies both the normal ticket-lock case
> and the virtualization case.
Yeah, I'm not sure. I expect it would be hard to match ticket lock's
simplicity and guarantees.
> I have no real objections to ticket locks, so long as I can turn them off
> ;)
Sure. Btw. I have no problems with your patchset, but if SMP guests
are seriously used on x86, just remember the fairness issue. At some
point you might find you'll need an even more advanced lock for Xen.
Nick Piggin wrote:
> Sure. Btw. I have no problems with your patchset, but if SMP guests
> are seriously used on x86, just remember the fairness issue. At some
> point you might find you'll need an even more advanced lock for Xen.
>
Perhaps. The spin lock has a bounded number of iterations before it
goes into the vcpu-blocking path. At that point, it's no longer subject
to the whims of the cache coherency protocol, and is explicitly woken.
But it may still have issues when competing with a rapid lock-unlocker.
A ticket-lock variant with some kind of directed yield might be the way
to go if it turns out to be a problem.
J