On 5/26/05, john cooper <[email protected]> wrote:
> given design. Clearly we aren't buying anything to trade off
> a spinlock protecting the update of a single pointer with a
> blocking lock and associated context switching.
thinking out loud here, doubting I'm covering any new ground.
Just curious, are there any statistics about how often spinlocks
are spinning? A spinlock only has to wait when there is actual
contention for the resource, right? If that is a very rare thing, the
performance difference is divided by the chance of the resource
being in contention, right? So if the normal, uncontended case,
in a SMP system, is, check something visible to all CPUs, set that
thing, do the protected thing, unset the visible thing, continue on, and
the setting/unsetting is the same speed, there would be no difference
in performance, except when there is actual contention?
If there were fifty CPUs, and two hundred threads, all occasionally needing the
same resource protected by the locking scheme, even if that resource is
something very small, if the section of the program that is going to use the
resource uses it a dozen times, locking and unlocking each time, it seems like
computer science queueing theory could be applied to determine the factors
that would optimize for minimum amount of lock contention.
On contention, and only on contention, we are faced with the question of what
to do. Do we wait, or do we go away and come back later? What information is
available to us to base the decision on? We can't gather any more information,
because that would take longer than spin-waiting. If the "spinaphore" told us,
on acquisition failure, how many other threads were asking for it, we
could implement
a tunable lock, that surrenders context when there are more than N
threads waiting for
the resource, and that otherwise waits its turn, or its chance, as a compromise
and synthesis.
Of course we'd then have to spin while waiting our turn to increment
(and decrement,
when we gave up) the spinaphore, unless we have machine code to do that
without frying the memory bus too severely, which we probably do, in any machine
architecture that is designed to support SMP, so that should not be a
problem. So,
one cache fault to check a global lock state, that we can't avoid --
or does the hardware
take care of writing back? I bet it does -- so, presuming no
contention, we breeze through
and hit the memory bus twice at lock and unlock, in any lock scheme.
The spinaphore would work something like this:
int try_and_obtain (spinaphore_t *S, void **EntryPoint)
returns zero on obtaination or an integer in which the low bits
indicate the number of
threads waiting for it, including you, and the high bits indicate
which wait slot your
non-null EntryPoint is registered in.
void abandon(spinaphore_t *S, int slot)
to notify the spinaphore that you are no longer wish to receive a notification
when it is your turn for the resource
int try_and_obtain_or_abandon (spinaphore_t *S)
doesn't register anything with S if we can't get it
void release(spinaphore_t *S)
to only be called on spinaphores you hold, causes selection of the
next registered EntryPoint and a message sent to that CPU -- an interrupt of
some kind -- to resume at the right place in the context that was waiting for
the resource. (This needs to retry until the message, if any, gets through)
void requeue(spinaphore_t, entry_point)
when a CPU receives an interrupt concerning an available spinaphore that
had been requested by a process with affinity to that CPU, it might do this
to the spinaphore instead of giving control back to the thread. this
would entail
putting the entry_point at the back of the line for the resource and sending
the resource_available message to a process on a different CPU. This might best
work as a response -- "NOT NOW" -- to a resource-available message instead of
as something more involved that the receiving CPU would have to do, leaving the
CPU that is trying to release the spinaphore to try to release it to
someone else,
which might again turn into spinning, but spinning through the other
CPUs instead
of retesting a bit.
So, some code would attempt try_and_obtain_or_abandon a few times,
then would register itself with the spinaphore, and yield its CPU. When the
message comes from the spinaphore that it is now this code's turn, the CPU
would either requeue the entry point if it is really busy --
interrupting an interrupt
or something like that -- or switch context back to whatever had registered with
the spinaphore.
Questions: Just how many CPUs, and how busy of a resource, do you have
to have for this kind of thing to work better than spinning, and what kind of
resource would it be protecting? How important is affinity?
Looking at the linux/spinlock.h file from 2.6.11, there are a lot of
flavors of lock to
choose between already. What's one or two or ten more?
On May 27, 2005, at 18:31:38, David Nicol wrote:
> On 5/26/05, john cooper <[email protected]> wrote:
>
>> given design. Clearly we aren't buying anything to trade off
>> a spinlock protecting the update of a single pointer with a
>> blocking lock and associated context switching.
>
> On contention, and only on contention, we are faced with the
> question of what
> to do. Do we wait, or do we go away and come back later? What
> information is
> available to us to base the decision on? We can't gather any more
> information,
> because that would take longer than spin-waiting. If the
> "spinaphore" told us,
> on acquisition failure, how many other threads were asking for it, we
> could implement
> a tunable lock, that surrenders context when there are more than N
> threads waiting for
> the resource, and that otherwise waits its turn, or its chance, as
> a compromise
> and synthesis.
Here is an example naive implementation which could perhaps be
optimized further
for architectures based on memory and synchronization requirements.
A quick summary:
Each time the lock is taken and released, a "hold_time" is updated
which indicates
the average time that the lock is held. During contention, each CPU
checks the
current average hold time and the number of CPUs waiting against a
predefined
"context switch + useful work" time, and goes to sleep if it thinks
it has enough
time to spare.
Problems:
You can't nest these. You also can't take a normal semaphore inside
one. The
only useable locking order for these is:
..., semaphore, semaphore, spinaphore, spinlock, spinlock, ...
Possible Solution:
If you had a reliable way of determining when it is safe to sleep,
you could call
a "cond_resched_if_nonatomic()" function instead of cond_resched()
and allow
nesting of spinaphores within each other and within spinlocks. I
_do_ implement a
spinaphore_lock_atomic which is guaranteed not to sleep and could be
used within
other locks instead.
struct spinaphore {
atomic_t queued;
atomic_t hold_time;
spinlock_t spinlock;
unsigned long acquire_time;
};
void spinaphore_lock (struct spinaphore *sph) {
unsigned long start_time = fast_monotonic_count();
int queue_me = 1;
until (likely(spin_trylock(&sph->spinlock))) {
/* Get the queue count (And ensure we're queued in the
process) */
unsigned int queued = queue_me ?
atomic_inc_return(&sph->queued) :
queued = atomic_get(&sph->queued);
queue_me = 0;
/* Figure out if we should switch away */
if (unlikely(CONFIG_SPINAPHORE_CONTEXT_SWITCH <
( queued*atomic_get(&sph->hold_time) -
fast_monotonic_count() - start_time
))) {
/* Remove ourselves from the wait pool (remember to re-
add later) */
atomic_dec(&sph->queued);
queue_me = 1;
/* Go to sleep */
cond_resched();
}
}
/* Dequeue ourselves and update the acquire time */
atomic_dec(&sph->queued);
sph->acquire_time = fast_monotonic_count();
}
void spinaphore_lock_atomic (struct spinaphore *sph) {
/* Let the other processes know what we're doing */
atomic_inc(&sph->queued);
/* Just get the lock the old fashioned way */
spin_lock(&sph->spinlock);
/* Dequeue ourselves and update the acquire time */
atomic_dec(&sph->queued);
sph->acquire_time = fast_monotonic_count();
}
int spinaphore_trylock (struct spinaphore *sph) {
/* Try to get the lock, and if so, update the acquire time */
if (spin_trylock(&sph->spinlock)) {
sph->acquire_time = fast_monotonic_count();
}
void spinaphore_unlock (struct spinaphore *sph) {
/* Update the running average hold time */
atomic_set(&sph->hold_time, (4*atomic_get(&sph->hold_time) +
(fast_monotonic_count() - sph->acquire_time))/5);
/* Actually unlock the spinlock */
spin_unlock(&sph->spinlock);
}
Cheers,
Kyle Moffett
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$
r !y?(-)
------END GEEK CODE BLOCK------
David Nicol wrote:
> So if the normal, uncontended case,
> in a SMP system, is, check something visible to all CPUs, set that
> thing, do the protected thing, unset the visible thing, continue on, and
> the setting/unsetting is the same speed, there would be no difference
> in performance, except when there is actual contention?
Yes.
> On contention, and only on contention, we are faced with the question of what
> to do. Do we wait, or do we go away and come back later? What information is
> available to us to base the decision on? We can't gather any more information,
> because that would take longer than spin-waiting. If the "spinaphore" told us,
> on acquisition failure, how many other threads were asking for it, we
> could implement
> a tunable lock, that surrenders context when there are more than N
> threads waiting for
> the resource, and that otherwise waits its turn, or its chance, as a compromise
> and synthesis.
The trick is making the correct spin/block decision
every time as quickly as possible with as little
information as possible. As that ideal is probably
not attainable partial solutions such as the adaptive
mutex have been used where the spin/block decision is
made upon simple heuristics such as whether the lock
owner is currently running on another cpu.
Pushing lock acquisition policy back to the lock user is
probably going to cause more confusion/misuse/bugs that
improving overall performance. In any case it is unlikely
to become popular in general usage and consequently won't
have the opportunity to attempt its goal.
> So, some code would attempt try_and_obtain_or_abandon a few times,
> then would register itself with the spinaphore, and yield its CPU. When the
> message comes from the spinaphore that it is now this code's turn, the CPU
> would either requeue the entry point if it is really busy --
> interrupting an interrupt
> or something like that -- or switch context back to whatever had registered with
> the spinaphore.
I think you may have answered your own question above.
Optimizing for the contended case is likely past the
point of diminishing returns. The effort may be better
spent on avoiding contention in the first place
through partitioning, reader/writer semantics, etc..
After all a lock by definition is designed to effect
synchronization by introducing delay.
> Looking at the linux/spinlock.h file from 2.6.11, there are a lot of
> flavors of lock to
> choose between already. What's one or two or ten more?
I'd hazard that was more a case of sprawling evolution
rather than a conscious design decision.
-john
--
[email protected]
One thing you are forgetting is that we are not just talking about the
latencies of contention. We are talking about the latency of a high
priority process when it wakes up to the time it runs. Most of the time
a spin lock stops preemption, either with (CONFIG_PREEMPT)
preempt_disable or simple turning off interrupts. With Ingo's mutexes,
the places with spin_locks are now preemptable. So there is probably
lots of times that it would be better to just spin on contention, but
that's not what Ingo's spin_locks are saving us. It's to keep most of
the kernel preemptable.
The priority inheritance of spin_locks is simply there to protect from
priority inversion.
-- Steve
Looks like a description of buzz locks used by various user space
libraries - spin briefly then assume the other cause of contention is
slow (or in user space that we pre-empted it) and be polite.
Alan
On 5/27/05, Kyle Moffett <[email protected]> wrote:
> Here is an example naive implementation which could perhaps be
> optimized further
> for architectures based on memory and synchronization requirements.
Fantastic! I have done some slight edits over what are probably typos.
> A quick summary:
> Each time the lock is taken and released, a "hold_time" is updated
> which indicates
> the average time that the lock is held. During contention, each CPU
> checks the
> current average hold time and the number of CPUs waiting against a
> predefined
per spinaphore instance -- don't know what these are protecting
exactly at lib compile time
> "context switch + useful work" time, and goes to sleep if it thinks
> it has enough
> time to spare.
>
> Problems:
> You can't nest these. You also can't take a normal semaphore inside
> one. The
> only useable locking order for these is:
> ..., semaphore, semaphore, spinaphore, spinlock, spinlock, ...
I don't see why very careful nesting wouldn't work. Because you could
get the count up on a locked-out lock? The problems of VMS
asynchronous traps :)
the outer ones would have higher hold times than the inner ones.
> Possible Solution:
> If you had a reliable way of determining when it is safe to sleep,
> you could call
> a "cond_resched_if_nonatomic()" function instead of cond_resched()
> and allow
> nesting of spinaphores within each other and within spinlocks. I
> _do_ implement a
> spinaphore_lock_atomic which is guaranteed not to sleep and could be
> used within
> other locks instead.
>
> struct spinaphore {
> atomic_t queued;
> atomic_t hold_time;
> spinlock_t spinlock;
> unsigned long acquire_time;
unsigned long acceptable_wait_time; /* dynamic tuning */
> };
>
> void spinaphore_lock (struct spinaphore *sph) {
> unsigned long start_time = fast_monotonic_count();
> int queue_me = 1;
int contention = 0; /* see below */
> until (likely(spin_trylock(&sph->spinlock))) {
contention = 1;
>
> /* Get the queue count (And ensure we're queued in the
> process) */
> unsigned int queued = queue_me ?
> atomic_inc_return(&sph->queued) :
> queued = atomic_get(&sph->queued);
> queue_me = 0;
>
> /* Figure out if we should switch away */
> if (unlikely(CONFIG_SPINAPHORE_CONTEXT_SWITCH <
> ( queued*atomic_get(&sph->hold_time) -
> fast_monotonic_count() - start_time
we could subtract the average lock-held time from the time that
the current lock has been held to find an expected time until
the lock becomes free, so we only try spinning when the current
holder of the lock is nearly done. Hmm what other metrics would
be easy to gather?
> ))) {
> /* Remove ourselves from the wait pool (remember to re-
> add later) */
> atomic_dec(&sph->queued);
> queue_me = 1;
>
> /* Go to sleep */
> cond_resched();
> }
> }
>
> /* Dequeue ourselves and update the acquire time */
> atomic_dec(&sph->queued);
if(contention)atomic_dec(&sph->queued);
when there was no contention we didn't increment.
> sph->acquire_time = fast_monotonic_count();
> }
[snip]
> void spinaphore_unlock (struct spinaphore *sph) {
> /* Update the running average hold time */
> atomic_set(&sph->hold_time, (4*atomic_get(&sph->hold_time) +
> (fast_monotonic_count() - sph->acquire_time))/5);
These don't need to be atomic functions, since we haven't released
the lock yet, or is there a risk that nonatomic gets and sets will get
deferred? no I'm sorry atomic_[get|set] pertains to operations on
atomic_t data is that correct?
> /* Actually unlock the spinlock */
> spin_unlock(&sph->spinlock);
> }
>
> Cheers,
> Kyle Moffett
is there a schedule-that-function-next call? The spinaphore idea is that
instead of simply yielding until later (cond_resched) we register ourselves
with the sph object, with a linked list, an actual queue instead of a count
of queued threads -- and at unlocking time, if there's a queue, the head of
the line gets the service next. Which would scale to a lot of CPUs, still with
a spinlock around the setting of the head-of-line pointer.
I guess I need to look at Ingo's mutexes before wasting more of everyone's time
David L Nicol
Kyle Moffett writes:
[...]
>
> struct spinaphore {
> atomic_t queued;
> atomic_t hold_time;
> spinlock_t spinlock;
> unsigned long acquire_time;
> };
>
> void spinaphore_lock (struct spinaphore *sph) {
> unsigned long start_time = fast_monotonic_count();
fast_monotonic_count() should be per-cpu, otherwise spinaphore_lock()
would require two atomic operations in the best case (and be twice as
expensive as a spin-lock). Per-cpu counter is OK, as long as thread is
not allowed to schedule with spinaphore held.
> int queue_me = 1;
> until (likely(spin_trylock(&sph->spinlock))) {
>
> /* Get the queue count (And ensure we're queued in the
> process) */
> unsigned int queued = queue_me ?
> atomic_inc_return(&sph->queued) :
> queued = atomic_get(&sph->queued);
> queue_me = 0;
>
> /* Figure out if we should switch away */
> if (unlikely(CONFIG_SPINAPHORE_CONTEXT_SWITCH <
> ( queued*atomic_get(&sph->hold_time) -
> fast_monotonic_count() - start_time
> ))) {
> /* Remove ourselves from the wait pool (remember to re-
> add later) */
> atomic_dec(&sph->queued);
> queue_me = 1;
>
> /* Go to sleep */
> cond_resched();
> }
> }
>
> /* Dequeue ourselves and update the acquire time */
> atomic_dec(&sph->queued);
atomic_dec() should only be done if atomic_inc_return() above was, i.e.,
not in contentionless fast-path, right?
[...]
>
> void spinaphore_unlock (struct spinaphore *sph) {
> /* Update the running average hold time */
> atomic_set(&sph->hold_time, (4*atomic_get(&sph->hold_time) +
> (fast_monotonic_count() - sph->acquire_time))/5);
>
> /* Actually unlock the spinlock */
> spin_unlock(&sph->spinlock);
> }
It is not good that unlock requires additional atomic operation. Why
->hold_time is atomic in the first place? It is only updated by the lock
holder, and as it is approximate statistics anyway, non-atomic reads in
spinaphore_lock() would be fine.
>
> Cheers,
> Kyle Moffett
Nikita.
On Fri, 27 May 2005 21:04:37 -0400, Kyle Moffett <[email protected]>
wrote:
> On May 27, 2005, at 18:31:38, David Nicol wrote:
>> On 5/26/05, john cooper <[email protected]> wrote:
>>
>>> given design. Clearly we aren't buying anything to trade off
>>> a spinlock protecting the update of a single pointer with a
>>> blocking lock and associated context switching.
>>
>> On contention, and only on contention, we are faced with the question
>> of what
>> to do. Do we wait, or do we go away and come back later? What
>> information is
>> available to us to base the decision on? We can't gather any more
>> information,
>> because that would take longer than spin-waiting. If the "spinaphore"
>> told us,
>> on acquisition failure, how many other threads were asking for it, we
>> could implement
>> a tunable lock, that surrenders context when there are more than N
>> threads waiting for
>> the resource, and that otherwise waits its turn, or its chance, as a
>> compromise
>> and synthesis.
>
> Here is an example naive implementation which could perhaps be optimized
> further
> for architectures based on memory and synchronization requirements.
>
> A quick summary:
> Each time the lock is taken and released, a "hold_time" is updated which
> indicates
> the average time that the lock is held. During contention, each CPU
> checks the
> current average hold time and the number of CPUs waiting against a
> predefined
> "context switch + useful work" time, and goes to sleep if it thinks it
> has enough
> time to spare.
If you went with a bakery algorithm and could tolerate FIFO service order,
you could use
the expected service time as the ticket increment value instead of 1.
Before a thread
gets a ticket, it examines the expected queue wait time, the difference
between the
current ticket and the next available ticket, to decide which increment to
be applied
to the next ticket value. The two possible increment values would be the
uncontended
resource service time and that value plus thread suspend/resume
overhead. If the
expected wait time is greater than the latter, it uses the latter as the
increment value
and suspends rather than spins.
Bakery algorithms have other nice properties. The lock release
(incrementing the
current ticket) doesn't require an interlocked operation, though the
release memory
barrier and other memory barriers required to determine if there are any
waiters may
make that somewhat moot. The next and current tickets could be kept in
separate
cache lines. And the get ticket interlocked operations are staggered
out, unlike a
conventional spin lock where once the lock is released, *all* waiting
processors
attempt to acquire the lock cache line exclusive all at once. The slows
down
the lock release since the cache line is guaranteed to be held by another
processor
if there was lock contention.
Also bakery spin locks can be rather easily modified to be rwlocks with no
extra
overhead to speak of. Basically, you get rwlock capability for free.
--
Joe Seigh
On May 29, 2005, at 01:25:15, David Nicol wrote:
> On 5/27/05, Kyle Moffett <[email protected]> wrote:
>> "context switch + useful work" time, and goes to sleep if it thinks
>> it has enough
>> time to spare.
>>
>> Problems:
>> You can't nest these. You also can't take a normal semaphore inside
>> one. The
>> only useable locking order for these is:
>> ..., semaphore, semaphore, spinaphore, spinlock, spinlock, ...
>>
>
> I don't see why very careful nesting wouldn't work. Because you
> could get the count up on a locked-out lock? The problems of VMS
> asynchronous traps :) the outer ones would have higher hold times
> than the inner ones.
No, more due to the nature of sleeping while holding a spinlock :-D
Under my current implementation, I use the literal spinlock code,
which disables preemption, etc. If someone were to use a semaphore
or a normal spinaphore_lock() (vs spinaphore_lock_atomic()) within
a spinaphore, it would BUG("scheduling while atomic").
>> struct spinaphore {
>> atomic_t queued;
>> atomic_t hold_time;
>> spinlock_t spinlock;
>> unsigned long acquire_time;
> unsigned long acceptable_wait_time; /* dynamic tuning */
Uhh, the "aceptable_wait_time" == hold_time, which must be an
atomic_t in the naive C implementation, so that it can be
properly re-read in each loop without getting cached in a register
or something.
>> };
>>
>> void spinaphore_lock (struct spinaphore *sph) {
>> unsigned long start_time = fast_monotonic_count();
>> int queue_me = 1;
>>
>> until (likely(spin_trylock(&sph->spinlock))) {
>>
>> /* Get the queue count (And ensure we're queued in the
>> process) */
>> unsigned int queued = queue_me ?
>> atomic_inc_return(&sph->queued) :
>> queued = atomic_get(&sph->queued);
>> queue_me = 0;
>>
>> /* Figure out if we should switch away */
>> if (unlikely(CONFIG_SPINAPHORE_CONTEXT_SWITCH <
>> ( queued*atomic_get(&sph->hold_time) -
>> fast_monotonic_count() - start_time
>>
>
> we could subtract the average lock-held time from the time that
> the current lock has been held to find an expected time until
> the lock becomes free, so we only try spinning when the current
> holder of the lock is nearly done. Hmm what other metrics would
> be easy to gather?
Oops, it should be this:
CONFIG_SPINAPHORE_CONTEXT_SWITCH < queueud * atomic_get(&sph->hold_time)
Basoically, the queued line is 2 * average_wait_time (Because we're
going to wait for 1/2 those to finish on average), so if we would wait
just as long on average (from now, with the current queued and
hold_time) to go do useful work as it would to spin, then go off and
do something useful.
>> ))) {
>> /* Remove ourselves from the wait pool (remember to re-
>> add later) */
>> atomic_dec(&sph->queued);
>> queue_me = 1;
>>
>> /* Go to sleep */
>> cond_resched();
>> }
>> }
>>
>> /* Dequeue ourselves and update the acquire time */
>> atomic_dec(&sph->queued);
> if(contention)atomic_dec(&sph->queued);
>
> when there was no contention we didn't increment.
Ah, yeah. How about removing the "contention" variable and using this:
if (!queue_me) atomic_dec(&sph->queued);
>> sph->acquire_time = fast_monotonic_count();
>> }
>>
>
>> void spinaphore_unlock (struct spinaphore *sph) {
>> /* Update the running average hold time */
>> atomic_set(&sph->hold_time, (4*atomic_get(&sph->hold_time) +
>> (fast_monotonic_count() - sph->acquire_time))/5);
>>
>
> These don't need to be atomic functions, since we haven't released
> the lock yet, or is there a risk that nonatomic gets and sets will get
> deferred? no I'm sorry atomic_[get|set] pertains to operations on
> atomic_t data is that correct?
Yeah. In the lock functions, we read the data atomically _before_ we've
obtained the lock, so here we must use atomic get/set in order to modify
that data (Because it's in an atomic_t structure).
>
>> /* Actually unlock the spinlock */
>> spin_unlock(&sph->spinlock);
>> }
>>
>> Cheers,
>> Kyle Moffett
>
> is there a schedule-that-function-next call? The spinaphore idea
> is that
> instead of simply yielding until later (cond_resched) we register
> ourselves
> with the sph object, with a linked list, an actual queue instead of
> a count
> of queued threads -- and at unlocking time, if there's a queue, the
> head of
> the line gets the service next. Which would scale to a lot of
> CPUs, still with
> a spinlock around the setting of the head-of-line pointer.
Yeah, that could be a next level implementation more in line with
what Ingo has
written already.
Cheers,
Kyle Moffett
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$
r !y?(-)
------END GEEK CODE BLOCK------
On May 29, 2005, at 04:42:43, Nikita Danilov wrote:
> Kyle Moffett writes:
>
> [...]
>
>
>>
>> struct spinaphore {
>> atomic_t queued;
>> atomic_t hold_time;
>> spinlock_t spinlock;
>> unsigned long acquire_time;
>> };
>>
>> void spinaphore_lock (struct spinaphore *sph) {
>> unsigned long start_time = fast_monotonic_count();
>>
>
> fast_monotonic_count() should be per-cpu, otherwise spinaphore_lock()
> would require two atomic operations in the best case (and be twice as
> expensive as a spin-lock). Per-cpu counter is OK, as long as thread is
> not allowed to schedule with spinaphore held.
Absolutely. I agree on all points (And that's why I added the function
spinaphore_lock_atomic() so that they could be nested a little bit.
>
>> int queue_me = 1;
>> until (likely(spin_trylock(&sph->spinlock))) {
>>
>> /* Get the queue count (And ensure we're queued in the
>> process) */
>> unsigned int queued = queue_me ?
>> atomic_inc_return(&sph->queued) :
>> queued = atomic_get(&sph->queued);
>> queue_me = 0;
>>
>> /* Figure out if we should switch away */
>> if (unlikely(CONFIG_SPINAPHORE_CONTEXT_SWITCH <
>> ( queued*atomic_get(&sph->hold_time) -
>> fast_monotonic_count() - start_time
>> ))) {
>> /* Remove ourselves from the wait pool (remember to re-
>> add later) */
>> atomic_dec(&sph->queued);
>> queue_me = 1;
>>
>> /* Go to sleep */
>> cond_resched();
>> }
>> }
>>
>> /* Dequeue ourselves and update the acquire time */
>> atomic_dec(&sph->queued);
>>
>
> atomic_dec() should only be done if atomic_inc_return() above was,
> i.e.,
> not in contentionless fast-path, right?
Oops, agreed, see other email for a fix.
>>
>> void spinaphore_unlock (struct spinaphore *sph) {
>> /* Update the running average hold time */
>> atomic_set(&sph->hold_time, (4*atomic_get(&sph->hold_time) +
>> (fast_monotonic_count() - sph->acquire_time))/5);
>>
>> /* Actually unlock the spinlock */
>> spin_unlock(&sph->spinlock);
>> }
>>
>
> It is not good that unlock requires additional atomic operation. Why
> ->hold_time is atomic in the first place? It is only updated by the
> lock
> holder, and as it is approximate statistics anyway, non-atomic
> reads in
> spinaphore_lock() would be fine.
The reason for the atomic reads there is so that the value is updated,
instead of cached in a register.
In an assembly implementation, this would be optimized such that it all
fits in one cacheline and uses a minimum number of atomic operations and
cache flushes. This is a naive C implementation based on existing
primitives.
Cheers,
Kyle Moffett
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$
r !y?(-)
------END GEEK CODE BLOCK------
On May 29, 2005, at 09:29:37, Joe Seigh wrote:
> If you went with a bakery algorithm and could tolerate FIFO service
> order,
> you could use the expected service time as the ticket increment value
> instead of 1. Before a thread gets a ticket, it examines the
> expected queue
> wait time, the difference between the current ticket and the next
> available
> ticket, to decide which increment to be applied to the next ticket
> value.
> The two possible increment values would be the uncontended resource
> service
> time and that value plus thread suspend/resume overhead. If the
> expected
> wait time is greater than the latter, it uses the latter as the
> increment
> value and suspends rather than spins.
Ah, interesting idea. Perhaps we ought to try implementing several of
the
ideas and benchmarking them. I'll work on a user-space operable
version of
my naive spinaphores, as well as an optimized assembly version, if I can
find the time in the next day or so.
Cheers,
Kyle Moffett
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$
r !y?(-)
------END GEEK CODE BLOCK------
Kyle Moffett <[email protected]> writes:
> if (unlikely(CONFIG_SPINAPHORE_CONTEXT_SWITCH <
> ( queued*atomic_get(&sph->hold_time) -
> fast_monotonic_count() - start_time
On many architectures (including popular ones like AMD x86-64)
there is no reliable fast monotonic (1) count unfortunately - except perhaps
jiffies, but that would be far to coarse grained for your purpose.
-Andi
(1) reliable fast monotonic - chose any two instead.
Andi Kleen wrote:
> On many architectures (including popular ones like AMD x86-64)
> there is no reliable fast monotonic (1) count unfortunately
What about rdtsc?
Chris
On Mon, May 30, 2005 at 08:52:13AM -0600, Chris Friesen wrote:
> Andi Kleen wrote:
>
> >On many architectures (including popular ones like AMD x86-64)
> >there is no reliable fast monotonic (1) count unfortunately
>
> What about rdtsc?
It fails the reliable and monotonic test on AMD; on Intel EM64T machines
it fails the fast test (although the alternatives are even slower) and
also the first ones one a few bigger ones.
-Andi
Andi Kleen wrote:
> On Mon, May 30, 2005 at 08:52:13AM -0600, Chris Friesen wrote:
>>What about rdtsc?
> It fails the reliable and monotonic test on AMD;
tsc goes backwards on AMD? Under what circumstances (I'm curious, since
I'm running one...)
Chris
Chris Friesen <[email protected]> writes:
> Andi Kleen wrote:
>> On Mon, May 30, 2005 at 08:52:13AM -0600, Chris Friesen wrote:
>
>>>What about rdtsc?
>
>> It fails the reliable and monotonic test on AMD;
>
> tsc goes backwards on AMD? Under what circumstances (I'm curious,
> since I'm running one...)
It is not synchronized between CPUs and slowly drifts and can even run
at completely different frequencies there when you use powernow! on a
MP system.
It can be used reliably when you only do deltas on same CPU
and correct for the changing frequency. However then you run
into other problems, like it being quite slow on Intel.
I suspect any attempt to use time stamps in locks is a bad
idea because of this.
Note that at least on modern Intel systems you can just use
MONITOR/MWAIT which is much more efficient, if you are willing to eat
the fake wakeups due to the cache line padding they use.
My impression is that the aggressive bus access avoidance the
original poster was trying to implement is not that useful
on modern systems anyways which have fast busses. Also
it is not even clear it even saves anything; after all the
CPU will always snoop cache accesses for all cache lines
and polling for the EXCLUSIVE transition of the local cache line
is probably either free or very cheap.
Optimizing for the congested case is always a bad idea anyways; in
case lock congestion is a problem it is better to fix the lock. And
when you have a really congested lock that for some reason cannot be
fixed then using some kind of queued lock (which also gives you
fairness on NUMA) is probably a better idea. But again you should
really fix the the lock instead, everything else is the wrong answer.
BTW Getting the fairness on the backoff scheme right would have been
probably a major problem.
The thing that seems to tickle CPU vendors much more is to avoid
wasting CPU time on SMT or on Hypervisors while spinning. That can be
all done with much simpler hints (like rep ; nop).
-Andi
On May 30, 2005, at 13:46:35, Andi Kleen wrote:
>> tsc goes backwards on AMD? Under what circumstances (I'm curious,
>> since I'm running one...)
>
> It is not synchronized between CPUs and slowly drifts and can even run
> at completely different frequencies there when you use powernow! on a
> MP system.
Unsynchronized is fine, we're only taking differences of short times.
Slow drift is likewise OK too. As to the different frequencies, how
different are we talking about? Also, on such a system, is there any
way to determine a relative frequency, even if unreliable or
occasionally
off?
> It can be used reliably when you only do deltas on same CPU
> and correct for the changing frequency. However then you run
> into other problems, like it being quite slow on Intel.
The deltas here are generally short, always on the same CPU, and can be
corrected for changing frequency, assuming that said frequency is
available somehow.
Ideally it would be some sort of CPU cycle-counter, just so I can say
"In between lock and unlock, the CPU did 1000 cycles", for some value
of "cycle".
> I suspect any attempt to use time stamps in locks is a bad
> idea because of this.
Something like this could be built only for CPUs that do support that
kind of cycle counter.
> My impression is that the aggressive bus access avoidance the
> original poster was trying to implement is not that useful
> on modern systems anyways which have fast busses. Also
> it is not even clear it even saves anything; after all the
> CPU will always snoop cache accesses for all cache lines
> and polling for the EXCLUSIVE transition of the local cache line
> is probably either free or very cheap.
The idea behind these locks is for bigger systems (8-way or more) for
heavily contended locks (say 32 threads doing write() on the same fd).
In such a system, cacheline snooping isn't practical at the hardware
level, and a lock such as this should be able to send several CPUs to
do other useful work (If there's any to be done) and minimize the
cacheline banging.
> Optimizing for the congested case is always a bad idea anyways; in
> case lock congestion is a problem it is better to fix the lock. And
> when you have a really congested lock that for some reason cannot be
> fixed then using some kind of queued lock (which also gives you
> fairness on NUMA) is probably a better idea. But again you should
> really fix the the lock instead, everything else is the wrong answer.
Good point
> BTW Getting the fairness on the backoff scheme right would have been
> probably a major problem.
Largely this is a fast guess heuristic, so max_time_before_schedule
could be this (and this would even work with priority inheritance):
effective_priority * waiters * avg_wait_time / 10;
> The thing that seems to tickle CPU vendors much more is to avoid
> wasting CPU time on SMT or on Hypervisors while spinning. That can be
> all done with much simpler hints (like rep ; nop).
On such a system, a lock like this would take advantage of all the
properties of normal spinlocks, including any changes like that.
Cheers,
Kyle Moffett
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$
r !y?(-)
------END GEEK CODE BLOCK------
On Mon, May 30, 2005 at 02:04:36PM -0400, Kyle Moffett wrote:
> On May 30, 2005, at 13:46:35, Andi Kleen wrote:
> >>tsc goes backwards on AMD? Under what circumstances (I'm curious,
> >>since I'm running one...)
> >
> >It is not synchronized between CPUs and slowly drifts and can even run
> >at completely different frequencies there when you use powernow! on a
> >MP system.
>
> Unsynchronized is fine, we're only taking differences of short times.
If you ask on two CPUs in a quick succession, you can get a negative
time difference.
> Slow drift is likewise OK too. As to the different frequencies, how
> different are we talking about?
1GHz vs 2GHz, for example.
There is cpufreq, which changes the CPU clocks in large steps under the
kernel's control, and there is spread spectrum, which wiggles them just
a tiny bit (1-4%) back and forth (independently on different CPUs) to
minimize EMI.
> Also, on such a system, is there any way to determine a relative
> frequency, even if unreliable or occasionally off?
You can measure it. With cpufreq you have a good guess when you switch
that the new frequency will have a certain ratio to the old one.
> >It can be used reliably when you only do deltas on same CPU
> >and correct for the changing frequency. However then you run
> >into other problems, like it being quite slow on Intel.
>
> The deltas here are generally short, always on the same CPU, and can be
> corrected for changing frequency, assuming that said frequency is
> available somehow.
The fact the deltas are on the same CPU helps. The shortness of the
interval doesn't, since on AMD CPUs RDTSC is executed speculatively and
out-of-order, and the order of two close RDTSC instructions isn't
guaranteed, as far as I remember.
> Ideally it would be some sort of CPU cycle-counter, just so I can say
> "In between lock and unlock, the CPU did 1000 cycles", for some value
> of "cycle".
This may be doable if precision isn't required.
> >I suspect any attempt to use time stamps in locks is a bad
> >idea because of this.
>
> Something like this could be built only for CPUs that do support that
> kind of cycle counter.
RDTSC on older Intel CPUs takes something like 6 cycles. On P4's it
takes much more, since it's decoded to a microcode MSR access.
--
Vojtech Pavlik
SuSE Labs, SuSE CR
On May 30, 2005, at 14:40:59, Vojtech Pavlik wrote:
> On Mon, May 30, 2005 at 02:04:36PM -0400, Kyle Moffett wrote:
>> Unsynchronized is fine, we're only taking differences of short times.
>
> If you ask on two CPUs in a quick succession, you can get a negative
> time difference.
But not on the same CPU, which is the only thing done in the given
implementation.
>
>> Slow drift is likewise OK too. As to the different frequencies, how
>> different are we talking about?
>
> 1GHz vs 2GHz, for example.
Fine, a ratio of 2 or even probably 4 is ok. If it was a ratio of 10:1
or 20:1, that might be an issue if we had no between-CPU ratio
adjustment,
but otherwise it's just guesstimation, so it's not that important to be
accurate.
>> Also, on such a system, is there any way to determine a relative
>> frequency, even if unreliable or occasionally off?
>
> You can measure it. With cpufreq you have a good guess when you switch
> that the new frequency will have a certain ratio to the old one.
The time periods under consideration are so small (and due to the way
it's used, especially WRT the running average, if the CPU changes
frequencies in mid-calculation, we don't really care about same-CPU
changes, as long as we can usually guess about between-CPU ratios.
>> The deltas here are generally short, always on the same CPU, and
>> can be
>> corrected for changing frequency, assuming that said frequency is
>> available somehow.
>
> The fact the deltas are on the same CPU helps. The shortness of the
> interval doesn't, since on AMD CPUs RDTSC is executed speculatively
> and
> out-of-order, and the order of two close RDTSC instructions isn't
> guaranteed, as far as I remember.
Not exactly that short. There will usually be several hundred
instructions
in between, at _least_, otherwise you'd just use a normal spinlock to
minimize overhead.
>> Ideally it would be some sort of CPU cycle-counter, just so I can say
>> "In between lock and unlock, the CPU did 1000 cycles", for some value
>> of "cycle".
> This may be doable if precision isn't required.
This is a best-guess as to whether the CPU should put the process to
sleep
and go do something else if it can or just poll. If we're wrong either
way it doesn't matter much, we'll effectively default back to the old
dumb behavior.
> RDTSC on older Intel CPUs takes something like 6 cycles. On P4's it
> takes much more, since it's decoded to a microcode MSR access.
Ick! Is there any other kind of fast cycle counter on a P4?
Cheers,
Kyle Moffett
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$
r !y?(-)
------END GEEK CODE BLOCK------
> > >I suspect any attempt to use time stamps in locks is a bad
> > >idea because of this.
> >
> > Something like this could be built only for CPUs that do support that
> > kind of cycle counter.
>
> RDTSC on older Intel CPUs takes something like 6 cycles. On P4's it
> takes much more, since it's decoded to a microcode MSR access.
It actually seems to flush the trace cache, because Intel figured
out that out of order RDTSC is probably not too useful (which is right)
and the only way to ensure that on Netburst seems to stop the trace
cache in its track. That can be pretty slow, we're talking 1000+ cycles
here.
Now on the other hand if you only execute it in the slow path
of a lock it might not be that bad (since the machine should
be pretty synchronized at this point anyways), but still it
is probably not something you want to do often.
-Andi
On Mon, May 30, 2005 at 02:04:36PM -0400, Kyle Moffett wrote:
> >I suspect any attempt to use time stamps in locks is a bad
> >idea because of this.
>
> Something like this could be built only for CPUs that do support that
> kind of cycle counter.
That gets you into a problem with binary distribution kernels.
While binary patching works to some extent, it also becomes
messy pretty quickly.
> >My impression is that the aggressive bus access avoidance the
> >original poster was trying to implement is not that useful
> >on modern systems anyways which have fast busses. Also
> >it is not even clear it even saves anything; after all the
> >CPU will always snoop cache accesses for all cache lines
> >and polling for the EXCLUSIVE transition of the local cache line
> >is probably either free or very cheap.
>
> The idea behind these locks is for bigger systems (8-way or more) for
> heavily contended locks (say 32 threads doing write() on the same fd).
Didn't Dipankar & co just fix that with their latest RCU patchkit?
(assuming you mean the FD locks)
> In such a system, cacheline snooping isn't practical at the hardware
> level, and a lock such as this should be able to send several CPUs to
Why not? Cache snooping has to always work with low overhead, otherwise the
machine is not very useful coherent. I assume that any bigger system
has a cache directory anyways, which should minimze the traffic;
and for smaller setups listening to broadcasts works fine.
-Andi
On May 30, 2005, at 15:28:26, Andi Kleen wrote:
> On Mon, May 30, 2005 at 02:04:36PM -0400, Kyle Moffett wrote:
>>> I suspect any attempt to use time stamps in locks is a bad
>>> idea because of this.
>>
>> Something like this could be built only for CPUs that do support that
>> kind of cycle counter.
>
> That gets you into a problem with binary distribution kernels.
> While binary patching works to some extent, it also becomes
> messy pretty quickly.
Well, a runtime-set function-pointer isn't all that bad. Basically
if a simple cycle-counter driver is available, it would use that,
otherwise it would fall back to ordinary spinlock behavior.
>> The idea behind these locks is for bigger systems (8-way or more) for
>> heavily contended locks (say 32 threads doing write() on the same
>> fd).
>
> Didn't Dipankar & co just fix that with their latest RCU patchkit?
> (assuming you mean the FD locks)
That's lock-free FD lookup (open, close, and read/write to a certain
extent). I'm handling something more like serialization on a fifo
accessed in both user context and interrupt context, which in a
traditional implementation would use a spinlock, but with a spinaphore,
one could do this:
In interrupt context:
spinaphore_lock_atomic(&fifo->sph);
put_stuff_in_fifo(fifo,stuff);
spinaphore_unlock(&fifo->sph);
In user context:
spinaphore_lock(&fifo->sph);
put_stuff_in_fifo(fifo,stuff);
spinaphore_unlock(&fifo->sph);
If there are multiple devices generating interrupts to put stuff in the
fifo and multiple processes also trying to put stuff in the fifo, all
bursting on different CPUs, then the fifo lock would be heavily loaded.
If the system had other things it could be doing while waiting for the
FIFO to become available, then it should be able to do those.
Cheers,
Kyle Moffett
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GCM/CS/IT/U d- s++: a18 C++++>$ UB/L/X/*++++(+)>$ P+++(++++)>$
L++++(+++) E W++(+) N+++(++) o? K? w--- O? M++ V? PS+() PE+(-) Y+
PGP+++ t+(+++) 5 X R? tv-(--) b++++(++) DI+ D+ G e->++++$ h!*()>++$
r !y?(-)
------END GEEK CODE BLOCK------
On Mon, May 30, 2005 at 03:39:04PM -0400, Kyle Moffett wrote:
> On May 30, 2005, at 15:28:26, Andi Kleen wrote:
> >On Mon, May 30, 2005 at 02:04:36PM -0400, Kyle Moffett wrote:
> >>The idea behind these locks is for bigger systems (8-way or more) for
> >>heavily contended locks (say 32 threads doing write() on the same
> >>fd).
> >
> >Didn't Dipankar & co just fix that with their latest RCU patchkit?
> >(assuming you mean the FD locks)
>
> That's lock-free FD lookup (open, close, and read/write to a certain
> extent). I'm handling something more like serialization on a fifo
> accessed in both user context and interrupt context, which in a
> traditional implementation would use a spinlock, but with a spinaphore,
> one could do this:
>
> In interrupt context:
>
> spinaphore_lock_atomic(&fifo->sph);
> put_stuff_in_fifo(fifo,stuff);
> spinaphore_unlock(&fifo->sph);
>
> In user context:
>
> spinaphore_lock(&fifo->sph);
> put_stuff_in_fifo(fifo,stuff);
> spinaphore_unlock(&fifo->sph);
>
> If there are multiple devices generating interrupts to put stuff in the
> fifo and multiple processes also trying to put stuff in the fifo, all
> bursting on different CPUs, then the fifo lock would be heavily loaded.
> If the system had other things it could be doing while waiting for the
> FIFO to become available, then it should be able to do those.
If you have lots of concurrent writes, then ordering cannot be guaranteed,
right? If ordering cannot be guaranteed, why not split the fifo into
multiple parallel streams, and have the writers randomly select one
of the streams?
If a given writer's data must be ordered, one could hash based on the
task-struct pointer for processes, and any convenient pointer for
interrupt context.
That way, you decrease the contention, greatly reducing the spinning and
the context switches.
Or am I missing something here?
Thanx, Paul