LinuxLists.cc - Re: RFC: Ideal Adaptive Spinning Conditions

2010-04-01 16:31:21

Subject: Re: RFC: Ideal Adaptive Spinning Conditions

On 04/01/2010 02:21 AM, Darren Hart wrote:
> I'm looking at some adaptive spinning with futexes as a way to help
> reduce the dependence on sched_yield() to implement userspace
> spinlocks. Chris, I included you in the CC after reading your comments
> regarding sched_yield() at kernel summit and I thought you might be
> interested.
>
> I have an experimental patchset that implements FUTEX_LOCK and
> FUTEX_LOCK_ADAPTIVE in the kernel and use something akin to
> mutex_spin_on_owner() for the first waiter to spin. What I'm finding
> is that adaptive spinning actually hurts my particular test case, so I
> was hoping to poll people for context regarding the existing adaptive
> spinning implementations in the kernel as to where we see benefit.
> Under which conditions does adaptive spinning help?
>
> I presume locks with a short average hold time stand to gain the most
> as the longer the lock is held the more likely the spinner will expire
> its timeslice or that the scheduling gain becomes noise in the
> acquisition time. My test case simple calls "lock();unlock()" for a
> fixed number of iterations and reports the iterations per second at
> the end of the run. It can run with an arbitrary number of threads as
> well. I typically run with 256 threads for 10M iterations.
>
> futex_lock: Result: 635 Kiter/s
> futex_lock_adaptive: Result: 542 Kiter/s

A lock(); unlock(); loop spends most of its time with the lock held or
contended. Can you something like this:

lock();
for (i = 0; i < 1000; ++i)
asm volatile ("" : : : "memory");
unlock();
for (i = 0; i < 10000; ++i)
asm volatile ("" : : : "memory");

This simulates a lock hold ratio of 10% with the lock hold time
exceeding the acquisition time. Will be interesting to lower both loop
bounds as well.

--
error compiling committee.c: too many arguments to function

2010-04-01 15:55:07

by Darren Hart

[permalink] [raw]

Subject: Re: RFC: Ideal Adaptive Spinning Conditions

Avi Kivity wrote:
> On 04/01/2010 02:21 AM, Darren Hart wrote:
>> I'm looking at some adaptive spinning with futexes as a way to help
>> reduce the dependence on sched_yield() to implement userspace
>> spinlocks. Chris, I included you in the CC after reading your comments
>> regarding sched_yield() at kernel summit and I thought you might be
>> interested.
>>
>> I have an experimental patchset that implements FUTEX_LOCK and
>> FUTEX_LOCK_ADAPTIVE in the kernel and use something akin to
>> mutex_spin_on_owner() for the first waiter to spin. What I'm finding
>> is that adaptive spinning actually hurts my particular test case, so I
>> was hoping to poll people for context regarding the existing adaptive
>> spinning implementations in the kernel as to where we see benefit.
>> Under which conditions does adaptive spinning help?
>>
>> I presume locks with a short average hold time stand to gain the most
>> as the longer the lock is held the more likely the spinner will expire
>> its timeslice or that the scheduling gain becomes noise in the
>> acquisition time. My test case simple calls "lock();unlock()" for a
>> fixed number of iterations and reports the iterations per second at
>> the end of the run. It can run with an arbitrary number of threads as
>> well. I typically run with 256 threads for 10M iterations.
>>
>> futex_lock: Result: 635 Kiter/s
>> futex_lock_adaptive: Result: 542 Kiter/s
>
> A lock(); unlock(); loop spends most of its time with the lock held or
> contended. Can you something like this:
>
>
> lock();
> for (i = 0; i < 1000; ++i)
> asm volatile ("" : : : "memory");
> unlock();
> for (i = 0; i < 10000; ++i)
> asm volatile ("" : : : "memory");

Great idea. I'll be doing a more rigorous investigation on this of
course, but I thought I'd share the results of just dumping this into
the testcase:

# ./futex_lock -i10000000
futex_lock: Measure FUTEX_LOCK operations per second
Arguments: iterations=10000000 threads=256 adaptive=0
Result: 420 Kiter/s
lock calls: 9999872
lock syscalls: 665824 (6.66%)
unlock calls: 9999872
unlock syscalls: 861240 (8.61%)

# ./futex_lock -a -i10000000
futex_lock: Measure FUTEX_LOCK operations per second
Arguments: iterations=10000000 threads=256 adaptive=1
Result: 426 Kiter/s
lock calls: 9999872
lock syscalls: 558787 (5.59%)
unlock calls: 9999872
unlock syscalls: 603412 (6.03%)

This is the first time I've seen adaptive locking have an advantage! The
second set of runs showed a slightly greater advantage. Note that this
was still with spinners being limited to one.

My thanks to everyone for their insight. I'll be preparing some result
matrices and will share the patches and testcases here shortly.

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-01 16:39:05

by Avi Kivity

[permalink] [raw]

Subject: Re: RFC: Ideal Adaptive Spinning Conditions

On 04/01/2010 06:54 PM, Darren Hart wrote:
>> A lock(); unlock(); loop spends most of its time with the lock held
>> or contended. Can you something like this:
>>
>>
>> lock();
>> for (i = 0; i < 1000; ++i)
>> asm volatile ("" : : : "memory");
>> unlock();
>> for (i = 0; i < 10000; ++i)
>> asm volatile ("" : : : "memory");
>
>
>
> Great idea. I'll be doing a more rigorous investigation on this of
> course, but I thought I'd share the results of just dumping this into
> the testcase:
>
> # ./futex_lock -i10000000
> futex_lock: Measure FUTEX_LOCK operations per second
> Arguments: iterations=10000000 threads=256 adaptive=0
> Result: 420 Kiter/s
> lock calls: 9999872
> lock syscalls: 665824 (6.66%)
> unlock calls: 9999872
> unlock syscalls: 861240 (8.61%)
>
> # ./futex_lock -a -i10000000
> futex_lock: Measure FUTEX_LOCK operations per second
> Arguments: iterations=10000000 threads=256 adaptive=1
> Result: 426 Kiter/s
> lock calls: 9999872
> lock syscalls: 558787 (5.59%)
> unlock calls: 9999872
> unlock syscalls: 603412 (6.03%)
>
> This is the first time I've seen adaptive locking have an advantage!
> The second set of runs showed a slightly greater advantage. Note that
> this was still with spinners being limited to one.

Question - do all threads finish at the same time, or wildly different
times?

--
error compiling committee.c: too many arguments to function

2010-04-01 17:10:20

by Darren Hart

[permalink] [raw]

Subject: Re: RFC: Ideal Adaptive Spinning Conditions

Avi Kivity wrote:
> On 04/01/2010 06:54 PM, Darren Hart wrote:
>>> A lock(); unlock(); loop spends most of its time with the lock held
>>> or contended. Can you something like this:
>>>
>>>
>>> lock();
>>> for (i = 0; i < 1000; ++i)
>>> asm volatile ("" : : : "memory");
>>> unlock();
>>> for (i = 0; i < 10000; ++i)
>>> asm volatile ("" : : : "memory");
>>
>>
>>
>> Great idea. I'll be doing a more rigorous investigation on this of
>> course, but I thought I'd share the results of just dumping this into
>> the testcase:
>>
>> # ./futex_lock -i10000000
>> futex_lock: Measure FUTEX_LOCK operations per second
>> Arguments: iterations=10000000 threads=256 adaptive=0
>> Result: 420 Kiter/s
>> lock calls: 9999872
>> lock syscalls: 665824 (6.66%)
>> unlock calls: 9999872
>> unlock syscalls: 861240 (8.61%)
>>
>> # ./futex_lock -a -i10000000
>> futex_lock: Measure FUTEX_LOCK operations per second
>> Arguments: iterations=10000000 threads=256 adaptive=1
>> Result: 426 Kiter/s
>> lock calls: 9999872
>> lock syscalls: 558787 (5.59%)
>> unlock calls: 9999872
>> unlock syscalls: 603412 (6.03%)
>>
>> This is the first time I've seen adaptive locking have an advantage!
>> The second set of runs showed a slightly greater advantage. Note that
>> this was still with spinners being limited to one.
>
> Question - do all threads finish at the same time, or wildly different
> times?

I'm not sure, I can add some fairness metrics to the test that will help
characterize how that's working. My suspicion is that there will be
several threads that don't make any progress until the very end - since
adaptive spinning is an "unfair" locking technique.

--
Darren Hart
IBM Linux Technology Center
Real-Time Linux Team

2010-04-01 17:15:51

by Avi Kivity

[permalink] [raw]

Subject: Re: RFC: Ideal Adaptive Spinning Conditions

On 04/01/2010 08:10 PM, Darren Hart wrote:
> Avi Kivity wrote:
>> On 04/01/2010 06:54 PM, Darren Hart wrote:
>>>> A lock(); unlock(); loop spends most of its time with the lock held
>>>> or contended. Can you something like this:
>>>>
>>>>
>>>> lock();
>>>> for (i = 0; i < 1000; ++i)
>>>> asm volatile ("" : : : "memory");
>>>> unlock();
>>>> for (i = 0; i < 10000; ++i)
>>>> asm volatile ("" : : : "memory");
>>>
>>>
>>>
>>> Great idea. I'll be doing a more rigorous investigation on this of
>>> course, but I thought I'd share the results of just dumping this
>>> into the testcase:
>>>
>>> # ./futex_lock -i10000000
>>> futex_lock: Measure FUTEX_LOCK operations per second
>>> Arguments: iterations=10000000 threads=256 adaptive=0
>>> Result: 420 Kiter/s
>>> lock calls: 9999872
>>> lock syscalls: 665824 (6.66%)
>>> unlock calls: 9999872
>>> unlock syscalls: 861240 (8.61%)
>>>
>>> # ./futex_lock -a -i10000000
>>> futex_lock: Measure FUTEX_LOCK operations per second
>>> Arguments: iterations=10000000 threads=256 adaptive=1
>>> Result: 426 Kiter/s
>>> lock calls: 9999872
>>> lock syscalls: 558787 (5.59%)
>>> unlock calls: 9999872
>>> unlock syscalls: 603412 (6.03%)
>>>
>>> This is the first time I've seen adaptive locking have an advantage!
>>> The second set of runs showed a slightly greater advantage. Note
>>> that this was still with spinners being limited to one.
>>
>> Question - do all threads finish at the same time, or wildly
>> different times?
>
> I'm not sure, I can add some fairness metrics to the test that will
> help characterize how that's working. My suspicion is that there will
> be several threads that don't make any progress until the very end -
> since adaptive spinning is an "unfair" locking technique.
>

Well, if the amount of unfairness differs between the tests (unfair
unfairness?) then you may see results that are not directly related to
spin vs yield. You need to make the test more self-regulating so the
results are more repeatable (yet not make it a round-robin test).

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.