LinuxLists.cc - Re: Improve spinlock performance by moving work to one core

2015-11-06 04:28:58

Subject: Re: Improve spinlock performance by moving work to one core

Longman

Thanks for your suggestion.
We will look for real scenario to test, and could you please introduce
some benchmarks on spinlock ?

Regards
Ling

>
> Your new spinlock code completely change the API and the semantics of the
> existing spinlock calls. That requires changes to thousands of places
> throughout the whole kernel. It also increases the size of the spinlock from
> 4 bytes to 32 bytes. It is basically a no-go.
>
> However, if you can improve the performance within the existing API and
> locking semantics, it will be much more acceptable.
>
> Cheers,
> Longman

2015-11-06 17:38:49

by Waiman Long

[permalink] [raw]

Subject: Re: Improve spinlock performance by moving work to one core

On 11/05/2015 11:28 PM, Ling Ma wrote:
> Longman
>
> Thanks for your suggestion.
> We will look for real scenario to test, and could you please introduce
> some benchmarks on spinlock ?
>
> Regards
> Ling
>
>

The kernel has been well optimized for most common workloads that
spinlock contention is usually not a performance bottleneck. There are
still corner cases where there is heavy spinlock contention.

I used a spinlock loop microbenchmark like what you are doing as well as
AIM7 for application level testing.

Cheers,
Longman

2015-11-23 09:41:58

by Ling Ma

[permalink] [raw]

Subject: Re: Improve spinlock performance by moving work to one core

Hi Longman,

Attachments include user space application thread.c and kernel patch
spinlock-test.patch based on kernel 4.3.0-rc4

we run thread.c with kernel patch, test original and new spinlock respectively,
perf top -G indicates thread.c cause cache_alloc_refill and
cache_flusharray functions to spend ~25% time on original spinlock,
after introducing new spinlock in two functions, the cost time become ~22%.

The printed data also tell us the new spinlock improves performance
by about 15%( 93841765576 / 81036259588) on E5-2699V3

Appreciate your comments.

Thanks
Ling

2015-11-07 1:38 GMT+08:00 Waiman Long <[email protected]>:
>
> On 11/05/2015 11:28 PM, Ling Ma wrote:
>>
>> Longman
>>
>> Thanks for your suggestion.
>> We will look for real scenario to test, and could you please introduce
>> some benchmarks on spinlock ?
>>
>> Regards
>> Ling
>>
>>
>
> The kernel has been well optimized for most common workloads that spinlock contention is usually not a performance bottleneck. There are still corner cases where there is heavy spinlock contention.
>
> I used a spinlock loop microbenchmark like what you are doing as well as AIM7 for application level testing.
>
> Cheers,
> Longman
>
>

Attachments:

spinlock-test.patch (24.07 kB)
thread.c (2.10 kB)
Download all attachments

2015-11-25 01:56:38

by Ling Ma

[permalink] [raw]

Subject: Re: Improve spinlock performance by moving work to one core

Any comments about it ?

Thanks
Ling

2015-11-23 17:41 GMT+08:00 Ling Ma <[email protected]>:
> Hi Longman,
>
> Attachments include user space application thread.c and kernel patch
> spinlock-test.patch based on kernel 4.3.0-rc4
>
> we run thread.c with kernel patch, test original and new spinlock respectively,
> perf top -G indicates thread.c cause cache_alloc_refill and
> cache_flusharray functions to spend ~25% time on original spinlock,
> after introducing new spinlock in two functions, the cost time become ~22%.
>
> The printed data also tell us the new spinlock improves performance
> by about 15%( 93841765576 / 81036259588) on E5-2699V3
>
> Appreciate your comments.
>
> Thanks
> Ling
>
> 2015-11-07 1:38 GMT+08:00 Waiman Long <[email protected]>:
>>
>> On 11/05/2015 11:28 PM, Ling Ma wrote:
>>>
>>> Longman
>>>
>>> Thanks for your suggestion.
>>> We will look for real scenario to test, and could you please introduce
>>> some benchmarks on spinlock ?
>>>
>>> Regards
>>> Ling
>>>
>>>
>>
>> The kernel has been well optimized for most common workloads that spinlock contention is usually not a performance bottleneck. There are still corner cases where there is heavy spinlock contention.
>>
>> I used a spinlock loop microbenchmark like what you are doing as well as AIM7 for application level testing.
>>
>> Cheers,
>> Longman
>>
>>

2015-11-25 19:06:01

by Waiman Long

[permalink] [raw]

Subject: Re: Improve spinlock performance by moving work to one core

On 11/23/2015 04:41 AM, Ling Ma wrote:
> Hi Longman,
>
> Attachments include user space application thread.c and kernel patch
> spinlock-test.patch based on kernel 4.3.0-rc4
>
> we run thread.c with kernel patch, test original and new spinlock respectively,
> perf top -G indicates thread.c cause cache_alloc_refill and
> cache_flusharray functions to spend ~25% time on original spinlock,
> after introducing new spinlock in two functions, the cost time become ~22%.
>
> The printed data also tell us the new spinlock improves performance
> by about 15%( 93841765576 / 81036259588) on E5-2699V3
>
> Appreciate your comments.
>
>

I saw that you make the following changes in the code:

static __always_inline void queued_spin_lock(struct qspinlock *lock)
{
u32 val;
-
+repeat:
val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
if (likely(val == 0))
return;
- queued_spin_lock_slowpath(lock, val);
+ goto repeat;
+ //queued_spin_lock_slowpath(lock, val);
}

This effectively changes the queued spinlock into an unfair byte lock.
Without a pause to moderate the cmpxchg() call, that is especially bad
for performance. Is the performance data above refers to the unfair byte
lock versus your new spinlock?

Cheers,
Longman

2015-11-26 03:49:17

by Ling Ma

[permalink] [raw]

Subject: Re: Improve spinlock performance by moving work to one core

Hi Longman,

All compared data is from the below operation in spinlock-test.patch:

+#if ORG_QUEUED_SPINLOCK
+ org_queued_spin_lock((struct qspinlock *)&pa.n->list_lock);
+ refill_fn(&pa);
+ org_queued_spin_unlock((struct qspinlock *)&pa.n->list_lock);
+#else
+ new_spin_lock((struct nspinlock *)&pa.n->list_lock, refill_fn, &pa);
+#endif

and

+#if ORG_QUEUED_SPINLOCK
+ org_queued_spin_lock((struct qspinlock *)&pa.n->list_lock);
+ flusharray_fn(&pa);
+ org_queued_spin_unlock((struct qspinlock *)&pa.n->list_lock);
+#else
+ new_spin_lock((struct nspinlock *)&pa.n->list_lock, flusharray_fn, &pa);
+#endif

So the result is correct and fair.

Yes, we updated the code in include/asm-generic/qspinlock.h to
simplified modification and avoid kernel crash,
for example there are 10 lock scenarios to use new spin lock,
because bottle-neck is only from one or two scenarios, we only modify them,
other lock scenarios will continue to use the lock in qspinlock.h, we
must modify the code,
otherwise the operation will be hooked in the queued and never be waken up.

Thanks
Ling

2015-11-26 3:05 GMT+08:00 Waiman Long <[email protected]>:
> On 11/23/2015 04:41 AM, Ling Ma wrote:
>> Hi Longman,
>>
>> Attachments include user space application thread.c and kernel patch
>> spinlock-test.patch based on kernel 4.3.0-rc4
>>
>> we run thread.c with kernel patch, test original and new spinlock respectively,
>> perf top -G indicates thread.c cause cache_alloc_refill and
>> cache_flusharray functions to spend ~25% time on original spinlock,
>> after introducing new spinlock in two functions, the cost time become ~22%.
>>
>> The printed data also tell us the new spinlock improves performance
>> by about 15%( 93841765576 / 81036259588) on E5-2699V3
>>
>> Appreciate your comments.
>>
>>
>
> I saw that you make the following changes in the code:
>
> static __always_inline void queued_spin_lock(struct qspinlock *lock)
> {
> u32 val;
> -
> +repeat:
> val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
> if (likely(val == 0))
> return;
> - queued_spin_lock_slowpath(lock, val);
> + goto repeat;
> + //queued_spin_lock_slowpath(lock, val);
> }
>
>
> This effectively changes the queued spinlock into an unfair byte lock.
> Without a pause to moderate the cmpxchg() call, that is especially bad
> for performance. Is the performance data above refers to the unfair byte
> lock versus your new spinlock?
>
> Cheers,
> Longman

2015-11-26 09:00:50

by Ling Ma

[permalink] [raw]

Subject: Re: Improve spinlock performance by moving work to one core

Run thread.c with clean kernel 4.3.0-rc4, perf top -G also indicates
cache_flusharray and cache_alloc_refill functions spend 25.6% time
on queued_spin_lock_slowpath totally. it means the compared data
from our spinlock-test.patch is reliable.

Thanks
Ling

2015-11-26 11:49 GMT+08:00 Ling Ma <[email protected]>:
> Hi Longman,
>
> All compared data is from the below operation in spinlock-test.patch:
>
> +#if ORG_QUEUED_SPINLOCK
> + org_queued_spin_lock((struct qspinlock *)&pa.n->list_lock);
> + refill_fn(&pa);
> + org_queued_spin_unlock((struct qspinlock *)&pa.n->list_lock);
> +#else
> + new_spin_lock((struct nspinlock *)&pa.n->list_lock, refill_fn, &pa);
> +#endif
>
> and
>
> +#if ORG_QUEUED_SPINLOCK
> + org_queued_spin_lock((struct qspinlock *)&pa.n->list_lock);
> + flusharray_fn(&pa);
> + org_queued_spin_unlock((struct qspinlock *)&pa.n->list_lock);
> +#else
> + new_spin_lock((struct nspinlock *)&pa.n->list_lock, flusharray_fn, &pa);
> +#endif
>
> So the result is correct and fair.
>
> Yes, we updated the code in include/asm-generic/qspinlock.h to
> simplified modification and avoid kernel crash,
> for example there are 10 lock scenarios to use new spin lock,
> because bottle-neck is only from one or two scenarios, we only modify them,
> other lock scenarios will continue to use the lock in qspinlock.h, we
> must modify the code,
> otherwise the operation will be hooked in the queued and never be waken up.
>
> Thanks
> Ling
>
>
>
> 2015-11-26 3:05 GMT+08:00 Waiman Long <[email protected]>:
>> On 11/23/2015 04:41 AM, Ling Ma wrote:
>>> Hi Longman,
>>>
>>> Attachments include user space application thread.c and kernel patch
>>> spinlock-test.patch based on kernel 4.3.0-rc4
>>>
>>> we run thread.c with kernel patch, test original and new spinlock respectively,
>>> perf top -G indicates thread.c cause cache_alloc_refill and
>>> cache_flusharray functions to spend ~25% time on original spinlock,
>>> after introducing new spinlock in two functions, the cost time become ~22%.
>>>
>>> The printed data also tell us the new spinlock improves performance
>>> by about 15%( 93841765576 / 81036259588) on E5-2699V3
>>>
>>> Appreciate your comments.
>>>
>>>
>>
>> I saw that you make the following changes in the code:
>>
>> static __always_inline void queued_spin_lock(struct qspinlock *lock)
>> {
>> u32 val;
>> -
>> +repeat:
>> val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
>> if (likely(val == 0))
>> return;
>> - queued_spin_lock_slowpath(lock, val);
>> + goto repeat;
>> + //queued_spin_lock_slowpath(lock, val);
>> }
>>
>>
>> This effectively changes the queued spinlock into an unfair byte lock.
>> Without a pause to moderate the cmpxchg() call, that is especially bad
>> for performance. Is the performance data above refers to the unfair byte
>> lock versus your new spinlock?
>>
>> Cheers,
>> Longman

2015-11-30 06:17:53

by Ling Ma

[permalink] [raw]

Subject: Re: Improve spinlock performance by moving work to one core

Any comments, the patch is acceptable ?

Thanks
Ling

2015-11-26 17:00 GMT+08:00 Ling Ma <[email protected]>:
> Run thread.c with clean kernel 4.3.0-rc4, perf top -G also indicates
> cache_flusharray and cache_alloc_refill functions spend 25.6% time
> on queued_spin_lock_slowpath totally. it means the compared data
> from our spinlock-test.patch is reliable.
>
> Thanks
> Ling
>
> 2015-11-26 11:49 GMT+08:00 Ling Ma <[email protected]>:
>> Hi Longman,
>>
>> All compared data is from the below operation in spinlock-test.patch:
>>
>> +#if ORG_QUEUED_SPINLOCK
>> + org_queued_spin_lock((struct qspinlock *)&pa.n->list_lock);
>> + refill_fn(&pa);
>> + org_queued_spin_unlock((struct qspinlock *)&pa.n->list_lock);
>> +#else
>> + new_spin_lock((struct nspinlock *)&pa.n->list_lock, refill_fn, &pa);
>> +#endif
>>
>> and
>>
>> +#if ORG_QUEUED_SPINLOCK
>> + org_queued_spin_lock((struct qspinlock *)&pa.n->list_lock);
>> + flusharray_fn(&pa);
>> + org_queued_spin_unlock((struct qspinlock *)&pa.n->list_lock);
>> +#else
>> + new_spin_lock((struct nspinlock *)&pa.n->list_lock, flusharray_fn, &pa);
>> +#endif
>>
>> So the result is correct and fair.
>>
>> Yes, we updated the code in include/asm-generic/qspinlock.h to
>> simplified modification and avoid kernel crash,
>> for example there are 10 lock scenarios to use new spin lock,
>> because bottle-neck is only from one or two scenarios, we only modify them,
>> other lock scenarios will continue to use the lock in qspinlock.h, we
>> must modify the code,
>> otherwise the operation will be hooked in the queued and never be waken up.
>>
>> Thanks
>> Ling
>>
>>
>>
>> 2015-11-26 3:05 GMT+08:00 Waiman Long <[email protected]>:
>>> On 11/23/2015 04:41 AM, Ling Ma wrote:
>>>> Hi Longman,
>>>>
>>>> Attachments include user space application thread.c and kernel patch
>>>> spinlock-test.patch based on kernel 4.3.0-rc4
>>>>
>>>> we run thread.c with kernel patch, test original and new spinlock respectively,
>>>> perf top -G indicates thread.c cause cache_alloc_refill and
>>>> cache_flusharray functions to spend ~25% time on original spinlock,
>>>> after introducing new spinlock in two functions, the cost time become ~22%.
>>>>
>>>> The printed data also tell us the new spinlock improves performance
>>>> by about 15%( 93841765576 / 81036259588) on E5-2699V3
>>>>
>>>> Appreciate your comments.
>>>>
>>>>
>>>
>>> I saw that you make the following changes in the code:
>>>
>>> static __always_inline void queued_spin_lock(struct qspinlock *lock)
>>> {
>>> u32 val;
>>> -
>>> +repeat:
>>> val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
>>> if (likely(val == 0))
>>> return;
>>> - queued_spin_lock_slowpath(lock, val);
>>> + goto repeat;
>>> + //queued_spin_lock_slowpath(lock, val);
>>> }
>>>
>>>
>>> This effectively changes the queued spinlock into an unfair byte lock.
>>> Without a pause to moderate the cmpxchg() call, that is especially bad
>>> for performance. Is the performance data above refers to the unfair byte
>>> lock versus your new spinlock?
>>>
>>> Cheers,
>>> Longman

2015-11-30 20:56:02

by Waiman Long

[permalink] [raw]

Subject: Re: Improve spinlock performance by moving work to one core

On 11/30/2015 01:17 AM, Ling Ma wrote:
> Any comments, the patch is acceptable ?
>
> Thanks
> Ling
>
>
Ling,

The core idea of your current patch hasn't changed from your previous
patch.

My comment is that you should not attempt to sell it as a replacement
of the current spinlock mechanism. I just don't see that will happen
given the change in API semantics. Also, I think there are probably
cases that your patch cannot be applied. So treat it as a separate
synchronization mechanism that can be useful in some scenarios.

Cheers,
Longman

2015-12-06 13:08:08

by Ling Ma

[permalink] [raw]

Subject: Re: Improve spinlock performance by moving work to one core

Longman,

We further optimized the kernel spinlock in ali-spin-lock.patch
as attachment based on kernel 4.3.0-rc4.

Run thread.c in user space with kernel patch(ali-spin-lock.patch) on E5-2699v3,
compare with original spinlock:

The printed data indicates the performance in critical path is
improved by 1.91x (92715428576 cycles/ 48475891244 cycles),
perf top -d1 also tell us the spinlock cost time is reduced from 25% to 15%

All compared data is from the below operation in ali-spin-lock.patch:

+#if ORG_QUEUED_SPINLOCK
+ org_queued_spin_lock((struct qspinlock *)&pa.n->list_lock);
+ refill_fn(&pa);
+ org_queued_spin_unlock((struct qspinlock *)&pa.n->list_lock);
+#else
+ ali_spin_lock((struct alispinlock *)&pa.n->list_lock, refill_fn, &pa);
+#endif

and

+#if ORG_QUEUED_SPINLOCK
+ org_queued_spin_lock((struct qspinlock *)&pa.n->list_lock);
+ flusharray_fn(&pa);
+ org_queued_spin_unlock((struct qspinlock *)&pa.n->list_lock);
+#else
+ ali_spin_lock((struct alispinlock *)&pa.n->list_lock,
flusharray_fn, &pa);
+#endif

We will send the formal patch as a separate synchronization mechanism soon.

Appreciate your comments.

Thanks
Ling

2015-12-01 4:55 GMT+08:00 Waiman Long <[email protected]>:
> On 11/30/2015 01:17 AM, Ling Ma wrote:
>>
>> Any comments, the patch is acceptable ?
>>
>> Thanks
>> Ling
>>
>>
> Ling,
>
> The core idea of your current patch hasn't changed from your previous
> patch.
>
> My comment is that you should not attempt to sell it as a replacement
> of the current spinlock mechanism. I just don't see that will happen
> given the change in API semantics. Also, I think there are probably
> cases that your patch cannot be applied. So treat it as a separate
> synchronization mechanism that can be useful in some scenarios.
>
> Cheers,
> Longman
>

Attachments:

lock_test.tar.bz2 (7.27 kB)