LinuxLists.cc - [PATCH V2] powerpc: Implement {cmp}xchg for u8 and u16

On Fri, Apr 22, 2016 at 09:59:22AM +0800, Pan Xinhui wrote:
> On 2016年04月21日 23:52, Boqun Feng wrote:
> > On Thu, Apr 21, 2016 at 11:35:07PM +0800, Pan Xinhui wrote:
> >> On 2016年04月20日 22:24, Peter Zijlstra wrote:
> >>> On Wed, Apr 20, 2016 at 09:24:00PM +0800, Pan Xinhui wrote:
> >>>
> >>>> +#define __XCHG_GEN(cmp, type, sfx, skip, v) \
> >>>> +static __always_inline unsigned long \
> >>>> +__cmpxchg_u32##sfx(v unsigned int *p, unsigned long old, \
> >>>> + unsigned long new); \
> >>>> +static __always_inline u32 \
> >>>> +__##cmp##xchg_##type##sfx(v void *ptr, u32 old, u32 new) \
> >>>> +{ \
> >>>> + int size = sizeof (type); \
> >>>> + int off = (unsigned long)ptr % sizeof(u32); \
> >>>> + volatile u32 *p = ptr - off; \
> >>>> + int bitoff = BITOFF_CAL(size, off); \
> >>>> + u32 bitmask = ((0x1 << size * BITS_PER_BYTE) - 1) << bitoff; \
> >>>> + u32 oldv, newv, tmp; \
> >>>> + u32 ret; \
> >>>> + oldv = READ_ONCE(*p); \
> >>>> + do { \
> >>>> + ret = (oldv & bitmask) >> bitoff; \
> >>>> + if (skip && ret != old) \
> >>>> + break; \
> >>>> + newv = (oldv & ~bitmask) | (new << bitoff); \
> >>>> + tmp = oldv; \
> >>>> + oldv = __cmpxchg_u32##sfx((v u32*)p, oldv, newv); \
> >>>> + } while (tmp != oldv); \
> >>>> + return ret; \
> >>>> +}
> >>>
> >>> So for an LL/SC based arch using cmpxchg() like that is sub-optimal.
> >>>
> >>> Why did you choose to write it entirely in C?
> >>>
> >> yes, you are right. more load/store will be done in C code.
> >> However such xchg_u8/u16 is just used by qspinlock now. and I did not see any performance regression.
> >> So just wrote in C, for simple. :)
> >>
> >> Of course I have done xchg tests.
> >> we run code just like xchg((u8*)&v, j++); in several threads.
> >> and the result is,
> >> [ 768.374264] use time[1550072]ns in xchg_u8_asm
> >
> > How was xchg_u8_asm() implemented, using lbarx or using a 32bit ll/sc
> > loop with shifting and masking in it?
> >
> yes, using 32bit ll/sc loops.
>
> looks like:
> __asm__ __volatile__(
> "1: lwarx %0,0,%3\n"
> " and %1,%0,%5\n"
> " or %1,%1,%4\n"
> PPC405_ERR77(0,%2)
> " stwcx. %1,0,%3\n"
> " bne- 1b"
> : "=&r" (_oldv), "=&r" (tmp), "+m" (*(volatile unsigned int *)_p)
> : "r" (_p), "r" (_newv), "r" (_oldv_mask)
> : "cc", "memory");
>

Good, so this works for all ppc ISAs too.

Given the performance benefit(maybe caused by the reason Peter
mentioned), I think we should use this as the implementation of u8/u16
{cmp}xchg for now. For Power7 and later, we can always switch to the
lbarx/lharx version if observable performance benefit can be achieved.

But the choice is left to you. After all, as you said, qspinlock is the
only user ;-)

Regards,
Boqun

>
> > Regards,
> > Boqun
> >
> >> [ 768.377102] use time[2826802]ns in xchg_u8_c
> >>
> >> I think this is because there is one more load in C.
> >> If possible, we can move such code in asm-generic/.
> >>
> >> thanks
> >> xinhui
> >>
>

Attachments:

(No filename) (3.03 kB)
signature.asc (473.00 B)
Download all attachments

2016-04-25 10:12:20

by Pan Xinhui

[permalink] [raw]

Subject: Re: [PATCH V3] powerpc: Implement {cmp}xchg for u8 and u16

On 2016年04月22日 00:13, Peter Zijlstra wrote:
> On Thu, Apr 21, 2016 at 11:35:07PM +0800, Pan Xinhui wrote:
>> yes, you are right. more load/store will be done in C code.
>> However such xchg_u8/u16 is just used by qspinlock now. and I did not see any performance regression.
>> So just wrote in C, for simple. :)
>
> Which is fine; but worthy of a note in your Changelog.
>
will do that.

>> Of course I have done xchg tests.
>> we run code just like xchg((u8*)&v, j++); in several threads.
>> and the result is,
>> [ 768.374264] use time[1550072]ns in xchg_u8_asm
>> [ 768.377102] use time[2826802]ns in xchg_u8_c
>>
>> I think this is because there is one more load in C.
>> If possible, we can move such code in asm-generic/.
>
> So I'm not actually _that_ familiar with the PPC LL/SC implementation;
> but there are things a CPU can do to optimize these loops.
>
> For example, a CPU might choose to not release the exclusive hold of the
> line for a number of cycles, except when it passes SC or an interrupt
> happens. This way there's a smaller chance the SC fails and inhibits
> forward progress.
I am not sure if there is such hardware optimization.

>
> By doing the modification outside of the LL/SC you loose such
> advantages.
>
> And yes, doing a !exclusive load prior to the exclusive load leads to an
> even bigger window where the data can get changed out from under you.
>
you are right.
We have observed such data change during the two different loads.

2016-04-25 15:37:26

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH V3] powerpc: Implement {cmp}xchg for u8 and u16

On Mon, Apr 25, 2016 at 06:10:51PM +0800, Pan Xinhui wrote:
> > So I'm not actually _that_ familiar with the PPC LL/SC implementation;
> > but there are things a CPU can do to optimize these loops.
> >
> > For example, a CPU might choose to not release the exclusive hold of the
> > line for a number of cycles, except when it passes SC or an interrupt
> > happens. This way there's a smaller chance the SC fails and inhibits
> > forward progress.

> I am not sure if there is such hardware optimization.

So I think the hardware must do _something_, otherwise competing cores
doing load-exlusive could life-lock a system, each one endlessly
breaking the exclusive ownership of the other and the store-conditional
always failing.

Of course, there are such implementations, and they tend to have to put
in explicit backoff loops; however, IIRC, PPC doesn't need that. (See
ARC for an example that needs to do this.)

2016-04-26 11:35:24

by Pan Xinhui

[permalink] [raw]

Subject: Re: [PATCH V3] powerpc: Implement {cmp}xchg for u8 and u16

On 2016年04月25日 23:37, Peter Zijlstra wrote:
> On Mon, Apr 25, 2016 at 06:10:51PM +0800, Pan Xinhui wrote:
>>> So I'm not actually _that_ familiar with the PPC LL/SC implementation;
>>> but there are things a CPU can do to optimize these loops.
>>>
>>> For example, a CPU might choose to not release the exclusive hold of the
>>> line for a number of cycles, except when it passes SC or an interrupt
>>> happens. This way there's a smaller chance the SC fails and inhibits
>>> forward progress.
>
>> I am not sure if there is such hardware optimization.
>
> So I think the hardware must do _something_, otherwise competing cores
> doing load-exlusive could life-lock a system, each one endlessly
> breaking the exclusive ownership of the other and the store-conditional
> always failing.
>
Seems there is no such optimization.

We haver observed SC fails almost all the time in a contention tests, then got stuck in the loop. :(
one thread modify val with LL/SC, and other threads just modify val without any respect to LL/SC.

So in the end, I choose to rewrite this patch in asm. :)

> Of course, there are such implementations, and they tend to have to put
> in explicit backoff loops; however, IIRC, PPC doesn't need that. (See
> ARC for an example that needs to do this.)
>

2016-04-27 09:19:50

by Pan Xinhui

2016-04-28 10:22:46

by Pan Xinhui

[permalink] [raw]

Subject: Re: [PATCH V4] powerpc: Implement {cmp}xchg for u8 and u16

On 2016年04月27日 22:59, Boqun Feng wrote:
> On Wed, Apr 27, 2016 at 10:50:34PM +0800, Boqun Feng wrote:
>>
>> Sorry, my bad, we can't implement cmpxchg like this.. please ignore
>> this, I should really go to bed soon...
>>
>> But still, we can save the "tmp" for xchg() I think.
>>
>
> No.. we can't. Sorry for all the noise.
>
> This patch looks good to me.
>
> FWIW, you can add
>
> Acked-by: Boqun Feng <[email protected]>
>
thanks!

> Regards,
> Boqun
>

2016-04-28 11:54:31

by Pan Xinhui

[permalink] [raw]

Subject: Re: [PATCH V4] powerpc: Implement {cmp}xchg for u8 and u16

On 2016年04月28日 15:59, Peter Zijlstra wrote:
> On Wed, Apr 27, 2016 at 05:16:45PM +0800, Pan Xinhui wrote:
>> From: Pan Xinhui <[email protected]>
>>
>> Implement xchg{u8,u16}{local,relaxed}, and
>> cmpxchg{u8,u16}{,local,acquire,relaxed}.
>>
>> It works on all ppc.
>>
>> remove volatile of first parameter in __cmpxchg_local and __cmpxchg
>>
>> Suggested-by: Peter Zijlstra (Intel) <[email protected]>
>> Signed-off-by: Pan Xinhui <[email protected]>
>
> Generally has the right shape; and I trust others to double check the
> ppc-asm minutia.
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>
>
>
thanks!

2016-11-25 00:04:28

by Michael Ellerman

[permalink] [raw]

Subject: Re: [V4] powerpc: Implement {cmp}xchg for u8 and u16

On Wed, 2016-04-27 at 09:16:45 UTC, xinhui wrote:
> From: Pan Xinhui <[email protected]>
>
> Implement xchg{u8,u16}{local,relaxed}, and
> cmpxchg{u8,u16}{,local,acquire,relaxed}.
>
> It works on all ppc.
>
> remove volatile of first parameter in __cmpxchg_local and __cmpxchg
>
> Suggested-by: Peter Zijlstra (Intel) <[email protected]>
> Signed-off-by: Pan Xinhui <[email protected]>
> Acked-by: Boqun Feng <[email protected]>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/d0563a1297e234ed37f6b51c2e9321

cheers