Message-ID: <5719857A.5080201@linux.vnet.ibm.com>
Date: Fri, 22 Apr 2016 09:59:22 +0800
From: Pan Xinhui <xinhui@linux.vnet.ibm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.8.0
MIME-Version: 1.0
To: Boqun Feng <boqun.feng@gmail.com>
CC: Peter Zijlstra <peterz@infradead.org>, linux-kernel@vger.kernel.org,
        linuxppc-dev@lists.ozlabs.org, benh@kernel.crashing.org,
        paulus@samba.org, mpe@ellerman.id.au, paulmck@linux.vnet.ibm.com,
        tglx@linutronix.de
Subject: Re: [PATCH V3] powerpc: Implement {cmp}xchg for u8 and u16
References: <5715D04E.9050009@linux.vnet.ibm.com> <571782F0.2020201@linux.vnet.ibm.com> <20160420142408.GF3430@twins.programming.kicks-ass.net> <5718F32B.3050409@linux.vnet.ibm.com> <20160421155257.GA20657@insomnia>
In-Reply-To: <20160421155257.GA20657@insomnia>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2474
Lines: 73

On 2016年04月21日 23:52, Boqun Feng wrote:
> On Thu, Apr 21, 2016 at 11:35:07PM +0800, Pan Xinhui wrote:
>> On 2016年04月20日 22:24, Peter Zijlstra wrote:
>>> On Wed, Apr 20, 2016 at 09:24:00PM +0800, Pan Xinhui wrote:
>>>
>>>> +#define __XCHG_GEN(cmp, type, sfx, skip, v)				\
>>>> +static __always_inline unsigned long					\
>>>> +__cmpxchg_u32##sfx(v unsigned int *p, unsigned long old,		\
>>>> +			 unsigned long new);				\
>>>> +static __always_inline u32						\
>>>> +__##cmp##xchg_##type##sfx(v void *ptr, u32 old, u32 new)		\
>>>> +{									\
>>>> +	int size = sizeof (type);					\
>>>> +	int off = (unsigned long)ptr % sizeof(u32);			\
>>>> +	volatile u32 *p = ptr - off;					\
>>>> +	int bitoff = BITOFF_CAL(size, off);				\
>>>> +	u32 bitmask = ((0x1 << size * BITS_PER_BYTE) - 1) << bitoff;	\
>>>> +	u32 oldv, newv, tmp;						\
>>>> +	u32 ret;							\
>>>> +	oldv = READ_ONCE(*p);						\
>>>> +	do {								\
>>>> +		ret = (oldv & bitmask) >> bitoff;			\
>>>> +		if (skip && ret != old)					\
>>>> +			break;						\
>>>> +		newv = (oldv & ~bitmask) | (new << bitoff);		\
>>>> +		tmp = oldv;						\
>>>> +		oldv = __cmpxchg_u32##sfx((v u32*)p, oldv, newv);	\
>>>> +	} while (tmp != oldv);						\
>>>> +	return ret;							\
>>>> +}
>>>
>>> So for an LL/SC based arch using cmpxchg() like that is sub-optimal.
>>>
>>> Why did you choose to write it entirely in C?
>>>
>> yes, you are right. more load/store will be done in C code.
>> However such xchg_u8/u16 is just used by qspinlock now. and I did not see any performance regression.
>> So just wrote in C, for simple. :)
>>
>> Of course I have done xchg tests.
>> we run code just like xchg((u8*)&v, j++); in several threads.
>> and the result is,
>> [  768.374264] use time[1550072]ns in xchg_u8_asm
> 
> How was xchg_u8_asm() implemented, using lbarx or using a 32bit ll/sc
> loop with shifting and masking in it?
> 
yes, using 32bit ll/sc loops.

looks like:
        __asm__ __volatile__(
"1:     lwarx   %0,0,%3\n"
"       and %1,%0,%5\n"
"       or %1,%1,%4\n"
       PPC405_ERR77(0,%2)
"       stwcx.  %1,0,%3\n"
"       bne-    1b"
        : "=&r" (_oldv), "=&r" (tmp), "+m" (*(volatile unsigned int *)_p)
        : "r" (_p), "r" (_newv), "r" (_oldv_mask)
        : "cc", "memory");


> Regards,
> Boqun
> 
>> [  768.377102] use time[2826802]ns in xchg_u8_c
>>
>> I think this is because there is one more load in C.
>> If possible, we can move such code in asm-generic/.
>>
>> thanks
>> xinhui
>>