Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Wed, 26 Sep 2018 22:52:08 +0200
From:   Andrea Parri <andrea.parri@amarulasolutions.com>
To:     Peter Zijlstra <peterz@infradead.org>
Cc:     will.deacon@arm.com, mingo@kernel.org,
        linux-kernel@vger.kernel.org, longman@redhat.com,
        tglx@linutronix.de
Subject: Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86
Message-ID: <20180926205208.GA4864@andrea>
References: <20180926110117.405325143@infradead.org>
 <20180926111307.513429499@infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20180926111307.513429499@infradead.org>
User-Agent: Mutt/1.9.4 (2018-02-28)
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:
> On x86 we cannot do fetch_or with a single instruction and end up
> using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> with a very tricky composite xchg8 + load.
> 
> The basic idea is that we use xchg8 to test-and-set the pending bit
> (when it is a byte) and then a load to fetch the whole word. Using
> two instructions of course opens a window we previously did not have.
> In particular the ordering between pending and tail is of interrest,
> because that is where the split happens.
> 
> The claim is that if we order them, it all works out just fine. There
> are two specific cases where the pending,tail state changes:
> 
>  - when the 3rd lock(er) comes in and finds pending set, it'll queue
>    and set tail; since we set tail while pending is set, the ordering
>    is split is not important (and not fundamentally different form
>    fetch_or). [*]
> 
>  - when the last queued lock holder acquires the lock (uncontended),
>    we clear the tail and set the lock byte. By first setting the
>    pending bit this cmpxchg will fail and the later load must then
>    see the remaining tail.
> 
> Another interesting scenario is where there are only 2 threads:
> 
> 	lock := (0,0,0)
> 
> 	CPU 0			CPU 1
> 
> 	lock()			lock()
> 	  trylock(-> 0,0,1)       trylock() /* fail */
> 	    return;               xchg_relaxed(pending, 1) (-> 0,1,1)
> 				  mb()
> 				  val = smp_load_acquire(*lock);
> 
> Where, without the mb() the load would've been allowed to return 0 for
> the locked byte.

If this were true, we would have a violation of "coherence":

C peterz

{}

P0(atomic_t *val)
{
	int r0;

	/* in queued_spin_trylock() */
	r0 = atomic_cmpxchg_acquire(val, 0, 1);
}

P1(atomic_t *val)
{
	int r0;
	int r1;

	/* in queued_spin_trylock() */
	r0 = atomic_cmpxchg_acquire(val, 0, 1); /* or atomic_read(val) */

	/* in queued_spin_lock_slowpath)() -- set_pending_fetch_acquire() */
	r1 = atomic_read_acquire(val);
}

exists (0:r0=0 /\ ~1:r0=0 /\ 1:r1=0)

  Andrea


> 
> [*] there is a fun scenario where the 3rd locker observes the pending
> bit, but before it sets the tail, the 1st lock (owner) unlocks and the
> 2nd locker acquires the lock, leaving us with a temporary 0,0,1 state.
> But this is not different between fetch_or or this new method.
> 
> Suggested-by: Will Deacon <will.deacon@arm.com>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  arch/x86/include/asm/qspinlock.h |    2 +
>  kernel/locking/qspinlock.c       |   56 ++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 57 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -9,6 +9,8 @@
>  
>  #define _Q_PENDING_LOOPS	(1 << 9)
>  
> +#define _Q_NO_FETCH_OR
> +
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
>  extern void __pv_init_lock_hash(void);
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -232,6 +232,60 @@ static __always_inline u32 xchg_tail(str
>  #endif /* _Q_PENDING_BITS == 8 */
>  
>  /**
> + * set_pending_fetch_acquire - fetch the whole lock value and set pending and locked
> + * @lock : Pointer to queued spinlock structure
> + * Return: The previous lock value
> + *
> + * *,*,* -> *,1,*
> + */
> +static __always_inline u32 set_pending_fetch_acquire(struct qspinlock *lock)
> +{
> +#if defined(_Q_NO_FETCH_OR) && _Q_PENDING_BITS == 8
> +	u32 val;
> +
> +	/*
> +	 * For the !LL/SC archs that do not have a native atomic_fetch_or
> +	 * but do have a native xchg (x86) we can do trickery and avoid the
> +	 * cmpxchg-loop based fetch-or to improve determinism.
> +	 *
> +	 * We first xchg the pending byte to set PENDING, and then issue a load
> +	 * for the rest of the word and mask out the pending byte to compute a
> +	 * 'full' value.
> +	 */
> +	val = xchg_relaxed(&lock->pending, 1) << _Q_PENDING_OFFSET;
> +	/*
> +	 * Ensures the tail load happens after the xchg().
> +	 *
> +	 *	   lock  unlock    (a)
> +	 *   xchg ---------------.
> +	 *    (b)  lock  unlock  +----- fetch_or
> +	 *   load ---------------'
> +	 *	   lock  unlock    (c)
> +	 *
> +	 * For both lock and unlock, (a) and (c) are the same as fetch_or(),
> +	 * since either are fully before or after. The only new case is (b),
> +	 * where a concurrent lock or unlock can interleave with the composite
> +	 * operation.
> +	 *
> +	 * In either case, it is the queueing case that is of interrest, otherwise
> +	 * the whole operation is covered by the xchg() and the tail will be 0.
> +	 *
> +	 * For lock-(b); we only care if we set the PENDING bit or not. If we lost
> +	 * the PENDING race, we queue. Otherwise we'd observe the tail anyway.
> +	 *
> +	 * For unlock-(b); since we'll have set PENDING, the uncontended claim
> +	 * will fail and we'll observe a !0 tail.
> +	 */
> +	smp_mb__after_atomic();
> +	val |= atomic_read_acquire(&lock->val) & ~_Q_PENDING_MASK;
> +
> +	return val;
> +#else
> +	return atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
> +#endif
> +}
> +
> +/**
>   * set_locked - Set the lock bit and own the lock
>   * @lock: Pointer to queued spinlock structure
>   *
> @@ -328,7 +382,7 @@ void queued_spin_lock_slowpath(struct qs
>  	 *
>  	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
>  	 */
> -	val = atomic_fetch_or_acquire(_Q_PENDING_VAL, &lock->val);
> +	val = set_pending_fetch_acquire(lock);
>  
>  	/*
>  	 * If we observe contention, there was a concurrent lock.
> 
>