Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Date:   Wed, 27 Jan 2021 22:21:58 +0000
From:   Will Deacon <will@kernel.org>
To:     Alexander A Sverdlin <alexander.sverdlin@nokia.com>
Cc:     Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, linux-kernel@vger.kernel.org,
        Russell King <linux@armlinux.org.uk>,
        linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH 1/2] qspinlock: Ensure writes are pushed out of core
 write buffer
Message-ID: <20210127222158.GB848@willie-the-truck>
References: <20210127200109.16412-1-alexander.sverdlin@nokia.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20210127200109.16412-1-alexander.sverdlin@nokia.com>
User-Agent: Mutt/1.10.1 (2018-07-13)
Precedence: bulk

On Wed, Jan 27, 2021 at 09:01:08PM +0100, Alexander A Sverdlin wrote:
> From: Alexander Sverdlin <alexander.sverdlin@nokia.com>
> 
> Ensure writes are pushed out of core write buffer to prevent waiting code
> on another cores from spinning longer than necessary.
> 
> 6 threads running tight spinlock loop competing for the same lock
> on 6 cores on MIPS/Octeon do 1000000 iterations...
> 
> before the patch in:	4.3 sec
> after the patch in:	1.2 sec

If you only have 6 cores, I'm not sure qspinlock makes any sense...

> Same 6-core Octeon machine:
> sysbench --test=mutex --num-threads=64 --memory-scope=local run
> 
> w/o patch:	1.53s
> with patch:	1.28s
> 
> This will also allow to remove the smp_wmb() in
> arch/arm/include/asm/mcs_spinlock.h (was it actually addressing the same
> issue?).
> 
> Finally our internal quite diverse test suite of different IPC/network
> aspects didn't detect any regressions on ARM/ARM64/x86_64.
> 
> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
> ---
>  kernel/locking/mcs_spinlock.h | 5 +++++
>  kernel/locking/qspinlock.c    | 6 ++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> index 5e10153..10e497a 100644
> --- a/kernel/locking/mcs_spinlock.h
> +++ b/kernel/locking/mcs_spinlock.h
> @@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
>  		return;
>  	}
>  	WRITE_ONCE(prev->next, node);
> +	/*
> +	 * This is necessary to make sure that the corresponding "while" in the
> +	 * mcs_spin_unlock() doesn't loop forever
> +	 */
> +	smp_wmb();

If it loops forever, that's broken hardware design; store buffers need to
drain. I don't think we should add unconditional barriers to bodge this.

>  	/* Wait until the lock holder passes the lock down. */
>  	arch_mcs_spin_lock_contended(&node->locked);
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index cbff6ba..577fe01 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -469,6 +469,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  
>  		/* Link @node into the waitqueue. */
>  		WRITE_ONCE(prev->next, node);
> +		/*
> +		 * This is necessary to make sure that the corresponding
> +		 * smp_cond_load_relaxed() below (running on another core)
> +		 * doesn't spin forever.
> +		 */
> +		smp_wmb();

Likewise.

Will