2021-01-28 00:21:14

by Alexander Sverdlin

[permalink] [raw]
Subject: [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer

From: Alexander Sverdlin <[email protected]>

Ensure writes are pushed out of core write buffer to prevent waiting code
on another cores from spinning longer than necessary.

6 threads running tight spinlock loop competing for the same lock
on 6 cores on MIPS/Octeon do 1000000 iterations...

before the patch in: 4.3 sec
after the patch in: 1.2 sec

Same 6-core Octeon machine:
sysbench --test=mutex --num-threads=64 --memory-scope=local run

w/o patch: 1.53s
with patch: 1.28s

This will also allow to remove the smp_wmb() in
arch/arm/include/asm/mcs_spinlock.h (was it actually addressing the same
issue?).

Finally our internal quite diverse test suite of different IPC/network
aspects didn't detect any regressions on ARM/ARM64/x86_64.

Signed-off-by: Alexander Sverdlin <[email protected]>
---
kernel/locking/mcs_spinlock.h | 5 +++++
kernel/locking/qspinlock.c | 6 ++++++
2 files changed, 11 insertions(+)

diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 5e10153..10e497a 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
return;
}
WRITE_ONCE(prev->next, node);
+ /*
+ * This is necessary to make sure that the corresponding "while" in the
+ * mcs_spin_unlock() doesn't loop forever
+ */
+ smp_wmb();

/* Wait until the lock holder passes the lock down. */
arch_mcs_spin_lock_contended(&node->locked);
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index cbff6ba..577fe01 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -469,6 +469,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)

/* Link @node into the waitqueue. */
WRITE_ONCE(prev->next, node);
+ /*
+ * This is necessary to make sure that the corresponding
+ * smp_cond_load_relaxed() below (running on another core)
+ * doesn't spin forever.
+ */
+ smp_wmb();

pv_wait_node(node, prev);
arch_mcs_spin_lock_contended(&node->locked);
--
2.10.2


2021-01-28 00:22:46

by Alexander Sverdlin

[permalink] [raw]
Subject: [PATCH 2/2] ARM: mcs_spinlock: Drop smp_wmb in arch_mcs_spin_lock_contended()

From: Alexander Sverdlin <[email protected]>

Drop smp_wmb in arch_mcs_spin_lock_contended() after adding in into
ARCH-independent code.

Signed-off-by: Alexander Sverdlin <[email protected]>
---
arch/arm/include/asm/mcs_spinlock.h | 2 --
1 file changed, 2 deletions(-)

diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
index 529d2cf..d8fa921 100644
--- a/arch/arm/include/asm/mcs_spinlock.h
+++ b/arch/arm/include/asm/mcs_spinlock.h
@@ -8,8 +8,6 @@
/* MCS spin-locking. */
#define arch_mcs_spin_lock_contended(lock) \
do { \
- /* Ensure prior stores are observed before we enter wfe. */ \
- smp_mb(); \
while (!(smp_load_acquire(lock))) \
wfe(); \
} while (0) \
--
2.10.2

2021-01-28 00:44:20

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 2/2] ARM: mcs_spinlock: Drop smp_wmb in arch_mcs_spin_lock_contended()

On Wed, Jan 27, 2021 at 09:01:09PM +0100, Alexander A Sverdlin wrote:
> From: Alexander Sverdlin <[email protected]>
>
> Drop smp_wmb in arch_mcs_spin_lock_contended() after adding in into
> ARCH-independent code.
>
> Signed-off-by: Alexander Sverdlin <[email protected]>
> ---
> arch/arm/include/asm/mcs_spinlock.h | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
> index 529d2cf..d8fa921 100644
> --- a/arch/arm/include/asm/mcs_spinlock.h
> +++ b/arch/arm/include/asm/mcs_spinlock.h
> @@ -8,8 +8,6 @@
> /* MCS spin-locking. */
> #define arch_mcs_spin_lock_contended(lock) \
> do { \
> - /* Ensure prior stores are observed before we enter wfe. */ \
> - smp_mb(); \

I think this is the right place for the barrier, not in the core code.

Will

2021-01-28 00:44:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer

On Wed, Jan 27, 2021 at 09:01:08PM +0100, Alexander A Sverdlin wrote:
> From: Alexander Sverdlin <[email protected]>
>
> Ensure writes are pushed out of core write buffer to prevent waiting code
> on another cores from spinning longer than necessary.

Our smp_wmb() as defined does not have that property. You're relying on
some arch specific details which do not belong in common code.


> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> index 5e10153..10e497a 100644
> --- a/kernel/locking/mcs_spinlock.h
> +++ b/kernel/locking/mcs_spinlock.h
> @@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> return;
> }
> WRITE_ONCE(prev->next, node);
> + /*
> + * This is necessary to make sure that the corresponding "while" in the
> + * mcs_spin_unlock() doesn't loop forever
> + */

This comment is utterly inadequate, since it does not describe an
explicit ordering between two (or more) stores.

> + smp_wmb();
>
> /* Wait until the lock holder passes the lock down. */
> arch_mcs_spin_lock_contended(&node->locked);
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index cbff6ba..577fe01 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -469,6 +469,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>
> /* Link @node into the waitqueue. */
> WRITE_ONCE(prev->next, node);
> + /*
> + * This is necessary to make sure that the corresponding
> + * smp_cond_load_relaxed() below (running on another core)
> + * doesn't spin forever.
> + */
> + smp_wmb();

That's insane, cache coherency should not allow that to happen in the
first place. Our smp_wmb() cannot help with that.

> pv_wait_node(node, prev);
> arch_mcs_spin_lock_contended(&node->locked);

2021-01-28 00:45:08

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer

On Wed, Jan 27, 2021 at 09:01:08PM +0100, Alexander A Sverdlin wrote:
> From: Alexander Sverdlin <[email protected]>
>
> Ensure writes are pushed out of core write buffer to prevent waiting code
> on another cores from spinning longer than necessary.
>
> 6 threads running tight spinlock loop competing for the same lock
> on 6 cores on MIPS/Octeon do 1000000 iterations...
>
> before the patch in: 4.3 sec
> after the patch in: 1.2 sec

If you only have 6 cores, I'm not sure qspinlock makes any sense...

> Same 6-core Octeon machine:
> sysbench --test=mutex --num-threads=64 --memory-scope=local run
>
> w/o patch: 1.53s
> with patch: 1.28s
>
> This will also allow to remove the smp_wmb() in
> arch/arm/include/asm/mcs_spinlock.h (was it actually addressing the same
> issue?).
>
> Finally our internal quite diverse test suite of different IPC/network
> aspects didn't detect any regressions on ARM/ARM64/x86_64.
>
> Signed-off-by: Alexander Sverdlin <[email protected]>
> ---
> kernel/locking/mcs_spinlock.h | 5 +++++
> kernel/locking/qspinlock.c | 6 ++++++
> 2 files changed, 11 insertions(+)
>
> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> index 5e10153..10e497a 100644
> --- a/kernel/locking/mcs_spinlock.h
> +++ b/kernel/locking/mcs_spinlock.h
> @@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> return;
> }
> WRITE_ONCE(prev->next, node);
> + /*
> + * This is necessary to make sure that the corresponding "while" in the
> + * mcs_spin_unlock() doesn't loop forever
> + */
> + smp_wmb();

If it loops forever, that's broken hardware design; store buffers need to
drain. I don't think we should add unconditional barriers to bodge this.

> /* Wait until the lock holder passes the lock down. */
> arch_mcs_spin_lock_contended(&node->locked);
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index cbff6ba..577fe01 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -469,6 +469,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>
> /* Link @node into the waitqueue. */
> WRITE_ONCE(prev->next, node);
> + /*
> + * This is necessary to make sure that the corresponding
> + * smp_cond_load_relaxed() below (running on another core)
> + * doesn't spin forever.
> + */
> + smp_wmb();

Likewise.

Will

2021-01-28 07:39:31

by Alexander Sverdlin

[permalink] [raw]
Subject: Re: [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer

Hi!

On 27/01/2021 23:21, Will Deacon wrote:
> On Wed, Jan 27, 2021 at 09:01:08PM +0100, Alexander A Sverdlin wrote:
>> From: Alexander Sverdlin <[email protected]>
>>
>> Ensure writes are pushed out of core write buffer to prevent waiting code
>> on another cores from spinning longer than necessary.
>>
>> 6 threads running tight spinlock loop competing for the same lock
>> on 6 cores on MIPS/Octeon do 1000000 iterations...
>>
>> before the patch in: 4.3 sec
>> after the patch in: 1.2 sec
> If you only have 6 cores, I'm not sure qspinlock makes any sense...

That's my impression as well and I even proposed to revert qspinlocks for MIPS:
https://lore.kernel.org/linux-mips/[email protected]/T/#ma1506c80472129b2ac41554cc2d863c9a24518c0

>> Same 6-core Octeon machine:
>> sysbench --test=mutex --num-threads=64 --memory-scope=local run
>>
>> w/o patch: 1.53s
>> with patch: 1.28s
>>
>> This will also allow to remove the smp_wmb() in
>> arch/arm/include/asm/mcs_spinlock.h (was it actually addressing the same
>> issue?).
>>
>> Finally our internal quite diverse test suite of different IPC/network
>> aspects didn't detect any regressions on ARM/ARM64/x86_64.
>>
>> Signed-off-by: Alexander Sverdlin <[email protected]>
>> ---
>> kernel/locking/mcs_spinlock.h | 5 +++++
>> kernel/locking/qspinlock.c | 6 ++++++
>> 2 files changed, 11 insertions(+)
>>
>> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
>> index 5e10153..10e497a 100644
>> --- a/kernel/locking/mcs_spinlock.h
>> +++ b/kernel/locking/mcs_spinlock.h
>> @@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
>> return;
>> }
>> WRITE_ONCE(prev->next, node);
>> + /*
>> + * This is necessary to make sure that the corresponding "while" in the
>> + * mcs_spin_unlock() doesn't loop forever
>> + */
>> + smp_wmb();
> If it loops forever, that's broken hardware design; store buffers need to
> drain. I don't think we should add unconditional barriers to bodge this.

The comment is a bit exaggerating the situation, but it's undeterministic and you see the
measurements above. Something is wrong in the qspinlocks code, please consider this patch
"RFC", but something has to be done here.

--
Best regards,
Alexander Sverdlin.

2021-01-28 07:44:51

by Alexander Sverdlin

[permalink] [raw]
Subject: Re: [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer

Hi!

On 27/01/2021 23:43, Peter Zijlstra wrote:
> On Wed, Jan 27, 2021 at 09:01:08PM +0100, Alexander A Sverdlin wrote:
>> From: Alexander Sverdlin <[email protected]>
>>
>> Ensure writes are pushed out of core write buffer to prevent waiting code
>> on another cores from spinning longer than necessary.
> Our smp_wmb() as defined does not have that property. You're relying on
> some arch specific details which do not belong in common code.

Yes, my intention was SYNCW on Octeon, which by accident is smp_wmb().
Do you think that the core write buffer is only Octeon feature and there
will be no others?

Should I re-implement arch_mcs_spin_lock_contended() for Octeon only,
as it has been done for ARM?

>> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
>> index 5e10153..10e497a 100644
>> --- a/kernel/locking/mcs_spinlock.h
>> +++ b/kernel/locking/mcs_spinlock.h
>> @@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
>> return;
>> }
>> WRITE_ONCE(prev->next, node);
>> + /*
>> + * This is necessary to make sure that the corresponding "while" in the
>> + * mcs_spin_unlock() doesn't loop forever
>> + */
> This comment is utterly inadequate, since it does not describe an
> explicit ordering between two (or more) stores.
>
>> + smp_wmb();
>>
>> /* Wait until the lock holder passes the lock down. */
>> arch_mcs_spin_lock_contended(&node->locked);
>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>> index cbff6ba..577fe01 100644
>> --- a/kernel/locking/qspinlock.c
>> +++ b/kernel/locking/qspinlock.c
>> @@ -469,6 +469,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>
>> /* Link @node into the waitqueue. */
>> WRITE_ONCE(prev->next, node);
>> + /*
>> + * This is necessary to make sure that the corresponding
>> + * smp_cond_load_relaxed() below (running on another core)
>> + * doesn't spin forever.
>> + */
>> + smp_wmb();
> That's insane, cache coherency should not allow that to happen in the
> first place. Our smp_wmb() cannot help with that.
>

--
Best regards,
Alexander Sverdlin.

2021-01-28 11:28:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] qspinlock: Ensure writes are pushed out of core write buffer

On Thu, Jan 28, 2021 at 08:36:24AM +0100, Alexander Sverdlin wrote:

> >> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> >> index 5e10153..10e497a 100644
> >> --- a/kernel/locking/mcs_spinlock.h
> >> +++ b/kernel/locking/mcs_spinlock.h
> >> @@ -89,6 +89,11 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
> >> return;
> >> }
> >> WRITE_ONCE(prev->next, node);
> >> + /*
> >> + * This is necessary to make sure that the corresponding "while" in the
> >> + * mcs_spin_unlock() doesn't loop forever
> >> + */
> >> + smp_wmb();
> > If it loops forever, that's broken hardware design; store buffers need to
> > drain. I don't think we should add unconditional barriers to bodge this.
>
> The comment is a bit exaggerating the situation, but it's undeterministic and you see the
> measurements above. Something is wrong in the qspinlocks code, please consider this patch
> "RFC", but something has to be done here.

The qspinlock code has been TLA+ modelled and has had extensive memory
ordering analysis. It has had lots of runtime on extremely large x86,
arm64, and Power machines. I'm fairly confident there is nothing wrong.

What I do think is more likely is that your platform is broken, it
wouldn't be the first MIPS that's got store-buffers completely wrong,
see commit:

a30718868915 ("MIPS: Change definition of cpu_relax() for Loongson-3")

Working around micro arch store-buffer issues is not something the
generic code is for.