2017-04-10 21:35:20

by Yury Norov

[permalink] [raw]
Subject: [RFC PATCH 0/3] arm64: queued spinlocks and rw-locks

The patch of Jan Glauber enables queued spinlocks on arm64. I rebased it on
latest kernel sources, and added a couple of fixes to headers to apply it
smoothly.

Though, locktourture test shows significant performance degradation in the
acquisition of rw-lock for read on qemu:

Before After
spin_lock-torture: 38957034 37076367 -4.83
rw_lock-torture W: 5369471 18971957 253.33
rw_lock-torture R: 6413179 3668160 -42.80

I'm not much experienced in locking, and so wonder how it's possible that
simple switching to generic queued rw-lock causes so significant performance
degradation, while in theory it should improve it. Even more, on x86 there
are no such problems probably.

I also think that patches 1 and 2 are correct and useful, and should be applied
anyway.

Any comments appreciated.

Yury.

Jan Glauber (1):
arm64/locking: qspinlocks and qrwlocks support

Yury Norov (2):
kernel/locking: #include <asm/spinlock.h> in qrwlock.c
asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h

arch/arm64/Kconfig | 2 ++
arch/arm64/include/asm/qrwlock.h | 7 +++++++
arch/arm64/include/asm/qspinlock.h | 20 ++++++++++++++++++++
arch/arm64/include/asm/spinlock.h | 12 ++++++++++++
arch/arm64/include/asm/spinlock_types.h | 14 +++++++++++---
include/asm-generic/qspinlock.h | 1 +
include/asm-generic/qspinlock_types.h | 8 --------
kernel/locking/qrwlock.c | 1 +
8 files changed, 54 insertions(+), 11 deletions(-)
create mode 100644 arch/arm64/include/asm/qrwlock.h
create mode 100644 arch/arm64/include/asm/qspinlock.h

--
2.7.4


2017-04-10 21:35:29

by Yury Norov

[permalink] [raw]
Subject: [PATCH 1/3] kernel/locking: #include <asm/spinlock.h> in qrwlock.c

qrwlock.c calls arch_spin_lock() and arch_spin_unlock() but doesn't
include the asm/spinlock.h, where those functions are defined. It
may produce "implicit declaration of function" errors. This patch
fixes it.

Signed-off-by: Yury Norov <[email protected]>
---
kernel/locking/qrwlock.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index cc3ed0c..6fb4292 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -20,6 +20,7 @@
#include <linux/cpumask.h>
#include <linux/percpu.h>
#include <linux/hardirq.h>
+#include <asm/spinlock.h>
#include <asm/qrwlock.h>

/*
--
2.7.4

2017-04-10 21:35:37

by Yury Norov

[permalink] [raw]
Subject: [PATCH 2/3] asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h

The "qspinlock_types.h" doesn't need linux/atomic.h directly. So
because of this, and because including of it requires the protection
against recursive inclusion, it looks reasonable to move the
inclusion exactly where it is needed. This change affects the x86_64
arch, as the only user of qspinlocks at now. I have build-tested the
change on x86_64 with CONFIG_PARAVIRT enabled and disabled.

Signed-off-by: Yury Norov <[email protected]>
---
include/asm-generic/qspinlock.h | 1 +
include/asm-generic/qspinlock_types.h | 8 --------
2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 9f0681b..5f4d42a 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -20,6 +20,7 @@
#define __ASM_GENERIC_QSPINLOCK_H

#include <asm-generic/qspinlock_types.h>
+#include <linux/atomic.h>

/**
* queued_spin_unlock_wait - wait until the _current_ lock holder releases the lock
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index 034acd0..a13cc90 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -18,15 +18,7 @@
#ifndef __ASM_GENERIC_QSPINLOCK_TYPES_H
#define __ASM_GENERIC_QSPINLOCK_TYPES_H

-/*
- * Including atomic.h with PARAVIRT on will cause compilation errors because
- * of recursive header file incluson via paravirt_types.h. So don't include
- * it if PARAVIRT is on.
- */
-#ifndef CONFIG_PARAVIRT
#include <linux/types.h>
-#include <linux/atomic.h>
-#endif

typedef struct qspinlock {
atomic_t val;
--
2.7.4

2017-04-10 21:36:01

by Yury Norov

[permalink] [raw]
Subject: [PATCH 3/3] arm64/locking: qspinlocks and qrwlocks support

From: Jan Glauber <[email protected]>

Ported from x86_64 with paravirtualization support removed.

Signed-off-by: Jan Glauber <[email protected]>

Note. This patch removes protection from direct inclusion of
arch/arm64/include/asm/spinlock_types.h. It's done because
kernel/locking/qrwlock.c file does it thru the header
include/asm-generic/qrwlock_types.h. Until now the only user
of qrwlock.c was x86, and there's no such protection too.

I'm not happy to remove the protection, but if it's OK for x86,
it should be also OK for arm64. If not, I think we'd fix it
for x86, and add the protection there too.

Yury

Signed-off-by: Yury Norov <[email protected]>
---
arch/arm64/Kconfig | 2 ++
arch/arm64/include/asm/qrwlock.h | 7 +++++++
arch/arm64/include/asm/qspinlock.h | 20 ++++++++++++++++++++
arch/arm64/include/asm/spinlock.h | 12 ++++++++++++
arch/arm64/include/asm/spinlock_types.h | 14 +++++++++++---
5 files changed, 52 insertions(+), 3 deletions(-)
create mode 100644 arch/arm64/include/asm/qrwlock.h
create mode 100644 arch/arm64/include/asm/qspinlock.h

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index f2b0b52..ac1c170 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -24,6 +24,8 @@ config ARM64
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION
select ARCH_WANT_FRAME_POINTERS
select ARCH_HAS_UBSAN_SANITIZE_ALL
+ select ARCH_USE_QUEUED_SPINLOCKS
+ select ARCH_USE_QUEUED_RWLOCKS
select ARM_AMBA
select ARM_ARCH_TIMER
select ARM_GIC
diff --git a/arch/arm64/include/asm/qrwlock.h b/arch/arm64/include/asm/qrwlock.h
new file mode 100644
index 0000000..626f6eb
--- /dev/null
+++ b/arch/arm64/include/asm/qrwlock.h
@@ -0,0 +1,7 @@
+#ifndef _ASM_ARM64_QRWLOCK_H
+#define _ASM_ARM64_QRWLOCK_H
+
+#include <asm-generic/qrwlock_types.h>
+#include <asm-generic/qrwlock.h>
+
+#endif /* _ASM_ARM64_QRWLOCK_H */
diff --git a/arch/arm64/include/asm/qspinlock.h b/arch/arm64/include/asm/qspinlock.h
new file mode 100644
index 0000000..98f50fc
--- /dev/null
+++ b/arch/arm64/include/asm/qspinlock.h
@@ -0,0 +1,20 @@
+#ifndef _ASM_ARM64_QSPINLOCK_H
+#define _ASM_ARM64_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+
+#define queued_spin_unlock queued_spin_unlock
+/**
+ * queued_spin_unlock - release a queued spinlock
+ * @lock : Pointer to queued spinlock structure
+ *
+ * A smp_store_release() on the least-significant byte.
+ */
+static inline void queued_spin_unlock(struct qspinlock *lock)
+{
+ smp_store_release((u8 *)lock, 0);
+}
+
+#include <asm-generic/qspinlock.h>
+
+#endif /* _ASM_ARM64_QSPINLOCK_H */
diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
index cae331d..3771339 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -20,6 +20,10 @@
#include <asm/spinlock_types.h>
#include <asm/processor.h>

+#ifdef CONFIG_QUEUED_SPINLOCKS
+#include <asm/qspinlock.h>
+#else
+
/*
* Spinlock implementation.
*
@@ -187,6 +191,12 @@ static inline int arch_spin_is_contended(arch_spinlock_t *lock)
}
#define arch_spin_is_contended arch_spin_is_contended

+#endif /* CONFIG_QUEUED_SPINLOCKS */
+
+#ifdef CONFIG_QUEUED_RWLOCKS
+#include <asm/qrwlock.h>
+#else
+
/*
* Write lock implementation.
*
@@ -351,6 +361,8 @@ static inline int arch_read_trylock(arch_rwlock_t *rw)
/* read_can_lock - would read_trylock() succeed? */
#define arch_read_can_lock(x) ((x)->lock < 0x80000000)

+#endif /* CONFIG_QUEUED_RWLOCKS */
+
#define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
#define arch_write_lock_flags(lock, flags) arch_write_lock(lock)

diff --git a/arch/arm64/include/asm/spinlock_types.h b/arch/arm64/include/asm/spinlock_types.h
index 55be59a..0f0f156 100644
--- a/arch/arm64/include/asm/spinlock_types.h
+++ b/arch/arm64/include/asm/spinlock_types.h
@@ -16,9 +16,9 @@
#ifndef __ASM_SPINLOCK_TYPES_H
#define __ASM_SPINLOCK_TYPES_H

-#if !defined(__LINUX_SPINLOCK_TYPES_H) && !defined(__ASM_SPINLOCK_H)
-# error "please don't include this file directly"
-#endif
+#ifdef CONFIG_QUEUED_SPINLOCKS
+#include <asm-generic/qspinlock_types.h>
+#else

#include <linux/types.h>

@@ -36,10 +36,18 @@ typedef struct {

#define __ARCH_SPIN_LOCK_UNLOCKED { 0 , 0 }

+#endif /* CONFIG_QUEUED_SPINLOCKS */
+
+#ifdef CONFIG_QUEUED_RWLOCKS
+#include <asm-generic/qrwlock_types.h>
+#else
+
typedef struct {
volatile unsigned int lock;
} arch_rwlock_t;

#define __ARCH_RW_LOCK_UNLOCKED { 0 }

+#endif /* CONFIG_QUEUED_RWLOCKS */
+
#endif
--
2.7.4

2017-04-12 17:05:02

by Adam Wallis

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] arm64: queued spinlocks and rw-locks

On 4/10/2017 5:35 PM, Yury Norov wrote:
> The patch of Jan Glauber enables queued spinlocks on arm64. I rebased it on
> latest kernel sources, and added a couple of fixes to headers to apply it
> smoothly.
>
> Though, locktourture test shows significant performance degradation in the
> acquisition of rw-lock for read on qemu:
>
> Before After
> spin_lock-torture: 38957034 37076367 -4.83
> rw_lock-torture W: 5369471 18971957 253.33
> rw_lock-torture R: 6413179 3668160 -42.80
>

On our 48 core QDF2400 part, I am seeing huge improvements with these patches on
the torture tests. The improvements go up even further when I apply Jason Low's
MCS Spinlock patch: https://lkml.org/lkml/2016/4/20/725

> I'm not much experienced in locking, and so wonder how it's possible that
> simple switching to generic queued rw-lock causes so significant performance
> degradation, while in theory it should improve it. Even more, on x86 there
> are no such problems probably.
>
> I also think that patches 1 and 2 are correct and useful, and should be applied
> anyway.
>
> Any comments appreciated.
>
> Yury.
>

I will be happy to tests these patches more thoroughly after you get some
additional comments/feedback.

> Jan Glauber (1):
> arm64/locking: qspinlocks and qrwlocks support
>
> Yury Norov (2):
> kernel/locking: #include <asm/spinlock.h> in qrwlock.c
> asm-generic: don't #include <linux/atomic.h> in qspinlock_types.h
>
> arch/arm64/Kconfig | 2 ++
> arch/arm64/include/asm/qrwlock.h | 7 +++++++
> arch/arm64/include/asm/qspinlock.h | 20 ++++++++++++++++++++
> arch/arm64/include/asm/spinlock.h | 12 ++++++++++++
> arch/arm64/include/asm/spinlock_types.h | 14 +++++++++++---
> include/asm-generic/qspinlock.h | 1 +
> include/asm-generic/qspinlock_types.h | 8 --------
> kernel/locking/qrwlock.c | 1 +
> 8 files changed, 54 insertions(+), 11 deletions(-)
> create mode 100644 arch/arm64/include/asm/qrwlock.h
> create mode 100644 arch/arm64/include/asm/qspinlock.h
>

Thanks

--
Adam Wallis
Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project.

2017-04-13 10:33:27

by Yury Norov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] arm64: queued spinlocks and rw-locks

On Wed, Apr 12, 2017 at 01:04:55PM -0400, Adam Wallis wrote:
> On 4/10/2017 5:35 PM, Yury Norov wrote:
> > The patch of Jan Glauber enables queued spinlocks on arm64. I rebased it on
> > latest kernel sources, and added a couple of fixes to headers to apply it
> > smoothly.
> >
> > Though, locktourture test shows significant performance degradation in the
> > acquisition of rw-lock for read on qemu:
> >
> > Before After
> > spin_lock-torture: 38957034 37076367 -4.83
> > rw_lock-torture W: 5369471 18971957 253.33
> > rw_lock-torture R: 6413179 3668160 -42.80
> >
>
> On our 48 core QDF2400 part, I am seeing huge improvements with these patches on
> the torture tests. The improvements go up even further when I apply Jason Low's
> MCS Spinlock patch: https://lkml.org/lkml/2016/4/20/725

It sounds great. So performance issue is looking like my local
problem, most probably because I ran tests on Qemu VM.

I don't see any problems with this series, other than performance,
and if it looks fine now, I think it's good enough for upstream.

Yury.

2017-04-13 18:12:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] arm64/locking: qspinlocks and qrwlocks support

On Tue, Apr 11, 2017 at 01:35:04AM +0400, Yury Norov wrote:

> +++ b/arch/arm64/include/asm/qspinlock.h
> @@ -0,0 +1,20 @@
> +#ifndef _ASM_ARM64_QSPINLOCK_H
> +#define _ASM_ARM64_QSPINLOCK_H
> +
> +#include <asm-generic/qspinlock_types.h>
> +
> +#define queued_spin_unlock queued_spin_unlock
> +/**
> + * queued_spin_unlock - release a queued spinlock
> + * @lock : Pointer to queued spinlock structure
> + *
> + * A smp_store_release() on the least-significant byte.
> + */
> +static inline void queued_spin_unlock(struct qspinlock *lock)
> +{
> + smp_store_release((u8 *)lock, 0);
> +}

I'm afraid this isn't enough for arm64. I suspect you want your own
variant of queued_spin_unlock_wait() and queued_spin_is_locked() as
well.

Much memory ordering fun to be had there.

2017-04-20 18:23:35

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH 3/3] arm64/locking: qspinlocks and qrwlocks support

On Thu, Apr 13, 2017 at 08:12:12PM +0200, Peter Zijlstra wrote:
> On Tue, Apr 11, 2017 at 01:35:04AM +0400, Yury Norov wrote:
>
> > +++ b/arch/arm64/include/asm/qspinlock.h
> > @@ -0,0 +1,20 @@
> > +#ifndef _ASM_ARM64_QSPINLOCK_H
> > +#define _ASM_ARM64_QSPINLOCK_H
> > +
> > +#include <asm-generic/qspinlock_types.h>
> > +
> > +#define queued_spin_unlock queued_spin_unlock
> > +/**
> > + * queued_spin_unlock - release a queued spinlock
> > + * @lock : Pointer to queued spinlock structure
> > + *
> > + * A smp_store_release() on the least-significant byte.
> > + */
> > +static inline void queued_spin_unlock(struct qspinlock *lock)
> > +{
> > + smp_store_release((u8 *)lock, 0);
> > +}
>
> I'm afraid this isn't enough for arm64. I suspect you want your own
> variant of queued_spin_unlock_wait() and queued_spin_is_locked() as
> well.
>
> Much memory ordering fun to be had there.

Hi Peter,

Is there some test to reproduce the locking failure for the case. I
ask because I run loctorture for many hours on my qemu (emulating
cortex-a57), and I see no failures in the test reports. And Jan did it
on ThunderX, and Adam on QDF2400 without any problems. So even if I
rework those functions, how could I check them for correctness?

Anyway, regarding the queued_spin_unlock_wait(), is my understanding
correct that you assume adding smp_mb() before entering the for(;;)
cycle, and using ldaxr/strxr instead of atomic_read()?

Yury

2017-04-20 19:01:17

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH 3/3] arm64/locking: qspinlocks and qrwlocks support

On Thu, Apr 20, 2017 at 09:23:18PM +0300, Yury Norov wrote:
> On Thu, Apr 13, 2017 at 08:12:12PM +0200, Peter Zijlstra wrote:
> > On Tue, Apr 11, 2017 at 01:35:04AM +0400, Yury Norov wrote:
> >
> > > +++ b/arch/arm64/include/asm/qspinlock.h
> > > @@ -0,0 +1,20 @@
> > > +#ifndef _ASM_ARM64_QSPINLOCK_H
> > > +#define _ASM_ARM64_QSPINLOCK_H
> > > +
> > > +#include <asm-generic/qspinlock_types.h>
> > > +
> > > +#define queued_spin_unlock queued_spin_unlock
> > > +/**
> > > + * queued_spin_unlock - release a queued spinlock
> > > + * @lock : Pointer to queued spinlock structure
> > > + *
> > > + * A smp_store_release() on the least-significant byte.
> > > + */
> > > +static inline void queued_spin_unlock(struct qspinlock *lock)
> > > +{
> > > + smp_store_release((u8 *)lock, 0);
> > > +}
> >
> > I'm afraid this isn't enough for arm64. I suspect you want your own
> > variant of queued_spin_unlock_wait() and queued_spin_is_locked() as
> > well.
> >
> > Much memory ordering fun to be had there.
>
> Hi Peter,
>
> Is there some test to reproduce the locking failure for the case. I
> ask because I run loctorture for many hours on my qemu (emulating
> cortex-a57), and I see no failures in the test reports.

Even with multi-threaded TCG, a system emulated with QEMU will have far
stronger memory ordering than a real platform. So stress tests on such a
system are useless for testing memory ordering properties.

I would strongly advise that you use a real platform for anything beyond
basic tests when touching code in this area.

> And Jan did it on ThunderX, and Adam on QDF2400 without any problems.
> So even if I rework those functions, how could I check them for
> correctness?

Given the variation the architecture permits, and how difficult it is to
diagnose issues in this area, testing isn't enough here.

You need at least some informal proof as to the primitives doing what
they should, i.e. you should be able to explain why the code is correct.

Thanks,
Mark.

2017-04-20 19:17:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] arm64/locking: qspinlocks and qrwlocks support

On Thu, Apr 20, 2017 at 09:23:18PM +0300, Yury Norov wrote:
> Is there some test to reproduce the locking failure for the case.

Possibly sysvsem stress before commit:

27d7be1801a4 ("ipc/sem.c: avoid using spin_unlock_wait()")

Although a similar scheme is also used in nf_conntrack, see commit:

b316ff783d17 ("locking/spinlock, netfilter: Fix nf_conntrack_lock() barriers")

> I
> ask because I run loctorture for many hours on my qemu (emulating
> cortex-a57), and I see no failures in the test reports. And Jan did it
> on ThunderX, and Adam on QDF2400 without any problems. So even if I
> rework those functions, how could I check them for correctness?

Running them doesn't prove them correct. Memory ordering bugs have been
in the kernel for many years without 'ever' triggering. This is stuff
you have to think about.

> Anyway, regarding the queued_spin_unlock_wait(), is my understanding
> correct that you assume adding smp_mb() before entering the for(;;)
> cycle, and using ldaxr/strxr instead of atomic_read()?

You'll have to ask Will, I always forget the arm64 details.

2017-04-24 13:36:50

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] arm64: queued spinlocks and rw-locks

On Wed, Apr 12, 2017 at 01:04:55PM -0400, Adam Wallis wrote:
> On 4/10/2017 5:35 PM, Yury Norov wrote:
> > The patch of Jan Glauber enables queued spinlocks on arm64. I rebased it on
> > latest kernel sources, and added a couple of fixes to headers to apply it
> > smoothly.
> >
> > Though, locktourture test shows significant performance degradation in the
> > acquisition of rw-lock for read on qemu:
> >
> > Before After
> > spin_lock-torture: 38957034 37076367 -4.83
> > rw_lock-torture W: 5369471 18971957 253.33
> > rw_lock-torture R: 6413179 3668160 -42.80
> >
>
> On our 48 core QDF2400 part, I am seeing huge improvements with these patches on
> the torture tests. The improvements go up even further when I apply Jason Low's
> MCS Spinlock patch: https://lkml.org/lkml/2016/4/20/725

Does the QDF2400 implement the large system extensions? If so, how do the
queued lock implementations compare to the LSE-based ticket locks?

Will

2017-04-26 12:40:12

by Yury Norov

[permalink] [raw]
Subject: Re: [PATCH 3/3] arm64/locking: qspinlocks and qrwlocks support

On Thu, Apr 20, 2017 at 09:05:30PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 20, 2017 at 09:23:18PM +0300, Yury Norov wrote:
> > Is there some test to reproduce the locking failure for the case.
>
> Possibly sysvsem stress before commit:
>
> 27d7be1801a4 ("ipc/sem.c: avoid using spin_unlock_wait()")
>
> Although a similar scheme is also used in nf_conntrack, see commit:
>
> b316ff783d17 ("locking/spinlock, netfilter: Fix nf_conntrack_lock() barriers")
>
> > I
> > ask because I run loctorture for many hours on my qemu (emulating
> > cortex-a57), and I see no failures in the test reports. And Jan did it
> > on ThunderX, and Adam on QDF2400 without any problems. So even if I
> > rework those functions, how could I check them for correctness?
>
> Running them doesn't prove them correct. Memory ordering bugs have been
> in the kernel for many years without 'ever' triggering. This is stuff
> you have to think about.
>
> > Anyway, regarding the queued_spin_unlock_wait(), is my understanding
> > correct that you assume adding smp_mb() before entering the for(;;)
> > cycle, and using ldaxr/strxr instead of atomic_read()?
>
> You'll have to ask Will, I always forget the arm64 details.

So, below is what I have. For queued_spin_unlock_wait() the generated
code is looking like this:
ffff0000080983a0 <queued_spin_unlock_wait>:
ffff0000080983a0: d5033bbf dmb ish
ffff0000080983a4: b9400007 ldr w7, [x0]
ffff0000080983a8: 350000c7 cbnz w7, ffff0000080983c0 <queued_spin_unlock_wait+0x20>
ffff0000080983ac: 1400000e b ffff0000080983e4 <queued_spin_unlock_wait+0x44>
ffff0000080983b0: d503203f yield
ffff0000080983b4: d5033bbf dmb ish
ffff0000080983b8: b9400007 ldr w7, [x0]
ffff0000080983bc: 34000147 cbz w7, ffff0000080983e4 <queued_spin_unlock_wait+0x44>
ffff0000080983c0: f2401cff tst x7, #0xff
ffff0000080983c4: 54ffff60 b.eq ffff0000080983b0 <queued_spin_unlock_wait+0x10>
ffff0000080983c8: 14000003 b ffff0000080983d4 <queued_spin_unlock_wait+0x34>
ffff0000080983cc: d503201f nop
ffff0000080983d0: d503203f yield
ffff0000080983d4: d5033bbf dmb ish
ffff0000080983d8: b9400007 ldr w7, [x0]
ffff0000080983dc: f2401cff tst x7, #0xff
ffff0000080983e0: 54ffff81 b.ne ffff0000080983d0 <queued_spin_unlock_wait+0x30>
ffff0000080983e4: d50339bf dmb ishld
ffff0000080983e8: d65f03c0 ret
ffff0000080983ec: d503201f nop

If I understand the documentation correctly, it's enough to check the lock
properly. If not - please give me the clue. Will?

Yury

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 22dbde97eefa..2d80161ee367 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -25,6 +25,8 @@ config ARM64
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION
select ARCH_WANT_FRAME_POINTERS
select ARCH_HAS_UBSAN_SANITIZE_ALL
+ select ARCH_USE_QUEUED_SPINLOCKS
+ select ARCH_USE_QUEUED_RWLOCKS
select ARM_AMBA
select ARM_ARCH_TIMER
select ARM_GIC
diff --git a/arch/arm64/include/asm/qrwlock.h b/arch/arm64/include/asm/qrwlock.h
new file mode 100644
index 000000000000..626f6ebfb52d
--- /dev/null
+++ b/arch/arm64/include/asm/qrwlock.h
@@ -0,0 +1,7 @@
+#ifndef _ASM_ARM64_QRWLOCK_H
+#define _ASM_ARM64_QRWLOCK_H
+
+#include <asm-generic/qrwlock_types.h>
+#include <asm-generic/qrwlock.h>
+
+#endif /* _ASM_ARM64_QRWLOCK_H */
diff --git a/arch/arm64/include/asm/qspinlock.h b/arch/arm64/include/asm/qspinlock.h
new file mode 100644
index 000000000000..09ef4f13f549
--- /dev/null
+++ b/arch/arm64/include/asm/qspinlock.h
@@ -0,0 +1,42 @@
+#ifndef _ASM_ARM64_QSPINLOCK_H
+#define _ASM_ARM64_QSPINLOCK_H
+
+#include <asm-generic/qspinlock_types.h>
+#include <asm/atomic.h>
+
+extern void queued_spin_unlock_wait(struct qspinlock *lock);
+#define queued_spin_unlock_wait queued_spin_unlock_wait
+
+#define queued_spin_unlock queued_spin_unlock
+/**
+ * queued_spin_unlock - release a queued spinlock
+ * @lock : Pointer to queued spinlock structure
+ *
+ * A smp_store_release() on the least-significant byte.
+ */
+static __always_inline void queued_spin_unlock(struct qspinlock *lock)
+{
+ smp_store_release((u8 *)lock, 0);
+}
+
+#define queued_spin_is_locked queued_spin_is_locked
+/**
+ * queued_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queued spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queued_spin_is_locked(struct qspinlock *lock)
+{
+ /*
+ * See queued_spin_unlock_wait().
+ *
+ * Any !0 state indicates it is locked, even if _Q_LOCKED_VAL
+ * isn't immediately observable.
+ */
+ smp_mb();
+ return atomic_read(&lock->val);
+}
+
+#include <asm-generic/qspinlock.h>
+
+#endif /* _ASM_ARM64_QSPINLOCK_H */
diff --git a/arch/arm64/include/asm/spinlock.h b/arch/arm64/include/asm/spinlock.h
index cae331d553f8..37713397e0c5 100644
--- a/arch/arm64/include/asm/spinlock.h
+++ b/arch/arm64/include/asm/spinlock.h
@@ -20,6 +20,10 @@
#include <asm/spinlock_types.h>
#include <asm/processor.h>

+#ifdef CONFIG_QUEUED_SPINLOCKS
+#include <asm/qspinlock.h>
+#else
+
/*
* Spinlock implementation.
*
@@ -187,6 +191,12 @@ static inline int arch_spin_is_contended(arch_spinlock_t *lock)
}
#define arch_spin_is_contended arch_spin_is_contended

+#endif /* CONFIG_QUEUED_SPINLOCKS */
+
+#ifdef CONFIG_QUEUED_RWLOCKS
+#include <asm/qrwlock.h>
+#else
+
/*
* Write lock implementation.
*
@@ -351,6 +361,8 @@ static inline int arch_read_trylock(arch_rwlock_t *rw)
/* read_can_lock - would read_trylock() succeed? */
#define arch_read_can_lock(x) ((x)->lock < 0x80000000)

+#endif /* CONFIG_QUEUED_RWLOCKS */
+
#define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
#define arch_write_lock_flags(lock, flags) arch_write_lock(lock)

diff --git a/arch/arm64/include/asm/spinlock_types.h b/arch/arm64/include/asm/spinlock_types.h
index 55be59a35e3f..0f0f1561ab6a 100644
--- a/arch/arm64/include/asm/spinlock_types.h
+++ b/arch/arm64/include/asm/spinlock_types.h
@@ -16,9 +16,9 @@
#ifndef __ASM_SPINLOCK_TYPES_H
#define __ASM_SPINLOCK_TYPES_H

-#if !defined(__LINUX_SPINLOCK_TYPES_H) && !defined(__ASM_SPINLOCK_H)
-# error "please don't include this file directly"
-#endif
+#ifdef CONFIG_QUEUED_SPINLOCKS
+#include <asm-generic/qspinlock_types.h>
+#else

#include <linux/types.h>

@@ -36,10 +36,18 @@ typedef struct {

#define __ARCH_SPIN_LOCK_UNLOCKED { 0 , 0 }

+#endif /* CONFIG_QUEUED_SPINLOCKS */
+
+#ifdef CONFIG_QUEUED_RWLOCKS
+#include <asm-generic/qrwlock_types.h>
+#else
+
typedef struct {
volatile unsigned int lock;
} arch_rwlock_t;

#define __ARCH_RW_LOCK_UNLOCKED { 0 }

+#endif /* CONFIG_QUEUED_RWLOCKS */
+
#endif
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 9d56467dc223..f48f6256e893 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -56,6 +56,7 @@ arm64-obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o \
arm64-obj-$(CONFIG_ARM64_RELOC_TEST) += arm64-reloc-test.o
arm64-reloc-test-y := reloc_test_core.o reloc_test_syms.o
arm64-obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
+arm64-obj-$(CONFIG_QUEUED_SPINLOCKS) += qspinlock.o

obj-y += $(arm64-obj-y) vdso/ probes/
obj-$(CONFIG_ARM64_ILP32) += vdso-ilp32/
diff --git a/arch/arm64/kernel/qspinlock.c b/arch/arm64/kernel/qspinlock.c
new file mode 100644
index 000000000000..924f19953adb
--- /dev/null
+++ b/arch/arm64/kernel/qspinlock.c
@@ -0,0 +1,34 @@
+#include <asm/qspinlock.h>
+#include <asm/processor.h>
+
+void queued_spin_unlock_wait(struct qspinlock *lock)
+{
+ u32 val;
+
+ for (;;) {
+ smp_mb();
+ val = atomic_read(&lock->val);
+
+ if (!val) /* not locked, we're done */
+ goto done;
+
+ if (val & _Q_LOCKED_MASK) /* locked, go wait for unlock */
+ break;
+
+ /* not locked, but pending, wait until we observe the lock */
+ cpu_relax();
+ }
+
+ for (;;) {
+ smp_mb();
+ val = atomic_read(&lock->val);
+ if (!(val & _Q_LOCKED_MASK)) /* any unlock is good */
+ break;
+
+ cpu_relax();
+ }
+
+done:
+ smp_acquire__after_ctrl_dep();
+}
+EXPORT_SYMBOL(queued_spin_unlock_wait);
--
2.11.0

2017-04-28 15:38:07

by Will Deacon

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] arm64: queued spinlocks and rw-locks

On Thu, Apr 13, 2017 at 01:33:09PM +0300, Yury Norov wrote:
> On Wed, Apr 12, 2017 at 01:04:55PM -0400, Adam Wallis wrote:
> > On 4/10/2017 5:35 PM, Yury Norov wrote:
> > > The patch of Jan Glauber enables queued spinlocks on arm64. I rebased it on
> > > latest kernel sources, and added a couple of fixes to headers to apply it
> > > smoothly.
> > >
> > > Though, locktourture test shows significant performance degradation in the
> > > acquisition of rw-lock for read on qemu:
> > >
> > > Before After
> > > spin_lock-torture: 38957034 37076367 -4.83
> > > rw_lock-torture W: 5369471 18971957 253.33
> > > rw_lock-torture R: 6413179 3668160 -42.80
> > >
> >
> > On our 48 core QDF2400 part, I am seeing huge improvements with these patches on
> > the torture tests. The improvements go up even further when I apply Jason Low's
> > MCS Spinlock patch: https://lkml.org/lkml/2016/4/20/725
>
> It sounds great. So performance issue is looking like my local
> problem, most probably because I ran tests on Qemu VM.
>
> I don't see any problems with this series, other than performance,
> and if it looks fine now, I think it's good enough for upstream.

I would still like to understand why you see such a significant performance
degradation, and whether or not you also see that on native hardware (i.e.
without Qemu involved).

Will

2017-04-28 15:44:27

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 3/3] arm64/locking: qspinlocks and qrwlocks support

On Wed, Apr 26, 2017 at 03:39:47PM +0300, Yury Norov wrote:
> On Thu, Apr 20, 2017 at 09:05:30PM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 20, 2017 at 09:23:18PM +0300, Yury Norov wrote:
> > > Is there some test to reproduce the locking failure for the case.
> >
> > Possibly sysvsem stress before commit:
> >
> > 27d7be1801a4 ("ipc/sem.c: avoid using spin_unlock_wait()")
> >
> > Although a similar scheme is also used in nf_conntrack, see commit:
> >
> > b316ff783d17 ("locking/spinlock, netfilter: Fix nf_conntrack_lock() barriers")
> >
> > > I
> > > ask because I run loctorture for many hours on my qemu (emulating
> > > cortex-a57), and I see no failures in the test reports. And Jan did it
> > > on ThunderX, and Adam on QDF2400 without any problems. So even if I
> > > rework those functions, how could I check them for correctness?
> >
> > Running them doesn't prove them correct. Memory ordering bugs have been
> > in the kernel for many years without 'ever' triggering. This is stuff
> > you have to think about.
> >
> > > Anyway, regarding the queued_spin_unlock_wait(), is my understanding
> > > correct that you assume adding smp_mb() before entering the for(;;)
> > > cycle, and using ldaxr/strxr instead of atomic_read()?
> >
> > You'll have to ask Will, I always forget the arm64 details.
>
> So, below is what I have. For queued_spin_unlock_wait() the generated
> code is looking like this:
> ffff0000080983a0 <queued_spin_unlock_wait>:
> ffff0000080983a0: d5033bbf dmb ish
> ffff0000080983a4: b9400007 ldr w7, [x0]
> ffff0000080983a8: 350000c7 cbnz w7, ffff0000080983c0 <queued_spin_unlock_wait+0x20>
> ffff0000080983ac: 1400000e b ffff0000080983e4 <queued_spin_unlock_wait+0x44>
> ffff0000080983b0: d503203f yield
> ffff0000080983b4: d5033bbf dmb ish
> ffff0000080983b8: b9400007 ldr w7, [x0]
> ffff0000080983bc: 34000147 cbz w7, ffff0000080983e4 <queued_spin_unlock_wait+0x44>
> ffff0000080983c0: f2401cff tst x7, #0xff
> ffff0000080983c4: 54ffff60 b.eq ffff0000080983b0 <queued_spin_unlock_wait+0x10>
> ffff0000080983c8: 14000003 b ffff0000080983d4 <queued_spin_unlock_wait+0x34>
> ffff0000080983cc: d503201f nop
> ffff0000080983d0: d503203f yield
> ffff0000080983d4: d5033bbf dmb ish
> ffff0000080983d8: b9400007 ldr w7, [x0]
> ffff0000080983dc: f2401cff tst x7, #0xff
> ffff0000080983e0: 54ffff81 b.ne ffff0000080983d0 <queued_spin_unlock_wait+0x30>
> ffff0000080983e4: d50339bf dmb ishld
> ffff0000080983e8: d65f03c0 ret
> ffff0000080983ec: d503201f nop
>
> If I understand the documentation correctly, it's enough to check the lock
> properly. If not - please give me the clue. Will?

Sorry, but I haven't had time to page this back in recently, so I can't give
you an answer straight off the bat. I'll need to go back and revisit the
qspinlock parts and, in particular, use of WFE before I'm comfortable with
this. I also don't want this on by default for the arm64 kernel, and I'd
like to see numbers comparing with our ticket locks on silicon with and
without the large system extensions, for low (<=8), medium (8-32) and high
(>32) core counts.

I'm very nervous about switching our locking implementation over to
something that's largely been developed and tested for x86, which has a
stronger memory model.

Will