In converting some existing spinlocks to rwlock, it was found that the
write lock slowpath performance isn't as good as the qspinlock. With
a workload that added a large number of inodes to the superblock and
was rate-limited by the inode_sb_list_lock, converting that spinlock
into a write lock slowed down its performance from about 22s to 36s.
This patch series tries to squeeze out as much performance as possible
to close the performance gap between qspinlock and qrwlock. With
all that patches applies, the workload performance improves to about
24-25s which is much better than before, though still a bit slower
than the spinlock.
With this patch series in place, we can start converting some spinlocks
back to rwlocks where it makes sense and the lock size increase isn't
a concern.
Waiman Long (4):
locking/qrwlock: Better optimization for interrupt context readers
locking/qrwlock: Reduce reader/writer to reader lock transfer latency
locking/qrwlock: Reduce writer to writer lock transfer latency
locking/qrwlock: Use direct MCS lock/unlock in slowpath
arch/x86/include/asm/qrwlock.h | 4 +
include/asm-generic/qrwlock.h | 4 +-
include/asm-generic/qrwlock_types.h | 26 ++++-
kernel/locking/qrwlock.c | 185 ++++++++++++++++++++++++++--------
kernel/locking/qspinlock.c | 9 +-
5 files changed, 174 insertions(+), 54 deletions(-)
The qrwlock is fair in the process context, but becoming unfair when
in the interrupt context to support use cases like the tasklist_lock.
The current code isn't that well-documented on what happens when
in the interrupt context. The rspin_until_writer_unlock() will only
spin if the writer has gotten the lock. If the writer is still in the
waiting state, the increment in the reader count will cause the writer
to remain in the waiting state and the new interrupt context reader
will get the lock and return immediately. The current code, however,
do an additional read of the lock value which is not necessary as
the information have already been there in the fast path. This may
sometime cause an additional cacheline transfer when the lock is
highly contended.
This patch passes the lock value information gotten in the fast path
to the slow path to eliminate the additional read. It also document
the action for the interrupt context readers more clearly.
Signed-off-by: Waiman Long <[email protected]>
---
include/asm-generic/qrwlock.h | 4 ++--
kernel/locking/qrwlock.c | 13 +++++++------
2 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 6383d54..865d021 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -36,7 +36,7 @@
/*
* External function declarations
*/
-extern void queue_read_lock_slowpath(struct qrwlock *lock);
+extern void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts);
extern void queue_write_lock_slowpath(struct qrwlock *lock);
/**
@@ -105,7 +105,7 @@ static inline void queue_read_lock(struct qrwlock *lock)
return;
/* The slowpath will decrement the reader count, if necessary. */
- queue_read_lock_slowpath(lock);
+ queue_read_lock_slowpath(lock, cnts);
}
/**
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 6c5da48..81bae99 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -62,20 +62,21 @@ rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
/**
* queue_read_lock_slowpath - acquire read lock of a queue rwlock
* @lock: Pointer to queue rwlock structure
+ * @cnts: Current qrwlock lock value
*/
-void queue_read_lock_slowpath(struct qrwlock *lock)
+void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
{
- u32 cnts;
-
/*
* Readers come here when they cannot get the lock without waiting
*/
if (unlikely(in_interrupt())) {
/*
- * Readers in interrupt context will spin until the lock is
- * available without waiting in the queue.
+ * Readers in interrupt context will get the lock immediately
+ * if the writer is just waiting (not holding the lock yet).
+ * The rspin_until_writer_unlock() function returns immediately
+ * in this case. Otherwise, they will spin until the lock
+ * is available without waiting in the queue.
*/
- cnts = smp_load_acquire((u32 *)&lock->cnts);
rspin_until_writer_unlock(lock, cnts);
return;
}
--
1.7.1
Currently, a reader will check first to make sure that the writer mode
byte is cleared before incrementing the reader count. That waiting is
not really necessary. It increases the latency in the reader/writer
to reader transition and reduces readers performance.
This patch eliminates that waiting. It also has the side effect
of reducing the chance of writer lock stealing and improving the
fairness of the lock. Using a locking microbenchmark, a 10-threads 5M
locking loop of mostly readers (RW ratio = 10,000:1) has the following
performance numbers in a Haswell-EX box:
Kernel Locking Rate (Kops/s)
------ ---------------------
4.1.1 15,063,081
Patched 4.1.1 17,241,552
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qrwlock.c | 12 ++++--------
1 files changed, 4 insertions(+), 8 deletions(-)
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 81bae99..ecd2d19 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -88,15 +88,11 @@ void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
arch_spin_lock(&lock->lock);
/*
- * At the head of the wait queue now, wait until the writer state
- * goes to 0 and then try to increment the reader count and get
- * the lock. It is possible that an incoming writer may steal the
- * lock in the interim, so it is necessary to check the writer byte
- * to make sure that the write lock isn't taken.
+ * At the head of the wait queue now, increment the reader count
+ * and wait until the writer, if it has the lock, has gone away.
+ * At ths stage, it is not possible for a writer to remain in the
+ * waiting state (_QW_WAITING). So there won't be any deadlock.
*/
- while (atomic_read(&lock->cnts) & _QW_WMASK)
- cpu_relax_lowlatency();
-
cnts = atomic_add_return(_QR_BIAS, &lock->cnts) - _QR_BIAS;
rspin_until_writer_unlock(lock, cnts);
--
1.7.1
In most cases, a writer acquires the lock in two steps - first setting
the writer mode byte to _QW_WAITING and then to _QW_LOCKED. So two
atomic operations are required. This 2-step dance is only needed if
readers are present. This patch modifies the logic so that a writer
will try to acquire the lock in a single step as long as possible
until it see some readers.
Using a locking microbenchmark, a 10-threads 5M locking loop of only
writers has the following performance numbers in a Haswell-EX box:
Kernel Locking Rate (Kops/s)
------ ---------------------
4.1.1 11,939,648
Patched 4.1.1 12,906,593
Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qrwlock.c | 20 +++++++++++++-------
1 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index ecd2d19..87e2d6b 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -109,15 +109,22 @@ EXPORT_SYMBOL(queue_read_lock_slowpath);
*/
void queue_write_lock_slowpath(struct qrwlock *lock)
{
- u32 cnts;
-
/* Put the writer into the wait queue */
arch_spin_lock(&lock->lock);
/* Try to acquire the lock directly if no reader is present */
- if (!atomic_read(&lock->cnts) &&
- (atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0))
- goto unlock;
+ for (;;) {
+ u32 cnts = atomic_read(&lock->cnts);
+
+ if (!cnts) {
+ cnts = atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED);
+ if (cnts == 0)
+ goto unlock;
+ }
+ if (cnts & ~_QW_WMASK)
+ break; /* Reader is present */
+ cpu_relax_lowlatency();
+ }
/*
* Set the waiting flag to notify readers that a writer is pending,
@@ -135,8 +142,7 @@ void queue_write_lock_slowpath(struct qrwlock *lock)
/* When no more readers, set the locked flag */
for (;;) {
- cnts = atomic_read(&lock->cnts);
- if ((cnts == _QW_WAITING) &&
+ if ((atomic_read(&lock->cnts) == _QW_WAITING) &&
(atomic_cmpxchg(&lock->cnts, _QW_WAITING,
_QW_LOCKED) == _QW_WAITING))
break;
--
1.7.1
Lock waiting in the qrwlock uses the spinlock (qspinlock for x86)
as the waiting queue. This is slower than using MCS lock directly
because of the extra level of indirection causing more atomics to
be used as well as 2 waiting threads spinning on the lock cacheline
instead of only one.
This patch change lock waiting code to use direct MCS lock/unlock for
bare metal, but keep on using spinlock in VM guest to take advantage
of the pvqspinlock and unfair lock functionality of the qspinlock code.
With that patch, we saw further improvement in reader and writer
performance on a Haswell-EX box using a locking microbenchmark.
Before patch:
Locker Locking Rate (Kops/s)
------ ---------------------
reader 17,241,552
writer 12,906,593
After patch:
Locker Locking Rate (Kops/s)
------ ---------------------
reader 17,436,786
writer 14,394,326
Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/include/asm/qrwlock.h | 4 +
include/asm-generic/qrwlock_types.h | 26 ++++++-
kernel/locking/qrwlock.c | 142 +++++++++++++++++++++++++++++------
kernel/locking/qspinlock.c | 9 +-
4 files changed, 149 insertions(+), 32 deletions(-)
diff --git a/arch/x86/include/asm/qrwlock.h b/arch/x86/include/asm/qrwlock.h
index ae0e241..7fab5ad 100644
--- a/arch/x86/include/asm/qrwlock.h
+++ b/arch/x86/include/asm/qrwlock.h
@@ -3,6 +3,10 @@
#include <asm-generic/qrwlock_types.h>
+#if defined(CONFIG_HYPERVISOR_GUEST) && !defined(static_cpu_has_hypervisor)
+#define static_cpu_has_hypervisor static_cpu_has(X86_FEATURE_HYPERVISOR)
+#endif
+
#ifndef CONFIG_X86_PPRO_FENCE
#define queue_write_unlock queue_write_unlock
static inline void queue_write_unlock(struct qrwlock *lock)
diff --git a/include/asm-generic/qrwlock_types.h b/include/asm-generic/qrwlock_types.h
index 4d76f24..8efd4b9 100644
--- a/include/asm-generic/qrwlock_types.h
+++ b/include/asm-generic/qrwlock_types.h
@@ -3,19 +3,37 @@
#include <linux/types.h>
#include <asm/spinlock_types.h>
+#include <asm/byteorder.h>
/*
* The queue read/write lock data structure
*/
typedef struct qrwlock {
- atomic_t cnts;
- arch_spinlock_t lock;
+ union {
+ atomic_t cnts;
+ struct {
+#ifdef __LITTLE_ENDIAN
+ u8 wmode; /* Writer mode */
+ u8 rcnts[3]; /* Reader counts */
+#else
+ u8 rcnts[3]; /* Reader counts */
+ u8 wmode; /* Writer mode */
+#endif
+ };
+ };
+ union {
+ u32 tail;
+ arch_spinlock_t lock;
+ };
} arch_rwlock_t;
-#define __ARCH_RW_LOCK_UNLOCKED { \
+/*
+ * Assuming that the spinlock is also initialized to 0.
+ */
+#define __ARCH_RW_LOCK_UNLOCKED { \
.cnts = ATOMIC_INIT(0), \
- .lock = __ARCH_SPIN_LOCK_UNLOCKED, \
+ .tail = 0, \
}
#endif /* __ASM_GENERIC_QRWLOCK_TYPES_H */
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 87e2d6b..d3e40c1 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -11,7 +11,7 @@
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
- * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
*
* Authors: Waiman Long <[email protected]>
*/
@@ -21,26 +21,117 @@
#include <linux/percpu.h>
#include <linux/hardirq.h>
#include <asm/qrwlock.h>
+#include "mcs_spinlock.h"
/*
- * This internal data structure is used for optimizing access to some of
- * the subfields within the atomic_t cnts.
+ * Use the internal qspinlock MCS nodes, if available
*/
-struct __qrwlock {
- union {
- atomic_t cnts;
- struct {
-#ifdef __LITTLE_ENDIAN
- u8 wmode; /* Writer mode */
- u8 rcnts[3]; /* Reader counts */
+#ifdef CONFIG_QUEUED_SPINLOCKS
+extern struct mcs_spinlock _mcs_qnodes[];
#else
- u8 rcnts[3]; /* Reader counts */
- u8 wmode; /* Writer mode */
+static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, _mcs_qnodes[4]);
#endif
- };
- };
- arch_spinlock_t lock;
-};
+
+#ifndef static_cpu_has_hypervisor
+#define static_cpu_has_hypervisor false
+#endif
+
+/*
+ * The following qlock()/qunlock() functions are simplified versions of the
+ * locking code in qspinlock.c. In bare metal, the MCS locking and unlocking
+ * code will be used to minimize the performance difference between qspinlock
+ * and qwriter lock. In VM guest, however, the qspinlock functions will be
+ * called to take advantage of the pvqspinlock and unfair lock functionality
+ * present in the qspinlock code at the expense of a bit of performance.
+ */
+#define TAIL_CPU_OFFSET 2
+#define TAIL_IDX_MASK 3
+
+static inline u32 encode_tail(int cpu, int idx)
+{
+ return ((cpu + 1) << TAIL_CPU_OFFSET) | (idx & TAIL_IDX_MASK);
+}
+
+static inline struct mcs_spinlock *decode_tail(u32 tail)
+{
+ int cpu = (tail >> TAIL_CPU_OFFSET) - 1;
+ int idx = tail & TAIL_IDX_MASK;
+
+ return per_cpu_ptr(&_mcs_qnodes[idx], cpu);
+}
+
+/*
+ * Enter the MCS lock waiting queue
+ */
+static inline u32 qlock(struct qrwlock *lock, struct mcs_spinlock **nptr)
+{
+ struct mcs_spinlock *prev, *node;
+ u32 old, tail;
+ int idx;
+
+ /*
+ * Use arch_spin_lock() if under hypervisor
+ */
+ if (static_cpu_has_hypervisor) {
+ arch_spin_lock(&lock->lock);
+ return 0;
+ }
+
+ /*
+ * MCS node initialization
+ */
+ node = this_cpu_ptr(&_mcs_qnodes[0]);
+ idx = node->count++;
+ tail = encode_tail(smp_processor_id(), idx);
+ node += idx;
+ node->locked = 0;
+ node->next = NULL;
+
+ old = xchg(&lock->tail, tail);
+
+ if (old) {
+ prev = decode_tail(old);
+ WRITE_ONCE(prev->next, node);
+ arch_mcs_spin_lock_contended(&node->locked);
+ }
+
+ /* Got the lock now */
+ *nptr = node;
+ return tail;
+}
+
+/*
+ * Exit the MCS lock waiting queue
+ */
+static inline void
+qunlock(struct qrwlock *lock, struct mcs_spinlock *node, u32 tail)
+{
+ struct mcs_spinlock *next;
+
+ /*
+ * Use arch_spin_unlock() if under hypervisor
+ */
+ if (static_cpu_has_hypervisor) {
+ arch_spin_unlock(&lock->lock);
+ return;
+ }
+
+
+ if ((READ_ONCE(lock->tail) == tail) &&
+ (cmpxchg(&lock->tail, tail, 0) == tail))
+ goto release;
+ /*
+ * Contended path; wait for next, release.
+ */
+ while (!(next = READ_ONCE(node->next)))
+ cpu_relax();
+ arch_mcs_spin_unlock_contended(&next->locked);
+release:
+ /*
+ * Release the node
+ */
+ this_cpu_dec(_mcs_qnodes[0].count);
+}
/**
* rspin_until_writer_unlock - inc reader count & spin until writer is gone
@@ -66,6 +157,10 @@ rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
*/
void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
{
+ struct mcs_spinlock *node = NULL;
+ u32 tail;
+
+
/*
* Readers come here when they cannot get the lock without waiting
*/
@@ -85,7 +180,7 @@ void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
/*
* Put the reader into the wait queue
*/
- arch_spin_lock(&lock->lock);
+ tail = qlock(lock, &node);
/*
* At the head of the wait queue now, increment the reader count
@@ -99,7 +194,7 @@ void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
/*
* Signal the next one in queue to become queue head
*/
- arch_spin_unlock(&lock->lock);
+ qunlock(lock, node, tail);
}
EXPORT_SYMBOL(queue_read_lock_slowpath);
@@ -109,8 +204,9 @@ EXPORT_SYMBOL(queue_read_lock_slowpath);
*/
void queue_write_lock_slowpath(struct qrwlock *lock)
{
+ struct mcs_spinlock *node = NULL;
/* Put the writer into the wait queue */
- arch_spin_lock(&lock->lock);
+ u32 tail = qlock(lock, &node);
/* Try to acquire the lock directly if no reader is present */
for (;;) {
@@ -131,10 +227,8 @@ void queue_write_lock_slowpath(struct qrwlock *lock)
* or wait for a previous writer to go away.
*/
for (;;) {
- struct __qrwlock *l = (struct __qrwlock *)lock;
-
- if (!READ_ONCE(l->wmode) &&
- (cmpxchg(&l->wmode, 0, _QW_WAITING) == 0))
+ if (!READ_ONCE(lock->wmode) &&
+ (cmpxchg(&lock->wmode, 0, _QW_WAITING) == 0))
break;
cpu_relax_lowlatency();
@@ -150,6 +244,6 @@ void queue_write_lock_slowpath(struct qrwlock *lock)
cpu_relax_lowlatency();
}
unlock:
- arch_spin_unlock(&lock->lock);
+ qunlock(lock, node, tail);
}
EXPORT_SYMBOL(queue_write_lock_slowpath);
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 38c4920..3812498 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -81,8 +81,9 @@
* Exactly fits one 64-byte cacheline on a 64-bit architecture.
*
* PV doubles the storage and uses the second cacheline for PV state.
+ * The MCS nodes are also shared with qrwlock.
*/
-static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]);
+DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, _mcs_qnodes[MAX_NODES]);
/*
* We must be able to distinguish between no-tail and the tail at 0:0,
@@ -107,7 +108,7 @@ static inline struct mcs_spinlock *decode_tail(u32 tail)
int cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;
int idx = (tail & _Q_TAIL_IDX_MASK) >> _Q_TAIL_IDX_OFFSET;
- return per_cpu_ptr(&mcs_nodes[idx], cpu);
+ return per_cpu_ptr(&_mcs_qnodes[idx], cpu);
}
#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
@@ -358,7 +359,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
* queuing.
*/
queue:
- node = this_cpu_ptr(&mcs_nodes[0]);
+ node = this_cpu_ptr(&_mcs_qnodes[0]);
idx = node->count++;
tail = encode_tail(smp_processor_id(), idx);
@@ -446,7 +447,7 @@ release:
/*
* release the node
*/
- this_cpu_dec(mcs_nodes[0].count);
+ this_cpu_dec(_mcs_qnodes[0].count);
}
EXPORT_SYMBOL(queued_spin_lock_slowpath);
--
1.7.1
Hi Waiman,
On Mon, Jul 06, 2015 at 04:43:04PM +0100, Waiman Long wrote:
> Currently, a reader will check first to make sure that the writer mode
> byte is cleared before incrementing the reader count. That waiting is
> not really necessary. It increases the latency in the reader/writer
> to reader transition and reduces readers performance.
>
> This patch eliminates that waiting. It also has the side effect
> of reducing the chance of writer lock stealing and improving the
> fairness of the lock. Using a locking microbenchmark, a 10-threads 5M
> locking loop of mostly readers (RW ratio = 10,000:1) has the following
> performance numbers in a Haswell-EX box:
>
> Kernel Locking Rate (Kops/s)
> ------ ---------------------
> 4.1.1 15,063,081
> Patched 4.1.1 17,241,552
>
> Signed-off-by: Waiman Long <[email protected]>
I've just finished rebasing my arm64 qrwlock stuff, but I think it will
conflict with these patches. Do you mind if I post them for review anyway,
so we can at least co-ordinate our efforts?
> ---
> kernel/locking/qrwlock.c | 12 ++++--------
> 1 files changed, 4 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> index 81bae99..ecd2d19 100644
> --- a/kernel/locking/qrwlock.c
> +++ b/kernel/locking/qrwlock.c
> @@ -88,15 +88,11 @@ void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
> arch_spin_lock(&lock->lock);
>
> /*
> - * At the head of the wait queue now, wait until the writer state
> - * goes to 0 and then try to increment the reader count and get
> - * the lock. It is possible that an incoming writer may steal the
> - * lock in the interim, so it is necessary to check the writer byte
> - * to make sure that the write lock isn't taken.
> + * At the head of the wait queue now, increment the reader count
> + * and wait until the writer, if it has the lock, has gone away.
> + * At ths stage, it is not possible for a writer to remain in the
> + * waiting state (_QW_WAITING). So there won't be any deadlock.
> */
> - while (atomic_read(&lock->cnts) & _QW_WMASK)
> - cpu_relax_lowlatency();
Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
rmode until it hits zero?
Will
On 07/06/2015 02:23 PM, Will Deacon wrote:
> Hi Waiman,
>
> On Mon, Jul 06, 2015 at 04:43:04PM +0100, Waiman Long wrote:
>> Currently, a reader will check first to make sure that the writer mode
>> byte is cleared before incrementing the reader count. That waiting is
>> not really necessary. It increases the latency in the reader/writer
>> to reader transition and reduces readers performance.
>>
>> This patch eliminates that waiting. It also has the side effect
>> of reducing the chance of writer lock stealing and improving the
>> fairness of the lock. Using a locking microbenchmark, a 10-threads 5M
>> locking loop of mostly readers (RW ratio = 10,000:1) has the following
>> performance numbers in a Haswell-EX box:
>>
>> Kernel Locking Rate (Kops/s)
>> ------ ---------------------
>> 4.1.1 15,063,081
>> Patched 4.1.1 17,241,552
>>
>> Signed-off-by: Waiman Long<[email protected]>
> I've just finished rebasing my arm64 qrwlock stuff, but I think it will
> conflict with these patches. Do you mind if I post them for review anyway,
> so we can at least co-ordinate our efforts?
Yes, sure. I would also like to coordinate my changes with yours to
minimize conflict. BTW, I just got 2 tip-bot messages about the commits:
locking/qrwlock: Better optimization for interrupt context readers
locking/qrwlock: Rename functions to queued_*()
So I need to rebase my patches also.
>> ---
>> kernel/locking/qrwlock.c | 12 ++++--------
>> 1 files changed, 4 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
>> index 81bae99..ecd2d19 100644
>> --- a/kernel/locking/qrwlock.c
>> +++ b/kernel/locking/qrwlock.c
>> @@ -88,15 +88,11 @@ void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
>> arch_spin_lock(&lock->lock);
>>
>> /*
>> - * At the head of the wait queue now, wait until the writer state
>> - * goes to 0 and then try to increment the reader count and get
>> - * the lock. It is possible that an incoming writer may steal the
>> - * lock in the interim, so it is necessary to check the writer byte
>> - * to make sure that the write lock isn't taken.
>> + * At the head of the wait queue now, increment the reader count
>> + * and wait until the writer, if it has the lock, has gone away.
>> + * At ths stage, it is not possible for a writer to remain in the
>> + * waiting state (_QW_WAITING). So there won't be any deadlock.
>> */
>> - while (atomic_read(&lock->cnts)& _QW_WMASK)
>> - cpu_relax_lowlatency();
> Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
> from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
> rmode until it hits zero?
No, this is how we make the lock fair so that an incoming streams of
later readers won't block a writer from getting the lock.
Cheers,
Longman
On Mon, Jul 06, 2015 at 08:49:33PM +0100, Waiman Long wrote:
> On 07/06/2015 02:23 PM, Will Deacon wrote:
> > I've just finished rebasing my arm64 qrwlock stuff, but I think it will
> > conflict with these patches. Do you mind if I post them for review anyway,
> > so we can at least co-ordinate our efforts?
>
> Yes, sure. I would also like to coordinate my changes with yours to
> minimize conflict. BTW, I just got 2 tip-bot messages about the commits:
>
> locking/qrwlock: Better optimization for interrupt context readers
> locking/qrwlock: Rename functions to queued_*()
>
> So I need to rebase my patches also.
Yeah, I've been carrying those two on my branch as well, but everything
should rebase cleanly.
> >> ---
> >> kernel/locking/qrwlock.c | 12 ++++--------
> >> 1 files changed, 4 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> >> index 81bae99..ecd2d19 100644
> >> --- a/kernel/locking/qrwlock.c
> >> +++ b/kernel/locking/qrwlock.c
> >> @@ -88,15 +88,11 @@ void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
> >> arch_spin_lock(&lock->lock);
> >>
> >> /*
> >> - * At the head of the wait queue now, wait until the writer state
> >> - * goes to 0 and then try to increment the reader count and get
> >> - * the lock. It is possible that an incoming writer may steal the
> >> - * lock in the interim, so it is necessary to check the writer byte
> >> - * to make sure that the write lock isn't taken.
> >> + * At the head of the wait queue now, increment the reader count
> >> + * and wait until the writer, if it has the lock, has gone away.
> >> + * At ths stage, it is not possible for a writer to remain in the
> >> + * waiting state (_QW_WAITING). So there won't be any deadlock.
> >> */
> >> - while (atomic_read(&lock->cnts)& _QW_WMASK)
> >> - cpu_relax_lowlatency();
> > Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
> > from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
> > rmode until it hits zero?
>
> No, this is how we make the lock fair so that an incoming streams of
> later readers won't block a writer from getting the lock.
But won't those readers effectively see that the lock is held for write
(because we set wmode to _QW_LOCKED before the existing reader had drained)
and therefore fall down the slow-path and get held up on the spinlock?
Will
On Tue, Jul 07, 2015 at 10:17:11AM +0100, Will Deacon wrote:
> > > Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
> > > from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
> > > rmode until it hits zero?
> >
> > No, this is how we make the lock fair so that an incoming streams of
> > later readers won't block a writer from getting the lock.
>
> But won't those readers effectively see that the lock is held for write
> (because we set wmode to _QW_LOCKED before the existing reader had drained)
> and therefore fall down the slow-path and get held up on the spinlock?
Yes, that's the entire point. Once there's a writer pending, new readers
should queue too.
On Mon, Jul 06, 2015 at 11:43:06AM -0400, Waiman Long wrote:
> Lock waiting in the qrwlock uses the spinlock (qspinlock for x86)
> as the waiting queue. This is slower than using MCS lock directly
> because of the extra level of indirection causing more atomics to
> be used as well as 2 waiting threads spinning on the lock cacheline
> instead of only one.
This needs a better explanation. Didn't we find with the qspinlock thing
that the pending spinner improved performance on light loads?
Taking it out seems counter intuitive, we could very much like these two
the be the same.
> --- a/kernel/locking/qrwlock.c
> +++ b/kernel/locking/qrwlock.c
> +static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, _mcs_qnodes[4]);
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -81,8 +81,9 @@
> * Exactly fits one 64-byte cacheline on a 64-bit architecture.
> *
> * PV doubles the storage and uses the second cacheline for PV state.
> + * The MCS nodes are also shared with qrwlock.
> */
> -static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]);
> +DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, _mcs_qnodes[MAX_NODES]);
Except you don't...
On Tue, Jul 07, 2015 at 12:17:31PM +0100, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 10:17:11AM +0100, Will Deacon wrote:
> > > > Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
> > > > from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
> > > > rmode until it hits zero?
> > >
> > > No, this is how we make the lock fair so that an incoming streams of
> > > later readers won't block a writer from getting the lock.
> >
> > But won't those readers effectively see that the lock is held for write
> > (because we set wmode to _QW_LOCKED before the existing reader had drained)
> > and therefore fall down the slow-path and get held up on the spinlock?
>
> Yes, that's the entire point. Once there's a writer pending, new readers
> should queue too.
Agreed. My point was that we can achieve the same result without
a separate _QW_WAITING flag afaict.
Will
On 07/07/2015 07:49 AM, Will Deacon wrote:
> On Tue, Jul 07, 2015 at 12:17:31PM +0100, Peter Zijlstra wrote:
>> On Tue, Jul 07, 2015 at 10:17:11AM +0100, Will Deacon wrote:
>>>>> Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
>>>>> from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
>>>>> rmode until it hits zero?
>>>> No, this is how we make the lock fair so that an incoming streams of
>>>> later readers won't block a writer from getting the lock.
>>> But won't those readers effectively see that the lock is held for write
>>> (because we set wmode to _QW_LOCKED before the existing reader had drained)
>>> and therefore fall down the slow-path and get held up on the spinlock?
>> Yes, that's the entire point. Once there's a writer pending, new readers
>> should queue too.
> Agreed. My point was that we can achieve the same result without
> a separate _QW_WAITING flag afaict.
>
> Will
>
_QW_WAITING and _QW_LOCKED has different semantics and are necessary for
the proper handshake between readers and writer. We set _QW_WAITING when
readers own the lock and the writer is waiting for the readers to go
away. The _QW_WAITING flag will force new readers to go to queuing while
the writer is waiting. We set _QW_LOCKED when a writer own the lock and
it can only be set atomically when no reader is present. Without the
intermediate _QW_WAITING step, a continuous stream of incoming readers
(which make the reader count never 0) could deny a writer from getting
the lock indefinitely.
Cheers,
Longman
On Tue, Jul 07, 2015 at 03:30:22PM +0100, Waiman Long wrote:
> On 07/07/2015 07:49 AM, Will Deacon wrote:
> > On Tue, Jul 07, 2015 at 12:17:31PM +0100, Peter Zijlstra wrote:
> >> On Tue, Jul 07, 2015 at 10:17:11AM +0100, Will Deacon wrote:
> >>>>> Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
> >>>>> from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
> >>>>> rmode until it hits zero?
> >>>> No, this is how we make the lock fair so that an incoming streams of
> >>>> later readers won't block a writer from getting the lock.
> >>> But won't those readers effectively see that the lock is held for write
> >>> (because we set wmode to _QW_LOCKED before the existing reader had drained)
> >>> and therefore fall down the slow-path and get held up on the spinlock?
> >> Yes, that's the entire point. Once there's a writer pending, new readers
> >> should queue too.
> > Agreed. My point was that we can achieve the same result without
> > a separate _QW_WAITING flag afaict.
>
> _QW_WAITING and _QW_LOCKED has different semantics and are necessary for
> the proper handshake between readers and writer. We set _QW_WAITING when
> readers own the lock and the writer is waiting for the readers to go
> away. The _QW_WAITING flag will force new readers to go to queuing while
> the writer is waiting. We set _QW_LOCKED when a writer own the lock and
> it can only be set atomically when no reader is present. Without the
> intermediate _QW_WAITING step, a continuous stream of incoming readers
> (which make the reader count never 0) could deny a writer from getting
> the lock indefinitely.
It's probably best if I try to implement something and we can either pick
holes in the patch or I'll realise why I'm wrong in the process :)
Will
On Tue, Jul 07, 2015 at 06:27:18PM +0100, Will Deacon wrote:
> On Tue, Jul 07, 2015 at 03:30:22PM +0100, Waiman Long wrote:
> > On 07/07/2015 07:49 AM, Will Deacon wrote:
> > > On Tue, Jul 07, 2015 at 12:17:31PM +0100, Peter Zijlstra wrote:
> > >> On Tue, Jul 07, 2015 at 10:17:11AM +0100, Will Deacon wrote:
> > >>>>> Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
> > >>>>> from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
> > >>>>> rmode until it hits zero?
> > >>>> No, this is how we make the lock fair so that an incoming streams of
> > >>>> later readers won't block a writer from getting the lock.
> > >>> But won't those readers effectively see that the lock is held for write
> > >>> (because we set wmode to _QW_LOCKED before the existing reader had drained)
> > >>> and therefore fall down the slow-path and get held up on the spinlock?
> > >> Yes, that's the entire point. Once there's a writer pending, new readers
> > >> should queue too.
> > > Agreed. My point was that we can achieve the same result without
> > > a separate _QW_WAITING flag afaict.
> >
> > _QW_WAITING and _QW_LOCKED has different semantics and are necessary for
> > the proper handshake between readers and writer. We set _QW_WAITING when
> > readers own the lock and the writer is waiting for the readers to go
> > away. The _QW_WAITING flag will force new readers to go to queuing while
> > the writer is waiting. We set _QW_LOCKED when a writer own the lock and
> > it can only be set atomically when no reader is present. Without the
> > intermediate _QW_WAITING step, a continuous stream of incoming readers
> > (which make the reader count never 0) could deny a writer from getting
> > the lock indefinitely.
>
> It's probably best if I try to implement something and we can either pick
> holes in the patch or I'll realise why I'm wrong in the process :)
Hmm, wasn't very enlightening. What's wrong with the following?
Will
--->8
diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index deb9e8b0eb9e..be8dc5c6fdbd 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -27,7 +27,6 @@
/*
* Writer states & reader shift and bias
*/
-#define _QW_WAITING 1 /* A writer is waiting */
#define _QW_LOCKED 0xff /* A writer holds the lock */
#define _QW_WMASK 0xff /* Writer mask */
#define _QR_SHIFT 8 /* Reader count shift */
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 9f644933f6d4..4006aa1fbd0b 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -127,28 +127,23 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
}
/*
- * Set the waiting flag to notify readers that a writer is pending,
- * or wait for a previous writer to go away.
+ * Wait for a previous writer to go away, then set the locked
+ * flag to notify future readers/writers that we are pending.
*/
for (;;) {
struct __qrwlock *l = (struct __qrwlock *)lock;
if (!READ_ONCE(l->wmode) &&
- (cmpxchg(&l->wmode, 0, _QW_WAITING) == 0))
+ (cmpxchg(&l->wmode, 0, _QW_LOCKED) == 0))
break;
cpu_relax_lowlatency();
}
- /* When no more readers, set the locked flag */
- for (;;) {
- if ((atomic_read(&lock->cnts) == _QW_WAITING) &&
- (atomic_cmpxchg(&lock->cnts, _QW_WAITING,
- _QW_LOCKED) == _QW_WAITING))
- break;
-
+ /* Wait for the readers to drain */
+ while (smp_load_acquire((u32 *)&lock->cnts) & ~_QW_WMASK)
cpu_relax_lowlatency();
- }
+
unlock:
arch_spin_unlock(&lock->lock);
}
On 07/07/2015 02:10 PM, Will Deacon wrote:
> On Tue, Jul 07, 2015 at 06:27:18PM +0100, Will Deacon wrote:
>> On Tue, Jul 07, 2015 at 03:30:22PM +0100, Waiman Long wrote:
>>> On 07/07/2015 07:49 AM, Will Deacon wrote:
>>>> On Tue, Jul 07, 2015 at 12:17:31PM +0100, Peter Zijlstra wrote:
>>>>> On Tue, Jul 07, 2015 at 10:17:11AM +0100, Will Deacon wrote:
>>>>>>>> Thinking about it, can we kill _QW_WAITING altogether and set (cmpxchg
>>>>>>>> from 0) wmode to _QW_LOCKED in the write_lock slowpath, polling (acquire)
>>>>>>>> rmode until it hits zero?
>>>>>>> No, this is how we make the lock fair so that an incoming streams of
>>>>>>> later readers won't block a writer from getting the lock.
>>>>>> But won't those readers effectively see that the lock is held for write
>>>>>> (because we set wmode to _QW_LOCKED before the existing reader had drained)
>>>>>> and therefore fall down the slow-path and get held up on the spinlock?
>>>>> Yes, that's the entire point. Once there's a writer pending, new readers
>>>>> should queue too.
>>>> Agreed. My point was that we can achieve the same result without
>>>> a separate _QW_WAITING flag afaict.
>>> _QW_WAITING and _QW_LOCKED has different semantics and are necessary for
>>> the proper handshake between readers and writer. We set _QW_WAITING when
>>> readers own the lock and the writer is waiting for the readers to go
>>> away. The _QW_WAITING flag will force new readers to go to queuing while
>>> the writer is waiting. We set _QW_LOCKED when a writer own the lock and
>>> it can only be set atomically when no reader is present. Without the
>>> intermediate _QW_WAITING step, a continuous stream of incoming readers
>>> (which make the reader count never 0) could deny a writer from getting
>>> the lock indefinitely.
>> It's probably best if I try to implement something and we can either pick
>> holes in the patch or I'll realise why I'm wrong in the process :)
> Hmm, wasn't very enlightening. What's wrong with the following?
>
> Will
>
> --->8
>
> diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
> index deb9e8b0eb9e..be8dc5c6fdbd 100644
> --- a/include/asm-generic/qrwlock.h
> +++ b/include/asm-generic/qrwlock.h
> @@ -27,7 +27,6 @@
> /*
> * Writer states& reader shift and bias
> */
> -#define _QW_WAITING 1 /* A writer is waiting */
> #define _QW_LOCKED 0xff /* A writer holds the lock */
> #define _QW_WMASK 0xff /* Writer mask */
> #define _QR_SHIFT 8 /* Reader count shift */
> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> index 9f644933f6d4..4006aa1fbd0b 100644
> --- a/kernel/locking/qrwlock.c
> +++ b/kernel/locking/qrwlock.c
> @@ -127,28 +127,23 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
> }
>
> /*
> - * Set the waiting flag to notify readers that a writer is pending,
> - * or wait for a previous writer to go away.
> + * Wait for a previous writer to go away, then set the locked
> + * flag to notify future readers/writers that we are pending.
> */
> for (;;) {
> struct __qrwlock *l = (struct __qrwlock *)lock;
>
> if (!READ_ONCE(l->wmode)&&
> - (cmpxchg(&l->wmode, 0, _QW_WAITING) == 0))
> + (cmpxchg(&l->wmode, 0, _QW_LOCKED) == 0))
> break;
>
> cpu_relax_lowlatency();
> }
>
> - /* When no more readers, set the locked flag */
> - for (;;) {
> - if ((atomic_read(&lock->cnts) == _QW_WAITING)&&
> - (atomic_cmpxchg(&lock->cnts, _QW_WAITING,
> - _QW_LOCKED) == _QW_WAITING))
> - break;
> -
> + /* Wait for the readers to drain */
> + while (smp_load_acquire((u32 *)&lock->cnts)& ~_QW_WMASK)
> cpu_relax_lowlatency();
> - }
> +
> unlock:
> arch_spin_unlock(&lock->lock);
> }
That changes the handshaking protocol. In this case, the readers will
have to decrement its reader count to enable the writer to continue. The
interrupt context reader code has to be changed. This gives preference
to writer and reader will be in a disadvantage. I prefer the current
setting as you won't know if the writer has the lock or not when you
take a snapshot of the value of the lock. You need the whole time
sequence in this case to figure it out and so will be more prone to error.
Cheers,
Longman
On 07/07/2015 07:24 AM, Peter Zijlstra wrote:
> On Mon, Jul 06, 2015 at 11:43:06AM -0400, Waiman Long wrote:
>> Lock waiting in the qrwlock uses the spinlock (qspinlock for x86)
>> as the waiting queue. This is slower than using MCS lock directly
>> because of the extra level of indirection causing more atomics to
>> be used as well as 2 waiting threads spinning on the lock cacheline
>> instead of only one.
> This needs a better explanation. Didn't we find with the qspinlock thing
> that the pending spinner improved performance on light loads?
>
> Taking it out seems counter intuitive, we could very much like these two
> the be the same.
Yes, for lightly loaded case, using raw_spin_lock should have an
advantage. It is a different matter when the lock is highly contended.
In this case, having the indirection in qspinlock will make it slower. I
struggle myself as to whether to duplicate the locking code in qrwlock.
So I send this patch out to test the water. I won't insist if you think
this is not a good idea, but I do want to get the previous 2 patches in
which should not be controversial.
Cheers,
Longman
On Tue, Jul 07, 2015 at 05:59:59PM -0400, Waiman Long wrote:
> On 07/07/2015 07:24 AM, Peter Zijlstra wrote:
> >On Mon, Jul 06, 2015 at 11:43:06AM -0400, Waiman Long wrote:
> >>Lock waiting in the qrwlock uses the spinlock (qspinlock for x86)
> >>as the waiting queue. This is slower than using MCS lock directly
> >>because of the extra level of indirection causing more atomics to
> >>be used as well as 2 waiting threads spinning on the lock cacheline
> >>instead of only one.
> >This needs a better explanation. Didn't we find with the qspinlock thing
> >that the pending spinner improved performance on light loads?
> >
> >Taking it out seems counter intuitive, we could very much like these two
> >the be the same.
>
> Yes, for lightly loaded case, using raw_spin_lock should have an advantage.
> It is a different matter when the lock is highly contended. In this case,
> having the indirection in qspinlock will make it slower.
But we should not optimize the lock for the complete saturation case, if
you encounter that, change the locking. The light contention case is
much more important.
On Tue, Jul 07, 2015 at 05:29:50PM -0400, Waiman Long wrote:
> On 07/07/2015 02:10 PM, Will Deacon wrote:
> >diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
> >index deb9e8b0eb9e..be8dc5c6fdbd 100644
> >--- a/include/asm-generic/qrwlock.h
> >+++ b/include/asm-generic/qrwlock.h
> >@@ -27,7 +27,6 @@
> > /*
> > * Writer states& reader shift and bias
> > */
> >-#define _QW_WAITING 1 /* A writer is waiting */
> > #define _QW_LOCKED 0xff /* A writer holds the lock */
> > #define _QW_WMASK 0xff /* Writer mask */
> > #define _QR_SHIFT 8 /* Reader count shift */
> >diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> >index 9f644933f6d4..4006aa1fbd0b 100644
> >--- a/kernel/locking/qrwlock.c
> >+++ b/kernel/locking/qrwlock.c
> >@@ -127,28 +127,23 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
> > }
> >
> > /*
> >- * Set the waiting flag to notify readers that a writer is pending,
> >- * or wait for a previous writer to go away.
> >+ * Wait for a previous writer to go away, then set the locked
> >+ * flag to notify future readers/writers that we are pending.
> > */
> > for (;;) {
> > struct __qrwlock *l = (struct __qrwlock *)lock;
> >
> > if (!READ_ONCE(l->wmode)&&
> >- (cmpxchg(&l->wmode, 0, _QW_WAITING) == 0))
> >+ (cmpxchg(&l->wmode, 0, _QW_LOCKED) == 0))
> > break;
> >
> > cpu_relax_lowlatency();
> > }
> >
> >- /* When no more readers, set the locked flag */
> >- for (;;) {
> >- if ((atomic_read(&lock->cnts) == _QW_WAITING)&&
> >- (atomic_cmpxchg(&lock->cnts, _QW_WAITING,
> >- _QW_LOCKED) == _QW_WAITING))
> >- break;
> >-
> >+ /* Wait for the readers to drain */
> >+ while (smp_load_acquire((u32 *)&lock->cnts)& ~_QW_WMASK)
> > cpu_relax_lowlatency();
> >- }
> >+
> > unlock:
> > arch_spin_unlock(&lock->lock);
> > }
>
> That changes the handshaking protocol. In this case, the readers will have
> to decrement its reader count to enable the writer to continue.
It already needs to, no?
> The interrupt context reader code has to be changed.
Agreed.
> This gives preference to writer and reader will be in a disadvantage.
I don't see that, everybody is still ordered by the wait queue / lock.
> I prefer the current setting as you won't know if the writer has the
> lock or not when you take a snapshot of the value of the lock. You
> need the whole time sequence in this case to figure it out and so will
> be more prone to error.
I still need to wake up, but I suspect we need to change
queue_read_{try,}lock() to use cmpxchg/inc_not_zero like things, which
is fine for ARM, but not so much for x86.
So I think I agree with Waiman, but am willing to be shown differently.
On Wed, Jul 08, 2015 at 10:52:48AM +0100, Peter Zijlstra wrote:
> On Tue, Jul 07, 2015 at 05:29:50PM -0400, Waiman Long wrote:
> > I prefer the current setting as you won't know if the writer has the
> > lock or not when you take a snapshot of the value of the lock. You
> > need the whole time sequence in this case to figure it out and so will
> > be more prone to error.
>
> I still need to wake up, but I suspect we need to change
> queue_read_{try,}lock() to use cmpxchg/inc_not_zero like things, which
> is fine for ARM, but not so much for x86.
>
> So I think I agree with Waiman, but am willing to be shown differently.
That's fine; I just wanted to make sure I wasn't going round the twist!
Will