2014-01-24 04:29:33

by Waiman Long

[permalink] [raw]
Subject: [PATCH v11 0/4] Introducing a queue read/write lock implementation

v10->v11:
- Insert appropriate smp_mb__{before,after}_atomic_* calls to make sure
that the lock and unlock functions provide the proper memory barrier.

v9->v10:
- Eliminate the temporary smp_load_acquire()/smp_store_release() macros
by merging v9 patch 4 into patch 1.
- Include <linux/mutex.h> & remove xadd() macro check.
- Incorporate review comments from PeterZ.

v8->v9:
- Rebase to the tip branch which has the PeterZ's
smp_load_acquire()/smp_store_release() patch.
- Only pass integer type arguments to smp_load_acquire() &
smp_store_release() functions.
- Add a new patch to make any data type less than or equal to long as
atomic or native in x86.
- Modify write_unlock() to use atomic_sub() if the writer field is
not atomic.

v7->v8:
- Use atomic_t functions (which are implemented in all arch's) to
modify reader counts.
- Use smp_load_acquire() & smp_store_release() for barriers.
- Further tuning in slowpath performance.

v6->v7:
- Remove support for unfair lock, so only fair qrwlock will be provided.
- Move qrwlock.c to the kernel/locking directory.

v5->v6:
- Modify queue_read_can_lock() to avoid false positive result.
- Move the two slowpath functions' performance tuning change from
patch 4 to patch 1.
- Add a new optional patch to use the new smp_store_release() function
if that is merged.

v4->v5:
- Fix wrong definitions for QW_MASK_FAIR & QW_MASK_UNFAIR macros.
- Add an optional patch 4 which should only be applied after the
mcs_spinlock.h header file is merged.

v3->v4:
- Optimize the fast path with better cold cache behavior and
performance.
- Removing some testing code.
- Make x86 use queue rwlock with no user configuration.

v2->v3:
- Make read lock stealing the default and fair rwlock an option with
a different initializer.
- In queue_read_lock_slowpath(), check irq_count() and force spinning
and lock stealing in interrupt context.
- Unify the fair and classic read-side code path, and make write-side
to use cmpxchg with 2 different writer states. This slows down the
write lock fastpath to make the read side more efficient, but is
still slightly faster than a spinlock.

v1->v2:
- Improve lock fastpath performance.
- Optionally provide classic read/write lock behavior for backward
compatibility.
- Use xadd instead of cmpxchg for fair reader code path to make it
immute to reader contention.
- Run more performance testing.

As mentioned in the LWN article http://lwn.net/Articles/364583/,
the read/write lock suffer from an unfairness problem that it is
possible for a stream of incoming readers to block a waiting writer
from getting the lock for a long time. Also, a waiting reader/writer
contending a rwlock in local memory will have a higher chance of
acquiring the lock than a reader/writer in remote node.

This patch set introduces a queue-based read/write lock implementation
that can largely solve this unfairness problem.

The read lock slowpath will check if the reader is in an interrupt
context. If so, it will force lock spinning and stealing without
waiting in a queue. This is to ensure the read lock will be granted
as soon as possible.

The queue write lock can also be used as a replacement for ticket
spinlocks that are highly contended if lock size increase is not
an issue.

The first 2 patches provides the base queue read/write lock support
on x86 architecture. Support for other architectures can be added
later on once architecture specific support infrastructure is added
and proper testing is done.

Patch 3 is for optimizing performance for x86 architecture.

The optional patch 4 has a dependency on the the mcs_spinlock.h
header file which has not been merged yet. So this patch should only
be applied after the mcs_spinlock.h header file is merged.

Waiman Long (4):
qrwlock: A queue read/write lock implementation
qrwlock, x86: Enable x86 to use queue read/write lock
qrwlock, x86: Add char and short as atomic data type in x86
qrwlock: Use the mcs_spinlock helper functions for MCS queuing

arch/x86/Kconfig | 1 +
arch/x86/include/asm/barrier.h | 18 +++
arch/x86/include/asm/spinlock.h | 2 +
arch/x86/include/asm/spinlock_types.h | 4 +
include/asm-generic/qrwlock.h | 215 +++++++++++++++++++++++++++++++++
kernel/Kconfig.locks | 7 +
kernel/locking/Makefile | 1 +
kernel/locking/qrwlock.c | 183 ++++++++++++++++++++++++++++
8 files changed, 431 insertions(+), 0 deletions(-)
create mode 100644 include/asm-generic/qrwlock.h
create mode 100644 kernel/locking/qrwlock.c


2014-01-24 04:29:36

by Waiman Long

[permalink] [raw]
Subject: [PATCH v11 1/4] qrwlock: A queue read/write lock implementation

This patch introduces a new read/write lock implementation that put
waiting readers and writers into a queue instead of actively contending
the lock like the current read/write lock implementation. This will
improve performance in highly contended situation by reducing the
cache line bouncing effect.

The queue read/write lock (qrwlock) is a fair lock even though there
is still a slight chance of lock stealing if a reader or writer comes
at the right moment. Other than that, lock granting is done in a
FIFO manner. As a result, it is possible to determine a maximum time
period after which the waiting is over and the lock can be acquired.

Internally, however, there is a second type of readers which try to
steal lock aggressively. They simply increments the reader count and
wait until the writer releases the lock. The transition to aggressive
reader happens in the read lock slowpath when

1. In an interrupt context.
2. When a reader comes to the head of the wait queue and sees
the release of a write lock.

The queue read lock is safe to use in an interrupt context (softirq
or hardirq) as it will switch to become an aggressive reader in such
environment allowing recursive read lock.

The only downside of queue rwlock is the size increase in the lock
structure by 4 bytes for 32-bit systems and by 12 bytes for 64-bit
systems.

In term of single-thread performance (no contention), a 256K
lock/unlock loop was run on a 2.4GHz and 2.93Ghz Westmere x86-64
CPUs. The following table shows the average time (in ns) for a single
lock/unlock sequence (including the looping and timing overhead):

Lock Type 2.4GHz 2.93GHz
--------- ------ -------
Ticket spinlock 14.9 12.3
Read lock 17.0 13.5
Write lock 17.0 13.5
Queue read lock 16.0 13.4
Queue write lock 9.2 7.8

The queue read lock is slightly slower than the spinlock, but is
slightly faster than the read lock. The queue write lock, however,
is the fastest of all. It is almost twice as fast as the write lock
and about 1.5X of the spinlock.

With lock contention, the speed of each individual lock/unlock function
is less important than the amount of contention-induced delays.

To investigate the performance characteristics of the queue rwlock
compared with the regular rwlock, Ingo's anon_vmas patch that converts
rwsem to rwlock was applied to a 3.12 kernel. This kernel was then
tested under the following 3 conditions:

1) Plain 3.12
2) Ingo's patch
3) Ingo's patch + qrwlock

Each of the 3 kernels were booted up twice with and without the
"idle=poll" kernel parameter which keeps the CPUs in C0 state while
idling instead of a more energy-saving sleep state. The jobs per
minutes (JPM) results of the AIM7's high_systime workload at 1500
users on a 8-socket 80-core DL980 (HT off) were:

Kernel JPMs %Change from (1)
------ ---- ----------------
1 145704/227295 -
2 229750/236066 +58%/+3.8%
4 240062/248606 +65%/+9.4%

The first JPM number is without the "idle=poll" kernel parameter,
the second number is with that parameter. It can be seen that most
of the performance benefit of converting rwsem to rwlock actually
come from the latency improvement of not needing to wake up a CPU
from deep sleep state when work is available.

The use of non-sleeping locks did improve performance by eliminating
the context switching cost. Using queue rwlock gave almost tripling
of performance gain. The performance gain was reduced somewhat with
a fair lock which was to be expected.

Looking at the perf profiles (with idle=poll) below, we can clearly see
that other bottlenecks were constraining the performance improvement.

Perf profile of kernel (2):

18.65% reaim [kernel.kallsyms] [k] __write_lock_failed
9.00% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
5.21% swapper [kernel.kallsyms] [k] cpu_idle_loop
3.08% reaim [kernel.kallsyms] [k] mspin_lock
2.50% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert
2.00% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave
1.29% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
1.21% reaim [kernel.kallsyms] [k] __read_lock_failed
1.12% reaim [kernel.kallsyms] [k] _raw_spin_lock
1.10% reaim [kernel.kallsyms] [k] perf_event_aux
1.09% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave

Perf profile of kernel (3):

20.14% swapper [kernel.kallsyms] [k] cpu_idle_loop
7.94% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
5.41% reaim [kernel.kallsyms] [k] queue_write_lock_slowpath
5.01% reaim [kernel.kallsyms] [k] mspin_lock
2.12% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert
2.07% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave
1.58% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
1.25% reaim [kernel.kallsyms] [k] queue_write_3step_lock
1.18% reaim [kernel.kallsyms] [k] queue_read_lock_slowpath
1.14% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave
0.95% reaim [kernel.kallsyms] [k] mutex_spin_on_owner

20.55% swapper [kernel.kallsyms] [k] cpu_idle_loop
7.74% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
6.47% reaim [kernel.kallsyms] [k] queue_write_lock_slowpath
4.41% reaim [kernel.kallsyms] [k] mspin_lock
2.18% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave
2.07% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert
1.49% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
1.43% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave
1.17% reaim [kernel.kallsyms] [k] queue_read_lock_slowpath
0.94% reaim [kernel.kallsyms] [k] mutex_spin_on_owner

The spinlock bottlenecks were shown below.

7.74% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|--65.94%-- release_pages
|--31.37%-- pagevec_lru_move_fn
|--0.79%-- get_page_from_freelist
|--0.62%-- __page_cache_release
--1.28%-- [...]

For both release_pages() & pagevec_lru_move_fn() function, the
spinlock contention was on zone->lru_lock.

Tim Chen also tested the qrwlock with Ingo's patch on a 4-socket
machine. It was found the performance improvement of 11% was the
same with regular rwlock or queue rwlock.

Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
---
include/asm-generic/qrwlock.h | 218 +++++++++++++++++++++++++++++++++++++
kernel/Kconfig.locks | 7 ++
kernel/locking/Makefile | 1 +
kernel/locking/qrwlock.c | 240 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 466 insertions(+), 0 deletions(-)
create mode 100644 include/asm-generic/qrwlock.h
create mode 100644 kernel/locking/qrwlock.c

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
new file mode 100644
index 0000000..da0efee
--- /dev/null
+++ b/include/asm-generic/qrwlock.h
@@ -0,0 +1,218 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#ifndef __ASM_GENERIC_QRWLOCK_H
+#define __ASM_GENERIC_QRWLOCK_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+#include <asm/bitops.h>
+#include <asm/cmpxchg.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
+#include <asm/byteorder.h>
+
+#if !defined(__LITTLE_ENDIAN) && !defined(__BIG_ENDIAN)
+#error "Missing either LITTLE_ENDIAN or BIG_ENDIAN definition."
+#endif
+
+/*
+ * The queue read/write lock data structure
+ *
+ * The layout of the structure is endian-sensitive to make sure that adding
+ * _QR_BIAS to the rw field to increment the reader count won't disturb
+ * the writer field. The least significant 8 bits is the writer field
+ * whereas the remaining 24 bits is the reader count.
+ */
+struct qrwnode {
+ struct qrwnode *next;
+ int wait; /* Waiting flag */
+};
+
+typedef struct qrwlock {
+ union qrwcnts {
+ struct {
+#ifdef __LITTLE_ENDIAN
+ u8 writer; /* Writer state */
+#else
+ u16 r16; /* Reader count - msb */
+ u8 r8; /* Reader count - lsb */
+ u8 writer; /* Writer state */
+#endif
+ };
+ atomic_t rwa; /* Reader/writer atomic */
+ u32 rwc; /* Reader/writer counts */
+ } cnts;
+ struct qrwnode *waitq; /* Tail of waiting queue */
+} arch_rwlock_t;
+
+/*
+ * Writer states & reader shift and bias
+ */
+#define _QW_WAITING 1 /* A writer is waiting */
+#define _QW_LOCKED 0xff /* A writer holds the lock */
+#define _QW_WMASK 0xff /* Writer mask */
+#define _QR_SHIFT 8 /* Reader count shift */
+#define _QR_BIAS (1U << _QR_SHIFT)
+
+/*
+ * External function declarations
+ */
+extern void queue_read_lock_slowpath(struct qrwlock *lock);
+extern void queue_write_lock_slowpath(struct qrwlock *lock);
+
+/**
+ * queue_read_can_lock- would read_trylock() succeed?
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline int queue_read_can_lock(struct qrwlock *lock)
+{
+ return !ACCESS_ONCE(lock->cnts.writer);
+}
+
+/**
+ * queue_write_can_lock- would write_trylock() succeed?
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline int queue_write_can_lock(struct qrwlock *lock)
+{
+ return !ACCESS_ONCE(lock->cnts.rwc);
+}
+
+/**
+ * queue_read_trylock - try to acquire read lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int queue_read_trylock(struct qrwlock *lock)
+{
+ union qrwcnts cnts;
+
+ cnts.rwc = ACCESS_ONCE(lock->cnts.rwc);
+ if (likely(!cnts.writer)) {
+ cnts.rwc = (u32)atomic_add_return(_QR_BIAS, &lock->cnts.rwa);
+ if (likely(!cnts.writer)) {
+ smp_mb__after_atomic_inc();
+ return 1;
+ }
+ atomic_sub(_QR_BIAS, &lock->cnts.rwa);
+ }
+ return 0;
+}
+
+/**
+ * queue_write_trylock - try to acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int queue_write_trylock(struct qrwlock *lock)
+{
+ union qrwcnts old, new;
+
+ old.rwc = ACCESS_ONCE(lock->cnts.rwc);
+ if (likely(!old.rwc)) {
+ new.rwc = old.rwc;
+ new.writer = _QW_LOCKED;
+ if (likely(cmpxchg(&lock->cnts.rwc, old.rwc, new.rwc)
+ == old.rwc))
+ return 1;
+ }
+ return 0;
+}
+/**
+ * queue_read_lock - acquire read lock of a queue rwlock
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline void queue_read_lock(struct qrwlock *lock)
+{
+ union qrwcnts cnts;
+
+ cnts.rwc = atomic_add_return(_QR_BIAS, &lock->cnts.rwa);
+ if (likely(!cnts.writer)) {
+ smp_mb__after_atomic_inc();
+ return;
+ }
+ /*
+ * Slowpath will decrement the reader count, if necessary
+ */
+ queue_read_lock_slowpath(lock);
+}
+
+/**
+ * queue_write_lock - acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_write_lock(struct qrwlock *lock)
+{
+ /*
+ * Optimize for the unfair lock case where the fair flag is 0.
+ */
+ if (cmpxchg(&lock->cnts.rwc, 0, _QW_LOCKED) == 0)
+ return;
+ queue_write_lock_slowpath(lock);
+}
+
+/**
+ * queue_read_unlock - release read lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_read_unlock(struct qrwlock *lock)
+{
+ /*
+ * Atomically decrement the reader count
+ */
+ smp_mb__before_atomic_dec();
+ atomic_sub(_QR_BIAS, &lock->cnts.rwa);
+}
+
+/**
+ * queue_write_unlock - release write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_write_unlock(struct qrwlock *lock)
+{
+ /*
+ * If the writer field is atomic, it can be cleared directly.
+ * Otherwise, an atomic subtraction will be used to clear it.
+ */
+ if (__native_word(lock->cnts.writer))
+ smp_store_release(&lock->cnts.writer, 0);
+ else {
+ smp_mb__before_atomic_dec();
+ atomic_sub(_QW_LOCKED, &lock->cnts.rwa);
+ }
+}
+
+/*
+ * Initializier
+ */
+#define __ARCH_RW_LOCK_UNLOCKED { .cnts = { .rwc = 0 }, .waitq = NULL }
+
+/*
+ * Remapping rwlock architecture specific functions to the corresponding
+ * queue rwlock functions.
+ */
+#define arch_read_can_lock(l) queue_read_can_lock(l)
+#define arch_write_can_lock(l) queue_write_can_lock(l)
+#define arch_read_lock(l) queue_read_lock(l)
+#define arch_write_lock(l) queue_write_lock(l)
+#define arch_read_trylock(l) queue_read_trylock(l)
+#define arch_write_trylock(l) queue_write_trylock(l)
+#define arch_read_unlock(l) queue_read_unlock(l)
+#define arch_write_unlock(l) queue_write_unlock(l)
+
+#endif /* __ASM_GENERIC_QRWLOCK_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index d2b32ac..35536d9 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
config MUTEX_SPIN_ON_OWNER
def_bool y
depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_USE_QUEUE_RWLOCK
+ bool
+
+config QUEUE_RWLOCK
+ def_bool y if ARCH_USE_QUEUE_RWLOCK
+ depends on SMP
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index baab8e5..3e7bab1 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o
obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
obj-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o
+obj-$(CONFIG_QUEUE_RWLOCK) += qrwlock.o
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
new file mode 100644
index 0000000..c76b8ce
--- /dev/null
+++ b/kernel/locking/qrwlock.c
@@ -0,0 +1,240 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <asm-generic/qrwlock.h>
+
+/*
+ * Compared with regular rwlock, the queue rwlock has has the following
+ * advantages:
+ * 1. Even though there is a slight chance of stealing the lock if come at
+ * the right moment, the granting of the lock is mostly in FIFO order.
+ * 2. It is usually faster in high contention situation.
+ *
+ * The only downside is that the lock is 4 bytes larger in 32-bit systems
+ * and 12 bytes larger in 64-bit systems.
+ *
+ * There are two queues for writers. The writer field of the lock is a
+ * one-slot wait queue. The writers that follow will have to wait in the
+ * combined reader/writer queue (waitq).
+ *
+ * Compared with x86 ticket spinlock, the queue rwlock is faster in high
+ * contention situation. The writer lock is also faster in single thread
+ * operations. Therefore, queue rwlock can be considered as a replacement
+ * for those spinlocks that are highly contended as long as an increase
+ * in lock size is not an issue.
+ */
+
+/*
+ * Simulate an exchange-add with atomic function
+ */
+#define qrw_xadd(rw, inc) (u32)(atomic_add_return(inc, &(rw).rwa) - inc)
+
+/**
+ * wait_in_queue - Add to queue and wait until it is at the head
+ * @lock: Pointer to queue rwlock structure
+ * @node: Node pointer to be added to the queue
+ */
+static inline void wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
+{
+ struct qrwnode *prev;
+
+ node->next = NULL;
+ node->wait = true;
+ prev = xchg(&lock->waitq, node);
+ if (prev) {
+ prev->next = node;
+ /*
+ * Wait until the waiting flag is off
+ */
+ while (smp_load_acquire(&node->wait))
+ arch_mutex_cpu_relax();
+ }
+}
+
+/**
+ * signal_next - Signal the next one in queue to be at the head
+ * @lock: Pointer to queue rwlock structure
+ * @node: Node pointer to the current head of queue
+ */
+static inline void signal_next(struct qrwlock *lock, struct qrwnode *node)
+{
+ struct qrwnode *next;
+
+ /*
+ * Try to notify the next node first without disturbing the cacheline
+ * of the lock. If that fails, check to see if it is the last node
+ * and so should clear the wait queue.
+ */
+ next = ACCESS_ONCE(node->next);
+ if (likely(next))
+ goto notify_next;
+
+ /*
+ * Clear the wait queue if it is the last node
+ */
+ if ((ACCESS_ONCE(lock->waitq) == node) &&
+ (cmpxchg(&lock->waitq, node, NULL) == node))
+ return;
+ /*
+ * Wait until the next one in queue set up the next field
+ */
+ while (likely(!(next = ACCESS_ONCE(node->next))))
+ arch_mutex_cpu_relax();
+ /*
+ * The next one in queue is now at the head
+ */
+notify_next:
+ smp_store_release(&next->wait, false);
+}
+
+/**
+ * rspin_until_writer_unlock - inc reader count & spin until writer is gone
+ * @lock : Pointer to queue rwlock structure
+ * @writer: Current queue rwlock writer status byte
+ *
+ * In interrupt context or at the head of the queue, the reader will just
+ * increment the reader count & wait until the writer releases the lock.
+ */
+static __always_inline void
+rspin_until_writer_unlock(struct qrwlock *lock, u32 rwc)
+{
+ while ((rwc & _QW_WMASK) == _QW_LOCKED) {
+ arch_mutex_cpu_relax();
+ rwc = smp_load_acquire(&lock->cnts.rwc);
+ }
+}
+
+/**
+ * queue_read_lock_slowpath - acquire read lock of a queue rwlock
+ * @lock: Pointer to queue rwlock structure
+ */
+void queue_read_lock_slowpath(struct qrwlock *lock)
+{
+ struct qrwnode node;
+ union qrwcnts cnts;
+
+ /*
+ * Readers come here when they cannot get the lock without waiting
+ */
+ if (unlikely(irq_count())) {
+ /*
+ * Readers in interrupt context will spin until the lock is
+ * available without waiting in the queue.
+ */
+ cnts.rwc = smp_load_acquire(&lock->cnts.rwc);
+ rspin_until_writer_unlock(lock, cnts.rwc);
+ return;
+ }
+ atomic_sub(_QR_BIAS, &lock->cnts.rwa);
+
+ /*
+ * Put the reader into the wait queue
+ */
+ wait_in_queue(lock, &node);
+
+ /*
+ * At the head of the wait queue now, wait until the writer state
+ * goes to 0 and then try to increment the reader count and get
+ * the lock. It is possible that an incoming writer may steal the
+ * lock in the interim, so it is necessary to check the writer byte
+ * to make sure that the write lock isn't taken.
+ */
+ while (ACCESS_ONCE(lock->cnts.writer))
+ arch_mutex_cpu_relax();
+ cnts.rwc = qrw_xadd(lock->cnts, _QR_BIAS);
+ rspin_until_writer_unlock(lock, cnts.rwc);
+
+ /*
+ * Signal the next one in queue to become queue head
+ */
+ signal_next(lock, &node);
+}
+EXPORT_SYMBOL(queue_read_lock_slowpath);
+
+/**
+ * qwrite_trylock - Try to acquire the write lock
+ * @lock : Pointer to queue rwlock structure
+ * @old : The current queue rwlock counts
+ * Return: 1 if lock acquired, 0 otherwise
+ */
+static __always_inline int
+qwrite_trylock(struct qrwlock *lock, u32 old)
+{
+ if (likely(cmpxchg(&lock->cnts.rwc, old, _QW_LOCKED) == old))
+ return 1;
+ return 0;
+}
+
+/**
+ * queue_write_3step_lock - acquire write lock in 3 steps
+ * @lock : Pointer to queue rwlock structure
+ *
+ * Step 1 - Try to acquire the lock directly if no reader is present
+ * Step 2 - Set the waiting flag to notify readers that a writer is waiting
+ * Step 3 - When the readers field goes to 0, set the locked flag
+ */
+static inline void queue_write_3step_lock(struct qrwlock *lock)
+{
+ u32 rwc;
+ u8 writer;
+
+ /* Step 1 */
+ if (!ACCESS_ONCE(lock->cnts.rwc) && qwrite_trylock(lock, 0))
+ return;
+
+ /* Step 2 */
+ writer = ACCESS_ONCE(lock->cnts.writer);
+ while (writer || (cmpxchg(&lock->cnts.writer, 0, _QW_WAITING) != 0)) {
+ arch_mutex_cpu_relax();
+ writer = ACCESS_ONCE(lock->cnts.writer);
+ }
+
+ /* Step 3 */
+ rwc = ACCESS_ONCE(lock->cnts.rwc);
+ while ((rwc > _QW_WAITING) || !qwrite_trylock(lock, _QW_WAITING)) {
+ arch_mutex_cpu_relax();
+ rwc = ACCESS_ONCE(lock->cnts.rwc);
+ }
+}
+
+/**
+ * queue_write_lock_slowpath - acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+void queue_write_lock_slowpath(struct qrwlock *lock)
+{
+ struct qrwnode node;
+
+ /*
+ * Put the writer into the wait queue
+ */
+ wait_in_queue(lock, &node);
+
+ /*
+ * At the head of the wait queue now, call queue_write_3step_lock()
+ * to acquire the lock until it is done.
+ */
+ queue_write_3step_lock(lock);
+ signal_next(lock, &node);
+}
+EXPORT_SYMBOL(queue_write_lock_slowpath);
--
1.7.1

2014-01-24 04:29:54

by Waiman Long

[permalink] [raw]
Subject: [PATCH v11 3/4] qrwlock, x86: Add char and short as atomic data type in x86

The generic __native_word() macro defined in include/linux/compiler.h
only allows "int" and "long" data types to be treated as native and
atomic. The x86 architecture, however, allow the use of char and short
data types as atomic as well.

This patch extends the data type allowed in the __native_word() macro to
allow the use of char and short for the x86 architecture.

Signed-off-by: Waiman Long <[email protected]>
---
arch/x86/include/asm/barrier.h | 18 ++++++++++++++++++
1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index 04a4890..0149a66 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -24,6 +24,24 @@
#define wmb() asm volatile("sfence" ::: "memory")
#endif

+/*
+ * Define x86 specific atomic types that can be used in read_acquire
+ * and store_release barriers.
+ */
+#define __arch_native_word(t) (sizeof(t) == sizeof(char) || \
+ sizeof(t) == sizeof(short) || \
+ sizeof(t) == sizeof(int) || \
+ sizeof(t) == sizeof(long))
+/*
+ * Due to cyclic header file dependency problem, the only way to modify
+ * the behavior of __native_word() macro within
+ * compiletime_assert_atomic_type() is to redefine it.
+ */
+#ifdef __native_word
+#undef __native_word
+#endif
+#define __native_word(t) __arch_native_word(t)
+
/**
* read_barrier_depends - Flush all pending reads that subsequents reads
* depend on.
--
1.7.1

2014-01-24 04:29:53

by Waiman Long

[permalink] [raw]
Subject: [PATCH v11 4/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing

There is a pending MCS lock patch series that adds generic MCS lock
helper functions to do MCS-style locking. This patch will enable
the queue rwlock to use that generic MCS lock/unlock primitives for
internal queuing. This patch should only be merged after the merging
of that generic MCS locking patch.

Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
---
include/asm-generic/qrwlock.h | 7 +---
kernel/locking/qrwlock.c | 71 ++++-------------------------------------
2 files changed, 9 insertions(+), 69 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index da0efee..7b337b3 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -38,10 +38,7 @@
* the writer field. The least significant 8 bits is the writer field
* whereas the remaining 24 bits is the reader count.
*/
-struct qrwnode {
- struct qrwnode *next;
- int wait; /* Waiting flag */
-};
+struct mcs_spinlock;

typedef struct qrwlock {
union qrwcnts {
@@ -57,7 +54,7 @@ typedef struct qrwlock {
atomic_t rwa; /* Reader/writer atomic */
u32 rwc; /* Reader/writer counts */
} cnts;
- struct qrwnode *waitq; /* Tail of waiting queue */
+ struct mcs_spinlock *waitq; /* Tail of waiting queue */
} arch_rwlock_t;

/*
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index c76b8ce..cc8d5a1 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -21,6 +21,7 @@
#include <linux/percpu.h>
#include <linux/hardirq.h>
#include <linux/mutex.h>
+#include <linux/mcs_spinlock.h>
#include <asm-generic/qrwlock.h>

/*
@@ -50,64 +51,6 @@
#define qrw_xadd(rw, inc) (u32)(atomic_add_return(inc, &(rw).rwa) - inc)

/**
- * wait_in_queue - Add to queue and wait until it is at the head
- * @lock: Pointer to queue rwlock structure
- * @node: Node pointer to be added to the queue
- */
-static inline void wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
-{
- struct qrwnode *prev;
-
- node->next = NULL;
- node->wait = true;
- prev = xchg(&lock->waitq, node);
- if (prev) {
- prev->next = node;
- /*
- * Wait until the waiting flag is off
- */
- while (smp_load_acquire(&node->wait))
- arch_mutex_cpu_relax();
- }
-}
-
-/**
- * signal_next - Signal the next one in queue to be at the head
- * @lock: Pointer to queue rwlock structure
- * @node: Node pointer to the current head of queue
- */
-static inline void signal_next(struct qrwlock *lock, struct qrwnode *node)
-{
- struct qrwnode *next;
-
- /*
- * Try to notify the next node first without disturbing the cacheline
- * of the lock. If that fails, check to see if it is the last node
- * and so should clear the wait queue.
- */
- next = ACCESS_ONCE(node->next);
- if (likely(next))
- goto notify_next;
-
- /*
- * Clear the wait queue if it is the last node
- */
- if ((ACCESS_ONCE(lock->waitq) == node) &&
- (cmpxchg(&lock->waitq, node, NULL) == node))
- return;
- /*
- * Wait until the next one in queue set up the next field
- */
- while (likely(!(next = ACCESS_ONCE(node->next))))
- arch_mutex_cpu_relax();
- /*
- * The next one in queue is now at the head
- */
-notify_next:
- smp_store_release(&next->wait, false);
-}
-
-/**
* rspin_until_writer_unlock - inc reader count & spin until writer is gone
* @lock : Pointer to queue rwlock structure
* @writer: Current queue rwlock writer status byte
@@ -130,7 +73,7 @@ rspin_until_writer_unlock(struct qrwlock *lock, u32 rwc)
*/
void queue_read_lock_slowpath(struct qrwlock *lock)
{
- struct qrwnode node;
+ struct mcs_spinlock node;
union qrwcnts cnts;

/*
@@ -150,7 +93,7 @@ void queue_read_lock_slowpath(struct qrwlock *lock)
/*
* Put the reader into the wait queue
*/
- wait_in_queue(lock, &node);
+ mcs_spin_lock(&lock->waitq, &node);

/*
* At the head of the wait queue now, wait until the writer state
@@ -167,7 +110,7 @@ void queue_read_lock_slowpath(struct qrwlock *lock)
/*
* Signal the next one in queue to become queue head
*/
- signal_next(lock, &node);
+ mcs_spin_unlock(&lock->waitq, &node);
}
EXPORT_SYMBOL(queue_read_lock_slowpath);

@@ -223,18 +166,18 @@ static inline void queue_write_3step_lock(struct qrwlock *lock)
*/
void queue_write_lock_slowpath(struct qrwlock *lock)
{
- struct qrwnode node;
+ struct mcs_spinlock node;

/*
* Put the writer into the wait queue
*/
- wait_in_queue(lock, &node);
+ mcs_spin_lock(&lock->waitq, &node);

/*
* At the head of the wait queue now, call queue_write_3step_lock()
* to acquire the lock until it is done.
*/
queue_write_3step_lock(lock);
- signal_next(lock, &node);
+ mcs_spin_unlock(&lock->waitq, &node);
}
EXPORT_SYMBOL(queue_write_lock_slowpath);
--
1.7.1

2014-01-24 04:30:55

by Waiman Long

[permalink] [raw]
Subject: [PATCH v11 2/4] qrwlock, x86: Enable x86 to use queue read/write lock

This patch makes the necessary changes at the x86 architecture specific
layer to enable the presence of the CONFIG_QUEUE_RWLOCK kernel option
to replace the read/write lock by the queue read/write lock.

It also enables the CONFIG_QUEUE_RWLOCK option by default for x86 which
will force the use of queue read/write lock. That will greatly improve
the fairness of read/write lock and eliminate live-lock situation
where one task may not get the lock for an indefinite period of time.

The code for the old x86 read/write lock will be removed in a later
release if no problem is found in the new queue read/write lock code.

Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/spinlock.h | 2 ++
arch/x86/include/asm/spinlock_types.h | 4 ++++
3 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index d3b9186..5c95257 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -119,6 +119,7 @@ config X86
select MODULES_USE_ELF_RELA if X86_64
select CLONE_BACKWARDS if X86_32
select ARCH_USE_BUILTIN_BSWAP
+ select ARCH_USE_QUEUE_RWLOCK
select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
select OLD_SIGACTION if X86_32
select COMPAT_OLD_SIGACTION if IA32_EMULATION
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index bf156de..8fb88c5 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -188,6 +188,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
cpu_relax();
}

+#ifndef CONFIG_QUEUE_RWLOCK
/*
* Read-write spinlocks, allowing multiple readers
* but only one writer.
@@ -270,6 +271,7 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
: "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
}
+#endif /* CONFIG_QUEUE_RWLOCK */

#define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
#define arch_write_lock_flags(lock, flags) arch_write_lock(lock)
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..a585635 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -34,6 +34,10 @@ typedef struct arch_spinlock {

#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } }

+#ifdef CONFIG_QUEUE_RWLOCK
+#include <asm-generic/qrwlock.h>
+#else
#include <asm/rwlock.h>
+#endif

#endif /* _ASM_X86_SPINLOCK_TYPES_H */
--
1.7.1

2014-01-24 08:26:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 1/4] qrwlock: A queue read/write lock implementation

On Thu, Jan 23, 2014 at 11:28:48PM -0500, Waiman Long wrote:
> +/**
> + * queue_read_trylock - try to acquire read lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +static inline int queue_read_trylock(struct qrwlock *lock)
> +{
> + union qrwcnts cnts;
> +
> + cnts.rwc = ACCESS_ONCE(lock->cnts.rwc);
> + if (likely(!cnts.writer)) {
> + cnts.rwc = (u32)atomic_add_return(_QR_BIAS, &lock->cnts.rwa);
> + if (likely(!cnts.writer)) {
> + smp_mb__after_atomic_inc();

That's superfluous, as atomic_add_return() is documented as being a full
barrier.

> + return 1;
> + }
> + atomic_sub(_QR_BIAS, &lock->cnts.rwa);
> + }
> + return 0;
> +}
> +
> +/**
> + * queue_write_trylock - try to acquire write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +static inline int queue_write_trylock(struct qrwlock *lock)
> +{
> + union qrwcnts old, new;
> +
> + old.rwc = ACCESS_ONCE(lock->cnts.rwc);
> + if (likely(!old.rwc)) {
> + new.rwc = old.rwc;
> + new.writer = _QW_LOCKED;
> + if (likely(cmpxchg(&lock->cnts.rwc, old.rwc, new.rwc)
> + == old.rwc))

One could actually use atomic_cmpxchg() and avoid one (ab)use of that
union :-)

> + return 1;
> + }
> + return 0;
> +}
> +/**
> + * queue_read_lock - acquire read lock of a queue rwlock
> + * @lock: Pointer to queue rwlock structure
> + */
> +static inline void queue_read_lock(struct qrwlock *lock)
> +{
> + union qrwcnts cnts;
> +
> + cnts.rwc = atomic_add_return(_QR_BIAS, &lock->cnts.rwa);
> + if (likely(!cnts.writer)) {
> + smp_mb__after_atomic_inc();

Superfluous again.

> + return;
> + }
> + /*
> + * Slowpath will decrement the reader count, if necessary
> + */
> + queue_read_lock_slowpath(lock);
> +}
> +
> +/**
> + * queue_write_lock - acquire write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +static inline void queue_write_lock(struct qrwlock *lock)
> +{
> + /*
> + * Optimize for the unfair lock case where the fair flag is 0.
> + */
> + if (cmpxchg(&lock->cnts.rwc, 0, _QW_LOCKED) == 0)

Could equally be atomic_cmpxchg() again.

> + return;
> + queue_write_lock_slowpath(lock);
> +}
> +
> +/**
> + * queue_read_unlock - release read lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +static inline void queue_read_unlock(struct qrwlock *lock)
> +{
> + /*
> + * Atomically decrement the reader count
> + */
> + smp_mb__before_atomic_dec();
> + atomic_sub(_QR_BIAS, &lock->cnts.rwa);
> +}
> +
> +/**
> + * queue_write_unlock - release write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +static inline void queue_write_unlock(struct qrwlock *lock)
> +{
> + /*
> + * If the writer field is atomic, it can be cleared directly.
> + * Otherwise, an atomic subtraction will be used to clear it.
> + */
> + if (__native_word(lock->cnts.writer))
> + smp_store_release(&lock->cnts.writer, 0);
> + else {
> + smp_mb__before_atomic_dec();
> + atomic_sub(_QW_LOCKED, &lock->cnts.rwa);
> + }

Missing {}, Documentation/CodingStyle Chapter 3 near the very end.

> +}

2014-01-24 08:26:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 4/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing

On Thu, Jan 23, 2014 at 11:28:51PM -0500, Waiman Long wrote:
> There is a pending MCS lock patch series that adds generic MCS lock
> helper functions to do MCS-style locking. This patch will enable
> the queue rwlock to use that generic MCS lock/unlock primitives for
> internal queuing. This patch should only be merged after the merging
> of that generic MCS locking patch.

I would still very much like this patch to be merged into the first. It
saves having to review all the code removed again.

2014-01-25 04:31:28

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v11 4/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing

On 01/24/2014 03:26 AM, Peter Zijlstra wrote:
> On Thu, Jan 23, 2014 at 11:28:51PM -0500, Waiman Long wrote:
>> There is a pending MCS lock patch series that adds generic MCS lock
>> helper functions to do MCS-style locking. This patch will enable
>> the queue rwlock to use that generic MCS lock/unlock primitives for
>> internal queuing. This patch should only be merged after the merging
>> of that generic MCS locking patch.
> I would still very much like this patch to be merged into the first. It
> saves having to review all the code removed again.

I will merge it to the first one once the once the MCS patches are in
the mainline or the tip branch.

-Longman

2014-01-25 04:39:48

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v11 1/4] qrwlock: A queue read/write lock implementation

On 01/24/2014 03:25 AM, Peter Zijlstra wrote:
> On Thu, Jan 23, 2014 at 11:28:48PM -0500, Waiman Long wrote:
>> +/**
>> + * queue_read_trylock - try to acquire read lock of a queue rwlock
>> + * @lock : Pointer to queue rwlock structure
>> + * Return: 1 if lock acquired, 0 if failed
>> + */
>> +static inline int queue_read_trylock(struct qrwlock *lock)
>> +{
>> + union qrwcnts cnts;
>> +
>> + cnts.rwc = ACCESS_ONCE(lock->cnts.rwc);
>> + if (likely(!cnts.writer)) {
>> + cnts.rwc = (u32)atomic_add_return(_QR_BIAS,&lock->cnts.rwa);
>> + if (likely(!cnts.writer)) {
>> + smp_mb__after_atomic_inc();
> That's superfluous, as atomic_add_return() is documented as being a full
> barrier.

Yes, you are right. I have reviewed the memory_barrier.txt again and
atomic_add_return() is supposed to act as a memory barrier. So no extra
barrier. I will correct that in the next version.

>> + return 1;
>> + }
>> + atomic_sub(_QR_BIAS,&lock->cnts.rwa);
>> + }
>> + return 0;
>> +}
>> +
>> +/**
>> + * queue_write_trylock - try to acquire write lock of a queue rwlock
>> + * @lock : Pointer to queue rwlock structure
>> + * Return: 1 if lock acquired, 0 if failed
>> + */
>> +static inline int queue_write_trylock(struct qrwlock *lock)
>> +{
>> + union qrwcnts old, new;
>> +
>> + old.rwc = ACCESS_ONCE(lock->cnts.rwc);
>> + if (likely(!old.rwc)) {
>> + new.rwc = old.rwc;
>> + new.writer = _QW_LOCKED;
>> + if (likely(cmpxchg(&lock->cnts.rwc, old.rwc, new.rwc)
>> + == old.rwc))
> One could actually use atomic_cmpxchg() and avoid one (ab)use of that
> union :-)

I think either one is fine. I would like to keep the original code if it
is not really a problem.

>> + return 1;
>> + }
>> + return 0;
>> +}
>> +/**
>> + * queue_read_lock - acquire read lock of a queue rwlock
>> + * @lock: Pointer to queue rwlock structure
>> + */
>> +static inline void queue_read_lock(struct qrwlock *lock)
>> +{
>> + union qrwcnts cnts;
>> +
>> + cnts.rwc = atomic_add_return(_QR_BIAS,&lock->cnts.rwa);
>> + if (likely(!cnts.writer)) {
>> + smp_mb__after_atomic_inc();
> Superfluous again.

Will remove that.

>> + return;
>> + queue_write_lock_slowpath(lock);
>> +}
>> +
>> +/**
>> + * queue_read_unlock - release read lock of a queue rwlock
>> + * @lock : Pointer to queue rwlock structure
>> + */
>> +static inline void queue_read_unlock(struct qrwlock *lock)
>> +{
>> + /*
>> + * Atomically decrement the reader count
>> + */
>> + smp_mb__before_atomic_dec();
>> + atomic_sub(_QR_BIAS,&lock->cnts.rwa);
>> +}
>> +
>> +/**
>> + * queue_write_unlock - release write lock of a queue rwlock
>> + * @lock : Pointer to queue rwlock structure
>> + */
>> +static inline void queue_write_unlock(struct qrwlock *lock)
>> +{
>> + /*
>> + * If the writer field is atomic, it can be cleared directly.
>> + * Otherwise, an atomic subtraction will be used to clear it.
>> + */
>> + if (__native_word(lock->cnts.writer))
>> + smp_store_release(&lock->cnts.writer, 0);
>> + else {
>> + smp_mb__before_atomic_dec();
>> + atomic_sub(_QW_LOCKED,&lock->cnts.rwa);
>> + }
> Missing {}, Documentation/CodingStyle Chapter 3 near the very end.

Thank for spotting that. Will fix it in the next version.

-Longman

2014-01-30 13:05:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation


So I took out that ugly union and rewrote the code to be mostly
atomic_*(), gcc generates acceptable code and its smaller too.

824 0 0 824 338 defconfig-build/kernel/locking/qrwlock.o
776 0 0 776 308 defconfig-build/kernel/locking/qrwlock.o

I don't think I wrecked it, but I've not actually tried it yet.

For now it uses the atomic_sub() unlock path, but by using a #ifndef we
can have arch overrides for that.

Eg, x86 could have something like:

#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
#define queue_write_unlock queue_write_unlock
static inline void queue_write_unlock(struct qrwlock *lock)
{
barrier();
ACCESS_ONCE(*(u8 *)&lock->cnts) = 0;
}
#endif

---
--- /dev/null
+++ b/include/asm-generic/qrwlock.h
@@ -0,0 +1,40 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#ifndef __ASM_GENERIC_QRWLOCK_H
+#define __ASM_GENERIC_QRWLOCK_H
+
+
+/*
+ * Initializier
+ */
+#define __ARCH_RW_LOCK_UNLOCKED { .cnts = { .rwc = 0 }, .waitq = NULL }
+
+/*
+ * Remapping rwlock architecture specific functions to the corresponding
+ * queue rwlock functions.
+ */
+#define arch_read_can_lock(l) queue_read_can_lock(l)
+#define arch_write_can_lock(l) queue_write_can_lock(l)
+#define arch_read_lock(l) queue_read_lock(l)
+#define arch_write_lock(l) queue_write_lock(l)
+#define arch_read_trylock(l) queue_read_trylock(l)
+#define arch_write_trylock(l) queue_write_trylock(l)
+#define arch_read_unlock(l) queue_read_unlock(l)
+#define arch_write_unlock(l) queue_write_unlock(l)
+
+#endif /* __ASM_GENERIC_QRWLOCK_H */
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
config MUTEX_SPIN_ON_OWNER
def_bool y
depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_USE_QUEUE_RWLOCK
+ bool
+
+config QUEUE_RWLOCK
+ def_bool y if ARCH_USE_QUEUE_RWLOCK
+ depends on SMP
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock
obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
obj-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o
+obj-$(CONFIG_QUEUE_RWLOCK) += qrwlock.o
--- /dev/null
+++ b/kernel/locking/qrwlock.c
@@ -0,0 +1,310 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <linux/mcs_spinlock.h>
+// #include <asm-generic/qrwlock.h>
+
+
+// XXX stolen from header to get a single translation unit to compare
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
+
+/*
+ * The queue read/write lock data structure
+ *
+ * The layout of the structure is endian-sensitive to make sure that adding
+ * _QR_BIAS to the rw field to increment the reader count won't disturb
+ * the writer field. The least significant 8 bits is the writer field
+ * whereas the remaining 24 bits is the reader count.
+ */
+struct mcs_spinlock;
+
+struct qrwlock {
+ atomic_t cnts; /* Reader/writer atomic */
+ struct mcs_spinlock *waitq; /* Tail of waiting queue */
+};
+
+/*
+ * Writer states & reader shift and bias
+ */
+#define _QW_WAITING 1 /* A writer is waiting */
+#define _QW_LOCKED 0xff /* A writer holds the lock */
+#define _QW_WMASK 0xff /* Writer mask */
+#define _QR_SHIFT 8 /* Reader count shift */
+#define _QR_BIAS (1U << _QR_SHIFT)
+
+/*
+ * External function declarations
+ */
+extern void queue_read_lock_slowpath(struct qrwlock *lock);
+extern void queue_write_lock_slowpath(struct qrwlock *lock);
+
+/**
+ * queue_read_can_lock- would read_trylock() succeed?
+ * @lock: Pointer to queue rwlock structure
+ */
+inline int queue_read_can_lock(struct qrwlock *lock)
+{
+ return !(atomic_read(&lock->cnts) & _QW_WMASK);
+}
+
+/**
+ * queue_write_can_lock- would write_trylock() succeed?
+ * @lock: Pointer to queue rwlock structure
+ */
+inline int queue_write_can_lock(struct qrwlock *lock)
+{
+ return !atomic_read(&lock->cnts);
+}
+
+/**
+ * queue_read_trylock - try to acquire read lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+inline int queue_read_trylock(struct qrwlock *lock)
+{
+ u32 cnts;
+
+ cnts = atomic_read(&lock->cnts);
+ if (likely(!(cnts & _QW_WMASK))) {
+ cnts = (u32)atomic_add_return(_QR_BIAS, &lock->cnts);
+ if (likely(!(cnts & _QW_WMASK)))
+ return 1;
+ atomic_sub(_QR_BIAS, &lock->cnts);
+ }
+ return 0;
+}
+
+/**
+ * queue_write_trylock - try to acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+inline int queue_write_trylock(struct qrwlock *lock)
+{
+ u32 old, new;
+
+ old = atomic_read(&lock->cnts);
+ if (unlikely(old))
+ return 0;
+
+ new = old | _QW_LOCKED;
+ return likely(atomic_cmpxchg(&lock->cnts, old, new) == old);
+}
+/**
+ * queue_read_lock - acquire read lock of a queue rwlock
+ * @lock: Pointer to queue rwlock structure
+ */
+inline void queue_read_lock(struct qrwlock *lock)
+{
+ u32 cnts;
+
+ cnts = atomic_add_return(_QR_BIAS, &lock->cnts);
+ if (likely(!(cnts & _QW_WMASK)))
+ return;
+
+ /* The slowpath will decrement the reader count, if necessary. */
+ queue_read_lock_slowpath(lock);
+}
+
+/**
+ * queue_write_lock - acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+inline void queue_write_lock(struct qrwlock *lock)
+{
+ /* Optimize for the unfair lock case where the fair flag is 0. */
+ if (atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0)
+ return;
+
+ queue_write_lock_slowpath(lock);
+}
+
+/**
+ * queue_read_unlock - release read lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+inline void queue_read_unlock(struct qrwlock *lock)
+{
+ /*
+ * Atomically decrement the reader count
+ */
+ smp_mb__before_atomic_dec();
+ atomic_sub(_QR_BIAS, &lock->cnts);
+}
+
+#ifndef queue_write_unlock
+/**
+ * queue_write_unlock - release write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+inline void queue_write_unlock(struct qrwlock *lock)
+{
+ /*
+ * If the writer field is atomic, it can be cleared directly.
+ * Otherwise, an atomic subtraction will be used to clear it.
+ */
+ smp_mb__before_atomic_dec();
+ atomic_sub(_QW_LOCKED, &lock->cnts);
+}
+#endif
+
+// XXX
+
+/*
+ * Compared with regular rwlock, the queue rwlock has has the following
+ * advantages:
+ * 1. Even though there is a slight chance of stealing the lock if come at
+ * the right moment, the granting of the lock is mostly in FIFO order.
+ * 2. It is usually faster in high contention situation.
+ *
+ * The only downside is that the lock is 4 bytes larger in 32-bit systems
+ * and 12 bytes larger in 64-bit systems.
+ *
+ * There are two queues for writers. The writer field of the lock is a
+ * one-slot wait queue. The writers that follow will have to wait in the
+ * combined reader/writer queue (waitq).
+ *
+ * Compared with x86 ticket spinlock, the queue rwlock is faster in high
+ * contention situation. The writer lock is also faster in single thread
+ * operations. Therefore, queue rwlock can be considered as a replacement
+ * for those spinlocks that are highly contended as long as an increase
+ * in lock size is not an issue.
+ */
+
+/**
+ * rspin_until_writer_unlock - inc reader count & spin until writer is gone
+ * @lock : Pointer to queue rwlock structure
+ * @writer: Current queue rwlock writer status byte
+ *
+ * In interrupt context or at the head of the queue, the reader will just
+ * increment the reader count & wait until the writer releases the lock.
+ */
+static __always_inline void
+rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
+{
+ while ((cnts & _QW_WMASK) == _QW_LOCKED) {
+ arch_mutex_cpu_relax();
+ cnts = smp_load_acquire((u32 *)&lock->cnts);
+ }
+}
+
+/**
+ * queue_read_lock_slowpath - acquire read lock of a queue rwlock
+ * @lock: Pointer to queue rwlock structure
+ */
+void queue_read_lock_slowpath(struct qrwlock *lock)
+{
+ struct mcs_spinlock node;
+ u32 cnts;
+
+ /*
+ * Readers come here when they cannot get the lock without waiting
+ */
+ if (unlikely(irq_count())) {
+ /*
+ * Readers in interrupt context will spin until the lock is
+ * available without waiting in the queue.
+ */
+ cnts = smp_load_acquire((u32 *)&lock->cnts);
+ rspin_until_writer_unlock(lock, cnts);
+ return;
+ }
+ atomic_sub(_QR_BIAS, &lock->cnts);
+
+ /*
+ * Put the reader into the wait queue
+ */
+ mcs_spin_lock(&lock->waitq, &node);
+
+ /*
+ * At the head of the wait queue now, wait until the writer state
+ * goes to 0 and then try to increment the reader count and get
+ * the lock. It is possible that an incoming writer may steal the
+ * lock in the interim, so it is necessary to check the writer byte
+ * to make sure that the write lock isn't taken.
+ */
+ while (atomic_read(&lock->cnts) & _QW_WMASK)
+ arch_mutex_cpu_relax();
+
+ cnts = atomic_add_return(_QR_BIAS, &lock->cnts) - _QR_BIAS;
+ rspin_until_writer_unlock(lock, cnts);
+
+ /*
+ * Signal the next one in queue to become queue head
+ */
+ mcs_spin_unlock(&lock->waitq, &node);
+}
+EXPORT_SYMBOL(queue_read_lock_slowpath);
+
+/**
+ * queue_write_lock_slowpath - acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+void queue_write_lock_slowpath(struct qrwlock *lock)
+{
+ struct mcs_spinlock node;
+ u32 cnts, old;
+
+ /* Put the writer into the wait queue */
+ mcs_spin_lock(&lock->waitq, &node);
+
+ /* Try to acquire the lock directly if no reader is present */
+ if (!atomic_read(&lock->cnts) &&
+ (atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0))
+ goto unlock;
+
+ /*
+ * Set the waiting flag to notify readers that a writer is pending,
+ * or wait for a previous writer to go away.
+ */
+ cnts = atomic_read(&lock->cnts);
+ for (;;) {
+ if (!(cnts & _QW_WMASK) &&
+ ((old = atomic_cmpxchg(&lock->cnts,
+ cnts, cnts | _QW_WAITING)) == cnts))
+ break;
+
+ cnts = old;
+ arch_mutex_cpu_relax();
+ }
+
+ /* When no more readers, set the locked flag */
+ cnts = atomic_read(&lock->cnts);
+ for (;;) {
+ if ((cnts == _QW_WAITING) &&
+ ((old = atomic_cmpxchg(&lock->cnts,
+ _QW_WAITING, _QW_LOCKED)) == _QW_WAITING))
+ break;
+
+ cnts = old;
+ arch_mutex_cpu_relax();
+ }
+unlock:
+ mcs_spin_unlock(&lock->waitq, &node);
+}
+EXPORT_SYMBOL(queue_write_lock_slowpath);

2014-01-30 15:17:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On Thu, Jan 30, 2014 at 02:04:53PM +0100, Peter Zijlstra wrote:
>
> So I took out that ugly union and rewrote the code to be mostly
> atomic_*(), gcc generates acceptable code and its smaller too.
>
> 824 0 0 824 338 defconfig-build/kernel/locking/qrwlock.o
> 776 0 0 776 308 defconfig-build/kernel/locking/qrwlock.o
>
> I don't think I wrecked it, but I've not actually tried it yet.

I did wreck it.. :-)

The below is still small and actually works.

---
arch/x86/Kconfig | 1
arch/x86/include/asm/spinlock.h | 2
arch/x86/include/asm/spinlock_types.h | 4
b/arch/x86/include/asm/qrwlock.h | 18 +++
b/include/asm-generic/qrwlock.h | 174 ++++++++++++++++++++++++++++++++++
b/include/asm-generic/qrwlock_types.h | 17 +++
b/kernel/locking/qrwlock.c | 157 ++++++++++++++++++++++++++++++
kernel/Kconfig.locks | 7 +
kernel/locking/Makefile | 1
9 files changed, 381 insertions(+)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -119,6 +119,7 @@ config X86
select MODULES_USE_ELF_RELA if X86_64
select CLONE_BACKWARDS if X86_32
select ARCH_USE_BUILTIN_BSWAP
+ select ARCH_USE_QUEUE_RWLOCK
select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
select OLD_SIGACTION if X86_32
select COMPAT_OLD_SIGACTION if IA32_EMULATION
--- /dev/null
+++ b/arch/x86/include/asm/qrwlock.h
@@ -0,0 +1,18 @@
+#ifndef _ASM_X86_QRWLOCK_H
+#define _ASM_X86_QRWLOCK_H
+
+#include <asm-generic/qrwlock_types.h>
+
+#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
+#define queue_write_unlock queue_write_unlock
+static inline void queue_write_unlock(struct qrwlock *lock)
+{
+ barrier();
+ ACCESS_ONCE(*(u8 *)&lock->cnts) = 0;
+}
+#endif
+
+#include <asm-generic/qrwlock.h>
+
+#endif /* _ASM_X86_QRWLOCK_H */
+
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -188,6 +188,7 @@ static inline void arch_spin_unlock_wait
cpu_relax();
}

+#ifndef CONFIG_QUEUE_RWLOCK
/*
* Read-write spinlocks, allowing multiple readers
* but only one writer.
@@ -270,6 +271,7 @@ static inline void arch_write_unlock(arc
asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
: "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
}
+#endif /* CONFIG_QUEUE_RWLOCK */

#define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
#define arch_write_lock_flags(lock, flags) arch_write_lock(lock)
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -34,6 +34,10 @@ typedef struct arch_spinlock {

#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } }

+#ifdef CONFIG_QUEUE_RWLOCK
+#include <asm/qrwlock.h>
+#else
#include <asm/rwlock.h>
+#endif

#endif /* _ASM_X86_SPINLOCK_TYPES_H */
--- /dev/null
+++ b/include/asm-generic/qrwlock.h
@@ -0,0 +1,174 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#ifndef __ASM_GENERIC_QRWLOCK_H
+#define __ASM_GENERIC_QRWLOCK_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
+
+#include <asm-generic/qrwlock_types.h>
+
+/*
+ * Writer states & reader shift and bias
+ */
+#define _QW_WAITING 1 /* A writer is waiting */
+#define _QW_LOCKED 0xff /* A writer holds the lock */
+#define _QW_WMASK 0xff /* Writer mask */
+#define _QR_SHIFT 8 /* Reader count shift */
+#define _QR_BIAS (1U << _QR_SHIFT)
+
+/*
+ * External function declarations
+ */
+extern void queue_read_lock_slowpath(struct qrwlock *lock);
+extern void queue_write_lock_slowpath(struct qrwlock *lock);
+
+/**
+ * queue_read_can_lock- would read_trylock() succeed?
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline int queue_read_can_lock(struct qrwlock *lock)
+{
+ return !(atomic_read(&lock->cnts) & _QW_WMASK);
+}
+
+/**
+ * queue_write_can_lock- would write_trylock() succeed?
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline int queue_write_can_lock(struct qrwlock *lock)
+{
+ return !atomic_read(&lock->cnts);
+}
+
+/**
+ * queue_read_trylock - try to acquire read lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int queue_read_trylock(struct qrwlock *lock)
+{
+ u32 cnts;
+
+ cnts = atomic_read(&lock->cnts);
+ if (likely(!(cnts & _QW_WMASK))) {
+ cnts = (u32)atomic_add_return(_QR_BIAS, &lock->cnts);
+ if (likely(!(cnts & _QW_WMASK)))
+ return 1;
+ atomic_sub(_QR_BIAS, &lock->cnts);
+ }
+ return 0;
+}
+
+/**
+ * queue_write_trylock - try to acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int queue_write_trylock(struct qrwlock *lock)
+{
+ u32 cnts;
+
+ cnts = atomic_read(&lock->cnts);
+ if (unlikely(cnts))
+ return 0;
+
+ return likely(atomic_cmpxchg(&lock->cnts,
+ cnts, cnts | _QW_LOCKED) == cnts);
+}
+/**
+ * queue_read_lock - acquire read lock of a queue rwlock
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline void queue_read_lock(struct qrwlock *lock)
+{
+ u32 cnts;
+
+ cnts = atomic_add_return(_QR_BIAS, &lock->cnts);
+ if (likely(!(cnts & _QW_WMASK)))
+ return;
+
+ /* The slowpath will decrement the reader count, if necessary. */
+ queue_read_lock_slowpath(lock);
+}
+
+/**
+ * queue_write_lock - acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_write_lock(struct qrwlock *lock)
+{
+ /* Optimize for the unfair lock case where the fair flag is 0. */
+ if (atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0)
+ return;
+
+ queue_write_lock_slowpath(lock);
+}
+
+/**
+ * queue_read_unlock - release read lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_read_unlock(struct qrwlock *lock)
+{
+ /*
+ * Atomically decrement the reader count
+ */
+ smp_mb__before_atomic_dec();
+ atomic_sub(_QR_BIAS, &lock->cnts);
+}
+
+#ifndef queue_write_unlock
+/**
+ * queue_write_unlock - release write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_write_unlock(struct qrwlock *lock)
+{
+ /*
+ * If the writer field is atomic, it can be cleared directly.
+ * Otherwise, an atomic subtraction will be used to clear it.
+ */
+ smp_mb__before_atomic_dec();
+ atomic_sub(_QW_LOCKED, &lock->cnts);
+}
+#endif
+
+typedef struct qrwlock arch_rwlock_t;
+
+/*
+ * Initializier
+ */
+#define __ARCH_RW_LOCK_UNLOCKED { .cnts = ATOMIC_INIT(0), .waitq = NULL, }
+
+/*
+ * Remapping rwlock architecture specific functions to the corresponding
+ * queue rwlock functions.
+ */
+#define arch_read_can_lock(l) queue_read_can_lock(l)
+#define arch_write_can_lock(l) queue_write_can_lock(l)
+#define arch_read_lock(l) queue_read_lock(l)
+#define arch_write_lock(l) queue_write_lock(l)
+#define arch_read_trylock(l) queue_read_trylock(l)
+#define arch_write_trylock(l) queue_write_trylock(l)
+#define arch_read_unlock(l) queue_read_unlock(l)
+#define arch_write_unlock(l) queue_write_unlock(l)
+
+#endif /* __ASM_GENERIC_QRWLOCK_H */
--- /dev/null
+++ b/include/asm-generic/qrwlock_types.h
@@ -0,0 +1,17 @@
+#ifndef __ASM_GENERIC_QRWLOCK_TYPES_H
+#define __ASM_GENERIC_QRWLOCK_TYPES_H
+
+#include <linux/atomic.h>
+
+struct mcs_spinlock;
+
+/*
+ * The queue read/write lock data structure
+ */
+
+struct qrwlock {
+ atomic_t cnts;
+ struct mcs_spinlock *waitq;
+};
+
+#endif /* __ASM_GENERIC_QRWLOCK_TYPES_H */
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
config MUTEX_SPIN_ON_OWNER
def_bool y
depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_USE_QUEUE_RWLOCK
+ bool
+
+config QUEUE_RWLOCK
+ def_bool y if ARCH_USE_QUEUE_RWLOCK
+ depends on SMP
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock
obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
obj-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o
+obj-$(CONFIG_QUEUE_RWLOCK) += qrwlock.o
--- /dev/null
+++ b/kernel/locking/qrwlock.c
@@ -0,0 +1,157 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <linux/mutex.h>
+#include <linux/mcs_spinlock.h>
+#include <asm/qrwlock.h>
+
+/*
+ * Compared with regular rwlock, the queue rwlock has has the following
+ * advantages:
+ * 1. Even though there is a slight chance of stealing the lock if come at
+ * the right moment, the granting of the lock is mostly in FIFO order.
+ * 2. It is usually faster in high contention situation.
+ *
+ * The only downside is that the lock is 4 bytes larger in 32-bit systems
+ * and 12 bytes larger in 64-bit systems.
+ *
+ * There are two queues for writers. The writer field of the lock is a
+ * one-slot wait queue. The writers that follow will have to wait in the
+ * combined reader/writer queue (waitq).
+ *
+ * Compared with x86 ticket spinlock, the queue rwlock is faster in high
+ * contention situation. The writer lock is also faster in single thread
+ * operations. Therefore, queue rwlock can be considered as a replacement
+ * for those spinlocks that are highly contended as long as an increase
+ * in lock size is not an issue.
+ */
+
+/**
+ * rspin_until_writer_unlock - inc reader count & spin until writer is gone
+ * @lock : Pointer to queue rwlock structure
+ * @writer: Current queue rwlock writer status byte
+ *
+ * In interrupt context or at the head of the queue, the reader will just
+ * increment the reader count & wait until the writer releases the lock.
+ */
+static __always_inline void
+rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
+{
+ while ((cnts & _QW_WMASK) == _QW_LOCKED) {
+ arch_mutex_cpu_relax();
+ cnts = smp_load_acquire((u32 *)&lock->cnts);
+ }
+}
+
+/**
+ * queue_read_lock_slowpath - acquire read lock of a queue rwlock
+ * @lock: Pointer to queue rwlock structure
+ */
+void queue_read_lock_slowpath(struct qrwlock *lock)
+{
+ struct mcs_spinlock node;
+ u32 cnts;
+
+ /*
+ * Readers come here when they cannot get the lock without waiting
+ */
+ if (unlikely(in_interrupt())) {
+ /*
+ * Readers in interrupt context will spin until the lock is
+ * available without waiting in the queue.
+ */
+ cnts = smp_load_acquire((u32 *)&lock->cnts);
+ rspin_until_writer_unlock(lock, cnts);
+ return;
+ }
+ atomic_sub(_QR_BIAS, &lock->cnts);
+
+ /*
+ * Put the reader into the wait queue
+ */
+ mcs_spin_lock(&lock->waitq, &node);
+
+ /*
+ * At the head of the wait queue now, wait until the writer state
+ * goes to 0 and then try to increment the reader count and get
+ * the lock. It is possible that an incoming writer may steal the
+ * lock in the interim, so it is necessary to check the writer byte
+ * to make sure that the write lock isn't taken.
+ */
+ while (atomic_read(&lock->cnts) & _QW_WMASK)
+ arch_mutex_cpu_relax();
+
+ cnts = atomic_add_return(_QR_BIAS, &lock->cnts) - _QR_BIAS;
+ rspin_until_writer_unlock(lock, cnts);
+
+ /*
+ * Signal the next one in queue to become queue head
+ */
+ mcs_spin_unlock(&lock->waitq, &node);
+}
+EXPORT_SYMBOL(queue_read_lock_slowpath);
+
+/**
+ * queue_write_lock_slowpath - acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+void queue_write_lock_slowpath(struct qrwlock *lock)
+{
+ struct mcs_spinlock node;
+ u32 cnts;
+
+ /* Put the writer into the wait queue */
+ mcs_spin_lock(&lock->waitq, &node);
+
+ /* Try to acquire the lock directly if no reader is present */
+ if (!atomic_read(&lock->cnts) &&
+ (atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0))
+ goto unlock;
+
+ /*
+ * Set the waiting flag to notify readers that a writer is pending,
+ * or wait for a previous writer to go away.
+ */
+ for (;;) {
+ cnts = atomic_read(&lock->cnts);
+ if (!(cnts & _QW_WMASK) &&
+ (atomic_cmpxchg(&lock->cnts, cnts,
+ cnts | _QW_WAITING) == cnts))
+ break;
+
+ arch_mutex_cpu_relax();
+ }
+
+ /* When no more readers, set the locked flag */
+ for (;;) {
+ cnts = atomic_read(&lock->cnts);
+ if ((cnts == _QW_WAITING) &&
+ (atomic_cmpxchg(&lock->cnts, _QW_WAITING,
+ _QW_LOCKED) == _QW_WAITING))
+ break;
+
+ arch_mutex_cpu_relax();
+ }
+unlock:
+ mcs_spin_unlock(&lock->waitq, &node);
+}
+EXPORT_SYMBOL(queue_write_lock_slowpath);

2014-01-30 15:43:31

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On 01/30/2014 10:17 AM, Peter Zijlstra wrote:
> On Thu, Jan 30, 2014 at 02:04:53PM +0100, Peter Zijlstra wrote:
>> So I took out that ugly union and rewrote the code to be mostly
>> atomic_*(), gcc generates acceptable code and its smaller too.
>>
>> 824 0 0 824 338 defconfig-build/kernel/locking/qrwlock.o
>> 776 0 0 776 308 defconfig-build/kernel/locking/qrwlock.o
>>
>> I don't think I wrecked it, but I've not actually tried it yet.
> I did wreck it.. :-)
>
> The below is still small and actually works.
>
> ---
> arch/x86/Kconfig | 1
> arch/x86/include/asm/spinlock.h | 2
> arch/x86/include/asm/spinlock_types.h | 4
> b/arch/x86/include/asm/qrwlock.h | 18 +++
> b/include/asm-generic/qrwlock.h | 174 ++++++++++++++++++++++++++++++++++
> b/include/asm-generic/qrwlock_types.h | 17 +++
> b/kernel/locking/qrwlock.c | 157 ++++++++++++++++++++++++++++++
> kernel/Kconfig.locks | 7 +
> kernel/locking/Makefile | 1
> 9 files changed, 381 insertions(+)
>
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
>

OK, I see what you are trying to do. I can apply the change to my patch
& send out v12. So I presume that you are now OK with it. Can I add your
sign-off line?

-Longman

2014-01-30 15:44:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On Thu, Jan 30, 2014 at 04:17:15PM +0100, Peter Zijlstra wrote:
> --- /dev/null
> +++ b/arch/x86/include/asm/qrwlock.h
> @@ -0,0 +1,18 @@
> +#ifndef _ASM_X86_QRWLOCK_H
> +#define _ASM_X86_QRWLOCK_H
> +
> +#include <asm-generic/qrwlock_types.h>
> +
> +#if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE)
> +#define queue_write_unlock queue_write_unlock
> +static inline void queue_write_unlock(struct qrwlock *lock)
> +{
> + barrier();
> + ACCESS_ONCE(*(u8 *)&lock->cnts) = 0;
> +}
> +#endif
> +
> +#include <asm-generic/qrwlock.h>
> +
> +#endif /* _ASM_X86_QRWLOCK_H */
> +

> +/**
> + * queue_read_unlock - release read lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +static inline void queue_read_unlock(struct qrwlock *lock)
> +{
> + /*
> + * Atomically decrement the reader count
> + */
> + smp_mb__before_atomic_dec();
> + atomic_sub(_QR_BIAS, &lock->cnts);
> +}
> +
> +#ifndef queue_write_unlock
> +/**
> + * queue_write_unlock - release write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +static inline void queue_write_unlock(struct qrwlock *lock)
> +{
> + /*
> + * If the writer field is atomic, it can be cleared directly.
> + * Otherwise, an atomic subtraction will be used to clear it.
> + */
> + smp_mb__before_atomic_dec();
> + atomic_sub(_QW_LOCKED, &lock->cnts);
> +}
> +#endif


Something like this would work for ARM and PPC, although I didn't do the
PPC variant of atomic_sub_release().


--- a/arch/arm64/include/asm/atomic.h
+++ b/arch/arm64/include/asm/atomic.h
@@ -90,6 +90,21 @@ static inline void atomic_sub(int i, ato
: "cc");
}

+static inline void atomic_sub_release(int i, atomic_t *v)
+{
+ unsigned long tmp;
+ int result;
+
+ asm volatile("// atomic_sub\n"
+"1: ldxr %w0, %2\n"
+" sub %w0, %w0, %w3\n"
+" stlxr %w1, %w0, %2\n"
+" cbnz %w1, 1b"
+ : "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
+ : "Ir" (i)
+ : "cc");
+}
+
static inline int atomic_sub_return(int i, atomic_t *v)
{
unsigned long tmp;
--- /dev/null
+++ b/arch/arm64/include/asm/qrwlock.h
@@ -0,0 +1,21 @@
+#ifndef _ASM_ARM64_QRWLOCK_H
+#define _ASM_ARM64_QRWLOCK_H
+
+#include <asm-generic/qrwlock_types.h>
+
+#define queue_read_unlock queue_read_unlock
+static inline void queue_read_unlock(struct qrwlock *lock)
+{
+ atomic_sub_release(_QR_BIAS, &lock->cnts);
+}
+
+#define queue_write_unlock queue_write_unlock
+static inline void queue_write_unlock(struct qrwlock *lock)
+{
+ atomic_sub_release(_QW_LOCKED, &lock->cnts);
+}
+
+#include <asm-generic/qrwlock.h>
+
+#endif /* _ASM_ARM64_QRWLOCK_H */
+
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -122,6 +122,7 @@ static inline void queue_write_lock(stru
queue_write_lock_slowpath(lock);
}

+#ifndef queue_read_unlock
/**
* queue_read_unlock - release read lock of a queue rwlock
* @lock : Pointer to queue rwlock structure
@@ -134,6 +135,7 @@ static inline void queue_read_unlock(str
smp_mb__before_atomic_dec();
atomic_sub(_QR_BIAS, &lock->cnts);
}
+#endif

#ifndef queue_write_unlock
/**

2014-01-30 15:51:16

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On 01/30/2014 10:43 AM, Waiman Long wrote:
> On 01/30/2014 10:17 AM, Peter Zijlstra wrote:
>> On Thu, Jan 30, 2014 at 02:04:53PM +0100, Peter Zijlstra wrote:
>>> So I took out that ugly union and rewrote the code to be mostly
>>> atomic_*(), gcc generates acceptable code and its smaller too.
>>>
>>> 824 0 0 824 338
>>> defconfig-build/kernel/locking/qrwlock.o
>>> 776 0 0 776 308
>>> defconfig-build/kernel/locking/qrwlock.o
>>>
>>> I don't think I wrecked it, but I've not actually tried it yet.
>> I did wreck it.. :-)
>>
>> The below is still small and actually works.
>>
>> ---
>> arch/x86/Kconfig | 1
>> arch/x86/include/asm/spinlock.h | 2
>> arch/x86/include/asm/spinlock_types.h | 4
>> b/arch/x86/include/asm/qrwlock.h | 18 +++
>> b/include/asm-generic/qrwlock.h | 174
>> ++++++++++++++++++++++++++++++++++
>> b/include/asm-generic/qrwlock_types.h | 17 +++
>> b/kernel/locking/qrwlock.c | 157
>> ++++++++++++++++++++++++++++++
>> kernel/Kconfig.locks | 7 +
>> kernel/locking/Makefile | 1
>> 9 files changed, 381 insertions(+)
>>
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>>
>
> OK, I see what you are trying to do. I can apply the change to my
> patch & send out v12. So I presume that you are now OK with it. Can I
> add your sign-off line?
>
> -Longman
>
>

One more thing, I often see line like

#define queue_write_unlock queue_write_unlock

So exactly what effect does this macro have?

-Longman

2014-01-30 15:54:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On Thu, Jan 30, 2014 at 10:50:41AM -0500, Waiman Long wrote:
> One more thing, I often see line like
>
> #define queue_write_unlock queue_write_unlock
>
> So exactly what effect does this macro have?

Makes sure the below doesn't emit another version.

#ifndef queue_write_unlock
/**
* queue_write_unlock - release write lock of a queue rwlock
* @lock : Pointer to queue rwlock structure
*/
static inline void queue_write_unlock(struct qrwlock *lock)
{
/*
* If the writer field is atomic, it can be cleared directly.
* Otherwise, an atomic subtraction will be used to clear it.
*/
smp_mb__before_atomic_dec();
atomic_sub(_QW_LOCKED, &lock->cnts);
}
#endif

2014-01-30 17:54:35

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

Hi Peter,

On Thu, Jan 30, 2014 at 03:44:00PM +0000, Peter Zijlstra wrote:
> Something like this would work for ARM and PPC, although I didn't do the
> PPC variant of atomic_sub_release().
>
>
> --- a/arch/arm64/include/asm/atomic.h
> +++ b/arch/arm64/include/asm/atomic.h
> @@ -90,6 +90,21 @@ static inline void atomic_sub(int i, ato
> : "cc");
> }
>
> +static inline void atomic_sub_release(int i, atomic_t *v)
> +{
> + unsigned long tmp;
> + int result;
> +
> + asm volatile("// atomic_sub\n"
> +"1: ldxr %w0, %2\n"
> +" sub %w0, %w0, %w3\n"
> +" stlxr %w1, %w0, %2\n"
> +" cbnz %w1, 1b"
> + : "=&r" (result), "=&r" (tmp), "+Q" (v->counter)
> + : "Ir" (i)
> + : "cc");

Probably want to replace this "cc" with a "memory".

> --- /dev/null
> +++ b/arch/arm64/include/asm/qrwlock.h
> @@ -0,0 +1,21 @@
> +#ifndef _ASM_ARM64_QRWLOCK_H
> +#define _ASM_ARM64_QRWLOCK_H
> +
> +#include <asm-generic/qrwlock_types.h>
> +
> +#define queue_read_unlock queue_read_unlock
> +static inline void queue_read_unlock(struct qrwlock *lock)
> +{
> + atomic_sub_release(_QR_BIAS, &lock->cnts);
> +}
> +
> +#define queue_write_unlock queue_write_unlock
> +static inline void queue_write_unlock(struct qrwlock *lock)
> +{
> + atomic_sub_release(_QW_LOCKED, &lock->cnts);
> +}
> +
> +#include <asm-generic/qrwlock.h>
> +
> +#endif /* _ASM_ARM64_QRWLOCK_H */

It would be nice if these were default implementations of the unlock, then
architectures just implement atomic_sub_release how they like.

One thing worth mentioning: I have a fairly invasive set of changes pending
for arch/arm64/include/asm/atomic.h, so if you do decide to go with this,
I'm more than happy to take the sub_release part via the arm64 tree. I guess
it depends on when this is likely to get merged.

Cheers,

Will

2014-01-30 18:06:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On Thu, Jan 30, 2014 at 05:52:12PM +0000, Will Deacon wrote:
> It would be nice if these were default implementations of the unlock, then
> architectures just implement atomic_sub_release how they like.

Yes, I suppose that makes sense. Last time I proposed the primitive
nobody yelled at me, so I suppose that means people agree :-)

> One thing worth mentioning: I have a fairly invasive set of changes pending
> for arch/arm64/include/asm/atomic.h, so if you do decide to go with this,
> I'm more than happy to take the sub_release part via the arm64 tree. I guess
> it depends on when this is likely to get merged.

I suppose it depends on when I get enough courage to do: vim
arch/*/include/asm/atomic*.h :-)

There's a few other cleanups I want to do, like today I found
atomic_{set,clear}_mask() instead of the more natural atomic_{or,and}()
functions.

I also think we can get rid of the {inc,dec} variants of
smp_mb__{before,after}_atomic() since these barriers should be the same
for _all_ atomic ops that do not already imply full mb semantics, and
they're certainly the same for all current inc/dec.

If tomorrow is another slow day and I get through enough of the review
backlog I might just give it a go.

Anyway, I'll base them on your arm64 changes, I know where to find
those.

2014-01-30 18:13:03

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On Thu, Jan 30, 2014 at 06:05:33PM +0000, Peter Zijlstra wrote:
> On Thu, Jan 30, 2014 at 05:52:12PM +0000, Will Deacon wrote:
> > It would be nice if these were default implementations of the unlock, then
> > architectures just implement atomic_sub_release how they like.
>
> Yes, I suppose that makes sense. Last time I proposed the primitive
> nobody yelled at me, so I suppose that means people agree :-)

If it's useful for these qrwlocks, that's good enough for me!

> > One thing worth mentioning: I have a fairly invasive set of changes pending
> > for arch/arm64/include/asm/atomic.h, so if you do decide to go with this,
> > I'm more than happy to take the sub_release part via the arm64 tree. I guess
> > it depends on when this is likely to get merged.
>
> I suppose it depends on when I get enough courage to do: vim
> arch/*/include/asm/atomic*.h :-)

Hehe.

> There's a few other cleanups I want to do, like today I found
> atomic_{set,clear}_mask() instead of the more natural atomic_{or,and}()
> functions.

Have you looked at the OpenCL atomic intrinsics at all?

http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/atomicFunctions.html

There's a good chance that they can be implemented efficiently on any
architectures that care about OpenCL. As you've noticed, composing them
together can be more efficient on LL/SC-based architectures too.

> I also think we can get rid of the {inc,dec} variants of
> smp_mb__{before,after}_atomic() since these barriers should be the same
> for _all_ atomic ops that do not already imply full mb semantics, and
> they're certainly the same for all current inc/dec.

Makes sense.

> If tomorrow is another slow day and I get through enough of the review
> backlog I might just give it a go.
>
> Anyway, I'll base them on your arm64 changes, I know where to find
> those.

Okey doke. If you need a stable (non-rebasing) branch, just holler.

Cheers,

Will

2014-01-30 18:17:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On Thu, Jan 30, 2014 at 06:11:36PM +0000, Will Deacon wrote:
> On Thu, Jan 30, 2014 at 06:05:33PM +0000, Peter Zijlstra wrote:
> > On Thu, Jan 30, 2014 at 05:52:12PM +0000, Will Deacon wrote:
> > > It would be nice if these were default implementations of the unlock, then
> > > architectures just implement atomic_sub_release how they like.
> >
> > Yes, I suppose that makes sense. Last time I proposed the primitive
> > nobody yelled at me, so I suppose that means people agree :-)
>
> If it's useful for these qrwlocks, that's good enough for me!

There's the qspinlock that can also use it.

> Have you looked at the OpenCL atomic intrinsics at all?
>
> http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/atomicFunctions.html
>
> There's a good chance that they can be implemented efficiently on any
> architectures that care about OpenCL. As you've noticed, composing them
> together can be more efficient on LL/SC-based architectures too.

Never looked at OpenCL, I'll have a look.

> Okey doke. If you need a stable (non-rebasing) branch, just holler.

Nah, who cares about those anyway :-)

2014-01-31 09:26:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On Thu, Jan 30, 2014 at 04:17:15PM +0100, Peter Zijlstra wrote:
> The below is still small and actually works.

OK, so having actually worked through the thing; I realized we can
actually do a version without MCS lock and instead use a ticket lock for
the waitqueue.

This is both smaller (back to 8 bytes for the rwlock_t), and should be
faster under moderate contention for not having to touch extra
cachelines.

Completely untested and with a rather crude generic ticket lock
implementation to illustrate the concept:

---
--- a/include/asm-generic/qrwlock_types.h
+++ b/include/asm-generic/qrwlock_types.h
@@ -3,15 +3,13 @@

#include <linux/atomic.h>

-struct mcs_spinlock;
-
/*
* The queue read/write lock data structure
*/

struct qrwlock {
atomic_t cnts;
- struct mcs_spinlock *waitq;
+ atomic_t tickets;
};

#endif /* __ASM_GENERIC_QRWLOCK_TYPES_H */
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -21,29 +21,32 @@
#include <linux/percpu.h>
#include <linux/hardirq.h>
#include <linux/mutex.h>
-#include <linux/mcs_spinlock.h>
#include <asm/qrwlock.h>

-/*
- * Compared with regular rwlock, the queue rwlock has has the following
- * advantages:
- * 1. Even though there is a slight chance of stealing the lock if come at
- * the right moment, the granting of the lock is mostly in FIFO order.
- * 2. It is usually faster in high contention situation.
- *
- * The only downside is that the lock is 4 bytes larger in 32-bit systems
- * and 12 bytes larger in 64-bit systems.
- *
- * There are two queues for writers. The writer field of the lock is a
- * one-slot wait queue. The writers that follow will have to wait in the
- * combined reader/writer queue (waitq).
- *
- * Compared with x86 ticket spinlock, the queue rwlock is faster in high
- * contention situation. The writer lock is also faster in single thread
- * operations. Therefore, queue rwlock can be considered as a replacement
- * for those spinlocks that are highly contended as long as an increase
- * in lock size is not an issue.
- */
+#define TICKET_MSB 0x8000
+#define TICKET_MASK 0x7FFF
+
+#define atomic_xadd(v, a) (atomic_add_return((v), (a)) - (v))
+
+static inline void ticket_spin_lock(atomic_t *tickets)
+{
+ u32 val = atomic_xadd(1 << 16, tickets);
+ u32 ticket = (val >> 16) & TICKET_MASK;
+
+ if (unlikely(val & TICKET_MSB))
+ atomic_clear_mask(TICKET_MSB, tickets);
+
+ if (unlikely((val & TICKET_MASK) != ticket)) {
+ while ((smp_load_acquire((u32 *)tickets) & TICKET_MASK) != ticket)
+ cpu_relax();
+ }
+}
+
+static inline void ticket_spin_unlock(atomic_t *tickets)
+{
+ smp_mb__before_atomic_inc();
+ atomic_inc(tickets);
+}

/**
* rspin_until_writer_unlock - inc reader count & spin until writer is gone
@@ -68,7 +71,6 @@ rspin_until_writer_unlock(struct qrwlock
*/
void queue_read_lock_slowpath(struct qrwlock *lock)
{
- struct mcs_spinlock node;
u32 cnts;

/*
@@ -88,7 +90,7 @@ void queue_read_lock_slowpath(struct qrw
/*
* Put the reader into the wait queue
*/
- mcs_spin_lock(&lock->waitq, &node);
+ ticket_spin_lock(&lock->tickets);

/*
* At the head of the wait queue now, wait until the writer state
@@ -100,13 +102,13 @@ void queue_read_lock_slowpath(struct qrw
while (atomic_read(&lock->cnts) & _QW_WMASK)
arch_mutex_cpu_relax();

- cnts = atomic_add_return(_QR_BIAS, &lock->cnts) - _QR_BIAS;
+ cnts = atomic_xadd(_QR_BIAS, &lock->cnts);
rspin_until_writer_unlock(lock, cnts);

/*
* Signal the next one in queue to become queue head
*/
- mcs_spin_unlock(&lock->waitq, &node);
+ ticket_spin_unlock(&lock->tickets);
}
EXPORT_SYMBOL(queue_read_lock_slowpath);

@@ -116,11 +118,10 @@ EXPORT_SYMBOL(queue_read_lock_slowpath);
*/
void queue_write_lock_slowpath(struct qrwlock *lock)
{
- struct mcs_spinlock node;
u32 cnts;

/* Put the writer into the wait queue */
- mcs_spin_lock(&lock->waitq, &node);
+ ticket_spin_lock(&lock->tickets);

/* Try to acquire the lock directly if no reader is present */
if (!atomic_read(&lock->cnts) &&
@@ -152,6 +153,6 @@ void queue_write_lock_slowpath(struct qr
arch_mutex_cpu_relax();
}
unlock:
- mcs_spin_unlock(&lock->waitq, &node);
+ ticket_spin_unlock(&lock->tickets);
}
EXPORT_SYMBOL(queue_write_lock_slowpath);

2014-01-31 10:03:52

by George Spelvin

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

How about getting rid of that TICKET_MSB mess and doing something like:

#define TICKET_MASK 0xFFFF

static inline void ticket_spin_unlock(atomic_t *tickets)
{
u32 t = *tickets;

smp_mb__before_atomic_inc();

/* Increment the low 16 bits without affecting the upper. */
if (unlikely((~t & TICKET_MASK) == 0))
atomic_add(-(atomic_t)TICKET_MASK, tickets);
else
atomic_inc(tickets);
}

That also allows up to 2^16 waiters, up from 2^15.
(Minus one on both cases, if you want to be fussy.)

2014-01-31 10:17:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On Fri, Jan 31, 2014 at 05:03:48AM -0500, George Spelvin wrote:
> How about getting rid of that TICKET_MSB mess and doing something like:
>
> #define TICKET_MASK 0xFFFF
>
> static inline void ticket_spin_unlock(atomic_t *tickets)
> {
> u32 t = *tickets;
>
> smp_mb__before_atomic_inc();
>
> /* Increment the low 16 bits without affecting the upper. */
> if (unlikely((~t & TICKET_MASK) == 0))
> atomic_add(-(atomic_t)TICKET_MASK, tickets);
> else
> atomic_inc(tickets);
> }
>
> That also allows up to 2^16 waiters, up from 2^15.
> (Minus one on both cases, if you want to be fussy.)

Ah indeed. That'll work. That said, any arch that can single-copy
address shorts can probably do better than this generic atomic_t thing.

My main point was that we should seriously look at a ticket lock instead
of the MCS one, because while the MCS has better contention behaviour,
we shouldn't optimize locks for the worst contention.

2014-01-31 11:30:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On Fri, Jan 31, 2014 at 11:17:29AM +0100, Peter Zijlstra wrote:
> My main point was that we should seriously look at a ticket lock instead
> of the MCS one, because while the MCS has better contention behaviour,
> we shouldn't optimize locks for the worst contention.

In fact, I should have just used arch_spinlock_t and assumed it is a
ticket lock.

2014-01-31 18:59:11

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On 01/31/2014 04:26 AM, Peter Zijlstra wrote:
> On Thu, Jan 30, 2014 at 04:17:15PM +0100, Peter Zijlstra wrote:
>> The below is still small and actually works.
> OK, so having actually worked through the thing; I realized we can
> actually do a version without MCS lock and instead use a ticket lock for
> the waitqueue.
>
> This is both smaller (back to 8 bytes for the rwlock_t), and should be
> faster under moderate contention for not having to touch extra
> cachelines.
>
> Completely untested and with a rather crude generic ticket lock
> implementation to illustrate the concept:
>

Using a ticket lock instead will have the same scalability problem as
the ticket spinlock as all the waiting threads will spin on the lock
cacheline causing a lot of cache bouncing traffic. That is the reason
why I want to replace ticket spinlock with queue spinlock. If the
16-byte size is an issue, I can use the same trick in the queue spinlock
patch to reduce its size down to 8 bytes with a bit more overhead in the
slowpath.

Another thing I want to discuss about is whether a bit more overhead in
moderate contention cases is really such a bit deal. With moderate
contention, I suppose the amount of time spent in the locking functions
will be just a few percent at most for real workloads. It won't really
be noticeable if the locking functions take, maybe, 50% more time to
finish. Anyway, I am going to do more performance testing on low end
machines.

-Longman

2014-01-31 19:47:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On Fri, Jan 31, 2014 at 01:59:02PM -0500, Waiman Long wrote:
> On 01/31/2014 04:26 AM, Peter Zijlstra wrote:
> >On Thu, Jan 30, 2014 at 04:17:15PM +0100, Peter Zijlstra wrote:
> >>The below is still small and actually works.
> >OK, so having actually worked through the thing; I realized we can
> >actually do a version without MCS lock and instead use a ticket lock for
> >the waitqueue.
> >
> >This is both smaller (back to 8 bytes for the rwlock_t), and should be
> >faster under moderate contention for not having to touch extra
> >cachelines.
> >
> >Completely untested and with a rather crude generic ticket lock
> >implementation to illustrate the concept:
> >
>
> Using a ticket lock instead will have the same scalability problem as the
> ticket spinlock as all the waiting threads will spin on the lock cacheline
> causing a lot of cache bouncing traffic. That is the reason why I want to
> replace ticket spinlock with queue spinlock.

But but but, just fix such heavily contended locks. Don't make sensible
code that is lightly contended run slower because of it.

2014-01-31 20:14:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On Fri, Jan 31, 2014 at 01:59:02PM -0500, Waiman Long wrote:
> On 01/31/2014 04:26 AM, Peter Zijlstra wrote:
> >On Thu, Jan 30, 2014 at 04:17:15PM +0100, Peter Zijlstra wrote:
> >>The below is still small and actually works.
> >OK, so having actually worked through the thing; I realized we can
> >actually do a version without MCS lock and instead use a ticket lock for
> >the waitqueue.
> >
> >This is both smaller (back to 8 bytes for the rwlock_t), and should be
> >faster under moderate contention for not having to touch extra
> >cachelines.
> >
> >Completely untested and with a rather crude generic ticket lock
> >implementation to illustrate the concept:
> >
>
> Using a ticket lock instead will have the same scalability problem as the
> ticket spinlock as all the waiting threads will spin on the lock cacheline
> causing a lot of cache bouncing traffic.

A much more important point for me is that a fair rwlock has a _much_
better worst case behaviour than the current mess. That's the reason I
was interested in the qrwlock thing. Not because it can run contended on
a 128 CPU system and be faster at being contended.

If you contend a lock with 128 CPUs you need to go fix that code that
causes this abysmal behaviour in the first place.

2014-01-31 21:09:42

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On 01/31/2014 03:14 PM, Peter Zijlstra wrote:
> On Fri, Jan 31, 2014 at 01:59:02PM -0500, Waiman Long wrote:
>> On 01/31/2014 04:26 AM, Peter Zijlstra wrote:
>>> On Thu, Jan 30, 2014 at 04:17:15PM +0100, Peter Zijlstra wrote:
>>>> The below is still small and actually works.
>>> OK, so having actually worked through the thing; I realized we can
>>> actually do a version without MCS lock and instead use a ticket lock for
>>> the waitqueue.
>>>
>>> This is both smaller (back to 8 bytes for the rwlock_t), and should be
>>> faster under moderate contention for not having to touch extra
>>> cachelines.
>>>
>>> Completely untested and with a rather crude generic ticket lock
>>> implementation to illustrate the concept:
>>>
>> Using a ticket lock instead will have the same scalability problem as the
>> ticket spinlock as all the waiting threads will spin on the lock cacheline
>> causing a lot of cache bouncing traffic.
> A much more important point for me is that a fair rwlock has a _much_
> better worst case behaviour than the current mess. That's the reason I
> was interested in the qrwlock thing. Not because it can run contended on
> a 128 CPU system and be faster at being contended.
>
> If you contend a lock with 128 CPUs you need to go fix that code that
> causes this abysmal behaviour in the first place.
>
>

I am not against the use of ticket spinlock as the queuing mechanism on
small systems. I do have concern about the contended performance on
large NUMA systems which is my primary job responsibility. Depending on
the workload, contention can happens anywhere. So it is easier said than
done to fix whatever lock contention that may happen.

How about making the selection of MCS or ticket queuing either user
configurable or depending on the setting of NR_CPUS, NUMA, etc?

-Longman

-Longman

2014-02-01 01:29:51

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On Fri, 2014-01-31 at 16:09 -0500, Waiman Long wrote:
> On 01/31/2014 03:14 PM, Peter Zijlstra wrote:
> > On Fri, Jan 31, 2014 at 01:59:02PM -0500, Waiman Long wrote:
> >> On 01/31/2014 04:26 AM, Peter Zijlstra wrote:
> >>> On Thu, Jan 30, 2014 at 04:17:15PM +0100, Peter Zijlstra wrote:
> >>>> The below is still small and actually works.
> >>> OK, so having actually worked through the thing; I realized we can
> >>> actually do a version without MCS lock and instead use a ticket lock for
> >>> the waitqueue.
> >>>
> >>> This is both smaller (back to 8 bytes for the rwlock_t), and should be
> >>> faster under moderate contention for not having to touch extra
> >>> cachelines.
> >>>
> >>> Completely untested and with a rather crude generic ticket lock
> >>> implementation to illustrate the concept:
> >>>
> >> Using a ticket lock instead will have the same scalability problem as the
> >> ticket spinlock as all the waiting threads will spin on the lock cacheline
> >> causing a lot of cache bouncing traffic.
> > A much more important point for me is that a fair rwlock has a _much_
> > better worst case behaviour than the current mess. That's the reason I
> > was interested in the qrwlock thing. Not because it can run contended on
> > a 128 CPU system and be faster at being contended.
> >
> > If you contend a lock with 128 CPUs you need to go fix that code that
> > causes this abysmal behaviour in the first place.
> >

But the kernel should also be prepared for such situations, whenever
possible.

> >
>
> I am not against the use of ticket spinlock as the queuing mechanism on
> small systems. I do have concern about the contended performance on
> large NUMA systems which is my primary job responsibility. Depending on
> the workload, contention can happens anywhere. So it is easier said than
> done to fix whatever lock contention that may happen.
>
> How about making the selection of MCS or ticket queuing either user
> configurable or depending on the setting of NR_CPUS, NUMA, etc?

Users have no business making these decisions and being exposed to these
kind of internals. CONFIG_NUMA sounds reasonable to me.

Thanks,
Davidlohr

2014-02-01 10:38:25

by George Spelvin

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

Peter Zijlstra wrote:
> On Fri, Jan 31, 2014 at 01:59:02PM -0500, Waiman Long wrote:
>> Using a ticket lock instead will have the same scalability problem as the
>> ticket spinlock as all the waiting threads will spin on the lock cacheline
>> causing a lot of cache bouncing traffic. That is the reason why I want to
>> replace ticket spinlock with queue spinlock.

> But but but, just fix such heavily contended locks. Don't make sensible
> code that is lightly contended run slower because of it.

While I agree that zero slowdown for "good" code is the goal, it is
impossible for the kernel to consist of only "good" code.

In particular, obscure error conditions causing locked regions to take
much longer than expected will never be completely expurgated; there's
a point where you just say "I'm not working for a week to save 10 people
per year a 2-minute stall."

What Waiman noted is that ticket locks take O(n^2) cache line transfers
to clear n waiters from the queue. (Each write must be broadcast to
each spinning reader.) So if you *do* get most of a large multiprocessor
piled up on a ticket lock, the performance can be disastrous.

It can conceivably send a large system into a "congestion collapse"
where the queue never clears. And it can affect processors (such as
other partitions of a large machine) that aren't even contending for
the lock.

The MCS lock is O(1) per release and O(n) to clear n waiters. This is
a noticeable improvement on 4- or 8-way contention, and (Waiman reports)
a huge improvement on 50-way and up.

Yes, if such contention occurs with any frequency at all, it should be
fixed, but it does seem worth mitigating problems in the meantime.

(As an aside, I have in the past heard people criticize the Linux kernel
for being optimized for the average case at the expense of worst-case
corner cases.)

Are we agreed that *not* improving highly-contended performance on the
grounds that it would discourage other optimization is as stupid as not
wearing a seat-belt because that would discourage more careful driving?


While I do think *some* benchmarking on smaller SMP systems is wanted,
given that Waiman has mananged to make the *uncontended* case faster,
and *that* is by far the most common case, it's quite plausible that it
will turn out to be a net performance improvement on 4- and 8-way systems.

2014-02-01 23:22:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On Fri, Jan 31, 2014 at 11:17:29AM +0100, Peter Zijlstra wrote:
> On Fri, Jan 31, 2014 at 05:03:48AM -0500, George Spelvin wrote:
> > How about getting rid of that TICKET_MSB mess and doing something like:
> >
> > #define TICKET_MASK 0xFFFF
> >
> > static inline void ticket_spin_unlock(atomic_t *tickets)
> > {
> > u32 t = *tickets;
> >
> > smp_mb__before_atomic_inc();
> >
> > /* Increment the low 16 bits without affecting the upper. */
> > if (unlikely((~t & TICKET_MASK) == 0))
> > atomic_add(-(atomic_t)TICKET_MASK, tickets);
> > else
> > atomic_inc(tickets);
> > }
> >
> > That also allows up to 2^16 waiters, up from 2^15.
> > (Minus one on both cases, if you want to be fussy.)
>
> Ah indeed. That'll work. That said, any arch that can single-copy
> address shorts can probably do better than this generic atomic_t thing.
>
> My main point was that we should seriously look at a ticket lock instead
> of the MCS one, because while the MCS has better contention behaviour,
> we shouldn't optimize locks for the worst contention.

We do need to articulate what we -are- optimizing for, or more generally,
what we are trying to accomplish with these locking primitives. Here
are some of the traditional goals:

1. Avoid performance collapse under heavy load. Here the goal is
to have throughput level off rather than decreasing when load
exceeds saturation levels.

2. Avoid starvation of requesters. Here the goal is a very
minimal level of fairness.

3. Achieve some reasonable level of fairness if the underlying
hardware provides fair access to memory. Here "reasonable"
might mean that lock-grant probabilities not differ by more
than (say) a factor of two.

4. Achieve some reasonable level of fairness even if the underlying
hardware is unfair. Rumor has it that some of the very large
systems in use today do not guarantee fair access to cache lines
under heavy memory contention.

5. Achieve some minimal level of scalability, so that throughput
increases monotonically with increasing load. Preferably
without penalizing small-system performance.

6. Achieve some reasonable level of scalability, so that adding
a given CPU provides at least some fraction (say, 80%) of
one CPU's worth of incremental throughput.

7. Achieve linear scalability.

Ticket locking has problems with #1 on some systems due to the thundering
herd of reads following each invalidation. (Yes, it is not as bad
as the atomic-operation thundering herd associated with test-and-set
spinlocks, but it can neverthelless be a problem on larger systems.)
Achieving #4 requires some sort of queued lock as far as I can see.
Although you -might- get #5 with a very clever NUMA-aware lock, #6 and
#7 will require some level of redesign.

Fancy locks can help if the common code path is fully parallelized,
but where some rare error path is not worth the effort. In this case,
using a fancy lock on the error path can at least keep the system moving
during the time that the error condition is in force, and hopefully
permit returning to full throughput once the error condition is gone.

So, what exactly are we trying to achieve with all this? ;-)

Thanx, Paul

2014-02-02 09:04:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation


* Waiman Long <[email protected]> wrote:

> How about making the selection of MCS or ticket queuing either user
> configurable or depending on the setting of NR_CPUS, NUMA, etc?

No!

There are lots of disadvantages to adding such CONFIG_NUMA Kconfig
variants for locking primitives:

- an doubling of the test matrix

- an doubling of the review matrix and a halving of effective review
capacity: we've just about go the capacity to review and validate
patches like this. Splitting out a 'only NUMA cares' variant is a
non-starter really.

- but most importantly, there's absolutely no reason to not be fast
on 128 CPU systems in the low contended case either! Sacrificing
the low contended case with 'on 128 CPU systems it is the contended
path that matters' is an idiotic argument.

Essentially the only area were we allow Kconfig dependencies are
unyielding physical forces: such as lots of CPUs needing a wider CPU
mask.

As Peter said it, the right solution is to fix the contended case. If
that also happens to speed up or better organize the uncondended code
then that's good, but it should not make it worse.

Thanks,

Ingo

2014-02-03 11:39:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On Sun, Feb 02, 2014 at 10:03:57AM +0100, Ingo Molnar wrote:
>
> * Waiman Long <[email protected]> wrote:
>
> > How about making the selection of MCS or ticket queuing either user
> > configurable or depending on the setting of NR_CPUS, NUMA, etc?
>
> No!
>
> There are lots of disadvantages to adding such CONFIG_NUMA Kconfig
> variants for locking primitives:
>
> - an doubling of the test matrix
>
> - an doubling of the review matrix and a halving of effective review
> capacity: we've just about go the capacity to review and validate
> patches like this. Splitting out a 'only NUMA cares' variant is a
> non-starter really.
>
> - but most importantly, there's absolutely no reason to not be fast
> on 128 CPU systems in the low contended case either! Sacrificing
> the low contended case with 'on 128 CPU systems it is the contended
> path that matters' is an idiotic argument.

Not to mention distros will only ever pick one.

Also, there's probably several orders of magnitude more single and dual
socket systems out there than there are quad or bigger systems.

So even if the lightly contended case is say only 1% slower, that could
still add up to more cycles lost than won over all computers out there
running Linux.

2014-02-06 03:09:05

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v11 0/4] Introducing a queue read/write lock implementation

On 02/02/2014 04:03 AM, Ingo Molnar wrote:
> * Waiman Long<[email protected]> wrote:
>
>> How about making the selection of MCS or ticket queuing either user
>> configurable or depending on the setting of NR_CPUS, NUMA, etc?
> No!
>
> There are lots of disadvantages to adding such CONFIG_NUMA Kconfig
> variants for locking primitives:
>
> - an doubling of the test matrix
>
> - an doubling of the review matrix and a halving of effective review
> capacity: we've just about go the capacity to review and validate
> patches like this. Splitting out a 'only NUMA cares' variant is a
> non-starter really.
>
> - but most importantly, there's absolutely no reason to not be fast
> on 128 CPU systems in the low contended case either! Sacrificing
> the low contended case with 'on 128 CPU systems it is the contended
> path that matters' is an idiotic argument.
>
> Essentially the only area were we allow Kconfig dependencies are
> unyielding physical forces: such as lots of CPUs needing a wider CPU
> mask.
>
> As Peter said it, the right solution is to fix the contended case. If
> that also happens to speed up or better organize the uncondended code
> then that's good, but it should not make it worse.
>
> Thanks,
>
> Ingo

You are right. I am trying to measure the performance impact of MCS
queuing has on a lightly contended system. I need to write some custom
test code to get that information. With that information, I may be able
to tune it to perform more or less on par with ticket lock.

As for the additional cache line access of the MCS lock, I don't think
it is really an issued as the MCS node is allocated on local stack which
is likely to be in the cache anyway. I will report back when I have more
data.

-Longman