2014-01-08 17:00:33

by Waiman Long

[permalink] [raw]
Subject: [PATCH v8 0/4] qrwlock: Introducing a queue read/write lock implementation

v7->v8:
- Use atomic_t functions (which are implemented in all arch's) to
modify reader counts.
- Use smp_load_acquire() & smp_store_release() for barriers.
- Further tuning in slowpath performance.

v6->v7:
- Remove support for unfair lock, so only fair qrwlock will be provided.
- Move qrwlock.c to the kernel/locking directory.

v5->v6:
- Modify queue_read_can_lock() to avoid false positive result.
- Move the two slowpath functions' performance tuning change from
patch 4 to patch 1.
- Add a new optional patch to use the new smp_store_release() function
if that is merged.

v4->v5:
- Fix wrong definitions for QW_MASK_FAIR & QW_MASK_UNFAIR macros.
- Add an optional patch 4 which should only be applied after the
mcs_spinlock.h header file is merged.

v3->v4:
- Optimize the fast path with better cold cache behavior and
performance.
- Removing some testing code.
- Make x86 use queue rwlock with no user configuration.

v2->v3:
- Make read lock stealing the default and fair rwlock an option with
a different initializer.
- In queue_read_lock_slowpath(), check irq_count() and force spinning
and lock stealing in interrupt context.
- Unify the fair and classic read-side code path, and make write-side
to use cmpxchg with 2 different writer states. This slows down the
write lock fastpath to make the read side more efficient, but is
still slightly faster than a spinlock.

v1->v2:
- Improve lock fastpath performance.
- Optionally provide classic read/write lock behavior for backward
compatibility.
- Use xadd instead of cmpxchg for fair reader code path to make it
immute to reader contention.
- Run more performance testing.

As mentioned in the LWN article http://lwn.net/Articles/364583/,
the read/write lock suffer from an unfairness problem that it is
possible for a stream of incoming readers to block a waiting writer
from getting the lock for a long time. Also, a waiting reader/writer
contending a rwlock in local memory will have a higher chance of
acquiring the lock than a reader/writer in remote node.

This patch set introduces a queue-based read/write lock implementation
that can largely solve this unfairness problem.

The read lock slowpath will check if the reader is in an interrupt
context. If so, it will force lock spinning and stealing without
waiting in a queue. This is to ensure the read lock will be granted
as soon as possible.

The queue write lock can also be used as a replacement for ticket
spinlocks that are highly contended if lock size increase is not
an issue.

This patch set currently provides queue read/write lock support on
x86 architecture only. Support for other architectures can be added
later on once architecture specific support infrastructure is added
and proper testing is done.

The optional patch 3 has a dependency on the the mcs_spinlock.h
header file which has not been merged yet. So this patch should only
be applied after the mcs_spinlock.h header file is merged.

The optional patch 4 use the new smp_store_release() and
smp_load_acquire() functions which are being reviewed & not merged yet.

Waiman Long (4):
qrwlock: A queue read/write lock implementation
qrwlock x86: Enable x86 to use queue read/write lock
qrwlock: Use the mcs_spinlock helper functions for MCS queuing
qrwlock: Use smp_store_release() in write_unlock()

arch/x86/Kconfig | 1 +
arch/x86/include/asm/spinlock.h | 2 +
arch/x86/include/asm/spinlock_types.h | 4 +
include/asm-generic/qrwlock.h | 203 +++++++++++++++++++++++++++++++++
kernel/Kconfig.locks | 7 +
kernel/locking/Makefile | 1 +
kernel/locking/qrwlock.c | 191 +++++++++++++++++++++++++++++++
7 files changed, 409 insertions(+), 0 deletions(-)
create mode 100644 include/asm-generic/qrwlock.h
create mode 100644 kernel/locking/qrwlock.c


2014-01-08 17:00:43

by Waiman Long

[permalink] [raw]
Subject: [PATCH v8 3/4] qrwlock: Use the mcs_spinlock helper functions for MCS queuing

There is a pending MCS lock patch series that adds generic MCS lock
helper functions to do MCS-style locking. This patch will enable
the queue rwlock to use that generic MCS lock/unlock primitives for
internal queuing. This patch should only be merged after the merging
of that generic MCS locking patch.

Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
---
include/asm-generic/qrwlock.h | 7 +---
kernel/locking/qrwlock.c | 71 ++++-------------------------------------
2 files changed, 9 insertions(+), 69 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index c26be3b..2b9a7b4 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -38,10 +38,7 @@
* the writer field. The least significant 8 bits is the writer field
* whereas the remaining 24 bits is the reader count.
*/
-struct qrwnode {
- struct qrwnode *next;
- bool wait; /* Waiting flag */
-};
+struct mcs_spinlock;

typedef struct qrwlock {
union qrwcnts {
@@ -57,7 +54,7 @@ typedef struct qrwlock {
atomic_t rwa; /* Reader/writer atomic */
u32 rwc; /* Reader/writer counts */
} cnts;
- struct qrwnode *waitq; /* Tail of waiting queue */
+ struct mcs_spinlock *waitq; /* Tail of waiting queue */
} arch_rwlock_t;

/*
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index ae45dae..1b3ffb2 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -20,6 +20,7 @@
#include <linux/cpumask.h>
#include <linux/percpu.h>
#include <linux/hardirq.h>
+#include <linux/mcs_spinlock.h>
#include <asm-generic/qrwlock.h>

/*
@@ -92,64 +93,6 @@
#endif

/**
- * wait_in_queue - Add to queue and wait until it is at the head
- * @lock: Pointer to queue rwlock structure
- * @node: Node pointer to be added to the queue
- */
-static inline void wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
-{
- struct qrwnode *prev;
-
- node->next = NULL;
- node->wait = true;
- prev = xchg(&lock->waitq, node);
- if (prev) {
- prev->next = node;
- /*
- * Wait until the waiting flag is off
- */
- while (smp_load_acquire(&node->wait))
- arch_mutex_cpu_relax();
- }
-}
-
-/**
- * signal_next - Signal the next one in queue to be at the head
- * @lock: Pointer to queue rwlock structure
- * @node: Node pointer to the current head of queue
- */
-static inline void signal_next(struct qrwlock *lock, struct qrwnode *node)
-{
- struct qrwnode *next;
-
- /*
- * Try to notify the next node first without disturbing the cacheline
- * of the lock. If that fails, check to see if it is the last node
- * and so should clear the wait queue.
- */
- next = ACCESS_ONCE(node->next);
- if (likely(next))
- goto notify_next;
-
- /*
- * Clear the wait queue if it is the last node
- */
- if ((ACCESS_ONCE(lock->waitq) == node) &&
- (cmpxchg(&lock->waitq, node, NULL) == node))
- return;
- /*
- * Wait until the next one in queue set up the next field
- */
- while (likely(!(next = ACCESS_ONCE(node->next))))
- arch_mutex_cpu_relax();
- /*
- * The next one in queue is now at the head
- */
-notify_next:
- smp_store_release(&next->wait, false);
-}
-
-/**
* rspin_until_writer_unlock - inc reader count & spin until writer is gone
* @lock : Pointer to queue rwlock structure
* @writer: Current queue rwlock writer status byte
@@ -172,7 +115,7 @@ rspin_until_writer_unlock(struct qrwlock *lock, u8 writer)
*/
void queue_read_lock_slowpath(struct qrwlock *lock)
{
- struct qrwnode node;
+ struct mcs_spinlock node;
union qrwcnts cnts;

/*
@@ -192,7 +135,7 @@ void queue_read_lock_slowpath(struct qrwlock *lock)
/*
* Put the reader into the wait queue
*/
- wait_in_queue(lock, &node);
+ mcs_spin_lock(&lock->waitq, &node);

/*
* At the head of the wait queue now, wait until the writer state
@@ -209,7 +152,7 @@ void queue_read_lock_slowpath(struct qrwlock *lock)
/*
* Signal the next one in queue to become queue head
*/
- signal_next(lock, &node);
+ mcs_spin_unlock(&lock->waitq, &node);
}
EXPORT_SYMBOL(queue_read_lock_slowpath);

@@ -265,18 +208,18 @@ static inline void queue_write_3step_lock(struct qrwlock *lock)
*/
void queue_write_lock_slowpath(struct qrwlock *lock)
{
- struct qrwnode node;
+ struct mcs_spinlock node;

/*
* Put the writer into the wait queue
*/
- wait_in_queue(lock, &node);
+ mcs_spin_lock(&lock->waitq, &node);

/*
* At the head of the wait queue now, call queue_write_3step_lock()
* to acquire the lock until it is done.
*/
queue_write_3step_lock(lock);
- signal_next(lock, &node);
+ mcs_spin_unlock(&lock->waitq, &node);
}
EXPORT_SYMBOL(queue_write_lock_slowpath);
--
1.7.1

2014-01-08 17:00:40

by Waiman Long

[permalink] [raw]
Subject: [PATCH v8 2/4] qrwlock x86: Enable x86 to use queue read/write lock

This patch makes the necessary changes at the x86 architecture specific
layer to enable the presence of the CONFIG_QUEUE_RWLOCK kernel option
to replace the read/write lock by the queue read/write lock.

It also enables the CONFIG_QUEUE_RWLOCK option by default for x86 which
will force the use of queue read/write lock. That will greatly improve
the fairness of read/write lock and eliminate live-lock situation
where one task may not get the lock for an indefinite period of time.

Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/spinlock.h | 2 ++
arch/x86/include/asm/spinlock_types.h | 4 ++++
3 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e903c71..c4a9c54 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -124,6 +124,7 @@ config X86
select RTC_LIB
select HAVE_DEBUG_STACKOVERFLOW
select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64
+ select QUEUE_RWLOCK

config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index bf156de..8fb88c5 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -188,6 +188,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
cpu_relax();
}

+#ifndef CONFIG_QUEUE_RWLOCK
/*
* Read-write spinlocks, allowing multiple readers
* but only one writer.
@@ -270,6 +271,7 @@ static inline void arch_write_unlock(arch_rwlock_t *rw)
asm volatile(LOCK_PREFIX WRITE_LOCK_ADD(%1) "%0"
: "+m" (rw->write) : "i" (RW_LOCK_BIAS) : "memory");
}
+#endif /* CONFIG_QUEUE_RWLOCK */

#define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
#define arch_write_lock_flags(lock, flags) arch_write_lock(lock)
diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h
index 4f1bea1..a585635 100644
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -34,6 +34,10 @@ typedef struct arch_spinlock {

#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } }

+#ifdef CONFIG_QUEUE_RWLOCK
+#include <asm-generic/qrwlock.h>
+#else
#include <asm/rwlock.h>
+#endif

#endif /* _ASM_X86_SPINLOCK_TYPES_H */
--
1.7.1

2014-01-08 17:01:07

by Waiman Long

[permalink] [raw]
Subject: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

This patch modifies the queue_write_unlock() function to use the
new smp_store_release() function in another pending patch. It also
removes the temporary implementation of smp_load_acquire() and
smp_store_release() function in qrwlock.c.

This patch should only be merged if PeterZ's linux-arch patch patch
was merged.

Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
---
include/asm-generic/qrwlock.h | 4 +---
kernel/locking/qrwlock.c | 34 ----------------------------------
2 files changed, 1 insertions(+), 37 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 2b9a7b4..4d4bd04 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -179,9 +179,7 @@ static inline void queue_write_unlock(struct qrwlock *lock)
/*
* Make sure that none of the critical section will be leaked out.
*/
- smp_mb__before_clear_bit();
- ACCESS_ONCE(lock->cnts.writer) = 0;
- smp_mb__after_clear_bit();
+ smp_store_release(&lock->cnts.writer, 0)
}

/*
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 1b3ffb2..3d3ba2b 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -48,40 +48,6 @@
# define arch_mutex_cpu_relax() cpu_relax()
#endif

-#ifndef smp_load_acquire
-# ifdef CONFIG_X86
-# define smp_load_acquire(p) \
- ({ \
- typeof(*p) ___p1 = ACCESS_ONCE(*p); \
- barrier(); \
- ___p1; \
- })
-# else
-# define smp_load_acquire(p) \
- ({ \
- typeof(*p) ___p1 = ACCESS_ONCE(*p); \
- smp_mb(); \
- ___p1; \
- })
-# endif
-#endif
-
-#ifndef smp_store_release
-# ifdef CONFIG_X86
-# define smp_store_release(p, v) \
- do { \
- barrier(); \
- ACCESS_ONCE(*p) = v; \
- } while (0)
-# else
-# define smp_store_release(p, v) \
- do { \
- smp_mb(); \
- ACCESS_ONCE(*p) = v; \
- } while (0)
-# endif
-#endif
-
/*
* If an xadd (exchange-add) macro isn't available, simulate one with
* the atomic_add_return() function.
--
1.7.1

2014-01-08 17:01:33

by Waiman Long

[permalink] [raw]
Subject: [PATCH v8 1/4] qrwlock: A queue read/write lock implementation

This patch introduces a new read/write lock implementation that put
waiting readers and writers into a queue instead of actively contending
the lock like the current read/write lock implementation. This will
improve performance in highly contended situation by reducing the
cache line bouncing effect.

The queue read/write lock (qrwlock) is a fair lock even though there
is still a slight chance of lock stealing if a reader or writer comes
at the right moment. Other than that, lock granting is done in a
FIFO manner. As a result, it is possible to determine a maximum time
period after which the waiting is over and the lock can be acquired.

Internally, however, there is a second type of readers which try to
steal lock aggressively. They simply increments the reader count and
wait until the writer releases the lock. The transition to aggressive
reader happens in the read lock slowpath when

1. In an interrupt context.
2. When a reader comes to the head of the wait queue and sees
the release of a write lock.

The queue read lock is safe to use in an interrupt context (softirq
or hardirq) as it will switch to become an aggressive reader in such
environment allowing recursive read lock.

The only downside of queue rwlock is the size increase in the lock
structure by 4 bytes for 32-bit systems and by 12 bytes for 64-bit
systems.

In term of single-thread performance (no contention), a 256K
lock/unlock loop was run on a 2.4GHz and 2.93Ghz Westmere x86-64
CPUs. The following table shows the average time (in ns) for a single
lock/unlock sequence (including the looping and timing overhead):

Lock Type 2.4GHz 2.93GHz
--------- ------ -------
Ticket spinlock 14.9 12.3
Read lock 17.0 13.5
Write lock 17.0 13.5
Queue read lock 16.0 13.4
Queue write lock 9.2 7.8

The queue read lock is slightly slower than the spinlock, but is
slightly faster than the read lock. The queue write lock, however,
is the fastest of all. It is almost twice as fast as the write lock
and about 1.5X of the spinlock.

With lock contention, the speed of each individual lock/unlock function
is less important than the amount of contention-induced delays.

To investigate the performance characteristics of the queue rwlock
compared with the regular rwlock, Ingo's anon_vmas patch that converts
rwsem to rwlock was applied to a 3.12 kernel. This kernel was then
tested under the following 3 conditions:

1) Plain 3.12
2) Ingo's patch
3) Ingo's patch + qrwlock

Each of the 3 kernels were booted up twice with and without the
"idle=poll" kernel parameter which keeps the CPUs in C0 state while
idling instead of a more energy-saving sleep state. The jobs per
minutes (JPM) results of the AIM7's high_systime workload at 1500
users on a 8-socket 80-core DL980 (HT off) were:

Kernel JPMs %Change from (1)
------ ---- ----------------
1 145704/227295 -
2 229750/236066 +58%/+3.8%
4 240062/248606 +65%/+9.4%

The first JPM number is without the "idle=poll" kernel parameter,
the second number is with that parameter. It can be seen that most
of the performance benefit of converting rwsem to rwlock actually
come from the latency improvement of not needing to wake up a CPU
from deep sleep state when work is available.

The use of non-sleeping locks did improve performance by eliminating
the context switching cost. Using queue rwlock gave almost tripling
of performance gain. The performance gain was reduced somewhat with
a fair lock which was to be expected.

Looking at the perf profiles (with idle=poll) below, we can clearly see
that other bottlenecks were constraining the performance improvement.

Perf profile of kernel (2):

18.65% reaim [kernel.kallsyms] [k] __write_lock_failed
9.00% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
5.21% swapper [kernel.kallsyms] [k] cpu_idle_loop
3.08% reaim [kernel.kallsyms] [k] mspin_lock
2.50% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert
2.00% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave
1.29% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
1.21% reaim [kernel.kallsyms] [k] __read_lock_failed
1.12% reaim [kernel.kallsyms] [k] _raw_spin_lock
1.10% reaim [kernel.kallsyms] [k] perf_event_aux
1.09% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave

Perf profile of kernel (3):

20.14% swapper [kernel.kallsyms] [k] cpu_idle_loop
7.94% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
5.41% reaim [kernel.kallsyms] [k] queue_write_lock_slowpath
5.01% reaim [kernel.kallsyms] [k] mspin_lock
2.12% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert
2.07% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave
1.58% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
1.25% reaim [kernel.kallsyms] [k] queue_write_3step_lock
1.18% reaim [kernel.kallsyms] [k] queue_read_lock_slowpath
1.14% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave
0.95% reaim [kernel.kallsyms] [k] mutex_spin_on_owner

20.55% swapper [kernel.kallsyms] [k] cpu_idle_loop
7.74% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
6.47% reaim [kernel.kallsyms] [k] queue_write_lock_slowpath
4.41% reaim [kernel.kallsyms] [k] mspin_lock
2.18% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave
2.07% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert
1.49% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
1.43% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave
1.17% reaim [kernel.kallsyms] [k] queue_read_lock_slowpath
0.94% reaim [kernel.kallsyms] [k] mutex_spin_on_owner

The spinlock bottlenecks were shown below.

7.74% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
|--65.94%-- release_pages
|--31.37%-- pagevec_lru_move_fn
|--0.79%-- get_page_from_freelist
|--0.62%-- __page_cache_release
--1.28%-- [...]

For both release_pages() & pagevec_lru_move_fn() function, the
spinlock contention was on zone->lru_lock.

Tim Chen also tested the qrwlock with Ingo's patch on a 4-socket
machine. It was found the performance improvement of 11% was the
same with regular rwlock or queue rwlock.

Signed-off-by: Waiman Long <[email protected]>
---
include/asm-generic/qrwlock.h | 208 ++++++++++++++++++++++++++++++
kernel/Kconfig.locks | 7 +
kernel/locking/Makefile | 1 +
kernel/locking/qrwlock.c | 282 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 498 insertions(+), 0 deletions(-)
create mode 100644 include/asm-generic/qrwlock.h
create mode 100644 kernel/locking/qrwlock.c

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
new file mode 100644
index 0000000..c26be3b
--- /dev/null
+++ b/include/asm-generic/qrwlock.h
@@ -0,0 +1,208 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#ifndef __ASM_GENERIC_QRWLOCK_H
+#define __ASM_GENERIC_QRWLOCK_H
+
+#include <linux/types.h>
+#include <linux/atomic.h>
+#include <asm/bitops.h>
+#include <asm/cmpxchg.h>
+#include <asm/barrier.h>
+#include <asm/processor.h>
+#include <asm/byteorder.h>
+
+#if !defined(__LITTLE_ENDIAN) && !defined(__BIG_ENDIAN)
+#error "Missing either LITTLE_ENDIAN or BIG_ENDIAN definition."
+#endif
+
+/*
+ * The queue read/write lock data structure
+ *
+ * The layout of the structure is endian-sensitive to make sure that adding
+ * _QR_BIAS to the rw field to increment the reader count won't disturb
+ * the writer field. The least significant 8 bits is the writer field
+ * whereas the remaining 24 bits is the reader count.
+ */
+struct qrwnode {
+ struct qrwnode *next;
+ bool wait; /* Waiting flag */
+};
+
+typedef struct qrwlock {
+ union qrwcnts {
+ struct {
+#ifdef __LITTLE_ENDIAN
+ u8 writer; /* Writer state */
+#else
+ u16 r16; /* Reader count - msb */
+ u8 r8; /* Reader count - lsb */
+ u8 writer; /* Writer state */
+#endif
+ };
+ atomic_t rwa; /* Reader/writer atomic */
+ u32 rwc; /* Reader/writer counts */
+ } cnts;
+ struct qrwnode *waitq; /* Tail of waiting queue */
+} arch_rwlock_t;
+
+/*
+ * Writer states & reader shift and bias
+ */
+#define _QW_WAITING 1 /* A writer is waiting */
+#define _QW_LOCKED 0xff /* A writer holds the lock */
+#define _QR_SHIFT 8 /* Reader count shift */
+#define _QR_BIAS (1U << _QR_SHIFT)
+
+/*
+ * External function declarations
+ */
+extern void queue_read_lock_slowpath(struct qrwlock *lock);
+extern void queue_write_lock_slowpath(struct qrwlock *lock);
+
+/**
+ * queue_read_can_lock- would read_trylock() succeed?
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline int queue_read_can_lock(struct qrwlock *lock)
+{
+ return !ACCESS_ONCE(lock->cnts.writer);
+}
+
+/**
+ * queue_write_can_lock- would write_trylock() succeed?
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline int queue_write_can_lock(struct qrwlock *lock)
+{
+ return !ACCESS_ONCE(lock->cnts.rwc);
+}
+
+/**
+ * queue_read_trylock - try to acquire read lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int queue_read_trylock(struct qrwlock *lock)
+{
+ union qrwcnts cnts;
+
+ cnts.rwc = ACCESS_ONCE(lock->cnts.rwc);
+ if (likely(!cnts.writer)) {
+ cnts.rwc = (u32)atomic_add_return(_QR_BIAS, &lock->cnts.rwa);
+ if (likely(!cnts.writer))
+ return 1;
+ atomic_sub(_QR_BIAS, &lock->cnts.rwa);
+ }
+ return 0;
+}
+
+/**
+ * queue_write_trylock - try to acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ * Return: 1 if lock acquired, 0 if failed
+ */
+static inline int queue_write_trylock(struct qrwlock *lock)
+{
+ union qrwcnts old, new;
+
+ old.rwc = ACCESS_ONCE(lock->cnts.rwc);
+ if (likely(!old.rwc)) {
+ new.rwc = old.rwc;
+ new.writer = _QW_LOCKED;
+ if (likely(cmpxchg(&lock->cnts.rwc, old.rwc, new.rwc)
+ == old.rwc))
+ return 1;
+ }
+ return 0;
+}
+/**
+ * queue_read_lock - acquire read lock of a queue rwlock
+ * @lock: Pointer to queue rwlock structure
+ */
+static inline void queue_read_lock(struct qrwlock *lock)
+{
+ union qrwcnts cnts;
+
+ cnts.rwc = atomic_add_return(_QR_BIAS, &lock->cnts.rwa);
+ if (likely(!cnts.writer))
+ return;
+ /*
+ * Slowpath will decrement the reader count, if necessary
+ */
+ queue_read_lock_slowpath(lock);
+}
+
+/**
+ * queue_write_lock - acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_write_lock(struct qrwlock *lock)
+{
+ /*
+ * Optimize for the unfair lock case where the fair flag is 0.
+ */
+ if (cmpxchg(&lock->cnts.rwc, 0, _QW_LOCKED) == 0)
+ return;
+ queue_write_lock_slowpath(lock);
+}
+
+/**
+ * queue_read_unlock - release read lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_read_unlock(struct qrwlock *lock)
+{
+ /*
+ * Atomically decrement the reader count
+ */
+ atomic_sub(_QR_BIAS, &lock->cnts.rwa);
+}
+
+/**
+ * queue_write_unlock - release write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+static inline void queue_write_unlock(struct qrwlock *lock)
+{
+ /*
+ * Make sure that none of the critical section will be leaked out.
+ */
+ smp_mb__before_clear_bit();
+ ACCESS_ONCE(lock->cnts.writer) = 0;
+ smp_mb__after_clear_bit();
+}
+
+/*
+ * Initializier
+ */
+#define __ARCH_RW_LOCK_UNLOCKED { .cnts = { .rwc = 0 }, .waitq = NULL }
+
+/*
+ * Remapping rwlock architecture specific functions to the corresponding
+ * queue rwlock functions.
+ */
+#define arch_read_can_lock(l) queue_read_can_lock(l)
+#define arch_write_can_lock(l) queue_write_can_lock(l)
+#define arch_read_lock(l) queue_read_lock(l)
+#define arch_write_lock(l) queue_write_lock(l)
+#define arch_read_trylock(l) queue_read_trylock(l)
+#define arch_write_trylock(l) queue_write_trylock(l)
+#define arch_read_unlock(l) queue_read_unlock(l)
+#define arch_write_unlock(l) queue_write_unlock(l)
+
+#endif /* __ASM_GENERIC_QRWLOCK_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index d2b32ac..b665478 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -223,3 +223,10 @@ endif
config MUTEX_SPIN_ON_OWNER
def_bool y
depends on SMP && !DEBUG_MUTEXES
+
+config ARCH_QUEUE_RWLOCK
+ bool
+
+config QUEUE_RWLOCK
+ def_bool y if ARCH_QUEUE_RWLOCK
+ depends on SMP
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index baab8e5..3e7bab1 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o
obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
obj-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o
+obj-$(CONFIG_QUEUE_RWLOCK) += qrwlock.o
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
new file mode 100644
index 0000000..ae45dae
--- /dev/null
+++ b/kernel/locking/qrwlock.c
@@ -0,0 +1,282 @@
+/*
+ * Queue read/write lock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#include <linux/smp.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/percpu.h>
+#include <linux/hardirq.h>
+#include <asm-generic/qrwlock.h>
+
+/*
+ * Compared with regular rwlock, the queue rwlock has has the following
+ * advantages:
+ * 1. Even though there is a slight chance of stealing the lock if come at
+ * the right moment, the granting of the lock is mostly in FIFO order.
+ * 2. It is usually faster in high contention situation.
+ *
+ * The only downside is that the lock is 4 bytes larger in 32-bit systems
+ * and 12 bytes larger in 64-bit systems.
+ *
+ * There are two queues for writers. The writer field of the lock is a
+ * one-slot wait queue. The writers that follow will have to wait in the
+ * combined reader/writer queue (waitq).
+ *
+ * Compared with x86 ticket spinlock, the queue rwlock is faster in high
+ * contention situation. The writer lock is also faster in single thread
+ * operations. Therefore, queue rwlock can be considered as a replacement
+ * for those spinlocks that are highly contended as long as an increase
+ * in lock size is not an issue.
+ */
+
+#ifndef arch_mutex_cpu_relax
+# define arch_mutex_cpu_relax() cpu_relax()
+#endif
+
+#ifndef smp_load_acquire
+# ifdef CONFIG_X86
+# define smp_load_acquire(p) \
+ ({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ barrier(); \
+ ___p1; \
+ })
+# else
+# define smp_load_acquire(p) \
+ ({ \
+ typeof(*p) ___p1 = ACCESS_ONCE(*p); \
+ smp_mb(); \
+ ___p1; \
+ })
+# endif
+#endif
+
+#ifndef smp_store_release
+# ifdef CONFIG_X86
+# define smp_store_release(p, v) \
+ do { \
+ barrier(); \
+ ACCESS_ONCE(*p) = v; \
+ } while (0)
+# else
+# define smp_store_release(p, v) \
+ do { \
+ smp_mb(); \
+ ACCESS_ONCE(*p) = v; \
+ } while (0)
+# endif
+#endif
+
+/*
+ * If an xadd (exchange-add) macro isn't available, simulate one with
+ * the atomic_add_return() function.
+ */
+#ifdef xadd
+# define qrw_xadd(rw, inc) xadd(&(rw).rwc, inc)
+#else
+# define qrw_xadd(rw, inc) (u32)(atomic_add_return(inc, &(rw).rwa) - inc)
+#endif
+
+/**
+ * wait_in_queue - Add to queue and wait until it is at the head
+ * @lock: Pointer to queue rwlock structure
+ * @node: Node pointer to be added to the queue
+ */
+static inline void wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
+{
+ struct qrwnode *prev;
+
+ node->next = NULL;
+ node->wait = true;
+ prev = xchg(&lock->waitq, node);
+ if (prev) {
+ prev->next = node;
+ /*
+ * Wait until the waiting flag is off
+ */
+ while (smp_load_acquire(&node->wait))
+ arch_mutex_cpu_relax();
+ }
+}
+
+/**
+ * signal_next - Signal the next one in queue to be at the head
+ * @lock: Pointer to queue rwlock structure
+ * @node: Node pointer to the current head of queue
+ */
+static inline void signal_next(struct qrwlock *lock, struct qrwnode *node)
+{
+ struct qrwnode *next;
+
+ /*
+ * Try to notify the next node first without disturbing the cacheline
+ * of the lock. If that fails, check to see if it is the last node
+ * and so should clear the wait queue.
+ */
+ next = ACCESS_ONCE(node->next);
+ if (likely(next))
+ goto notify_next;
+
+ /*
+ * Clear the wait queue if it is the last node
+ */
+ if ((ACCESS_ONCE(lock->waitq) == node) &&
+ (cmpxchg(&lock->waitq, node, NULL) == node))
+ return;
+ /*
+ * Wait until the next one in queue set up the next field
+ */
+ while (likely(!(next = ACCESS_ONCE(node->next))))
+ arch_mutex_cpu_relax();
+ /*
+ * The next one in queue is now at the head
+ */
+notify_next:
+ smp_store_release(&next->wait, false);
+}
+
+/**
+ * rspin_until_writer_unlock - inc reader count & spin until writer is gone
+ * @lock : Pointer to queue rwlock structure
+ * @writer: Current queue rwlock writer status byte
+ *
+ * In interrupt context or at the head of the queue, the reader will just
+ * increment the reader count & wait until the writer releases the lock.
+ */
+static __always_inline void
+rspin_until_writer_unlock(struct qrwlock *lock, u8 writer)
+{
+ while (writer == _QW_LOCKED) {
+ arch_mutex_cpu_relax();
+ writer = smp_load_acquire(&lock->cnts.writer);
+ }
+}
+
+/**
+ * queue_read_lock_slowpath - acquire read lock of a queue rwlock
+ * @lock: Pointer to queue rwlock structure
+ */
+void queue_read_lock_slowpath(struct qrwlock *lock)
+{
+ struct qrwnode node;
+ union qrwcnts cnts;
+
+ /*
+ * Readers come here when they cannot get the lock without waiting
+ */
+ if (unlikely(irq_count())) {
+ /*
+ * Readers in interrupt context will spin until the lock is
+ * available without waiting in the queue.
+ */
+ cnts.rwc = smp_load_acquire(&lock->cnts.rwc);
+ rspin_until_writer_unlock(lock, cnts.writer);
+ return;
+ }
+ atomic_sub(_QR_BIAS, &lock->cnts.rwa);
+
+ /*
+ * Put the reader into the wait queue
+ */
+ wait_in_queue(lock, &node);
+
+ /*
+ * At the head of the wait queue now, wait until the writer state
+ * goes to 0 and then try to increment the reader count and get
+ * the lock. It is possible that an incoming writer may steal the
+ * lock in the interim, so it is necessary to check the writer byte
+ * to make sure that the write lock isn't taken.
+ */
+ while (ACCESS_ONCE(lock->cnts.writer))
+ arch_mutex_cpu_relax();
+ cnts.rwc = qrw_xadd(lock->cnts, _QR_BIAS);
+ rspin_until_writer_unlock(lock, cnts.writer);
+
+ /*
+ * Signal the next one in queue to become queue head
+ */
+ signal_next(lock, &node);
+}
+EXPORT_SYMBOL(queue_read_lock_slowpath);
+
+/**
+ * qwrite_trylock - Try to acquire the write lock
+ * @lock : Pointer to queue rwlock structure
+ * @old : The current queue rwlock counts
+ * Return: 1 if lock acquired, 0 otherwise
+ */
+static __always_inline int
+qwrite_trylock(struct qrwlock *lock, u32 old)
+{
+ if (likely(cmpxchg(&lock->cnts.rwc, old, _QW_LOCKED) == old))
+ return 1;
+ return 0;
+}
+
+/**
+ * queue_write_3step_lock - acquire write lock in 3 steps
+ * @lock : Pointer to queue rwlock structure
+ *
+ * Step 1 - Try to acquire the lock directly if no reader is present
+ * Step 2 - Set the waiting flag to notify readers that a writer is waiting
+ * Step 3 - When the readers field goes to 0, set the locked flag
+ */
+static inline void queue_write_3step_lock(struct qrwlock *lock)
+{
+ u32 rwc;
+ u8 writer;
+
+ /* Step 1 */
+ if (!ACCESS_ONCE(lock->cnts.rwc) && qwrite_trylock(lock, 0))
+ return;
+
+ /* Step 2 */
+ writer = ACCESS_ONCE(lock->cnts.writer);
+ while (writer || (cmpxchg(&lock->cnts.writer, 0, _QW_WAITING) != 0)) {
+ arch_mutex_cpu_relax();
+ writer = ACCESS_ONCE(lock->cnts.writer);
+ }
+
+ /* Step 3 */
+ rwc = ACCESS_ONCE(lock->cnts.rwc);
+ while ((rwc > _QW_WAITING) || !qwrite_trylock(lock, _QW_WAITING)) {
+ arch_mutex_cpu_relax();
+ rwc = ACCESS_ONCE(lock->cnts.rwc);
+ }
+}
+
+/**
+ * queue_write_lock_slowpath - acquire write lock of a queue rwlock
+ * @lock : Pointer to queue rwlock structure
+ */
+void queue_write_lock_slowpath(struct qrwlock *lock)
+{
+ struct qrwnode node;
+
+ /*
+ * Put the writer into the wait queue
+ */
+ wait_in_queue(lock, &node);
+
+ /*
+ * At the head of the wait queue now, call queue_write_3step_lock()
+ * to acquire the lock until it is done.
+ */
+ queue_write_3step_lock(lock);
+ signal_next(lock, &node);
+}
+EXPORT_SYMBOL(queue_write_lock_slowpath);
--
1.7.1

2014-01-10 06:41:45

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v8 1/4] qrwlock: A queue read/write lock implementation

On Wed, Jan 08, 2014 at 11:59:33AM -0500, Waiman Long wrote:
> This patch introduces a new read/write lock implementation that put
> waiting readers and writers into a queue instead of actively contending
> the lock like the current read/write lock implementation. This will
> improve performance in highly contended situation by reducing the
> cache line bouncing effect.

Reviewed-by: Paul E. McKenney <[email protected]>

> The queue read/write lock (qrwlock) is a fair lock even though there
> is still a slight chance of lock stealing if a reader or writer comes
> at the right moment. Other than that, lock granting is done in a
> FIFO manner. As a result, it is possible to determine a maximum time
> period after which the waiting is over and the lock can be acquired.
>
> Internally, however, there is a second type of readers which try to
> steal lock aggressively. They simply increments the reader count and
> wait until the writer releases the lock. The transition to aggressive
> reader happens in the read lock slowpath when
>
> 1. In an interrupt context.
> 2. When a reader comes to the head of the wait queue and sees
> the release of a write lock.
>
> The queue read lock is safe to use in an interrupt context (softirq
> or hardirq) as it will switch to become an aggressive reader in such
> environment allowing recursive read lock.
>
> The only downside of queue rwlock is the size increase in the lock
> structure by 4 bytes for 32-bit systems and by 12 bytes for 64-bit
> systems.
>
> In term of single-thread performance (no contention), a 256K
> lock/unlock loop was run on a 2.4GHz and 2.93Ghz Westmere x86-64
> CPUs. The following table shows the average time (in ns) for a single
> lock/unlock sequence (including the looping and timing overhead):
>
> Lock Type 2.4GHz 2.93GHz
> --------- ------ -------
> Ticket spinlock 14.9 12.3
> Read lock 17.0 13.5
> Write lock 17.0 13.5
> Queue read lock 16.0 13.4
> Queue write lock 9.2 7.8
>
> The queue read lock is slightly slower than the spinlock, but is
> slightly faster than the read lock. The queue write lock, however,
> is the fastest of all. It is almost twice as fast as the write lock
> and about 1.5X of the spinlock.
>
> With lock contention, the speed of each individual lock/unlock function
> is less important than the amount of contention-induced delays.
>
> To investigate the performance characteristics of the queue rwlock
> compared with the regular rwlock, Ingo's anon_vmas patch that converts
> rwsem to rwlock was applied to a 3.12 kernel. This kernel was then
> tested under the following 3 conditions:
>
> 1) Plain 3.12
> 2) Ingo's patch
> 3) Ingo's patch + qrwlock
>
> Each of the 3 kernels were booted up twice with and without the
> "idle=poll" kernel parameter which keeps the CPUs in C0 state while
> idling instead of a more energy-saving sleep state. The jobs per
> minutes (JPM) results of the AIM7's high_systime workload at 1500
> users on a 8-socket 80-core DL980 (HT off) were:
>
> Kernel JPMs %Change from (1)
> ------ ---- ----------------
> 1 145704/227295 -
> 2 229750/236066 +58%/+3.8%
> 4 240062/248606 +65%/+9.4%
>
> The first JPM number is without the "idle=poll" kernel parameter,
> the second number is with that parameter. It can be seen that most
> of the performance benefit of converting rwsem to rwlock actually
> come from the latency improvement of not needing to wake up a CPU
> from deep sleep state when work is available.
>
> The use of non-sleeping locks did improve performance by eliminating
> the context switching cost. Using queue rwlock gave almost tripling
> of performance gain. The performance gain was reduced somewhat with
> a fair lock which was to be expected.
>
> Looking at the perf profiles (with idle=poll) below, we can clearly see
> that other bottlenecks were constraining the performance improvement.
>
> Perf profile of kernel (2):
>
> 18.65% reaim [kernel.kallsyms] [k] __write_lock_failed
> 9.00% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> 5.21% swapper [kernel.kallsyms] [k] cpu_idle_loop
> 3.08% reaim [kernel.kallsyms] [k] mspin_lock
> 2.50% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert
> 2.00% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> 1.29% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
> 1.21% reaim [kernel.kallsyms] [k] __read_lock_failed
> 1.12% reaim [kernel.kallsyms] [k] _raw_spin_lock
> 1.10% reaim [kernel.kallsyms] [k] perf_event_aux
> 1.09% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave
>
> Perf profile of kernel (3):
>
> 20.14% swapper [kernel.kallsyms] [k] cpu_idle_loop
> 7.94% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> 5.41% reaim [kernel.kallsyms] [k] queue_write_lock_slowpath
> 5.01% reaim [kernel.kallsyms] [k] mspin_lock
> 2.12% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert
> 2.07% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> 1.58% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
> 1.25% reaim [kernel.kallsyms] [k] queue_write_3step_lock
> 1.18% reaim [kernel.kallsyms] [k] queue_read_lock_slowpath
> 1.14% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> 0.95% reaim [kernel.kallsyms] [k] mutex_spin_on_owner
>
> 20.55% swapper [kernel.kallsyms] [k] cpu_idle_loop
> 7.74% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> 6.47% reaim [kernel.kallsyms] [k] queue_write_lock_slowpath
> 4.41% reaim [kernel.kallsyms] [k] mspin_lock
> 2.18% ls [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> 2.07% reaim [kernel.kallsyms] [k] anon_vma_interval_tree_insert
> 1.49% reaim [kernel.kallsyms] [k] update_cfs_rq_blocked_load
> 1.43% true [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> 1.17% reaim [kernel.kallsyms] [k] queue_read_lock_slowpath
> 0.94% reaim [kernel.kallsyms] [k] mutex_spin_on_owner
>
> The spinlock bottlenecks were shown below.
>
> 7.74% reaim [kernel.kallsyms] [k] _raw_spin_lock_irqsave
> |--65.94%-- release_pages
> |--31.37%-- pagevec_lru_move_fn
> |--0.79%-- get_page_from_freelist
> |--0.62%-- __page_cache_release
> --1.28%-- [...]
>
> For both release_pages() & pagevec_lru_move_fn() function, the
> spinlock contention was on zone->lru_lock.
>
> Tim Chen also tested the qrwlock with Ingo's patch on a 4-socket
> machine. It was found the performance improvement of 11% was the
> same with regular rwlock or queue rwlock.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> include/asm-generic/qrwlock.h | 208 ++++++++++++++++++++++++++++++
> kernel/Kconfig.locks | 7 +
> kernel/locking/Makefile | 1 +
> kernel/locking/qrwlock.c | 282 +++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 498 insertions(+), 0 deletions(-)
> create mode 100644 include/asm-generic/qrwlock.h
> create mode 100644 kernel/locking/qrwlock.c
>
> diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
> new file mode 100644
> index 0000000..c26be3b
> --- /dev/null
> +++ b/include/asm-generic/qrwlock.h
> @@ -0,0 +1,208 @@
> +/*
> + * Queue read/write lock
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
> + *
> + * Authors: Waiman Long <[email protected]>
> + */
> +#ifndef __ASM_GENERIC_QRWLOCK_H
> +#define __ASM_GENERIC_QRWLOCK_H
> +
> +#include <linux/types.h>
> +#include <linux/atomic.h>
> +#include <asm/bitops.h>
> +#include <asm/cmpxchg.h>
> +#include <asm/barrier.h>
> +#include <asm/processor.h>
> +#include <asm/byteorder.h>
> +
> +#if !defined(__LITTLE_ENDIAN) && !defined(__BIG_ENDIAN)
> +#error "Missing either LITTLE_ENDIAN or BIG_ENDIAN definition."
> +#endif
> +
> +/*
> + * The queue read/write lock data structure
> + *
> + * The layout of the structure is endian-sensitive to make sure that adding
> + * _QR_BIAS to the rw field to increment the reader count won't disturb
> + * the writer field. The least significant 8 bits is the writer field
> + * whereas the remaining 24 bits is the reader count.
> + */
> +struct qrwnode {
> + struct qrwnode *next;
> + bool wait; /* Waiting flag */
> +};
> +
> +typedef struct qrwlock {
> + union qrwcnts {
> + struct {
> +#ifdef __LITTLE_ENDIAN
> + u8 writer; /* Writer state */
> +#else
> + u16 r16; /* Reader count - msb */
> + u8 r8; /* Reader count - lsb */
> + u8 writer; /* Writer state */
> +#endif
> + };
> + atomic_t rwa; /* Reader/writer atomic */
> + u32 rwc; /* Reader/writer counts */
> + } cnts;
> + struct qrwnode *waitq; /* Tail of waiting queue */
> +} arch_rwlock_t;
> +
> +/*
> + * Writer states & reader shift and bias
> + */
> +#define _QW_WAITING 1 /* A writer is waiting */
> +#define _QW_LOCKED 0xff /* A writer holds the lock */
> +#define _QR_SHIFT 8 /* Reader count shift */
> +#define _QR_BIAS (1U << _QR_SHIFT)
> +
> +/*
> + * External function declarations
> + */
> +extern void queue_read_lock_slowpath(struct qrwlock *lock);
> +extern void queue_write_lock_slowpath(struct qrwlock *lock);
> +
> +/**
> + * queue_read_can_lock- would read_trylock() succeed?
> + * @lock: Pointer to queue rwlock structure
> + */
> +static inline int queue_read_can_lock(struct qrwlock *lock)
> +{
> + return !ACCESS_ONCE(lock->cnts.writer);
> +}
> +
> +/**
> + * queue_write_can_lock- would write_trylock() succeed?
> + * @lock: Pointer to queue rwlock structure
> + */
> +static inline int queue_write_can_lock(struct qrwlock *lock)
> +{
> + return !ACCESS_ONCE(lock->cnts.rwc);
> +}
> +
> +/**
> + * queue_read_trylock - try to acquire read lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +static inline int queue_read_trylock(struct qrwlock *lock)
> +{
> + union qrwcnts cnts;
> +
> + cnts.rwc = ACCESS_ONCE(lock->cnts.rwc);
> + if (likely(!cnts.writer)) {
> + cnts.rwc = (u32)atomic_add_return(_QR_BIAS, &lock->cnts.rwa);
> + if (likely(!cnts.writer))
> + return 1;
> + atomic_sub(_QR_BIAS, &lock->cnts.rwa);
> + }
> + return 0;
> +}
> +
> +/**
> + * queue_write_trylock - try to acquire write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + * Return: 1 if lock acquired, 0 if failed
> + */
> +static inline int queue_write_trylock(struct qrwlock *lock)
> +{
> + union qrwcnts old, new;
> +
> + old.rwc = ACCESS_ONCE(lock->cnts.rwc);
> + if (likely(!old.rwc)) {
> + new.rwc = old.rwc;
> + new.writer = _QW_LOCKED;
> + if (likely(cmpxchg(&lock->cnts.rwc, old.rwc, new.rwc)
> + == old.rwc))
> + return 1;
> + }
> + return 0;
> +}
> +/**
> + * queue_read_lock - acquire read lock of a queue rwlock
> + * @lock: Pointer to queue rwlock structure
> + */
> +static inline void queue_read_lock(struct qrwlock *lock)
> +{
> + union qrwcnts cnts;
> +
> + cnts.rwc = atomic_add_return(_QR_BIAS, &lock->cnts.rwa);
> + if (likely(!cnts.writer))
> + return;
> + /*
> + * Slowpath will decrement the reader count, if necessary
> + */
> + queue_read_lock_slowpath(lock);
> +}
> +
> +/**
> + * queue_write_lock - acquire write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +static inline void queue_write_lock(struct qrwlock *lock)
> +{
> + /*
> + * Optimize for the unfair lock case where the fair flag is 0.
> + */
> + if (cmpxchg(&lock->cnts.rwc, 0, _QW_LOCKED) == 0)
> + return;
> + queue_write_lock_slowpath(lock);
> +}
> +
> +/**
> + * queue_read_unlock - release read lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +static inline void queue_read_unlock(struct qrwlock *lock)
> +{
> + /*
> + * Atomically decrement the reader count
> + */
> + atomic_sub(_QR_BIAS, &lock->cnts.rwa);
> +}
> +
> +/**
> + * queue_write_unlock - release write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +static inline void queue_write_unlock(struct qrwlock *lock)
> +{
> + /*
> + * Make sure that none of the critical section will be leaked out.
> + */
> + smp_mb__before_clear_bit();
> + ACCESS_ONCE(lock->cnts.writer) = 0;
> + smp_mb__after_clear_bit();
> +}
> +
> +/*
> + * Initializier
> + */
> +#define __ARCH_RW_LOCK_UNLOCKED { .cnts = { .rwc = 0 }, .waitq = NULL }
> +
> +/*
> + * Remapping rwlock architecture specific functions to the corresponding
> + * queue rwlock functions.
> + */
> +#define arch_read_can_lock(l) queue_read_can_lock(l)
> +#define arch_write_can_lock(l) queue_write_can_lock(l)
> +#define arch_read_lock(l) queue_read_lock(l)
> +#define arch_write_lock(l) queue_write_lock(l)
> +#define arch_read_trylock(l) queue_read_trylock(l)
> +#define arch_write_trylock(l) queue_write_trylock(l)
> +#define arch_read_unlock(l) queue_read_unlock(l)
> +#define arch_write_unlock(l) queue_write_unlock(l)
> +
> +#endif /* __ASM_GENERIC_QRWLOCK_H */
> diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
> index d2b32ac..b665478 100644
> --- a/kernel/Kconfig.locks
> +++ b/kernel/Kconfig.locks
> @@ -223,3 +223,10 @@ endif
> config MUTEX_SPIN_ON_OWNER
> def_bool y
> depends on SMP && !DEBUG_MUTEXES
> +
> +config ARCH_QUEUE_RWLOCK
> + bool
> +
> +config QUEUE_RWLOCK
> + def_bool y if ARCH_QUEUE_RWLOCK
> + depends on SMP
> diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
> index baab8e5..3e7bab1 100644
> --- a/kernel/locking/Makefile
> +++ b/kernel/locking/Makefile
> @@ -23,3 +23,4 @@ obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o
> obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
> obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
> obj-$(CONFIG_PERCPU_RWSEM) += percpu-rwsem.o
> +obj-$(CONFIG_QUEUE_RWLOCK) += qrwlock.o
> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> new file mode 100644
> index 0000000..ae45dae
> --- /dev/null
> +++ b/kernel/locking/qrwlock.c
> @@ -0,0 +1,282 @@
> +/*
> + * Queue read/write lock
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P.
> + *
> + * Authors: Waiman Long <[email protected]>
> + */
> +#include <linux/smp.h>
> +#include <linux/bug.h>
> +#include <linux/cpumask.h>
> +#include <linux/percpu.h>
> +#include <linux/hardirq.h>
> +#include <asm-generic/qrwlock.h>
> +
> +/*
> + * Compared with regular rwlock, the queue rwlock has has the following
> + * advantages:
> + * 1. Even though there is a slight chance of stealing the lock if come at
> + * the right moment, the granting of the lock is mostly in FIFO order.
> + * 2. It is usually faster in high contention situation.
> + *
> + * The only downside is that the lock is 4 bytes larger in 32-bit systems
> + * and 12 bytes larger in 64-bit systems.
> + *
> + * There are two queues for writers. The writer field of the lock is a
> + * one-slot wait queue. The writers that follow will have to wait in the
> + * combined reader/writer queue (waitq).
> + *
> + * Compared with x86 ticket spinlock, the queue rwlock is faster in high
> + * contention situation. The writer lock is also faster in single thread
> + * operations. Therefore, queue rwlock can be considered as a replacement
> + * for those spinlocks that are highly contended as long as an increase
> + * in lock size is not an issue.
> + */
> +
> +#ifndef arch_mutex_cpu_relax
> +# define arch_mutex_cpu_relax() cpu_relax()
> +#endif
> +
> +#ifndef smp_load_acquire
> +# ifdef CONFIG_X86
> +# define smp_load_acquire(p) \
> + ({ \
> + typeof(*p) ___p1 = ACCESS_ONCE(*p); \
> + barrier(); \
> + ___p1; \
> + })
> +# else
> +# define smp_load_acquire(p) \
> + ({ \
> + typeof(*p) ___p1 = ACCESS_ONCE(*p); \
> + smp_mb(); \
> + ___p1; \
> + })
> +# endif
> +#endif
> +
> +#ifndef smp_store_release
> +# ifdef CONFIG_X86
> +# define smp_store_release(p, v) \
> + do { \
> + barrier(); \
> + ACCESS_ONCE(*p) = v; \
> + } while (0)
> +# else
> +# define smp_store_release(p, v) \
> + do { \
> + smp_mb(); \
> + ACCESS_ONCE(*p) = v; \
> + } while (0)
> +# endif
> +#endif
> +
> +/*
> + * If an xadd (exchange-add) macro isn't available, simulate one with
> + * the atomic_add_return() function.
> + */
> +#ifdef xadd
> +# define qrw_xadd(rw, inc) xadd(&(rw).rwc, inc)
> +#else
> +# define qrw_xadd(rw, inc) (u32)(atomic_add_return(inc, &(rw).rwa) - inc)
> +#endif
> +
> +/**
> + * wait_in_queue - Add to queue and wait until it is at the head
> + * @lock: Pointer to queue rwlock structure
> + * @node: Node pointer to be added to the queue
> + */
> +static inline void wait_in_queue(struct qrwlock *lock, struct qrwnode *node)
> +{
> + struct qrwnode *prev;
> +
> + node->next = NULL;
> + node->wait = true;
> + prev = xchg(&lock->waitq, node);
> + if (prev) {
> + prev->next = node;
> + /*
> + * Wait until the waiting flag is off
> + */
> + while (smp_load_acquire(&node->wait))
> + arch_mutex_cpu_relax();
> + }
> +}
> +
> +/**
> + * signal_next - Signal the next one in queue to be at the head
> + * @lock: Pointer to queue rwlock structure
> + * @node: Node pointer to the current head of queue
> + */
> +static inline void signal_next(struct qrwlock *lock, struct qrwnode *node)
> +{
> + struct qrwnode *next;
> +
> + /*
> + * Try to notify the next node first without disturbing the cacheline
> + * of the lock. If that fails, check to see if it is the last node
> + * and so should clear the wait queue.
> + */
> + next = ACCESS_ONCE(node->next);
> + if (likely(next))
> + goto notify_next;
> +
> + /*
> + * Clear the wait queue if it is the last node
> + */
> + if ((ACCESS_ONCE(lock->waitq) == node) &&
> + (cmpxchg(&lock->waitq, node, NULL) == node))
> + return;
> + /*
> + * Wait until the next one in queue set up the next field
> + */
> + while (likely(!(next = ACCESS_ONCE(node->next))))
> + arch_mutex_cpu_relax();
> + /*
> + * The next one in queue is now at the head
> + */
> +notify_next:
> + smp_store_release(&next->wait, false);
> +}
> +
> +/**
> + * rspin_until_writer_unlock - inc reader count & spin until writer is gone
> + * @lock : Pointer to queue rwlock structure
> + * @writer: Current queue rwlock writer status byte
> + *
> + * In interrupt context or at the head of the queue, the reader will just
> + * increment the reader count & wait until the writer releases the lock.
> + */
> +static __always_inline void
> +rspin_until_writer_unlock(struct qrwlock *lock, u8 writer)
> +{
> + while (writer == _QW_LOCKED) {
> + arch_mutex_cpu_relax();
> + writer = smp_load_acquire(&lock->cnts.writer);
> + }
> +}
> +
> +/**
> + * queue_read_lock_slowpath - acquire read lock of a queue rwlock
> + * @lock: Pointer to queue rwlock structure
> + */
> +void queue_read_lock_slowpath(struct qrwlock *lock)
> +{
> + struct qrwnode node;
> + union qrwcnts cnts;
> +
> + /*
> + * Readers come here when they cannot get the lock without waiting
> + */
> + if (unlikely(irq_count())) {
> + /*
> + * Readers in interrupt context will spin until the lock is
> + * available without waiting in the queue.
> + */
> + cnts.rwc = smp_load_acquire(&lock->cnts.rwc);
> + rspin_until_writer_unlock(lock, cnts.writer);
> + return;
> + }
> + atomic_sub(_QR_BIAS, &lock->cnts.rwa);
> +
> + /*
> + * Put the reader into the wait queue
> + */
> + wait_in_queue(lock, &node);
> +
> + /*
> + * At the head of the wait queue now, wait until the writer state
> + * goes to 0 and then try to increment the reader count and get
> + * the lock. It is possible that an incoming writer may steal the
> + * lock in the interim, so it is necessary to check the writer byte
> + * to make sure that the write lock isn't taken.
> + */
> + while (ACCESS_ONCE(lock->cnts.writer))
> + arch_mutex_cpu_relax();
> + cnts.rwc = qrw_xadd(lock->cnts, _QR_BIAS);
> + rspin_until_writer_unlock(lock, cnts.writer);
> +
> + /*
> + * Signal the next one in queue to become queue head
> + */
> + signal_next(lock, &node);
> +}
> +EXPORT_SYMBOL(queue_read_lock_slowpath);
> +
> +/**
> + * qwrite_trylock - Try to acquire the write lock
> + * @lock : Pointer to queue rwlock structure
> + * @old : The current queue rwlock counts
> + * Return: 1 if lock acquired, 0 otherwise
> + */
> +static __always_inline int
> +qwrite_trylock(struct qrwlock *lock, u32 old)
> +{
> + if (likely(cmpxchg(&lock->cnts.rwc, old, _QW_LOCKED) == old))
> + return 1;
> + return 0;
> +}
> +
> +/**
> + * queue_write_3step_lock - acquire write lock in 3 steps
> + * @lock : Pointer to queue rwlock structure
> + *
> + * Step 1 - Try to acquire the lock directly if no reader is present
> + * Step 2 - Set the waiting flag to notify readers that a writer is waiting
> + * Step 3 - When the readers field goes to 0, set the locked flag
> + */
> +static inline void queue_write_3step_lock(struct qrwlock *lock)
> +{
> + u32 rwc;
> + u8 writer;
> +
> + /* Step 1 */
> + if (!ACCESS_ONCE(lock->cnts.rwc) && qwrite_trylock(lock, 0))
> + return;
> +
> + /* Step 2 */
> + writer = ACCESS_ONCE(lock->cnts.writer);
> + while (writer || (cmpxchg(&lock->cnts.writer, 0, _QW_WAITING) != 0)) {
> + arch_mutex_cpu_relax();
> + writer = ACCESS_ONCE(lock->cnts.writer);
> + }
> +
> + /* Step 3 */
> + rwc = ACCESS_ONCE(lock->cnts.rwc);
> + while ((rwc > _QW_WAITING) || !qwrite_trylock(lock, _QW_WAITING)) {
> + arch_mutex_cpu_relax();
> + rwc = ACCESS_ONCE(lock->cnts.rwc);
> + }
> +}
> +
> +/**
> + * queue_write_lock_slowpath - acquire write lock of a queue rwlock
> + * @lock : Pointer to queue rwlock structure
> + */
> +void queue_write_lock_slowpath(struct qrwlock *lock)
> +{
> + struct qrwnode node;
> +
> + /*
> + * Put the writer into the wait queue
> + */
> + wait_in_queue(lock, &node);
> +
> + /*
> + * At the head of the wait queue now, call queue_write_3step_lock()
> + * to acquire the lock until it is done.
> + */
> + queue_write_3step_lock(lock);
> + signal_next(lock, &node);
> +}
> +EXPORT_SYMBOL(queue_write_lock_slowpath);
> --
> 1.7.1
>

2014-01-10 06:42:49

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v8 0/4] qrwlock: Introducing a queue read/write lock implementation

On Wed, Jan 08, 2014 at 11:59:32AM -0500, Waiman Long wrote:
> v7->v8:
> - Use atomic_t functions (which are implemented in all arch's) to
> modify reader counts.
> - Use smp_load_acquire() & smp_store_release() for barriers.
> - Further tuning in slowpath performance.

This version looks good to me. You now have my Reviewed-by on all
the patches.

Thanx, Paul

> v6->v7:
> - Remove support for unfair lock, so only fair qrwlock will be provided.
> - Move qrwlock.c to the kernel/locking directory.
>
> v5->v6:
> - Modify queue_read_can_lock() to avoid false positive result.
> - Move the two slowpath functions' performance tuning change from
> patch 4 to patch 1.
> - Add a new optional patch to use the new smp_store_release() function
> if that is merged.
>
> v4->v5:
> - Fix wrong definitions for QW_MASK_FAIR & QW_MASK_UNFAIR macros.
> - Add an optional patch 4 which should only be applied after the
> mcs_spinlock.h header file is merged.
>
> v3->v4:
> - Optimize the fast path with better cold cache behavior and
> performance.
> - Removing some testing code.
> - Make x86 use queue rwlock with no user configuration.
>
> v2->v3:
> - Make read lock stealing the default and fair rwlock an option with
> a different initializer.
> - In queue_read_lock_slowpath(), check irq_count() and force spinning
> and lock stealing in interrupt context.
> - Unify the fair and classic read-side code path, and make write-side
> to use cmpxchg with 2 different writer states. This slows down the
> write lock fastpath to make the read side more efficient, but is
> still slightly faster than a spinlock.
>
> v1->v2:
> - Improve lock fastpath performance.
> - Optionally provide classic read/write lock behavior for backward
> compatibility.
> - Use xadd instead of cmpxchg for fair reader code path to make it
> immute to reader contention.
> - Run more performance testing.
>
> As mentioned in the LWN article http://lwn.net/Articles/364583/,
> the read/write lock suffer from an unfairness problem that it is
> possible for a stream of incoming readers to block a waiting writer
> from getting the lock for a long time. Also, a waiting reader/writer
> contending a rwlock in local memory will have a higher chance of
> acquiring the lock than a reader/writer in remote node.
>
> This patch set introduces a queue-based read/write lock implementation
> that can largely solve this unfairness problem.
>
> The read lock slowpath will check if the reader is in an interrupt
> context. If so, it will force lock spinning and stealing without
> waiting in a queue. This is to ensure the read lock will be granted
> as soon as possible.
>
> The queue write lock can also be used as a replacement for ticket
> spinlocks that are highly contended if lock size increase is not
> an issue.
>
> This patch set currently provides queue read/write lock support on
> x86 architecture only. Support for other architectures can be added
> later on once architecture specific support infrastructure is added
> and proper testing is done.
>
> The optional patch 3 has a dependency on the the mcs_spinlock.h
> header file which has not been merged yet. So this patch should only
> be applied after the mcs_spinlock.h header file is merged.
>
> The optional patch 4 use the new smp_store_release() and
> smp_load_acquire() functions which are being reviewed & not merged yet.
>
> Waiman Long (4):
> qrwlock: A queue read/write lock implementation
> qrwlock x86: Enable x86 to use queue read/write lock
> qrwlock: Use the mcs_spinlock helper functions for MCS queuing
> qrwlock: Use smp_store_release() in write_unlock()
>
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/spinlock.h | 2 +
> arch/x86/include/asm/spinlock_types.h | 4 +
> include/asm-generic/qrwlock.h | 203 +++++++++++++++++++++++++++++++++
> kernel/Kconfig.locks | 7 +
> kernel/locking/Makefile | 1 +
> kernel/locking/qrwlock.c | 191 +++++++++++++++++++++++++++++++
> 7 files changed, 409 insertions(+), 0 deletions(-)
> create mode 100644 include/asm-generic/qrwlock.h
> create mode 100644 kernel/locking/qrwlock.c
>

2014-01-10 15:34:54

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v8 0/4] qrwlock: Introducing a queue read/write lock implementation

On 01/10/2014 01:42 AM, Paul E. McKenney wrote:
> On Wed, Jan 08, 2014 at 11:59:32AM -0500, Waiman Long wrote:
>> v7->v8:
>> - Use atomic_t functions (which are implemented in all arch's) to
>> modify reader counts.
>> - Use smp_load_acquire()& smp_store_release() for barriers.
>> - Further tuning in slowpath performance.
> This version looks good to me. You now have my Reviewed-by on all
> the patches.
>
> Thanx, Paul

Thank for your review!

-Longman

2014-01-13 03:26:19

by Daniel J Blueman

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Thursday, 9 January 2014 01:10:03 UTC+8, Waiman Long wrote:
> This patch modifies the queue_write_unlock() function to use the
> new smp_store_release() function in another pending patch. It also
> removes the temporary implementation of smp_load_acquire() and
> smp_store_release() function in qrwlock.c.
>
> This patch should only be merged if PeterZ's linux-arch patch patch
> was merged.
>
> Signed-off-by: Waiman Long <[email protected]>
> Reviewed-by: Paul E. McKenney <[email protected]>
> ---
> include/asm-generic/qrwlock.h | 4 +---
> kernel/locking/qrwlock.c | 34 ----------------------------------
> 2 files changed, 1 insertions(+), 37 deletions(-)
>
> diff --git a/include/asm-generic/qrwlock.h
b/include/asm-generic/qrwlock.h
> index 2b9a7b4..4d4bd04 100644
> --- a/include/asm-generic/qrwlock.h
> +++ b/include/asm-generic/qrwlock.h
> @@ -179,9 +179,7 @@ static inline void queue_write_unlock(struct
qrwlock *lock)
> /*
> * Make sure that none of the critical section will be leaked out.
> */
> - smp_mb__before_clear_bit();
> - ACCESS_ONCE(lock->cnts.writer) = 0;
> - smp_mb__after_clear_bit();
> + smp_store_release(&lock->cnts.writer, 0)

This will fail compilation, so probably needs further testing with
Peter's load_acquire/store_release barrier patches.

Thanks,
Daniel
--
Daniel J Blueman
Principal Software Engineer, Numascale

2014-01-13 03:50:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Mon, Jan 13, 2014 at 10:47:36AM +0800, Daniel J Blueman wrote:
> On Thursday, 9 January 2014 01:10:03 UTC+8, Waiman Long wrote:
> > This patch modifies the queue_write_unlock() function to use the
> > new smp_store_release() function in another pending patch. It also
> > removes the temporary implementation of smp_load_acquire() and
> > smp_store_release() function in qrwlock.c.
> >
> > This patch should only be merged if PeterZ's linux-arch patch patch
> > was merged.
> >
> > Signed-off-by: Waiman Long <[email protected]>
> > Reviewed-by: Paul E. McKenney <[email protected]>
> > ---
> > include/asm-generic/qrwlock.h | 4 +---
> > kernel/locking/qrwlock.c | 34 ----------------------------------
> > 2 files changed, 1 insertions(+), 37 deletions(-)
> >
> > diff --git a/include/asm-generic/qrwlock.h
> b/include/asm-generic/qrwlock.h
> > index 2b9a7b4..4d4bd04 100644
> > --- a/include/asm-generic/qrwlock.h
> > +++ b/include/asm-generic/qrwlock.h
> > @@ -179,9 +179,7 @@ static inline void queue_write_unlock(struct
> qrwlock *lock)
> > /*
> > * Make sure that none of the critical section will be leaked out.
> > */
> > - smp_mb__before_clear_bit();
> > - ACCESS_ONCE(lock->cnts.writer) = 0;
> > - smp_mb__after_clear_bit();
> > + smp_store_release(&lock->cnts.writer, 0)
>
> This will fail compilation, so probably needs further testing with
> Peter's load_acquire/store_release barrier patches.

Peter's patch just hit -tip, so this should be esay to do. In Waiman's
defense, he does call attention to this in the commit log.

Thanx, Paul

2014-01-13 16:41:43

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On 01/12/2014 09:47 PM, Daniel J Blueman wrote:
> On Thursday, 9 January 2014 01:10:03 UTC+8, Waiman Long wrote:
> > This patch modifies the queue_write_unlock() function to use the
> > new smp_store_release() function in another pending patch. It also
> > removes the temporary implementation of smp_load_acquire() and
> > smp_store_release() function in qrwlock.c.
> >
> > This patch should only be merged if PeterZ's linux-arch patch patch
> > was merged.
> >
> > Signed-off-by: Waiman Long <[email protected]>
> > Reviewed-by: Paul E. McKenney <[email protected]>
> > ---
> > include/asm-generic/qrwlock.h | 4 +---
> > kernel/locking/qrwlock.c | 34
> ----------------------------------
> > 2 files changed, 1 insertions(+), 37 deletions(-)
> >
> > diff --git a/include/asm-generic/qrwlock.h
> b/include/asm-generic/qrwlock.h
> > index 2b9a7b4..4d4bd04 100644
> > --- a/include/asm-generic/qrwlock.h
> > +++ b/include/asm-generic/qrwlock.h
> > @@ -179,9 +179,7 @@ static inline void queue_write_unlock(struct
> qrwlock *lock)
> > /*
> > * Make sure that none of the critical section will be leaked out.
> > */
> > - smp_mb__before_clear_bit();
> > - ACCESS_ONCE(lock->cnts.writer) = 0;
> > - smp_mb__after_clear_bit();
> > + smp_store_release(&lock->cnts.writer, 0)
>
> This will fail compilation, so probably needs further testing with
> Peter's load_acquire/store_release barrier patches.
>

Peter,

I found out that the build failure was caused by the fact that the
__native_word() macro (used internally by compiletime_assert_atomic())
allows only a size of 4 or 8 for x86-64. The data type that I used is a
byte. Is there a reason why byte and short are not considered native?

-Longman

2014-01-14 02:28:41

by Daniel J Blueman

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On 14/01/2014 00:41, Waiman Long wrote:
> On 01/12/2014 09:47 PM, Daniel J Blueman wrote:
>> On Thursday, 9 January 2014 01:10:03 UTC+8, Waiman Long wrote:
>> > This patch modifies the queue_write_unlock() function to use the
>> > new smp_store_release() function in another pending patch. It also
>> > removes the temporary implementation of smp_load_acquire() and
>> > smp_store_release() function in qrwlock.c.
>> >
>> > This patch should only be merged if PeterZ's linux-arch patch patch
>> > was merged.
>> >
>> > Signed-off-by: Waiman Long <[email protected]>
>> > Reviewed-by: Paul E. McKenney <[email protected]>
>> > ---
>> > include/asm-generic/qrwlock.h | 4 +---
>> > kernel/locking/qrwlock.c | 34
>> ----------------------------------
>> > 2 files changed, 1 insertions(+), 37 deletions(-)
>> >
>> > diff --git a/include/asm-generic/qrwlock.h
>> b/include/asm-generic/qrwlock.h
>> > index 2b9a7b4..4d4bd04 100644
>> > --- a/include/asm-generic/qrwlock.h
>> > +++ b/include/asm-generic/qrwlock.h
>> > @@ -179,9 +179,7 @@ static inline void queue_write_unlock(struct
>> qrwlock *lock)
>> > /*
>> > * Make sure that none of the critical section will be leaked out.
>> > */
>> > - smp_mb__before_clear_bit();
>> > - ACCESS_ONCE(lock->cnts.writer) = 0;
>> > - smp_mb__after_clear_bit();
>> > + smp_store_release(&lock->cnts.writer, 0)
>>
>> This will fail compilation, so probably needs further testing with
>> Peter's load_acquire/store_release barrier patches.
>>
>
> Peter,
>
> I found out that the build failure was caused by the fact that the
> __native_word() macro (used internally by compiletime_assert_atomic())
> allows only a size of 4 or 8 for x86-64. The data type that I used is a
> byte. Is there a reason why byte and short are not considered native?

It seems likely it was implemented like that since there was no existing
need; long can be relied on as the largest native type, so this should
suffice and works here:

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index fe7a686..dac91d7 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -299,8 +299,8 @@ void ftrace_likely_update(struct ftrace_branch_data
*f, int val, int expect);
#endif

/* Is this type a native word size -- useful for atomic operations */
-#ifndef __native_word
-# define __native_word(t) (sizeof(t) == sizeof(int) || sizeof(t) ==
sizeof(long))
+#ifndef __native_type
+# define __native_type(t) (sizeof(t) <= sizeof(long))
#endif

/* Compile time object size, -1 for unknown */
@@ -343,8 +343,8 @@ void ftrace_likely_update(struct ftrace_branch_data
*f, int val, int expect);
_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)

#define compiletime_assert_atomic_type(t) \
- compiletime_assert(__native_word(t), \
- "Need native word sized stores/loads for atomicity.")
+ compiletime_assert(__native_type(t), \
+ "Need native sized stores/loads for atomicity.")

/*
* Prevent the compiler from merging or refetching accesses. The compiler

Signed-off-by: Daniel J Blueman <[email protected]>
--
Daniel J Blueman
Principal Software Engineer, Numascale

2014-01-14 11:03:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Tue, Jan 14, 2014 at 10:28:23AM +0800, Daniel J Blueman wrote:
> >Peter,
> >
> >I found out that the build failure was caused by the fact that the
> >__native_word() macro (used internally by compiletime_assert_atomic())
> >allows only a size of 4 or 8 for x86-64. The data type that I used is a
> >byte. Is there a reason why byte and short are not considered native?
>
> It seems likely it was implemented like that since there was no existing
> need; long can be relied on as the largest native type, so this should
> suffice and works here:

There's Alphas that cannot actually atomically adres a byte; I do not
konw if Linux cares about them, but if it does, we cannot in fact rely
on this in generic primitives like this.

2014-01-14 15:25:40

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On 01/14/2014 06:03 AM, Peter Zijlstra wrote:
> On Tue, Jan 14, 2014 at 10:28:23AM +0800, Daniel J Blueman wrote:
>>> Peter,
>>>
>>> I found out that the build failure was caused by the fact that the
>>> __native_word() macro (used internally by compiletime_assert_atomic())
>>> allows only a size of 4 or 8 for x86-64. The data type that I used is a
>>> byte. Is there a reason why byte and short are not considered native?
>> It seems likely it was implemented like that since there was no existing
>> need; long can be relied on as the largest native type, so this should
>> suffice and works here:
> There's Alphas that cannot actually atomically adres a byte; I do not
> konw if Linux cares about them, but if it does, we cannot in fact rely
> on this in generic primitives like this.

Thank for the explanation.

Can we allow architectural override of __native_word() macro? Like

#ifdef __arch_native_word
#define __native_word(t) __arch_native_word(t)
#else
#define __native_word(t) (sizeof(t) == sizeof(int) || sizeof(t) ==
sizeof(long))
#endif

In this way, we can allow x86 to support byte-based atomic type while
restricting the generic macro to int and long only. I will also modify
the code to use cmpxchg() when byte is not an atomic type.

-Longman

2014-01-14 17:09:05

by Matt Turner

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Tue, Jan 14, 2014 at 3:03 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Jan 14, 2014 at 10:28:23AM +0800, Daniel J Blueman wrote:
>> >Peter,
>> >
>> >I found out that the build failure was caused by the fact that the
>> >__native_word() macro (used internally by compiletime_assert_atomic())
>> >allows only a size of 4 or 8 for x86-64. The data type that I used is a
>> >byte. Is there a reason why byte and short are not considered native?
>>
>> It seems likely it was implemented like that since there was no existing
>> need; long can be relied on as the largest native type, so this should
>> suffice and works here:
>
> There's Alphas that cannot actually atomically adres a byte; I do not
> konw if Linux cares about them, but if it does, we cannot in fact rely
> on this in generic primitives like this.

That's right, and thanks for the heads-up. Alpha can only address 4
and 8 bytes atomically. (LDL_L, LDQ_L, STL_C, STQ_C).

The Byte-Word extension in EV56 doesn't add new atomics, so in fact no
Alphas can address < 4 bytes atomically.

2014-01-14 18:01:14

by Richard Henderson

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On 01/14/2014 09:08 AM, Matt Turner wrote:
> On Tue, Jan 14, 2014 at 3:03 AM, Peter Zijlstra <[email protected]> wrote:
>> On Tue, Jan 14, 2014 at 10:28:23AM +0800, Daniel J Blueman wrote:
>>>> Peter,
>>>>
>>>> I found out that the build failure was caused by the fact that the
>>>> __native_word() macro (used internally by compiletime_assert_atomic())
>>>> allows only a size of 4 or 8 for x86-64. The data type that I used is a
>>>> byte. Is there a reason why byte and short are not considered native?
>>>
>>> It seems likely it was implemented like that since there was no existing
>>> need; long can be relied on as the largest native type, so this should
>>> suffice and works here:
>>
>> There's Alphas that cannot actually atomically adres a byte; I do not
>> konw if Linux cares about them, but if it does, we cannot in fact rely
>> on this in generic primitives like this.
>
> That's right, and thanks for the heads-up. Alpha can only address 4
> and 8 bytes atomically. (LDL_L, LDQ_L, STL_C, STQ_C).
>
> The Byte-Word extension in EV56 doesn't add new atomics, so in fact no
> Alphas can address < 4 bytes atomically.
>

Emulated with aligned 4 byte atomics, and masking. The same is true for arm,
ppc, mips which, depending on cpu, also lack < 4 byte atomics.


r~

2014-01-14 19:09:39

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On 01/14/2014 01:01 PM, Richard Henderson wrote:
> On 01/14/2014 09:08 AM, Matt Turner wrote:
>> On Tue, Jan 14, 2014 at 3:03 AM, Peter Zijlstra<[email protected]> wrote:
>>> On Tue, Jan 14, 2014 at 10:28:23AM +0800, Daniel J Blueman wrote:
>>>>> Peter,
>>>>>
>>>>> I found out that the build failure was caused by the fact that the
>>>>> __native_word() macro (used internally by compiletime_assert_atomic())
>>>>> allows only a size of 4 or 8 for x86-64. The data type that I used is a
>>>>> byte. Is there a reason why byte and short are not considered native?
>>>> It seems likely it was implemented like that since there was no existing
>>>> need; long can be relied on as the largest native type, so this should
>>>> suffice and works here:
>>> There's Alphas that cannot actually atomically adres a byte; I do not
>>> konw if Linux cares about them, but if it does, we cannot in fact rely
>>> on this in generic primitives like this.
>> That's right, and thanks for the heads-up. Alpha can only address 4
>> and 8 bytes atomically. (LDL_L, LDQ_L, STL_C, STQ_C).
>>
>> The Byte-Word extension in EV56 doesn't add new atomics, so in fact no
>> Alphas can address< 4 bytes atomically.
>>
> Emulated with aligned 4 byte atomics, and masking. The same is true for arm,
> ppc, mips which, depending on cpu, also lack< 4 byte atomics.
>

I would like to know if the action of writing out a byte (e.g. *byte =
0) is atomic in those architectures or is emulated by a
compiler-generated software read-modify-write.

-Longman

2014-01-14 20:20:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Tue, Jan 14, 2014 at 02:09:30PM -0500, Waiman Long wrote:
> I would like to know if the action of writing out a byte (e.g. *byte = 0) is
> atomic in those architectures or is emulated by a compiler-generated
> software read-modify-write.

So on Alpha pre ev56 something like:

*(volatile u8 *)foo = 0;

_Should_ cause a compile error as the hardware has to do a rmw which is
not compatible with the requirements for volatile -- that said I do not
know if a compiler will actually generate this error.

I can well imagine other load-store archs suffering similar problems,
although I'm not aware of any.

2014-01-14 23:44:56

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Tue, Jan 14, 2014 at 10:01:04AM -0800, Richard Henderson wrote:
> On 01/14/2014 09:08 AM, Matt Turner wrote:
> > On Tue, Jan 14, 2014 at 3:03 AM, Peter Zijlstra <[email protected]> wrote:
> >> On Tue, Jan 14, 2014 at 10:28:23AM +0800, Daniel J Blueman wrote:
> >>>> Peter,
> >>>>
> >>>> I found out that the build failure was caused by the fact that the
> >>>> __native_word() macro (used internally by compiletime_assert_atomic())
> >>>> allows only a size of 4 or 8 for x86-64. The data type that I used is a
> >>>> byte. Is there a reason why byte and short are not considered native?
> >>>
> >>> It seems likely it was implemented like that since there was no existing
> >>> need; long can be relied on as the largest native type, so this should
> >>> suffice and works here:
> >>
> >> There's Alphas that cannot actually atomically adres a byte; I do not
> >> konw if Linux cares about them, but if it does, we cannot in fact rely
> >> on this in generic primitives like this.
> >
> > That's right, and thanks for the heads-up. Alpha can only address 4
> > and 8 bytes atomically. (LDL_L, LDQ_L, STL_C, STQ_C).
> >
> > The Byte-Word extension in EV56 doesn't add new atomics, so in fact no
> > Alphas can address < 4 bytes atomically.
>
> Emulated with aligned 4 byte atomics, and masking. The same is true for arm,
> ppc, mips which, depending on cpu, also lack < 4 byte atomics.

Which means that Alpha should be able to similarly emulate 1-byte and
2-byte atomics, correct?

Thanx, Paul

2014-01-15 00:25:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Wed, Jan 15, 2014 at 6:44 AM, Paul E. McKenney
<[email protected]> wrote:
>
> Which means that Alpha should be able to similarly emulate 1-byte and
> 2-byte atomics, correct?

Not reasonably, no.

The ldl/stc implementation on early alpha was so broken as to be
unusable. It's not actually done in the cache, it WENT OUT ON THE BUS.
We're talking 70's style "external lock signal" kind of things like
the 8086 did for locked cycles before the advent of caches, the kind
that nobody sane has done for a long long time.

So realistically, you absolutely do not want to use those things to
emulate atomic byte/word accesses. The whole point of "load_acquire()"
and "store_release()" is that it's supposed to be cheaper than a
locked access, and can be done with just a barrier instruction or a
special instruction flag.

If you just want to do a store release, on alpha you'd want to
implement that as a full memory barrier followed by a store. It
doesn't get the advantage of a real release consistency model, but at
least it's not doing an external bus access. But you can only do that
store as a 4-byte or 8-byte store.on the older alphas (byte and word
stores work on newer ones).

Of course, it's entirely possible that nobody cares..

Linus

2014-01-15 02:40:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Wed, Jan 15, 2014 at 07:25:04AM +0700, Linus Torvalds wrote:
> On Wed, Jan 15, 2014 at 6:44 AM, Paul E. McKenney
> <[email protected]> wrote:
> >
> > Which means that Alpha should be able to similarly emulate 1-byte and
> > 2-byte atomics, correct?
>
> Not reasonably, no.
>
> The ldl/stc implementation on early alpha was so broken as to be
> unusable. It's not actually done in the cache, it WENT OUT ON THE BUS.
> We're talking 70's style "external lock signal" kind of things like
> the 8086 did for locked cycles before the advent of caches, the kind
> that nobody sane has done for a long long time.
>
> So realistically, you absolutely do not want to use those things to
> emulate atomic byte/word accesses. The whole point of "load_acquire()"
> and "store_release()" is that it's supposed to be cheaper than a
> locked access, and can be done with just a barrier instruction or a
> special instruction flag.
>
> If you just want to do a store release, on alpha you'd want to
> implement that as a full memory barrier followed by a store. It
> doesn't get the advantage of a real release consistency model, but at
> least it's not doing an external bus access. But you can only do that
> store as a 4-byte or 8-byte store.on the older alphas (byte and word
> stores work on newer ones).
>
> Of course, it's entirely possible that nobody cares..

That would be my hope. ;-)

If nobody cares about Alpha period, it is easy. However, the last time
that I tried that approach, they sent me a URL of a wiki showing Alpha
systems still running mainline. But a slow-but-working approach for
Alpha does seem reasonable, even for those still running Linux on Alpha.

Thanx, Paul

2014-01-15 03:08:30

by Daniel J Blueman

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On 01/15/2014 07:44 AM, Paul E. McKenney wrote:
> On Tue, Jan 14, 2014 at 10:01:04AM -0800, Richard Henderson wrote:
>> On 01/14/2014 09:08 AM, Matt Turner wrote:
>>> On Tue, Jan 14, 2014 at 3:03 AM, Peter Zijlstra <[email protected]> wrote:
>>>> On Tue, Jan 14, 2014 at 10:28:23AM +0800, Daniel J Blueman wrote:
>>>>>> Peter,
>>>>>>
>>>>>> I found out that the build failure was caused by the fact that the
>>>>>> __native_word() macro (used internally by compiletime_assert_atomic())
>>>>>> allows only a size of 4 or 8 for x86-64. The data type that I used is a
>>>>>> byte. Is there a reason why byte and short are not considered native?
>>>>>
>>>>> It seems likely it was implemented like that since there was no existing
>>>>> need; long can be relied on as the largest native type, so this should
>>>>> suffice and works here:
>>>>
>>>> There's Alphas that cannot actually atomically adres a byte; I do not
>>>> konw if Linux cares about them, but if it does, we cannot in fact rely
>>>> on this in generic primitives like this.
>>>
>>> That's right, and thanks for the heads-up. Alpha can only address 4
>>> and 8 bytes atomically. (LDL_L, LDQ_L, STL_C, STQ_C).
>>>
>>> The Byte-Word extension in EV56 doesn't add new atomics, so in fact no
>>> Alphas can address < 4 bytes atomically.
>>
>> Emulated with aligned 4 byte atomics, and masking. The same is true for arm,
>> ppc, mips which, depending on cpu, also lack < 4 byte atomics.
>
> Which means that Alpha should be able to similarly emulate 1-byte and
> 2-byte atomics, correct?

If it's not possible to guarantee that GCC emits the 4-byte atomics by
using a union, then we could emit the instructions via assembly. We'd
introduce a macro to ensure lock word alignment, and this would be safe
for unsigned counting up to the packed type limit. Maybe that's just too
over-constrained though.

Thanks,
Daniel
--
Daniel J Blueman
Principal Software Engineer, Numascale

2014-01-15 08:08:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Tue, Jan 14, 2014 at 06:39:58PM -0800, Paul E. McKenney wrote:
> > If you just want to do a store release, on alpha you'd want to
> > implement that as a full memory barrier followed by a store. It
> > doesn't get the advantage of a real release consistency model, but at
> > least it's not doing an external bus access. But you can only do that
> > store as a 4-byte or 8-byte store.on the older alphas (byte and word
> > stores work on newer ones).
> >
> > Of course, it's entirely possible that nobody cares..
>
> That would be my hope. ;-)
>
> If nobody cares about Alpha period, it is easy. However, the last time
> that I tried that approach, they sent me a URL of a wiki showing Alpha
> systems still running mainline. But a slow-but-working approach for
> Alpha does seem reasonable, even for those still running Linux on Alpha.

Well, if they're all EV56 or later we're still good as they can actually
do what we need.

But I don't think a ll/sc implementation of the store_release can even
work, because in that case all users of the other bytes also need a
ll/sc around them but how are we to know about them?

So the only real way to allow store_release on 8/16 bit values is by
removing all Alpha support _pre_ EV56 :/

2014-01-15 20:53:59

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Wed, Jan 15, 2014 at 09:07:53AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 14, 2014 at 06:39:58PM -0800, Paul E. McKenney wrote:
> > > If you just want to do a store release, on alpha you'd want to
> > > implement that as a full memory barrier followed by a store. It
> > > doesn't get the advantage of a real release consistency model, but at
> > > least it's not doing an external bus access. But you can only do that
> > > store as a 4-byte or 8-byte store.on the older alphas (byte and word
> > > stores work on newer ones).
> > >
> > > Of course, it's entirely possible that nobody cares..
> >
> > That would be my hope. ;-)
> >
> > If nobody cares about Alpha period, it is easy. However, the last time
> > that I tried that approach, they sent me a URL of a wiki showing Alpha
> > systems still running mainline. But a slow-but-working approach for
> > Alpha does seem reasonable, even for those still running Linux on Alpha.
>
> Well, if they're all EV56 or later we're still good as they can actually
> do what we need.
>
> But I don't think a ll/sc implementation of the store_release can even
> work, because in that case all users of the other bytes also need a
> ll/sc around them but how are we to know about them?
>
> So the only real way to allow store_release on 8/16 bit values is by
> removing all Alpha support _pre_ EV56 :/

Fair point... We could demand Alpha-specific alignment, but that would
get really ugly really quickly.

But we did drop support for SMP i386 quite some time ago, so perhaps
it is time to drop support for SMP Alpha pre-EV56.

Thanx, Paul

2014-01-15 23:22:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Wed, Jan 15, 2014 at 12:53:46PM -0800, Paul E. McKenney wrote:
> But we did drop support for SMP i386 quite some time ago, so perhaps
> it is time to drop support for SMP Alpha pre-EV56.

So while the primitive is called smp_store_release() the !SMP variant
still does:

*(volatile __type *) = ptr;

which should not compile on any Alpha pre EV56, SMP or no for __type ==
u8.

2014-01-16 10:37:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Thu, Jan 16, 2014 at 06:39:23AM +0700, Linus Torvalds wrote:
> On Jan 16, 2014 6:22 AM, "Peter Zijlstra" <[email protected]> wrote:
> >
> > So while the primitive is called smp_store_release() the !SMP variant
> > still does:
> >
> > *(volatile __type *) = ptr;
> >
> > which should not compile on any Alpha pre EV56, SMP or no for __type ==
> > u8.
>
> I'm not sure where you get that "should not compile" theory from.
>
> I'm pretty sure it will compile just fine. It will just generate the same
> standard read-modify-write sequence (and not using the ldl/stc sequence
> either). Do you have any actual reason to believe it won't, apart from your
> theoretical wishes of how the world should work?

No, I earlier even said it probably would compile. My usage of 'should'
comes from how we've 'defined' volatile/ACCESS_ONCE in
Documentation/memory-barriers.txt. According to those constraints the
rmw cycle is not proper code.

2014-01-18 10:01:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Thu, Jan 16, 2014 at 11:36:59AM +0100, Peter Zijlstra wrote:
> On Thu, Jan 16, 2014 at 06:39:23AM +0700, Linus Torvalds wrote:
> > On Jan 16, 2014 6:22 AM, "Peter Zijlstra" <[email protected]> wrote:
> > >
> > > So while the primitive is called smp_store_release() the !SMP variant
> > > still does:
> > >
> > > *(volatile __type *) = ptr;
> > >
> > > which should not compile on any Alpha pre EV56, SMP or no for __type ==
> > > u8.
> >
> > I'm not sure where you get that "should not compile" theory from.
> >
> > I'm pretty sure it will compile just fine. It will just generate the same
> > standard read-modify-write sequence (and not using the ldl/stc sequence
> > either). Do you have any actual reason to believe it won't, apart from your
> > theoretical wishes of how the world should work?
>
> No, I earlier even said it probably would compile. My usage of 'should'
> comes from how we've 'defined' volatile/ACCESS_ONCE in
> Documentation/memory-barriers.txt. According to those constraints the
> rmw cycle is not proper code.

OK, I will bite... Aside from fine-grained code timing, what code could
you write to tell the difference between a real one-byte store and an
RMW emulating that store?

Thanx, Paul

2014-01-18 11:34:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Sat, Jan 18, 2014 at 02:01:05AM -0800, Paul E. McKenney wrote:
> OK, I will bite... Aside from fine-grained code timing, what code could
> you write to tell the difference between a real one-byte store and an
> RMW emulating that store?

Why isn't fine-grained code timing an issue? I'm sure Alpha people will
love it when their machine magically keels over every so often.

Suppose we have two bytes in a word that get concurrent updates:

union {
struct {
u8 a;
u8 b;
};
int word;
} ponies = { .word = 0, };

then two threads concurrently do:

CPU0: CPU1:

ponies.a = 5 ponies.b = 10


At which point you'd expect: a == 5 && b == 10

However, with a rmw you could end up like:


load r, ponies.word
load r, ponies.word
and r, ~0xFF
or r, 5
store ponies.word, r
and r, ~0xFF00
or r, 10 << 8
store ponies.word, r

which gives: a == 0 && b == 10

The same can be had on a single CPU if you make the second RMW an
interrupt.


In fact, we recently had such a RMW issue on PPC64 although from a
slightly different angle, but we managed to hit it quite consistently.
See commit ba1f14fbe7096.

The thing is, if we allow the above RMW 'atomic' store, we have to be
_very_ careful that there cannot be such overlapping stores, otherwise
things will go BOOM!

However, if we already have to make sure there's no overlapping stores,
we might as well write a wide store and not allow the narrow stores to
begin with, to force people to think about the issue.

2014-01-18 12:26:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Sat, Jan 18, 2014 at 12:34:06PM +0100, Peter Zijlstra wrote:
> On Sat, Jan 18, 2014 at 02:01:05AM -0800, Paul E. McKenney wrote:
> > OK, I will bite... Aside from fine-grained code timing, what code could
> > you write to tell the difference between a real one-byte store and an
> > RMW emulating that store?
>
> Why isn't fine-grained code timing an issue? I'm sure Alpha people will
> love it when their machine magically keels over every so often.
>
> Suppose we have two bytes in a word that get concurrent updates:
>
> union {
> struct {
> u8 a;
> u8 b;
> };
> int word;
> } ponies = { .word = 0, };
>
> then two threads concurrently do:
>
> CPU0: CPU1:
>
> ponies.a = 5 ponies.b = 10
>
>
> At which point you'd expect: a == 5 && b == 10
>
> However, with a rmw you could end up like:
>
>
> load r, ponies.word
> load r, ponies.word
> and r, ~0xFF
> or r, 5
> store ponies.word, r
> and r, ~0xFF00
> or r, 10 << 8
> store ponies.word, r
>
> which gives: a == 0 && b == 10
>
> The same can be had on a single CPU if you make the second RMW an
> interrupt.
>
>
> In fact, we recently had such a RMW issue on PPC64 although from a
> slightly different angle, but we managed to hit it quite consistently.
> See commit ba1f14fbe7096.
>
> The thing is, if we allow the above RMW 'atomic' store, we have to be
> _very_ careful that there cannot be such overlapping stores, otherwise
> things will go BOOM!
>
> However, if we already have to make sure there's no overlapping stores,
> we might as well write a wide store and not allow the narrow stores to
> begin with, to force people to think about the issue.

Ah, I was assuming atomic rmw, which for Alpha would be implemented using
the LL and SC instructions. Yes, lots of overhead, but if the CPU
designers chose not to provide a load/store byte...

Thanx, Paul

2014-01-18 12:42:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Sat, Jan 18, 2014 at 04:25:48AM -0800, Paul E. McKenney wrote:
> On Sat, Jan 18, 2014 at 12:34:06PM +0100, Peter Zijlstra wrote:
> > On Sat, Jan 18, 2014 at 02:01:05AM -0800, Paul E. McKenney wrote:
> > > OK, I will bite... Aside from fine-grained code timing, what code could
> > > you write to tell the difference between a real one-byte store and an
> > > RMW emulating that store?
> >
> > Why isn't fine-grained code timing an issue? I'm sure Alpha people will
> > love it when their machine magically keels over every so often.
> >
> > Suppose we have two bytes in a word that get concurrent updates:
> >
> > union {
> > struct {
> > u8 a;
> > u8 b;
> > };
> > int word;
> > } ponies = { .word = 0, };
> >
> > then two threads concurrently do:
> >
> > CPU0: CPU1:
> >
> > ponies.a = 5 ponies.b = 10
> >
> >
> > At which point you'd expect: a == 5 && b == 10
> >
> > However, with a rmw you could end up like:
> >
> >
> > load r, ponies.word
> > load r, ponies.word
> > and r, ~0xFF
> > or r, 5
> > store ponies.word, r
> > and r, ~0xFF00
> > or r, 10 << 8
> > store ponies.word, r
> >
> > which gives: a == 0 && b == 10
> >
> > The same can be had on a single CPU if you make the second RMW an
> > interrupt.
> >
> >
> > In fact, we recently had such a RMW issue on PPC64 although from a
> > slightly different angle, but we managed to hit it quite consistently.
> > See commit ba1f14fbe7096.
> >
> > The thing is, if we allow the above RMW 'atomic' store, we have to be
> > _very_ careful that there cannot be such overlapping stores, otherwise
> > things will go BOOM!
> >
> > However, if we already have to make sure there's no overlapping stores,
> > we might as well write a wide store and not allow the narrow stores to
> > begin with, to force people to think about the issue.
>
> Ah, I was assuming atomic rmw, which for Alpha would be implemented using
> the LL and SC instructions. Yes, lots of overhead, but if the CPU
> designers chose not to provide a load/store byte...

I don't see how ll/sc will help any. Suppose we do the a store as
smp_store_release() using ll/sc but the b store is unaware and doesn't
do an ll/sc.

Then we're still up shit creek without no paddle.

Whatever you're going to do, you need to be intimately aware of what the
other bits in your word are doing.

2014-01-18 21:22:39

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Sat, Jan 18, 2014 at 01:41:36PM +0100, Peter Zijlstra wrote:
> On Sat, Jan 18, 2014 at 04:25:48AM -0800, Paul E. McKenney wrote:
> > On Sat, Jan 18, 2014 at 12:34:06PM +0100, Peter Zijlstra wrote:
> > > On Sat, Jan 18, 2014 at 02:01:05AM -0800, Paul E. McKenney wrote:
> > > > OK, I will bite... Aside from fine-grained code timing, what code could
> > > > you write to tell the difference between a real one-byte store and an
> > > > RMW emulating that store?
> > >
> > > Why isn't fine-grained code timing an issue? I'm sure Alpha people will
> > > love it when their machine magically keels over every so often.
> > >
> > > Suppose we have two bytes in a word that get concurrent updates:
> > >
> > > union {
> > > struct {
> > > u8 a;
> > > u8 b;
> > > };
> > > int word;
> > > } ponies = { .word = 0, };
> > >
> > > then two threads concurrently do:
> > >
> > > CPU0: CPU1:
> > >
> > > ponies.a = 5 ponies.b = 10
> > >
> > >
> > > At which point you'd expect: a == 5 && b == 10
> > >
> > > However, with a rmw you could end up like:
> > >
> > >
> > > load r, ponies.word
> > > load r, ponies.word
> > > and r, ~0xFF
> > > or r, 5
> > > store ponies.word, r
> > > and r, ~0xFF00
> > > or r, 10 << 8
> > > store ponies.word, r
> > >
> > > which gives: a == 0 && b == 10
> > >
> > > The same can be had on a single CPU if you make the second RMW an
> > > interrupt.
> > >
> > >
> > > In fact, we recently had such a RMW issue on PPC64 although from a
> > > slightly different angle, but we managed to hit it quite consistently.
> > > See commit ba1f14fbe7096.
> > >
> > > The thing is, if we allow the above RMW 'atomic' store, we have to be
> > > _very_ careful that there cannot be such overlapping stores, otherwise
> > > things will go BOOM!
> > >
> > > However, if we already have to make sure there's no overlapping stores,
> > > we might as well write a wide store and not allow the narrow stores to
> > > begin with, to force people to think about the issue.
> >
> > Ah, I was assuming atomic rmw, which for Alpha would be implemented using
> > the LL and SC instructions. Yes, lots of overhead, but if the CPU
> > designers chose not to provide a load/store byte...
>
> I don't see how ll/sc will help any. Suppose we do the a store as
> smp_store_release() using ll/sc but the b store is unaware and doesn't
> do an ll/sc.
>
> Then we're still up shit creek without no paddle.
>
> Whatever you're going to do, you need to be intimately aware of what the
> other bits in your word are doing.

Yes, this requires that -all- updates to the fields in the machine word
in question use atomic rmw. Which would not be pretty from a core-code
perspective. Hence my suggestion of ceasing Linux-kernel support for
DEC Alpha CPUs that don't support byte operations. Also need 16-bit
operations as well, of course...

Thanx, Paul

2014-01-19 00:57:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Sat, Jan 18, 2014 at 1:22 PM, Paul E. McKenney
<[email protected]> wrote:
>
> Yes, this requires that -all- updates to the fields in the machine word
> in question use atomic rmw. Which would not be pretty from a core-code
> perspective. Hence my suggestion of ceasing Linux-kernel support for
> DEC Alpha CPUs that don't support byte operations. Also need 16-bit
> operations as well, of course...

I'm not seeing this.

Why the hell would you have byte- or halfword-sized versions of the
store_release or load_acquire things on alpha anyway?

What it means is that data structures that do locking or atomics need
to be "int" or "long" on alpha. That has always been true. What do
you claim has changed?

Linus

2014-01-19 08:04:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Sat, Jan 18, 2014 at 04:57:05PM -0800, Linus Torvalds wrote:
> On Sat, Jan 18, 2014 at 1:22 PM, Paul E. McKenney
> <[email protected]> wrote:
> >
> > Yes, this requires that -all- updates to the fields in the machine word
> > in question use atomic rmw. Which would not be pretty from a core-code
> > perspective. Hence my suggestion of ceasing Linux-kernel support for
> > DEC Alpha CPUs that don't support byte operations. Also need 16-bit
> > operations as well, of course...
>
> I'm not seeing this.
>
> Why the hell would you have byte- or halfword-sized versions of the
> store_release or load_acquire things on alpha anyway?
>
> What it means is that data structures that do locking or atomics need
> to be "int" or "long" on alpha. That has always been true. What do
> you claim has changed?

OK, another approach would be to never add "select ARCH_USE_QUEUE_RWLOCK"
on Alpha, at least if the queued rwlocks really do want to atomically
manipulate bytes. After all, the Alpha systems that I know about don't
have enough CPUs to make queued rwlocks necessary anyway.

Much simpler solution!

Is this what you were getting at, or am I missing your point?

Thanx, Paul

2014-01-19 19:56:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Sun, Jan 19, 2014 at 12:04 AM, Paul E. McKenney
<[email protected]> wrote:
>
> OK, another approach would be to never add "select ARCH_USE_QUEUE_RWLOCK"
> on Alpha, at least if the queued rwlocks really do want to atomically
> manipulate bytes. After all, the Alpha systems that I know about don't
> have enough CPUs to make queued rwlocks necessary anyway.
>
> Much simpler solution!
>
> Is this what you were getting at, or am I missing your point?

You're missing something.

Just make the "writer" field be an "int" on little-endian archiectures
(like alpha).

There is no reason for that field to be a "char" to begin with, as far
as I can tell, since the padding of the structure means that it
doesn't save any space. But even if that wasn't true, we could make an
arch-specific type for "minimum type for locking".

So my *point* was that it should be easy enough to just make sure that
any data structures used for locking have types that are appropriate
for that locking.

Linus

2014-01-20 00:52:25

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Sun, Jan 19, 2014 at 11:56:02AM -0800, Linus Torvalds wrote:
> On Sun, Jan 19, 2014 at 12:04 AM, Paul E. McKenney
> <[email protected]> wrote:
> >
> > OK, another approach would be to never add "select ARCH_USE_QUEUE_RWLOCK"
> > on Alpha, at least if the queued rwlocks really do want to atomically
> > manipulate bytes. After all, the Alpha systems that I know about don't
> > have enough CPUs to make queued rwlocks necessary anyway.
> >
> > Much simpler solution!
> >
> > Is this what you were getting at, or am I missing your point?
>
> You're missing something.
>
> Just make the "writer" field be an "int" on little-endian archiectures
> (like alpha).
>
> There is no reason for that field to be a "char" to begin with, as far
> as I can tell, since the padding of the structure means that it
> doesn't save any space. But even if that wasn't true, we could make an
> arch-specific type for "minimum type for locking".

On 64-bit systems (which includes Alpha), agreed, the field can be a
32-bit portion of a 64-bit structure that is then manipulated atomically.
Many 32-bit systems need the reader and writer counts to fix in 32 bits
in order to allow things like queue_read_trylock() to correctly account
for both readers and writers.

If there was a 32-bit system running Linux that did not support byte
accesses, there would be a problem, but I don't know of any such system.

> So my *point* was that it should be easy enough to just make sure that
> any data structures used for locking have types that are appropriate
> for that locking.

So something like the following for the qrwlock definition, with
appropriate C-preprocessor wrappers for the atomic-add accesses?

Thanx, Paul

------------------------------------------------------------------------

typedef struct qrwlock {
union qrwcnts {
#ifdef CONFIG_64B
struct (
int writer;
int reader;
};
atomic_long_t rwa;
u64 rwc;
#else
struct {
#ifdef __LITTLE_ENDIAN
u8 writer; /* Writer state */
#else
u16 r16; /* Reader count - msb */
u8 r8; /* Reader count - lsb */
u8 writer; /* Writer state */
#endif
};
atomic_t rwa; /* Reader/writer atomic */
u32 rwc; /* Reader/writer counts */
} cnts;
#endif
struct mcs_spinlock *waitq; /* Tail of waiting queue */
} arch_rwlock_t;

2014-01-21 15:02:14

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On 01/19/2014 03:04 AM, Paul E. McKenney wrote:
> On Sat, Jan 18, 2014 at 04:57:05PM -0800, Linus Torvalds wrote:
>> On Sat, Jan 18, 2014 at 1:22 PM, Paul E. McKenney
>> <[email protected]> wrote:
>>> Yes, this requires that -all- updates to the fields in the machine word
>>> in question use atomic rmw. Which would not be pretty from a core-code
>>> perspective. Hence my suggestion of ceasing Linux-kernel support for
>>> DEC Alpha CPUs that don't support byte operations. Also need 16-bit
>>> operations as well, of course...
>> I'm not seeing this.
>>
>> Why the hell would you have byte- or halfword-sized versions of the
>> store_release or load_acquire things on alpha anyway?
>>
>> What it means is that data structures that do locking or atomics need
>> to be "int" or "long" on alpha. That has always been true. What do
>> you claim has changed?
> OK, another approach would be to never add "select ARCH_USE_QUEUE_RWLOCK"
> on Alpha, at least if the queued rwlocks really do want to atomically
> manipulate bytes. After all, the Alpha systems that I know about don't
> have enough CPUs to make queued rwlocks necessary anyway.
>
> Much simpler solution!
>
> Is this what you were getting at, or am I missing your point?
>
> Thanx, Paul
>

My latest v9 series of qrwlock patch will automatically adapt to the
lack of atomic byte access by using an atomic integer instruction
instead. So the new series should work for pre-EV56 Alpha, it is just a
bit less efficient in this case.

-Longman

2014-01-21 15:41:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On Tue, Jan 21, 2014 at 10:02:06AM -0500, Waiman Long wrote:
> My latest v9 series of qrwlock patch will automatically adapt to the lack of
> atomic byte access by using an atomic integer instruction instead. So the
> new series should work for pre-EV56 Alpha, it is just a bit less efficient
> in this case.

See my other email; I don't think you can do that without also changing
the implementation of the queue_read_{try}lock() functions.

Without those changes you can have transient values in your 'read-count'
part of the word and a full word write will wreck things.

2014-01-21 16:16:09

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v8 4/4] qrwlock: Use smp_store_release() in write_unlock()

On 01/21/2014 10:41 AM, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 10:02:06AM -0500, Waiman Long wrote:
>> My latest v9 series of qrwlock patch will automatically adapt to the lack of
>> atomic byte access by using an atomic integer instruction instead. So the
>> new series should work for pre-EV56 Alpha, it is just a bit less efficient
>> in this case.
> See my other email; I don't think you can do that without also changing
> the implementation of the queue_read_{try}lock() functions.
>
> Without those changes you can have transient values in your 'read-count'
> part of the word and a full word write will wreck things.

I don't see any problem with my current logic. If a writer has the write
lock, the writer byte has to have a value of 0xff. So atomically
subtracting 0xff from it will guarantee that the writer byte will become
zero, which is the same as assigning a zero value to that byte. The only
difference is that an atomic subtract instruction will need to be used
instead of a simple byte assignment.

Please let me know if there is any flaw in my logic.

-Longman