LinuxLists.cc - [PATCH v4 0/2] locking/qrwlock: More optimizations in qrwlock

2015-06-18 02:20:51

Subject: [PATCH v4 0/2] locking/qrwlock: More optimizations in qrwlock

v3->v4:
- Remove the unnecessary _QW_WMASK check in
queue_read_lock_slowpath().

v2->v3:
- Fix incorrect commit log message in patch 1.

v1->v2:
- Add microbenchmark data for the second patch

This patch set contains 2 patches on qrwlock. The first one is to
optimize the interrupt context reader slowpath. The second one is
to optimize the writer slowpath.

Waiman Long (2):
locking/qrwlock: Better optimization for interrupt context readers
locking/qrwlock: Don't contend with readers when setting _QW_WAITING

include/asm-generic/qrwlock.h | 4 ++--
kernel/locking/qrwlock.c | 41 +++++++++++++++++++++++++++++++----------
2 files changed, 33 insertions(+), 12 deletions(-)

2015-06-18 02:21:13

by Waiman Long

[permalink] [raw]

Subject: [PATCH v4 1/2] locking/qrwlock: Better optimization for interrupt context readers

The qrwlock is fair in the process context, but becoming unfair when
in the interrupt context to support use cases like the tasklist_lock.

The current code isn't that well-documented on what happens when
in the interrupt context. The rspin_until_writer_unlock() will only
spin if the writer has gotten the lock. If the writer is still in the
waiting state, the increment in the reader count will cause the writer
to remain in the waiting state and the new interrupt context reader
will get the lock and return immediately. The current code, however,
do an additional read of the lock value which is not necessary as
the information have already been there in the fast path. This may
sometime cause an additional cacheline transfer when the lock is
highly contended.

This patch passes the lock value information gotten in the fast path
to the slow path to eliminate the additional read. It also document
the action for the interrupt context readers more clearly.

Signed-off-by: Waiman Long <[email protected]>
---
include/asm-generic/qrwlock.h | 4 ++--
kernel/locking/qrwlock.c | 13 +++++++------
2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/qrwlock.h b/include/asm-generic/qrwlock.h
index 6383d54..865d021 100644
--- a/include/asm-generic/qrwlock.h
+++ b/include/asm-generic/qrwlock.h
@@ -36,7 +36,7 @@
/*
* External function declarations
*/
-extern void queue_read_lock_slowpath(struct qrwlock *lock);
+extern void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts);
extern void queue_write_lock_slowpath(struct qrwlock *lock);

/**
@@ -105,7 +105,7 @@ static inline void queue_read_lock(struct qrwlock *lock)
return;

/* The slowpath will decrement the reader count, if necessary. */
- queue_read_lock_slowpath(lock);
+ queue_read_lock_slowpath(lock, cnts);
}

/**
diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 00c12bb..7fd223c 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -43,22 +43,23 @@ rspin_until_writer_unlock(struct qrwlock *lock, u32 cnts)
* queue_read_lock_slowpath - acquire read lock of a queue rwlock
* @lock: Pointer to queue rwlock structure
*/
-void queue_read_lock_slowpath(struct qrwlock *lock)
+void queue_read_lock_slowpath(struct qrwlock *lock, u32 cnts)
{
- u32 cnts;
-
/*
* Readers come here when they cannot get the lock without waiting
*/
if (unlikely(in_interrupt())) {
/*
- * Readers in interrupt context will spin until the lock is
- * available without waiting in the queue.
+ * Readers in interrupt context will get the lock immediately
+ * if the writer is just waiting (not holding the lock yet).
+ * The rspin_until_writer_unlock() function returns immediately
+ * in this case. Otherwise, they will spin until the lock
+ * is available without waiting in the queue.
*/
- cnts = smp_load_acquire((u32 *)&lock->cnts);
rspin_until_writer_unlock(lock, cnts);
return;
}
+
atomic_sub(_QR_BIAS, &lock->cnts);

/*
--
1.7.1

2015-06-18 02:21:05

by Waiman Long

[permalink] [raw]

Subject: [PATCH v4 2/2] locking/qrwlock: Don't contend with readers when setting _QW_WAITING

The current cmpxchg() loop in setting the _QW_WAITING flag for writers
in queue_write_lock_slowpath() will contend with incoming readers
causing possibly extra cmpxchg() operations that are wasteful. This
patch changes the code to do a byte cmpxchg() to eliminate contention
with new readers.

A multithreaded microbenchmark running 5M read_lock/write_lock loop
on a 8-socket 80-core Westmere-EX machine running 4.0 based kernel
with the qspinlock patch have the following execution times (in ms)
with and without the patch:

With R:W ratio = 5:1

Threads w/o patch with patch % change
------- --------- ---------- --------
2 990 895 -9.6%
3 2136 1912 -10.5%
4 3166 2830 -10.6%
5 3953 3629 -8.2%
6 4628 4405 -4.8%
7 5344 5197 -2.8%
8 6065 6004 -1.0%
9 6826 6811 -0.2%
10 7599 7599 0.0%
15 9757 9766 +0.1%
20 13767 13817 +0.4%

With small number of contending threads, this patch can improve
locking performance by up to 10%. With more contending threads,
however, the gain diminishes.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/locking/qrwlock.c | 28 ++++++++++++++++++++++++----
1 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
index 7fd223c..38ad7e0 100644
--- a/kernel/locking/qrwlock.c
+++ b/kernel/locking/qrwlock.c
@@ -22,6 +22,26 @@
#include <linux/hardirq.h>
#include <asm/qrwlock.h>

+/*
+ * This internal data structure is used for optimizing access to some of
+ * the subfields within the atomic_t cnts.
+ */
+struct __qrwlock {
+ union {
+ atomic_t cnts;
+ struct {
+#ifdef __LITTLE_ENDIAN
+ u8 wmode; /* Writer mode */
+ u8 rcnts[3]; /* Reader counts */
+#else
+ u8 rcnts[3]; /* Reader counts */
+ u8 wmode; /* Writer mode */
+#endif
+ };
+ };
+ arch_spinlock_t lock;
+};
+
/**
* rspin_until_writer_unlock - inc reader count & spin until writer is gone
* @lock : Pointer to queue rwlock structure
@@ -108,10 +128,10 @@ void queue_write_lock_slowpath(struct qrwlock *lock)
* or wait for a previous writer to go away.
*/
for (;;) {
- cnts = atomic_read(&lock->cnts);
- if (!(cnts & _QW_WMASK) &&
- (atomic_cmpxchg(&lock->cnts, cnts,
- cnts | _QW_WAITING) == cnts))
+ struct __qrwlock *l = (struct __qrwlock *)lock;
+
+ if (!READ_ONCE(l->wmode) &&
+ (cmpxchg(&l->wmode, 0, _QW_WAITING) == 0))
break;

cpu_relax_lowlatency();
--
1.7.1

2015-06-18 12:41:22

by Will Deacon

[permalink] [raw]

Subject: Re: [PATCH v4 1/2] locking/qrwlock: Better optimization for interrupt context readers

On Thu, Jun 18, 2015 at 03:20:22AM +0100, Waiman Long wrote:
> The qrwlock is fair in the process context, but becoming unfair when
> in the interrupt context to support use cases like the tasklist_lock.
>
> The current code isn't that well-documented on what happens when
> in the interrupt context. The rspin_until_writer_unlock() will only
> spin if the writer has gotten the lock. If the writer is still in the
> waiting state, the increment in the reader count will cause the writer
> to remain in the waiting state and the new interrupt context reader
> will get the lock and return immediately. The current code, however,
> do an additional read of the lock value which is not necessary as
> the information have already been there in the fast path. This may
> sometime cause an additional cacheline transfer when the lock is
> highly contended.
>
> This patch passes the lock value information gotten in the fast path
> to the slow path to eliminate the additional read. It also document
> the action for the interrupt context readers more clearly.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> include/asm-generic/qrwlock.h | 4 ++--
> kernel/locking/qrwlock.c | 13 +++++++------
> 2 files changed, 9 insertions(+), 8 deletions(-)

LGTM:

Reviewed-by: Will Deacon <[email protected]>

Will