2019-03-29 15:24:48

by Alex Kogan

[permalink] [raw]
Subject: [PATCH v2 0/5] Add NUMA-awareness to qspinlock

This version addresses feedback from Peter and Waiman. In particular,
the CNA functionality has been moved to a separate file, and is controlled
by a config option (enabled by default if NUMA is enabled).
An optimization has been introduced to reduce the overhead of shuffling
threads between waiting queues when the lock is only lightly contended.

Summary
-------

Lock throughput can be increased by handing a lock to a waiter on the
same NUMA node as the lock holder, provided care is taken to avoid
starvation of waiters on other NUMA nodes. This patch introduces CNA
(compact NUMA-aware lock) as the slow path for qspinlock. It can be
enabled through a configuration option (NUMA_AWARE_SPINLOCKS).

CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are
organized in two queues, a main queue for threads running on the same
node as the current lock holder, and a secondary queue for threads
running on other nodes. Threads store the ID of the node on which
they are running in their queue nodes. At the unlock time, the lock
holder scans the main queue looking for a thread running on the same
node. If found (call it thread T), all threads in the main queue
between the current lock holder and T are moved to the end of the
secondary queue, and the lock is passed to T. If such T is not found, the
lock is passed to the first node in the secondary queue. Finally, if the
secondary queue is empty, the lock is passed to the next thread in the
main queue. To avoid starvation of threads in the secondary queue,
those threads are moved back to the head of the main queue
after a certain expected number of intra-node lock hand-offs.

More details are available at https://arxiv.org/abs/1810.05600.

We have done some performance evaluation with the locktorture module
as well as with several benchmarks from the will-it-scale repo.
The following locktorture results are from an Oracle X5-4 server
(four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
cores each). Each number represents an average (over 25 runs) of the
total number of ops (x10^7) reported at the end of each run. The
standard deviation is also reported in (), and in general, with a few
exceptions, is about 3%. The 'stock' kernel is v5.0-rc8,
commit 28d49e282665 ("locking/lockdep: Shrink struct lock_class_key"),
compiled in the default configuration. 'patch' is the modified
kernel compiled with NUMA_AWARE_SPINLOCKS not set; it is included to show
that any performance changes to the existing qspinlock implementation are
essentially noise. 'patch-CNA' is the modified kernel with
NUMA_AWARE_SPINLOCKS set; the speedup is calculated dividing
'patch-CNA' by 'stock'.

#thr stock patch patch-CNA speedup (patch-CNA/stock)
1 2.731 (0.102) 2.732 (0.093) 2.716 (0.082) 0.995
2 3.071 (0.124) 3.084 (0.109) 3.079 (0.113) 1.003
4 4.221 (0.138) 4.229 (0.087) 4.408 (0.103) 1.044
8 5.366 (0.154) 5.274 (0.094) 6.958 (0.233) 1.297
16 6.673 (0.164) 6.689 (0.095) 8.547 (0.145) 1.281
32 7.365 (0.177) 7.353 (0.183) 9.305 (0.202) 1.263
36 7.473 (0.198) 7.422 (0.181) 9.441 (0.196) 1.263
72 6.805 (0.182) 6.699 (0.170) 10.020 (0.218) 1.472
108 6.509 (0.082) 6.480 (0.115) 10.027 (0.194) 1.540
142 6.223 (0.109) 6.294 (0.100) 9.874 (0.183) 1.587

The following tables contain throughput results (ops/us) from the same
setup for will-it-scale/open1_threads:

#thr stock patch patch-CNA speedup (patch-CNA/stock)
1 0.565 (0.004) 0.567 (0.001) 0.565 (0.003) 0.999
2 0.892 (0.021) 0.899 (0.022) 0.900 (0.018) 1.009
4 1.503 (0.031) 1.527 (0.038) 1.481 (0.025) 0.985
8 1.755 (0.105) 1.714 (0.079) 1.683 (0.106) 0.959
16 1.740 (0.095) 1.752 (0.087) 1.693 (0.098) 0.973
32 0.884 (0.080) 0.908 (0.090) 1.686 (0.092) 1.906
36 0.907 (0.095) 0.894 (0.088) 1.709 (0.081) 1.885
72 0.856 (0.041) 0.858 (0.043) 1.707 (0.082) 1.994
108 0.858 (0.039) 0.869 (0.037) 1.732 (0.076) 2.020
142 0.809 (0.044) 0.854 (0.044) 1.728 (0.083) 2.135

and will-it-scale/lock2_threads:

#thr stock patch patch-CNA speedup (patch-CNA/stock)
1 1.713 (0.004) 1.715 (0.004) 1.711 (0.004) 0.999
2 2.889 (0.057) 2.864 (0.078) 2.876 (0.066) 0.995
4 4.582 (1.032) 5.066 (0.787) 4.725 (0.959) 1.031
8 4.227 (0.196) 4.104 (0.274) 4.092 (0.365) 0.968
16 4.108 (0.141) 4.057 (0.138) 4.010 (0.168) 0.976
32 2.674 (0.125) 2.625 (0.171) 3.958 (0.156) 1.480
36 2.622 (0.107) 2.553 (0.150) 3.978 (0.116) 1.517
72 2.009 (0.090) 1.998 (0.092) 3.932 (0.114) 1.957
108 2.154 (0.069) 2.089 (0.090) 3.870 (0.081) 1.797
142 1.953 (0.106) 1.943 (0.111) 3.853 (0.100) 1.973

Further comments are welcome and appreciated.

Alex Kogan (5):
locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic
locking/qspinlock: Refactor the qspinlock slow path
locking/qspinlock: Introduce CNA into the slow path of qspinlock
locking/qspinlock: Introduce starvation avoidance into CNA
locking/qspinlock: Introduce the shuffle reduction optimization into
CNA

arch/arm/include/asm/mcs_spinlock.h | 4 +-
arch/x86/Kconfig | 14 ++
include/asm-generic/qspinlock_types.h | 13 ++
kernel/locking/mcs_spinlock.h | 16 ++-
kernel/locking/qspinlock.c | 77 +++++++++--
kernel/locking/qspinlock_cna.h | 245 ++++++++++++++++++++++++++++++++++
6 files changed, 354 insertions(+), 15 deletions(-)
create mode 100644 kernel/locking/qspinlock_cna.h

--
2.11.0 (Apple Git-81)



2019-03-29 15:22:24

by Alex Kogan

[permalink] [raw]
Subject: [PATCH v2 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

Choose the next lock holder among spinning threads running on the same
node with high probability rather than always. With small probability,
hand the lock to the first thread in the secondary queue or, if that
queue is empty, to the immediate successor of the current lock holder
in the main queue. Thus, assuming no failures while threads hold the
lock, every thread would be able to acquire the lock after a bounded
number of lock transitions, with high probability.

Signed-off-by: Alex Kogan <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
---
kernel/locking/qspinlock_cna.h | 55 ++++++++++++++++++++++++++++++++++++++++--
1 file changed, 53 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 5bc5fd9586ea..5addf6439326 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -3,6 +3,8 @@
#error "do not include this file"
#endif

+#include <linux/random.h>
+
/*
* Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
*
@@ -15,7 +17,9 @@
* secondary queue, and the lock is passed to T. If such T is not found, the
* lock is passed to the first node in the secondary queue. Finally, if the
* secondary queue is empty, the lock is passed to the next thread in the
- * main queue.
+ * main queue. To avoid starvation of threads in the secondary queue,
+ * those threads are moved back to the head of the main queue
+ * after a certain expected number of intra-node lock hand-offs.
*
* For details, see https://arxiv.org/abs/1810.05600.
*
@@ -25,6 +29,18 @@

#define MCS_NODE(ptr) ((struct mcs_spinlock *)(ptr))

+/* Per-CPU pseudo-random number seed */
+static DEFINE_PER_CPU(u32, seed);
+
+/*
+ * Controls the probability for intra-node lock hand-off. It can be
+ * tuned and depend, e.g., on the number of CPUs per node. For now,
+ * choose a value that provides reasonable long-term fairness without
+ * sacrificing performance compared to a version that does not have any
+ * fairness guarantees.
+ */
+#define INTRA_NODE_HANDOFF_PROB_ARG 0x10000
+
static inline __pure int decode_numa_node(u32 node_and_count)
{
int node = (node_and_count >> _Q_NODE_OFFSET) - 1;
@@ -102,6 +118,35 @@ static struct mcs_spinlock *find_successor(struct mcs_spinlock *me)
return NULL;
}

+/*
+ * xorshift function for generating pseudo-random numbers:
+ * https://en.wikipedia.org/wiki/Xorshift
+ */
+static inline u32 xor_random(void)
+{
+ u32 v;
+
+ v = this_cpu_read(seed);
+ if (v == 0)
+ get_random_bytes(&v, sizeof(u32));
+
+ v ^= v << 6;
+ v ^= v >> 21;
+ v ^= v << 7;
+ this_cpu_write(seed, v);
+
+ return v;
+}
+
+/*
+ * Return false with probability 1 / @range.
+ * @range must be a power of 2.
+ */
+static bool probably(unsigned int range)
+{
+ return xor_random() & (range - 1);
+}
+
static __always_inline int get_node_index(struct mcs_spinlock *node)
{
return decode_count(node->node_and_count++);
@@ -151,7 +196,13 @@ static inline void pass_mcs_lock(struct mcs_spinlock *node,
{
struct mcs_spinlock *succ = NULL;

- succ = find_successor(node);
+ /*
+ * Try to pass the lock to a thread running on the same node.
+ * For long-term fairness, search for such a thread with high
+ * probability rather than always.
+ */
+ if (probably(INTRA_NODE_HANDOFF_PROB_ARG))
+ succ = find_successor(node);

if (succ) {
arch_mcs_spin_unlock_contended(&succ->locked, node->locked);
--
2.11.0 (Apple Git-81)


2019-03-29 15:22:30

by Alex Kogan

[permalink] [raw]
Subject: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

In CNA, spinning threads are organized in two queues, a main queue for
threads running on the same node as the current lock holder, and a
secondary queue for threads running on other nodes. At the unlock time,
the lock holder scans the main queue looking for a thread running on
the same node. If found (call it thread T), all threads in the main queue
between the current lock holder and T are moved to the end of the
secondary queue, and the lock is passed to T. If such T is not found, the
lock is passed to the first node in the secondary queue. Finally, if the
secondary queue is empty, the lock is passed to the next thread in the
main queue. For more details, see https://arxiv.org/abs/1810.05600.

Note that this variant of CNA may introduce starvation by continuously
passing the lock to threads running on the same node. This issue
will be addressed later in the series.

Enabling CNA is controlled via a new configuration option
(NUMA_AWARE_SPINLOCKS), which is enabled by default if NUMA is enabled.

Signed-off-by: Alex Kogan <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
---
arch/x86/Kconfig | 14 +++
include/asm-generic/qspinlock_types.h | 13 +++
kernel/locking/mcs_spinlock.h | 10 ++
kernel/locking/qspinlock.c | 29 +++++-
kernel/locking/qspinlock_cna.h | 173 ++++++++++++++++++++++++++++++++++
5 files changed, 236 insertions(+), 3 deletions(-)
create mode 100644 kernel/locking/qspinlock_cna.h

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 68261430fe6e..e70c39a901f2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1554,6 +1554,20 @@ config NUMA

Otherwise, you should say N.

+config NUMA_AWARE_SPINLOCKS
+ bool "Numa-aware spinlocks"
+ depends on NUMA
+ default y
+ help
+ Introduce NUMA (Non Uniform Memory Access) awareness into
+ the slow path of spinlocks.
+
+ The kernel will try to keep the lock on the same node,
+ thus reducing the number of remote cache misses, while
+ trading some of the short term fairness for better performance.
+
+ Say N if you want absolute first come first serve fairness.
+
config AMD_NUMA
def_bool y
prompt "Old style AMD Opteron NUMA detection"
diff --git a/include/asm-generic/qspinlock_types.h b/include/asm-generic/qspinlock_types.h
index d10f1e7d6ba8..23368ca20d2d 100644
--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -109,4 +109,17 @@ typedef struct qspinlock {
#define _Q_LOCKED_VAL (1U << _Q_LOCKED_OFFSET)
#define _Q_PENDING_VAL (1U << _Q_PENDING_OFFSET)

+#ifdef CONFIG_NUMA_AWARE_SPINLOCKS
+/*
+ * Bitfields in the non-atomic node_and_count value:
+ * 0- 1: queue node index (always < 4)
+ * 2-31: numa node id
+ */
+#define _Q_IDX_OFFSET 0
+#define _Q_IDX_BITS 2
+#define _Q_IDX_MASK _Q_SET_MASK(IDX)
+#define _Q_NODE_OFFSET (_Q_IDX_OFFSET + _Q_IDX_BITS)
+
+#endif /* CONFIG_NUMA_AWARE_SPINLOCKS */
+
#endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index bc6d3244e1af..71ee4b64c5d4 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -17,8 +17,18 @@

struct mcs_spinlock {
struct mcs_spinlock *next;
+#ifndef CONFIG_NUMA_AWARE_SPINLOCKS
int locked; /* 1 if lock acquired */
int count; /* nesting count, see qspinlock.c */
+#else /* CONFIG_NUMA_AWARE_SPINLOCKS */
+ uintptr_t locked; /* 1 if lock acquired, 0 if not, other values */
+ /* represent a pointer to the secondary queue head */
+ u32 node_and_count; /* node id on which this thread is running */
+ /* with two lower bits reserved for nesting */
+ /* count, see qspinlock.c */
+ u32 encoded_tail; /* encoding of this node as the main queue tail */
+ struct mcs_spinlock *tail; /* points to the secondary queue tail */
+#endif /* CONFIG_NUMA_AWARE_SPINLOCKS */
};

#ifndef arch_mcs_spin_lock_contended
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 074f65b9bedc..7cc923a59716 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -85,6 +85,8 @@
* to have more than 2 levels of slowpath nesting in actual use. We don't
* want to penalize pvqspinlocks to optimize for a rare case in native
* qspinlocks.
+ * Note that when CNA is used, the mcs_spinlock structure
+ * is 32 bytes in size, so two of them fit in one 64-byte cacheline.
*/
struct qnode {
struct mcs_spinlock mcs;
@@ -109,7 +111,8 @@ struct qnode {
* Per-CPU queue node structures; we can never have more than 4 nested
* contexts: task, softirq, hardirq, nmi.
*
- * Exactly fits one 64-byte cacheline on a 64-bit architecture.
+ * Exactly fits one 64-byte cacheline (or two, if CNA is used)
+ * on a 64-bit architecture.
*
* PV doubles the storage and uses the second cacheline for PV state.
*/
@@ -297,6 +300,8 @@ static __always_inline u32 __pv_wait_head_or_lock(struct qspinlock *lock,
#define queued_spin_lock_slowpath native_queued_spin_lock_slowpath
#endif

+#ifndef CONFIG_NUMA_AWARE_SPINLOCKS
+
static __always_inline int get_node_index(struct mcs_spinlock *node)
{
return node->count++;
@@ -334,6 +339,16 @@ static __always_inline void pass_mcs_lock(struct mcs_spinlock *node,
arch_mcs_spin_unlock_contended(&next->locked, 1);
}

+static __always_inline void cna_init_node(struct mcs_spinlock *node,
+ int cpuid, u32 tail) { }
+
+#else /* CONFIG_NUMA_AWARE_SPINLOCKS */
+
+#define _GEN_CNA_LOCK_SLOWPATH
+#include "qspinlock_cna.h"
+
+#endif /* CONFIG_NUMA_AWARE_SPINLOCKS */
+
#endif /* _GEN_PV_LOCK_SLOWPATH */

/**
@@ -361,7 +376,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
{
struct mcs_spinlock *prev, *next, *node;
u32 old, tail;
- int idx;
+ int idx, cpuid;

BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));

@@ -444,7 +459,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
pv_queue:
node = this_cpu_ptr(&qnodes[0].mcs);
idx = get_node_index(node);
- tail = encode_tail(smp_processor_id(), idx);
+ cpuid = smp_processor_id();
+ tail = encode_tail(cpuid, idx);

/*
* 4 nodes are allocated based on the assumption that there will
@@ -478,6 +494,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)

node->locked = 0;
node->next = NULL;
+ cna_init_node(node, cpuid, tail);
pv_init_node(node);

/*
@@ -527,6 +544,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
next = READ_ONCE(node->next);
if (next)
prefetchw(next);
+ } else {
+ /* In CNA, we must pass a non-zero value to successor when
+ * we unlock. This store should be harmless performance-wise,
+ * as we just initialized @node.
+ */
+ node->locked = 1;
}

/*
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
new file mode 100644
index 000000000000..5bc5fd9586ea
--- /dev/null
+++ b/kernel/locking/qspinlock_cna.h
@@ -0,0 +1,173 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _GEN_CNA_LOCK_SLOWPATH
+#error "do not include this file"
+#endif
+
+/*
+ * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
+ *
+ * In CNA, spinning threads are organized in two queues, a main queue for
+ * threads running on the same node as the current lock holder, and a
+ * secondary queue for threads running on other nodes. At the unlock time,
+ * the lock holder scans the main queue looking for a thread running on
+ * the same node. If found (call it thread T), all threads in the main queue
+ * between the current lock holder and T are moved to the end of the
+ * secondary queue, and the lock is passed to T. If such T is not found, the
+ * lock is passed to the first node in the secondary queue. Finally, if the
+ * secondary queue is empty, the lock is passed to the next thread in the
+ * main queue.
+ *
+ * For details, see https://arxiv.org/abs/1810.05600.
+ *
+ * Authors: Alex Kogan <[email protected]>
+ * Dave Dice <[email protected]>
+ */
+
+#define MCS_NODE(ptr) ((struct mcs_spinlock *)(ptr))
+
+static inline __pure int decode_numa_node(u32 node_and_count)
+{
+ int node = (node_and_count >> _Q_NODE_OFFSET) - 1;
+
+ return node;
+}
+
+static inline __pure int decode_count(u32 node_and_count)
+{
+ int count = node_and_count & _Q_IDX_MASK;
+
+ return count;
+}
+
+static inline void store_numa_node(struct mcs_spinlock *node, int numa_node)
+{
+ u32 val;
+
+ val = (numa_node + 1) << _Q_NODE_OFFSET;
+ val |= decode_count(node->node_and_count);
+
+ node->node_and_count = val;
+}
+
+/**
+ * find_successor - Scan the main waiting queue looking for the first
+ * thread running on the same node as the lock holder. If found (call it
+ * thread T), move all threads in the main queue between the lock holder
+ * and T to the end of the secondary queue and return T; otherwise, return NULL.
+ */
+static struct mcs_spinlock *find_successor(struct mcs_spinlock *me)
+{
+ int my_node;
+ struct mcs_spinlock *head_other, *tail_other, *cur;
+ struct mcs_spinlock *next = me->next;
+
+ /* @next should be set, else we would not be calling this function. */
+ WARN_ON_ONCE(next == NULL);
+
+ /* Get node, which would not be set if we entered an empty queue. */
+ my_node = decode_numa_node(me->node_and_count);
+
+ /*
+ * Fast path - check whether the immediate successor runs on
+ * the same node.
+ */
+ if (decode_numa_node(next->node_and_count) == my_node)
+ return next;
+
+ head_other = next;
+ tail_other = next;
+
+ /*
+ * Traverse the main waiting queue starting from the successor of my
+ * successor, and look for a thread running on the same node.
+ */
+ cur = READ_ONCE(next->next);
+ while (cur) {
+ if (decode_numa_node(cur->node_and_count) == my_node) {
+ /*
+ * Found a thread on the same node. Move threads
+ * between me and that node into the secondary queue.
+ */
+ if (me->locked > 1)
+ MCS_NODE(me->locked)->tail->next = head_other;
+ else
+ me->locked = (uintptr_t)head_other;
+ tail_other->next = NULL;
+ MCS_NODE(me->locked)->tail = tail_other;
+ return cur;
+ }
+ tail_other = cur;
+ cur = READ_ONCE(cur->next);
+ }
+ return NULL;
+}
+
+static __always_inline int get_node_index(struct mcs_spinlock *node)
+{
+ return decode_count(node->node_and_count++);
+}
+
+static __always_inline void release_mcs_node(struct mcs_spinlock *node)
+{
+ __this_cpu_dec(node->node_and_count);
+}
+
+static __always_inline void cna_init_node(struct mcs_spinlock *node, int cpuid,
+ u32 tail)
+{
+ if (decode_numa_node(node->node_and_count) == -1)
+ store_numa_node(node, numa_cpu_node(cpuid));
+ node->encoded_tail = tail;
+}
+
+static inline bool set_locked_empty_mcs(struct qspinlock *lock, u32 val,
+ struct mcs_spinlock *node)
+{
+ /* Check whether the secondary queue is empty. */
+ if (node->locked == 1) {
+ if (atomic_try_cmpxchg_relaxed(&lock->val, &val,
+ _Q_LOCKED_VAL))
+ return true; /* No contention */
+ } else {
+ /*
+ * Pass the lock to the first thread in the secondary
+ * queue, but first try to update the queue's tail to
+ * point to the last node in the secondary queue.
+ */
+ struct mcs_spinlock *succ = MCS_NODE(node->locked);
+ u32 new = succ->tail->encoded_tail + _Q_LOCKED_VAL;
+
+ if (atomic_try_cmpxchg_relaxed(&lock->val, &val, new)) {
+ arch_mcs_spin_unlock_contended(&succ->locked, 1);
+ return true;
+ }
+ }
+
+ return false;
+}
+
+static inline void pass_mcs_lock(struct mcs_spinlock *node,
+ struct mcs_spinlock *next)
+{
+ struct mcs_spinlock *succ = NULL;
+
+ succ = find_successor(node);
+
+ if (succ) {
+ arch_mcs_spin_unlock_contended(&succ->locked, node->locked);
+ } else if (node->locked > 1) {
+ /*
+ * If the secondary queue is not empty, pass the lock
+ * to the first node in that queue.
+ */
+ succ = MCS_NODE(node->locked);
+ succ->tail->next = next;
+ arch_mcs_spin_unlock_contended(&succ->locked, 1);
+ } else {
+ /*
+ * Otherwise, pass the lock to the immediate successor
+ * in the main queue.
+ */
+ arch_mcs_spin_unlock_contended(&next->locked, 1);
+ }
+}
--
2.11.0 (Apple Git-81)


2019-03-29 15:22:37

by Alex Kogan

[permalink] [raw]
Subject: [PATCH v2 1/5] locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic

The arch_mcs_spin_unlock_contended macro should accept the value to be
stored into the lock argument as another argument. This allows using the
same macro in cases where the value to be stored is different from 1.

Signed-off-by: Alex Kogan <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
---
arch/arm/include/asm/mcs_spinlock.h | 4 ++--
kernel/locking/mcs_spinlock.h | 6 +++---
kernel/locking/qspinlock.c | 2 +-
3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
index 529d2cf4d06f..ae6d763477f4 100644
--- a/arch/arm/include/asm/mcs_spinlock.h
+++ b/arch/arm/include/asm/mcs_spinlock.h
@@ -14,9 +14,9 @@ do { \
wfe(); \
} while (0) \

-#define arch_mcs_spin_unlock_contended(lock) \
+#define arch_mcs_spin_unlock_contended(lock, val) \
do { \
- smp_store_release(lock, 1); \
+ smp_store_release(lock, (val)); \
dsb_sev(); \
} while (0)

diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 5e10153b4d3c..bc6d3244e1af 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -41,8 +41,8 @@ do { \
* operations in the critical section has been completed before
* unlocking.
*/
-#define arch_mcs_spin_unlock_contended(l) \
- smp_store_release((l), 1)
+#define arch_mcs_spin_unlock_contended(l, val) \
+ smp_store_release((l), (val))
#endif

/*
@@ -115,7 +115,7 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
}

/* Pass lock to next waiter. */
- arch_mcs_spin_unlock_contended(&next->locked);
+ arch_mcs_spin_unlock_contended(&next->locked, 1);
}

#endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 5e9247dc2515..5941ce3527ce 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -558,7 +558,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
if (!next)
next = smp_cond_load_relaxed(&node->next, (VAL));

- arch_mcs_spin_unlock_contended(&next->locked);
+ arch_mcs_spin_unlock_contended(&next->locked, 1);
pv_kick_node(lock, next);

release:
--
2.11.0 (Apple Git-81)


2019-03-29 15:22:50

by Alex Kogan

[permalink] [raw]
Subject: [PATCH v2 5/5] locking/qspinlock: Introduce the shuffle reduction optimization into CNA

This optimization reduces the probability threads will be shuffled between
the main and secondary queues when the secondary queue is empty.
It is helpful when the lock is only lightly contended.

Signed-off-by: Alex Kogan <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
---
kernel/locking/qspinlock_cna.h | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 5addf6439326..29941677a6e1 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -41,6 +41,15 @@ static DEFINE_PER_CPU(u32, seed);
*/
#define INTRA_NODE_HANDOFF_PROB_ARG 0x10000

+/*
+ * Controls the probability for enabling the scan of the main queue when
+ * the secondary queue is empty. The chosen value reduces the amount of
+ * unnecessary shuffling of threads between the two waiting queues when
+ * the contention is low, while responding fast enough and enabling
+ * the shuffling when the contention is high.
+ */
+#define SHUFFLE_REDUCTION_PROB_ARG 0x80
+
static inline __pure int decode_numa_node(u32 node_and_count)
{
int node = (node_and_count >> _Q_NODE_OFFSET) - 1;
@@ -197,6 +206,18 @@ static inline void pass_mcs_lock(struct mcs_spinlock *node,
struct mcs_spinlock *succ = NULL;

/*
+ * Limit thread shuffling when the secondary queue is empty.
+ * This copes with the overhead the shuffling creates when the
+ * lock is only lightly contended, and threads do not stay
+ * in the secondary queue long enough to reap the benefit of moving
+ * them there.
+ */
+ if (node->locked == 1 && probably(SHUFFLE_REDUCTION_PROB_ARG)) {
+ arch_mcs_spin_unlock_contended(&next->locked, 1);
+ return;
+ }
+
+ /*
* Try to pass the lock to a thread running on the same node.
* For long-term fairness, search for such a thread with high
* probability rather than always.
--
2.11.0 (Apple Git-81)


2019-03-29 15:24:04

by Alex Kogan

[permalink] [raw]
Subject: [PATCH v2 2/5] locking/qspinlock: Refactor the qspinlock slow path

Move some of the code manipulating MCS nodes into separate functions.
This would allow easier integration of alternative ways to manipulate
those nodes.

Signed-off-by: Alex Kogan <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
---
kernel/locking/qspinlock.c | 48 +++++++++++++++++++++++++++++++++++++++-------
1 file changed, 41 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 5941ce3527ce..074f65b9bedc 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -297,6 +297,43 @@ static __always_inline u32 __pv_wait_head_or_lock(struct qspinlock *lock,
#define queued_spin_lock_slowpath native_queued_spin_lock_slowpath
#endif

+static __always_inline int get_node_index(struct mcs_spinlock *node)
+{
+ return node->count++;
+}
+
+static __always_inline void release_mcs_node(struct mcs_spinlock *node)
+{
+ __this_cpu_dec(node->count);
+}
+
+/*
+ * set_locked_empty_mcs - Try to set the spinlock value to _Q_LOCKED_VAL,
+ * and by doing that unlock the MCS lock when its waiting queue is empty
+ * @lock: Pointer to queued spinlock structure
+ * @val: Current value of the lock
+ * @node: Pointer to the MCS node of the lock holder
+ *
+ * *,*,* -> 0,0,1
+ */
+static __always_inline bool set_locked_empty_mcs(struct qspinlock *lock,
+ u32 val,
+ struct mcs_spinlock *node)
+{
+ return atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL);
+}
+
+/*
+ * pass_mcs_lock - pass the MCS lock to the next waiter
+ * @node: Pointer to the MCS node of the lock holder
+ * @next: Pointer to the MCS node of the first waiter in the MCS queue
+ */
+static __always_inline void pass_mcs_lock(struct mcs_spinlock *node,
+ struct mcs_spinlock *next)
+{
+ arch_mcs_spin_unlock_contended(&next->locked, 1);
+}
+
#endif /* _GEN_PV_LOCK_SLOWPATH */

/**
@@ -406,7 +443,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
qstat_inc(qstat_lock_slowpath, true);
pv_queue:
node = this_cpu_ptr(&qnodes[0].mcs);
- idx = node->count++;
+ idx = get_node_index(node);
tail = encode_tail(smp_processor_id(), idx);

/*
@@ -541,7 +578,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
* PENDING will make the uncontended transition fail.
*/
if ((val & _Q_TAIL_MASK) == tail) {
- if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+ if (set_locked_empty_mcs(lock, val, node))
goto release; /* No contention */
}

@@ -558,14 +595,11 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
if (!next)
next = smp_cond_load_relaxed(&node->next, (VAL));

- arch_mcs_spin_unlock_contended(&next->locked, 1);
+ pass_mcs_lock(node, next);
pv_kick_node(lock, next);

release:
- /*
- * release the node
- */
- __this_cpu_dec(qnodes[0].mcs.count);
+ release_mcs_node(&qnodes[0].mcs);
}
EXPORT_SYMBOL(queued_spin_lock_slowpath);

--
2.11.0 (Apple Git-81)


2019-04-01 09:08:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On Fri, Mar 29, 2019 at 11:20:04AM -0400, Alex Kogan wrote:
> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> index bc6d3244e1af..71ee4b64c5d4 100644
> --- a/kernel/locking/mcs_spinlock.h
> +++ b/kernel/locking/mcs_spinlock.h
> @@ -17,8 +17,18 @@
>
> struct mcs_spinlock {
> struct mcs_spinlock *next;
> +#ifndef CONFIG_NUMA_AWARE_SPINLOCKS
> int locked; /* 1 if lock acquired */
> int count; /* nesting count, see qspinlock.c */
> +#else /* CONFIG_NUMA_AWARE_SPINLOCKS */
> + uintptr_t locked; /* 1 if lock acquired, 0 if not, other values */
> + /* represent a pointer to the secondary queue head */
> + u32 node_and_count; /* node id on which this thread is running */
> + /* with two lower bits reserved for nesting */
> + /* count, see qspinlock.c */
> + u32 encoded_tail; /* encoding of this node as the main queue tail */
> + struct mcs_spinlock *tail; /* points to the secondary queue tail */
> +#endif /* CONFIG_NUMA_AWARE_SPINLOCKS */
> };

Please, have another look at the paravirt code, in particular at struct
pv_node and its usage. This is horrible.

> #ifndef arch_mcs_spin_lock_contended
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 074f65b9bedc..7cc923a59716 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c

> @@ -527,6 +544,12 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> next = READ_ONCE(node->next);
> if (next)
> prefetchw(next);
> + } else {
> + /* In CNA, we must pass a non-zero value to successor when
> + * we unlock. This store should be harmless performance-wise,
> + * as we just initialized @node.
> + */

Buggered comment style, also, it confuses the heck out of me. What does
it want to say?

Also, why isn't it hidden in your pv_wait_head_or_lock() implementation?

> + node->locked = 1;
> }
>

2019-04-01 09:10:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Add NUMA-awareness to qspinlock

On Fri, Mar 29, 2019 at 11:20:01AM -0400, Alex Kogan wrote:
> The following locktorture results are from an Oracle X5-4 server
> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
> cores each).

The other interesting number is on a !NUMA machine. What do these
patches do there? Remember, most people do not in fact have numa.

2019-04-01 09:22:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On Fri, Mar 29, 2019 at 11:20:04AM -0400, Alex Kogan wrote:
> +static inline void pass_mcs_lock(struct mcs_spinlock *node,
> + struct mcs_spinlock *next)
> +{
> + struct mcs_spinlock *succ = NULL;
> +
> + succ = find_successor(node);
> +
> + if (succ) {
> + arch_mcs_spin_unlock_contended(&succ->locked, node->locked);
> + } else if (node->locked > 1) {
> + /*
> + * If the secondary queue is not empty, pass the lock
> + * to the first node in that queue.
> + */
> + succ = MCS_NODE(node->locked);
> + succ->tail->next = next;
> + arch_mcs_spin_unlock_contended(&succ->locked, 1);
> + } else {
> + /*
> + * Otherwise, pass the lock to the immediate successor
> + * in the main queue.
> + */
> + arch_mcs_spin_unlock_contended(&next->locked, 1);
> + }
> +}

Note that something like:

static inline void
pass_mcs_lock(struct mcs_spinlock *node, struct mcs_spinlock *next)
{
struct mcs_spinlock *succ = NULL;
uintptr_t *var = &next->locked;
uintptr_t val = 1;

succ = find_successor(node);
if (succ) {
var = &succ->locked;
val = node->locked;
} else if (node->locked > 1) {
succ = MCS_NODE(node->locked);
succ->tail->next = next; /* WRITE_ONCE() !?!? */
var = &node->locked;
}

arch_mcs_spin_unlock_contended(var, val);
}

is shorter and generates much better code if
arch_mcs_spin_unlock_contended() is asm volatile ().

2019-04-01 09:35:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On Mon, Apr 01, 2019 at 11:06:53AM +0200, Peter Zijlstra wrote:
> On Fri, Mar 29, 2019 at 11:20:04AM -0400, Alex Kogan wrote:
> > diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
> > index bc6d3244e1af..71ee4b64c5d4 100644
> > --- a/kernel/locking/mcs_spinlock.h
> > +++ b/kernel/locking/mcs_spinlock.h
> > @@ -17,8 +17,18 @@
> >
> > struct mcs_spinlock {
> > struct mcs_spinlock *next;
> > +#ifndef CONFIG_NUMA_AWARE_SPINLOCKS
> > int locked; /* 1 if lock acquired */
> > int count; /* nesting count, see qspinlock.c */
> > +#else /* CONFIG_NUMA_AWARE_SPINLOCKS */
> > + uintptr_t locked; /* 1 if lock acquired, 0 if not, other values */
> > + /* represent a pointer to the secondary queue head */
> > + u32 node_and_count; /* node id on which this thread is running */
> > + /* with two lower bits reserved for nesting */
> > + /* count, see qspinlock.c */
> > + u32 encoded_tail; /* encoding of this node as the main queue tail */
> > + struct mcs_spinlock *tail; /* points to the secondary queue tail */
> > +#endif /* CONFIG_NUMA_AWARE_SPINLOCKS */
> > };
>
> Please, have another look at the paravirt code, in particular at struct
> pv_node and its usage. This is horrible.

Thing is, this turns into a right mess when you also have PV spinlocks
on.

One thing we could maybe do is change locked and count to u8, then your
overlay structure could be something like:

struct mcs_spinlock {
struct mcs_spinlock *next;
u8 locked;
u8 count;
};

struct cna_node {
/* struct mcs_spinlock overlay */
struct mcs_spinlock *next;
u8 locked;
u8 count;

/* our CNA bits, consuming the slack and PV space */
u16 node;
u32 encoded_tail;
struct mcs_spinlock *head;
struct mcs_spinlock *tail;
};

Then you also don't need the first two patches.

2019-04-01 14:37:19

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On 03/29/2019 11:20 AM, Alex Kogan wrote:
> In CNA, spinning threads are organized in two queues, a main queue for
> threads running on the same node as the current lock holder, and a
> secondary queue for threads running on other nodes. At the unlock time,
> the lock holder scans the main queue looking for a thread running on
> the same node. If found (call it thread T), all threads in the main queue
> between the current lock holder and T are moved to the end of the
> secondary queue, and the lock is passed to T. If such T is not found, the
> lock is passed to the first node in the secondary queue. Finally, if the
> secondary queue is empty, the lock is passed to the next thread in the
> main queue. For more details, see https://arxiv.org/abs/1810.05600.
>
> Note that this variant of CNA may introduce starvation by continuously
> passing the lock to threads running on the same node. This issue
> will be addressed later in the series.
>
> Enabling CNA is controlled via a new configuration option
> (NUMA_AWARE_SPINLOCKS), which is enabled by default if NUMA is enabled.
>
> Signed-off-by: Alex Kogan <[email protected]>
> Reviewed-by: Steve Sistare <[email protected]>
> ---
> arch/x86/Kconfig | 14 +++
> include/asm-generic/qspinlock_types.h | 13 +++
> kernel/locking/mcs_spinlock.h | 10 ++
> kernel/locking/qspinlock.c | 29 +++++-
> kernel/locking/qspinlock_cna.h | 173 ++++++++++++++++++++++++++++++++++
> 5 files changed, 236 insertions(+), 3 deletions(-)
> create mode 100644 kernel/locking/qspinlock_cna.h
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 68261430fe6e..e70c39a901f2 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1554,6 +1554,20 @@ config NUMA
>
> Otherwise, you should say N.
>
> +config NUMA_AWARE_SPINLOCKS
> + bool "Numa-aware spinlocks"
> + depends on NUMA
> + default y
> + help
> + Introduce NUMA (Non Uniform Memory Access) awareness into
> + the slow path of spinlocks.
> +
> + The kernel will try to keep the lock on the same node,
> + thus reducing the number of remote cache misses, while
> + trading some of the short term fairness for better performance.
> +
> + Say N if you want absolute first come first serve fairness.
> +

The patch that I am looking for is to have a separate
numa_queued_spinlock_slowpath() that coexists with
native_queued_spinlock_slowpath() and
paravirt_queued_spinlock_slowpath(). At boot time, we select the most
appropriate one for the system at hand.

If you are going for the configuration option route, keep in mind that
we optimize for the most common cases which are single-socket systems.
Please default to "n" unless you can prove that your change won't
regress performance for single-socket systems.

Cheers,
Longman

2019-04-02 09:46:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On Mon, Apr 01, 2019 at 10:36:19AM -0400, Waiman Long wrote:
> On 03/29/2019 11:20 AM, Alex Kogan wrote:
> > +config NUMA_AWARE_SPINLOCKS
> > + bool "Numa-aware spinlocks"
> > + depends on NUMA
> > + default y
> > + help
> > + Introduce NUMA (Non Uniform Memory Access) awareness into
> > + the slow path of spinlocks.
> > +
> > + The kernel will try to keep the lock on the same node,
> > + thus reducing the number of remote cache misses, while
> > + trading some of the short term fairness for better performance.
> > +
> > + Say N if you want absolute first come first serve fairness.
> > +
>
> The patch that I am looking for is to have a separate
> numa_queued_spinlock_slowpath() that coexists with
> native_queued_spinlock_slowpath() and
> paravirt_queued_spinlock_slowpath(). At boot time, we select the most
> appropriate one for the system at hand.

Agreed; and until we have static_call, I think we can abuse the paravirt
stuff for this.

By the time we patch the paravirt stuff:

check_bugs()
alternative_instructions()
apply_paravirt()

we should already have enumerated the NODE topology and so nr_node_ids()
should be set.

So if we frob pv_ops.lock.queued_spin_lock_slowpath to
numa_queued_spin_lock_slowpath before that, it should all get patched
just right.

That of course means the whole NUMA_AWARE_SPINLOCKS thing depends on
PARAVIRT_SPINLOCK, which is a bit awkward...

2019-04-02 10:39:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On Fri, Mar 29, 2019 at 11:20:05AM -0400, Alex Kogan wrote:
> @@ -25,6 +29,18 @@
>
> #define MCS_NODE(ptr) ((struct mcs_spinlock *)(ptr))
>
> +/* Per-CPU pseudo-random number seed */
> +static DEFINE_PER_CPU(u32, seed);
> +
> +/*
> + * Controls the probability for intra-node lock hand-off. It can be
> + * tuned and depend, e.g., on the number of CPUs per node. For now,
> + * choose a value that provides reasonable long-term fairness without
> + * sacrificing performance compared to a version that does not have any
> + * fairness guarantees.
> + */
> +#define INTRA_NODE_HANDOFF_PROB_ARG 0x10000
> +
> static inline __pure int decode_numa_node(u32 node_and_count)
> {
> int node = (node_and_count >> _Q_NODE_OFFSET) - 1;
> @@ -102,6 +118,35 @@ static struct mcs_spinlock *find_successor(struct mcs_spinlock *me)
> return NULL;
> }
>
> +/*
> + * xorshift function for generating pseudo-random numbers:
> + * https://en.wikipedia.org/wiki/Xorshift

Cute; so clearly you've read that page, but then you provide us a
variant that isn't actually listed there.

Your naming is also non-standard in that it does not relay the period.
The type seems to suggest 32bit, so the name should then have been:

xorshift32()

Now, where did you get those parameters from; is this a proper
xorshift32 ?

> + */
> +static inline u32 xor_random(void)
> +{
> + u32 v;
> +
> + v = this_cpu_read(seed);
> + if (v == 0)
> + get_random_bytes(&v, sizeof(u32));

Given xorshift is a LFSR subset, the above case will only ever happen
_once_ and it seems like bad form to stick it here instead of in a init
function.

Also, does it really matter, can't we simply initialize the variable
with a !0 value and call it a day?

As to that variable, seed is clearly a misnomer, the wiki page you
reference calls it state, which might be a little ambiguous, xs_state
otoh should work just fine.

> + v ^= v << 6;
> + v ^= v >> 21;
> + v ^= v << 7;
> + this_cpu_write(seed, v);
> +
> + return v;
> +}

Now, you've read that page and you know there's 'trivial' improvements
on the pure xorshift, why not pick one of those? xorwow seems cheap
enough, or that xorshift128plus() one.

> +
> +/*
> + * Return false with probability 1 / @range.
> + * @range must be a power of 2.
> + */
> +static bool probably(unsigned int range)
> +{
> + return xor_random() & (range - 1);
> +}

Uhh, you sure that's what it does? The only way for that to return false
is when all @range bits are 0, which happens once (2^32/range)-1 times,
or am I mistaken?

Also, linux/random.h includes next_pseudo_random32(), should we be using
that? Arguably that's more expensive on a number of platforms due to the
multiplication. Also, we actually have xorshift32 already in tree in
lib/test_hash.c.

The advantage of next_psuedo_random32() is that it doesn't have that 0
identify that pure LFSRs suffer from and it has 0 state. Now at a
glance, the xorwow/xorshift128plus variants don't seem to suffer that 0
identify, so that's good, but they still have fairly large state. It
also seems unfortunate to litter the tree with custom PRNGs. Ted?


2019-04-03 15:42:02

by Alex Kogan

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

Peter, Longman, many thanks for your detailed comments!

A few follow-up questions are inlined below.

> On Apr 2, 2019, at 5:43 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Apr 01, 2019 at 10:36:19AM -0400, Waiman Long wrote:
>> On 03/29/2019 11:20 AM, Alex Kogan wrote:
>>> +config NUMA_AWARE_SPINLOCKS
>>> + bool "Numa-aware spinlocks"
>>> + depends on NUMA
>>> + default y
>>> + help
>>> + Introduce NUMA (Non Uniform Memory Access) awareness into
>>> + the slow path of spinlocks.
>>> +
>>> + The kernel will try to keep the lock on the same node,
>>> + thus reducing the number of remote cache misses, while
>>> + trading some of the short term fairness for better performance.
>>> +
>>> + Say N if you want absolute first come first serve fairness.
>>> +
>>
>> The patch that I am looking for is to have a separate
>> numa_queued_spinlock_slowpath() that coexists with
>> native_queued_spinlock_slowpath() and
>> paravirt_queued_spinlock_slowpath(). At boot time, we select the most
>> appropriate one for the system at hand.
Is this how this selection works today for paravirt?
I see a PARAVIRT_SPINLOCKS config option, but IIUC you are talking about a different mechanism here.
Can you, please, elaborate or give me a link to a page that explains that?

>
> Agreed; and until we have static_call, I think we can abuse the paravirt
> stuff for this.
>
> By the time we patch the paravirt stuff:
>
> check_bugs()
> alternative_instructions()
> apply_paravirt()
>
> we should already have enumerated the NODE topology and so nr_node_ids()
> should be set.
>
> So if we frob pv_ops.lock.queued_spin_lock_slowpath to
> numa_queued_spin_lock_slowpath before that, it should all get patched
> just right.
>
> That of course means the whole NUMA_AWARE_SPINLOCKS thing depends on
> PARAVIRT_SPINLOCK, which is a bit awkward…
Just to mention here, the patch so far does not address paravirt, but our goal is to add this support once we address all the concerns for the native version.
So we will end up with four variants for the queued_spinlock_slowpath() — one for each combination of native/paravirt and NUMA/non-NUMA.
Or perhaps we do not need a NUMA/paravirt variant?

— Alex


2019-04-03 15:49:13

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On 04/03/2019 11:39 AM, Alex Kogan wrote:
> Peter, Longman, many thanks for your detailed comments!
>
> A few follow-up questions are inlined below.
>
>> On Apr 2, 2019, at 5:43 AM, Peter Zijlstra <[email protected]> wrote:
>>
>> On Mon, Apr 01, 2019 at 10:36:19AM -0400, Waiman Long wrote:
>>> On 03/29/2019 11:20 AM, Alex Kogan wrote:
>>>> +config NUMA_AWARE_SPINLOCKS
>>>> + bool "Numa-aware spinlocks"
>>>> + depends on NUMA
>>>> + default y
>>>> + help
>>>> + Introduce NUMA (Non Uniform Memory Access) awareness into
>>>> + the slow path of spinlocks.
>>>> +
>>>> + The kernel will try to keep the lock on the same node,
>>>> + thus reducing the number of remote cache misses, while
>>>> + trading some of the short term fairness for better performance.
>>>> +
>>>> + Say N if you want absolute first come first serve fairness.
>>>> +
>>> The patch that I am looking for is to have a separate
>>> numa_queued_spinlock_slowpath() that coexists with
>>> native_queued_spinlock_slowpath() and
>>> paravirt_queued_spinlock_slowpath(). At boot time, we select the most
>>> appropriate one for the system at hand.
> Is this how this selection works today for paravirt?
> I see a PARAVIRT_SPINLOCKS config option, but IIUC you are talking about a different mechanism here.
> Can you, please, elaborate or give me a link to a page that explains that?
>
>> Agreed; and until we have static_call, I think we can abuse the paravirt
>> stuff for this.
>>
>> By the time we patch the paravirt stuff:
>>
>> check_bugs()
>> alternative_instructions()
>> apply_paravirt()
>>
>> we should already have enumerated the NODE topology and so nr_node_ids()
>> should be set.
>>
>> So if we frob pv_ops.lock.queued_spin_lock_slowpath to
>> numa_queued_spin_lock_slowpath before that, it should all get patched
>> just right.
>>
>> That of course means the whole NUMA_AWARE_SPINLOCKS thing depends on
>> PARAVIRT_SPINLOCK, which is a bit awkward…
> Just to mention here, the patch so far does not address paravirt, but our goal is to add this support once we address all the concerns for the native version.
> So we will end up with four variants for the queued_spinlock_slowpath() — one for each combination of native/paravirt and NUMA/non-NUMA.
> Or perhaps we do not need a NUMA/paravirt variant?

I don't expect we need a numa variant for paravirt. First of all, the
NUMA information available in a VM guest is unreliable. So let just not
go there. What we are looking for is to make sure your patch won't break
the paravirt code. So testing the paravirt qspinlock should be part of
your testing matrix.

Cheers,
Longman



2019-04-03 15:55:03

by Alex Kogan

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock


> On Apr 1, 2019, at 5:33 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Apr 01, 2019 at 11:06:53AM +0200, Peter Zijlstra wrote:
>> On Fri, Mar 29, 2019 at 11:20:04AM -0400, Alex Kogan wrote:
>>> diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
>>> index bc6d3244e1af..71ee4b64c5d4 100644
>>> --- a/kernel/locking/mcs_spinlock.h
>>> +++ b/kernel/locking/mcs_spinlock.h
>>> @@ -17,8 +17,18 @@
>>>
>>> struct mcs_spinlock {
>>> struct mcs_spinlock *next;
>>> +#ifndef CONFIG_NUMA_AWARE_SPINLOCKS
>>> int locked; /* 1 if lock acquired */
>>> int count; /* nesting count, see qspinlock.c */
>>> +#else /* CONFIG_NUMA_AWARE_SPINLOCKS */
>>> + uintptr_t locked; /* 1 if lock acquired, 0 if not, other values */
>>> + /* represent a pointer to the secondary queue head */
>>> + u32 node_and_count; /* node id on which this thread is running */
>>> + /* with two lower bits reserved for nesting */
>>> + /* count, see qspinlock.c */
>>> + u32 encoded_tail; /* encoding of this node as the main queue tail */
>>> + struct mcs_spinlock *tail; /* points to the secondary queue tail */
>>> +#endif /* CONFIG_NUMA_AWARE_SPINLOCKS */
>>> };
>>
>> Please, have another look at the paravirt code, in particular at struct
>> pv_node and its usage. This is horrible.
>
> Thing is, this turns into a right mess when you also have PV spinlocks
> on.
>
> One thing we could maybe do is change locked and count to u8, then your
> overlay structure could be something like:
>
> struct mcs_spinlock {
> struct mcs_spinlock *next;
> u8 locked;
> u8 count;
> };
I was trying to keep the size of the mcs_spinlock structure for the non-NUMA variant unchanged.
If this is not a huge concern, changing the fields as above would indeed simplify a few things.

— Alex

>
> struct cna_node {
> /* struct mcs_spinlock overlay */
> struct mcs_spinlock *next;
> u8 locked;
> u8 count;
>
> /* our CNA bits, consuming the slack and PV space */
> u16 node;
> u32 encoded_tail;
> struct mcs_spinlock *head;
> struct mcs_spinlock *tail;
> };
>
> Then you also don't need the first two patches.

2019-04-03 16:02:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On Wed, Apr 03, 2019 at 11:39:09AM -0400, Alex Kogan wrote:

> >> The patch that I am looking for is to have a separate
> >> numa_queued_spinlock_slowpath() that coexists with
> >> native_queued_spinlock_slowpath() and
> >> paravirt_queued_spinlock_slowpath(). At boot time, we select the most
> >> appropriate one for the system at hand.
> Is this how this selection works today for paravirt?
> I see a PARAVIRT_SPINLOCKS config option, but IIUC you are talking about a different mechanism here.
> Can you, please, elaborate or give me a link to a page that explains that?

Oh man, you ask us to explain how paravirt patching works... that's
magic :-)

Basically, the compiler will emit a bunch of indirect calls to the
various pv_ops.*.* functions.

Then, at alternative_instructions() <- apply_paravirt() it will rewrite
all these indirect calls to direct calls to the function pointers that
are in the pv_ops structure at that time (+- more magic).

So we initialize the pv_ops.lock.* methods to the normal
native_queued_spin*() stuff, if KVM/Xen/whatever setup detectors pv
spnlock support changes the methods to the paravirt_queued_*() stuff.

If you wnt more details, you'll just have to read
arch/x86/include/asm/paravirt*.h and arch/x86/kernel/paravirt*.c, I
don't think there's a coherent writeup of all that.

> > Agreed; and until we have static_call, I think we can abuse the paravirt
> > stuff for this.
> >
> > By the time we patch the paravirt stuff:
> >
> > check_bugs()
> > alternative_instructions()
> > apply_paravirt()
> >
> > we should already have enumerated the NODE topology and so nr_node_ids()
> > should be set.
> >
> > So if we frob pv_ops.lock.queued_spin_lock_slowpath to
> > numa_queued_spin_lock_slowpath before that, it should all get patched
> > just right.
> >
> > That of course means the whole NUMA_AWARE_SPINLOCKS thing depends on
> > PARAVIRT_SPINLOCK, which is a bit awkward…

> Just to mention here, the patch so far does not address paravirt, but
> our goal is to add this support once we address all the concerns for
> the native version. So we will end up with four variants for the
> queued_spinlock_slowpath() — one for each combination of
> native/paravirt and NUMA/non-NUMA. Or perhaps we do not need a
> NUMA/paravirt variant?

I wouldn't bother with a pv version of the numa aware code at all. If
you have overcommitted guests, topology is likely irrelevant anyway. If
you have 1:1 pinned guests, they'll not use pv spinlocks anyway.

So keep it to tertiary choice:

- native
- native/numa
- paravirt

2019-04-03 16:11:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On Wed, Apr 03, 2019 at 11:53:53AM -0400, Alex Kogan wrote:

> > One thing we could maybe do is change locked and count to u8, then your
> > overlay structure could be something like:
> >
> > struct mcs_spinlock {
> > struct mcs_spinlock *next;
> > u8 locked;
> > u8 count;
> > };
> I was trying to keep the size of the mcs_spinlock structure for the non-NUMA variant unchanged.
> If this is not a huge concern, changing the fields as above would indeed simplify a few things.

Well, sizeof(struct mcs_spinlock) is unchanged by the above proposal
(for x86_64).

And I don't think it matters for x86, which is very good at byte
accesses, my only concern would be for other architectures that might
not be as good at byte accesses. For instance Alpha <EV56 would generate
shit code, but then, Alpha isn't using qspinlock anyway.


2019-04-03 16:35:02

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On 04/02/2019 05:43 AM, Peter Zijlstra wrote:
> On Mon, Apr 01, 2019 at 10:36:19AM -0400, Waiman Long wrote:
>> On 03/29/2019 11:20 AM, Alex Kogan wrote:
>>> +config NUMA_AWARE_SPINLOCKS
>>> + bool "Numa-aware spinlocks"
>>> + depends on NUMA
>>> + default y
>>> + help
>>> + Introduce NUMA (Non Uniform Memory Access) awareness into
>>> + the slow path of spinlocks.
>>> +
>>> + The kernel will try to keep the lock on the same node,
>>> + thus reducing the number of remote cache misses, while
>>> + trading some of the short term fairness for better performance.
>>> +
>>> + Say N if you want absolute first come first serve fairness.
>>> +
>> The patch that I am looking for is to have a separate
>> numa_queued_spinlock_slowpath() that coexists with
>> native_queued_spinlock_slowpath() and
>> paravirt_queued_spinlock_slowpath(). At boot time, we select the most
>> appropriate one for the system at hand.
> Agreed; and until we have static_call, I think we can abuse the paravirt
> stuff for this.

I haven't checked Josh's patch to see if it is doing. The availability
of static_call will certainly make thing easier for this case.

> By the time we patch the paravirt stuff:
>
> check_bugs()
> alternative_instructions()
> apply_paravirt()
>
> we should already have enumerated the NODE topology and so nr_node_ids()
> should be set.
>
> So if we frob pv_ops.lock.queued_spin_lock_slowpath to
> numa_queued_spin_lock_slowpath before that, it should all get patched
> just right.
>
> That of course means the whole NUMA_AWARE_SPINLOCKS thing depends on
> PARAVIRT_SPINLOCK, which is a bit awkward...

Yes, this is one way of doing it. Another way to use static key to
switch between the native and numa version. So if PARAVIRT_SPINLOCK is
defined, we use the paravirt patching to point to the right function. If
PARAVIRT_SPINLOCK isn't enabled, we can do something like

static inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32
val)
{
        if (static_branch_unlikely(&use_numa_spinlock))
                numa_queued_spin_lock_slowpath(lock, val);
        else   
                native_queued_spin_lock_slowpath(lock, val);
}

Alternatively, we can also call numa_queued_spin_lock_slowpath() in
native_queued_spin_lock_slowpath() if we don't want to increase the code
size of spinlock call sites.

Cheers,
Longman


2019-04-03 17:09:15

by Alex Kogan

[permalink] [raw]
Subject: Re: [PATCH v2 4/5] locking/qspinlock: Introduce starvation avoidance into CNA


> On Apr 2, 2019, at 6:37 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Mar 29, 2019 at 11:20:05AM -0400, Alex Kogan wrote:
>> @@ -25,6 +29,18 @@
>>
>> #define MCS_NODE(ptr) ((struct mcs_spinlock *)(ptr))
>>
>> +/* Per-CPU pseudo-random number seed */
>> +static DEFINE_PER_CPU(u32, seed);
>> +
>> +/*
>> + * Controls the probability for intra-node lock hand-off. It can be
>> + * tuned and depend, e.g., on the number of CPUs per node. For now,
>> + * choose a value that provides reasonable long-term fairness without
>> + * sacrificing performance compared to a version that does not have any
>> + * fairness guarantees.
>> + */
>> +#define INTRA_NODE_HANDOFF_PROB_ARG 0x10000
>> +
>> static inline __pure int decode_numa_node(u32 node_and_count)
>> {
>> int node = (node_and_count >> _Q_NODE_OFFSET) - 1;
>> @@ -102,6 +118,35 @@ static struct mcs_spinlock *find_successor(struct mcs_spinlock *me)
>> return NULL;
>> }
>>
>> +/*
>> + * xorshift function for generating pseudo-random numbers:
>> + * https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Xorshift&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=btYIeJJfSDAM0UloKh-zrG4-PgdCMjQMKnvDIqPuEQk&s=1UYM9Qd5nozlImYHub0yqRmQCja5hJtnzbHuudtz-nM&e=
>
> Cute; so clearly you've read that page, but then you provide us a
> variant that isn't actually listed there.
>
> Your naming is also non-standard in that it does not relay the period.
> The type seems to suggest 32bit, so the name should then have been:
>
> xorshift32()
>
> Now, where did you get those parameters from; is this a proper
> xorshift32 ?
Apologies for the confusion.
We are using one of the shift tuples from the Xorshift paper (https://www.jstatsoft.org/v08/i14/paper), which is referenced by the wiki page.
Those tuples happen to be different from the ones on the wiki page itself.
I do not think we care much, though — if we keep using xorshift (see below), we will switch to the variant from the wiki to avoid any confusion.

>
> Now, you've read that page and you know there's 'trivial' improvements
> on the pure xorshift, why not pick one of those? xorwow seems cheap
> enough, or that xorshift128plus() one.
Xorshift seems to be working well enough for our purposes, while those other variants are slightly more expensive and have a larger state.

>
>> +
>> +/*
>> + * Return false with probability 1 / @range.
>> + * @range must be a power of 2.
>> + */
>> +static bool probably(unsigned int range)
>> +{
>> + return xor_random() & (range - 1);
>> +}
>
> Uhh, you sure that's what it does? The only way for that to return false
> is when all @range bits are 0, which happens once (2^32/range)-1 times,
> or am I mistaken?

probably() would return 0 if and only if xor_random() returns 0 in the lowest log_2(range) bits,
which should happen with probability of 1/(2^log_2(range))=1/range.

>
> Also, linux/random.h includes next_pseudo_random32(), should we be using
> that? Arguably that's more expensive on a number of platforms due to the
> multiplication.
We will check the impact of using next_pseudo_random32().

> Also, we actually have xorshift32 already in tree in
> lib/test_hash.c.
I see that now. We can certainly use this function instead.
If we end up doing that, any suggestion where it should be moved to be called from multiple places (the original lib/test_hash.c and our CNA code)?

Thanks,
— Alex

2019-04-03 17:15:21

by Alex Kogan

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Add NUMA-awareness to qspinlock


> On Apr 1, 2019, at 5:09 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Mar 29, 2019 at 11:20:01AM -0400, Alex Kogan wrote:
>> The following locktorture results are from an Oracle X5-4 server
>> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
>> cores each).
>
> The other interesting number is on a !NUMA machine. What do these
> patches do there? Remember, most people do not in fact have numa.
I will make sure to include numbers from a !NUMA machine in the next revision of the patch.

Regards,
— Alex

2019-04-03 17:17:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On Wed, Apr 03, 2019 at 12:33:20PM -0400, Waiman Long wrote:
> static inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32
> val)
> {
> ??????? if (static_branch_unlikely(&use_numa_spinlock))
> ??????????????? numa_queued_spin_lock_slowpath(lock, val);
> ??????? else???
> ??????????????? native_queued_spin_lock_slowpath(lock, val);
> }

That's horrible for the exact reason you state.

> Alternatively, we can also call numa_queued_spin_lock_slowpath() in
> native_queued_spin_lock_slowpath() if we don't want to increase the code
> size of spinlock call sites.

Yeah, still don't much like that though, we're littering the fast path
of that slow path with all sorts of crap.

2019-04-03 17:41:21

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On 04/03/2019 01:16 PM, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 12:33:20PM -0400, Waiman Long wrote:
>> static inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32
>> val)
>> {
>>         if (static_branch_unlikely(&use_numa_spinlock))
>>                 numa_queued_spin_lock_slowpath(lock, val);
>>         else   
>>                 native_queued_spin_lock_slowpath(lock, val);
>> }
> That's horrible for the exact reason you state.
>
>> Alternatively, we can also call numa_queued_spin_lock_slowpath() in
>> native_queued_spin_lock_slowpath() if we don't want to increase the code
>> size of spinlock call sites.
> Yeah, still don't much like that though, we're littering the fast path
> of that slow path with all sorts of crap.

Yes, I know it is less than ideal, but that is probably the only option
if we don't have static_call or paravirt. On the other hand, I am
perfectly fine with making CNA depends on PARAVIRT_SPINLOCKS for now
until static_call is available.

Cheers,
Longman

2019-04-04 02:05:23

by Hanjun Guo

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

Hi Alex,

On 2019/3/29 23:20, Alex Kogan wrote:
> +
> +static __always_inline void cna_init_node(struct mcs_spinlock *node, int cpuid,
> + u32 tail)
> +{
> + if (decode_numa_node(node->node_and_count) == -1)
> + store_numa_node(node, numa_cpu_node(cpuid));

How about using cpu_to_node() here and #include <linux/topology.h> in this
file, then the code can be reused for other architectures such as ARM64?

Thanks
Hanjun

2019-04-04 03:17:12

by Alex Kogan

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

Hi, Hanjun.

> On Apr 3, 2019, at 10:02 PM, Hanjun Guo <[email protected]> wrote:
>
> Hi Alex,
>
> On 2019/3/29 23:20, Alex Kogan wrote:
>> +
>> +static __always_inline void cna_init_node(struct mcs_spinlock *node, int cpuid,
>> + u32 tail)
>> +{
>> + if (decode_numa_node(node->node_and_count) == -1)
>> + store_numa_node(node, numa_cpu_node(cpuid));
>
> How about using cpu_to_node() here and #include <linux/topology.h> in this
> file, then the code can be reused for other architectures such as ARM64?
Good point. Thanks!

— Alex

>
> Thanks
> Hanjun
>

2019-04-04 05:06:39

by Jürgen Groß

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On 03/04/2019 18:01, Peter Zijlstra wrote:
> On Wed, Apr 03, 2019 at 11:39:09AM -0400, Alex Kogan wrote:
>
>>>> The patch that I am looking for is to have a separate
>>>> numa_queued_spinlock_slowpath() that coexists with
>>>> native_queued_spinlock_slowpath() and
>>>> paravirt_queued_spinlock_slowpath(). At boot time, we select the most
>>>> appropriate one for the system at hand.
>> Is this how this selection works today for paravirt?
>> I see a PARAVIRT_SPINLOCKS config option, but IIUC you are talking about a different mechanism here.
>> Can you, please, elaborate or give me a link to a page that explains that?
>
> Oh man, you ask us to explain how paravirt patching works... that's
> magic :-)
>
> Basically, the compiler will emit a bunch of indirect calls to the
> various pv_ops.*.* functions.
>
> Then, at alternative_instructions() <- apply_paravirt() it will rewrite
> all these indirect calls to direct calls to the function pointers that
> are in the pv_ops structure at that time (+- more magic).
>
> So we initialize the pv_ops.lock.* methods to the normal
> native_queued_spin*() stuff, if KVM/Xen/whatever setup detectors pv
> spnlock support changes the methods to the paravirt_queued_*() stuff.
>
> If you wnt more details, you'll just have to read
> arch/x86/include/asm/paravirt*.h and arch/x86/kernel/paravirt*.c, I
> don't think there's a coherent writeup of all that.
>
>>> Agreed; and until we have static_call, I think we can abuse the paravirt
>>> stuff for this.
>>>
>>> By the time we patch the paravirt stuff:
>>>
>>> check_bugs()
>>> alternative_instructions()
>>> apply_paravirt()
>>>
>>> we should already have enumerated the NODE topology and so nr_node_ids()
>>> should be set.
>>>
>>> So if we frob pv_ops.lock.queued_spin_lock_slowpath to
>>> numa_queued_spin_lock_slowpath before that, it should all get patched
>>> just right.
>>>
>>> That of course means the whole NUMA_AWARE_SPINLOCKS thing depends on
>>> PARAVIRT_SPINLOCK, which is a bit awkward…
>
>> Just to mention here, the patch so far does not address paravirt, but
>> our goal is to add this support once we address all the concerns for
>> the native version. So we will end up with four variants for the
>> queued_spinlock_slowpath() — one for each combination of
>> native/paravirt and NUMA/non-NUMA. Or perhaps we do not need a
>> NUMA/paravirt variant?
>
> I wouldn't bother with a pv version of the numa aware code at all. If
> you have overcommitted guests, topology is likely irrelevant anyway. If
> you have 1:1 pinned guests, they'll not use pv spinlocks anyway.
>
> So keep it to tertiary choice:
>
> - native
> - native/numa
> - paravirt

Just for the records: the paravirt variant could easily choose whether
it wants to include a numa version just by using the existing hooks.
With PARAVIRT_SPINLOCK configured I guess even the native case would
need to use the paravirt hooks for selection of native or native/numa.

Without PARAVIRT_SPINLOCK this would be just an alternative() then?

Maybe the resulting code would be much more readable if we'd just
make PARAVIRT_SPINLOCK usable without the other PARAVIRT hooks? So
splitting up PARAVIRT into PARAVIRT_GUEST (timer hooks et al) and
the patching infrastructure, with PARAVIRT_GUEST and PARAVIRT_SPINLOCK
selecting PARAVIRT, and PARAVIRT_XXL selecting PARAVIRT_GUEST.


Juergen

2019-04-04 09:41:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On Thu, Apr 04, 2019 at 07:05:24AM +0200, Juergen Gross wrote:

> Without PARAVIRT_SPINLOCK this would be just an alternative() then?

That could maybe work yes. This is all early enough.

> Maybe the resulting code would be much more readable if we'd just
> make PARAVIRT_SPINLOCK usable without the other PARAVIRT hooks? So
> splitting up PARAVIRT into PARAVIRT_GUEST (timer hooks et al) and
> the patching infrastructure, with PARAVIRT_GUEST and PARAVIRT_SPINLOCK
> selecting PARAVIRT, and PARAVIRT_XXL selecting PARAVIRT_GUEST.

Well, ideally we'll get static_call() sorted and use that.

2019-04-04 18:05:42

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On 04/04/2019 05:38 AM, Peter Zijlstra wrote:
> On Thu, Apr 04, 2019 at 07:05:24AM +0200, Juergen Gross wrote:
>
>> Without PARAVIRT_SPINLOCK this would be just an alternative() then?
> That could maybe work yes. This is all early enough.

Yes, alternative() should work as it is done before SMP boot. The
slowpath should never be executed before SMP boot.

Cheers,
Longman

2019-06-04 23:24:23

by Alex Kogan

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

Hi, Peter, Longman,

> On Apr 3, 2019, at 12:01 PM, Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Apr 03, 2019 at 11:39:09AM -0400, Alex Kogan wrote:
>
>>>> The patch that I am looking for is to have a separate
>>>> numa_queued_spinlock_slowpath() that coexists with
>>>> native_queued_spinlock_slowpath() and
>>>> paravirt_queued_spinlock_slowpath(). At boot time, we select the most
>>>> appropriate one for the system at hand.
>> Is this how this selection works today for paravirt?
>> I see a PARAVIRT_SPINLOCKS config option, but IIUC you are talking about a different mechanism here.
>> Can you, please, elaborate or give me a link to a page that explains that?
>
> Oh man, you ask us to explain how paravirt patching works... that's
> magic :-)
>
> Basically, the compiler will emit a bunch of indirect calls to the
> various pv_ops.*.* functions.
>
> Then, at alternative_instructions() <- apply_paravirt() it will rewrite
> all these indirect calls to direct calls to the function pointers that
> are in the pv_ops structure at that time (+- more magic).
Trying to resume this work, I am looking for concrete steps required to integrate CNA with the paravirt patching.

Looking at alternative_instructions(), I wonder if I need to add another call, something like apply_numa() similar to apply_paravirt(), and do the patch work there.
Or perhaps I should “just" initialize the pv_ops structure with the corresponding numa_queued_spinlock_slowpath() in paravirt.c?

Also, the paravirt code is under arch/x86, while CNA is generic (not x86-specific).
Do you still want to see CNA-related patching residing under arch/x86?

We still need a config option (something like NUMA_AWARE_SPINLOCKS) to enable CNA patching under this config only, correct?

Thanks in advance,
— Alex

2019-06-05 20:41:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On Tue, Jun 04, 2019 at 07:21:13PM -0400, Alex Kogan wrote:

> Trying to resume this work, I am looking for concrete steps required
> to integrate CNA with the paravirt patching.
>
> Looking at alternative_instructions(), I wonder if I need to add
> another call, something like apply_numa() similar to apply_paravirt(),
> and do the patch work there. Or perhaps I should “just" initialize
> the pv_ops structure with the corresponding
> numa_queued_spinlock_slowpath() in paravirt.c?

Yeah, just initialize the pv_ops.lock.* thingies to contain the numa
variant before apply_paravirt() happens.

> Also, the paravirt code is under arch/x86, while CNA is generic (not
> x86-specific). Do you still want to see CNA-related patching residing
> under arch/x86?
>
> We still need a config option (something like NUMA_AWARE_SPINLOCKS) to
> enable CNA patching under this config only, correct?

There is the static_call() stuff that could be generic; I posted a new
version of that today (x86 only for now, but IIRC there's arm64 patches
for that around somewhere too).

https://lkml.kernel.org/r/[email protected]

Which would allow something a little like this:


diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index bd5ac6cc37db..01feaf912bd7 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -63,29 +63,7 @@ static inline bool vcpu_is_preempted(long cpu)
#endif

#ifdef CONFIG_PARAVIRT
-DECLARE_STATIC_KEY_TRUE(virt_spin_lock_key);
-
void native_pv_lock_init(void) __init;
-
-#define virt_spin_lock virt_spin_lock
-static inline bool virt_spin_lock(struct qspinlock *lock)
-{
- if (!static_branch_likely(&virt_spin_lock_key))
- return false;
-
- /*
- * On hypervisors without PARAVIRT_SPINLOCKS support we fall
- * back to a Test-and-Set spinlock, because fair locks have
- * horrible lock 'holder' preemption issues.
- */
-
- do {
- while (atomic_read(&lock->val) != 0)
- cpu_relax();
- } while (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) != 0);
-
- return true;
-}
#else
static inline void native_pv_lock_init(void)
{
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5169b8cc35bb..78be9e474e94 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -531,7 +531,7 @@ static void __init kvm_smp_prepare_cpus(unsigned int max_cpus)
{
native_smp_prepare_cpus(max_cpus);
if (kvm_para_has_hint(KVM_HINTS_REALTIME))
- static_branch_disable(&virt_spin_lock_key);
+ static_call_update(queued_spin_lock_slowpath, __queued_spin_lock_slowpath);
}

static void __init kvm_smp_prepare_boot_cpu(void)
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 98039d7fb998..ae6d15f84867 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -105,12 +105,10 @@ static unsigned paravirt_patch_jmp(void *insn_buff, const void *target,
}
#endif

-DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
-
void __init native_pv_lock_init(void)
{
- if (!boot_cpu_has(X86_FEATURE_HYPERVISOR))
- static_branch_disable(&virt_spin_lock_key);
+ if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+ static_call_update(queued_spin_lock_slowpath, __tas_spin_lock_slowpath);
}

unsigned paravirt_patch_default(u8 type, void *insn_buff,
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 3776122c87cc..86808127b6e6 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -70,7 +70,7 @@ void xen_init_lock_cpu(int cpu)

if (!xen_pvspin) {
if (cpu == 0)
- static_branch_disable(&virt_spin_lock_key);
+ static_call_update(queued_spin_lock_slowpath, __queued_spin_lock_slowpath);
return;
}

diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index fde943d180e0..8ca4dd9db931 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -65,7 +65,9 @@ static __always_inline int queued_spin_trylock(struct qspinlock *lock)
return likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL));
}

-extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+DECLARE_STATIC_CALL(queued_spin_lock_slowpath, __queued_spin_lock_slowpath);

/**
* queued_spin_lock - acquire a queued spinlock
@@ -78,7 +80,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
return;

- queued_spin_lock_slowpath(lock, val);
+ static_call(queued_spin_lock_slowpath, lock, val);
}

#ifndef queued_spin_unlock
@@ -95,13 +97,6 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
}
#endif

-#ifndef virt_spin_lock
-static __always_inline bool virt_spin_lock(struct qspinlock *lock)
-{
- return false;
-}
-#endif
-
/*
* Remapping spinlock architecture specific functions to the corresponding
* queued spinlock functions.
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 2473f10c6956..0e9e61637d56 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -290,6 +290,20 @@ static __always_inline u32 __pv_wait_head_or_lock(struct qspinlock *lock,

#endif /* _GEN_PV_LOCK_SLOWPATH */

+void __tas_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+{
+ /*
+ * On hypervisors without PARAVIRT_SPINLOCKS support we fall
+ * back to a Test-and-Set spinlock, because fair locks have
+ * horrible lock 'holder' preemption issues.
+ */
+
+ do {
+ while (atomic_read(&lock->val) != 0)
+ cpu_relax();
+ } while (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) != 0);
+}
+
/**
* queued_spin_lock_slowpath - acquire the queued spinlock
* @lock: Pointer to queued spinlock structure
@@ -311,7 +325,7 @@ static __always_inline u32 __pv_wait_head_or_lock(struct qspinlock *lock,
* contended : (*,x,y) +--> (*,0,0) ---> (*,0,1) -' :
* queue : ^--' :
*/
-void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+void __queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
{
struct mcs_spinlock *prev, *next, *node;
u32 old, tail;
@@ -322,9 +336,6 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
if (pv_enabled())
goto pv_queue;

- if (virt_spin_lock(lock))
- return;
-
/*
* Wait for in-progress pending->locked hand-overs with a bounded
* number of spins so that we guarantee forward progress.
@@ -558,7 +569,9 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
*/
__this_cpu_dec(qnodes[0].mcs.count);
}
-EXPORT_SYMBOL(queued_spin_lock_slowpath);
+EXPORT_SYMBOL(__queued_spin_lock_slowpath);
+
+DEFINE_STATIC_CALL(queued_spin_lock_slowpath, __queued_spin_lock_slowpath);

/*
* Generate the paravirt code for queued_spin_unlock_slowpath().

2019-06-06 15:35:06

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On 6/6/19 11:21 AM, Alex Kogan wrote:
>>> Also, the paravirt code is under arch/x86, while CNA is generic (not
>>> x86-specific). Do you still want to see CNA-related patching residing
>>> under arch/x86?
>>>
>>> We still need a config option (something like NUMA_AWARE_SPINLOCKS) to
>>> enable CNA patching under this config only, correct?
>> There is the static_call() stuff that could be generic; I posted a new
>> version of that today (x86 only for now, but IIRC there's arm64 patches
>> for that around somewhere too).
> The static_call technique appears as the more desirable long-term approach, but I think it would be prudent to keep the patches decoupled for now so we can move forward with less entanglements.
> So unless anyone objects, we will work on plugging into the existing patching for pv.
> And we will keep that code under arch/x86, but will be open for any suggestion to move it elsewhere.
>
If you mean making the CNV code depends on PARAVIRT_SPINLOCKS for now,
that is fine. The code should be under kernel/locking. You shouldn't put
it somewhere under arch/x86.

-Longman

2019-06-06 18:13:44

by Alex Kogan

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

>> Also, the paravirt code is under arch/x86, while CNA is generic (not
>> x86-specific). Do you still want to see CNA-related patching residing
>> under arch/x86?
>>
>> We still need a config option (something like NUMA_AWARE_SPINLOCKS) to
>> enable CNA patching under this config only, correct?
>
> There is the static_call() stuff that could be generic; I posted a new
> version of that today (x86 only for now, but IIRC there's arm64 patches
> for that around somewhere too).

The static_call technique appears as the more desirable long-term approach, but I think it would be prudent to keep the patches decoupled for now so we can move forward with less entanglements.
So unless anyone objects, we will work on plugging into the existing patching for pv.
And we will keep that code under arch/x86, but will be open for any suggestion to move it elsewhere.

Thanks!
— Alex

2019-06-06 19:39:24

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On 6/6/19 11:32 AM, Waiman Long wrote:
> On 6/6/19 11:21 AM, Alex Kogan wrote:
>>>> Also, the paravirt code is under arch/x86, while CNA is generic (not
>>>> x86-specific). Do you still want to see CNA-related patching residing
>>>> under arch/x86?
>>>>
>>>> We still need a config option (something like NUMA_AWARE_SPINLOCKS) to
>>>> enable CNA patching under this config only, correct?
>>> There is the static_call() stuff that could be generic; I posted a new
>>> version of that today (x86 only for now, but IIRC there's arm64 patches
>>> for that around somewhere too).
>> The static_call technique appears as the more desirable long-term approach, but I think it would be prudent to keep the patches decoupled for now so we can move forward with less entanglements.
>> So unless anyone objects, we will work on plugging into the existing patching for pv.
>> And we will keep that code under arch/x86, but will be open for any suggestion to move it elsewhere.
>>
> If you mean making the CNV code depends on PARAVIRT_SPINLOCKS for now,
> that is fine. The code should be under kernel/locking. You shouldn't put
> it somewhere under arch/x86.

I mean the core CNV code should be under kernel/locking. The paravirt
specific code, however, should be close to the current paravirt code
which is under arch/x86.

-Longman

2019-06-11 04:23:14

by liwei (GF)

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

Hi Alex,

On 2019/3/29 23:20, Alex Kogan wrote:
> In CNA, spinning threads are organized in two queues, a main queue for
> threads running on the same node as the current lock holder, and a
> secondary queue for threads running on other nodes. At the unlock time,
> the lock holder scans the main queue looking for a thread running on
> the same node. If found (call it thread T), all threads in the main queue
> between the current lock holder and T are moved to the end of the
> secondary queue, and the lock is passed to T. If such T is not found, the
> lock is passed to the first node in the secondary queue. Finally, if the
> secondary queue is empty, the lock is passed to the next thread in the
> main queue. For more details, see https://arxiv.org/abs/1810.05600.
>
> Note that this variant of CNA may introduce starvation by continuously
> passing the lock to threads running on the same node. This issue
> will be addressed later in the series.
>
> Enabling CNA is controlled via a new configuration option
> (NUMA_AWARE_SPINLOCKS), which is enabled by default if NUMA is enabled.
>
> Signed-off-by: Alex Kogan <[email protected]>
> Reviewed-by: Steve Sistare <[email protected]>
> ---
> arch/x86/Kconfig | 14 +++
> include/asm-generic/qspinlock_types.h | 13 +++
> kernel/locking/mcs_spinlock.h | 10 ++
> kernel/locking/qspinlock.c | 29 +++++-
> kernel/locking/qspinlock_cna.h | 173 ++++++++++++++++++++++++++++++++++
> 5 files changed, 236 insertions(+), 3 deletions(-)
> create mode 100644 kernel/locking/qspinlock_cna.h
>
(SNIP)
> +
> +static __always_inline int get_node_index(struct mcs_spinlock *node)
> +{
> + return decode_count(node->node_and_count++);
When nesting level is > 4, it won't return a index >= 4 here and the numa node number
is changed by mistake. It will go into a wrong way instead of the following branch.


/*
* 4 nodes are allocated based on the assumption that there will
* not be nested NMIs taking spinlocks. That may not be true in
* some architectures even though the chance of needing more than
* 4 nodes will still be extremely unlikely. When that happens,
* we fall back to spinning on the lock directly without using
* any MCS node. This is not the most elegant solution, but is
* simple enough.
*/
if (unlikely(idx >= MAX_NODES)) {
while (!queued_spin_trylock(lock))
cpu_relax();
goto release;
}

> +}
> +
> +static __always_inline void release_mcs_node(struct mcs_spinlock *node)
> +{
> + __this_cpu_dec(node->node_and_count);
> +}
> +
> +static __always_inline void cna_init_node(struct mcs_spinlock *node, int cpuid,
> + u32 tail)
> +{

Thanks,
Wei

2019-06-12 07:56:31

by Alex Kogan

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

Hi, Wei.

> On Jun 11, 2019, at 12:22 AM, liwei (GF) <[email protected]> wrote:
>
> Hi Alex,
>
> On 2019/3/29 23:20, Alex Kogan wrote:
>> In CNA, spinning threads are organized in two queues, a main queue for
>> threads running on the same node as the current lock holder, and a
>> secondary queue for threads running on other nodes. At the unlock time,
>> the lock holder scans the main queue looking for a thread running on
>> the same node. If found (call it thread T), all threads in the main queue
>> between the current lock holder and T are moved to the end of the
>> secondary queue, and the lock is passed to T. If such T is not found, the
>> lock is passed to the first node in the secondary queue. Finally, if the
>> secondary queue is empty, the lock is passed to the next thread in the
>> main queue. For more details, see https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwICbg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=U7mfTbYj1r2Te2BBUUNbVrRPuTa_ujlpR4GZfUsrGTM&s=Dw4O1EniF-nde4fp6RA9ISlSMOjWuqeR9OS1G0iauj0&e=.
>>
>> Note that this variant of CNA may introduce starvation by continuously
>> passing the lock to threads running on the same node. This issue
>> will be addressed later in the series.
>>
>> Enabling CNA is controlled via a new configuration option
>> (NUMA_AWARE_SPINLOCKS), which is enabled by default if NUMA is enabled.
>>
>> Signed-off-by: Alex Kogan <[email protected]>
>> Reviewed-by: Steve Sistare <[email protected]>
>> ---
>> arch/x86/Kconfig | 14 +++
>> include/asm-generic/qspinlock_types.h | 13 +++
>> kernel/locking/mcs_spinlock.h | 10 ++
>> kernel/locking/qspinlock.c | 29 +++++-
>> kernel/locking/qspinlock_cna.h | 173 ++++++++++++++++++++++++++++++++++
>> 5 files changed, 236 insertions(+), 3 deletions(-)
>> create mode 100644 kernel/locking/qspinlock_cna.h
>>
> (SNIP)
>> +
>> +static __always_inline int get_node_index(struct mcs_spinlock *node)
>> +{
>> + return decode_count(node->node_and_count++);
> When nesting level is > 4, it won't return a index >= 4 here and the numa node number
> is changed by mistake. It will go into a wrong way instead of the following branch.
>
>
> /*
> * 4 nodes are allocated based on the assumption that there will
> * not be nested NMIs taking spinlocks. That may not be true in
> * some architectures even though the chance of needing more than
> * 4 nodes will still be extremely unlikely. When that happens,
> * we fall back to spinning on the lock directly without using
> * any MCS node. This is not the most elegant solution, but is
> * simple enough.
> */
> if (unlikely(idx >= MAX_NODES)) {
> while (!queued_spin_trylock(lock))
> cpu_relax();
> goto release;
> }
Good point.
This patch does not handle count overflows gracefully.
It can be easily fixed by allocating more bits for the count — we don’t really need 30 bits for #NUMA nodes.

However, I am working on a new revision of the patch, in which the cna node encapsulates the mcs node (following Peter’s suggestion and similarly to pv_node).
With that approach, this issue is gone.

Best regards,
— Alex



2019-06-12 18:04:18

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On 6/12/19 12:38 AM, Alex Kogan wrote:
> Hi, Wei.
>
>> On Jun 11, 2019, at 12:22 AM, liwei (GF) <[email protected]> wrote:
>>
>> Hi Alex,
>>
>> On 2019/3/29 23:20, Alex Kogan wrote:
>>> In CNA, spinning threads are organized in two queues, a main queue for
>>> threads running on the same node as the current lock holder, and a
>>> secondary queue for threads running on other nodes. At the unlock time,
>>> the lock holder scans the main queue looking for a thread running on
>>> the same node. If found (call it thread T), all threads in the main queue
>>> between the current lock holder and T are moved to the end of the
>>> secondary queue, and the lock is passed to T. If such T is not found, the
>>> lock is passed to the first node in the secondary queue. Finally, if the
>>> secondary queue is empty, the lock is passed to the next thread in the
>>> main queue. For more details, see https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwICbg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=U7mfTbYj1r2Te2BBUUNbVrRPuTa_ujlpR4GZfUsrGTM&s=Dw4O1EniF-nde4fp6RA9ISlSMOjWuqeR9OS1G0iauj0&e=.
>>>
>>> Note that this variant of CNA may introduce starvation by continuously
>>> passing the lock to threads running on the same node. This issue
>>> will be addressed later in the series.
>>>
>>> Enabling CNA is controlled via a new configuration option
>>> (NUMA_AWARE_SPINLOCKS), which is enabled by default if NUMA is enabled.
>>>
>>> Signed-off-by: Alex Kogan <[email protected]>
>>> Reviewed-by: Steve Sistare <[email protected]>
>>> ---
>>> arch/x86/Kconfig | 14 +++
>>> include/asm-generic/qspinlock_types.h | 13 +++
>>> kernel/locking/mcs_spinlock.h | 10 ++
>>> kernel/locking/qspinlock.c | 29 +++++-
>>> kernel/locking/qspinlock_cna.h | 173 ++++++++++++++++++++++++++++++++++
>>> 5 files changed, 236 insertions(+), 3 deletions(-)
>>> create mode 100644 kernel/locking/qspinlock_cna.h
>>>
>> (SNIP)
>>> +
>>> +static __always_inline int get_node_index(struct mcs_spinlock *node)
>>> +{
>>> + return decode_count(node->node_and_count++);
>> When nesting level is > 4, it won't return a index >= 4 here and the numa node number
>> is changed by mistake. It will go into a wrong way instead of the following branch.
>>
>>
>> /*
>> * 4 nodes are allocated based on the assumption that there will
>> * not be nested NMIs taking spinlocks. That may not be true in
>> * some architectures even though the chance of needing more than
>> * 4 nodes will still be extremely unlikely. When that happens,
>> * we fall back to spinning on the lock directly without using
>> * any MCS node. This is not the most elegant solution, but is
>> * simple enough.
>> */
>> if (unlikely(idx >= MAX_NODES)) {
>> while (!queued_spin_trylock(lock))
>> cpu_relax();
>> goto release;
>> }
> Good point.
> This patch does not handle count overflows gracefully.
> It can be easily fixed by allocating more bits for the count — we don’t really need 30 bits for #NUMA nodes.

Actually, the default setting uses 2 bits for 4-level nesting and 14
bits for cpu numbers. That means it can support up to 16k-1 cpus. It is
a limit that is likely to be exceeded in the foreseeable future.
qspinlock also supports an additional mode with 21 bits used for cpu
numbers. That can support up to 2M-1 cpus. However, this mode will be a
little bit slower. That is why we don't want to use more than 2 bits for
nesting as I have never see more than 2 level of nesting used in my
testing. So it is highly unlikely we will ever hit more than 4 levels. I
am not saying that it is impossible, though.

Cheers,
Longman

2019-07-03 11:59:00

by Jan Glauber

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Add NUMA-awareness to qspinlock

Hi Alex,
I've tried this series on arm64 (ThunderX2 with up to SMT=4 and 224 CPUs)
with the borderline testcase of accessing a single file from all
threads. With that
testcase the qspinlock slowpath is the top spot in the kernel.

The results look really promising:

CPUs normal numa-qspinlocks
---------------------------------------------
56 149.41 73.90
224 576.95 290.31

Also frontend-stalls are reduced to 50% and interconnect traffic is
greatly reduced.
Tested-by: Jan Glauber <[email protected]>

--Jan

Am Fr., 29. März 2019 um 16:23 Uhr schrieb Alex Kogan <[email protected]>:
>
> This version addresses feedback from Peter and Waiman. In particular,
> the CNA functionality has been moved to a separate file, and is controlled
> by a config option (enabled by default if NUMA is enabled).
> An optimization has been introduced to reduce the overhead of shuffling
> threads between waiting queues when the lock is only lightly contended.
>
> Summary
> -------
>
> Lock throughput can be increased by handing a lock to a waiter on the
> same NUMA node as the lock holder, provided care is taken to avoid
> starvation of waiters on other NUMA nodes. This patch introduces CNA
> (compact NUMA-aware lock) as the slow path for qspinlock. It can be
> enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
>
> CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are
> organized in two queues, a main queue for threads running on the same
> node as the current lock holder, and a secondary queue for threads
> running on other nodes. Threads store the ID of the node on which
> they are running in their queue nodes. At the unlock time, the lock
> holder scans the main queue looking for a thread running on the same
> node. If found (call it thread T), all threads in the main queue
> between the current lock holder and T are moved to the end of the
> secondary queue, and the lock is passed to T. If such T is not found, the
> lock is passed to the first node in the secondary queue. Finally, if the
> secondary queue is empty, the lock is passed to the next thread in the
> main queue. To avoid starvation of threads in the secondary queue,
> those threads are moved back to the head of the main queue
> after a certain expected number of intra-node lock hand-offs.
>
> More details are available at https://arxiv.org/abs/1810.05600.
>
> We have done some performance evaluation with the locktorture module
> as well as with several benchmarks from the will-it-scale repo.
> The following locktorture results are from an Oracle X5-4 server
> (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
> cores each). Each number represents an average (over 25 runs) of the
> total number of ops (x10^7) reported at the end of each run. The
> standard deviation is also reported in (), and in general, with a few
> exceptions, is about 3%. The 'stock' kernel is v5.0-rc8,
> commit 28d49e282665 ("locking/lockdep: Shrink struct lock_class_key"),
> compiled in the default configuration. 'patch' is the modified
> kernel compiled with NUMA_AWARE_SPINLOCKS not set; it is included to show
> that any performance changes to the existing qspinlock implementation are
> essentially noise. 'patch-CNA' is the modified kernel with
> NUMA_AWARE_SPINLOCKS set; the speedup is calculated dividing
> 'patch-CNA' by 'stock'.
>
> #thr stock patch patch-CNA speedup (patch-CNA/stock)
> 1 2.731 (0.102) 2.732 (0.093) 2.716 (0.082) 0.995
> 2 3.071 (0.124) 3.084 (0.109) 3.079 (0.113) 1.003
> 4 4.221 (0.138) 4.229 (0.087) 4.408 (0.103) 1.044
> 8 5.366 (0.154) 5.274 (0.094) 6.958 (0.233) 1.297
> 16 6.673 (0.164) 6.689 (0.095) 8.547 (0.145) 1.281
> 32 7.365 (0.177) 7.353 (0.183) 9.305 (0.202) 1.263
> 36 7.473 (0.198) 7.422 (0.181) 9.441 (0.196) 1.263
> 72 6.805 (0.182) 6.699 (0.170) 10.020 (0.218) 1.472
> 108 6.509 (0.082) 6.480 (0.115) 10.027 (0.194) 1.540
> 142 6.223 (0.109) 6.294 (0.100) 9.874 (0.183) 1.587
>
> The following tables contain throughput results (ops/us) from the same
> setup for will-it-scale/open1_threads:
>
> #thr stock patch patch-CNA speedup (patch-CNA/stock)
> 1 0.565 (0.004) 0.567 (0.001) 0.565 (0.003) 0.999
> 2 0.892 (0.021) 0.899 (0.022) 0.900 (0.018) 1.009
> 4 1.503 (0.031) 1.527 (0.038) 1.481 (0.025) 0.985
> 8 1.755 (0.105) 1.714 (0.079) 1.683 (0.106) 0.959
> 16 1.740 (0.095) 1.752 (0.087) 1.693 (0.098) 0.973
> 32 0.884 (0.080) 0.908 (0.090) 1.686 (0.092) 1.906
> 36 0.907 (0.095) 0.894 (0.088) 1.709 (0.081) 1.885
> 72 0.856 (0.041) 0.858 (0.043) 1.707 (0.082) 1.994
> 108 0.858 (0.039) 0.869 (0.037) 1.732 (0.076) 2.020
> 142 0.809 (0.044) 0.854 (0.044) 1.728 (0.083) 2.135
>
> and will-it-scale/lock2_threads:
>
> #thr stock patch patch-CNA speedup (patch-CNA/stock)
> 1 1.713 (0.004) 1.715 (0.004) 1.711 (0.004) 0.999
> 2 2.889 (0.057) 2.864 (0.078) 2.876 (0.066) 0.995
> 4 4.582 (1.032) 5.066 (0.787) 4.725 (0.959) 1.031
> 8 4.227 (0.196) 4.104 (0.274) 4.092 (0.365) 0.968
> 16 4.108 (0.141) 4.057 (0.138) 4.010 (0.168) 0.976
> 32 2.674 (0.125) 2.625 (0.171) 3.958 (0.156) 1.480
> 36 2.622 (0.107) 2.553 (0.150) 3.978 (0.116) 1.517
> 72 2.009 (0.090) 1.998 (0.092) 3.932 (0.114) 1.957
> 108 2.154 (0.069) 2.089 (0.090) 3.870 (0.081) 1.797
> 142 1.953 (0.106) 1.943 (0.111) 3.853 (0.100) 1.973
>
> Further comments are welcome and appreciated.
>
> Alex Kogan (5):
> locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic
> locking/qspinlock: Refactor the qspinlock slow path
> locking/qspinlock: Introduce CNA into the slow path of qspinlock
> locking/qspinlock: Introduce starvation avoidance into CNA
> locking/qspinlock: Introduce the shuffle reduction optimization into
> CNA
>
> arch/arm/include/asm/mcs_spinlock.h | 4 +-
> arch/x86/Kconfig | 14 ++
> include/asm-generic/qspinlock_types.h | 13 ++
> kernel/locking/mcs_spinlock.h | 16 ++-
> kernel/locking/qspinlock.c | 77 +++++++++--
> kernel/locking/qspinlock_cna.h | 245 ++++++++++++++++++++++++++++++++++
> 6 files changed, 354 insertions(+), 15 deletions(-)
> create mode 100644 kernel/locking/qspinlock_cna.h
>
> --
> 2.11.0 (Apple Git-81)
>

2019-07-12 08:14:25

by Hanjun Guo

[permalink] [raw]
Subject: Re: [PATCH v2 0/5] Add NUMA-awareness to qspinlock

On 2019/7/3 19:58, Jan Glauber wrote:
> Hi Alex,
> I've tried this series on arm64 (ThunderX2 with up to SMT=4 and 224 CPUs)
> with the borderline testcase of accessing a single file from all
> threads. With that
> testcase the qspinlock slowpath is the top spot in the kernel.
>
> The results look really promising:
>
> CPUs normal numa-qspinlocks
> ---------------------------------------------
> 56 149.41 73.90
> 224 576.95 290.31
>
> Also frontend-stalls are reduced to 50% and interconnect traffic is
> greatly reduced.
> Tested-by: Jan Glauber <[email protected]>

Tested this patchset on Kunpeng920 ARM64 server (96 cores,
4 NUMA nodes), and with the same test case from Jan, I can
see 150%+ boost! (Need to add a patch below [1].)

For the real workload such as Nginx I can see about 10%
performance improvement as well.

Tested-by: Hanjun Guo <[email protected]>

Please cc me for new versions and I'm willing to test it.

Thanks
Hanjun

[1]
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 657bbc5..72c1346 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -792,6 +792,20 @@ config NODES_SHIFT
Specify the maximum number of NUMA Nodes available on the target
system. Increases memory reserved to accommodate various tables.

+config NUMA_AWARE_SPINLOCKS
+ bool "Numa-aware spinlocks"
+ depends on NUMA
+ default y
+ help
+ Introduce NUMA (Non Uniform Memory Access) awareness into
+ the slow path of spinlocks.
+
+ The kernel will try to keep the lock on the same node,
+ thus reducing the number of remote cache misses, while
+ trading some of the short term fairness for better performance.
+
+ Say N if you want absolute first come first serve fairness.
+
config USE_PERCPU_NUMA_NODE_ID
def_bool y
depends on NUMA
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 2994167..be5dd44 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -4,7 +4,7 @@
#endif

#include <linux/random.h>
-
+#include <linux/topology.h>
/*
* Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
*
@@ -170,7 +170,7 @@ static __always_inline void cna_init_node(struct mcs_spinlock *node, int cpuid,
u32 tail)
{
if (decode_numa_node(node->node_and_count) == -1)
- store_numa_node(node, numa_cpu_node(cpuid));
+ store_numa_node(node, cpu_to_node(cpuid));
node->encoded_tail = tail;
}