Change from v13:
----------------
Fix regression in stress-ng.
Reported-by: kernel test robot <[email protected]>
The fix is to move common-case stores into the 'partial_order' field
of the queue node from cna_wait_head_or_lock() (where those stores
can cause a cache miss during lock handover) to cna_init_node().
Summary
-------
Lock throughput can be increased by handing a lock to a waiter on the
same NUMA node as the lock holder, provided care is taken to avoid
starvation of waiters on other NUMA nodes. This patch introduces CNA
(compact NUMA-aware lock) as the slow path for qspinlock. It is
enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
CNA is a NUMA-aware version of the MCS lock. Spinning threads are
organized in two queues, a primary queue for threads running on the same
node as the current lock holder, and a secondary queue for threads
running on other nodes. Threads store the ID of the node on which
they are running in their queue nodes. After acquiring the MCS lock and
before acquiring the spinlock, the MCS lock holder checks whether the next
waiter in the primary queue (if exists) is running on the same NUMA node.
If it is not, that waiter is detached from the main queue and moved into
the tail of the secondary queue. This way, we gradually filter the primary
queue, leaving only waiters running on the same preferred NUMA node. Note
that certain priortized waiters (e.g., in irq and nmi contexts) are
excluded from being moved to the secondary queue. We change the NUMA node
preference after a waiter at the head of the secondary queue spins for a
certain amount of time. We do that by flushing the secondary queue into
the head of the primary queue, effectively changing the preference to the
NUMA node of the waiter at the head of the secondary queue at the time of
the flush.
More details are available at https://arxiv.org/abs/1810.05600.
We have done some performance evaluation with the locktorture module
as well as with several benchmarks from the will-it-scale repo.
The following locktorture results are from an Oracle X5-4 server
(four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
cores each). Each number represents an average (over 25 runs) of the
total number of ops (x10^7) reported at the end of each run. The
standard deviation is also reported in (), and in general is about 3%
from the average. The 'stock' kernel is v5.12.0-rc5,
commit 99d4e8b4e609b, compiled in the default configuration.
'CNA' is the modified kernel with NUMA_AWARE_SPINLOCKS set;
The speedup is calculated by dividing the result of 'CNA' by the result
achieved with 'stock'.
#thr stock CNA / speedup
1 2.675 (0.088) 2.707 (0.100) / 1.012
2 2.781 (0.172) 2.790 (0.148) / 1.003
4 4.314 (0.115) 4.403 (0.205) / 1.021
8 5.085 (0.140) 7.292 (0.210) / 1.434
16 5.846 (0.114) 9.340 (0.186) / 1.598
32 6.282 (0.127) 10.074 (0.197) / 1.603
36 6.382 (0.144) 10.243 (0.198) / 1.605
72 6.175 (0.086) 10.482 (0.179) / 1.698
108 6.037 (0.073) 10.354 (0.164) / 1.715
142 5.796 (0.072) 10.275 (0.193) / 1.773
The following tables contain throughput results (ops/us) from the same
setup for will-it-scale/open1_threads:
#thr stock CNA / speedup
1 0.515 (0.002) 0.514 (0.001) / 0.999
2 0.774 (0.018) 0.780 (0.019) / 1.009
4 1.408 (0.030) 1.421 (0.034) / 1.009
8 1.785 (0.082) 1.724 (0.115) / 0.965
16 1.878 (0.118) 1.775 (0.114) / 0.945
32 0.914 (0.072) 1.755 (0.105) / 1.920
36 0.877 (0.064) 1.738 (0.100) / 1.981
72 0.810 (0.047) 1.658 (0.086) / 2.047
108 0.835 (0.036) 1.756 (0.091) / 2.103
142 0.809 (0.038) 1.766 (0.069) / 2.184
and will-it-scale/lock2_threads:
#thr stock CNA / speedup
1 1.605 (0.003) 1.606 (0.005) / 1.000
2 2.814 (0.077) 2.793 (0.068) / 0.992
4 5.318 (0.352) 5.463 (0.357) / 1.027
8 4.241 (0.265) 4.025 (0.366) / 0.949
16 4.125 (0.124) 3.914 (0.159) / 0.949
32 2.458 (0.113) 3.950 (0.131) / 1.607
36 2.457 (0.075) 3.924 (0.080) / 1.597
72 1.836 (0.102) 3.909 (0.104) / 2.129
108 1.868 (0.098) 3.871 (0.106) / 2.072
142 1.780 (0.124) 3.894 (0.089) / 2.188
Our evaluation shows that CNA also improves performance of user
applications that have hot pthread mutexes. Those mutexes are
blocking, and waiting threads park and unpark via the futex
mechanism in the kernel. Given that kernel futex chains, which
are hashed by the mutex address, are each protected by a
chain-specific spin lock, the contention on a user-mode mutex
translates into contention on a kernel level spinlock.
Here are the throughput results (ops/us) for the leveldb ‘readrandom’
benchmark:
#thr stock CNA / speedup
1 0.493 (0.034) 0.520 (0.026) / 1.053
2 0.796 (0.046) 0.838 (0.033) / 1.053
4 1.172 (0.076) 1.211 (0.043) / 1.033
8 1.215 (0.096) 1.207 (0.116) / 0.993
16 0.986 (0.136) 1.070 (0.129) / 1.085
32 0.756 (0.036) 1.148 (0.021) / 1.519
36 0.701 (0.032) 1.147 (0.017) / 1.636
72 0.625 (0.016) 1.149 (0.024) / 1.839
108 0.612 (0.015) 1.146 (0.014) / 1.872
142 0.606 (0.013) 1.131 (0.025) / 1.867
Further comments are welcome and appreciated.
Alex Kogan (6):
locking/qspinlock: Rename mcs lock/unlock macros and make them more
generic
locking/qspinlock: Refactor the qspinlock slow path
locking/qspinlock: Introduce CNA into the slow path of qspinlock
locking/qspinlock: Introduce starvation avoidance into CNA
locking/qspinlock: Avoid moving certain threads between waiting queues
in CNA
locking/qspinlock: Introduce the shuffle reduction optimization into
CNA
.../admin-guide/kernel-parameters.txt | 19 +
arch/arm/include/asm/mcs_spinlock.h | 6 +-
arch/x86/Kconfig | 20 +
arch/x86/include/asm/qspinlock.h | 4 +
arch/x86/kernel/alternative.c | 4 +
include/asm-generic/mcs_spinlock.h | 4 +-
kernel/locking/mcs_spinlock.h | 20 +-
kernel/locking/qspinlock.c | 82 +++-
kernel/locking/qspinlock_cna.h | 455 ++++++++++++++++++
kernel/locking/qspinlock_paravirt.h | 2 +-
10 files changed, 593 insertions(+), 23 deletions(-)
create mode 100644 kernel/locking/qspinlock_cna.h
--
2.24.3 (Apple Git-128)
The mcs unlock macro (arch_mcs_lock_handoff) should accept the value to be
stored into the lock argument as another argument. This allows using the
same macro in cases where the value to be stored when passing the lock is
different from 1.
Signed-off-by: Alex Kogan <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
---
arch/arm/include/asm/mcs_spinlock.h | 6 +++---
include/asm-generic/mcs_spinlock.h | 4 ++--
kernel/locking/mcs_spinlock.h | 18 +++++++++---------
kernel/locking/qspinlock.c | 4 ++--
kernel/locking/qspinlock_paravirt.h | 2 +-
5 files changed, 17 insertions(+), 17 deletions(-)
diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
index 529d2cf4d06f..1eb4d733459c 100644
--- a/arch/arm/include/asm/mcs_spinlock.h
+++ b/arch/arm/include/asm/mcs_spinlock.h
@@ -6,7 +6,7 @@
#include <asm/spinlock.h>
/* MCS spin-locking. */
-#define arch_mcs_spin_lock_contended(lock) \
+#define arch_mcs_spin_wait(lock) \
do { \
/* Ensure prior stores are observed before we enter wfe. */ \
smp_mb(); \
@@ -14,9 +14,9 @@ do { \
wfe(); \
} while (0) \
-#define arch_mcs_spin_unlock_contended(lock) \
+#define arch_mcs_lock_handoff(lock, val) \
do { \
- smp_store_release(lock, 1); \
+ smp_store_release((lock), (val)); \
dsb_sev(); \
} while (0)
diff --git a/include/asm-generic/mcs_spinlock.h b/include/asm-generic/mcs_spinlock.h
index 10cd4ffc6ba2..f933d99c63e0 100644
--- a/include/asm-generic/mcs_spinlock.h
+++ b/include/asm-generic/mcs_spinlock.h
@@ -4,8 +4,8 @@
/*
* Architectures can define their own:
*
- * arch_mcs_spin_lock_contended(l)
- * arch_mcs_spin_unlock_contended(l)
+ * arch_mcs_spin_wait(l)
+ * arch_mcs_lock_handoff(l, val)
*
* See kernel/locking/mcs_spinlock.c.
*/
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 5e10153b4d3c..904ba5d0f3f4 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -21,7 +21,7 @@ struct mcs_spinlock {
int count; /* nesting count, see qspinlock.c */
};
-#ifndef arch_mcs_spin_lock_contended
+#ifndef arch_mcs_spin_wait
/*
* Using smp_cond_load_acquire() provides the acquire semantics
* required so that subsequent operations happen after the
@@ -29,20 +29,20 @@ struct mcs_spinlock {
* ARM64 would like to do spin-waiting instead of purely
* spinning, and smp_cond_load_acquire() provides that behavior.
*/
-#define arch_mcs_spin_lock_contended(l) \
-do { \
- smp_cond_load_acquire(l, VAL); \
+#define arch_mcs_spin_wait(l) \
+do { \
+ smp_cond_load_acquire(l, VAL); \
} while (0)
#endif
-#ifndef arch_mcs_spin_unlock_contended
+#ifndef arch_mcs_lock_handoff
/*
* smp_store_release() provides a memory barrier to ensure all
* operations in the critical section has been completed before
* unlocking.
*/
-#define arch_mcs_spin_unlock_contended(l) \
- smp_store_release((l), 1)
+#define arch_mcs_lock_handoff(l, val) \
+ smp_store_release((l), (val))
#endif
/*
@@ -91,7 +91,7 @@ void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
WRITE_ONCE(prev->next, node);
/* Wait until the lock holder passes the lock down. */
- arch_mcs_spin_lock_contended(&node->locked);
+ arch_mcs_spin_wait(&node->locked);
}
/*
@@ -115,7 +115,7 @@ void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
}
/* Pass lock to next waiter. */
- arch_mcs_spin_unlock_contended(&next->locked);
+ arch_mcs_lock_handoff(&next->locked, 1);
}
#endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index cbff6ba53d56..435d696f9250 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -471,7 +471,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
WRITE_ONCE(prev->next, node);
pv_wait_node(node, prev);
- arch_mcs_spin_lock_contended(&node->locked);
+ arch_mcs_spin_wait(&node->locked);
/*
* While waiting for the MCS lock, the next pointer may have
@@ -550,7 +550,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
if (!next)
next = smp_cond_load_relaxed(&node->next, (VAL));
- arch_mcs_spin_unlock_contended(&next->locked);
+ arch_mcs_lock_handoff(&next->locked, 1);
pv_kick_node(lock, next);
release:
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e84d21aa0722..619d80fd5ea8 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -368,7 +368,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
*
* Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
*
- * The write to next->locked in arch_mcs_spin_unlock_contended()
+ * The write to next->locked in arch_mcs_lock_handoff()
* must be ordered before the read of pn->state in the cmpxchg()
* below for the code to work correctly. To guarantee full ordering
* irrespective of the success or failure of the cmpxchg(),
--
2.24.3 (Apple Git-128)
Move some of the code manipulating the spin lock into separate functions.
This would allow easier integration of alternative ways to manipulate
that lock.
Signed-off-by: Alex Kogan <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock.c | 38 ++++++++++++++++++++++++++++++++++++--
1 file changed, 36 insertions(+), 2 deletions(-)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 435d696f9250..e3518709ffdc 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -289,6 +289,34 @@ static __always_inline u32 __pv_wait_head_or_lock(struct qspinlock *lock,
#define queued_spin_lock_slowpath native_queued_spin_lock_slowpath
#endif
+/*
+ * __try_clear_tail - try to clear tail by setting the lock value to
+ * _Q_LOCKED_VAL.
+ * @lock: Pointer to the queued spinlock structure
+ * @val: Current value of the lock
+ * @node: Pointer to the MCS node of the lock holder
+ */
+static __always_inline bool __try_clear_tail(struct qspinlock *lock,
+ u32 val,
+ struct mcs_spinlock *node)
+{
+ return atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL);
+}
+
+/*
+ * __mcs_lock_handoff - pass the MCS lock to the next waiter
+ * @node: Pointer to the MCS node of the lock holder
+ * @next: Pointer to the MCS node of the first waiter in the MCS queue
+ */
+static __always_inline void __mcs_lock_handoff(struct mcs_spinlock *node,
+ struct mcs_spinlock *next)
+{
+ arch_mcs_lock_handoff(&next->locked, 1);
+}
+
+#define try_clear_tail __try_clear_tail
+#define mcs_lock_handoff __mcs_lock_handoff
+
#endif /* _GEN_PV_LOCK_SLOWPATH */
/**
@@ -533,7 +561,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
* PENDING will make the uncontended transition fail.
*/
if ((val & _Q_TAIL_MASK) == tail) {
- if (atomic_try_cmpxchg_relaxed(&lock->val, &val, _Q_LOCKED_VAL))
+ if (try_clear_tail(lock, val, node))
goto release; /* No contention */
}
@@ -550,7 +578,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
if (!next)
next = smp_cond_load_relaxed(&node->next, (VAL));
- arch_mcs_lock_handoff(&next->locked, 1);
+ mcs_lock_handoff(node, next);
pv_kick_node(lock, next);
release:
@@ -575,6 +603,12 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
#undef pv_kick_node
#undef pv_wait_head_or_lock
+#undef try_clear_tail
+#define try_clear_tail __try_clear_tail
+
+#undef mcs_lock_handoff
+#define mcs_lock_handoff __mcs_lock_handoff
+
#undef queued_spin_lock_slowpath
#define queued_spin_lock_slowpath __pv_queued_spin_lock_slowpath
--
2.24.3 (Apple Git-128)
This performance optimization chooses probabilistically to avoid moving
threads from the main queue into the secondary one when the secondary queue
is empty.
It is helpful when the lock is only lightly contended. In particular, it
makes CNA less eager to create a secondary queue, but does not introduce
any extra delays for threads waiting in that queue once it is created.
Signed-off-by: Alex Kogan <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock_cna.h | 39 ++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 29c3abbd3d94..983c6a47a221 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -5,6 +5,7 @@
#include <linux/topology.h>
#include <linux/sched/rt.h>
+#include <linux/random.h>
/*
* Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
@@ -86,6 +87,34 @@ static inline bool intra_node_threshold_reached(struct cna_node *cn)
return current_time - threshold > 0;
}
+/*
+ * Controls the probability for enabling the ordering of the main queue
+ * when the secondary queue is empty. The chosen value reduces the amount
+ * of unnecessary shuffling of threads between the two waiting queues
+ * when the contention is low, while responding fast enough and enabling
+ * the shuffling when the contention is high.
+ */
+#define SHUFFLE_REDUCTION_PROB_ARG (7)
+
+/* Per-CPU pseudo-random number seed */
+static DEFINE_PER_CPU(u32, seed);
+
+/*
+ * Return false with probability 1 / 2^@num_bits.
+ * Intuitively, the larger @num_bits the less likely false is to be returned.
+ * @num_bits must be a number between 0 and 31.
+ */
+static bool probably(unsigned int num_bits)
+{
+ u32 s;
+
+ s = this_cpu_read(seed);
+ s = next_pseudo_random32(s);
+ this_cpu_write(seed, s);
+
+ return s & ((1 << num_bits) - 1);
+}
+
static void __init cna_init_nodes_per_cpu(unsigned int cpu)
{
struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
@@ -293,6 +322,16 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
{
struct cna_node *cn = (struct cna_node *)node;
+ if (node->locked <= 1 && probably(SHUFFLE_REDUCTION_PROB_ARG)) {
+ /*
+ * When the secondary queue is empty, skip the call to
+ * cna_order_queue() below with high probability. This optimization
+ * reduces the overhead of unnecessary shuffling of threads
+ * between waiting queues when the lock is only lightly contended.
+ */
+ return 0;
+ }
+
if (!cn->start_time || !intra_node_threshold_reached(cn)) {
/*
* We are at the head of the wait queue, no need to use
--
2.24.3 (Apple Git-128)
Prohibit moving certain threads (e.g., in irq and nmi contexts)
to the secondary queue. Those prioritized threads will always stay
in the primary queue, and so will have a shorter wait time for the lock.
Signed-off-by: Alex Kogan <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
---
kernel/locking/qspinlock_cna.h | 26 ++++++++++++++++++++------
1 file changed, 20 insertions(+), 6 deletions(-)
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 0513360c11fe..29c3abbd3d94 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -4,6 +4,7 @@
#endif
#include <linux/topology.h>
+#include <linux/sched/rt.h>
/*
* Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
@@ -35,7 +36,8 @@
* running on the same NUMA node. If it is not, that waiter is detached from the
* main queue and moved into the tail of the secondary queue. This way, we
* gradually filter the primary queue, leaving only waiters running on the same
- * preferred NUMA node.
+ * preferred NUMA node. Note that certain priortized waiters (e.g., in
+ * irq and nmi contexts) are excluded from being moved to the secondary queue.
*
* We change the NUMA node preference after a waiter at the head of the
* secondary queue spins for a certain amount of time (10ms, by default).
@@ -49,6 +51,8 @@
* Dave Dice <[email protected]>
*/
+#define CNA_PRIORITY_NODE 0xffff
+
struct cna_node {
struct mcs_spinlock mcs;
u16 numa_node;
@@ -121,9 +125,10 @@ static int __init cna_init_nodes(void)
static __always_inline void cna_init_node(struct mcs_spinlock *node)
{
+ bool priority = !in_task() || irqs_disabled() || rt_task(current);
struct cna_node *cn = (struct cna_node *)node;
- cn->numa_node = cn->real_numa_node;
+ cn->numa_node = priority ? CNA_PRIORITY_NODE : cn->real_numa_node;
cn->partial_order = LOCAL_WAITER_FOUND;
cn->start_time = 0;
}
@@ -266,11 +271,13 @@ static void cna_order_queue(struct mcs_spinlock *node)
next_numa_node = ((struct cna_node *)next)->numa_node;
if (next_numa_node != numa_node) {
- struct mcs_spinlock *nnext = READ_ONCE(next->next);
+ if (next_numa_node != CNA_PRIORITY_NODE) {
+ struct mcs_spinlock *nnext = READ_ONCE(next->next);
- if (nnext) {
- cna_splice_next(node, next, nnext);
- next = nnext;
+ if (nnext) {
+ cna_splice_next(node, next, nnext);
+ next = nnext;
+ }
}
/*
* Inherit NUMA node id of primary queue, to maintain the
@@ -287,6 +294,13 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
struct cna_node *cn = (struct cna_node *)node;
if (!cn->start_time || !intra_node_threshold_reached(cn)) {
+ /*
+ * We are at the head of the wait queue, no need to use
+ * the fake NUMA node ID.
+ */
+ if (cn->numa_node == CNA_PRIORITY_NODE)
+ cn->numa_node = cn->real_numa_node;
+
/*
* Try and put the time otherwise spent spin waiting on
* _Q_LOCKED_PENDING_MASK to use by sorting our lists.
--
2.24.3 (Apple Git-128)
On Thu, Apr 01, 2021 at 11:31:56AM -0400, Alex Kogan wrote:
> This performance optimization chooses probabilistically to avoid moving
> threads from the main queue into the secondary one when the secondary queue
> is empty.
>
> It is helpful when the lock is only lightly contended. In particular, it
> makes CNA less eager to create a secondary queue, but does not introduce
> any extra delays for threads waiting in that queue once it is created.
>
> Signed-off-by: Alex Kogan <[email protected]>
> Reviewed-by: Steve Sistare <[email protected]>
> Reviewed-by: Waiman Long <[email protected]>
> ---
> kernel/locking/qspinlock_cna.h | 39 ++++++++++++++++++++++++++++++++++
> 1 file changed, 39 insertions(+)
>
> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
> index 29c3abbd3d94..983c6a47a221 100644
> --- a/kernel/locking/qspinlock_cna.h
> +++ b/kernel/locking/qspinlock_cna.h
> @@ -5,6 +5,7 @@
>
> #include <linux/topology.h>
> #include <linux/sched/rt.h>
> +#include <linux/random.h>
>
> /*
> * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
> @@ -86,6 +87,34 @@ static inline bool intra_node_threshold_reached(struct cna_node *cn)
> return current_time - threshold > 0;
> }
>
> +/*
> + * Controls the probability for enabling the ordering of the main queue
> + * when the secondary queue is empty. The chosen value reduces the amount
> + * of unnecessary shuffling of threads between the two waiting queues
> + * when the contention is low, while responding fast enough and enabling
> + * the shuffling when the contention is high.
> + */
> +#define SHUFFLE_REDUCTION_PROB_ARG (7)
Out of curiosity:
Have you used other values and done measurements what's an efficient
value for SHUFFLE_REDUCTION_PROB_ARG?
Maybe I miscalculated it, but if I understand it correctly this value
implies that the propability is 0.9921875 that below function returns
true.
My question is probably answered by following statement from
referenced paper:
"In our experiments with the shuffle reduction optimization enabled,
we set THRESHOLD2 to 0xff." (page with figure 5)
> +
> +/* Per-CPU pseudo-random number seed */
> +static DEFINE_PER_CPU(u32, seed);
> +
> +/*
> + * Return false with probability 1 / 2^@num_bits.
> + * Intuitively, the larger @num_bits the less likely false is to be returned.
> + * @num_bits must be a number between 0 and 31.
> + */
> +static bool probably(unsigned int num_bits)
> +{
> + u32 s;
> +
> + s = this_cpu_read(seed);
> + s = next_pseudo_random32(s);
> + this_cpu_write(seed, s);
> +
> + return s & ((1 << num_bits) - 1);
> +}
> +
> static void __init cna_init_nodes_per_cpu(unsigned int cpu)
> {
> struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
> @@ -293,6 +322,16 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
> {
> struct cna_node *cn = (struct cna_node *)node;
>
> + if (node->locked <= 1 && probably(SHUFFLE_REDUCTION_PROB_ARG)) {
Again if I understand it correctly with SHUFFLE_REDUCTION_PROB_ARG==7
it's roughly 1 out of 100 cases where probably() returns false.
Why/when is this beneficial?
I assume it has to do with following statement in the referenced
paper:
"The superior performance over MCS at 4 threads is the result of the
shuffling that does take place once in a while, organizing threads’
arrivals to the lock in a way that reduces the inter-socket lock
migration without the need to continuously modify the main queue. ..."
(page with figure 9; the paper has no page numbers)
But OTHO why this pseudo randomness?
How about deterministically treating every 100th execution differently
(it also matches "once in a while") and thus entirely removing the
pseudo randomness?
Have you tried this? If so why was it worse than pseudo randomness?
(Or maybe I missed something and pseudo randomness is required for
other reasons there.)
> + /*
> + * When the secondary queue is empty, skip the call to
> + * cna_order_queue() below with high probability. This optimization
> + * reduces the overhead of unnecessary shuffling of threads
> + * between waiting queues when the lock is only lightly contended.
> + */
> + return 0;
> + }
> +
> if (!cn->start_time || !intra_node_threshold_reached(cn)) {
> /*
> * We are at the head of the wait queue, no need to use
> --
> 2.24.3 (Apple Git-128)
>
--
Regards,
Andreas
Hi, Andreas.
Thanks for the great questions.
> On Apr 14, 2021, at 3:47 AM, Andreas Herrmann <[email protected]> wrote:
>
> On Thu, Apr 01, 2021 at 11:31:56AM -0400, Alex Kogan wrote:
>> This performance optimization chooses probabilistically to avoid moving
>> threads from the main queue into the secondary one when the secondary queue
>> is empty.
>>
>> It is helpful when the lock is only lightly contended. In particular, it
>> makes CNA less eager to create a secondary queue, but does not introduce
>> any extra delays for threads waiting in that queue once it is created.
>>
>> Signed-off-by: Alex Kogan <[email protected]>
>> Reviewed-by: Steve Sistare <[email protected]>
>> Reviewed-by: Waiman Long <[email protected]>
>> ---
>> kernel/locking/qspinlock_cna.h | 39 ++++++++++++++++++++++++++++++++++
>> 1 file changed, 39 insertions(+)
>>
>> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
>> index 29c3abbd3d94..983c6a47a221 100644
>> --- a/kernel/locking/qspinlock_cna.h
>> +++ b/kernel/locking/qspinlock_cna.h
>> @@ -5,6 +5,7 @@
>>
>> #include <linux/topology.h>
>> #include <linux/sched/rt.h>
>> +#include <linux/random.h>
>>
>> /*
>> * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
>> @@ -86,6 +87,34 @@ static inline bool intra_node_threshold_reached(struct cna_node *cn)
>> return current_time - threshold > 0;
>> }
>>
>> +/*
>> + * Controls the probability for enabling the ordering of the main queue
>> + * when the secondary queue is empty. The chosen value reduces the amount
>> + * of unnecessary shuffling of threads between the two waiting queues
>> + * when the contention is low, while responding fast enough and enabling
>> + * the shuffling when the contention is high.
>> + */
>> +#define SHUFFLE_REDUCTION_PROB_ARG (7)
>
> Out of curiosity:
>
> Have you used other values and done measurements what's an efficient
> value for SHUFFLE_REDUCTION_PROB_ARG?
Yes, we did try other values. Small variations do not change the results much,
but if you bump the value significantly (e.g., 20), you end up with a lock that
hardly does any shuffling and thus performs on-par with the (MCS-based)
baseline.
> Maybe I miscalculated it, but if I understand it correctly this value
> implies that the propability is 0.9921875 that below function returns
> true.
Your analysis is correct. Intuitively, we tried to keep the probability around 1-2%,
so if we do decide to shuffle when we don’t really want to (i.e., when the
contention is low), the overall overhead of such wrong decisions would be small.
>
> My question is probably answered by following statement from
> referenced paper:
>
> "In our experiments with the shuffle reduction optimization enabled,
> we set THRESHOLD2 to 0xff." (page with figure 5)
Yeah, just to avoid any confusion — we used a different mechanism to draw
pseudo-random numbers in the paper, so that 0xff number is not directly
comparable to the range of possible values for SHUFFLE_REDUCTION_PROB_ARG,
but the idea was exactly the same.
>
>> +
>> +/* Per-CPU pseudo-random number seed */
>> +static DEFINE_PER_CPU(u32, seed);
>> +
>> +/*
>> + * Return false with probability 1 / 2^@num_bits.
>> + * Intuitively, the larger @num_bits the less likely false is to be returned.
>> + * @num_bits must be a number between 0 and 31.
>> + */
>> +static bool probably(unsigned int num_bits)
>> +{
>> + u32 s;
>> +
>> + s = this_cpu_read(seed);
>> + s = next_pseudo_random32(s);
>> + this_cpu_write(seed, s);
>> +
>> + return s & ((1 << num_bits) - 1);
>> +}
>> +
>> static void __init cna_init_nodes_per_cpu(unsigned int cpu)
>> {
>> struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
>> @@ -293,6 +322,16 @@ static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
>> {
>> struct cna_node *cn = (struct cna_node *)node;
>>
>> + if (node->locked <= 1 && probably(SHUFFLE_REDUCTION_PROB_ARG)) {
>
> Again if I understand it correctly with SHUFFLE_REDUCTION_PROB_ARG==7
> it's roughly 1 out of 100 cases where probably() returns false.
>
> Why/when is this beneficial?
>
> I assume it has to do with following statement in the referenced
> paper:
>
> "The superior performance over MCS at 4 threads is the result of the
> shuffling that does take place once in a while, organizing threads’
> arrivals to the lock in a way that reduces the inter-socket lock
> migration without the need to continuously modify the main queue. ..."
> (page with figure 9; the paper has no page numbers)
This is correct. This performance optimization deals with a small regression
that might occur at the low-medium range of contention. And just to reiterate
what the patch comment says, this optimization has nothing to do with correctness
and does not introduce any extra delays for threads already waiting in the
secondary queue.
>
> But OTHO why this pseudo randomness?
>
> How about deterministically treating every 100th execution differently
> (it also matches "once in a while") and thus entirely removing the
> pseudo randomness?
Pseudo-randomness helps to avoid pathological cases where we would do the
same decision, despite having this optimization in place, for every lock acquisition.
Consider an application in which each thread cycles through X locks, e.g., X=10.
If we count deterministically, then for the first 9 locks we will _always avoid doing
any shuffling, missing on any opportunity for the benefit derived by the CNA
approach. At the same time, for lock #10 we will always attempt to shuffle, and so
the optimization would not have any effect.
>
> Have you tried this? If so why was it worse than pseudo randomness?
>
> (Or maybe I missed something and pseudo randomness is required for
> other reasons there.)
Hope this helps — let me know if you have any further questions!
Best,
— Alex