LinuxLists.cc - [PATCH v14 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock

2021-04-01 18:02:28

Subject: [PATCH v14 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock

In CNA, spinning threads are organized in two queues, a primary queue for
threads running on the same node as the current lock holder, and a
secondary queue for threads running on other nodes. After acquiring the
MCS lock and before acquiring the spinlock, the MCS lock
holder checks whether the next waiter in the primary queue (if exists) is
running on the same NUMA node. If it is not, that waiter is detached from
the main queue and moved into the tail of the secondary queue. This way,
we gradually filter the primary queue, leaving only waiters running on
the same preferred NUMA node. For more details, see
https://arxiv.org/abs/1810.05600.

Note that this variant of CNA may introduce starvation by continuously
passing the lock between waiters in the main queue. This issue will be
addressed later in the series.

Enabling CNA is controlled via a new configuration option
(NUMA_AWARE_SPINLOCKS). By default, the CNA variant is patched in at the
boot time only if we run on a multi-node machine in native environment and
the new config is enabled. (For the time being, the patching requires
CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be
resolved once static_call() is available.) This default behavior can be
overridden with the new kernel boot command-line option
"numa_spinlock=on/off" (default is "auto").

Signed-off-by: Alex Kogan <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 10 +
arch/x86/Kconfig | 20 ++
arch/x86/include/asm/qspinlock.h | 4 +
arch/x86/kernel/alternative.c | 4 +
kernel/locking/mcs_spinlock.h | 2 +-
kernel/locking/qspinlock.c | 42 ++-
kernel/locking/qspinlock_cna.h | 330 ++++++++++++++++++
7 files changed, 407 insertions(+), 5 deletions(-)
create mode 100644 kernel/locking/qspinlock_cna.h

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..ace55afd4441 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3475,6 +3475,16 @@
numa_balancing= [KNL,X86] Enable or disable automatic NUMA balancing.
Allowed values are enable and disable

+ numa_spinlock= [NUMA, PV_OPS] Select the NUMA-aware variant
+ of spinlock. The options are:
+ auto - Enable this variant if running on a multi-node
+ machine in native environment.
+ on - Unconditionally enable this variant.
+ off - Unconditionally disable this variant.
+
+ Not specifying this option is equivalent to
+ numa_spinlock=auto.
+
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
'node', 'default' can be specified
This can be set from sysctl after boot.
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..c3b47077d7e4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1558,6 +1558,26 @@ config NUMA

Otherwise, you should say N.

+config NUMA_AWARE_SPINLOCKS
+ bool "Numa-aware spinlocks"
+ depends on NUMA
+ depends on QUEUED_SPINLOCKS
+ depends on 64BIT
+ # For now, we depend on PARAVIRT_SPINLOCKS to make the patching work.
+ # This is awkward, but hopefully would be resolved once static_call()
+ # is available.
+ depends on PARAVIRT_SPINLOCKS
+ default y
+ help
+ Introduce NUMA (Non Uniform Memory Access) awareness into
+ the slow path of spinlocks.
+
+ In this variant of qspinlock, the kernel will try to keep the lock
+ on the same node, thus reducing the number of remote cache misses,
+ while trading some of the short term fairness for better performance.
+
+ Say N if you want absolute first come first serve fairness.
+
config AMD_NUMA
def_bool y
prompt "Old style AMD Opteron NUMA detection"
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index d86ab942219c..21d09e8db979 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -27,6 +27,10 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
return val;
}

+#ifdef CONFIG_NUMA_AWARE_SPINLOCKS
+extern void cna_configure_spin_lock_slowpath(void);
+#endif
+
#ifdef CONFIG_PARAVIRT_SPINLOCKS
extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
extern void __pv_init_lock_hash(void);
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 8d778e46725d..0178ff898d8a 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -741,6 +741,10 @@ void __init alternative_instructions(void)
}
#endif

+#if defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+ cna_configure_spin_lock_slowpath();
+#endif
+
apply_paravirt(__parainstructions, __parainstructions_end);

restart_nmi();
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
index 904ba5d0f3f4..5e47ffb3f08b 100644
--- a/kernel/locking/mcs_spinlock.h
+++ b/kernel/locking/mcs_spinlock.h
@@ -17,7 +17,7 @@

struct mcs_spinlock {
struct mcs_spinlock *next;
- int locked; /* 1 if lock acquired */
+ unsigned int locked; /* 1 if lock acquired */
int count; /* nesting count, see qspinlock.c */
};

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index e3518709ffdc..8c1a21b53913 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -11,7 +11,7 @@
* Peter Zijlstra <[email protected]>
*/

-#ifndef _GEN_PV_LOCK_SLOWPATH
+#if !defined(_GEN_PV_LOCK_SLOWPATH) && !defined(_GEN_CNA_LOCK_SLOWPATH)

#include <linux/smp.h>
#include <linux/bug.h>
@@ -71,7 +71,8 @@
/*
* On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
* size and four of them will fit nicely in one 64-byte cacheline. For
- * pvqspinlock, however, we need more space for extra data. To accommodate
+ * pvqspinlock, however, we need more space for extra data. The same also
+ * applies for the NUMA-aware variant of spinlocks (CNA). To accommodate
* that, we insert two more long words to pad it up to 32 bytes. IOW, only
* two of them can fit in a cacheline in this case. That is OK as it is rare
* to have more than 2 levels of slowpath nesting in actual use. We don't
@@ -80,7 +81,7 @@
*/
struct qnode {
struct mcs_spinlock mcs;
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#if defined(CONFIG_PARAVIRT_SPINLOCKS) || defined(CONFIG_NUMA_AWARE_SPINLOCKS)
long reserved[2];
#endif
};
@@ -104,6 +105,8 @@ struct qnode {
* Exactly fits one 64-byte cacheline on a 64-bit architecture.
*
* PV doubles the storage and uses the second cacheline for PV state.
+ * CNA also doubles the storage and uses the second cacheline for
+ * CNA-specific state.
*/
static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[MAX_NODES]);

@@ -317,7 +320,7 @@ static __always_inline void __mcs_lock_handoff(struct mcs_spinlock *node,
#define try_clear_tail __try_clear_tail
#define mcs_lock_handoff __mcs_lock_handoff

-#endif /* _GEN_PV_LOCK_SLOWPATH */
+#endif /* _GEN_PV_LOCK_SLOWPATH && _GEN_CNA_LOCK_SLOWPATH */

/**
* queued_spin_lock_slowpath - acquire the queued spinlock
@@ -589,6 +592,37 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
}
EXPORT_SYMBOL(queued_spin_lock_slowpath);

+/*
+ * Generate the code for NUMA-aware spinlocks
+ */
+#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
+#define _GEN_CNA_LOCK_SLOWPATH
+
+#undef pv_init_node
+#define pv_init_node cna_init_node
+
+#undef pv_wait_head_or_lock
+#define pv_wait_head_or_lock cna_wait_head_or_lock
+
+#undef try_clear_tail
+#define try_clear_tail cna_try_clear_tail
+
+#undef mcs_lock_handoff
+#define mcs_lock_handoff cna_lock_handoff
+
+#undef queued_spin_lock_slowpath
+/*
+ * defer defining queued_spin_lock_slowpath until after the include to
+ * avoid a name clash with the identically named field in pv_ops.lock
+ * (see cna_configure_spin_lock_slowpath())
+ */
+#include "qspinlock_cna.h"
+#define queued_spin_lock_slowpath __cna_queued_spin_lock_slowpath
+
+#include "qspinlock.c"
+
+#endif
+
/*
* Generate the paravirt code for queued_spin_unlock_slowpath().
*/
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
new file mode 100644
index 000000000000..d689861a7b3d
--- /dev/null
+++ b/kernel/locking/qspinlock_cna.h
@@ -0,0 +1,330 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _GEN_CNA_LOCK_SLOWPATH
+#error "do not include this file"
+#endif
+
+#include <linux/topology.h>
+
+/*
+ * Implement a NUMA-aware version of MCS (aka CNA, or compact NUMA-aware lock).
+ *
+ * In CNA, spinning threads are organized in two queues, a primary queue for
+ * threads running on the same NUMA node as the current lock holder, and a
+ * secondary queue for threads running on other nodes. Schematically, it
+ * looks like this:
+ *
+ * cna_node
+ * +----------+ +--------+ +--------+
+ * |mcs:next | --> |mcs:next| --> ... |mcs:next| --> NULL [Primary queue]
+ * |mcs:locked| -. +--------+ +--------+
+ * +----------+ |
+ * `----------------------.
+ * v
+ * +--------+ +--------+
+ * |mcs:next| --> ... |mcs:next| [Secondary queue]
+ * +--------+ +--------+
+ * ^ |
+ * `--------------------'
+ *
+ * N.B. locked := 1 if secondary queue is absent. Otherwise, it contains the
+ * encoded pointer to the tail of the secondary queue, which is organized as a
+ * circular list.
+ *
+ * After acquiring the MCS lock and before acquiring the spinlock, the MCS lock
+ * holder checks whether the next waiter in the primary queue (if exists) is
+ * running on the same NUMA node. If it is not, that waiter is detached from the
+ * main queue and moved into the tail of the secondary queue. This way, we
+ * gradually filter the primary queue, leaving only waiters running on the same
+ * preferred NUMA node.
+ *
+ * For more details, see https://arxiv.org/abs/1810.05600.
+ *
+ * Authors: Alex Kogan <[email protected]>
+ * Dave Dice <[email protected]>
+ */
+
+struct cna_node {
+ struct mcs_spinlock mcs;
+ u16 numa_node;
+ u16 real_numa_node;
+ u32 encoded_tail; /* self */
+ u32 partial_order; /* enum val */
+};
+
+enum {
+ LOCAL_WAITER_FOUND,
+ LOCAL_WAITER_NOT_FOUND,
+};
+
+static void __init cna_init_nodes_per_cpu(unsigned int cpu)
+{
+ struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
+ int numa_node = cpu_to_node(cpu);
+ int i;
+
+ for (i = 0; i < MAX_NODES; i++) {
+ struct cna_node *cn = (struct cna_node *)grab_mcs_node(base, i);
+
+ cn->real_numa_node = numa_node;
+ cn->encoded_tail = encode_tail(cpu, i);
+ /*
+ * make sure @encoded_tail is not confused with other valid
+ * values for @locked (0 or 1)
+ */
+ WARN_ON(cn->encoded_tail <= 1);
+ }
+}
+
+static int __init cna_init_nodes(void)
+{
+ unsigned int cpu;
+
+ /*
+ * this will break on 32bit architectures, so we restrict
+ * the use of CNA to 64bit only (see arch/x86/Kconfig)
+ */
+ BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
+ /* we store an ecoded tail word in the node's @locked field */
+ BUILD_BUG_ON(sizeof(u32) > sizeof(unsigned int));
+
+ for_each_possible_cpu(cpu)
+ cna_init_nodes_per_cpu(cpu);
+
+ return 0;
+}
+
+static __always_inline void cna_init_node(struct mcs_spinlock *node)
+{
+ struct cna_node *cn = (struct cna_node *)node;
+
+ cn->numa_node = cn->real_numa_node;
+ cn->partial_order = LOCAL_WAITER_FOUND;
+}
+
+/*
+ * cna_splice_head -- splice the entire secondary queue onto the head of the
+ * primary queue.
+ *
+ * Returns the new primary head node or NULL on failure.
+ */
+static struct mcs_spinlock *
+cna_splice_head(struct qspinlock *lock, u32 val,
+ struct mcs_spinlock *node, struct mcs_spinlock *next)
+{
+ struct mcs_spinlock *head_2nd, *tail_2nd;
+ u32 new;
+
+ tail_2nd = decode_tail(node->locked);
+ head_2nd = tail_2nd->next;
+
+ if (next) {
+ /*
+ * If the primary queue is not empty, the primary tail doesn't
+ * need to change and we can simply link the secondary tail to
+ * the old primary head.
+ */
+ tail_2nd->next = next;
+ } else {
+ /*
+ * When the primary queue is empty, the secondary tail becomes
+ * the primary tail.
+ */
+
+ /*
+ * Speculatively break the secondary queue's circular link such
+ * that when the secondary tail becomes the primary tail it all
+ * works out.
+ */
+ tail_2nd->next = NULL;
+
+ /*
+ * tail_2nd->next = NULL; old = xchg_tail(lock, tail);
+ * prev = decode_tail(old);
+ * try_cmpxchg_release(...); WRITE_ONCE(prev->next, node);
+ *
+ * If the following cmpxchg() succeeds, our stores will not
+ * collide.
+ */
+ new = ((struct cna_node *)tail_2nd)->encoded_tail |
+ _Q_LOCKED_VAL;
+ if (!atomic_try_cmpxchg_release(&lock->val, &val, new)) {
+ /* Restore the secondary queue's circular link. */
+ tail_2nd->next = head_2nd;
+ return NULL;
+ }
+ }
+
+ /* The primary queue head now is what was the secondary queue head. */
+ return head_2nd;
+}
+
+static inline bool cna_try_clear_tail(struct qspinlock *lock, u32 val,
+ struct mcs_spinlock *node)
+{
+ /*
+ * We're here because the primary queue is empty; check the secondary
+ * queue for remote waiters.
+ */
+ if (node->locked > 1) {
+ struct mcs_spinlock *next;
+
+ /*
+ * When there are waiters on the secondary queue, try to move
+ * them back onto the primary queue and let them rip.
+ */
+ next = cna_splice_head(lock, val, node, NULL);
+ if (next) {
+ arch_mcs_lock_handoff(&next->locked, 1);
+ return true;
+ }
+
+ return false;
+ }
+
+ /* Both queues are empty. Do what MCS does. */
+ return __try_clear_tail(lock, val, node);
+}
+
+/*
+ * cna_splice_tail -- splice the next node from the primary queue onto
+ * the secondary queue.
+ */
+static void cna_splice_next(struct mcs_spinlock *node,
+ struct mcs_spinlock *next,
+ struct mcs_spinlock *nnext)
+{
+ /* remove 'next' from the main queue */
+ node->next = nnext;
+
+ /* stick `next` on the secondary queue tail */
+ if (node->locked <= 1) { /* if secondary queue is empty */
+ /* create secondary queue */
+ next->next = next;
+ } else {
+ /* add to the tail of the secondary queue */
+ struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
+ struct mcs_spinlock *head_2nd = tail_2nd->next;
+
+ tail_2nd->next = next;
+ next->next = head_2nd;
+ }
+
+ node->locked = ((struct cna_node *)next)->encoded_tail;
+}
+
+/*
+ * cna_order_queue - check whether the next waiter in the main queue is on
+ * the same NUMA node as the lock holder; if not, and it has a waiter behind
+ * it in the main queue, move the former onto the secondary queue.
+ */
+static void cna_order_queue(struct mcs_spinlock *node)
+{
+ struct mcs_spinlock *next = READ_ONCE(node->next);
+ struct cna_node *cn = (struct cna_node *)node;
+ int numa_node, next_numa_node;
+
+ if (!next) {
+ cn->partial_order = LOCAL_WAITER_NOT_FOUND;
+ return;
+ }
+
+ numa_node = cn->numa_node;
+ next_numa_node = ((struct cna_node *)next)->numa_node;
+
+ if (next_numa_node != numa_node) {
+ struct mcs_spinlock *nnext = READ_ONCE(next->next);
+
+ if (nnext) {
+ cna_splice_next(node, next, nnext);
+ next = nnext;
+ }
+ /*
+ * Inherit NUMA node id of primary queue, to maintain the
+ * preference even if the next waiter is on a different node.
+ */
+ ((struct cna_node *)next)->numa_node = numa_node;
+ }
+}
+
+/* Abuse the pv_wait_head_or_lock() hook to get some work done */
+static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
+ struct mcs_spinlock *node)
+{
+ /*
+ * Try and put the time otherwise spent spin waiting on
+ * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+ */
+ cna_order_queue(node);
+
+ return 0; /* we lied; we didn't wait, go do so now */
+}
+
+static inline void cna_lock_handoff(struct mcs_spinlock *node,
+ struct mcs_spinlock *next)
+{
+ struct cna_node *cn = (struct cna_node *)node;
+ u32 val = 1;
+
+ u32 partial_order = cn->partial_order;
+
+ if (partial_order == LOCAL_WAITER_NOT_FOUND)
+ cna_order_queue(node);
+
+ /*
+ * We have a local waiter, either real or fake one;
+ * reload @next in case it was changed by cna_order_queue().
+ */
+ next = node->next;
+ if (node->locked > 1)
+ val = node->locked; /* preseve secondary queue */
+
+ arch_mcs_lock_handoff(&next->locked, val);
+}
+
+/*
+ * Constant (boot-param configurable) flag selecting the NUMA-aware variant
+ * of spinlock. Possible values: -1 (off) / 0 (auto, default) / 1 (on).
+ */
+static int numa_spinlock_flag;
+
+static int __init numa_spinlock_setup(char *str)
+{
+ if (!strcmp(str, "auto")) {
+ numa_spinlock_flag = 0;
+ return 1;
+ } else if (!strcmp(str, "on")) {
+ numa_spinlock_flag = 1;
+ return 1;
+ } else if (!strcmp(str, "off")) {
+ numa_spinlock_flag = -1;
+ return 1;
+ }
+
+ return 0;
+}
+__setup("numa_spinlock=", numa_spinlock_setup);
+
+void __cna_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+/*
+ * Switch to the NUMA-friendly slow path for spinlocks when we have
+ * multiple NUMA nodes in native environment, unless the user has
+ * overridden this default behavior by setting the numa_spinlock flag.
+ */
+void __init cna_configure_spin_lock_slowpath(void)
+{
+
+ if (numa_spinlock_flag < 0)
+ return;
+
+ if (numa_spinlock_flag == 0 && (nr_node_ids < 2 ||
+ pv_ops.lock.queued_spin_lock_slowpath !=
+ native_queued_spin_lock_slowpath))
+ return;
+
+ cna_init_nodes();
+
+ pv_ops.lock.queued_spin_lock_slowpath = __cna_queued_spin_lock_slowpath;
+
+ pr_info("Enabling CNA spinlock\n");
+}
--
2.24.3 (Apple Git-128)

2021-04-13 16:42:20

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v14 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock

On Thu, Apr 01, 2021 at 11:31:53AM -0400, Alex Kogan wrote:

> +/*
> + * cna_splice_tail -- splice the next node from the primary queue onto
> + * the secondary queue.
> + */
> +static void cna_splice_next(struct mcs_spinlock *node,
> + struct mcs_spinlock *next,
> + struct mcs_spinlock *nnext)

You forgot to update the comment when you changed the name on this
thing.

> +/*
> + * cna_order_queue - check whether the next waiter in the main queue is on
> + * the same NUMA node as the lock holder; if not, and it has a waiter behind
> + * it in the main queue, move the former onto the secondary queue.
> + */
> +static void cna_order_queue(struct mcs_spinlock *node)
> +{
> + struct mcs_spinlock *next = READ_ONCE(node->next);
> + struct cna_node *cn = (struct cna_node *)node;
> + int numa_node, next_numa_node;
> +
> + if (!next) {
> + cn->partial_order = LOCAL_WAITER_NOT_FOUND;
> + return;
> + }
> +
> + numa_node = cn->numa_node;
> + next_numa_node = ((struct cna_node *)next)->numa_node;
> +
> + if (next_numa_node != numa_node) {
> + struct mcs_spinlock *nnext = READ_ONCE(next->next);
> +
> + if (nnext) {
> + cna_splice_next(node, next, nnext);
> + next = nnext;
> + }
> + /*
> + * Inherit NUMA node id of primary queue, to maintain the
> + * preference even if the next waiter is on a different node.
> + */
> + ((struct cna_node *)next)->numa_node = numa_node;
> + }
> +}

So the obvious change since last time I looked a this is that it now
only looks 1 entry ahead. Which makes sense I suppose.

I'm not really a fan of the 'partial_order' name combined with that
silly enum { LOCAL_WAITER_FOUND, LOCAL_WAITER_NOT_FOUND }. That's just
really bad naming all around. The enum is about having a waiter while
the variable is about partial order, that doesn't match at all.

If you rename the variable to 'has_waiter' and simply use 0,1 values,
things would be ever so more readable. But I don't think that makes
sense, see below.

I'm also not sure about that whole numa_node thing, why would you
over-write the numa node, why at this point ?

> +
> +/* Abuse the pv_wait_head_or_lock() hook to get some work done */
> +static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
> + struct mcs_spinlock *node)
> +{
> + /*
> + * Try and put the time otherwise spent spin waiting on
> + * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
> + */
> + cna_order_queue(node);
> +
> + return 0; /* we lied; we didn't wait, go do so now */

So here we inspect one entry ahead and then quit. I can't rmember, but
did we try something like:

/*
* Try and put the time otherwise spent spin waiting on
* _Q_LOCKED_PENDING_MASK to use by sorting our lists.
* Move one entry at a go until either the list is fully
* sorted or we ran out of spin condition.
*/
while (READ_ONCE(lock->val) & _Q_LOCKED_PENDING_MASK &&
node->partial_order)
cna_order_queue(node);

return 0;

This will keep moving @next to the remote list until such a time that
we're forced to continue or @next is local.

> +}
> +
> +static inline void cna_lock_handoff(struct mcs_spinlock *node,
> + struct mcs_spinlock *next)
> +{
> + struct cna_node *cn = (struct cna_node *)node;
> + u32 val = 1;
> +
> + u32 partial_order = cn->partial_order;
> +
> + if (partial_order == LOCAL_WAITER_NOT_FOUND)
> + cna_order_queue(node);
> +

AFAICT this is where playing silly games with ->numa_node belong; but
right along with that goes a comment that describes why any of that
makes sense.

I mean, if you leave your node, for any reason, why bother coming back
to it, why not accept it is a sign of the gods you're overdue for a
node-change?

Was the efficacy of this scheme tested?

> + /*
> + * We have a local waiter, either real or fake one;
> + * reload @next in case it was changed by cna_order_queue().
> + */
> + next = node->next;
> + if (node->locked > 1)
> + val = node->locked; /* preseve secondary queue */

IIRC we used to do:

val |= node->locked;

Which is simpler for not having branches. Why change a good thing?

> +
> + arch_mcs_lock_handoff(&next->locked, val);
> +}

2021-04-14 11:57:31

by Alex Kogan

[permalink] [raw]

Subject: Re: [External] : Re: [PATCH v14 3/6] locking/qspinlock: Introduce CNA into the slow path of qspinlock

Peter, thanks for all the comments and suggestions!

> On Apr 13, 2021, at 7:30 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Apr 01, 2021 at 11:31:53AM -0400, Alex Kogan wrote:
>
>> +/*
>> + * cna_splice_tail -- splice the next node from the primary queue onto
>> + * the secondary queue.
>> + */
>> +static void cna_splice_next(struct mcs_spinlock *node,
>> + struct mcs_spinlock *next,
>> + struct mcs_spinlock *nnext)
>
> You forgot to update the comment when you changed the name on this
> thing.
Good catch, thanks.

>
>> +/*
>> + * cna_order_queue - check whether the next waiter in the main queue is on
>> + * the same NUMA node as the lock holder; if not, and it has a waiter behind
>> + * it in the main queue, move the former onto the secondary queue.
>> + */
>> +static void cna_order_queue(struct mcs_spinlock *node)
>> +{
>> + struct mcs_spinlock *next = READ_ONCE(node->next);
>> + struct cna_node *cn = (struct cna_node *)node;
>> + int numa_node, next_numa_node;
>> +
>> + if (!next) {
>> + cn->partial_order = LOCAL_WAITER_NOT_FOUND;
>> + return;
>> + }
>> +
>> + numa_node = cn->numa_node;
>> + next_numa_node = ((struct cna_node *)next)->numa_node;
>> +
>> + if (next_numa_node != numa_node) {
>> + struct mcs_spinlock *nnext = READ_ONCE(next->next);
>> +
>> + if (nnext) {
>> + cna_splice_next(node, next, nnext);
>> + next = nnext;
>> + }
>> + /*
>> + * Inherit NUMA node id of primary queue, to maintain the
>> + * preference even if the next waiter is on a different node.
>> + */
>> + ((struct cna_node *)next)->numa_node = numa_node;
>> + }
>> +}
>
> So the obvious change since last time I looked a this is that it now
> only looks 1 entry ahead. Which makes sense I suppose.
This is in response to the critique that the worst-case time complexity of
cna_order_queue() was O(n). With this change, the complexity is constant.

>
> I'm not really a fan of the 'partial_order' name combined with that
> silly enum { LOCAL_WAITER_FOUND, LOCAL_WAITER_NOT_FOUND }. That's just
> really bad naming all around. The enum is about having a waiter while
> the variable is about partial order, that doesn't match at all.
Fair enough.

> If you rename the variable to 'has_waiter' and simply use 0,1 values,
> things would be ever so more readable. But I don't think that makes
> sense, see below.
>
> I'm also not sure about that whole numa_node thing, why would you
> over-write the numa node, why at this point ?
With this move-one-by-one approach, I want to keep the NUMA-node
preference of the lock holder even if the next-next waiter is on a different
NUMA-node. Otherwise, we will end up switching preference often and
the entire scheme would not perform well. In particular, we might easily
end up with threads from the preferred node waiting in the secondary queue.

>
>> +
>> +/* Abuse the pv_wait_head_or_lock() hook to get some work done */
>> +static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
>> + struct mcs_spinlock *node)
>> +{
>> + /*
>> + * Try and put the time otherwise spent spin waiting on
>> + * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
>> + */
>> + cna_order_queue(node);
>> +
>> + return 0; /* we lied; we didn't wait, go do so now */
>
> So here we inspect one entry ahead and then quit. I can't rmember, but
> did we try something like:
>
> /*
> * Try and put the time otherwise spent spin waiting on
> * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
> * Move one entry at a go until either the list is fully
> * sorted or we ran out of spin condition.
> */
> while (READ_ONCE(lock->val) & _Q_LOCKED_PENDING_MASK &&
> node->partial_order)
> cna_order_queue(node);
>
> return 0;
>
> This will keep moving @next to the remote list until such a time that
> we're forced to continue or @next is local.
We have not tried that. This is actually an interesting idea, with its pros and cons.
That is, we are likely to filter out “non-preferred” waiters into the secondary queue
faster, but also we are more likely to run into a situation where the lock becomes
available at the time we are running the cna_order_queue() logic, thus prolonging
the handover time.

With this change, however, there is no need to call cna_order_queue() again
in cna_lock_handoff(), which is another advantage I guess.

All in all, I am fine switching to this alternative if you like it more.

BTW, we can return node->partial_order instead of 0, to skip the call to
atomic_cond_read_acquire() in the general code.

>
>> +}
>> +
>> +static inline void cna_lock_handoff(struct mcs_spinlock *node,
>> + struct mcs_spinlock *next)
>> +{
>> + struct cna_node *cn = (struct cna_node *)node;
>> + u32 val = 1;
>> +
>> + u32 partial_order = cn->partial_order;
>> +
>> + if (partial_order == LOCAL_WAITER_NOT_FOUND)
>> + cna_order_queue(node);
>> +
>
> AFAICT this is where playing silly games with ->numa_node belong; but
> right along with that goes a comment that describes why any of that
> makes sense.
>
> I mean, if you leave your node, for any reason, why bother coming back
> to it, why not accept it is a sign of the gods you're overdue for a
> node-change?
I think there is some misunderstanding here — let me try to clarify.
In the current logic, we first call cna_order_queue() from cna_wait_head_or_lock().
If we find an appropriate “local” waiter (either real or fake), we set it as
next and make sure the numa_node of that node is the same as ours.
If not (i.e., when we do not have any next waiter), we set partial_order to
LOCAL_WAITER_NOT_FOUND (pardon the names) and go spin on the lock (calling
atomic_cond_read_acquire() in the generic code).
During this spin time, new waiters can join the queue.
Hence, we recheck the state of the queue by calling cna_order_queue() again
from cna_lock_handoff().

As just mentioned, if we change the logic as you suggest, there is
really no reason to call cna_order_queue() again.

>
> Was the efficacy of this scheme tested?
>
>> + /*
>> + * We have a local waiter, either real or fake one;
>> + * reload @next in case it was changed by cna_order_queue().
>> + */
>> + next = node->next;
>> + if (node->locked > 1)
>> + val = node->locked; /* preseve secondary queue */
>
> IIRC we used to do:
>
> val |= node->locked;
>
> Which is simpler for not having branches. Why change a good thing?
With |= we might set val to encoded tail+1 (rather than encoded tail).
This still would work, as decode_tail(val) does not care about LSB, but seems fragile to me.
We talked about that before:
https://lore.kernel.org/linux-arm-kernel/[email protected]

If you think this is a non-issue, I will be happy to adopt your suggestion.

Regards,
— Alex