2021-04-01 17:46:59

by Alex Kogan

[permalink] [raw]
Subject: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

Keep track of the time the thread at the head of the secondary queue
has been waiting, and force inter-node handoff once this time passes
a preset threshold. The default value for the threshold (10ms) can be
overridden with the new kernel boot command-line option
"numa_spinlock_threshold". The ms value is translated internally to the
nearest rounded-up jiffies.

Signed-off-by: Alex Kogan <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 9 ++
kernel/locking/qspinlock_cna.h | 96 ++++++++++++++++---
2 files changed, 93 insertions(+), 12 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ace55afd4441..5c959631a8c8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3485,6 +3485,15 @@
Not specifying this option is equivalent to
numa_spinlock=auto.

+ numa_spinlock_threshold= [NUMA, PV_OPS]
+ Set the time threshold in milliseconds for the
+ number of intra-node lock hand-offs before the
+ NUMA-aware spinlock is forced to be passed to
+ a thread on another NUMA node. Valid values
+ are in the [1..100] range. Smaller values result
+ in a more fair, but less performant spinlock,
+ and vice versa. The default value is 10.
+
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
'node', 'default' can be specified
This can be set from sysctl after boot.
diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index d689861a7b3d..0513360c11fe 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -37,6 +37,12 @@
* gradually filter the primary queue, leaving only waiters running on the same
* preferred NUMA node.
*
+ * We change the NUMA node preference after a waiter at the head of the
+ * secondary queue spins for a certain amount of time (10ms, by default).
+ * We do that by flushing the secondary queue into the head of the primary queue,
+ * effectively changing the preference to the NUMA node of the waiter at the head
+ * of the secondary queue at the time of the flush.
+ *
* For more details, see https://arxiv.org/abs/1810.05600.
*
* Authors: Alex Kogan <[email protected]>
@@ -49,13 +55,33 @@ struct cna_node {
u16 real_numa_node;
u32 encoded_tail; /* self */
u32 partial_order; /* enum val */
+ s32 start_time;
};

enum {
LOCAL_WAITER_FOUND,
LOCAL_WAITER_NOT_FOUND,
+ FLUSH_SECONDARY_QUEUE
};

+/*
+ * Controls the threshold time in ms (default = 10) for intra-node lock
+ * hand-offs before the NUMA-aware variant of spinlock is forced to be
+ * passed to a thread on another NUMA node. The default setting can be
+ * changed with the "numa_spinlock_threshold" boot option.
+ */
+#define MSECS_TO_JIFFIES(m) \
+ (((m) + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ))
+static int intra_node_handoff_threshold __ro_after_init = MSECS_TO_JIFFIES(10);
+
+static inline bool intra_node_threshold_reached(struct cna_node *cn)
+{
+ s32 current_time = (s32)jiffies;
+ s32 threshold = cn->start_time + intra_node_handoff_threshold;
+
+ return current_time - threshold > 0;
+}
+
static void __init cna_init_nodes_per_cpu(unsigned int cpu)
{
struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
@@ -99,6 +125,7 @@ static __always_inline void cna_init_node(struct mcs_spinlock *node)

cn->numa_node = cn->real_numa_node;
cn->partial_order = LOCAL_WAITER_FOUND;
+ cn->start_time = 0;
}

/*
@@ -198,8 +225,15 @@ static void cna_splice_next(struct mcs_spinlock *node,

/* stick `next` on the secondary queue tail */
if (node->locked <= 1) { /* if secondary queue is empty */
+ struct cna_node *cn = (struct cna_node *)node;
+
/* create secondary queue */
next->next = next;
+
+ cn->start_time = (s32)jiffies;
+ /* make sure start_time != 0 iff secondary queue is not empty */
+ if (!cn->start_time)
+ cn->start_time = 1;
} else {
/* add to the tail of the secondary queue */
struct mcs_spinlock *tail_2nd = decode_tail(node->locked);
@@ -250,11 +284,17 @@ static void cna_order_queue(struct mcs_spinlock *node)
static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
struct mcs_spinlock *node)
{
- /*
- * Try and put the time otherwise spent spin waiting on
- * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
- */
- cna_order_queue(node);
+ struct cna_node *cn = (struct cna_node *)node;
+
+ if (!cn->start_time || !intra_node_threshold_reached(cn)) {
+ /*
+ * Try and put the time otherwise spent spin waiting on
+ * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
+ */
+ cna_order_queue(node);
+ } else {
+ cn->partial_order = FLUSH_SECONDARY_QUEUE;
+ }

return 0; /* we lied; we didn't wait, go do so now */
}
@@ -270,13 +310,28 @@ static inline void cna_lock_handoff(struct mcs_spinlock *node,
if (partial_order == LOCAL_WAITER_NOT_FOUND)
cna_order_queue(node);

- /*
- * We have a local waiter, either real or fake one;
- * reload @next in case it was changed by cna_order_queue().
- */
- next = node->next;
- if (node->locked > 1)
- val = node->locked; /* preseve secondary queue */
+ if (partial_order != FLUSH_SECONDARY_QUEUE) {
+ /*
+ * We have a local waiter, either real or fake one;
+ * reload @next in case it was changed by cna_order_queue().
+ */
+ next = node->next;
+ if (node->locked > 1) {
+ val = node->locked; /* preseve secondary queue */
+ ((struct cna_node *)next)->start_time = cn->start_time;
+ }
+ } else {
+ /*
+ * We decided to flush the secondary queue;
+ * this can only happen if that queue is not empty.
+ */
+ WARN_ON(node->locked <= 1);
+ /*
+ * Splice the secondary queue onto the primary queue and pass the lock
+ * to the longest waiting remote waiter.
+ */
+ next = cna_splice_head(NULL, 0, node, next);
+ }

arch_mcs_lock_handoff(&next->locked, val);
}
@@ -328,3 +383,20 @@ void __init cna_configure_spin_lock_slowpath(void)

pr_info("Enabling CNA spinlock\n");
}
+
+static int __init numa_spinlock_threshold_setup(char *str)
+{
+ int param;
+
+ if (get_option(&str, &param)) {
+ /* valid value is between 1 and 100 */
+ if (param <= 0 || param > 100)
+ return 0;
+
+ intra_node_handoff_threshold = msecs_to_jiffies(param);
+ return 1;
+ }
+
+ return 0;
+}
+__setup("numa_spinlock_threshold=", numa_spinlock_threshold_setup);
--
2.24.3 (Apple Git-128)


2021-04-13 08:57:07

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

Alex Kogan <[email protected]> writes:
>
> + numa_spinlock_threshold= [NUMA, PV_OPS]
> + Set the time threshold in milliseconds for the
> + number of intra-node lock hand-offs before the
> + NUMA-aware spinlock is forced to be passed to
> + a thread on another NUMA node. Valid values
> + are in the [1..100] range. Smaller values result
> + in a more fair, but less performant spinlock,
> + and vice versa. The default value is 10.

ms granularity seems very coarse grained for this. Surely
at some point of spinning you can afford a ktime_get? But ok.

Could you turn that into a moduleparm which can be changed at runtime?
Would be strange to have to reboot just to play with this parameter

This would also make the code a lot shorter I guess.

-Andi

2021-04-13 08:58:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

Andi Kleen <[email protected]> writes:

> Alex Kogan <[email protected]> writes:
>>
>> + numa_spinlock_threshold= [NUMA, PV_OPS]
>> + Set the time threshold in milliseconds for the
>> + number of intra-node lock hand-offs before the
>> + NUMA-aware spinlock is forced to be passed to
>> + a thread on another NUMA node. Valid values
>> + are in the [1..100] range. Smaller values result
>> + in a more fair, but less performant spinlock,
>> + and vice versa. The default value is 10.
>
> ms granularity seems very coarse grained for this. Surely
> at some point of spinning you can afford a ktime_get? But ok.

Actually thinking about it more using jiffies is likely broken
anyways because if the interrupts are disabled and the CPU
is running the main timer interrupts they won't increase.

cpu_clock (better than ktime_get) or sched_clock would work.

-Andi

2021-04-13 16:41:13

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

On Thu, Apr 01, 2021 at 11:31:54AM -0400, Alex Kogan wrote:

> @@ -49,13 +55,33 @@ struct cna_node {
> u16 real_numa_node;
> u32 encoded_tail; /* self */
> u32 partial_order; /* enum val */
> + s32 start_time;
> };

> +/*
> + * Controls the threshold time in ms (default = 10) for intra-node lock
> + * hand-offs before the NUMA-aware variant of spinlock is forced to be
> + * passed to a thread on another NUMA node. The default setting can be
> + * changed with the "numa_spinlock_threshold" boot option.
> + */
> +#define MSECS_TO_JIFFIES(m) \
> + (((m) + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ))
> +static int intra_node_handoff_threshold __ro_after_init = MSECS_TO_JIFFIES(10);
> +
> +static inline bool intra_node_threshold_reached(struct cna_node *cn)
> +{
> + s32 current_time = (s32)jiffies;
> + s32 threshold = cn->start_time + intra_node_handoff_threshold;
> +
> + return current_time - threshold > 0;
> +}

None of this makes any sense:

- why do you track time elapsed as a signed entity?
- why are you using jiffies; that's terrible granularity.

As Andi already said, 10ms is silly large. You've just inflated the
lock-acquire time for every contended lock to stupid land just because
NUMA.

And this also brings me to the whole premise of this series; *why* are
we optimizing this? What locks are so contended that this actually helps
and shouldn't you be spending your time breaking those locks? That would
improve throughput more than this ever can.

> static void __init cna_init_nodes_per_cpu(unsigned int cpu)
> {
> struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);

> @@ -250,11 +284,17 @@ static void cna_order_queue(struct mcs_spinlock *node)
> static __always_inline u32 cna_wait_head_or_lock(struct qspinlock *lock,
> struct mcs_spinlock *node)
> {
> - /*
> - * Try and put the time otherwise spent spin waiting on
> - * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
> - */
> - cna_order_queue(node);
> + struct cna_node *cn = (struct cna_node *)node;
> +
> + if (!cn->start_time || !intra_node_threshold_reached(cn)) {
> + /*
> + * Try and put the time otherwise spent spin waiting on
> + * _Q_LOCKED_PENDING_MASK to use by sorting our lists.
> + */
> + cna_order_queue(node);
> + } else {
> + cn->partial_order = FLUSH_SECONDARY_QUEUE;

This is even worse naming..

> + }
>
> return 0; /* we lied; we didn't wait, go do so now */
> }

2021-04-14 08:16:53

by Alex Kogan

[permalink] [raw]
Subject: Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

Hi, Andi.

Thanks for your comments!

> On Apr 13, 2021, at 2:03 AM, Andi Kleen <[email protected]> wrote:
>
> Alex Kogan <[email protected]> writes:
>>
>> + numa_spinlock_threshold= [NUMA, PV_OPS]
>> + Set the time threshold in milliseconds for the
>> + number of intra-node lock hand-offs before the
>> + NUMA-aware spinlock is forced to be passed to
>> + a thread on another NUMA node. Valid values
>> + are in the [1..100] range. Smaller values result
>> + in a more fair, but less performant spinlock,
>> + and vice versa. The default value is 10.
>
> ms granularity seems very coarse grained for this. Surely
> at some point of spinning you can afford a ktime_get? But ok.
We are reading time when we are at the head of the (main) queue, but
don’t have the lock yet. Not sure about the latency of ktime_get(), but
anything reasonably fast but not necessarily precise should work.

> Could you turn that into a moduleparm which can be changed at runtime?
> Would be strange to have to reboot just to play with this parameter
Yes, good suggestion, thanks.

> This would also make the code a lot shorter I guess.
So you don’t think we need the command-line parameter, just the module_param?

Regards,
— Alex

2021-04-14 08:30:16

by Andi Kleen

[permalink] [raw]
Subject: Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

> > ms granularity seems very coarse grained for this. Surely
> > at some point of spinning you can afford a ktime_get? But ok.
> We are reading time when we are at the head of the (main) queue, but
> don’t have the lock yet. Not sure about the latency of ktime_get(), but
> anything reasonably fast but not necessarily precise should work.

Actually cpu_clock / sched_clock (see my other email). These should
be fast without corner cases and also monotonic.

>
> > Could you turn that into a moduleparm which can be changed at runtime?
> > Would be strange to have to reboot just to play with this parameter
> Yes, good suggestion, thanks.
>
> > This would also make the code a lot shorter I guess.
> So you don’t think we need the command-line parameter, just the module_param?

module_params can be changed at the command line too, so yes.

-Andi

2021-04-14 11:57:31

by Alex Kogan

[permalink] [raw]
Subject: Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA



> On Apr 13, 2021, at 5:22 PM, Andi Kleen <[email protected]> wrote:
>
>>> ms granularity seems very coarse grained for this. Surely
>>> at some point of spinning you can afford a ktime_get? But ok.
>> We are reading time when we are at the head of the (main) queue, but
>> don’t have the lock yet. Not sure about the latency of ktime_get(), but
>> anything reasonably fast but not necessarily precise should work.
>
> Actually cpu_clock / sched_clock (see my other email). These should
> be fast without corner cases and also monotonic.
I see, thanks.

>
>>
>>> Could you turn that into a moduleparm which can be changed at runtime?
>>> Would be strange to have to reboot just to play with this parameter
>> Yes, good suggestion, thanks.
>>
>>> This would also make the code a lot shorter I guess.
>> So you don’t think we need the command-line parameter, just the module_param?
>
> module_params can be changed at the command line too, so yes.
Got it, thanks again.

— Alex

2021-04-15 00:35:30

by Waiman Long

[permalink] [raw]
Subject: Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

On 4/13/21 5:01 PM, Alex Kogan wrote:
> Hi, Andi.
>
> Thanks for your comments!
>
>> On Apr 13, 2021, at 2:03 AM, Andi Kleen <[email protected]> wrote:
>>
>> Alex Kogan <[email protected]> writes:
>>> + numa_spinlock_threshold= [NUMA, PV_OPS]
>>> + Set the time threshold in milliseconds for the
>>> + number of intra-node lock hand-offs before the
>>> + NUMA-aware spinlock is forced to be passed to
>>> + a thread on another NUMA node. Valid values
>>> + are in the [1..100] range. Smaller values result
>>> + in a more fair, but less performant spinlock,
>>> + and vice versa. The default value is 10.
>> ms granularity seems very coarse grained for this. Surely
>> at some point of spinning you can afford a ktime_get? But ok.
> We are reading time when we are at the head of the (main) queue, but
> don’t have the lock yet. Not sure about the latency of ktime_get(), but
> anything reasonably fast but not necessarily precise should work.
>
>> Could you turn that into a moduleparm which can be changed at runtime?
>> Would be strange to have to reboot just to play with this parameter
> Yes, good suggestion, thanks.
>
>> This would also make the code a lot shorter I guess.
> So you don’t think we need the command-line parameter, just the module_param?

The CNA code, if enabled, will be in vmlinux, not in a kernel module. As
a result, I think a module parameter will be no different from a kernel
command line parameter in this regard.

Cheers,
Longman


2021-04-16 04:12:32

by Alex Kogan

[permalink] [raw]
Subject: Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA



> On Apr 13, 2021, at 8:03 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Apr 01, 2021 at 11:31:54AM -0400, Alex Kogan wrote:
>
>> @@ -49,13 +55,33 @@ struct cna_node {
>> u16 real_numa_node;
>> u32 encoded_tail; /* self */
>> u32 partial_order; /* enum val */
>> + s32 start_time;
>> };
>
>> +/*
>> + * Controls the threshold time in ms (default = 10) for intra-node lock
>> + * hand-offs before the NUMA-aware variant of spinlock is forced to be
>> + * passed to a thread on another NUMA node. The default setting can be
>> + * changed with the "numa_spinlock_threshold" boot option.
>> + */
>> +#define MSECS_TO_JIFFIES(m) \
>> + (((m) + (MSEC_PER_SEC / HZ) - 1) / (MSEC_PER_SEC / HZ))
>> +static int intra_node_handoff_threshold __ro_after_init = MSECS_TO_JIFFIES(10);
>> +
>> +static inline bool intra_node_threshold_reached(struct cna_node *cn)
>> +{
>> + s32 current_time = (s32)jiffies;
>> + s32 threshold = cn->start_time + intra_node_handoff_threshold;
>> +
>> + return current_time - threshold > 0;
>> +}
>
> None of this makes any sense:
>
> - why do you track time elapsed as a signed entity?
> - why are you using jiffies; that's terrible granularity.
Good points. I will address that (see below). I will just mention that
those suggestions came from senior folks on this mailing list,
and it seemed prudent to take their counsel.

>
> As Andi already said, 10ms is silly large. You've just inflated the
> lock-acquire time for every contended lock to stupid land just because
> NUMA.
I just ran a few quick tests — local_clock() (a wrapper around sched_clock())
works well, so I will switch to using that.

I also took a few numbers with different thresholds. Looks like we can drop
the threshold to 1ms with a minor penalty to performance. However,
pushing the threshold to 100us has a more significant cost. Here are
the numbers for reference:

will-it-scale/lock2_threads:
threshold: 10ms 1ms 100us
speedup at 142 threads: 2.184 1.974 1.1418

will-it-scale/open1_threads:
threshold: 10ms 1ms 100us
speedup at 142 threads: 2.146 1.974 1.291

Would you be more comfortable with setting the default at 1ms?

> And this also brings me to the whole premise of this series; *why* are
> we optimizing this? What locks are so contended that this actually helps
> and shouldn't you be spending your time breaking those locks? That would
> improve throughput more than this ever can.

I think for the same reason the kernel switched from ticket locks to queue locks
several years back. There always will be applications with contended locks.
Sometimes the workarounds are easy, but many times they are not, like with
legacy applications or when the workload is skewed (e.g., every client tries to
update the metadata of the same file protected by the same lock). The results
show that for those cases we leave > 2x performance on the table. Those are not
only our numbers — LKP reports show similar or even better results,
on a wide range of benchmarks, e.g.:
https://lists.01.org/hyperkitty/list/[email protected]/thread/HGVOCYDEE5KTLYPTAFBD2RXDQOCDPFUJ/
https://lists.01.org/hyperkitty/list/[email protected]/thread/OUPS7MZ3GJA2XYWM52GMU7H7EI25IT37/
https://lists.01.org/hyperkitty/list/[email protected]/thread/DNMEQPXJRQY2IKHZ3ERGRY6TUPWDTFUN/

Regards,
— Alex