LinuxLists.cc - [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

2019-12-30 19:54:59

Subject: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

Keep track of the number of intra-node lock handoffs, and force
inter-node handoff once this number reaches a preset threshold.
The default value for the threshold can be overridden with
the new kernel boot command-line option "numa_spinlock_threshold".

Signed-off-by: Alex Kogan <[email protected]>
Reviewed-by: Steve Sistare <[email protected]>
---
.../admin-guide/kernel-parameters.txt | 8 ++++
kernel/locking/qspinlock.c | 3 ++
kernel/locking/qspinlock_cna.h | 41 ++++++++++++++++++-
3 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index b68cb80e477f..30d79819a3b0 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3200,6 +3200,14 @@
Not specifying this option is equivalent to
numa_spinlock=auto.

+ numa_spinlock_threshold= [NUMA, PV_OPS]
+ Set the threshold for the number of intra-node
+ lock hand-offs before the NUMA-aware spinlock
+ is forced to be passed to a thread on another NUMA node.
+ Valid values are in the [0..31] range. Smaller values
+ result in a more fair, but less performant spinlock, and
+ vice versa. The default value is 16.
+
cpu0_hotplug [X86] Turn on CPU0 hotplug feature when
CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
Some features depend on CPU0. Known dependencies are:
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 609980a53841..e382d8946ccc 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -597,6 +597,9 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
#if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
#define _GEN_CNA_LOCK_SLOWPATH

+#undef pv_init_node
+#define pv_init_node cna_init_node
+
#undef pv_wait_head_or_lock
#define pv_wait_head_or_lock cna_pre_scan

diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
index 3c99a4a6184b..30feff02865d 100644
--- a/kernel/locking/qspinlock_cna.h
+++ b/kernel/locking/qspinlock_cna.h
@@ -51,13 +51,25 @@ struct cna_node {
int numa_node;
u32 encoded_tail;
u32 pre_scan_result; /* encoded tail or enum val */
+ u32 intra_count;
};

enum {
LOCAL_WAITER_FOUND = 2, /* 0 and 1 are reserved for @locked */
+ FLUSH_SECONDARY_QUEUE = 3,
MIN_ENCODED_TAIL
};

+/*
+ * Controls the threshold for the number of intra-node lock hand-offs before
+ * the NUMA-aware variant of spinlock is forced to be passed to a thread on
+ * another NUMA node. By default, the chosen value provides reasonable
+ * long-term fairness without sacrificing performance compared to a lock
+ * that does not have any fairness guarantees. The default setting can
+ * be changed with the "numa_spinlock_threshold" boot option.
+ */
+int intra_node_handoff_threshold __ro_after_init = 1 << 16;
+
static void __init cna_init_nodes_per_cpu(unsigned int cpu)
{
struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
@@ -97,6 +109,11 @@ static int __init cna_init_nodes(void)
}
early_initcall(cna_init_nodes);

+static __always_inline void cna_init_node(struct mcs_spinlock *node)
+{
+ ((struct cna_node *)node)->intra_count = 0;
+}
+
/* this function is called only when the primary queue is empty */
static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
struct mcs_spinlock *node)
@@ -233,7 +250,9 @@ __always_inline u32 cna_pre_scan(struct qspinlock *lock,
{
struct cna_node *cn = (struct cna_node *)node;

- cn->pre_scan_result = cna_scan_main_queue(node, node);
+ cn->pre_scan_result =
+ cn->intra_count == intra_node_handoff_threshold ?
+ FLUSH_SECONDARY_QUEUE : cna_scan_main_queue(node, node);

return 0;
}
@@ -263,6 +282,9 @@ static inline void cna_pass_lock(struct mcs_spinlock *node,
* if we acquired the MCS lock when its queue was empty
*/
val = node->locked ? node->locked : 1;
+ /* inc @intra_count if the secondary queue is not empty */
+ ((struct cna_node *)next_holder)->intra_count =
+ cn->intra_count + (node->locked > 1);
} else if (node->locked > 1) { /* if secondary queue is not empty */
/* next holder will be the first node in the secondary queue */
tail_2nd = decode_tail(node->locked);
@@ -317,3 +339,20 @@ void cna_configure_spin_lock_slowpath(void)
pr_info("Enabling CNA spinlock\n");
}
}
+
+static int __init numa_spinlock_threshold_setup(char *str)
+{
+ int new_threshold_param;
+
+ if (get_option(&str, &new_threshold_param)) {
+ /* valid value is between 0 and 31 */
+ if (new_threshold_param < 0 || new_threshold_param > 31)
+ return 0;
+
+ intra_node_handoff_threshold = 1 << new_threshold_param;
+ return 1;
+ }
+
+ return 0;
+}
+__setup("numa_spinlock_threshold=", numa_spinlock_threshold_setup);
--
2.21.0 (Apple Git-122.2)

2020-01-06 15:35:07

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On 12/30/19 2:40 PM, Alex Kogan wrote:
> Keep track of the number of intra-node lock handoffs, and force
> inter-node handoff once this number reaches a preset threshold.
> The default value for the threshold can be overridden with
> the new kernel boot command-line option "numa_spinlock_threshold".
>
> Signed-off-by: Alex Kogan <[email protected]>
> Reviewed-by: Steve Sistare <[email protected]>
> ---
> .../admin-guide/kernel-parameters.txt | 8 ++++
> kernel/locking/qspinlock.c | 3 ++
> kernel/locking/qspinlock_cna.h | 41 ++++++++++++++++++-
> 3 files changed, 51 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index b68cb80e477f..30d79819a3b0 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -3200,6 +3200,14 @@
> Not specifying this option is equivalent to
> numa_spinlock=auto.
>
> + numa_spinlock_threshold= [NUMA, PV_OPS]
> + Set the threshold for the number of intra-node
> + lock hand-offs before the NUMA-aware spinlock
> + is forced to be passed to a thread on another NUMA node.
> + Valid values are in the [0..31] range. Smaller values
> + result in a more fair, but less performant spinlock, and
> + vice versa. The default value is 16.
> +
> cpu0_hotplug [X86] Turn on CPU0 hotplug feature when
> CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
> Some features depend on CPU0. Known dependencies are:
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 609980a53841..e382d8946ccc 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -597,6 +597,9 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
> #if !defined(_GEN_CNA_LOCK_SLOWPATH) && defined(CONFIG_NUMA_AWARE_SPINLOCKS)
> #define _GEN_CNA_LOCK_SLOWPATH
>
> +#undef pv_init_node
> +#define pv_init_node cna_init_node
> +
> #undef pv_wait_head_or_lock
> #define pv_wait_head_or_lock cna_pre_scan
>
> diff --git a/kernel/locking/qspinlock_cna.h b/kernel/locking/qspinlock_cna.h
> index 3c99a4a6184b..30feff02865d 100644
> --- a/kernel/locking/qspinlock_cna.h
> +++ b/kernel/locking/qspinlock_cna.h
> @@ -51,13 +51,25 @@ struct cna_node {
> int numa_node;
> u32 encoded_tail;
> u32 pre_scan_result; /* encoded tail or enum val */
> + u32 intra_count;
> };
>
> enum {
> LOCAL_WAITER_FOUND = 2, /* 0 and 1 are reserved for @locked */
> + FLUSH_SECONDARY_QUEUE = 3,
> MIN_ENCODED_TAIL
> };
>
> +/*
> + * Controls the threshold for the number of intra-node lock hand-offs before
> + * the NUMA-aware variant of spinlock is forced to be passed to a thread on
> + * another NUMA node. By default, the chosen value provides reasonable
> + * long-term fairness without sacrificing performance compared to a lock
> + * that does not have any fairness guarantees. The default setting can
> + * be changed with the "numa_spinlock_threshold" boot option.
> + */
> +int intra_node_handoff_threshold __ro_after_init = 1 << 16;

Another minor nit. (1 << 31) is negative for an int value. I will
suggest that you either make intra_node_handoff_threshold an unsigned
int or limit the upper bound to 30.

Cheers,
Longman

2020-01-21 13:32:12

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On Mon, Dec 30, 2019 at 02:40:41PM -0500, Alex Kogan wrote:

> +/*
> + * Controls the threshold for the number of intra-node lock hand-offs before
> + * the NUMA-aware variant of spinlock is forced to be passed to a thread on
> + * another NUMA node. By default, the chosen value provides reasonable
> + * long-term fairness without sacrificing performance compared to a lock
> + * that does not have any fairness guarantees. The default setting can
> + * be changed with the "numa_spinlock_threshold" boot option.
> + */
> +int intra_node_handoff_threshold __ro_after_init = 1 << 16;

There is a distinct lack of quantitative data to back up that
'reasonable' claim there.

Where is the table of inter-node latencies observed for the various
values tested, and on what criteria is this number deemed reasonable?

To me, 64k lock hold times seems like a giant number, entirely outside
of reasonable.

> +
> static void __init cna_init_nodes_per_cpu(unsigned int cpu)
> {
> struct mcs_spinlock *base = per_cpu_ptr(&qnodes[0].mcs, cpu);
> @@ -97,6 +109,11 @@ static int __init cna_init_nodes(void)
> }
> early_initcall(cna_init_nodes);
>
> +static __always_inline void cna_init_node(struct mcs_spinlock *node)
> +{
> + ((struct cna_node *)node)->intra_count = 0;
> +}
> +
> /* this function is called only when the primary queue is empty */
> static inline bool cna_try_change_tail(struct qspinlock *lock, u32 val,
> struct mcs_spinlock *node)
> @@ -233,7 +250,9 @@ __always_inline u32 cna_pre_scan(struct qspinlock *lock,
> {
> struct cna_node *cn = (struct cna_node *)node;
>
> - cn->pre_scan_result = cna_scan_main_queue(node, node);
> + cn->pre_scan_result =
> + cn->intra_count == intra_node_handoff_threshold ?
> + FLUSH_SECONDARY_QUEUE : cna_scan_main_queue(node, node);

Because:

if (cn->intra_count < intra_node_handoff_threshold)
cn->pre_scan_result = cna_scan_main_queue(node, node);
else
cn->pre_scan_result = FLUSH_SECONDARY_QUEUE;

was too readable?

2020-01-21 13:52:11

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On Tue, Jan 21, 2020 at 02:29:49PM +0100, Peter Zijlstra wrote:
> On Mon, Dec 30, 2019 at 02:40:41PM -0500, Alex Kogan wrote:
>
> > +/*
> > + * Controls the threshold for the number of intra-node lock hand-offs before
> > + * the NUMA-aware variant of spinlock is forced to be passed to a thread on
> > + * another NUMA node. By default, the chosen value provides reasonable
> > + * long-term fairness without sacrificing performance compared to a lock
> > + * that does not have any fairness guarantees. The default setting can
> > + * be changed with the "numa_spinlock_threshold" boot option.
> > + */
> > +int intra_node_handoff_threshold __ro_after_init = 1 << 16;
>
> There is a distinct lack of quantitative data to back up that
> 'reasonable' claim there.
>
> Where is the table of inter-node latencies observed for the various
> values tested, and on what criteria is this number deemed reasonable?
>
> To me, 64k lock hold times seems like a giant number, entirely outside
> of reasonable.

Daniel, IIRC you just did a paper on constructing worst case latencies
from measuring pieces. Do you have data on average lock hold times?

2020-01-21 15:47:07

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On 1/21/20 8:29 AM, Peter Zijlstra wrote:
> On Mon, Dec 30, 2019 at 02:40:41PM -0500, Alex Kogan wrote:
>
>> +/*
>> + * Controls the threshold for the number of intra-node lock hand-offs before
>> + * the NUMA-aware variant of spinlock is forced to be passed to a thread on
>> + * another NUMA node. By default, the chosen value provides reasonable
>> + * long-term fairness without sacrificing performance compared to a lock
>> + * that does not have any fairness guarantees. The default setting can
>> + * be changed with the "numa_spinlock_threshold" boot option.
>> + */
>> +int intra_node_handoff_threshold __ro_after_init = 1 << 16;
> There is a distinct lack of quantitative data to back up that
> 'reasonable' claim there.
>
> Where is the table of inter-node latencies observed for the various
> values tested, and on what criteria is this number deemed reasonable?
>
> To me, 64k lock hold times seems like a giant number, entirely outside
> of reasonable.

I actually had similar question before, but having the capability of
changing the default with boot time parameter alleviate some of my
concern. I will certainly like to see actual data on how different
values will affect the performance of the code.

Cheers,
Longman

2020-01-21 21:21:04

by Daniel Bristot de Oliveira

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On 1/21/20 2:50 PM, Peter Zijlstra wrote:
> On Tue, Jan 21, 2020 at 02:29:49PM +0100, Peter Zijlstra wrote:
>> On Mon, Dec 30, 2019 at 02:40:41PM -0500, Alex Kogan wrote:
>>
>>> +/*
>>> + * Controls the threshold for the number of intra-node lock hand-offs before
>>> + * the NUMA-aware variant of spinlock is forced to be passed to a thread on
>>> + * another NUMA node. By default, the chosen value provides reasonable
>>> + * long-term fairness without sacrificing performance compared to a lock
>>> + * that does not have any fairness guarantees. The default setting can
>>> + * be changed with the "numa_spinlock_threshold" boot option.
>>> + */
>>> +int intra_node_handoff_threshold __ro_after_init = 1 << 16;
>> There is a distinct lack of quantitative data to back up that
>> 'reasonable' claim there.
>>
>> Where is the table of inter-node latencies observed for the various
>> values tested, and on what criteria is this number deemed reasonable?
>>
>> To me, 64k lock hold times seems like a giant number, entirely outside
>> of reasonable.
> Daniel, IIRC you just did a paper on constructing worst case latencies
> from measuring pieces. Do you have data on average lock hold times?
>

I am still writing the paper, but I do not have the (avg) lock times. It is it
is in the TODO list, though!

-- Daniel

2020-01-24 07:55:53

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On Thu, Jan 23, 2020 at 04:33:54PM -0500, Alex Kogan wrote:
> Let me put this question to you. What do you think the number should be?

I think it would be very good to keep the inter-node latency below 1ms.

But to realize that we need data on the lock hold times. Specifically
for the heavily contended locks that make CNA worth it in the first
place.

I don't see that data, so I don't see how we can argue about this let
alone call something reasonable.

2020-01-24 15:11:56

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On 1/24/20 2:52 AM, Peter Zijlstra wrote:
> On Thu, Jan 23, 2020 at 04:33:54PM -0500, Alex Kogan wrote:
>> Let me put this question to you. What do you think the number should be?
> I think it would be very good to keep the inter-node latency below 1ms.
It is hard to guarantee that given that lock hold times can vary quite a
lot depending on the workload. What we can control is just how many
later lock waiters can jump ahead before a given waiter.
> But to realize that we need data on the lock hold times. Specifically
> for the heavily contended locks that make CNA worth it in the first
> place.
>
> I don't see that data, so I don't see how we can argue about this let
> alone call something reasonable.
>
In essence, CNA lock is for improving throughput on NUMA machines at the
expense of increasing worst case latency. If low latency is important,
it should be disabled. If CONFIG_PREEMPT_RT is on,
CONFIG_NUMA_AWARE_SPINLOCKS should be off.

Cheers,
Longman

2020-01-24 15:20:11

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On Fri, Jan 24, 2020 at 09:42:42AM -0500, Waiman Long wrote:
> On 1/24/20 2:52 AM, Peter Zijlstra wrote:
> > On Thu, Jan 23, 2020 at 04:33:54PM -0500, Alex Kogan wrote:
> >> Let me put this question to you. What do you think the number should be?
> > I think it would be very good to keep the inter-node latency below 1ms.
> It is hard to guarantee that given that lock hold times can vary quite a
> lot depending on the workload. What we can control is just how many
> later lock waiters can jump ahead before a given waiter.

We're not into this for easy. And exactly because it depends on a lot we
need a lot of data.

Worst case lock acquisition times directly translate into worst case
IRQ-off latencies, and even the most die hard throughput oriented
workloads don't like to experience multiple ticks worth of irq-off
latencies.

> > But to realize that we need data on the lock hold times. Specifically
> > for the heavily contended locks that make CNA worth it in the first
> > place.
> >
> > I don't see that data, so I don't see how we can argue about this let
> > alone call something reasonable.
> >
> In essence, CNA lock is for improving throughput on NUMA machines at the
> expense of increasing worst case latency. If low latency is important,

Latency is _always_ important. Otherwise we'd never have put so much
time and effort into fair locks to begin with. Unbounded latency sucks
unconditionally.

> it should be disabled. If CONFIG_PREEMPT_RT is on,
> CONFIG_NUMA_AWARE_SPINLOCKS should be off.

You're spouting nonsense. You cannot claim any random number is
reasonable without argument.

2020-01-24 15:21:46

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On 1/24/20 9:42 AM, Waiman Long wrote:
> On 1/24/20 2:52 AM, Peter Zijlstra wrote:
>> On Thu, Jan 23, 2020 at 04:33:54PM -0500, Alex Kogan wrote:
>>> Let me put this question to you. What do you think the number should be?
>> I think it would be very good to keep the inter-node latency below 1ms.
> It is hard to guarantee that given that lock hold times can vary quite a
> lot depending on the workload. What we can control is just how many
> later lock waiters can jump ahead before a given waiter.
>> But to realize that we need data on the lock hold times. Specifically
>> for the heavily contended locks that make CNA worth it in the first
>> place.
>>
>> I don't see that data, so I don't see how we can argue about this let
>> alone call something reasonable.
>>
> In essence, CNA lock is for improving throughput on NUMA machines at the
> expense of increasing worst case latency. If low latency is important,
> it should be disabled. If CONFIG_PREEMPT_RT is on,
> CONFIG_NUMA_AWARE_SPINLOCKS should be off.

Actually, what we are worrying about is the additional latency that can
be added to important tasks or execution contexts that are waiting for a
lock. Maybe we can make CNA lock behaves somewhat like qrwlock is that
requests from interrupt context are giving priority. We could add a
priority flag in the CNA node. If the flag is set, we will never put it
into the secondary queue. In fact, we can transfer control next to it
even if it is not on the same node. We may also set the priority flag if
it is a RT task that is trying to acquire the lock.

In this way, we can guarantee that important tasks or contexts will not
suffer a delay in acquiring the lock. Those less important tasks,
however, may need to wait a bit longer before they can get the lock.

What do you guys think about that?

Regards,
Longman

2020-01-24 21:01:00

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On 1/24/20 1:40 PM, Waiman Long wrote:
> On 1/24/20 1:19 PM, Alex Kogan wrote:
>>
>>
>>> On Jan 24, 2020, at 11:46 AM, Waiman Long <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>> On 1/24/20 11:29 AM, Alex Kogan wrote:
>>>>
>>>>
>>>>> On Jan 24, 2020, at 10:19 AM, Waiman Long <[email protected]
>>>>> <mailto:[email protected]>> wrote:
>>>>>
>>>>> On 1/24/20 9:42 AM, Waiman Long wrote:
>>>>>> On 1/24/20 2:52 AM, Peter Zijlstra wrote:
>>>>>>> On Thu, Jan 23, 2020 at 04:33:54PM -0500, Alex Kogan wrote:
>>>>>>>> Let me put this question to you. What do you think the number
>>>>>>>> should be?
>>>>>>> I think it would be very good to keep the inter-node latency
>>>>>>> below 1ms.
>>>>>> It is hard to guarantee that given that lock hold times can vary
>>>>>> quite a
>>>>>> lot depending on the workload. What we can control is just how many
>>>>>> later lock waiters can jump ahead before a given waiter.
>>>> I totally agree. I do not think you can guarantee that latency even
>>>> today.
>>>> With the existing spinlock, you join the queue and wait for as long
>>>> as it takes
>>>> for each and every thread in front of you to execute its critical
>>>> section.
>>>>
>>>>>>> But to realize that we need data on the lock hold times.
>>>>>>> Specifically
>>>>>>> for the heavily contended locks that make CNA worth it in the first
>>>>>>> place.
>>>>>>>
>>>>>>> I don't see that data, so I don't see how we can argue about
>>>>>>> this let
>>>>>>> alone call something reasonable.
>>>>>>>
>>>>>> In essence, CNA lock is for improving throughput on NUMA machines
>>>>>> at the
>>>>>> expense of increasing worst case latency. If low latency is
>>>>>> important,
>>>>>> it should be disabled. If CONFIG_PREEMPT_RT is on,
>>>>>> CONFIG_NUMA_AWARE_SPINLOCKS should be off.
>>>>>
>>>>> Actually, what we are worrying about is the additional latency
>>>>> that can
>>>>> be added to important tasks or execution contexts that are waiting
>>>>> for a
>>>>> lock. Maybe we can make CNA lock behaves somewhat like qrwlock is that
>>>>> requests from interrupt context are giving priority. We could add a
>>>>> priority flag in the CNA node. If the flag is set, we will never
>>>>> put it
>>>>> into the secondary queue. In fact, we can transfer control next to it
>>>>> even if it is not on the same node. We may also set the priority
>>>>> flag if
>>>>> it is a RT task that is trying to acquire the lock.
>>>> I think this is possible, and in fact, we have been thinking along
>>>> those lines
>>>> about ways to better support RT tasks with CNA. However, this will
>>>> _probably
>>>> require changes to API and will _certainly complicate the code
>>>> quite a bit.
>>>
>>> What you need to do is to modify cna_init_node() to check the
>>> current locking context and set the priority flag accordingly.
>>>
>> Is there a lightweight way to identify such a “prioritized” thread?
>
> You can use the in_task() macro in include/linux/preempt.h. This is
> just a percpu preempt_count read and test. If in_task() is false, it
> is in a {soft|hard}irq or nmi context. If it is true, you can check
> the rt_task() macro to see if it is an RT task. That will access to
> the current task structure. So it may cost a little bit more if you
> want to handle the RT task the same way.
>
We may not need to do that for softIRQ context. If that is the case, you
can use in_irq() which checks for hardirq and nmi only. Peter, what is
your thought on that?

Cheers,
Longman

2020-01-24 21:13:53

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On 1/24/20 3:09 PM, Alex Kogan wrote:
>>> We also probably do not want those “prioritized” threads to disrupt
>>> normal
>>> CNA operation. E.g., if the main queue looks like T1_1, P2_1, T1_2,
>>> …, where
>>> T1_x is a thread running on node 1 and P2_1 is a prioritized thread
>>> running
>>> on node 2, we want to pass the lock from T1_1 to P2_1 and then to T1_2
>>> (rather than have P2_1 to scan for another thread on node 2).
>>>
>>> There is a way to achieve that — when we pass the lock to P2_1,
>>> we can set its numa node field to 1. This means that we need to
>>> reset the numa
>>> node field in cna_init_node(), but AFAICT this is relatively cheap.
>>> The rest
>>> of the CNA logic should not change.
>>
>> I won't recommend doing that. If the lock cacheline has been moved
>> from node 1 to 2, I will say it is better to stick with node 2 rather
>> than switching back to node 1. That will mean that the secondary
>> queue may contain lock waiters from the same nodes, but they will
>> eventually be flushed back to the primary queue.
>>
> That’s right, assuming we do not reset intra_node count when
> transferring the
> lock to a prioritized thread from another node. Otherwise, we may starve
> waiters in the secondary queue.
>
> Still, that can make the lock even less fair to non-prioritized
> threads. When
> you flush the secondary queue, the preference may remain with the same
> node. This will not happen in the current form of CNA, as we never get
> threads from the preferred node in the secondary queue.

That is true.

However, it is no different from the current scheme that a waiter from
another node may have to wait for 64k other waiters to go first before
it has a chance to get it. Now that waiter can be from the same node as
well.

>
> Besides, I think that will decrease the benefit of CNA, especially if such
> prioritized threads are frequent, switching the preference from one node
> to another. But this is something I can evaluate, unless
> there is some fundamental objection to the idea of tweaking the numa
> node of prioritized threads.

Usually the locks used in interrupt context are not that contended to
the point that CNA can kick in. Of course, there are exceptions in some
circumstances and we do not want to introduce additional latency to
their lock waiting time.

Cheers,
Longman

2020-01-24 21:32:21

by Alex Kogan

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

> On Jan 24, 2020, at 4:12 PM, Waiman Long <[email protected]> wrote:
>
> On 1/24/20 3:09 PM, Alex Kogan wrote:
>>>> We also probably do not want those “prioritized” threads to disrupt
>>>> normal
>>>> CNA operation. E.g., if the main queue looks like T1_1, P2_1, T1_2,
>>>> …, where
>>>> T1_x is a thread running on node 1 and P2_1 is a prioritized thread
>>>> running
>>>> on node 2, we want to pass the lock from T1_1 to P2_1 and then to T1_2
>>>> (rather than have P2_1 to scan for another thread on node 2).
>>>>
>>>> There is a way to achieve that — when we pass the lock to P2_1,
>>>> we can set its numa node field to 1. This means that we need to
>>>> reset the numa
>>>> node field in cna_init_node(), but AFAICT this is relatively cheap.
>>>> The rest
>>>> of the CNA logic should not change.
>>>
>>> I won't recommend doing that. If the lock cacheline has been moved
>>> from node 1 to 2, I will say it is better to stick with node 2 rather
>>> than switching back to node 1. That will mean that the secondary
>>> queue may contain lock waiters from the same nodes, but they will
>>> eventually be flushed back to the primary queue.
>>>
>> That’s right, assuming we do not reset intra_node count when
>> transferring the
>> lock to a prioritized thread from another node. Otherwise, we may starve
>> waiters in the secondary queue.
>>
>> Still, that can make the lock even less fair to non-prioritized
>> threads. When
>> you flush the secondary queue, the preference may remain with the same
>> node. This will not happen in the current form of CNA, as we never get
>> threads from the preferred node in the secondary queue.
>
> That is true.
>
> However, it is no different from the current scheme that a waiter from
> another node may have to wait for 64k other waiters to go first before
> it has a chance to get it. Now that waiter can be from the same node as
> well.

The difference is that in the current form of CNA, the preferred node _will
change after 64k lock transitions. In the change you propose, this is no
longer the case. It may take another ~64k transitions for that to happen.
More generally, I think this makes the analysis of the lock behavior more
convoluted.

I think we should treat those prioritized threads as “wild” cards, passing the
lock through them, but keeping the preferred node intact. This will potentially
cost one extra lock migration, but will make reasoning about the lock
behavior easier.

Regards,
— Alex

2020-01-25 00:43:26

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On 1/24/20 4:27 PM, Alex Kogan wrote:
>
>> On Jan 24, 2020, at 4:12 PM, Waiman Long <[email protected]> wrote:
>>
>> On 1/24/20 3:09 PM, Alex Kogan wrote:
>>>>> We also probably do not want those “prioritized” threads to disrupt
>>>>> normal
>>>>> CNA operation. E.g., if the main queue looks like T1_1, P2_1, T1_2,
>>>>> …, where
>>>>> T1_x is a thread running on node 1 and P2_1 is a prioritized thread
>>>>> running
>>>>> on node 2, we want to pass the lock from T1_1 to P2_1 and then to T1_2
>>>>> (rather than have P2_1 to scan for another thread on node 2).
>>>>>
>>>>> There is a way to achieve that — when we pass the lock to P2_1,
>>>>> we can set its numa node field to 1. This means that we need to
>>>>> reset the numa
>>>>> node field in cna_init_node(), but AFAICT this is relatively cheap.
>>>>> The rest
>>>>> of the CNA logic should not change.
>>>> I won't recommend doing that. If the lock cacheline has been moved
>>>> from node 1 to 2, I will say it is better to stick with node 2 rather
>>>> than switching back to node 1. That will mean that the secondary
>>>> queue may contain lock waiters from the same nodes, but they will
>>>> eventually be flushed back to the primary queue.
>>>>
>>> That’s right, assuming we do not reset intra_node count when
>>> transferring the
>>> lock to a prioritized thread from another node. Otherwise, we may starve
>>> waiters in the secondary queue.
>>>
>>> Still, that can make the lock even less fair to non-prioritized
>>> threads. When
>>> you flush the secondary queue, the preference may remain with the same
>>> node. This will not happen in the current form of CNA, as we never get
>>> threads from the preferred node in the secondary queue.
>> That is true.
>>
>> However, it is no different from the current scheme that a waiter from
>> another node may have to wait for 64k other waiters to go first before
>> it has a chance to get it. Now that waiter can be from the same node as
>> well.
> The difference is that in the current form of CNA, the preferred node _will
> change after 64k lock transitions. In the change you propose, this is no
> longer the case. It may take another ~64k transitions for that to happen.
> More generally, I think this makes the analysis of the lock behavior more
> convoluted.
>
> I think we should treat those prioritized threads as “wild” cards, passing the
> lock through them, but keeping the preferred node intact. This will potentially
> cost one extra lock migration, but will make reasoning about the lock
> behavior easier.

It seems like you prefer mathematical purity than practical
consideration. I won't object to that if you insist that is the right
way to go. Just be mindful that you may need to add a preferred node
value to do that. It will also complicate the code, but it is your choice.

Cheers,
Longman

2020-01-25 11:18:51

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On Fri, Jan 24, 2020 at 11:46:53AM -0500, Waiman Long wrote:
> I also thought about that. As you said, it can be hard to guarantee that
> reliable time value can be retrieved in a timely manner across all the
> archs.

Rememer that this code is limited to 64bit archs that have NUMA, my
quick grep says that is limited to:

alpha, arm64, ia64, mips, powerpc, s390, sparc, x86

afaict, x86 is the one with the worst clocks between the lot of them
(with exception of ia64, which has been completely buggered for a while
and nobody cares).

> Even if we can do that, we will introduce latency to important
> tasks or contexts. I like the first approach better.

In general, the kernel has no clues what is actually important.

2020-01-25 11:22:12

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On Fri, Jan 24, 2020 at 01:19:05PM -0500, Alex Kogan wrote:

> Is there a lightweight way to identify such a “prioritized” thread?

No; people might for instance care about tail latencies between their
identically spec'ed worker tasks.

In general it turns out that priority is a dynamic concept, not a static
one.

2020-01-25 11:23:14

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On Fri, Jan 24, 2020 at 01:51:34PM -0500, Waiman Long wrote:

< 71 lines of garbage >

> > You can use the in_task() macro in include/linux/preempt.h. This is
> > just a percpu preempt_count read and test. If in_task() is false, it
> > is in a {soft|hard}irq or nmi context. If it is true, you can check
> > the rt_task() macro to see if it is an RT task. That will access to
> > the current task structure. So it may cost a little bit more if you
> > want to handle the RT task the same way.
> >
> We may not need to do that for softIRQ context. If that is the case, you
> can use in_irq() which checks for hardirq and nmi only. Peter, what is
> your thought on that?

Can you lot please start trimming emails when you reply?

2020-01-25 19:58:55

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On 1/24/20 1:51 PM, Waiman Long wrote:
>> You can use the in_task() macro in include/linux/preempt.h. This is
>> just a percpu preempt_count read and test. If in_task() is false, it
>> is in a {soft|hard}irq or nmi context. If it is true, you can check
>> the rt_task() macro to see if it is an RT task. That will access to
>> the current task structure. So it may cost a little bit more if you
>> want to handle the RT task the same way.
>>
> We may not need to do that for softIRQ context. If that is the case, you
> can use in_irq() which checks for hardirq and nmi only. Peter, what is
> your thought on that?

In second thought, we should do that for softIRQ as well. Also, we may
want to also check if irqs_disabled() is true as well by calls like
spin_lock_irq() or spin_lock_irqsave(). We do not want to unnecessarily
prolong the irq off period.

Cheers,
Longman

2020-01-30 22:08:43

by Alex Kogan

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

> On Jan 25, 2020, at 6:19 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Fri, Jan 24, 2020 at 01:19:05PM -0500, Alex Kogan wrote:
>
>> Is there a lightweight way to identify such a “prioritized” thread?
>
> No; people might for instance care about tail latencies between their
> identically spec'ed worker tasks.

I would argue that those users need to tune/reduce the intra-node handoff
threshold for their needs. Or disable CNA altogether.

In general, Peter, seems like you are not on board with the way Longman
suggested to handle prioritized threads. Am I right?

Thanks,
— Alex

2020-02-03 15:34:18

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On Thu, Jan 30, 2020 at 05:05:28PM -0500, Alex Kogan wrote:
>
> > On Jan 25, 2020, at 6:19 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > On Fri, Jan 24, 2020 at 01:19:05PM -0500, Alex Kogan wrote:
> >
> >> Is there a lightweight way to identify such a “prioritized” thread?
> >
> > No; people might for instance care about tail latencies between their
> > identically spec'ed worker tasks.
>
> I would argue that those users need to tune/reduce the intra-node handoff
> threshold for their needs. Or disable CNA altogether.

I really don't like boot time arguments (or tunables in generic) for a
machine to work as it should.

The default really should 'just work'.

> In general, Peter, seems like you are not on board with the way Longman
> suggested to handle prioritized threads. Am I right?

Right.

Presumably you have a workload where CNA is actually a win? That is,
what inspired you to go down this road? Which actual kernel lock is so
contended on NUMA machines that we need to do this?

2020-02-03 17:18:18

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On 2/3/20 8:45 AM, Peter Zijlstra wrote:
> On Thu, Jan 30, 2020 at 05:05:28PM -0500, Alex Kogan wrote:
>>> On Jan 25, 2020, at 6:19 AM, Peter Zijlstra <[email protected]> wrote:
>>>
>>> On Fri, Jan 24, 2020 at 01:19:05PM -0500, Alex Kogan wrote:
>>>
>>>> Is there a lightweight way to identify such a “prioritized” thread?
>>> No; people might for instance care about tail latencies between their
>>> identically spec'ed worker tasks.
>> I would argue that those users need to tune/reduce the intra-node handoff
>> threshold for their needs. Or disable CNA altogether.
> I really don't like boot time arguments (or tunables in generic) for a
> machine to work as it should.
>
> The default really should 'just work'.
That will be the ideal case. In reality, it usually takes a while for
the code to mature enough to do some kind of self tuning. In the mean
time, having some configuration options available allows us to have more
time to figure what the best configuration options to be.
>> In general, Peter, seems like you are not on board with the way Longman
>> suggested to handle prioritized threads. Am I right?
> Right.
>
> Presumably you have a workload where CNA is actually a win? That is,
> what inspired you to go down this road? Which actual kernel lock is so
> contended on NUMA machines that we need to do this?

Today, a 2-socket Rome server can have 128 cores and 256 threads. If we
scale up more, we could easily have more than 1000 threads in a system.
With that many logical cpus available, it is easy to envision some heavy
spinlock contention can happen fairly regularly. This patch can
alleviate the congestion and improve performance under that
circumstance. Of course, the specific locks that are contended will
depend on the workloads.

Cheers,
Longman

2020-02-03 17:27:40

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On Mon, Feb 03, 2020 at 09:59:12AM -0500, Waiman Long wrote:
> On 2/3/20 8:45 AM, Peter Zijlstra wrote:

> > Presumably you have a workload where CNA is actually a win? That is,
> > what inspired you to go down this road? Which actual kernel lock is so
> > contended on NUMA machines that we need to do this?
>
> Today, a 2-socket Rome server can have 128 cores and 256 threads. If we
> scale up more, we could easily have more than 1000 threads in a system.
> With that many logical cpus available, it is easy to envision some heavy
> spinlock contention can happen fairly regularly. This patch can
> alleviate the congestion and improve performance under that
> circumstance. Of course, the specific locks that are contended will
> depend on the workloads.

Not the point. If there isn't an issue today, we don't have anything to
fix.

Furthermore, we've always adressed specific issues by looking at the
locking granularity, first.

So again, what specific lock inspired all these patches?

2020-02-03 17:47:11

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On 2/3/20 10:28 AM, Peter Zijlstra wrote:
> On Mon, Feb 03, 2020 at 09:59:12AM -0500, Waiman Long wrote:
>> On 2/3/20 8:45 AM, Peter Zijlstra wrote:
>>> Presumably you have a workload where CNA is actually a win? That is,
>>> what inspired you to go down this road? Which actual kernel lock is so
>>> contended on NUMA machines that we need to do this?
>> Today, a 2-socket Rome server can have 128 cores and 256 threads. If we
>> scale up more, we could easily have more than 1000 threads in a system.
>> With that many logical cpus available, it is easy to envision some heavy
>> spinlock contention can happen fairly regularly. This patch can
>> alleviate the congestion and improve performance under that
>> circumstance. Of course, the specific locks that are contended will
>> depend on the workloads.
> Not the point. If there isn't an issue today, we don't have anything to
> fix.
>
> Furthermore, we've always adressed specific issues by looking at the
> locking granularity, first.

You are right in that. Unlike ticket spinlock where performance can drop
precipitately over a cliff when there is heavy contention, qspinlock
won't have this kind of performance drop. My suspicion is that slowdowns
caused by heavy spinlock contention in actual workloads are likely to be
more transient in nature and harder to pinpoint. These days, I seldom
get bug report that is related to heavy spinlock contention.

>
> So again, what specific lock inspired all these patches?
>
Maybe Alex has some data to share.

Cheers,
Longman

2020-02-04 17:30:30

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On Tue, Feb 04, 2020 at 11:54:02AM -0500, Alex Kogan wrote:
> > On Feb 3, 2020, at 10:47 AM, Waiman Long <[email protected]> wrote:
> >
> > On 2/3/20 10:28 AM, Peter Zijlstra wrote:
> >> On Mon, Feb 03, 2020 at 09:59:12AM -0500, Waiman Long wrote:
> >>> On 2/3/20 8:45 AM, Peter Zijlstra wrote:
> >>>> Presumably you have a workload where CNA is actually a win? That is,
> >>>> what inspired you to go down this road? Which actual kernel lock is so
> >>>> contended on NUMA machines that we need to do this?

> There are quite a few actually. files_struct.file_lock, file_lock_context.flc_lock
> and lockref.lock are some concrete examples that get very hot in will-it-scale
> benchmarks.

Right, that's all a variant of banging on the same resources across
nodes. I'm not sure there's anything fundamental we can fix there.

> And then there are spinlocks in __futex_data.queues,
> which get hot when applications have contended (pthread) locks —
> LevelDB is an example.

A numa aware rework of futexes has been on the todo list for years :/

> Our initial motivation was based on an observation that kernel qspinlock is not
> NUMA-aware. So what, you may ask. Much like people realized in the past that
> global spinning is bad for performance, and they switched from ticket lock to
> locks with local spinning (e.g., MCS), I think everyone would agree these days that
> bouncing a lock (and cache lines in general) across numa nodes is similarly bad.
> And as CNA demonstrates, we are easily leaving 2-3x speedups on the table by
> doing just that with the current qspinlock.

Actual benchmarks with performance numbers are required. It helps
motivate the patches as well as gives reviewers clues on how to
reproduce / inspect the claims made.

2020-02-04 17:41:06

by Waiman Long

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

On 2/4/20 12:27 PM, Peter Zijlstra wrote:
> On Tue, Feb 04, 2020 at 11:54:02AM -0500, Alex Kogan wrote:
>>> On Feb 3, 2020, at 10:47 AM, Waiman Long <[email protected]> wrote:
>>>
>>> On 2/3/20 10:28 AM, Peter Zijlstra wrote:
>>>> On Mon, Feb 03, 2020 at 09:59:12AM -0500, Waiman Long wrote:
>>>>> On 2/3/20 8:45 AM, Peter Zijlstra wrote:
>>>>>> Presumably you have a workload where CNA is actually a win? That is,
>>>>>> what inspired you to go down this road? Which actual kernel lock is so
>>>>>> contended on NUMA machines that we need to do this?
>> There are quite a few actually. files_struct.file_lock, file_lock_context.flc_lock
>> and lockref.lock are some concrete examples that get very hot in will-it-scale
>> benchmarks.
> Right, that's all a variant of banging on the same resources across
> nodes. I'm not sure there's anything fundamental we can fix there.
>
>> And then there are spinlocks in __futex_data.queues,
>> which get hot when applications have contended (pthread) locks —
>> LevelDB is an example.
> A numa aware rework of futexes has been on the todo list for years :/
Now, we are going to get that for free with this patchset:-)
>
>> Our initial motivation was based on an observation that kernel qspinlock is not
>> NUMA-aware. So what, you may ask. Much like people realized in the past that
>> global spinning is bad for performance, and they switched from ticket lock to
>> locks with local spinning (e.g., MCS), I think everyone would agree these days that
>> bouncing a lock (and cache lines in general) across numa nodes is similarly bad.
>> And as CNA demonstrates, we are easily leaving 2-3x speedups on the table by
>> doing just that with the current qspinlock.
> Actual benchmarks with performance numbers are required. It helps
> motivate the patches as well as gives reviewers clues on how to
> reproduce / inspect the claims made.
>
I think the cover-letter does have some benchmark results listed. Are
you saying that some benchmark results should be put into individual
patches themselves?

Cheers,
Longman

2020-02-04 17:55:54

by Alex Kogan

[permalink] [raw]

Subject: Re: [PATCH v8 4/5] locking/qspinlock: Introduce starvation avoidance into CNA

> On Feb 4, 2020, at 12:39 PM, Waiman Long <[email protected]> wrote:
>
> On 2/4/20 12:27 PM, Peter Zijlstra wrote:
>> On Tue, Feb 04, 2020 at 11:54:02AM -0500, Alex Kogan wrote:
>>>> On Feb 3, 2020, at 10:47 AM, Waiman Long <[email protected]> wrote:
>>>>
>>>> On 2/3/20 10:28 AM, Peter Zijlstra wrote:
>>>>> On Mon, Feb 03, 2020 at 09:59:12AM -0500, Waiman Long wrote:
>>>>>> On 2/3/20 8:45 AM, Peter Zijlstra wrote:
>>>>>>> Presumably you have a workload where CNA is actually a win? That is,
>>>>>>> what inspired you to go down this road? Which actual kernel lock is so
>>>>>>> contended on NUMA machines that we need to do this?
>>> There are quite a few actually. files_struct.file_lock, file_lock_context.flc_lock
>>> and lockref.lock are some concrete examples that get very hot in will-it-scale
>>> benchmarks.
>> Right, that's all a variant of banging on the same resources across
>> nodes. I'm not sure there's anything fundamental we can fix there.
Not much, except gain that 2x from a better lock.

>>
>>> And then there are spinlocks in __futex_data.queues,
>>> which get hot when applications have contended (pthread) locks —
>>> LevelDB is an example.
>> A numa aware rework of futexes has been on the todo list for years :/
> Now, we are going to get that for free with this patchset:-)
Exactly!!

>>
>>> Our initial motivation was based on an observation that kernel qspinlock is not
>>> NUMA-aware. So what, you may ask. Much like people realized in the past that
>>> global spinning is bad for performance, and they switched from ticket lock to
>>> locks with local spinning (e.g., MCS), I think everyone would agree these days that
>>> bouncing a lock (and cache lines in general) across numa nodes is similarly bad.
>>> And as CNA demonstrates, we are easily leaving 2-3x speedups on the table by
>>> doing just that with the current qspinlock.
>> Actual benchmarks with performance numbers are required. It helps
>> motivate the patches as well as gives reviewers clues on how to
>> reproduce / inspect the claims made.
>>
> I think the cover-letter does have some benchmark results listed.
To clarify on that, I _used to include benchmark results in the cover letter
for previous revisions. I stopped doing that as the changes between revisions
were rather minor — maybe that is the missing part? If so, my apologies, I can
certainly publish them again.

The point is that we have numbers for actual benchmarks, plus the kernel build
robot has sent quite a few reports on positive improvements in the performance
of AIM7 and other benchmarks due to CNA (plus ARM folks noticed improvement
in their benchmarks, although I think those were mostly microbenchmarks.
Yet, it is evident that the improvements are cross-platform.)

Regards,
— Alex