2023-12-29 20:52:23

by David Laight

[permalink] [raw]
Subject: [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code

Zeng Heng noted that heavy use of the osq (optimistic spin queue) code
used rather more cpu than might be expected. See:
https://lore.kernel.org/lkml/[email protected]/T/#mcc46eedd1ef22a0d668828b1d088508c9b1875b8

Part of the problem is there is a pretty much guaranteed cache line reload
reading node->prev->cpu for the vcpu_is_preempted() check (even on bare metal)
in the wakeup path which slows it down.
(On bare metal the hypervisor call is patched out, but the argument is still read.)

Careful analysis shows that it isn't necessary to dirty the per-cpu data
on the fast-path osq_lock() path. This may be slightly beneficial.

The code also uses this_cpu_ptr() to get the address of the per-cpu data.
On x86-64 (at least) this is implemented as:
&per_cpu_data[smp_processor_id()]->member
ie there is a real function call, an array index and an add.
However if raw_cpu_read() can used then (which is typically just an offset
from register - %gs for x86-64) the code will be faster.
Putting the address of the per-cpu node into itself means that only one
cache line need be loaded.

I can't see a list of per-cpu data initialisation functions, so the fields
are initialised on the first osq_lock() call.

The last patch avoids the cache line reload calling vcpu_is_preempted()
by simply saving node->prev->cpu as node->prev_cpu and updating it when
node->prev changes.
This is simpler than the patch proposed by Waimon.

David Laight (5):
Move the definition of optimistic_spin_node into osf_lock.c
Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.
Clarify osq_wait_next()
Optimise per-cpu data accesses.
Optimise vcpu_is_preempted() check.

include/linux/osq_lock.h | 5 ----
kernel/locking/osq_lock.c | 61 +++++++++++++++++++++------------------
2 files changed, 33 insertions(+), 33 deletions(-)

--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)



2023-12-29 20:54:19

by David Laight

[permalink] [raw]
Subject: [PATCH next 1/5] locking/osq_lock: Move the definition of optimistic_spin_node into osf_lock.c

struct optimistic_spin_node is private to the implementation.
Move it into the C file to ensure nothing is accessing it.

Signed-off-by: David Laight <[email protected]>
---
include/linux/osq_lock.h | 5 -----
kernel/locking/osq_lock.c | 7 +++++++
2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 5581dbd3bd34..ea8fb31379e3 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -6,11 +6,6 @@
* An MCS like lock especially tailored for optimistic spinning for sleeping
* lock implementations (mutex, rwsem, etc).
*/
-struct optimistic_spin_node {
- struct optimistic_spin_node *next, *prev;
- int locked; /* 1 if lock acquired */
- int cpu; /* encoded CPU # + 1 value */
-};

struct optimistic_spin_queue {
/*
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d5610ad52b92..d414eef4bec6 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -11,6 +11,13 @@
* called from interrupt context and we have preemption disabled while
* spinning.
*/
+
+struct optimistic_spin_node {
+ struct optimistic_spin_node *next, *prev;
+ int locked; /* 1 if lock acquired */
+ int cpu; /* encoded CPU # + 1 value */
+};
+
static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);

/*
--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2023-12-29 20:56:33

by David Laight

[permalink] [raw]
Subject: [PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next()

osq_wait_next() is passed 'prev' from osq_lock() and NULL from osq_unlock()
but only needs the 'cpu' value to write to lock->tail.
Just pass prev->cpu or OSQ_UNLOCKED_VAL instead.

Also directly return NULL or 'next' instead of breaking the loop.

Should have no effect on the generated code since gcc manages to
assume that 'prev != NULL' due to an earlier dereference.

Signed-off-by: David Laight <[email protected]>
---
kernel/locking/osq_lock.c | 23 ++++++++++-------------
1 file changed, 10 insertions(+), 13 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 55f5db896c02..9bb3a077ba92 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -48,18 +48,17 @@ static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
static inline struct optimistic_spin_node *
osq_wait_next(struct optimistic_spin_queue *lock,
struct optimistic_spin_node *node,
- struct optimistic_spin_node *prev)
+ int old)
{
- struct optimistic_spin_node *next = NULL;
+ struct optimistic_spin_node *next;
int curr = node->cpu;
- int old;

/*
- * If there is a prev node in queue, then the 'old' value will be
- * the prev node's CPU #, else it's set to OSQ_UNLOCKED_VAL since if
- * we're currently last in queue, then the queue will then become empty.
+ * If osq_lock() is being cancelled there must be a previous node
+ * and 'old' is its CPU #.
+ * For osq_unlock() there is never a previous node and old is set
+ * to OSQ_UNLOCKED_VAL.
*/
- old = prev ? prev->cpu : OSQ_UNLOCKED_VAL;

for (;;) {
if (atomic_read(&lock->tail) == curr &&
@@ -69,7 +68,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
* will now observe @lock and will complete its
* unlock()/unqueue().
*/
- break;
+ return NULL;
}

/*
@@ -85,13 +84,11 @@ osq_wait_next(struct optimistic_spin_queue *lock,
if (node->next) {
next = xchg(&node->next, NULL);
if (next)
- break;
+ return next;
}

cpu_relax();
}
-
- return next;
}

bool osq_lock(struct optimistic_spin_queue *lock)
@@ -192,7 +189,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* back to @prev.
*/

- next = osq_wait_next(lock, node, prev);
+ next = osq_wait_next(lock, node, prev->cpu);
if (!next)
return false;

@@ -232,7 +229,7 @@ void osq_unlock(struct optimistic_spin_queue *lock)
return;
}

- next = osq_wait_next(lock, node, NULL);
+ next = osq_wait_next(lock, node, OSQ_UNLOCKED_VAL);
if (next)
WRITE_ONCE(next->locked, 1);
}
--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2023-12-29 20:57:44

by David Laight

[permalink] [raw]
Subject: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

this_cpu_ptr() is rather more expensive than raw_cpu_read() since
the latter can use an 'offset from register' (%gs for x86-84).

Add a 'self' field to 'struct optimistic_spin_node' that can be
read with raw_cpu_read(), initialise on first call.

Signed-off-by: David Laight <[email protected]>
---
kernel/locking/osq_lock.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 9bb3a077ba92..b60b0add0161 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -13,7 +13,7 @@
*/

struct optimistic_spin_node {
- struct optimistic_spin_node *next, *prev;
+ struct optimistic_spin_node *self, *next, *prev;
int locked; /* 1 if lock acquired */
int cpu; /* encoded CPU # + 1 value */
};
@@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,

bool osq_lock(struct optimistic_spin_queue *lock)
{
- struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
+ struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
struct optimistic_spin_node *prev, *next;
int old;

- if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
- node->cpu = encode_cpu(smp_processor_id());
+ if (unlikely(!node)) {
+ int cpu = encode_cpu(smp_processor_id());
+ node = decode_cpu(cpu);
+ node->self = node;
+ node->cpu = cpu;
+ }

/*
* We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -222,7 +226,7 @@ void osq_unlock(struct optimistic_spin_queue *lock)
/*
* Second most likely case.
*/
- node = this_cpu_ptr(&osq_node);
+ node = raw_cpu_read(osq_node.self);
next = xchg(&node->next, NULL);
if (next) {
WRITE_ONCE(next->locked, 1);
--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2023-12-29 20:58:49

by David Laight

[permalink] [raw]
Subject: [PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check.

The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
cpu is no longer running.
Although patched out for bare-metal the code still needs the cpu number.
Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss
when osq_unlock() is waking up the next cpu.

Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.

This is simpler than checking for 'node->prev' changing and caching
'prev->cpu'.

Signed-off-by: David Laight <[email protected]>
---
kernel/locking/osq_lock.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index b60b0add0161..89be63627434 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -14,8 +14,9 @@

struct optimistic_spin_node {
struct optimistic_spin_node *self, *next, *prev;
- int locked; /* 1 if lock acquired */
- int cpu; /* encoded CPU # + 1 value */
+ int locked; /* 1 if lock acquired */
+ int cpu; /* encoded CPU # + 1 value */
+ int prev_cpu; /* actual CPU # for vpcu_is_preempted() */
};

static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
@@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
}

-static inline int node_cpu(struct optimistic_spin_node *node)
-{
- return node->cpu - 1;
-}
-
static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
{
int cpu_nr = encoded_cpu_val - 1;
@@ -114,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
if (old == OSQ_UNLOCKED_VAL)
return true;

+ node->prev_cpu = old - 1;
prev = decode_cpu(old);
node->prev = prev;
node->locked = 0;
@@ -148,7 +145,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* polling, be careful.
*/
if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
- vcpu_is_preempted(node_cpu(node->prev))))
+ vcpu_is_preempted(node->prev_cpu)))
return true;

/* unqueue */
@@ -205,6 +202,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* it will wait in Step-A.
*/

+ WRITE_ONCE(next->prev_cpu, prev->cpu - 1);
WRITE_ONCE(next->prev, prev);
WRITE_ONCE(prev->next, next);

--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2023-12-29 21:01:16

by David Laight

[permalink] [raw]
Subject: RE: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.

osq_lock() starts by setting node->next to NULL and node->locked to 0.
Careful analysis shows that node->next is always NULL on entry.

node->locked is set non-zero by another cpu to force a wakeup.
This can only happen after the 'prev->next = node' assignment,
so locked can be set to zero just before that (along with the assignment
to node->prev).

Only initialise node->cpu once, after that use its value instead
of smp_processor_id() - which is probably a real function call.

Should reduce cache-line bouncing a little.

Signed-off-by: David Laight <[email protected]>
---
kernel/locking/osq_lock.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d414eef4bec6..55f5db896c02 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
struct optimistic_spin_node *prev)
{
struct optimistic_spin_node *next = NULL;
- int curr = encode_cpu(smp_processor_id());
+ int curr = node->cpu;
int old;

/*
@@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
{
struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
struct optimistic_spin_node *prev, *next;
- int curr = encode_cpu(smp_processor_id());
int old;

- node->locked = 0;
- node->next = NULL;
- node->cpu = curr;
+ if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
+ node->cpu = encode_cpu(smp_processor_id());

/*
* We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* the node fields we just initialised) semantics when updating
* the lock tail.
*/
- old = atomic_xchg(&lock->tail, curr);
+ old = atomic_xchg(&lock->tail, node->cpu);
if (old == OSQ_UNLOCKED_VAL)
return true;

prev = decode_cpu(old);
node->prev = prev;
+ node->locked = 0;

/*
* osq_lock() unqueue
@@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
void osq_unlock(struct optimistic_spin_queue *lock)
{
struct optimistic_spin_node *node, *next;
- int curr = encode_cpu(smp_processor_id());
+ int curr = raw_cpu_read(osq_node.cpu);

/*
* Fast path for the uncontended case.
--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2023-12-29 22:12:07

by David Laight

[permalink] [raw]
Subject: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.

osq_lock() starts by setting node->next to NULL and node->locked to 0.
Careful analysis shows that node->next is always NULL on entry.

node->locked is set non-zero by another cpu to force a wakeup.
This can only happen after the 'prev->next = node' assignment,
so locked can be set to zero just before that (along with the assignment
to node->prev).

Only initialise node->cpu once, after that use its value instead
of smp_processor_id() - which is probably a real function call.

Should reduce cache-line bouncing a little.

Signed-off-by: David Laight <[email protected]>
---

Re-send without the 'RE:' on the subject line.
kernel/locking/osq_lock.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d414eef4bec6..55f5db896c02 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
struct optimistic_spin_node *prev)
{
struct optimistic_spin_node *next = NULL;
- int curr = encode_cpu(smp_processor_id());
+ int curr = node->cpu;
int old;

/*
@@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
{
struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
struct optimistic_spin_node *prev, *next;
- int curr = encode_cpu(smp_processor_id());
int old;

- node->locked = 0;
- node->next = NULL;
- node->cpu = curr;
+ if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
+ node->cpu = encode_cpu(smp_processor_id());

/*
* We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* the node fields we just initialised) semantics when updating
* the lock tail.
*/
- old = atomic_xchg(&lock->tail, curr);
+ old = atomic_xchg(&lock->tail, node->cpu);
if (old == OSQ_UNLOCKED_VAL)
return true;

prev = decode_cpu(old);
node->prev = prev;
+ node->locked = 0;

/*
* osq_lock() unqueue
@@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
void osq_unlock(struct optimistic_spin_queue *lock)
{
struct optimistic_spin_node *node, *next;
- int curr = encode_cpu(smp_processor_id());
+ int curr = raw_cpu_read(osq_node.cpu);

/*
* Fast path for the uncontended case.
--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2023-12-29 22:54:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next()

On Fri, 29 Dec 2023 at 12:56, David Laight <[email protected]> wrote:
>
> osq_wait_next() is passed 'prev' from osq_lock() and NULL from osq_unlock()
> but only needs the 'cpu' value to write to lock->tail.
> Just pass prev->cpu or OSQ_UNLOCKED_VAL instead.
>
> Also directly return NULL or 'next' instead of breaking the loop.

Please split these two totally independent things out of the patch,
just to make things much more obvious.

I like the new calling convention, but I don't like how the patch
isn't obviously just that.

In fact, I'd take your patch #1 and just the calling convention change
from #3 as "these are obviously not changing anything at all, only
moving things to more local places".

I'd also take the other part of #3 as a "clearly doesn't change
anything" but it should be a separate patch, and it should be done
differently: make 'next' be local to just *inside* the for-loop (in
fact, make it local to the if-statement that sets it), to clarify the
whole thing that it can never be non-NULL at the top of the loop, and
can never have any long-term semantics.

The other parts actually change some logic, and would need the OSQ
people to take a more serious look.

Linus

2023-12-30 01:59:55

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH next 1/5] locking/osq_lock: Move the definition of optimistic_spin_node into osf_lock.c

On 12/29/23 15:53, David Laight wrote:
> struct optimistic_spin_node is private to the implementation.
> Move it into the C file to ensure nothing is accessing it.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> include/linux/osq_lock.h | 5 -----
> kernel/locking/osq_lock.c | 7 +++++++
> 2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
> index 5581dbd3bd34..ea8fb31379e3 100644
> --- a/include/linux/osq_lock.h
> +++ b/include/linux/osq_lock.h
> @@ -6,11 +6,6 @@
> * An MCS like lock especially tailored for optimistic spinning for sleeping
> * lock implementations (mutex, rwsem, etc).
> */
> -struct optimistic_spin_node {
> - struct optimistic_spin_node *next, *prev;
> - int locked; /* 1 if lock acquired */
> - int cpu; /* encoded CPU # + 1 value */
> -};
>
> struct optimistic_spin_queue {
> /*
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index d5610ad52b92..d414eef4bec6 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -11,6 +11,13 @@
> * called from interrupt context and we have preemption disabled while
> * spinning.
> */
> +
> +struct optimistic_spin_node {
> + struct optimistic_spin_node *next, *prev;
> + int locked; /* 1 if lock acquired */
> + int cpu; /* encoded CPU # + 1 value */
> +};
> +
> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
>
> /*

Please correct the patch title "osf_lock.c" => "osq_lock.c".

After the fix, you can add

Acked-by: Waiman Long <[email protected]>


2023-12-30 02:56:01

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next()


On 12/29/23 15:56, David Laight wrote:
> osq_wait_next() is passed 'prev' from osq_lock() and NULL from osq_unlock()
> but only needs the 'cpu' value to write to lock->tail.
> Just pass prev->cpu or OSQ_UNLOCKED_VAL instead.
>
> Also directly return NULL or 'next' instead of breaking the loop.
>
> Should have no effect on the generated code since gcc manages to
> assume that 'prev != NULL' due to an earlier dereference.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> kernel/locking/osq_lock.c | 23 ++++++++++-------------
> 1 file changed, 10 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 55f5db896c02..9bb3a077ba92 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -48,18 +48,17 @@ static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> static inline struct optimistic_spin_node *
> osq_wait_next(struct optimistic_spin_queue *lock,
> struct optimistic_spin_node *node,
> - struct optimistic_spin_node *prev)
> + int old)

Make the last argument name more descriptive, like "old_cpu" as the
"int" type does not provide enough context to allow people to guess what
"old" may be.

Cheers,
Longman



2023-12-30 03:08:19

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

On 12/29/23 15:57, David Laight wrote:
> this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> the latter can use an 'offset from register' (%gs for x86-84).
>
> Add a 'self' field to 'struct optimistic_spin_node' that can be
> read with raw_cpu_read(), initialise on first call.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> kernel/locking/osq_lock.c | 14 +++++++++-----
> 1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 9bb3a077ba92..b60b0add0161 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -13,7 +13,7 @@
> */
>
> struct optimistic_spin_node {
> - struct optimistic_spin_node *next, *prev;
> + struct optimistic_spin_node *self, *next, *prev;
> int locked; /* 1 if lock acquired */
> int cpu; /* encoded CPU # + 1 value */
> };
> @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>
> bool osq_lock(struct optimistic_spin_queue *lock)
> {
> - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);

My gcc 11 compiler produces the following x86-64 code:

92        struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
   0x0000000000000029 <+25>:    mov    %rcx,%rdx
   0x000000000000002c <+28>:    add %gs:0x0(%rip),%rdx        # 0x34
<osq_lock+36>

Which looks pretty optimized for me. Maybe older compiler may generate
more complex code. However, I do have some doubt as to the benefit of
this patch at the expense of making the code a bit more complex.

Cheers,
Longman


2023-12-30 03:14:00

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check.


On 12/29/23 15:58, David Laight wrote:
> The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
> cpu is no longer running.
> Although patched out for bare-metal the code still needs the cpu number.
> Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss
> when osq_unlock() is waking up the next cpu.
>
> Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
> Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.
>
> This is simpler than checking for 'node->prev' changing and caching
> 'prev->cpu'.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> kernel/locking/osq_lock.c | 14 ++++++--------
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index b60b0add0161..89be63627434 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -14,8 +14,9 @@
>
> struct optimistic_spin_node {
> struct optimistic_spin_node *self, *next, *prev;
> - int locked; /* 1 if lock acquired */
> - int cpu; /* encoded CPU # + 1 value */
> + int locked; /* 1 if lock acquired */
> + int cpu; /* encoded CPU # + 1 value */
> + int prev_cpu; /* actual CPU # for vpcu_is_preempted() */
> };
>
> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
> @@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
> return cpu_nr + 1;
> }
>
> -static inline int node_cpu(struct optimistic_spin_node *node)
> -{
> - return node->cpu - 1;
> -}
> -
> static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> {
> int cpu_nr = encoded_cpu_val - 1;
> @@ -114,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> if (old == OSQ_UNLOCKED_VAL)
> return true;
>
> + node->prev_cpu = old - 1;
> prev = decode_cpu(old);
> node->prev = prev;
> node->locked = 0;
> @@ -148,7 +145,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> * polling, be careful.
> */
> if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
> - vcpu_is_preempted(node_cpu(node->prev))))
> + vcpu_is_preempted(node->prev_cpu)))
> return true;
>
> /* unqueue */
> @@ -205,6 +202,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> * it will wait in Step-A.
> */
>
> + WRITE_ONCE(next->prev_cpu, prev->cpu - 1);
> WRITE_ONCE(next->prev, prev);
> WRITE_ONCE(prev->next, next);
Reviewed-by: Waiman Long <[email protected]>

>


2023-12-30 03:20:40

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.


On 12/29/23 17:11, David Laight wrote:
> osq_lock() starts by setting node->next to NULL and node->locked to 0.
> Careful analysis shows that node->next is always NULL on entry.
>
> node->locked is set non-zero by another cpu to force a wakeup.
> This can only happen after the 'prev->next = node' assignment,
> so locked can be set to zero just before that (along with the assignment
> to node->prev).
>
> Only initialise node->cpu once, after that use its value instead
> of smp_processor_id() - which is probably a real function call.
>
> Should reduce cache-line bouncing a little.
>
> Signed-off-by: David Laight <[email protected]>
> ---
>
> Re-send without the 'RE:' on the subject line.
> kernel/locking/osq_lock.c | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index d414eef4bec6..55f5db896c02 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> struct optimistic_spin_node *prev)
> {
> struct optimistic_spin_node *next = NULL;
> - int curr = encode_cpu(smp_processor_id());
> + int curr = node->cpu;
> int old;
>
> /*
> @@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> {
> struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> struct optimistic_spin_node *prev, *next;
> - int curr = encode_cpu(smp_processor_id());
> int old;
>
> - node->locked = 0;
> - node->next = NULL;

I have some concern about not clearing node->next at the beginning. I
know that next is supposed to be cleared at the end. I do have some
worry that there may exist a race condition that leaves next not cleared
before it is used again. So you either have to prove that such a race
does not exist, or initializing it to NULL at the beginning like it is
today.

Cheers,
Longman

> - node->cpu = curr;
> + if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> + node->cpu = encode_cpu(smp_processor_id());
>
> /*
> * We need both ACQUIRE (pairs with corresponding RELEASE in
> @@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> * the node fields we just initialised) semantics when updating
> * the lock tail.
> */
> - old = atomic_xchg(&lock->tail, curr);
> + old = atomic_xchg(&lock->tail, node->cpu);
> if (old == OSQ_UNLOCKED_VAL)
> return true;
>
> prev = decode_cpu(old);
> node->prev = prev;
> + node->locked = 0;
>
> /*
> * osq_lock() unqueue
> @@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> void osq_unlock(struct optimistic_spin_queue *lock)
> {
> struct optimistic_spin_node *node, *next;
> - int curr = encode_cpu(smp_processor_id());
> + int curr = raw_cpu_read(osq_node.cpu);
>
> /*
> * Fast path for the uncontended case.


2023-12-30 11:09:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.


* Waiman Long <[email protected]> wrote:

> On 12/29/23 15:57, David Laight wrote:
> > this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> > the latter can use an 'offset from register' (%gs for x86-84).
> >
> > Add a 'self' field to 'struct optimistic_spin_node' that can be
> > read with raw_cpu_read(), initialise on first call.
> >
> > Signed-off-by: David Laight <[email protected]>
> > ---
> > kernel/locking/osq_lock.c | 14 +++++++++-----
> > 1 file changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > index 9bb3a077ba92..b60b0add0161 100644
> > --- a/kernel/locking/osq_lock.c
> > +++ b/kernel/locking/osq_lock.c
> > @@ -13,7 +13,7 @@
> > */
> > struct optimistic_spin_node {
> > - struct optimistic_spin_node *next, *prev;
> > + struct optimistic_spin_node *self, *next, *prev;
> > int locked; /* 1 if lock acquired */
> > int cpu; /* encoded CPU # + 1 value */
> > };
> > @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> > bool osq_lock(struct optimistic_spin_queue *lock)
> > {
> > - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
>
> My gcc 11 compiler produces the following x86-64 code:
>
> 92??? ??? struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> ?? 0x0000000000000029 <+25>:??? mov??? %rcx,%rdx
> ?? 0x000000000000002c <+28>:??? add %gs:0x0(%rip),%rdx??????? # 0x34
> <osq_lock+36>
>
> Which looks pretty optimized for me. Maybe older compiler may generate more
> complex code. However, I do have some doubt as to the benefit of this patch
> at the expense of making the code a bit more complex.

GCC-11 is plenty of a look-back window in terms of compiler efficiency:
latest enterprise distros use GCC-11 or newer, while recent desktop
distros use GCC-13. Anything older won't matter, because no major
distribution is going to use new kernels with old compilers.

Thanks,

Ingo

2023-12-30 11:36:02

by David Laight

[permalink] [raw]
Subject: RE: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

From: Ingo Molnar
> Sent: 30 December 2023 11:09
>
>
> * Waiman Long <[email protected]> wrote:
>
> > On 12/29/23 15:57, David Laight wrote:
> > > this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> > > the latter can use an 'offset from register' (%gs for x86-84).
> > >
> > > Add a 'self' field to 'struct optimistic_spin_node' that can be
> > > read with raw_cpu_read(), initialise on first call.
> > >
> > > Signed-off-by: David Laight <[email protected]>
> > > ---
> > > kernel/locking/osq_lock.c | 14 +++++++++-----
> > > 1 file changed, 9 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > > index 9bb3a077ba92..b60b0add0161 100644
> > > --- a/kernel/locking/osq_lock.c
> > > +++ b/kernel/locking/osq_lock.c
> > > @@ -13,7 +13,7 @@
> > > */
> > > struct optimistic_spin_node {
> > > - struct optimistic_spin_node *next, *prev;
> > > + struct optimistic_spin_node *self, *next, *prev;
> > > int locked; /* 1 if lock acquired */
> > > int cpu; /* encoded CPU # + 1 value */
> > > };
> > > @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> > > bool osq_lock(struct optimistic_spin_queue *lock)
> > > {
> > > - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > > + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
> >
> > My gcc 11 compiler produces the following x86-64 code:
> >
> > 92        struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> >    0x0000000000000029 <+25>:    mov    %rcx,%rdx
> >    0x000000000000002c <+28>:    add %gs:0x0(%rip),%rdx        # 0x34
> > <osq_lock+36>
> >
> > Which looks pretty optimized for me. Maybe older compiler may generate more
> > complex code. However, I do have some doubt as to the benefit of this patch
> > at the expense of making the code a bit more complex.

My changed code is one instruction shorter!
18: 65 48 8b 15 00 00 00 mov %gs:0x0(%rip),%rdx # 20 <osq_lock+0x20>
1f: 00
1c: R_X86_64_PC32 .data..percpu..shared_aligned-0x4
However is might have one less cache line miss.

> GCC-11 is plenty of a look-back window in terms of compiler efficiency:
> latest enterprise distros use GCC-11 or newer, while recent desktop
> distros use GCC-13. Anything older won't matter, because no major
> distribution is going to use new kernels with old compilers.

There must be a difference in the header files as well.
Possibly forced by the older compiler I'm using (7.5 from Ubuntu 18.04).
But maybe based on some config option.

I'm seeing this_cpu_ptr(&xxx) converted to per_cpu_ptr(&xxx, smp_processor_id())
which necessitates an array lookup (indexed by cpu number).
Whereas I think you are seeing it implemented as
raw_cpu_read(per_cpu_data_base) + offset_to(xxx)

So the old code generates (after the prologue):
10: 49 89 fd mov %rdi,%r13
13: 49 c7 c4 00 00 00 00 mov $0x0,%r12
16: R_X86_64_32S .data..percpu..shared_aligned
1a: e8 00 00 00 00 callq 1f <osq_lock+0x1f>
1b: R_X86_64_PC32 debug_smp_processor_id-0x4
1f: 89 c0 mov %eax,%eax
21: 48 8b 1c c5 00 00 00 mov 0x0(,%rax,8),%rbx
28: 00
25: R_X86_64_32S __per_cpu_offset
29: e8 00 00 00 00 callq 2e <osq_lock+0x2e>
2a: R_X86_64_PC32 debug_smp_processor_id-0x4
2e: 4c 01 e3 add %r12,%rbx
31: 83 c0 01 add $0x1,%eax
34: c7 43 10 00 00 00 00 movl $0x0,0x10(%rbx)
3b: 48 c7 03 00 00 00 00 movq $0x0,(%rbx)
42: 89 43 14 mov %eax,0x14(%rbx)
45: 41 87 45 00 xchg %eax,0x0(%r13)

I was also surprised that smp_processor_id() is a real function rather
than an offset from %gs.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2023-12-30 15:50:22

by David Laight

[permalink] [raw]
Subject: RE: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.

From: Waiman Long
> Sent: 30 December 2023 03:20
>
> On 12/29/23 17:11, David Laight wrote:
> > osq_lock() starts by setting node->next to NULL and node->locked to 0.
> > Careful analysis shows that node->next is always NULL on entry.
> >
> > node->locked is set non-zero by another cpu to force a wakeup.
> > This can only happen after the 'prev->next = node' assignment,
> > so locked can be set to zero just before that (along with the assignment
> > to node->prev).
> >
> > Only initialise node->cpu once, after that use its value instead
> > of smp_processor_id() - which is probably a real function call.
> >
> > Should reduce cache-line bouncing a little.
> >
> > Signed-off-by: David Laight <[email protected]>
> > ---
> >
> > Re-send without the 'RE:' on the subject line.
> > kernel/locking/osq_lock.c | 13 ++++++-------
> > 1 file changed, 6 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > index d414eef4bec6..55f5db896c02 100644
> > --- a/kernel/locking/osq_lock.c
> > +++ b/kernel/locking/osq_lock.c
> > @@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> > struct optimistic_spin_node *prev)
> > {
> > struct optimistic_spin_node *next = NULL;
> > - int curr = encode_cpu(smp_processor_id());
> > + int curr = node->cpu;
> > int old;
> >
> > /*
> > @@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> > {
> > struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > struct optimistic_spin_node *prev, *next;
> > - int curr = encode_cpu(smp_processor_id());
> > int old;
> >
> > - node->locked = 0;
> > - node->next = NULL;
>
> I have some concern about not clearing node->next at the beginning. I
> know that next is supposed to be cleared at the end. I do have some
> worry that there may exist a race condition that leaves next not cleared
> before it is used again. So you either have to prove that such a race
> does not exist, or initializing it to NULL at the beginning like it is
> today.

I've stared at the code some more :-)

There are two places where node->next is written non-NULL, both in osq_lock().
The first is at the top of the slow-path where prev->next = node
(this should be overwriting the NULL set (or not) on entry).
The second is at the bottom of osq_lock() prev->next = next (Step C)
where the NULL written in Step A is written with the correct 'next' node.
After either of those the 'node' cpu must later either take the unqueue
exit from osq_lock() or call osq_unlock().
Both require it wait for node->next be non-NULL and NULL it.

If code calls osq_lock() twice all bets are off!

Even if (somehow) node->next was non-NULL on entry it will be set by an
osq_lock() call from another cpu.
The only problem would be if osq_unlock() were called before the queueing
cpu set prev->next = node.
That in itself is unlikely - but would happen if node->next were always set.

I don't completely understand the 'acquire'/'release' semantics (they didn't
exist when I started doing SMP kernel code in the late 1980s).
But it looks odd that osq_unlock()'s fast path uses _release but the very
similar code in osq_wait_next() uses _acquire.

Indeed, apart from some (assumed) optimisations, I think osq_unlock()
could just be:
next = osq_wait_next(lock, this_cpu_ptr(&osq_node), 0);
if (next)
next->locked = 1;

I don't think the order of the tests for lock->tail and node->next
matter in osq_wait_next().
If they were swapped the 'Second most likely case' code from osq_unlock()
could be removed.
(The 'uncontended case' doesn't need to load the address of 'node'.)

David


-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-12-30 15:57:53

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check.

On 12/29/23 22:13, Waiman Long wrote:
>
> On 12/29/23 15:58, David Laight wrote:
>> The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
>>    cpu is no longer running.
>> Although patched out for bare-metal the code still needs the cpu number.
>> Reading this from 'prev->cpu' is a pretty much guaranteed have a
>> cache miss
>> when osq_unlock() is waking up the next cpu.
>>
>> Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
>> Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.
>>
>> This is simpler than checking for 'node->prev' changing and caching
>> 'prev->cpu'.
>>
>> Signed-off-by: David Laight <[email protected]>
>> ---
>>   kernel/locking/osq_lock.c | 14 ++++++--------
>>   1 file changed, 6 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>> index b60b0add0161..89be63627434 100644
>> --- a/kernel/locking/osq_lock.c
>> +++ b/kernel/locking/osq_lock.c
>> @@ -14,8 +14,9 @@
>>     struct optimistic_spin_node {
>>       struct optimistic_spin_node *self, *next, *prev;
>> -    int locked; /* 1 if lock acquired */
>> -    int cpu; /* encoded CPU # + 1 value */
>> +    int locked;    /* 1 if lock acquired */
>> +    int cpu;       /* encoded CPU # + 1 value */
>> +    int prev_cpu;  /* actual CPU # for vpcu_is_preempted() */
>>   };
>>     static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node,
>> osq_node);
>> @@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
>>       return cpu_nr + 1;
>>   }
>>   -static inline int node_cpu(struct optimistic_spin_node *node)
>> -{
>> -    return node->cpu - 1;
>> -}
>> -
>>   static inline struct optimistic_spin_node *decode_cpu(int
>> encoded_cpu_val)
>>   {
>>       int cpu_nr = encoded_cpu_val - 1;
>> @@ -114,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>>       if (old == OSQ_UNLOCKED_VAL)
>>           return true;
>>   +    node->prev_cpu = old - 1;
>>       prev = decode_cpu(old);
>>       node->prev = prev;
>>       node->locked = 0;
>> @@ -148,7 +145,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>>        * polling, be careful.
>>        */
>>       if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
>> -                  vcpu_is_preempted(node_cpu(node->prev))))
>> +                  vcpu_is_preempted(node->prev_cpu)))

On second thought, I believe it is more correct to use READ_ONCE() to
access "node->prev_cpu" as this field is subjected to change by a
WRITE_ONCE().

Cheers,
Longman



2023-12-30 19:41:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code

On Fri, 29 Dec 2023 at 12:52, David Laight <[email protected]> wrote:
>
> David Laight (5):
> Move the definition of optimistic_spin_node into osf_lock.c
> Clarify osq_wait_next()

I took these two as preparatory independent patches, with that
osq_wait_next() clarification split into two.

I also did the renaming that Waiman asked for.

Linus

2023-12-30 20:37:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.


* David Laight <[email protected]> wrote:

> bool osq_lock(struct optimistic_spin_queue *lock)
> {
> - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
> struct optimistic_spin_node *prev, *next;
> int old;
>
> - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> - node->cpu = encode_cpu(smp_processor_id());
> + if (unlikely(!node)) {
> + int cpu = encode_cpu(smp_processor_id());
> + node = decode_cpu(cpu);
> + node->self = node;
> + node->cpu = cpu;

This whole initialization sequence is suboptimal and needs to be
cleaned up first: the node->cpu field is constant once initialized, so
it should be initialized from appropriate init methods, not runtime in
osq_lock(), right?

Eliminating that initialization branch is a useful micro-optimization
in itself for the hot path.

Thanks,

Ingo

2023-12-30 20:41:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

On Fri, 29 Dec 2023 at 12:57, David Laight <[email protected]> wrote:
>
> this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> the latter can use an 'offset from register' (%gs for x86-84).
>
> Add a 'self' field to 'struct optimistic_spin_node' that can be
> read with raw_cpu_read(), initialise on first call.

No, this is horrible.

The problem isn't the "this_cpu_ptr()", it's the rest of the code.

> bool osq_lock(struct optimistic_spin_queue *lock)
> {
> - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);

No. Both of these are crap.

> struct optimistic_spin_node *prev, *next;
> int old;
>
> - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> - node->cpu = encode_cpu(smp_processor_id());
> + if (unlikely(!node)) {
> + int cpu = encode_cpu(smp_processor_id());
> + node = decode_cpu(cpu);
> + node->self = node;
> + node->cpu = cpu;
> + }

The proper fix here is to not do that silly

node = this_cpu_ptr(&osq_node);
..
node->next = NULL;

dance at all, but to simply do

this_cpu_write(osq_node.next, NULL);

in the first place. That makes the whole thing just a single store off
the segment descriptor.

Yes, you'll eventually end up doing that

node = this_cpu_ptr(&osq_node);

thing because it then wants to use that raw pointer to do

WRITE_ONCE(prev->next, node);

but that's a separate issue and still does not make it worth it to
create a pointless self-pointer.

Btw, if you *really* want to solve that separate issue, then make the
optimistic_spin_node struct not contain the pointers at all, but the
CPU numbers, and then turn those numbers into the pointers the exact
same way it does for the "lock->tail" thing, ie doing that whole

prev = decode_cpu(old);

dance. That *may* then result in avoiding turning them into pointers
at all in some cases.

Also, I think that you might want to look into making OSQ_UNLOCKED_VAL
be -1 instead, and add something like

#define IS_OSQ_UNLOCKED(x) ((int)(x)<0)

and that would then avoid the +1 / -1 games in encoding/decoding the
CPU numbers. It causes silly code generated like this:

subl $1, %eax #, cpu_nr
...
cltq
addq __per_cpu_offset(,%rax,8), %rcx

which seems honestly stupid. The cltq is there for sign-extension,
which is because all these things are "int", and the "subl" will
zero-extend to 64-bit, not sign-extend.

At that point, I think gcc might be able to just generate

addq __per_cpu_offset-8(,%rax,8), %rcx

but honestly, I think it would be nicer to just have decode_cpu() do

unsigned int cpu_nr = encoded_cpu_val;

return per_cpu_ptr(&osq_node, cpu_nr);

and not have the -1/+1 at all.

Hmm?

UNTESTED patch to just do the "this_cpu_write()" parts attached.
Again, note how we do end up doing that this_cpu_ptr conversion later
anyway, but at least it's off the critical path.

Linus


Attachments:
patch.diff (1.06 kB)

2023-12-30 20:59:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

On Sat, 30 Dec 2023 at 12:41, Linus Torvalds
<[email protected]> wrote:
>
> UNTESTED patch to just do the "this_cpu_write()" parts attached.
> Again, note how we do end up doing that this_cpu_ptr conversion later
> anyway, but at least it's off the critical path.

Also note that while 'this_cpu_ptr()' doesn't exactly generate lovely
code, it really is still better than caching a value in memory.

At least the memory location that 'this_cpu_ptr()' accesses is
slightly more likely to be hot (and is right next to the cpu number,
iirc).

That said, I think we should fix this_cpu_ptr() to not ever generate
that disgusting cltq just because the cpu pointer has the wrong
signedness. I don't quite know how to do it, but this:

-#define per_cpu_offset(x) (__per_cpu_offset[x])
+#define per_cpu_offset(x) (__per_cpu_offset[(unsigned)(x)])

at least helps a *bit*. It gets rid of the cltq, at least, but if
somebody actually passes in an 'unsigned long' cpuid, it would cause
an unnecessary truncation.

And gcc still generates

subl $1, %eax #, cpu_nr
addq __per_cpu_offset(,%rax,8), %rcx

instead of just doing

addq __per_cpu_offset-8(,%rax,8), %rcx

because it still needs to clear the upper 32 bits and doesn't know
that the 'xchg()' already did that.

Oh well. I guess even without the -1/+1 games by the OSQ code, we
would still end up with a "movl" just to do that upper bits clearing
that the compiler doesn't know is unnecessary.

I don't think we have any reasonable way to tell the compiler that the
register output of our xchg() inline asm has the upper 32 bits clear.

Linus

2023-12-30 22:38:23

by David Laight

[permalink] [raw]
Subject: RE: [PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check.

From: Waiman Long
> Sent: 30 December 2023 15:57
>
> On 12/29/23 22:13, Waiman Long wrote:
> >
> > On 12/29/23 15:58, David Laight wrote:
> >> The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
> >>    cpu is no longer running.
> >> Although patched out for bare-metal the code still needs the cpu number.
> >> Reading this from 'prev->cpu' is a pretty much guaranteed have a
> >> cache miss
> >> when osq_unlock() is waking up the next cpu.
> >>
> >> Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
> >> Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.
> >>
> >> This is simpler than checking for 'node->prev' changing and caching
> >> 'prev->cpu'.
> >>
> >> Signed-off-by: David Laight <[email protected]>
> >> ---
> >>   kernel/locking/osq_lock.c | 14 ++++++--------
> >>   1 file changed, 6 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> >> index b60b0add0161..89be63627434 100644
> >> --- a/kernel/locking/osq_lock.c
> >> +++ b/kernel/locking/osq_lock.c
> >> @@ -14,8 +14,9 @@
> >>     struct optimistic_spin_node {
> >>       struct optimistic_spin_node *self, *next, *prev;
> >> -    int locked; /* 1 if lock acquired */
> >> -    int cpu; /* encoded CPU # + 1 value */
> >> +    int locked;    /* 1 if lock acquired */
> >> +    int cpu;       /* encoded CPU # + 1 value */
> >> +    int prev_cpu;  /* actual CPU # for vpcu_is_preempted() */
> >>   };
> >>     static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node,
> >> osq_node);
> >> @@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
> >>       return cpu_nr + 1;
> >>   }
> >>   -static inline int node_cpu(struct optimistic_spin_node *node)
> >> -{
> >> -    return node->cpu - 1;
> >> -}
> >> -
> >>   static inline struct optimistic_spin_node *decode_cpu(int
> >> encoded_cpu_val)
> >>   {
> >>       int cpu_nr = encoded_cpu_val - 1;
> >> @@ -114,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> >>       if (old == OSQ_UNLOCKED_VAL)
> >>           return true;
> >>   +    node->prev_cpu = old - 1;
> >>       prev = decode_cpu(old);
> >>       node->prev = prev;
> >>       node->locked = 0;
> >> @@ -148,7 +145,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> >>        * polling, be careful.
> >>        */
> >>       if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
> >> -                  vcpu_is_preempted(node_cpu(node->prev))))
> >> +                  vcpu_is_preempted(node->prev_cpu)))
>
> On second thought, I believe it is more correct to use READ_ONCE() to
> access "node->prev_cpu" as this field is subjected to change by a
> WRITE_ONCE().

It can be done...

Aren't pretty much all the 'node' members accessed like that?
There are a sprinkling of READ_ONCE() and WRITE_ONCE() but they
are not always used.

Maybe the structure member(s) should just be marked 'volatile' :-)
That should have exactly the same effect as the volatile cast
inside READ/WRITE_ONCE().
(I know there is a document about not using volatile...)

I've not actually checked whether the two existing WRITE_ONCE()
in 'Step C' need to be ordered - and whether that is guaranteed
by the code, especially on out good old friend 'Alpha' (is that
horrid cache system still supported?).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-12-30 22:39:46

by David Laight

[permalink] [raw]
Subject: RE: [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code

From: Linus Torvalds
> Sent: 30 December 2023 19:41
>
> On Fri, 29 Dec 2023 at 12:52, David Laight <[email protected]> wrote:
> >
> > David Laight (5):
> > Move the definition of optimistic_spin_node into osf_lock.c
> > Clarify osq_wait_next()
>
> I took these two as preparatory independent patches, with that
> osq_wait_next() clarification split into two.
>
> I also did the renaming that Waiman asked for.

Ok, I'll try to remove them from any respin.
(Or at least put them first!)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-12-30 22:47:49

by David Laight

[permalink] [raw]
Subject: RE: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

From: Ingo Molnar
> Sent: 30 December 2023 20:38
>
> * David Laight <[email protected]> wrote:
>
> > bool osq_lock(struct optimistic_spin_queue *lock)
> > {
> > - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
> > struct optimistic_spin_node *prev, *next;
> > int old;
> >
> > - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> > - node->cpu = encode_cpu(smp_processor_id());
> > + if (unlikely(!node)) {
> > + int cpu = encode_cpu(smp_processor_id());
> > + node = decode_cpu(cpu);
> > + node->self = node;
> > + node->cpu = cpu;
>
> This whole initialization sequence is suboptimal and needs to be
> cleaned up first: the node->cpu field is constant once initialized, so
> it should be initialized from appropriate init methods, not runtime in
> osq_lock(), right?

I thought that as well, but there would need to be a list of 'init'
functions for the per-cpu data. I didn't spot one.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2023-12-31 02:14:58

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code


On 12/30/23 17:39, David Laight wrote:
> From: Linus Torvalds
>> Sent: 30 December 2023 19:41
>>
>> On Fri, 29 Dec 2023 at 12:52, David Laight <[email protected]> wrote:
>>> David Laight (5):
>>> Move the definition of optimistic_spin_node into osf_lock.c
>>> Clarify osq_wait_next()
>> I took these two as preparatory independent patches, with that
>> osq_wait_next() clarification split into two.
>>
>> I also did the renaming that Waiman asked for.
> Ok, I'll try to remove them from any respin.
> (Or at least put them first!)

The first 2 patches have been included in Linus' master branch. So you
should just remove them from a respin.

Cheers,
Longman



2023-12-31 03:04:26

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.


On 12/30/23 06:35, David Laight wrote:
> From: Ingo Molnar
>> Sent: 30 December 2023 11:09
>>
>>
>> * Waiman Long <[email protected]> wrote:
>>
>>> On 12/29/23 15:57, David Laight wrote:
>>>> this_cpu_ptr() is rather more expensive than raw_cpu_read() since
>>>> the latter can use an 'offset from register' (%gs for x86-84).
>>>>
>>>> Add a 'self' field to 'struct optimistic_spin_node' that can be
>>>> read with raw_cpu_read(), initialise on first call.
>>>>
>>>> Signed-off-by: David Laight <[email protected]>
>>>> ---
>>>> kernel/locking/osq_lock.c | 14 +++++++++-----
>>>> 1 file changed, 9 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>>>> index 9bb3a077ba92..b60b0add0161 100644
>>>> --- a/kernel/locking/osq_lock.c
>>>> +++ b/kernel/locking/osq_lock.c
>>>> @@ -13,7 +13,7 @@
>>>> */
>>>> struct optimistic_spin_node {
>>>> - struct optimistic_spin_node *next, *prev;
>>>> + struct optimistic_spin_node *self, *next, *prev;
>>>> int locked; /* 1 if lock acquired */
>>>> int cpu; /* encoded CPU # + 1 value */
>>>> };
>>>> @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>>>> bool osq_lock(struct optimistic_spin_queue *lock)
>>>> {
>>>> - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
>>>> + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
>>> My gcc 11 compiler produces the following x86-64 code:
>>>
>>> 92        struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
>>>    0x0000000000000029 <+25>:    mov    %rcx,%rdx
>>>    0x000000000000002c <+28>:    add %gs:0x0(%rip),%rdx        # 0x34
>>> <osq_lock+36>
>>>
>>> Which looks pretty optimized for me. Maybe older compiler may generate more
>>> complex code. However, I do have some doubt as to the benefit of this patch
>>> at the expense of making the code a bit more complex.
> My changed code is one instruction shorter!
> 18: 65 48 8b 15 00 00 00 mov %gs:0x0(%rip),%rdx # 20 <osq_lock+0x20>
> 1f: 00
> 1c: R_X86_64_PC32 .data..percpu..shared_aligned-0x4
> However is might have one less cache line miss.
>
>> GCC-11 is plenty of a look-back window in terms of compiler efficiency:
>> latest enterprise distros use GCC-11 or newer, while recent desktop
>> distros use GCC-13. Anything older won't matter, because no major
>> distribution is going to use new kernels with old compilers.
> There must be a difference in the header files as well.
> Possibly forced by the older compiler I'm using (7.5 from Ubuntu 18.04).
> But maybe based on some config option.
>
> I'm seeing this_cpu_ptr(&xxx) converted to per_cpu_ptr(&xxx, smp_processor_id())
> which necessitates an array lookup (indexed by cpu number).
> Whereas I think you are seeing it implemented as
> raw_cpu_read(per_cpu_data_base) + offset_to(xxx)
>
> So the old code generates (after the prologue):
> 10: 49 89 fd mov %rdi,%r13
> 13: 49 c7 c4 00 00 00 00 mov $0x0,%r12
> 16: R_X86_64_32S .data..percpu..shared_aligned
> 1a: e8 00 00 00 00 callq 1f <osq_lock+0x1f>
> 1b: R_X86_64_PC32 debug_smp_processor_id-0x4
> 1f: 89 c0 mov %eax,%eax
> 21: 48 8b 1c c5 00 00 00 mov 0x0(,%rax,8),%rbx
> 28: 00
> 25: R_X86_64_32S __per_cpu_offset
> 29: e8 00 00 00 00 callq 2e <osq_lock+0x2e>
> 2a: R_X86_64_PC32 debug_smp_processor_id-0x4
> 2e: 4c 01 e3 add %r12,%rbx
> 31: 83 c0 01 add $0x1,%eax
> 34: c7 43 10 00 00 00 00 movl $0x0,0x10(%rbx)
> 3b: 48 c7 03 00 00 00 00 movq $0x0,(%rbx)
> 42: 89 43 14 mov %eax,0x14(%rbx)
> 45: 41 87 45 00 xchg %eax,0x0(%r13)
>
> I was also surprised that smp_processor_id() is a real function rather
> than an offset from %gs.

I have looked up definition of this_cpu_ptr() and gotten the following
results:

this_cpu_ptr() => raw_cpu_ptr() => arch_raw_cpu_ptr()

/*
 * Compared to the generic __my_cpu_offset version, the following
 * saves one instruction and avoids clobbering a temp register.
 */
#define arch_raw_cpu_ptr(ptr)                           \
({                                                      \
        unsigned long tcp_ptr__;                        \
        asm ("add " __percpu_arg(1) ", %0"              \
             : "=r" (tcp_ptr__)                         \
             : "m" (this_cpu_off), "0" (ptr));          \
        (typeof(*(ptr)) __kernel __force *)tcp_ptr__;   \
})

The presence of debug_smp_processor_id in your compiled code is likely
due to the setting of CONFIG_DEBUG_PREEMPT in your kernel config.

#ifdef CONFIG_DEBUG_PREEMPT
  extern unsigned int debug_smp_processor_id(void);
# define smp_processor_id() debug_smp_processor_id()
#else
# define smp_processor_id() __smp_processor_id()
#endif

I don't have that config entry in my kernel config and so I only get 2
instructions for this_cpu_ptr(). We are not going to optimize the code
specifically for CONFIG_DEBUG_PREEMPT and so this patch should be dropped.

Cheers,
Longman


2023-12-31 10:37:14

by David Laight

[permalink] [raw]
Subject: RE: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

From: Waiman Long
> Sent: 31 December 2023 03:04
....
> The presence of debug_smp_processor_id in your compiled code is likely
> due to the setting of CONFIG_DEBUG_PREEMPT in your kernel config.
>
> #ifdef CONFIG_DEBUG_PREEMPT
>   extern unsigned int debug_smp_processor_id(void);
> # define smp_processor_id() debug_smp_processor_id()
> #else
> # define smp_processor_id() __smp_processor_id()
> #endif
>
> I don't have that config entry in my kernel config and so I only get 2
> instructions for this_cpu_ptr(). We are not going to optimize the code
> specifically for CONFIG_DEBUG_PREEMPT and so this patch should be dropped.

Yes, I discovered that late last night.
I've no idea why it is set.
It might even be inherited from the ubuntu 18.04 (I think) .config
I started with (mostly removing extra modules to reduce compile
time and the size of uramdisk).

I'll look at the patches again.
Saving node->prev->cpu as node->prev_cpu will definitely save
the cache-line bounce.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-12-31 11:42:09

by David Laight

[permalink] [raw]
Subject: RE: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

From: Linus Torvalds
> Sent: 30 December 2023 20:41
>
> On Fri, 29 Dec 2023 at 12:57, David Laight <[email protected]> wrote:
> >
> > this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> > the latter can use an 'offset from register' (%gs for x86-84).
> >
> > Add a 'self' field to 'struct optimistic_spin_node' that can be
> > read with raw_cpu_read(), initialise on first call.
>
> No, this is horrible.
>
> The problem isn't the "this_cpu_ptr()", it's the rest of the code.
>
> > bool osq_lock(struct optimistic_spin_queue *lock)
> > {
> > - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
>
> No. Both of these are crap.
>
> > struct optimistic_spin_node *prev, *next;
> > int old;
> >
> > - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> > - node->cpu = encode_cpu(smp_processor_id());
> > + if (unlikely(!node)) {
> > + int cpu = encode_cpu(smp_processor_id());
> > + node = decode_cpu(cpu);
> > + node->self = node;
> > + node->cpu = cpu;
> > + }
>
> The proper fix here is to not do that silly
>
> node = this_cpu_ptr(&osq_node);
> ..
> node->next = NULL;
>
> dance at all, but to simply do
>
> this_cpu_write(osq_node.next, NULL);
>
> in the first place. That makes the whole thing just a single store off
> the segment descriptor.

Is the equivalent true (ie offset from fixed register) for all SMP archs?
Or do some have to do something rather more complicated?

> Yes, you'll eventually end up doing that
>
> node = this_cpu_ptr(&osq_node);

For some reason I've had CONFIG_DEBUG_PREEMPT set. I don't remember
setting it, and can't imagine why I might have.
Best guess is it was inherited from the ubuntu .config I started with.
In any case it changes smp_processor_id() into a real function
in order to check that preemption is disabled.
I'd guess something like BUG_ON(!raw_cpu_read(preempt_disable_count))
would be faster and more obvious!

> thing because it then wants to use that raw pointer to do
>
> WRITE_ONCE(prev->next, node);
>
> but that's a separate issue and still does not make it worth it to
> create a pointless self-pointer.

I could claim that loading it is one instruction shorter and that
if the cache line containing 'node' is needed and 'pcpu_hot'
is (unexpectedly) not cached it saves a cache line load.
But I probably won't!

>
> Btw, if you *really* want to solve that separate issue, then make the
> optimistic_spin_node struct not contain the pointers at all, but the
> CPU numbers, and then turn those numbers into the pointers the exact
> same way it does for the "lock->tail" thing, ie doing that whole
>
> prev = decode_cpu(old);
>
> dance. That *may* then result in avoiding turning them into pointers
> at all in some cases.

Don't think so.
Pretty much all the uses need to dereference the next/prev pointers.

> Also, I think that you might want to look into making OSQ_UNLOCKED_VAL
> be -1 instead, and add something like
>
> #define IS_OSQ_UNLOCKED(x) ((int)(x)<0)

I did think about that (but not for these patches).
But it is a lot more dangerous because it absolutely requires
the structure be correctly initialised, not just be all zero.
That might show up some very strange bugs.

> and that would then avoid the +1 / -1 games in encoding/decoding the
> CPU numbers. It causes silly code generated like this:
>
> subl $1, %eax #, cpu_nr
> ...
> cltq
> addq __per_cpu_offset(,%rax,8), %rcx
>
> which seems honestly stupid. The cltq is there for sign-extension,
> which is because all these things are "int", and the "subl" will
> zero-extend to 64-bit, not sign-extend.

Changing all the variables to 'unsigned int' will remove the sign
extension and, after the decrement, gcc will know the high bits
are zero so shouldn't need to zero extend.

> At that point, I think gcc might be able to just generate
>
> addq __per_cpu_offset-8(,%rax,8), %rcx

That would need the decrement to be 64bit.
A quick test failed to make that work.
Probably (as you mentioned in the next email) because gcc
doesn't know that the high bits from atomic_xchg() are zero.

> but honestly, I think it would be nicer to just have decode_cpu() do
>
> unsigned int cpu_nr = encoded_cpu_val;
>
> return per_cpu_ptr(&osq_node, cpu_nr);
>
> and not have the -1/+1 at all.

Unless you can persuade gcc that the high bits from atomic_xchg()
are zero that will require a zero extend.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-12-31 11:56:37

by David Laight

[permalink] [raw]
Subject: RE: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

From: Linus Torvalds
> Sent: 30 December 2023 20:59
>
> On Sat, 30 Dec 2023 at 12:41, Linus Torvalds
> <[email protected]> wrote:
> >
> > UNTESTED patch to just do the "this_cpu_write()" parts attached.
> > Again, note how we do end up doing that this_cpu_ptr conversion later
> > anyway, but at least it's off the critical path.
>
> Also note that while 'this_cpu_ptr()' doesn't exactly generate lovely
> code, it really is still better than caching a value in memory.
>
> At least the memory location that 'this_cpu_ptr()' accesses is
> slightly more likely to be hot (and is right next to the cpu number,
> iirc).

I was only going to access the 'self' field in code that required
the 'node' cache line be present.

>
> That said, I think we should fix this_cpu_ptr() to not ever generate
> that disgusting cltq just because the cpu pointer has the wrong
> signedness. I don't quite know how to do it, but this:
>
> -#define per_cpu_offset(x) (__per_cpu_offset[x])
> +#define per_cpu_offset(x) (__per_cpu_offset[(unsigned)(x)])
>
> at least helps a *bit*. It gets rid of the cltq, at least, but if
> somebody actually passes in an 'unsigned long' cpuid, it would cause
> an unnecessary truncation.

Doing the conversion using arithmetic might help, so:
__per_cpu_offset[(x) + 0u]

> And gcc still generates
>
> subl $1, %eax #, cpu_nr
> addq __per_cpu_offset(,%rax,8), %rcx
>
> instead of just doing
>
> addq __per_cpu_offset-8(,%rax,8), %rcx
>
> because it still needs to clear the upper 32 bits and doesn't know
> that the 'xchg()' already did that.

Not only that, you need to do the 'subl' after converting to 64 bits.
Otherwise the wrong location is read were cpu_nr to be zero.
I've tried that - but it still failed.

> Oh well. I guess even without the -1/+1 games by the OSQ code, we
> would still end up with a "movl" just to do that upper bits clearing
> that the compiler doesn't know is unnecessary.
>
> I don't think we have any reasonable way to tell the compiler that the
> register output of our xchg() inline asm has the upper 32 bits clear.

It could be done for a 32bit unsigned xchg() - just make the return
type unsigned 64bit.
But that won't work for the signed exchange - and 'atomic_t' is signed.
OTOH I'd guess this code could use 'unsigned int' instead of atomic_t?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-01-02 18:55:14

by Boqun Feng

[permalink] [raw]
Subject: Re: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.

On Sat, Dec 30, 2023 at 03:49:52PM +0000, David Laight wrote:
[...]
> I don't completely understand the 'acquire'/'release' semantics (they didn't
> exist when I started doing SMP kernel code in the late 1980s).
> But it looks odd that osq_unlock()'s fast path uses _release but the very
> similar code in osq_wait_next() uses _acquire.
>

The _release in osq_unlock() is needed since unlocks are needed to be
RELEASE so that lock+unlock can be a critical section (i.e. no memory
accesses can escape). When osq_wait_next() is used in non unlock cases,
the RELEASE is not required. As for the case where osq_wait_next() is
used in osq_unlock(), there is a xchg() preceding it, which provides a
full barrier, so things are fine.

/me wonders whether we can relax the _acquire in osq_wait_next() into
a _relaxed.

> Indeed, apart from some (assumed) optimisations, I think osq_unlock()
> could just be:
> next = osq_wait_next(lock, this_cpu_ptr(&osq_node), 0);
> if (next)
> next->locked = 1;
>

If so we need to provide some sort of RELEASE semantics for the
osq_unlock() in all the cases.

Regards,
Boqun

> I don't think the order of the tests for lock->tail and node->next
> matter in osq_wait_next().
> If they were swapped the 'Second most likely case' code from osq_unlock()
> could be removed.
> (The 'uncontended case' doesn't need to load the address of 'node'.)
>
> David
>
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

2024-01-02 23:32:36

by David Laight

[permalink] [raw]
Subject: RE: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.

From: Boqun Feng
> Sent: 02 January 2024 18:54
>
> On Sat, Dec 30, 2023 at 03:49:52PM +0000, David Laight wrote:
> [...]
> > But it looks odd that osq_unlock()'s fast path uses _release but the very
> > similar code in osq_wait_next() uses _acquire.
> >
>
> The _release in osq_unlock() is needed since unlocks are needed to be
> RELEASE so that lock+unlock can be a critical section (i.e. no memory
> accesses can escape). When osq_wait_next() is used in non unlock cases,
> the RELEASE is not required. As for the case where osq_wait_next() is
> used in osq_unlock(), there is a xchg() preceding it, which provides a
> full barrier, so things are fine.

I know there have been issues with ACQUIRE/RELEASE/FULL xchg in this code,
but are FULL xchg always needed on node->next?

> /me wonders whether we can relax the _acquire in osq_wait_next() into
> a _relaxed.

I wouldn't have worried about relaxed v release.

> > Indeed, apart from some (assumed) optimisations, I think osq_unlock()
> > could just be:
> > next = osq_wait_next(lock, this_cpu_ptr(&osq_node), 0);
> > if (next)
> > next->locked = 1;
> >
>
> If so we need to provide some sort of RELEASE semantics for the
> osq_unlock() in all the cases.

I wonder how often the unqueue code happens, and especially for
the last cpu in the list?
I'd only expect need_resched() to return true after spinning for
a while - in which case perhaps it is more likely that there are
a lot of cpu in the queue and the cpu being removed won't be last.
So osq_wait_next() exits on xchg(&node->next, NULL) != NULL
which is full barrier.

On a slightly different note I've also wondered if 'osq_node'
actually needs to be cache line aligned?
You definitely don't want it spanning 2 cache line, but I'd
expect that per-cpu data is mostly accessed by its own cpu?
So you really aren't going to get false sharing with some
other per-cpu data since the cpu is busy in this code.
So __aligned(16) would do?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)