LinuxLists.cc - [PATCH next v2 0/5] locking/osq_lock: Optimisations to osq

[permalink] [raw]

Subject: [PATCH next v2 1/5] locking/osq_lock: Defer clearing node->locked until the slow osq_lock() path.

Since node->locked cannot be set before the assignment to prev->next
it is save to clear it in the slow path.

Signed-off-by: David Laight <[email protected]>
---
kernel/locking/osq_lock.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 75a6f6133866..e0bc74d85a76 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -97,7 +97,6 @@ bool osq_lock(struct optimistic_spin_queue *lock)
int curr = encode_cpu(smp_processor_id());
int old;

- node->locked = 0;
node->next = NULL;
node->cpu = curr;

@@ -111,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
if (old == OSQ_UNLOCKED_VAL)
return true;

+ node->locked = 0;
prev = decode_cpu(old);
node->prev = prev;

--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-12-31 21:53:25

[permalink] [raw]

Subject: [PATCH next v2 2/5] locking/osq_lock: Optimise the vcpu_is_preempted() check.

The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
cpu is no longer running.

Although patched out for bare-metal the code still needs the cpu number.
Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss
when osq_unlock() is waking up the next cpu.

Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.

This is simpler than checking for 'node->prev' changing and caching
'prev->cpu'.

Signed-off-by: David Laight <[email protected]>
---
kernel/locking/osq_lock.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index e0bc74d85a76..eb8a6dfdb79d 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -14,8 +14,9 @@

struct optimistic_spin_node {
struct optimistic_spin_node *next, *prev;
- int locked; /* 1 if lock acquired */
- int cpu; /* encoded CPU # + 1 value */
+ int locked; /* 1 if lock acquired */
+ int cpu; /* encoded CPU # + 1 value */
+ int prev_cpu; /* encoded CPU # + 1 value */
};

static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
@@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
}

-static inline int node_cpu(struct optimistic_spin_node *node)
-{
- return node->cpu - 1;
-}
-
static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
{
int cpu_nr = encoded_cpu_val - 1;
@@ -110,9 +106,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
if (old == OSQ_UNLOCKED_VAL)
return true;

- node->locked = 0;
+ node->prev_cpu = old;
prev = decode_cpu(old);
node->prev = prev;
+ node->locked = 0;

/*
* osq_lock() unqueue
@@ -144,7 +141,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* polling, be careful.
*/
if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
- vcpu_is_preempted(node_cpu(node->prev))))
+ vcpu_is_preempted(READ_ONCE(node->prev_cpu) - 1)))
return true;

/* unqueue */
@@ -201,6 +198,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* it will wait in Step-A.
*/

+ WRITE_ONCE(next->prev_cpu, prev->cpu);
WRITE_ONCE(next->prev, prev);
WRITE_ONCE(prev->next, next);

--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-12-31 21:54:36

[permalink] [raw]

Subject: [PATCH next v2 3/5] locking/osq_lock: Use node->prev_cpu instead of saving node->prev.

node->prev is only used to update 'prev' in the unlikely case
of concurrent unqueues.
This can be replaced by a check for node->prev_cpu changing
and then calling decode_cpu() to get the changed 'prev' pointer.

node->cpu (or more particularly) prev->cpu is only used for the
osq_wait_next() call in the unqueue path.
Normally this is exactly the value that the initial xchg() read
from lock->tail (used to obtain 'prev'), but can get updated
by concurrent unqueues.

Both the 'prev' and 'cpu' members of optimistic_spin_node are
now unused and can be deleted.

Signed-off-by: David Laight <[email protected]>
---
kernel/locking/osq_lock.c | 31 +++++++++++++++++--------------
1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index eb8a6dfdb79d..27324b509f68 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -13,9 +13,8 @@
*/

struct optimistic_spin_node {
- struct optimistic_spin_node *next, *prev;
+ struct optimistic_spin_node *next;
int locked; /* 1 if lock acquired */
- int cpu; /* encoded CPU # + 1 value */
int prev_cpu; /* encoded CPU # + 1 value */
};

@@ -91,10 +90,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
struct optimistic_spin_node *prev, *next;
int curr = encode_cpu(smp_processor_id());
- int old;
+ int prev_cpu;

node->next = NULL;
- node->cpu = curr;

/*
* We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -102,13 +100,12 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* the node fields we just initialised) semantics when updating
* the lock tail.
*/
- old = atomic_xchg(&lock->tail, curr);
- if (old == OSQ_UNLOCKED_VAL)
+ prev_cpu = atomic_xchg(&lock->tail, curr);
+ if (prev_cpu == OSQ_UNLOCKED_VAL)
return true;

- node->prev_cpu = old;
- prev = decode_cpu(old);
- node->prev = prev;
+ node->prev_cpu = prev_cpu;
+ prev = decode_cpu(prev_cpu);
node->locked = 0;

/*
@@ -174,9 +171,16 @@ bool osq_lock(struct optimistic_spin_queue *lock)

/*
* Or we race against a concurrent unqueue()'s step-B, in which
- * case its step-C will write us a new @node->prev pointer.
+ * case its step-C will write us a new @node->prev_cpu value.
*/
- prev = READ_ONCE(node->prev);
+ {
+ int new_prev_cpu = READ_ONCE(node->prev_cpu);
+
+ if (new_prev_cpu == prev_cpu)
+ continue;
+ prev_cpu = new_prev_cpu;
+ prev = decode_cpu(prev_cpu);
+ }
}

/*
@@ -186,7 +190,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* back to @prev.
*/

- next = osq_wait_next(lock, node, prev->cpu);
+ next = osq_wait_next(lock, node, prev_cpu);
if (!next)
return false;

@@ -198,8 +202,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* it will wait in Step-A.
*/

- WRITE_ONCE(next->prev_cpu, prev->cpu);
- WRITE_ONCE(next->prev, prev);
+ WRITE_ONCE(next->prev_cpu, prev_cpu);
WRITE_ONCE(prev->next, next);

return false;
--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-12-31 21:55:33

[permalink] [raw]

Subject: [PATCH next v2 4/5] locking/osq_lock: Avoid writing to node->next in the osq_lock() fast path.

When osq_lock() returns false or osq_unlock() returns static
analysis shows that node->next should always be NULL.
This means that it isn't necessary to explicitly set it to NULL
prior to atomic_xchg(&lock->tail, curr) on extry to osq_lock().

Just in case there a non-obvious race condition that can leave it
non-NULL check with WARN_ON_ONCE() and NULL if set.
Note that without this check the fast path (adding at the list head)
doesn't need to to access the per-cpu osq_node at all.

Signed-off-by: David Laight <[email protected]>
---
kernel/locking/osq_lock.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 27324b509f68..35bb99e96697 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -87,12 +87,17 @@ osq_wait_next(struct optimistic_spin_queue *lock,

bool osq_lock(struct optimistic_spin_queue *lock)
{
- struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
- struct optimistic_spin_node *prev, *next;
+ struct optimistic_spin_node *node, *prev, *next;
int curr = encode_cpu(smp_processor_id());
int prev_cpu;

- node->next = NULL;
+ /*
+ * node->next should be NULL on entry.
+ * Check just in case there is a race somewhere.
+ * Note that this is probably an unnecessary cache miss in the fast path.
+ */
+ if (WARN_ON_ONCE(raw_cpu_read(osq_node.next) != NULL))
+ raw_cpu_write(osq_node.next, NULL);

/*
* We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -104,8 +109,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
if (prev_cpu == OSQ_UNLOCKED_VAL)
return true;

- node->prev_cpu = prev_cpu;
+ node = this_cpu_ptr(&osq_node);
prev = decode_cpu(prev_cpu);
+ node->prev_cpu = prev_cpu;
node->locked = 0;

/*
--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-12-31 21:56:27

[permalink] [raw]

Subject: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
This requires the cpu number be 64bit.
However the value is osq_lock() comes from a 32bit xchg() and there
isn't a way of telling gcc the high bits are zero (they are) so
there will always be an instruction to clear the high bits.

The cpu number is also offset by one (to make the initialiser 0)
It seems to be impossible to get gcc to convert __per_cpu_offset[cpu_p1 - 1]
into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the address).

Converting the cpu number to 32bit unsigned prior to the decrement means
that gcc knows the decrement has set the high bits to zero and doesn't
add a register-register move (or cltq) to zero/sign extend the value.

Not massive but saves two instructions.

Signed-off-by: David Laight <[email protected]>
---
kernel/locking/osq_lock.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 35bb99e96697..37a4fa872989 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
}

-static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
+static inline struct optimistic_spin_node *decode_cpu(unsigned int encoded_cpu_val)
{
- int cpu_nr = encoded_cpu_val - 1;
-
- return per_cpu_ptr(&osq_node, cpu_nr);
+ return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);
}

/*
--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-01-01 04:08:28

[permalink] [raw]

Subject: Re: [PATCH next v2 1/5] locking/osq_lock: Defer clearing node->locked until the slow osq_lock() path.

On 12/31/23 16:51, David Laight wrote:
> Since node->locked cannot be set before the assignment to prev->next
> it is save to clear it in the slow path.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> kernel/locking/osq_lock.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 75a6f6133866..e0bc74d85a76 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -97,7 +97,6 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> int curr = encode_cpu(smp_processor_id());
> int old;
>
> - node->locked = 0;
> node->next = NULL;
> node->cpu = curr;
>
> @@ -111,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> if (old == OSQ_UNLOCKED_VAL)
> return true;
>
> + node->locked = 0;
> prev = decode_cpu(old);
> node->prev = prev;
>

Reviewed-by: Waiman Long <[email protected]>

2024-01-01 04:09:34

[permalink] [raw]

Subject: Re: [PATCH next v2 2/5] locking/osq_lock: Optimise the vcpu_is_preempted() check.

On 12/31/23 16:52, David Laight wrote:
> The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
> cpu is no longer running.
>
> Although patched out for bare-metal the code still needs the cpu number.
> Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss
> when osq_unlock() is waking up the next cpu.
>
> Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
> Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.
>
> This is simpler than checking for 'node->prev' changing and caching
> 'prev->cpu'.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> kernel/locking/osq_lock.c | 16 +++++++---------
> 1 file changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index e0bc74d85a76..eb8a6dfdb79d 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -14,8 +14,9 @@
>
> struct optimistic_spin_node {
> struct optimistic_spin_node *next, *prev;
> - int locked; /* 1 if lock acquired */
> - int cpu; /* encoded CPU # + 1 value */
> + int locked; /* 1 if lock acquired */
> + int cpu; /* encoded CPU # + 1 value */
> + int prev_cpu; /* encoded CPU # + 1 value */
> };
>
> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
> @@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
> return cpu_nr + 1;
> }
>
> -static inline int node_cpu(struct optimistic_spin_node *node)
> -{
> - return node->cpu - 1;
> -}
> -
> static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> {
> int cpu_nr = encoded_cpu_val - 1;
> @@ -110,9 +106,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> if (old == OSQ_UNLOCKED_VAL)
> return true;
>
> - node->locked = 0;
> + node->prev_cpu = old;
> prev = decode_cpu(old);
> node->prev = prev;
> + node->locked = 0;
>
> /*
> * osq_lock() unqueue
> @@ -144,7 +141,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> * polling, be careful.
> */
> if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
> - vcpu_is_preempted(node_cpu(node->prev))))
> + vcpu_is_preempted(READ_ONCE(node->prev_cpu) - 1)))
> return true;
>
> /* unqueue */
> @@ -201,6 +198,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> * it will wait in Step-A.
> */
>
> + WRITE_ONCE(next->prev_cpu, prev->cpu);
> WRITE_ONCE(next->prev, prev);
> WRITE_ONCE(prev->next, next);
>

Reviewed-by: Waiman Long <[email protected]>

Reviewed-by: Waiman Long <[email protected]>

2024-01-01 04:09:54

[permalink] [raw]

Subject: Re: [PATCH next v2 3/5] locking/osq_lock: Use node->prev_cpu instead of saving node->prev.

On 12/31/23 16:54, David Laight wrote:
> node->prev is only used to update 'prev' in the unlikely case
> of concurrent unqueues.
> This can be replaced by a check for node->prev_cpu changing
> and then calling decode_cpu() to get the changed 'prev' pointer.
>
> node->cpu (or more particularly) prev->cpu is only used for the
> osq_wait_next() call in the unqueue path.
> Normally this is exactly the value that the initial xchg() read
> from lock->tail (used to obtain 'prev'), but can get updated
> by concurrent unqueues.
>
> Both the 'prev' and 'cpu' members of optimistic_spin_node are
> now unused and can be deleted.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> kernel/locking/osq_lock.c | 31 +++++++++++++++++--------------
> 1 file changed, 17 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index eb8a6dfdb79d..27324b509f68 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -13,9 +13,8 @@
> */
>
> struct optimistic_spin_node {
> - struct optimistic_spin_node *next, *prev;
> + struct optimistic_spin_node *next;
> int locked; /* 1 if lock acquired */
> - int cpu; /* encoded CPU # + 1 value */
> int prev_cpu; /* encoded CPU # + 1 value */
> };
>
> @@ -91,10 +90,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> struct optimistic_spin_node *prev, *next;
> int curr = encode_cpu(smp_processor_id());
> - int old;
> + int prev_cpu;
>
> node->next = NULL;
> - node->cpu = curr;
>
> /*
> * We need both ACQUIRE (pairs with corresponding RELEASE in
> @@ -102,13 +100,12 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> * the node fields we just initialised) semantics when updating
> * the lock tail.
> */
> - old = atomic_xchg(&lock->tail, curr);
> - if (old == OSQ_UNLOCKED_VAL)
> + prev_cpu = atomic_xchg(&lock->tail, curr);
> + if (prev_cpu == OSQ_UNLOCKED_VAL)
> return true;
>
> - node->prev_cpu = old;
> - prev = decode_cpu(old);
> - node->prev = prev;
> + node->prev_cpu = prev_cpu;
> + prev = decode_cpu(prev_cpu);
> node->locked = 0;
>
> /*
> @@ -174,9 +171,16 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>
> /*
> * Or we race against a concurrent unqueue()'s step-B, in which
> - * case its step-C will write us a new @node->prev pointer.
> + * case its step-C will write us a new @node->prev_cpu value.
> */
> - prev = READ_ONCE(node->prev);
> + {
> + int new_prev_cpu = READ_ONCE(node->prev_cpu);
> +
> + if (new_prev_cpu == prev_cpu)
> + continue;
> + prev_cpu = new_prev_cpu;
> + prev = decode_cpu(prev_cpu);
> + }

Just a minor nit. It is not that common in the kernel to add another
nesting level just to reduce the scope of new_prev_cpu auto variable.

Anyway,

Reviewed-by: Waiman Long <[email protected]>

2024-01-01 04:13:41

[permalink] [raw]

Subject: Re: [PATCH next v2 4/5] locking/osq_lock: Avoid writing to node->next in the osq_lock() fast path.

On 12/31/23 16:54, David Laight wrote:
> When osq_lock() returns false or osq_unlock() returns static
> analysis shows that node->next should always be NULL.
> This means that it isn't necessary to explicitly set it to NULL
> prior to atomic_xchg(&lock->tail, curr) on extry to osq_lock().
>
> Just in case there a non-obvious race condition that can leave it
> non-NULL check with WARN_ON_ONCE() and NULL if set.
> Note that without this check the fast path (adding at the list head)
> doesn't need to to access the per-cpu osq_node at all.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> kernel/locking/osq_lock.c | 14 ++++++++++----
> 1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 27324b509f68..35bb99e96697 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -87,12 +87,17 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>
> bool osq_lock(struct optimistic_spin_queue *lock)
> {
> - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> - struct optimistic_spin_node *prev, *next;
> + struct optimistic_spin_node *node, *prev, *next;
> int curr = encode_cpu(smp_processor_id());
> int prev_cpu;
>
> - node->next = NULL;
> + /*
> + * node->next should be NULL on entry.
> + * Check just in case there is a race somewhere.
> + * Note that this is probably an unnecessary cache miss in the fast path.
> + */
> + if (WARN_ON_ONCE(raw_cpu_read(osq_node.next) != NULL))
> + raw_cpu_write(osq_node.next, NULL);
>
> /*
> * We need both ACQUIRE (pairs with corresponding RELEASE in
> @@ -104,8 +109,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> if (prev_cpu == OSQ_UNLOCKED_VAL)
> return true;
>
> - node->prev_cpu = prev_cpu;
> + node = this_cpu_ptr(&osq_node);
> prev = decode_cpu(prev_cpu);
> + node->prev_cpu = prev_cpu;
> node->locked = 0;
>
> /*
Reviewed-by: Waiman Long <[email protected]>

2024-01-01 04:14:36

[permalink] [raw]

Subject: Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

On 12/31/23 16:55, David Laight wrote:
> per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
> This requires the cpu number be 64bit.
> However the value is osq_lock() comes from a 32bit xchg() and there
> isn't a way of telling gcc the high bits are zero (they are) so
> there will always be an instruction to clear the high bits.
>
> The cpu number is also offset by one (to make the initialiser 0)
> It seems to be impossible to get gcc to convert __per_cpu_offset[cpu_p1 - 1]
> into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the address).
>
> Converting the cpu number to 32bit unsigned prior to the decrement means
> that gcc knows the decrement has set the high bits to zero and doesn't
> add a register-register move (or cltq) to zero/sign extend the value.
>
> Not massive but saves two instructions.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> kernel/locking/osq_lock.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 35bb99e96697..37a4fa872989 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
> return cpu_nr + 1;
> }
>
> -static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> +static inline struct optimistic_spin_node *decode_cpu(unsigned int encoded_cpu_val)
> {
> - int cpu_nr = encoded_cpu_val - 1;
> -
> - return per_cpu_ptr(&osq_node, cpu_nr);
> + return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);
> }
>
> /*

You really like micro-optimization.

Anyway,

Reviewed-by: Waiman Long <[email protected]>

2024-01-01 08:47:38

[permalink] [raw]

Subject: RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

From: Waiman Long
> Sent: 01 January 2024 04:14
...
> You really like micro-optimization.

They all add up :-)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-01-02 09:47:29

[permalink] [raw]

Subject: Re: [PATCH next v2 4/5] locking/osq_lock: Avoid writing to node->next in the osq_lock() fast path.

* David Laight <[email protected]> wrote:

> When osq_lock() returns false or osq_unlock() returns static
> analysis shows that node->next should always be NULL.
> This means that it isn't necessary to explicitly set it to NULL
> prior to atomic_xchg(&lock->tail, curr) on extry to osq_lock().
>
> Just in case there a non-obvious race condition that can leave it
> non-NULL check with WARN_ON_ONCE() and NULL if set.
> Note that without this check the fast path (adding at the list head)
> doesn't need to to access the per-cpu osq_node at all.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> kernel/locking/osq_lock.c | 14 ++++++++++----
> 1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 27324b509f68..35bb99e96697 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -87,12 +87,17 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>
> bool osq_lock(struct optimistic_spin_queue *lock)
> {
> - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> - struct optimistic_spin_node *prev, *next;
> + struct optimistic_spin_node *node, *prev, *next;
> int curr = encode_cpu(smp_processor_id());
> int prev_cpu;
>
> - node->next = NULL;
> + /*
> + * node->next should be NULL on entry.
> + * Check just in case there is a race somewhere.
> + * Note that this is probably an unnecessary cache miss in the fast path.
> + */
> + if (WARN_ON_ONCE(raw_cpu_read(osq_node.next) != NULL))
> + raw_cpu_write(osq_node.next, NULL);

The fix-uppery and explanation about something that shouldn't happen is
excessive: please just put a plain WARN_ON_ONCE() here - which we can
remove in a release or so.

Thanks,

Ingo

2024-01-02 09:49:42

[permalink] [raw]

Subject: Re: [PATCH next v2 2/5] locking/osq_lock: Optimise the vcpu_is_preempted() check.

* David Laight <[email protected]> wrote:

> The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
> cpu is no longer running.
>
> Although patched out for bare-metal the code still needs the cpu number.

Comma missing.

> Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss
> when osq_unlock() is waking up the next cpu.
>
> Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
> Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.
>
> This is simpler than checking for 'node->prev' changing and caching
> 'prev->cpu'.

Throughout the series, in changelogs and comments, please do:

s/cpu
/CPU

Please be more careful about changelog quality.

> struct optimistic_spin_node {
> struct optimistic_spin_node *next, *prev;
> - int locked; /* 1 if lock acquired */
> - int cpu; /* encoded CPU # + 1 value */
> + int locked; /* 1 if lock acquired */
> + int cpu; /* encoded CPU # + 1 value */
> + int prev_cpu; /* encoded CPU # + 1 value */
> };

s/ encoded CPU # + 1 value
/ encoded value: CPU+1

Thanks,

Ingo

2024-01-02 09:50:07

[permalink] [raw]

Subject: Re: [PATCH next v2 1/5] locking/osq_lock: Defer clearing node->locked until the slow osq_lock() path.

* David Laight <[email protected]> wrote:

> Since node->locked cannot be set before the assignment to prev->next
> it is save to clear it in the slow path.

s/save
/safe

Thanks,

Ingo

2024-01-02 09:51:47

[permalink] [raw]

Subject: Re: [PATCH next v2 1/5] locking/osq_lock: Defer clearing node->locked until the slow osq_lock() path.

Also, please don't put periods into titles:

s/[PATCH next v2 1/5] locking/osq_lock: Defer clearing node->locked until the slow osq_lock() path.
/[PATCH next v2 1/5] locking/osq_lock: Defer clearing node->locked until the slow osq_lock() path

Usually maintainers remove them manually, but there's no reason to be
inconsistent: half of your series had a period. It just unnecessarily
distracts from review.

This rule applies to this series and to all future patches.

Thanks,

Ingo

2024-01-02 09:54:52

[permalink] [raw]

Subject: Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

* David Laight <[email protected]> wrote:

> per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
> This requires the cpu number be 64bit.
> However the value is osq_lock() comes from a 32bit xchg() and there
> isn't a way of telling gcc the high bits are zero (they are) so
> there will always be an instruction to clear the high bits.
>
> The cpu number is also offset by one (to make the initialiser 0)
> It seems to be impossible to get gcc to convert __per_cpu_offset[cpu_p1 - 1]
> into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the address).
>
> Converting the cpu number to 32bit unsigned prior to the decrement means
> that gcc knows the decrement has set the high bits to zero and doesn't
> add a register-register move (or cltq) to zero/sign extend the value.
>
> Not massive but saves two instructions.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> kernel/locking/osq_lock.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 35bb99e96697..37a4fa872989 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
> return cpu_nr + 1;
> }
>
> -static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> +static inline struct optimistic_spin_node *decode_cpu(unsigned int encoded_cpu_val)
> {
> - int cpu_nr = encoded_cpu_val - 1;
> -
> - return per_cpu_ptr(&osq_node, cpu_nr);
> + return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);

So why do we 'encode' the CPU number to begin with?

Why not use -1 as the special value? Checks for negative values
generates similarly fast machine code compared to checking for 0, if
the value is also used (which it is in most cases here). What am I
missing? We seem to be going through a lot of unnecessary hoops, and
some of that is in the runtime path.

Thanks,

Ingo

2024-01-02 10:20:47

[permalink] [raw]

Subject: RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

From: Ingo Molnar
> Sent: 02 January 2024 09:54
>
>
> * David Laight <[email protected]> wrote:
>
> > per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
> > This requires the cpu number be 64bit.
> > However the value is osq_lock() comes from a 32bit xchg() and there
> > isn't a way of telling gcc the high bits are zero (they are) so
> > there will always be an instruction to clear the high bits.
> >
> > The cpu number is also offset by one (to make the initialiser 0)
> > It seems to be impossible to get gcc to convert __per_cpu_offset[cpu_p1 - 1]
> > into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the address).
> >
> > Converting the cpu number to 32bit unsigned prior to the decrement means
> > that gcc knows the decrement has set the high bits to zero and doesn't
> > add a register-register move (or cltq) to zero/sign extend the value.
> >
> > Not massive but saves two instructions.
> >
> > Signed-off-by: David Laight <[email protected]>
> > ---
> > kernel/locking/osq_lock.c | 6 ++----
> > 1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > index 35bb99e96697..37a4fa872989 100644
> > --- a/kernel/locking/osq_lock.c
> > +++ b/kernel/locking/osq_lock.c
> > @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
> > return cpu_nr + 1;
> > }
> >
> > -static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> > +static inline struct optimistic_spin_node *decode_cpu(unsigned int encoded_cpu_val)
> > {
> > - int cpu_nr = encoded_cpu_val - 1;
> > -
> > - return per_cpu_ptr(&osq_node, cpu_nr);
> > + return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);
>
> So why do we 'encode' the CPU number to begin with?
>
> Why not use -1 as the special value? Checks for negative values
> generates similarly fast machine code compared to checking for 0, if
> the value is also used (which it is in most cases here). What am I
> missing? We seem to be going through a lot of unnecessary hoops, and
> some of that is in the runtime path.

You'd really have to ask the person who did the original patch
that changed lock->tail from a pointer to an int (saving 8 bytes)
in every mutex and rwsem.

I suspect the reason is that it is so much safer to have the
initialiser being zero, rather than a non-zero value with zero
being a valid value.

It is also hard to avoid an extra instruction in the per_cpu_ptr()
code - something has to extend the 32bit result from xchg() to
a 64bit one for the array index.
The asm for an unsigned 32bit exchange could return a 64bit result
(which would have the desired effect), but that won't work for
a signed value.
The '-1' in the vcpu_is_preempted() path will be executed in parallel
with something else and is likely to have no measurable effect.

So it is a slightly risky change that has less benefit than the
other changes (which save cache line accesses).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-01-08 07:42:38

by Oliver Sang

[permalink] [raw]

Subject: Re: [PATCH next v2 2/5] locking/osq_lock: Optimise the vcpu_is_preempted() check.

Hello,

kernel test robot

commit: d93300891f810c9498 url: ========================== class/compiler/cpufreq_gov scheduler/gcc-12/perform
commit:
ff787c1fd0 ("locking/osq_lock: d93300891f ("locking/osq_lock:
ff787c1fd0c13f9b ---------------- %stddev %change \ | 3880 ? 7% +26.4% 0.48 ?126% -99.8% 0.16 ? 23% -38.9% 1.50 ?118% -99.9% 2.55 ? 97% -83.7% 32244865 537413 38094 ? 12% +42.2% 42290 ? 11% +39.8% 143.50 ? 7% 4955 ? 3% -26.4% 4038 ? 2% -18.8% 3483 ? 3% -21.1% 7521 ? 2% -19.9% 0.42 ? 3% -16.2% 1.066e+10 51.90 -2.5 22517746 ? 3% 3730 ? 7% +29.2% 3.99 -3.1% 9535 ? 3% +15.4% 0.00 ? 3% +0.0 1.419e+10 8.411e+08 5.36e+10 0.26 +7.0% 837.29 ? 3% 401.41 6404966 141818 ? 4% 69.54 +13.8 3935319 1626033 0.42 ? 3% -16.0% 0.11 -0.0 51.32 -2.5 4.06 -3.0% 9677 ? 3% +15.6% 0.00 ? 3% +0.0 0.25 +3.1% 70.78 +14.2 1.049e+10 22167740 ? 3% 3667 ? 7% +29.1% 1.396e+10 8.273e+08 5.274e+10 6303682 140690 ? 4% 3875362 1599985 3.297e+12 96.07 -0.2 97.52 -0.1 98.98 -0.1 98.99 -0.1 98.97 -0.1 99.09 -0.1 99.47 -0.0 99.44 -0.0 99.35 -0.0 99.44 -0.0 96.08 -0.2 97.52 -0.1 98.98 -0.1 99.00 -0.1 98.97 -0.1 99.20 -0.0 0.13 ? 3% -0.0 0.14 ? 4% -0.0 99.36 -0.0 99.44 -0.0 99.44 -0.0 99.59 -0.0 0.07 ? 5% +0.0 0.11 +0.0 0.18 ? 3% +0.0 0.36 +0.0 0.13 ? 3% +0.0 0.20 ? 3% +0.0 0.24 ? 3% +0.0 0.24 ? 3% +0.0 0.31 ? 3% +0.0 1.22 +0.0 0.85 +0.1 0.94 +0.1 1.11 +0.1 95.80 -0.2 0.13 ? 3% -0.0 0.11 ? 3% +0.0 1.11 +0.1

Disclaimer:
Results have been for informational design or configuration

--
0-DAY CI Kernel Test Service
https://github.com/intel-lab-lkp/linux/commits/David-Laight/locking-osq_lock-Defer-clearing-node-locked-until-the-slow-osq_lock-path/20240101-055853
t.kernel.org/cgit/linux/kernel/git/torvalds/linux.git">https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 610a9b8f49fbcf1100716370d3b5f6f884a2835a
s://lore.kernel.org/all/3a9d1782cd50436c99ced8c10175bae6@AcuMS.aculab.com/">https://lore.kernel.org/all/[email protected]/
next v2 2/5] locking/osq_lock: Optimise the vcpu_is_preempted() check.
threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
------------------------------------------------------------------------>
and materials to reproduce are available at:
.01.org/0day-ci/archive/20240108/202401081557.641738f5-oliver.sang@intel.com">https://download.01.org/0day-ci/archive/20240108/[email protected]
===============================================================
ernor/kconfig/nr_threads/rootfs/sc_pid_max/tbox_group/test/testcase/testtime:
ance/x86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/4194304/lkp-icl-2sp8/netlink-task/stress-ng/60s
Defer clearing node->locked until the slow osq_lock() path.")
Optimise the vcpu_is_preempted() check.")
d93300891f810c9498d09a6ecea
---------------------------
%stddev
\
4903 ? 18% vmstat.system.cs
0.00 ?141% perf-sched.sch_delay.avg.ms.__cond_resched.aa_sk_perm.security_socket_recvmsg.sock_recvmsg.__sys_recvfrom
0.10 ? 32% perf-sched.sch_delay.avg.ms.schedule_preempt_disabled.__mutex_lock.constprop.0.genl_rcv_msg
0.00 ?142% perf-sched.sch_delay.max.ms.__cond_resched.aa_sk_perm.security_socket_recvmsg.sock_recvmsg.__sys_recvfrom
0.42 ?145% perf-sched.wait_time.max.ms.__cond_resched.__mutex_lock.constprop.0.genl_rcv_msg
+10.7% 35709040 stress-ng.netlink-task.ops
+10.7% 595150 stress-ng.netlink-task.ops_per_sec
54160 ? 27% stress-ng.time.involuntary_context_switches
59117 ? 23% stress-ng.time.voluntary_context_switches
-28.8% 102.17 ? 15% perf-c2c.DRAM.local
3647 ? 4% perf-c2c.DRAM.remote
3277 ? 4% perf-c2c.HITM.local
2747 ? 5% perf-c2c.HITM.remote
6024 ? 4% perf-c2c.HITM.total
0.35 ? 5% perf-stat.i.MPKI
+9.6% 1.169e+10 perf-stat.i.branch-instructions
49.42 ? 2% perf-stat.i.cache-miss-rate%
-13.4% 19503564 ? 5% perf-stat.i.cache-misses
4819 ? 19% perf-stat.i.context-switches
3.86 perf-stat.i.cpi
11003 ? 5% perf-stat.i.cycles-between-cache-misses
0.00 ? 3% perf-stat.i.dTLB-load-miss-rate%
-14.9% 1.207e+10 perf-stat.i.dTLB-loads
+8.4% 9.118e+08 perf-stat.i.dTLB-stores
+3.1% 5.524e+10 perf-stat.i.instructions
0.28 ? 5% perf-stat.i.ipc
-9.8% 755.30 ? 4% perf-stat.i.metric.K/sec
-4.1% 385.10 perf-stat.i.metric.M/sec
-23.2% 4920722 perf-stat.i.node-load-misses
-22.2% 110404 ? 4% perf-stat.i.node-loads
83.36 perf-stat.i.node-store-miss-rate%
+10.4% 4345724 perf-stat.i.node-store-misses
-52.6% 771187 ? 5% perf-stat.i.node-stores
0.35 ? 5% perf-stat.overall.MPKI
0.10 ? 8% perf-stat.overall.branch-miss-rate%
48.79 ? 2% perf-stat.overall.cache-miss-rate%
3.94 perf-stat.overall.cpi
11187 ? 5% perf-stat.overall.cycles-between-cache-misses
0.00 ? 4% perf-stat.overall.dTLB-load-miss-rate%
0.25 perf-stat.overall.ipc
84.94 perf-stat.overall.node-store-miss-rate%
+9.5% 1.149e+10 perf-stat.ps.branch-instructions
-13.4% 19186498 ? 5% perf-stat.ps.cache-misses
4735 ? 19% perf-stat.ps.context-switches
-15.0% 1.187e+10 perf-stat.ps.dTLB-loads
+8.3% 8.963e+08 perf-stat.ps.dTLB-stores
+3.0% 5.433e+10 perf-stat.ps.instructions
-23.2% 4839978 perf-stat.ps.node-load-misses
-22.5% 109023 ? 4% perf-stat.ps.node-loads
+10.3% 4276026 perf-stat.ps.node-store-misses
-52.6% 758184 ? 5% perf-stat.ps.node-stores
+3.0% 3.396e+12 perf-stat.total.instructions
95.87 perf-profile.calltrace.cycles-pp.osq_lock.__mutex_lock.genl_rcv_msg.netlink_rcv_skb.genl_rcv
97.37 perf-profile.calltrace.cycles-pp.__mutex_lock.genl_rcv_msg.netlink_rcv_skb.genl_rcv.netlink_unicast
98.90 perf-profile.calltrace.cycles-pp.netlink_rcv_skb.genl_rcv.netlink_unicast.netlink_sendmsg.__sys_sendto
98.92 perf-profile.calltrace.cycles-pp.genl_rcv.netlink_unicast.netlink_sendmsg.__sys_sendto.__x64_sys_sendto
98.89 perf-profile.calltrace.cycles-pp.genl_rcv_msg.netlink_rcv_skb.genl_rcv.netlink_unicast.netlink_sendmsg
99.04 perf-profile.calltrace.cycles-pp.netlink_unicast.netlink_sendmsg.__sys_sendto.__x64_sys_sendto.do_syscall_64
99.43 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto.stress_netlink_taskstats_monitor.stress_netlink_task
99.40 perf-profile.calltrace.cycles-pp.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto.stress_netlink_taskstats_monitor
99.32 perf-profile.calltrace.cycles-pp.netlink_sendmsg.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe
99.40 perf-profile.calltrace.cycles-pp.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto
95.89 perf-profile.children.cycles-pp.osq_lock
97.38 perf-profile.children.cycles-pp.__mutex_lock
98.90 perf-profile.children.cycles-pp.netlink_rcv_skb
98.92 perf-profile.children.cycles-pp.genl_rcv
98.89 perf-profile.children.cycles-pp.genl_rcv_msg
99.15 perf-profile.children.cycles-pp.netlink_unicast
0.08 ? 7% perf-profile.children.cycles-pp.genl_cmd_full_to_split
0.10 ? 5% perf-profile.children.cycles-pp.genl_get_cmd
99.32 perf-profile.children.cycles-pp.netlink_sendmsg
99.41 perf-profile.children.cycles-pp.__x64_sys_sendto
99.41 perf-profile.children.cycles-pp.__sys_sendto
99.56 perf-profile.children.cycles-pp.sendto
0.08 ? 5% perf-profile.children.cycles-pp.genl_family_rcv_msg_attrs_parse
0.12 ? 6% perf-profile.children.cycles-pp.apparmor_capable
0.20 ? 4% perf-profile.children.cycles-pp.netlink_recvmsg
0.38 perf-profile.children.cycles-pp.fill_stats
0.15 ? 4% perf-profile.children.cycles-pp.ns_capable
0.23 ? 4% perf-profile.children.cycles-pp.sock_recvmsg
0.27 ? 3% perf-profile.children.cycles-pp.__sys_recvfrom
0.27 ? 4% perf-profile.children.cycles-pp.__x64_sys_recvfrom
0.34 ? 3% perf-profile.children.cycles-pp.recv
1.26 perf-profile.children.cycles-pp.genl_family_rcv_msg
0.90 perf-profile.children.cycles-pp.cmd_attr_pid
1.01 perf-profile.children.cycles-pp.genl_family_rcv_msg_doit
1.23 perf-profile.children.cycles-pp.mutex_spin_on_owner
95.62 perf-profile.self.cycles-pp.osq_lock
0.08 ? 7% perf-profile.self.cycles-pp.genl_cmd_full_to_split
0.12 ? 6% perf-profile.self.cycles-pp.apparmor_capable
1.23 perf-profile.self.cycles-pp.mutex_spin_on_owner
estimated based on internal Intel analysis and are provided
purposes only. Any difference in system hardware or software
may affect actual performance.
om/intel/lkp-tests/wiki">https://github.com/intel/lkp-tests/wiki

2024-05-03 15:59:59

[permalink] [raw]

Subject: Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

On 12/31/23 23:14, Waiman Long wrote:
>
> On 12/31/23 16:55, David Laight wrote:
>> per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
>> This requires the cpu number be 64bit.
>> However the value is osq_lock() comes from a 32bit xchg() and there
>> isn't a way of telling gcc the high bits are zero (they are) so
>> there will always be an instruction to clear the high bits.
>>
>> The cpu number is also offset by one (to make the initialiser 0)
>> It seems to be impossible to get gcc to convert
>> __per_cpu_offset[cpu_p1 - 1]
>> into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the
>> address).
>>
>> Converting the cpu number to 32bit unsigned prior to the decrement means
>> that gcc knows the decrement has set the high bits to zero and doesn't
>> add a register-register move (or cltq) to zero/sign extend the value.
>>
>> Not massive but saves two instructions.
>>
>> Signed-off-by: David Laight <[email protected]>
>> ---
>> kernel/locking/osq_lock.c | 6 ++----
>> 1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>> index 35bb99e96697..37a4fa872989 100644
>> --- a/kernel/locking/osq_lock.c
>> +++ b/kernel/locking/osq_lock.c
>> @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
>>       return cpu_nr + 1;
>> }
>> -static inline struct optimistic_spin_node *decode_cpu(int
>> encoded_cpu_val)
>> +static inline struct optimistic_spin_node *decode_cpu(unsigned int
>> encoded_cpu_val)
>> {
>> -    int cpu_nr = encoded_cpu_val - 1;
>> -
>> -    return per_cpu_ptr(&osq_node, cpu_nr);
>> +    return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);
>> }
>> /*
>
> You really like micro-optimization.
>
> Anyway,
>
> Reviewed-by: Waiman Long <[email protected]>
>
David,

Could you respin the series based on the latest upstream code?

Thanks,
Longman

2024-05-03 16:23:18

[permalink] [raw]

Subject: RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

From: Waiman Long
> Sent: 03 May 2024 17:00
> To: David Laight <[email protected]>; '[email protected]' <linux-
> [email protected]>; '[email protected]' <[email protected]>
> Cc: '[email protected]' <[email protected]>; '[email protected]' <[email protected]>; '[email protected]'
> <[email protected]>; 'Linus Torvalds' <[email protected]>; '[email protected]
> foundation.org' <[email protected]>; 'Zeng Heng' <[email protected]>
> Subject: Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().
>
>
> On 12/31/23 23:14, Waiman Long wrote:
> >
> > On 12/31/23 16:55, David Laight wrote:
> >> per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
> >> This requires the cpu number be 64bit.
> >> However the value is osq_lock() comes from a 32bit xchg() and there
> >> isn't a way of telling gcc the high bits are zero (they are) so
> >> there will always be an instruction to clear the high bits.
> >>
> >> The cpu number is also offset by one (to make the initialiser 0)
> >> It seems to be impossible to get gcc to convert
> >> __per_cpu_offset[cpu_p1 - 1]
> >> into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the
> >> address).
> >>
> >> Converting the cpu number to 32bit unsigned prior to the decrement means
> >> that gcc knows the decrement has set the high bits to zero and doesn't
> >> add a register-register move (or cltq) to zero/sign extend the value.
> >>
> >> Not massive but saves two instructions.
> >>
> >> Signed-off-by: David Laight <[email protected]>
> >> ---
> >> kernel/locking/osq_lock.c | 6 ++----
> >> 1 file changed, 2 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> >> index 35bb99e96697..37a4fa872989 100644
> >> --- a/kernel/locking/osq_lock.c
> >> +++ b/kernel/locking/osq_lock.c
> >> @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
> >>       return cpu_nr + 1;
> >> }
> >> -static inline struct optimistic_spin_node *decode_cpu(int
> >> encoded_cpu_val)
> >> +static inline struct optimistic_spin_node *decode_cpu(unsigned int
> >> encoded_cpu_val)
> >> {
> >> -    int cpu_nr = encoded_cpu_val - 1;
> >> -
> >> -    return per_cpu_ptr(&osq_node, cpu_nr);
> >> +    return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);
> >> }
> >> /*
> >
> > You really like micro-optimization.
> >
> > Anyway,
> >
> > Reviewed-by: Waiman Long <[email protected]>
> >
> David,
>
> Could you respin the series based on the latest upstream code?

Looks like a wet bank holiday weekend.....

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-05-03 21:13:06

[permalink] [raw]

Subject: RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

From: Waiman Long
> Sent: 03 May 2024 17:00
...
> David,
>
> Could you respin the series based on the latest upstream code?

I've just reapplied the patches to 'master' and they all apply
cleanly and diffing the new patches to the old ones gives no differences.
So I think they should still apply.

Were you seeing a specific problem?

I don't remember any suggested changed either.
(Apart from a very local variable I used to keep a patch isolated.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-05-03 22:13:53

[permalink] [raw]

Subject: Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

On 5/3/24 17:10, David Laight wrote:
> From: Waiman Long
>> Sent: 03 May 2024 17:00
> ...
>> David,
>>
>> Could you respin the series based on the latest upstream code?
> I've just reapplied the patches to 'master' and they all apply
> cleanly and diffing the new patches to the old ones gives no differences.
> So I think they should still apply.
>
> Were you seeing a specific problem?
>
> I don't remember any suggested changed either.
> (Apart from a very local variable I used to keep a patch isolated.)

No, I just want to make sure that your patches will still apply. Anyway,
it will be easier for the maintainer to merge your remaining patches if
you can send out a new version even if they are almost the same as the
old ones.

Thanks,
Longman

2024-05-04 20:27:00