2023-12-31 21:50:26

by David Laight

[permalink] [raw]
Subject: [PATCH next v2 0/5] locking/osq_lock: Optimisations to osq_lock code.

This is an updated series of optimisations to osq_lock.c
Patches #1 and #3 from v1 have been applied by Linus.
Some of the generated code issues I was getting were caused by
CONFIG_DEBUG_PREEMPT being set. No idea why, it isn't any more.

Patch #1 is the node->locked part of the old #2.

Patch #2 removes the pretty much guaranteed cache line reload getting
the cpu number (from node->prev) for the vcpu_is_preempted() check.
It is (basically) the old #5 with the addition of a READ_ONCE()
and leaving the '+ 1' offset (for patch 3).

Patch #3 ends up removing both node->cpu and node->prev.
This saves issues initialising node->cpu.
Basically node->cpu was only ever read as node->prev->cpu in the unqueue code.
Most of the time it is the value read from lock->tail that was used to
obtain 'prev' in the first place.
The only time it is different is in the unlock race path where 'prev'
is re-read from node->prev - updated right at the bottom of osq_lock().
So the updated node->prev_cpu can used (and prev obtained from it) without
worrying about only one of node->prev and node->prev-cpu being updated.

Linus did suggest just saving the cpu numbers instead of pointers.
It actually works for 'prev' but not 'next'.

Patch #4 removes the 'should be unnecessary' node->next = NULL
assignment from the top of osq_lock().
Since longman was worried about race conditions, I've added a
WARN_ON_ONCE() check that ensures it is NULL.
This saves dirtying the 'node' cache line in the fast path, but the
check still requires the cache line be loaded.

Patch #5 just stops gcc using two separate instructions to decrement
the offset cpu number and then convert it to 64 bits.
Linus got annoyed with it, and I'd spotted it as well.
I don't seem to be able to get gcc to convert __per_cpu_offset[cpu - 1]
to (__per_cpu_offset - 1)[cpu] (cpu is offset by one) but, in any case,
it would still need zero extending in the common case.

David Laight (5):
1) Defer clearing node->locked until the slow osq_lock() path.
2) Optimise vcpu_is_preempted() check.
3) Use node->prev_cpu instead of saving node->prev.
4) Avoid writing to node->next in the osq_lock() fast path.
5) Optimise decode_cpu() and per_cpu_ptr().

kernel/locking/osq_lock.c | 59 +++++++++++++++++++++------------------
1 file changed, 32 insertions(+), 27 deletions(-)

--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)



2023-12-31 21:51:52

by David Laight

[permalink] [raw]
Subject: [PATCH next v2 1/5] locking/osq_lock: Defer clearing node->locked until the slow osq_lock() path.

Since node->locked cannot be set before the assignment to prev->next
it is save to clear it in the slow path.

Signed-off-by: David Laight <[email protected]>
---
kernel/locking/osq_lock.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 75a6f6133866..e0bc74d85a76 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -97,7 +97,6 @@ bool osq_lock(struct optimistic_spin_queue *lock)
int curr = encode_cpu(smp_processor_id());
int old;

- node->locked = 0;
node->next = NULL;
node->cpu = curr;

@@ -111,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
if (old == OSQ_UNLOCKED_VAL)
return true;

+ node->locked = 0;
prev = decode_cpu(old);
node->prev = prev;

--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2023-12-31 21:53:25

by David Laight

[permalink] [raw]
Subject: [PATCH next v2 2/5] locking/osq_lock: Optimise the vcpu_is_preempted() check.

The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
cpu is no longer running.

Although patched out for bare-metal the code still needs the cpu number.
Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss
when osq_unlock() is waking up the next cpu.

Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.

This is simpler than checking for 'node->prev' changing and caching
'prev->cpu'.

Signed-off-by: David Laight <[email protected]>
---
kernel/locking/osq_lock.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index e0bc74d85a76..eb8a6dfdb79d 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -14,8 +14,9 @@

struct optimistic_spin_node {
struct optimistic_spin_node *next, *prev;
- int locked; /* 1 if lock acquired */
- int cpu; /* encoded CPU # + 1 value */
+ int locked; /* 1 if lock acquired */
+ int cpu; /* encoded CPU # + 1 value */
+ int prev_cpu; /* encoded CPU # + 1 value */
};

static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
@@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
}

-static inline int node_cpu(struct optimistic_spin_node *node)
-{
- return node->cpu - 1;
-}
-
static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
{
int cpu_nr = encoded_cpu_val - 1;
@@ -110,9 +106,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
if (old == OSQ_UNLOCKED_VAL)
return true;

- node->locked = 0;
+ node->prev_cpu = old;
prev = decode_cpu(old);
node->prev = prev;
+ node->locked = 0;

/*
* osq_lock() unqueue
@@ -144,7 +141,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* polling, be careful.
*/
if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
- vcpu_is_preempted(node_cpu(node->prev))))
+ vcpu_is_preempted(READ_ONCE(node->prev_cpu) - 1)))
return true;

/* unqueue */
@@ -201,6 +198,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* it will wait in Step-A.
*/

+ WRITE_ONCE(next->prev_cpu, prev->cpu);
WRITE_ONCE(next->prev, prev);
WRITE_ONCE(prev->next, next);

--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2023-12-31 21:54:36

by David Laight

[permalink] [raw]
Subject: [PATCH next v2 3/5] locking/osq_lock: Use node->prev_cpu instead of saving node->prev.

node->prev is only used to update 'prev' in the unlikely case
of concurrent unqueues.
This can be replaced by a check for node->prev_cpu changing
and then calling decode_cpu() to get the changed 'prev' pointer.

node->cpu (or more particularly) prev->cpu is only used for the
osq_wait_next() call in the unqueue path.
Normally this is exactly the value that the initial xchg() read
from lock->tail (used to obtain 'prev'), but can get updated
by concurrent unqueues.

Both the 'prev' and 'cpu' members of optimistic_spin_node are
now unused and can be deleted.

Signed-off-by: David Laight <[email protected]>
---
kernel/locking/osq_lock.c | 31 +++++++++++++++++--------------
1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index eb8a6dfdb79d..27324b509f68 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -13,9 +13,8 @@
*/

struct optimistic_spin_node {
- struct optimistic_spin_node *next, *prev;
+ struct optimistic_spin_node *next;
int locked; /* 1 if lock acquired */
- int cpu; /* encoded CPU # + 1 value */
int prev_cpu; /* encoded CPU # + 1 value */
};

@@ -91,10 +90,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
struct optimistic_spin_node *prev, *next;
int curr = encode_cpu(smp_processor_id());
- int old;
+ int prev_cpu;

node->next = NULL;
- node->cpu = curr;

/*
* We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -102,13 +100,12 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* the node fields we just initialised) semantics when updating
* the lock tail.
*/
- old = atomic_xchg(&lock->tail, curr);
- if (old == OSQ_UNLOCKED_VAL)
+ prev_cpu = atomic_xchg(&lock->tail, curr);
+ if (prev_cpu == OSQ_UNLOCKED_VAL)
return true;

- node->prev_cpu = old;
- prev = decode_cpu(old);
- node->prev = prev;
+ node->prev_cpu = prev_cpu;
+ prev = decode_cpu(prev_cpu);
node->locked = 0;

/*
@@ -174,9 +171,16 @@ bool osq_lock(struct optimistic_spin_queue *lock)

/*
* Or we race against a concurrent unqueue()'s step-B, in which
- * case its step-C will write us a new @node->prev pointer.
+ * case its step-C will write us a new @node->prev_cpu value.
*/
- prev = READ_ONCE(node->prev);
+ {
+ int new_prev_cpu = READ_ONCE(node->prev_cpu);
+
+ if (new_prev_cpu == prev_cpu)
+ continue;
+ prev_cpu = new_prev_cpu;
+ prev = decode_cpu(prev_cpu);
+ }
}

/*
@@ -186,7 +190,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* back to @prev.
*/

- next = osq_wait_next(lock, node, prev->cpu);
+ next = osq_wait_next(lock, node, prev_cpu);
if (!next)
return false;

@@ -198,8 +202,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* it will wait in Step-A.
*/

- WRITE_ONCE(next->prev_cpu, prev->cpu);
- WRITE_ONCE(next->prev, prev);
+ WRITE_ONCE(next->prev_cpu, prev_cpu);
WRITE_ONCE(prev->next, next);

return false;
--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2023-12-31 21:55:33

by David Laight

[permalink] [raw]
Subject: [PATCH next v2 4/5] locking/osq_lock: Avoid writing to node->next in the osq_lock() fast path.

When osq_lock() returns false or osq_unlock() returns static
analysis shows that node->next should always be NULL.
This means that it isn't necessary to explicitly set it to NULL
prior to atomic_xchg(&lock->tail, curr) on extry to osq_lock().

Just in case there a non-obvious race condition that can leave it
non-NULL check with WARN_ON_ONCE() and NULL if set.
Note that without this check the fast path (adding at the list head)
doesn't need to to access the per-cpu osq_node at all.

Signed-off-by: David Laight <[email protected]>
---
kernel/locking/osq_lock.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 27324b509f68..35bb99e96697 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -87,12 +87,17 @@ osq_wait_next(struct optimistic_spin_queue *lock,

bool osq_lock(struct optimistic_spin_queue *lock)
{
- struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
- struct optimistic_spin_node *prev, *next;
+ struct optimistic_spin_node *node, *prev, *next;
int curr = encode_cpu(smp_processor_id());
int prev_cpu;

- node->next = NULL;
+ /*
+ * node->next should be NULL on entry.
+ * Check just in case there is a race somewhere.
+ * Note that this is probably an unnecessary cache miss in the fast path.
+ */
+ if (WARN_ON_ONCE(raw_cpu_read(osq_node.next) != NULL))
+ raw_cpu_write(osq_node.next, NULL);

/*
* We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -104,8 +109,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
if (prev_cpu == OSQ_UNLOCKED_VAL)
return true;

- node->prev_cpu = prev_cpu;
+ node = this_cpu_ptr(&osq_node);
prev = decode_cpu(prev_cpu);
+ node->prev_cpu = prev_cpu;
node->locked = 0;

/*
--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2023-12-31 21:56:27

by David Laight

[permalink] [raw]
Subject: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
This requires the cpu number be 64bit.
However the value is osq_lock() comes from a 32bit xchg() and there
isn't a way of telling gcc the high bits are zero (they are) so
there will always be an instruction to clear the high bits.

The cpu number is also offset by one (to make the initialiser 0)
It seems to be impossible to get gcc to convert __per_cpu_offset[cpu_p1 - 1]
into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the address).

Converting the cpu number to 32bit unsigned prior to the decrement means
that gcc knows the decrement has set the high bits to zero and doesn't
add a register-register move (or cltq) to zero/sign extend the value.

Not massive but saves two instructions.

Signed-off-by: David Laight <[email protected]>
---
kernel/locking/osq_lock.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 35bb99e96697..37a4fa872989 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
}

-static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
+static inline struct optimistic_spin_node *decode_cpu(unsigned int encoded_cpu_val)
{
- int cpu_nr = encoded_cpu_val - 1;
-
- return per_cpu_ptr(&osq_node, cpu_nr);
+ return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);
}

/*
--
2.17.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2024-01-01 04:08:28

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH next v2 1/5] locking/osq_lock: Defer clearing node->locked until the slow osq_lock() path.


On 12/31/23 16:51, David Laight wrote:
> Since node->locked cannot be set before the assignment to prev->next
> it is save to clear it in the slow path.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> kernel/locking/osq_lock.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 75a6f6133866..e0bc74d85a76 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -97,7 +97,6 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> int curr = encode_cpu(smp_processor_id());
> int old;
>
> - node->locked = 0;
> node->next = NULL;
> node->cpu = curr;
>
> @@ -111,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> if (old == OSQ_UNLOCKED_VAL)
> return true;
>
> + node->locked = 0;
> prev = decode_cpu(old);
> node->prev = prev;
>

Reviewed-by: Waiman Long <[email protected]>


2024-01-01 04:09:34

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH next v2 2/5] locking/osq_lock: Optimise the vcpu_is_preempted() check.


On 12/31/23 16:52, David Laight wrote:
> The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
> cpu is no longer running.
>
> Although patched out for bare-metal the code still needs the cpu number.
> Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss
> when osq_unlock() is waking up the next cpu.
>
> Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
> Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.
>
> This is simpler than checking for 'node->prev' changing and caching
> 'prev->cpu'.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> kernel/locking/osq_lock.c | 16 +++++++---------
> 1 file changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index e0bc74d85a76..eb8a6dfdb79d 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -14,8 +14,9 @@
>
> struct optimistic_spin_node {
> struct optimistic_spin_node *next, *prev;
> - int locked; /* 1 if lock acquired */
> - int cpu; /* encoded CPU # + 1 value */
> + int locked; /* 1 if lock acquired */
> + int cpu; /* encoded CPU # + 1 value */
> + int prev_cpu; /* encoded CPU # + 1 value */
> };
>
> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
> @@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
> return cpu_nr + 1;
> }
>
> -static inline int node_cpu(struct optimistic_spin_node *node)
> -{
> - return node->cpu - 1;
> -}
> -
> static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> {
> int cpu_nr = encoded_cpu_val - 1;
> @@ -110,9 +106,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> if (old == OSQ_UNLOCKED_VAL)
> return true;
>
> - node->locked = 0;
> + node->prev_cpu = old;
> prev = decode_cpu(old);
> node->prev = prev;
> + node->locked = 0;
>
> /*
> * osq_lock() unqueue
> @@ -144,7 +141,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> * polling, be careful.
> */
> if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
> - vcpu_is_preempted(node_cpu(node->prev))))
> + vcpu_is_preempted(READ_ONCE(node->prev_cpu) - 1)))
> return true;
>
> /* unqueue */
> @@ -201,6 +198,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> * it will wait in Step-A.
> */
>
> + WRITE_ONCE(next->prev_cpu, prev->cpu);
> WRITE_ONCE(next->prev, prev);
> WRITE_ONCE(prev->next, next);
>

Reviewed-by: Waiman Long <[email protected]>

Reviewed-by: Waiman Long <[email protected]>


2024-01-01 04:09:54

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH next v2 3/5] locking/osq_lock: Use node->prev_cpu instead of saving node->prev.

On 12/31/23 16:54, David Laight wrote:
> node->prev is only used to update 'prev' in the unlikely case
> of concurrent unqueues.
> This can be replaced by a check for node->prev_cpu changing
> and then calling decode_cpu() to get the changed 'prev' pointer.
>
> node->cpu (or more particularly) prev->cpu is only used for the
> osq_wait_next() call in the unqueue path.
> Normally this is exactly the value that the initial xchg() read
> from lock->tail (used to obtain 'prev'), but can get updated
> by concurrent unqueues.
>
> Both the 'prev' and 'cpu' members of optimistic_spin_node are
> now unused and can be deleted.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> kernel/locking/osq_lock.c | 31 +++++++++++++++++--------------
> 1 file changed, 17 insertions(+), 14 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index eb8a6dfdb79d..27324b509f68 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -13,9 +13,8 @@
> */
>
> struct optimistic_spin_node {
> - struct optimistic_spin_node *next, *prev;
> + struct optimistic_spin_node *next;
> int locked; /* 1 if lock acquired */
> - int cpu; /* encoded CPU # + 1 value */
> int prev_cpu; /* encoded CPU # + 1 value */
> };
>
> @@ -91,10 +90,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> struct optimistic_spin_node *prev, *next;
> int curr = encode_cpu(smp_processor_id());
> - int old;
> + int prev_cpu;
>
> node->next = NULL;
> - node->cpu = curr;
>
> /*
> * We need both ACQUIRE (pairs with corresponding RELEASE in
> @@ -102,13 +100,12 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> * the node fields we just initialised) semantics when updating
> * the lock tail.
> */
> - old = atomic_xchg(&lock->tail, curr);
> - if (old == OSQ_UNLOCKED_VAL)
> + prev_cpu = atomic_xchg(&lock->tail, curr);
> + if (prev_cpu == OSQ_UNLOCKED_VAL)
> return true;
>
> - node->prev_cpu = old;
> - prev = decode_cpu(old);
> - node->prev = prev;
> + node->prev_cpu = prev_cpu;
> + prev = decode_cpu(prev_cpu);
> node->locked = 0;
>
> /*
> @@ -174,9 +171,16 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>
> /*
> * Or we race against a concurrent unqueue()'s step-B, in which
> - * case its step-C will write us a new @node->prev pointer.
> + * case its step-C will write us a new @node->prev_cpu value.
> */
> - prev = READ_ONCE(node->prev);
> + {
> + int new_prev_cpu = READ_ONCE(node->prev_cpu);
> +
> + if (new_prev_cpu == prev_cpu)
> + continue;
> + prev_cpu = new_prev_cpu;
> + prev = decode_cpu(prev_cpu);
> + }

Just a minor nit. It is not that common in the kernel to add another
nesting level just to reduce the scope of  new_prev_cpu auto variable.

Anyway,

Reviewed-by: Waiman Long <[email protected]>


2024-01-01 04:13:41

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH next v2 4/5] locking/osq_lock: Avoid writing to node->next in the osq_lock() fast path.


On 12/31/23 16:54, David Laight wrote:
> When osq_lock() returns false or osq_unlock() returns static
> analysis shows that node->next should always be NULL.
> This means that it isn't necessary to explicitly set it to NULL
> prior to atomic_xchg(&lock->tail, curr) on extry to osq_lock().
>
> Just in case there a non-obvious race condition that can leave it
> non-NULL check with WARN_ON_ONCE() and NULL if set.
> Note that without this check the fast path (adding at the list head)
> doesn't need to to access the per-cpu osq_node at all.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> kernel/locking/osq_lock.c | 14 ++++++++++----
> 1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 27324b509f68..35bb99e96697 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -87,12 +87,17 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>
> bool osq_lock(struct optimistic_spin_queue *lock)
> {
> - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> - struct optimistic_spin_node *prev, *next;
> + struct optimistic_spin_node *node, *prev, *next;
> int curr = encode_cpu(smp_processor_id());
> int prev_cpu;
>
> - node->next = NULL;
> + /*
> + * node->next should be NULL on entry.
> + * Check just in case there is a race somewhere.
> + * Note that this is probably an unnecessary cache miss in the fast path.
> + */
> + if (WARN_ON_ONCE(raw_cpu_read(osq_node.next) != NULL))
> + raw_cpu_write(osq_node.next, NULL);
>
> /*
> * We need both ACQUIRE (pairs with corresponding RELEASE in
> @@ -104,8 +109,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> if (prev_cpu == OSQ_UNLOCKED_VAL)
> return true;
>
> - node->prev_cpu = prev_cpu;
> + node = this_cpu_ptr(&osq_node);
> prev = decode_cpu(prev_cpu);
> + node->prev_cpu = prev_cpu;
> node->locked = 0;
>
> /*
Reviewed-by: Waiman Long <[email protected]>


2024-01-01 04:14:36

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().


On 12/31/23 16:55, David Laight wrote:
> per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
> This requires the cpu number be 64bit.
> However the value is osq_lock() comes from a 32bit xchg() and there
> isn't a way of telling gcc the high bits are zero (they are) so
> there will always be an instruction to clear the high bits.
>
> The cpu number is also offset by one (to make the initialiser 0)
> It seems to be impossible to get gcc to convert __per_cpu_offset[cpu_p1 - 1]
> into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the address).
>
> Converting the cpu number to 32bit unsigned prior to the decrement means
> that gcc knows the decrement has set the high bits to zero and doesn't
> add a register-register move (or cltq) to zero/sign extend the value.
>
> Not massive but saves two instructions.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> kernel/locking/osq_lock.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 35bb99e96697..37a4fa872989 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
> return cpu_nr + 1;
> }
>
> -static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> +static inline struct optimistic_spin_node *decode_cpu(unsigned int encoded_cpu_val)
> {
> - int cpu_nr = encoded_cpu_val - 1;
> -
> - return per_cpu_ptr(&osq_node, cpu_nr);
> + return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);
> }
>
> /*

You really like micro-optimization.

Anyway,

Reviewed-by: Waiman Long <[email protected]>


2024-01-01 08:47:38

by David Laight

[permalink] [raw]
Subject: RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

From: Waiman Long
> Sent: 01 January 2024 04:14
...
> You really like micro-optimization.

They all add up :-)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-01-02 09:47:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH next v2 4/5] locking/osq_lock: Avoid writing to node->next in the osq_lock() fast path.


* David Laight <[email protected]> wrote:

> When osq_lock() returns false or osq_unlock() returns static
> analysis shows that node->next should always be NULL.
> This means that it isn't necessary to explicitly set it to NULL
> prior to atomic_xchg(&lock->tail, curr) on extry to osq_lock().
>
> Just in case there a non-obvious race condition that can leave it
> non-NULL check with WARN_ON_ONCE() and NULL if set.
> Note that without this check the fast path (adding at the list head)
> doesn't need to to access the per-cpu osq_node at all.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> kernel/locking/osq_lock.c | 14 ++++++++++----
> 1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 27324b509f68..35bb99e96697 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -87,12 +87,17 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>
> bool osq_lock(struct optimistic_spin_queue *lock)
> {
> - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> - struct optimistic_spin_node *prev, *next;
> + struct optimistic_spin_node *node, *prev, *next;
> int curr = encode_cpu(smp_processor_id());
> int prev_cpu;
>
> - node->next = NULL;
> + /*
> + * node->next should be NULL on entry.
> + * Check just in case there is a race somewhere.
> + * Note that this is probably an unnecessary cache miss in the fast path.
> + */
> + if (WARN_ON_ONCE(raw_cpu_read(osq_node.next) != NULL))
> + raw_cpu_write(osq_node.next, NULL);

The fix-uppery and explanation about something that shouldn't happen is
excessive: please just put a plain WARN_ON_ONCE() here - which we can
remove in a release or so.

Thanks,

Ingo

2024-01-02 09:49:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH next v2 2/5] locking/osq_lock: Optimise the vcpu_is_preempted() check.


* David Laight <[email protected]> wrote:

> The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
> cpu is no longer running.
>
> Although patched out for bare-metal the code still needs the cpu number.

Comma missing.

> Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss
> when osq_unlock() is waking up the next cpu.
>
> Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
> Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.
>
> This is simpler than checking for 'node->prev' changing and caching
> 'prev->cpu'.

Throughout the series, in changelogs and comments, please do:

s/cpu
/CPU

Please be more careful about changelog quality.

> struct optimistic_spin_node {
> struct optimistic_spin_node *next, *prev;
> - int locked; /* 1 if lock acquired */
> - int cpu; /* encoded CPU # + 1 value */
> + int locked; /* 1 if lock acquired */
> + int cpu; /* encoded CPU # + 1 value */
> + int prev_cpu; /* encoded CPU # + 1 value */
> };

s/ encoded CPU # + 1 value
/ encoded value: CPU+1

Thanks,

Ingo

2024-01-02 09:50:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH next v2 1/5] locking/osq_lock: Defer clearing node->locked until the slow osq_lock() path.


* David Laight <[email protected]> wrote:

> Since node->locked cannot be set before the assignment to prev->next
> it is save to clear it in the slow path.

s/save
/safe

Thanks,

Ingo

2024-01-02 09:51:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH next v2 1/5] locking/osq_lock: Defer clearing node->locked until the slow osq_lock() path.


Also, please don't put periods into titles:

s/[PATCH next v2 1/5] locking/osq_lock: Defer clearing node->locked until the slow osq_lock() path.
/[PATCH next v2 1/5] locking/osq_lock: Defer clearing node->locked until the slow osq_lock() path

Usually maintainers remove them manually, but there's no reason to be
inconsistent: half of your series had a period. It just unnecessarily
distracts from review.

This rule applies to this series and to all future patches.

Thanks,

Ingo

2024-01-02 09:54:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().


* David Laight <[email protected]> wrote:

> per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
> This requires the cpu number be 64bit.
> However the value is osq_lock() comes from a 32bit xchg() and there
> isn't a way of telling gcc the high bits are zero (they are) so
> there will always be an instruction to clear the high bits.
>
> The cpu number is also offset by one (to make the initialiser 0)
> It seems to be impossible to get gcc to convert __per_cpu_offset[cpu_p1 - 1]
> into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the address).
>
> Converting the cpu number to 32bit unsigned prior to the decrement means
> that gcc knows the decrement has set the high bits to zero and doesn't
> add a register-register move (or cltq) to zero/sign extend the value.
>
> Not massive but saves two instructions.
>
> Signed-off-by: David Laight <[email protected]>
> ---
> kernel/locking/osq_lock.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 35bb99e96697..37a4fa872989 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
> return cpu_nr + 1;
> }
>
> -static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> +static inline struct optimistic_spin_node *decode_cpu(unsigned int encoded_cpu_val)
> {
> - int cpu_nr = encoded_cpu_val - 1;
> -
> - return per_cpu_ptr(&osq_node, cpu_nr);
> + return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);

So why do we 'encode' the CPU number to begin with?

Why not use -1 as the special value? Checks for negative values
generates similarly fast machine code compared to checking for 0, if
the value is also used (which it is in most cases here). What am I
missing? We seem to be going through a lot of unnecessary hoops, and
some of that is in the runtime path.

Thanks,

Ingo

2024-01-02 10:20:47

by David Laight

[permalink] [raw]
Subject: RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

From: Ingo Molnar
> Sent: 02 January 2024 09:54
>
>
> * David Laight <[email protected]> wrote:
>
> > per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
> > This requires the cpu number be 64bit.
> > However the value is osq_lock() comes from a 32bit xchg() and there
> > isn't a way of telling gcc the high bits are zero (they are) so
> > there will always be an instruction to clear the high bits.
> >
> > The cpu number is also offset by one (to make the initialiser 0)
> > It seems to be impossible to get gcc to convert __per_cpu_offset[cpu_p1 - 1]
> > into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the address).
> >
> > Converting the cpu number to 32bit unsigned prior to the decrement means
> > that gcc knows the decrement has set the high bits to zero and doesn't
> > add a register-register move (or cltq) to zero/sign extend the value.
> >
> > Not massive but saves two instructions.
> >
> > Signed-off-by: David Laight <[email protected]>
> > ---
> > kernel/locking/osq_lock.c | 6 ++----
> > 1 file changed, 2 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > index 35bb99e96697..37a4fa872989 100644
> > --- a/kernel/locking/osq_lock.c
> > +++ b/kernel/locking/osq_lock.c
> > @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
> > return cpu_nr + 1;
> > }
> >
> > -static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> > +static inline struct optimistic_spin_node *decode_cpu(unsigned int encoded_cpu_val)
> > {
> > - int cpu_nr = encoded_cpu_val - 1;
> > -
> > - return per_cpu_ptr(&osq_node, cpu_nr);
> > + return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);
>
> So why do we 'encode' the CPU number to begin with?
>
> Why not use -1 as the special value? Checks for negative values
> generates similarly fast machine code compared to checking for 0, if
> the value is also used (which it is in most cases here). What am I
> missing? We seem to be going through a lot of unnecessary hoops, and
> some of that is in the runtime path.

You'd really have to ask the person who did the original patch
that changed lock->tail from a pointer to an int (saving 8 bytes)
in every mutex and rwsem.

I suspect the reason is that it is so much safer to have the
initialiser being zero, rather than a non-zero value with zero
being a valid value.

It is also hard to avoid an extra instruction in the per_cpu_ptr()
code - something has to extend the 32bit result from xchg() to
a 64bit one for the array index.
The asm for an unsigned 32bit exchange could return a 64bit result
(which would have the desired effect), but that won't work for
a signed value.
The '-1' in the vcpu_is_preempted() path will be executed in parallel
with something else and is likely to have no measurable effect.

So it is a slightly risky change that has less benefit than the
other changes (which save cache line accesses).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2024-01-08 07:42:38

by Oliver Sang

[permalink] [raw]
Subject: Re: [PATCH next v2 2/5] locking/osq_lock: Optimise the vcpu_is_preempted() check.



Hello,

kernel test robot noticed a 10.7% improvement of stress-ng.netlink-task.ops_per_sec on:


commit: d93300891f810c9498d09a6ecea2403d7a3546f0 ("[PATCH next v2 2/5] locking/osq_lock: Optimise the vcpu_is_preempted() check.")
url: https://github.com/intel-lab-lkp/linux/commits/David-Laight/locking-osq_lock-Defer-clearing-node-locked-until-the-slow-osq_lock-path/20240101-055853
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 610a9b8f49fbcf1100716370d3b5f6f884a2835a
patch link: https://lore.kernel.org/all/[email protected]/
patch subject: [PATCH next v2 2/5] locking/osq_lock: Optimise the vcpu_is_preempted() check.

testcase: stress-ng
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:

nr_threads: 100%
testtime: 60s
sc_pid_max: 4194304
class: scheduler
test: netlink-task
cpufreq_governor: performance






Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240108/[email protected]

=========================================================================================
class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/sc_pid_max/tbox_group/test/testcase/testtime:
scheduler/gcc-12/performance/x86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/4194304/lkp-icl-2sp8/netlink-task/stress-ng/60s

commit:
ff787c1fd0 ("locking/osq_lock: Defer clearing node->locked until the slow osq_lock() path.")
d93300891f ("locking/osq_lock: Optimise the vcpu_is_preempted() check.")

ff787c1fd0c13f9b d93300891f810c9498d09a6ecea
---------------- ---------------------------
%stddev %change %stddev
\ | \
3880 ? 7% +26.4% 4903 ? 18% vmstat.system.cs
0.48 ?126% -99.8% 0.00 ?141% perf-sched.sch_delay.avg.ms.__cond_resched.aa_sk_perm.security_socket_recvmsg.sock_recvmsg.__sys_recvfrom
0.16 ? 23% -38.9% 0.10 ? 32% perf-sched.sch_delay.avg.ms.schedule_preempt_disabled.__mutex_lock.constprop.0.genl_rcv_msg
1.50 ?118% -99.9% 0.00 ?142% perf-sched.sch_delay.max.ms.__cond_resched.aa_sk_perm.security_socket_recvmsg.sock_recvmsg.__sys_recvfrom
2.55 ? 97% -83.7% 0.42 ?145% perf-sched.wait_time.max.ms.__cond_resched.__mutex_lock.constprop.0.genl_rcv_msg
32244865 +10.7% 35709040 stress-ng.netlink-task.ops
537413 +10.7% 595150 stress-ng.netlink-task.ops_per_sec
38094 ? 12% +42.2% 54160 ? 27% stress-ng.time.involuntary_context_switches
42290 ? 11% +39.8% 59117 ? 23% stress-ng.time.voluntary_context_switches
143.50 ? 7% -28.8% 102.17 ? 15% perf-c2c.DRAM.local
4955 ? 3% -26.4% 3647 ? 4% perf-c2c.DRAM.remote
4038 ? 2% -18.8% 3277 ? 4% perf-c2c.HITM.local
3483 ? 3% -21.1% 2747 ? 5% perf-c2c.HITM.remote
7521 ? 2% -19.9% 6024 ? 4% perf-c2c.HITM.total
0.42 ? 3% -16.2% 0.35 ? 5% perf-stat.i.MPKI
1.066e+10 +9.6% 1.169e+10 perf-stat.i.branch-instructions
51.90 -2.5 49.42 ? 2% perf-stat.i.cache-miss-rate%
22517746 ? 3% -13.4% 19503564 ? 5% perf-stat.i.cache-misses
3730 ? 7% +29.2% 4819 ? 19% perf-stat.i.context-switches
3.99 -3.1% 3.86 perf-stat.i.cpi
9535 ? 3% +15.4% 11003 ? 5% perf-stat.i.cycles-between-cache-misses
0.00 ? 3% +0.0 0.00 ? 3% perf-stat.i.dTLB-load-miss-rate%
1.419e+10 -14.9% 1.207e+10 perf-stat.i.dTLB-loads
8.411e+08 +8.4% 9.118e+08 perf-stat.i.dTLB-stores
5.36e+10 +3.1% 5.524e+10 perf-stat.i.instructions
0.26 +7.0% 0.28 ? 5% perf-stat.i.ipc
837.29 ? 3% -9.8% 755.30 ? 4% perf-stat.i.metric.K/sec
401.41 -4.1% 385.10 perf-stat.i.metric.M/sec
6404966 -23.2% 4920722 perf-stat.i.node-load-misses
141818 ? 4% -22.2% 110404 ? 4% perf-stat.i.node-loads
69.54 +13.8 83.36 perf-stat.i.node-store-miss-rate%
3935319 +10.4% 4345724 perf-stat.i.node-store-misses
1626033 -52.6% 771187 ? 5% perf-stat.i.node-stores
0.42 ? 3% -16.0% 0.35 ? 5% perf-stat.overall.MPKI
0.11 -0.0 0.10 ? 8% perf-stat.overall.branch-miss-rate%
51.32 -2.5 48.79 ? 2% perf-stat.overall.cache-miss-rate%
4.06 -3.0% 3.94 perf-stat.overall.cpi
9677 ? 3% +15.6% 11187 ? 5% perf-stat.overall.cycles-between-cache-misses
0.00 ? 3% +0.0 0.00 ? 4% perf-stat.overall.dTLB-load-miss-rate%
0.25 +3.1% 0.25 perf-stat.overall.ipc
70.78 +14.2 84.94 perf-stat.overall.node-store-miss-rate%
1.049e+10 +9.5% 1.149e+10 perf-stat.ps.branch-instructions
22167740 ? 3% -13.4% 19186498 ? 5% perf-stat.ps.cache-misses
3667 ? 7% +29.1% 4735 ? 19% perf-stat.ps.context-switches
1.396e+10 -15.0% 1.187e+10 perf-stat.ps.dTLB-loads
8.273e+08 +8.3% 8.963e+08 perf-stat.ps.dTLB-stores
5.274e+10 +3.0% 5.433e+10 perf-stat.ps.instructions
6303682 -23.2% 4839978 perf-stat.ps.node-load-misses
140690 ? 4% -22.5% 109023 ? 4% perf-stat.ps.node-loads
3875362 +10.3% 4276026 perf-stat.ps.node-store-misses
1599985 -52.6% 758184 ? 5% perf-stat.ps.node-stores
3.297e+12 +3.0% 3.396e+12 perf-stat.total.instructions
96.07 -0.2 95.87 perf-profile.calltrace.cycles-pp.osq_lock.__mutex_lock.genl_rcv_msg.netlink_rcv_skb.genl_rcv
97.52 -0.1 97.37 perf-profile.calltrace.cycles-pp.__mutex_lock.genl_rcv_msg.netlink_rcv_skb.genl_rcv.netlink_unicast
98.98 -0.1 98.90 perf-profile.calltrace.cycles-pp.netlink_rcv_skb.genl_rcv.netlink_unicast.netlink_sendmsg.__sys_sendto
98.99 -0.1 98.92 perf-profile.calltrace.cycles-pp.genl_rcv.netlink_unicast.netlink_sendmsg.__sys_sendto.__x64_sys_sendto
98.97 -0.1 98.89 perf-profile.calltrace.cycles-pp.genl_rcv_msg.netlink_rcv_skb.genl_rcv.netlink_unicast.netlink_sendmsg
99.09 -0.1 99.04 perf-profile.calltrace.cycles-pp.netlink_unicast.netlink_sendmsg.__sys_sendto.__x64_sys_sendto.do_syscall_64
99.47 -0.0 99.43 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto.stress_netlink_taskstats_monitor.stress_netlink_task
99.44 -0.0 99.40 perf-profile.calltrace.cycles-pp.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto.stress_netlink_taskstats_monitor
99.35 -0.0 99.32 perf-profile.calltrace.cycles-pp.netlink_sendmsg.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe
99.44 -0.0 99.40 perf-profile.calltrace.cycles-pp.__sys_sendto.__x64_sys_sendto.do_syscall_64.entry_SYSCALL_64_after_hwframe.sendto
96.08 -0.2 95.89 perf-profile.children.cycles-pp.osq_lock
97.52 -0.1 97.38 perf-profile.children.cycles-pp.__mutex_lock
98.98 -0.1 98.90 perf-profile.children.cycles-pp.netlink_rcv_skb
99.00 -0.1 98.92 perf-profile.children.cycles-pp.genl_rcv
98.97 -0.1 98.89 perf-profile.children.cycles-pp.genl_rcv_msg
99.20 -0.0 99.15 perf-profile.children.cycles-pp.netlink_unicast
0.13 ? 3% -0.0 0.08 ? 7% perf-profile.children.cycles-pp.genl_cmd_full_to_split
0.14 ? 4% -0.0 0.10 ? 5% perf-profile.children.cycles-pp.genl_get_cmd
99.36 -0.0 99.32 perf-profile.children.cycles-pp.netlink_sendmsg
99.44 -0.0 99.41 perf-profile.children.cycles-pp.__x64_sys_sendto
99.44 -0.0 99.41 perf-profile.children.cycles-pp.__sys_sendto
99.59 -0.0 99.56 perf-profile.children.cycles-pp.sendto
0.07 ? 5% +0.0 0.08 ? 5% perf-profile.children.cycles-pp.genl_family_rcv_msg_attrs_parse
0.11 +0.0 0.12 ? 6% perf-profile.children.cycles-pp.apparmor_capable
0.18 ? 3% +0.0 0.20 ? 4% perf-profile.children.cycles-pp.netlink_recvmsg
0.36 +0.0 0.38 perf-profile.children.cycles-pp.fill_stats
0.13 ? 3% +0.0 0.15 ? 4% perf-profile.children.cycles-pp.ns_capable
0.20 ? 3% +0.0 0.23 ? 4% perf-profile.children.cycles-pp.sock_recvmsg
0.24 ? 3% +0.0 0.27 ? 3% perf-profile.children.cycles-pp.__sys_recvfrom
0.24 ? 3% +0.0 0.27 ? 4% perf-profile.children.cycles-pp.__x64_sys_recvfrom
0.31 ? 3% +0.0 0.34 ? 3% perf-profile.children.cycles-pp.recv
1.22 +0.0 1.26 perf-profile.children.cycles-pp.genl_family_rcv_msg
0.85 +0.1 0.90 perf-profile.children.cycles-pp.cmd_attr_pid
0.94 +0.1 1.01 perf-profile.children.cycles-pp.genl_family_rcv_msg_doit
1.11 +0.1 1.23 perf-profile.children.cycles-pp.mutex_spin_on_owner
95.80 -0.2 95.62 perf-profile.self.cycles-pp.osq_lock
0.13 ? 3% -0.0 0.08 ? 7% perf-profile.self.cycles-pp.genl_cmd_full_to_split
0.11 ? 3% +0.0 0.12 ? 6% perf-profile.self.cycles-pp.apparmor_capable
1.11 +0.1 1.23 perf-profile.self.cycles-pp.mutex_spin_on_owner




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


2024-05-03 15:59:59

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().


On 12/31/23 23:14, Waiman Long wrote:
>
> On 12/31/23 16:55, David Laight wrote:
>> per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
>> This requires the cpu number be 64bit.
>> However the value is osq_lock() comes from a 32bit xchg() and there
>> isn't a way of telling gcc the high bits are zero (they are) so
>> there will always be an instruction to clear the high bits.
>>
>> The cpu number is also offset by one (to make the initialiser 0)
>> It seems to be impossible to get gcc to convert
>> __per_cpu_offset[cpu_p1 - 1]
>> into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the
>> address).
>>
>> Converting the cpu number to 32bit unsigned prior to the decrement means
>> that gcc knows the decrement has set the high bits to zero and doesn't
>> add a register-register move (or cltq) to zero/sign extend the value.
>>
>> Not massive but saves two instructions.
>>
>> Signed-off-by: David Laight <[email protected]>
>> ---
>>   kernel/locking/osq_lock.c | 6 ++----
>>   1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>> index 35bb99e96697..37a4fa872989 100644
>> --- a/kernel/locking/osq_lock.c
>> +++ b/kernel/locking/osq_lock.c
>> @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
>>       return cpu_nr + 1;
>>   }
>>   -static inline struct optimistic_spin_node *decode_cpu(int
>> encoded_cpu_val)
>> +static inline struct optimistic_spin_node *decode_cpu(unsigned int
>> encoded_cpu_val)
>>   {
>> -    int cpu_nr = encoded_cpu_val - 1;
>> -
>> -    return per_cpu_ptr(&osq_node, cpu_nr);
>> +    return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);
>>   }
>>     /*
>
> You really like micro-optimization.
>
> Anyway,
>
> Reviewed-by: Waiman Long <[email protected]>
>
David,

Could you respin the series based on the latest upstream code?

Thanks,
Longman


2024-05-03 16:23:18

by David Laight

[permalink] [raw]
Subject: RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

From: Waiman Long
> Sent: 03 May 2024 17:00
> To: David Laight <[email protected]>; '[email protected]' <linux-
> [email protected]>; '[email protected]' <[email protected]>
> Cc: '[email protected]' <[email protected]>; '[email protected]' <[email protected]>; '[email protected]'
> <[email protected]>; 'Linus Torvalds' <[email protected]>; '[email protected]
> foundation.org' <[email protected]>; 'Zeng Heng' <[email protected]>
> Subject: Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().
>
>
> On 12/31/23 23:14, Waiman Long wrote:
> >
> > On 12/31/23 16:55, David Laight wrote:
> >> per_cpu_ptr() indexes __per_cpu_offset[] with the cpu number.
> >> This requires the cpu number be 64bit.
> >> However the value is osq_lock() comes from a 32bit xchg() and there
> >> isn't a way of telling gcc the high bits are zero (they are) so
> >> there will always be an instruction to clear the high bits.
> >>
> >> The cpu number is also offset by one (to make the initialiser 0)
> >> It seems to be impossible to get gcc to convert
> >> __per_cpu_offset[cpu_p1 - 1]
> >> into (__per_cpu_offset - 1)[cpu_p1] (transferring the offset to the
> >> address).
> >>
> >> Converting the cpu number to 32bit unsigned prior to the decrement means
> >> that gcc knows the decrement has set the high bits to zero and doesn't
> >> add a register-register move (or cltq) to zero/sign extend the value.
> >>
> >> Not massive but saves two instructions.
> >>
> >> Signed-off-by: David Laight <[email protected]>
> >> ---
> >>   kernel/locking/osq_lock.c | 6 ++----
> >>   1 file changed, 2 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> >> index 35bb99e96697..37a4fa872989 100644
> >> --- a/kernel/locking/osq_lock.c
> >> +++ b/kernel/locking/osq_lock.c
> >> @@ -29,11 +29,9 @@ static inline int encode_cpu(int cpu_nr)
> >>       return cpu_nr + 1;
> >>   }
> >>   -static inline struct optimistic_spin_node *decode_cpu(int
> >> encoded_cpu_val)
> >> +static inline struct optimistic_spin_node *decode_cpu(unsigned int
> >> encoded_cpu_val)
> >>   {
> >> -    int cpu_nr = encoded_cpu_val - 1;
> >> -
> >> -    return per_cpu_ptr(&osq_node, cpu_nr);
> >> +    return per_cpu_ptr(&osq_node, encoded_cpu_val - 1);
> >>   }
> >>     /*
> >
> > You really like micro-optimization.
> >
> > Anyway,
> >
> > Reviewed-by: Waiman Long <[email protected]>
> >
> David,
>
> Could you respin the series based on the latest upstream code?

Looks like a wet bank holiday weekend.....

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-05-03 21:13:06

by David Laight

[permalink] [raw]
Subject: RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

From: Waiman Long
> Sent: 03 May 2024 17:00
...
> David,
>
> Could you respin the series based on the latest upstream code?

I've just reapplied the patches to 'master' and they all apply
cleanly and diffing the new patches to the old ones gives no differences.
So I think they should still apply.

Were you seeing a specific problem?

I don't remember any suggested changed either.
(Apart from a very local variable I used to keep a patch isolated.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-05-03 22:13:53

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().


On 5/3/24 17:10, David Laight wrote:
> From: Waiman Long
>> Sent: 03 May 2024 17:00
> ...
>> David,
>>
>> Could you respin the series based on the latest upstream code?
> I've just reapplied the patches to 'master' and they all apply
> cleanly and diffing the new patches to the old ones gives no differences.
> So I think they should still apply.
>
> Were you seeing a specific problem?
>
> I don't remember any suggested changed either.
> (Apart from a very local variable I used to keep a patch isolated.)

No, I just want to make sure that your patches will still apply. Anyway,
it will be easier for the maintainer to merge your remaining patches if
you can send out a new version even if they are almost the same as the
old ones.

Thanks,
Longman


2024-05-04 20:27:00

by David Laight

[permalink] [raw]
Subject: RE: [PATCH next v2 5/5] locking/osq_lock: Optimise decode_cpu() and per_cpu_ptr().

From: Waiman Long
> Sent: 03 May 2024 23:14
>
>
> On 5/3/24 17:10, David Laight wrote:
> > From: Waiman Long
> >> Sent: 03 May 2024 17:00
> > ...
> >> David,
> >>
> >> Could you respin the series based on the latest upstream code?
> > I've just reapplied the patches to 'master' and they all apply
> > cleanly and diffing the new patches to the old ones gives no differences.
> > So I think they should still apply.
> >
> > Were you seeing a specific problem?
> >
> > I don't remember any suggested changed either.
> > (Apart from a very local variable I used to keep a patch isolated.)
>
> No, I just want to make sure that your patches will still apply. Anyway,
> it will be easier for the maintainer to merge your remaining patches if
> you can send out a new version even if they are almost the same as the
> old ones.

I don't think any changes are needed.
So the existing versions are fine.
They applied (well my copy of what I think I sent applied) and built.
So there shouldn't be any issues.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)