Received-SPF: pass (google.com: domain of linux-kernel+bounces-106906-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1;
Message-ID: <d63dfa33-f194-4368-ab9a-3be27c4cd893@redhat.com>
Date: Mon, 18 Mar 2024 21:59:35 -0400
Precedence: bulk
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v2] locking/osq_lock: Optimize osq_lock performance using
 per-NUMA
Content-Language: en-US
To: Guo Hui <guohui@uniontech.com>, peterz@infradead.org, mingo@redhat.com,
 will@kernel.org, boqun.feng@gmail.com, David.Laight@ACULAB.COM
Cc: linux-kernel@vger.kernel.org
References: <20240318055023.25546-1-guohui@uniontech.com>
From: Waiman Long <longman@redhat.com>
In-Reply-To: <20240318055023.25546-1-guohui@uniontech.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

On 3/18/24 01:50, Guo Hui wrote:
> Changes in version v1:
> The queue is divided according to NUMA nodes,
> but the tail of each NUMA node is still stored
> in the structure optimistic_spin_queue.
>
> Changes in version v2:
> 1,On the basis of v1,
> the tail of each NUMA node is separated from
> the structure optimistic_spin_queue,
> and a separate memory space is allocated.
> This memory space is a pre-statically allocated
> fixed-size array osnq_queue[1024], array length is 1024.
> Each array element of osnq_queue is an array of atomic_t type.
> The length of this atomic_t type array is MAX_NUMNODES.
> The total memory size of statically allocated arrays is as follows:
>
>       4 * MAX_NUMNODES * 1024
>
> When MAX_NUMNODES is 64, the memory is 256K,
> and when it is 1024, the memory is 4M.
>
> The relationship between the statically allocated array osnq_queue
> and the structure optimistic_spin_queue is to use the hash value of
> the optimistic_spin_queue structure type pointer as the index of
> the array osnq_queue, and obtain the array element
> corresponding to osq_lock from osnq_queue.
> This array element stores the tail value of
> each NUMA node corresponding to the osq_lock lock.
>
> The form of the osnq_queue array is as follows:
>
>   ----------------------------------------------------------
>   |                  |                                     |
>   |                  |             MAX_NUMNODES            |
>   |                  |                                     |
>   |                  |-------------------------------------|
>   |                  |               |               |     |
>   |                  | atomic_t tail | atomic_t tail | ... |
>   |                  |               |               |     |
>   |                  |-------------------------------------|
>   |                  |               |               |     |
>   |                  | atomic_t tail | atomic_t tail | ... |
>   |                  |               |               |     |
>   |                  |-------------------------------------|
>   |                  |               |               |     |
>   | osnq_queue[1024] | atomic_t tail | atomic_t tail | ... |
>   |                  |               |               |     |
>   |                  |-------------------------------------|
>   | The hash value ->|               |               |     |
>   | of osq_lock is   | atomic_t tail | atomic_t tail | ... |
>   | the index        |               |               |     |
>   |                  |-------------------------------------|
>   |                  |               |               |     |
>   |                  |    ... ...    |    ... ...    | ... |
>   |                  |               |               |     |
>   |------------------|-------------------------------------|
>
> There is a very small probability that different osq_locks
> with the same hash value will run concurrently on different CPUs.
> After extensive testing, this probability is no greater than 0.01%.
> This situation is also a normal situation.
> In this case, two different osq_locks will share the same
> osnq_queue array element,these two different osq_locks
> share the same NUMA linked list.Put different CPU nodes waiting
> for different osq_locks into the same NUMA linked list,
> which means that CPU nodes with different osq_locks
> share the same lock of the same NUMA queue.
> This is essentially the same method as using zippers
> to resolve hash collisions.
>
> Make an extreme case and set the above osnq_queue array
> to an array element.Then all osq_locks in the kernel
> will share the same queue on different NUMA nodes.
> After verification, the kernel can also run normally.
> However, the performance of some test cases will deteriorate.
> This patch solution greatly reduces the probability of
> shared queues to less than 0.01%,greatly improving
> the kernel osq_lock lock performance.

I don't like the idea of using a hash table and having hash collision. 
Is it possible to adopt the numa-aware qspinlock's idea of having a 
primary queue of the same numa node and a secondary queue for all the 
other nodes? I know it will be more complex because of the need to back 
out when rescheduling is needed.

Also I will prefer to make this numa-awareness optional so that it is 
determined at boot time if the numa-aware version or the original 
version is being used. At the beginning, the numa-aware osq_lock will be 
default to off unless it is forced on by a command line or kconfig 
option. We want to minimize risk due to the introduction of bug in the 
new code.

>
> 2. Achieve fairness in transferring locks between nodes
> and prevent the same NUMA node from holding locks for a long time.
> This method borrows from the qspinlock numa-aware scheme.
>
> The effect on the 96-core 4-NUMA ARM64 platform is as follows:
> System Benchmarks Partial Index       with patch  without patch  promote
> Execl Throughput                         7255.8	       5632.8      +28.81%
> File Copy 1024 bufsize 2000 maxblocks    1817.2	       910.9       +99.50%
> File Copy 256 bufsize 500 maxblocks      1168.1        570.4       +104.79%
> File Copy 4096 bufsize 8000 maxblocks    3321.1	       2088.7      +59.00%
>
> The effect on the 128-core 8-NUMA X86_64 platform is as follows:
> System Benchmarks Partial Index       with patch  without patch  promote
> Execl Throughput                         3947         3533.6      +11.70%
> File Copy 1024 bufsize 2000 maxblocks    819.1        553         +48.12%
> File Copy 256 bufsize 500 maxblocks      508.5        330.2       +54.00%
> File Copy 4096 bufsize 8000 maxblocks    1982.2       1377.1      +43.94%
>
> Signed-off-by: Guo Hui <guohui@uniontech.com>
> ---
>   include/linux/osq_lock.h  |  29 ++++++++--
>   kernel/locking/osq_lock.c | 111 ++++++++++++++++++++++++++++++++++----
>   2 files changed, 126 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
> index ea8fb31379e3..3433f13276b8 100644
> --- a/include/linux/osq_lock.h
> +++ b/include/linux/osq_lock.h
> @@ -2,6 +2,13 @@
>   #ifndef __LINUX_OSQ_LOCK_H
>   #define __LINUX_OSQ_LOCK_H
>   
> +#include <linux/nodemask.h>
> +#include <linux/hash.h>
> +
> +struct optimistic_spin_numa_queue {
> +	atomic_t tail[MAX_NUMNODES]; /* Store the tail of each NUMA queue */
> +};
> +
>   /*
>    * An MCS like lock especially tailored for optimistic spinning for sleeping
>    * lock implementations (mutex, rwsem, etc).
> @@ -9,12 +16,16 @@
>   
>   struct optimistic_spin_queue {
>   	/*
> -	 * Stores an encoded value of the CPU # of the tail node in the queue.
> -	 * If the queue is empty, then it's set to OSQ_UNLOCKED_VAL.
> +	 * Stores the hash value of the structure type pointer.
>   	 */
>   	atomic_t tail;
>   };
>   
> +#define HASH_BITS_LEN    10
> +
> +/* this value is 2^@HASH_BITS_LEN */
> +#define NUMA_QUEUE_SIZE  1024
> +
>   #define OSQ_UNLOCKED_VAL (0)
>   
>   /* Init macro and function. */
> @@ -22,15 +33,25 @@ struct optimistic_spin_queue {
>   
>   static inline void osq_lock_init(struct optimistic_spin_queue *lock)
>   {
> -	atomic_set(&lock->tail, OSQ_UNLOCKED_VAL);
> +	atomic_set(&lock->tail, hash_ptr(lock, HASH_BITS_LEN));
>   }
>   
>   extern bool osq_lock(struct optimistic_spin_queue *lock);
>   extern void osq_unlock(struct optimistic_spin_queue *lock);
>   
> +extern struct optimistic_spin_numa_queue osnq_queue[NUMA_QUEUE_SIZE];
> +
>   static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
>   {
> -	return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
> +	int node;
> +	atomic_t *numa_tail = osnq_queue[atomic_read(&lock->tail)].tail;
> +
> +	for_each_online_node(node) {
> +		if (atomic_read(&numa_tail[node]) != OSQ_UNLOCKED_VAL)
> +			return true;
> +	}
> +
> +	return false;
>   }
That greatly increase of cost of calling osq_is_locked().
>   
>   #endif
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 75a6f6133866..bea6a2784b5e 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -2,6 +2,8 @@
>   #include <linux/percpu.h>
>   #include <linux/sched.h>
>   #include <linux/osq_lock.h>
> +#include <linux/topology.h>
> +#include <linux/random.h>
>   
>   /*
>    * An MCS like lock especially tailored for optimistic spinning for sleeping
> @@ -14,12 +16,48 @@
>   
>   struct optimistic_spin_node {
>   	struct optimistic_spin_node *next, *prev;optimistic_spin_queue
> +	atomic_t *tail;
Don't use the name "tail" here as it is also used in 
optimistic_spin_queue. It is hard to tell which tail is which when 
reading the code. Make them unique.
>   	int locked; /* 1 if lock acquired */
>   	int cpu; /* encoded CPU # + 1 value */
> +	int numa_node;
>   };
>   
>   static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
>   
> +/* Use the hash value of the structure optimistic_spin_node type pointer as the index. */
> +struct optimistic_spin_numa_queue osnq_queue[NUMA_QUEUE_SIZE];
> +
> +#define INVALID_NUMA_NODE (-1)
> +
> +/* Per-CPU pseudo-random number seed */
> +static DEFINE_PER_CPU(u32, seed);
> +
> +/*
> + * Controls the probability for intra-node lock hand-off. It can be
> + * tuned and depend, e.g., on the number of CPUs per node. For now,
> + * choose a value that provides reasonable long-term fairness without
> + * sacrificing performance compared to a version that does not have any
> + * fairness guarantees.
> + */
> +#define INTRA_NODE_HANDOFF_PROB_ARG (16)
> +
> +
> +/*
> + * Return false with probability 1 / 2^@num_bits.
> + * Intuitively, the larger @num_bits the less likely false is to be returned.
> + * @num_bits must be a number between 0 and 31.
> + */
> +static bool probably(unsigned int num_bits)
> +{
> +	u32 s;
> +
> +	s = this_cpu_read(seed);
> +	s = next_pseudo_random32(s);
> +	this_cpu_write(seed, s);
> +
> +	return s & ((1 << num_bits) - 1);
> +}
> +
>   /*
>    * We use the value 0 to represent "no CPU", thus the encoded value
>    * will be the CPU number incremented by 1.
> @@ -58,8 +96,8 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>   	int curr = encode_cpu(smp_processor_id());
>   
>   	for (;;) {
> -		if (atomic_read(&lock->tail) == curr &&
> -		    atomic_cmpxchg_acquire(&lock->tail, curr, old_cpu) == curr) {
> +		if (atomic_read(&node->tail[node->numa_node]) == curr &&
> +		    atomic_cmpxchg_acquire(&node->tail[node->numa_node], curr, old_cpu) == curr) {
>   			/*
>   			 * We were the last queued, we moved @lock back. @prev
>   			 * will now observe @lock and will complete its
> @@ -90,6 +128,19 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>   	}
>   }
>   
> +static atomic_t osq_lock_node = ATOMIC_INIT(-1);
> +
> +static void osq_wait_numa_node(struct optimistic_spin_node *node)
> +{
> +	int  old_node;
> +
> +	while (!need_resched() &&
> +		((old_node = atomic_cmpxchg_acquire(&osq_lock_node, INVALID_NUMA_NODE,
> +			node->numa_node)) != INVALID_NUMA_NODE) &&
> +		(node->numa_node != old_node))
> +		cpu_relax();
> +}
> +
>   bool osq_lock(struct optimistic_spin_queue *lock)
>   {
>   	struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> @@ -100,6 +151,8 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>   	node->locked = 0;
>   	node->next = NULL;
>   	node->cpu = curr;
> +	node->numa_node = cpu_to_node(smp_processor_id());
> +	node->tail = osnq_queue[atomic_read(&lock->tail)].tail;
>   
>   	/*optimistic_spin_queue
>   	 * We need both ACQUIRE (pairs with corresponding RELEASE in
> @@ -107,9 +160,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>   	 * the node fields we just initialised) semantics when updating
>   	 * the lock tail.
>   	 */
> -	old = atomic_xchg(&lock->tail, curr);
> -	if (old == OSQ_UNLOCKED_VAL)
> +	old = atomic_xchg(&node->tail[node->numa_node], curr);
> +	if (old == OSQ_UNLOCKED_VAL) {
> +		osq_wait_numa_node(node);
>   		return true;
> +	}
>   
>   	prev = decode_cpu(old);
>   	node->prev = prev;
> @@ -144,8 +199,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>   	 * polling, be careful.
>   	 */
>   	if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
> -				  vcpu_is_preempted(node_cpu(node->prev))))
> +				  vcpu_is_preempted(node_cpu(node->prev)))) {
> +		osq_wait_numa_node(node);
>   		return true;
> +	}
>   
>   	/* unqueue */
>   	/*
> @@ -170,8 +227,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>   		 * in which case we should observe @node->locked becoming
>   		 * true.
>   		 */
> -		if (smp_load_acquire(&node->locked))
> +		if (smp_load_acquire(&node->locked)) {
> +			osq_wait_numa_node(node);
>   			return true;
> +		}
>   
>   		cpu_relax();
>   
> @@ -207,29 +266,61 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>   	return false;
>   }
>   
> +/*
> + * Pass the lock to the next NUMA node.
> + */
> +static void pass_lock_numa_node(struct optimistic_spin_node *node)
> +{
> +	int curr_node = cpu_to_node(smp_processor_id());
> +	int numa_node = curr_node;
> +	int num_nodes = num_online_nodes();
> +
> +	do {
> +		numa_node = (numa_node + 1) % num_nodes;
> +		if (numa_node == curr_node) {
> +			atomic_set(&osq_lock_node, INVALID_NUMA_NODE);
> +			return;
> +		}
> +	} while (atomic_read(&node->tail[numa_node]) == OSQ_UNLOCKED_VAL);
> +	atomic_set(&osq_lock_node, numa_node);
> +}
> +
> +static inline void pass_lock_fair(struct optimistic_spin_node *node)
> +{
> +	if (!probably(INTRA_NODE_HANDOFF_PROB_ARG))
> +		pass_lock_numa_node(node);
> +}
> +
>   void osq_unlock(struct optimistic_spin_queue *lock)
>   {
>   	struct optimistic_spin_node *node, *next;
>   	int curr = encode_cpu(smp_processor_id());
>   
> +	node = this_cpu_ptr(&osq_node);
> +
>   	/*
>   	 * Fast path for the uncontended case.
>   	 */
> -	if (likely(atomic_cmpxchg_release(&lock->tail, curr,
> -					  OSQ_UNLOCKED_VAL) == curr))
> +	if (likely(atomic_cmpxchg_release(&node->tail[node->numa_node], curr,
> +					  OSQ_UNLOCKED_VAL) == curr)) {
> +		pass_lock_numa_node(node);
>   		return;
> +	}
>   
>   	/*
>   	 * Second most likely case.
>   	 */
> -	node = this_cpu_ptr(&osq_node);
>   	next = xchg(&node->next, NULL);
>   	if (next) {
>   		WRITE_ONCE(next->locked, 1);
> +		pass_lock_fair(node);
>   		return;
>   	}
>   
>   	next = osq_wait_next(lock, node, OSQ_UNLOCKED_VAL);
> -	if (next)
> +	if (next) {
>   		WRITE_ONCE(next->locked, 1);
> +		pass_lock_fair(node);
> +	} else
> +		pass_lock_numa_node(node);
>   }

With numa-aware qspinlock, most of the action is in the unlock code to 
scan the queue.

Cheers,
Longman