by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

* David Laight <[email protected]> wrote:

> bool osq_lock(struct optimistic_spin_queue *lock)
> {
> - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
> struct optimistic_spin_node *prev, *next;
> int old;
>
> - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> - node->cpu = encode_cpu(smp_processor_id());
> + if (unlikely(!node)) {
> + int cpu = encode_cpu(smp_processor_id());
> + node = decode_cpu(cpu);
> + node->self = node;
> + node->cpu = cpu;

This whole initialization sequence is suboptimal and needs to be
cleaned up first: the node->cpu field is constant once initialized, so
it should be initialized from appropriate init methods, not runtime in
osq_lock(), right?

Eliminating that initialization branch is a useful micro-optimization
in itself for the hot path.

Thanks,

Ingo

2023-12-30 20:41:44

From: Boqun Feng
> Sent: 02 January 2024 18:54
>
> On Sat, Dec 30, 2023 at 03:49:52PM +0000, David Laight wrote:
> [...]
> > But it looks odd that osq_unlock()'s fast path uses _release but the very
> > similar code in osq_wait_next() uses _acquire.
> >
>
> The _release in osq_unlock() is needed since unlocks are needed to be
> RELEASE so that lock+unlock can be a critical section (i.e. no memory
> accesses can escape). When osq_wait_next() is used in non unlock cases,
> the RELEASE is not required. As for the case where osq_wait_next() is
> used in osq_unlock(), there is a xchg() preceding it, which provides a
> full barrier, so things are fine.

I know there have been issues with ACQUIRE/RELEASE/FULL xchg in this code,
but are FULL xchg always needed on node->next?

> /me wonders whether we can relax the _acquire in osq_wait_next() into
> a _relaxed.

I wouldn't have worried about relaxed v release.

> > Indeed, apart from some (assumed) optimisations, I think osq_unlock()
> > could just be:
> > next = osq_wait_next(lock, this_cpu_ptr(&osq_node), 0);
> > if (next)
> > next->locked = 1;
> >
>
> If so we need to provide some sort of RELEASE semantics for the
> osq_unlock() in all the cases.

I wonder how often the unqueue code happens, and especially for
the last cpu in the list?
I'd only expect need_resched() to return true after spinning for
a while - in which case perhaps it is more likely that there are
a lot of cpu in the queue and the cpu being removed won't be last.
So osq_wait_next() exits on xchg(&node->next, NULL) != NULL
which is full barrier.

On a slightly different note I've also wondered if 'osq_node'
actually needs to be cache line aligned?
You definitely don't want it spanning 2 cache line, but I'd
expect that per-cpu data is mostly accessed by its own cpu?
So you really aren't going to get false sharing with some
other per-cpu data since the cpu is busy in this code.
So __aligned(16) would do?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)