2024-05-23 20:07:51

by André Almeida

[permalink] [raw]
Subject: [PATCH v2 0/1] Add FUTEX_SPIN operation

Hi,

In the last LPC, Mathieu Desnoyers and I presented[0] a proposal to extend the
rseq interface to be able to implement spin locks in userspace correctly. Thomas
Gleixner agreed that this is something that Linux could improve, but asked for
an alternative proposal first: a futex operation that allows to spin a user
lock inside the kernel. This patchset implements a prototype of this idea for
further discussion.

With FUTEX2_SPIN flag set during a futex_wait(), the futex value is expected to
be the TID of the lock owner. Then, the kernel gets the task_struct of the
corresponding TID, and checks if it's running. It spins until the futex
is awaken, the task is scheduled out or if a timeout happens. If the lock owner
is scheduled out at any time, then the syscall follows the normal path of
sleeping as usual. The user input is masked with FUTEX_TID_MASK so we have some
bits to play.

If the futex is awaken and we are spinning, we can return to userspace quickly,
avoid the scheduling out and in again to wake from a futex_wait(), thus
speeding up the wait operation. The user input is masked with FUTEX_TID_MASK so
we have some bits to play.

Christian Brauner suggested using pidfd to avoid race conditions, and I will
implement that in the next patch iteration. I benchmarked the implementation
measuring the time required to wait for a futex for a simple loop using the code
at [2]. In my setup, the total wait time for 1000 futexes using the spin method
was almost 10% lower than just using the normal futex wait:

Testing with FUTEX2_SPIN | FUTEX_WAIT
Total wait time: 8650089 usecs

Testing with FUTEX_WAIT
Total wait time: 9447291 usecs

However, as I played with how long the lock owner would be busy, the
benchmark results of spinning vs no spinning would match, showing that the
spinning will be effective for some specific scheduling scenarios, but depending
on the wait time, there's no big difference either spinning or not.

[0] https://lpc.events/event/17/contributions/1481/

You can find a small snippet to play with this interface here:

[1] https://gist.github.com/andrealmeid/f0b8c93a3c7a5c50458247c47f7078e1

Changelog:

v1: - s/PID/TID
- masked user input with FUTEX_TID_MASK
- add benchmark tool to the cover letter
- dropped debug prints
- added missing put_task_struct()

André Almeida (1):
futex: Add FUTEX_SPIN operation

include/uapi/linux/futex.h | 2 +-
kernel/futex/futex.h | 6 ++-
kernel/futex/waitwake.c | 78 +++++++++++++++++++++++++++++++++++++-
3 files changed, 82 insertions(+), 4 deletions(-)

--
2.45.1



2024-05-23 20:08:41

by André Almeida

[permalink] [raw]
Subject: [PATCH v2 1/1] futex: Add FUTEX_SPIN operation

Add a new mode for futex wait, the futex spin.

Given the FUTEX2_SPIN flag, parse the futex value as the TID of the lock
owner. Then, before going to the normal wait path, spins while the lock
owner is running in a different CPU, to avoid the whole context switch
operation and to quickly return to userspace. If the lock owner is not
running, just sleep as the normal futex wait path.

The user value is masked with FUTEX_TID_MASK, to allow some bits for
future use.

The check for the owner to be running or not is important to avoid
spinning for something that won't be released quickly. Userspace is
responsible on providing the proper TID, the kernel does a basic check.

Signed-off-by: André Almeida <[email protected]>
---
include/uapi/linux/futex.h | 2 +-
kernel/futex/futex.h | 6 ++-
kernel/futex/waitwake.c | 78 +++++++++++++++++++++++++++++++++++++-
3 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index d2ee625ea189..d77d692ffac2 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -63,7 +63,7 @@
#define FUTEX2_SIZE_U32 0x02
#define FUTEX2_SIZE_U64 0x03
#define FUTEX2_NUMA 0x04
- /* 0x08 */
+#define FUTEX2_SPIN 0x08
/* 0x10 */
/* 0x20 */
/* 0x40 */
diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index 8b195d06f4e8..180c1c10dc81 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -37,6 +37,7 @@
#define FLAGS_HAS_TIMEOUT 0x0040
#define FLAGS_NUMA 0x0080
#define FLAGS_STRICT 0x0100
+#define FLAGS_SPIN 0x0200

/* FUTEX_ to FLAGS_ */
static inline unsigned int futex_to_flags(unsigned int op)
@@ -52,7 +53,7 @@ static inline unsigned int futex_to_flags(unsigned int op)
return flags;
}

-#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_PRIVATE)
+#define FUTEX2_VALID_MASK (FUTEX2_SIZE_MASK | FUTEX2_PRIVATE | FUTEX2_SPIN)

/* FUTEX2_ to FLAGS_ */
static inline unsigned int futex2_to_flags(unsigned int flags2)
@@ -65,6 +66,9 @@ static inline unsigned int futex2_to_flags(unsigned int flags2)
if (flags2 & FUTEX2_NUMA)
flags |= FLAGS_NUMA;

+ if (flags2 & FUTEX2_SPIN)
+ flags |= FLAGS_SPIN;
+
return flags;
}

diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index 3a10375d9521..276c03804b92 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -372,6 +372,73 @@ void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
__set_current_state(TASK_RUNNING);
}

+static inline bool task_on_cpu(struct task_struct *p)
+{
+#ifdef CONFIG_SMP
+ return !!(p->on_cpu);
+#else
+ return false;
+#endif
+}
+
+static int futex_spin(struct futex_hash_bucket *hb, struct futex_q *q,
+ struct hrtimer_sleeper *timeout, u32 uval)
+{
+ struct task_struct *p;
+ pid_t pid = uval & FUTEX_TID_MASK;
+
+ p = find_get_task_by_vpid(pid);
+
+ /* no task found, maybe it already exited */
+ if (!p) {
+ futex_q_unlock(hb);
+ return -EAGAIN;
+ }
+
+ /* can't spin in a kernel task */
+ if (unlikely(p->flags & PF_KTHREAD)) {
+ put_task_struct(p);
+ futex_q_unlock(hb);
+ return -EPERM;
+ }
+
+ futex_queue(q, hb);
+
+ if (timeout)
+ hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
+
+ while (1) {
+ if (likely(!plist_node_empty(&q->list))) {
+ if (timeout && !timeout->task)
+ goto exit;
+
+ if (task_on_cpu(p)) {
+ /* spin */
+ continue;
+ } else {
+ /* task is not running, sleep */
+ break;
+ }
+ } else {
+ goto exit;
+ }
+ }
+
+ /* spinning didn't work, go to the normal path */
+ set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
+
+ if (likely(!plist_node_empty(&q->list))) {
+ if (!timeout || timeout->task)
+ schedule();
+ }
+
+ __set_current_state(TASK_RUNNING);
+
+exit:
+ put_task_struct(p);
+ return 0;
+}
+
/**
* futex_unqueue_multiple - Remove various futexes from their hash bucket
* @v: The list of futexes to unqueue
@@ -665,8 +732,15 @@ int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
if (ret)
return ret;

- /* futex_queue and wait for wakeup, timeout, or a signal. */
- futex_wait_queue(hb, &q, to);
+ if (flags & FLAGS_SPIN) {
+ ret = futex_spin(hb, &q, to, val);
+
+ if (ret)
+ return ret;
+ } else {
+ /* futex_queue and wait for wakeup, timeout, or a signal. */
+ futex_wait_queue(hb, &q, to);
+ }

/* If we were woken (and unqueued), we succeeded, whatever. */
if (!futex_unqueue(&q))
--
2.45.1


2024-05-23 20:20:04

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] futex: Add FUTEX_SPIN operation

On 2024-05-23 16:07, André Almeida wrote:
> Add a new mode for futex wait, the futex spin.
>
> Given the FUTEX2_SPIN flag, parse the futex value as the TID of the lock
> owner. Then, before going to the normal wait path, spins while the lock
> owner is running in a different CPU, to avoid the whole context switch
> operation and to quickly return to userspace. If the lock owner is not
> running, just sleep as the normal futex wait path.
>
> The user value is masked with FUTEX_TID_MASK, to allow some bits for
> future use.
>
> The check for the owner to be running or not is important to avoid
> spinning for something that won't be released quickly. Userspace is
> responsible on providing the proper TID, the kernel does a basic check.
>
> Signed-off-by: André Almeida <[email protected]>
> ---

[...]

> +
> +static int futex_spin(struct futex_hash_bucket *hb, struct futex_q *q,
> + struct hrtimer_sleeper *timeout, u32 uval)
> +{
> + struct task_struct *p;
> + pid_t pid = uval & FUTEX_TID_MASK;
> +
> + p = find_get_task_by_vpid(pid);
> +
> + /* no task found, maybe it already exited */
> + if (!p) {
> + futex_q_unlock(hb);
> + return -EAGAIN;
> + }
> +
> + /* can't spin in a kernel task */
> + if (unlikely(p->flags & PF_KTHREAD)) {
> + put_task_struct(p);
> + futex_q_unlock(hb);
> + return -EPERM;
> + }
> +
> + futex_queue(q, hb);
> +
> + if (timeout)
> + hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
> +
> + while (1) {

Infinite loops in other kernel/futex/ files appear to use "for (;;) {"
instead.

> + if (likely(!plist_node_empty(&q->list))) {
> + if (timeout && !timeout->task)
> + goto exit;
> +
> + if (task_on_cpu(p)) {
> + /* spin */

You may want to add a "cpu_relax();" here to lessen the
power consumption of this busy-loop.

> + continue;
> + } else {
> + /* task is not running, sleep */
> + break;
> + }
> + } else {
> + goto exit;
> + }
> + }
> +
> + /* spinning didn't work, go to the normal path */
> + set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);

I wonder if flipping the order between:

set_current_state(TASK_INTERRUPTIBLE|TASK_FREEZABLE);
futex_queue(q, hb);

as done in futex_wait_queue() and what is done here matters ?
Does it introduce a race where a wakeup could be missed ?

If it's an issue, then setting the current state could be done
before invoking futex_queue(), and whenever the spin exits,
set it back to TASK_RUNNING.


> +
> + if (likely(!plist_node_empty(&q->list))) {
> + if (!timeout || timeout->task)
> + schedule();
> + }
> +
> + __set_current_state(TASK_RUNNING);
> +
> +exit:
> + put_task_struct(p);
> + return 0;
> +}
> +
> /**
> * futex_unqueue_multiple - Remove various futexes from their hash bucket
> * @v: The list of futexes to unqueue
> @@ -665,8 +732,15 @@ int __futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
> if (ret)
> return ret;
>
> - /* futex_queue and wait for wakeup, timeout, or a signal. */
> - futex_wait_queue(hb, &q, to);
> + if (flags & FLAGS_SPIN) {
> + ret = futex_spin(hb, &q, to, val);
The empty line below could be removed.

Thanks,

Mathieu

> +
> + if (ret)
> + return ret;
> + } else {
> + /* futex_queue and wait for wakeup, timeout, or a signal. */
> + futex_wait_queue(hb, &q, to);
> + }
>
> /* If we were woken (and unqueued), we succeeded, whatever. */
> if (!futex_unqueue(&q))

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2024-05-23 20:39:08

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH v2 0/1] Add FUTEX_SPIN operation

On 2024-05-23 16:07, André Almeida wrote:
> Hi,
>
> In the last LPC, Mathieu Desnoyers and I presented[0] a proposal to extend the
> rseq interface to be able to implement spin locks in userspace correctly. Thomas
> Gleixner agreed that this is something that Linux could improve, but asked for
> an alternative proposal first: a futex operation that allows to spin a user
> lock inside the kernel. This patchset implements a prototype of this idea for
> further discussion.
>
> With FUTEX2_SPIN flag set during a futex_wait(), the futex value is expected to
> be the TID of the lock owner. Then, the kernel gets the task_struct of the
> corresponding TID, and checks if it's running. It spins until the futex
> is awaken, the task is scheduled out or if a timeout happens. If the lock owner
> is scheduled out at any time, then the syscall follows the normal path of
> sleeping as usual. The user input is masked with FUTEX_TID_MASK so we have some
> bits to play.
>
> If the futex is awaken and we are spinning, we can return to userspace quickly,
> avoid the scheduling out and in again to wake from a futex_wait(), thus
> speeding up the wait operation. The user input is masked with FUTEX_TID_MASK so
> we have some bits to play.
>
> Christian Brauner suggested using pidfd to avoid race conditions, and I will
> implement that in the next patch iteration. I benchmarked the implementation
> measuring the time required to wait for a futex for a simple loop using the code
> at [2]. In my setup, the total wait time for 1000 futexes using the spin method
> was almost 10% lower than just using the normal futex wait:
>
> Testing with FUTEX2_SPIN | FUTEX_WAIT
> Total wait time: 8650089 usecs
>
> Testing with FUTEX_WAIT
> Total wait time: 9447291 usecs
>
> However, as I played with how long the lock owner would be busy, the
> benchmark results of spinning vs no spinning would match, showing that the
> spinning will be effective for some specific scheduling scenarios, but depending
> on the wait time, there's no big difference either spinning or not.
>
> [0] https://lpc.events/event/17/contributions/1481/
>
> You can find a small snippet to play with this interface here:
>
> [1] https://gist.github.com/andrealmeid/f0b8c93a3c7a5c50458247c47f7078e1

What exactly are you trying to benchmark here ? I've looked at this toy
program, and I suspect that most of the delay you observe is due to
initial scheduling of a newly cloned thread, because this is what is
repeatedly being done in the delay you measure.

I would recommend to change this benchmark program to measure something
meaningful, e.g.:

- N threads repeatedly contending on a lock, until a "stop" flag is set,
- run for "duration" seconds, after which main() sets a "stop" flag.
- delay loop of "work_delay" us within the lock critical section,
- delay loop of "inactive_delay" us between locking attempts,
- measure the time it takes to grab the lock, report stats on this,
- measure the total number of operations done within the given
"duration".
- report statistics on the number of operations per thread to see
the impact on fairness,

The run the program with the following constraints:

- Pin one thread per core, with nb thread <= nb cores. This should
be a best case scenario for spinning.
- Pin all threads to a single core. when nb threads > nb cores, this
should be the worse scenario for spinning.
- Groups things between those two extremes to see how things evolve.

I would not be surprised that if you measure relevant delays, you will
observe much better speedups than what you currently have.

Thanks,

Mathieu

>
> Changelog:
>
> v1: - s/PID/TID
> - masked user input with FUTEX_TID_MASK
> - add benchmark tool to the cover letter
> - dropped debug prints
> - added missing put_task_struct()
>
> André Almeida (1):
> futex: Add FUTEX_SPIN operation
>
> include/uapi/linux/futex.h | 2 +-
> kernel/futex/futex.h | 6 ++-
> kernel/futex/waitwake.c | 78 +++++++++++++++++++++++++++++++++++++-
> 3 files changed, 82 insertions(+), 4 deletions(-)
>

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


2024-05-24 07:52:52

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2 1/1] futex: Add FUTEX_SPIN operation

From: André Almeida
> Sent: 23 May 2024 21:07
>
> Add a new mode for futex wait, the futex spin.
>
> Given the FUTEX2_SPIN flag, parse the futex value as the TID of the lock
> owner. Then, before going to the normal wait path, spins while the lock
> owner is running in a different CPU, to avoid the whole context switch
> operation and to quickly return to userspace. If the lock owner is not
> running, just sleep as the normal futex wait path.
>
> The user value is masked with FUTEX_TID_MASK, to allow some bits for
> future use.
>
> The check for the owner to be running or not is important to avoid
> spinning for something that won't be released quickly. Userspace is
> responsible on providing the proper TID, the kernel does a basic check.

The kernel needs to do something to stop a user-process spinning in-kernel
indefinitely.

...
> +static inline bool task_on_cpu(struct task_struct *p)
> +{
> +#ifdef CONFIG_SMP
> + return !!(p->on_cpu);
> +#else
> + return false;
> +#endif
> +}

I suspect that isn't going to work in a VM where the entire 'cpu'
can be sleeping.

This is similar to the (I don't think works properly) check
in the 'osq' (optimistic spin queue) used when waiting for
the thread spinning on a mutex/rwlock to change state.

IIRC that code also checks whether the current thread should
be pre-empted by a higher priority process.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2024-05-25 09:36:59

by Mateusz Guzik

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] futex: Add FUTEX_SPIN operation

On Thu, May 23, 2024 at 05:07:04PM -0300, André Almeida wrote:
> Add a new mode for futex wait, the futex spin.
>

I'll note that currently individuals who want to reduce performance
degradation due to contention in userspace either handroll some
speculative spinning + futex usage or straight up only spin, which btw
is a massive pet peeve of mine.

With this in mind the real reference point justifying this mechanism (or
showing it not being good enough to warrant inclusion) is a case where
all participating threads are on cpu and spinning is done in userspace
(in a sensible manner). Of course there is more than one way to spin,
but that is a minor point in the grand scheme of things.

The benchmark you linked in the cover letter definitely would use some
improvement, it was mentioned elsewhere that some of the time is spent
just waiting for threads to be created. The usual way to handle this is
by parking everyone on a barrier before the measurement starts. You can
use will-it-scale as an example how to do it, or better yet plug your
stuff into it as testcases instead. It even already has some contested
pthread mutex benchmarks.

So I would say proper test matrix would include pthread mutexes as is,
FUTEX2_SPIN and mere spinning (just load + cpu_relax, I can ship you
with sample code for will-it-scale if you want).

Even the best possible variant of FUTEX2_SPIN has to do worse than mere
spinning in userspace -- in a HT setup it hinders execution by coming
here instead of just cpu_relax-ing, and that aside it may be introducing
dead time where the lock is free to use by the cpu is busy going in and
out of the kernel. The question is if it is going to do well enough to
persuade people to not bother with their own hacks, which it plausibly
will.

> Given the FUTEX2_SPIN flag, parse the futex value as the TID of the lock
> owner. Then, before going to the normal wait path, spins while the lock
> owner is running in a different CPU, to avoid the whole context switch
> operation and to quickly return to userspace. If the lock owner is not
> running, just sleep as the normal futex wait path.
>

Syscall costs are so high that by the time you get to this code chances
are decent the owner already released the lock and it is now being held
by someone else, also see below.

> +static int futex_spin(struct futex_hash_bucket *hb, struct futex_q *q,
> + struct hrtimer_sleeper *timeout, u32 uval)
> +{
> + struct task_struct *p;
> + pid_t pid = uval & FUTEX_TID_MASK;
> +
> + p = find_get_task_by_vpid(pid);
> +

In order to even get here all prospective spinners have to serialize on
a spinlock. Then they further dirty a process by grabbing a ref on it.
This seriously hinders the mechanism but is avoidable.

> + /* no task found, maybe it already exited */
> + if (!p) {
> + futex_q_unlock(hb);
> + return -EAGAIN;
> + }
> +
> + /* can't spin in a kernel task */
> + if (unlikely(p->flags & PF_KTHREAD)) {
> + put_task_struct(p);
> + futex_q_unlock(hb);
> + return -EPERM;
> + }
> +
> + futex_queue(q, hb);
> +
> + if (timeout)
> + hrtimer_sleeper_start_expires(timeout, HRTIMER_MODE_ABS);
> +
> + while (1) {
> + if (likely(!plist_node_empty(&q->list))) {
> + if (timeout && !timeout->task)
> + goto exit;
> +
> + if (task_on_cpu(p)) {
> + /* spin */

cpu_relax() is the thing to do here, it was already mentioned by someone
else.

My comment is that as mentioned above the lock owner by now may be
different. So to my reading this spins on the lock as long as p is
running, which may not even be holding it at the moment. Worse, by now
the p task may also be spinning in this very place.

That is to say I don't believe passing TID as an argument to the
syscall is a viable approach.

Instead this code will have to read the futex value on its own every
time (with a special accessor which does not allow for page faults) and
do the find_get_task_by_vpid + task_on_cpu combo under RCU (no refing of
the proc).

The queue locking would also be dodged, except for the case where the
code decides to go off cpu.

Short of something like that imho this has no hopes of competing with
userspace spinning.