2023-07-12 17:00:55

by Jens Axboe

[permalink] [raw]
Subject: [PATCHSET v2 0/8] Add io_uring futex/futexv support

Hi,

This patchset adds support for first futex wake and wait, and then
futexv. Patches 1..3 are just prep patches, patch 4 adds the wait
and wake support for io_uring, and then patches 5..7 are again prep
patches to end up with futexv support in patch 8.

For both wait/wake/waitv, we support the bitset variant, as the
"normal" variants can be easily implemented on top of that.

PI and requeue are not supported through io_uring, just the above
mentioned parts. This may change in the future, but in the spirit
of keeping this small (and based on what people have been asking for),
this is what we currently have.

When I did these patches, I forgot that Pavel had previously posted a
futex variant for io_uring. The major thing that had been holding me
back from people asking about futexes and io_uring, is that I wanted
to do this what I consider the right way - no usage of io-wq or thread
offload, an actually async implementation that is efficient to use
and don't rely on a blocking thread for futex wait/waitv. This is what
this patchset attempts to do, while being minimally invasive on the
futex side. I believe the diffstat reflects that.

As far as I can recall, the first request for futex support with
io_uring came from Andres Freund, working on postgres. His aio rework
of postgres was one of the early adopters of io_uring, and futex
support was a natural extension for that. This is relevant from both
a usability point of view, as well as for effiency and performance.
In Andres's words, for the former:

"Futex wait support in io_uring makes it a lot easier to avoid deadlocks
in concurrent programs that have their own buffer pool: Obviously pages in
the application buffer pool have to be locked during IO. If the initiator
of IO A needs to wait for a held lock B, the holder of lock B might wait
for the IO A to complete. The ability to wait for a lock and IO
completions at the same time provides an efficient way to avoid such
deadlocks."

and in terms of effiency, even without unlocking the full potential yet,
Andres says:

"Futex wake support in io_uring is useful because it allows for more
efficient directed wakeups. For some "locks" postgres has queues
implemented in userspace, with wakeup logic that cannot easily be
implemented with FUTEX_WAKE_BITSET on a single "futex word" (imagine
waiting for journal flushes to have completed up to a certain point). Thus
a "lock release" sometimes need to wake up many processes in a row. A
quick-and-dirty conversion to doing these wakeups via io_uring lead to a
3% throughput increase, with 12% fewer context switches, albeit in a
fairly extreme workload."

Some basic io_uring futex support and test cases are available in the
liburing 'futex' branch:

https://git.kernel.dk/cgit/liburing/log/?h=futex

testing all of the variants. I originally wrote this code about a
month ago and Andres has been using it with postgres, and I'm not
aware of any bugs in it. That's not to say it's perfect, obviously,
and I welcome some feedback so we can move this forward and hash out
any potential issues.

include/linux/io_uring_types.h | 3 +
include/uapi/linux/io_uring.h | 4 +
io_uring/Makefile | 4 +-
io_uring/cancel.c | 5 +
io_uring/cancel.h | 4 +
io_uring/futex.c | 376 +++++++++++++++++++++++++++++++++
io_uring/futex.h | 36 ++++
io_uring/io_uring.c | 5 +
io_uring/opdef.c | 35 ++-
kernel/futex/futex.h | 33 +++
kernel/futex/requeue.c | 3 +-
kernel/futex/syscalls.c | 27 ++-
kernel/futex/waitwake.c | 49 +++--
13 files changed, 548 insertions(+), 36 deletions(-)

You can also find the code here:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-futex

V2:
- Abstract out __futex_wake_mark() helper. Use it both in the
futex and io_uring code. This also fixes a missing WARN_ON
on the io_uring side.
- Have futex_op_to_flags() unconditionally clear flags to
zero rather than do that in both callers.
- Remove comment on needing to open-code futex_queue(),
and associated hunk doing that. This was a leftover
from an earlier version.
- Expand the commit message logs in various patches.

--
Jens Axboe




2023-07-12 17:16:03

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 1/8] futex: abstract out futex_op_to_flags() helper

Rather than needing to duplicate this for the io_uring hook of futexes,
abstract out a helper.

No functional changes intended in this patch.

Signed-off-by: Jens Axboe <[email protected]>
---
kernel/futex/futex.h | 17 +++++++++++++++++
kernel/futex/syscalls.c | 13 +++----------
2 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index b5379c0e6d6d..b8f454792304 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -291,4 +291,21 @@ extern int futex_unlock_pi(u32 __user *uaddr, unsigned int flags);

extern int futex_lock_pi(u32 __user *uaddr, unsigned int flags, ktime_t *time, int trylock);

+static inline bool futex_op_to_flags(int op, int cmd, unsigned int *flags)
+{
+ *flags = 0;
+
+ if (!(op & FUTEX_PRIVATE_FLAG))
+ *flags |= FLAGS_SHARED;
+
+ if (op & FUTEX_CLOCK_REALTIME) {
+ *flags |= FLAGS_CLOCKRT;
+ if (cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI &&
+ cmd != FUTEX_LOCK_PI2)
+ return false;
+ }
+
+ return true;
+}
+
#endif /* _FUTEX_H */
diff --git a/kernel/futex/syscalls.c b/kernel/futex/syscalls.c
index a8074079b09e..0b63d5bcdc77 100644
--- a/kernel/futex/syscalls.c
+++ b/kernel/futex/syscalls.c
@@ -86,17 +86,10 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3)
{
int cmd = op & FUTEX_CMD_MASK;
- unsigned int flags = 0;
+ unsigned int flags;

- if (!(op & FUTEX_PRIVATE_FLAG))
- flags |= FLAGS_SHARED;
-
- if (op & FUTEX_CLOCK_REALTIME) {
- flags |= FLAGS_CLOCKRT;
- if (cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI &&
- cmd != FUTEX_LOCK_PI2)
- return -ENOSYS;
- }
+ if (!futex_op_to_flags(op, cmd, &flags))
+ return -ENOSYS;

switch (cmd) {
case FUTEX_WAIT:
--
2.40.1


2023-07-12 17:19:58

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 7/8] futex: make the vectored futex operations available

Rename unqueue_multiple() as futex_unqueue_multiple(), and make both
that and futex_wait_multiple_setup() available for external users. This
is in preparation for wiring up vectored waits in io_uring.

Signed-off-by: Jens Axboe <[email protected]>
---
kernel/futex/futex.h | 5 +++++
kernel/futex/waitwake.c | 10 +++++-----
2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/futex/futex.h b/kernel/futex/futex.h
index f6598d8451fb..4d73d2978e50 100644
--- a/kernel/futex/futex.h
+++ b/kernel/futex/futex.h
@@ -290,6 +290,11 @@ extern int futex_parse_waitv(struct futex_vector *futexv,
unsigned int nr_futexes, futex_wake_fn *wake,
void *wake_data);

+extern int futex_wait_multiple_setup(struct futex_vector *vs, int count,
+ int *woken);
+
+extern int futex_unqueue_multiple(struct futex_vector *v, int count);
+
extern int futex_wait_multiple(struct futex_vector *vs, unsigned int count,
struct hrtimer_sleeper *to);

diff --git a/kernel/futex/waitwake.c b/kernel/futex/waitwake.c
index f8fb6550061d..0383da9f737f 100644
--- a/kernel/futex/waitwake.c
+++ b/kernel/futex/waitwake.c
@@ -369,7 +369,7 @@ void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
}

/**
- * unqueue_multiple - Remove various futexes from their hash bucket
+ * futex_unqueue_multiple - Remove various futexes from their hash bucket
* @v: The list of futexes to unqueue
* @count: Number of futexes in the list
*
@@ -379,7 +379,7 @@ void futex_wait_queue(struct futex_hash_bucket *hb, struct futex_q *q,
* - >=0 - Index of the last futex that was awoken;
* - -1 - No futex was awoken
*/
-static int unqueue_multiple(struct futex_vector *v, int count)
+int futex_unqueue_multiple(struct futex_vector *v, int count)
{
int ret = -1, i;

@@ -407,7 +407,7 @@ static int unqueue_multiple(struct futex_vector *v, int count)
* - 0 - Success
* - <0 - -EFAULT, -EWOULDBLOCK or -EINVAL
*/
-static int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
+int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *woken)
{
struct futex_hash_bucket *hb;
bool retry = false;
@@ -469,7 +469,7 @@ static int futex_wait_multiple_setup(struct futex_vector *vs, int count, int *wo
* was woken, we don't return error and return this index to
* userspace
*/
- *woken = unqueue_multiple(vs, i);
+ *woken = futex_unqueue_multiple(vs, i);
if (*woken >= 0)
return 1;

@@ -554,7 +554,7 @@ int futex_wait_multiple(struct futex_vector *vs, unsigned int count,

__set_current_state(TASK_RUNNING);

- ret = unqueue_multiple(vs, count);
+ ret = futex_unqueue_multiple(vs, count);
if (ret >= 0)
return ret;

--
2.40.1


2023-07-12 17:46:42

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCHSET v2 0/8] Add io_uring futex/futexv support


Neglected to mention, that I have run this through the ltp futex tests,
and there are no changes. All tests pass.

--
Jens Axboe