2020-06-16 17:24:51

by Peter Oskolkov

[permalink] [raw]
Subject: [RFC PATCH 1/3 v2] futex: introduce FUTEX_SWAP operation

From 6fbe0261204692a7f488261ab3c4ac696b91db5c Mon Sep 17 00:00:00 2001
From: Peter Oskolkov <[email protected]>
Date: Tue, 9 Jun 2020 16:03:14 -0700
Subject: [RFC PATCH 1/3 v2] futex: introduce FUTEX_SWAP operation

This is an RFC!

As Paul Turner presented at LPC in 2013 ...
- pdf: http://pdxplumbers.osuosl.org/2013/ocw//system/presentations/1653/original/LPC%20-%20User%20Threading.pdf
- video: https://www.youtube.com/watch?v=KXuZi9aeGTw

... Google has developed an M:N userspace threading subsystem backed
by Google-private SwitchTo Linux Kernel API (page 17 in the pdf referenced
above). This subsystem provides latency-sensitive services at Google with
fine-grained user-space control/scheduling over what is running when,
and this subsystem is used widely internally (called schedulers or fibers).

This RFC patchset is the first step to open-source this work. As explained
in the linked pdf and video, SwitchTo API has three core operations: wait,
resume, and swap (=switch). So this patchset adds a FUTEX_SWAP operation
that, in addition to FUTEX_WAIT and FUTEX_WAKE, will provide a foundation
on top of which user-space threading libraries can be built.

Another common use case for FUTEX_SWAP is message passing a-la RPC
between tasks: task/thread T1 prepares a message,
wakes T2 to work on it, and waits for the results; when T2 is done, it
wakes T1 and waits for more work to arrive. Currently the simplest
way to implement this is

a. T1: futex-wake T2, futex-wait
b. T2: wakes, does what it has been woken to do
c. T2: futex-wake T1, futex-wait

With FUTEX_SWAP, steps a and c above can be reduced to one futex operation
that runs 5-10 times faster.

Patches in this patchset:

Patch 1: (this patch) introduce FUTEX_SWAP futex operation that,
internally, does wake + wait. The purpose of this patch is
to work out the API.
Patch 2: a first rough attempt to make FUTEX_SWAP faster than
what wake + wait can do.
Patch 3: a selftest that can also be used to benchmark FUTEX_SWAP vs
FUTEX_WAKE + FUTEX_WAIT.

Tested: see patch 3 in this patchset.

Signed-off-by: Peter Oskolkov <[email protected]>
---
include/uapi/linux/futex.h | 2 +
kernel/futex.c | 97 +++++++++++++++++++++++++++++++-------
2 files changed, 83 insertions(+), 16 deletions(-)

diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index a89eb0accd5e..c1d151d97dea 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -21,6 +21,7 @@
#define FUTEX_WAKE_BITSET 10
#define FUTEX_WAIT_REQUEUE_PI 11
#define FUTEX_CMP_REQUEUE_PI 12
+#define FUTEX_SWAP 13

#define FUTEX_PRIVATE_FLAG 128
#define FUTEX_CLOCK_REALTIME 256
@@ -40,6 +41,7 @@
FUTEX_PRIVATE_FLAG)
#define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \
FUTEX_PRIVATE_FLAG)
+#define FUTEX_SWAP_PRIVATE (FUTEX_SWAP | FUTEX_PRIVATE_FLAG)

/*
* Support for robust futexes: the kernel cleans up held futexes at
diff --git a/kernel/futex.c b/kernel/futex.c
index b59532862bc0..f3833190886f 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1592,16 +1592,16 @@ double_unlock_hb(struct futex_hash_bucket *hb1, struct futex_hash_bucket *hb2)
}

/*
- * Wake up waiters matching bitset queued on this futex (uaddr).
+ * Prepare wake queue matching bitset queued on this futex (uaddr).
*/
static int
-futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
+prepare_wake_q(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset,
+ struct wake_q_head *wake_q)
{
struct futex_hash_bucket *hb;
struct futex_q *this, *next;
union futex_key key = FUTEX_KEY_INIT;
int ret;
- DEFINE_WAKE_Q(wake_q);

if (!bitset)
return -EINVAL;
@@ -1629,20 +1629,34 @@ futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
if (!(this->bitset & bitset))
continue;

- mark_wake_futex(&wake_q, this);
+ mark_wake_futex(wake_q, this);
if (++ret >= nr_wake)
break;
}
}

spin_unlock(&hb->lock);
- wake_up_q(&wake_q);
out_put_key:
put_futex_key(&key);
out:
return ret;
}

+/*
+ * Wake up waiters matching bitset queued on this futex (uaddr).
+ */
+static int
+futex_wake(u32 __user *uaddr, unsigned int flags, int nr_wake, u32 bitset)
+{
+ int ret;
+ DEFINE_WAKE_Q(wake_q);
+
+ ret = prepare_wake_q(uaddr, flags, nr_wake, bitset, &wake_q);
+ wake_up_q(&wake_q);
+
+ return ret;
+}
+
static int futex_atomic_op_inuser(unsigned int encoded_op, u32 __user *uaddr)
{
unsigned int op = (encoded_op & 0x70000000) >> 28;
@@ -2600,9 +2614,12 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
* @hb: the futex hash bucket, must be locked by the caller
* @q: the futex_q to queue up on
* @timeout: the prepared hrtimer_sleeper, or null for no timeout
+ * @next: if present, wake next and hint to the scheduler that we'd
+ * prefer to execute it locally.
*/
static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
- struct hrtimer_sleeper *timeout)
+ struct hrtimer_sleeper *timeout,
+ struct task_struct *next)
{
/*
* The task state is guaranteed to be set before another task can
@@ -2627,10 +2644,27 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
* flagged for rescheduling. Only call schedule if there
* is no timeout, or if it has yet to expire.
*/
- if (!timeout || timeout->task)
+ if (!timeout || timeout->task) {
+ if (next) {
+ /*
+ * wake_up_process() below will be replaced
+ * in the next patch with
+ * wake_up_process_prefer_current_cpu().
+ */
+ wake_up_process(next);
+ put_task_struct(next);
+ next = NULL;
+ }
freezable_schedule();
+ }
}
__set_current_state(TASK_RUNNING);
+
+ if (next) {
+ /* Maybe call wake_up_process_prefer_current_cpu()? */
+ wake_up_process(next);
+ put_task_struct(next);
+ }
}

/**
@@ -2710,7 +2744,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
}

static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
- ktime_t *abs_time, u32 bitset)
+ ktime_t *abs_time, u32 bitset, struct task_struct *next)
{
struct hrtimer_sleeper timeout, *to;
struct restart_block *restart;
@@ -2734,7 +2768,8 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
goto out;

/* queue_me and wait for wakeup, timeout, or a signal. */
- futex_wait_queue_me(hb, &q, to);
+ futex_wait_queue_me(hb, &q, to, next);
+ next = NULL;

/* If we were woken (and unqueued), we succeeded, whatever. */
ret = 0;
@@ -2767,6 +2802,10 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
ret = -ERESTART_RESTARTBLOCK;

out:
+ if (next) {
+ wake_up_process(next);
+ put_task_struct(next);
+ }
if (to) {
hrtimer_cancel(&to->timer);
destroy_hrtimer_on_stack(&to->timer);
@@ -2774,7 +2813,6 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
return ret;
}

-
static long futex_wait_restart(struct restart_block *restart)
{
u32 __user *uaddr = restart->futex.uaddr;
@@ -2786,10 +2824,35 @@ static long futex_wait_restart(struct restart_block *restart)
}
restart->fn = do_no_restart_syscall;

- return (long)futex_wait(uaddr, restart->futex.flags,
- restart->futex.val, tp, restart->futex.bitset);
+ return (long)futex_wait(uaddr, restart->futex.flags, restart->futex.val,
+ tp, restart->futex.bitset, NULL);
}

+static int futex_swap(u32 __user *uaddr, unsigned int flags, u32 val,
+ ktime_t *abs_time, u32 __user *uaddr2)
+{
+ u32 bitset = FUTEX_BITSET_MATCH_ANY;
+ struct task_struct *next = NULL;
+ DEFINE_WAKE_Q(wake_q);
+ int ret;
+
+ ret = prepare_wake_q(uaddr2, flags, 1, bitset, &wake_q);
+ if (!wake_q_empty(&wake_q)) {
+ /* Pull the first wakee out of the queue to swap into. */
+ next = container_of(wake_q.first, struct task_struct, wake_q);
+ wake_q.first = wake_q.first->next;
+ next->wake_q.next = NULL;
+ /*
+ * Note that wake_up_q does not touch wake_q.last, so we
+ * do not bother with it here.
+ */
+ wake_up_q(&wake_q);
+ }
+ if (ret < 0)
+ return ret;
+
+ return futex_wait(uaddr, flags, val, abs_time, bitset, next);
+}

/*
* Userspace tried a 0 -> TID atomic transition of the futex value
@@ -3275,7 +3338,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
}

/* Queue the futex_q, drop the hb lock, wait for wakeup. */
- futex_wait_queue_me(hb, &q, to);
+ futex_wait_queue_me(hb, &q, to, NULL);

spin_lock(&hb->lock);
ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
@@ -3805,7 +3868,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
val3 = FUTEX_BITSET_MATCH_ANY;
/* fall through */
case FUTEX_WAIT_BITSET:
- return futex_wait(uaddr, flags, val, timeout, val3);
+ return futex_wait(uaddr, flags, val, timeout, val3, NULL);
case FUTEX_WAKE:
val3 = FUTEX_BITSET_MATCH_ANY;
/* fall through */
@@ -3829,6 +3892,8 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
uaddr2);
case FUTEX_CMP_REQUEUE_PI:
return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
+ case FUTEX_SWAP:
+ return futex_swap(uaddr, flags, val, timeout, uaddr2);
}
return -ENOSYS;
}
@@ -3845,7 +3910,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,

if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
cmd == FUTEX_WAIT_BITSET ||
- cmd == FUTEX_WAIT_REQUEUE_PI)) {
+ cmd == FUTEX_WAIT_REQUEUE_PI || cmd == FUTEX_SWAP)) {
if (unlikely(should_fail_futex(!(op & FUTEX_PRIVATE_FLAG))))
return -EFAULT;
if (get_timespec64(&ts, utime))
@@ -3854,7 +3919,7 @@ SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
return -EINVAL;

t = timespec64_to_ktime(ts);
- if (cmd == FUTEX_WAIT)
+ if (cmd == FUTEX_WAIT || cmd == FUTEX_SWAP)
t = ktime_add_safe(ktime_get(), t);
tp = &t;
}
--
2.25.1



2020-06-18 00:52:14

by Andrei Vagin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3 v2] futex: introduce FUTEX_SWAP operation

On Tue, Jun 16, 2020 at 10:22:26AM -0700, Peter Oskolkov wrote:
> From 6fbe0261204692a7f488261ab3c4ac696b91db5c Mon Sep 17 00:00:00 2001
> From: Peter Oskolkov <[email protected]>
> Date: Tue, 9 Jun 2020 16:03:14 -0700
> Subject: [RFC PATCH 1/3 v2] futex: introduce FUTEX_SWAP operation
>
> This is an RFC!
>
> As Paul Turner presented at LPC in 2013 ...
> - pdf: http://pdxplumbers.osuosl.org/2013/ocw//system/presentations/1653/original/LPC%20-%20User%20Threading.pdf
> - video: https://www.youtube.com/watch?v=KXuZi9aeGTw
>
> ... Google has developed an M:N userspace threading subsystem backed
> by Google-private SwitchTo Linux Kernel API (page 17 in the pdf referenced
> above). This subsystem provides latency-sensitive services at Google with
> fine-grained user-space control/scheduling over what is running when,
> and this subsystem is used widely internally (called schedulers or fibers).
>
> This RFC patchset is the first step to open-source this work. As explained
> in the linked pdf and video, SwitchTo API has three core operations: wait,
> resume, and swap (=switch). So this patchset adds a FUTEX_SWAP operation
> that, in addition to FUTEX_WAIT and FUTEX_WAKE, will provide a foundation
> on top of which user-space threading libraries can be built.
>
> Another common use case for FUTEX_SWAP is message passing a-la RPC
> between tasks: task/thread T1 prepares a message,
> wakes T2 to work on it, and waits for the results; when T2 is done, it
> wakes T1 and waits for more work to arrive. Currently the simplest
> way to implement this is
>
> a. T1: futex-wake T2, futex-wait
> b. T2: wakes, does what it has been woken to do
> c. T2: futex-wake T1, futex-wait
>
> With FUTEX_SWAP, steps a and c above can be reduced to one futex operation
> that runs 5-10 times faster.
>

Hi Peter,

We have a good use-case in gVisor for this new futex command.

gVisor accesses a file system through a file proxy, called the Gofer.
The gofer runs as a separate process, that is isolated from the sandbox
(sentry). Gofer instances communicate with their respective sentry using
the 9P-like protocol. We used sockets as communication channels, but
recently we switched to the flipcall (1) library which improve
performance by using shared memory for data (reducing memory copies) and
using futexes for control signaling (which is much cheaper than
sendto/recvfrom/sendmsg/recvmsg).

I modified the flipcall library to use FUTEX_SWAP and I see a
significant performance improvement.

A low level benchmarks (2) shows that req/resp is more than five time
faster with FUTEX_SWAP than with FUTEX_WAKE&FUTEX_WAIT. This is more or
less the same test what you did.

* FUTEX_WAKE & FUTEX_WAIT
BenchmarkSendRecv-8 88396 13625 ns/op

* FUTEX_SWAP
BenchmarkSendRecv-8 479604 2524 ns/op

And a more high-level test (3) which benchmarks the open syscall in
gVisor shows about 40% improvements.

* FUTEX_WAKE & FUTEX_WAIT
BM_Open/1/real_time_mean 93996 ns

* FUTEX_SWAP
BM_Open/1/real_time_mean 53136 ns

I believe there are many use-cases for FUTEX_SWAP in other projects.

1. https://github.com/google/gvisor/tree/master/pkg/flipcall
2. https://github.com/google/gvisor/blob/master/pkg/flipcall/flipcall_test.go#L361
3. https://github.com/google/gvisor/blob/master/test/perf/linux/open_benchmark.cc

Thanks,
Andrei

2020-06-23 13:30:40

by Aaron Lu

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3 v2] futex: introduce FUTEX_SWAP operation

On Tue, Jun 16, 2020 at 10:22:26AM -0700, Peter Oskolkov wrote:
> static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
> - struct hrtimer_sleeper *timeout)
> + struct hrtimer_sleeper *timeout,
> + struct task_struct *next)
> {
> /*
> * The task state is guaranteed to be set before another task can
> @@ -2627,10 +2644,27 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
> * flagged for rescheduling. Only call schedule if there
> * is no timeout, or if it has yet to expire.
> */
> - if (!timeout || timeout->task)
> + if (!timeout || timeout->task) {
> + if (next) {
> + /*
> + * wake_up_process() below will be replaced
> + * in the next patch with
> + * wake_up_process_prefer_current_cpu().
> + */
> + wake_up_process(next);
> + put_task_struct(next);
> + next = NULL;
> + }

So in futex_swap case, the wake up occurs in futex_wait_queue_me(). I
personally think it's more natural to do the wakeup in futex_swap()
instead.

> freezable_schedule();
> + }
> }
> __set_current_state(TASK_RUNNING);
> +
> + if (next) {
> + /* Maybe call wake_up_process_prefer_current_cpu()? */
> + wake_up_process(next);
> + put_task_struct(next);
> + }
> }
>
> /**

> +static int futex_swap(u32 __user *uaddr, unsigned int flags, u32 val,
> + ktime_t *abs_time, u32 __user *uaddr2)
> +{
> + u32 bitset = FUTEX_BITSET_MATCH_ANY;
> + struct task_struct *next = NULL;
> + DEFINE_WAKE_Q(wake_q);
> + int ret;
> +
> + ret = prepare_wake_q(uaddr2, flags, 1, bitset, &wake_q);
> + if (!wake_q_empty(&wake_q)) {
> + /* Pull the first wakee out of the queue to swap into. */
> + next = container_of(wake_q.first, struct task_struct, wake_q);
> + wake_q.first = wake_q.first->next;
> + next->wake_q.next = NULL;
> + /*
> + * Note that wake_up_q does not touch wake_q.last, so we
> + * do not bother with it here.
> + */
> + wake_up_q(&wake_q);

wake_up_q() doesn't seem to serve any purpose in that the above
assignment of wake_q.first shall make it an empty queue now?
Also, I don't see a need to touch wake_q.first either so I think we can
get rid of wake_q altogether here.

> + }
> + if (ret < 0)
> + return ret;
> +
> + return futex_wait(uaddr, flags, val, abs_time, bitset, next);
> +}

I've cooked the below diff, on top of your patchset. It survived your
self test and schbench. Feel free to ignore it if you don't like it, or
merge it into your patchset if you think it looks better.

do wake up in futex_swap()

---
kernel/futex.c | 43 +++++++++++--------------------------------
1 file changed, 11 insertions(+), 32 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index a426671e4bbb..995bc881059c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2618,8 +2618,7 @@ static int fixup_owner(u32 __user *uaddr, struct futex_q *q, int locked)
* prefer to execute it locally.
*/
static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
- struct hrtimer_sleeper *timeout,
- struct task_struct *next)
+ struct hrtimer_sleeper *timeout)
{
/*
* The task state is guaranteed to be set before another task can
@@ -2644,22 +2643,11 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
* flagged for rescheduling. Only call schedule if there
* is no timeout, or if it has yet to expire.
*/
- if (!timeout || timeout->task) {
- if (next) {
- wake_up_process_prefer_current_cpu(next);
- put_task_struct(next);
- next = NULL;
- }
+ if (!timeout || timeout->task)
freezable_schedule();
- }
}
- __set_current_state(TASK_RUNNING);

- if (next) {
- /* Maybe call wake_up_process_prefer_current_cpu()? */
- wake_up_process(next);
- put_task_struct(next);
- }
+ __set_current_state(TASK_RUNNING);
}

/**
@@ -2739,7 +2727,7 @@ static int futex_wait_setup(u32 __user *uaddr, u32 val, unsigned int flags,
}

static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
- ktime_t *abs_time, u32 bitset, struct task_struct *next)
+ ktime_t *abs_time, u32 bitset)
{
struct hrtimer_sleeper timeout, *to;
struct restart_block *restart;
@@ -2763,8 +2751,7 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
goto out;

/* queue_me and wait for wakeup, timeout, or a signal. */
- futex_wait_queue_me(hb, &q, to, next);
- next = NULL;
+ futex_wait_queue_me(hb, &q, to);

/* If we were woken (and unqueued), we succeeded, whatever. */
ret = 0;
@@ -2797,10 +2784,6 @@ static int futex_wait(u32 __user *uaddr, unsigned int flags, u32 val,
ret = -ERESTART_RESTARTBLOCK;

out:
- if (next) {
- wake_up_process(next);
- put_task_struct(next);
- }
if (to) {
hrtimer_cancel(&to->timer);
destroy_hrtimer_on_stack(&to->timer);
@@ -2820,7 +2803,7 @@ static long futex_wait_restart(struct restart_block *restart)
restart->fn = do_no_restart_syscall;

return (long)futex_wait(uaddr, restart->futex.flags, restart->futex.val,
- tp, restart->futex.bitset, NULL);
+ tp, restart->futex.bitset);
}

static int futex_swap(u32 __user *uaddr, unsigned int flags, u32 val,
@@ -2835,18 +2818,14 @@ static int futex_swap(u32 __user *uaddr, unsigned int flags, u32 val,
if (!wake_q_empty(&wake_q)) {
/* Pull the first wakee out of the queue to swap into. */
next = container_of(wake_q.first, struct task_struct, wake_q);
- wake_q.first = wake_q.first->next;
next->wake_q.next = NULL;
- /*
- * Note that wake_up_q does not touch wake_q.last, so we
- * do not bother with it here.
- */
- wake_up_q(&wake_q);
+ wake_up_process_prefer_current_cpu(next);
+ put_task_struct(next);
}
if (ret < 0)
return ret;

- return futex_wait(uaddr, flags, val, abs_time, bitset, next);
+ return futex_wait(uaddr, flags, val, abs_time, bitset);
}

/*
@@ -3333,7 +3312,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags,
}

/* Queue the futex_q, drop the hb lock, wait for wakeup. */
- futex_wait_queue_me(hb, &q, to, NULL);
+ futex_wait_queue_me(hb, &q, to);

spin_lock(&hb->lock);
ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
@@ -3863,7 +3842,7 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
val3 = FUTEX_BITSET_MATCH_ANY;
/* fall through */
case FUTEX_WAIT_BITSET:
- return futex_wait(uaddr, flags, val, timeout, val3, NULL);
+ return futex_wait(uaddr, flags, val, timeout, val3);
case FUTEX_WAKE:
val3 = FUTEX_BITSET_MATCH_ANY;
/* fall through */
--
2.26.2

2020-06-23 18:32:00

by Peter Oskolkov

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3 v2] futex: introduce FUTEX_SWAP operation

Hi Aaron,

thanks a lot for your tests and feedback! My comments below.

Thanks,
Peter

On Tue, 2020-06-23 at 21:25 +0800, Aaron Lu wrote:
> On Tue, Jun 16, 2020 at 10:22:26AM -0700, Peter Oskolkov wrote:
> > static void futex_wait_queue_me(struct futex_hash_bucket *hb,
> > struct futex_q *q,
> > - struct hrtimer_sleeper *timeout)
> > + struct hrtimer_sleeper *timeout,
> > + struct task_struct *next)
> > {
> > /*
> > * The task state is guaranteed to be set before another task
> > can
> > @@ -2627,10 +2644,27 @@ static void futex_wait_queue_me(struct
> > futex_hash_bucket *hb, struct futex_q *q,
> > * flagged for rescheduling. Only call schedule if
> > there
> > * is no timeout, or if it has yet to expire.
> > */
> > - if (!timeout || timeout->task)
> > + if (!timeout || timeout->task) {
> > + if (next) {
> > + /*
> > + * wake_up_process() below will be
> > replaced
> > + * in the next patch with
> > + *
> > wake_up_process_prefer_current_cpu().
> > + */
> > + wake_up_process(next);
> > + put_task_struct(next);
> > + next = NULL;
> > + }
>
> So in futex_swap case, the wake up occurs in futex_wait_queue_me(). I
> personally think it's more natural to do the wakeup in futex_swap()
> instead.

I chose to do it the way I did so that
wake_up_process_prefer_current_cpu() is called only
when we are reasonably sure that the current task is
going to block/wait, as otherwise it makes no sense
to migrate the wakee/next to the current CPU. In addition, future
optimizations may include actually doing a context switch into
the wakee/next if the current task is going to block instead of
just calling schedule(), which also points to benefits of
having the waking/context-switching code
reside here rather than in futex_swap.

These reasons are somewhat hand-wavy, though, and I agree
that logically it seems that waking in futex_swap() is
easier to think through. Let's see what the maintainers say.

>
> > freezable_schedule();
> > + }
> > }
> > __set_current_state(TASK_RUNNING);
> > +
> > + if (next) {
> > + /* Maybe call wake_up_process_prefer_current_cpu()? */
> > + wake_up_process(next);
> > + put_task_struct(next);
> > + }
> > }
> >
> > /**
> > +static int futex_swap(u32 __user *uaddr, unsigned int flags, u32
> > val,
> > + ktime_t *abs_time, u32 __user *uaddr2)
> > +{
> > + u32 bitset = FUTEX_BITSET_MATCH_ANY;
> > + struct task_struct *next = NULL;
> > + DEFINE_WAKE_Q(wake_q);
> > + int ret;
> > +
> > + ret = prepare_wake_q(uaddr2, flags, 1, bitset, &wake_q);
> > + if (!wake_q_empty(&wake_q)) {
> > + /* Pull the first wakee out of the queue to swap into.
> > */
> > + next = container_of(wake_q.first, struct task_struct,
> > wake_q);
> > + wake_q.first = wake_q.first->next;
> > + next->wake_q.next = NULL;
> > + /*
> > + * Note that wake_up_q does not touch wake_q.last, so
> > we
> > + * do not bother with it here.
> > + */
> > + wake_up_q(&wake_q);
>
> wake_up_q() doesn't seem to serve any purpose in that the above
> assignment of wake_q.first shall make it an empty queue now?
> Also, I don't see a need to touch wake_q.first either so I think we
> can
> get rid of wake_q altogether here.

The futex at uaddr2 may have more than one waiter, so we cannot assume
that wake_q will be empty when we remove the first element.

>
> > + }
> > + if (ret < 0)
> > + return ret;
> > +
> > + return futex_wait(uaddr, flags, val, abs_time, bitset, next);
> > +}
>
> I've cooked the below diff, on top of your patchset. It survived your
> self test and schbench. Feel free to ignore it if you don't like it,
> or
> merge it into your patchset if you think it looks better.
>
> do wake up in futex_swap()
>
> ---
> kernel/futex.c | 43 +++++++++++--------------------------------
> 1 file changed, 11 insertions(+), 32 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index a426671e4bbb..995bc881059c 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -2618,8 +2618,7 @@ static int fixup_owner(u32 __user *uaddr,
> struct futex_q *q, int locked)
> * prefer to execute it locally.
> */
> static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct
> futex_q *q,
> - struct hrtimer_sleeper *timeout,
> - struct task_struct *next)
> + struct hrtimer_sleeper *timeout)
> {
> /*
> * The task state is guaranteed to be set before another task
> can
> @@ -2644,22 +2643,11 @@ static void futex_wait_queue_me(struct
> futex_hash_bucket *hb, struct futex_q *q,
> * flagged for rescheduling. Only call schedule if
> there
> * is no timeout, or if it has yet to expire.
> */
> - if (!timeout || timeout->task) {
> - if (next) {
> - wake_up_process_prefer_current_cpu(next
> );
> - put_task_struct(next);
> - next = NULL;
> - }
> + if (!timeout || timeout->task)
> freezable_schedule();
> - }
> }
> - __set_current_state(TASK_RUNNING);
>
> - if (next) {
> - /* Maybe call wake_up_process_prefer_current_cpu()? */
> - wake_up_process(next);
> - put_task_struct(next);
> - }
> + __set_current_state(TASK_RUNNING);
> }
>
> /**
> @@ -2739,7 +2727,7 @@ static int futex_wait_setup(u32 __user *uaddr,
> u32 val, unsigned int flags,
> }
>
> static int futex_wait(u32 __user *uaddr, unsigned int flags, u32
> val,
> - ktime_t *abs_time, u32 bitset, struct task_struct
> *next)
> + ktime_t *abs_time, u32 bitset)
> {
> struct hrtimer_sleeper timeout, *to;
> struct restart_block *restart;
> @@ -2763,8 +2751,7 @@ static int futex_wait(u32 __user *uaddr,
> unsigned int flags, u32 val,
> goto out;
>
> /* queue_me and wait for wakeup, timeout, or a signal. */
> - futex_wait_queue_me(hb, &q, to, next);
> - next = NULL;
> + futex_wait_queue_me(hb, &q, to);
>
> /* If we were woken (and unqueued), we succeeded, whatever. */
> ret = 0;
> @@ -2797,10 +2784,6 @@ static int futex_wait(u32 __user *uaddr,
> unsigned int flags, u32 val,
> ret = -ERESTART_RESTARTBLOCK;
>
> out:
> - if (next) {
> - wake_up_process(next);
> - put_task_struct(next);
> - }
> if (to) {
> hrtimer_cancel(&to->timer);
> destroy_hrtimer_on_stack(&to->timer);
> @@ -2820,7 +2803,7 @@ static long futex_wait_restart(struct
> restart_block *restart)
> restart->fn = do_no_restart_syscall;
>
> return (long)futex_wait(uaddr, restart->futex.flags, restart-
> >futex.val,
> - tp, restart->futex.bitset, NULL);
> + tp, restart->futex.bitset);
> }
>
> static int futex_swap(u32 __user *uaddr, unsigned int flags, u32
> val,
> @@ -2835,18 +2818,14 @@ static int futex_swap(u32 __user *uaddr,
> unsigned int flags, u32 val,
> if (!wake_q_empty(&wake_q)) {
> /* Pull the first wakee out of the queue to swap into.
> */
> next = container_of(wake_q.first, struct task_struct,
> wake_q);
> - wake_q.first = wake_q.first->next;
> next->wake_q.next = NULL;
> - /*
> - * Note that wake_up_q does not touch wake_q.last, so
> we
> - * do not bother with it here.
> - */
> - wake_up_q(&wake_q);
> + wake_up_process_prefer_current_cpu(next);
> + put_task_struct(next);
> }
> if (ret < 0)
> return ret;
>
> - return futex_wait(uaddr, flags, val, abs_time, bitset, next);
> + return futex_wait(uaddr, flags, val, abs_time, bitset);
> }
>
> /*
> @@ -3333,7 +3312,7 @@ static int futex_wait_requeue_pi(u32 __user
> *uaddr, unsigned int flags,
> }
>
> /* Queue the futex_q, drop the hb lock, wait for wakeup. */
> - futex_wait_queue_me(hb, &q, to, NULL);
> + futex_wait_queue_me(hb, &q, to);
>
> spin_lock(&hb->lock);
> ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to);
> @@ -3863,7 +3842,7 @@ long do_futex(u32 __user *uaddr, int op, u32
> val, ktime_t *timeout,
> val3 = FUTEX_BITSET_MATCH_ANY;
> /* fall through */
> case FUTEX_WAIT_BITSET:
> - return futex_wait(uaddr, flags, val, timeout, val3,
> NULL);
> + return futex_wait(uaddr, flags, val, timeout, val3);
> case FUTEX_WAKE:
> val3 = FUTEX_BITSET_MATCH_ANY;
> /* fall through */

2020-06-23 19:46:10

by Andrei Vagin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3 v2] futex: introduce FUTEX_SWAP operation

On Tue, Jun 23, 2020 at 11:30:30AM -0700, Peter Oskolkov wrote:
...
> > > /**
> > > +static int futex_swap(u32 __user *uaddr, unsigned int flags, u32
> > > val,
> > > + ktime_t *abs_time, u32 __user *uaddr2)
> > > +{
> > > + u32 bitset = FUTEX_BITSET_MATCH_ANY;
> > > + struct task_struct *next = NULL;
> > > + DEFINE_WAKE_Q(wake_q);
> > > + int ret;
> > > +
> > > + ret = prepare_wake_q(uaddr2, flags, 1, bitset, &wake_q);
> > > + if (!wake_q_empty(&wake_q)) {
> > > + /* Pull the first wakee out of the queue to swap into.
> > > */
> > > + next = container_of(wake_q.first, struct task_struct,
> > > wake_q);
> > > + wake_q.first = wake_q.first->next;
> > > + next->wake_q.next = NULL;
> > > + /*
> > > + * Note that wake_up_q does not touch wake_q.last, so
> > > we
> > > + * do not bother with it here.
> > > + */
> > > + wake_up_q(&wake_q);
> >
> > wake_up_q() doesn't seem to serve any purpose in that the above
> > assignment of wake_q.first shall make it an empty queue now?
> > Also, I don't see a need to touch wake_q.first either so I think we
> > can
> > get rid of wake_q altogether here.
>
> The futex at uaddr2 may have more than one waiter, so we cannot assume
> that wake_q will be empty when we remove the first element.

The third argument of prepare_wake_q is nr_wake which is one in this
case, so we can be sure that wake_q will be empty, can't we?

>
> >
> > > + }
> > > + if (ret < 0)
> > > + return ret;
> > > +
> > > + return futex_wait(uaddr, flags, val, abs_time, bitset, next);
> > > +}
> >

2020-06-23 21:03:28

by Peter Oskolkov

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3 v2] futex: introduce FUTEX_SWAP operation

On Tue, Jun 23, 2020 at 12:45 PM Andrei Vagin <[email protected]> wrote:
>
> On Tue, Jun 23, 2020 at 11:30:30AM -0700, Peter Oskolkov wrote:
> ...
> > > > /**
> > > > +static int futex_swap(u32 __user *uaddr, unsigned int flags, u32
> > > > val,
> > > > + ktime_t *abs_time, u32 __user *uaddr2)
> > > > +{
> > > > + u32 bitset = FUTEX_BITSET_MATCH_ANY;
> > > > + struct task_struct *next = NULL;
> > > > + DEFINE_WAKE_Q(wake_q);
> > > > + int ret;
> > > > +
> > > > + ret = prepare_wake_q(uaddr2, flags, 1, bitset, &wake_q);
> > > > + if (!wake_q_empty(&wake_q)) {
> > > > + /* Pull the first wakee out of the queue to swap into.
> > > > */
> > > > + next = container_of(wake_q.first, struct task_struct,
> > > > wake_q);
> > > > + wake_q.first = wake_q.first->next;
> > > > + next->wake_q.next = NULL;
> > > > + /*
> > > > + * Note that wake_up_q does not touch wake_q.last, so
> > > > we
> > > > + * do not bother with it here.
> > > > + */
> > > > + wake_up_q(&wake_q);
> > >
> > > wake_up_q() doesn't seem to serve any purpose in that the above
> > > assignment of wake_q.first shall make it an empty queue now?
> > > Also, I don't see a need to touch wake_q.first either so I think we
> > > can
> > > get rid of wake_q altogether here.
> >
> > The futex at uaddr2 may have more than one waiter, so we cannot assume
> > that wake_q will be empty when we remove the first element.
>
> The third argument of prepare_wake_q is nr_wake which is one in this
> case, so we can be sure that wake_q will be empty, can't we?

Right, sorry. In an early draft it was possible to wake more than one waiter,
and the code carried over from then/there in this form. I'll remove wake_up_q()
if/when the API will be deemed acceptable by the maintainers.

>
> >
> > >
> > > > + }
> > > > + if (ret < 0)
> > > > + return ret;
> > > > +
> > > > + return futex_wait(uaddr, flags, val, abs_time, bitset, next);
> > > > +}
> > >