2021-03-04 23:43:34

by André Almeida

[permalink] [raw]
Subject: [RFC PATCH v2 00/13] Add futex2 syscall

Hi,

This patch series introduces the futex2 syscalls.

* What happened to the current futex()?

For some years now, developers have been trying to add new features to
futex, but maintainers have been reluctant to accept then, given the
multiplexed interface full of legacy features and tricky to do big
changes. Some problems that people tried to address with patchsets are:
NUMA-awareness[0], smaller sized futexes[1], wait on multiple futexes[2].
NUMA, for instance, just doesn't fit the current API in a reasonable
way. Considering that, it's not possible to merge new features into the
current futex.

** The NUMA problem

At the current implementation, all futex kernel side infrastructure is
stored on a single node. Given that, all futex() calls issued by
processors that aren't located on that node will have a memory access
penalty when doing it.

** The 32bit sized futex problem

Embedded systems or anything with memory constrains would benefit of
using smaller sizes for the futex userspace integer. Also, a mutex
implementation can be done using just three values, so 8 bits is enough
for various scenarios.

** The wait on multiple problem

The use case lies in the Wine implementation of the Windows NT interface
WaitMultipleObjects. This Windows API function allows a thread to sleep
waiting on the first of a set of event sources (mutexes, timers, signal,
console input, etc) to signal. Considering this is a primitive
synchronization operation for Windows applications, being able to quickly
signal events on the producer side, and quickly go to sleep on the
consumer side is essential for good performance of those running over Wine.

[0] https://lore.kernel.org/lkml/[email protected]/
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/

* The solution

As proposed by Peter Zijlstra and Florian Weimer[3], a new interface
is required to solve this, which must be designed with those features in
mind. futex2() is that interface. As opposed to the current multiplexed
interface, the new one should have one syscall per operation. This will
allow the maintainability of the API if it gets extended, and will help
users with type checking of arguments.

In particular, the new interface is extended to support the ability to
wait on any of a list of futexes at a time, which could be seen as a
vectored extension of the FUTEX_WAIT semantics.

[3] https://lore.kernel.org/lkml/[email protected]/

* The interface

The new interface can be seen in details in the following patches, but
this is a high level summary of what the interface can do:

- Supports wake/wait semantics, as in futex()
- Supports requeue operations, similarly as FUTEX_CMP_REQUEUE, but with
individual flags for each address
- Supports waiting for a vector of futexes, using a new syscall named
futex_waitv()
- Supports variable sized futexes (8bits, 16bits and 32bits)
- Supports NUMA-awareness operations, where the user can specify on
which memory node would like to operate

* Implementation

The internal implementation follows a similar design to the original futex.
Given that we want to replicate the same external behavior of current
futex, this should be somewhat expected. For some functions, like the
init and the code to get a shared key, I literally copied code and
comments from kernel/futex.c. I decided to do so instead of exposing the
original function as a public function since in that way we can freely
modify our implementation if required, without any impact on old futex.
Also, the comments precisely describes the details and corner cases of
the implementation.

Each patch contains a brief description of implementation, but patch 6
"docs: locking: futex2: Add documentation" adds a more complete document
about it.

* The patchset

This patchset can be also found at my git tree:

https://gitlab.collabora.com/tonyk/linux/-/tree/futex2-dev

- Patch 1: Implements wait/wake, and the basics foundations of futex2

- Patches 2-4: Implement the remaining features (shared, waitv, requeue).

- Patch 5: Adds the x86_x32 ABI handling. I kept it in a separated
patch since I'm not sure if x86_x32 is still a thing, or if it should
return -ENOSYS.

- Patch 6: Add a documentation file which details the interface and
the internal implementation.

- Patches 7-13: Selftests for all operations along with perf
support for futex2.

- Patch 14: While working on porting glibc for futex2, I found out
that there's a futex_wake() call at the user thread exit path, if
that thread was created with clone(..., CLONE_CHILD_SETTID, ...). In
order to make pthreads work with futex2, it was required to add
this patch. Note that this is more a proof-of-concept of what we
will need to do in future, rather than part of the interface and
shouldn't be merged as it is.

* Testing:

This patchset provides selftests for each operation and their flags.
Along with that, the following work was done:

** Stability

To stress the interface in "real world scenarios":

- glibc[4]: nptl's low level locking was modified to use futex2 API
(except for robust and PI things). All relevant nptl/ tests passed.

- Wine[5]: Proton/Wine was modified in order to use futex2() for the
emulation of Windows NT sync mechanisms based on futex, called "fsync".
Triple-A games with huge CPU's loads and tons of parallel jobs worked
as expected when compared with the previous FUTEX_WAIT_MULTIPLE
implementation at futex(). Some games issue 42k futex2() calls
per second.

- Full GNU/Linux distro: I installed the modified glibc in my host
machine, so all pthread's programs would use futex2(). After tweaking
systemd[6] to allow futex2() calls at seccomp, everything worked as
expected (web browsers do some syscall sandboxing and need some
configuration as well).

- perf: The perf benchmarks tests can also be used to stress the
interface, and they can be found in this patchset.

** Performance

- For comparing futex() and futex2() performance, I used the artificial
benchmarks implemented at perf (wake, wake-parallel, hash and
requeue). The setup was 200 runs for each test and using 8, 80, 800,
8000 for the number of threads, Note that for this test, I'm not using
patch 14 ("kernel: Enable waitpid() for futex2") , for reasons explained
at "The patchset" section.

- For the first three ones, I measured an average of 4% gain in
performance. This is not a big step, but it shows that the new
interface is at least comparable in performance with the current one.

- For requeue, I measured an average of 21% decrease in performance
compared to the original futex implementation. This is expected given
the new design with individual flags. The performance trade-offs are
explained at patch 4 ("futex2: Implement requeue operation").

[4] https://gitlab.collabora.com/tonyk/glibc/-/tree/futex2
[5] https://gitlab.collabora.com/tonyk/wine/-/tree/proton_5.13
[6] https://gitlab.collabora.com/tonyk/systemd

* FAQ

** "Where's the code for NUMA and FUTEX_8/16?"

The current code is already complex enough to take some time for
review, so I believe it's better to split that work out to a future
iteration of this patchset. Besides that, this RFC is the core part of the
infrastructure, and the following features will not pose big design
changes to it, the work will be more about wiring up the flags and
modifying some functions.

** "And what's about FUTEX_64?"

By supporting 64 bit futexes, the kernel structure for futex would
need to have a 64 bit field for the value, and that could defeat one of
the purposes of having different sized futexes in the first place:
supporting smaller ones to decrease memory usage. This might be
something that could be disabled for 32bit archs (and even for
CONFIG_BASE_SMALL).

Which use case would benefit for FUTEX_64? Does it worth the trade-offs?

** "Where's the PI/robust stuff?"

As said by Peter Zijlstra at [3], all those new features are related to
the "simple" futex interface, that doesn't use PI or robust. Do we want
to have this complexity at futex2() and if so, should it be part of
this patchset or can it be future work?

Thanks,
André

* Changelog

Changes from v1:
- Unified futex_set_timer_and_wait and __futex_wait code
- Dropped _carefull from linked list function calls
- Fixed typos on docs patch
- uAPI flags are now added as features are introduced, instead of all flags
in patch 1
- Removed struct futex_single_waiter in favor of an anon struct
v1: https://lore.kernel.org/lkml/[email protected]/


André Almeida (13):
futex2: Implement wait and wake functions
futex2: Add support for shared futexes
futex2: Implement vectorized wait
futex2: Implement requeue operation
futex2: Add compatibility entry point for x86_x32 ABI
docs: locking: futex2: Add documentation
selftests: futex2: Add wake/wait test
selftests: futex2: Add timeout test
selftests: futex2: Add wouldblock test
selftests: futex2: Add waitv test
selftests: futex2: Add requeue test
perf bench: Add futex2 benchmark tests
kernel: Enable waitpid() for futex2

Documentation/locking/futex2.rst | 198 +++
Documentation/locking/index.rst | 1 +
MAINTAINERS | 2 +-
arch/arm/tools/syscall.tbl | 4 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 8 +
arch/x86/entry/syscalls/syscall_32.tbl | 4 +
arch/x86/entry/syscalls/syscall_64.tbl | 4 +
fs/inode.c | 1 +
include/linux/compat.h | 23 +
include/linux/fs.h | 1 +
include/linux/syscalls.h | 18 +
include/uapi/asm-generic/unistd.h | 14 +-
include/uapi/linux/futex.h | 31 +
init/Kconfig | 7 +
kernel/Makefile | 1 +
kernel/fork.c | 2 +
kernel/futex2.c | 1239 +++++++++++++++++
kernel/sys_ni.c | 6 +
tools/arch/x86/include/asm/unistd_64.h | 12 +
tools/include/uapi/asm-generic/unistd.h | 11 +-
.../arch/x86/entry/syscalls/syscall_64.tbl | 4 +
tools/perf/bench/bench.h | 4 +
tools/perf/bench/futex-hash.c | 24 +-
tools/perf/bench/futex-requeue.c | 57 +-
tools/perf/bench/futex-wake-parallel.c | 41 +-
tools/perf/bench/futex-wake.c | 37 +-
tools/perf/bench/futex.h | 47 +
tools/perf/builtin-bench.c | 18 +-
.../selftests/futex/functional/.gitignore | 3 +
.../selftests/futex/functional/Makefile | 8 +-
.../futex/functional/futex2_requeue.c | 164 +++
.../selftests/futex/functional/futex2_wait.c | 209 +++
.../selftests/futex/functional/futex2_waitv.c | 157 +++
.../futex/functional/futex_wait_timeout.c | 58 +-
.../futex/functional/futex_wait_wouldblock.c | 33 +-
.../testing/selftests/futex/functional/run.sh | 6 +
.../selftests/futex/include/futex2test.h | 121 ++
38 files changed, 2527 insertions(+), 53 deletions(-)
create mode 100644 Documentation/locking/futex2.rst
create mode 100644 kernel/futex2.c
create mode 100644 tools/testing/selftests/futex/functional/futex2_requeue.c
create mode 100644 tools/testing/selftests/futex/functional/futex2_wait.c
create mode 100644 tools/testing/selftests/futex/functional/futex2_waitv.c
create mode 100644 tools/testing/selftests/futex/include/futex2test.h

--
2.30.1


2021-03-04 23:44:55

by André Almeida

[permalink] [raw]
Subject: [RFC PATCH v2 12/13] perf bench: Add futex2 benchmark tests

Add support at the existing futex benchmarking code base to enable
futex2 calls. `perf bench` tests can be used not only as a way to
measure the performance of implementation, but also as stress testing
for the kernel infrastructure.

Signed-off-by: André Almeida <[email protected]>
---
tools/arch/x86/include/asm/unistd_64.h | 12 ++++++
tools/perf/bench/bench.h | 4 ++
tools/perf/bench/futex-hash.c | 24 +++++++++--
tools/perf/bench/futex-requeue.c | 57 ++++++++++++++++++++------
tools/perf/bench/futex-wake-parallel.c | 41 +++++++++++++++---
tools/perf/bench/futex-wake.c | 37 +++++++++++++----
tools/perf/bench/futex.h | 47 +++++++++++++++++++++
tools/perf/builtin-bench.c | 18 ++++++--
8 files changed, 206 insertions(+), 34 deletions(-)

diff --git a/tools/arch/x86/include/asm/unistd_64.h b/tools/arch/x86/include/asm/unistd_64.h
index 4205ed4158bf..b65c51e8d675 100644
--- a/tools/arch/x86/include/asm/unistd_64.h
+++ b/tools/arch/x86/include/asm/unistd_64.h
@@ -17,3 +17,15 @@
#ifndef __NR_setns
#define __NR_setns 308
#endif
+
+#ifndef __NR_futex_wait
+# define __NR_futex_wait 443
+#endif
+
+#ifndef __NR_futex_wake
+# define __NR_futex_wake 444
+#endif
+
+#ifndef __NR_futex_requeue
+# define __NR_futex_requeue 446
+#endif
diff --git a/tools/perf/bench/bench.h b/tools/perf/bench/bench.h
index eac36afab2b3..12346844b354 100644
--- a/tools/perf/bench/bench.h
+++ b/tools/perf/bench/bench.h
@@ -38,9 +38,13 @@ int bench_mem_memcpy(int argc, const char **argv);
int bench_mem_memset(int argc, const char **argv);
int bench_mem_find_bit(int argc, const char **argv);
int bench_futex_hash(int argc, const char **argv);
+int bench_futex2_hash(int argc, const char **argv);
int bench_futex_wake(int argc, const char **argv);
+int bench_futex2_wake(int argc, const char **argv);
int bench_futex_wake_parallel(int argc, const char **argv);
+int bench_futex2_wake_parallel(int argc, const char **argv);
int bench_futex_requeue(int argc, const char **argv);
+int bench_futex2_requeue(int argc, const char **argv);
/* pi futexes */
int bench_futex_lock_pi(int argc, const char **argv);
int bench_epoll_wait(int argc, const char **argv);
diff --git a/tools/perf/bench/futex-hash.c b/tools/perf/bench/futex-hash.c
index b65373ce5c4f..1068749af40c 100644
--- a/tools/perf/bench/futex-hash.c
+++ b/tools/perf/bench/futex-hash.c
@@ -33,7 +33,7 @@ static unsigned int nthreads = 0;
static unsigned int nsecs = 10;
/* amount of futexes per thread */
static unsigned int nfutexes = 1024;
-static bool fshared = false, done = false, silent = false;
+static bool fshared = false, done = false, silent = false, futex2 = false;
static int futex_flag = 0;

struct timeval bench__start, bench__end, bench__runtime;
@@ -85,7 +85,10 @@ static void *workerfn(void *arg)
* such as internal waitqueue handling, thus enlarging
* the critical region protected by hb->lock.
*/
- ret = futex_wait(&w->futex[i], 1234, NULL, futex_flag);
+ if (!futex2)
+ ret = futex_wait(&w->futex[i], 1234, NULL, futex_flag);
+ else
+ ret = futex2_wait(&w->futex[i], 1234, futex_flag, NULL);
if (!silent &&
(!ret || errno != EAGAIN || errno != EWOULDBLOCK))
warn("Non-expected futex return call");
@@ -116,7 +119,7 @@ static void print_summary(void)
(int)bench__runtime.tv_sec);
}

-int bench_futex_hash(int argc, const char **argv)
+static int __bench_futex_hash(int argc, const char **argv)
{
int ret = 0;
cpu_set_t cpuset;
@@ -148,7 +151,9 @@ int bench_futex_hash(int argc, const char **argv)
if (!worker)
goto errmem;

- if (!fshared)
+ if (futex2)
+ futex_flag = FUTEX_32 | (fshared * FUTEX_SHARED_FLAG);
+ else if (!fshared)
futex_flag = FUTEX_PRIVATE_FLAG;

printf("Run summary [PID %d]: %d threads, each operating on %d [%s] futexes for %d secs.\n\n",
@@ -228,3 +233,14 @@ int bench_futex_hash(int argc, const char **argv)
errmem:
err(EXIT_FAILURE, "calloc");
}
+
+int bench_futex_hash(int argc, const char **argv)
+{
+ return __bench_futex_hash(argc, argv);
+}
+
+int bench_futex2_hash(int argc, const char **argv)
+{
+ futex2 = true;
+ return __bench_futex_hash(argc, argv);
+}
diff --git a/tools/perf/bench/futex-requeue.c b/tools/perf/bench/futex-requeue.c
index 5fa23295ee5f..6cdd649b54f4 100644
--- a/tools/perf/bench/futex-requeue.c
+++ b/tools/perf/bench/futex-requeue.c
@@ -2,8 +2,8 @@
/*
* Copyright (C) 2013 Davidlohr Bueso <[email protected]>
*
- * futex-requeue: Block a bunch of threads on futex1 and requeue them
- * on futex2, N at a time.
+ * futex-requeue: Block a bunch of threads on addr1 and requeue them
+ * on addr2, N at a time.
*
* This program is particularly useful to measure the latency of nthread
* requeues without waking up any tasks -- thus mimicking a regular futex_wait.
@@ -28,7 +28,10 @@
#include <stdlib.h>
#include <sys/time.h>

-static u_int32_t futex1 = 0, futex2 = 0;
+static u_int32_t addr1 = 0, addr2 = 0;
+
+static struct futex_requeue rq1 = { .uaddr = &addr1, .flags = FUTEX_32 };
+static struct futex_requeue rq2 = { .uaddr = &addr2, .flags = FUTEX_32 };

/*
* How many tasks to requeue at a time.
@@ -37,7 +40,7 @@ static u_int32_t futex1 = 0, futex2 = 0;
static unsigned int nrequeue = 1;

static pthread_t *worker;
-static bool done = false, silent = false, fshared = false;
+static bool done = false, silent = false, fshared = false, futex2 = false;
static pthread_mutex_t thread_lock;
static pthread_cond_t thread_parent, thread_worker;
static struct stats requeuetime_stats, requeued_stats;
@@ -79,7 +82,11 @@ static void *workerfn(void *arg __maybe_unused)
pthread_cond_wait(&thread_worker, &thread_lock);
pthread_mutex_unlock(&thread_lock);

- futex_wait(&futex1, 0, NULL, futex_flag);
+ if (!futex2)
+ futex_wait(&addr1, 0, NULL, futex_flag);
+ else
+ futex2_wait(&addr1, 0, futex_flag, NULL);
+
return NULL;
}

@@ -111,7 +118,7 @@ static void toggle_done(int sig __maybe_unused,
done = true;
}

-int bench_futex_requeue(int argc, const char **argv)
+static int __bench_futex_requeue(int argc, const char **argv)
{
int ret = 0;
unsigned int i, j;
@@ -139,15 +146,20 @@ int bench_futex_requeue(int argc, const char **argv)
if (!worker)
err(EXIT_FAILURE, "calloc");

- if (!fshared)
+ if (futex2) {
+ futex_flag = FUTEX_32 | (fshared * FUTEX_SHARED_FLAG);
+ rq1.flags |= FUTEX_SHARED_FLAG * fshared;
+ rq2.flags |= FUTEX_SHARED_FLAG * fshared;
+ } else if (!fshared) {
futex_flag = FUTEX_PRIVATE_FLAG;
+ }

if (nrequeue > nthreads)
nrequeue = nthreads;

printf("Run summary [PID %d]: Requeuing %d threads (from [%s] %p to %p), "
"%d at a time.\n\n", getpid(), nthreads,
- fshared ? "shared":"private", &futex1, &futex2, nrequeue);
+ fshared ? "shared":"private", &addr1, &addr2, nrequeue);

init_stats(&requeued_stats);
init_stats(&requeuetime_stats);
@@ -176,11 +188,15 @@ int bench_futex_requeue(int argc, const char **argv)
gettimeofday(&start, NULL);
while (nrequeued < nthreads) {
/*
- * Do not wakeup any tasks blocked on futex1, allowing
+ * Do not wakeup any tasks blocked on addr1, allowing
* us to really measure futex_wait functionality.
*/
- nrequeued += futex_cmp_requeue(&futex1, 0, &futex2, 0,
- nrequeue, futex_flag);
+ if (!futex2)
+ nrequeued += futex_cmp_requeue(&addr1, 0, &addr2,
+ 0, nrequeue, futex_flag);
+ else
+ nrequeued += futex2_requeue(&rq1, &rq2,
+ 0, nrequeue, 0, 0);
}

gettimeofday(&end, NULL);
@@ -194,8 +210,12 @@ int bench_futex_requeue(int argc, const char **argv)
j + 1, nrequeued, nthreads, runtime.tv_usec / (double)USEC_PER_MSEC);
}

- /* everybody should be blocked on futex2, wake'em up */
- nrequeued = futex_wake(&futex2, nrequeued, futex_flag);
+ /* everybody should be blocked on addr2, wake'em up */
+ if (!futex2)
+ nrequeued = futex_wake(&addr2, nrequeued, futex_flag);
+ else
+ nrequeued = futex2_wake(&addr2, nrequeued, futex_flag);
+
if (nthreads != nrequeued)
warnx("couldn't wakeup all tasks (%d/%d)", nrequeued, nthreads);

@@ -220,3 +240,14 @@ int bench_futex_requeue(int argc, const char **argv)
usage_with_options(bench_futex_requeue_usage, options);
exit(EXIT_FAILURE);
}
+
+int bench_futex_requeue(int argc, const char **argv)
+{
+ return __bench_futex_requeue(argc, argv);
+}
+
+int bench_futex2_requeue(int argc, const char **argv)
+{
+ futex2 = true;
+ return __bench_futex_requeue(argc, argv);
+}
diff --git a/tools/perf/bench/futex-wake-parallel.c b/tools/perf/bench/futex-wake-parallel.c
index 6e6f5247e1fe..cac90fc0bfb3 100644
--- a/tools/perf/bench/futex-wake-parallel.c
+++ b/tools/perf/bench/futex-wake-parallel.c
@@ -17,6 +17,12 @@ int bench_futex_wake_parallel(int argc __maybe_unused, const char **argv __maybe
pr_err("%s: pthread_barrier_t unavailable, disabling this test...\n", __func__);
return 0;
}
+
+int bench_futex2_wake_parallel(int argc __maybe_unused, const char **argv __maybe_unused)
+{
+ pr_err("%s: pthread_barrier_t unavailable, disabling this test...\n", __func__);
+ return 0;
+}
#else /* HAVE_PTHREAD_BARRIER */
/* For the CLR_() macros */
#include <string.h>
@@ -47,7 +53,7 @@ static unsigned int nwakes = 1;
static u_int32_t futex = 0;

static pthread_t *blocked_worker;
-static bool done = false, silent = false, fshared = false;
+static bool done = false, silent = false, fshared = false, futex2 = false;
static unsigned int nblocked_threads = 0, nwaking_threads = 0;
static pthread_mutex_t thread_lock;
static pthread_cond_t thread_parent, thread_worker;
@@ -78,7 +84,11 @@ static void *waking_workerfn(void *arg)

gettimeofday(&start, NULL);

- waker->nwoken = futex_wake(&futex, nwakes, futex_flag);
+ if (!futex2)
+ waker->nwoken = futex_wake(&futex, nwakes, futex_flag);
+ else
+ waker->nwoken = futex2_wake(&futex, nwakes, futex_flag);
+
if (waker->nwoken != nwakes)
warnx("couldn't wakeup all tasks (%d/%d)",
waker->nwoken, nwakes);
@@ -129,8 +139,13 @@ static void *blocked_workerfn(void *arg __maybe_unused)
pthread_mutex_unlock(&thread_lock);

while (1) { /* handle spurious wakeups */
- if (futex_wait(&futex, 0, NULL, futex_flag) != EINTR)
- break;
+ if (!futex2) {
+ if (futex_wait(&futex, 0, NULL, futex_flag) != EINTR)
+ break;
+ } else {
+ if (futex2_wait(&futex, 0, futex_flag, NULL) != EINTR)
+ break;
+ }
}

pthread_exit(NULL);
@@ -217,7 +232,7 @@ static void toggle_done(int sig __maybe_unused,
done = true;
}

-int bench_futex_wake_parallel(int argc, const char **argv)
+static int __bench_futex_wake_parallel(int argc, const char **argv)
{
int ret = 0;
unsigned int i, j;
@@ -261,7 +276,9 @@ int bench_futex_wake_parallel(int argc, const char **argv)
if (!blocked_worker)
err(EXIT_FAILURE, "calloc");

- if (!fshared)
+ if (futex2)
+ futex_flag = FUTEX_32 | (fshared * FUTEX_SHARED_FLAG);
+ else if (!fshared)
futex_flag = FUTEX_PRIVATE_FLAG;

printf("Run summary [PID %d]: blocking on %d threads (at [%s] "
@@ -321,4 +338,16 @@ int bench_futex_wake_parallel(int argc, const char **argv)
free(blocked_worker);
return ret;
}
+
+int bench_futex_wake_parallel(int argc, const char **argv)
+{
+ return __bench_futex_wake_parallel(argc, argv);
+}
+
+int bench_futex2_wake_parallel(int argc, const char **argv)
+{
+ futex2 = true;
+ return __bench_futex_wake_parallel(argc, argv);
+}
+
#endif /* HAVE_PTHREAD_BARRIER */
diff --git a/tools/perf/bench/futex-wake.c b/tools/perf/bench/futex-wake.c
index 6d217868f53c..546d2818eed8 100644
--- a/tools/perf/bench/futex-wake.c
+++ b/tools/perf/bench/futex-wake.c
@@ -38,7 +38,7 @@ static u_int32_t futex1 = 0;
static unsigned int nwakes = 1;

pthread_t *worker;
-static bool done = false, silent = false, fshared = false;
+static bool done = false, silent = false, fshared = false, futex2 = false;
static pthread_mutex_t thread_lock;
static pthread_cond_t thread_parent, thread_worker;
static struct stats waketime_stats, wakeup_stats;
@@ -68,8 +68,13 @@ static void *workerfn(void *arg __maybe_unused)
pthread_mutex_unlock(&thread_lock);

while (1) {
- if (futex_wait(&futex1, 0, NULL, futex_flag) != EINTR)
- break;
+ if (!futex2) {
+ if (futex_wait(&futex1, 0, NULL, futex_flag) != EINTR)
+ break;
+ } else {
+ if (futex2_wait(&futex1, 0, futex_flag, NULL) != EINTR)
+ break;
+ }
}

pthread_exit(NULL);
@@ -117,7 +122,7 @@ static void toggle_done(int sig __maybe_unused,
done = true;
}

-int bench_futex_wake(int argc, const char **argv)
+static int __bench_futex_wake(int argc, const char **argv)
{
int ret = 0;
unsigned int i, j;
@@ -147,7 +152,9 @@ int bench_futex_wake(int argc, const char **argv)
if (!worker)
err(EXIT_FAILURE, "calloc");

- if (!fshared)
+ if (futex2)
+ futex_flag = FUTEX_32 | (fshared * FUTEX_SHARED_FLAG);
+ else if (!fshared)
futex_flag = FUTEX_PRIVATE_FLAG;

printf("Run summary [PID %d]: blocking on %d threads (at [%s] futex %p), "
@@ -179,9 +186,14 @@ int bench_futex_wake(int argc, const char **argv)

/* Ok, all threads are patiently blocked, start waking folks up */
gettimeofday(&start, NULL);
- while (nwoken != nthreads)
- nwoken += futex_wake(&futex1, nwakes, futex_flag);
+ while (nwoken != nthreads) {
+ if (!futex2)
+ nwoken += futex_wake(&futex1, nwakes, futex_flag);
+ else
+ nwoken += futex2_wake(&futex1, nwakes, futex_flag);
+ }
gettimeofday(&end, NULL);
+
timersub(&end, &start, &runtime);

update_stats(&wakeup_stats, nwoken);
@@ -211,3 +223,14 @@ int bench_futex_wake(int argc, const char **argv)
free(worker);
return ret;
}
+
+int bench_futex_wake(int argc, const char **argv)
+{
+ return __bench_futex_wake(argc, argv);
+}
+
+int bench_futex2_wake(int argc, const char **argv)
+{
+ futex2 = true;
+ return __bench_futex_wake(argc, argv);
+}
diff --git a/tools/perf/bench/futex.h b/tools/perf/bench/futex.h
index 31b53cc7d5bc..6b2213cf3f64 100644
--- a/tools/perf/bench/futex.h
+++ b/tools/perf/bench/futex.h
@@ -86,4 +86,51 @@ futex_cmp_requeue(u_int32_t *uaddr, u_int32_t val, u_int32_t *uaddr2, int nr_wak
return futex(uaddr, FUTEX_CMP_REQUEUE, nr_wake, nr_requeue, uaddr2,
val, opflags);
}
+
+/**
+ * futex2_wait - Wait at uaddr if *uaddr == val, until timo.
+ * @uaddr: User address to wait for
+ * @val: Expected value at uaddr
+ * @flags: Operation options
+ * @timo: Optional timeout
+ *
+ * Return: 0 on success, error code otherwise
+ */
+static inline int futex2_wait(volatile void *uaddr, unsigned long val,
+ unsigned long flags, struct timespec *timo)
+{
+ return syscall(__NR_futex_wait, uaddr, val, flags, timo);
+}
+
+/**
+ * futex2_wake - Wake a number of waiters waiting at uaddr
+ * @uaddr: Address to wake
+ * @nr: Number of waiters to wake
+ * @flags: Operation options
+ *
+ * Return: number of waked futexes
+ */
+static inline int futex2_wake(volatile void *uaddr, unsigned int nr, unsigned long flags)
+{
+ return syscall(__NR_futex_wake, uaddr, nr, flags);
+}
+
+/**
+ * futex2_requeue - Requeue waiters from an address to another one
+ * @uaddr1: Address where waiters are currently waiting on
+ * @uaddr2: New address to wait
+ * @nr_wake: Number of waiters at uaddr1 to be wake
+ * @nr_requeue: After waking nr_wake, number of waiters to be requeued
+ * @cmpval: Expected value at uaddr1
+ * @flags: Operation options
+ *
+ * Return: waked futexes + requeued futexes at uaddr1
+ */
+static inline int futex2_requeue(volatile struct futex_requeue *uaddr1,
+ volatile struct futex_requeue *uaddr2,
+ unsigned int nr_wake, unsigned int nr_requeue,
+ unsigned int cmpval, unsigned long flags)
+{
+ return syscall(__NR_futex_requeue, uaddr1, uaddr2, nr_wake, nr_requeue, cmpval, flags);
+}
#endif /* _FUTEX_H */
diff --git a/tools/perf/builtin-bench.c b/tools/perf/builtin-bench.c
index 62a7b7420a44..e41a95ad2db6 100644
--- a/tools/perf/builtin-bench.c
+++ b/tools/perf/builtin-bench.c
@@ -12,10 +12,11 @@
*
* sched ... scheduler and IPC performance
* syscall ... System call performance
- * mem ... memory access performance
- * numa ... NUMA scheduling and MM performance
- * futex ... Futex performance
- * epoll ... Event poll performance
+ * mem ... memory access performance
+ * numa ... NUMA scheduling and MM performance
+ * futex ... Futex performance
+ * futex2 ... Futex2 performance
+ * epoll ... Event poll performance
*/
#include <subcmd/parse-options.h>
#include "builtin.h"
@@ -75,6 +76,14 @@ static struct bench futex_benchmarks[] = {
{ NULL, NULL, NULL }
};

+static struct bench futex2_benchmarks[] = {
+ { "hash", "Benchmark for futex2 hash table", bench_futex2_hash },
+ { "wake", "Benchmark for futex2 wake calls", bench_futex2_wake },
+ { "wake-parallel", "Benchmark for parallel futex2 wake calls", bench_futex2_wake_parallel },
+ { "requeue", "Benchmark for futex2 requeue calls", bench_futex2_requeue },
+ { NULL, NULL, NULL }
+};
+
#ifdef HAVE_EVENTFD_SUPPORT
static struct bench epoll_benchmarks[] = {
{ "wait", "Benchmark epoll concurrent epoll_waits", bench_epoll_wait },
@@ -105,6 +114,7 @@ static struct collection collections[] = {
{ "numa", "NUMA scheduling and MM benchmarks", numa_benchmarks },
#endif
{"futex", "Futex stressing benchmarks", futex_benchmarks },
+ {"futex2", "Futex2 stressing benchmarks", futex2_benchmarks },
#ifdef HAVE_EVENTFD_SUPPORT
{"epoll", "Epoll stressing benchmarks", epoll_benchmarks },
#endif
--
2.30.1

2021-03-05 00:15:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/13] Add futex2 syscall

On Wed, Mar 03, 2021 at 09:42:06PM -0300, Andr? Almeida wrote:
> ** Performance
>
> - For comparing futex() and futex2() performance, I used the artificial
> benchmarks implemented at perf (wake, wake-parallel, hash and
> requeue). The setup was 200 runs for each test and using 8, 80, 800,
> 8000 for the number of threads, Note that for this test, I'm not using
> patch 14 ("kernel: Enable waitpid() for futex2") , for reasons explained
> at "The patchset" section.

How heavily contended where the benchmarks? One of the benefits of
the original futex was that no system call was necessary in the happy
path when the lock is uncontended. Especially on a non-NUMA system
(which are the far more common case), since that's where relying on a
single memory access was a huge win for the original futex. I would
expect that futex2 will fare worse in this particular case, since it
requires a system call entry for all operations --- the question is
how large is the delta in this worst case (for futex2) and best case
(for futex) scenario.

Cheers,

- Ted

2021-03-07 13:25:32

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/13] Add futex2 syscall


Hi André,
> ** The wait on multiple problem
>
> The use case lies in the Wine implementation of the Windows NT interface
> WaitMultipleObjects. This Windows API function allows a thread to sleep
> waiting on the first of a set of event sources (mutexes, timers, signal,
> console input, etc) to signal.

With that in mind would it be good to have some interaction with epoll (and similar calls)?

Instead of having a blocked futex_waitv() waiting on an fd (maybe a generic eventfd() or a new futex2fd())
would be a better interface?

Or instead introduce an IORING_OP_FUTEX2_WAITV? Then the futex_waitv logic wait
in an io-wq kernel thread...

I guess the io_uring way would mean we could have that in mind as future addition, which can be implemented
later...

metze

2021-03-08 11:24:51

by David Laight

[permalink] [raw]
Subject: RE: [RFC PATCH v2 00/13] Add futex2 syscall

From: Stefan Metzmacher
> Sent: 07 March 2021 11:35
>
> Hi André,
> > ** The wait on multiple problem
> >
> > The use case lies in the Wine implementation of the Windows NT interface
> > WaitMultipleObjects. This Windows API function allows a thread to sleep
> > waiting on the first of a set of event sources (mutexes, timers, signal,
> > console input, etc) to signal.

They are all events.
You can only wait on either events or sockets (using select).
There is a socket api to signal an event when data arrives (etc).
There is also the insane (these days) restriction of 64 events.

> With that in mind would it be good to have some interaction with epoll (and similar calls)?

Or hook something up so that pollwakeup can kick a futex as well
as waking up poll() and adding an event to epoll().

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-03-08 11:54:32

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/13] Add futex2 syscall


Am 07.03.21 um 12:56 schrieb Daurnimator:
> On Sun, 7 Mar 2021 at 22:35, Stefan Metzmacher <[email protected]> wrote:
>> Instead of having a blocked futex_waitv() waiting on an fd (maybe a generic eventfd() or a new futex2fd())
>> would be a better interface?
>
> Like bring back FUTEX_FD? (which was removed back in 2.6.25)

Ah, ok, yes something like that.

But as that was removed because of races, but might not be a good idea to bring it back.

metze

2021-03-08 11:57:35

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/13] Add futex2 syscall

Am 08.03.21 um 12:11 schrieb David Laight:
> From: Stefan Metzmacher
>> Sent: 07 March 2021 11:35
>>
>> Hi André,
>>> ** The wait on multiple problem
>>>
>>> The use case lies in the Wine implementation of the Windows NT interface
>>> WaitMultipleObjects. This Windows API function allows a thread to sleep
>>> waiting on the first of a set of event sources (mutexes, timers, signal,
>>> console input, etc) to signal.
>
> They are all events.
> You can only wait on either events or sockets (using select).
> There is a socket api to signal an event when data arrives (etc).
> There is also the insane (these days) restriction of 64 events.

Ok.

>> With that in mind would it be good to have some interaction with epoll (and similar calls)?
>
> Or hook something up so that pollwakeup can kick a futex as well
> as waking up poll() and adding an event to epoll().

I guess as FUTEX_FD was already there and was removed we can stop this discussion.

If there will every be the need to an async call, I guess a io_uring based one would
be the best...

metze

2021-03-08 16:20:10

by Elizabeth Figura

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/13] Add futex2 syscall

On 3/3/21 6:42 PM, André Almeida wrote:
> ** The wait on multiple problem
>
> The use case lies in the Wine implementation of the Windows NT interface
> WaitMultipleObjects. This Windows API function allows a thread to sleep
> waiting on the first of a set of event sources (mutexes, timers, signal,
> console input, etc) to signal. Considering this is a primitive
> synchronization operation for Windows applications, being able to quickly
> signal events on the producer side, and quickly go to sleep on the
> consumer side is essential for good performance of those running over Wine.

It's probably worth pointing out, for better or for worse, while this is
*a* use case, it's also limited to an out-of-tree patch set/forked
versions of Wine. I'm currently working on a different approach that
should be upstreamable to Wine proper, as detailed in [1].

[1]
https://lore.kernel.org/lkml/[email protected]/

2021-03-08 17:36:57

by David Laight

[permalink] [raw]
Subject: RE: [RFC PATCH v2 00/13] Add futex2 syscall



From: Zebediah Figura
> Sent: 08 March 2021 16:18
>
> On 3/3/21 6:42 PM, André Almeida wrote:
> > ** The wait on multiple problem
> >
> > The use case lies in the Wine implementation of the Windows NT interface
> > WaitMultipleObjects. This Windows API function allows a thread to sleep
> > waiting on the first of a set of event sources (mutexes, timers, signal,
> > console input, etc) to signal. Considering this is a primitive
> > synchronization operation for Windows applications, being able to quickly
> > signal events on the producer side, and quickly go to sleep on the
> > consumer side is essential for good performance of those running over Wine.
>
> It's probably worth pointing out, for better or for worse, while this is
> *a* use case, it's also limited to an out-of-tree patch set/forked
> versions of Wine. I'm currently working on a different approach that
> should be upstreamable to Wine proper, as detailed in [1].
>
> [1]
> https://lore.kernel.org/lkml/[email protected]/

* NtPulseEvent can't work right. We badly emulate it by setting and then
immediately resetting the event, but due to the above gap between poll()
and read(), most threads end up missing the wakeup anyway.

As you stated later PulseEvent() is completely broken anyway.
At least one of the problems is that in order to complete an async io
(and all io is async) to final 'copy_to_user' must be done in the
context of the initiating thread.
So if the thread is in WaitMultipleObjects (it usually is) and an async io
completes (eg receive data on a TCP connection) the thread stops waiting
while the io completion callback is done.
If a pulseEvent happens during that window then it is lost.

Mind you there was (maybe is still) a bug in WMO on 64bit windows
that means the process completely misses io completion callbacks
if (I think) they happen while the process is being scheduled.
There is a loop in WMO - that fails to recover because interrupts
are disabled and a 30 second timer that unblocks things.
I had to add code to write to the ioapic to request the hardware
interrupt to unblock everything :-)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)