Hi,
This patchset implements a general ABI to exchange per-thread data
between kernel and user-space. The initial feature implemented is a
cache for the CPU number of the currently running thread in user-space.
This ABI is extensible to add more features in the future.
Benchmarks comparing this approach to a getcpu based on system call on
ARM show a 44x speedup. They show a 16.5x speedup on x86-64 compared to
executing lsl from a vDSO through glibc.
There is a man page in the changelog of patch 1/5, which shows an
example usage of this new system call.
This patchset is sent as RFC. It applies on Linux 4.5. The prior
versions of this patchset were known as a "getcpu_cache system call".
Feedback is welcome,
Thanks!
Mathieu
Mathieu Desnoyers (5):
Thread-local ABI system call: cache CPU number of running thread
Thread-local ABI cpu_id: ARM resume notifier
Thread-local ABI: wire up ARM system call
Thread-local ABI cpu_id: x86 32/64 resume notifier
Thread-local ABI: wire up x86 32/64 system call
MAINTAINERS | 7 +++
arch/arm/include/asm/unistd.h | 2 +-
arch/arm/include/uapi/asm/unistd.h | 1 +
arch/arm/kernel/calls.S | 3 +-
arch/arm/kernel/signal.c | 1 +
arch/x86/entry/common.c | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
fs/exec.c | 1 +
include/linux/sched.h | 66 +++++++++++++++++++++
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/thread_local_abi.h | 83 ++++++++++++++++++++++++++
init/Kconfig | 14 +++++
kernel/Makefile | 1 +
kernel/fork.c | 4 ++
kernel/sched/sched.h | 1 +
kernel/sys_ni.c | 3 +
kernel/thread_local_abi.c | 103 +++++++++++++++++++++++++++++++++
18 files changed, 292 insertions(+), 2 deletions(-)
create mode 100644 include/uapi/linux/thread_local_abi.h
create mode 100644 kernel/thread_local_abi.c
--
2.1.4
Wire up the thread-local ABI system call on 32-bit ARM.
This provides an ABI improving the speed of a getcpu operation
on ARM by skipping the getcpu system call on the fast path.
The thread-local ABI can be extended to add features in the future.
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
---
arch/arm/include/asm/unistd.h | 2 +-
arch/arm/include/uapi/asm/unistd.h | 1 +
arch/arm/kernel/calls.S | 3 ++-
3 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index 7b84657..194b699 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -19,7 +19,7 @@
* This may need to be greater than __NR_last_syscall+1 in order to
* account for the padding in the syscall table
*/
-#define __NR_syscalls (392)
+#define __NR_syscalls (396)
#define __ARCH_WANT_STAT64
#define __ARCH_WANT_SYS_GETHOSTNAME
diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
index 5dd2528..aaa9221 100644
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -418,6 +418,7 @@
#define __NR_membarrier (__NR_SYSCALL_BASE+389)
#define __NR_mlock2 (__NR_SYSCALL_BASE+390)
#define __NR_copy_file_range (__NR_SYSCALL_BASE+391)
+#define __NR_thread_local_abi (__NR_SYSCALL_BASE+392)
/*
* The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index dfc7cd6..d6a0fe9 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -399,8 +399,9 @@
CALL(sys_execveat)
CALL(sys_userfaultfd)
CALL(sys_membarrier)
- CALL(sys_mlock2)
+/* 390 */ CALL(sys_mlock2)
CALL(sys_copy_file_range)
+ CALL(sys_thread_local_abi)
#ifndef syscalls_counted
.equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
#define syscalls_counted
--
2.1.4
Wire up the thread-local ABI system call on x86 32/64.
This provides an ABI improving the speed of a getcpu operation
on x86 by removing the need to perform a function call, "lsl"
instruction, or system call on the fast path.
The thread-local ABI can be extended to add features in the future.
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
2 files changed, 2 insertions(+)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index cb713df..e31d5a5 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -384,3 +384,4 @@
375 i386 membarrier sys_membarrier
376 i386 mlock2 sys_mlock2
377 i386 copy_file_range sys_copy_file_range
+378 i386 thread_local_abi sys_thread_local_abi
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index dc1040a..6aaddb4b 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -333,6 +333,7 @@
324 common membarrier sys_membarrier
325 common mlock2 sys_mlock2
326 common copy_file_range sys_copy_file_range
+326 common thread_local_abi sys_thread_local_abi
#
# x32-specific system call numbers start at 512 to avoid cache impact
--
2.1.4
Call the tlabi_cpu_id_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
---
arch/x86/entry/common.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 0366374..8dbdde5 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -249,6 +249,7 @@ static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
if (cached_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+ tlabi_cpu_id_handle_notify_resume(current);
}
if (cached_flags & _TIF_USER_RETURN_NOTIFY)
--
2.1.4
Expose a new system call allowing threads to register one userspace
memory area where to store the CPU number on which the calling thread is
running. Scheduler migration sets the TIF_NOTIFY_RESUME flag on the
current thread. Upon return to user-space, a notify-resume handler
updates the current CPU value within each registered user-space memory
area. User-space can then read the current CPU number directly from
memory.
This thread-local ABI can be extended to add features in the future.
One future feature extension is the restartable critical sections
(percpu atomics) work undergone by Paul Turner and Andrew Hunter,
which lets the kernel handle restart of critical sections. [1] [2]
This cpu id cache is an improvement over current mechanisms available to
read the current CPU number, which has the following benefits:
- 44x speedup on ARM vs system call through glibc,
- 20x speedup on x86 compared to calling glibc, which calls vdso
executing a "lsl" instruction,
- 16x speedup on x86 compared to inlined "lsl" instruction,
- Unlike vdso approaches, this cached value can be read from an inline
assembly, which makes it a useful building block for restartable
sequences.
- The cpu id cache approach is portable (e.g. ARM), which is not the
case for the lsl-based x86 vdso.
On x86, yet another possible approach would be to use the gs segment
selector to point to user-space per-cpu data. This approach performs
similarly to the cpu id cache, but it has two disadvantages: it is
not portable, and it is incompatible with existing applications already
using the gs segment selector for other purposes.
Benchmarking various approaches for reading the current CPU number:
ARMv7 Processor rev 10 (v7l)
Machine model: Wandboard i.MX6 Quad Board
- Baseline (empty loop): 10.1 ns
- Read CPU from __thread_local_abi.cpu_id: 10.1 ns
- Read CPU from __thread_local_abi.cpu_id (lazy register): 12.4 ns
- glibc 2.19-0ubuntu6.6 getcpu: 445.6 ns
- getcpu system call: 322.2 ns
x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
- Baseline (empty loop): 0.8 ns
- Read CPU from __thread_local_abi.cpu_id: 0.8 ns
- Read CPU from __thread_local_abi.cpu_id (lazy register): 1.6 ns
- Read using gs segment selector: 0.8 ns
- "lsl" inline assembly: 13.0 ns
- glibc 2.19-0ubuntu6 getcpu: 16.5 ns
- getcpu system call: 52.5 ns
- Speed
Running 10 runs of hackbench -l 100000 seems to indicate that the sched
switch impact of this new configuration option is within the noise:
Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
kernel parameter), with a Linux v4.5 defconfig+localyesconfig,
thread-local ABI series applied.
* CONFIG_GETCPU_CACHE=n
avg.: 40.42 s
std.dev.: 0.29 s
* CONFIG_GETCPU_CACHE=y
avg.: 40.60 s
std.dev.: 0.17 s
- Size
On x86-64, between CONFIG_THREAD_LOCAL_ABI_CPU_ID=n/y, the text size
increase of vmlinux is 640 bytes, and the data size increase of vmlinux
is 512 bytes.
* CONFIG_THREAD_LOCAL_ABI_CPU_ID=n
text data bss dec hex filename
17018635 2762368 1564672 21345675 145b58b vmlinux
* CONFIG_THREAD_LOCAL_ABI_CPU_ID=y
text data bss dec hex filename
17019275 2762880 1564672 21346827 145ba0b vmlinux
[1] https://lwn.net/Articles/650333/
[2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdf
Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: "H. Peter Anvin" <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Michael Kerrisk <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
---
Changes since v1:
- Return -1, errno=EINVAL if cpu_cache pointer is not aligned on
sizeof(int32_t).
- Update man page to describe the pointer alignement requirements and
update atomicity guarantees.
- Add MAINTAINERS file GETCPU_CACHE entry.
- Remove dynamic memory allocation: go back to having a single
getcpu_cache entry per thread. Update documentation accordingly.
- Rebased on Linux 4.4.
Changes since v2:
- Introduce a "cmd" argument, along with an enum with GETCPU_CACHE_GET
and GETCPU_CACHE_SET. Introduce a uapi header linux/getcpu_cache.h
defining this enumeration.
- Split resume notifier architecture implementation from the system call
wire up in the following arch-specific patches.
- Man pages updates.
- Handle 32-bit compat pointers.
- Simplify handling of getcpu_cache GETCPU_CACHE_SET compiler barrier:
set the current cpu cache pointer before doing the cache update, and
set it back to NULL if the update fails. Setting it back to NULL on
error ensures that no resume notifier will trigger a SIGSEGV if a
migration happened concurrently.
Changes since v3:
- Fix __user annotations in compat code,
- Update memory ordering comments.
- Rebased on kernel v4.5-rc5.
Changes since v4:
- Inline getcpu_cache_fork, getcpu_cache_execve, and getcpu_cache_exit.
- Add new line between if() and switch() to improve readability.
- Added sched switch benchmarks (hackbench) and size overhead comparison
to change log.
Changes since v5:
- Rename "getcpu_cache" to "thread_local_abi", allowing to extend
this system call to cover future features such as restartable critical
sections. Generalizing this system call ensures that we can add
features similar to the cpu_id field within the same cache-line
without having to track one pointer per feature within the task
struct.
- Add a tlabi_nr parameter to the system call, thus allowing to extend
the ABI beyond the initial 64-byte structure by registering structures
with tlabi_nr greater than 0. The initial ABI structure is associated
with tlabi_nr 0.
- Rebased on kernel v4.5.
Man page associated:
THREAD_LOCAL_ABI(2) Linux Programmer's Manual THREAD_LOCAL_ABI(2)
NAME
thread_local_abi - Shared memory interface between user-space
threads and the kernel
SYNOPSIS
#include <linux/thread_local_abi.h>
int thread_local_abi(uint32_t tlabi_nr, void * tlabi, uint32_t feature_mask, int flags);
DESCRIPTION
The thread_local_abi() accelerates some frequent user-space opera‐
tions by defining a shared data structure ABI between each user-
space thread and the kernel.
The tlabi_nr argument is the thread-local ABI structure number.
Currently, only tlabi_nr 0 is supported. tlabi_nr 0 expects tlabi
to hold a pointer to struct thread_local_abi features and layout,
or NULL.
The layout of struct thread_local_abi is as follows:
Structure alignment
This structure needs to be aligned on multiples of 64
bytes.
Structure size
This structure has a fixed size of 64 bytes.
Fields
features
Bitmask of the features enabled for this thread's tlabi_nr
0.
cpu_id
Cache of the CPU number on which the calling thread is run‐
ning.
The tlabi argument is a pointer to the thread-local ABI structure
to be shared between kernel and user-space. If tlabi is NULL, the
currently registered address will be used.
The feature_mask is a bitmask of the features to enable. For
tlabi_nr 0, it is a OR'd mask of the following features:
TLABI_FEATURE_CPU_ID
Cache the CPU number on which the calling thread is running
into the cpu_id field of the struct thread_local_abi struc‐
ture.
The flags argument is currently unused and must be specified as 0.
Typically, a library or application will keep the thread-local ABI
in a thread-local storage variable, or other memory areas belong‐
ing to each thread. It is recommended to perform volatile reads of
the thread-local cache to prevent the compiler from doing load
tearing. An alternative approach is to read the cpu number cache
from inline assembly in a single instruction.
Each thread is responsible for registering its thread-local ABI
structure. Only one thread-local ABI structure address can be reg‐
istered per thread for each tlabi_nr number. Once set, the thread-
local ABI address associated to a tlabi_nr number is idempotent
for a given thread.
The symbol __thread_local_abi is recommended to be used across
libraries and applications wishing to register a the thread-local
ABI structure for tlabi_nr 0. The attribute "weak" is recommended
when declaring this variable in libraries. Applications can
choose to define their own version of this symbol without the weak
attribute as a performance improvement.
In a typical usage scenario, the thread registering the thread-
local ABI structure will be performing reads from that structure.
It is however also allowed to read the that structure from other
threads. The thread-local ABI field updates performed by the ker‐
nel provide single-copy atomicity semantics, which guarantee that
other threads performing single-copy atomic reads of the cpu num‐
ber cache will always observe a consistent value.
Memory registered as thread-local ABI structure should never be
deallocated before the thread which registered it exits: specifi‐
cally, it should not be freed, and the library containing the reg‐
istered thread-local storage should not be dlclose'd. Violating
this constraint may cause a SIGSEGV signal to be delivered to the
thread.
Unregistration of associated thread-local ABI structure is implic‐
itly performed when a thread or process exit.
RETURN VALUE
A return value of 0 indicates success. On error, -1 is returned,
and errno is set appropriately.
ERRORS
EINVAL Either flags is non-zero, an unexpected tlabi_nr has been
specified, tlabi contains an address which is not appro‐
priately aligned, or a feature specified in the fea‐
ture_mask is not available.
ENOSYS The thread_local_abi() system call is not implemented by
this kernel.
EFAULT tlabi is an invalid address.
EBUSY The tlabi argument contains a non-NULL address which dif‐
fers from the memory location already registered for this
thread for the given tlabi_nr number.
ENOENT The tlabi argument is NULL, but no memory location is cur‐
rently registered for this thread for the given tlabi_nr
number.
VERSIONS
The thread_local_abi() system call was added in Linux 4.X (TODO).
CONFORMING TO
thread_local_abi() is Linux-specific.
EXAMPLE
The following code uses the thread_local_abi() system call to keep
a thread-local storage variable up to date with the current CPU
number, with a fallback on sched_getcpu(3) if the cache is not
available. For example simplicity, it is done in main(), but mul‐
tithreaded programs would need to invoke thread_local_abi() from
each program thread.
#define _GNU_SOURCE
#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <stdint.h>
#include <sched.h>
#include <stddef.h>
#include <sys/syscall.h>
#include <linux/thread_local_abi.h>
static inline int
thread_local_abi(uint32_t tlabi_nr,
volatile struct thread_local_abi *tlabi,
uint32_t feature_mask, int flags)
{
return syscall(__NR_thread_local_abi, tlabi_nr, tlabi,
feature_mask, flags);
}
/*
* __thread_local_abi is recommended as symbol name for the
* thread-local ABI. Weak attribute is recommended when declaring
* this variable in libraries.
*/
__thread __attribute__((weak))
volatile struct thread_local_abi __thread_local_abi;
static int
tlabi_cpu_id_register(void)
{
if (thread_local_abi(0, &__thread_local_abi,
TLABI_FEATURE_CPU_ID, 0))
return -1;
return 0;
}
static int32_t
read_cpu_id(void)
{
if (!(__thread_local_abi.features & TLABI_FEATURE_CPU_ID))
return sched_getcpu();
return __thread_local_abi.cpu_id;
}
int
main(int argc, char **argv)
{
if (tlabi_cpu_id_register()) {
fprintf(stderr,
"Unable to initialize thread-local ABI cpu_id feature.\n");
fprintf(stderr, "Using sched_getcpu() as fallback.\n");
}
printf("Current CPU number: %d\n", read_cpu_id());
printf("TLABI features: 0x%x\n", __thread_local_abi.features);
exit(EXIT_SUCCESS);
}
SEE ALSO
sched_getcpu(3)
Linux 2016-01-27 THREAD_LOCAL_ABI(2)
---
MAINTAINERS | 7 +++
fs/exec.c | 1 +
include/linux/sched.h | 66 ++++++++++++++++++++++
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/thread_local_abi.h | 83 +++++++++++++++++++++++++++
init/Kconfig | 14 +++++
kernel/Makefile | 1 +
kernel/fork.c | 4 ++
kernel/sched/sched.h | 1 +
kernel/sys_ni.c | 3 +
kernel/thread_local_abi.c | 103 ++++++++++++++++++++++++++++++++++
11 files changed, 284 insertions(+)
create mode 100644 include/uapi/linux/thread_local_abi.h
create mode 100644 kernel/thread_local_abi.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 6ee06ea..9b5b613 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4787,6 +4787,13 @@ M: Joe Perches <[email protected]>
S: Maintained
F: scripts/get_maintainer.pl
+THREAD LOCAL ABI SUPPORT
+M: Mathieu Desnoyers <[email protected]>
+L: [email protected]
+S: Supported
+F: kernel/thread_local_abi.c
+F: include/uapi/linux/thread_local_abi.h
+
GFS2 FILE SYSTEM
M: Steven Whitehouse <[email protected]>
M: Bob Peterson <[email protected]>
diff --git a/fs/exec.c b/fs/exec.c
index dcd4ac7..b41903c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1594,6 +1594,7 @@ static int do_execveat_common(int fd, struct filename *filename,
/* execve succeeded */
current->fs->in_exec = 0;
current->in_execve = 0;
+ thread_local_abi_execve(current);
acct_update_integrals(current);
task_numa_free(current);
free_bprm(bprm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a10494a..7dcc910 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -59,6 +59,7 @@ struct sched_param {
#include <linux/gfp.h>
#include <linux/magic.h>
#include <linux/cgroup-defs.h>
+#include <linux/thread_local_abi.h>
#include <asm/processor.h>
@@ -1830,6 +1831,10 @@ struct task_struct {
unsigned long task_state_change;
#endif
int pagefault_disabled;
+#ifdef CONFIG_THREAD_LOCAL_ABI
+ uint32_t tlabi_features;
+ struct thread_local_abi __user *tlabi;
+#endif
/* CPU-specific state of this task */
struct thread_struct thread;
/*
@@ -3207,4 +3212,65 @@ static inline unsigned long rlimit_max(unsigned int limit)
return task_rlimit_max(current, limit);
}
+#ifdef CONFIG_THREAD_LOCAL_ABI
+/*
+ * If parent process has a thread-local ABI, the child inherits. Only
+ * applies when forking a process, not a thread.
+ */
+static inline void thread_local_abi_fork(struct task_struct *t)
+{
+ t->tlabi_features = current->tlabi_features;
+ t->tlabi = current->tlabi;
+}
+static inline void thread_local_abi_execve(struct task_struct *t)
+{
+ t->tlabi_features = 0;
+ t->tlabi = NULL;
+}
+static inline void thread_local_abi_exit(struct task_struct *t)
+{
+ t->tlabi_features = 0;
+ t->tlabi = NULL;
+}
+#else
+static inline void thread_local_abi_fork(struct task_struct *t)
+{
+}
+static inline void thread_local_abi_execve(struct task_struct *t)
+{
+}
+static inline void thread_local_abi_exit(struct task_struct *t)
+{
+}
+#endif
+
+#ifdef CONFIG_THREAD_LOCAL_ABI_CPU_ID
+void __tlabi_cpu_id_handle_notify_resume(struct task_struct *t);
+static inline void tlabi_cpu_id_set_notify_resume(struct task_struct *t)
+{
+ if (t->tlabi_features & TLABI_FEATURE_CPU_ID)
+ set_tsk_thread_flag(t, TIF_NOTIFY_RESUME);
+}
+static inline void tlabi_cpu_id_handle_notify_resume(struct task_struct *t)
+{
+ if (t->tlabi_features & TLABI_FEATURE_CPU_ID)
+ __tlabi_cpu_id_handle_notify_resume(t);
+}
+static inline bool tlabi_cpu_id_feature_available(void)
+{
+ return true;
+}
+#else
+static inline void tlabi_cpu_id_set_notify_resume(struct task_struct *t)
+{
+}
+static inline void tlabi_cpu_id_handle_notify_resume(struct task_struct *t)
+{
+}
+static inline bool tlabi_cpu_id_feature_available(void)
+{
+ return false;
+}
+#endif
+
#endif
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index ebd10e6..96f6f32 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -398,6 +398,7 @@ header-y += tcp_metrics.h
header-y += telephony.h
header-y += termios.h
header-y += thermal.h
+header-y += thread_local_abi.h
header-y += time.h
header-y += times.h
header-y += timex.h
diff --git a/include/uapi/linux/thread_local_abi.h b/include/uapi/linux/thread_local_abi.h
new file mode 100644
index 0000000..48e685a
--- /dev/null
+++ b/include/uapi/linux/thread_local_abi.h
@@ -0,0 +1,83 @@
+#ifndef _UAPI_LINUX_THREAD_LOCAL_ABI_H
+#define _UAPI_LINUX_THREAD_LOCAL_ABI_H
+
+/*
+ * linux/thread_local_abi.h
+ *
+ * Thread-local ABI system call API
+ *
+ * Copyright (c) 2015-2016 Mathieu Desnoyers <[email protected]>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifdef __KERNEL__
+# include <linux/types.h>
+#else /* #ifdef __KERNEL__ */
+# include <stdint.h>
+#endif /* #else #ifdef __KERNEL__ */
+
+/*
+ * The initial thread-local ABI shared structure is associated with
+ * the tlabi_nr parameter value 0 passed to the thread_local_abi system
+ * call. It will be henceforth referred to as "tlabi 0".
+ *
+ * This tlabi 0 structure is strictly required to be aligned on 64
+ * bytes. The tlabi 0 structure has a fixed length of 64 bytes. Each of
+ * its fields should be naturally aligned so no padding is necessary.
+ * The size of tlabi 0 structure is fixed to 64 bytes to ensure that
+ * neither the kernel nor user-space have to perform size checks. The
+ * choice of 64 bytes matches the L1 cache size on common architectures.
+ *
+ * If more fields are needed than the available 64 bytes, a new tlabi
+ * number should be reserved, associated to its own shared structure
+ * layout.
+ */
+#define TLABI_LEN 64
+
+enum thread_local_abi_feature {
+ TLABI_FEATURE_NONE = 0,
+ TLABI_FEATURE_CPU_ID = (1 << 0),
+};
+
+struct thread_local_abi {
+ /*
+ * Thread-local ABI features field.
+ * Updated by the kernel, and read by user-space with
+ * single-copy atomicity semantics. Aligned on 32-bit.
+ * This field contains a mask of enabled features.
+ */
+ uint32_t features;
+
+ /*
+ * Thread-local ABI cpu_id field.
+ * Updated by the kernel, and read by user-space with
+ * single-copy atomicity semantics. Aligned on 32-bit.
+ */
+ uint32_t cpu_id;
+
+ /*
+ * Add new fields here, before padding. Increment TLABI_BYTES_USED
+ * accordingly.
+ */
+#define TLABI_BYTES_USED 8
+ char padding[TLABI_LEN - TLABI_BYTES_USED];
+} __attribute__ ((aligned(TLABI_LEN)));
+
+#endif /* _UAPI_LINUX_THREAD_LOCAL_ABI_H */
diff --git a/init/Kconfig b/init/Kconfig
index 2232080..3f64a2f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1589,6 +1589,20 @@ config MEMBARRIER
If unsure, say Y.
+config THREAD_LOCAL_ABI
+ bool
+
+config THREAD_LOCAL_ABI_CPU_ID
+ bool "Enable thread-local CPU number cache" if EXPERT
+ default y
+ select THREAD_LOCAL_ABI
+ help
+ Enable the thread-local CPU number cache. It provides a
+ user-space cache for the current CPU number value, which
+ speeds up getting the current CPU number from user-space.
+
+ If unsure, say Y.
+
config EMBEDDED
bool "Embedded system"
option allnoconfig_y
diff --git a/kernel/Makefile b/kernel/Makefile
index 53abf00..327fbd9 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -103,6 +103,7 @@ obj-$(CONFIG_TORTURE_TEST) += torture.o
obj-$(CONFIG_MEMBARRIER) += membarrier.o
obj-$(CONFIG_HAS_IOMEM) += memremap.o
+obj-$(CONFIG_THREAD_LOCAL_ABI) += thread_local_abi.o
$(obj)/configs.o: $(obj)/config_data.h
diff --git a/kernel/fork.c b/kernel/fork.c
index 2e391c7..055f37d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -252,6 +252,7 @@ void __put_task_struct(struct task_struct *tsk)
WARN_ON(tsk == current);
cgroup_free(tsk);
+ thread_local_abi_exit(tsk);
task_numa_free(tsk);
security_task_free(tsk);
exit_creds(tsk);
@@ -1552,6 +1553,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
*/
copy_seccomp(p);
+ if (!(clone_flags & CLONE_THREAD))
+ thread_local_abi_fork(p);
+
/*
* Process group and session signals need to be delivered to just the
* parent before the fork or both the parent and the child after the
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 10f1637..a67d732 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -971,6 +971,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
{
set_task_rq(p, cpu);
#ifdef CONFIG_SMP
+ tlabi_cpu_id_set_notify_resume(p);
/*
* After ->cpu is set up to a new value, task_rq_lock(p, ...) can be
* successfuly executed on another CPU. We must ensure that updates of
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 2c5e3a8..ce1f466 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -250,3 +250,6 @@ cond_syscall(sys_execveat);
/* membarrier */
cond_syscall(sys_membarrier);
+
+/* thread-local ABI */
+cond_syscall(sys_thread_local_abi);
diff --git a/kernel/thread_local_abi.c b/kernel/thread_local_abi.c
new file mode 100644
index 0000000..91adbb8
--- /dev/null
+++ b/kernel/thread_local_abi.c
@@ -0,0 +1,103 @@
+/*
+ * Copyright (C) 2015-2016 Mathieu Desnoyers <[email protected]>
+ *
+ * Thread-local ABI system call
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/uaccess.h>
+#include <linux/syscalls.h>
+#include <linux/compat.h>
+#include <linux/thread_local_abi.h>
+
+#define TLABI_FEATURES_UNKNOWN (~TLABI_FEATURE_CPU_ID)
+
+/*
+ * This resume handler should always be executed between a migration
+ * triggered by preemption and return to user-space.
+ */
+void __tlabi_cpu_id_handle_notify_resume(struct task_struct *t)
+{
+ if (unlikely(t->flags & PF_EXITING))
+ return;
+ if (put_user(raw_smp_processor_id(), &t->tlabi->cpu_id))
+ force_sig(SIGSEGV, t);
+}
+
+/*
+ * sys_thread_local_abi - setup thread-local ABI for caller thread
+ */
+SYSCALL_DEFINE4(thread_local_abi, uint32_t, tlabi_nr, void *, _tlabi,
+ uint32_t, feature_mask, int, flags)
+{
+ struct thread_local_abi __user *tlabi =
+ (struct thread_local_abi __user *)_tlabi;
+ uint32_t orig_feature_mask;
+
+ /* Sanity check on size of ABI structure. */
+ BUILD_BUG_ON(sizeof(struct thread_local_abi) != TLABI_LEN);
+
+ if (unlikely(flags || tlabi_nr))
+ return -EINVAL;
+ /* Ensure requested features are available. */
+ if (feature_mask & TLABI_FEATURES_UNKNOWN)
+ return -EINVAL;
+ if ((feature_mask & TLABI_FEATURE_CPU_ID)
+ && !tlabi_cpu_id_feature_available())
+ return -EINVAL;
+
+ if (tlabi) {
+ if (current->tlabi) {
+ /*
+ * If tlabi is already registered, check
+ * whether the provided address differs from the
+ * prior one.
+ */
+ if (current->tlabi != tlabi)
+ return -EBUSY;
+ } else {
+ /*
+ * If there was no tlabi previously registered,
+ * we need to ensure the provided tlabi is
+ * properly aligned and valid.
+ */
+ if (!IS_ALIGNED((unsigned long)tlabi, TLABI_LEN))
+ return -EINVAL;
+ if (!access_ok(VERIFY_WRITE, tlabi,
+ sizeof(struct thread_local_abi)))
+ return -EFAULT;
+ current->tlabi = tlabi;
+ }
+ } else {
+ if (!current->tlabi)
+ return -ENOENT;
+ }
+
+ /* Update feature mask for current thread. */
+ orig_feature_mask = current->tlabi_features;
+ current->tlabi_features |= feature_mask;
+ if (put_user(current->tlabi_features, ¤t->tlabi->features)) {
+ current->tlabi = NULL;
+ current->tlabi_features = 0;
+ return -EFAULT;
+ }
+
+ /*
+ * If the CPU_ID feature was previously inactive, and has just
+ * been requested, ensure the cpu_id field is updated before
+ * returning to user-space.
+ */
+ if (!(orig_feature_mask & TLABI_FEATURE_CPU_ID))
+ tlabi_cpu_id_set_notify_resume(current);
+ return 0;
+}
--
2.1.4
Call the tlabi_cpu_id_handle_notify_resume() function on return to
userspace if TIF_NOTIFY_RESUME thread flag is set.
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Russell King <[email protected]>
CC: Catalin Marinas <[email protected]>
CC: Will Deacon <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Paul Turner <[email protected]>
CC: Andrew Hunter <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Andy Lutomirski <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Dave Watson <[email protected]>
CC: Chris Lameter <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Ben Maurer <[email protected]>
CC: Steven Rostedt <[email protected]>
CC: "Paul E. McKenney" <[email protected]>
CC: Josh Triplett <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: Andrew Morton <[email protected]>
CC: Boqun Feng <[email protected]>
CC: [email protected]
---
arch/arm/kernel/signal.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
index 7b8f214..95418d3 100644
--- a/arch/arm/kernel/signal.c
+++ b/arch/arm/kernel/signal.c
@@ -594,6 +594,7 @@ do_work_pending(struct pt_regs *regs, unsigned int thread_flags, int syscall)
} else {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+ tlabi_cpu_id_handle_notify_resume(current);
}
}
local_irq_disable();
--
2.1.4
On 04/04/16 10:01, Mathieu Desnoyers wrote:
>
> Changes since v5:
> - Rename "getcpu_cache" to "thread_local_abi", allowing to extend
> this system call to cover future features such as restartable critical
> sections. Generalizing this system call ensures that we can add
> features similar to the cpu_id field within the same cache-line
> without having to track one pointer per feature within the task
> struct.
> - Add a tlabi_nr parameter to the system call, thus allowing to extend
> the ABI beyond the initial 64-byte structure by registering structures
> with tlabi_nr greater than 0. The initial ABI structure is associated
> with tlabi_nr 0.
> - Rebased on kernel v4.5.
>
This seems absolutely insanely complex, both for the kernel and for
userspace.
A much saner way would be for userspace to query the kernel for the size
of the structure; userspace then allocates the maximum of what it knows
and what the kernel knows. That way, the kernel doesn't need to
conditionalize its accesses to user space, and libc doesn't need to
conditionalize its accesses either.
-hpa
----- On Apr 4, 2016, at 1:11 PM, H. Peter Anvin [email protected] wrote:
> On 04/04/16 10:01, Mathieu Desnoyers wrote:
>>
>> Changes since v5:
>> - Rename "getcpu_cache" to "thread_local_abi", allowing to extend
>> this system call to cover future features such as restartable critical
>> sections. Generalizing this system call ensures that we can add
>> features similar to the cpu_id field within the same cache-line
>> without having to track one pointer per feature within the task
>> struct.
>> - Add a tlabi_nr parameter to the system call, thus allowing to extend
>> the ABI beyond the initial 64-byte structure by registering structures
>> with tlabi_nr greater than 0. The initial ABI structure is associated
>> with tlabi_nr 0.
>> - Rebased on kernel v4.5.
>>
>
> This seems absolutely insanely complex, both for the kernel and for
> userspace.
>
> A much saner way would be for userspace to query the kernel for the size
> of the structure; userspace then allocates the maximum of what it knows
> and what the kernel knows. That way, the kernel doesn't need to
> conditionalize its accesses to user space, and libc doesn't need to
> conditionalize its accesses either.
If we go down the route of having user-space dynamically allocating
the structure, my understanding is that we need to associate the
user-space TLS symbol with a pointer to the structure, and test for
NULL each time, thus requiring user-space to touch one more cache-line
(read the pointer), and add one conditional per user-space fast-path,
compared to a statically-sized definition approach. Or perhaps you have
some clever trick in mind for "allocation by user-space" that I'm missing ?
Besides the NULL pointer check, another issue is feature detection.
As we extend the feature set, my proposal has a 32-bit features
mask at the beginning of the TLS structure, within the same
cache-line containing the structure fields, so user-space can quickly
check whether the required feature is enabled (adds one conditional
on the user-space fast path, but does not require to touch another
cache-line). This allows adding new features without requiring to
reserve the value "0" within each field of the structure to mean
"feature unavailable", which I find terminally unaesthetic.
I propose here a fixed-size 64 bytes layout for the first structure,
for which a 32-bit feature mask should be enough. If we ever fill
up these 64 bytes, we can then use the following tlabi_nr number (1),
which will define its own structure size and feature mask. This
seems like a good compromise between fast-path speed, feature detection
flexibility, optimal use of cache-lines, and extensibility.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
----- On Apr 4, 2016, at 3:46 PM, Mathieu Desnoyers [email protected] wrote:
> ----- On Apr 4, 2016, at 1:11 PM, H. Peter Anvin [email protected] wrote:
>
>> On 04/04/16 10:01, Mathieu Desnoyers wrote:
>>>
>>> Changes since v5:
>>> - Rename "getcpu_cache" to "thread_local_abi", allowing to extend
>>> this system call to cover future features such as restartable critical
>>> sections. Generalizing this system call ensures that we can add
>>> features similar to the cpu_id field within the same cache-line
>>> without having to track one pointer per feature within the task
>>> struct.
>>> - Add a tlabi_nr parameter to the system call, thus allowing to extend
>>> the ABI beyond the initial 64-byte structure by registering structures
>>> with tlabi_nr greater than 0. The initial ABI structure is associated
>>> with tlabi_nr 0.
>>> - Rebased on kernel v4.5.
>>>
>>
>> This seems absolutely insanely complex, both for the kernel and for
>> userspace.
>>
>> A much saner way would be for userspace to query the kernel for the size
>> of the structure; userspace then allocates the maximum of what it knows
>> and what the kernel knows. That way, the kernel doesn't need to
>> conditionalize its accesses to user space, and libc doesn't need to
>> conditionalize its accesses either.
>
> If we go down the route of having user-space dynamically allocating
> the structure, my understanding is that we need to associate the
> user-space TLS symbol with a pointer to the structure, and test for
> NULL each time, thus requiring user-space to touch one more cache-line
> (read the pointer), and add one conditional per user-space fast-path,
> compared to a statically-sized definition approach. Or perhaps you have
> some clever trick in mind for "allocation by user-space" that I'm missing ?
>
> Besides the NULL pointer check, another issue is feature detection.
> As we extend the feature set, my proposal has a 32-bit features
> mask at the beginning of the TLS structure, within the same
> cache-line containing the structure fields, so user-space can quickly
> check whether the required feature is enabled (adds one conditional
> on the user-space fast path, but does not require to touch another
> cache-line). This allows adding new features without requiring to
> reserve the value "0" within each field of the structure to mean
> "feature unavailable", which I find terminally unaesthetic.
>
> I propose here a fixed-size 64 bytes layout for the first structure,
> for which a 32-bit feature mask should be enough. If we ever fill
> up these 64 bytes, we can then use the following tlabi_nr number (1),
> which will define its own structure size and feature mask. This
> seems like a good compromise between fast-path speed, feature detection
> flexibility, optimal use of cache-lines, and extensibility.
Moreover, the feature set that the application knows about, glibc
knows about, and the kernel knows about are three different things.
My intent here is to have glibc stay out of the way as much as possible,
since this is really an interface between various applications/libraries
and the kernel.
Even if glibc allocates a structure large enough for the union of
the features it knows about and the features the kernel implements,
the application could be built against kernel headers that expose
more features than glibc knows about, and would therefore need to
have a structure length check, for an added branch on the fast path
if we dynamically allocate the tlabi structure.
A statically-sized structure allows application and libraries to
skip pointer load, NULL checks, and structure length checks on
the user-space fast-path.
Thanks,
Mathieu
>
> Thanks,
>
> Mathieu
>
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On 04/04/2016 10:48 PM, Mathieu Desnoyers wrote:
> Moreover, the feature set that the application knows about, glibc
> knows about, and the kernel knows about are three different things.
> My intent here is to have glibc stay out of the way as much as possible,
> since this is really an interface between various applications/libraries
> and the kernel.
Surely glibc can allocate the space based on what is advertised as
needed by the kernel? Why would it limit itself to what is supported by
the kernel headers it is compiled against if the actual size can be
queried from the kernel?
Florian
On Tue, Apr 05, 2016 at 06:02:25PM +0200, Florian Weimer wrote:
> On 04/04/2016 10:48 PM, Mathieu Desnoyers wrote:
>
> > Moreover, the feature set that the application knows about, glibc
> > knows about, and the kernel knows about are three different things.
> > My intent here is to have glibc stay out of the way as much as possible,
> > since this is really an interface between various applications/libraries
> > and the kernel.
>
> Surely glibc can allocate the space based on what is advertised as
> needed by the kernel? Why would it limit itself to what is supported by
> the kernel headers it is compiled against if the actual size can be
> queried from the kernel?
I guess the question is; can we do thread local variable arrays like:
__thread uint32_t[x]; /* with x being a runtime constant */
Because then we can do:
__thread struct thread_local_abi tla;
where sizeof(struct thread_local_abi) is a runtime variable.
Without that we cannot have this thread-local-abi structure be part of
the immediately addressable TLS space. That is, we then need a pointer
like:
__thread struct thread_local_abi *tla;
and every usage will need the extra pointer deref.
Because ideally this structure would be part of the initial (glibc) TCB
with fixed offset etc.
On 04/05/2016 06:47 PM, Peter Zijlstra wrote:
> On Tue, Apr 05, 2016 at 06:02:25PM +0200, Florian Weimer wrote:
>> On 04/04/2016 10:48 PM, Mathieu Desnoyers wrote:
>>
>>> Moreover, the feature set that the application knows about, glibc
>>> knows about, and the kernel knows about are three different things.
>>> My intent here is to have glibc stay out of the way as much as possible,
>>> since this is really an interface between various applications/libraries
>>> and the kernel.
>>
>> Surely glibc can allocate the space based on what is advertised as
>> needed by the kernel? Why would it limit itself to what is supported by
>> the kernel headers it is compiled against if the actual size can be
>> queried from the kernel?
>
> I guess the question is; can we do thread local variable arrays like:
>
> __thread uint32_t[x]; /* with x being a runtime constant */
>
> Because then we can do:
>
> __thread struct thread_local_abi tla;
>
> where sizeof(struct thread_local_abi) is a runtime variable.
It's slightly complicated. ELF TLS in the GNU ABI will give you a
static offset only with static linking.
> Without that we cannot have this thread-local-abi structure be part of
> the immediately addressable TLS space. That is, we then need a pointer
> like:
>
> __thread struct thread_local_abi *tla;
>
> and every usage will need the extra pointer deref.
The offset relative to the base will be dynamic anyway and need an extra
load (which can be hoisted out of loops etc., but it's still there in
some case).
> Because ideally this structure would be part of the initial (glibc) TCB
> with fixed offset etc.
This is not possible because we have layering violations and code
assumes it knows the precise of the glibc TCB. I think Address
Sanitizer is in this category. This means we cannot adjust the TCB size
based on the kernel headers used to compile glibc, and there will have
to be some indirection.
Florian
On Thu, Apr 07, 2016 at 11:01:25AM +0200, Florian Weimer wrote:
> > Because ideally this structure would be part of the initial (glibc) TCB
> > with fixed offset etc.
>
> This is not possible because we have layering violations and code
> assumes it knows the precise of the glibc TCB. I think Address
> Sanitizer is in this category. This means we cannot adjust the TCB size
> based on the kernel headers used to compile glibc, and there will have
> to be some indirection.
So with the proposed fixed sized object it would work, right?
Which is part of the reason its being proposed as a fixed sized object.
On 04/07/2016 12:31 PM, Peter Zijlstra wrote:
> On Thu, Apr 07, 2016 at 11:01:25AM +0200, Florian Weimer wrote:
>>> Because ideally this structure would be part of the initial (glibc) TCB
>>> with fixed offset etc.
>>
>> This is not possible because we have layering violations and code
>> assumes it knows the precise of the glibc TCB. I think Address
>> Sanitizer is in this category. This means we cannot adjust the TCB size
>> based on the kernel headers used to compile glibc, and there will have
>> to be some indirection.
>
> So with the proposed fixed sized object it would work, right?
I didn't see a proposal for a fixed size buffer, in the sense that the
size of struct sockaddr_in is fixed.
Florian
On 04/04/2016 07:01 PM, Mathieu Desnoyers wrote:
> NAME
> thread_local_abi - Shared memory interface between user-space
> threads and the kernel
We already have set_robust_list, which is conceptually similar.
Florian
On Thu, Apr 07, 2016 at 12:39:21PM +0200, Florian Weimer wrote:
> On 04/07/2016 12:31 PM, Peter Zijlstra wrote:
> > On Thu, Apr 07, 2016 at 11:01:25AM +0200, Florian Weimer wrote:
> >>> Because ideally this structure would be part of the initial (glibc) TCB
> >>> with fixed offset etc.
> >>
> >> This is not possible because we have layering violations and code
> >> assumes it knows the precise of the glibc TCB. I think Address
> >> Sanitizer is in this category. This means we cannot adjust the TCB size
> >> based on the kernel headers used to compile glibc, and there will have
> >> to be some indirection.
> >
> > So with the proposed fixed sized object it would work, right?
>
> I didn't see a proposal for a fixed size buffer, in the sense that the
> size of struct sockaddr_in is fixed.
This thing proposed a single 64byte structure (with the possibility of
eventually adding more 64byte structures). Basically:
struct tlabi {
union {
__u8[64] __foo;
struct {
/* fields go here */
};
};
} __aligned__(64);
People objected against the fixed size scheme, but it being possible to
get a fixed TCB offset and reduce indirections is a big win IMO.
On 04/07/2016 01:19 PM, Peter Zijlstra wrote:
> On Thu, Apr 07, 2016 at 12:39:21PM +0200, Florian Weimer wrote:
>> On 04/07/2016 12:31 PM, Peter Zijlstra wrote:
>>> On Thu, Apr 07, 2016 at 11:01:25AM +0200, Florian Weimer wrote:
>>>>> Because ideally this structure would be part of the initial (glibc) TCB
>>>>> with fixed offset etc.
>>>>
>>>> This is not possible because we have layering violations and code
>>>> assumes it knows the precise of the glibc TCB. I think Address
>>>> Sanitizer is in this category. This means we cannot adjust the TCB size
>>>> based on the kernel headers used to compile glibc, and there will have
>>>> to be some indirection.
>>>
>>> So with the proposed fixed sized object it would work, right?
>>
>> I didn't see a proposal for a fixed size buffer, in the sense that the
>> size of struct sockaddr_in is fixed.
>
> This thing proposed a single 64byte structure (with the possibility of
> eventually adding more 64byte structures). Basically:
>
> struct tlabi {
> union {
> __u8[64] __foo;
> struct {
> /* fields go here */
> };
> };
> } __aligned__(64);
That's not really ?fixed size? as far as an ABI is concerned, due to the
possibility of future extensions.
> People objected against the fixed size scheme, but it being possible to
> get a fixed TCB offset and reduce indirections is a big win IMO.
It's a difficult trade-off. It's not an indirection as such, it's avoid
loading the dynamic TLS offset.
Let me repeat that the ELF TLS GNU ABI has very limited support for
static offsets at present, and it is difficult to make them available
more widely without code generation at run time (in the form of text
relocations, but still).
Florian
On Thu, Apr 07, 2016 at 02:03:53PM +0200, Florian Weimer wrote:
> > struct tlabi {
> > union {
> > __u8[64] __foo;
> > struct {
> > /* fields go here */
> > };
> > };
> > } __aligned__(64);
>
> That's not really “fixed size” as far as an ABI is concerned, due to the
> possibility of future extensions.
sizeof(struct tlabi) is always the same, right? How is that not fixed?
> > People objected against the fixed size scheme, but it being possible to
> > get a fixed TCB offset and reduce indirections is a big win IMO.
>
> It's a difficult trade-off. It's not an indirection as such, it's avoid
> loading the dynamic TLS offset.
What we _want_ is being able to use %[gf]s:offset and have it work (I
forever forget which segment register userspace TLS uses).
> Let me repeat that the ELF TLS GNU ABI has very limited support for
> static offsets at present, and it is difficult to make them available
> more widely without code generation at run time (in the form of text
> relocations, but still).
Do you have a pointer to something I can read? Because I'm clearly not
understanding the full issue here.
----- On Apr 7, 2016, at 8:03 AM, Florian Weimer [email protected] wrote:
> On 04/07/2016 01:19 PM, Peter Zijlstra wrote:
>> On Thu, Apr 07, 2016 at 12:39:21PM +0200, Florian Weimer wrote:
>>> On 04/07/2016 12:31 PM, Peter Zijlstra wrote:
>>>> On Thu, Apr 07, 2016 at 11:01:25AM +0200, Florian Weimer wrote:
>>>>>> Because ideally this structure would be part of the initial (glibc) TCB
>>>>>> with fixed offset etc.
>>>>>
>>>>> This is not possible because we have layering violations and code
>>>>> assumes it knows the precise of the glibc TCB. I think Address
>>>>> Sanitizer is in this category. This means we cannot adjust the TCB size
>>>>> based on the kernel headers used to compile glibc, and there will have
>>>>> to be some indirection.
>>>>
>>>> So with the proposed fixed sized object it would work, right?
>>>
>>> I didn't see a proposal for a fixed size buffer, in the sense that the
>>> size of struct sockaddr_in is fixed.
>>
>> This thing proposed a single 64byte structure (with the possibility of
>> eventually adding more 64byte structures). Basically:
>>
>> struct tlabi {
>> union {
>> __u8[64] __foo;
>> struct {
>> /* fields go here */
>> };
>> };
>> } __aligned__(64);
>
> That's not really “fixed size” as far as an ABI is concerned, due to the
> possibility of future extensions.
Hi Florian,
Let me try to spell out how I'm proposing to combine
fixed-size structure as well as future extensions.
I understand that this trick might be a bit counter-
intuitive.
Initially, we define a fixed size struct tlabi, which
length is 64 bytes. It is zero-padded, and will never be
extended beyond 64 bytes. When we register it to the
system call, we pass value 0 to the tlabi_nr parameter.
So far, the kernel only has to track a single pointer
and a 32-bit features mask per thread.
If we even fill up those 64 bytes, then we need to assign
a new struct tlabi_1. Its length may also be fixed at 64 bytes,
or another size that we can decide at that time. When userspace
will register this new structure, it will pass value 1 as
tlabi_nr parameter. At that point, the kernel will need to
track two pointers per thread, one for tlabi_nr=0 and one for
tlabi_nr=1. However, the kernel can combine the features_mask
of the two tlabi structures internally into a single uint64_t
bitmask per thread, and then we can extend this to a larger
bitmask if we ever have more than 64 features.
Now the question is: Peter Anvin claims that this scheme is
too complex, and we should just have a dynamically-sized
area, which size is the max between the size known by the
kernel and the size known by glibc. I'm trying to figure out
whether we can do that without adding NULL pointer checks, size
checks, and all sorts of extra code to the user-space fast-path,
or else that would mean going down the fixed-size structure route
would be justified.
Hopefully my summary here helps clarifying a few points.
Thanks!
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On 04/07/2016 02:25 PM, Peter Zijlstra wrote:
> Do you have a pointer to something I can read? Because I'm clearly not
> understanding the full issue here.
I believe the canonical reference still is this document:
<https://www.akkadia.org/drepper/tls.pdf>
For newer architectures (ppc64le, aarch64), you'll have to check the
psABI supplements.
Florian
----- On Apr 7, 2016, at 8:25 AM, Peter Zijlstra [email protected] wrote:
> On Thu, Apr 07, 2016 at 02:03:53PM +0200, Florian Weimer wrote:
>> > struct tlabi {
>> > union {
>> > __u8[64] __foo;
>> > struct {
>> > /* fields go here */
>> > };
>> > };
>> > } __aligned__(64);
>>
>> That's not really “fixed size” as far as an ABI is concerned, due to the
>> possibility of future extensions.
>
> sizeof(struct tlabi) is always the same, right? How is that not fixed?
>
>> > People objected against the fixed size scheme, but it being possible to
>> > get a fixed TCB offset and reduce indirections is a big win IMO.
>>
>> It's a difficult trade-off. It's not an indirection as such, it's avoid
>> loading the dynamic TLS offset.
>
> What we _want_ is being able to use %[gf]s:offset and have it work (I
> forever forget which segment register userspace TLS uses).
>
>> Let me repeat that the ELF TLS GNU ABI has very limited support for
>> static offsets at present, and it is difficult to make them available
>> more widely without code generation at run time (in the form of text
>> relocations, but still).
>
> Do you have a pointer to something I can read? Because I'm clearly not
> understanding the full issue here.
For what is is worth, here are a couple of objdump snippet of my
test program without and with -fPIC:
* Compiled with -O2, *without* -fPIC, x86-64:
__thread __attribute__((weak)) volatile struct thread_local_abi __thread_local_abi;
static
int32_t read_cpu_id(void)
{
if (unlikely(!(__thread_local_abi.features & TLABI_FEATURE_CPU_ID)))
40064e: 64 8b 04 25 c0 ff ff mov %fs:0xffffffffffffffc0,%eax
400655: ff
400656: a8 01 test $0x1,%al
400658: 74 71 je 4006cb <main+0xab>
return sched_getcpu();
return __thread_local_abi.cpu_id;
40065a: 64 8b 14 25 c4 ff ff mov %fs:0xffffffffffffffc4,%edx
400661: ff
}
* Compiled with -O2, with -fPIC, x86_64:
__thread __attribute__((weak)) volatile struct thread_local_abi __thread_local_abi;
4006de: 64 48 8b 04 25 00 00 mov %fs:0x0,%rax
4006e5: 00 00
static
int32_t read_cpu_id(void)
{
if (unlikely(!(__thread_local_abi.features & TLABI_FEATURE_CPU_ID)))
4006e7: 48 8d 80 c0 ff ff ff lea -0x40(%rax),%rax
4006ee: 8b 10 mov (%rax),%edx
4006f0: 83 e2 01 and $0x1,%edx
4006f3: 0f 84 80 00 00 00 je 400779 <main+0xc9>
return sched_getcpu();
return __thread_local_abi.cpu_id;
4006f9: 8b 50 04 mov 0x4(%rax),%edx
}
So with -fPIC (libraries), TLS adds an extra indirection. However,
it just needs to load the base address once, and can then access
both "features" and "cpu_id" fields as offsets from that base.
For executables compiled without -fPIC, there is no indirection.
This justifies the following paragraph in the proposed man page:
The symbol __thread_local_abi is recommended to be used across
libraries and applications wishing to register a the thread-local
ABI structure for tlabi_nr 0. The attribute "weak" is recommended
when declaring this variable in libraries. Applications can
choose to define their own version of this symbol without the weak
attribute as a performance improvement.
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Thu, Apr 7, 2016 at 4:19 AM, Peter Zijlstra <[email protected]> wrote:
>
> People objected against the fixed size scheme, but it being possible to
> get a fixed TCB offset and reduce indirections is a big win IMO.
Guys, I'm going to just make an executive decision here, because this
whole "fixed vs some strange size that is a superset of kernel and
user mode knowledge" discussion has been going on for too long.
Here's the executive decision: I will not merge anything that doesn't
have a (small) fixed size as far as the kernel is concerned.
Why?
I don't think there is *any* possible reason for the kernel to care
about the size. There will be no future extensions. The kernel will
not magically start doing bigger things, and change more fields, or
anything like that.
Put another way: if the interface cannot be designed so that the
kernel simply DOES NOT HAVE TO CARE about the rest of the crap, I will
not merge this patch series. Ever.
So I don't want to hear more idiotic emails about "extensible sizes".
The people who want to push this interface had better be able to show
that the kernel will never care about what user space does, and *all*
the kernel has to do is to invalidate a single field when a thread is
moved.
In other words:
- get rid of the stupid "abi features" bitfield. Anybody who feels it
is needed had better take a deep breath and ask themselves why.
- that leaves us with *one* single 32-bit field that the kernel cares
about: "cpu_id".
- we specify that the *only* thing the kernel will ever do is that single
put_user(raw_smp_processor_id(), &t->tlabi->cpu_id)
and absolutely nothing else.
End result? That damn data structure is 32 bits. No more, no less.
I'm perfectly happy to make a strict requiremnt that it is some
16-byte aligned thing, and we can add padding values, but quite
frankly, I'm not really sure even that is required.
And if the kernel ever has to care about anything else, I say "no".
Can anybody give a *coherent* and actual *real* reason why the kernel
would ever care about anything else?
Because if not, then this discussion is done for. Stop with the
f*cking idiotic "let's look at some kernel size and user-space size
and try to match them up". The kernel doesn't care. The kernel MUST
NOT care. The kernel will touch one single word, and that's all the
kernel does, and user space had better be able make up their own
semantics around that.
Linus
On Thu, Apr 7, 2016 at 9:39 AM, Linus Torvalds
<[email protected]> wrote:
> Can anybody give a *coherent* and actual *real* reason why the kernel
> would ever care about anything else?
The rseq data structure, which is still being designed.
OTOH, that thing could be registered separately by userspace.
--Andy
On 04/07/2016 06:39 PM, Linus Torvalds wrote:
> Can anybody give a *coherent* and actual *real* reason why the kernel
> would ever care about anything else?
We already have a similar per-thread data structure, the robust mutex
list. The CPU ID is another one. So it's conceivable we might get
further such fields in the future.
(AFAICS set_robust_list was designed with such extensions in mind.)
Florian
On Thu, Apr 7, 2016 at 9:39 AM, Linus Torvalds
<[email protected]> wrote:
>
> Because if not, then this discussion is done for. Stop with the
> f*cking idiotic "let's look at some kernel size and user-space size
> and try to match them up". The kernel doesn't care. The kernel MUST
> NOT care. The kernel will touch one single word, and that's all the
> kernel does, and user space had better be able make up their own
> semantics around that.
.. and btw - if people aren't sure that that is a "good enough"
interface, then I'm sure as hell not going to merge that patch anyway.
Andy mentions rseq. Yeah, I'm not going to merge anything where part
of the discussion is "and we might want to do something else for X".
Either the suggested patches are useful and generic enough that people
can do this, or they aren't.
If people can agree that "yes, this whole cpu id cache is a great
interface that we can build up interesting user-space constructs
around", then great. Such a new kernel interface may be worth merging.
But if people cannot be convinced that it is sufficient, then I don't
want to merge some half-arsed interface that generates these kinds of
discussions.
So the fact that currently makes me go "no way will I merge any of
this" is the very fact that these discussions continue and are still
going on.
Linus
On Thu, Apr 7, 2016 at 9:50 AM, Florian Weimer <[email protected]> wrote:
>
> (AFAICS set_robust_list was designed with such extensions in mind.)
This is a disease among people who have been taught computer science.
People think that "designing with extensions in mind" is a good idea.
It's a _horrible_ idea.
If you think that "design with extensions in mind" is a good idea,
you're basically saying "I don't know what I might want to do".
I'm not interested in those kinds of kernel interfaces. EVERY SINGLE
TIME when we add a new random non-standard interface that isn't
already used by lots and lots of people, the end result is the same:
nobody actually uses it. There might be one or two very obscure
libraries that use it, and then a couple of special applications that
use those libraries. And that's it.
So unless there is a clear use-case, and clear semantics that people
can agree on as being truly generic and useful for a lot of different
cases, excuse me for being less than impressed.
Anything with a "let's add feature fields" is broken shit. BY DEFINITION.
See my argument?
And btw, ask yourself how well that set_robust_list() extension worked?
(Answer sheet to the above question: it was pure garbage. Instead of
actually ever being extended, the "struct robust_list_head" not only
is fixed, it was horribly misdesigned to the point of requiring a
compat system call. Pure garbage, in other words, and an example of
how *not* to do user interfaces).
Linus
----- On Apr 7, 2016, at 12:52 PM, Linus Torvalds [email protected] wrote:
> On Thu, Apr 7, 2016 at 9:39 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> Because if not, then this discussion is done for. Stop with the
>> f*cking idiotic "let's look at some kernel size and user-space size
>> and try to match them up". The kernel doesn't care. The kernel MUST
>> NOT care. The kernel will touch one single word, and that's all the
>> kernel does, and user space had better be able make up their own
>> semantics around that.
>
> .. and btw - if people aren't sure that that is a "good enough"
> interface, then I'm sure as hell not going to merge that patch anyway.
> Andy mentions rseq. Yeah, I'm not going to merge anything where part
> of the discussion is "and we might want to do something else for X".
>
> Either the suggested patches are useful and generic enough that people
> can do this, or they aren't.
>
> If people can agree that "yes, this whole cpu id cache is a great
> interface that we can build up interesting user-space constructs
> around", then great. Such a new kernel interface may be worth merging.
One basic use of cpu id cache is to speed up the sched_getcpu(3)
implementation in glibc. This is why I'm proposing it as a stand-alone
feature that does not require the restartable sequences. It can
also be used directly from applications to remove the function call
overhead of sched_getcpu, which further accelerates this operation.
>
> But if people cannot be convinced that it is sufficient, then I don't
> want to merge some half-arsed interface that generates these kinds of
> discussions.
>
> So the fact that currently makes me go "no way will I merge any of
> this" is the very fact that these discussions continue and are still
> going on.
The intent of this RFC patchset is to get people to agree on the proper
way to introduce both the "cpu id" and the "rseq (restartable critical
section)" features. I have so far proposed two ways of doing it: one
system call per feature, or one system call to register all the features.
My previous patch rounds were adding a system call specific for the
cpu_id field, registering a pointer to a 32-bit per-thread integer.
(getcpu_cache system call) Based on prior email exchanges I had with
you on other topics, I was inclined to go for the specific getcpu_cache
system call route, and adding future features as separate system calls.
hpa pointed out that this will mean keeping track of one pointer
per task-struct for cpu_id, and eventually another pointer per
task-struct for rseq fields, thus degrading cache locality. In
order to address his concerns, I proposed this "thread local ABI"
system call, which registers a fixed-size 64 bytes structure that
starts with a feature mask.
The other route we could take is to just implement one "rseq" system
call, which would contain all fields needed for the rseq feature,
which happen to include the cpu_id. The main downside of this
approach is that whenever we want to port the cpu_id feature to
another architecture, it _needs_ to come with the implemented
"rseq" feature too, which is rather more complex. I don't mind
going that way either if that's preferred.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
> One basic use of cpu id cache is to speed up the sched_getcpu(3)
> implementation in glibc. This is why I'm proposing it as a stand-alone
I don't think rseq is needed for faster getcpu.
User space has to be able handle stale return values anyways, as it
has no way to lock itself to a cpu while it is using the return value.
So it can be only a hint.
The original version of getcpu just had a jiffies based cache. The CPU
value was valid up to a jiffie (the next time jiffie changes), and then it
gets looked up again.
Processes are unlikely to switch CPUs more often than a jiffie, so it's
good enough as a hint.
This doesn't need any new kernel interfaces at all because jiffies is already
exported to the vdso.
It just needs a new entry point into the vdso that handles the jiffie
check.
-Andi
----- On Apr 7, 2016, at 4:22 PM, Andi Kleen [email protected] wrote:
>> One basic use of cpu id cache is to speed up the sched_getcpu(3)
>> implementation in glibc. This is why I'm proposing it as a stand-alone
>
> I don't think rseq is needed for faster getcpu.
I agree that rseq is not needed for faster getcpu. This is why I was proposing
to make "cpu_id" feature configurable separately from the rseq feature.
E.g. a kernel configuration that don't want to take the hit of rseq handling
in signal delivery and preemption could just enable the cpu_id feature, and
thus only need to add work in the migration code path, and when returning to
userspace. Also, if a thread only registers the cpu_id feature, the kernel
can skip the rseq code quickly in signal delivery and preemption too.
>
> User space has to be able handle stale return values anyways, as it
> has no way to lock itself to a cpu while it is using the return value.
> So it can be only a hint.
>
> The original version of getcpu just had a jiffies based cache. The CPU
> value was valid up to a jiffie (the next time jiffie changes), and then it
> gets looked up again.
>
> Processes are unlikely to switch CPUs more often than a jiffie, so it's
> good enough as a hint.
One example use-case where this would hurt: we use the CPU id heavily when
tracing to a ring buffer in user-space. Having one event written into the
wrong buffer once in a while is not a big deal, but tracing a whole burst
of events within a jiffy (e.g. 4ms at 250Hz) to the wrong cpu buffer
whenever the thread migrates is really an unwanted side-effect latency-wise.
>
> This doesn't need any new kernel interfaces at all because jiffies is already
> exported to the vdso.
My understanding is that although your assumptions about availability of
those features in vdso are true for x86 32/64, but do not currently apply
to ARM32.
ARM32 is my main target architecture for the CPU id cache work. x86 32/64
simply also happen to benefit from that work too (see my benchmark numbers
in changelog of patch 1/5).
> It just needs a new entry point into the vdso that handles the jiffie
> check.
This would likely require to extend the ARM vdso page to expose the jiffies
counter to user-space, and update user-space libraries to use this counter
in sched_getcpu. But it would still be slower than the cpu_id cache I propose,
due to the required function call to sched_getcpu, unless you want to open-code
the jiffies check within all applications as an ABI. It would also be bad for
fast bursts of cpu id use (e.g. per-cpu ring buffers).
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com