2021-04-14 12:46:22

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

We already have process_vm_readv and process_vm_writev to read and write
to a process memory faster than we can do this with ptrace. And now it
is time for process_vm_exec that allows executing code in an address
space of another process. We can do this with ptrace but it is much
slower.

= Use-cases =

Here are two known use-cases. The first one is “application kernel”
sandboxes like User-mode Linux and gVisor. In this case, we have a
process that runs the sandbox kernel and a set of stub processes that
are used to manage guest address spaces. Guest code is executed in the
context of stub processes but all system calls are intercepted and
handled in the sandbox kernel. Right now, these sort of sandboxes use
PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
significantly speed them up.

Another use-case is CRIU (Checkpoint/Restore in User-space). Several
process properties can be received only from the process itself. Right
now, we use a parasite code that is injected into the process. We do
this with ptrace but it is slow, unsafe, and tricky. process_vm_exec can
simplify the process of injecting a parasite code and it will allow
pre-dump memory without stopping processes. The pre-dump here is when we
enable a memory tracker and dump the memory while a process is continue
running. On each interaction we dump memory that has been changed from
the previous iteration. In the final step, we will stop processes and
dump their full state. Right now the most effective way to dump process
memory is to create a set of pipes and splice memory into these pipes
from the parasite code. With process_vm_exec, we will be able to call
vmsplice directly. It means that we will not need to stop a process to
inject the parasite code.

= How it works =

process_vm_exec has two modes:

* Execute code in an address space of a target process and stop on any
signal or system call.

* Execute a system call in an address space of a target process.

int process_vm_exec(pid_t pid, struct sigcontext uctx,
unsigned long flags, siginfo_t siginfo,
sigset_t *sigmask, size_t sizemask)

PID - target process identification. We can consider to use pidfd
instead of PID here.

sigcontext contains a process state with what the process will be
resumed after switching the address space and then when a process will
be stopped, its sate will be saved back to sigcontext.

siginfo is information about a signal that has interrupted the process.
If a process is interrupted by a system call, signfo will contain a
synthetic siginfo of the SIGSYS signal.

sigmask is a set of signals that process_vm_exec returns via signfo.

# How fast is it

In the fourth patch, you can find two benchmarks that execute a function
that calls system calls in a loop. ptrace_vm_exe uses ptrace to trap
system calls, proces_vm_exec uses the process_vm_exec syscall to do the
same thing.

ptrace_vm_exec: 1446 ns/syscall
ptrocess_vm_exec: 289 ns/syscall

PS: This version is just a prototype. Its goal is to collect the initial
feedback, to discuss the interfaces, and maybe to get some advice on
implementation..

Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Anton Ivanov <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Dmitry Safonov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Jeff Dike <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Michael Kerrisk (man-pages) <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Richard Weinberger <[email protected]>
Cc: Thomas Gleixner <[email protected]>

Andrei Vagin (4):
signal: add a helper to restore a process state from sigcontex
arch/x86: implement the process_vm_exec syscall
arch/x86: allow to execute syscalls via process_vm_exec
selftests: add tests for process_vm_exec

arch/Kconfig | 15 ++
arch/x86/Kconfig | 1 +
arch/x86/entry/common.c | 19 +++
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/asm/sigcontext.h | 2 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/process_vm_exec.c | 160 ++++++++++++++++++
arch/x86/kernel/signal.c | 125 ++++++++++----
include/linux/entry-common.h | 2 +
include/linux/process_vm_exec.h | 17 ++
include/linux/sched.h | 7 +
include/linux/syscalls.h | 6 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/process_vm_exec.h | 8 +
kernel/entry/common.c | 2 +-
kernel/fork.c | 9 +
kernel/sys_ni.c | 2 +
.../selftests/process_vm_exec/Makefile | 7 +
tools/testing/selftests/process_vm_exec/log.h | 26 +++
.../process_vm_exec/process_vm_exec.c | 105 ++++++++++++
.../process_vm_exec/process_vm_exec_fault.c | 111 ++++++++++++
.../process_vm_exec/process_vm_exec_syscall.c | 81 +++++++++
.../process_vm_exec/ptrace_vm_exec.c | 111 ++++++++++++
23 files changed, 785 insertions(+), 37 deletions(-)
create mode 100644 arch/x86/kernel/process_vm_exec.c
create mode 100644 include/linux/process_vm_exec.h
create mode 100644 include/uapi/linux/process_vm_exec.h
create mode 100644 tools/testing/selftests/process_vm_exec/Makefile
create mode 100644 tools/testing/selftests/process_vm_exec/log.h
create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec.c
create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec_fault.c
create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec_syscall.c
create mode 100644 tools/testing/selftests/process_vm_exec/ptrace_vm_exec.c

--
2.29.2


2021-04-14 12:47:07

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall

This change introduces the new system call:
process_vm_exec(pid_t pid, struct sigcontext *uctx, unsigned long flags,
siginfo_t * uinfo, sigset_t *sigmask, size_t sizemask)

process_vm_exec allows to execute the current process in an address
space of another process.

process_vm_exec swaps the current address space with an address space of
a specified process, sets a state from sigcontex and resumes the process.
When a process receives a signal or calls a system call,
process_vm_exec saves the process state back to sigcontext, restores the
origin address space, restores the origin process state, and returns to
userspace.

If it was interrupted by a signal and the signal is in the user_mask,
the signal is dequeued and information about it is saved in uinfo.
If process_vm_exec is interrupted by a system call, a synthetic siginfo
for the SIGSYS signal is generated.

The behavior of this system call is similar to PTRACE_SYSEMU but
everything is happing in the context of one process, so
process_vm_exec shows a better performance.

PTRACE_SYSEMU is primarily used to implement sandboxes (application
kernels) like User-mode Linux or gVisor. These type of sandboxes
intercepts applications system calls and acts as the guest kernel.
A simple benchmark, where a "tracee" process executes systems calls in a
loop and a "tracer" process traps syscalls and handles them just
incrementing the tracee instruction pointer to skip the syscall
instruction shows that process_vm_exec works more than 5 times faster
than PTRACE_SYSEMU.

Signed-off-by: Andrei Vagin <[email protected]>
---
arch/Kconfig | 15 +++
arch/x86/Kconfig | 1 +
arch/x86/entry/common.c | 16 +++
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/x86/include/asm/sigcontext.h | 2 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/process_vm_exec.c | 133 +++++++++++++++++++++++++
arch/x86/kernel/signal.c | 47 +++++++++
include/linux/process_vm_exec.h | 15 +++
include/linux/sched.h | 7 ++
include/linux/syscalls.h | 6 ++
include/uapi/asm-generic/unistd.h | 4 +-
kernel/fork.c | 9 ++
kernel/sys_ni.c | 2 +
14 files changed, 258 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/process_vm_exec.c
create mode 100644 include/linux/process_vm_exec.h

diff --git a/arch/Kconfig b/arch/Kconfig
index ba4e966484ab..3ed9b8fb1727 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -514,6 +514,21 @@ config SECCOMP_FILTER

See Documentation/userspace-api/seccomp_filter.rst for details.

+config HAVE_ARCH_PROCESS_VM_EXEC
+ bool
+ help
+ An arch should select this symbol to support the process_vm_exec system call.
+
+config PROCESS_VM_EXEC
+ prompt "Enable the process_vm_exec syscall"
+ def_bool y
+ depends on HAVE_ARCH_PROCESS_VM_EXEC
+ help
+ process_vm_exec allows executing code and system calls in a specified
+ address space.
+
+ If unsure, say Y.
+
config HAVE_ARCH_STACKLEAK
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index fbf26e0f7a6a..1c7ebb58865e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -27,6 +27,7 @@ config X86_64
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
select ARCH_USE_CMPXCHG_LOCKREF
select HAVE_ARCH_SOFT_DIRTY
+ select HAVE_ARCH_PROCESS_VM_EXEC
select MODULES_USE_ELF_RELA
select NEED_DMA_MAP_STATE
select SWIOTLB
diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 870efeec8bda..42eac459b25b 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -19,6 +19,7 @@
#include <linux/nospec.h>
#include <linux/syscalls.h>
#include <linux/uaccess.h>
+#include <linux/process_vm_exec.h>

#ifdef CONFIG_XEN_PV
#include <xen/xen-ops.h>
@@ -38,6 +39,21 @@
#ifdef CONFIG_X86_64
__visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
{
+#ifdef CONFIG_PROCESS_VM_EXEC
+ if (current->exec_mm && current->exec_mm->ctx) {
+ kernel_siginfo_t info = {
+ .si_signo = SIGSYS,
+ .si_call_addr = (void __user *)KSTK_EIP(current),
+ .si_arch = syscall_get_arch(current),
+ .si_syscall = nr,
+ };
+ restore_vm_exec_context(regs);
+ regs->ax = copy_siginfo_to_user(current->exec_mm->siginfo, &info);
+ syscall_exit_to_user_mode(regs);
+ return;
+ }
+#endif
+
nr = syscall_enter_from_user_mode(regs, nr);

instrumentation_begin();
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 379819244b91..2a8e27b2d87e 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -362,6 +362,7 @@
438 common pidfd_getfd sys_pidfd_getfd
439 common faccessat2 sys_faccessat2
440 common process_madvise sys_process_madvise
+441 64 process_vm_exec sys_process_vm_exec

#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/x86/include/asm/sigcontext.h b/arch/x86/include/asm/sigcontext.h
index 140d890c2c98..e390410cc3e9 100644
--- a/arch/x86/include/asm/sigcontext.h
+++ b/arch/x86/include/asm/sigcontext.h
@@ -6,4 +6,6 @@

#include <uapi/asm/sigcontext.h>

+extern long swap_vm_exec_context(struct sigcontext __user *uctx);
+
#endif /* _ASM_X86_SIGCONTEXT_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 68608bd892c0..d053289fd19e 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -163,3 +163,4 @@ ifeq ($(CONFIG_X86_64),y)
endif

obj-$(CONFIG_IMA_SECURE_AND_OR_TRUSTED_BOOT) += ima_arch.o
+obj-$(CONFIG_PROCESS_VM_EXEC) += process_vm_exec.o
diff --git a/arch/x86/kernel/process_vm_exec.c b/arch/x86/kernel/process_vm_exec.c
new file mode 100644
index 000000000000..28b32330f744
--- /dev/null
+++ b/arch/x86/kernel/process_vm_exec.c
@@ -0,0 +1,133 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <asm/syscall.h>
+#include <asm/sigframe.h>
+#include <asm/signal.h>
+#include <asm/mmu_context.h>
+#include <asm/sigcontext.h>
+
+#include <linux/types.h>
+#include <linux/mm.h>
+#include <linux/sched/mm.h>
+#include <linux/syscalls.h>
+#include <linux/vmacache.h>
+#include <linux/process_vm_exec.h>
+
+static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm)
+{
+ struct task_struct *tsk = current;
+ struct mm_struct *active_mm;
+
+ task_lock(tsk);
+ /* Hold off tlb flush IPIs while switching mm's */
+ local_irq_disable();
+
+ sync_mm_rss(prev_mm);
+
+ vmacache_flush(tsk);
+
+ active_mm = tsk->active_mm;
+ if (active_mm != target_mm) {
+ mmgrab(target_mm);
+ tsk->active_mm = target_mm;
+ }
+ tsk->mm = target_mm;
+ switch_mm_irqs_off(active_mm, target_mm, tsk);
+ local_irq_enable();
+ task_unlock(tsk);
+#ifdef finish_arch_post_lock_switch
+ finish_arch_post_lock_switch();
+#endif
+
+ if (active_mm != target_mm)
+ mmdrop(active_mm);
+}
+
+void restore_vm_exec_context(struct pt_regs *regs)
+{
+ struct sigcontext __user *uctx;
+ struct mm_struct *prev_mm, *target_mm;
+
+ uctx = current->exec_mm->ctx;
+ current->exec_mm->ctx = NULL;
+
+ target_mm = current->exec_mm->mm;
+ current->exec_mm->mm = NULL;
+ prev_mm = current->mm;
+
+ swap_mm(prev_mm, target_mm);
+
+ mmput(prev_mm);
+ mmdrop(target_mm);
+
+ swap_vm_exec_context(uctx);
+}
+
+SYSCALL_DEFINE6(process_vm_exec, pid_t, pid, struct sigcontext __user *, uctx,
+ unsigned long, flags, siginfo_t __user *, uinfo,
+ sigset_t __user *, user_mask, size_t, sizemask)
+{
+ struct mm_struct *prev_mm, *mm;
+ struct task_struct *tsk;
+ long ret = -ESRCH;
+
+ sigset_t mask;
+
+ if (flags)
+ return -EINVAL;
+
+ if (sizemask != sizeof(sigset_t))
+ return -EINVAL;
+ if (copy_from_user(&mask, user_mask, sizeof(mask)))
+ return -EFAULT;
+
+ sigdelsetmask(&mask, sigmask(SIGKILL) | sigmask(SIGSTOP));
+ signotset(&mask);
+
+ tsk = find_get_task_by_vpid(pid);
+ if (!tsk) {
+ ret = -ESRCH;
+ goto err;
+ }
+ mm = mm_access(tsk, PTRACE_MODE_ATTACH_REALCREDS);
+ put_task_struct(tsk);
+ if (!mm || IS_ERR(mm)) {
+ ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
+ goto err;
+ }
+
+ current_pt_regs()->ax = 0;
+ ret = swap_vm_exec_context(uctx);
+ if (ret < 0)
+ goto err_mm_put;
+
+ if (!current->exec_mm) {
+ ret = -ENOMEM;
+ current->exec_mm = kmalloc(sizeof(*current->exec_mm), GFP_KERNEL);
+ if (current->exec_mm == NULL)
+ goto err_mm_put;
+ }
+ current->exec_mm->ctx = uctx;
+ current->exec_mm->mm = current->mm;
+ current->exec_mm->flags = flags;
+ current->exec_mm->sigmask = mask;
+ current->exec_mm->siginfo = uinfo;
+ prev_mm = current->mm;
+
+ mmgrab(prev_mm);
+ swap_mm(prev_mm, mm);
+
+ ret = current_pt_regs()->ax;
+
+ return ret;
+err_mm_put:
+ mmput(mm);
+err:
+ return ret;
+}
+
+void free_exec_mm_struct(struct task_struct *p)
+{
+ kfree(p->exec_mm);
+ p->exec_mm = NULL;
+}
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index cc269a20dd5f..51286c79062b 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -27,6 +27,7 @@
#include <linux/context_tracking.h>
#include <linux/entry-common.h>
#include <linux/syscalls.h>
+#include <linux/process_vm_exec.h>

#include <asm/processor.h>
#include <asm/ucontext.h>
@@ -816,6 +817,23 @@ void arch_do_signal(struct pt_regs *regs)
{
struct ksignal ksig;

+#ifdef CONFIG_PROCESS_VM_EXEC
+ if (current->exec_mm && current->exec_mm->ctx) {
+ kernel_siginfo_t info;
+ int ret;
+
+ restore_vm_exec_context(current_pt_regs());
+
+ spin_lock_irq(&current->sighand->siglock);
+ ret = dequeue_signal(current, &current->exec_mm->sigmask, &info);
+ spin_unlock_irq(&current->sighand->siglock);
+
+ if (ret > 0)
+ ret = copy_siginfo_to_user(current->exec_mm->siginfo, &info);
+ regs->ax = ret;
+ }
+#endif
+
if (get_signal(&ksig)) {
/* Whee! Actually deliver the signal. */
handle_signal(&ksig, regs);
@@ -896,3 +914,32 @@ COMPAT_SYSCALL_DEFINE0(x32_rt_sigreturn)
return 0;
}
#endif
+
+#ifdef CONFIG_PROCESS_VM_EXEC
+long swap_vm_exec_context(struct sigcontext __user *uctx)
+{
+ struct sigcontext ctx = {};
+ sigset_t set = {};
+
+
+ if (copy_from_user(&ctx, uctx, CONTEXT_COPY_SIZE))
+ return -EFAULT;
+ /* A floating point state is managed from user-space. */
+ if (ctx.fpstate != 0)
+ return -EINVAL;
+ if (!user_access_begin(uctx, sizeof(*uctx)))
+ return -EFAULT;
+ unsafe_put_sigcontext(uctx, NULL, current_pt_regs(), (&set), Efault);
+ user_access_end();
+
+ if (__restore_sigcontext(current_pt_regs(), &ctx, 0))
+ goto badframe;
+
+ return 0;
+Efault:
+ user_access_end();
+badframe:
+ signal_fault(current_pt_regs(), uctx, "swap_vm_exec_context");
+ return -EFAULT;
+}
+#endif
diff --git a/include/linux/process_vm_exec.h b/include/linux/process_vm_exec.h
new file mode 100644
index 000000000000..a02535fbd5c8
--- /dev/null
+++ b/include/linux/process_vm_exec.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_PROCESS_VM_EXEC_H
+#define _LINUX_PROCESS_VM_EXEC_H
+
+struct exec_mm {
+ struct sigcontext *ctx;
+ struct mm_struct *mm;
+ unsigned long flags;
+ sigset_t sigmask;
+ siginfo_t __user *siginfo;
+};
+
+void free_exec_mm_struct(struct task_struct *tsk);
+
+#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 76cd21fa5501..864a8fdd0ed7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -64,6 +64,7 @@ struct signal_struct;
struct task_delay_info;
struct task_group;
struct io_uring_task;
+struct exec_mm;

/*
* Task state bitmask. NOTE! These bits are also
@@ -637,6 +638,8 @@ struct wake_q_node {
struct wake_q_node *next;
};

+struct exec_mm;
+
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
@@ -757,6 +760,10 @@ struct task_struct {
struct mm_struct *mm;
struct mm_struct *active_mm;

+#ifdef CONFIG_PROCESS_VM_EXEC
+ struct exec_mm *exec_mm;
+#endif
+
/* Per-thread vma caching: */
struct vmacache vmacache;

diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 37bea07c12f2..bdea75a14975 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1347,4 +1347,10 @@ int __sys_getsockopt(int fd, int level, int optname, char __user *optval,
int __user *optlen);
int __sys_setsockopt(int fd, int level, int optname, char __user *optval,
int optlen);
+
+#ifdef CONFIG_PROCESS_VM_EXEC
+void restore_vm_exec_context(struct pt_regs *regs);
+#else
+static inline void restore_vm_exec_context(struct pt_regs *regs) {}
+#endif
#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 2056318988f7..60acbd9cf511 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -859,9 +859,11 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
__SYSCALL(__NR_faccessat2, sys_faccessat2)
#define __NR_process_madvise 440
__SYSCALL(__NR_process_madvise, sys_process_madvise)
+#define __NR_process_madvise 441
+__SYSCALL(__NR_process_vm_exec, sys_process_vm_exec)

#undef __NR_syscalls
-#define __NR_syscalls 441
+#define __NR_syscalls 442

/*
* 32 bit systems traditionally used different
diff --git a/kernel/fork.c b/kernel/fork.c
index 6d266388d380..61ca7a4a1130 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -96,6 +96,7 @@
#include <linux/kasan.h>
#include <linux/scs.h>
#include <linux/io_uring.h>
+#include <linux/process_vm_exec.h>

#include <asm/pgalloc.h>
#include <linux/uaccess.h>
@@ -461,6 +462,9 @@ void free_task(struct task_struct *tsk)
arch_release_task_struct(tsk);
if (tsk->flags & PF_KTHREAD)
free_kthread_struct(tsk);
+#ifdef CONFIG_PROCESS_VM_EXEC
+ free_exec_mm_struct(tsk);
+#endif
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -943,6 +947,11 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
#ifdef CONFIG_MEMCG
tsk->active_memcg = NULL;
#endif
+
+#ifdef CONFIG_PROCESS_VM_EXEC
+ tsk->exec_mm = NULL;
+#endif
+
return tsk;

free_stack:
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index f27ac94d5fa7..2545a409bb07 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -350,6 +350,8 @@ COND_SYSCALL(pkey_mprotect);
COND_SYSCALL(pkey_alloc);
COND_SYSCALL(pkey_free);

+/* execute in another address space */
+COND_SYSCALL(process_vm_exec);

/*
* Architecture specific weak syscall entries.
--
2.29.2

2021-04-14 13:44:45

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 1/4] signal: add a helper to restore a process state from sigcontex

It will be used to implement process_vm_exec.

Signed-off-by: Andrei Vagin <[email protected]>
---
arch/x86/kernel/signal.c | 78 ++++++++++++++++++++++------------------
1 file changed, 43 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index be0d7d4152ec..cc269a20dd5f 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -79,51 +79,43 @@ static void force_valid_ss(struct pt_regs *regs)
# define CONTEXT_COPY_SIZE sizeof(struct sigcontext)
#endif

-static int restore_sigcontext(struct pt_regs *regs,
- struct sigcontext __user *usc,
+static int __restore_sigcontext(struct pt_regs *regs,
+ struct sigcontext __user *sc,
unsigned long uc_flags)
{
- struct sigcontext sc;
-
- /* Always make any pending restarted system calls return -EINTR */
- current->restart_block.fn = do_no_restart_syscall;
-
- if (copy_from_user(&sc, usc, CONTEXT_COPY_SIZE))
- return -EFAULT;
-
#ifdef CONFIG_X86_32
- set_user_gs(regs, sc.gs);
- regs->fs = sc.fs;
- regs->es = sc.es;
- regs->ds = sc.ds;
+ set_user_gs(regs, sc->gs);
+ regs->fs = sc->fs;
+ regs->es = sc->es;
+ regs->ds = sc->ds;
#endif /* CONFIG_X86_32 */

- regs->bx = sc.bx;
- regs->cx = sc.cx;
- regs->dx = sc.dx;
- regs->si = sc.si;
- regs->di = sc.di;
- regs->bp = sc.bp;
- regs->ax = sc.ax;
- regs->sp = sc.sp;
- regs->ip = sc.ip;
+ regs->bx = sc->bx;
+ regs->cx = sc->cx;
+ regs->dx = sc->dx;
+ regs->si = sc->si;
+ regs->di = sc->di;
+ regs->bp = sc->bp;
+ regs->ax = sc->ax;
+ regs->sp = sc->sp;
+ regs->ip = sc->ip;

#ifdef CONFIG_X86_64
- regs->r8 = sc.r8;
- regs->r9 = sc.r9;
- regs->r10 = sc.r10;
- regs->r11 = sc.r11;
- regs->r12 = sc.r12;
- regs->r13 = sc.r13;
- regs->r14 = sc.r14;
- regs->r15 = sc.r15;
+ regs->r8 = sc->r8;
+ regs->r9 = sc->r9;
+ regs->r10 = sc->r10;
+ regs->r11 = sc->r11;
+ regs->r12 = sc->r12;
+ regs->r13 = sc->r13;
+ regs->r14 = sc->r14;
+ regs->r15 = sc->r15;
#endif /* CONFIG_X86_64 */

/* Get CS/SS and force CPL3 */
- regs->cs = sc.cs | 0x03;
- regs->ss = sc.ss | 0x03;
+ regs->cs = sc->cs | 0x03;
+ regs->ss = sc->ss | 0x03;

- regs->flags = (regs->flags & ~FIX_EFLAGS) | (sc.flags & FIX_EFLAGS);
+ regs->flags = (regs->flags & ~FIX_EFLAGS) | (sc->flags & FIX_EFLAGS);
/* disable syscall checks */
regs->orig_ax = -1;

@@ -136,10 +128,26 @@ static int restore_sigcontext(struct pt_regs *regs,
force_valid_ss(regs);
#endif

- return fpu__restore_sig((void __user *)sc.fpstate,
+ return fpu__restore_sig((void __user *)sc->fpstate,
IS_ENABLED(CONFIG_X86_32));
}

+static int restore_sigcontext(struct pt_regs *regs,
+ struct sigcontext __user *usc,
+ unsigned long uc_flags)
+{
+ struct sigcontext sc;
+
+ /* Always make any pending restarted system calls return -EINTR */
+ current->restart_block.fn = do_no_restart_syscall;
+
+ if (copy_from_user(&sc, usc, CONTEXT_COPY_SIZE))
+ return -EFAULT;
+
+ return __restore_sigcontext(regs, &sc, uc_flags);
+}
+
+
static __always_inline int
__unsafe_setup_sigcontext(struct sigcontext __user *sc, void __user *fpstate,
struct pt_regs *regs, unsigned long mask)
--
2.29.2

2021-04-14 13:44:54

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 3/4] arch/x86: allow to execute syscalls via process_vm_exec

process_vm_exec allows to execute code in an address space of another
process. It changes the current address space to the target address
space and resume the current process with registers from sigcontex that
is passed in the arguments.

This changes adds the PROCESS_VM_EXEC_SYSCALL flag and if it is set
process_vm_exec will execute a system call with arguments from sigcontext.

process_vm_exec retuns 0 if the system call has been executed and an error
code in other cases.

A return code of the system call can be found in a proper register in
sigcontext.

Signed-off-by: Andrei Vagin <[email protected]>
---
arch/x86/entry/common.c | 5 ++++-
arch/x86/kernel/process_vm_exec.c | 29 +++++++++++++++++++++++++++-
include/linux/entry-common.h | 2 ++
include/linux/process_vm_exec.h | 2 ++
include/uapi/linux/process_vm_exec.h | 8 ++++++++
kernel/entry/common.c | 2 +-
6 files changed, 45 insertions(+), 3 deletions(-)
create mode 100644 include/uapi/linux/process_vm_exec.h

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 42eac459b25b..8de02ca19aca 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -40,7 +40,10 @@
__visible noinstr void do_syscall_64(unsigned long nr, struct pt_regs *regs)
{
#ifdef CONFIG_PROCESS_VM_EXEC
- if (current->exec_mm && current->exec_mm->ctx) {
+ struct exec_mm *exec_mm = current->exec_mm;
+
+ if (exec_mm && exec_mm->ctx &&
+ !(exec_mm->flags & PROCESS_VM_EXEC_SYSCALL)) {
kernel_siginfo_t info = {
.si_signo = SIGSYS,
.si_call_addr = (void __user *)KSTK_EIP(current),
diff --git a/arch/x86/kernel/process_vm_exec.c b/arch/x86/kernel/process_vm_exec.c
index 28b32330f744..9124b23f1e9b 100644
--- a/arch/x86/kernel/process_vm_exec.c
+++ b/arch/x86/kernel/process_vm_exec.c
@@ -11,6 +11,7 @@
#include <linux/sched/mm.h>
#include <linux/syscalls.h>
#include <linux/vmacache.h>
+#include <linux/entry-common.h>
#include <linux/process_vm_exec.h>

static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm)
@@ -73,7 +74,7 @@ SYSCALL_DEFINE6(process_vm_exec, pid_t, pid, struct sigcontext __user *, uctx,

sigset_t mask;

- if (flags)
+ if (flags & ~PROCESS_VM_EXEC_SYSCALL)
return -EINVAL;

if (sizemask != sizeof(sigset_t))
@@ -97,6 +98,9 @@ SYSCALL_DEFINE6(process_vm_exec, pid_t, pid, struct sigcontext __user *, uctx,
}

current_pt_regs()->ax = 0;
+ if (flags & PROCESS_VM_EXEC_SYSCALL)
+ syscall_exit_to_user_mode_prepare(current_pt_regs());
+
ret = swap_vm_exec_context(uctx);
if (ret < 0)
goto err_mm_put;
@@ -117,6 +121,29 @@ SYSCALL_DEFINE6(process_vm_exec, pid_t, pid, struct sigcontext __user *, uctx,
mmgrab(prev_mm);
swap_mm(prev_mm, mm);

+ if (flags & PROCESS_VM_EXEC_SYSCALL) {
+ struct pt_regs *regs = current_pt_regs();
+ kernel_siginfo_t info;
+ int sysno;
+
+ regs->orig_ax = regs->ax;
+ regs->ax = -ENOSYS;
+ sysno = syscall_get_nr(current, regs);
+
+ do_syscall_64(sysno, regs);
+
+ restore_vm_exec_context(regs);
+ info.si_signo = SIGSYS;
+ info.si_call_addr = (void __user *)KSTK_EIP(current);
+ info.si_arch = syscall_get_arch(current);
+ info.si_syscall = sysno;
+ ret = copy_siginfo_to_user(current->exec_mm->siginfo, &info);
+ current_pt_regs()->orig_ax = __NR_process_vm_exec;
+ current_pt_regs()->ax = -ENOSYS;
+ syscall_enter_from_user_mode_work(current_pt_regs(), current_pt_regs()->orig_ax);
+ return ret;
+ }
+
ret = current_pt_regs()->ax;

return ret;
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 474f29638d2c..d0ebbe9ca9e4 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -285,6 +285,8 @@ static inline void arch_syscall_exit_tracehook(struct pt_regs *regs, bool step)
}
#endif

+void syscall_exit_to_user_mode_prepare(struct pt_regs *regs);
+
/**
* syscall_exit_to_user_mode - Handle work before returning to user mode
* @regs: Pointer to currents pt_regs
diff --git a/include/linux/process_vm_exec.h b/include/linux/process_vm_exec.h
index a02535fbd5c8..2e04b4875a92 100644
--- a/include/linux/process_vm_exec.h
+++ b/include/linux/process_vm_exec.h
@@ -2,6 +2,8 @@
#ifndef _LINUX_PROCESS_VM_EXEC_H
#define _LINUX_PROCESS_VM_EXEC_H

+#include <uapi/linux/process_vm_exec.h>
+
struct exec_mm {
struct sigcontext *ctx;
struct mm_struct *mm;
diff --git a/include/uapi/linux/process_vm_exec.h b/include/uapi/linux/process_vm_exec.h
new file mode 100644
index 000000000000..35465b5d3ebf
--- /dev/null
+++ b/include/uapi/linux/process_vm_exec.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+
+#ifndef _UAPI_LINUX_PROCESS_VM_EXEC_H
+#define _UAPI_LINUX_PROCESS_VM_EXEC_H
+
+#define PROCESS_VM_EXEC_SYSCALL 0x1UL
+
+#endif
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index e9e2df3f3f9e..c325a2e5ecf4 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -235,7 +235,7 @@ static void syscall_exit_work(struct pt_regs *regs, unsigned long ti_work)
* Syscall specific exit to user mode preparation. Runs with interrupts
* enabled.
*/
-static void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
+void syscall_exit_to_user_mode_prepare(struct pt_regs *regs)
{
u32 cached_flags = READ_ONCE(current_thread_info()->flags);
unsigned long nr = syscall_get_nr(current, regs);
--
2.29.2

2021-04-14 15:01:23

by Benjamin Berg

[permalink] [raw]
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

On Wed, 2021-04-14 at 09:34 +0200, Johannes Berg wrote:
> On Wed, 2021-04-14 at 08:22 +0100, Anton Ivanov wrote:
> > On 14/04/2021 06:52, Andrei Vagin wrote:
> > > We already have process_vm_readv and process_vm_writev to read and
> > > write
> > > to a process memory faster than we can do this with ptrace. And now
> > > it
> > > is time for process_vm_exec that allows executing code in an
> > > address
> > > space of another process. We can do this with ptrace but it is much
> > > slower.
> > >
> > > = Use-cases =
> > >
> > > Here are two known use-cases. The first one is “application kernel”
> > > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > > process that runs the sandbox kernel and a set of stub processes
> > > that
> > > are used to manage guest address spaces. Guest code is executed in
> > > the
> > > context of stub processes but all system calls are intercepted and
> > > handled in the sandbox kernel. Right now, these sort of sandboxes
> > > use
> > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > > significantly speed them up.
> >
> > Certainly interesting, but will require um to rework most of its
> > memory
> > management and we will most likely need extra mm support to make use
> > of
> > it in UML. We are not likely to get away just with one syscall there.
>
> Might help the seccomp mode though:
>
> https://patchwork.ozlabs.org/project/linux-um/list/?series=231980

Hmm, to me it sounds like it replaces both ptrace and seccomp mode
while completely avoiding the scheduling overhead that these techniques
have. I think everything UML needs is covered:

* The new API can do syscalls in the target memory space
(we can modify the address space)
* The new API can run code until the next syscall happens
(or a signal happens, which means SIGALRM for scheduling works)
* Single step tracing should work by setting EFLAGS

I think the memory management itself stays fundamentally the same. We
just do the initial clone() using CLONE_STOPPED. We don't need any stub
code/data and we have everything we need to modify the address space
and run the userspace process.

Benjamin


Attachments:
signature.asc (849.00 B)
This is a digitally signed message part

2021-04-14 15:31:46

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

On Wed, Apr 14, 2021 at 12:27 PM Florian Weimer <[email protected]> wrote:
>
> * Andrei Vagin:
>
> > We already have process_vm_readv and process_vm_writev to read and write
> > to a process memory faster than we can do this with ptrace. And now it
> > is time for process_vm_exec that allows executing code in an address
> > space of another process. We can do this with ptrace but it is much
> > slower.
> >
> > = Use-cases =
>
> We also have some vaguely related within the same address space: running
> code on another thread, without modifying its stack, while it has signal
> handlers blocked, and without causing system calls to fail with EINTR.
> This can be used to implement certain kinds of memory barriers.

That's what the membarrier() syscall is for, right? Unless you don't
want to register all threads for expedited membarrier use?

> It is
> also necessary to implement set*id with POSIX semantics in userspace.
> (Linux only changes the current thread credentials, POSIX requires
> process-wide changes.) We currently use a signal for set*id, but it has
> issues (it can be blocked, the signal could come from somewhere, etc.).
> We can't use signals for barriers because of the EINTR issue, and
> because the signal context is stored on the stack.

This essentially becomes a question of "how much is set*id allowed to
block and what level of guarantee should there be by the time it
returns that no threads will perform privileged actions anymore after
it returns", right?

Like, if some piece of kernel code grabs a pointer to the current
credentials or acquires a temporary reference to some privileged
resource, then blocks on reading an argument from userspace, and then
performs a privileged action using the previously-grabbed credentials
or resource, what behavior do you want? Should setuid() block until
that privileged action has completed? Should it abort that action
(which is kinda what you get with the signals approach)? Should it
just return immediately even though an attacker who can write to
process memory at that point might still be able to influence a
privileged operation that hasn't read all its inputs yet? Should the
kernel be designed to keep track of whether it is currently holding a
privileged resource? Or should the kernel just specifically permit
credential changes in specific places where it is known that a task
might block for a long time and it is not holding any privileged
resources (kinda like the approach taken for freezer stuff)?

If userspace wants multithreaded setuid() without syscall aborting,
things get gnarly really fast; and having an interface to remotely
perform operations under another task's context isn't really relevant
to the core problem here, I think.

2021-04-14 18:48:12

by Andrei Vagin

[permalink] [raw]
Subject: [PATCH 4/4] selftests: add tests for process_vm_exec

Output:
$ make run_tests
TAP version 13
1..4
# selftests: process_vm_exec: process_vm_exec
# 1..1
# ok 1 275 ns/syscall
# # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
ok 1 selftests: process_vm_exec: process_vm_exec
# selftests: process_vm_exec: process_vm_exec_fault
# 1..1
# ok 1 789 ns/signal
# # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
ok 2 selftests: process_vm_exec: process_vm_exec_fault
# selftests: process_vm_exec: ptrace_vm_exec
# 1..1
# ok 1 1378 ns/syscall# Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
ok 3 selftests: process_vm_exec: ptrace_vm_exec
# selftests: process_vm_exec: process_vm_exec_syscall
# 1..1
# ok 1 write works as expectd
# # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0
ok 4 selftests: process_vm_exec: process_vm_exec_syscall

Signed-off-by: Andrei Vagin <[email protected]>
---
.../selftests/process_vm_exec/Makefile | 7 ++
tools/testing/selftests/process_vm_exec/log.h | 26 ++++
.../process_vm_exec/process_vm_exec.c | 105 +++++++++++++++++
.../process_vm_exec/process_vm_exec_fault.c | 111 ++++++++++++++++++
.../process_vm_exec/process_vm_exec_syscall.c | 81 +++++++++++++
.../process_vm_exec/ptrace_vm_exec.c | 111 ++++++++++++++++++
6 files changed, 441 insertions(+)
create mode 100644 tools/testing/selftests/process_vm_exec/Makefile
create mode 100644 tools/testing/selftests/process_vm_exec/log.h
create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec.c
create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec_fault.c
create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec_syscall.c
create mode 100644 tools/testing/selftests/process_vm_exec/ptrace_vm_exec.c

diff --git a/tools/testing/selftests/process_vm_exec/Makefile b/tools/testing/selftests/process_vm_exec/Makefile
new file mode 100644
index 000000000000..bdf7fcf0fdd3
--- /dev/null
+++ b/tools/testing/selftests/process_vm_exec/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+
+UNAME_M := $(shell uname -m)
+TEST_GEN_PROGS_x86_64 := process_vm_exec process_vm_exec_fault ptrace_vm_exec process_vm_exec_syscall
+TEST_GEN_PROGS += $(TEST_GEN_PROGS_$(UNAME_M))
+
+include ../lib.mk
diff --git a/tools/testing/selftests/process_vm_exec/log.h b/tools/testing/selftests/process_vm_exec/log.h
new file mode 100644
index 000000000000..ef268c2cf2b8
--- /dev/null
+++ b/tools/testing/selftests/process_vm_exec/log.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __SELFTEST_PROCESS_VM_EXEC_LOG_H__
+#define __SELFTEST_PROCESS_VM_EXEC_LOG_H__
+
+#define pr_msg(fmt, lvl, ...) \
+ ksft_print_msg("[%s] (%s:%d)\t" fmt "\n", \
+ lvl, __FILE__, __LINE__, ##__VA_ARGS__)
+
+#define pr_p(func, fmt, ...) func(fmt ": %m", ##__VA_ARGS__)
+
+#define pr_err(fmt, ...) \
+ ({ \
+ ksft_test_result_error(fmt "\n", ##__VA_ARGS__); \
+ -1; \
+ })
+
+#define pr_fail(fmt, ...) \
+ ({ \
+ ksft_test_result_fail(fmt "\n", ##__VA_ARGS__); \
+ -1; \
+ })
+
+#define pr_perror(fmt, ...) pr_p(pr_err, fmt, ##__VA_ARGS__)
+
+#endif
diff --git a/tools/testing/selftests/process_vm_exec/process_vm_exec.c b/tools/testing/selftests/process_vm_exec/process_vm_exec.c
new file mode 100644
index 000000000000..aa4009c43e01
--- /dev/null
+++ b/tools/testing/selftests/process_vm_exec/process_vm_exec.c
@@ -0,0 +1,105 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <signal.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <sys/user.h>
+#include <sys/uio.h>
+#include <sys/prctl.h>
+#include "asm/unistd.h"
+#include <time.h>
+#include <sys/mman.h>
+
+#include "../kselftest.h"
+#include "log.h"
+
+#ifndef __NR_process_vm_exec
+#define __NR_process_vm_exec 441
+#endif
+
+#define TEST_SYSCALL 123
+#define TEST_SYSCALL_RET 456
+#define TEST_MARKER 789
+#define TEST_TIMEOUT 5
+#define TEST_STACK_SIZE 65536
+
+static inline long __syscall1(long n, long a1)
+{
+ unsigned long ret;
+
+ __asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1) : "rcx", "r11", "memory");
+
+ return ret;
+}
+
+int marker;
+
+static void guest(void)
+{
+ while (1)
+ if (__syscall1(TEST_SYSCALL, marker) != TEST_SYSCALL_RET)
+ abort();
+}
+
+int main(int argc, char **argv)
+{
+ struct sigcontext ctx = {};
+ struct timespec start, cur;
+ int status, ret;
+ pid_t pid;
+ long sysnr;
+ void *stack;
+
+ ksft_set_plan(1);
+
+ stack = mmap(NULL, TEST_STACK_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, 0, 0);
+ if (stack == MAP_FAILED)
+ return pr_perror("mmap");
+
+ pid = fork();
+ if (pid == 0) {
+ prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
+ marker = TEST_MARKER;
+ kill(getpid(), SIGSTOP);
+ abort();
+ return 0;
+ }
+
+ ctx.rip = (long)guest;
+ ctx.rsp = (long)stack + TEST_STACK_SIZE;
+ ctx.cs = 0x33;
+
+ sysnr = 0;
+ clock_gettime(CLOCK_MONOTONIC, &start);
+ while (1) {
+ unsigned long long sigmask = 0xffffffff;
+ siginfo_t siginfo;
+
+ clock_gettime(CLOCK_MONOTONIC, &cur);
+ if (start.tv_sec + TEST_TIMEOUT < cur.tv_sec ||
+ (start.tv_sec + TEST_TIMEOUT == cur.tv_sec &&
+ start.tv_nsec < cur.tv_nsec))
+ break;
+
+ ret = syscall(__NR_process_vm_exec, pid, &ctx, 0, &siginfo, &sigmask, 8);
+#ifdef __DEBUG
+ ksft_print_msg("ret %d signo %d sysno %d ip %lx\n",
+ ret, siginfo.si_signo, siginfo.si_syscall, ctx.rip);
+#endif
+ if (ret != 0)
+ pr_fail("unexpected return code: ret %d errno %d", ret, errno);
+ if (siginfo.si_signo != SIGSYS)
+ pr_fail("unexpected signal: %d", siginfo.si_signo);
+ if (siginfo.si_syscall != TEST_SYSCALL)
+ pr_fail("unexpected syscall: %d", siginfo.si_syscall);
+ ctx.rax = TEST_SYSCALL_RET;
+ sysnr++;
+ }
+ ksft_test_result_pass("%ld ns/syscall\n", 1000000000 / sysnr);
+ ksft_exit_pass();
+ return 0;
+}
diff --git a/tools/testing/selftests/process_vm_exec/process_vm_exec_fault.c b/tools/testing/selftests/process_vm_exec/process_vm_exec_fault.c
new file mode 100644
index 000000000000..b2c49095f386
--- /dev/null
+++ b/tools/testing/selftests/process_vm_exec/process_vm_exec_fault.c
@@ -0,0 +1,111 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <time.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+#include <sys/user.h>
+#include <sys/uio.h>
+#include <asm/unistd.h>
+
+#include "../kselftest.h"
+#include "log.h"
+
+#ifndef __NR_process_vm_exec
+#define __NR_process_vm_exec 441
+#endif
+
+#define TEST_TIMEOUT 5
+#define TEST_STACK_SIZE 65536
+
+#define TEST_VAL 0xaabbccddee
+
+unsigned long test_val;
+
+static inline void fault(unsigned long addr)
+{
+ unsigned long val = 0;
+
+ __asm__ __volatile__ (
+ "movq %%rcx, (%%rax)\n"
+ :
+ : "a"(addr), "c"(val)
+ :);
+}
+
+
+int marker;
+
+static void guest(void)
+{
+ unsigned long addr = 0;
+
+ while (1) {
+ addr = (addr + 1) % 8;
+ fault(addr);
+ if (test_val != TEST_VAL)
+ _exit(1);
+ }
+}
+
+int main(char argc, char **argv)
+{
+ siginfo_t siginfo;
+ unsigned long long sigmask = 0xffffffff;
+ struct sigcontext ctx = {};
+ struct timespec start, cur;
+ unsigned long addr;
+ int status, ret;
+ char *stack;
+ pid_t pid;
+ long faults;
+
+ ksft_set_plan(1);
+
+ stack = mmap(NULL, TEST_STACK_SIZE, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_ANONYMOUS, 0, 0);
+ if (stack == MAP_FAILED)
+ return pr_perror("mmap");
+
+ pid = fork();
+ if (pid == 0) {
+ prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
+ marker = 789;
+ kill(getpid(), SIGSTOP);
+ abort();
+ return 0;
+ }
+
+ ctx.rip = (long)guest;
+ ctx.rsp = (long)stack + TEST_STACK_SIZE;
+ ctx.cs = 0x33;
+
+ faults = 0;
+ addr = 0;
+ clock_gettime(CLOCK_MONOTONIC, &start);
+ while (1) {
+ addr = (addr + 1) % 8;
+
+ clock_gettime(CLOCK_MONOTONIC, &cur);
+ if (start.tv_sec + TEST_TIMEOUT < cur.tv_sec ||
+ (start.tv_sec + TEST_TIMEOUT == cur.tv_sec &&
+ start.tv_nsec < cur.tv_nsec))
+ break;
+
+ ret = syscall(__NR_process_vm_exec, pid, &ctx, 0, &siginfo, &sigmask, 8);
+ if (addr % 8 != ctx.rax)
+ return pr_fail("unexpected address: %lx", addr);
+ ctx.rax = (long)&test_val;
+ ctx.rcx = TEST_VAL;
+ faults++;
+ }
+ ksft_test_result_pass("%ld ns/signal\n", 1000000000 / faults);
+ ksft_exit_pass();
+ return 0;
+}
diff --git a/tools/testing/selftests/process_vm_exec/process_vm_exec_syscall.c b/tools/testing/selftests/process_vm_exec/process_vm_exec_syscall.c
new file mode 100644
index 000000000000..c0a7f6ee5b1a
--- /dev/null
+++ b/tools/testing/selftests/process_vm_exec/process_vm_exec_syscall.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <signal.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <time.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/prctl.h>
+#include <sys/user.h>
+#include <sys/uio.h>
+#include <asm/unistd.h>
+
+#include "../kselftest.h"
+#include "log.h"
+
+#ifndef __NR_process_vm_exec
+#define __NR_process_vm_exec 441
+#endif
+
+#ifndef PROCESS_VM_EXEC_SYSCALL
+#define PROCESS_VM_EXEC_SYSCALL 0x1
+#endif
+
+#define TEST_VAL 0x1e511e51
+
+int test_val = TEST_VAL;
+
+int main(int argc, char **argv)
+{
+ struct sigcontext ctx = {};
+ unsigned long long sigmask;
+ int ret, p[2], val;
+ siginfo_t siginfo;
+ pid_t pid;
+
+ ksft_set_plan(1);
+
+ pid = fork();
+ if (pid < 0)
+ return pr_perror("fork");
+ if (pid == 0) {
+ prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
+ kill(getpid(), SIGSTOP);
+ return 0;
+ }
+
+ test_val = 0;
+ if (pipe(p))
+ return pr_perror("pipe");
+
+ ctx.rax = __NR_write;
+ ctx.rdi = p[1];
+ ctx.rsi = (unsigned long) &test_val;
+ ctx.rdx = sizeof(test_val);
+ ctx.r10 = 0;
+ ctx.r8 = 0;
+ ctx.r9 = 0;
+ sigmask = 0xffffffff;
+ ret = syscall(__NR_process_vm_exec, pid, &ctx, PROCESS_VM_EXEC_SYSCALL,
+ &siginfo, &sigmask, 8);
+ if (ret != 0)
+ return pr_perror("process_vm_exec");
+ if (siginfo.si_signo != SIGSYS)
+ return pr_fail("unexpected signal: %d", siginfo.si_signo);
+ if (ctx.rax != sizeof(test_val))
+ pr_fail("unexpected rax: %lx", ctx.rax);
+ if (kill(pid, SIGKILL))
+ return pr_perror("kill");
+ if (wait(NULL) != pid)
+ return pr_perror("kill");
+ if (read(p[0], &val, sizeof(val)) != sizeof(val))
+ pr_perror("read");
+ if (val != TEST_VAL)
+ pr_fail("unexpected data: %x", val);
+ ksft_test_result_pass("process_vm_exec(..., PROCESS_VM_EXEC_SYSCALL, ...) \n");
+ ksft_exit_pass();
+ return 0;
+}
diff --git a/tools/testing/selftests/process_vm_exec/ptrace_vm_exec.c b/tools/testing/selftests/process_vm_exec/ptrace_vm_exec.c
new file mode 100644
index 000000000000..aac14c2e8f11
--- /dev/null
+++ b/tools/testing/selftests/process_vm_exec/ptrace_vm_exec.c
@@ -0,0 +1,111 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <signal.h>
+#include <sys/ptrace.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <linux/unistd.h>
+#include <stdio.h>
+#include <sys/user.h>
+#include <sys/uio.h>
+#include <time.h>
+
+#include "../kselftest.h"
+#include "log.h"
+
+static inline long __syscall1(long n, long a1)
+{
+ unsigned long ret;
+
+ __asm__ __volatile__ ("syscall"
+ : "=a"(ret)
+ : "a"(n), "D"(a1)
+ : "rcx", "r11", "memory");
+ return ret;
+}
+
+#define TEST_SYSCALL 444
+#define TEST_SYSCALL_RET 555
+#define TEST_MARKER 789
+#define TEST_TIMEOUT 5
+
+static int marker;
+
+static void guest(void)
+{
+ while (1) {
+ int ret;
+
+ ret = __syscall1(TEST_SYSCALL, marker);
+ if (ret != TEST_SYSCALL_RET)
+ abort();
+ }
+}
+
+int main(int argc, char **argv)
+{
+ struct user_regs_struct regs = {};
+ struct timespec start, cur;
+ int status;
+ long sysnr;
+ pid_t pid;
+
+ ksft_set_plan(1);
+
+ pid = fork();
+ if (pid == 0) {
+ marker = TEST_MARKER;
+ kill(getpid(), SIGSTOP);
+ /* unreachable */
+ abort();
+ return 0;
+ }
+
+ if (waitpid(pid, &status, WUNTRACED) != pid)
+ return pr_perror("waidpid");
+ if (ptrace(PTRACE_ATTACH, pid, 0, 0))
+ return pr_perror("PTRACE_ATTACH");
+ if (wait(&status) != pid)
+ return pr_perror("waidpid");
+ if (ptrace(PTRACE_CONT, pid, 0, 0))
+ return pr_perror("PTRACE_CONT");
+ if (waitpid(pid, &status, 0) != pid)
+ return pr_perror("waidpid");
+
+ if (ptrace(PTRACE_SETOPTIONS, pid, 0, PTRACE_O_EXITKILL))
+ return pr_perror("PTRACE_SETOPTIONS");
+ if (ptrace(PTRACE_GETREGS, pid, NULL, &regs))
+ return pr_perror("PTRACE_SETREGS");
+ regs.rip = (long)guest;
+
+ clock_gettime(CLOCK_MONOTONIC, &start);
+ for (sysnr = 0; ; sysnr++) {
+ int status;
+
+ clock_gettime(CLOCK_MONOTONIC, &cur);
+ if (start.tv_sec + TEST_TIMEOUT < cur.tv_sec ||
+ (start.tv_sec + TEST_TIMEOUT == cur.tv_sec &&
+ start.tv_nsec < cur.tv_nsec))
+ break;
+ if (ptrace(PTRACE_SETREGS, pid, NULL, &regs))
+ return pr_perror("PTRACE_SETREGS");
+ if (ptrace(PTRACE_SYSEMU, pid, 0, 0))
+ return pr_perror("PTRACE_SYSEMU");
+ if (waitpid(pid, &status, 0) != pid)
+ return pr_perror("waitpid");
+ if (!WIFSTOPPED(status) || WSTOPSIG(status) != SIGTRAP)
+ return pr_err("unexpected status: %d", status);
+ if (ptrace(PTRACE_GETREGS, pid, NULL, &regs))
+ return pr_perror("PTRACE_GETREGS: %d", regs.rdi);
+ if (regs.rdi != TEST_MARKER)
+ return pr_err("unexpected marker: %d", regs.rdi);
+ if (regs.orig_rax != TEST_SYSCALL)
+ return pr_err("unexpected syscall: %d", regs.orig_rax);
+ regs.rax = TEST_SYSCALL_RET;
+ }
+ ksft_test_result_pass("%ld ns/syscall\n", 1000000000 / sysnr);
+ ksft_exit_pass();
+ return 0;
+}
--
2.29.2

2021-04-14 19:08:47

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <[email protected]> wrote:
> We already have process_vm_readv and process_vm_writev to read and write
> to a process memory faster than we can do this with ptrace. And now it
> is time for process_vm_exec that allows executing code in an address
> space of another process. We can do this with ptrace but it is much
> slower.
>
> = Use-cases =

It seems to me like your proposed API doesn't really fit either one of
those usecases well...

> Here are two known use-cases. The first one is “application kernel”
> sandboxes like User-mode Linux and gVisor. In this case, we have a
> process that runs the sandbox kernel and a set of stub processes that
> are used to manage guest address spaces. Guest code is executed in the
> context of stub processes but all system calls are intercepted and
> handled in the sandbox kernel. Right now, these sort of sandboxes use
> PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> significantly speed them up.

In this case, since you really only want an mm_struct to run code
under, it seems weird to create a whole task with its own PID and so
on. It seems to me like something similar to the /dev/kvm API would be
more appropriate here? Implementation options that I see for that
would be:

1. mm_struct-based:
a set of syscalls to create a new mm_struct,
change memory mappings under that mm_struct, and switch to it
2. pagetable-mirroring-based:
like /dev/kvm, an API to create a new pagetable, mirror parts of
the mm_struct's pagetables over into it with modified permissions
(like KVM_SET_USER_MEMORY_REGION),
and run code under that context.
page fault handling would first handle the fault against mm->pgd
as normal, then mirror the PTE over into the secondary pagetables.
invalidation could be handled with MMU notifiers.

> Another use-case is CRIU (Checkpoint/Restore in User-space). Several
> process properties can be received only from the process itself. Right
> now, we use a parasite code that is injected into the process. We do
> this with ptrace but it is slow, unsafe, and tricky.

But this API will only let you run code under the *mm* of the target
process, not fully in the context of a target *task*, right? So you
still won't be able to use this for accessing anything other than
memory? That doesn't seem very generically useful to me.

Also, I don't doubt that anything involving ptrace is kinda tricky,
but it would be nice to have some more detail on what exactly makes
this slow, unsafe and tricky. Are there API additions for ptrace that
would make this work better? I imagine you're thinking of things like
an API for injecting a syscall into the target process without having
to first somehow find an existing SYSCALL instruction in the target
process?

> process_vm_exec can
> simplify the process of injecting a parasite code and it will allow
> pre-dump memory without stopping processes. The pre-dump here is when we
> enable a memory tracker and dump the memory while a process is continue
> running. On each interaction we dump memory that has been changed from
> the previous iteration. In the final step, we will stop processes and
> dump their full state. Right now the most effective way to dump process
> memory is to create a set of pipes and splice memory into these pipes
> from the parasite code. With process_vm_exec, we will be able to call
> vmsplice directly. It means that we will not need to stop a process to
> inject the parasite code.

Alternatively you could add splice support to /proc/$pid/mem or add a
syscall similar to process_vm_readv() that splices into a pipe, right?

2021-04-14 20:23:41

by Anton Ivanov

[permalink] [raw]
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

On 14/04/2021 06:52, Andrei Vagin wrote:
> We already have process_vm_readv and process_vm_writev to read and write
> to a process memory faster than we can do this with ptrace. And now it
> is time for process_vm_exec that allows executing code in an address
> space of another process. We can do this with ptrace but it is much
> slower.
>
> = Use-cases =
>
> Here are two known use-cases. The first one is “application kernel”
> sandboxes like User-mode Linux and gVisor. In this case, we have a
> process that runs the sandbox kernel and a set of stub processes that
> are used to manage guest address spaces. Guest code is executed in the
> context of stub processes but all system calls are intercepted and
> handled in the sandbox kernel. Right now, these sort of sandboxes use
> PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> significantly speed them up.

Certainly interesting, but will require um to rework most of its memory
management and we will most likely need extra mm support to make use of
it in UML. We are not likely to get away just with one syscall there.

>
> Another use-case is CRIU (Checkpoint/Restore in User-space). Several
> process properties can be received only from the process itself. Right
> now, we use a parasite code that is injected into the process. We do
> this with ptrace but it is slow, unsafe, and tricky. process_vm_exec can
> simplify the process of injecting a parasite code and it will allow
> pre-dump memory without stopping processes. The pre-dump here is when we
> enable a memory tracker and dump the memory while a process is continue
> running. On each interaction we dump memory that has been changed from
> the previous iteration. In the final step, we will stop processes and
> dump their full state. Right now the most effective way to dump process
> memory is to create a set of pipes and splice memory into these pipes
> from the parasite code. With process_vm_exec, we will be able to call
> vmsplice directly. It means that we will not need to stop a process to
> inject the parasite code.
>
> = How it works =
>
> process_vm_exec has two modes:
>
> * Execute code in an address space of a target process and stop on any
> signal or system call.
>
> * Execute a system call in an address space of a target process.
>
> int process_vm_exec(pid_t pid, struct sigcontext uctx,
> unsigned long flags, siginfo_t siginfo,
> sigset_t *sigmask, size_t sizemask)
>
> PID - target process identification. We can consider to use pidfd
> instead of PID here.
>
> sigcontext contains a process state with what the process will be
> resumed after switching the address space and then when a process will
> be stopped, its sate will be saved back to sigcontext.
>
> siginfo is information about a signal that has interrupted the process.
> If a process is interrupted by a system call, signfo will contain a
> synthetic siginfo of the SIGSYS signal.
>
> sigmask is a set of signals that process_vm_exec returns via signfo.
>
> # How fast is it
>
> In the fourth patch, you can find two benchmarks that execute a function
> that calls system calls in a loop. ptrace_vm_exe uses ptrace to trap
> system calls, proces_vm_exec uses the process_vm_exec syscall to do the
> same thing.
>
> ptrace_vm_exec: 1446 ns/syscall
> ptrocess_vm_exec: 289 ns/syscall
>
> PS: This version is just a prototype. Its goal is to collect the initial
> feedback, to discuss the interfaces, and maybe to get some advice on
> implementation..
>
> Cc: Andrew Morton <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Anton Ivanov <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: Dmitry Safonov <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Jeff Dike <[email protected]>
> Cc: Mike Rapoport <[email protected]>
> Cc: Michael Kerrisk (man-pages) <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Richard Weinberger <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
>
> Andrei Vagin (4):
> signal: add a helper to restore a process state from sigcontex
> arch/x86: implement the process_vm_exec syscall
> arch/x86: allow to execute syscalls via process_vm_exec
> selftests: add tests for process_vm_exec
>
> arch/Kconfig | 15 ++
> arch/x86/Kconfig | 1 +
> arch/x86/entry/common.c | 19 +++
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> arch/x86/include/asm/sigcontext.h | 2 +
> arch/x86/kernel/Makefile | 1 +
> arch/x86/kernel/process_vm_exec.c | 160 ++++++++++++++++++
> arch/x86/kernel/signal.c | 125 ++++++++++----
> include/linux/entry-common.h | 2 +
> include/linux/process_vm_exec.h | 17 ++
> include/linux/sched.h | 7 +
> include/linux/syscalls.h | 6 +
> include/uapi/asm-generic/unistd.h | 4 +-
> include/uapi/linux/process_vm_exec.h | 8 +
> kernel/entry/common.c | 2 +-
> kernel/fork.c | 9 +
> kernel/sys_ni.c | 2 +
> .../selftests/process_vm_exec/Makefile | 7 +
> tools/testing/selftests/process_vm_exec/log.h | 26 +++
> .../process_vm_exec/process_vm_exec.c | 105 ++++++++++++
> .../process_vm_exec/process_vm_exec_fault.c | 111 ++++++++++++
> .../process_vm_exec/process_vm_exec_syscall.c | 81 +++++++++
> .../process_vm_exec/ptrace_vm_exec.c | 111 ++++++++++++
> 23 files changed, 785 insertions(+), 37 deletions(-)
> create mode 100644 arch/x86/kernel/process_vm_exec.c
> create mode 100644 include/linux/process_vm_exec.h
> create mode 100644 include/uapi/linux/process_vm_exec.h
> create mode 100644 tools/testing/selftests/process_vm_exec/Makefile
> create mode 100644 tools/testing/selftests/process_vm_exec/log.h
> create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec.c
> create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec_fault.c
> create mode 100644 tools/testing/selftests/process_vm_exec/process_vm_exec_syscall.c
> create mode 100644 tools/testing/selftests/process_vm_exec/ptrace_vm_exec.c
>


--
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661
https://www.cambridgegreys.com/

2021-04-14 20:25:56

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

On Wed, 2021-04-14 at 08:22 +0100, Anton Ivanov wrote:
> On 14/04/2021 06:52, Andrei Vagin wrote:
> > We already have process_vm_readv and process_vm_writev to read and write
> > to a process memory faster than we can do this with ptrace. And now it
> > is time for process_vm_exec that allows executing code in an address
> > space of another process. We can do this with ptrace but it is much
> > slower.
> >
> > = Use-cases =
> >
> > Here are two known use-cases. The first one is “application kernel”
> > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > process that runs the sandbox kernel and a set of stub processes that
> > are used to manage guest address spaces. Guest code is executed in the
> > context of stub processes but all system calls are intercepted and
> > handled in the sandbox kernel. Right now, these sort of sandboxes use
> > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > significantly speed them up.
>
> Certainly interesting, but will require um to rework most of its memory
> management and we will most likely need extra mm support to make use of
> it in UML. We are not likely to get away just with one syscall there.

Might help the seccomp mode though:

https://patchwork.ozlabs.org/project/linux-um/list/?series=231980

johannes


2021-04-14 22:46:36

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

* Andrei Vagin:

> We already have process_vm_readv and process_vm_writev to read and write
> to a process memory faster than we can do this with ptrace. And now it
> is time for process_vm_exec that allows executing code in an address
> space of another process. We can do this with ptrace but it is much
> slower.
>
> = Use-cases =

We also have some vaguely related within the same address space: running
code on another thread, without modifying its stack, while it has signal
handlers blocked, and without causing system calls to fail with EINTR.
This can be used to implement certain kinds of memory barriers. It is
also necessary to implement set*id with POSIX semantics in userspace.
(Linux only changes the current thread credentials, POSIX requires
process-wide changes.) We currently use a signal for set*id, but it has
issues (it can be blocked, the signal could come from somewhere, etc.).
We can't use signals for barriers because of the EINTR issue, and
because the signal context is stored on the stack.

Thanks,
Florian

2021-04-15 00:19:38

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

* Jann Horn:

> On Wed, Apr 14, 2021 at 12:27 PM Florian Weimer <[email protected]> wrote:
>>
>> * Andrei Vagin:
>>
>> > We already have process_vm_readv and process_vm_writev to read and write
>> > to a process memory faster than we can do this with ptrace. And now it
>> > is time for process_vm_exec that allows executing code in an address
>> > space of another process. We can do this with ptrace but it is much
>> > slower.
>> >
>> > = Use-cases =
>>
>> We also have some vaguely related within the same address space: running
>> code on another thread, without modifying its stack, while it has signal
>> handlers blocked, and without causing system calls to fail with EINTR.
>> This can be used to implement certain kinds of memory barriers.
>
> That's what the membarrier() syscall is for, right? Unless you don't
> want to register all threads for expedited membarrier use?

membarrier is not sufficiently powerful for revoking biased locks, for
example.

For the EINTR issue, <https://github.com/golang/go/issues/38836> is an
example. I believe CIFS has since seen a few fixes (after someone
reported that tar on CIFS wouldn't work because the SIGCHLD causing
utimensat to fail—and there isn't even a signal handler for SIGCHLD!),
but the time it took to get to this point doesn't give me confidence
that it is safe to send signals to a thread that is running unknown
code.

But as you explained regarding the set*id broadcast, it seems that if we
had this run-on-another-thread functionality, we would likely encounter
issues similar to those with SA_RESTART. We don't see the issue with
set*id today because it's a rare operation, and multi-threaded file
servers that need to change credentials frequently opt out of the set*id
broadcast anyway. (What I have in mind is a future world where any
printf call, any malloc call, can trigger such a broadcast.)

The cross-VM CRIU scenario would probably somewhere in between (not
quite the printf/malloc level, but more frequent than set*id).

Thanks,
Florian

2021-04-15 00:29:33

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

On Wed, Apr 14, 2021 at 2:20 PM Florian Weimer <[email protected]> wrote:
>
> * Jann Horn:
>
> > On Wed, Apr 14, 2021 at 12:27 PM Florian Weimer <[email protected]> wrote:
> >>
> >> * Andrei Vagin:
> >>
> >> > We already have process_vm_readv and process_vm_writev to read and write
> >> > to a process memory faster than we can do this with ptrace. And now it
> >> > is time for process_vm_exec that allows executing code in an address
> >> > space of another process. We can do this with ptrace but it is much
> >> > slower.
> >> >
> >> > = Use-cases =
> >>
> >> We also have some vaguely related within the same address space: running
> >> code on another thread, without modifying its stack, while it has signal
> >> handlers blocked, and without causing system calls to fail with EINTR.
> >> This can be used to implement certain kinds of memory barriers.
> >
> > That's what the membarrier() syscall is for, right? Unless you don't
> > want to register all threads for expedited membarrier use?
>
> membarrier is not sufficiently powerful for revoking biased locks, for
> example.

But on Linux >=5.10, together with rseq, it is, right? Then lock
acquisition could look roughly like this, in pseudo-C (yes, I know,
real rseq doesn't quite look like that, you'd need inline asm for that
unless the compiler adds special support for this):


enum local_state {
STATE_FREE_OR_BIASED,
STATE_LOCKED
};
#define OWNER_LOCKBIT (1U<<31)
#define OWNER_WAITER_BIT (1U<<30) /* notify futex when OWNER_LOCKBIT
is cleared */
struct biased_lock {
unsigned int owner_with_lockbit;
enum local_state local_state;
};

void lock(struct biased_lock *L) {
unsigned int my_tid = THREAD_SELF->tid;
RSEQ_SEQUENCE_START(); // restart here on failure
if (READ_ONCE(L->owner) == my_tid) {
if (READ_ONCE(L->local_state) == STATE_LOCKED) {
RSEQ_SEQUENCE_END();
/*
* Deadlock, abort execution.
* Note that we are not necessarily actually *holding* the lock;
* this can also happen if we entered a signal handler while we
* were in the process of acquiring the lock.
* But in that case it could just as well have happened that we
* already grabbed the lock, so the caller is wrong anyway.
*/
fatal_error();
}
RSEQ_COMMIT(L->local_state = STATE_LOCKED);
return; /* fastpath success */
}
RSEQ_SEQUENCE_END();

/* slowpath */
/* acquire and lock owner field */
unsigned int old_owner_with_lockbit;
while (1) {
old_owner_with_lockbit = READ_ONCE(L->owner_with_lockbit);
if (old_owner_with_lockbit & OWNER_LOCKBIT) {
if (!__sync_bool_compare_and_swap (&L->owner_with_lockbit,
old_owner_with_lockbit, my_tid | OWNER_LOCKBIT | OWNER_WAITER_BIT))
continue;
futex(&L->owner_with_lockbit, FUTEX_WAIT,
old_owner_with_lockbit, NULL, NULL, 0);
continue;
} else {
if (__sync_bool_compare_and_swap (&L->owner_with_lockbit,
old_owner_with_lockbit, my_tid | OWNER_LOCKBIT))
break;
}
}

/*
* ensure old owner won't lock local_state anymore.
* we only have to worry about the owner that directly preceded us here;
* it will have done this step for the owners that preceded it before clearing
* the LOCKBIT; so if we were the old owner, we don't have to sync.
*/
if (old_owner_with_lockbit != my_tid) {
if (membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ, 0, 0))
fatal_error();
}

/*
* As soon as the lock becomes STATE_FREE_OR_BIASED, we own it; but
* at this point it might still be locked.
*/
while (READ_ONCE(L->local_state) == STATE_LOCKED) {
futex(&L->local_state, FUTEX_WAIT, STATE_LOCKED, NULL, NULL, 0);
}

/* OK, now the lock is biased to us and we can grab it. */
WRITE_ONCE(L->local_state, STATE_LOCKED);

/* drop lockbit */
unsigned int old_owner_with_lockbit;
while (1) {
old_owner_with_lockbit = READ_ONCE(L->owner_with_lockbit);
if (__sync_bool_compare_and_swap (&L->owner_with_lockbit,
old_owner_with_lockbit, my_tid))
break;
}
if (old_owner_with_lockbit & OWNER_WAITER_BIT)
futex(&L->owner_with_lockbit, FUTEX_WAKE, INT_MAX, NULL, NULL, 0);
}

void unlock(struct biased_lock *L) {
unsigned int my_tid = THREAD_SELF->tid;

/*
* If we run before the membarrier(), the lock() path will immediately
* see the lock as uncontended, and we don't need to call futex().
* If we run after the membarrier(), the ->owner_with_lockbit read
* here will observe the new owner and we'll wake the futex.
*/
RSEQ_SEQUENCE_START();
unsigned int old_owner_with_lockbit = READ_ONCE(L->owner_with_lockbit);
RSEQ_COMMIT(WRITE_ONCE(L->local_state, STATE_FREE_OR_BIASED));
if (old_owner_with_lockbit != my_tid)
futex(&L->local_state, FUTEX_WAKE, INT_MAX, NULL, NULL, 0);
}

2021-04-15 00:37:04

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall

On 04/13, Andrei Vagin wrote:
>
> +static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm)
> +{
> + struct task_struct *tsk = current;
> + struct mm_struct *active_mm;
> +
> + task_lock(tsk);
> + /* Hold off tlb flush IPIs while switching mm's */
> + local_irq_disable();
> +
> + sync_mm_rss(prev_mm);
> +
> + vmacache_flush(tsk);
> +
> + active_mm = tsk->active_mm;
> + if (active_mm != target_mm) {
> + mmgrab(target_mm);
> + tsk->active_mm = target_mm;
> + }
> + tsk->mm = target_mm;
> + switch_mm_irqs_off(active_mm, target_mm, tsk);
> + local_irq_enable();
> + task_unlock(tsk);
> +#ifdef finish_arch_post_lock_switch
> + finish_arch_post_lock_switch();
> +#endif
> +
> + if (active_mm != target_mm)
> + mmdrop(active_mm);
> +}

I think this should be unified with kthread_use_mm() somehow...

And does it really need the "prev_mm" argument? It must be tsk->mm, no?

Oleg.

2021-04-15 00:49:30

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote:
> On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <[email protected]> wrote:
> > We already have process_vm_readv and process_vm_writev to read and write
> > to a process memory faster than we can do this with ptrace. And now it
> > is time for process_vm_exec that allows executing code in an address
> > space of another process. We can do this with ptrace but it is much
> > slower.
> >
> > = Use-cases =
>
> It seems to me like your proposed API doesn't really fit either one of
> those usecases well...

We definitely can invent more specific interfaces for each of these
problems. Sure, they will handle their use-cases a bit better than this
generic one. But do we want to have two very specific interfaces with
separate kernel implementations? My previous experiences showed that the
kernel community doesn't like interfaces that are specific for only one
narrow use-case.

So when I was working on process_vm_exec, I was thinking how to make
one interfaces that will be good enough for all these use-cases.

>
> > Here are two known use-cases. The first one is “application kernel”
> > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > process that runs the sandbox kernel and a set of stub processes that
> > are used to manage guest address spaces. Guest code is executed in the
> > context of stub processes but all system calls are intercepted and
> > handled in the sandbox kernel. Right now, these sort of sandboxes use
> > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > significantly speed them up.
>
> In this case, since you really only want an mm_struct to run code
> under, it seems weird to create a whole task with its own PID and so
> on. It seems to me like something similar to the /dev/kvm API would be
> more appropriate here? Implementation options that I see for that
> would be:
>
> 1. mm_struct-based:
> a set of syscalls to create a new mm_struct,
> change memory mappings under that mm_struct, and switch to it
> 2. pagetable-mirroring-based:
> like /dev/kvm, an API to create a new pagetable, mirror parts of
> the mm_struct's pagetables over into it with modified permissions
> (like KVM_SET_USER_MEMORY_REGION),
> and run code under that context.
> page fault handling would first handle the fault against mm->pgd
> as normal, then mirror the PTE over into the secondary pagetables.
> invalidation could be handled with MMU notifiers.

We are ready to discuss this sort of interfaces if the community will
agree to accept it. Are there any other users except sandboxes that will
need something like this? Will the sandbox use-case enough to justify
the addition of this interface?

>
> > Another use-case is CRIU (Checkpoint/Restore in User-space). Several
> > process properties can be received only from the process itself. Right
> > now, we use a parasite code that is injected into the process. We do
> > this with ptrace but it is slow, unsafe, and tricky.
>
> But this API will only let you run code under the *mm* of the target
> process, not fully in the context of a target *task*, right? So you
> still won't be able to use this for accessing anything other than
> memory? That doesn't seem very generically useful to me.

You are right, this will not rid us of the need to run a parasite code.
I wrote that it will make a process of injecting a parasite code a bit
simpler.

>
> Also, I don't doubt that anything involving ptrace is kinda tricky,
> but it would be nice to have some more detail on what exactly makes
> this slow, unsafe and tricky. Are there API additions for ptrace that
> would make this work better? I imagine you're thinking of things like
> an API for injecting a syscall into the target process without having
> to first somehow find an existing SYSCALL instruction in the target
> process?


You describe the first problem right. We need to find or inject a
syscall instruction to a target process.
Right now, we need to do these steps to execute a system call:

* inject the syscall instruction (PTRACE_PEEKDATA/PTRACE_POKEDATA).
* get origin registers
* set new registers
* get a signal mask.
* block signals
* resume the process
* stop it on the next syscall-exit
* get registers
* set origin registers
* restore a signal mask.

One of the CRIU principals is to avoid changing a process state, so if
criu is interrupted, processes must be resumed and continue running. The
procedure of injecting a system call creates a window when a process is
in an inconsistent state, and a disappearing CRIU at such moments will
be fatal for the process. We don't think that we can eliminate such
windows, but we want to make them smaller.

In CRIU, we have a self-healed parasite. The idea is to inject a
parasite code with a signal frame that contains the origin process
state. The parasite runs in an "RPC daemon mode" and gets commands from
criu via a unix socket. If it detects that criu disappeared, it calls
rt_sigreturn and resumes the origin process.

As for the performance of the ptrace, there are a few reasons why it is
slow. First, it is a number of steps what we need to do. Second, it is
two synchronious context switches. Even if we will solve the first
problem with a new ptrace command, it will be not enough to stop using a
parasite in CRIU.

>
> > process_vm_exec can
> > simplify the process of injecting a parasite code and it will allow
> > pre-dump memory without stopping processes. The pre-dump here is when we
> > enable a memory tracker and dump the memory while a process is continue
> > running. On each interaction we dump memory that has been changed from
> > the previous iteration. In the final step, we will stop processes and
> > dump their full state. Right now the most effective way to dump process
> > memory is to create a set of pipes and splice memory into these pipes
> > from the parasite code. With process_vm_exec, we will be able to call
> > vmsplice directly. It means that we will not need to stop a process to
> > inject the parasite code.
>
> Alternatively you could add splice support to /proc/$pid/mem or add a
> syscall similar to process_vm_readv() that splices into a pipe, right?

We send patches to introcude process_vm_splice:
https://lore.kernel.org/patchwork/cover/871116/

but they were not merged and the main reason was a lack of enough users
to justify its addition.

2021-04-16 19:57:40

by Kirill Smelkov

[permalink] [raw]
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

On Tue, Apr 13, 2021 at 10:52:13PM -0700, Andrei Vagin wrote:
> We already have process_vm_readv and process_vm_writev to read and write
> to a process memory faster than we can do this with ptrace. And now it
> is time for process_vm_exec that allows executing code in an address
> space of another process. We can do this with ptrace but it is much
> slower.

I'd like to add that there are cases when using ptrace is even hardly possible:
in my situation one process needs to modify address space of another process
while that target process is being blocked under pagefault. From
https://lab.nexedi.com/kirr/wendelin.core/blob/539ec405/wcfs/notes.txt#L149-171 ,
https://lab.nexedi.com/kirr/wendelin.core/blob/539ec405/wcfs/wcfs.go#L395-397 :

---- 8< ----
Client cannot be ptraced while under pagefault
==============================================

We cannot use ptrace to run code on client thread that is under pagefault:

The kernel sends SIGSTOP to interrupt tracee, but the signal will be
processed only when the process returns from kernel space, e.g. here

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/entry/common.c?id=v4.19-rc8-151-g23469de647c4#n160

This way the tracer won't receive obligatory information that tracee
stopped (via wait...) and even though ptrace(ATTACH) succeeds, all other
ptrace commands will fail:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n1140
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/ptrace.c?id=v4.19-rc8-151-g23469de647c4#n207

My original idea was to use ptrace to run code in process to change it's
memory mappings, while the triggering process is under pagefault/read
to wcfs, and the above shows it won't work - trying to ptrace the
client from under wcfs will just block forever (the kernel will be
waiting for read operation to finish for ptrace, and read will be first
waiting on ptrace stopping to complete = deadlock)

...

// ( one could imagine adjusting mappings synchronously via running
// wcfs-trusted code via ptrace that wcfs injects into clients, but ptrace
// won't work when client thread is blocked under pagefault or syscall(^) )
---- 8< ----

To workaround that I need to add special thread into target process and
implement custom additional "isolation protocol" in between my filesystem and
client processes that use it:

https://lab.nexedi.com/kirr/wendelin.core/blob/539ec405/wcfs/wcfs.go#L94-182
https://lab.nexedi.com/kirr/wendelin.core/blob/539ec405/wcfs/client/wcfs.h#L20-96
https://lab.nexedi.com/kirr/wendelin.core/blob/539ec405/wcfs/client/wcfs.cpp#L24-203

Most parts of that dance would be much easier, or completely
unnecessary, if it could be possible to reliably make changes to address
space of target process from outside.

Kirill

2021-04-23 07:05:38

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall

On Wed, Apr 14, 2021 at 07:09:15PM +0200, Oleg Nesterov wrote:
> On 04/13, Andrei Vagin wrote:
> >
> > +static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm)
> > +{
> > + struct task_struct *tsk = current;
> > + struct mm_struct *active_mm;
> > +
> > + task_lock(tsk);
> > + /* Hold off tlb flush IPIs while switching mm's */
> > + local_irq_disable();
> > +
> > + sync_mm_rss(prev_mm);
> > +
> > + vmacache_flush(tsk);
> > +
> > + active_mm = tsk->active_mm;
> > + if (active_mm != target_mm) {
> > + mmgrab(target_mm);
> > + tsk->active_mm = target_mm;
> > + }
> > + tsk->mm = target_mm;
> > + switch_mm_irqs_off(active_mm, target_mm, tsk);
> > + local_irq_enable();
> > + task_unlock(tsk);
> > +#ifdef finish_arch_post_lock_switch
> > + finish_arch_post_lock_switch();
> > +#endif
> > +
> > + if (active_mm != target_mm)
> > + mmdrop(active_mm);
> > +}
>
> I think this should be unified with kthread_use_mm() somehow...

I agree.

>
> And does it really need the "prev_mm" argument? It must be tsk->mm, no?

No, it doesn't. It is leftover of unuse_mm. BTW why do we pass mm to kthread_unuse_mm?

Thanks,
Andrei.

2021-06-28 16:16:21

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall

On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <[email protected]> wrote:
> This change introduces the new system call:
> process_vm_exec(pid_t pid, struct sigcontext *uctx, unsigned long flags,
> siginfo_t * uinfo, sigset_t *sigmask, size_t sizemask)
>
> process_vm_exec allows to execute the current process in an address
> space of another process.
[...]

I still think that this whole API is fundamentally the wrong approach
because it tries to shoehorn multiple usecases with different
requirements into a single API. But that aside:

> +static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm)
> +{
> + struct task_struct *tsk = current;
> + struct mm_struct *active_mm;
> +
> + task_lock(tsk);
> + /* Hold off tlb flush IPIs while switching mm's */
> + local_irq_disable();
> +
> + sync_mm_rss(prev_mm);
> +
> + vmacache_flush(tsk);
> +
> + active_mm = tsk->active_mm;
> + if (active_mm != target_mm) {
> + mmgrab(target_mm);
> + tsk->active_mm = target_mm;
> + }
> + tsk->mm = target_mm;

I'm pretty sure you're not currently allowed to overwrite the ->mm
pointer of a userspace thread. For example, zap_threads() assumes that
all threads running under a process have the same ->mm. (And if you're
fiddling with ->mm stuff, you should probably CC linux-mm@.)

As far as I understand, only kthreads are allowed to do this (as
implemented in kthread_use_mm()).

> + switch_mm_irqs_off(active_mm, target_mm, tsk);
> + local_irq_enable();
> + task_unlock(tsk);
> +#ifdef finish_arch_post_lock_switch
> + finish_arch_post_lock_switch();
> +#endif
> +
> + if (active_mm != target_mm)
> + mmdrop(active_mm);
> +}

2021-06-28 16:33:34

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall



On Mon, Jun 28, 2021, at 9:13 AM, Jann Horn wrote:
> On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <[email protected]> wrote:
> > This change introduces the new system call:
> > process_vm_exec(pid_t pid, struct sigcontext *uctx, unsigned long flags,
> > siginfo_t * uinfo, sigset_t *sigmask, size_t sizemask)
> >
> > process_vm_exec allows to execute the current process in an address
> > space of another process.
> [...]
>
> I still think that this whole API is fundamentally the wrong approach
> because it tries to shoehorn multiple usecases with different
> requirements into a single API. But that aside:
>
> > +static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm)
> > +{
> > + struct task_struct *tsk = current;
> > + struct mm_struct *active_mm;
> > +
> > + task_lock(tsk);
> > + /* Hold off tlb flush IPIs while switching mm's */
> > + local_irq_disable();
> > +
> > + sync_mm_rss(prev_mm);
> > +
> > + vmacache_flush(tsk);
> > +
> > + active_mm = tsk->active_mm;
> > + if (active_mm != target_mm) {
> > + mmgrab(target_mm);
> > + tsk->active_mm = target_mm;
> > + }
> > + tsk->mm = target_mm;
>
> I'm pretty sure you're not currently allowed to overwrite the ->mm
> pointer of a userspace thread. For example, zap_threads() assumes that
> all threads running under a process have the same ->mm. (And if you're
> fiddling with ->mm stuff, you should probably CC linux-mm@.)

exec_mmap() does it, so it can’t be entirely impossible.

2021-06-28 23:39:08

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall

On Mon, Jun 28, 2021 at 6:30 PM Andy Lutomirski <[email protected]> wrote:
> On Mon, Jun 28, 2021, at 9:13 AM, Jann Horn wrote:
> > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <[email protected]> wrote:
> > > This change introduces the new system call:
> > > process_vm_exec(pid_t pid, struct sigcontext *uctx, unsigned long flags,
> > > siginfo_t * uinfo, sigset_t *sigmask, size_t sizemask)
> > >
> > > process_vm_exec allows to execute the current process in an address
> > > space of another process.
> > [...]
> >
> > I still think that this whole API is fundamentally the wrong approach
> > because it tries to shoehorn multiple usecases with different
> > requirements into a single API. But that aside:
> >
> > > +static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm)
> > > +{
> > > + struct task_struct *tsk = current;
> > > + struct mm_struct *active_mm;
> > > +
> > > + task_lock(tsk);
> > > + /* Hold off tlb flush IPIs while switching mm's */
> > > + local_irq_disable();
> > > +
> > > + sync_mm_rss(prev_mm);
> > > +
> > > + vmacache_flush(tsk);
> > > +
> > > + active_mm = tsk->active_mm;
> > > + if (active_mm != target_mm) {
> > > + mmgrab(target_mm);
> > > + tsk->active_mm = target_mm;
> > > + }
> > > + tsk->mm = target_mm;
> >
> > I'm pretty sure you're not currently allowed to overwrite the ->mm
> > pointer of a userspace thread. For example, zap_threads() assumes that
> > all threads running under a process have the same ->mm. (And if you're
> > fiddling with ->mm stuff, you should probably CC linux-mm@.)
>
> exec_mmap() does it, so it can’t be entirely impossible.

Yeah, true, execve can do it - I guess the thing that makes that
special is that it's running after de_thread(), so it's guaranteed to
be single-threaded?

2021-06-28 23:39:58

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall

Jann Horn <[email protected]> writes:

> On Mon, Jun 28, 2021 at 6:30 PM Andy Lutomirski <[email protected]> wrote:
>> On Mon, Jun 28, 2021, at 9:13 AM, Jann Horn wrote:
>> > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <[email protected]> wrote:
>> > > This change introduces the new system call:
>> > > process_vm_exec(pid_t pid, struct sigcontext *uctx, unsigned long flags,
>> > > siginfo_t * uinfo, sigset_t *sigmask, size_t sizemask)
>> > >
>> > > process_vm_exec allows to execute the current process in an address
>> > > space of another process.
>> > [...]
>> >
>> > I still think that this whole API is fundamentally the wrong approach
>> > because it tries to shoehorn multiple usecases with different
>> > requirements into a single API. But that aside:
>> >
>> > > +static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm)
>> > > +{
>> > > + struct task_struct *tsk = current;
>> > > + struct mm_struct *active_mm;
>> > > +
>> > > + task_lock(tsk);
>> > > + /* Hold off tlb flush IPIs while switching mm's */
>> > > + local_irq_disable();
>> > > +
>> > > + sync_mm_rss(prev_mm);
>> > > +
>> > > + vmacache_flush(tsk);
>> > > +
>> > > + active_mm = tsk->active_mm;
>> > > + if (active_mm != target_mm) {
>> > > + mmgrab(target_mm);
>> > > + tsk->active_mm = target_mm;
>> > > + }
>> > > + tsk->mm = target_mm;
>> >
>> > I'm pretty sure you're not currently allowed to overwrite the ->mm
>> > pointer of a userspace thread. For example, zap_threads() assumes that
>> > all threads running under a process have the same ->mm. (And if you're
>> > fiddling with ->mm stuff, you should probably CC linux-mm@.)
>>
>> exec_mmap() does it, so it can’t be entirely impossible.
>
> Yeah, true, execve can do it - I guess the thing that makes that
> special is that it's running after de_thread(), so it's guaranteed to
> be single-threaded?

Even the implementation detail of swapping the mm aside. Even the idea
of swaping the mm is completely broken, as an endless system calls
depend upon the state held in task_struct. io_uring just tried running
system calls of a process in a different context and we ultimately had
to make the threads part of the original process to make enough things
work to keep the problem tractable.

System calls deeply and fundamentally depend on task_struct and
signal_struct.

I can think of two possibilities.
1) Hijack and existing process thread.
2) Inject a new thread into an existing process.

Anything else is just an exercise in trouble. Of this I think Hijacking
an existing thread is the only one that won't require lots of tracking
down of special cases. I seem to remember audit is still struggling
with how to properly audit io_uring threads.

Eric

2021-06-29 01:09:29

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall

On Mon, Jun 28, 2021 at 01:18:07PM -0500, Eric W. Biederman wrote:
> Jann Horn <[email protected]> writes:
>
> > On Mon, Jun 28, 2021 at 6:30 PM Andy Lutomirski <[email protected]> wrote:
> >> On Mon, Jun 28, 2021, at 9:13 AM, Jann Horn wrote:
> >> > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <[email protected]> wrote:
> >> > > This change introduces the new system call:
> >> > > process_vm_exec(pid_t pid, struct sigcontext *uctx, unsigned long flags,
> >> > > siginfo_t * uinfo, sigset_t *sigmask, size_t sizemask)
> >> > >
> >> > > process_vm_exec allows to execute the current process in an address
> >> > > space of another process.
> >> > [...]
> >> >
> >> > I still think that this whole API is fundamentally the wrong approach
> >> > because it tries to shoehorn multiple usecases with different
> >> > requirements into a single API. But that aside:
> >> >
> >> > > +static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm)
> >> > > +{
> >> > > + struct task_struct *tsk = current;
> >> > > + struct mm_struct *active_mm;
> >> > > +
> >> > > + task_lock(tsk);
> >> > > + /* Hold off tlb flush IPIs while switching mm's */
> >> > > + local_irq_disable();
> >> > > +
> >> > > + sync_mm_rss(prev_mm);
> >> > > +
> >> > > + vmacache_flush(tsk);
> >> > > +
> >> > > + active_mm = tsk->active_mm;
> >> > > + if (active_mm != target_mm) {
> >> > > + mmgrab(target_mm);
> >> > > + tsk->active_mm = target_mm;
> >> > > + }
> >> > > + tsk->mm = target_mm;
> >> >
> >> > I'm pretty sure you're not currently allowed to overwrite the ->mm
> >> > pointer of a userspace thread. For example, zap_threads() assumes that
> >> > all threads running under a process have the same ->mm. (And if you're
> >> > fiddling with ->mm stuff, you should probably CC linux-mm@.)
> >>
> >> exec_mmap() does it, so it can’t be entirely impossible.
> >
> > Yeah, true, execve can do it - I guess the thing that makes that
> > special is that it's running after de_thread(), so it's guaranteed to
> > be single-threaded?
>
> Even the implementation detail of swapping the mm aside. Even the idea
> of swaping the mm is completely broken, as an endless system calls
> depend upon the state held in task_struct. io_uring just tried running
> system calls of a process in a different context and we ultimately had
> to make the threads part of the original process to make enough things
> work to keep the problem tractable.
>
> System calls deeply and fundamentally depend on task_struct and
> signal_struct.

In opposite to io_uring, process_vm_exec doesn't intend to run system
calls in the context of the target process. We initially declare that
system calls are executed in the context of the current process with
just another mm. If we are talking about user-mode kernels, they will
need just two system calls: mmap and munmap. In case of CRIU, vmsplice
will be used too.

>
> I can think of two possibilities.
> 1) Hijack and existing process thread.
> 2) Inject a new thread into an existing process.

I am not sure that I understand what you mean here, but it sounds like
we will need to do a context switch to execute anything in a context
of a hijacked thread. If I am right, it kills the main idea of
process_vm_exec. If I misunderstand your idea, maybe you can describe it
with more details.

Thanks,
Andrei

2021-07-02 06:28:17

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall

On Mon, Jun 28, 2021 at 06:13:29PM +0200, Jann Horn wrote:
> On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <[email protected]> wrote:
> > This change introduces the new system call:
> > process_vm_exec(pid_t pid, struct sigcontext *uctx, unsigned long flags,
> > siginfo_t * uinfo, sigset_t *sigmask, size_t sizemask)
> >
> > process_vm_exec allows to execute the current process in an address
> > space of another process.
> [...]
>
> I still think that this whole API is fundamentally the wrong approach
> because it tries to shoehorn multiple usecases with different
> requirements into a single API. But that aside:

Here, I can't agree with you, but this is discussed in the parallel
thread.

>
> > +static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm)
> > +{
> > + struct task_struct *tsk = current;
> > + struct mm_struct *active_mm;
> > +
> > + task_lock(tsk);
> > + /* Hold off tlb flush IPIs while switching mm's */
> > + local_irq_disable();
> > +
> > + sync_mm_rss(prev_mm);
> > +
> > + vmacache_flush(tsk);
> > +
> > + active_mm = tsk->active_mm;
> > + if (active_mm != target_mm) {
> > + mmgrab(target_mm);
> > + tsk->active_mm = target_mm;
> > + }
> > + tsk->mm = target_mm;
>
> I'm pretty sure you're not currently allowed to overwrite the ->mm
> pointer of a userspace thread. For example, zap_threads() assumes that
> all threads running under a process have the same ->mm. (And if you're
> fiddling with ->mm stuff, you should probably CC linux-mm@.)
>
> As far as I understand, only kthreads are allowed to do this (as
> implemented in kthread_use_mm()).

kthread_use_mm() was renamed from use_mm in the v5.8 kernel. Before
that, it wasn't used for user processes in the kernel, but it was
exported for modules, and we used it without any visible problems. We
understood that there could be some issues like zap_threads and it was
one of reasons why we decided to introduce this system call.

I understand that there are no places in the kernel where we change mm
of user threads back and forth, but are there any real concerns why we
should not do that? I agree that zap_threads should be fixed, but it
will the easy one.

>
> > + switch_mm_irqs_off(active_mm, target_mm, tsk);
> > + local_irq_enable();
> > + task_unlock(tsk);
> > +#ifdef finish_arch_post_lock_switch
> > + finish_arch_post_lock_switch();
> > +#endif
> > +
> > + if (active_mm != target_mm)
> > + mmdrop(active_mm);
> > +}

2021-07-02 07:10:24

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote:
> On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <[email protected]> wrote:
> > We already have process_vm_readv and process_vm_writev to read and write
> > to a process memory faster than we can do this with ptrace. And now it
> > is time for process_vm_exec that allows executing code in an address
> > space of another process. We can do this with ptrace but it is much
> > slower.
> >
> > = Use-cases =
>
> It seems to me like your proposed API doesn't really fit either one of
> those usecases well...
>
> > Here are two known use-cases. The first one is “application kernel”
> > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > process that runs the sandbox kernel and a set of stub processes that
> > are used to manage guest address spaces. Guest code is executed in the
> > context of stub processes but all system calls are intercepted and
> > handled in the sandbox kernel. Right now, these sort of sandboxes use
> > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > significantly speed them up.
>
> In this case, since you really only want an mm_struct to run code
> under, it seems weird to create a whole task with its own PID and so
> on. It seems to me like something similar to the /dev/kvm API would be
> more appropriate here? Implementation options that I see for that
> would be:
>
> 1. mm_struct-based:
> a set of syscalls to create a new mm_struct,
> change memory mappings under that mm_struct, and switch to it

I like the idea to have a handle for mm. Instead of pid, we will pass
this handle to process_vm_exec. We have pidfd for processes and we can
introduce mmfd for mm_struct.


> 2. pagetable-mirroring-based:
> like /dev/kvm, an API to create a new pagetable, mirror parts of
> the mm_struct's pagetables over into it with modified permissions
> (like KVM_SET_USER_MEMORY_REGION),
> and run code under that context.
> page fault handling would first handle the fault against mm->pgd
> as normal, then mirror the PTE over into the secondary pagetables.
> invalidation could be handled with MMU notifiers.
>

I found this idea interesting and decided to look at it more closely.
After reading the kernel code for a few days, I realized that it would
not be easy to implement something like this, but more important is that
I don’t understand what problem it solves. Will it simplify the
user-space code? I don’t think so. Will it improve performance? It is
unclear for me too.

First, in the KVM case, we have a few big linear mappings and need to
support one “shadow” address space. In the case of sandboxes, we can
have a tremendous amount of mappings and many address spaces that we
need to manage. Memory mappings will be mapped with different addresses
in a supervisor address space and “guest” address spaces. If guest
address spaces will not have their mm_structs, we will need to reinvent
vma-s in some form. If guest address spaces have mm_structs, this will
look similar to https://lwn.net/Articles/830648/.

Second, each pagetable is tied up with mm_stuct. You suggest creating
new pagetables that will not have their mm_struct-s (sorry if I
misunderstood something). I am not sure that it will be easy to
implement. How many corner cases will be there?

As for page faults in a secondary address space, we will need to find a
fault address in the main address space, handle the fault there and then
mirror the PTE to the secondary pagetable. Effectively, it means that
page faults will be handled in two address spaces. Right now, we use
memfd and shared mappings. It means that each fault is handled only in
one address space, and we map a guest memory region to the supervisor
address space only when we need to access it. A large portion of guest
anonymous memory is never mapped to the supervisor address space.
Will an overhead of mirrored address spaces be smaller than memfd shared
mappings? I am not sure.

Third, this approach will not get rid of having process_vm_exec. We will
need to switch to a guest address space with a specified state and
switch back on faults or syscalls. If the main concern is the ability to
run syscalls on a remote mm, we can think about how to fix this. I see
two ways what we can do here:

* Specify the exact list of system calls that are allowed. The first
three candidates are mmap, munmap, and vmsplice.

* Instead of allowing us to run system calls, we can implement this in
the form of commands. In the case of sandboxes, we need to implement
only two commands to create and destroy memory mappings in a target
address space.

Thanks,
Andrei

2021-07-02 08:53:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall


I'm terrified of all of this...

On Tue, Apr 13, 2021 at 10:52:15PM -0700, Andrei Vagin wrote:

> +long swap_vm_exec_context(struct sigcontext __user *uctx)
> +{
> + struct sigcontext ctx = {};
> + sigset_t set = {};
> +
> +
> + if (copy_from_user(&ctx, uctx, CONTEXT_COPY_SIZE))
> + return -EFAULT;
> + /* A floating point state is managed from user-space. */
> + if (ctx.fpstate != 0)
> + return -EINVAL;
> + if (!user_access_begin(uctx, sizeof(*uctx)))
> + return -EFAULT;
> + unsafe_put_sigcontext(uctx, NULL, current_pt_regs(), (&set), Efault);
> + user_access_end();

But here you save the sigcontext without FPU state.

> +
> + if (__restore_sigcontext(current_pt_regs(), &ctx, 0))
> + goto badframe;

And here you restore sigcontext, *with* FPU state. At which point your
FPU state is irrecoverably lost.

Also, I'm not at all convinced this can ever do the right thing when the
tasks don't agree on what the FPU state is. I suppose in the best case
the save will EFAULT.

> +
> + return 0;
> +Efault:
> + user_access_end();
> +badframe:
> + signal_fault(current_pt_regs(), uctx, "swap_vm_exec_context");
> + return -EFAULT;
> +}

2021-07-02 11:53:06

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall

On Fri, Jul 2, 2021 at 8:25 AM Andrei Vagin <[email protected]> wrote:
> On Mon, Jun 28, 2021 at 06:13:29PM +0200, Jann Horn wrote:
> > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <[email protected]> wrote:
> > > +static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm)
> > > +{
> > > + struct task_struct *tsk = current;
> > > + struct mm_struct *active_mm;
> > > +
> > > + task_lock(tsk);
> > > + /* Hold off tlb flush IPIs while switching mm's */
> > > + local_irq_disable();
> > > +
> > > + sync_mm_rss(prev_mm);
> > > +
> > > + vmacache_flush(tsk);
> > > +
> > > + active_mm = tsk->active_mm;
> > > + if (active_mm != target_mm) {
> > > + mmgrab(target_mm);
> > > + tsk->active_mm = target_mm;
> > > + }
> > > + tsk->mm = target_mm;
> >
> > I'm pretty sure you're not currently allowed to overwrite the ->mm
> > pointer of a userspace thread. For example, zap_threads() assumes that
> > all threads running under a process have the same ->mm. (And if you're
> > fiddling with ->mm stuff, you should probably CC linux-mm@.)
> >
> > As far as I understand, only kthreads are allowed to do this (as
> > implemented in kthread_use_mm()).
>
> kthread_use_mm() was renamed from use_mm in the v5.8 kernel. Before
> that, it wasn't used for user processes in the kernel, but it was
> exported for modules, and we used it without any visible problems. We
> understood that there could be some issues like zap_threads and it was
> one of reasons why we decided to introduce this system call.
>
> I understand that there are no places in the kernel where we change mm
> of user threads back and forth, but are there any real concerns why we
> should not do that? I agree that zap_threads should be fixed, but it
> will the easy one.

My point is that if you break a preexisting assumption like this,
you'll have to go through the kernel and search for places that rely
on this assumption, and fix them up, which may potentially require
thinking about what kinds of semantics would actually be appropriate
there. Like the MCE killing logic (collect_procs_anon() and such). And
current_is_single_threaded(), in which the current patch probably
leads to logic security bugs. And __uprobe_perf_filter(). Before my
refactoring of the ELF coredump logic in kernel 5.10 (commit
b2767d97f5ff75 and the ones before it), you'd have also probably
created memory corruption bugs in races between elf_core_dump() and
syscalls like mmap()/munmap(). (Note that this is not necessarily an
exhaustive list.)

2021-07-02 15:22:54

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

On Fri, Jul 2, 2021 at 9:01 AM Andrei Vagin <[email protected]> wrote:
> On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote:
> > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <[email protected]> wrote:
> > > We already have process_vm_readv and process_vm_writev to read and write
> > > to a process memory faster than we can do this with ptrace. And now it
> > > is time for process_vm_exec that allows executing code in an address
> > > space of another process. We can do this with ptrace but it is much
> > > slower.
> > >
> > > = Use-cases =
> >
> > It seems to me like your proposed API doesn't really fit either one of
> > those usecases well...
> >
> > > Here are two known use-cases. The first one is “application kernel”
> > > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > > process that runs the sandbox kernel and a set of stub processes that
> > > are used to manage guest address spaces. Guest code is executed in the
> > > context of stub processes but all system calls are intercepted and
> > > handled in the sandbox kernel. Right now, these sort of sandboxes use
> > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > > significantly speed them up.
> >
> > In this case, since you really only want an mm_struct to run code
> > under, it seems weird to create a whole task with its own PID and so
> > on. It seems to me like something similar to the /dev/kvm API would be
> > more appropriate here? Implementation options that I see for that
> > would be:
> >
> > 1. mm_struct-based:
> > a set of syscalls to create a new mm_struct,
> > change memory mappings under that mm_struct, and switch to it
>
> I like the idea to have a handle for mm. Instead of pid, we will pass
> this handle to process_vm_exec. We have pidfd for processes and we can
> introduce mmfd for mm_struct.

I personally think that it might be quite unwieldy when it comes to
the restrictions you get from trying to have shared memory with the
owning process - I'm having trouble figuring out how you can implement
copy-on-write semantics without relying on copy-on-write logic in the
host OS and without being able to use userfaultfd.

But if that's not a problem somehow, and you can find some reasonable
way to handle memory usage accounting and fix up everything that
assumes that multithreaded userspace threads don't switch ->mm, I
guess this might work for your usecase.

> > 2. pagetable-mirroring-based:
> > like /dev/kvm, an API to create a new pagetable, mirror parts of
> > the mm_struct's pagetables over into it with modified permissions
> > (like KVM_SET_USER_MEMORY_REGION),
> > and run code under that context.
> > page fault handling would first handle the fault against mm->pgd
> > as normal, then mirror the PTE over into the secondary pagetables.
> > invalidation could be handled with MMU notifiers.
> >
>
> I found this idea interesting and decided to look at it more closely.
> After reading the kernel code for a few days, I realized that it would
> not be easy to implement something like this,

Yeah, it might need architecture-specific code to flip the page tables
on userspace entry/exit, and maybe also for mirroring them. And for
the TLB flushing logic...

> but more important is that
> I don’t understand what problem it solves. Will it simplify the
> user-space code? I don’t think so. Will it improve performance? It is
> unclear for me too.

Some reasons I can think of are:

- direct guest memory access: I imagined you'd probably want to be able to
directly access userspace memory from the supervisor, and
with this approach that'd become easy.

- integration with on-demand paging of the host OS: You'd be able to
create things like file-backed copy-on-write mappings from the
host filesystem, or implement your own mappings backed by some kind
of storage using userfaultfd.

- sandboxing: For sandboxing usecases (not your usecase), it would be
possible to e.g. create a read-only clone of the entire address space of a
process and give write access to specific parts of it, or something
like that.
These address space clones could potentially be created and destroyed
fairly quickly.

- accounting: memory usage would be automatically accounted to the
supervisor process, so even without a parasite process, you'd be able
to see the memory usage correctly in things like "top".

- small (non-pageable) memory footprint in the host kernel:
The only things the host kernel would have to persistently store would be
the normal MM data structures for the supervisor plus the mappings
from "guest userspace" memory ranges to supervisor memory ranges;
userspace pagetables would be discardable, and could even be shared
with those of the supervisor in cases where the alignment fits.
So with this, large anonymous mappings with 4K granularity only cost you
~0.20% overhead across host and guest address space; without this, if you
used shared mappings instead, you'd pay twice that for every 2MiB range
from which parts are accessed in both contexts, plus probably another
~0.2% or so for the "struct address_space"?

- all memory-management-related syscalls could be directly performed
in the "kernel" process

But yeah, some of those aren't really relevant for your usecase, and I
guess things like the accounting aspect could just as well be solved
differently...

> First, in the KVM case, we have a few big linear mappings and need to
> support one “shadow” address space. In the case of sandboxes, we can
> have a tremendous amount of mappings and many address spaces that we
> need to manage. Memory mappings will be mapped with different addresses
> in a supervisor address space and “guest” address spaces. If guest
> address spaces will not have their mm_structs, we will need to reinvent
> vma-s in some form. If guest address spaces have mm_structs, this will
> look similar to https://lwn.net/Articles/830648/.
>
> Second, each pagetable is tied up with mm_stuct. You suggest creating
> new pagetables that will not have their mm_struct-s (sorry if I
> misunderstood something).

Yeah, that's what I had in mind, page tables without an mm_struct.

> I am not sure that it will be easy to
> implement. How many corner cases will be there?

Yeah, it would require some work around TLB flushing and entry/exit
from userspace. But from a high-level perspective it feels to me like
a change with less systematic impact. Maybe I'm wrong about that.

> As for page faults in a secondary address space, we will need to find a
> fault address in the main address space, handle the fault there and then
> mirror the PTE to the secondary pagetable.

Right.

> Effectively, it means that
> page faults will be handled in two address spaces. Right now, we use
> memfd and shared mappings. It means that each fault is handled only in
> one address space, and we map a guest memory region to the supervisor
> address space only when we need to access it. A large portion of guest
> anonymous memory is never mapped to the supervisor address space.
> Will an overhead of mirrored address spaces be smaller than memfd shared
> mappings? I am not sure.

But as long as the mappings are sufficiently big and aligned properly,
or you explicitly manage the supervisor address space, some of that
cost disappears: E.g. even if a page is mapped in both address spaces,
you wouldn't have a memory cost for the second mapping if the page
tables are shared.

> Third, this approach will not get rid of having process_vm_exec. We will
> need to switch to a guest address space with a specified state and
> switch back on faults or syscalls.

Yeah, you'd still need a syscall for running code under a different
set of page tables. But that's something that KVM _almost_ already
does.

> If the main concern is the ability to
> run syscalls on a remote mm, we can think about how to fix this. I see
> two ways what we can do here:
>
> * Specify the exact list of system calls that are allowed. The first
> three candidates are mmap, munmap, and vmsplice.
>
> * Instead of allowing us to run system calls, we can implement this in
> the form of commands. In the case of sandboxes, we need to implement
> only two commands to create and destroy memory mappings in a target
> address space.

FWIW, there is precedent for something similar: The Android folks
already added process_madvise() for remotely messing with the VMAs of
another process to some degree.

2021-07-02 20:43:30

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall



On Fri, Jul 2, 2021, at 4:51 AM, Jann Horn wrote:
> On Fri, Jul 2, 2021 at 8:25 AM Andrei Vagin <[email protected]> wrote:
> > On Mon, Jun 28, 2021 at 06:13:29PM +0200, Jann Horn wrote:
> > > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <[email protected]> wrote:
> > > > +static void swap_mm(struct mm_struct *prev_mm, struct mm_struct *target_mm)
> > > > +{
> > > > + struct task_struct *tsk = current;
> > > > + struct mm_struct *active_mm;
> > > > +
> > > > + task_lock(tsk);
> > > > + /* Hold off tlb flush IPIs while switching mm's */
> > > > + local_irq_disable();
> > > > +
> > > > + sync_mm_rss(prev_mm);
> > > > +
> > > > + vmacache_flush(tsk);
> > > > +
> > > > + active_mm = tsk->active_mm;
> > > > + if (active_mm != target_mm) {
> > > > + mmgrab(target_mm);
> > > > + tsk->active_mm = target_mm;
> > > > + }
> > > > + tsk->mm = target_mm;
> > >
> > > I'm pretty sure you're not currently allowed to overwrite the ->mm
> > > pointer of a userspace thread. For example, zap_threads() assumes that
> > > all threads running under a process have the same ->mm. (And if you're
> > > fiddling with ->mm stuff, you should probably CC linux-mm@.)
> > >
> > > As far as I understand, only kthreads are allowed to do this (as
> > > implemented in kthread_use_mm()).
> >
> > kthread_use_mm() was renamed from use_mm in the v5.8 kernel. Before
> > that, it wasn't used for user processes in the kernel, but it was
> > exported for modules, and we used it without any visible problems. We
> > understood that there could be some issues like zap_threads and it was
> > one of reasons why we decided to introduce this system call.
> >
> > I understand that there are no places in the kernel where we change mm
> > of user threads back and forth, but are there any real concerns why we
> > should not do that? I agree that zap_threads should be fixed, but it
> > will the easy one.
>
> My point is that if you break a preexisting assumption like this,
> you'll have to go through the kernel and search for places that rely
> on this assumption, and fix them up, which may potentially require
> thinking about what kinds of semantics would actually be appropriate
> there. Like the MCE killing logic (collect_procs_anon() and such). And
> current_is_single_threaded(), in which the current patch probably
> leads to logic security bugs. And __uprobe_perf_filter(). Before my
> refactoring of the ELF coredump logic in kernel 5.10 (commit
> b2767d97f5ff75 and the ones before it), you'd have also probably
> created memory corruption bugs in races between elf_core_dump() and
> syscalls like mmap()/munmap(). (Note that this is not necessarily an
> exhaustive list.)
>

There’s nmi_uaccess_okay(), and its callers assume that, when a task is perf tracing itself, that an event on that task with nmi_uaccess_okay() means that uaccess will access that task’s memory.

Core dump code probably expects that dumping memory will access the correct mm.

I cannot fathom why any kind of remote vm access touched FPU state at all.

What PKRU value is supposed to be used when doing mm swap shenanigans? How about PASID?

What happens if one task attempts to issue a KVM ioctl while its mm is swapped?

2021-07-02 21:00:32

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall

On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <[email protected]> wrote:
> This change introduces the new system call:
> process_vm_exec(pid_t pid, struct sigcontext *uctx, unsigned long flags,
> siginfo_t * uinfo, sigset_t *sigmask, size_t sizemask)
>
> process_vm_exec allows to execute the current process in an address
> space of another process.
>
> process_vm_exec swaps the current address space with an address space of
> a specified process, sets a state from sigcontex and resumes the process.
> When a process receives a signal or calls a system call,
> process_vm_exec saves the process state back to sigcontext, restores the
> origin address space, restores the origin process state, and returns to
> userspace.
>
> If it was interrupted by a signal and the signal is in the user_mask,
> the signal is dequeued and information about it is saved in uinfo.
> If process_vm_exec is interrupted by a system call, a synthetic siginfo
> for the SIGSYS signal is generated.
>
> The behavior of this system call is similar to PTRACE_SYSEMU but
> everything is happing in the context of one process, so
> process_vm_exec shows a better performance.
>
> PTRACE_SYSEMU is primarily used to implement sandboxes (application
> kernels) like User-mode Linux or gVisor. These type of sandboxes
> intercepts applications system calls and acts as the guest kernel.
> A simple benchmark, where a "tracee" process executes systems calls in a
> loop and a "tracer" process traps syscalls and handles them just
> incrementing the tracee instruction pointer to skip the syscall
> instruction shows that process_vm_exec works more than 5 times faster
> than PTRACE_SYSEMU.
[...]
> +long swap_vm_exec_context(struct sigcontext __user *uctx)
> +{
> + struct sigcontext ctx = {};
> + sigset_t set = {};
> +
> +
> + if (copy_from_user(&ctx, uctx, CONTEXT_COPY_SIZE))
> + return -EFAULT;
> + /* A floating point state is managed from user-space. */
> + if (ctx.fpstate != 0)
> + return -EINVAL;
> + if (!user_access_begin(uctx, sizeof(*uctx)))
> + return -EFAULT;
> + unsafe_put_sigcontext(uctx, NULL, current_pt_regs(), (&set), Efault);
> + user_access_end();
> +
> + if (__restore_sigcontext(current_pt_regs(), &ctx, 0))
> + goto badframe;
> +
> + return 0;
> +Efault:
> + user_access_end();
> +badframe:
> + signal_fault(current_pt_regs(), uctx, "swap_vm_exec_context");
> + return -EFAULT;
> +}

Comparing the pieces of context that restore_sigcontext() restores
with what a normal task switch does (see __switch_to() and callees), I
noticed: On CPUs with FSGSBASE support, I think sandboxed code could
overwrite FSBASE/GSBASE using the WRFSBASE/WRGSBASE instructions,
causing the supervisor to access attacker-controlled addresses when it
tries to access a thread-local variable like "errno"? Signal handling
saves the segment registers, but not the FS/GS base addresses.


jannh@laptop:~/test$ cat signal_gsbase.c
// compile with -mfsgsbase
#include <stdio.h>
#include <signal.h>
#include <immintrin.h>

void signal_handler(int sig, siginfo_t *info, void *ucontext_) {
puts("signal handler");
_writegsbase_u64(0x12345678);
}

int main(void) {
struct sigaction new_act = {
.sa_sigaction = signal_handler,
.sa_flags = SA_SIGINFO
};
sigaction(SIGUSR1, &new_act, NULL);

printf("original gsbase is 0x%lx\n", _readgsbase_u64());
raise(SIGUSR1);
printf("post-signal gsbase is 0x%lx\n", _readgsbase_u64());
}
jannh@laptop:~/test$ gcc -o signal_gsbase signal_gsbase.c -mfsgsbase
jannh@laptop:~/test$ ./signal_gsbase
original gsbase is 0x0
signal handler
post-signal gsbase is 0x12345678
jannh@laptop:~/test$


So to make this usable for a sandboxing usecase, you'd also have to
save and restore FSBASE/GSBASE, just like __switch_to().

2021-07-02 22:31:06

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall

On Fri, Jul 02, 2021 at 10:51:13AM +0200, Peter Zijlstra wrote:
>
> I'm terrified of all of this...
>
> On Tue, Apr 13, 2021 at 10:52:15PM -0700, Andrei Vagin wrote:
>
> > +long swap_vm_exec_context(struct sigcontext __user *uctx)
> > +{
> > + struct sigcontext ctx = {};
> > + sigset_t set = {};
> > +
> > +
> > + if (copy_from_user(&ctx, uctx, CONTEXT_COPY_SIZE))
> > + return -EFAULT;
> > + /* A floating point state is managed from user-space. */
> > + if (ctx.fpstate != 0)
> > + return -EINVAL;

Here, we check that ctx doesn't have an FPU state.

> > + if (!user_access_begin(uctx, sizeof(*uctx)))
> > + return -EFAULT;
> > + unsafe_put_sigcontext(uctx, NULL, current_pt_regs(), (&set), Efault);
> > + user_access_end();
>
> But here you save the sigcontext without FPU state.
>
> > +
> > + if (__restore_sigcontext(current_pt_regs(), &ctx, 0))
> > + goto badframe;
>
> And here you restore sigcontext, *with* FPU state. At which point your
> FPU state is irrecoverably lost.

process_vm_exec doesn't change a process FPU state. Unlike signals, here
we can control it from a user-space. A process can set an FPU state
before process_vm_exec and then retore its FPU state after the
call.

This version of patches has a bug that I fixed in my tree when I
implemented the user-space part for gVisor. I didn't take into account
that restore_sigcontext(ctx) clears a process fpu state if ctx->fpstate
is zero. I moved fpu__restore_sig out from __restore_sigcontext to fix
this issue:

https://github.com/avagin/linux-task-diag/commit/55b7194d00ff

>
> Also, I'm not at all convinced this can ever do the right thing when the
> tasks don't agree on what the FPU state is. I suppose in the best case
> the save will EFAULT.
>
> > +
> > + return 0;
> > +Efault:
> > + user_access_end();
> > +badframe:
> > + signal_fault(current_pt_regs(), uctx, "swap_vm_exec_context");
> > + return -EFAULT;
> > +}

2021-07-02 22:54:46

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall

On Fri, Jul 02, 2021 at 10:56:38PM +0200, Jann Horn wrote:
> On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <[email protected]> wrote:
> > This change introduces the new system call:
> > process_vm_exec(pid_t pid, struct sigcontext *uctx, unsigned long flags,
> > siginfo_t * uinfo, sigset_t *sigmask, size_t sizemask)
> >
> > process_vm_exec allows to execute the current process in an address
> > space of another process.
> >
> > process_vm_exec swaps the current address space with an address space of
> > a specified process, sets a state from sigcontex and resumes the process.
> > When a process receives a signal or calls a system call,
> > process_vm_exec saves the process state back to sigcontext, restores the
> > origin address space, restores the origin process state, and returns to
> > userspace.
> >
> > If it was interrupted by a signal and the signal is in the user_mask,
> > the signal is dequeued and information about it is saved in uinfo.
> > If process_vm_exec is interrupted by a system call, a synthetic siginfo
> > for the SIGSYS signal is generated.
> >
> > The behavior of this system call is similar to PTRACE_SYSEMU but
> > everything is happing in the context of one process, so
> > process_vm_exec shows a better performance.
> >
> > PTRACE_SYSEMU is primarily used to implement sandboxes (application
> > kernels) like User-mode Linux or gVisor. These type of sandboxes
> > intercepts applications system calls and acts as the guest kernel.
> > A simple benchmark, where a "tracee" process executes systems calls in a
> > loop and a "tracer" process traps syscalls and handles them just
> > incrementing the tracee instruction pointer to skip the syscall
> > instruction shows that process_vm_exec works more than 5 times faster
> > than PTRACE_SYSEMU.
> [...]
> > +long swap_vm_exec_context(struct sigcontext __user *uctx)
> > +{
> > + struct sigcontext ctx = {};
> > + sigset_t set = {};
> > +
> > +
> > + if (copy_from_user(&ctx, uctx, CONTEXT_COPY_SIZE))
> > + return -EFAULT;
> > + /* A floating point state is managed from user-space. */
> > + if (ctx.fpstate != 0)
> > + return -EINVAL;
> > + if (!user_access_begin(uctx, sizeof(*uctx)))
> > + return -EFAULT;
> > + unsafe_put_sigcontext(uctx, NULL, current_pt_regs(), (&set), Efault);
> > + user_access_end();
> > +
> > + if (__restore_sigcontext(current_pt_regs(), &ctx, 0))
> > + goto badframe;
> > +
> > + return 0;
> > +Efault:
> > + user_access_end();
> > +badframe:
> > + signal_fault(current_pt_regs(), uctx, "swap_vm_exec_context");
> > + return -EFAULT;
> > +}
>
> Comparing the pieces of context that restore_sigcontext() restores
> with what a normal task switch does (see __switch_to() and callees), I
> noticed: On CPUs with FSGSBASE support, I think sandboxed code could
> overwrite FSBASE/GSBASE using the WRFSBASE/WRGSBASE instructions,
> causing the supervisor to access attacker-controlled addresses when it
> tries to access a thread-local variable like "errno"? Signal handling
> saves the segment registers, but not the FS/GS base addresses.
>
>
> jannh@laptop:~/test$ cat signal_gsbase.c
> // compile with -mfsgsbase
> #include <stdio.h>
> #include <signal.h>
> #include <immintrin.h>
>
> void signal_handler(int sig, siginfo_t *info, void *ucontext_) {
> puts("signal handler");
> _writegsbase_u64(0x12345678);
> }
>
> int main(void) {
> struct sigaction new_act = {
> .sa_sigaction = signal_handler,
> .sa_flags = SA_SIGINFO
> };
> sigaction(SIGUSR1, &new_act, NULL);
>
> printf("original gsbase is 0x%lx\n", _readgsbase_u64());
> raise(SIGUSR1);
> printf("post-signal gsbase is 0x%lx\n", _readgsbase_u64());
> }
> jannh@laptop:~/test$ gcc -o signal_gsbase signal_gsbase.c -mfsgsbase
> jannh@laptop:~/test$ ./signal_gsbase
> original gsbase is 0x0
> signal handler
> post-signal gsbase is 0x12345678
> jannh@laptop:~/test$
>
>
> So to make this usable for a sandboxing usecase, you'd also have to
> save and restore FSBASE/GSBASE, just like __switch_to().

You are right. I've found this too when I implemented the gviosr user-space
part.

Here is the tree whether this problem has been fixed:
https://github.com/avagin/linux-task-diag/commits/wip/gvisor-5.10

2021-07-02 22:58:36

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

On 4/13/21 10:52 PM, Andrei Vagin wrote:

> process_vm_exec has two modes:
>
> * Execute code in an address space of a target process and stop on any
> signal or system call.

We already have a perfectly good context switch mechanism: context
switches. If you execute code, you are basically guaranteed to be
subject to being hijacked, which means you pretty much can't allow
syscalls. But there's a lot of non-syscall state, and I think context
switching needs to be done with extreme care.

(Just as example, suppose you switch mms, then set %gs to point to the
LDT, then switch back. Now you're in a weird state. With %ss the plot
is a bit thicker. And there are emulated vsyscalls and such.)

If you, PeterZ, and the UMCG could all find an acceptable, efficient way
to wake-and-wait so you can switch into an injected task in the target
process and switch back quickly, then I think a much nicer solution will
become available.

>
> * Execute a system call in an address space of a target process.

I could get behind this, but there are plenty of cans of worms to watch
out for. Serious auditing would be needed.

2021-07-18 00:52:05

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

On Fri, Jul 02, 2021 at 05:12:02PM +0200, Jann Horn wrote:
> On Fri, Jul 2, 2021 at 9:01 AM Andrei Vagin <[email protected]> wrote:
> > On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote:
> > > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin <[email protected]> wrote:
> > > > We already have process_vm_readv and process_vm_writev to read and write
> > > > to a process memory faster than we can do this with ptrace. And now it
> > > > is time for process_vm_exec that allows executing code in an address
> > > > space of another process. We can do this with ptrace but it is much
> > > > slower.
> > > >
> > > > = Use-cases =
> > >
> > > It seems to me like your proposed API doesn't really fit either one of
> > > those usecases well...
> > >
> > > > Here are two known use-cases. The first one is “application kernel”
> > > > sandboxes like User-mode Linux and gVisor. In this case, we have a
> > > > process that runs the sandbox kernel and a set of stub processes that
> > > > are used to manage guest address spaces. Guest code is executed in the
> > > > context of stub processes but all system calls are intercepted and
> > > > handled in the sandbox kernel. Right now, these sort of sandboxes use
> > > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can
> > > > significantly speed them up.
> > >
> > > In this case, since you really only want an mm_struct to run code
> > > under, it seems weird to create a whole task with its own PID and so
> > > on. It seems to me like something similar to the /dev/kvm API would be
> > > more appropriate here? Implementation options that I see for that
> > > would be:
> > >
> > > 1. mm_struct-based:
> > > a set of syscalls to create a new mm_struct,
> > > change memory mappings under that mm_struct, and switch to it
> >
> > I like the idea to have a handle for mm. Instead of pid, we will pass
> > this handle to process_vm_exec. We have pidfd for processes and we can
> > introduce mmfd for mm_struct.
>
> I personally think that it might be quite unwieldy when it comes to
> the restrictions you get from trying to have shared memory with the
> owning process - I'm having trouble figuring out how you can implement
> copy-on-write semantics without relying on copy-on-write logic in the
> host OS and without being able to use userfaultfd.

It is easy. COW mappings are mapped to guest address spaces without the
write permission. If one of processes wants to write something, it
triggers a fault that is handled in the Sentry (supervisor/kernel).

>
> But if that's not a problem somehow, and you can find some reasonable
> way to handle memory usage accounting and fix up everything that
> assumes that multithreaded userspace threads don't switch ->mm, I
> guess this might work for your usecase.
>
> > > 2. pagetable-mirroring-based:
> > > like /dev/kvm, an API to create a new pagetable, mirror parts of
> > > the mm_struct's pagetables over into it with modified permissions
> > > (like KVM_SET_USER_MEMORY_REGION),
> > > and run code under that context.
> > > page fault handling would first handle the fault against mm->pgd
> > > as normal, then mirror the PTE over into the secondary pagetables.
> > > invalidation could be handled with MMU notifiers.
> > >
> >
> > I found this idea interesting and decided to look at it more closely.
> > After reading the kernel code for a few days, I realized that it would
> > not be easy to implement something like this,
>
> Yeah, it might need architecture-specific code to flip the page tables
> on userspace entry/exit, and maybe also for mirroring them. And for
> the TLB flushing logic...
>
> > but more important is that
> > I don’t understand what problem it solves. Will it simplify the
> > user-space code? I don’t think so. Will it improve performance? It is
> > unclear for me too.
>
> Some reasons I can think of are:
>
> - direct guest memory access: I imagined you'd probably want to be able to
> directly access userspace memory from the supervisor, and
> with this approach that'd become easy.

Right now, we use shared memory regions for that and they work fine. As
I already mentioned the most part of memory are never mapped to the
supervisor address space.

>
> - integration with on-demand paging of the host OS: You'd be able to
> create things like file-backed copy-on-write mappings from the
> host filesystem, or implement your own mappings backed by some kind
> of storage using userfaultfd.

This isn't a problem either...

>
> - sandboxing: For sandboxing usecases (not your usecase), it would be
> possible to e.g. create a read-only clone of the entire address space of a
> process and give write access to specific parts of it, or something
> like that.
> These address space clones could potentially be created and destroyed
> fairly quickly.

This is a very valid example and I would assume this is where your idea
was coming from. I have some doubts about the idea of additional
sub-page-tables in the kernel, but I know a good way how to implement
your idea with KVM. You can look at how the KVM platform is implemented in
gVisor and this sort of sandboxing can be implemented in the same way.

In a few words, we create a KVM virtual machine, repeat the process
address space in the guest ring0, implement basic operating system-level
stubs, so that the process can jump between the host ring3 and the guest
ring0.

https://github.com/google/gvisor/blob/master/pkg/ring0/
https://github.com/google/gvisor/tree/master/pkg/sentry/platform/kvm

When we have all these bits, we can create any page tables for a guest
ring3 and run untrusted code there. The sandbox process switches to
the guest ring0 and then it switches to a guest ring3 with a specified
page tables and a state.

https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/platform/kvm/machine_amd64.go;l=356

With this scheme, the sandbox process will have direct access to page
tables and will be able to change them.

>
> - accounting: memory usage would be automatically accounted to the
> supervisor process, so even without a parasite process, you'd be able
> to see the memory usage correctly in things like "top".
>
> - small (non-pageable) memory footprint in the host kernel:
> The only things the host kernel would have to persistently store would be
> the normal MM data structures for the supervisor plus the mappings
> from "guest userspace" memory ranges to supervisor memory ranges;
> userspace pagetables would be discardable, and could even be shared
> with those of the supervisor in cases where the alignment fits.
> So with this, large anonymous mappings with 4K granularity only cost you
> ~0.20% overhead across host and guest address space; without this, if you
> used shared mappings instead, you'd pay twice that for every 2MiB range
> from which parts are accessed in both contexts, plus probably another
> ~0.2% or so for the "struct address_space"?

If we use shared mappings, we don't map the most part of guest memory to
the supervisor address space and don't have page tables for it there. I
would say that this is a question where a memory footprint will be
smaller...

>
> - all memory-management-related syscalls could be directly performed
> in the "kernel" process
>
> But yeah, some of those aren't really relevant for your usecase, and I
> guess things like the accounting aspect could just as well be solved
> differently...
>
> > First, in the KVM case, we have a few big linear mappings and need to
> > support one “shadow” address space. In the case of sandboxes, we can
> > have a tremendous amount of mappings and many address spaces that we
> > need to manage. Memory mappings will be mapped with different addresses
> > in a supervisor address space and “guest” address spaces. If guest
> > address spaces will not have their mm_structs, we will need to reinvent
> > vma-s in some form. If guest address spaces have mm_structs, this will
> > look similar to https://lwn.net/Articles/830648/.
> >
> > Second, each pagetable is tied up with mm_stuct. You suggest creating
> > new pagetables that will not have their mm_struct-s (sorry if I
> > misunderstood something).
>
> Yeah, that's what I had in mind, page tables without an mm_struct.
>
> > I am not sure that it will be easy to
> > implement. How many corner cases will be there?
>
> Yeah, it would require some work around TLB flushing and entry/exit
> from userspace. But from a high-level perspective it feels to me like
> a change with less systematic impact. Maybe I'm wrong about that.
>
> > As for page faults in a secondary address space, we will need to find a
> > fault address in the main address space, handle the fault there and then
> > mirror the PTE to the secondary pagetable.
>
> Right.
>
> > Effectively, it means that
> > page faults will be handled in two address spaces. Right now, we use
> > memfd and shared mappings. It means that each fault is handled only in
> > one address space, and we map a guest memory region to the supervisor
> > address space only when we need to access it. A large portion of guest
> > anonymous memory is never mapped to the supervisor address space.
> > Will an overhead of mirrored address spaces be smaller than memfd shared
> > mappings? I am not sure.
>
> But as long as the mappings are sufficiently big and aligned properly,
> or you explicitly manage the supervisor address space, some of that
> cost disappears: E.g. even if a page is mapped in both address spaces,
> you wouldn't have a memory cost for the second mapping if the page
> tables are shared.

You are right. It is interesting how many pte-s will be shared. For
example, if a guest process forks a child, all anon memory will be COW,
this means we will need to remove the W bit from pte-s and so we will
need to allocate pte-s for both processes...

>
> > Third, this approach will not get rid of having process_vm_exec. We will
> > need to switch to a guest address space with a specified state and
> > switch back on faults or syscalls.
>
> Yeah, you'd still need a syscall for running code under a different
> set of page tables. But that's something that KVM _almost_ already
> does.

I don't understand this analogy with KVM...

>
> > If the main concern is the ability to
> > run syscalls on a remote mm, we can think about how to fix this. I see
> > two ways what we can do here:
> >
> > * Specify the exact list of system calls that are allowed. The first
> > three candidates are mmap, munmap, and vmsplice.
> >
> > * Instead of allowing us to run system calls, we can implement this in
> > the form of commands. In the case of sandboxes, we need to implement
> > only two commands to create and destroy memory mappings in a target
> > address space.
>
> FWIW, there is precedent for something similar: The Android folks
> already added process_madvise() for remotely messing with the VMAs of
> another process to some degree.

I know. We tried to implement process_vm_mmap and process_vm_splice:

https://lkml.org/lkml/2018/1/9/32
https://patchwork.kernel.org/project/linux-mm/cover/155836064844.2441.10911127801797083064.stgit@localhost.localdomain/

Thanks,
Andrei

2021-07-18 01:40:15

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space

On Fri, Jul 02, 2021 at 03:44:41PM -0700, Andy Lutomirski wrote:
> On 4/13/21 10:52 PM, Andrei Vagin wrote:
>
> > process_vm_exec has two modes:
> >
> > * Execute code in an address space of a target process and stop on any
> > signal or system call.
>
> We already have a perfectly good context switch mechanism: context
> switches. If you execute code, you are basically guaranteed to be
> subject to being hijacked, which means you pretty much can't allow
> syscalls. But there's a lot of non-syscall state, and I think context
> switching needs to be done with extreme care.
>
> (Just as example, suppose you switch mms, then set %gs to point to the
> LDT, then switch back. Now you're in a weird state. With %ss the plot
> is a bit thicker. And there are emulated vsyscalls and such.)
>
> If you, PeterZ, and the UMCG could all find an acceptable, efficient way
> to wake-and-wait so you can switch into an injected task in the target
> process and switch back quickly, then I think a much nicer solution will
> become available.

I know about umcg and I even did a prototype that used fuxet_swap (the
previous attempt of umcg). Here are a few problems and maybe you will
have some ideas on how to solve them.

The main question is how to hijack a stub process where a guest code is
executing. We need to trap system calls, memory faults, and other
exceptions and handle them in the Sentry (supervisor/kernel). All
interested events except system calls generate signals. We can use
seccomp to get signals on system calls too. In my prototype, a guest
code is running in stub processes. One stub process is for each guest
address space. In a stub process, I set a signal handler for SIGSEGV,
SIGBUS, SIGFPE, SIGSYS, SIGILL, set an alternate signal stack, and set
seccomp rules. The signal handler communicates with the Sentry
(supervisor/kernel) via shared memory and uses futex_swap to make fast
switches to the Sentry and back to a stub process.

Here are a few problems. First, we have a signal handler code, its
stack, and a shared memory region in a guest address space, and we need
to guarantee that a guest code will not be able to use them to do
something unexpected.

The second problem is performance. It is much faster if we compare it
with the ptrace platform, but it is still a few times slower than
process_vm_exec. Signal handling is expensive. The kernel has to
generate a signal frame, execute a signal handler, and then it needs to
call rt_sigreturn. Futex_swap makes fast context switches, but it is
still slower than process_vm_exec. UMCG should be faster because it
doesn’t have a futex overhead.

Andy, what do you think about the idea to rework process_vm_exec so that
it executes code and syscalls in the context of a target process?
Maybe you see other ways how we can “hijack” a remote process?

Thanks,
Andrei

>
> >
> > * Execute a system call in an address space of a target process.
>
> I could get behind this, but there are plenty of cans of worms to watch
> out for. Serious auditing would be needed.