Received: by 2002:a05:6a10:f3d0:0:0:0:0 with SMTP id a16csp1849035pxv; Fri, 2 Jul 2021 14:00:32 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx9s0mPs2cHz/rzjTU1Gfj51fYHcftwTDUz+xuk4opG0RBF8J0WQgproREngyyqgWAgQIr7 X-Received: by 2002:a17:907:1609:: with SMTP id hb9mr1651324ejc.368.1625259631969; Fri, 02 Jul 2021 14:00:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1625259631; cv=none; d=google.com; s=arc-20160816; b=rhTLi98KBOlDQotWpqh9la62lDnGnAXW4AMlCdUIqPb9D6UYEsUgZqOAS0i6+6xC8b 974NC6SUVSUw96PJSqLzyYojmXnEOJZxTVCTclTdjnwlW6dOE3u9J0xfuMdBxdP6V3Ow 9b6Hh0LeJuqTbVuk1Us4TDZYwhoa/2S+MVnJU1RUne0jqVs1fumsS1mWxeWW/molXrC3 bBafX6Cu3VrKgqlc7eSUA7qUMioMABfapXMYjrYnmGl+tPSfr8oUG0rOhYKdmXoybkg3 aZJR3x5+cgbc0T8mvWEIy7vn3YdCt+Cor5znI7pyIil490wevl4HufbJoKTW8EH5LfKp HQVA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=8hDmjhh4CEc7tcPs29BDsOViDcQQOe2tkj/fgDDj4a8=; b=qMp1sTvZX9GgxayO85R3l8MDXaSoG0YV3XtcjD74T1zYRvud6qVnrn7gpGjW8wfZw2 a8dn22fqJ7VU1pmRBfaz4r7xvHLotdeHO0EDEgKwZ+vFU/SR+Y9gj7dmBdEuW8v3OSve 46cj5kcZHMAjILctSL7zYjHZu5DSdVQahZ6JXKhXCGVHlIJndmXWK8QveBZzOuWpZf3G i4kyvCt0qwuX2TXi/UklYrsiDg+uhbDxMqNga5q0edqr5IULyaGDjFRYRD8K0y5MFkNK 1t2XA1hVM3708cMEMq3wgNJ3Fc6G1fLdliBIDHD9JaCnpYWeVMpEGFBWzU8wUTVh3gyk SAQA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=WK38xMCu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id 16si4219019ejj.242.2021.07.02.14.00.07; Fri, 02 Jul 2021 14:00:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=WK38xMCu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231702AbhGBU7k (ORCPT + 99 others); Fri, 2 Jul 2021 16:59:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42478 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230377AbhGBU7k (ORCPT ); Fri, 2 Jul 2021 16:59:40 -0400 Received: from mail-lf1-x132.google.com (mail-lf1-x132.google.com [IPv6:2a00:1450:4864:20::132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 94D13C061762 for ; Fri, 2 Jul 2021 13:57:06 -0700 (PDT) Received: by mail-lf1-x132.google.com with SMTP id bu19so20325043lfb.9 for ; Fri, 02 Jul 2021 13:57:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=8hDmjhh4CEc7tcPs29BDsOViDcQQOe2tkj/fgDDj4a8=; b=WK38xMCuVpfUrGEjMntGywHqdXgVFuWSivGvWFJymCFPCO9Ml/kPCWA6DekIoIMQ/D kZNRHf7S0mKDYkX1xvAyT8n7OUP1MQlFZLadK+qUubLFBPuZsgv/gJy1tovcgSjchPQO 1zkMQe147Ckwr5W5S2YvW2sygH/bGmwrv+rLpTHnI37KxoaCvULHquOLvIuMcdkwIkZ5 yZi+eV3R3hu9FJxtC6KMCLBrgP9x7Xno2W3HnohpYj89P+L6H1eVmcEzmnA1gkx5V7yv rsGL3DkQGuRbG0+OTmk3xxv5IB1PV6/45GJu2NPlAroTtBnSwOtfQ0LuQv8zvrPCWAFt jbfg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=8hDmjhh4CEc7tcPs29BDsOViDcQQOe2tkj/fgDDj4a8=; b=TdvuPiNnrOYs/kJdqn9Rof+jwqDznLnNujX8X1EZHrlZoMdBPmDMZeU3SkiOpfW/jw qNCQ8W6/a8cF9xPxweIJ5L6gju8YRYuincpjKi2Jk/u3xJ0SB+UYozN7hTwiFgKOvuXp VC4/szS2Z+migy9U6k0lqiVbEKGRKfCFxNBhS5vqmolp18cH+tLIeFqYQ0CtftA8oLbR /nDj8FU8hbXdLu2v5ix+LRFjp9v0q0LWAC1YD9HyhkBPDUULhhw24sWmK99nbjRs4qaR pkIEre1br2nTTq9h5PRZti4hBPDiQyGYk5rC51OrBcCFjrov37xaTqlBLlLi6iDIRpCI FlrA== X-Gm-Message-State: AOAM533K4YCmd+2xnIYjgPkVPfPUCrYgW7LL0CdwAzCjU0PFdhCRh5f7 m8O4UqwMxC0PBNX6dBWx0edmFUAhRqwjY1GUJCXHw1uS191rNQ== X-Received: by 2002:a05:6512:210e:: with SMTP id q14mr1061425lfr.356.1625259424639; Fri, 02 Jul 2021 13:57:04 -0700 (PDT) MIME-Version: 1.0 References: <20210414055217.543246-1-avagin@gmail.com> <20210414055217.543246-3-avagin@gmail.com> In-Reply-To: <20210414055217.543246-3-avagin@gmail.com> From: Jann Horn Date: Fri, 2 Jul 2021 22:56:38 +0200 Message-ID: Subject: Re: [PATCH 2/4] arch/x86: implement the process_vm_exec syscall To: Andrei Vagin , "the arch/x86 maintainers" , Andy Lutomirski Cc: linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, linux-um@lists.infradead.org, criu@openvz.org, avagin@google.com, Andrew Morton , Anton Ivanov , Christian Brauner , Dmitry Safonov <0x7f454c46@gmail.com>, Ingo Molnar , Jeff Dike , Mike Rapoport , Michael Kerrisk , Oleg Nesterov , Peter Zijlstra , Richard Weinberger , Thomas Gleixner Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin wrote: > This change introduces the new system call: > process_vm_exec(pid_t pid, struct sigcontext *uctx, unsigned long flags, > siginfo_t * uinfo, sigset_t *sigmask, size_t sizemask) > > process_vm_exec allows to execute the current process in an address > space of another process. > > process_vm_exec swaps the current address space with an address space of > a specified process, sets a state from sigcontex and resumes the process. > When a process receives a signal or calls a system call, > process_vm_exec saves the process state back to sigcontext, restores the > origin address space, restores the origin process state, and returns to > userspace. > > If it was interrupted by a signal and the signal is in the user_mask, > the signal is dequeued and information about it is saved in uinfo. > If process_vm_exec is interrupted by a system call, a synthetic siginfo > for the SIGSYS signal is generated. > > The behavior of this system call is similar to PTRACE_SYSEMU but > everything is happing in the context of one process, so > process_vm_exec shows a better performance. > > PTRACE_SYSEMU is primarily used to implement sandboxes (application > kernels) like User-mode Linux or gVisor. These type of sandboxes > intercepts applications system calls and acts as the guest kernel. > A simple benchmark, where a "tracee" process executes systems calls in a > loop and a "tracer" process traps syscalls and handles them just > incrementing the tracee instruction pointer to skip the syscall > instruction shows that process_vm_exec works more than 5 times faster > than PTRACE_SYSEMU. [...] > +long swap_vm_exec_context(struct sigcontext __user *uctx) > +{ > + struct sigcontext ctx = {}; > + sigset_t set = {}; > + > + > + if (copy_from_user(&ctx, uctx, CONTEXT_COPY_SIZE)) > + return -EFAULT; > + /* A floating point state is managed from user-space. */ > + if (ctx.fpstate != 0) > + return -EINVAL; > + if (!user_access_begin(uctx, sizeof(*uctx))) > + return -EFAULT; > + unsafe_put_sigcontext(uctx, NULL, current_pt_regs(), (&set), Efault); > + user_access_end(); > + > + if (__restore_sigcontext(current_pt_regs(), &ctx, 0)) > + goto badframe; > + > + return 0; > +Efault: > + user_access_end(); > +badframe: > + signal_fault(current_pt_regs(), uctx, "swap_vm_exec_context"); > + return -EFAULT; > +} Comparing the pieces of context that restore_sigcontext() restores with what a normal task switch does (see __switch_to() and callees), I noticed: On CPUs with FSGSBASE support, I think sandboxed code could overwrite FSBASE/GSBASE using the WRFSBASE/WRGSBASE instructions, causing the supervisor to access attacker-controlled addresses when it tries to access a thread-local variable like "errno"? Signal handling saves the segment registers, but not the FS/GS base addresses. jannh@laptop:~/test$ cat signal_gsbase.c // compile with -mfsgsbase #include #include #include void signal_handler(int sig, siginfo_t *info, void *ucontext_) { puts("signal handler"); _writegsbase_u64(0x12345678); } int main(void) { struct sigaction new_act = { .sa_sigaction = signal_handler, .sa_flags = SA_SIGINFO }; sigaction(SIGUSR1, &new_act, NULL); printf("original gsbase is 0x%lx\n", _readgsbase_u64()); raise(SIGUSR1); printf("post-signal gsbase is 0x%lx\n", _readgsbase_u64()); } jannh@laptop:~/test$ gcc -o signal_gsbase signal_gsbase.c -mfsgsbase jannh@laptop:~/test$ ./signal_gsbase original gsbase is 0x0 signal handler post-signal gsbase is 0x12345678 jannh@laptop:~/test$ So to make this usable for a sandboxing usecase, you'd also have to save and restore FSBASE/GSBASE, just like __switch_to().