Received: by 2002:a05:6a10:f3d0:0:0:0:0 with SMTP id a16csp1630333pxv; Fri, 2 Jul 2021 08:22:54 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwkmFcf+BN9wto9sRyYLjiZCRnateeAkieL558bsXb3Gi5NpmGtsrBucmd3FwOWUgHk4mOm X-Received: by 2002:a05:6602:1243:: with SMTP id o3mr412636iou.13.1625239374483; Fri, 02 Jul 2021 08:22:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1625239374; cv=none; d=google.com; s=arc-20160816; b=JjBw8qex60g4Be20+BuMiov2w/lvjf1FOqdTLhJFKNatpvbbmSUyVkJdZ54JMm0uAM EMprWxzl9d5J2yT2GOriZ5o988GatNoRwmbL2/dOYzdepVWvT5eQlz+6kwtdanSN5HPp 9k8nyhPhZFSiaHf/k07SI1Cqr2IqGNLAfF50wIuimLbU+tmJD9orz13qr8/uSe16T1on MN76C1/UYwzxKyQHbjpCDNZvKNf2OSyUPn4vcjtPEyj/JVNpsAPyKIVMPl3zHPHXteTX b1Ogo3u4lXh5kkj8+/prxaIDBarqPdyz+DLFVcp7NeW+HL83aHYvtXLRNHOXbX/sLJjp J1qQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=khrNN02EDjPn05bGelbS4FlmRXR9mKhe5093E+aR9GU=; b=UQcvPlW9+kdb5WBNozuEsPVeOHQlMetrlDWC+x4JOGNkdUFfs/uvB5A1UOBb+WsFUP m8DUMYM6614kqmBrEJtk9J41uSnDxbn2cYGWW9Jl7looHKdFd4us5eeRqPj8E54Dz0Vf h2OjMYvgoFZAKn4iBEgI9E755p8O3duOdimqjVndNJwLD9MY/lXJ/C07UXtygq9Mo3R9 VWLbjQzyhmpw7S+oLfGjRH+wVxDJNUAAzvdG2FqoLlBRMbb/0dmEPdj9QM89A8jfvgcx bfCwzFFqvuL80+DOwwbIm51eBfSQFj0/JfsomMyKt/MWy63VD6PiejpHtgkVu+55Ujmr mLuA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=a9CXomLJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id i19si3748431jav.94.2021.07.02.08.22.41; Fri, 02 Jul 2021 08:22:54 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=a9CXomLJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232297AbhGBPPE (ORCPT + 99 others); Fri, 2 Jul 2021 11:15:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51448 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232306AbhGBPPD (ORCPT ); Fri, 2 Jul 2021 11:15:03 -0400 Received: from mail-lf1-x12e.google.com (mail-lf1-x12e.google.com [IPv6:2a00:1450:4864:20::12e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 54080C061764 for ; Fri, 2 Jul 2021 08:12:31 -0700 (PDT) Received: by mail-lf1-x12e.google.com with SMTP id bu19so18661115lfb.9 for ; Fri, 02 Jul 2021 08:12:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=khrNN02EDjPn05bGelbS4FlmRXR9mKhe5093E+aR9GU=; b=a9CXomLJ4GNd0tdcmkWcR+FmNcSc2kTxEU0xtbUD6O3OGNR/M2wgP49Lq3F7YFZs38 oqgcyEv+SagTByCN0q2uRdW5ipKECVAgViAA1aTBKmqVbTIxdE2wHLRIWLT+Q86HOTQJ l7+DZ2ZxNM8tloIrlk3qbkDV3HsizwWyA1s5Y7Jke8Cg98+8kvVkRtTT7kcPA0jQxjV2 Fo0qsKh0aRE59elDr04R6kZtlvijP4PCxUS/b+q3gO5O18r21Iu26OnzX6kW2xvMx1/3 WJqmVMes6CXVl01lfFlr/YA/HV5rd2LAvfrQh5Q0WUzS3Ty/7yHCVcUiw/Rkgd6yXlHc BhjQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=khrNN02EDjPn05bGelbS4FlmRXR9mKhe5093E+aR9GU=; b=hIPEbf7FRC4XV2BxkTDhMt4eEB3dkzMnGNBlYgv94HGd+fZrIIBTqvwCKRqix0MQga dXREPXCWo5jrwgjKyzD9AR4775t8DebCcKl6KVymmkcF0A4ChihPPIpdJ4nLHfNCkeXU 1J+wnnm9Lbhz5eEa9w5sfofYwOAk0KLd9y6f5pZmGZMXfEkYweSKdsU+p6ksr8hIAbT3 4m6gGSX0BgVLBT8wtByxCMN0NLbrRGYcPJxEYv407mb+ivEADhkm07KuerYRbnjRswUW W63t0QljCjBjZAd9DnyCj22Vmg5Y4nx7pMqMRxkYCEP0Lw6Vg9QrvvUvHzoE5tj7r70c NO9A== X-Gm-Message-State: AOAM533V+JYpkesz9v4VJBTm7s+wZgouolNW+wVMMitVDs61kdRKy087 8c9Xnrga7oN92V3qQpB6m8MAH8f/4VX7M9B0c4EDRQ== X-Received: by 2002:ac2:519b:: with SMTP id u27mr94541lfi.352.1625238749175; Fri, 02 Jul 2021 08:12:29 -0700 (PDT) MIME-Version: 1.0 References: <20210414055217.543246-1-avagin@gmail.com> In-Reply-To: From: Jann Horn Date: Fri, 2 Jul 2021 17:12:02 +0200 Message-ID: Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space To: Andrei Vagin Cc: kernel list , Linux API , linux-um@lists.infradead.org, criu@openvz.org, avagin@google.com, Andrew Morton , Andy Lutomirski , Anton Ivanov , Christian Brauner , Dmitry Safonov <0x7f454c46@gmail.com>, Ingo Molnar , Jeff Dike , Mike Rapoport , Michael Kerrisk , Oleg Nesterov , Peter Zijlstra , Richard Weinberger , Thomas Gleixner , linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 2, 2021 at 9:01 AM Andrei Vagin wrote: > On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote: > > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin wrote: > > > We already have process_vm_readv and process_vm_writev to read and wr= ite > > > to a process memory faster than we can do this with ptrace. And now i= t > > > is time for process_vm_exec that allows executing code in an address > > > space of another process. We can do this with ptrace but it is much > > > slower. > > > > > > =3D Use-cases =3D > > > > It seems to me like your proposed API doesn't really fit either one of > > those usecases well... > > > > > Here are two known use-cases. The first one is =E2=80=9Capplication k= ernel=E2=80=9D > > > sandboxes like User-mode Linux and gVisor. In this case, we have a > > > process that runs the sandbox kernel and a set of stub processes that > > > are used to manage guest address spaces. Guest code is executed in th= e > > > context of stub processes but all system calls are intercepted and > > > handled in the sandbox kernel. Right now, these sort of sandboxes use > > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can > > > significantly speed them up. > > > > In this case, since you really only want an mm_struct to run code > > under, it seems weird to create a whole task with its own PID and so > > on. It seems to me like something similar to the /dev/kvm API would be > > more appropriate here? Implementation options that I see for that > > would be: > > > > 1. mm_struct-based: > > a set of syscalls to create a new mm_struct, > > change memory mappings under that mm_struct, and switch to it > > I like the idea to have a handle for mm. Instead of pid, we will pass > this handle to process_vm_exec. We have pidfd for processes and we can > introduce mmfd for mm_struct. I personally think that it might be quite unwieldy when it comes to the restrictions you get from trying to have shared memory with the owning process - I'm having trouble figuring out how you can implement copy-on-write semantics without relying on copy-on-write logic in the host OS and without being able to use userfaultfd. But if that's not a problem somehow, and you can find some reasonable way to handle memory usage accounting and fix up everything that assumes that multithreaded userspace threads don't switch ->mm, I guess this might work for your usecase. > > 2. pagetable-mirroring-based: > > like /dev/kvm, an API to create a new pagetable, mirror parts of > > the mm_struct's pagetables over into it with modified permissions > > (like KVM_SET_USER_MEMORY_REGION), > > and run code under that context. > > page fault handling would first handle the fault against mm->pgd > > as normal, then mirror the PTE over into the secondary pagetables= . > > invalidation could be handled with MMU notifiers. > > > > I found this idea interesting and decided to look at it more closely. > After reading the kernel code for a few days, I realized that it would > not be easy to implement something like this, Yeah, it might need architecture-specific code to flip the page tables on userspace entry/exit, and maybe also for mirroring them. And for the TLB flushing logic... > but more important is that > I don=E2=80=99t understand what problem it solves. Will it simplify the > user-space code? I don=E2=80=99t think so. Will it improve performance? I= t is > unclear for me too. Some reasons I can think of are: - direct guest memory access: I imagined you'd probably want to be able to directly access userspace memory from the supervisor, and with this approach that'd become easy. - integration with on-demand paging of the host OS: You'd be able to create things like file-backed copy-on-write mappings from the host filesystem, or implement your own mappings backed by some kind of storage using userfaultfd. - sandboxing: For sandboxing usecases (not your usecase), it would be possible to e.g. create a read-only clone of the entire address space of= a process and give write access to specific parts of it, or something like that. These address space clones could potentially be created and destroyed fairly quickly. - accounting: memory usage would be automatically accounted to the supervisor process, so even without a parasite process, you'd be able to see the memory usage correctly in things like "top". - small (non-pageable) memory footprint in the host kernel: The only things the host kernel would have to persistently store would b= e the normal MM data structures for the supervisor plus the mappings from "guest userspace" memory ranges to supervisor memory ranges; userspace pagetables would be discardable, and could even be shared with those of the supervisor in cases where the alignment fits. So with this, large anonymous mappings with 4K granularity only cost you ~0.20% overhead across host and guest address space; without this, if yo= u used shared mappings instead, you'd pay twice that for every 2MiB range from which parts are accessed in both contexts, plus probably another ~0.2% or so for the "struct address_space"? - all memory-management-related syscalls could be directly performed in the "kernel" process But yeah, some of those aren't really relevant for your usecase, and I guess things like the accounting aspect could just as well be solved differently... > First, in the KVM case, we have a few big linear mappings and need to > support one =E2=80=9Cshadow=E2=80=9D address space. In the case of sandbo= xes, we can > have a tremendous amount of mappings and many address spaces that we > need to manage. Memory mappings will be mapped with different addresses > in a supervisor address space and =E2=80=9Cguest=E2=80=9D address spaces.= If guest > address spaces will not have their mm_structs, we will need to reinvent > vma-s in some form. If guest address spaces have mm_structs, this will > look similar to https://lwn.net/Articles/830648/. > > Second, each pagetable is tied up with mm_stuct. You suggest creating > new pagetables that will not have their mm_struct-s (sorry if I > misunderstood something). Yeah, that's what I had in mind, page tables without an mm_struct. > I am not sure that it will be easy to > implement. How many corner cases will be there? Yeah, it would require some work around TLB flushing and entry/exit from userspace. But from a high-level perspective it feels to me like a change with less systematic impact. Maybe I'm wrong about that. > As for page faults in a secondary address space, we will need to find a > fault address in the main address space, handle the fault there and then > mirror the PTE to the secondary pagetable. Right. > Effectively, it means that > page faults will be handled in two address spaces. Right now, we use > memfd and shared mappings. It means that each fault is handled only in > one address space, and we map a guest memory region to the supervisor > address space only when we need to access it. A large portion of guest > anonymous memory is never mapped to the supervisor address space. > Will an overhead of mirrored address spaces be smaller than memfd shared > mappings? I am not sure. But as long as the mappings are sufficiently big and aligned properly, or you explicitly manage the supervisor address space, some of that cost disappears: E.g. even if a page is mapped in both address spaces, you wouldn't have a memory cost for the second mapping if the page tables are shared. > Third, this approach will not get rid of having process_vm_exec. We will > need to switch to a guest address space with a specified state and > switch back on faults or syscalls. Yeah, you'd still need a syscall for running code under a different set of page tables. But that's something that KVM _almost_ already does. > If the main concern is the ability to > run syscalls on a remote mm, we can think about how to fix this. I see > two ways what we can do here: > > * Specify the exact list of system calls that are allowed. The first > three candidates are mmap, munmap, and vmsplice. > > * Instead of allowing us to run system calls, we can implement this in > the form of commands. In the case of sandboxes, we need to implement > only two commands to create and destroy memory mappings in a target > address space. FWIW, there is precedent for something similar: The Android folks already added process_madvise() for remotely messing with the VMAs of another process to some degree.