Received: by 2002:a05:6a10:f3d0:0:0:0:0 with SMTP id a16csp1311754pxv; Fri, 2 Jul 2021 00:10:24 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwQDD3vDfXLtXtDQa5gmDCQgI+TaHNIvGduY6QIQgR3pgpghpUZyVmWf0cmaCNrmip4HqUI X-Received: by 2002:a17:906:15d5:: with SMTP id l21mr3836895ejd.429.1625209823961; Fri, 02 Jul 2021 00:10:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1625209823; cv=none; d=google.com; s=arc-20160816; b=bz1RmhHGKHHb1FUgJDyg3JtLjyCXFmGNHtEfhcEsAwumO5SOXw1tyJrQliyfo9059u EcLcaRp0ycEI7xfnFnvCWct4W11dAb54F23MxOxYJdq5MJ8mF35K3NYEz63CFkXLyBVA i2VL5CZPRzJPRHx3RJlRGpIgn5rRzmmY4isx1zbv+8lz4IAyY5SBT8DAQAxi350FZclx y3lBL1HHEWKdfnXKJkiReVmIjQ8idobHCUUg68RS4T0inOtVrqAPmdp3/0Mh9p5RV9o5 er9fwfw/mFznhiRjI239W+KtLHWmuP4kfEZjf8lPkl+OC42P40Bz/4yy+5J8/5Vp12AM BTpA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=pOxwy4S8+CMaHsFPZCERpcLAZFzgkbgCU7ysS2kserA=; b=eShspfJomxoUdZwkY4rGeT5O3IcTPhVIjpERuR7RWklTs3uFfd+mzR0/6MTDa33oeV 2DSUz8xgzXQ0+Ws7apcqODC56g9GUVuzsmTpeNqTp1wNk5D4ubJIPnRT6Bs3febCujjW UHPSagwLSK6Uex3GY0oq4eh7e/rBM7vgixNR+vJCHlLTxkr+stQ+XesAI0KYUmjGCkPF S2Xde6Dbk26iSsPlIiUQ4pLwo5chq1zPVaf9BkKtlHydxnvY36Y1N6m5XG5i116qiVQL Me5bfpve+jafqIA0VvTkpCMssb31ZD7lw5Cc2LEjQy12tKSikxUlwJ3Ij3wtovRwml2y POeA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b="C7iqWm/w"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id y8si2495051edw.54.2021.07.02.00.10.00; Fri, 02 Jul 2021 00:10:23 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b="C7iqWm/w"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230061AbhGBHEA (ORCPT + 99 others); Fri, 2 Jul 2021 03:04:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54450 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229542AbhGBHD7 (ORCPT ); Fri, 2 Jul 2021 03:03:59 -0400 Received: from mail-pl1-x633.google.com (mail-pl1-x633.google.com [IPv6:2607:f8b0:4864:20::633]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 083B2C061762; Fri, 2 Jul 2021 00:01:26 -0700 (PDT) Received: by mail-pl1-x633.google.com with SMTP id d1so5092258plg.6; Fri, 02 Jul 2021 00:01:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=pOxwy4S8+CMaHsFPZCERpcLAZFzgkbgCU7ysS2kserA=; b=C7iqWm/wesRNy/TjG2rGEA+W4kQ+TyHVuuL7z1F62PhJmxaaEoArlcQc84EmSneSqz iznC5OCPYsOsUJxJhgRSMTidOscJf/aghiOmDfzN8vcOyS9Ktom0AAY3oZkSKKjSXO1d bNX6lm2SULxR5MVs+e4vybMcekoS7NI2FBZQjmcpcDVbq1yNnSJqb1zN5kPr2Z1pUbNk A5G6ojm9VhQ8EdJjli5tjWX9R26cjvQG9ITwcfWD6VHn+xMCAGOok6yLzhJ4HNTQ0yGZ /Y/LCtDHcowpYrJJLDpAoIzycnxxFSBNvQyxIEhjKJ4y6wPKqpXxQSs7x9PXOvzmkcds oZlA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=pOxwy4S8+CMaHsFPZCERpcLAZFzgkbgCU7ysS2kserA=; b=Huu+9TsdpC67OgbLtHNe97lTzt0BMZGwO1TZvd24qV+NJdw7gnEWIp5Uwuv6LHBn6f c6eAxbeZtsOsoqKHPgunxvYoEpgAUd1mZ/efIr2eCOZ8wbsNFAwjpwkNsHP4Q3Wk+l7U nnjbYGPrvgyN0cZSeqF3e9U0hwNzHvuIk4TrJ76lXzWgt5FyggnjwpOPWs8FHdM/Fzfv HyA/6AG6TLYR31bbHPsLqpEwxW4MPvzPlXYg4dSZhu5KRwhSJ3ZDvyrJfmPQVVK8yMGM vlg4C+CUvyJJzm/PbUy2K3Bn1ULQOTfswHII3PbJMSuebLUW6T6Pa5Q5/uqqHn/2fen5 mrVA== X-Gm-Message-State: AOAM532/9WmDYzMI5jjeNew+IGS7zo07YLsSuNgdPUp6UtYdnqj0T2Cw GkHXysuYazLxjWC3VTCIoSI= X-Received: by 2002:a17:90b:4c4b:: with SMTP id np11mr13875779pjb.125.1625209286236; Fri, 02 Jul 2021 00:01:26 -0700 (PDT) Received: from gmail.com ([2601:600:8500:5f14:d627:c51e:516e:a105]) by smtp.gmail.com with ESMTPSA id p14sm2378495pgb.2.2021.07.02.00.01.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Jul 2021 00:01:25 -0700 (PDT) Date: Thu, 1 Jul 2021 23:57:53 -0700 From: Andrei Vagin To: Jann Horn Cc: kernel list , Linux API , linux-um@lists.infradead.org, criu@openvz.org, avagin@google.com, Andrew Morton , Andy Lutomirski , Anton Ivanov , Christian Brauner , Dmitry Safonov <0x7f454c46@gmail.com>, Ingo Molnar , Jeff Dike , Mike Rapoport , Michael Kerrisk , Oleg Nesterov , Peter Zijlstra , Richard Weinberger , Thomas Gleixner , linux-mm@kvack.org Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space Message-ID: References: <20210414055217.543246-1-avagin@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 14, 2021 at 08:46:40AM +0200, Jann Horn wrote: > On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin wrote: > > We already have process_vm_readv and process_vm_writev to read and write > > to a process memory faster than we can do this with ptrace. And now it > > is time for process_vm_exec that allows executing code in an address > > space of another process. We can do this with ptrace but it is much > > slower. > > > > = Use-cases = > > It seems to me like your proposed API doesn't really fit either one of > those usecases well... > > > Here are two known use-cases. The first one is “application kernel” > > sandboxes like User-mode Linux and gVisor. In this case, we have a > > process that runs the sandbox kernel and a set of stub processes that > > are used to manage guest address spaces. Guest code is executed in the > > context of stub processes but all system calls are intercepted and > > handled in the sandbox kernel. Right now, these sort of sandboxes use > > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can > > significantly speed them up. > > In this case, since you really only want an mm_struct to run code > under, it seems weird to create a whole task with its own PID and so > on. It seems to me like something similar to the /dev/kvm API would be > more appropriate here? Implementation options that I see for that > would be: > > 1. mm_struct-based: > a set of syscalls to create a new mm_struct, > change memory mappings under that mm_struct, and switch to it I like the idea to have a handle for mm. Instead of pid, we will pass this handle to process_vm_exec. We have pidfd for processes and we can introduce mmfd for mm_struct. > 2. pagetable-mirroring-based: > like /dev/kvm, an API to create a new pagetable, mirror parts of > the mm_struct's pagetables over into it with modified permissions > (like KVM_SET_USER_MEMORY_REGION), > and run code under that context. > page fault handling would first handle the fault against mm->pgd > as normal, then mirror the PTE over into the secondary pagetables. > invalidation could be handled with MMU notifiers. > I found this idea interesting and decided to look at it more closely. After reading the kernel code for a few days, I realized that it would not be easy to implement something like this, but more important is that I don’t understand what problem it solves. Will it simplify the user-space code? I don’t think so. Will it improve performance? It is unclear for me too. First, in the KVM case, we have a few big linear mappings and need to support one “shadow” address space. In the case of sandboxes, we can have a tremendous amount of mappings and many address spaces that we need to manage. Memory mappings will be mapped with different addresses in a supervisor address space and “guest” address spaces. If guest address spaces will not have their mm_structs, we will need to reinvent vma-s in some form. If guest address spaces have mm_structs, this will look similar to https://lwn.net/Articles/830648/. Second, each pagetable is tied up with mm_stuct. You suggest creating new pagetables that will not have their mm_struct-s (sorry if I misunderstood something). I am not sure that it will be easy to implement. How many corner cases will be there? As for page faults in a secondary address space, we will need to find a fault address in the main address space, handle the fault there and then mirror the PTE to the secondary pagetable. Effectively, it means that page faults will be handled in two address spaces. Right now, we use memfd and shared mappings. It means that each fault is handled only in one address space, and we map a guest memory region to the supervisor address space only when we need to access it. A large portion of guest anonymous memory is never mapped to the supervisor address space. Will an overhead of mirrored address spaces be smaller than memfd shared mappings? I am not sure. Third, this approach will not get rid of having process_vm_exec. We will need to switch to a guest address space with a specified state and switch back on faults or syscalls. If the main concern is the ability to run syscalls on a remote mm, we can think about how to fix this. I see two ways what we can do here: * Specify the exact list of system calls that are allowed. The first three candidates are mmap, munmap, and vmsplice. * Instead of allowing us to run system calls, we can implement this in the form of commands. In the case of sandboxes, we need to implement only two commands to create and destroy memory mappings in a target address space. Thanks, Andrei