Received: by 2002:a05:6a10:17d3:0:0:0:0 with SMTP id hz19csp170049pxb; Wed, 14 Apr 2021 12:08:47 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxAwMesDprnk1vTO41D5i/HgN6n95/DweIzRi+T8N1jOHmA/MLEY+tnDVqLnolFIv/sR3R/ X-Received: by 2002:a17:90a:b398:: with SMTP id e24mr5222837pjr.141.1618427327388; Wed, 14 Apr 2021 12:08:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1618427327; cv=none; d=google.com; s=arc-20160816; b=qWszS9A5tEOBGsw8GtXWTOKRot4+at8+d9qS97vCOC57Lg7PClgM59n7Ynz46KUXiB GL7RavVOc0hpEguqS8mWIAMznMebJ4JiqRa7MTlJtcBP+W6pZKjRtKsHen93sMcbc9IJ KnC5VO/9HuWFY4qCN0WUCnFe5Ih+/LeLZY9eQOBgTCEVftRTY0WC0sGbnrmF6MReh/GJ BcwWbjjAJzg8uR0ceb8cwZ2t7KOVuLxcG/XCzU56G6V71wm+VnLTLpL4FoVz4+OK/5qV D3MG1ms+Dwxvmk7Ss3g/oOT3Fm3eK+nvo+m+k4BXjrvu2K5w1qk8x/PFDWqvwzFpNFv5 BkKQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=o7YUQsFXpVVzDPEGdF/U6Rn3Tg4j4C3osu9eHf3orK8=; b=QkTqX3DO39UY2NmnhbtMDAUI3I8VZyEOHROfq1DLzjpQRwOlcLrxHHsiLlXOoopJey zHNjGzChhi8m6lN8u+nRBzroK1UU/Z+397QdgJLZspxE/aFzwgYasfEer1mzhQgY0hTQ Lf44pYBk25yDI3kzrlnTXCCxYQCTbquII2gAOo057Wujprw4TzDFdu+aZiyjdPdvxmOc ZoGs64c7/uYzbasKT/WwkzYQZCYE60vEViR36WAkNFycnbTi9Jw7if5a+vfpC1rifNJa INVnn/HNRPCVtCYQ+D/1ZnjtgfGj0JVe8ZfHvIl1MM+Zv0EneIso/7sBShfv5i3R46Al hGRg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=XFdPNvxH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id 125si472503pfy.37.2021.04.14.12.08.34; Wed, 14 Apr 2021 12:08:47 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=XFdPNvxH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232285AbhDNGra (ORCPT + 99 others); Wed, 14 Apr 2021 02:47:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45102 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1348255AbhDNGra (ORCPT ); Wed, 14 Apr 2021 02:47:30 -0400 Received: from mail-lf1-x129.google.com (mail-lf1-x129.google.com [IPv6:2a00:1450:4864:20::129]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F22A0C061756 for ; Tue, 13 Apr 2021 23:47:08 -0700 (PDT) Received: by mail-lf1-x129.google.com with SMTP id n8so31444508lfh.1 for ; Tue, 13 Apr 2021 23:47:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=o7YUQsFXpVVzDPEGdF/U6Rn3Tg4j4C3osu9eHf3orK8=; b=XFdPNvxHWkKqyNvtRYRixrne43Ud5Nv45e7x2J2k1gwoawHyEst2zAorc/4jV6vHD1 oyXh7R900sVsURqzY9z5CF9GdkWfPWZBTZqjaLkdE+ZOtopofHjjAdYzCXioCmykjnvF +1pYiMxqkkSP8v9lYZAzGbskYFWnEcuw7NyeylG0jbp/LC0agt2r/lo+yHJ8bwLCYupV eb/9FJUNPS1ay4ujpaR1YQfm4qOxruB9WA0oGJjROfcSRJex5azES4JtqykbhYmKt8rW QA2UEneBNegtkiA4ckUfJ6uBsw1Q555Bx3vZ97CDQps9ybNI3bz1X40h03mZHwgE56hC npqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=o7YUQsFXpVVzDPEGdF/U6Rn3Tg4j4C3osu9eHf3orK8=; b=hr3GXogBxF0tA0sHivVtZyfCQJr2h2uAsCV6C9TBsYPmobWlkGMFxsmShWas61SjqU 3mSj2jpjc7Msm/YqqoiVfetA58qS5/P9YW43O0/PM3b7u1Buggu6p8eDIp9mNzSZ3Mmp jcOhqp/aPakn5NzhyH6wvrefTLDjjMBxSKQ0DHewFkx1Jr99BlKDjtEKWS3jnDLU9zC8 94lSsYLudF2hgyzeZr0ojDkh5dRCBprPmFg8OyJz4w8vpX93HJDyZAW7JtsHcGQJ0Pic c9XIajaxvVExQMktSaAFAi+mUiC85UQrkWipPKy20RZtQfXZswvR97lRO4J3cu40J98L 1T9w== X-Gm-Message-State: AOAM5339qnXpYJ8ObB9YsTlYVpyNbRqT9ndprX6PSjFMalGhc3/rQHk/ 1zENlvVc7SeGKCgWuOAVGxR+c+Dwvcr5oMx0g7OwIQ== X-Received: by 2002:a19:6a16:: with SMTP id u22mr24510963lfu.356.1618382827311; Tue, 13 Apr 2021 23:47:07 -0700 (PDT) MIME-Version: 1.0 References: <20210414055217.543246-1-avagin@gmail.com> In-Reply-To: <20210414055217.543246-1-avagin@gmail.com> From: Jann Horn Date: Wed, 14 Apr 2021 08:46:40 +0200 Message-ID: Subject: Re: [PATCH 0/4 POC] Allow executing code and syscalls in another address space To: Andrei Vagin Cc: kernel list , Linux API , linux-um@lists.infradead.org, criu@openvz.org, avagin@google.com, Andrew Morton , Andy Lutomirski , Anton Ivanov , Christian Brauner , Dmitry Safonov <0x7f454c46@gmail.com>, Ingo Molnar , Jeff Dike , Mike Rapoport , Michael Kerrisk , Oleg Nesterov , Peter Zijlstra , Richard Weinberger , Thomas Gleixner Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 14, 2021 at 7:59 AM Andrei Vagin wrote: > We already have process_vm_readv and process_vm_writev to read and write > to a process memory faster than we can do this with ptrace. And now it > is time for process_vm_exec that allows executing code in an address > space of another process. We can do this with ptrace but it is much > slower. > > =3D Use-cases =3D It seems to me like your proposed API doesn't really fit either one of those usecases well... > Here are two known use-cases. The first one is =E2=80=9Capplication kerne= l=E2=80=9D > sandboxes like User-mode Linux and gVisor. In this case, we have a > process that runs the sandbox kernel and a set of stub processes that > are used to manage guest address spaces. Guest code is executed in the > context of stub processes but all system calls are intercepted and > handled in the sandbox kernel. Right now, these sort of sandboxes use > PTRACE_SYSEMU to trap system calls, but the process_vm_exec can > significantly speed them up. In this case, since you really only want an mm_struct to run code under, it seems weird to create a whole task with its own PID and so on. It seems to me like something similar to the /dev/kvm API would be more appropriate here? Implementation options that I see for that would be: 1. mm_struct-based: a set of syscalls to create a new mm_struct, change memory mappings under that mm_struct, and switch to it 2. pagetable-mirroring-based: like /dev/kvm, an API to create a new pagetable, mirror parts of the mm_struct's pagetables over into it with modified permissions (like KVM_SET_USER_MEMORY_REGION), and run code under that context. page fault handling would first handle the fault against mm->pgd as normal, then mirror the PTE over into the secondary pagetables. invalidation could be handled with MMU notifiers. > Another use-case is CRIU (Checkpoint/Restore in User-space). Several > process properties can be received only from the process itself. Right > now, we use a parasite code that is injected into the process. We do > this with ptrace but it is slow, unsafe, and tricky. But this API will only let you run code under the *mm* of the target process, not fully in the context of a target *task*, right? So you still won't be able to use this for accessing anything other than memory? That doesn't seem very generically useful to me. Also, I don't doubt that anything involving ptrace is kinda tricky, but it would be nice to have some more detail on what exactly makes this slow, unsafe and tricky. Are there API additions for ptrace that would make this work better? I imagine you're thinking of things like an API for injecting a syscall into the target process without having to first somehow find an existing SYSCALL instruction in the target process? > process_vm_exec can > simplify the process of injecting a parasite code and it will allow > pre-dump memory without stopping processes. The pre-dump here is when we > enable a memory tracker and dump the memory while a process is continue > running. On each interaction we dump memory that has been changed from > the previous iteration. In the final step, we will stop processes and > dump their full state. Right now the most effective way to dump process > memory is to create a set of pipes and splice memory into these pipes > from the parasite code. With process_vm_exec, we will be able to call > vmsplice directly. It means that we will not need to stop a process to > inject the parasite code. Alternatively you could add splice support to /proc/$pid/mem or add a syscall similar to process_vm_readv() that splices into a pipe, right?