Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp1139908pxb; Wed, 6 Apr 2022 09:37:57 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxL7AukhnhbA/VynXM8fhvimQNTrLoQuJUNjLJTfT2qX4KHYyQMFmgCapwHANMG4/FVAjeO X-Received: by 2002:a65:6753:0:b0:385:fa8a:188f with SMTP id c19-20020a656753000000b00385fa8a188fmr7854421pgu.499.1649263077165; Wed, 06 Apr 2022 09:37:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649263077; cv=none; d=google.com; s=arc-20160816; b=VojYErrnPKgqSlDkFii6jTfZdCj8V3NUYrdCJkXh9/BKTgFfgWNrA5kF4glMK6lFuN jpekAbpPtV/NJhSqzcOo+r2/1YbC41maNBoX9tS3qddG76DYqdWeDi71WItuePISqc4P OsReUiLNVvkxsmLrv4bZuPy6w42qnG8lJEEDsZLed7iv7fi80XUfeZ1vFbQoba+yCdF+ gJURCy9QVjn0E6pJVp4NQzaHfquNKz8kiyivxxFBvKubElAiPSwf3j/KCIDFXH8A9J8/ tTeaEeriDk1LdImktIthAiP7RdI8e4e+CLpBE6dYbDkdDkDefcqIXQskvRJ3IHr5CiZ/ YCiQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=PM54LwgT2Mgw5AMkSV1qMvIZs90k76HVAm+mIhFW/Ko=; b=YZhKiviaaAMghigcq4rXhDlChppU/I/S+M2VRdeoBkdFcD3Eci5saktrDBbqbV0e2O /6+JiiOReoH/xgZASfnDoQNU3hIB8OfiN6psa5gGuKouQP3yBvI6FJnUGS65BZLbMvY1 cydkLQ3KQcJDPXoAC8WHuHIzd/g7Nw1f66tI74EnLoNKFbf4ca72aynRTjxR1JsV9VFZ DcSnCZmqvRKMCSY/lOHu4K0PaUvedMV5hBRNuZkZqXiBF1O2yvbcXagKFHUFKikWqj/M 3UXxQgFWVAcbSEIumBvUvZwRI2KvCdfMvTfUy9J2uJxA4hmkOWtqW5aoV9JdV1I/4DXc f/2w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=dDSmSxNu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id g15-20020a056a001a0f00b004fa3a8e0013si17947081pfv.202.2022.04.06.09.37.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Apr 2022 09:37:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=dDSmSxNu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 10E9635FCC9; Wed, 6 Apr 2022 08:47:09 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236406AbiDFPsj (ORCPT + 99 others); Wed, 6 Apr 2022 11:48:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50388 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236945AbiDFPry (ORCPT ); Wed, 6 Apr 2022 11:47:54 -0400 Received: from mail-ed1-x529.google.com (mail-ed1-x529.google.com [IPv6:2a00:1450:4864:20::529]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B3D0C2C2EDD for ; Wed, 6 Apr 2022 06:05:24 -0700 (PDT) Received: by mail-ed1-x529.google.com with SMTP id f18so2543581edc.5 for ; Wed, 06 Apr 2022 06:05:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=PM54LwgT2Mgw5AMkSV1qMvIZs90k76HVAm+mIhFW/Ko=; b=dDSmSxNu4cKou0wKKJy2RxvoXv2oI6xKR2vY3pb3rpGtbFJSq+v9Sour4amtkG8OmK /mIlDVEuFioJxUYt/uLZELUgdXQw0ov1GNV1lrpYe6lLXJddt79EqvJAQITLd3WaNVH7 RZkDvqmMSe65o+jVNmIxDSjYRyulxd4z4O6RUWbk8rC6zzIUeWQYB0okxXTU0SvhHXZA FOlx5xH2zFxCf3OE8/9JHhuxHDnwAY1LJRZbWwasq6aNf43sSQc+pAeLfaWwXG6joBqH 7NUBr0fvlYsspSPTWRfWfZ9QRS8Vfkb6O9IoizmZOkjDfsVP23+M9cveQTa0tzHjUgJl d64Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=PM54LwgT2Mgw5AMkSV1qMvIZs90k76HVAm+mIhFW/Ko=; b=idPM+BQR9ud6AeurUrLKzphM+7CvuA6Xpf4bqyJkxVfO9Xw0US5oJ7jRepCKSjBIFV kcHg+talwQP4Dj+ojVAmz+/vIRDRtGf240NCkYH2tjMQn0E/ReAL1yBUNEtZ6r9Hz0v3 NbbKlg0J0CZl6wv2EdNoejg1CZnH7Q5KX7/2Rl47/Ms3af+cURSFT8GTtzD9Ml5w/C+J UdE7v5K4SlMV45m1qh7DyvDdAbd7w1MwfTe+QIUcfPxBISRR5E15f42YdAWAfwgEkyJy EqXdM46c695qAfHgst/twMu966fKwiV8SlIJDdOQYRE3bwcSEy+NA7rehuONlXSWHf0Z 7txg== X-Gm-Message-State: AOAM531nC5edEDaJtlSlxRKX8on5PsOKCSU+br5iVXtkX3a9UbVduJpN QHlHIezAzuAWecYJzZcFyvFXtg== X-Received: by 2002:a05:6402:34cf:b0:419:75b1:99ad with SMTP id w15-20020a05640234cf00b0041975b199admr8672622edc.228.1649250322915; Wed, 06 Apr 2022 06:05:22 -0700 (PDT) Received: from google.com (30.171.91.34.bc.googleusercontent.com. [34.91.171.30]) by smtp.gmail.com with ESMTPSA id j22-20020a50ed16000000b00419366b2146sm8158326eds.43.2022.04.06.06.05.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Apr 2022 06:05:22 -0700 (PDT) Date: Wed, 6 Apr 2022 13:05:18 +0000 From: Quentin Perret To: Andy Lutomirski Cc: Sean Christopherson , Steven Price , Chao Peng , kvm list , Linux Kernel Mailing List , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Linux API , qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , the arch/x86 maintainers , "H. Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A. Shutemov" , "Nakajima, Jun" , Dave Hansen , Andi Kleen , David Hildenbrand , Marc Zyngier , Will Deacon Subject: Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Message-ID: References: <80aad2f9-9612-4e87-a27a-755d3fa97c92@www.fastmail.com> <83fd55f8-cd42-4588-9bf6-199cbce70f33@www.fastmail.com> <54acbba9-f4fd-48c1-9028-d596d9f63069@www.fastmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <54acbba9-f4fd-48c1-9028-d596d9f63069@www.fastmail.com> X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tuesday 05 Apr 2022 at 10:51:36 (-0700), Andy Lutomirski wrote: > Let's try actually counting syscalls and mode transitions, at least approximately. For non-direct IO (DMA allocation on guest side, not straight to/from pagecache or similar): > > Guest writes to shared DMA buffer. Assume the guest is smart and reuses the buffer. > Guest writes descriptor to shared virtio ring. > Guest rings virtio doorbell, which causes an exit. > *** guest -> hypervisor -> host *** > host reads virtio ring (mmaped shared memory) > host does pread() to read the DMA buffer or reads mmapped buffer > host does the IO > resume guest > *** host -> hypervisor -> guest *** > > This is essentially optimal in terms of transitions. The data is copied on the guest side (which may well be mandatory depending on what guest userspace did to initiate the IO) and on the host (which may well be mandatory depending on what the host is doing with the data). > > Now let's try straight-from-guest-pagecache or otherwise zero-copy on the guest side. Without nondestructive changes, the guest needs a bounce buffer and it looks just like the above. One extra copy, zero extra mode transitions. With nondestructive changes, it's a bit more like physical hardware with an IOMMU: > > Guest shares the page. > *** guest -> hypervisor *** > Hypervisor adds a PTE. Let's assume we're being very optimal and the host is not synchronously notified. > *** hypervisor -> guest *** > Guest writes descriptor to shared virtio ring. > Guest rings virtio doorbell, which causes an exit. > *** guest -> hypervisor -> host *** > host reads virtio ring (mmaped shared memory) > > mmap *** syscall *** > host does the IO > munmap *** syscall, TLBI *** > > resume guest > *** host -> hypervisor -> guest *** > Guest unshares the page. > *** guest -> hypervisor *** > Hypervisor removes PTE. TLBI. > *** hypervisor -> guest *** > > This is quite expensive. For small IO, pread() or splice() in the host may be a lot faster. Even for large IO, splice() may still win. Right, that would work nicely for pages that are shared transiently, but less so for long-term shares. But I guess your proposal below should do the trick. > I can imagine clever improvements. First, let's get rid of mmap() + munmap(). Instead use a special device mapping with special semantics, not regular memory. (mmap and munmap are expensive even ignoring any arch and TLB stuff.) The rule is that, if the page is shared, access works, and if private, access doesn't, but it's still mapped. The hypervisor and the host cooperate to make it so. As long as the page can't be GUP'd I _think_ this shouldn't be a problem. We can have the hypervisor re-inject the fault in the host. And the host fault handler will deal with it just fine if the fault was taken from userspace (inject a SEGV), or from the kernel through uaccess macros. But we do get into issues if the host kernel can be tricked into accessing the page via e.g. kmap(). I've been able to trigger this by strace-ing a userspace process which passes a pointer to private memory to a syscall. strace will inspect the syscall argument using process_vm_readv(), which will pin_user_pages_remote() and access the page via kmap(), and then we're in trouble. But preventing GUP would prevent this by construction I think? FWIW memfd_secret() did look like a good solution to this, but it lacks the bidirectional notifiers with KVM that is offered by this patch series, which is needed to allow KVM to handle guest faults, and also offers a good framework to support future extensions (e.g. hypervisor-assisted page migration, swap, ...). So yes, ideally pKVM would use a kind of hybrid between memfd_secret and the private fd proposed here, or something else providing similar properties. Thanks, Quentin