Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp1056539pxb; Wed, 6 Apr 2022 07:43:33 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwry1+rAwaQ+n1Y1ZJspPNqUoZGnsX+eWSg8cqUx7ObyW5xgRNPMrTn9jZBgsw2k2C7nLGq X-Received: by 2002:a17:90b:3889:b0:1c7:a31f:2a50 with SMTP id mu9-20020a17090b388900b001c7a31f2a50mr10370386pjb.193.1649256213067; Wed, 06 Apr 2022 07:43:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1649256213; cv=none; d=google.com; s=arc-20160816; b=Z8eaqAZfWNFQ6hKTK5TEjdCNfYdAR9ibWkbA4y3nLubsuWb4X7d3ytu9uzbyJmlLPV h3RtxLNy23cwr4XoW+1xKASz3WLzYFRhnLgIF8d8ZqQ2jEXlvicaj51wxgepBTEfoiuw AApUXdy0vGP2u0OGu3Xs4if/zsrcPD8w6H6lcmfFD5fI6iMP4cLPFnLUNf8HG8alhUG6 ujSwEEgp7d41bFwFdmL8mS1o2JKqDkBCa+Zkvv2LebxYoqq3DQ0B49MaqJRqwCrmtVye qM20kQysrHznzFz5T8uNaHgG6HZqwdqeATw3rb3ZKsRO7buQrFER5Sx+vMM/nW9pG53x 8ydA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=M+9t9GPp8kQkcOpf9/1bOVcjxUYWjperAKVBlci1EH0=; b=sNOWp8GCrewGc2NmnTQkhdwttnmCg/xMaCuJXq6WTGF2pE2rirtOcYkZyv6IiTFggu iiWOK/DtSjkV4KhPAf70i4P8IFSMUKpq6Dfb6vmUhhUQKPDWw2/6P1wO+PtWuAPqtvr5 hRTLqLGOvApKI6qP+hGSqEYECLN3FwDTcsrkXqnOqf103rZzNSKsAS61t0czBh6k0feE Eo/hTrKQkntnvmzPVmVuVxfSMIH1ZZVq7A1bT8lIs3KUcnqsem6UGVebqfo+nhyeCcDJ 709sJFvXeeIifXWIrdMYlqyyzG1ufFWOSYwuYqG3A4YSGZ5EmAl/l+dnPGKDK7+tO6bt vfDQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=YoxHIZZ3; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id v25-20020a634819000000b003989594a149si16492937pga.388.2022.04.06.07.43.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Apr 2022 07:43:33 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=YoxHIZZ3; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id DD147658133; Wed, 6 Apr 2022 05:29:01 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343708AbiDEUNl (ORCPT + 99 others); Tue, 5 Apr 2022 16:13:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44126 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1356817AbiDELPp (ORCPT ); Tue, 5 Apr 2022 07:15:45 -0400 Received: from mail-ed1-x52f.google.com (mail-ed1-x52f.google.com [IPv6:2a00:1450:4864:20::52f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CF847A66F3 for ; Tue, 5 Apr 2022 03:36:31 -0700 (PDT) Received: by mail-ed1-x52f.google.com with SMTP id w18so8105696edi.13 for ; Tue, 05 Apr 2022 03:36:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=M+9t9GPp8kQkcOpf9/1bOVcjxUYWjperAKVBlci1EH0=; b=YoxHIZZ3G33PpP4KmdhCg75x+dpK6knIWVV8om3MTD9ItxOCrXniFFZ+EvxkQJ749N TZNSlhHf4yWYtlJeFdnSgzWzKhMprnLWCYXMXUTjbzKQ4/AX1ZqSMnT1SfxHkylpJTA7 TcIjcf/eAxyNT9pvVGShATH+k7UpVvaxHcSaiQOwLh3eYOzlDTpuPyS5UNG1XNCwsFhM md0yH5dJ7w10QqUB206n1HWdLrv0jotHk49fIRHox0SvuqO4dIa/vG90Ckw+UKkNkWMc sq0j21T+LRsTluEbkq+UsOFNTncjQe37S6SdEaPevdA/gEDmNPTD2BCQT7M2E3fUdzaK aNHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=M+9t9GPp8kQkcOpf9/1bOVcjxUYWjperAKVBlci1EH0=; b=2aOs53BiBKgl68gUhjHAczzsix9KQiZsb9U/yiA7bnVZv9lzSvYNGczui4WqgNxhdS pL+IO2IJAKL/ktJD2s0EqQnIJSKkFwYM81H+vFR+EzJeYdMnwU/LqsmChP08N+pXlNPc BfEUy0upeAE7K7KGeA2u18TTLvF5nvlh/KcVWRs26Ep5QjrhVpcbN8jO/Qb44fdfyMEg WC007G4wQ/q3A+YEOwAwRa/m5p66LkOFVW26qoOm4tFEbqU5ukMSapHIIHma+bHgxH66 fm42do59nOAgocsL/mQALkPr7IVd9E+0gCk2ucTp5i1StR3ZlYY6XtSZbmhUcO8WdbSx FQAA== X-Gm-Message-State: AOAM532zM4nkeGQiaX0zYV7T/ZyFaiXH2kbSNl3k+RKAe75PQe73c5jf +uiy6rgpWFVM8Xso9ueV4I7Ttw== X-Received: by 2002:a05:6402:d4c:b0:410:a415:fd95 with SMTP id ec12-20020a0564020d4c00b00410a415fd95mr2848428edb.288.1649154990022; Tue, 05 Apr 2022 03:36:30 -0700 (PDT) Received: from google.com (30.171.91.34.bc.googleusercontent.com. [34.91.171.30]) by smtp.gmail.com with ESMTPSA id d4-20020a056402000400b00412d60fee38sm6428799edu.11.2022.04.05.03.36.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 05 Apr 2022 03:36:29 -0700 (PDT) Date: Tue, 5 Apr 2022 10:36:26 +0000 From: Quentin Perret To: Andy Lutomirski Cc: Sean Christopherson , Steven Price , Chao Peng , kvm list , Linux Kernel Mailing List , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Linux API , qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , the arch/x86 maintainers , "H. Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A. Shutemov" , "Nakajima, Jun" , Dave Hansen , Andi Kleen , David Hildenbrand , Marc Zyngier , Will Deacon Subject: Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory Message-ID: References: <88620519-029e-342b-0a85-ce2a20eaf41b@arm.com> <80aad2f9-9612-4e87-a27a-755d3fa97c92@www.fastmail.com> <83fd55f8-cd42-4588-9bf6-199cbce70f33@www.fastmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Monday 04 Apr 2022 at 15:04:17 (-0700), Andy Lutomirski wrote: > > > On Mon, Apr 4, 2022, at 10:06 AM, Sean Christopherson wrote: > > On Mon, Apr 04, 2022, Quentin Perret wrote: > >> On Friday 01 Apr 2022 at 12:56:50 (-0700), Andy Lutomirski wrote: > >> FWIW, there are a couple of reasons why I'd like to have in-place > >> conversions: > >> > >> - one goal of pKVM is to migrate some things away from the Arm > >> Trustzone environment (e.g. DRM and the likes) and into protected VMs > >> instead. This will give Linux a fighting chance to defend itself > >> against these things -- they currently have access to _all_ memory. > >> And transitioning pages between Linux and Trustzone (donations and > >> shares) is fast and non-destructive, so we really do not want pKVM to > >> regress by requiring the hypervisor to memcpy things; > > > > Is there actually a _need_ for the conversion to be non-destructive? > > E.g. I assume > > the "trusted" side of things will need to be reworked to run as a pKVM > > guest, at > > which point reworking its logic to understand that conversions are > > destructive and > > slow-ish doesn't seem too onerous. > > > >> - it can be very useful for protected VMs to do shared=>private > >> conversions. Think of a VM receiving some data from the host in a > >> shared buffer, and then it wants to operate on that buffer without > >> risking to leak confidential informations in a transient state. In > >> that case the most logical thing to do is to convert the buffer back > >> to private, do whatever needs to be done on that buffer (decrypting a > >> frame, ...), and then share it back with the host to consume it; > > > > If performance is a motivation, why would the guest want to do two > > conversions > > instead of just doing internal memcpy() to/from a private page? I > > would be quite > > surprised if multiple exits and TLB shootdowns is actually faster, > > especially at > > any kind of scale where zapping stage-2 PTEs will cause lock contention > > and IPIs. > > I don't know the numbers or all the details, but this is arm64, which is a rather better architecture than x86 in this regard. So maybe it's not so bad, at least in very simple cases, ignoring all implementation details. (But see below.) Also the systems in question tend to have fewer CPUs than some of the massive x86 systems out there. Yep. I can try and do some measurements if that's really necessary, but I'm really convinced the cost of the TLBI for the shared->private conversion is going to be significantly smaller than the cost of memcpy the buffer twice in the guest for us. To be fair, although the cost for the CPU update is going to be low, the cost for IOMMU updates _might_ be higher, but that very much depends on the hardware. On systems that use e.g. the Arm SMMU, the IOMMUs can use the CPU page-tables directly, and the iotlb invalidation is done on the back of the CPU invalidation. So, on systems with sane hardware the overhead is *really* quite small. Also, memcpy requires double the memory, it is pretty bad for power, and it causes memory traffic which can't be a good thing for things running concurrently. > If we actually wanted to support transitioning the same page between shared and private, though, we have a bit of an awkward situation. Private to shared is conceptually easy -- do some bookkeeping, reconstitute the direct map entry, and it's done. The other direction is a mess: all existing uses of the page need to be torn down. If the page has been recently used for DMA, this includes IOMMU entries. > > Quentin: let's ignore any API issues for now. Do you have a concept of how a nondestructive shared -> private transition could work well, even in principle? I had a high level idea for the workflow, but I haven't looked into the implementation details. The idea would be to allow KVM *or* userspace to take a reference to a page in the fd in an exclusive manner. KVM could take a reference on a page (which would be necessary before to donating it to a guest) using some kind of memfile_notifier as proposed in this series, and userspace could do the same some other way (mmap presumably?). In both cases, the operation might fail. I would imagine the boot and private->shared flow as follow: - the VMM uses fallocate on the private fd, and associates the with a memslot; - the guest boots, and as part of that KVM takes references to all the pages that are donated to the guest. If userspace happens to have a mapping to a page, KVM will fail to take the reference, which would be fatal for the guest. - once the guest has booted, it issues a hypercall to share a page back with the host; - KVM is notified, and at that point it drops its reference to the page. It then exits to userspace to notify it of the share; - host userspace receives the share, and mmaps the shared page with MAP_FIXED to access it, which takes a reference on the fd-backed page. There are variations of that idea: e.g. allow userspace to mmap the entire private fd but w/o taking a reference on pages mapped with PROT_NONE. And then the VMM can use mprotect() in response to share/unshare requests. I think Marc liked that idea as it keeps the userspace API closer to normal KVM -- there actually is a straightforward gpa->hva relation. Not sure how much that would impact the implementation at this point. For the shared=>private conversion, this would be something like so: - the guest issues a hypercall to unshare a page; - the hypervisor forwards the request to the host; - the host kernel forwards the request to userspace; - userspace then munmap()s the shared page; - KVM then tries to take a reference to the page. If it succeeds, it re-enters the guest with a flag of some sort saying that the share succeeded, and the hypervisor will adjust pgtables accordingly. If KVM failed to take a reference, it flags this and the hypervisor will be responsible for communicating that back to the guest. This means the guest must handle failures (possibly fatal). (There are probably many ways in which we can optimize this, e.g. by having the host proactively munmap() pages it no longer needs so that the unshare hypercall from the guest doesn't need to exit all the way back to host userspace.) A nice side-effect of the above is that it allows userspace to dump a payload in the private fd before booting the guest. It just needs to mmap the fd, copy what it wants in there, munmap, and only then pass the fd to KVM which will be happy enough as long as there are no current references to the pages. Note: in a previous email I've said that Android doesn't need this (which is correct as our guest bootloader currently receives the payload over virtio) but this might change some day, and there might be other implementations as well, so it's a nice bonus if we can make this work. > The best I can come up with is a special type of shared page that is not GUP-able and maybe not even mmappable, having a clear option for transitions to fail, and generally preventing the nasty cases from happening in the first place. Right, that sounds reasonable to me. > Maybe there could be a special mode for the private memory fds in which specific pages are marked as "managed by this fd but actually shared". pread() and pwrite() would work on those pages, but not mmap(). (Or maybe mmap() but the resulting mappings would not permit GUP.) And transitioning them would be a special operation on the fd that is specific to pKVM and wouldn't work on TDX or SEV. Aha, didn't think of pread()/pwrite(). Very interesting. I'd need to check what our VMM actually does, but as an initial reaction it feels like this might require a pretty significant rework in userspace. Maybe it's a good thing? Dunno. Maybe more important, those shared pages are used for virtio communications, so the cost of issuing syscalls every time the VMM needs to access the shared page will need to be considered... Thanks, Quentin