Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp2420880pxu; Fri, 18 Dec 2020 12:51:56 -0800 (PST) X-Google-Smtp-Source: ABdhPJxiBey1T6F5J3v9CICkxQq3uyPEnZQd6PG3O3zSWEiHf5Wwu/RMU5Oy/jAhZti4+5mVCPP0 X-Received: by 2002:a17:906:1a01:: with SMTP id i1mr5894713ejf.315.1608324716697; Fri, 18 Dec 2020 12:51:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1608324716; cv=none; d=google.com; s=arc-20160816; b=TyQpUImc0fjtXGyeXUM0HHTmgtqU1mtBeM98BMGnLRnRGu3B71Csv9N3mkCp2bkUNZ jB8+WqUDe512SDilKNwuudPnZbVD6L6LXoMNf1H/uXX+LLiraYL243Gy7X2CqK/iWDQg Sq/Jemif6bu0anVfxTpd7iWqkTx8k74Pcw4nMD8oLESDN3NHbgHK065Rn4lPrj/j2RAf im9lUd5lTCq5sIykQooQSyZ3q/medw5RTCTwc+fJ8IjaWg94DLDZ9yHKDNO3c58lNSqT GZu6JFFT3Ba3H66KQZfPLOdBuv6iyPkzBqi1etyT34b2XnPksBI1fiqRj3mLySM9PpD9 IAHg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=qTqL06D7b4fb7cUvZJDJjp0eb7f5gIGXVqXJeXuddJU=; b=BsuJRqrOfgV628TnubrsUNdSvGJeak8vmanQcTeIxSbBgfURTUjBzKQlOurxICHjm5 Uwz8RZGXcbqovss42psFP3ZEEmjs5gV1sncYSMDg0HaHTK8PO0jfSVhYGbGe6IfWyi3l 54k7+KZ9UU1To3SCoIedpdhh9Qm+n+ZOtjcI9zH/Uq5PE22FilaMA7BsPZRRUiMGdjy2 XgEvBEbqGymkh626Fps+ir6Y6jq0wcazdXt8Usg0Uhodg0B3hrzheFKl5/pyuQ7JEP2n ZkqtFrKStU6u7ShIi2jIW3loi8y57wyifDzxSzYZ51zncbtyMc6NorGbWwYIs4CoM/vl /qog== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=ScUXSgQJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h11si6870917edl.78.2020.12.18.12.51.34; Fri, 18 Dec 2020 12:51:56 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=ScUXSgQJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733252AbgLRTln (ORCPT + 99 others); Fri, 18 Dec 2020 14:41:43 -0500 Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:59764 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733244AbgLRTlm (ORCPT ); Fri, 18 Dec 2020 14:41:42 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1608320415; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=qTqL06D7b4fb7cUvZJDJjp0eb7f5gIGXVqXJeXuddJU=; b=ScUXSgQJ2d2eX22j0oGcmv3K9GcM232GdvvZtruYadUN0W6q1pNrAUVfYRAPem1rmXXuBO 5YtjAz3a6SUUOnI6qrOaK9PvTPHDiB1xztP6SC5qeAuKOE1Lz6AV9oHOlv52kozzjQrXmd jSVpXb7Nf+RtiHyxUNunJSB44W4Bw3k= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-340-yIelHva4MIC_T5AEBbVHEQ-1; Fri, 18 Dec 2020 14:40:06 -0500 X-MC-Unique: yIelHva4MIC_T5AEBbVHEQ-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 965086D522; Fri, 18 Dec 2020 19:40:03 +0000 (UTC) Received: from work-vm (ovpn-114-200.ams2.redhat.com [10.36.114.200]) by smtp.corp.redhat.com (Postfix) with ESMTPS id BD4172CE56; Fri, 18 Dec 2020 19:39:58 +0000 (UTC) Date: Fri, 18 Dec 2020 19:39:56 +0000 From: "Dr. David Alan Gilbert" To: Ashish Kalra Cc: Brijesh Singh , Steve Rutherford , Sean Christopherson , Paolo Bonzini , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , Joerg Roedel , Borislav Petkov , Tom Lendacky , X86 ML , KVM list , LKML , dovmurik@linux.vnet.ibm.com, tobin@ibm.com, jejb@linux.ibm.com, frankeh@us.ibm.com Subject: Re: [PATCH v2 1/9] KVM: x86: Add AMD SEV specific Hypercall3 Message-ID: <20201218193956.GJ2956@work-vm> References: <765f86ae-7c68-6722-c6e0-c6150ce69e59@amd.com> <20201211225542.GA30409@ashkalra_ubuntu_server> <20201212045603.GA27415@ashkalra_ubuntu_server> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201212045603.GA27415@ashkalra_ubuntu_server> User-Agent: Mutt/1.14.6 (2020-07-11) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Ashish Kalra (ashish.kalra@amd.com) wrote: > On Fri, Dec 11, 2020 at 10:55:42PM +0000, Ashish Kalra wrote: > > Hello All, > > > > On Tue, Dec 08, 2020 at 10:29:05AM -0600, Brijesh Singh wrote: > > > > > > On 12/7/20 9:09 PM, Steve Rutherford wrote: > > > > On Mon, Dec 7, 2020 at 12:42 PM Sean Christopherson wrote: > > > >> On Sun, Dec 06, 2020, Paolo Bonzini wrote: > > > >>> On 03/12/20 01:34, Sean Christopherson wrote: > > > >>>> On Tue, Dec 01, 2020, Ashish Kalra wrote: > > > >>>>> From: Brijesh Singh > > > >>>>> > > > >>>>> KVM hypercall framework relies on alternative framework to patch the > > > >>>>> VMCALL -> VMMCALL on AMD platform. If a hypercall is made before > > > >>>>> apply_alternative() is called then it defaults to VMCALL. The approach > > > >>>>> works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor > > > >>>>> will be able to decode the instruction and do the right things. But > > > >>>>> when SEV is active, guest memory is encrypted with guest key and > > > >>>>> hypervisor will not be able to decode the instruction bytes. > > > >>>>> > > > >>>>> Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall > > > >>>>> will be used by the SEV guest to notify encrypted pages to the hypervisor. > > > >>>> What if we invert KVM_HYPERCALL and X86_FEATURE_VMMCALL to default to VMMCALL > > > >>>> and opt into VMCALL? It's a synthetic feature flag either way, and I don't > > > >>>> think there are any existing KVM hypercalls that happen before alternatives are > > > >>>> patched, i.e. it'll be a nop for sane kernel builds. > > > >>>> > > > >>>> I'm also skeptical that a KVM specific hypercall is the right approach for the > > > >>>> encryption behavior, but I'll take that up in the patches later in the series. > > > >>> Do you think that it's the guest that should "donate" memory for the bitmap > > > >>> instead? > > > >> No. Two things I'd like to explore: > > > >> > > > >> 1. Making the hypercall to announce/request private vs. shared common across > > > >> hypervisors (KVM, Hyper-V, VMware, etc...) and technologies (SEV-* and TDX). > > > >> I'm concerned that we'll end up with multiple hypercalls that do more or > > > >> less the same thing, e.g. KVM+SEV, Hyper-V+SEV, TDX, etc... Maybe it's a > > > >> pipe dream, but I'd like to at least explore options before shoving in KVM- > > > >> only hypercalls. > > > >> > > > >> > > > >> 2. Tracking shared memory via a list of ranges instead of a using bitmap to > > > >> track all of guest memory. For most use cases, the vast majority of guest > > > >> memory will be private, most ranges will be 2mb+, and conversions between > > > >> private and shared will be uncommon events, i.e. the overhead to walk and > > > >> split/merge list entries is hopefully not a big concern. I suspect a list > > > >> would consume far less memory, hopefully without impacting performance. > > > > For a fancier data structure, I'd suggest an interval tree. Linux > > > > already has an rbtree-based interval tree implementation, which would > > > > likely work, and would probably assuage any performance concerns. > > > > > > > > Something like this would not be worth doing unless most of the shared > > > > pages were physically contiguous. A sample Ubuntu 20.04 VM on GCP had > > > > 60ish discontiguous shared regions. This is by no means a thorough > > > > search, but it's suggestive. If this is typical, then the bitmap would > > > > be far less efficient than most any interval-based data structure. > > > > > > > > You'd have to allow userspace to upper bound the number of intervals > > > > (similar to the maximum bitmap size), to prevent host OOMs due to > > > > malicious guests. There's something nice about the guest donating > > > > memory for this, since that would eliminate the OOM risk. > > > > > > > > > Tracking the list of ranges may not be bad idea, especially if we use > > > the some kind of rbtree-based data structure to update the ranges. It > > > will certainly be better than bitmap which grows based on the guest > > > memory size and as you guys see in the practice most of the pages will > > > be guest private. I am not sure if guest donating a memory will cover > > > all the cases, e.g what if we do a memory hotplug (increase the guest > > > ram from 2GB to 64GB), will donated memory range will be enough to store > > > the metadata. > > > > > >. > > > > With reference to internal discussions regarding the above, i am going > > to look into specific items as listed below : > > > > 1). "hypercall" related : > > a). Explore the SEV-SNP page change request structure (included in GHCB), > > see if there is something common there than can be re-used for SEV/SEV-ES > > page encryption status hypercalls. > > b). Explore if there is any common hypercall framework i can use in > > Linux/KVM. > > > > 2). related to the "backing" data structure - explore using a range-based > > list or something like rbtree-based interval tree data structure > > (as mentioned by Steve above) to replace the current bitmap based > > implementation. > > > > > > I do agree that a range-based list or an interval tree data structure is a > really good "logical" fit for the guest page encryption status tracking. > > We can only keep track of the guest unencrypted shared pages in the > range(s) list (which will keep the data structure quite compact) and all > the guest private/encrypted memory does not really need any tracking in > the list, anything not in the list will be encrypted/private. > > Also looking at a more "practical" use case, here is the current log of > page encryption status hypercalls when booting a linux guest : > > ... > [ 56.146336] page_enc_status_hc invoked, gpa = 1f018000, npages = 1, enc = 1 > [ 56.146351] page_enc_status_hc invoked, gpa = 1f00e000, npages = 1, enc = 0 > [ 56.147261] page_enc_status_hc invoked, gpa = 1f00e000, npages = 1, enc = 0 > [ 56.147271] page_enc_status_hc invoked, gpa = 1f018000, npages = 1, enc = 0 .... > [ 56.180730] page_enc_status_hc invoked, gpa = 1f008000, npages = 1, enc = 0 > [ 56.180741] page_enc_status_hc invoked, gpa = 1f006000, npages = 1, enc = 0 > [ 56.180768] page_enc_status_hc invoked, gpa = 1f008000, npages = 1, enc = 1 > [ 56.180782] page_enc_status_hc invoked, gpa = 1f006000, npages = 1, enc = 1 .... > [ 56.197110] page_enc_status_hc invoked, gpa = 1f007000, npages = 1, enc = 0 > [ 56.197120] page_enc_status_hc invoked, gpa = 1f005000, npages = 1, enc = 0 > [ 56.197136] page_enc_status_hc invoked, gpa = 1f007000, npages = 1, enc = 1 > [ 56.197148] page_enc_status_hc invoked, gpa = 1f005000, npages = 1, enc = 1 .... > [ 56.222679] page_enc_status_hc invoked, gpa = 1e83b000, npages = 1, enc = 0 > [ 56.222691] page_enc_status_hc invoked, gpa = 1e839000, npages = 1, enc = 0 > [ 56.222707] page_enc_status_hc invoked, gpa = 1e83b000, npages = 1, enc = 1 > [ 56.222720] page_enc_status_hc invoked, gpa = 1e839000, npages = 1, enc = 1 .... > [ 56.313747] page_enc_status_hc invoked, gpa = 1e5eb000, npages = 1, enc = 0 > [ 56.313771] page_enc_status_hc invoked, gpa = 1e5e9000, npages = 1, enc = 0 > [ 56.313789] page_enc_status_hc invoked, gpa = 1e5eb000, npages = 1, enc = 1 > [ 56.313803] page_enc_status_hc invoked, gpa = 1e5e9000, npages = 1, enc = 1 .... > [ 56.459276] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 0 > [ 56.459428] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 1 > [ 56.460037] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 1 > [ 56.460216] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 0 > [ 56.460299] page_enc_status_hc invoked, gpa = 1d767000, npages = 100, enc = 0 > [ 56.460448] page_enc_status_hc invoked, gpa = 1e501000, npages = 1, enc = 1 .... > As can be observed here, all guest MMIO ranges are initially setup as > shared, and those are all contigious guest page ranges. > > After that the encryption status hypercalls are invoked when DMA gets > triggered during disk i/o while booting the guest ... here again the > guest page ranges are contigious, though mostly single page is touched > and a lot of page re-use is observed. > > So a range-based list/structure will be a "good" fit for such usage > scenarios. It seems surprisingly common to flick the same pages back and forth between encrypted and clear for quite a while; why is this? Dave > Thanks, > Ashish > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK