Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp2946183rwb; Mon, 19 Sep 2022 12:24:42 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4HbZeSyU3GhzHBA8lvfUZkWiOYvv66Ffhb3OcBDpfwWtzgIIrXtRuB8F8Pk3BpSvqqVDUL X-Received: by 2002:aa7:c050:0:b0:453:4427:a947 with SMTP id k16-20020aa7c050000000b004534427a947mr15847965edo.172.1663615482251; Mon, 19 Sep 2022 12:24:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1663615482; cv=none; d=google.com; s=arc-20160816; b=a2P1aQPtEx9tp7uEjCE4BKjtikhFJj8kxmRO/JH5Y/laIvKGtoNYXkPx/APy1YpjoW KruZVfVkdVcGZMKPtjm/DjKtdVsgVw7qtUioDlufdRbSOtuIbZAs77/MM8gXIBe+Eo8Z qx4X1xx9Lr//jgRkA9TKmJyffkZCi2Z+aaZWxT6Bflm+uvWZZ3TBUhq0mrtemj2+kPoA svxe0QSDt0f/X6h2fwQdNwJLeiXMyOiMI4DjxtX81+fHHDPyWtxiaN/tD+MVuPJhYqH3 Y+SPPovuUArmun62bmoa9GKdEA7WwRduYeXktZ2MDrplkK8Iss0WnnqiEdh0b4v43gXI hRiQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=uHDCmC+fkGgZQ7BzMPp6c7XrTAwkaI5bKq16AmgDePo=; b=XXTizWOHhBNYSf0qeYGf3GRStb2N0xSIbsSb6sY8Jyj4BJLpsNd8L2vv/GNyu+ahpY DH8h0tW1ALZ7cVpBKAbQRt3Nqb9guMokou+udUep4GCAafK8+zFnah5RavUINBDwjC4l IyuA7SdvTB/Kt2LogAwkrteeeDQDQPvc9kbxrmlE5qXnvs+fmxhaV9+uirucKPXo7Hsq Yj1PySqM5PzyGKQgf552YPx7et6eJ+pWQfIlkhmL0URmfhD0XhqfWBrBDdbrRY9ar40D D87Ribu5+X1bloi/Zc4XKXjXgVC92bgl5YXCd3wyBX0bq+cgYfW3yf4dpbhJ5xkxRMqL p6nQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=FzLZKvxt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id et5-20020a056402378500b004543cc9a998si2077013edb.76.2022.09.19.12.24.17; Mon, 19 Sep 2022 12:24:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=FzLZKvxt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229692AbiISTK3 (ORCPT + 99 others); Mon, 19 Sep 2022 15:10:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41456 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229645AbiISTKX (ORCPT ); Mon, 19 Sep 2022 15:10:23 -0400 Received: from mail-pj1-x1034.google.com (mail-pj1-x1034.google.com [IPv6:2607:f8b0:4864:20::1034]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A3AD924F39 for ; Mon, 19 Sep 2022 12:10:20 -0700 (PDT) Received: by mail-pj1-x1034.google.com with SMTP id d64-20020a17090a6f4600b00202ce056566so8344977pjk.4 for ; Mon, 19 Sep 2022 12:10:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date; bh=uHDCmC+fkGgZQ7BzMPp6c7XrTAwkaI5bKq16AmgDePo=; b=FzLZKvxtooNg2nIQsy5+eM5f7fy1lAzBB2VA1wO/c/z5x+mIvht2HeQTpz0+O2GNON qYfCZIIQd7vVS1ydcSUk98uZRncHJ8NJqsBYal/77hLUDVgkbWXOBd+hz5VEBE58svml 7veNt7a5mrys0Ba0zXNGukgmilRYcWUK1G5Ld55DCw1aaAWGjuHTYvwOXM55fy1GZCq2 /HIYNCCaMOBCvVP4bTRqEX/0Cg2d5v66Gen6iUYVbeQoaRoKlz+CSMn6594FqKI3OBOx CgWqKCHmKyJawz7IbqlOUC2g8zhLuRqDNc0zsCq7DTjXNqPS5WtBOOhuXN/fLqvu7Nx+ YhhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date; bh=uHDCmC+fkGgZQ7BzMPp6c7XrTAwkaI5bKq16AmgDePo=; b=dsokpYj6vWghJKCaIOQcpq5uYUC6vBcvQdHSdnjmtm2WlUrpuW8kgla7kEsFtde6rr 3fdXyN+4MMm66QAbQ4epWj9gBjm8QR/99H5PzjojiVe5BOzxFqAx9YFdDm33pPE6VyWU iZqSkhU4knvh+z1lHm+UEOPNG7YVERFvzEaOzQBHlYqFrgzisLl0cxz5mN+/Bo+A5rig f4HpSvunG/7i0Kfm4Kwcg2vrIjGE3OqF7CLfQq42jZD01fu93AezJLSMLBdZ311ZrFFP zSMrVTMUvk4+Afno+aACiCZWkAPgcOQxMhKtnDnyvOBlwklM4u6JEOAPJt3s7u0ZmOTM etNA== X-Gm-Message-State: ACrzQf3sr8dDfLEKiUpQ9SLj/M4dx31WPQRNFLiz7uBdVJz3iSiOBe7i C/iggOodVMlo8PRGIj+YJqeUcg== X-Received: by 2002:a17:90a:b00b:b0:203:a6de:5b0f with SMTP id x11-20020a17090ab00b00b00203a6de5b0fmr1533338pjq.134.1663614619499; Mon, 19 Sep 2022 12:10:19 -0700 (PDT) Received: from google.com (7.104.168.34.bc.googleusercontent.com. [34.168.104.7]) by smtp.gmail.com with ESMTPSA id p187-20020a6229c4000000b00540c24ba181sm20357398pfp.120.2022.09.19.12.10.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 19 Sep 2022 12:10:18 -0700 (PDT) Date: Mon, 19 Sep 2022 19:10:15 +0000 From: Sean Christopherson To: David Hildenbrand Cc: Chao Peng , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com, Will Deacon , Marc Zyngier , Fuad Tabba Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd Message-ID: References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org +Will, Marc and Fuad (apologies if I missed other pKVM folks) On Mon, Sep 19, 2022, David Hildenbrand wrote: > On 15.09.22 16:29, Chao Peng wrote: > > From: "Kirill A. Shutemov" > > > > KVM can use memfd-provided memory for guest memory. For normal userspace > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its > > virtual address space and then tells KVM to use the virtual address to > > setup the mapping in the secondary page table (e.g. EPT). > > > > With confidential computing technologies like Intel TDX, the > > memfd-provided memory may be encrypted with special key for special > > software domain (e.g. KVM guest) and is not expected to be directly > > accessed by userspace. Precisely, userspace access to such encrypted > > memory may lead to host crash so it should be prevented. > > Initially my thaught was that this whole inaccessible thing is TDX specific > and there is no need to force that on other mechanisms. That's why I > suggested to not expose this to user space but handle the notifier > requirements internally. > > IIUC now, protected KVM has similar demands. Either access (read/write) of > guest RAM would result in a fault and possibly crash the hypervisor (at > least not the whole machine IIUC). Yep. The missing piece for pKVM is the ability to convert from shared to private while preserving the contents, e.g. to hand off a large buffer (hundreds of MiB) for processing in the protected VM. Thoughts on this at the bottom. > > This patch introduces userspace inaccessible memfd (created with > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via > > in-kernel interface so KVM can directly interact with core-mm without > > the need to map the memory into KVM userspace. > > With secretmem we decided to not add such "concept switch" flags and instead > use a dedicated syscall. > I have no personal preference whatsoever between a flag and a dedicated syscall, but a dedicated syscall does seem like it would give the kernel a bit more flexibility. > What about memfd_inaccessible()? Especially, sealing and hugetlb are not > even supported and it might take a while to support either. Don't know about sealing, but hugetlb support for "inaccessible" memory needs to come sooner than later. "inaccessible" in quotes because we might want to choose a less binary name, e.g. "restricted"?. Regarding pKVM's use case, with the shim approach I believe this can be done by allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions piled on top. My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly tightly control usage without taking on too much complexity in the kernel, but working through things, routing the behavior through the shim itself might not be all that horrific. IIRC, we discarded the idea of allowing userspace to map the "private" fd because things got too complex, but with the shim it doesn't seem _that_ bad. E.g. on the memfd side: 1. The entire memfd must be mapped, and at most one mapping is allowed, i.e. mapping is all or nothing. 2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for the restricted memfd. 3. Add notifier hooks to allow downstream users to further restrict things. 4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in one shot. 5. Require that there are no outstanding references at munmap(). Or if this can't be guaranteed by userspace, maybe add some way for userspace to wait until it's ok to convert to private? E.g. so that get_pfn() doesn't need to do an expensive check every time. static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma) { if (vma->vm_pgoff) return -EINVAL; if ((vma->vm_end - vma->vm_start) != ) return -EINVAL; mutex_lock(&data->lock); if (data->has_mapping) { r = -EINVAL; goto err; } list_for_each_entry(notifier, &data->notifiers, list) { r = notifier->ops->mmap_start(notifier, ...); if (r) goto abort; } notifier->ops->mmap_end(notifier, ...); mutex_unlock(&data->lock); return 0; abort: list_for_each_entry_continue_reverse(notifier &data->notifiers, list) notifier->ops->mmap_abort(notifier, ...); err: mutex_unlock(&data->lock); return r; } static void memfd_restricted_close(struct vm_area_struct *vma) { mutex_lock(...); /* * Destroy the memfd and disable all future accesses if there are * outstanding refcounts (or other unsatisfied restrictions?). */ if ( || ???) memfd_restricted_destroy(...); else data->has_mapping = false; mutex_unlock(...); } static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr) { return -EINVAL; } static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma) { return -EINVAL; } Then on the KVM side, its mmap_start() + mmap_end() sequence would: 1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero memory into the guest (after pre-boot phase). 2. Be mutually exclusive with shared<=>private conversions, and is allowed if and only if the entire gfn range of the associated memslot is shared.