Received: by 2002:a05:6358:1087:b0:cb:c9d3:cd90 with SMTP id j7csp5983143rwi; Tue, 18 Oct 2022 06:47:44 -0700 (PDT) X-Google-Smtp-Source: AMsMyM69ykGNrYHyjsiHcuukMkiJUpZcRpHli1OBg9gKsvd7+vatRdhaMxmNfTvE7ilTGD6iXvx1 X-Received: by 2002:a17:907:608f:b0:78e:1b60:60e2 with SMTP id ht15-20020a170907608f00b0078e1b6060e2mr2503704ejc.382.1666100864045; Tue, 18 Oct 2022 06:47:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666100864; cv=none; d=google.com; s=arc-20160816; b=S9rhVNs4cHhWF1zNQj+CJY16Dl1EjqU7ucPv+DeXSWKOCF9rxLnx0Wo4pYaN+HJymB KBTZ5gDacYITjFBnWfMAnLTF2kiBPHR9Y5la4zvGVygVOwxyGP0BFQncz9z0dFOwhCUN gkGgHZxV0JjlIjMykdWiLsKSNL5Fg6eP/Kkm4yy34yDKsUZJeKa1lOLKvFrsn4b/9sKq D+W1iB4k462Gz8JJpZeuKYl6PhrekSVXbNi+obMmPPw/HGZGMxShVRu8/A3w3M2aPwOl /LP/Lg7ywzMIhsYTv7ASMjCU9JnRIw8bzTPlzczfZLOJQfi+neC18j+lxMvb6lM4GI+/ FUdQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=TQSORsKSMngPJb5f/eC6Hu6kpkY5kYXJ98QHFM8UvxY=; b=ZK71vvSzq+uqeHh+AoZhVrUc4Cl2/NNlDYKXAq/ahgEBaQjroPqozl0Ws0vKOoRhdo 87SlBmylQSn1t1l5BnJ+bvAp9ej28ZZSWf/zRnIx3UA1Jyn0/QFKwA4b5eyxtUBLjEKZ lRmBbKxfOHQHDAHSlk0w+x3T591G5gAJrplrzK24qn69+ZhTj5U3AAcKVoCk5nfqNZ81 BZ9vAkSWeONyAKRjhmJY1DflRx+4wyF3qh/B2lLimZPtA1Z/aSiHRdmDrYT+Vqd4AVK1 A0rBE2DZE+fObNVKGgQ8JaY5wRhQ6saB0ARbMJ6pmI4kI3wkHSYljvIQIj+iPOSKmOJs Qmmw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=ZHxxlG3U; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id hb13-20020a170907160d00b0078b0865d468si10537129ejc.549.2022.10.18.06.47.17; Tue, 18 Oct 2022 06:47:44 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=ZHxxlG3U; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231145AbiJRNmc (ORCPT + 99 others); Tue, 18 Oct 2022 09:42:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34720 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230488AbiJRNm1 (ORCPT ); Tue, 18 Oct 2022 09:42:27 -0400 Received: from mail-pf1-x433.google.com (mail-pf1-x433.google.com [IPv6:2607:f8b0:4864:20::433]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5853BCD5DA for ; Tue, 18 Oct 2022 06:42:24 -0700 (PDT) Received: by mail-pf1-x433.google.com with SMTP id i3so14090247pfk.9 for ; Tue, 18 Oct 2022 06:42:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=TQSORsKSMngPJb5f/eC6Hu6kpkY5kYXJ98QHFM8UvxY=; b=ZHxxlG3UTEot3sPjKtvw1X3aBdRbEmIooxBSsM63jzfeOELeKLeWEK8Ka0UBZMDOP+ akvRFs76C2wKq91BFhqgTb5HB703LMV7esd/A+wWslFPM0+sKw9GkKbhOLqB+y2xkhr8 xAw55/tP2EjBapVx7JcnOuoy3rb21J87PUbxaFqakG14s9yKfbb31M3qi5ybSq2mRACM jSitk4irsc/WL0t07xPDMeWHTF1Fx+OZSUwfySvO4jPCavdKRxOz8qiV/i/KYM1y4DB8 jWL9urfkjvvcFCgxaCtIJ6pE7MoPjXcItlLTszbdaWP+IhZBfYH0+bbwgZYx0bdSzYu+ tT2w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=TQSORsKSMngPJb5f/eC6Hu6kpkY5kYXJ98QHFM8UvxY=; b=KUAsJ+FVZBtckTHiE9fKEPiUOoV9/GmQ+m/8bzFoGBjqi49k65MP+qROb8zj3MkNsr I2q8rkf2dO6GvSIpI6m/g+5F9GEhvwft0xRgmCoLwEdU6VqX4fbof5qGNkVcAG7csUeg QwhfzejUcMNmUTQe2UJ4uz9rsMane1eoojXzvT41VTeJKkiwO8/xUCtIOelfXtIN+WtX v42FnSyyQmgTZeztrBRqP+BkdCBYIbhfU3P2hAoT1w6O29AJMuEIcIlTJPqibgVMVfZU 7zleE8fSiNb4R601JU99M9r3qTZaG2fp2gEL0pMYw+rnU5ZIznXhHsq5K/8r87UXwqZ1 oGVg== X-Gm-Message-State: ACrzQf2LkpdiUy8FeBrhd3wydBSomo/oKzRGGRF/ZwIE6uRZaSmMnv+K t6LfUd/PxODKLevzGdmu1npUrp/NqRxzCmuH5TXn5w== X-Received: by 2002:a63:88c7:0:b0:462:79de:dc75 with SMTP id l190-20020a6388c7000000b0046279dedc75mr2721715pgd.458.1666100542751; Tue, 18 Oct 2022 06:42:22 -0700 (PDT) MIME-Version: 1.0 References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> <20221017161955.t4gditaztbwijgcn@box.shutemov.name> <20221017215640.hobzcz47es7dq2bi@box.shutemov.name> In-Reply-To: <20221017215640.hobzcz47es7dq2bi@box.shutemov.name> From: Vishal Annapurve Date: Tue, 18 Oct 2022 19:12:10 +0530 Message-ID: Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd To: "Kirill A . Shutemov" Cc: "Gupta, Pankaj" , Vlastimil Babka , Chao Peng , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Yu Zhang , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 18, 2022 at 3:27 AM Kirill A . Shutemov wrote: > > On Mon, Oct 17, 2022 at 06:39:06PM +0200, Gupta, Pankaj wrote: > > On 10/17/2022 6:19 PM, Kirill A . Shutemov wrote: > > > On Mon, Oct 17, 2022 at 03:00:21PM +0200, Vlastimil Babka wrote: > > > > On 9/15/22 16:29, Chao Peng wrote: > > > > > From: "Kirill A. Shutemov" > > > > > > > > > > KVM can use memfd-provided memory for guest memory. For normal userspace > > > > > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its > > > > > virtual address space and then tells KVM to use the virtual address to > > > > > setup the mapping in the secondary page table (e.g. EPT). > > > > > > > > > > With confidential computing technologies like Intel TDX, the > > > > > memfd-provided memory may be encrypted with special key for special > > > > > software domain (e.g. KVM guest) and is not expected to be directly > > > > > accessed by userspace. Precisely, userspace access to such encrypted > > > > > memory may lead to host crash so it should be prevented. > > > > > > > > > > This patch introduces userspace inaccessible memfd (created with > > > > > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through > > > > > ordinary MMU access (e.g. read/write/mmap) but can be accessed via > > > > > in-kernel interface so KVM can directly interact with core-mm without > > > > > the need to map the memory into KVM userspace. > > > > > > > > > > It provides semantics required for KVM guest private(encrypted) memory > > > > > support that a file descriptor with this flag set is going to be used as > > > > > the source of guest memory in confidential computing environments such > > > > > as Intel TDX/AMD SEV. > > > > > > > > > > KVM userspace is still in charge of the lifecycle of the memfd. It > > > > > should pass the opened fd to KVM. KVM uses the kernel APIs newly added > > > > > in this patch to obtain the physical memory address and then populate > > > > > the secondary page table entries. > > > > > > > > > > The userspace inaccessible memfd can be fallocate-ed and hole-punched > > > > > from userspace. When hole-punching happens, KVM can get notified through > > > > > inaccessible_notifier it then gets chance to remove any mapped entries > > > > > of the range in the secondary page tables. > > > > > > > > > > The userspace inaccessible memfd itself is implemented as a shim layer > > > > > on top of real memory file systems like tmpfs/hugetlbfs but this patch > > > > > only implemented tmpfs. The allocated memory is currently marked as > > > > > unmovable and unevictable, this is required for current confidential > > > > > usage. But in future this might be changed. > > > > > > > > > > Signed-off-by: Kirill A. Shutemov > > > > > Signed-off-by: Chao Peng > > > > > --- > > > > > > > > ... > > > > > > > > > +static long inaccessible_fallocate(struct file *file, int mode, > > > > > + loff_t offset, loff_t len) > > > > > +{ > > > > > + struct inaccessible_data *data = file->f_mapping->private_data; > > > > > + struct file *memfd = data->memfd; > > > > > + int ret; > > > > > + > > > > > + if (mode & FALLOC_FL_PUNCH_HOLE) { > > > > > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) > > > > > + return -EINVAL; > > > > > + } > > > > > + > > > > > + ret = memfd->f_op->fallocate(memfd, mode, offset, len); > > > > > + inaccessible_notifier_invalidate(data, offset, offset + len); > > > > > > > > Wonder if invalidate should precede the actual hole punch, otherwise we open > > > > a window where the page tables point to memory no longer valid? > > > > > > Yes, you are right. Thanks for catching this. > > > > I also noticed this. But then thought the memory would be anyways zeroed > > (hole punched) before this call? > > Hole punching can free pages, given that offset/len covers full page. > > -- > Kiryl Shutsemau / Kirill A. Shutemov I think moving this notifier_invalidate before fallocate may not solve the problem completely. Is it possible that between invalidate and fallocate, KVM tries to handle the page fault for the guest VM from another vcpu and uses the pages to be freed to back gpa ranges? Should hole punching here also update mem_attr first to say that KVM should consider the corresponding gpa ranges to be no more backed by inaccessible memfd?