Received: by 2002:a05:6358:45e:b0:b5:b6eb:e1f9 with SMTP id 30csp1777829rwe; Fri, 2 Sep 2022 03:48:36 -0700 (PDT) X-Google-Smtp-Source: AA6agR7mOD5qEnKNWH94C6EtXNvc9ykk9ek1jlnre8QHpNrY13dk7HOr7I7M5+aZuzPHF9GD04M1 X-Received: by 2002:a05:6402:4301:b0:448:d506:e2e5 with SMTP id m1-20020a056402430100b00448d506e2e5mr13475133edc.153.1662115716399; Fri, 02 Sep 2022 03:48:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1662115716; cv=none; d=google.com; s=arc-20160816; b=N5ANQ/cV3mibn++GJy8wtrriu3UfndRHCgz4rHI8A47hpzdo/8vlmF7kiGNv0246dX HCLqf8p3CEm7QVM/ULzF5eDjqcBNibuC2v3cVFKK8ysMblyfjGf8W59O7DCAKe4tkVSq tNrQjJgtQ4ONcjTQRfI0zkr95f4VAOGOU7R8kCgBJ8J3P8htI/l4x/YM79CM46f0azMN +fmGtZg6OyKV1OTzhrPP24HBnAfMMetQO7GhtzW+qabZWIRI1KaMEretKnz7RTrGlXeB eW511Lov/2KVAJr+IXe6BSyRZ8EkvA/go8z0fvUEAViSmbyS3k+870blOTLYfQrZjbr0 /Nkw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=JRTqsXUC4XlLTanZI5JqazSVLGhTubz9i8V3/efqq9I=; b=HEBPKK2jKOCUii9WE26Y0X926OpZK2EGPSIhnzTWItJISg+UpnZuWQ7rv4jioB03cs nrkyLxt0BQrfmEVvJoPcg1UaoyzG5Y1Y67iOXLUD/AGzIn7uvJgE6MKfeyX6Exv6rqu0 fQSnEFVJFHdoNxDK0x4FXA4RNI9zR2y3780KYux0Fy7RUJ8R+KcFLaEwe8vDJE+G+iiJ 66I9QCLuQsuq0YfX1eappUJNyyS1qhCDhe5CJsFMAWLMFRORZIUhf0Obqzw+KVgHdfZu mPLyRfmcBB/RJFKnrlI4Xumcg00hwVgOgIf7izAPpWsgswm59TluIlcHVoUiyEv9PAWH DdqA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="Mb/5VAfA"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id hp3-20020a1709073e0300b0073d6b849d4fsi1939728ejc.731.2022.09.02.03.48.07; Fri, 02 Sep 2022 03:48:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="Mb/5VAfA"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235919AbiIBKdF (ORCPT + 99 others); Fri, 2 Sep 2022 06:33:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49140 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235841AbiIBKcu (ORCPT ); Fri, 2 Sep 2022 06:32:50 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1BAF1B0292; Fri, 2 Sep 2022 03:32:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1662114770; x=1693650770; h=date:from:to:cc:subject:message-id:reply-to:references: mime-version:in-reply-to; bh=Ws38xTyai1UArcptdFTCO63DRGm+K11UNHn/R44NQ2A=; b=Mb/5VAfA17yHlt35Ao2Hl/TpXeVFkd3x+CULkkBnLJPuHRW2iPXaCxsi mN/6BInmSKgsBEOz3veXRo7SzXTljMuFuuSpMuM0B6Q7KDJIXhc9yBT+0 xFhL1/LtWLTIlRwe0fPfrc6eOD2/ZB0V1IAB3CYHLOmFogOXNdWWgMaWh r4/jx2GM5q+oD6SWo43ztbUngcDGYo4zatmjx2sqiBVqLR3ZxMkDzGcL8 yJc71Tu+A3zbMjUvhKRH+5pPec3nGAOAaffKFu2Y2nsl0T70SxnGIunQ+ Npw1jHYBTwsGAt7tJ9C8evXPgAsEqOI2BHugwGjvvoGBdZvnS+yIBEgtI g==; X-IronPort-AV: E=McAfee;i="6500,9779,10457"; a="278969000" X-IronPort-AV: E=Sophos;i="5.93,283,1654585200"; d="scan'208";a="278969000" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 02 Sep 2022 03:32:49 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.93,283,1654585200"; d="scan'208";a="608945519" Received: from chaop.bj.intel.com (HELO localhost) ([10.240.193.75]) by orsmga007.jf.intel.com with ESMTP; 02 Sep 2022 03:32:39 -0700 Date: Fri, 2 Sep 2022 18:27:57 +0800 From: Chao Peng To: "Kirill A . Shutemov" Cc: Hugh Dickins , "Kirill A. Shutemov" , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, linux-kselftest@vger.kernel.org, Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , "Gupta, Pankaj" , Elena Reshetova Subject: Re: [PATCH v7 00/14] KVM: mm: fd-based approach for supporting KVM guest private memory Message-ID: <20220902102757.GB1712673@chaop.bj.intel.com> Reply-To: Chao Peng References: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> <20220818132421.6xmjqduempmxnnu2@box> <20220820002700.6yflrxklmpsavdzi@box.shutemov.name> <20220831142439.65q2gi4g2d2z4ofh@box.shutemov.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220831142439.65q2gi4g2d2z4ofh@box.shutemov.name> X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 31, 2022 at 05:24:39PM +0300, Kirill A . Shutemov wrote: > On Sat, Aug 20, 2022 at 10:15:32PM -0700, Hugh Dickins wrote: > > > I will try next week to rework it as shim to top of shmem. Does it work > > > for you? > > > > Yes, please do, thanks. It's a compromise between us: the initial TDX > > case has no justification to use shmem at all, but doing it that way > > will help you with some of the infrastructure, and will probably be > > easiest for KVM to extend to other more relaxed fd cases later. > > Okay, below is my take on the shim approach. > > I don't hate how it turned out. It is easier to understand without > callback exchange thing. > > The only caveat is I had to introduce external lock to protect against > race between lookup and truncate. Otherwise, looks pretty reasonable to me. > > I did very limited testing. And it lacks integration with KVM, but API > changed not substantially, any it should be easy to adopt. I have integrated this patch with other KVM patches and verified the functionality works well in TDX environment with a minor fix below. > > Any comments? > ... > diff --git a/mm/memfd.c b/mm/memfd.c > index 08f5f8304746..1853a90f49ff 100644 > --- a/mm/memfd.c > +++ b/mm/memfd.c > @@ -261,7 +261,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) > #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) > #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) > > -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) > +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \ > + MFD_INACCESSIBLE) > > SYSCALL_DEFINE2(memfd_create, > const char __user *, uname, > @@ -283,6 +284,14 @@ SYSCALL_DEFINE2(memfd_create, > return -EINVAL; > } > > + /* Disallow sealing when MFD_INACCESSIBLE is set. */ > + if ((flags & MFD_INACCESSIBLE) && (flags & MFD_ALLOW_SEALING)) > + return -EINVAL; > + > + /* TODO: add hugetlb support */ > + if ((flags & MFD_INACCESSIBLE) && (flags & MFD_HUGETLB)) > + return -EINVAL; > + > /* length includes terminating zero */ > len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); > if (len <= 0) > @@ -331,10 +340,24 @@ SYSCALL_DEFINE2(memfd_create, > *file_seals &= ~F_SEAL_SEAL; > } > > + if (flags & MFD_INACCESSIBLE) { > + struct file *inaccessible_file; > + > + inaccessible_file = memfd_mkinaccessible(file); > + if (IS_ERR(inaccessible_file)) { > + error = PTR_ERR(inaccessible_file); > + goto err_file; > + } The new file should alse be marked as O_LARGEFILE otherwise setting the initial size greater than 2^31 on the fd will be refused by ftruncate(). + inaccessible_file->f_flags |= O_LARGEFILE; + > + > + file = inaccessible_file; > + } > + > fd_install(fd, file); > kfree(name); > return fd; > > +err_file: > + fput(file); > err_fd: > put_unused_fd(fd); > err_name: