Received: by 2002:a05:6358:a55:b0:ec:fcf4:3ecf with SMTP id 21csp6167321rwb; Wed, 18 Jan 2023 01:55:59 -0800 (PST) X-Google-Smtp-Source: AMrXdXtnPKvV4cyf/m/yMAxFebDmeGr5W2kpmxT3JXLiiOhyY5VNoKdsMCagSHNMh80NEJAqYOyw X-Received: by 2002:a17:90a:3ee1:b0:225:c712:5df8 with SMTP id k88-20020a17090a3ee100b00225c7125df8mr5936051pjc.3.1674035759788; Wed, 18 Jan 2023 01:55:59 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674035759; cv=none; d=google.com; s=arc-20160816; b=Id6Pu2plQAjNzCmzCpVfyPrwgKAE1dP6tp1aBfUWC6gyHnRh1tl5V+RwwWTFw3urXo rB4Vydt9uMDBeMl/IF6x8uB4bnF9yjlK6uK3kPBPL0jskUXMkfEuS+CDVrXj2mGYT1mf jgAauMT+1Mbct/YJLIqB7cJhjwFW3wt8D/3WYojs/tek984UOnnHblBCHd8Xby2UX45T Wte5+SxZKVvDJ2qlGyAKlyx70KhzxV7kls8q0ttT6AWl2TpX8L3sHM8k3fSOeJw8Bkm2 6i59ltisYhQmSj0jL3sjcDTbliCmk32Ew+hYowzz1oycUmftPhdS+e2Hj5gbWhCk6Q6B qwbA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=WypZnTZ+iYZuTrdy9ZJF4+SMiwNrsvWgfqhfii2dj28=; b=oLL5TEQiPWpxb5uyfqnqTTKh7NhZEofd6/EWJeCVdvVGFNDI43J3qlQQyRe+nr17gE RkcK8fpJ7zTFnx/naC0F+iy0+5UFsGOd107GiO1HQKu3jid1uodc3BtiUhhJa40CVtRa xQDShZ5uEmkj+BeKQb9FpfIJwzGeO/Lyf4qCpG2CMkaquJEOvlzCAu0mik86VDld/2Na HuUCohMGS5REIKsVbxVlU8LO84XagxpMkSNt1+HfcEMo++LCNq+ZvXEDiG9SwFfAyIJ+ PDn0rlw9W3zbVBaCMMHcIsgPiURDukIEd3ZMcAOwLQw4PE05LzDTqr2srVeDRV9piWUx VRSg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Xw5pozIH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j184-20020a638bc1000000b004c62c4e242fsi14885505pge.260.2023.01.18.01.55.54; Wed, 18 Jan 2023 01:55:59 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=Xw5pozIH; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229878AbjARJGj (ORCPT + 45 others); Wed, 18 Jan 2023 04:06:39 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35798 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229838AbjARJEg (ORCPT ); Wed, 18 Jan 2023 04:04:36 -0500 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 755523E0AD; Wed, 18 Jan 2023 00:24:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674030281; x=1705566281; h=date:from:to:cc:subject:message-id:reply-to:references: mime-version:in-reply-to; bh=mSO4PAMhEZisTll2SVfi2Yx3pBS6smLK7Q3mTuJU1oA=; b=Xw5pozIH/dQFwvoJIvMdSKyZ69PKHC2soxXvHTqfxtxwVSDqkyWEtCfA ipGSRm2hIIZxS3oQ0xiNx/l29EfDF1SJH2L97GChoGaKd4BctU+oeVRUY mUCOuNP14slRIsR4Hj//w1ykPHkYYqcp/OSYQjkn88KBcUKnzF2WmwSg9 UvuvxkDlrQ3StY+f+L2CBO3aOOGhsSscfNh87yrxPEghYjw2hsYy0c8il EVbziUgGqb67ahXJKBgpRh0kvRORM8lu95w5HrCHcjq78znw31sEQ/LbM vNJTrUZ0qH8DrEEqOaAjHLKFW284bfJstu4HG1wGECFDEyjd43slXQpe5 A==; X-IronPort-AV: E=McAfee;i="6500,9779,10593"; a="323620012" X-IronPort-AV: E=Sophos;i="5.97,224,1669104000"; d="scan'208";a="323620012" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jan 2023 00:24:40 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10593"; a="722997271" X-IronPort-AV: E=Sophos;i="5.97,224,1669104000"; d="scan'208";a="722997271" Received: from chaop.bj.intel.com (HELO localhost) ([10.240.192.105]) by fmsmga008.fm.intel.com with ESMTP; 18 Jan 2023 00:24:28 -0800 Date: Wed, 18 Jan 2023 16:16:41 +0800 From: Chao Peng To: Sean Christopherson Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Arnd Bergmann , Naoya Horiguchi , Miaohe Lin , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , tabba@google.com, Michael Roth , mhocko@suse.com, wei.w.wang@intel.com Subject: Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Message-ID: <20230118081641.GA303785@chaop.bj.intel.com> Reply-To: Chao Peng References: <20221202061347.1070246-1-chao.p.peng@linux.intel.com> <20221202061347.1070246-2-chao.p.peng@linux.intel.com> <20230117124107.GA273037@chaop.bj.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 17, 2023 at 04:34:15PM +0000, Sean Christopherson wrote: > On Tue, Jan 17, 2023, Chao Peng wrote: > > On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote: > > > > + list_for_each_entry(notifier, &data->notifiers, list) { > > > > + notifier->ops->invalidate_start(notifier, start, end); > > > > > > Two major design issues that we overlooked long ago: > > > > > > 1. Blindly invoking notifiers will not scale. E.g. if userspace configures a > > > VM with a large number of convertible memslots that are all backed by a > > > single large restrictedmem instance, then converting a single page will > > > result in a linear walk through all memslots. I don't expect anyone to > > > actually do something silly like that, but I also never expected there to be > > > a legitimate usecase for thousands of memslots. > > > > > > 2. This approach fails to provide the ability for KVM to ensure a guest has > > > exclusive access to a page. As discussed in the past, the kernel can rely > > > on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but > > > only for SNP and TDX VMs. For VMs where userspace is trusted to some extent, > > > e.g. SEV, there is value in ensuring a 1:1 association. > > > > > > And probably more importantly, relying on hardware for SNP and TDX yields a > > > poor ABI and complicates KVM's internals. If the kernel doesn't guarantee a > > > page is exclusive to a guest, i.e. if userspace can hand out the same page > > > from a restrictedmem instance to multiple VMs, then failure will occur only > > > when KVM tries to assign the page to the second VM. That will happen deep > > > in KVM, which means KVM needs to gracefully handle such errors, and it means > > > that KVM's ABI effectively allows plumbing garbage into its memslots. > > > > It may not be a valid usage, but in my TDX environment I do meet below > > issue. > > > > kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0 > > kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0 > > kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22 > > > > Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU > > and slot#2 fails due to below exclusive check. > > > > Currently I changed QEMU code to mark these alias slots as shared > > instead of private but I'm not 100% confident this is correct fix. > > That's a QEMU bug of sorts. SMM is mutually exclusive with TDX, QEMU shouldn't > be configuring SMRAM (or any SMM memslots for that matter) for TDX guests. Thanks for the confirmation. As long as we only bind one notifier for each address, using xarray does make things simple. Chao