Received: by 2002:a05:6358:e9c4:b0:b2:91dc:71ab with SMTP id hc4csp1745195rwb; Fri, 5 Aug 2022 07:11:02 -0700 (PDT) X-Google-Smtp-Source: AA6agR49Ge3LAJpsBlTa4XjpvQXmjvg+7GIA8tNl9bNhW8Me0Uo1vUBT5XRHq8s0DAZ/eesyR0dU X-Received: by 2002:a63:6b87:0:b0:41c:2fb4:24f1 with SMTP id g129-20020a636b87000000b0041c2fb424f1mr6033806pgc.452.1659708662644; Fri, 05 Aug 2022 07:11:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1659708662; cv=none; d=google.com; s=arc-20160816; b=wkqQ+dnrwhYtsQ45ggy5vd356zg9PB+cFPpWnMzWRZN+flWOmdhg67fJYQHAP3CK8C 7AnZE73pHyBBJo4DL1wRp6Vkv6yxkOR+bifNojMV7jZDIFSsYCSRCS4rIePSsqix/bCm 3IWt5Ye+PXXQ1peAYt5v+MC9RGV4iq65A7ik/C3krqRMOtURT6EWVgHrk2TSIxHNNJZu BwUtCFa4mGHX1SnQ+Ae8M6mkXCj67tD/CWfdnYaiTjBu2r1gNRIbWFVwjKabVqjQMIMN PH2p7u09uZuhsf1LY13BKIZil7N7n32Hfchcjm6pVSI1W+YZdIxLpyZlX7bS18X/tzI7 OeeQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature:dkim-signature; bh=4fUGftCh+M9MgogM/ULV9Ld8sZEIEZ7PClbDaS2W+nE=; b=Q3waYL3rivpsRojbMZKEyvWXkRky0YccWC0M0GCn5EXKI7IXBXgNY0pytc3+C54drn AaCo1C9CkJX2U++RcLmxFGUxNUB0Ljo/SfDhV55puhCSJiFI8SkrRpT0iGJH6G1TRtZr g3Dk0MKgSsKCW87rFgvzTOHPBxfKgxyGu7KEkMGeZ9JsnRMBNt9K1Ss6wD02BOvxD6Vf 1fR8eTAVCWNlKlGZGi7Bwv68TqjBAPta6jiBYUz5ABPX3dTa9akXPTgsd15fOBTG7bQU 5BOW90EmlNKVe+vmRv4lG/kgenAQ0Dujat0uVJZbFgkK7V+kpY9LoEWTexAgkHm2Y213 SbWQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=D+kEVNjz; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a3-20020aa79703000000b0052e03a6013fsi3826066pfg.126.2022.08.05.07.10.40; Fri, 05 Aug 2022 07:11:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.cz header.s=susede2_rsa header.b=D+kEVNjz; dkim=neutral (no key) header.i=@suse.cz header.s=susede2_ed25519; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240307AbiHENjF (ORCPT + 99 others); Fri, 5 Aug 2022 09:39:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50286 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236060AbiHENjC (ORCPT ); Fri, 5 Aug 2022 09:39:02 -0400 Received: from smtp-out2.suse.de (smtp-out2.suse.de [IPv6:2001:67c:2178:6::1d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 93CA51AD92; Fri, 5 Aug 2022 06:39:01 -0700 (PDT) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 18E34205C1; Fri, 5 Aug 2022 13:39:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1659706740; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=4fUGftCh+M9MgogM/ULV9Ld8sZEIEZ7PClbDaS2W+nE=; b=D+kEVNjzYfXvldJYaFOj5ruPF3HbAPd1RTtNyDXO5s9Vr7b1FvUVwcELgPkROgutphj7nM iiSynX1+4TbllV29hbsrR302sIj7NfXZ7IJX60SePWGOU3/zaepY+xQ42vPlATE3nVKiVV r8DHVCwFJzRdQ5klIaNM0LEzfHm/818= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1659706740; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=4fUGftCh+M9MgogM/ULV9Ld8sZEIEZ7PClbDaS2W+nE=; b=CLiDkKcWHQhufTXRuMvyOhl3nB3aO/ZcIjwms2qYoxLv6GjUOx+1rIHjF1ArgPsKth9MMu BsA7bdcANtdIV7Cw== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 9E898133B5; Fri, 5 Aug 2022 13:38:59 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id rQT5JXMd7WKCNQAAMHmgww (envelope-from ); Fri, 05 Aug 2022 13:38:59 +0000 Message-ID: Date: Fri, 5 Aug 2022 15:38:59 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.0.3 Subject: Re: [PATCHv7 02/14] mm: Add support for unaccepted memory Content-Language: en-US To: David Hildenbrand , "Kirill A. Shutemov" , Borislav Petkov , Andy Lutomirski , Sean Christopherson , Andrew Morton , Joerg Roedel , Ard Biesheuvel Cc: Andi Kleen , Kuppuswamy Sathyanarayanan , David Rientjes , Tom Lendacky , Thomas Gleixner , Peter Zijlstra , Paolo Bonzini , Ingo Molnar , Dario Faggioli , Dave Hansen , Mike Rapoport , marcelo.cerri@canonical.com, tim.gardner@canonical.com, khalid.elmously@canonical.com, philip.cox@canonical.com, x86@kernel.org, linux-mm@kvack.org, linux-coco@lists.linux.dev, linux-efi@vger.kernel.org, linux-kernel@vger.kernel.org, Mike Rapoport , Mel Gorman References: <20220614120231.48165-1-kirill.shutemov@linux.intel.com> <20220614120231.48165-3-kirill.shutemov@linux.intel.com> <8cf143e7-2b62-1a1e-de84-e3dcc6c027a4@suse.cz> From: Vlastimil Babka In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,SPF_HELO_NONE, SPF_SOFTFAIL,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 8/5/22 14:09, David Hildenbrand wrote: > On 05.08.22 13:49, Vlastimil Babka wrote: >> On 6/14/22 14:02, Kirill A. Shutemov wrote: >>> UEFI Specification version 2.9 introduces the concept of memory >>> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD >>> SEV-SNP, require memory to be accepted before it can be used by the >>> guest. Accepting happens via a protocol specific to the Virtual Machine >>> platform. >>> >>> There are several ways kernel can deal with unaccepted memory: >>> >>> 1. Accept all the memory during the boot. It is easy to implement and >>> it doesn't have runtime cost once the system is booted. The downside >>> is very long boot time. >>> >>> Accept can be parallelized to multiple CPUs to keep it manageable >>> (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate >>> memory bandwidth and does not scale beyond the point. >>> >>> 2. Accept a block of memory on the first use. It requires more >>> infrastructure and changes in page allocator to make it work, but >>> it provides good boot time. >>> >>> On-demand memory accept means latency spikes every time kernel steps >>> onto a new memory block. The spikes will go away once workload data >>> set size gets stabilized or all memory gets accepted. >>> >>> 3. Accept all memory in background. Introduce a thread (or multiple) >>> that gets memory accepted proactively. It will minimize time the >>> system experience latency spikes on memory allocation while keeping >>> low boot time. >>> >>> This approach cannot function on its own. It is an extension of #2: >>> background memory acceptance requires functional scheduler, but the >>> page allocator may need to tap into unaccepted memory before that. >>> >>> The downside of the approach is that these threads also steal CPU >>> cycles and memory bandwidth from the user's workload and may hurt >>> user experience. >>> >>> Implement #2 for now. It is a reasonable default. Some workloads may >>> want to use #1 or #3 and they can be implemented later based on user's >>> demands. >>> >>> Support of unaccepted memory requires a few changes in core-mm code: >>> >>> - memblock has to accept memory on allocation; >>> >>> - page allocator has to accept memory on the first allocation of the >>> page; >>> >>> Memblock change is trivial. >>> >>> The page allocator is modified to accept pages on the first allocation. >>> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is >>> used to indicate that the page requires acceptance. >>> >>> Architecture has to provide two helpers if it wants to support >>> unaccepted memory: >>> >>> - accept_memory() makes a range of physical addresses accepted. >>> >>> - range_contains_unaccepted_memory() checks anything within the range >>> of physical addresses requires acceptance. >>> >>> Signed-off-by: Kirill A. Shutemov >>> Acked-by: Mike Rapoport # memblock >>> Reviewed-by: David Hildenbrand >> >> Hmm I realize it's not ideal to raise this at v7, and maybe it was discussed >> before, but it's really not great how this affects the core page allocator >> paths. Wouldn't it be possible to only release pages to page allocator when >> accepted, and otherwise use some new per-zone variables together with the >> bitmap to track how much exactly is where to accept? Then it could be hooked >> in get_page_from_freelist() similarly to CONFIG_DEFERRED_STRUCT_PAGE_INIT - >> if we fail zone_watermark_fast() and there are unaccepted pages in the zone, >> accept them and continue. With a static key to flip in case we eventually >> accept everything. Because this is really similar scenario to the deferred >> init and that one was solved in a way that adds minimal overhead. > > I kind of like just having the memory stats being correct (e.g., free > memory) and acceptance being an internal detail to be triggered when > allocating pages -- just like the arch_alloc_page() callback. Hm, good point about the stats. Could be tweaked perhaps so it appears correct on the outside, but might be tricky. > I'm sure we could optimize for the !unaccepted memory via static keys > also in this version with some checks at the right places if we find > this to hurt performance? It would be great if we would at least somehow hit the necessary code only when dealing with a >=pageblock size block. The bitmap approach and accepting everything smaller uprofront actually seems rather compatible. Yet in the current patch we e.g. check PageUnaccepted(buddy) on every buddy size while merging. A list that sits besides the existing free_area, contains only >=pageblock order sizes of unaccepted pages (no migratetype distinguished) and we tap into it approximately before __rmqueue_fallback()? There would be some trickery around releasing zone-lock for doing accept_memory(), but should be manageable.