Received: by 2002:a05:7412:f584:b0:e2:908c:2ebd with SMTP id eh4csp1044044rdb; Mon, 4 Sep 2023 00:41:32 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHufiwnO5Kj1nES76Za1BK0qYNLT0DFRIQO1IetrRGVY+XnX0eXJXuWwHT/N+mSHB0z4oo+ X-Received: by 2002:a05:6a20:2587:b0:133:7ad8:712b with SMTP id k7-20020a056a20258700b001337ad8712bmr9119328pzd.52.1693813292677; Mon, 04 Sep 2023 00:41:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1693813292; cv=none; d=google.com; s=arc-20160816; b=pjNuz2VYV60iG6U1RmXNoqZohhngSVptc0KbQ6MxVmdzy2shvKnXDaXqhLEnc8JzKI Fv5g7pybAF2Q7h6ML4DpJX2PHapGn4LYgV3kzaohPDxyS4VCzJH728cL9AZUNwTBeFt3 e7j61v1Jw6hfGI2fUTZ3SC3/Uf6iZr3wnFGduv5FiZJhcFv7U73xGv/a10Xot9jAqs8b iUp65sOSWSgx6/a/8SEiU2MidmzbZ4xfXECokDhda5ejksuuKTd+CAvYnck5VrRv79Tz vCSL91uuywKwGpQloNJYOJpsnbu7OtKE7DJA26z729e1uMxmP3lp2T9r0yZabs8C/4mf zGsg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=cbUCJIPHs8ltaaXbg2qGIXHyM+zW+QZPZYvy21/0TpQ=; fh=fvlqGD3VzShpZiBsvk5BnBlq1cpYz4UPDnbTEhxGBpE=; b=vwAFnqbftmuXq+i0YofZkjKf4IJg2MTiSU6X/4dHKyLwBvMAiKobxIRRha2eM5V64j OY60kelmLwnKVr1s3JGzPeGsbG6F9A36FO3X0VDrRLAGkpbe6VfmCyJw53LAm+Tj8etW 1jvSToUQw3JkFS1Uj0ABkv1uQtxtOoTym3eIpxXS3ubPBW7zdaD+oXEsZ4v9hHrpAbxu Po+7TDfVQN92PcEI/Z8eIGUK+zHUUkmgvnPkox5uUUErWEwYE96ieQqGVy6b0xE+yO7V AFM1SdQ3lLOA+cbOpMP+H/YwJeG6Guzrz+gHv1YiuncUxCoLln1gPSRM2VaFtYwjKd1N rT2g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f15-20020a056a001acf00b0068a55cc9a10si7592692pfv.381.2023.09.04.00.41.28; Mon, 04 Sep 2023 00:41:32 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237819AbjHaK26 (ORCPT + 18 others); Thu, 31 Aug 2023 06:28:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58538 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236969AbjHaK25 (ORCPT ); Thu, 31 Aug 2023 06:28:57 -0400 Received: from outbound-smtp01.blacknight.com (outbound-smtp01.blacknight.com [81.17.249.7]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 79854C5 for ; Thu, 31 Aug 2023 03:28:53 -0700 (PDT) Received: from mail.blacknight.com (pemlinmail02.blacknight.ie [81.17.254.11]) by outbound-smtp01.blacknight.com (Postfix) with ESMTPS id 0F7ACC4AD7 for ; Thu, 31 Aug 2023 11:28:52 +0100 (IST) Received: (qmail 28140 invoked from network); 31 Aug 2023 10:28:51 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[84.203.20.191]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 31 Aug 2023 10:28:51 -0000 Date: Thu, 31 Aug 2023 11:28:49 +0100 From: Mel Gorman To: Muchun Song Cc: Linux-MM , Mike Kravetz , Mike Rapoport , LKML , Muchun Song , fam.zheng@bytedance.com, liangma@liangbit.com, punit.agrawal@bytedance.com, Andrew Morton , Usama Arif Subject: Re: [External] [v3 4/4] mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO Message-ID: <20230831102849.zsbwebyq4hkyvwyb@techsingularity.net> References: <20230825111836.1715308-1-usama.arif@bytedance.com> <20230825111836.1715308-5-usama.arif@bytedance.com> <486CFF93-3BB1-44CD-B0A0-A47F560F2CAE@linux.dev> <20230831095801.76rtpgdsvdijbw5t@techsingularity.net> <07E9202B-CA8B-4E1E-93FC-7BF84CB8E988@linux.dev> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <07E9202B-CA8B-4E1E-93FC-7BF84CB8E988@linux.dev> X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Aug 31, 2023 at 06:01:08PM +0800, Muchun Song wrote: > > > > On Aug 31, 2023, at 17:58, Mel Gorman wrote: > > > > On Thu, Aug 31, 2023 at 02:21:06PM +0800, Muchun Song wrote: > >> > >> > >>> On Aug 30, 2023, at 18:27, Usama Arif wrote: > >>> On 28/08/2023 12:33, Muchun Song wrote: > >>>>> On Aug 25, 2023, at 19:18, Usama Arif wrote: > >>>>> > >>>>> The new boot flow when it comes to initialization of gigantic pages > >>>>> is as follows: > >>>>> - At boot time, for a gigantic page during __alloc_bootmem_hugepage, > >>>>> the region after the first struct page is marked as noinit. > >>>>> - This results in only the first struct page to be > >>>>> initialized in reserve_bootmem_region. As the tail struct pages are > >>>>> not initialized at this point, there can be a significant saving > >>>>> in boot time if HVO succeeds later on. > >>>>> - Later on in the boot, HVO is attempted. If its successful, only the first > >>>>> HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages > >>>>> after the head struct page are initialized. If it is not successful, > >>>>> then all of the tail struct pages are initialized. > >>>>> > >>>>> Signed-off-by: Usama Arif > >>>> This edition is simpler than before ever, thanks for your work. > >>>> There is premise that other subsystems do not access vmemmap pages > >>>> before the initialization of vmemmap pages associated withe HugeTLB > >>>> pages allocated from bootmem for your optimization. However, IIUC, the > >>>> compacting path could access arbitrary struct page when memory fails > >>>> to be allocated via buddy allocator. So we should make sure that > >>>> those struct pages are not referenced in this routine. And I know > >>>> if CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, it will encounter > >>>> the same issue, but I don't find any code to prevent this from > >>>> happening. I need more time to confirm this, if someone already knows, > >>>> please let me know, thanks. So I think HugeTLB should adopt the similar > >>>> way to prevent this. > >>>> Thanks. > >>> > >>> Thanks for the reviews. > >>> > >>> So if I understand it correctly, the uninitialized pages due to the optimization in this patch and due to DEFERRED_STRUCT_PAGE_INIT should be treated in the same way during compaction. I see that in isolate_freepages during compaction there is a check to see if PageBuddy flag is set and also there are calls like __pageblock_pfn_to_page to check if the pageblock is valid. > >>> > >>> But if the struct page is uninitialized then they would contain random data and these checks could pass if certain bits were set? > >>> > >>> Compaction is done on free list. I think the uninitialized struct pages atleast from DEFERRED_STRUCT_PAGE_INIT would be part of freelist, so I think their pfn would be considered for compaction. > >>> > >>> Could someone more familiar with DEFERRED_STRUCT_PAGE_INIT and compaction confirm how the uninitialized struct pages are handled when compaction happens? Thanks! > >> > >> Hi Mel, > >> > >> Could you help us answer this question? I think you must be the expert of > >> CONFIG_DEFERRED_STRUCT_PAGE_INIT. I summarize the context here. As we all know, > >> some struct pages are uninnitialized when CONFIG_DEFERRED_STRUCT_PAGE_INIT is > >> enabled, if someone allocates a larger memory (e.g. order is 4) via buddy > >> allocator and fails to allocate the memory, then we will go into the compacting > >> routine, which will traverse all pfns and use pfn_to_page to access its struct > >> page, however, those struct pages may be uninnitialized (so it's arbitrary data). > >> Our question is how to prevent the compacting routine from accessing those > >> uninitialized struct pages? We'll be appreciated if you know the answer. > >> > > > > I didn't check the code but IIRC, the struct pages should be at least > > valid and not contain arbitrary data once page_alloc_init_late finishes. > > However, the buddy allocator is ready before page_alloc_init_late(), so it > may access arbitrary data in compacting routine, right? > Again, I didn't check the code but given that there is a minimum amount of the zone that must be initialised and only the highest zone is deferred for initialisation (again, didn't check this), there may be an implicit assumption that compaction is not required in early boot. Even if it was attempted, it would likely have nothing to do as fragmentation-related events that can be resolved by compaction should not occur that early in boot. -- Mel Gorman SUSE Labs