Received: by 2002:a05:6a10:a841:0:0:0:0 with SMTP id d1csp68168pxy; Tue, 20 Apr 2021 20:51:21 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxmIXkTQ9ri1PFs/dNZa69YGtsa+1Ee7zE3RfdiwLrSaeZ4hiF7gDtdg5oGvBtG6Q52b0ex X-Received: by 2002:a63:1b55:: with SMTP id b21mr20342403pgm.160.1618977081032; Tue, 20 Apr 2021 20:51:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1618977081; cv=none; d=google.com; s=arc-20160816; b=RvaomrNZ2DRElbn+yJ7pnbhAV5xQ9ausAUNZsGiiNo13bcJc1dMQ4dledhnO692RTV LNgRPuedQaWneUlugQo6uX79layRE4AsLQdtUWHokU+XK4dHph1njLS+/hdhFyYPHYYs jZRbtoRhmIts/KmgJJsXwvQhP8dY9AIsl5zJLj0jbQ9CChc6lOuxzktOI+5Yj+jV4//T mlXiG6PPRo2Ik7GLdey0pvZwN1HvukHmNpdB8kMvR1Nd0kLWlT8k/dNxRfCPmKMkdk/z q3fBpdx2bmH4PLVIaNO+ht0xc1Foh6RPZGaG+lMC7+4JjxGceCGaz5HZDnpya032HElR iR3Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=KmKEWuh5kX+xwliBF/7UPbgNwt5vdw/WaIbOHk7+c2c=; b=bwN2pFv/kjCetWMxPz/R1EaVi86AzRqxUNGSXYfiL5gzHPHqQ4YpazNUdgM6pwhMln MMckFtIq+ZGb6QMYPV2ynZvm48oioXhMqiCG2nXLS+LElcsITi2CMurEiSuv35+1cnvq 1rbFkOERzk4IZtYhfGO3boKo2qECGj5I414w2YxcifJKU7UWyWWVMzUwaqITeQ/rCCEv vF9kd1nHMbuXrwoGv9BqVfNsgVeFSTA+iB7n0w2frq5udzCht7Yc84WB6ctxkDvpcymX grsg7QlEjxcWA4KaN/5zwFrn39pTlgK0E/YhmWKqFls/S7YrsIPiQNbttWQfa1tj7Ahh 86Sw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=D7SzKgyr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id a21si1190064pjs.163.2021.04.20.20.50.57; Tue, 20 Apr 2021 20:51:21 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=D7SzKgyr; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234925AbhDUDno (ORCPT + 99 others); Tue, 20 Apr 2021 23:43:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56272 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233874AbhDUDnn (ORCPT ); Tue, 20 Apr 2021 23:43:43 -0400 Received: from mail-pf1-x42b.google.com (mail-pf1-x42b.google.com [IPv6:2607:f8b0:4864:20::42b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 61A7BC06138B for ; Tue, 20 Apr 2021 20:43:09 -0700 (PDT) Received: by mail-pf1-x42b.google.com with SMTP id q2so3048632pfk.9 for ; Tue, 20 Apr 2021 20:43:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=KmKEWuh5kX+xwliBF/7UPbgNwt5vdw/WaIbOHk7+c2c=; b=D7SzKgyrUNQwXWF1oBPoS5L/TdTX7IV6DZcj4o3bXAG/nI1PLwARVLHG05VNer8PrQ LqQMbArJmBX7eFaejaAcIQ+iDUsJRbdJudzkznKNK9i5Pj3hwzk3PZ3XrBB612K+qNVW 5eOZ9jXYCAGMnI7SMy2msbQj7sxWn7Kbwb6IQHRN+1DnNysKm6li7qWZ8HbocKa5HKgy 9cwTk9BczWKoSi9OqrNCmmRzVETsKrujTohOO8EO5gyyhP8jrBrHibLeN8qoNfx3wYK+ 5PuE0RuIe7cSKlOmBrSiRb8uRzi3YAmBRGLNr/JV5aGCt7oYM/N+jrwUD0Sa/mCRj6Xa DZgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=KmKEWuh5kX+xwliBF/7UPbgNwt5vdw/WaIbOHk7+c2c=; b=qyoYG7yyTWouNc9yjubbedI/KU+34tWt1MHs1afb0OZz4rtmRCF7KTN8mdwbDU9s2X WNUalmUVMdD/JDZ4gMDVjmQsAHzbi5t1g9J/uV/llPdEfsJo80i2r/RyqTdVRjQKyfVc 20wbdjQ90r+qyY5A12+luBc2a2xYDG5MD5TXLRMXDfFb6ofpI3041BKj0A3QsXQF/fJr wFbTqmCrRqszATRr9wEigq8/ltKhHozZy10qaeaS3UVm1asdjxbWVs6o2cu9i1TYUNHf MWd8sq97ylffTXuCEc349JoEP1kFnWsHVOCQR+PUYm0Fp7dzaKJunqx56DBcnkX3qQRm 7GeA== X-Gm-Message-State: AOAM5318ypf8Uz2ZCDtEnavQsj7mAYlQHrwkP/8t+QVmvh8iMSmf/CaW 09vXnjoKR4oa3pSjRgaEXzvwt1MoVePUnOnuo+5JvA== X-Received: by 2002:a05:6a00:8c7:b029:20f:1cf4:d02 with SMTP id s7-20020a056a0008c7b029020f1cf40d02mr28230925pfu.49.1618976588944; Tue, 20 Apr 2021 20:43:08 -0700 (PDT) MIME-Version: 1.0 References: <20210415084005.25049-1-songmuchun@bytedance.com> <20210415084005.25049-7-songmuchun@bytedance.com> <5f914142-009b-3bfc-9cb2-46154f610e29@oracle.com> <8de3d7a0-f100-5d50-fe54-b83af07570f4@oracle.com> In-Reply-To: <8de3d7a0-f100-5d50-fe54-b83af07570f4@oracle.com> From: Muchun Song Date: Wed, 21 Apr 2021 11:42:31 +0800 Message-ID: Subject: Re: [External] Re: [PATCH v20 6/9] mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page To: Mike Kravetz Cc: Jonathan Corbet , Thomas Gleixner , Ingo Molnar , bp@alien8.de, X86 ML , hpa@zytor.com, dave.hansen@linux.intel.com, luto@kernel.org, Peter Zijlstra , Alexander Viro , Andrew Morton , paulmck@kernel.org, pawan.kumar.gupta@linux.intel.com, Randy Dunlap , oneukum@suse.com, anshuman.khandual@arm.com, jroedel@suse.de, Mina Almasry , David Rientjes , Matthew Wilcox , Oscar Salvador , Michal Hocko , "Song Bao Hua (Barry Song)" , David Hildenbrand , =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= , Joao Martins , Xiongchun duan , fam.zheng@bytedance.com, linux-doc@vger.kernel.org, LKML , Linux Memory Management List , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 21, 2021 at 1:48 AM Mike Kravetz wrote: > > On 4/20/21 1:46 AM, Muchun Song wrote: > > On Tue, Apr 20, 2021 at 7:20 AM Mike Kravetz wrote: > >> > >> On 4/15/21 1:40 AM, Muchun Song wrote: > >>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > >>> index 0abed7e766b8..6e970a7d3480 100644 > >>> --- a/include/linux/hugetlb.h > >>> +++ b/include/linux/hugetlb.h > >>> @@ -525,6 +525,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr, > >>> * code knows it has only reference. All other examinations and > >>> * modifications require hugetlb_lock. > >>> * HPG_freed - Set when page is on the free lists. > >>> + * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed. > >>> * Synchronization: hugetlb_lock held for examination and modification. > >> > >> I like the per-page flag. In previous versions of the series, you just > >> checked the free_vmemmap_pages_per_hpage() to determine if vmemmmap > >> should be allocated. Is there any change in functionality that makes is > >> necessary to set the flag in each page, or is it mostly for flexibility > >> going forward? > > > > Actually, only the routine of dissolving the page cares whether > > the page is on the buddy free list when update_and_free_page > > returns. But we cannot change the return type of the > > update_and_free_page (e.g. change return type from 'void' to 'int'). > > Why? If the hugepage is freed through a kworker, we cannot > > know the return value when update_and_free_page returns. > > So adding a return value seems odd. > > > > In the dissolving routine, We can allocate vmemmap pages first, > > if it is successful, then we can make sure that > > update_and_free_page can successfully free page. So I need > > some stuff to mark the page which does not need to allocate > > vmemmap pages. > > > > On the surface, we seem to have a straightforward method > > to do this. > > > > Add a new parameter 'alloc_vmemmap' to update_and_free_page() to > > indicate that the caller is already allocated the vmemmap pages. > > update_and_free_page() do not need to allocate. Just like below. > > > > void update_and_free_page(struct hstate *h, struct page *page, bool atomic, > > bool alloc_vmemmap) > > { > > if (alloc_vmemmap) > > // allocate vmemmap pages > > } > > > > But if the page is freed through a kworker. How to pass > > 'alloc_vmemmap' to the kworker? We can embed this > > information into the per-page flag. So if we introduce > > HPG_vmemmap_optimized, the parameter of > > alloc_vmemmap is also necessary. > > > > So it seems that introducing HPG_vmemmap_optimized is > > a good choice. > > Thanks for the explanation! > > Agree that the flag is a good choice. How about adding a comment like > this above the alloc_huge_page_vmemmap call in dissolve_free_huge_page? > > /* > * Normally update_and_free_page will allocate required vmemmmap before > * freeing the page. update_and_free_page will fail to free the page > * if it can not allocate required vmemmap. We need to adjust > * max_huge_pages if the page is not freed. Attempt to allocate > * vmemmmap here so that we can take appropriate action on failure. > */ Thanks. I will add this comment. > > ... > >>> +static void add_hugetlb_page(struct hstate *h, struct page *page, > >>> + bool adjust_surplus) > >>> +{ > >> > >> We need to be a bit careful with hugepage specific flags that may be > >> set. The routine remove_hugetlb_page which is called for 'page' before > >> this routine will not clear any of the hugepage specific flags. If the > >> calling path goes through free_huge_page, most but not all flags are > >> cleared. > >> > >> We had a discussion about clearing the page->private field in Oscar's > >> series. In the case of 'new' pages we can assume page->private is > >> cleared, but perhaps we should not make that assumption here. Since we > >> hope to rarely call this routine, it might be safer to do something > >> like: > >> > >> set_page_private(page, 0); > >> SetHPageVmemmapOptimized(page); > > > > Agree. Thanks for your reminder. I will fix this. > > > >> > >>> + int nid = page_to_nid(page); > >>> + > >>> + lockdep_assert_held(&hugetlb_lock); > >>> + > >>> + INIT_LIST_HEAD(&page->lru); > >>> + h->nr_huge_pages++; > >>> + h->nr_huge_pages_node[nid]++; > >>> + > >>> + if (adjust_surplus) { > >>> + h->surplus_huge_pages++; > >>> + h->surplus_huge_pages_node[nid]++; > >>> + } > >>> + > >>> + set_compound_page_dtor(page, HUGETLB_PAGE_DTOR); > >>> + > >>> + /* > >>> + * The refcount can possibly be increased by memory-failure or > >>> + * soft_offline handlers. > >>> + */ > >>> + if (likely(put_page_testzero(page))) { > >> > >> In the existing code there is no such test. Is the need for the test > >> because of something introduced in the new code? > > > > No. > > > >> Or, should this test be in the existing code? > > > > Yes. gather_surplus_pages should be fixed. I can fix it > > in a separate patch. > > > > The possible bad scenario: > > > > CPU0: CPU1: > > set_compound_page_dtor(HUGETLB_PAGE_DTOR); > > memory_failure_hugetlb > > get_hwpoison_page > > __get_hwpoison_page > > get_page_unless_zero > > put_page_testzero() > > > > put_page(page) > > > > > > More details and discussion can refer to: > > > > https://lore.kernel.org/linux-doc/CAMZfGtVRSBkKe=tKAKLY8dp_hywotq3xL+EJZNjXuSKt3HK3bQ@mail.gmail.com/ > > > > Thanks you! I did not remember that discussion. > > It would be helpful to add a separate patch for gather_surplus_pages. > Otherwise, we have the VM_BUG_ON there and not in add_hugetlb_page. > Agree. Will do. > -- > Mike Kravetz