Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp2860171pxb; Sun, 24 Jan 2021 23:58:05 -0800 (PST) X-Google-Smtp-Source: ABdhPJyi6HZ3e/uNe1bb5SpcMnk2d78SdNVQuXmI7hUkZtKATQTFBXkAddd+xV8UMvBD02CX282M X-Received: by 2002:a50:b223:: with SMTP id o32mr1161516edd.79.1611561485579; Sun, 24 Jan 2021 23:58:05 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1611561485; cv=none; d=google.com; s=arc-20160816; b=wEOjzEfZ/nxm2lC2R92R3FDaecsoLw9D4CFwO//ooTDiD0dmCGH+OOgBFvf7GzvIsB vXjt6cf1ELZN55Ak8Rr1sJgWFuzVYwZMeFtk8wDoD5oN1swMYcgZa2/h/k6YgPANVoJa fRt5CTI9r2KUXabK2zKbiqjJwTNcyHKLrqtYOpOcNqD4R46Z0T++Bf4mpZiqZEyQsSX/ 1G9nqYrfNO3yAW6KQGyREMm1uGuqwgeKL/eTIJsOeAws/B+rEX8+5frcQiBcORmgbt0s weSpIRJD4g6C5WTNQ69CnYtjk4gL7QvpFfcd4o05ZvtGqJyQpl/M0+cysLscP5k/SRZY OOwg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=7mbaGXtQOBneRXBPI9rhq6Q1yml5C2C1LZtqV3oYKEc=; b=nqDuReqBUneMgFl9CFOzw3ps2M0wBQeH1V1G1qyA42/TqG4iC6dhsM86DEB6Ka9/BR t1jGtCKcoUHjFSb6FPZ5ScY6kayWePVAk+iKwaJPVUHsWPLNDWqMGhZ/YY+qIorhyvhz WhnxHaKA0nH0TpxrlUf3kifeVBtJfI3aI94spK1LrLSGBqmHan3APEEPbLUrAC2mKzmZ 9xJMCW+eAqwrlpqZi5x4Y0iC80xArTH/XUIJlQp7LSq/codsNfqErwFUwUhF46t6Pxx8 G5btSoK3sHZXNLvNSnJSC+cltcOI2G9xnovdISNzwQckF8qRgHAxg/7Kll7ursRBe2oe KKaw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=l5wF8kbv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id bw26si5833741ejb.644.2021.01.24.23.57.40; Sun, 24 Jan 2021 23:58:05 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=l5wF8kbv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727214AbhAYHyX (ORCPT + 99 others); Mon, 25 Jan 2021 02:54:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41724 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727300AbhAYHmu (ORCPT ); Mon, 25 Jan 2021 02:42:50 -0500 Received: from mail-pl1-x62d.google.com (mail-pl1-x62d.google.com [IPv6:2607:f8b0:4864:20::62d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C2DDBC061574 for ; Sun, 24 Jan 2021 23:42:06 -0800 (PST) Received: by mail-pl1-x62d.google.com with SMTP id e9so7081641plh.3 for ; Sun, 24 Jan 2021 23:42:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=7mbaGXtQOBneRXBPI9rhq6Q1yml5C2C1LZtqV3oYKEc=; b=l5wF8kbv5vJ18cKSWjzRkOyzwMT+tS8Hdt8KVjYSmqhOtdOlB+fDpmkYyXwSG2hMaf NPA/Bndgdm39IgUqwcjuhvbp6MpnaQTUCs7y6r+WfQCnCVVFKEsunoRtYnW6PI8Q9GFa sU95uUACSaBOhJoy67QJzhuefsxy9SRflbvSFxmIcuE7iabXUScy2fY59beZ/w95aA2t v55znZ9hojrG4Gmh+7ihSO1tdg+kL1QAleYXMZInOiXlJsgzOCV/HYtN86e4mSxqZcl3 aF9beKuexlCWqt0CSgh3JJbdktQf5AyQLraZOQVjkNxRYegFmkTeXTVzFYUb4crULuK0 UZVA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=7mbaGXtQOBneRXBPI9rhq6Q1yml5C2C1LZtqV3oYKEc=; b=HfBH+2oB1y3qB5mNXMKbYih7H+F1Ed4I4hrG9KSJJ9twP6qgnnwxBUr6F7BGtdnKBv ik40bCf95x3D6d8vRmEWmFwKT6kvHfYcHdppIKJ+zyj7ONMS2Gfa7Q5G8lBFaFUW7cMJ voIf6FLdctRfcNkahK8vymRwUe64W7UrUCTNV8RwCb4knbWTevxu6kvhmyWswWCex8Dl KOMR8x4ScCdbkGDUi2akTL6ofs/4rlNCmvHBE/aGyWFAw8ubrr65hwKIUJGuOSwgqPwO D5Lx8kC9Er0n4iSRVnlqD+UInzaO8pCRQfXSsZ1KUI1MaEFHuJ9gjh4Vah3uTSg3kuKq npoA== X-Gm-Message-State: AOAM531wt2Uc2hNjIwSukc7kDO41Xp2GbpFv+fV+YEWrCwb4LOLElA74 p0DncteiUYCifd0Glk8ij7U2LQgaBPiBcyX0/W9egQ== X-Received: by 2002:a17:90a:3e81:: with SMTP id k1mr3773086pjc.13.1611560526278; Sun, 24 Jan 2021 23:42:06 -0800 (PST) MIME-Version: 1.0 References: <20210117151053.24600-1-songmuchun@bytedance.com> <20210117151053.24600-6-songmuchun@bytedance.com> <6a68fde-583d-b8bb-a2c8-fbe32e03b@google.com> In-Reply-To: From: Muchun Song Date: Mon, 25 Jan 2021 15:41:29 +0800 Message-ID: Subject: Re: [External] Re: [PATCH v13 05/12] mm: hugetlb: allocate the vmemmap pages associated with each HugeTLB page To: David Rientjes Cc: Jonathan Corbet , Mike Kravetz , Thomas Gleixner , mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, dave.hansen@linux.intel.com, luto@kernel.org, Peter Zijlstra , viro@zeniv.linux.org.uk, Andrew Morton , paulmck@kernel.org, mchehab+huawei@kernel.org, pawan.kumar.gupta@linux.intel.com, Randy Dunlap , oneukum@suse.com, anshuman.khandual@arm.com, jroedel@suse.de, Mina Almasry , Matthew Wilcox , Oscar Salvador , Michal Hocko , "Song Bao Hua (Barry Song)" , David Hildenbrand , =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= , Xiongchun duan , linux-doc@vger.kernel.org, LKML , Linux Memory Management List , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 25, 2021 at 2:40 PM Muchun Song wrote: > > On Mon, Jan 25, 2021 at 8:05 AM David Rientjes wrote: > > > > > > On Sun, 17 Jan 2021, Muchun Song wrote: > > > > > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c > > > index ce4be1fa93c2..3b146d5949f3 100644 > > > --- a/mm/sparse-vmemmap.c > > > +++ b/mm/sparse-vmemmap.c > > > @@ -29,6 +29,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > > > > #include > > > #include > > > @@ -40,7 +41,8 @@ > > > * @remap_pte: called for each non-empty PTE (lowest-level) entry. > > > * @reuse_page: the page which is reused for the tail vmemmap pages. > > > * @reuse_addr: the virtual address of the @reuse_page page. > > > - * @vmemmap_pages: the list head of the vmemmap pages that can be freed. > > > + * @vmemmap_pages: the list head of the vmemmap pages that can be freed > > > + * or is mapped from. > > > */ > > > struct vmemmap_remap_walk { > > > void (*remap_pte)(pte_t *pte, unsigned long addr, > > > @@ -50,6 +52,10 @@ struct vmemmap_remap_walk { > > > struct list_head *vmemmap_pages; > > > }; > > > > > > +/* The gfp mask of allocating vmemmap page */ > > > +#define GFP_VMEMMAP_PAGE \ > > > + (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN | __GFP_THISNODE) > > > + > > > > This is unnecessary, just use the gfp mask directly in allocator. > > Will do. Thanks. > > > > > > static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, > > > unsigned long end, > > > struct vmemmap_remap_walk *walk) > > > @@ -228,6 +234,75 @@ void vmemmap_remap_free(unsigned long start, unsigned long end, > > > free_vmemmap_page_list(&vmemmap_pages); > > > } > > > > > > +static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, > > > + struct vmemmap_remap_walk *walk) > > > +{ > > > + pgprot_t pgprot = PAGE_KERNEL; > > > + struct page *page; > > > + void *to; > > > + > > > + BUG_ON(pte_page(*pte) != walk->reuse_page); > > > + > > > + page = list_first_entry(walk->vmemmap_pages, struct page, lru); > > > + list_del(&page->lru); > > > + to = page_to_virt(page); > > > + copy_page(to, (void *)walk->reuse_addr); > > > + > > > + set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); > > > +} > > > + > > > +static void alloc_vmemmap_page_list(struct list_head *list, > > > + unsigned long start, unsigned long end) > > > +{ > > > + unsigned long addr; > > > + > > > + for (addr = start; addr < end; addr += PAGE_SIZE) { > > > + struct page *page; > > > + int nid = page_to_nid((const void *)addr); > > > + > > > +retry: > > > + page = alloc_pages_node(nid, GFP_VMEMMAP_PAGE, 0); > > > + if (unlikely(!page)) { > > > + msleep(100); > > > + /* > > > + * We should retry infinitely, because we cannot > > > + * handle allocation failures. Once we allocate > > > + * vmemmap pages successfully, then we can free > > > + * a HugeTLB page. > > > + */ > > > + goto retry; > > > > Ugh, I don't think this will work, there's no guarantee that we'll ever > > succeed and now we can't free a 2MB hugepage because we cannot allocate a > > 4KB page. We absolutely have to ensure we make forward progress here. > > This can trigger a OOM when there is no memory and kill someone to release > some memory. Right? > > > > > We're going to be freeing the hugetlb page after this succeeeds, can we > > not use part of the hugetlb page that we're freeing for this memory > > instead? > > It seems a good idea. We can try to allocate memory firstly, if successful, > just use the new page to remap (it can reduce memory fragmentation). > If not, we can use part of the hugetlb page to remap. What's your opinion > about this? If the HugeTLB page is a gigantic page which is allocated from CMA. In this case, we cannot use part of the hugetlb page to remap. Right? > > > > > > + } > > > + list_add_tail(&page->lru, list); > > > + } > > > +} > > > + > > > +/** > > > + * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end) > > > + * to the page which is from the @vmemmap_pages > > > + * respectively. > > > + * @start: start address of the vmemmap virtual address range. > > > + * @end: end address of the vmemmap virtual address range. > > > + * @reuse: reuse address. > > > + */ > > > +void vmemmap_remap_alloc(unsigned long start, unsigned long end, > > > + unsigned long reuse) > > > +{ > > > + LIST_HEAD(vmemmap_pages); > > > + struct vmemmap_remap_walk walk = { > > > + .remap_pte = vmemmap_restore_pte, > > > + .reuse_addr = reuse, > > > + .vmemmap_pages = &vmemmap_pages, > > > + }; > > > + > > > + might_sleep(); > > > + > > > + /* See the comment in the vmemmap_remap_free(). */ > > > + BUG_ON(start - reuse != PAGE_SIZE); > > > + > > > + alloc_vmemmap_page_list(&vmemmap_pages, start, end); > > > + vmemmap_remap_range(reuse, end, &walk); > > > +} > > > + > > > /* > > > * Allocate a block of memory to be used to back the virtual memory map > > > * or to back the page tables that are used to create the mapping. > > > -- > > > 2.11.0 > > > > > >