Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp2838770pxb; Sun, 24 Jan 2021 23:06:19 -0800 (PST) X-Google-Smtp-Source: ABdhPJzi9KlaMWmWUEANyYQywxNcneMqDX2zvr1QnFn+Rqj5nCSNzYDKNGlYWZF/Qpl0+uQKbj5d X-Received: by 2002:a05:6402:3088:: with SMTP id de8mr1156375edb.221.1611558379019; Sun, 24 Jan 2021 23:06:19 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1611558379; cv=none; d=google.com; s=arc-20160816; b=d/d7nS2y4l8dSBO4ir2RB+Lo9ZSCE6G3qP+DyC8ZbdnkTyaoChxKIPZJMkMv2WS8Rd 9C5veJdDXukpa2naUf3ykSJlfanoqceiawqJ1fz4G/hF1qcUMpRM1OrcQkTWHx4OZW9f rwJd7BDmb0cpLVAidOMnsiyGLfeCxpAVP0mnTRsjRnW4Y7EPRLb4DuearY0MRusSYSz2 XC0uiPyJbdxdDiT93OOQWDDDfnom7j3+x4EijK3A7qIA6tr3s/Ei6uufycgUtqKto8R0 awgqhY3yJuHwkaMvcRZfbbVSWZO744fDhBo7yeqXoCYprf7n9GTwvN/2ad7G/mrAIwFM MhAQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=NewSY01qkIGM9FMVRaVXfK28tOMEomXG8TL6xj/6ExU=; b=yMAiVdftJMSp5kfV4hKd35j4xs7t9sI9cv/HHJ0lDFwBxArQfjg4ACgW2NWqWGv1aj 6yILY+aEYM3lpMfDI7uHXRXZQt1YHAdRH9fFXLhirpL+a8SN43C6WJ/NgNwUzSz/qnXZ ASNrHWbkicAkmlE+Po4nlUwkNHoDYeUxxdaYNVrPJq2yHKxL27ky6ff8/Gk7SL3iKXHY 7MnRC3IdEaWnSvZctsPj+Z8p6/AOGCRpCmtfXh07R2A48jP7VUCrzZu2qLWXbDhfWET1 Re1USWXdZkaEm2LWrGm9TDtDBgZuczJ9VtOCKmCW546erjQikOGx4CviOGNaW25m50r0 GOeA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=dqz4SrJI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id by3si6917825edb.233.2021.01.24.23.05.54; Sun, 24 Jan 2021 23:06:19 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=dqz4SrJI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727061AbhAYHCn (ORCPT + 99 others); Mon, 25 Jan 2021 02:02:43 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56952 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727109AbhAYGlp (ORCPT ); Mon, 25 Jan 2021 01:41:45 -0500 Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A4B1FC061573 for ; Sun, 24 Jan 2021 22:41:04 -0800 (PST) Received: by mail-pl1-x636.google.com with SMTP id g3so6997250plp.2 for ; Sun, 24 Jan 2021 22:41:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=NewSY01qkIGM9FMVRaVXfK28tOMEomXG8TL6xj/6ExU=; b=dqz4SrJI2EpyL5zBD3Ud75KgTfBUcXl9CpiGOJPIxKjyVX/iYdSQdfe1ePcFBhZw7b HYIhTcNlR8rvxCOXr+82MSI/2qu96+I5tHYDRCPV9hOM6sLvKudctE9rkAnTTCA1RQmE jCPswFUCuBmin5UAqlev59PlZTvBSdMSyQXAVcig6fVCUeLUCjkYB4iyIUDOJeeZCkHB KDQroHze9xd2oxp1s/erz77PxkrYSSwhCs9QzMqfsf74CGGMre5Jrx8LVzcKXTEpzkoY KKlCLBUoGVVc3PSi3/vFNNnWsqF2XM6XiscKNyZcCoG6+cLAFdRxTdW7MkUMLy/8uHcy VljQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=NewSY01qkIGM9FMVRaVXfK28tOMEomXG8TL6xj/6ExU=; b=VUuhVvYJSEE7vfMw4cKBITCljwiixdtnWcOWUtXCFeacZkydNmaDdTh8ZMft3LKNDN J+zjUVTwf7sgiLmreslS/NMMi/F7QiSfoL+ja3jC4q0g4UaBEgZV2fIdNsHw/wYcoYIv +0Y0A3HPb652RvTY+uRidhP6b3GBKUcGpyxyj4V6bpGaLtAhJi7kUr1Gt0V7bJptVA8U 00pFDqamJZRWItEaaY7Cch1cmQY0akazxlF3KKoFV/z//t0mwZs7l77a5BTL1HY4DFrX mDFcg394OvVQfulCm/Cpj+XYoD/m7KovMzEB4gzVWY4JmNiChGKnuQw2ZB8HpttgFyZw tl/g== X-Gm-Message-State: AOAM531p87ogawVtFDYTJLJR+Iia3Vy7eotNmfhfV7NLTGWbB2CvfYAv ILk54QHbqgqsnPoMVgK1e+mSzT4hlvP9rah4avURoA== X-Received: by 2002:a17:90a:808a:: with SMTP id c10mr6198232pjn.229.1611556864233; Sun, 24 Jan 2021 22:41:04 -0800 (PST) MIME-Version: 1.0 References: <20210117151053.24600-1-songmuchun@bytedance.com> <20210117151053.24600-6-songmuchun@bytedance.com> <6a68fde-583d-b8bb-a2c8-fbe32e03b@google.com> In-Reply-To: <6a68fde-583d-b8bb-a2c8-fbe32e03b@google.com> From: Muchun Song Date: Mon, 25 Jan 2021 14:40:27 +0800 Message-ID: Subject: Re: [External] Re: [PATCH v13 05/12] mm: hugetlb: allocate the vmemmap pages associated with each HugeTLB page To: David Rientjes Cc: Jonathan Corbet , Mike Kravetz , Thomas Gleixner , mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, dave.hansen@linux.intel.com, luto@kernel.org, Peter Zijlstra , viro@zeniv.linux.org.uk, Andrew Morton , paulmck@kernel.org, mchehab+huawei@kernel.org, pawan.kumar.gupta@linux.intel.com, Randy Dunlap , oneukum@suse.com, anshuman.khandual@arm.com, jroedel@suse.de, Mina Almasry , Matthew Wilcox , Oscar Salvador , Michal Hocko , "Song Bao Hua (Barry Song)" , David Hildenbrand , =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= , Xiongchun duan , linux-doc@vger.kernel.org, LKML , Linux Memory Management List , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 25, 2021 at 8:05 AM David Rientjes wrote: > > > On Sun, 17 Jan 2021, Muchun Song wrote: > > > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c > > index ce4be1fa93c2..3b146d5949f3 100644 > > --- a/mm/sparse-vmemmap.c > > +++ b/mm/sparse-vmemmap.c > > @@ -29,6 +29,7 @@ > > #include > > #include > > #include > > +#include > > > > #include > > #include > > @@ -40,7 +41,8 @@ > > * @remap_pte: called for each non-empty PTE (lowest-level) entry. > > * @reuse_page: the page which is reused for the tail vmemmap pages. > > * @reuse_addr: the virtual address of the @reuse_page page. > > - * @vmemmap_pages: the list head of the vmemmap pages that can be freed. > > + * @vmemmap_pages: the list head of the vmemmap pages that can be freed > > + * or is mapped from. > > */ > > struct vmemmap_remap_walk { > > void (*remap_pte)(pte_t *pte, unsigned long addr, > > @@ -50,6 +52,10 @@ struct vmemmap_remap_walk { > > struct list_head *vmemmap_pages; > > }; > > > > +/* The gfp mask of allocating vmemmap page */ > > +#define GFP_VMEMMAP_PAGE \ > > + (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN | __GFP_THISNODE) > > + > > This is unnecessary, just use the gfp mask directly in allocator. Will do. Thanks. > > > static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, > > unsigned long end, > > struct vmemmap_remap_walk *walk) > > @@ -228,6 +234,75 @@ void vmemmap_remap_free(unsigned long start, unsigned long end, > > free_vmemmap_page_list(&vmemmap_pages); > > } > > > > +static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, > > + struct vmemmap_remap_walk *walk) > > +{ > > + pgprot_t pgprot = PAGE_KERNEL; > > + struct page *page; > > + void *to; > > + > > + BUG_ON(pte_page(*pte) != walk->reuse_page); > > + > > + page = list_first_entry(walk->vmemmap_pages, struct page, lru); > > + list_del(&page->lru); > > + to = page_to_virt(page); > > + copy_page(to, (void *)walk->reuse_addr); > > + > > + set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); > > +} > > + > > +static void alloc_vmemmap_page_list(struct list_head *list, > > + unsigned long start, unsigned long end) > > +{ > > + unsigned long addr; > > + > > + for (addr = start; addr < end; addr += PAGE_SIZE) { > > + struct page *page; > > + int nid = page_to_nid((const void *)addr); > > + > > +retry: > > + page = alloc_pages_node(nid, GFP_VMEMMAP_PAGE, 0); > > + if (unlikely(!page)) { > > + msleep(100); > > + /* > > + * We should retry infinitely, because we cannot > > + * handle allocation failures. Once we allocate > > + * vmemmap pages successfully, then we can free > > + * a HugeTLB page. > > + */ > > + goto retry; > > Ugh, I don't think this will work, there's no guarantee that we'll ever > succeed and now we can't free a 2MB hugepage because we cannot allocate a > 4KB page. We absolutely have to ensure we make forward progress here. This can trigger a OOM when there is no memory and kill someone to release some memory. Right? > > We're going to be freeing the hugetlb page after this succeeeds, can we > not use part of the hugetlb page that we're freeing for this memory > instead? It seems a good idea. We can try to allocate memory firstly, if successful, just use the new page to remap (it can reduce memory fragmentation). If not, we can use part of the hugetlb page to remap. What's your opinion about this? > > > + } > > + list_add_tail(&page->lru, list); > > + } > > +} > > + > > +/** > > + * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end) > > + * to the page which is from the @vmemmap_pages > > + * respectively. > > + * @start: start address of the vmemmap virtual address range. > > + * @end: end address of the vmemmap virtual address range. > > + * @reuse: reuse address. > > + */ > > +void vmemmap_remap_alloc(unsigned long start, unsigned long end, > > + unsigned long reuse) > > +{ > > + LIST_HEAD(vmemmap_pages); > > + struct vmemmap_remap_walk walk = { > > + .remap_pte = vmemmap_restore_pte, > > + .reuse_addr = reuse, > > + .vmemmap_pages = &vmemmap_pages, > > + }; > > + > > + might_sleep(); > > + > > + /* See the comment in the vmemmap_remap_free(). */ > > + BUG_ON(start - reuse != PAGE_SIZE); > > + > > + alloc_vmemmap_page_list(&vmemmap_pages, start, end); > > + vmemmap_remap_range(reuse, end, &walk); > > +} > > + > > /* > > * Allocate a block of memory to be used to back the virtual memory map > > * or to back the page tables that are used to create the mapping. > > -- > > 2.11.0 > > > >