Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp1780441pxb; Fri, 5 Feb 2021 00:57:02 -0800 (PST) X-Google-Smtp-Source: ABdhPJzla6oLR9rV38TzGhOqK+JFgqc5wdE+hrvXBheTeP1o6zNgc2i5iUDBr8pgtgMNoIjrLtPn X-Received: by 2002:a05:6402:524f:: with SMTP id t15mr2538013edd.158.1612515422657; Fri, 05 Feb 2021 00:57:02 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1612515422; cv=none; d=google.com; s=arc-20160816; b=ePTY7hGSfShd4E4+NA12GtiKd40cOz1aMwwm6JV8ADw1BuPhf+z9jD+nujxxjD26Mw PCKYM8HK1RX73ZCbEGsyR6OkCnw8fUSgjdu86YE1762JyTf0WGWgLWcIoIq7BJLxzlOF hvbdAiasjifz8CRfGx2mSEpUS6Vfu/rr053aYPcy7qdmS5HaUSNkGBT8o4jZ3NW5Gquc qMeX3gDVs3miFyZ76RdJsJnOlClNMERJz53pKU0LSdr1CU0CFjQdVEe/3prTUqPaIs9+ bX0ZS4o1p++ByHdYtX3BBcgtE3DtErXF2kfiO7SjEhR2cGZ82Li3lG+EeZAUiOdKEPoH jHVQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=2t8hx/jwuTwdUnrfw3EldurywWQYeJG1ziGqShXx3VU=; b=MvjCSyOGLo7Gdh6cFuS/Lrfmx2qvk4Jfqb12OgaKrFN3ZlGfYSUaH2G6R95kPJ7187 Dh/VnoU3COSjare6N5FuZPHTvD3PEvrEUGwIs0CIY7+9DyjyUnsE8W3XP0/6qvz6UkI0 QDtBUqmkqQ5c/nGnweJODJBKkIvQciVO+H73ccESByTntVYIoCekph2K8LfMbCL+HnIx F+kTU4IGpuKte6FG6j7mNpotVO0Mih3rMjjOaR5+s8RByXfbfSAc2a8WxenpAROq+k2Y ZZW5ziGMhajZpAlLr069jur3KrgVlmIMtvxAoNicTlSVv4VFGKjhs2JQPCyKxvL0H1zR xVXw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id v12si4129206ejq.716.2021.02.05.00.56.38; Fri, 05 Feb 2021 00:57:02 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229999AbhBEIzb (ORCPT + 99 others); Fri, 5 Feb 2021 03:55:31 -0500 Received: from mx2.suse.de ([195.135.220.15]:52188 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229702AbhBEIzY (ORCPT ); Fri, 5 Feb 2021 03:55:24 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id CF55EAD37; Fri, 5 Feb 2021 08:54:42 +0000 (UTC) Date: Fri, 5 Feb 2021 09:54:37 +0100 From: Oscar Salvador To: Muchun Song Cc: corbet@lwn.net, mike.kravetz@oracle.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, viro@zeniv.linux.org.uk, akpm@linux-foundation.org, paulmck@kernel.org, mchehab+huawei@kernel.org, pawan.kumar.gupta@linux.intel.com, rdunlap@infradead.org, oneukum@suse.com, anshuman.khandual@arm.com, jroedel@suse.de, almasrymina@google.com, rientjes@google.com, willy@infradead.org, mhocko@suse.com, song.bao.hua@hisilicon.com, david@redhat.com, naoya.horiguchi@nec.com, duanxiongchun@bytedance.com, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: Re: [PATCH v14 3/8] mm: hugetlb: free the vmemmap pages associated with each HugeTLB page Message-ID: <20210205085437.GB13848@linux> References: <20210204035043.36609-1-songmuchun@bytedance.com> <20210204035043.36609-4-songmuchun@bytedance.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210204035043.36609-4-songmuchun@bytedance.com> User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 04, 2021 at 11:50:38AM +0800, Muchun Song wrote: > Every HugeTLB has more than one struct page structure. We __know__ that > we only use the first 4(HUGETLB_CGROUP_MIN_ORDER) struct page structures > to store metadata associated with each HugeTLB. > > There are a lot of struct page structures associated with each HugeTLB > page. For tail pages, the value of compound_head is the same. So we can > reuse first page of tail page structures. We map the virtual addresses > of the remaining pages of tail page structures to the first tail page > struct, and then free these page frames. Therefore, we need to reserve > two pages as vmemmap areas. > > When we allocate a HugeTLB page from the buddy, we can free some vmemmap > pages associated with each HugeTLB page. It is more appropriate to do it > in the prep_new_huge_page(). > > The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap > pages associated with a HugeTLB page can be freed, returns zero for > now, which means the feature is disabled. We will enable it once all > the infrastructure is there. > > Signed-off-by: Muchun Song > --- [...] > +void free_huge_page_vmemmap(struct hstate *h, struct page *head) > +{ > + unsigned long vmemmap_addr = (unsigned long)head; > + unsigned long vmemmap_end, vmemmap_reuse; > + > + if (!free_vmemmap_pages_per_hpage(h)) > + return; > + > + vmemmap_addr += RESERVE_VMEMMAP_SIZE; > + vmemmap_end = vmemmap_addr + free_vmemmap_pages_size_per_hpage(h); > + vmemmap_reuse = vmemmap_addr - PAGE_SIZE; > + > + /* > + * Remap the vmemmap virtual address range [@vmemmap_addr, @vmemmap_end) > + * to the page which @vmemmap_reuse is mapped to, then free the vmemmap > + * pages which the range are mapped to. "then free the pages which the range [@vmemmap_addr, @vmemmap_end] is mapped to." I am not a native but sounds better to me. > + */ > + vmemmap_remap_free(vmemmap_addr, vmemmap_end, vmemmap_reuse); > +} > diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h > new file mode 100644 > index 000000000000..6923f03534d5 > --- /dev/null > +++ b/mm/hugetlb_vmemmap.h [...] > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c > index 16183d85a7d5..50c1dc00b686 100644 > --- a/mm/sparse-vmemmap.c > +++ b/mm/sparse-vmemmap.c > @@ -27,8 +27,215 @@ > #include > #include > #include > +#include > +#include > + > #include > #include > +#include > + > +/** > + * vmemmap_remap_walk - walk vmemmap page table > + * > + * @remap_pte: called for each non-empty PTE (lowest-level) entry. Well, we BUG_ON on empty PTE, so not sure that pointing out here is worth. It sounds like we do nothing when it's empty. Maybe: "called for each lowest-level entry (PTE)" > + * @reuse_page: the page which is reused for the tail vmemmap pages. > + * @reuse_addr: the virtual address of the @reuse_page page. > + * @vmemmap_pages: the list head of the vmemmap pages that can be freed. > + */ > +struct vmemmap_remap_walk { > + void (*remap_pte)(pte_t *pte, unsigned long addr, > + struct vmemmap_remap_walk *walk); > + struct page *reuse_page; > + unsigned long reuse_addr; > + struct list_head *vmemmap_pages; > +}; > + > +static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, > + unsigned long end, > + struct vmemmap_remap_walk *walk) > +{ > + pte_t *pte; > + > + pte = pte_offset_kernel(pmd, addr); > + > + /* > + * The reuse_page is found 'first' in table walk before we start > + * remapping (which is calling @walk->remap_pte). > + */ > + if (!walk->reuse_page) { > + BUG_ON(pte_none(*pte) || walk->reuse_addr != addr); I would rather have them in separate lines: BUG_ON(pte_none(*pte)); BUG_ON(walk->reuse_addr != addr)); It helps when trying to figure out when we explode. One could dig in the registers, but let's make it easier to find out. > + [...] > +static void vmemmap_remap_range(unsigned long start, unsigned long end, > + struct vmemmap_remap_walk *walk) > +{ > + unsigned long addr = start; > + unsigned long next; > + pgd_t *pgd; > + > + VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE)); > + VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE)); > + > + pgd = pgd_offset_k(addr); > + do { > + BUG_ON(pgd_none(*pgd)); > + > + next = pgd_addr_end(addr, end); > + vmemmap_p4d_range(pgd, addr, next, walk); > + } while (pgd++, addr = next, addr != end); > + > + /* > + * We do not change the mapping of the vmemmap virtual address range > + * [@start, @start + PAGE_SIZE) which belongs to the reuse range. > + * So we not need to flush the TLB. > + */ > + flush_tlb_kernel_range(start + PAGE_SIZE, end); I find that comment a bit confusing. I would rather describe what are we flushing instead of what we are not. > +} > + > +/* > + * Free a vmemmap page. A vmemmap page can be allocated from the memblock > + * allocator or buddy allocator. If the PG_reserved flag is set, it means > + * that it allocated from the memblock allocator, just free it via the > + * free_bootmem_page(). Otherwise, use __free_page(). > + */ > +static inline void free_vmemmap_page(struct page *page) > +{ > + if (PageReserved(page)) > + free_bootmem_page(page); > + else > + __free_page(page); > +} > + > +/* Free a list of the vmemmap pages */ > +static void free_vmemmap_page_list(struct list_head *list) > +{ > + struct page *page, *next; > + > + list_for_each_entry_safe(page, next, list, lru) { > + list_del(&page->lru); > + free_vmemmap_page(page); > + } > +} > + > +static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > + struct vmemmap_remap_walk *walk) > +{ > + /* > + * Remap the tail pages as read-only to catch illegal write operation > + * to the tail pages. > + */ > + pgprot_t pgprot = PAGE_KERNEL_RO; > + pte_t entry = mk_pte(walk->reuse_page, pgprot); > + struct page *page = pte_page(*pte); > + > + list_add(&page->lru, walk->vmemmap_pages); > + set_pte_at(&init_mm, addr, pte, entry); > +} > + > +/** > + * vmemmap_remap_free - remap the vmemmap virtual address range [@start, @end) > + * to the page which @reuse is mapped to, then free vmemmap > + * which the range are mapped to. > + * @start: start address of the vmemmap virtual address range that we want > + * to remap. > + * @end: end address of the vmemmap virtual address range that we want to > + * remap. > + * @reuse: reuse address. > + * > + * Note: This function depends on vmemmap being base page mapped. Please make > + * sure that the architecture disables PMD mapping of vmemmap pages when calling > + * this function. Well, we do not really depend on the architecture to not map the vmemmap range with PMDs, right? IIUC, that is driven by your boot parameter (patch#5), which overrides whatever the architecture can do. Functional changes look good to me, so with all the above fixes, you can add: Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3