Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp274359pxb; Wed, 20 Jan 2021 06:50:10 -0800 (PST) X-Google-Smtp-Source: ABdhPJw6P902Vhltp6qzKmVQv8IEtFvlCytjPIH6uPUu6FUcJfUaOIlYhvpsV8iogfDP7OTZlZ6D X-Received: by 2002:a05:6402:490:: with SMTP id k16mr7414785edv.71.1611154209807; Wed, 20 Jan 2021 06:50:09 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1611154209; cv=none; d=google.com; s=arc-20160816; b=dqotO+PSRdB6auU113jt7DDBy214EyyRAwvZhD6cvmNOmDsM5r3495bPnf35rmZ6nb rpIOEBl8GIBboxRaS1vm8vSTsCwvGoNuRb1CPYsLK80yl4Hs4p2kBNmFZGqXnli5YtVa JTWzhZlDWFrvCnySDrJP9SRUE/1npq1f56+7B8baKYC/HQriAgmM/9HpxS5gSC0ljUEc +iu11UEjb5eId1D4ddC2TS8cAE6i70sXpy7C35RcWkJC/mopH8VM26z5mwVl1MHCsySK bGw2ot4jjKF/LCLLB76Db05T0q31kXJUHF/EqXOhZvLpD6cvt8NVnVNMPlu21jGazWdO 17Iw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=eEGDmiE9YBns+qxPa/gAmMT/6IfSXUfA56EioGnmq04=; b=QMjcWBKlJWZu0Ak8S+s1VwUEv50jpNBlpjzRlOdUxMy5RSOngQ8MpZYTPOUStluw/z SBDhd3hbWAHtY3GDPpR7NS0AbAEnUbVRbiV14jnRiW5YYjYbWloHe8Q8rcvoaqCEY+Jn ZOlRmpf6ya1INbuvLka6NiQzo0PLkAw5yIZep5rt+hyY+FUxLDa8BLKCvpT/7ZA/ntRQ jLRRqhNXhFM095j+rMhisAwhMWo8zFsd8G6RvOi/nSjOLTF/BI5vIR+iqxG4nI7UTD0p F4cCiauR/OZgiq9CG10kWZv6Z4IALD4HKeM/Eu5snncexEk8FMADib6sXgG5qd5CBh+X 6uDA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=t4XAKdSU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id m15si948378edc.599.2021.01.20.06.49.42; Wed, 20 Jan 2021 06:50:09 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=t4XAKdSU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2390179AbhATOWZ (ORCPT + 99 others); Wed, 20 Jan 2021 09:22:25 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60632 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1733195AbhATMyc (ORCPT ); Wed, 20 Jan 2021 07:54:32 -0500 Received: from mail-pg1-x530.google.com (mail-pg1-x530.google.com [IPv6:2607:f8b0:4864:20::530]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6DBBFC061794 for ; Wed, 20 Jan 2021 04:53:32 -0800 (PST) Received: by mail-pg1-x530.google.com with SMTP id n25so15159678pgb.0 for ; Wed, 20 Jan 2021 04:53:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=eEGDmiE9YBns+qxPa/gAmMT/6IfSXUfA56EioGnmq04=; b=t4XAKdSUeeERrBQ4Ea1LFXUakmFoPSDOnd6X34Pi9aauXjIhNdIA8AJEN8w2MLjEoe b8KXj/po1zn9YCm22shZRpPQ/Vj25W1cmJvhWtMtSi4ewkSglGarAba5+gWmx5n2EKwS L2z0sHh1kNlS7FJw/V0K632/cL3EcAd+hloDeNcZUVqxYKSN3R2VyMfoo7lpDczrpv8w jMamS6LN70BcVHJhxLpe++xnKvZF8yKmWOo1wa678Wuo+7WSV95Wd75p7BeDdu3P0kOv YrtHgND4l03K0/TCOMgiXnb7IdoMeK6kLVm3/Jcj4TeALG9BMaGrEXwHMcHDJ5Iv9qQR qtFg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=eEGDmiE9YBns+qxPa/gAmMT/6IfSXUfA56EioGnmq04=; b=F6jrPZhsU/KctIm+fVyZSmDADa+kxcTVMSalQy6XW1LjRClotCfdhC1k18w4XoPpaU IbkBWIT9SlSuFqsczq6hbWg4L1FLfgOJyBauIFf2CRbRoogpr1XkrZQo6iy9TLnvr/zC chcEcISM6iCJeNGOkrZVTyJUboltt5Vy4JT4IA011MAqxFojgk0U8YyJ5Qfaupai+pfd Y+wSow85J6NkWGD2DAvMMVvlt5ZnF5ZdBcVKWQ2dtVkL2WuHmGgiYdmno61j0t0cVJCw p5FJvnNJzmUkkU3o/A2Yi1omC9U8WPE3qWtL6VayxeUWs7HNRx6yO8UpSDgkDMlXHPV5 DVcw== X-Gm-Message-State: AOAM5313X/xQhXCiuk1iE3BzqkK7OKAXJW3PlLqtMNzvKHIZsKLTAij0 SLOZPUvpj+bTdyuC+2roOugn0zeyJE46stwTnkYmzA== X-Received: by 2002:a63:50a:: with SMTP id 10mr9094894pgf.273.1611147211760; Wed, 20 Jan 2021 04:53:31 -0800 (PST) MIME-Version: 1.0 References: <20210117151053.24600-1-songmuchun@bytedance.com> In-Reply-To: <20210117151053.24600-1-songmuchun@bytedance.com> From: Muchun Song Date: Wed, 20 Jan 2021 20:52:50 +0800 Message-ID: Subject: Re: [PATCH v13 00/12] Free some vmemmap pages of HugeTLB page To: Oscar Salvador , Mike Kravetz Cc: Xiongchun duan , Jonathan Corbet , Thomas Gleixner , paulmck@kernel.org, dave.hansen@linux.intel.com, anshuman.khandual@arm.com, oneukum@suse.com, bp@alien8.de, hpa@zytor.com, x86@kernel.org, Randy Dunlap , mingo@redhat.com, mchehab+huawei@kernel.org, luto@kernel.org, Andrew Morton , viro@zeniv.linux.org.uk, Peter Zijlstra , David Rientjes , Michal Hocko , jroedel@suse.de, Mina Almasry , pawan.kumar.gupta@linux.intel.com, =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= , David Hildenbrand , "Song Bao Hua (Barry Song)" , linux-doc@vger.kernel.org, LKML , Linux Memory Management List , linux-fsdevel , Matthew Wilcox Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Jan 17, 2021 at 11:12 PM Muchun Song wrote: > > Hi all, > > This patch series will free some vmemmap pages(struct page structures) > associated with each hugetlbpage when preallocated to save memory. > > In order to reduce the difficulty of the first version of code review. > From this version, we disable PMD/huge page mapping of vmemmap if this > feature was enabled. This accutualy eliminate a bunch of the complex code > doing page table manipulation. When this patch series is solid, we cam add > the code of vmemmap page table manipulation in the future. > > The struct page structures (page structs) are used to describe a physical > page frame. By default, there is a one-to-one mapping from a page frame to > it's corresponding page struct. > > The HugeTLB pages consist of multiple base page size pages and is supported > by many architectures. See hugetlbpage.rst in the Documentation directory > for more details. On the x86 architecture, HugeTLB pages of size 2MB and 1GB > are currently supported. Since the base page size on x86 is 4KB, a 2MB > HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of > 4096 base pages. For each base page, there is a corresponding page struct. > > Within the HugeTLB subsystem, only the first 4 page structs are used to > contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER > provides this upper limit. The only 'useful' information in the remaining > page structs is the compound_head field, and this field is the same for all > tail pages. > > By removing redundant page structs for HugeTLB pages, memory can returned to > the buddy allocator for other uses. > > When the system boot up, every 2M HugeTLB has 512 struct page structs which > size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE). > > HugeTLB struct pages(8 pages) page frame(8 pages) > +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ > | | | 0 | -------------> | 0 | > | | +-----------+ +-----------+ > | | | 1 | -------------> | 1 | > | | +-----------+ +-----------+ > | | | 2 | -------------> | 2 | > | | +-----------+ +-----------+ > | | | 3 | -------------> | 3 | > | | +-----------+ +-----------+ > | | | 4 | -------------> | 4 | > | 2MB | +-----------+ +-----------+ > | | | 5 | -------------> | 5 | > | | +-----------+ +-----------+ > | | | 6 | -------------> | 6 | > | | +-----------+ +-----------+ > | | | 7 | -------------> | 7 | > | | +-----------+ +-----------+ > | | > | | > | | > +-----------+ > > The value of page->compound_head is the same for all tail pages. The first > page of page structs (page 0) associated with the HugeTLB page contains the 4 > page structs necessary to describe the HugeTLB. The only use of the remaining > pages of page structs (page 1 to page 7) is to point to page->compound_head. > Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs > will be used for each HugeTLB page. This will allow us to free the remaining > 6 pages to the buddy allocator. > > Here is how things look after remapping. > > HugeTLB struct pages(8 pages) page frame(8 pages) > +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ > | | | 0 | -------------> | 0 | > | | +-----------+ +-----------+ > | | | 1 | -------------> | 1 | > | | +-----------+ +-----------+ > | | | 2 | ----------------^ ^ ^ ^ ^ ^ > | | +-----------+ | | | | | > | | | 3 | ------------------+ | | | | > | | +-----------+ | | | | > | | | 4 | --------------------+ | | | > | 2MB | +-----------+ | | | > | | | 5 | ----------------------+ | | > | | +-----------+ | | > | | | 6 | ------------------------+ | > | | +-----------+ | > | | | 7 | --------------------------+ > | | +-----------+ > | | > | | > | | > +-----------+ > > When a HugeTLB is freed to the buddy system, we should allocate 6 pages for > vmemmap pages and restore the previous mapping relationship. > > Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page. It is similar > to the 2MB HugeTLB page. We also can use this approach to free the vmemmap > pages. > > In this case, for the 1GB HugeTLB page, we can save 4094 pages. This is a > very substantial gain. On our server, run some SPDK/QEMU applications which > will use 1024GB hugetlbpage. With this feature enabled, we can save ~16GB > (1G hugepage)/~12GB (2MB hugepage) memory. > > Because there are vmemmap page tables reconstruction on the freeing/allocating > path, it increases some overhead. Here are some overhead analysis. > > 1) Allocating 10240 2MB hugetlb pages. > > a) With this patch series applied: > # time echo 10240 > /proc/sys/vm/nr_hugepages > > real 0m0.166s > user 0m0.000s > sys 0m0.166s > > # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' > Attaching 2 probes... > > @latency: > [8K, 16K) 8360 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [16K, 32K) 1868 |@@@@@@@@@@@ | > [32K, 64K) 10 | | > [64K, 128K) 2 | | > > b) Without this patch series: > # time echo 10240 > /proc/sys/vm/nr_hugepages > > real 0m0.066s > user 0m0.000s > sys 0m0.066s > > # bpftrace -e 'kprobe:alloc_fresh_huge_page { @start[tid] = nsecs; } kretprobe:alloc_fresh_huge_page /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' > Attaching 2 probes... > > @latency: > [4K, 8K) 10176 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [8K, 16K) 62 | | > [16K, 32K) 2 | | > > Summarize: this feature is about ~2x slower than before. > > 2) Freeing 10240 2MB hugetlb pages. > > a) With this patch series applied: > # time echo 0 > /proc/sys/vm/nr_hugepages > > real 0m0.004s > user 0m0.000s > sys 0m0.002s > > # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' > Attaching 2 probes... > > @latency: > [16K, 32K) 10240 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > > b) Without this patch series: > # time echo 0 > /proc/sys/vm/nr_hugepages > > real 0m0.077s > user 0m0.001s > sys 0m0.075s > > # bpftrace -e 'kprobe:__free_hugepage { @start[tid] = nsecs; } kretprobe:__free_hugepage /@start[tid]/ { @latency = hist(nsecs - @start[tid]); delete(@start[tid]); }' > Attaching 2 probes... > > @latency: > [4K, 8K) 9950 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [8K, 16K) 287 |@ | > [16K, 32K) 3 | | > > Summarize: The overhead of __free_hugepage is about ~2-4x slower than before. > But according to the allocation test above, I think that here is > also ~2x slower than before. > > But why the 'real' time of patched is smaller than before? Because > In this patch series, the freeing hugetlb is asynchronous(through > kwoker). > > Although the overhead has increased, the overhead is not significant. Like Mike > said, "However, remember that the majority of use cases create hugetlb pages at > or shortly after boot time and add them to the pool. So, additional overhead is > at pool creation time. There is no change to 'normal run time' operations of > getting a page from or returning a page to the pool (think page fault/unmap)". > > Todo: > - Free all of the tail vmemmap pages > Now for the 2MB HugrTLB page, we only free 6 vmemmap pages. we really can > free 7 vmemmap pages. In this case, we can see 8 of the 512 struct page > structures has beed set PG_head flag. If we can adjust compound_head() > slightly and make compound_head() return the real head struct page when > the parameter is the tail struct page but with PG_head flag set. > > In order to make the code evolution route clearer. This feature can can be > a separate patch after this patchset is solid. > > - Support for other architectures (e.g. aarch64). > - Enable PMD/huge page mapping of vmemmap even if this feature was enabled. > > Changelog in v12 -> v13: > - Remove VM_WARN_ON_PAGE macro. > - Add more comments in vmemmap_pte_range() and vmemmap_remap_free(). > > Thanks to Oscar and Mike's suggestions and review. Hi Oscar and Mike, Any suggestions about this version? Looking forward to your review. Thanks a lot. > > Changelog in v11 -> v12: > - Move VM_WARN_ON_PAGE to a separate patch. > - Call __free_hugepage() with hugetlb_lock (See patch #5.) to serialize > with dissolve_free_huge_page(). It is to prepare for patch #9. > - Introduce PageHugeInflight. See patch #9. > > Changelog in v10 -> v11: > - Fix compiler error when !CONFIG_HUGETLB_PAGE_FREE_VMEMMAP. > - Rework some comments and commit changes. > - Rework vmemmap_remap_free() to 3 parameters. > > Thanks to Oscar and Mike's suggestions and review. > > Changelog in v9 -> v10: > - Fix a bug in patch #11. Thanks to Oscar for pointing that out. > - Rework some commit log or comments. Thanks Mike and Oscar for the suggestions. > - Drop VMEMMAP_TAIL_PAGE_REUSE in the patch #3. > > Thank you very much Mike and Oscar for reviewing the code. > > Changelog in v8 -> v9: > - Rework some code. Very thanks to Oscar. > - Put all the non-hugetlb vmemmap functions under sparsemem-vmemmap.c. > > Changelog in v7 -> v8: > - Adjust the order of patches. > > Very thanks to David and Oscar. Your suggestions are very valuable. > > Changelog in v6 -> v7: > - Rebase to linux-next 20201130 > - Do not use basepage mapping for vmemmap when this feature is disabled. > - Rework some patchs. > [PATCH v6 08/16] mm/hugetlb: Free the vmemmap pages associated with each hugetlb page > [PATCH v6 10/16] mm/hugetlb: Allocate the vmemmap pages associated with each hugetlb page > > Thanks to Oscar and Barry. > > Changelog in v5 -> v6: > - Disable PMD/huge page mapping of vmemmap if this feature was enabled. > - Simplify the first version code. > > Changelog in v4 -> v5: > - Rework somme comments and code in the [PATCH v4 04/21] and [PATCH v4 05/21]. > > Thanks to Mike and Oscar's suggestions. > > Changelog in v3 -> v4: > - Move all the vmemmap functions to hugetlb_vmemmap.c. > - Make the CONFIG_HUGETLB_PAGE_FREE_VMEMMAP default to y, if we want to > disable this feature, we should disable it by a boot/kernel command line. > - Remove vmemmap_pgtable_{init, deposit, withdraw}() helper functions. > - Initialize page table lock for vmemmap through core_initcall mechanism. > > Thanks for Mike and Oscar's suggestions. > > Changelog in v2 -> v3: > - Rename some helps function name. Thanks Mike. > - Rework some code. Thanks Mike and Oscar. > - Remap the tail vmemmap page with PAGE_KERNEL_RO instead of PAGE_KERNEL. > Thanks Matthew. > - Add some overhead analysis in the cover letter. > - Use vmemap pmd table lock instead of a hugetlb specific global lock. > > Changelog in v1 -> v2: > - Fix do not call dissolve_compound_page in alloc_huge_page_vmemmap(). > - Fix some typo and code style problems. > - Remove unused handle_vmemmap_fault(). > - Merge some commits to one commit suggested by Mike. > > Muchun Song (12): > mm: memory_hotplug: factor out bootmem core functions to > bootmem_info.c > mm: hugetlb: introduce a new config HUGETLB_PAGE_FREE_VMEMMAP > mm: hugetlb: free the vmemmap pages associated with each HugeTLB page > mm: hugetlb: defer freeing of HugeTLB pages > mm: hugetlb: allocate the vmemmap pages associated with each HugeTLB > page > mm: hugetlb: set the PageHWPoison to the raw error page > mm: hugetlb: flush work when dissolving a HugeTLB page > mm: hugetlb: introduce PageHugeInflight > mm: hugetlb: add a kernel parameter hugetlb_free_vmemmap > mm: hugetlb: introduce nr_free_vmemmap_pages in the struct hstate > mm: hugetlb: gather discrete indexes of tail page > mm: hugetlb: optimize the code with the help of the compiler > > Documentation/admin-guide/kernel-parameters.txt | 14 ++ > Documentation/admin-guide/mm/hugetlbpage.rst | 3 + > arch/x86/mm/init_64.c | 13 +- > fs/Kconfig | 18 ++ > include/linux/bootmem_info.h | 65 ++++++ > include/linux/hugetlb.h | 37 ++++ > include/linux/hugetlb_cgroup.h | 15 +- > include/linux/memory_hotplug.h | 27 --- > include/linux/mm.h | 5 + > mm/Makefile | 2 + > mm/bootmem_info.c | 124 +++++++++++ > mm/hugetlb.c | 218 +++++++++++++++++-- > mm/hugetlb_vmemmap.c | 278 ++++++++++++++++++++++++ > mm/hugetlb_vmemmap.h | 45 ++++ > mm/memory_hotplug.c | 116 ---------- > mm/sparse-vmemmap.c | 273 +++++++++++++++++++++++ > mm/sparse.c | 1 + > 17 files changed, 1082 insertions(+), 172 deletions(-) > create mode 100644 include/linux/bootmem_info.h > create mode 100644 mm/bootmem_info.c > create mode 100644 mm/hugetlb_vmemmap.c > create mode 100644 mm/hugetlb_vmemmap.h > > -- > 2.11.0 >