Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp1601253pxb; Thu, 4 Mar 2021 16:01:53 -0800 (PST) X-Google-Smtp-Source: ABdhPJzhh/bDWzIKSCjhUcOGHFbP9US/ts5snNzZSPq84H8j3fmkvkBxvvhwpLKCL5j+GfHgeVgx X-Received: by 2002:a92:d201:: with SMTP id y1mr6543284ily.129.1614902513705; Thu, 04 Mar 2021 16:01:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1614902513; cv=none; d=google.com; s=arc-20160816; b=Imo+quu6VYP8f56cvlJENpW3qK7e8cRiKuS95jjtrmh2MDFu05XXKgK46cOP4LaUkJ +DA8v+7nUnF1NbrhzhHNT2F8maLzOYwoqPYwBuQRtDA1w8NnH8fXwa8/E594e8dIj53W DMhGfDrFUNqDq2J7uLRKhzpf1Ex3nr05vU/CmA4xrfrRkw/bo50uzYAfPiUxU1+NafaW t4AXVypVGy9nRAwuPSIw6AG8VX1bL67+CPQ7NokAjyFTiFg7TbGtB9PrJDUaxR7bHUOJ qrty4DwKqUQeTiFlktGDt34rDI+tWkgjCf6adD13eNj27qbdpRZGCYkUIW6+zb6yL3I3 JtWw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=V5aaETOTZkHBTtEXphmw/cUhBvHYGOGg2bi64gc0COM=; b=uYOdfngLIgJg/9SZNnUIwrFrmtF+T+iyutoh7GFaOcialqelOCkX6ivO9CgXZqcM/M Wt5OHYfC3Q1eJJdEoVIoB+czgRfFaAOFlow3rQHlY+aGtjhPi50zk+OgjoHconBKIID7 Rj8QATeNbhjxh1haYobdX1SaQp7dkV/g3YWLp3SRkAIHEPHECJw1DhV4VBlMIJMwzwnn TQlpIm3NqntvQ4i4ees4FHmxBNp+vaeQxpBaKMhq7FwAUTmyBwn4i5gghYNJmFUjDwPY mC5vY7WhSBIjt05MS2UQXI5NpSFUqr9/6Tpqr9E3+7TxCMT7GSiamuXrA8GfIRnyZj7p mQNw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=fZ7WA8tx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h7si809753jav.111.2021.03.04.16.01.40; Thu, 04 Mar 2021 16:01:53 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=fZ7WA8tx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232636AbhCDDiF (ORCPT + 99 others); Wed, 3 Mar 2021 22:38:05 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35090 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232676AbhCDDiC (ORCPT ); Wed, 3 Mar 2021 22:38:02 -0500 Received: from mail-pl1-x634.google.com (mail-pl1-x634.google.com [IPv6:2607:f8b0:4864:20::634]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2857DC06175F for ; Wed, 3 Mar 2021 19:37:22 -0800 (PST) Received: by mail-pl1-x634.google.com with SMTP id s7so8158099plg.5 for ; Wed, 03 Mar 2021 19:37:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=V5aaETOTZkHBTtEXphmw/cUhBvHYGOGg2bi64gc0COM=; b=fZ7WA8txY9uXEXqTFaVbB4DIu0hZ0dICn+uHs2p4lCHU4k3P47Yo0zFYBBq7GpxFH+ NRN/Iz/ofK337vNhjF8Ttoyok/6H25rgUDLSkYtumyFREI78IqpOuNJEU4YtAt++YJNA 67QO5GDhjSaqve0o0ZH2ujiGkGQ60l6raKjrMw7Unc/KHTCOxJZc+OnXC5RMu/ESzfC0 +QwEAaNeGndLDHeZgven22q6jpxUt00LwhdFxPtyE1s/lKjiVZ+8kNVQCmHmRFrh13dR +3Fe8JsYKC6e5qK1PDz5rkeqMlMIwccy5FdQaL2yEjluQ3A57xKCCUXuwtEEqu2IzbA8 2Wdg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=V5aaETOTZkHBTtEXphmw/cUhBvHYGOGg2bi64gc0COM=; b=RoHTFo0MmMLb1CvuL/Jh/qEou5wcDeZLBQwASdofw5VgfRwaDAY9fNX8s8SQOQcpdS yUy+h87BRhD0Fyvjtx1VLCVVZK4b9Jr5HUyLFQcsM2qQrdU1jmW7+C50uUG7joJYhZyD RY4TnL3RLhK7V9RBleovz9vAExfiFBrb7VC0H7KdefVxs5/rfTVakeCH/W0+0HbuK22l wlZQT8BVnucziND/Vt91fRMtqxRb3Gv108W/15hDTZi8hDoLOjYJ5ya8R5DCezAK0INy UFJDm6XNdWrhJ+8/klPpptvKNFPGQD2KCGH7jL1TJ/IkHC9tseVtuyPQAMQmF4qXl/AH av0w== X-Gm-Message-State: AOAM5306kn4KPJsa6bPFislaa2hEpqPhm9qndJsqlsRH7TRm1EWJd5WN OnZzRoIsc6HPc3MbuzKBBaFkch9pA+uKK/YWw+ZLWg== X-Received: by 2002:a17:90a:f008:: with SMTP id bt8mr2359360pjb.13.1614829041551; Wed, 03 Mar 2021 19:37:21 -0800 (PST) MIME-Version: 1.0 References: <20210225132130.26451-1-songmuchun@bytedance.com> In-Reply-To: From: Muchun Song Date: Thu, 4 Mar 2021 11:36:44 +0800 Message-ID: Subject: Re: [External] Re: [PATCH v17 0/9] Free some vmemmap pages of HugeTLB page To: "Singh, Balbir" Cc: Jonathan Corbet , Mike Kravetz , Thomas Gleixner , Ingo Molnar , bp@alien8.de, x86@kernel.org, hpa@zytor.com, dave.hansen@linux.intel.com, luto@kernel.org, Peter Zijlstra , Alexander Viro , Andrew Morton , paulmck@kernel.org, mchehab+huawei@kernel.org, pawan.kumar.gupta@linux.intel.com, Randy Dunlap , oneukum@suse.com, anshuman.khandual@arm.com, jroedel@suse.de, Mina Almasry , David Rientjes , Matthew Wilcox , Oscar Salvador , Michal Hocko , "Song Bao Hua (Barry Song)" , David Hildenbrand , =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= , Joao Martins , Xiongchun duan , linux-doc@vger.kernel.org, LKML , Linux Memory Management List , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 4, 2021 at 11:14 AM Singh, Balbir wrote: > > On 26/2/21 12:21 am, Muchun Song wrote: > > Hi all, > > > > This patch series will free some vmemmap pages(struct page structures) > > associated with each hugetlbpage when preallocated to save memory. > > > > In order to reduce the difficulty of the first version of code review. > > From this version, we disable PMD/huge page mapping of vmemmap if this > > feature was enabled. This accutualy eliminate a bunch of the complex code > > doing page table manipulation. When this patch series is solid, we cam add > > the code of vmemmap page table manipulation in the future. > > > > The struct page structures (page structs) are used to describe a physical > > page frame. By default, there is a one-to-one mapping from a page frame to > > it's corresponding page struct. > > > > The HugeTLB pages consist of multiple base page size pages and is supported > > by many architectures. See hugetlbpage.rst in the Documentation directory > > for more details. On the x86 architecture, HugeTLB pages of size 2MB and 1GB > > are currently supported. Since the base page size on x86 is 4KB, a 2MB > > HugeTLB page consists of 512 base pages and a 1GB HugeTLB page consists of > > 4096 base pages. For each base page, there is a corresponding page struct. > > > > Within the HugeTLB subsystem, only the first 4 page structs are used to > > contain unique information about a HugeTLB page. HUGETLB_CGROUP_MIN_ORDER > > provides this upper limit. The only 'useful' information in the remaining > > page structs is the compound_head field, and this field is the same for all > > tail pages. > > The HUGETLB_CGROUP_MIN_ORDER is only when CGROUP_HUGETLB is enabled, but I guess > that does not matter Agree. > > > > > By removing redundant page structs for HugeTLB pages, memory can returned to > > the buddy allocator for other uses. > > > > When the system boot up, every 2M HugeTLB has 512 struct page structs which > > size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE). > > > > HugeTLB struct pages(8 pages) page frame(8 pages) > > +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ > > | | | 0 | -------------> | 0 | > > | | +-----------+ +-----------+ > > | | | 1 | -------------> | 1 | > > | | +-----------+ +-----------+ > > | | | 2 | -------------> | 2 | > > | | +-----------+ +-----------+ > > | | | 3 | -------------> | 3 | > > | | +-----------+ +-----------+ > > | | | 4 | -------------> | 4 | > > | 2MB | +-----------+ +-----------+ > > | | | 5 | -------------> | 5 | > > | | +-----------+ +-----------+ > > | | | 6 | -------------> | 6 | > > | | +-----------+ +-----------+ > > | | | 7 | -------------> | 7 | > > | | +-----------+ +-----------+ > > | | > > | | > > | | > > +-----------+ > > > > The value of page->compound_head is the same for all tail pages. The first > > page of page structs (page 0) associated with the HugeTLB page contains the 4 > > page structs necessary to describe the HugeTLB. The only use of the remaining > > pages of page structs (page 1 to page 7) is to point to page->compound_head. > > Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs > > will be used for each HugeTLB page. This will allow us to free the remaining > > 6 pages to the buddy allocator. > > What is page 1 used for? page 0 carries the 4 struct pages needed, does compound_head > need a full page? IOW, why do we need two full pages -- may be the patches have the > answer to something I am missing? Yeah. It really can free 7 pages. But we need some work to support this. Why? Now for the 2MB HugeTLB page, we only free 6 vmemmap pages. we really can free 7 vmemmap pages. In this case, we can see 8 of the 512 struct page structures have been set PG_head flag. If we can adjust compound_head() slightly and make compound_head() return the real head struct page when the parameter is the tail struct page but with PG_head flag set. In order to make the code evolution route clearer. This feature can be a separate patch (and send it out) after this patchset is solid and applied. > > > > > Here is how things look after remapping. > > > > HugeTLB struct pages(8 pages) page frame(8 pages) > > +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ > > | | | 0 | -------------> | 0 | > > | | +-----------+ +-----------+ > > | | | 1 | -------------> | 1 | > > | | +-----------+ +-----------+ > > | | | 2 | ----------------^ ^ ^ ^ ^ ^ > > | | +-----------+ | | | | | > > | | | 3 | ------------------+ | | | | > > | | +-----------+ | | | | > > | | | 4 | --------------------+ | | | > > | 2MB | +-----------+ | | | > > | | | 5 | ----------------------+ | | > > | | +-----------+ | | > > | | | 6 | ------------------------+ | > > | | +-----------+ | > > | | | 7 | --------------------------+ > > | | +-----------+ > > | | > > | | > > | | > > +-----------+ > > > > When a HugeTLB is freed to the buddy system, we should allocate 6 pages for > > vmemmap pages and restore the previous mapping relationship. > > > > Can these 6 pages come from the hugeTLB page itself? When you say 6 pages, > I presume you mean 6 pages of PAGE_SIZE There was a decent discussion about this in a previous version of the series starting here: https://lore.kernel.org/linux-mm/20210126092942.GA10602@linux/ In this thread various other options were suggested and discussed. Thanks. > > > Apart from 2MB HugeTLB page, we also have 1GB HugeTLB page. It is similar > > to the 2MB HugeTLB page. We also can use this approach to free the vmemmap > > pages. > > > > In this case, for the 1GB HugeTLB page, we can save 4094 pages. This is a > > very substantial gain. On our server, run some SPDK/QEMU applications which > > will use 1024GB hugetlbpage. With this feature enabled, we can save ~16GB > > (1G hugepage)/~12GB (2MB hugepage) memory. > > Thanks, > Balbir Singh > > > > > > > > > > > > >