Received: by 2002:a05:6358:7058:b0:131:369:b2a3 with SMTP id 24csp5714353rwp; Mon, 17 Jul 2023 08:24:01 -0700 (PDT) X-Google-Smtp-Source: APBJJlGyEZAl5thUk16yOpm1cN/FM6PZFwBjM4T1YLJLoMzYzfrYKppZXH1R/PvJgAloFgNoCQ5g X-Received: by 2002:a05:6a00:1503:b0:681:50fd:2b89 with SMTP id q3-20020a056a00150300b0068150fd2b89mr14944653pfu.30.1689607441450; Mon, 17 Jul 2023 08:24:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1689607441; cv=none; d=google.com; s=arc-20160816; b=uu0we4gZJYqrKraXtKkLpcwqvcrWbnVMyY1DJMkUqic5LdfjZ7SKbmYJ7wHgEI6B1t V8rXEqWinPpBrqi5OjUsNiMmoEo11pGwEKxQfOdly5thRwDmttz75bG7LDFWTkgit1Qr gQzLDQ9APqz9nxUbrTsZAvTlAp22AGlvAlJWtd322Sb/C+HuLu+wXwQFDdr7KOBD/dD1 XvjB3JJVYPQm+haC/FZX8jC+m9DKxzfGfMK5B82fls22dGiDmY4mmpkhh4NWbMHPy3zK 4+f5ZlY+kxBwkUf2aTlL7EWTq/0WxQTa/l2igPiB3Jafm1fOHL9YKzu9wEjYV4d1PQkN fEYQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id; bh=B3UXGVmZUR9CDbDqdweYtFKavUS7ikLJ78/8r7SZsEA=; fh=jfkW6LUtZoTJXCXRHyROpIaSMlpDOpdlTqhyICkIFAc=; b=XGr26n+1hxHy7zPyqDANIUjw2/SjnpyvNWBnYMO0enZt+MBX9Ijlxt9Cvohvuk/oN/ KTeRdioKUk+1Mh3Nxntxgru/FhikmhbDZkF7b5zt9RlDT245R0Xfiq1TB9CtAAKyaEn1 LG2cdf6Gw6lP/1dkRUzze6wl9sXH0YEG1pbiHu12CQl3TqJdoo/lMJDsJGP6rTPDaTMY QlUY89AQj0Zlf6sBm6Aon2t/RDuO05Q3LUf6qqKh+oT13kjnYkQAEsnhe5iIy6SXcZf5 9edWSL4KXF7K1+oZ1R9yljgSn/w8gkpHEX04Hg9RcWpGnzyaNg7mK99kRvIrS4HIP2nX 1jyg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id q9-20020a6557c9000000b0055a9b4f5f65si6438332pgr.82.2023.07.17.08.23.49; Mon, 17 Jul 2023 08:24:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229449AbjGQOsN (ORCPT + 99 others); Mon, 17 Jul 2023 10:48:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35486 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229685AbjGQOsA (ORCPT ); Mon, 17 Jul 2023 10:48:00 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id AD63BE45 for ; Mon, 17 Jul 2023 07:47:57 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C241B13D5; Mon, 17 Jul 2023 07:48:40 -0700 (PDT) Received: from [10.57.76.30] (unknown [10.57.76.30]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id D7D833F738; Mon, 17 Jul 2023 07:47:54 -0700 (PDT) Message-ID: <4f89d7bf-2fe2-fa53-c7ca-e4f152ca0edf@arm.com> Date: Mon, 17 Jul 2023 15:47:52 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 Subject: Re: [PATCH v3 3/4] mm: FLEXIBLE_THP for improved performance To: David Hildenbrand , Yu Zhao Cc: Andrew Morton , Matthew Wilcox , "Kirill A. Shutemov" , Yin Fengwei , Catalin Marinas , Will Deacon , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20230714160407.4142030-1-ryan.roberts@arm.com> <20230714161733.4144503-3-ryan.roberts@arm.com> <82c934af-a777-3437-8d87-ff453ad94bfd@redhat.com> <2c4b2a41-1c98-0782-ac30-80e65bdb2b0c@arm.com> <2e7d5692-8ba7-1e56-a03f-449f1671b100@redhat.com> From: Ryan Roberts In-Reply-To: <2e7d5692-8ba7-1e56-a03f-449f1671b100@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 17/07/2023 14:56, David Hildenbrand wrote: > On 17.07.23 15:20, Ryan Roberts wrote: >> On 17/07/2023 14:06, David Hildenbrand wrote: >>> On 14.07.23 19:17, Yu Zhao wrote: >>>> On Fri, Jul 14, 2023 at 10:17 AM Ryan Roberts wrote: >>>>> >>>>> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be >>>>> allocated in large folios of a determined order. All pages of the large >>>>> folio are pte-mapped during the same page fault, significantly reducing >>>>> the number of page faults. The number of per-page operations (e.g. ref >>>>> counting, rmap management lru list management) are also significantly >>>>> reduced since those ops now become per-folio. >>>>> >>>>> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which >>>>> defaults to disabled for now; The long term aim is for this to defaut to >>>>> enabled, but there are some risks around internal fragmentation that >>>>> need to be better understood first. >>>>> >>>>> When enabled, the folio order is determined as such: For a vma, process >>>>> or system that has explicitly disabled THP, we continue to allocate >>>>> order-0. THP is most likely disabled to avoid any possible internal >>>>> fragmentation so we honour that request. >>>>> >>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas >>>>> that have not explicitly opted-in to use transparent hugepages (e.g. >>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then >>>>> arch_wants_pte_order() is limited by the new cmdline parameter, >>>>> `flexthp_unhinted_max`. This allows for a performance boost without >>>>> requiring any explicit opt-in from the workload while allowing the >>>>> sysadmin to tune between performance and internal fragmentation. >>>>> >>>>> arch_wants_pte_order() can be overridden by the architecture if desired. >>>>> Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous >>>>> set of ptes map physically contigious, naturally aligned memory, so this >>>>> mechanism allows the architecture to optimize as required. >>>>> >>>>> If the preferred order can't be used (e.g. because the folio would >>>>> breach the bounds of the vma, or because ptes in the region are already >>>>> mapped) then we fall back to a suitable lower order; first >>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0. >>>>> >>>>> Signed-off-by: Ryan Roberts >>>>> --- >>>>>    .../admin-guide/kernel-parameters.txt         |  10 + >>>>>    mm/Kconfig                                    |  10 + >>>>>    mm/memory.c                                   | 187 ++++++++++++++++-- >>>>>    3 files changed, 190 insertions(+), 17 deletions(-) >>>>> >>>>> diff --git a/Documentation/admin-guide/kernel-parameters.txt >>>>> b/Documentation/admin-guide/kernel-parameters.txt >>>>> index a1457995fd41..405d624e2191 100644 >>>>> --- a/Documentation/admin-guide/kernel-parameters.txt >>>>> +++ b/Documentation/admin-guide/kernel-parameters.txt >>>>> @@ -1497,6 +1497,16 @@ >>>>>                           See Documentation/admin-guide/sysctl/net.rst for >>>>>                           fb_tunnels_only_for_init_ns >>>>> >>>>> +       flexthp_unhinted_max= >>>>> +                       [KNL] Requires CONFIG_FLEXIBLE_THP enabled. The >>>>> maximum >>>>> +                       folio size that will be allocated for an anonymous vma >>>>> +                       that has neither explicitly opted in nor out of using >>>>> +                       transparent hugepages. The size must be a >>>>> power-of-2 in >>>>> +                       the range [PAGE_SIZE, PMD_SIZE). A larger size >>>>> improves >>>>> +                       performance by reducing page faults, while a smaller >>>>> +                       size reduces internal fragmentation. Default: max(64K, >>>>> +                       PAGE_SIZE). Format: size[KMG]. >>>>> + >>>> >>>> Let's split this parameter into a separate patch. >>>> >>> >>> Just a general comment after stumbling over patch #2, let's not start splitting >>> patches into things that don't make any sense on their own; that just makes >>> review a lot harder. >> >> ACK >> >>> >>> For this case here, I'd suggest first adding the general infrastructure and then >>> adding tunables we want to have on top. >> >> OK, so 1 patch for the main infrastructure, then a patch to disable for >> MADV_NOHUGEPAGE and friends, then a further patch to set flexthp_unhinted_max >> via a sysctl? > > MADV_NOHUGEPAGE handling for me falls under the category "required for > correctness to not break existing workloads" and has to be there initially. > > Anything that is rather a performance tunable (e.g., a sysctl to optimize) can > be added on top and discussed separately.> > At least IMHO :) > >> >>> >>> I agree that toggling that at runtime (for example via sysfs as raised by me >>> previously) would be nicer. >> >> OK, I clearly misunderstood, I thought you were requesting a boot parameter. > > Oh, sorry about that. I wanted to actually express > "/sys/kernel/mm/transparent_hugepage/" sysctls where we can toggle that later at > runtime as well. > >> What's the ABI compat guarrantee for sysctls? I assumed that for a boot >> parameter it would be easier to remove in future if we wanted, but for sysctl, >> its there forever? > > sysctl are hard/impossible to remove, yes. So we better make sure what we add > has clear semantics. > > If we ever want some real auto-tunable mode (and can actually implement it > without harming performance; and I am skeptical), we might want to allow for > setting such a parameter to "auto", for example. > >> >> Also, how do you feel about the naming and behavior of the parameter? > > Very good question. "flexthp_unhinted_max" naming is a bit suboptimal. > > For example, I'm not so sure if we should expose the feature to user space as > "flexthp" at all. I think we should find a clearer feature name to begin with. > > ... maybe we can initially get away with dropping that parameter and default to > something reasonably small (i.e., 64k as you have above)? That would certainly get my vote. But it was you who was arguing for a tunable previously ;-). I propose we use the following as the "unhinted ceiling" for now, then we can add a tunable if/when we find a use case that doesn't work optimally with this value: static int flexthp_unhinted_max_order = ilog2(SZ_64K > PAGE_SIZE ? SZ_64K : PAGE_SIZE) - PAGE_SHIFT; (Using PAGE_SIZE when its gt 64K to cover the ppc case that looks like it can support 256K pages. Open coding the max because max() can't be used outside a function). > > /sys/kernel/mm/transparent_hugepage/enabled=never and simply not get any thp. Yes, that should work with the patch as it is today. >