Received: by 2002:a05:6358:700f:b0:131:369:b2a3 with SMTP id 15csp3295136rwo; Fri, 4 Aug 2023 02:34:26 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGOyQ0m83/3NgRyotoaYvrTM83G+yJ1FPPrFn+CS0snc6oWK/HJhy2CEkn1LsONw5uxqrAF X-Received: by 2002:a05:6a21:6d84:b0:136:f3ef:4ff with SMTP id wl4-20020a056a216d8400b00136f3ef04ffmr1560887pzb.50.1691141666318; Fri, 04 Aug 2023 02:34:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691141666; cv=none; d=google.com; s=arc-20160816; b=coxiapNBn4JN7pAwYkrvWJQE5uW+bD3yfV8mRn3n4N/oofU4z7qVNpJXhUhXlIsJ2x 5uDQ/kTRgsq7QPrEVMpFkVnszUikEHhYS8YH1/JcISiaCO4oqYhx9BVwapdjLBWP5aFD kyiIypvUwiznoMpIp0YnQv3ptlqCrjO2QNHetRpPaOJBO+3LIU+OIS5K6QpfSTYd3CCi 31S+PXkRU+GkcvH+7Qp+yoSwGxQJV+6OMYjpmsxn9RUEkx6x1FJT/ziZNu5be1+9Fw13 3P2vpZm1GC3qaGU2tpYVPs6U8mLAlN9eBhm6O2OKQX5Q4GaKjic0rSztpp1wmcRtEQ9G RWqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id; bh=racduXwiwE7CYvpE2hVQePZyG7nkXhj+RXtGsABuJJE=; fh=6s1Cer8oplHZOpf+oTHTcszZw03CtNmhVijuCP6Rla0=; b=PPu7+IyCFKAEkM3e46+LdvmPDXXL5mcnyvgOjxSx2j9FdRMJJ9wrejB8EjRU4zEYzx R6vrBV24IcpLv2s4JtkETNe5azI/KPn7P+gEV5uRQrLiHrToY7wDGZyfGKgLxts6WSJW cNIBkF1uTMyDNHnzKb9sKnaFzTmIiLnbaQ0OTTXJ4Z2KFMijMjuB90n/XgHC/lci3gMn /CplrP+NrdoDijum8kHAhM95uc95al0107wG5x9KlUn3xt0AJBOEP/rPKEtNJ0d1ZwNx Ngg+AtYVt4cjofxmVej9Y2yH4AmRw7GiWs7yE+P7BYen5VIvEmDaMqbW4/rEYGOCHCE3 u/5g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id s3-20020a63d043000000b00564ac0c6cebsi1384163pgi.211.2023.08.04.02.34.14; Fri, 04 Aug 2023 02:34:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230175AbjHDJJY (ORCPT + 99 others); Fri, 4 Aug 2023 05:09:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55198 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230051AbjHDJIs (ORCPT ); Fri, 4 Aug 2023 05:08:48 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 9763A4EF6 for ; Fri, 4 Aug 2023 02:06:30 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0A6102F4; Fri, 4 Aug 2023 02:07:13 -0700 (PDT) Received: from [10.57.77.247] (unknown [10.57.77.247]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 443583F6C4; Fri, 4 Aug 2023 02:06:26 -0700 (PDT) Message-ID: Date: Fri, 4 Aug 2023 10:06:24 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 Subject: Re: [PATCH v4 2/5] mm: LARGE_ANON_FOLIO for improved performance To: Yu Zhao , "Kirill A. Shutemov" Cc: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Catalin Marinas , Will Deacon , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org References: <20230726095146.2826796-1-ryan.roberts@arm.com> <20230726095146.2826796-3-ryan.roberts@arm.com> <20230803142154.nvgkavg33uyn6f72@box.shutemov.name> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/08/2023 01:19, Yu Zhao wrote: > On Thu, Aug 3, 2023 at 8:27 AM Kirill A. Shutemov > wrote: >> >> On Thu, Aug 03, 2023 at 01:43:31PM +0100, Ryan Roberts wrote: >>> + Kirill >>> >>> On 26/07/2023 10:51, Ryan Roberts wrote: >>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be >>>> allocated in large folios of a determined order. All pages of the large >>>> folio are pte-mapped during the same page fault, significantly reducing >>>> the number of page faults. The number of per-page operations (e.g. ref >>>> counting, rmap management lru list management) are also significantly >>>> reduced since those ops now become per-folio. >>>> >>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, >>>> which defaults to disabled for now; The long term aim is for this to >>>> defaut to enabled, but there are some risks around internal >>>> fragmentation that need to be better understood first. >>>> >>>> When enabled, the folio order is determined as such: For a vma, process >>>> or system that has explicitly disabled THP, we continue to allocate >>>> order-0. THP is most likely disabled to avoid any possible internal >>>> fragmentation so we honour that request. >>>> >>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas >>>> that have not explicitly opted-in to use transparent hugepages (e.g. >>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then >>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is >>>> bigger). This allows for a performance boost without requiring any >>>> explicit opt-in from the workload while limitting internal >>>> fragmentation. >>>> >>>> If the preferred order can't be used (e.g. because the folio would >>>> breach the bounds of the vma, or because ptes in the region are already >>>> mapped) then we fall back to a suitable lower order; first >>>> PAGE_ALLOC_COSTLY_ORDER, then order-0. >>>> >>> >>> ... >>> >>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \ >>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT) >>>> + >>>> +static int anon_folio_order(struct vm_area_struct *vma) >>>> +{ >>>> + int order; >>>> + >>>> + /* >>>> + * If THP is explicitly disabled for either the vma, the process or the >>>> + * system, then this is very likely intended to limit internal >>>> + * fragmentation; in this case, don't attempt to allocate a large >>>> + * anonymous folio. >>>> + * >>>> + * Else, if the vma is eligible for thp, allocate a large folio of the >>>> + * size preferred by the arch. Or if the arch requested a very small >>>> + * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, >>>> + * which still meets the arch's requirements but means we still take >>>> + * advantage of SW optimizations (e.g. fewer page faults). >>>> + * >>>> + * Finally if thp is enabled but the vma isn't eligible, take the >>>> + * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. >>>> + * This ensures workloads that have not explicitly opted-in take benefit >>>> + * while capping the potential for internal fragmentation. >>>> + */ >>>> + >>>> + if ((vma->vm_flags & VM_NOHUGEPAGE) || >>>> + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) || >>>> + !hugepage_flags_enabled()) >>>> + order = 0; >>>> + else { >>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); >>>> + >>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true)) >>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED); >>>> + } >>>> + >>>> + return order; >>>> +} >>> >>> >>> Hi All, >>> >>> I'm writing up the conclusions that we arrived at during discussion in the THP >>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if >>> I can get explicit "agree" or disagree + rationale from at least David, Yu and >>> Kirill. >>> >>> In summary; I think we are converging on the approach that is already coded, but >>> I'd like confirmation. >>> >>> >>> >>> The THP situation today >>> ----------------------- >>> >>> - At system level: THP can be set to "never", "madvise" or "always" >>> - At process level: THP can be "never" or "defer to system setting" >>> - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE >>> >>> That gives us this table to describe how a page fault is handled, according to >>> process state (columns) and vma flags (rows): >>> >>> | never | madvise | always >>> ----------------|-----------|-----------|----------- >>> no hint | S | S | THP>S >>> MADV_HUGEPAGE | S | THP>S | THP>S >>> MADV_NOHUGEPAGE | S | S | S >>> >>> Legend: >>> S allocate single page (PTE-mapped) >>> LAF allocate lage anon folio (PTE-mapped) >>> THP allocate THP-sized folio (PMD-mapped) >>>> fallback (usually because vma size/alignment insufficient for folio) >>> >>> >>> >>> Principles for Large Anon Folios (LAF) >>> -------------------------------------- >>> >>> David tells us there are use cases today (e.g. qemu live migration) which use >>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted" >>> and these use cases will break (i.e. functionally incorrect) if this request is >>> not honoured. >>> >>> So LAF must at least honour MADV_NOHUGEPAGE to prevent breaking existing use >>> cases. And once we do this, then I think the least confusing thing is for it to >>> also honor the "never" system/process state; so if either the system, process or >>> vma has explicitly opted-out of THP, then LAF should also be bypassed. >>> >>> Similarly, any case that would previously cause the allocation of PMD-sized THP >>> must continue to be honoured, else we risk performance regression. >>> >>> That leaves the "madvise/no-hint" case, and all THP fallback paths due to the >>> VMA not being correctly aligned or sized to hold a PMD-sized mapping. In these >>> cases, we will attempt to use LAF first, and fallback to single page if the vma >>> size/alignment doesn't permit it. >>> >>> | never | madvise | always >>> ----------------|-----------|-----------|----------- >>> no hint | S | LAF>S | THP>LAF>S >>> MADV_HUGEPAGE | S | THP>LAF>S | THP>LAF>S >>> MADV_NOHUGEPAGE | S | S | S >>> >>> I think this (perhaps conservative) approach will be the least surprising to >>> users. And is the policy that is already implemented in this patch. >> >> This looks very reasonable. >> >> The only questionable field is no-hint/madvise. I can argue for both LAF>S >> and S here. I think LAF>S is fine as long as we are not too aggressive >> with allocation order. >> >> I think we need to work on eliminating reasons for users to set 'never'. >> If something behaves better with 'never' kernel has failed user. >> >>> Downsides of this policy >>> ------------------------ >>> >>> As Yu and Yin have pointed out, there are some workloads which do not perform >>> well with THP, due to large fault latency or memory wastage, etc. But which >>> _may_ still benefit from LAF. By taking the conservative approach, we exclude >>> these workloads from benefiting automatically. >> >> Hm. I don't buy it. Why THP with order-9 is too much, but order-8 LAF is >> fine? > > No, it's not. And no one said order-8 LAF is fine :) The starting > order for LAF that we have been discussing is at most 64KB (vs 2MB > THP). For my taste, it's still too large. I'd go with 32KB/16KB. Its currently influenced by the arch. If the arch doesn't have an opinion then its currently 32K in the code. The 64K size is my aspiration for arm64 if/when I land the contpte mapping work. > > However, the same argument can be used to argue against the policy > Ryan listed above: why order-10 LAF is ok for madvise but not order-11 > (which becomes "always")? Sorry I don't understand what you are saying here. Where has order-10 LAF come from? > > I'm strongly against this policy Ugh, I thought we came to an agreement (or at least "disagree and commit") on the THP call. Obviously I was wrong. David is telling us that we will break user space if we don't consider MADV_NOHUGEPAGE to mean "never allocate memory to unfaulted addresses". So tying to at least this must be cast in stone, no? Could you lay out any policy proposal you have as an alternative that still follows this requirement? > for two practical reasons I learned > from tuning THPs in our data centers: > 1. By doing the above, we are blurring the lines between those values > and making real-world performance tuning extremely hard if not > impractice. > 2. As I previously pointed out: if we mix LAFs with THPs, we actually > risk causing performance regressions because giving smaller VMAs LAFs > can deprive large VMAs of THPs.