Received: by 2002:a05:6359:6284:b0:131:369:b2a3 with SMTP id se4csp3386653rwb; Mon, 7 Aug 2023 12:40:19 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFocK+cUGpJ5z+If3VfnimKjY2+BDW3noFeVD7I44LPZvaLVVMpDQ5/JYTZViVs2+TIXNaa X-Received: by 2002:a17:902:7683:b0:1bb:141c:3034 with SMTP id m3-20020a170902768300b001bb141c3034mr9869003pll.12.1691437218835; Mon, 07 Aug 2023 12:40:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691437218; cv=none; d=google.com; s=arc-20160816; b=KUZ1DWufKsJnzcJvUqCkK8jZdFNzp8ZwBNHrbj6p2Td80yFe3C1X8xLWUtm3sgjXpQ qc0gW006Ura4EitNKpnSmpxONzXyWC4GDf87zTNfXeKohSxWSZpysJEDLejdEM+da0fl SOhsI/AdrWXZj6sOlAOQSSndhQGtOtXXYbfb2lh6rsbtnLA9IhpbGFENQXcGNus4uTqe kQ1nZ7G+YvVlzFNg0zJ0G0gzYkEMcuZ32hCxx/Fq/BsBDCU0WJNRuZs++DAqajhcww+B /PtrrzagYRRKvBxohrWxlZ05lDwza9/wi5SNnQY5qvfILqM/ji66hNDnI0VrGRvwgdYl cO/g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id; bh=Zw+u2SvuRIhTb+1roS6bucZ4+QIn/aQrBelOw+hq/Wc=; fh=DRJ8Q43Psk0y22orjG7fcD4YtqmkevyLAi3I/WP0hAk=; b=VyLhbpbvRlzqI92YbNLkHhnM++307yVzqRdWKceBKsn05F/YIqxj3RbPwbHEw+ezvm lf8gGJ8/mB0PDvBwxGRtOTbgEwielw+SXAveX25U8hrq7DVxMTHrU9qyL6dB4huvdYQE GdJLyRnFYzFauZt59byyZSR/XOwE2H3kChW5CsTpIgeoVatxSOkpg47mmPKldnaOZLmi Y2IJ6sJ/XADVqG8oRg29PkBthrv71ZaecnIcP7Y+GME8hd4EAqQkAc1hfqbCh/mnmTcI NwHwy2z29GLPnat5/mWlE34BD1oOXINCcRBsLYBOaU67MEwwJ1ewfUoLplpx80MAlSjx xDnw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id t4-20020a170902bc4400b001ab089f7329si6145175plz.73.2023.08.07.12.40.06; Mon, 07 Aug 2023 12:40:18 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231180AbjHGTAf (ORCPT + 99 others); Mon, 7 Aug 2023 15:00:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45476 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229553AbjHGTAe (ORCPT ); Mon, 7 Aug 2023 15:00:34 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id F302F1726 for ; Mon, 7 Aug 2023 12:00:30 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7441E1FB; Mon, 7 Aug 2023 12:01:13 -0700 (PDT) Received: from [10.57.77.247] (unknown [10.57.77.247]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2862A3F64C; Mon, 7 Aug 2023 12:00:27 -0700 (PDT) Message-ID: <9cd7681c-a54e-c0f4-f97c-eba1ecb661d6@arm.com> Date: Mon, 7 Aug 2023 20:00:25 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 Subject: Re: [PATCH v4 2/5] mm: LARGE_ANON_FOLIO for improved performance To: Yu Zhao Cc: "Kirill A. Shutemov" , Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Catalin Marinas , Will Deacon , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org References: <20230726095146.2826796-1-ryan.roberts@arm.com> <20230726095146.2826796-3-ryan.roberts@arm.com> <20230803142154.nvgkavg33uyn6f72@box.shutemov.name> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-3.7 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/08/2023 19:53, Yu Zhao wrote: > On Fri, Aug 4, 2023 at 3:06 AM Ryan Roberts wrote: >> >> On 04/08/2023 01:19, Yu Zhao wrote: >>> On Thu, Aug 3, 2023 at 8:27 AM Kirill A. Shutemov >>> wrote: >>>> >>>> On Thu, Aug 03, 2023 at 01:43:31PM +0100, Ryan Roberts wrote: >>>>> + Kirill >>>>> >>>>> On 26/07/2023 10:51, Ryan Roberts wrote: >>>>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be >>>>>> allocated in large folios of a determined order. All pages of the large >>>>>> folio are pte-mapped during the same page fault, significantly reducing >>>>>> the number of page faults. The number of per-page operations (e.g. ref >>>>>> counting, rmap management lru list management) are also significantly >>>>>> reduced since those ops now become per-folio. >>>>>> >>>>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, >>>>>> which defaults to disabled for now; The long term aim is for this to >>>>>> defaut to enabled, but there are some risks around internal >>>>>> fragmentation that need to be better understood first. >>>>>> >>>>>> When enabled, the folio order is determined as such: For a vma, process >>>>>> or system that has explicitly disabled THP, we continue to allocate >>>>>> order-0. THP is most likely disabled to avoid any possible internal >>>>>> fragmentation so we honour that request. >>>>>> >>>>>> Otherwise, the return value of arch_wants_pte_order() is used. For vmas >>>>>> that have not explicitly opted-in to use transparent hugepages (e.g. >>>>>> where thp=madvise and the vma does not have MADV_HUGEPAGE), then >>>>>> arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is >>>>>> bigger). This allows for a performance boost without requiring any >>>>>> explicit opt-in from the workload while limitting internal >>>>>> fragmentation. >>>>>> >>>>>> If the preferred order can't be used (e.g. because the folio would >>>>>> breach the bounds of the vma, or because ptes in the region are already >>>>>> mapped) then we fall back to a suitable lower order; first >>>>>> PAGE_ALLOC_COSTLY_ORDER, then order-0. >>>>>> >>>>> >>>>> ... >>>>> >>>>>> +#define ANON_FOLIO_MAX_ORDER_UNHINTED \ >>>>>> + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SHIFT) >>>>>> + >>>>>> +static int anon_folio_order(struct vm_area_struct *vma) >>>>>> +{ >>>>>> + int order; >>>>>> + >>>>>> + /* >>>>>> + * If THP is explicitly disabled for either the vma, the process or the >>>>>> + * system, then this is very likely intended to limit internal >>>>>> + * fragmentation; in this case, don't attempt to allocate a large >>>>>> + * anonymous folio. >>>>>> + * >>>>>> + * Else, if the vma is eligible for thp, allocate a large folio of the >>>>>> + * size preferred by the arch. Or if the arch requested a very small >>>>>> + * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDER, >>>>>> + * which still meets the arch's requirements but means we still take >>>>>> + * advantage of SW optimizations (e.g. fewer page faults). >>>>>> + * >>>>>> + * Finally if thp is enabled but the vma isn't eligible, take the >>>>>> + * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHINTED. >>>>>> + * This ensures workloads that have not explicitly opted-in take benefit >>>>>> + * while capping the potential for internal fragmentation. >>>>>> + */ >>>>>> + >>>>>> + if ((vma->vm_flags & VM_NOHUGEPAGE) || >>>>>> + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) || >>>>>> + !hugepage_flags_enabled()) >>>>>> + order = 0; >>>>>> + else { >>>>>> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); >>>>>> + >>>>>> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true)) >>>>>> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED); >>>>>> + } >>>>>> + >>>>>> + return order; >>>>>> +} >>>>> >>>>> >>>>> Hi All, >>>>> >>>>> I'm writing up the conclusions that we arrived at during discussion in the THP >>>>> meeting yesterday, regarding linkage with exiting THP ABIs. It would be great if >>>>> I can get explicit "agree" or disagree + rationale from at least David, Yu and >>>>> Kirill. >>>>> >>>>> In summary; I think we are converging on the approach that is already coded, but >>>>> I'd like confirmation. >>>>> >>>>> >>>>> >>>>> The THP situation today >>>>> ----------------------- >>>>> >>>>> - At system level: THP can be set to "never", "madvise" or "always" >>>>> - At process level: THP can be "never" or "defer to system setting" >>>>> - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE >>>>> >>>>> That gives us this table to describe how a page fault is handled, according to >>>>> process state (columns) and vma flags (rows): >>>>> >>>>> | never | madvise | always >>>>> ----------------|-----------|-----------|----------- >>>>> no hint | S | S | THP>S >>>>> MADV_HUGEPAGE | S | THP>S | THP>S >>>>> MADV_NOHUGEPAGE | S | S | S >>>>> >>>>> Legend: >>>>> S allocate single page (PTE-mapped) >>>>> LAF allocate lage anon folio (PTE-mapped) >>>>> THP allocate THP-sized folio (PMD-mapped) >>>>>> fallback (usually because vma size/alignment insufficient for folio) >>>>> >>>>> >>>>> >>>>> Principles for Large Anon Folios (LAF) >>>>> -------------------------------------- >>>>> >>>>> David tells us there are use cases today (e.g. qemu live migration) which use >>>>> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faulted" >>>>> and these use cases will break (i.e. functionally incorrect) if this request is >>>>> not honoured. >>>>> >>>>> So LAF must at least honour MADV_NOHUGEPAGE to prevent breaking existing use >>>>> cases. And once we do this, then I think the least confusing thing is for it to >>>>> also honor the "never" system/process state; so if either the system, process or >>>>> vma has explicitly opted-out of THP, then LAF should also be bypassed. >>>>> >>>>> Similarly, any case that would previously cause the allocation of PMD-sized THP >>>>> must continue to be honoured, else we risk performance regression. >>>>> >>>>> That leaves the "madvise/no-hint" case, and all THP fallback paths due to the >>>>> VMA not being correctly aligned or sized to hold a PMD-sized mapping. In these >>>>> cases, we will attempt to use LAF first, and fallback to single page if the vma >>>>> size/alignment doesn't permit it. >>>>> >>>>> | never | madvise | always >>>>> ----------------|-----------|-----------|----------- >>>>> no hint | S | LAF>S | THP>LAF>S >>>>> MADV_HUGEPAGE | S | THP>LAF>S | THP>LAF>S >>>>> MADV_NOHUGEPAGE | S | S | S >>>>> >>>>> I think this (perhaps conservative) approach will be the least surprising to >>>>> users. And is the policy that is already implemented in this patch. >>>> >>>> This looks very reasonable. >>>> >>>> The only questionable field is no-hint/madvise. I can argue for both LAF>S >>>> and S here. I think LAF>S is fine as long as we are not too aggressive >>>> with allocation order. >>>> >>>> I think we need to work on eliminating reasons for users to set 'never'. >>>> If something behaves better with 'never' kernel has failed user. >>>> >>>>> Downsides of this policy >>>>> ------------------------ >>>>> >>>>> As Yu and Yin have pointed out, there are some workloads which do not perform >>>>> well with THP, due to large fault latency or memory wastage, etc. But which >>>>> _may_ still benefit from LAF. By taking the conservative approach, we exclude >>>>> these workloads from benefiting automatically. >>>> >>>> Hm. I don't buy it. Why THP with order-9 is too much, but order-8 LAF is >>>> fine? >>> >>> No, it's not. And no one said order-8 LAF is fine :) The starting >>> order for LAF that we have been discussing is at most 64KB (vs 2MB >>> THP). For my taste, it's still too large. I'd go with 32KB/16KB. >> >> Its currently influenced by the arch. If the arch doesn't have an opinion then >> its currently 32K in the code. The 64K size is my aspiration for arm64 if/when I >> land the contpte mapping work. > > Just to double check: this discussion covers the long term/permanente > solution/roadmap, correct? That's what Kirill and I were arguing > about. Otherwise, the order-8/9 concern above is totally irrelevant, > since we don't have them in this series. > > For the short term (this series), what you described above looks good > to me: we may regress but will not break any existing use cases, and > we are behind a Kconfig option. OK that's good to hear. > >>> However, the same argument can be used to argue against the policy >>> Ryan listed above: why order-10 LAF is ok for madvise but not order-11 >>> (which becomes "always")? >> >> Sorry I don't understand what you are saying here. Where has order-10 LAF come from? > > I pushed that rhetoric a bit further: order-11 is the THP size (32MB) > with 16KB base page size on ARM. Confusing, isn't it? And there is > another complaint from Fengwei here [1]. > > [1] https://lore.kernel.org/linux-mm/CAOUHufasZ6w32sHO+Lq33+tGy3+GiO0_dd6mNYwfS_5gqhzYbw@mail.gmail.com/ > >>> I'm strongly against this policy > > Again, just to be clear: I'm strongly against this policy to be > exposed to userspace in any way and become a long-term/permanent thing > we have to maintain/change in the future, since I'm assuming that's > the context. I'm still confused. The policy I described (and which I thought we were discussing) does not expose any new tunables to user space. And you said above that what I described "looks good to me". So is "this policy" which you are strongly against referring to the policy I wrote down or something else? > >> Ugh, I thought we came to an agreement (or at least "disagree and commit") on >> the THP call. Obviously I was wrong. > > My impression is we only agreed on one thing: at the current stage, we > should respect things we absolutely have to. We didn't agree on what > "never" means ("never 2MB" or "never >4KB"), and we didn't touch on > how "always" should behave at all. I _think_ we have now agreed some of this in other threads, but please re-raise in the context of the other email thread I just sent out - its probably cleaner to continue discussion there. > >> David is telling us that we will break user space if we don't consider >> MADV_NOHUGEPAGE to mean "never allocate memory to unfaulted addresses". So tying >> to at least this must be cast in stone, no? Could you lay out any policy >> proposal you have as an alternative that still follows this requirement? > > If MADV_NOHUGEPAGE falls into the category of things we have to > absolutely respect, then we will. But I don't think it does, because > the UFFD check we have in this series already guarantees the KVM use > case. I can explain how it works in detail if it's still not clear to > you: long story short, the UFFD check precedes the MADV_NOHUGEPAGE > check in alloc_anon_folio(). I think we have now concluded that its not this simple; MADV_NOHUGEPAGE is applied to the region and pages are faulted in before UFFD is registered, so checking for UFFD is not sufficient. > > Here is what I recommend for the medium and long terms: > https://lore.kernel.org/linux-mm/CAOUHufYm6Lkm4tLRbyKOc3-NYU-8d6ZDMNDWHo=e=E16oasN8A@mail.gmail.com/ > > For the short term, hard-coding two orders (hw/sw preferred), putting > them behind a Kconfig and not exposing this info to the userspace are > good enough for me. I think that's pretty much what I have now, so perhaps I'm laboring the point a bit too much here. Let's just get the prerequisites ticked off then get this patch set merged??