Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp29405914rwd; Wed, 5 Jul 2023 11:19:21 -0700 (PDT) X-Google-Smtp-Source: APBJJlFpmBlPDBbED0svMQxkMKfrKK/YGTmHpHrkYjp8tohSY2s2uinGPND2uwoU+9pHhUdB2bG4 X-Received: by 2002:a17:902:c950:b0:1b5:522a:1578 with SMTP id i16-20020a170902c95000b001b5522a1578mr4392082pla.29.1688581161722; Wed, 05 Jul 2023 11:19:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1688581161; cv=none; d=google.com; s=arc-20160816; b=sDMnWUa10CyENMBJGTKxsstHHaV392axAQD4ekzn/Cxu3ZUV9xm/PLK8JLQzxqC/EH +Tl2eAl1Gyp9rbsUuxnSoMCAGdJJ1tzH3jnHNSnie2dNilf/kKB8huQSOqcuhuCbOH/z JLpy5TD8qrkYVIiG9c7ZBmklavnMSMVuhd2sw0l9qDZqrygRyMlEc0yxQjEHe0fTJcFJ pr1OrNP3gPSRcte2QSe9s59l/7pF/7ijjhMfCP239Y0m4gbZ37H0YieCCggOTX+2OG5r h77cgiBIB2YUce99XV/b8bMqRilyXa6iy4JtrDGb/N+lQtLVh8LChe0JuMGn0B8mUOxI aGIg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id; bh=OL4lFCWwxFECy1ldvag1ZkXGfs+c0ipMDLV5aqlqA2E=; fh=WfeLqpNf0pM2FKG36GMZnrLJD9QIoBChKruSuJw8AOI=; b=rL2FKINaHMcCqC5qUnMbvYTX/rrgcddy/1Rg7Zz6O18dIdodVpVlGD59HFLGQzDirI I1EpTB5itERMhLqSeMgDoQI7xyWJLA0tepBgBloCKTXNDlY+BFLliIyyMmJDUhrh0zg5 FHdzVcXjmA/NKIUiJUp6bgPiKZer5EGvvsPWLNNiae7x2JBJ17zSF14ztXMR3HvG0wQH U0lo8bIGCtks1RgYaQtokYKsLkyRXG6kCX0ZLOtJvDy7VyTnS7Je3VT9ibB8HSE/gV7u dm6yfXuVpESa5K9cdbyyBKXwTauqnSI99X07O00gL0ydbmgU4LOwhb/z5/gjIhpLPDmO 6Oqw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x2-20020a170902ec8200b001b8690d7181si12965978plg.550.2023.07.05.11.19.08; Wed, 05 Jul 2023 11:19:21 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233316AbjGESB0 (ORCPT + 99 others); Wed, 5 Jul 2023 14:01:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39806 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233328AbjGESBU (ORCPT ); Wed, 5 Jul 2023 14:01:20 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 62E7419A9 for ; Wed, 5 Jul 2023 11:01:13 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 399E21763; Wed, 5 Jul 2023 11:01:55 -0700 (PDT) Received: from [10.57.76.116] (unknown [10.57.76.116]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 5C29C3F762; Wed, 5 Jul 2023 11:01:11 -0700 (PDT) Message-ID: Date: Wed, 5 Jul 2023 19:01:09 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 Subject: Re: [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() To: Yu Zhao Cc: Andrew Morton , Matthew Wilcox , "Kirill A. Shutemov" , Yin Fengwei , David Hildenbrand , Catalin Marinas , Will Deacon , Anshuman Khandual , Yang Shi , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20230703135330.1865927-1-ryan.roberts@arm.com> <20230703135330.1865927-4-ryan.roberts@arm.com> <9c5f3515-ad39-e416-902e-96e9387a3b60@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/07/2023 18:24, Yu Zhao wrote: > On Wed, Jul 5, 2023 at 3:11 AM Ryan Roberts wrote: >> >> On 05/07/2023 03:07, Yu Zhao wrote: >>> On Tue, Jul 4, 2023 at 7:20 AM Ryan Roberts wrote: >>>> >>>> On 03/07/2023 20:50, Yu Zhao wrote: >>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts wrote: >>>>>> >>>>>> arch_wants_pte_order() can be overridden by the arch to return the >>>>>> preferred folio order for pte-mapped memory. This is useful as some >>>>>> architectures (e.g. arm64) can coalesce TLB entries when the physical >>>>>> memory is suitably contiguous. >>>>>> >>>>>> The first user for this hint will be FLEXIBLE_THP, which aims to >>>>>> allocate large folios for anonymous memory to reduce page faults and >>>>>> other per-page operation costs. >>>>>> >>>>>> Here we add the default implementation of the function, used when the >>>>>> architecture does not define it, which returns the order corresponding >>>>>> to 64K. >>>>> >>>>> I don't really mind a non-zero default value. But people would ask why >>>>> non-zero and why 64KB. Probably you could argue this is the large size >>>>> all known archs support if they have TLB coalescing. For x86, AMD CPUs >>>>> would want to override this. I'll leave it to Fengwei to decide >>>>> whether Intel wants a different default value.> >>>>> Also I don't like the vma parameter because it makes >>>>> arch_wants_pte_order() a mix of hw preference and vma policy. From my >>>>> POV, the function should be only about the former; the latter should >>>>> be decided by arch-independent MM code. However, I can live with it if >>>>> ARM MM people think this is really what you want. ATM, I'm skeptical >>>>> they do. >>>> >>>> Here's the big picture for what I'm tryng to achieve: >>>> >>>> - In the common case, I'd like all programs to get a performance bump by >>>> automatically and transparently using large anon folios - so no explicit >>>> requirement on the process to opt-in. >>> >>> We all agree on this :) >>> >>>> - On arm64, in the above case, I'd like the preferred folio size to be 64K; >>>> from the (admittedly limitted) testing I've done that's about where the >>>> performance knee is and it doesn't appear to increase the memory wastage very >>>> much. It also has the benefits that for 4K base pages this is the contpte size >>>> (order-4) so I can take full benefit of contpte mappings transparently to the >>>> process. And for 16K this is the HPA size (order-2). >>> >>> My highest priority is to get 16KB proven first because it would >>> benefit both client and server devices. So it may be different from >>> yours but I don't see any conflict. >> >> Do you mean 16K folios on a 4K base page system > > Yes. > >> or large folios on a 16K base >> page system? I thought your focus was on speeding up 4K base page client systems >> but this statement has got me wondering? > > Sorry, I should have said 4x4KB. OK. Be aware that a number of Arm CPUs that support HPA don't have it enabled by default (or at least don't have it enabled in the mode that you would want it to see best performance with large anon folios). You would need EL3 access to reconfigure it. > >>>> - On arm64 when the process has marked the VMA for THP (or when >>>> transparent_hugepage=always) but the VMA does not meet the requirements for a >>>> PMD-sized mapping (or we failed to allocate, ...) then I'd like to map using >>>> contpte. For 4K base pages this is 64K (order-4), for 16K this is 2M (order-7) >>>> and for 64K this is 2M (order-5). The 64K base page case is very important since >>>> the PMD size for that base page is 512MB which is almost impossible to allocate >>>> in practice. >>> >>> Which case (server or client) are you focusing on here? For our client >>> devices, I can confidently say that 64KB has to be after 16KB, if it >>> happens at all. For servers in general, I don't know of any major >>> memory-intensive workloads that are not THP-aware, i.e., I don't think >>> "VMA does not meet the requirements" is a concern. >> >> For the 64K base page case, the focus is server. The problem reported by our >> partner is that the 512M huge page size is too big to reliably allocate and so >> the fauls always fall back to 64K base pages in practice. I would also speculate >> (happy to be proved wrong) that there are many THP-aware workloads that assume >> the THP size is 2M. In this case, their VMAs may well be too small to fit a 512M >> huge page when running on 64K base page system. > > Interesting. When you have something ready to share, I might be able > to try it on our ARM servers as well. That would be really helpful. I'm currently updating my branch that collates everything to reflect the review comments in this patch set and the contpte patch set. I'll share it in a couple of weeks. > >> But the TL;DR is that Arm has a partner for which enabling 2M THP on a 64K base >> page system is a very real requirement. Our intent is that this will be the >> mechanism we use to enable it. > > Yes, contpte makes more sense for what you described. It'd fit in a > lot better in the hugetlb case, but I guess your partner uses anon. arm64 already supports contpte for hugetlb, but they need it to work with anon memory using THP.