Received: by 2002:a05:6358:700f:b0:131:369:b2a3 with SMTP id 15csp2059633rwo; Thu, 3 Aug 2023 04:14:15 -0700 (PDT) X-Google-Smtp-Source: APBJJlF+XR9JaOPQOAUBq6PN8RH+Cr0TVYadOa4LbzS24T9Pm4Lx1gI6VR+0MsX3WGkZDD+/Z4Ys X-Received: by 2002:a17:906:8a5b:b0:99b:bdff:b0ac with SMTP id gx27-20020a1709068a5b00b0099bbdffb0acmr7481726ejc.16.1691061254763; Thu, 03 Aug 2023 04:14:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1691061254; cv=none; d=google.com; s=arc-20160816; b=tgkb4mKbd0jcJ8z/iiD2JdLYYUUEoIivbLVRLHS0FhS0SP3A0ca1ycTg9B7PD4sqZG DbmtAkoFHiArfp7movtajXxItujcKu/33XAgCiDHv2Q8pq0r/ZVLRBMZ86q35OzGjdFH /7P7ANlD/yfsTKlpDIjscm1YyeTg3aLvzfCya/IT5vEbL6UvI28l3dCbIdR9xN4k9dSq WblYAqzEi8DKdvcdeIxRQUSk85ryBJ67/CKB5/HOCH6LcCvprmOWzWQ3yIMnGYbBQMk8 PxIcbanfzgGf7nYGBXjvbTm/4CN9igwoPiCl9Su+Oe8Xm5UjOWvXpLE/fxc2WCmoPn39 td9A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id; bh=7BoDfLK8JDJngt4tZQfeOVflfFfYnKs0uLU9hiPCMIM=; fh=LJhTwQhCRDJDNoSnfQAK0XtqHTVLdgFeHBwJDBaTJ3k=; b=t5yapiAZlWte6cOrW6C6tYnEccM3XWGZCxqlvgg/ZxHOFesqfxiYTUGeiIJ9yw5JfR vm3mbwVo6m+IqxOD/7moMl6gwz94UJ4hdld1wmvNXvgdzWnvSqI8xFRI8WekZWDOgBRF PooBySUtuo8WvH1WOdeEhs9HFZzf/C7YrY0PnzKn69ab00HZhZF27Eb2j3yY6v7Up3o+ MB9go+7dkdFJbrIUg1tHj+zrtF6L0jJ5r2vED7hEQOvISO+JF69DcsPq7uiIDy1FI3Nl 1wo20e14lIPk5bFIBwXqU9lCR+c2NhTQY9ulbVfLVLJ8tYckW8Ashy/Li6AVDL4CTgoe 6dqA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k13-20020a170906970d00b00991c32f3565si2953240ejx.30.2023.08.03.04.13.50; Thu, 03 Aug 2023 04:14:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230058AbjHCK2U (ORCPT + 99 others); Thu, 3 Aug 2023 06:28:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59190 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233844AbjHCK1s (ORCPT ); Thu, 3 Aug 2023 06:27:48 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 2843C19B0 for ; Thu, 3 Aug 2023 03:27:47 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id DB58D113E; Thu, 3 Aug 2023 03:28:29 -0700 (PDT) Received: from [10.1.35.53] (C02Z41KALVDN.cambridge.arm.com [10.1.35.53]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id BBA463F5A1; Thu, 3 Aug 2023 03:27:44 -0700 (PDT) Message-ID: Date: Thu, 3 Aug 2023 11:27:43 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 Subject: Re: [PATCH v4 2/5] mm: LARGE_ANON_FOLIO for improved performance To: Yin Fengwei , Yu Zhao Cc: Andrew Morton , Matthew Wilcox , David Hildenbrand , Catalin Marinas , Will Deacon , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org References: <20230726095146.2826796-1-ryan.roberts@arm.com> <20230726095146.2826796-3-ryan.roberts@arm.com> <8c0710e0-a75a-b315-dae1-dd93092e4bd6@arm.com> <4ae53b2a-e069-f579-428d-ac6f744cd19a@intel.com> <49142e18-fd4e-6487-113a-3112b1c17dbe@arm.com> <2d947a72-c295-e4c5-4176-4c59cc250e39@intel.com> <07d060a8-9ffe-16c1-652b-7854730ea572@intel.com> From: Ryan Roberts In-Reply-To: <07d060a8-9ffe-16c1-652b-7854730ea572@intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_BLOCKED,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03/08/2023 10:58, Yin Fengwei wrote: > > > On 8/3/23 17:32, Ryan Roberts wrote: >> On 03/08/2023 09:37, Yin Fengwei wrote: >>> >>> >>> On 8/3/23 16:21, Ryan Roberts wrote: >>>> On 03/08/2023 09:05, Yin Fengwei wrote: >>>> >>>> ... >>>> >>>>>> I've captured run time and peak memory usage, and taken the mean. The stdev for >>>>>> the peak memory usage is big-ish, but I'm confident this still captures the >>>>>> central tendancy well: >>>>>> >>>>>> | MAX_ORDER_UNHINTED | real-time | kern-time | user-time | peak memory | >>>>>> |:-------------------|------------:|------------:|------------:|:------------| >>>>>> | 4k | 0.0% | 0.0% | 0.0% | 0.0% | >>>>>> | 16k | -3.6% | -26.5% | -0.5% | -0.1% | >>>>>> | 32k | -4.8% | -37.4% | -0.6% | -0.1% | >>>>>> | 64k | -5.7% | -42.0% | -0.6% | -1.1% | >>>>>> | 128k | -5.6% | -42.1% | -0.7% | 1.4% | >>>>>> | 256k | -4.9% | -41.9% | -0.4% | 1.9% | >>>>> >>>>> Here is my test result: >>>>> >>>>> real user sys >>>>> hink-4k: 0% 0% 0% >>>>> hink-16K: -3% 0.1% -18.3% >>>>> hink-32K: -4% 0.2% -27.2% >>>>> hink-64K: -4% 0.5% -31.0% >>>>> hink-128K: -4% 0.9% -33.7% >>>>> hink-256K: -5% 1% -34.6% >>>>> >>>>> >>>>> I used command: >>>>> /usr/bin/time -f "\t%E real,\t%U user,\t%S sys" make -skj96 allmodconfig all >>>>> to build kernel and collect the real time/user time/kernel time. >>>>> /sys/kernel/mm/transparent_hugepage/enabled is "madvise". >>>>> Let me know if you have any question about the test. >>>> >>>> Thanks for doing this! I have a couple of questions: >>>> >>>> - how many times did you run each test? >>> Three times for each ANON_FOLIO_MAX_ORDER_UNHINTED. The stddev is quite >>> small like less than %1. >> >> And out of interest, were you running on bare metal or in VM? And did you reboot >> between each run? > I run the test on bare metal env. I didn't reboot for every run. But have to reboot > for different ANON_FOLIO_MAX_ORDER_UNHINTED size. I do > echo 3 > /proc/sys/vm/drop_caches > for everything run after "make mrproper" even after a fresh boot. > > >> >>>> >>>> - how did you configure the large page size? (I sent an email out yesterday >>>> saying that I was doing it wrong from my tests, so the 128k and 256k results >>>> for my test set are not valid. >>> I changed the ANON_FOLIO_MAX_ORDER_UNHINTED definition manually every time. >> >> In that case, I think your results are broken in a similar way to mine. This >> code means that order will never be higher than 3 (32K) on x86: >> >> + order = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); >> + >> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true)) >> + order = min(order, ANON_FOLIO_MAX_ORDER_UNHINTED); >> >> On x86, arch_wants_pte_order() is not implemented and the default returns -1, so >> you end up with: > I added arch_waits_pte_order() for x86 and gave it a very large number. So the > order is decided by ANON_FOLIO_MAX_ORDER_UNHINTED. I suppose my data is valid. Ahh great! ok sorry for the noise. Given part of the rationale for the experiment was to plot perf against memory usage, did you collect any memory numbers? > >> >> order = min(PAGE_ALLOC_COSTLY_ORDER, ANON_FOLIO_MAX_ORDER_UNHINTED) >> >> So your 4k, 16k and 32k results should be valid, but 64k, 128k and 256k results >> are actually using 32k, I think? Which is odd because you are getting more >> stddev than the < 1% you quoted above? So perhaps this is down to rebooting >> (kaslr, or something...?) >> >> (on arm64, arch_wants_pte_order() returns 4, so my 64k result is also valid). >> >> As a quick hack to work around this, would you be able to change the code to this: >> >> + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, true)) >> + order = ANON_FOLIO_MAX_ORDER_UNHINTED; >> >>> >>>> >>>> - what does "hink" mean?? >>> Sorry for the typo. It should be ANON_FOLIO_MAX_ORDER_UNHINTED. >>> >>>> >>>>> >>>>> I also find one strange behavior with this version. It's related with why >>>>> I need to set the /sys/kernel/mm/transparent_hugepage/enabled to "madvise". >>>>> If it's "never", the large folio is disabled either. >>>>> If it's "always", the THP will be active before large folio. So the system is >>>>> in the mixed mode. it's not suitable for this test. >>>> >>>> We had a discussion around this in the THP meeting yesterday. I'm going to write >>>> this up propoerly so we can have proper systematic discussion. The tentative >>>> conclusion is that MADV_NOHUGEPAGE must continue to mean "do not fault in more >>>> than is absolutely necessary". I would assume we need to extend that thinking to >>>> the process-wide and system-wide knobs (as is done in the patch), but we didn't >>>> explicitly say so in the meeting. >>> There are cases that THP is not appreciated because of the latency or memory >>> consumption. For these cases, large folio may fill the gap as less latency and >>> memory consumption. >>> >>> >>> So if disabling THP means large folio can't be used, we loose the chance to >>> benefit those cases with large folio. >> >> Yes, I appreciate that. But there are also real use cases that expect >> MADV_NOHUGEPAGE means "do not fault more than is absolutely necessary" and the >> use cases break if that's not obeyed (e.g. live migration w/ qemu). So I think >> we need to be conservitive to start. These apps that are explicitly forbidding >> THP today, should be updated in the long run to opt-in to large anon folios >> using some as-yet undefined control. > Fair enough. > > > Regards > Yin, Fengwei > >> >>> >>> >>> Regards >>> Yin, Fengwei >>> >>>> >>>> My intention is that if you have requested THP and your vma is big enough for >>>> PMD-size then you get that, else you fallback to large anon folios. And if you >>>> have neither opted in nor out, then you get large anon folios. >>>> >>>> We talked about the idea of adding a new knob that let's you set the max order, >>>> but that needs a lot more thought. >>>> >>>> Anyway, as I said, I'll write it up so we can all systematically discuss. >>>> >>>>> >>>>> So if it's "never", large folio is disabled. But why "madvise" enables large >>>>> folio unconditionly? Suppose it's only enabled for the VMA range which user >>>>> madvise large folio (or THP)? >>>>> >>>>> Specific for the hink setting, my understand is that we can't choose it only >>>>> by this testing. Other workloads may have different behavior with differnt >>>>> hink setting. >>>>> >>>>> >>>>> Regards >>>>> Yin, Fengwei >>>>> >>>> >>