Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp27856193rwd; Tue, 4 Jul 2023 08:48:01 -0700 (PDT) X-Google-Smtp-Source: APBJJlEnskAU0isF2coR/R+F7Qve5iGjVNr20DZ1U2L7UpvL+OD99cAWPqb7aqSKXHpNXlqnTlvv X-Received: by 2002:a05:6a00:23c4:b0:67a:8f2a:2cb2 with SMTP id g4-20020a056a0023c400b0067a8f2a2cb2mr14509800pfc.20.1688485680918; Tue, 04 Jul 2023 08:48:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1688485680; cv=none; d=google.com; s=arc-20160816; b=Uzk9osE2qa9JpPqN+DPjwEiUdOs7xkHsgPzAPotUFVm7774/oIBuf8XdyeRSriNy9b Oo0zdE+xTXOch3TpnVcrFFeXQNCTxwOcpiLB3cvywSoxVmrGec3J2ZFwZ9BqdkUNBD1z iawUPclv13EqaeLQKhBUCmlBzVDVQr5UYRL+thTlvtlCOb2EjQzjVo71vKzp3xdoslH/ i4uHE5cM4C72yF1ytpfpBdCfAFYx3B0aw5DqaYxHvqyvoKwSIosQ7h6BCK96Me6f+raM 1tnKu1gMRPeAOd+Sgk/y25GkZjnp+YwPhfxUP10VuI1uX9jQICcZjG3ZAQms719wrIdS AYfg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id; bh=fj0TXSvjkXIaWdKvdB6Di9A4DGJRIkp1+RIGB+WEK3Y=; fh=IryDfyn9qgSpCIGHBYOOMjHFksevj3SHnlzFpXgKhCs=; b=eBIZbylcSN9nIJmMeAkP16xiktC3IyYeecJkFL6vjIUeuSoDcHpvNMl0nQ7r/Ks0FY f8OZPqX6KiB8zTNVPWJKqGuOoJpEfEsUEDbLtWXt82wkjfgG0INs7XpvQTk+jEwl6p08 QaoFrq2lmbH+gC7b/cEXOXpVe7oNN01yZzVC4ITxxK8q8/ht6k8r68+gC+D6/siAHvkE CP+PL1nKrQrCb3bVMYPAHl1kHH/tC5+isiI6pmMCygPbfViw0wbxd/xM6k/Rdnd0BV0a JNKZEXBAZKdmaEsKNSI1tVwTtW1vOuGIkFGImDdE2lGVwlai3uRSw/vNO4ftgRaBGL97 61XQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z26-20020aa79f9a000000b0066a66144278si7453832pfr.108.2023.07.04.08.47.45; Tue, 04 Jul 2023 08:48:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231575AbjGDPgW (ORCPT + 99 others); Tue, 4 Jul 2023 11:36:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47690 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230255AbjGDPgU (ORCPT ); Tue, 4 Jul 2023 11:36:20 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 5047C1A4 for ; Tue, 4 Jul 2023 08:36:19 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 890951595; Tue, 4 Jul 2023 08:37:01 -0700 (PDT) Received: from [10.1.35.40] (C02Z41KALVDN.cambridge.arm.com [10.1.35.40]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6F9543F73F; Tue, 4 Jul 2023 08:36:17 -0700 (PDT) Message-ID: <467afd30-c85a-8b9d-97b9-a9ef9d0983af@arm.com> Date: Tue, 4 Jul 2023 16:36:16 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 Subject: Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory To: Yu Zhao , "Yin, Fengwei" Cc: Andrew Morton , Matthew Wilcox , "Kirill A. Shutemov" , David Hildenbrand , Catalin Marinas , Will Deacon , Anshuman Khandual , Yang Shi , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20230703135330.1865927-1-ryan.roberts@arm.com> <69aada71-0b3f-e928-6413-742fe7926576@intel.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/07/2023 08:11, Yu Zhao wrote: > On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei wrote: >> >> On 7/4/2023 10:18 AM, Yu Zhao wrote: >>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts wrote: >>>> >>>> Hi All, >>>> >>>> This is v2 of a series to implement variable order, large folios for anonymous >>>> memory. The objective of this is to improve performance by allocating larger >>>> chunks of memory during anonymous page faults. See [1] for background. >>> >>> Thanks for the quick response! >>> >>>> I've significantly reworked and simplified the patch set based on comments from >>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to >>>> VARIABLE_THP, on Yu's advice. >>>> >>>> The last patch is for arm64 to explicitly override the default >>>> arch_wants_pte_order() and is intended as an example. If this series is accepted >>>> I suggest taking the first 4 patches through the mm tree and the arm64 change >>>> could be handled through the arm64 tree separately. Neither has any build >>>> dependency on the other. >>>> >>>> The one area where I haven't followed Yu's advice is in the determination of the >>>> size of folio to use. It was suggested that I have a single preferred large >>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there >>>> being existing overlapping populated PTEs, etc) then fallback immediately to >>>> order-0. It turned out that this approach caused a performance regression in the >>>> Speedometer benchmark. >>> >>> I suppose it's regression against the v1, not the unpatched kernel. >> From the performance data Ryan shared, it's against unpatched kernel: >> >> Speedometer 2.0: >> >> | kernel | runs_per_min | >> |:-------------------------------|---------------:| >> | baseline-4k | 0.0% | >> | anonfolio-lkml-v1 | 0.7% | >> | anonfolio-lkml-v2-simple-order | -0.9% | >> | anonfolio-lkml-v2 | 0.5% | > > I see. Thanks. > > A couple of questions: > 1. Do we have a stddev? | kernel | mean_abs | std_abs | mean_rel | std_rel | |:------------------------- |-----------:|----------:|-----------:|----------:| | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | This is with 3 runs per reboot across 5 reboots, with first run after reboot trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data points per kernel in total. I've rerun the test multiple times and see similar results each time. I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I see the same performance as baseline-4k. > 2. Do we have a theory why it regressed? I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that mean when we fault, order-4 is often too big to fit in the VMA. So we fallback to order-0. I guess this is happening so often for this workload that the cost of doing the checks and fallback is outweighing the benefit of the memory that does end up with order-4 folios. I've sampled the memory in each bucket (once per second) while running and its roughly: 64K: 25% 32K: 15% 16K: 15% 4K: 45% 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and the 64K contents is more static - that's just a guess though. > Assuming no bugs, I don't see how a real regression could happen -- > falling back to order-0 isn't different from the original behavior. > Ryan, could you `perf record` and `cat /proc/vmstat` and share them? I can, but it will have to be a bit later in the week. I'll do some more test runs overnight so we have a larger number of runs - hopefully that might tell us that this is noise to a certain extent. I'd still like to hear a clear technical argument for why the bin-packing approach is not the correct one! Thanks, Ryan