Received: by 2002:a05:7412:d8a:b0:e2:908c:2ebd with SMTP id b10csp801555rdg; Wed, 11 Oct 2023 06:06:20 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGtVMBrLHYxgjD6N8fzTuksKiw0zpAJ2jvyV/kLSQ0odaU7nE0kERrTqM1GGVLVGUc95XHv X-Received: by 2002:a17:90b:3905:b0:273:f138:29cc with SMTP id ob5-20020a17090b390500b00273f13829ccmr17831391pjb.35.1697029579996; Wed, 11 Oct 2023 06:06:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1697029579; cv=none; d=google.com; s=arc-20160816; b=CZFtqfDJchbLoD6KeZYkPaP05B0lVPnVqKOHRHhwY0jrlr4J7PFpyprYuF6dj8Zt7o z6yJ3uxu62lggks69slbZykBBLbZSPLV96CufQNeGqhHIHgrEtwrtIhRMPKXet4EtM0O a+a6uq2C3NOt9EzZGPfWr29JtBNFHVVux44gkRs7hB+RY02DqW2ORcb3YPemKL/fxcbu CAtpIZfSC6WZtALou7JeiRcN2keR9y+IqxpkWlsCMsTXD9o/nS/ySgmE1VOoNiOIo4qw l17Ui1kQxma8XWzH9rRhb771xP1v5Bq5tVJpzsiAibCHQ+IDvUDZxx2By9ZJIXkcj40Y 8vrg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=wTsmGQzoC0xtRRJymfVDEYWGe8BfO8CRDI7hIcVYUa4=; fh=uP898DYFQ5bAFB8kRHYByM5nag1upn86iB3mFF/0xJ0=; b=KuEonOfYaShNudagshRRUv8wGv5/4h0PHn3n8DHxQnx+BqEYmfxaODZfb96n8I9hiF 2SBk0drImpMRNHEraPDCqP6cNkQiUVJ2uQqpk7GWJ2JqfnbzwHcRZef8Bipi3pNT8wrG uUqRnBld7Nu9FI0G5Pt4kqxw194G4LxjGZdpAR9gGCRYDpDI0tF9FNbuKsRwN/gMBzg9 Nkb3XEq16LxZQTwQ+rR800WsxZ3Gt7IbA9wCTmioU2Z1qq2VTOOxMaRyWk84L5cmSliL vSliO4mzwGvJ+dEk2J1QeDG4Krk5Zpy06FLBwguc899epOe7zjQ7UJgIb6CvPi7XVlHK 3C6g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from fry.vger.email (fry.vger.email. [23.128.96.38]) by mx.google.com with ESMTPS id im16-20020a170902bb1000b001b9c3498526si863725plb.433.2023.10.11.06.06.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 11 Oct 2023 06:06:19 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) client-ip=23.128.96.38; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.38 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by fry.vger.email (Postfix) with ESMTP id AF63F808A35B; Wed, 11 Oct 2023 06:05:28 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at fry.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232193AbjJKNFN (ORCPT + 99 others); Wed, 11 Oct 2023 09:05:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52966 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231317AbjJKNFL (ORCPT ); Wed, 11 Oct 2023 09:05:11 -0400 Received: from outbound-smtp01.blacknight.com (outbound-smtp01.blacknight.com [81.17.249.7]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 54FCB8F for ; Wed, 11 Oct 2023 06:05:09 -0700 (PDT) Received: from mail.blacknight.com (pemlinmail03.blacknight.ie [81.17.254.16]) by outbound-smtp01.blacknight.com (Postfix) with ESMTPS id D71CDC6048 for ; Wed, 11 Oct 2023 14:05:07 +0100 (IST) Received: (qmail 2271 invoked from network); 11 Oct 2023 13:05:07 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[84.203.197.19]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 11 Oct 2023 13:05:07 -0000 Date: Wed, 11 Oct 2023 14:05:05 +0100 From: Mel Gorman To: Andrew Morton Cc: Huang Ying , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Arjan Van De Ven , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: Re: [PATCH 00/10] mm: PCP high auto-tuning Message-ID: <20231011130505.356soszayes3vy2n@techsingularity.net> References: <20230920061856.257597-1-ying.huang@intel.com> <20230920094118.8b8f739125c6aede17c627e0@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <20230920094118.8b8f739125c6aede17c627e0@linux-foundation.org> X-Spam-Status: No, score=2.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_SBL_CSS,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on fry.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (fry.vger.email [0.0.0.0]); Wed, 11 Oct 2023 06:05:29 -0700 (PDT) X-Spam-Level: ** On Wed, Sep 20, 2023 at 09:41:18AM -0700, Andrew Morton wrote: > On Wed, 20 Sep 2023 14:18:46 +0800 Huang Ying wrote: > > > The page allocation performance requirements of different workloads > > are often different. So, we need to tune the PCP (Per-CPU Pageset) > > high on each CPU automatically to optimize the page allocation > > performance. > > Some of the performance changes here are downright scary. > > I've never been very sure that percpu pages was very beneficial (and > hey, I invented the thing back in the Mesozoic era). But these numbers > make me think it's very important and we should have been paying more > attention. > FWIW, it is because not only does it avoid lock contention issues, it avoids excessive splitting/merging of buddies as well as the slower paths of the allocator. It is not very satisfactory and frankly, the whole page allocator needs a revisit to account for very large zones but it is far from a trivial project. PCP just masks the worst of the issues and replacing it is far harder than tweaking it. > > The list of patches in series is as follows, > > > > 1 mm, pcp: avoid to drain PCP when process exit > > 2 cacheinfo: calculate per-CPU data cache size > > 3 mm, pcp: reduce lock contention for draining high-order pages > > 4 mm: restrict the pcp batch scale factor to avoid too long latency > > 5 mm, page_alloc: scale the number of pages that are batch allocated > > 6 mm: add framework for PCP high auto-tuning > > 7 mm: tune PCP high automatically > > 8 mm, pcp: decrease PCP high if free pages < high watermark > > 9 mm, pcp: avoid to reduce PCP high unnecessarily > > 10 mm, pcp: reduce detecting time of consecutive high order page freeing > > > > Patch 1/2/3 optimize the PCP draining for consecutive high-order pages > > freeing. > > > > Patch 4/5 optimize batch freeing and allocating. > > > > Patch 6/7/8/9 implement and optimize a PCP high auto-tuning method. > > > > Patch 10 optimize the PCP draining for consecutive high order page > > freeing based on PCP high auto-tuning. > > > > The test results for patches with performance impact are as follows, > > > > kbuild > > ====== > > > > On a 2-socket Intel server with 224 logical CPU, we tested kbuild on > > one socket with `make -j 112`. > > > > build time zone lock% free_high alloc_zone > > ---------- ---------- --------- ---------- > > base 100.0 43.6 100.0 100.0 > > patch1 96.6 40.3 49.2 95.2 > > patch3 96.4 40.5 11.3 95.1 > > patch5 96.1 37.9 13.3 96.8 > > patch7 86.4 9.8 6.2 22.0 > > patch9 85.9 9.4 4.8 16.3 > > patch10 87.7 12.6 29.0 32.3 > > You're seriously saying that kbuild got 12% faster? > > I see that [07/10] (autotuning) alone sped up kbuild by 10%? > > Other thoughts: > > - What if any facilities are provided to permit users/developers to > monitor the operation of the autotuning algorithm? > Not that I've seen yet but I'm still in part of the series. It could be monitored with tracepoints but it can also be inferred from lock contention issue. I think it would only be meaningful to developers to monitor this closely, at least that's what I think now. Honestly, I'm more worried about potential changes in behaviour depending on the exact CPU and cache implementation than I am about being able to actively monitor it. > - I'm not seeing any Documentation/ updates. Surely there are things > we can tell users? > > - This: > > : It's possible that PCP high auto-tuning doesn't work well for some > : workloads. So, when PCP high is tuned by hand via the sysctl knob, > : the auto-tuning will be disabled. The PCP high set by hand will be > : used instead. > > Is it a bit hacky to disable autotuning when the user alters > pcp-high? Would it be cleaner to have a separate on/off knob for > autotuning? > It might be but tuning the allocator is very specific and once we introduce that tunable, we're probably stuck with it. I would prefer to see it introduced if and only if we have to. > And how is the user to determine that "PCP high auto-tuning doesn't work > well" for their workload? Not easily. It may manifest as variable lock contention issues when the workload is at a steady state but that would increase the pressure to split the allocator away from being zone-based entirely instead of tweaking PCP further. -- Mel Gorman SUSE Labs