Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp590436rwb; Tue, 13 Dec 2022 23:33:05 -0800 (PST) X-Google-Smtp-Source: AA0mqf6p4PigV0PheVz1Yd1hDt1hZMrT054fbRi+xK2OASlPNfJhbASRmZ8bArovTq4gJbWSQMeJ X-Received: by 2002:a05:6402:3445:b0:46b:14d8:af9b with SMTP id l5-20020a056402344500b0046b14d8af9bmr19687354edc.38.1671003185099; Tue, 13 Dec 2022 23:33:05 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671003185; cv=none; d=google.com; s=arc-20160816; b=fbFx9nv8BvBPCa3QSO37qvh8p3IEzAsPTtQqYahxK44fsuN0qvt0oILAlauiwgVzuE 6t1nHbDhS4HhRbtvw/5F4RsmAd7FfqsuRYDA+JihDYTodHSwpoXSNg+DyWh8T6FLtNfy n8h3jzNyV84mzWcQ7QyRqYOKEu/2jiCTpSg74+gHnApIT4mYC6JmOnEPL8CvR4ci/SFl wt/Yfsuy2jzSWGHM6pIH0HSZblEnf+EUcm+4kXafNgPzwJN1zutSKd/tcgPcLaz8IhOz ZptlBrwI/5omwgOcMyS+J4/knS2OUaAXMdMmiEmYrhQ9ocP4YFXGXUfr8nODDjyRqxnM nnNw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:message-id:in-reply-to :date:references:subject:cc:to:from:dkim-signature; bh=XAt3nct4GcmGzqMOtjuOXw42KaIoMPKgeyxek1EZWNM=; b=YU+1OR7pC1AtNlN2TRVR9Vy4mJSZ6qRHnIeCUUwZ2pbKYhZvRZmquL9TGFYv4Wrmw1 TXOc8YfXOAlz7uNF7VnTCnIeIla+Vd8PYpijoxJlIjkh9/6mi1AQ8Rp1jO75ZXemkXfD sVga0DmWlOpkwvNyp7rQfTtMyeAb2xz8Rzw1C9YwGGK04dKk7XBQO5D67U3nWDMETHl5 Gk0KfL4gUEqlD/eoOknIt35EpYBUbXGa18j7ySxVrcqRupZVCOdvrjeceSRsvhmZwuO+ ZUVpyU4ykPufKgSkBC/7byTyxI1ScyAn0WMl0G51SfIcF9T2uVkV3BSwr2ayCmtNZhTC d1iA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="F/ZCHz4f"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ds1-20020a0564021cc100b0046f973651efsi8342440edb.589.2022.12.13.23.32.48; Tue, 13 Dec 2022 23:33:05 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="F/ZCHz4f"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237337AbiLNHQX (ORCPT + 71 others); Wed, 14 Dec 2022 02:16:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49882 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229454AbiLNHQV (ORCPT ); Wed, 14 Dec 2022 02:16:21 -0500 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 64B4D23381; Tue, 13 Dec 2022 23:16:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1671002180; x=1702538180; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=VwZs05pVUP7LqVeCoKXF+sKs0Nd4RnRQHbZbxbbWEUM=; b=F/ZCHz4f9CQXiU0q0fu9QtlL8zYjBMOhbvA4fnPZZjW5VBAnqYqCw9VR 6jByAKuS0r0v0LOkwx8aysVGrM0Z2RYn3r+m2JkaoIKghDReNRmgXn+oW xZcpiTaU94BuJUbAUBAkgQ5pyXDr40PBus7JL0YB2Vtzn3+Qhtwr+4HfN YDEGRicmrQRXzylyUSJRLm337TANrumheHebkhyOtNVfSL91J388t2GVQ E69ZfrfU1ERyWnsrd7x24OYlybDvhwlukzwQfXogprKl09oipkmtF1W4f 6K3XRsKm6nkIha/f9SkGfo8uYZLsVeL6nuqUqCHjlz+C4NU6PYhJqcc6o A==; X-IronPort-AV: E=McAfee;i="6500,9779,10560"; a="298013865" X-IronPort-AV: E=Sophos;i="5.96,243,1665471600"; d="scan'208";a="298013865" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Dec 2022 23:16:19 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10560"; a="894183341" X-IronPort-AV: E=Sophos;i="5.96,243,1665471600"; d="scan'208";a="894183341" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Dec 2022 23:16:15 -0800 From: "Huang, Ying" To: Johannes Weiner Cc: Michal Hocko , Mina Almasry , Tejun Heo , Zefan Li , Jonathan Corbet , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Yang Shi , Yosry Ahmed , weixugc@google.com, fvdl@google.com, bagasdotme@gmail.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim References: <20221202223533.1785418-1-almasrymina@google.com> Date: Wed, 14 Dec 2022 15:15:17 +0800 In-Reply-To: (Johannes Weiner's message of "Tue, 13 Dec 2022 16:58:50 +0100") Message-ID: <87fsdifoca.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Johannes Weiner writes: > On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote: >> I do recognize your need to control the demotion but I argue that it is >> a bad idea to rely on an implicit behavior of the memory reclaim and an >> interface which is _documented_ to primarily _reclaim_ memory. > > I think memory.reclaim should demote as part of page aging. What I'd > like to avoid is *having* to manually control the aging component in > the interface (e.g. making memory.reclaim *only* reclaim, and > *requiring* a coordinated use of memory.demote to ensure progress.) > >> Really, consider that the current demotion implementation will change >> in the future and based on a newly added heuristic memory reclaim or >> compression would be preferred over migration to a different tier. This >> might completely break your current assumptions and break your usecase >> which relies on an implicit demotion behavior. Do you see that as a >> potential problem at all? What shall we do in that case? Special case >> memory.reclaim behavior? > > Shouldn't that be derived from the distance propertiers in the tier > configuration? > > I.e. if local compression is faster than demoting to a slower node, we > should maybe have a separate tier for that. Ignoring proactive reclaim > or demotion commands for a second: on that node, global memory > pressure should always compress first, while the oldest pages from the > compression cache should demote to the other node(s) - until they > eventually get swapped out. > > However fine-grained we make proactive reclaim control over these > stages, it should at least be possible for the user to request the > default behavior that global pressure follows, without jumping through > hoops or requiring the coordinated use of multiple knobs. So IMO there > is an argument for having a singular knob that requests comprehensive > aging and reclaiming across the configured hierarchy. > > As far as explicit control over the individual stages goes - no idea > if you would call the compression stage demotion or reclaim. The > distinction still does not make much of sense to me, since reclaim is > just another form of demotion. Sure, page faults have a different > access latency than dax to slower memory. But you could also have 3 > tiers of memory where the difference between tier 1 and 2 is much > smaller than the difference between 2 and 3, and you might want to > apply different demotion rates between them as well. > > The other argument is that demotion does not free cgroup memory, > whereas reclaim does. But with multiple memory tiers of vastly > different performance, isn't there also an argument for granting > cgroups different shares of each memory? So that a higher priority > group has access to a bigger share of the fastest memory, and lower > prio cgroups are relegated to lower tiers. If we split those pools, > then "demotion" will actually free memory in a cgroup. > > This is why I liked adding a nodes= argument to memory.reclaim the > best. It doesn't encode a distinction that may not last for long. > > The problem comes from how to interpret the input argument and the > return value, right? Could we solve this by requiring the passed > nodes= to all be of the same memory tier? Then there is no confusion > around what is requested and what the return value means. Yes. The definition is clear if nodes= from the same memory tier. > And if no nodes are passed, it means reclaim (from the lowest memory > tier) X pages and demote as needed, then return the reclaimed pages. It appears that the definition isn't very clear here. How many pages should be demoted? The target number is the value echoed to memory.reclaim? Or requested_number - pages_in_lowest_tier? Should we demote in as many tiers as possible or in as few tiers as possible? One possibility is to take advantage of top tier memory as much as possible. That is, try to reclaim pages in lower tiers only. >> Now to your specific usecase. If there is a need to do a memory >> distribution balancing then fine but this should be a well defined >> interface. E.g. is there a need to not only control demotion but >> promotions as well? I haven't heard anybody requesting that so far >> but I can easily imagine that like outsourcing the memory reclaim to >> the userspace someone might want to do the same thing with the numa >> balancing because $REASONS. Should that ever happen, I am pretty sure >> hooking into memory.reclaim is not really a great idea. > > Should this ever happen, it would seem fair that that be a separate > knob anyway, no? One knob to move the pipeline in one direction > (aging), one knob to move it the other way. Agree. Best Regards, Huang, Ying