Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp7997832rwb; Tue, 13 Dec 2022 00:14:19 -0800 (PST) X-Google-Smtp-Source: AA0mqf6UpNna6DujYPfKz4VJ0BTPDJqC+ZvCSFcNv6Z9OfUpOOVSBi6BS7zZp2jshJOJfdg/aLlJ X-Received: by 2002:a17:903:240d:b0:185:441f:70b1 with SMTP id e13-20020a170903240d00b00185441f70b1mr20456944plo.54.1670919258834; Tue, 13 Dec 2022 00:14:18 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670919258; cv=none; d=google.com; s=arc-20160816; b=vajZZZB5wo/MlwQITMHJkimYh6K9qgveyJemMf67xGvfSyiNJIhbGlvX+n1VFR866M k6KiAKgW+WVC4KA5VVlDAS5uKc/Owmq6X/6LAHQkhlTIt/0Suvc67eFaFTuA/ztkPnNe wfliVzGaN+Fu1NBr67p5khmycQliJElM7EyQ0HWW2Brw5zHqkW3/cINawPT0HcgKYB2v vgDcjXoafwUO8iQgfyO7fZsJvxUwpHPaIiDhZVjOMU8mNaTVJar/5MFaFtjliSzjt1LU FOH+Tznlr8lgz+wY5pzC/CjoJJhRpkPG/GfS1MFNpsrahTNevDxv6yOsLSXz4IBTOKG6 QJ8g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=Dl3Yao6mN2AiSPWnXX8BOx2+Ij0ONLiwggaARdAt/rY=; b=rIsU2keEiYz3dwIxwWfKspqTIJvE2KZf2KfrxySJJOudKO/OM8aRAUiUv8XnYeY/aJ hfwM9bZRBquOrV13yP6UqsYvmaeg+nAOKI0ZGQGq8OzX869hsbC5Rd1ejU4yy4dF0BFB FwysJPWVeU280xL6/6lrS7eydkMEFrSd+2xWOagweZv1qkvqJhoRh7y87ylZRgyw1E2M ZPiZRRjI+DjAT2qNOddnVKulxRABFJO8nT3MOdA3IzknYTN8MUIHJALRNIyhz/TsPokh LzoCQpdT4E6QDayw31JGRkkjFvkKJwDzpSk0F3f8PrHbOj0QFnsQyJeYcPSPKV9SCXyM jz4w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="BF6k7/wC"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 26-20020a630e5a000000b00478eaccedbbsi11522491pgo.582.2022.12.13.00.14.08; Tue, 13 Dec 2022 00:14:18 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="BF6k7/wC"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234244AbiLMHtN (ORCPT + 74 others); Tue, 13 Dec 2022 02:49:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57534 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234639AbiLMHtL (ORCPT ); Tue, 13 Dec 2022 02:49:11 -0500 Received: from mail-pj1-x1033.google.com (mail-pj1-x1033.google.com [IPv6:2607:f8b0:4864:20::1033]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8000E13D01 for ; Mon, 12 Dec 2022 23:49:10 -0800 (PST) Received: by mail-pj1-x1033.google.com with SMTP id t11-20020a17090a024b00b0021932afece4so2648926pje.5 for ; Mon, 12 Dec 2022 23:49:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=Dl3Yao6mN2AiSPWnXX8BOx2+Ij0ONLiwggaARdAt/rY=; b=BF6k7/wCZJJeVxACOt4nifXw3+9QOrchqEkwguMyLH9KMLyKBzoTjrJOHvJn6OgLKJ B6qpK9Ui9Qomqp1FshqQPWe713zL2W3CbZ4e3iNV6mDsT8lzwwm33pwMNluRWMNv+63/ KcN6ZybzVkfi+Vj4164zGqEhyhYQdzUJn7df3ZS3uyqbHxQO+ix6+v2oQl1FLlYm61jq QVHrjItXSh2K+Xpvh/LMFNjjElVXq119hLWpnRPYD2Kx8NzbqmOJpThq2wfnHfv/147X jLKB7EdtKMtLd9B6/SCHmMskYvAP8IHxZj59dCtpSX53oJgfuGzHHUsvkuZM/MsH7LH1 EZWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Dl3Yao6mN2AiSPWnXX8BOx2+Ij0ONLiwggaARdAt/rY=; b=E34lc3aVaSwrkx2g6LkPgTUITnC7rTU6cZYrp0aZ5wsVHjJfYTYkNUCtZJxwfsXLzH sgOB1weevmL45rJfLoLztNJYiRyKqdAYLofAN06T00CILTUoly7WeejkOBEnuXnLnsqH W2hSApK+0TAHdPicHJj23fvZgTlBFWL3dQPvIsMz0k6p2RLRU2mD+A5Z7NELByNTHCQg vGB8PKuazPXieDSWqSnYnLTIy3PN4abtb35OW+0Ocraa+7jIXVf7hAdgMoYLHe2/dhIV XpiqZRojzMqm1IdUhErZaS3P2SEWr4WC2IasloomqSkSP3XRgC2xIMJOsKdI+bz9SUWO OjCw== X-Gm-Message-State: ANoB5pkkQ7i4f8kRvNnyh2FlDb9pMQOGPDMnPETtnFEWNDckkovUYdyk WeENTculxeiXTrQBe9WNQoEQGPXFp6AGFLDHNQ7mcA== X-Received: by 2002:a17:90a:8b06:b0:219:41ef:a812 with SMTP id y6-20020a17090a8b0600b0021941efa812mr233907pjn.153.1670917749693; Mon, 12 Dec 2022 23:49:09 -0800 (PST) MIME-Version: 1.0 References: <20221202223533.1785418-1-almasrymina@google.com> <87k02volwe.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87k02volwe.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Wei Xu Date: Mon, 12 Dec 2022 23:48:57 -0800 Message-ID: Subject: Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim To: "Huang, Ying" Cc: Mina Almasry , Michal Hocko , Tejun Heo , Zefan Li , Johannes Weiner , Jonathan Corbet , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Yang Shi , Yosry Ahmed , fvdl@google.com, bagasdotme@gmail.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 12, 2022 at 10:32 PM Huang, Ying wrote: > > Mina Almasry writes: > > > On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko wrote: > >> > >> On Fri 02-12-22 14:35:31, Mina Almasry wrote: > >> > The nodes= arg instructs the kernel to only scan the given nodes for > >> > proactive reclaim. For example use cases, consider a 2 tier memory system: > >> > > >> > nodes 0,1 -> top tier > >> > nodes 2,3 -> second tier > >> > > >> > $ echo "1m nodes=0" > memory.reclaim > >> > > >> > This instructs the kernel to attempt to reclaim 1m memory from node 0. > >> > Since node 0 is a top tier node, demotion will be attempted first. This > >> > is useful to direct proactive reclaim to specific nodes that are under > >> > pressure. > >> > > >> > $ echo "1m nodes=2,3" > memory.reclaim > >> > > >> > This instructs the kernel to attempt to reclaim 1m memory in the second tier, > >> > since this tier of memory has no demotion targets the memory will be > >> > reclaimed. > >> > > >> > $ echo "1m nodes=0,1" > memory.reclaim > >> > > >> > Instructs the kernel to reclaim memory from the top tier nodes, which can > >> > be desirable according to the userspace policy if there is pressure on > >> > the top tiers. Since these nodes have demotion targets, the kernel will > >> > attempt demotion first. > >> > > >> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > >> > reclaim""), the proactive reclaim interface memory.reclaim does both > >> > reclaim and demotion. Reclaim and demotion incur different latency costs > >> > to the jobs in the cgroup. Demoted memory would still be addressable > >> > by the userspace at a higher latency, but reclaimed memory would need to > >> > incur a pagefault. > >> > > >> > The 'nodes' arg is useful to allow the userspace to control demotion > >> > and reclaim independently according to its policy: if the memory.reclaim > >> > is called on a node with demotion targets, it will attempt demotion first; > >> > if it is called on a node without demotion targets, it will only attempt > >> > reclaim. > >> > > >> > Acked-by: Michal Hocko > >> > Signed-off-by: Mina Almasry > >> > >> After discussion in [1] I have realized that I haven't really thought > >> through all the consequences of this patch and therefore I am retracting > >> my ack here. I am not nacking the patch at this statge but I also think > >> this shouldn't be merged now and we should really consider all the > >> consequences. > >> > >> Let me summarize my main concerns here as well. The proposed > >> implementation doesn't apply the provided nodemask to the whole reclaim > >> process. This means that demotion can happen outside of the mask so the > >> the user request cannot really control demotion targets and that limits > >> the interface should there be any need for a finer grained control in > >> the future (see an example in [2]). > >> Another problem is that this can limit future reclaim extensions because > >> of existing assumptions of the interface [3] - specify only top-tier > >> node to force the aging without actually reclaiming any charges and > >> (ab)use the interface only for aging on multi-tier system. A change to > >> the reclaim to not demote in some cases could break this usecase. > >> > > > > I think this is correct. My use case is to request from the kernel to > > do demotion without reclaim in the cgroup, and the reason for that is > > stated in the commit message: > > > > "Reclaim and demotion incur different latency costs to the jobs in the > > cgroup. Demoted memory would still be addressable by the userspace at > > a higher latency, but reclaimed memory would need to incur a > > pagefault." > > > > For jobs of some latency tiers, we would like to trigger proactive > > demotion (which incurs relatively low latency on the job), but not > > trigger proactive reclaim (which incurs a pagefault). I initially had > > proposed a separate interface for this, but Johannes directed me to > > this interface instead in [1]. In the same email Johannes also tells > > me that meta's reclaim stack relies on memory.reclaim triggering > > demotion, so it seems that I'm not the first to take a dependency on > > this. Additionally in [2] Johannes also says it would be great if in > > the long term reclaim policy and demotion policy do not diverge. > > > > [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/ > > [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/ > > After these discussion, I think the solution maybe use different > interfaces for "proactive demote" and "proactive reclaim". That is, > reconsider "memory.demote". In this way, we will always uncharge the > cgroup for "memory.reclaim". This avoid the possible confusion there. > And, because demotion is considered aging, we don't need to disable > demotion for "memory.reclaim", just don't count it. +1 on memory.demote. > Best Regards, > Huang, Ying >