Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp8411386rwb; Tue, 13 Dec 2022 06:08:59 -0800 (PST) X-Google-Smtp-Source: AA0mqf55dDZdqWWnMBb+tkwWKWg7Nf3j4N4Wi66HEwZJcIKzNb6AZ52Kxy9aG8g/YSP4NR0lU41q X-Received: by 2002:a05:6a20:3d06:b0:9d:efbe:52c3 with SMTP id y6-20020a056a203d0600b0009defbe52c3mr34949780pzi.51.1670940538957; Tue, 13 Dec 2022 06:08:58 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670940538; cv=none; d=google.com; s=arc-20160816; b=tiekUAaHsUpEB8T/Bc3Z3HDFK0H1Y0wWC2WPWJ1mSGujk6Ts78GI81GnZDRS94n18t 5OVFfA1WUmgYtkFEShINu/vCfoo+rlJnsF/RH+CWMKJj7VufMaphQRXIPsuKbmJs4nFJ 5+rL1cs2vt3gJiOh+hYdnMHWOM+2bl2cbGo+sEY5lRc8Kl/mhiHGybCKs9AsyZjgKvTJ 1Ex1d0+C4b8mWi6B1roMwjPEHwZHQzXAAVj/Xkz80qnAXTuzPjWWWL6HZdlqwBFwLJde mMzyVb9wU030JWjFS8KS4slM19mgieX84Y50tsVzHRddm1Y4ret9ZjwI655YwVEVxmtn Yvfw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=GkAga1fPXNCvgWWr0y17vOcwxqzOIKIhEW49YSLUBLs=; b=tvjaBQNphXwvRzaxGlgpgGXtTv82GEhxqa96rP5e78BtlaSt1oAsqZF7QvLqIHbL0h f3TIRLr4JqmUXtTO1j+RqqDYJ9bzX6BpkipTK7A+TuisA5AQ2uIYWFtSlPX6vfS8WdSg vxo85pM7sWwfox6aD4dBtpvGXV9OrGMVKcVmtQPuGtvWonLf5NHi0NF0gdsFhvn16RxL whw/1uUg9UIvygW2DnmN0Ue+XwScd/Sc4Fvbof3e565Ykd3ftRQuCQPrVXoDdDH9x7an SuTRWv1kd58XtwXpXsXpprrrjKFSAjRCfq9ck7A1QEUSstkNK5VHZzTq5s6eaLxJxdGE VAbA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=zXfkNNv5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id e17-20020a63f551000000b004770fe95989si12106056pgk.496.2022.12.13.06.08.48; Tue, 13 Dec 2022 06:08:58 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=zXfkNNv5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235356AbiLMNax (ORCPT + 72 others); Tue, 13 Dec 2022 08:30:53 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45892 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235205AbiLMNau (ORCPT ); Tue, 13 Dec 2022 08:30:50 -0500 Received: from mail-ed1-x541.google.com (mail-ed1-x541.google.com [IPv6:2a00:1450:4864:20::541]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D44E7C2 for ; Tue, 13 Dec 2022 05:30:48 -0800 (PST) Received: by mail-ed1-x541.google.com with SMTP id d14so17674010edj.11 for ; Tue, 13 Dec 2022 05:30:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20210112.gappssmtp.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=GkAga1fPXNCvgWWr0y17vOcwxqzOIKIhEW49YSLUBLs=; b=zXfkNNv5mXOMPEanvux+4bWlBQnFuzxE2MzUd22vOUfuSUzuyhRUOfDzB0ggA71T8F U4BoSvDNvouPUsFnxzBhZ9PkJKNOOi+r+fWw9vIFexrfG0KwcjqamyWJH78bxvSa+wEK sviWJZdff4IfSH7PcbdsfV9IggXAnSU97epnS/d0NovCmFSRag3LxiwLBWEbw0wia48V YbDPBVUWD+T8qS9AWX0q7GxruKFPjvBYZuFfVuhZJcW2vVgQJ/wUPa1YAkGfHLEO3TNv ut45RL2Pkoauar+k5MQKvHAfX+4LQXjfZ+Tr4d/sXdDmzrGpjYfmw5ayEfmXxR2ffeRd 2uuQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=GkAga1fPXNCvgWWr0y17vOcwxqzOIKIhEW49YSLUBLs=; b=vcYFUINW7LMF4D/uiDcHxLbjGKXeZO5sUeUXdTKMlnJm+OU26rj1T/WZptPRnkVGAJ E1bPV3JWYmZ7rzbSr0PPsflM8lHaIembKZCqL0OcEM3votaq+PBjdpPBPiMDIkyogFJr aJ4GWI1dpV55EkOzBqVLXOvn+wo6QTgSwoxMUImNxyFW99XW+9nK23hl9jGIpX37moTM IdRkYdpDEV59pz44cqyYLj1j1VPI0vjYDvP43kE4DVK6ysDnhbXZ6ijScUatTqOS9nXX JHJ2WGiqFrQoizzKAq+3qtQbh7h/RuQg4VmmrG7UMzEGS47xLjno614W8MyCpPDU8Ey7 WdeA== X-Gm-Message-State: ANoB5pn7tfcIx81lPl+djfpCBMCOec6zlmwLDFD7nu9qBvZUfs9RzHHS FF3aVnrw/8eDSvYr0cwIKtLFhQ== X-Received: by 2002:a05:6402:c08:b0:46d:83ea:44e8 with SMTP id co8-20020a0564020c0800b0046d83ea44e8mr18308556edb.21.1670938247443; Tue, 13 Dec 2022 05:30:47 -0800 (PST) Received: from localhost (ip-046-005-139-011.um12.pools.vodafone-ip.de. [46.5.139.11]) by smtp.gmail.com with ESMTPSA id d7-20020aa7ce07000000b0046951b43e84sm4945096edv.55.2022.12.13.05.30.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 13 Dec 2022 05:30:47 -0800 (PST) Date: Tue, 13 Dec 2022 14:30:40 +0100 From: Johannes Weiner To: "Huang, Ying" Cc: Mina Almasry , Michal Hocko , Tejun Heo , Zefan Li , Jonathan Corbet , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Yang Shi , Yosry Ahmed , weixugc@google.com, fvdl@google.com, bagasdotme@gmail.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v3] mm: Add nodes= arg to memory.reclaim Message-ID: References: <20221202223533.1785418-1-almasrymina@google.com> <87k02volwe.fsf@yhuang6-desk2.ccr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87k02volwe.fsf@yhuang6-desk2.ccr.corp.intel.com> X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote: > Mina Almasry writes: > > > On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko wrote: > >> > >> On Fri 02-12-22 14:35:31, Mina Almasry wrote: > >> > The nodes= arg instructs the kernel to only scan the given nodes for > >> > proactive reclaim. For example use cases, consider a 2 tier memory system: > >> > > >> > nodes 0,1 -> top tier > >> > nodes 2,3 -> second tier > >> > > >> > $ echo "1m nodes=0" > memory.reclaim > >> > > >> > This instructs the kernel to attempt to reclaim 1m memory from node 0. > >> > Since node 0 is a top tier node, demotion will be attempted first. This > >> > is useful to direct proactive reclaim to specific nodes that are under > >> > pressure. > >> > > >> > $ echo "1m nodes=2,3" > memory.reclaim > >> > > >> > This instructs the kernel to attempt to reclaim 1m memory in the second tier, > >> > since this tier of memory has no demotion targets the memory will be > >> > reclaimed. > >> > > >> > $ echo "1m nodes=0,1" > memory.reclaim > >> > > >> > Instructs the kernel to reclaim memory from the top tier nodes, which can > >> > be desirable according to the userspace policy if there is pressure on > >> > the top tiers. Since these nodes have demotion targets, the kernel will > >> > attempt demotion first. > >> > > >> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > >> > reclaim""), the proactive reclaim interface memory.reclaim does both > >> > reclaim and demotion. Reclaim and demotion incur different latency costs > >> > to the jobs in the cgroup. Demoted memory would still be addressable > >> > by the userspace at a higher latency, but reclaimed memory would need to > >> > incur a pagefault. > >> > > >> > The 'nodes' arg is useful to allow the userspace to control demotion > >> > and reclaim independently according to its policy: if the memory.reclaim > >> > is called on a node with demotion targets, it will attempt demotion first; > >> > if it is called on a node without demotion targets, it will only attempt > >> > reclaim. > >> > > >> > Acked-by: Michal Hocko > >> > Signed-off-by: Mina Almasry > >> > >> After discussion in [1] I have realized that I haven't really thought > >> through all the consequences of this patch and therefore I am retracting > >> my ack here. I am not nacking the patch at this statge but I also think > >> this shouldn't be merged now and we should really consider all the > >> consequences. > >> > >> Let me summarize my main concerns here as well. The proposed > >> implementation doesn't apply the provided nodemask to the whole reclaim > >> process. This means that demotion can happen outside of the mask so the > >> the user request cannot really control demotion targets and that limits > >> the interface should there be any need for a finer grained control in > >> the future (see an example in [2]). > >> Another problem is that this can limit future reclaim extensions because > >> of existing assumptions of the interface [3] - specify only top-tier > >> node to force the aging without actually reclaiming any charges and > >> (ab)use the interface only for aging on multi-tier system. A change to > >> the reclaim to not demote in some cases could break this usecase. > >> > > > > I think this is correct. My use case is to request from the kernel to > > do demotion without reclaim in the cgroup, and the reason for that is > > stated in the commit message: > > > > "Reclaim and demotion incur different latency costs to the jobs in the > > cgroup. Demoted memory would still be addressable by the userspace at > > a higher latency, but reclaimed memory would need to incur a > > pagefault." > > > > For jobs of some latency tiers, we would like to trigger proactive > > demotion (which incurs relatively low latency on the job), but not > > trigger proactive reclaim (which incurs a pagefault). I initially had > > proposed a separate interface for this, but Johannes directed me to > > this interface instead in [1]. In the same email Johannes also tells > > me that meta's reclaim stack relies on memory.reclaim triggering > > demotion, so it seems that I'm not the first to take a dependency on > > this. Additionally in [2] Johannes also says it would be great if in > > the long term reclaim policy and demotion policy do not diverge. > > > > [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/ > > [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/ > > After these discussion, I think the solution maybe use different > interfaces for "proactive demote" and "proactive reclaim". That is, > reconsider "memory.demote". In this way, we will always uncharge the > cgroup for "memory.reclaim". This avoid the possible confusion there. > And, because demotion is considered aging, we don't need to disable > demotion for "memory.reclaim", just don't count it. Hm, so in summary: 1) memory.reclaim would demote and reclaim like today, but it would change to only count reclaimed pages against the goal. 2) memory.demote would only demote. a) What if the demotion targets are full? Would it reclaim or fail? 3) Would memory.reclaim and memory.demote still need nodemasks? Would they return -EINVAL if a) memory.reclaim gets passed only toptier nodes or b) memory.demote gets passed any lasttier nodes?