Received: by 2002:a05:6358:a55:b0:ec:fcf4:3ecf with SMTP id 21csp6901246rwb; Wed, 18 Jan 2023 10:45:34 -0800 (PST) X-Google-Smtp-Source: AMrXdXsLKI0Ow6019f1rJIMr54CdHluWdBE4TIY7vLfaGFyrA/lcuL8S8VexZ+Xq95K5SZ1XA3Jo X-Received: by 2002:a62:a507:0:b0:58d:e780:33cd with SMTP id v7-20020a62a507000000b0058de78033cdmr2923950pfm.21.1674067534084; Wed, 18 Jan 2023 10:45:34 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1674067534; cv=none; d=google.com; s=arc-20160816; b=rrF07VFLQYtZKSNTcpwixaHLLFpgrQKOKyf4l+IF1aWM5e/otQm/T/1GCv0Yxe05ji upoajohQpTIKUYCrLlCrh/C5eYwz3aV591GX7AiY/zTzaVolLrkz2brLJx4GQixA0ztf olwJT29sLxLVN32wr26lL1eGe9o0+xD2aX8BN0Foot2yFvJ/QjDPKKs+hOOZQNCAnNcI nDE+Msoh738ytXU04yoVPNX8Zbyyx3bA0QMO5uaFmZ6yOZIviM3D6zHfDMQZy/ekEulN wg9hftCmIdJ/9dWp+/GGAbfRFCFLm4nMecFX333QltAXA0+yUambUkEq3h+riFyRqNK7 cJiA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=1SCxeVbTpnVsT03pooH2y84Lv1BB7k5KwUYDP0XFyWA=; b=uZtXyM2qC7BwdWiL7ZNjTT5VWIt+Esyq9OCCa3aoj/db/vGAKbCNXYkmFJkXF6lA34 9WrSntEkvktxW3e+PHBrZ3wmmE2MHK3izxuZQUwhLAryPjrol23fb7AmKTvltCA0GMD5 yngw9+WjYCn4T2LcEumu2e6ByDycoiKnkHfXve9rH8QdoPP6uv59ISN6R2UnQc4/L9TD Hj8oYsEXYdwZ4d+b8WcI4ybfOz9T3VYLR334ynCMBNjEVQyIF0bv2Vct+fEgafAzZhv0 lLommx/n0pV6v3OWQscMGSGUDecBZQrTnvh4yTZpkI8A34QuPfxH3SGcmCB1g+ddsn8t 2trg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=qEwVvVWx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y17-20020a056a00191100b0058bb299c48bsi19431280pfi.245.2023.01.18.10.45.28; Wed, 18 Jan 2023 10:45:34 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=qEwVvVWx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229816AbjARRV2 (ORCPT + 45 others); Wed, 18 Jan 2023 12:21:28 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51140 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229663AbjARRV0 (ORCPT ); Wed, 18 Jan 2023 12:21:26 -0500 Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.220.29]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D4638125AB; Wed, 18 Jan 2023 09:21:24 -0800 (PST) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out2.suse.de (Postfix) with ESMTPS id 917705BEE9; Wed, 18 Jan 2023 17:21:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1674062483; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=1SCxeVbTpnVsT03pooH2y84Lv1BB7k5KwUYDP0XFyWA=; b=qEwVvVWxhxhY1cLo7t+SlFvjG6RhTz/cvVyqd8YUGGhSp6XL9LFdC+ZOgbWgyDTtmahcIT PriICH13TJhLwOXC5lOpM2F2BnJ+ohb6blVOOdSIa6wmQbgw3j0OO9trtN7JH/cm3sZZ0d zsQ3Pc5PBLVe9jF2xxuP8FlySmS00aQ= Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 31513139D2; Wed, 18 Jan 2023 17:21:23 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id MAcLCZMqyGPARQAAMHmgww (envelope-from ); Wed, 18 Jan 2023 17:21:23 +0000 Date: Wed, 18 Jan 2023 18:21:22 +0100 From: Michal Hocko To: "Huang, Ying" Cc: Mina Almasry , Johannes Weiner , Yang Shi , Yosry Ahmed , weixugc@google.com, Tim Chen , Andrew Morton , Tejun Heo , Zefan Li , Jonathan Corbet , Roman Gushchin , Shakeel Butt , Muchun Song , fvdl@google.com, bagasdotme@gmail.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: Proactive reclaim/demote discussion (was Re: [PATCH] Revert "mm: add nodes= arg to memory.reclaim") Message-ID: References: <20221202223533.1785418-1-almasrymina@google.com> <20221216101820.3f4a370af2c93d3c2e78ed8a@linux-foundation.org> <20221219144252.f3da256e75e176905346b4d1@linux-foundation.org> <87lemiitdd.fsf_-_@yhuang6-desk2.ccr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87lemiitdd.fsf_-_@yhuang6-desk2.ccr.corp.intel.com> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 04-01-23 16:41:50, Huang, Ying wrote: > Michal Hocko writes: > > [snip] > > > This really requires more discussion. > > Let's start the discussion with some summary. > > Requirements: > > - Proactive reclaim. The counting of current per-memcg proactive > reclaim (memory.reclaim) isn't correct. The demoted, but not > reclaimed pages will be counted as reclaimed. So "echo XXM > > memory.reclaim" may exit prematurely before the specified number of > memory is reclaimed. This is reportedly a problem because memory.reclaim interface cannot be used for proper memcg sizing IIRC. > - Proactive demote. We need an interface to do per-memcg proactive > demote. For the further discussion it would be useful to reference the usecase that is requiring this functionality. I believe this has been mentioned somewhere but having it in this thread would help. > We may reuse memory.reclaim via extending the concept of > reclaiming to include demoting. Or, we can add a new interface for > that (for example, memory.demote). In addition to demote from fast > tier to slow tier, in theory, we may need to demote from a set of > nodes to another set of nodes for something like general node > balancing. > > - Proactive promote. In theory, this is possible, but there's no real > life requirements yet. And it should use a separate interface, so I > don't think we need to discuss that here. Yes, proactive promotion is not backed by any real usecase at the moment. We do not really have to focus on it but we should be aware of the posibility and alow future extentions towards that functionality. There is one requirement missing here. - Per NUMA node control - this is what makes the distinction between demotion and charge reclaim really semantically challenging - e.g. should demotions constrained by the provided nodemask or they should be implicit? > Open questions: > > - Use memory.reclaim or memory.demote for proactive demote. In current > memcg context, reclaiming and demoting is quite different, because > reclaiming will uncharge, while demoting will not. But if we will add > per-memory-tier charging finally, the difference disappears. So the > question becomes whether will we add per-memory-tier charging. The question is not whether but when IMHO. We've had a similar situation with the swap accounting. Originally we have considered swap as a shared resource but cgroupv2 goes with per swap limits because contention for the swap space is really something people do care about. > - Whether should we demote from faster tier nodes to lower tier nodes > during the proactive reclaiming. I thought we are aligned on that. Demotion is a part of aging and that is an integral part of the reclaim. > Choice A is to keep as much fast > memory as possible. That is, reclaim from the lowest tier nodes > firstly, then the secondary lowest tier nodes, and so on. Choice B is > to demote at the same time of reclaiming. In this way, if we > proactively reclaim XX MB memory, we may free XX MB memory on the > fastest memory nodes. > > - When we proactively demote some memory from a fast memory tier, should > we trigger memory competition in the slower memory tiers? That is, > whether to wake up kswapd of the slower memory tiers nodes? Johannes made some very strong arguments that there is no other choice than involve kswapd (https://lore.kernel.org/all/Y5nEQeXj6HQBEHEY@cmpxchg.org/). > If we > want to make per-memcg proactive demoting to be per-memcg strictly, we > should avoid to trigger the global behavior such as triggering memory > competition in the slower memory tiers. Instead, we can add a global > proactive demote interface for that (such as per-memory-tier or > per-node). I suspect we are left with a real usecase and then follow the path we took for the swap accounting. Other open questions I do see are - what to do when the memory.reclaim is constrained by a nodemask as mentioned above. Is the whole reclaim process (including aging) bound to the given nodemask or does demotion escape from it. - should the demotion be specific to multi-tier systems or the interface should be just NUMA based and users could use the scheme to shuffle memory around and allow numa balancing from userspace that way. That would imply that demotion is a dedicated interface of course. - there are other usecases that would like to trigger aging from userspace (http://lkml.kernel.org/r/20221214225123.2770216-1-yuanchu@google.com). Isn't demotion just a special case of aging in general or should we end up with 3 different interfaces? -- Michal Hocko SUSE Labs