Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp2444282iob; Fri, 20 May 2022 09:27:29 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzkCJAXbf+bYGk291VNQg9BMrgEixo2C70h6htV9yeKOEAeEE/4Ol30AeF5C/XjHkw2cNHO X-Received: by 2002:a17:90b:4b82:b0:1df:e6b4:1fa4 with SMTP id lr2-20020a17090b4b8200b001dfe6b41fa4mr7786487pjb.46.1653064049068; Fri, 20 May 2022 09:27:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1653064049; cv=none; d=google.com; s=arc-20160816; b=G6ojuMaHJH/jCrOrJ44pvVj7lbh6CwdVYuI6ls74DzBDVfTrmQqlONdIoOVgRBmG0W CvkPU/x6PKbifnij2TaBQoA507U38uqLLfrubInQdPrlrkF7JXVDRF4jIJzqWd8PtI3/ lmb+vjfTlNKVcRD84aev6Rz8cikYtKEI87/yyCanNOlwiBh/x5Kzzxjpk8enqisqRk2e o5SAHRFGrmDckI12jEC164LpbFNfHJoYC7lj5RnA4TXH7MoGHmY/koCIqXI857vAbo4V GpTWCt/zxWvBtXjWGZf4HFC9br589fxerd7KK9nA45jlw799cC392+PHlYSWnQc9yd3S CybA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=I8DbhN/hmOIXXZ4piaGn9jvuXkaTo8VWp7TEEneSWwU=; b=cscW2BqISKqdCdd2eZIdu16giyrAZ/zGF7ROOxrp50MN9H79mCs4nybvkDFzeEcZ+g QT6tUdwql4mo1yMzZirAouG+1b06CgDVNRNVq1UUtjbGu1783LjtoVEwgR6oHdUbmsW5 BERI5PrZPlSchcp/lhNFO9ITtkZuucEmdfLC0KtXKIcYBBMMjD7HGzaIKKSItZRS/uVS hesJ3pv6PxX3zNacqvZcPJK5IfQT4Od0ENQOZYERuzO/4dq2N51e618gpOzAAJN54Znh yJhJRASkmC0Q5H5apMcjIXYLRnGWjngbfpL3q3eJLPVjvu3xTOkjRHen8a94+kbxJ+Tk /GTA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=YFxzBTOt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y6-20020a170902b48600b001616652f769si9808788plr.240.2022.05.20.09.27.15; Fri, 20 May 2022 09:27:29 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=YFxzBTOt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236328AbiESJvW (ORCPT + 99 others); Thu, 19 May 2022 05:51:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49904 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236300AbiESJvR (ORCPT ); Thu, 19 May 2022 05:51:17 -0400 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9FECB1C125; Thu, 19 May 2022 02:51:15 -0700 (PDT) Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id 4AA302188C; Thu, 19 May 2022 09:51:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1652953874; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=I8DbhN/hmOIXXZ4piaGn9jvuXkaTo8VWp7TEEneSWwU=; b=YFxzBTOtJeatYq35CXez5bOlSjNY4j0o3mtivxShrX3+r3OGxsUKu4LeTD5d/SWLBWQHZw et1F5ekV2CK0+dO/rT6Z6P3Tc3mMmVNGUJuW4ovJIawZLb0vaTZXSIG2mclSWc4xLg67TK pvtcYKhsS9GpnIBCVqgNTN5bQGKNjh4= Received: from suse.cz (unknown [10.100.201.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 7FA8F2C141; Thu, 19 May 2022 09:51:13 +0000 (UTC) Date: Thu, 19 May 2022 11:51:12 +0200 From: Michal Hocko To: Johannes Weiner Cc: Dave Hansen , "Huang, Ying" , Yang Shi , Andrew Morton , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com, Zi Yan , Shakeel Butt , Roman Gushchin Subject: Re: [PATCH] Revert "mm/vmscan: never demote for memcg reclaim" Message-ID: References: <20220518190911.82400-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220518190911.82400-1-hannes@cmpxchg.org> X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 18-05-22 15:09:11, Johannes Weiner wrote: > This reverts commit 3a235693d3930e1276c8d9cc0ca5807ef292cf0a. > > Its premise was that cgroup reclaim cares about freeing memory inside > the cgroup, and demotion just moves them around within the cgroup > limit. Hence, pages from toptier nodes should be reclaimed directly. > > However, with NUMA balancing now doing tier promotions, demotion is > part of the page aging process. Global reclaim demotes the coldest > toptier pages to secondary memory, where their life continues and from > which they have a chance to get promoted back. Essentially, tiered > memory systems have an LRU order that spans multiple nodes. > > When cgroup reclaims pages coming off the toptier directly, there can > be colder pages on lower tier nodes that were demoted by global > reclaim. This is an aging inversion, not unlike if cgroups were to > reclaim directly from the active lists while there are inactive pages. > > Proactive reclaim is another factor. The goal of that it is to offload > colder pages from expensive RAM to cheaper storage. When lower tier > memory is available as an intermediate layer, we want offloading to > take advantage of it instead of bypassing to storage. > > Revert the patch so that cgroups respect the LRU order spanning the > memory hierarchy. I do agree with your reasoning. > Of note is a specific undercommit scenario, where all cgroup limits in > the system add up to <= available toptier memory. In that case, > shuffling pages out to lower tiers first to reclaim them from there is > inefficient. This is something could be optimized/short-circuited > later on (although care must be taken not to accidentally recreate the > aging inversion). Let's ensure correctness first. My slight concern with demotion is that there is no actual "guarantee" to reclaim any charges which try_charge depends on to make a forward progress. I suspect this is rather unlikely situation, though. The last tear (without any fallback) should have some memory to reclaim most of the time. Retries should push some pages out but low effort allocation requests like GFP_NORETRY might fail but callers should be prepared for that. All that being said the agin inversion is much more real of a problem than this. > Signed-off-by: Johannes Weiner > Cc: Dave Hansen > Cc: "Huang, Ying" > Cc: Yang Shi > Cc: Zi Yan > Cc: Michal Hocko > Cc: Shakeel Butt > Cc: Roman Gushchin Acked-by: Michal Hocko > --- > mm/vmscan.c | 9 ++------- > 1 file changed, 2 insertions(+), 7 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index c6918fff06e1..7a4090712177 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -528,13 +528,8 @@ static bool can_demote(int nid, struct scan_control *sc) > { > if (!numa_demotion_enabled) > return false; > - if (sc) { > - if (sc->no_demotion) > - return false; > - /* It is pointless to do demotion in memcg reclaim */ > - if (cgroup_reclaim(sc)) > - return false; > - } > + if (sc && sc->no_demotion) > + return false; > if (next_demotion_node(nid) == NUMA_NO_NODE) > return false; > > -- > 2.36.1 -- Michal Hocko SUSE Labs