Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp1793065rwb; Tue, 29 Nov 2022 20:01:22 -0800 (PST) X-Google-Smtp-Source: AA0mqf7Uj8i2qc64cBAQuTuDRge0yxBhTj6XISrCZSqb6lbtf1IsGQFtgaTajsh1da0LXfVR7LtZ X-Received: by 2002:a17:906:c0c:b0:78d:77b1:a433 with SMTP id s12-20020a1709060c0c00b0078d77b1a433mr50132840ejf.486.1669780881953; Tue, 29 Nov 2022 20:01:21 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669780881; cv=none; d=google.com; s=arc-20160816; b=L2P+iqs0jiv4SICSFK3KOF46CKzqgGIFaL2c7hLD3GJ0ZjYwQ6HEhHnNyYnjkFhzxH IMLskDTtA8fTYmoTlo4SQin/bf1njc8OVfwODCXi/7X41dknoJXuH1+kfN7/ddrU7UET cwT0GcCLQrESvFtrp2U0PQC6GKBspLS8/MoIPvSHmE6XVWWklGOBR4US0k8ZOR879jLj /XiB+fCRMmB7A/oE2/UMkI4Ad7nEKFPwCLqVhRZ345MGEoqlSmZ4BaoEgzrsu/ZYtLOj CBwY+1PBvFRkHBSwSYoQgnxFrjYL3FHLR2vCj/IPZPJoQTic7VsnQZAcz+ift7pb6p3y cKNg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:message-id:in-reply-to :date:references:subject:cc:to:from:dkim-signature; bh=P7R4r89eCSQA+3OXNm+GQHNuYGeQiD/0LwVfwJ8s9mk=; b=1Av28PzA/zW2zqCbL6DeMxCNJoLPNdp7HbdGxtNTzrcDjT+Q3KCD4CQx9LIF3RgkqF my2YriInd4F3bGX/yyGNFUGIFiMVS9wLiV6p64djKYB6wQPJOsiqa7UH7ykZ4/tVIZwv vTX4BJdzTU31NNH8sgQVqQ4lTSuE+gtY9KBOlbgOjTZrgALgJMow7QKhrfijlNIuCaI8 GFFWeY2b0xRsmwMGfKMTSRuXnh4/+sRv6T8EupOqSzjQkoXKzU3O2/GgrJHqLyaaZ+VO XwzjirfHoYMWSpyHphthuSaygihnkqpfF54L8sWprBiUt9OJfq+fgHk2u8LxoRMVNTtR h1YA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=PmZpwdXh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id sb33-20020a1709076da100b0077b45792af2si431803ejc.464.2022.11.29.20.01.01; Tue, 29 Nov 2022 20:01:21 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=PmZpwdXh; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232027AbiK3D4L (ORCPT + 86 others); Tue, 29 Nov 2022 22:56:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40152 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229604AbiK3D4J (ORCPT ); Tue, 29 Nov 2022 22:56:09 -0500 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 784F169DC3; Tue, 29 Nov 2022 19:56:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1669780567; x=1701316567; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=wstP/Vj3b/bx7v6BDCKgttREqYEPOTLSqNXoVk7Grsg=; b=PmZpwdXhWg30AfIbwE+5BNZSK0jFr+kkFnao4N5hnVDCWeYeUYxQEfao 0nYDpfUU53fw3uc2vAj7O1TweyemLOBgA2PGz0iqoeQtyJnnZqw+fqGcM ArBcvpqZGyeTMS4S0f5PH4AFXwafrpA4YFERjPe5uBsj5V1KKHQyFtz1k 2A45rhDixaMqEJpdH6XJ7m7Ojz5U2QX3qPpRDW/iLpGWVGUQijdKistBY z2XsjUupmo8G3jpzWslRfSjAwBPtNCTM3WC0YjwVLYzAnmJY8KfpyMUI3 nuBFXTBgYO72I+VRZIxZG/cRd0lnBF+CSYLHelxauvHCgrVX9N4q07iGv A==; X-IronPort-AV: E=McAfee;i="6500,9779,10546"; a="312911108" X-IronPort-AV: E=Sophos;i="5.96,205,1665471600"; d="scan'208";a="312911108" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Nov 2022 19:56:06 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10546"; a="644055798" X-IronPort-AV: E=Sophos;i="5.96,205,1665471600"; d="scan'208";a="644055798" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 29 Nov 2022 19:56:02 -0800 From: "Huang, Ying" To: Johannes Weiner Cc: Mina Almasry , Yang Shi , Yosry Ahmed , Tim Chen , weixugc@google.com, shakeelb@google.com, gthelen@google.com, fvdl@google.com, Michal Hocko , Roman Gushchin , Muchun Song , Andrew Morton , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org Subject: Re: [RFC PATCH V1] mm: Disable demotion from proactive reclaim References: <20221122203850.2765015-1-almasrymina@google.com> <874juonbmv.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Wed, 30 Nov 2022 11:55:12 +0800 In-Reply-To: (Johannes Weiner's message of "Tue, 29 Nov 2022 13:08:01 -0500") Message-ID: <87wn7dayfz.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Johannes Weiner writes: > Hello Ying, > > On Thu, Nov 24, 2022 at 01:51:20PM +0800, Huang, Ying wrote: >> Johannes Weiner writes: >> > The fallback to reclaim actually strikes me as wrong. >> > >> > Think of reclaim as 'demoting' the pages to the storage tier. If we >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means >> > we 'demote' it directly from RAM to storage, bypassing potentially a >> > huge amount of pages colder than it in CXL. That doesn't seem right. >> > >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by >> > breaking the layering. Rather it should deflect that pressure to the >> > lower layers to make room. This makes sure we maintain an aging >> > pipeline that honors the memory tier hierarchy. >> >> Yes. I think that we should avoid to fall back to reclaim as much as >> possible too. Now, when we allocate memory for demotion >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger >> kswapd reclaim on lower tier node to free some memory to avoid fall back >> to reclaim on current (higher tier) node. This may be not good enough, >> for example, the following patch from Hasan may help via waking up >> kswapd earlier. >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ >> >> Do you know what is the next step plan for this patch? >> >> Should we do even more? >> >> From another point of view, I still think that we can use falling back >> to reclaim as the last resort to avoid OOM in some special situations, >> for example, most pages in the lowest tier node are mlock() or too hot >> to be reclaimed. > > If they're hotter than reclaim candidates on the toptier, shouldn't > they get promoted instead and make room that way? We may have to tweak > the watermark logic a bit to facilitate that (allow promotions where > regular allocations already fail?). But this sort of resorting would > be preferable to age inversions. Now it's legal to enable demotion and disable promotion. Yes, this is wrong configuration in general. But should we trigger OOM for these users? And now promotion only works for default NUMA policy (and MPOL_BIND to both promotion source and target nodes with MPOL_F_NUMA_BALANCING). If we use some other NUMA policy, the pages cannot be promoted too. > The mlock scenario sounds possible. In that case, it wouldn't be an > aging inversion, since there is nothing colder on the CXL node. > > Maybe a bypass check should explicitly consult the demotion target > watermarks against its evictable pages (similar to the file_is_tiny > check in prepare_scan_count)? Yes. This sounds doable. > Because in any other scenario, if there is a bug in the promo/demo > coordination, I think we'd rather have the OOM than deal with age > inversions causing intermittent performance issues that are incredibly > hard to track down. Previously, I thought that people will always prefer performance regression than OOM. Apparently, I am wrong. Anyway, I think that we need to reduce the possibility of OOM or falling back to reclaim as much as possible firstly. Do you agree? One possibility, can we fall back to reclaim only if the sc->priority is small enough (even 0)? Best Regards, Huang, Ying