Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp943108iob; Fri, 13 May 2022 17:15:59 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyyYJcO7amIsbH/fwtfTlvOF/0XPNThlU2mJRSHxmhar/F7DxQoxVPr53GYwyV+//CBPxOH X-Received: by 2002:a5d:4ed1:0:b0:20a:e375:35f0 with SMTP id s17-20020a5d4ed1000000b0020ae37535f0mr5749329wrv.94.1652487359342; Fri, 13 May 2022 17:15:59 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652487359; cv=none; d=google.com; s=arc-20160816; b=sG+gpIIChiIVjtO2PuHIGSTt9YROd0Ajrul79kjDLpRhNb4hdWGanUjbs4UlGNf8E+ PShanYhsg/pt7s9IKbWspfienVtcpTaiv+vWL+8bz210+28NY8gBJpiK9bHsprWRRLKP cNvoKylFwCT9ubEP+xDbHE1f+aNK/wjS1IJiGkaGZ8PtYRphSc6qMgELtN6DfC+DtOur Oo4PhM7S0hvul87qOz3zOl+hjNCMRG4x2lM/eD+1MasYbL+v9070hAN7LSQv1qqYmNzs 2nAyTgjrmmbWsZLXkPum5BlG7gy8PjrO0ThJD0pUAY8BA0hcLezNf1biWQRlOsEiFpJO ATwA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id:dkim-signature; bh=MPt7lzaJauorVZXNsYIEQeVx02DX1MHXe3MCohEwfJ0=; b=hIqD1NZZ6daqHB2dNDNhAeJmG+5lLJNtZjgdnSR8vyAPVXy8GWqJN/moYeVV+XRiVk sizFdMOft0s8ahb7osZ0OmopDCLtcbcnTA0k7MYIYakfPlIQDsya6S/26KIHniGnbAd0 bSkIXSBqptBK5HVbXil5cGIo1vSz101O33XsS0gO71Tw5tsRw+u8wbOHCEhOh/HNWTxO SSE6EJA70K4h7Nlq5xiICTb1SYlIJcp5YMVv4lSIob/4PPnAy3r3ggA1vc4NYwQZZdb6 jomPn/x/K0um99JDJgnT48KSA2+xA9kPTCDMqmz6UMZwT5CkR/vQvJllJJYURzo3gwSS TS3Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ic32GU2k; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id o26-20020a05600c511a00b00394317da936si8615589wms.218.2022.05.13.17.15.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 13 May 2022 17:15:59 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ic32GU2k; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 3EDBF32C1DB; Fri, 13 May 2022 16:15:42 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1377585AbiEMHEv (ORCPT + 99 others); Fri, 13 May 2022 03:04:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59694 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238932AbiEMHEt (ORCPT ); Fri, 13 May 2022 03:04:49 -0400 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BB76D29BC69 for ; Fri, 13 May 2022 00:04:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1652425488; x=1683961488; h=message-id:subject:from:to:cc:date:in-reply-to: references:mime-version:content-transfer-encoding; bh=sawKbh59agg1iUxoGQ2w1FRcN8D+N8ccR6rZnCreP0M=; b=ic32GU2kPrbtCaiU996iKmmD46oqQSvnKB1B1B54G/wjNxhh+W5uahM2 ffSb8D1dXxZX/KTeFocw23QWM/7lPWdxV3EEsmSSTFOA6nttp1V+plTpP jqkpBME4bucHXPSrpx2G/nBezpkV3U/5suiof3isLNZMF/JF5MMwb58yo eytecqIeA9fP/ZTQcKaupsZOG+vCRHVvtbxo4nvgmiMIaAsMvtFfz7KXq PN/QiBDkCoxUAAt9ddaFbpleLRl+KaOFJ68BaWqMT+qvife1IeBuqtUBk tGpwtsLb9R8eJiYnqXTKtRpfT4xJLGtjjyRJ1i5HfpCm017satHyh+Ubz Q==; X-IronPort-AV: E=McAfee;i="6400,9594,10345"; a="267830499" X-IronPort-AV: E=Sophos;i="5.91,221,1647327600"; d="scan'208";a="267830499" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2022 00:04:48 -0700 X-IronPort-AV: E=Sophos;i="5.91,221,1647327600"; d="scan'208";a="595085051" Received: from jliu69-mobl.ccr.corp.intel.com ([10.254.212.158]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 May 2022 00:04:43 -0700 Message-ID: Subject: Re: RFC: Memory Tiering Kernel Interfaces (v2) From: "ying.huang@intel.com" To: Wei Xu Cc: Andrew Morton , Greg Thelen , "Aneesh Kumar K.V" , Yang Shi , Linux Kernel Mailing List , Jagdish Gediya , Michal Hocko , Tim C Chen , Dave Hansen , Alistair Popple , Baolin Wang , Feng Tang , Jonathan Cameron , Davidlohr Bueso , Dan Williams , David Rientjes , Linux MM , Brice Goglin , Hesham Almatary Date: Fri, 13 May 2022 15:04:39 +0800 In-Reply-To: References: <69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.38.3-1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2022-05-12 at 23:36 -0700, Wei Xu wrote: > On Thu, May 12, 2022 at 8:25 PM ying.huang@intel.com > wrote: > > > > On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote: > > > > > > Memory Allocation for Demotion > > > ============================== > > > > > > To allocate a new page as the demotion target for a page, the kernel > > > calls the allocation function (__alloc_pages_nodemask) with the > > > source page node as the preferred node and the union of all lower > > > tier nodes as the allowed nodemask. The actual target node selection > > > then follows the allocation fallback order that the kernel has > > > already defined. > > > > > > The pseudo code looks like: > > > > > >     targets = NODE_MASK_NONE; > > >     src_nid = page_to_nid(page); > > >     src_tier = node_tier_map[src_nid]; > > >     for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++) > > >             nodes_or(targets, targets, memory_tiers[i]); > > >     new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets); > > > > > > The memopolicy of cpuset, vma and owner task of the source page can > > > be set to refine the demotion target nodemask, e.g. to prevent > > > demotion or select a particular allowed node as the demotion target. > > > > Consider a system with 3 tiers, if we want to demote some pages from > > tier 0, the desired behavior is, > > > > - Allocate pages from tier 1 > > - If there's no enough free pages in tier 1, wakeup kswapd of tier 1 so > > demote some pages from tier 1 to tier 2 > > - If there's still no enough free pages in tier 1, allocate pages from > > tier 2. > > > > In this way, tier 0 will have the hottest pages, while tier 1 will have > > the coldest pages. > > When we are already in the allocation path for the demotion of a page > from tier 0, I think we'd better not block this allocation to wait for > kswapd to demote pages from tier 1 to tier 2. Instead, we should > directly allocate from tier 2. Meanwhile, this demotion can wakeup > kswapd to demote from tier 1 to tier 2 in the background. Yes. That's what I want too. My original words may be misleading. > > With your proposed method, the demoting from tier 0 behavior is, > > > > - Allocate pages from tier 1 > > - If there's no enough free pages in tier 1, allocate pages in tier 2 > > > > The kswapd of tier 1 will not be waken up until there's no enough free > > pages in tier 2. In quite long time, there's no much hot/cold > > differentiation between tier 1 and tier 2. > > This is true with the current allocation code. But I think we can make > some changes for demotion allocations. For example, we can add a > GFP_DEMOTE flag and update the allocation function to wake up kswapd > when this flag is set and we need to fall back to another node. > > > This isn't hard to be fixed, just call __alloc_pages_nodemask() for each > > tier one by one considering page allocation fallback order. > > That would have worked, except that there is an example earlier, in > which it is actually preferred for some nodes to demote to their tier > + 2, not tier +1. > > More specifically, the example is: > >                  20 >    Node 0 (DRAM) -- Node 1 (DRAM) >     | | | | >     | | 30 120 | | >     | v v | 100 > 100 | Node 2 (PMEM) | >     | | | >     | | 100 | >      \ v v >       -> Node 3 (Large Mem) > > Node distances: > node 0 1 2 3 >    0 10 20 30 100 >    1 20 10 120 100 >    2 30 120 10 100 >    3 100 100 100 10 > > 3 memory tiers are defined: > tier 0: 0-1 > tier 1: 2 > tier 2: 3 > > The demotion fallback order is: > node 0: 2, 3 > node 1: 3, 2 > node 2: 3 > node 3: empty > > Note that even though node 3 is in tier 2 and node 2 is in tier 1, > node 1 (tier 0) still prefers node 3 as its first demotion target, not > node 2. Yes. I understand that we need to support this use case. We can use the tier order in allocation fallback list instead of from small to large. That is, for node 1, the tier order for demotion is tier 2, tier 1. Best Regards, Huang, Ying