Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp967923iob; Fri, 13 May 2022 18:02:27 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx1XIxtbyiCgUfQ6f+d3mG6V0/jTmIkyTnxdLIOQEzP9fH3RdkBzS/wR97XV8BbbZHsvYh2 X-Received: by 2002:a7b:c3cb:0:b0:394:3533:c712 with SMTP id t11-20020a7bc3cb000000b003943533c712mr6871597wmj.141.1652490147322; Fri, 13 May 2022 18:02:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652490147; cv=none; d=google.com; s=arc-20160816; b=vCxWn49UIT/oTl3A0TDsNm9Yjp0v0x9/tNhwax+eFC74uy5rDUN04KgL3qpm0hLIoZ WgqYyqMfJAJeEQHqcM9IL4LWQzZs1i2nJhUBrjYzA0zlq7w3Hn7HLaP09h8lcuZH01tz 7HuKZ+w0c8IKAluCrxXYqK/H7mvAX4xIVyl2+ic/0o12fq03uxjGTeyQTHqHJgZO/fiR fNLlRQT4Lwg9qtFIpTOVpMm9fnLOJnZ73+FzsqBSHGXVh9l+wbLxWNLD9EYxCWy6r2wq 8ls7tI7+FOIl5tchLbbmcKfpRApZ5TqTx8QMt9FJeYgy2dNAORJFbgotxAQtovi/7xS3 NqGw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=OinVFhkgVxz8G6r3WLh+5ytD4Oovn7TM4M9+J0/PGsM=; b=PKNjLhAAKGPbSIolt3Bh0bNtxj8fwi6cJlyEEMgzAtYxCTo/g7cauD/crb5hxKU2VH 0BxMWBz9oW2BcJiZER6K/Oz7wAbXfV1LeP+OWkrZ23VDMYTb15NB5vm0s2SFQEV6KBYI 3i6rnoSa2+pM7qsZAeX98PKiHXm1jHH8V5ieJCvPpgTBLmxmtuVKm8HORqmAvHK0uoBU 16k1GAuHxsOGzYFSHthh18IvbXQrq9+ptXYjLa5lOp0SGlJprvx80Otyt8sdPb+7Nglp /EVW2u7CpE+5yzs8ATTKFh10d/6I/JiN0+H+6/pnsW2k8FWzjGJbMfnDuhAQhkWJLcMB D3hg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=sUHlLm+z; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id p22-20020a1c5456000000b003941bfcf824si3612465wmi.103.2022.05.13.18.02.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 13 May 2022 18:02:27 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=sUHlLm+z; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 937EB36C9C2; Fri, 13 May 2022 16:35:54 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1377346AbiEMGhL (ORCPT + 99 others); Fri, 13 May 2022 02:37:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45890 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236494AbiEMGhJ (ORCPT ); Fri, 13 May 2022 02:37:09 -0400 Received: from mail-vk1-xa31.google.com (mail-vk1-xa31.google.com [IPv6:2607:f8b0:4864:20::a31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 56864606F7 for ; Thu, 12 May 2022 23:37:08 -0700 (PDT) Received: by mail-vk1-xa31.google.com with SMTP id y27so3753548vkl.8 for ; Thu, 12 May 2022 23:37:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=OinVFhkgVxz8G6r3WLh+5ytD4Oovn7TM4M9+J0/PGsM=; b=sUHlLm+z2dE3XMIFfMyb0yhLzLwh3dENmJkGV/c3cLLymCaIEPVpL0WNlKX+OpLMZb 3nNAqXzsoFqzV9t1Lsy5rny1SiRpZiG+kNJ9yRXk2g8AceJJn7F00Vb1x3jY4TX4SSO0 zCNp0fAitWKMHWxV3xeu/z5MVwViDcAk5mSHPEj0HaI/okftzDmS7c3hx69TEZyoFp9f hIcJKPflhQUjtszBL0c2QxKVZ3FkeP017ZYdQK/PpZWwfTt4bm7J++FKeMDREp2O7KhT ypE4fmu2X2o8VElW5D8b5pIPGSQZZ5UpokHrJERtn0a9hdaiZ+fBpb+PqMqWculxWhaD u7Mg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=OinVFhkgVxz8G6r3WLh+5ytD4Oovn7TM4M9+J0/PGsM=; b=4Q96eeD5O/+dVpT4Grt8xS77l/FCeC0eT+Z0QZINan+GJRb5Cbb8sDlpq1SLysvMQZ eQGn5SRHB43Sw7KRPmCkZiPSpZjj3ND7o8WPt1C4/Khb/VCKsUxfJMcAViL2GVIx0WwF uy3Sx0cGiiA5huy64ViGY7oodEV8hubyfuZUuPKuOvutMBx/owN7ifmUxkTjhBGrSVrA VV7a1vYVAjz8ZwaZIZF3EvdCzrDWcVj/ZQ+gkAjsYy+Ki/zwK+A71jxWUFrfH06sscmk nBt0plL9nXauybzY6yrRnj92A5oj9uA6JFBlq9rG+jHLKWnDdfmQNTRp0lnzTdsg3nWZ U2uw== X-Gm-Message-State: AOAM533YP3KbvAkSNqvFBxoRtn53YAeEVMIuJnStMfOQTF3slDaq2bjk pbaOZnD2EGz0lnyfLA3P7M1KArD8UTKSvrBydlCJZQ== X-Received: by 2002:a05:6122:2386:b0:352:5a79:5a43 with SMTP id bu6-20020a056122238600b003525a795a43mr1570390vkb.23.1652423827326; Thu, 12 May 2022 23:37:07 -0700 (PDT) MIME-Version: 1.0 References: <69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com> In-Reply-To: <69f2d063a15f8c4afb4688af7b7890f32af55391.camel@intel.com> From: Wei Xu Date: Thu, 12 May 2022 23:36:56 -0700 Message-ID: Subject: Re: RFC: Memory Tiering Kernel Interfaces (v2) To: "ying.huang@intel.com" Cc: Andrew Morton , Greg Thelen , "Aneesh Kumar K.V" , Yang Shi , Linux Kernel Mailing List , Jagdish Gediya , Michal Hocko , Tim C Chen , Dave Hansen , Alistair Popple , Baolin Wang , Feng Tang , Jonathan Cameron , Davidlohr Bueso , Dan Williams , David Rientjes , Linux MM , Brice Goglin , Hesham Almatary Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 12, 2022 at 8:25 PM ying.huang@intel.com wrote: > > On Wed, 2022-05-11 at 23:22 -0700, Wei Xu wrote: > > > > Memory Allocation for Demotion > > ============================== > > > > To allocate a new page as the demotion target for a page, the kernel > > calls the allocation function (__alloc_pages_nodemask) with the > > source page node as the preferred node and the union of all lower > > tier nodes as the allowed nodemask. The actual target node selection > > then follows the allocation fallback order that the kernel has > > already defined. > > > > The pseudo code looks like: > > > > targets = NODE_MASK_NONE; > > src_nid = page_to_nid(page); > > src_tier = node_tier_map[src_nid]; > > for (i = src_tier + 1; i < MAX_MEMORY_TIERS; i++) > > nodes_or(targets, targets, memory_tiers[i]); > > new_page = __alloc_pages_nodemask(gfp, order, src_nid, targets); > > > > The memopolicy of cpuset, vma and owner task of the source page can > > be set to refine the demotion target nodemask, e.g. to prevent > > demotion or select a particular allowed node as the demotion target. > > Consider a system with 3 tiers, if we want to demote some pages from > tier 0, the desired behavior is, > > - Allocate pages from tier 1 > - If there's no enough free pages in tier 1, wakeup kswapd of tier 1 so > demote some pages from tier 1 to tier 2 > - If there's still no enough free pages in tier 1, allocate pages from > tier 2. > > In this way, tier 0 will have the hottest pages, while tier 1 will have > the coldest pages. When we are already in the allocation path for the demotion of a page from tier 0, I think we'd better not block this allocation to wait for kswapd to demote pages from tier 1 to tier 2. Instead, we should directly allocate from tier 2. Meanwhile, this demotion can wakeup kswapd to demote from tier 1 to tier 2 in the background. > With your proposed method, the demoting from tier 0 behavior is, > > - Allocate pages from tier 1 > - If there's no enough free pages in tier 1, allocate pages in tier 2 > > The kswapd of tier 1 will not be waken up until there's no enough free > pages in tier 2. In quite long time, there's no much hot/cold > differentiation between tier 1 and tier 2. This is true with the current allocation code. But I think we can make some changes for demotion allocations. For example, we can add a GFP_DEMOTE flag and update the allocation function to wake up kswapd when this flag is set and we need to fall back to another node. > This isn't hard to be fixed, just call __alloc_pages_nodemask() for each > tier one by one considering page allocation fallback order. That would have worked, except that there is an example earlier, in which it is actually preferred for some nodes to demote to their tier + 2, not tier +1. More specifically, the example is: 20 Node 0 (DRAM) -- Node 1 (DRAM) | | | | | | 30 120 | | | v v | 100 100 | Node 2 (PMEM) | | | | | | 100 | \ v v -> Node 3 (Large Mem) Node distances: node 0 1 2 3 0 10 20 30 100 1 20 10 120 100 2 30 120 10 100 3 100 100 100 10 3 memory tiers are defined: tier 0: 0-1 tier 1: 2 tier 2: 3 The demotion fallback order is: node 0: 2, 3 node 1: 3, 2 node 2: 3 node 3: empty Note that even though node 3 is in tier 2 and node 2 is in tier 1, node 1 (tier 0) still prefers node 3 as its first demotion target, not node 2.