Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp5590883imu; Wed, 26 Dec 2018 05:38:51 -0800 (PST) X-Google-Smtp-Source: ALg8bN4cG/JQiWU75pCS1NRiiuNcRWRpfORZBleiJkdl5TtRNGJ/GRzm+9RrHGhwtAuryU4aCOqW X-Received: by 2002:a63:4e41:: with SMTP id o1mr19311455pgl.282.1545831531324; Wed, 26 Dec 2018 05:38:51 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1545831531; cv=none; d=google.com; s=arc-20160816; b=HM+CgiE1OQza/RKTmqOyPl9el4FvjBeMdj0K6+7fvUjrB3WsinG378VSKaZ7fMuI/U 40lJ3ct59fmmPtXw5ggGeoHHmi/Q7B0gpF7R5sJuSmrtqJz4ca+JgV7WPC/0tFOh/3dh jMloja8OdsjUygNDXw5MS4orZjHf0Dl0/dbel4A1Ctk0wOCuS10RNXjodYxhyEUJ+ZN/ GQHZbykfoS78yVqzJoSVkO287KXaz9shDgl5hvV8ophs4dPohs0et46B2f4EfqLv0BOZ cd9iiZ1mIkCS93AE/TlEzkMmCW90U9a5yFOLS0giczWALE6ZLv9YnR63+k/tmnunvtmQ 9lYw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-disposition:mime-version :references:subject:cc:cc:cc:cc:cc:cc:cc:cc:cc:cc:cc:to:from:date :user-agent:message-id; bh=xxdhI13zA/ZYq1pM1tVWHu7Nmzr2NFM/Mu6MhY8pnCo=; b=J5gJQgSwaMVtBRNFaWvB3P3lRZRiXsu3D3sPL2rMHbtnPSM4IEcvTN8kX2XYt8r9bi FjV0i5z02bKBG4yjV7oWGh5zgaSwDItFpvdayeasaXnzHQkxCaP39S9iYE940uoLbc1r mK48qSMPRA9TVZCxBighiUH6qJqDIWSupRnBmkuya9zshdc4RzNRaQvfdVY55FdU69Li TnQJel2Mft8vqV9Pp7cAMt3mTdBiSZtm+YFMeV+3HbaaKNSpdb7hciyvnw16PGq/wTHM 8y8dJyT1LRp+soz+91/nlOQBsrRqIF55yb6pnvLx7uaFq/jkuP92We8b1yEhFXo4u8gz YriA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i1si33051113pfj.276.2018.12.26.05.38.36; Wed, 26 Dec 2018 05:38:51 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727138AbeLZNhU (ORCPT + 99 others); Wed, 26 Dec 2018 08:37:20 -0500 Received: from mga06.intel.com ([134.134.136.31]:21293 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727007AbeLZNhI (ORCPT ); Wed, 26 Dec 2018 08:37:08 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 26 Dec 2018 05:37:06 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,400,1539673200"; d="scan'208";a="113358937" Received: from wangdan1-mobl1.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.210.154]) by orsmga003.jf.intel.com with ESMTP; 26 Dec 2018 05:37:02 -0800 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1gc9Mr-0005Oe-Dr; Wed, 26 Dec 2018 21:37:01 +0800 Message-Id: <20181226133351.644607371@intel.com> User-Agent: quilt/0.65 Date: Wed, 26 Dec 2018 21:14:56 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Fan Du , Fengguang Wu cc: kvm@vger.kernel.org Cc: LKML cc: Yao Yuan cc: Peng Dong cc: Huang Ying CC: Liu Jingqi cc: Dong Eddie cc: Dave Hansen cc: Zhang Yi cc: Dan Williams Subject: [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node References: <20181226131446.330864849@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Disposition: inline; filename=0016-page-alloc-Build-separate-zonelist-for-PMEM-and-RAM-.patch Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Fan Du When allocate page, DRAM and PMEM node should better not fall back to each other. This allows migration code to explicitly control which type of node to allocate pages from. With this patch, PMEM NUMA node can only be used in 2 ways: - migrate in and out - numactl That guarantees PMEM NUMA node will only hold anon pages. We don't detect hotness for other types of pages for now. So need to prevent some PMEM page goes hot while not able to detect/move it to DRAM. Another implication is, new page allocations will by default goto DRAM nodes. Which is normally a good choice -- since DRAM writes are cheaper than PMEM, it's often benefitial to watch new pages in DRAM for some time and only move the likely cold pages to PMEM. However there can be exceptions. For example, if PMEM:DRAM ratio is very high, some page allocations may better go to PMEM nodes directly. In long term, we may create more kind of fallback zonelists and make them configurable by NUMA policy. Signed-off-by: Fan Du Signed-off-by: Fengguang Wu --- mm/mempolicy.c | 14 ++++++++++++++ mm/page_alloc.c | 42 +++++++++++++++++++++++++++++------------- 2 files changed, 43 insertions(+), 13 deletions(-) --- linux.orig/mm/mempolicy.c 2018-12-26 20:03:49.821417489 +0800 +++ linux/mm/mempolicy.c 2018-12-26 20:29:24.597884301 +0800 @@ -1745,6 +1745,20 @@ static int policy_node(gfp_t gfp, struct WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE)); } + if (policy->mode == MPOL_BIND) { + nodemask_t nodes = policy->v.nodes; + + /* + * The rule is if we run on DRAM node and mbind to PMEM node, + * perferred node id is the peer node, vice versa. + * if we run on DRAM node and mbind to DRAM node, #PF node is + * the preferred node, vice versa, so just fall back. + */ + if ((is_node_dram(nd) && nodes_subset(nodes, numa_nodes_pmem)) || + (is_node_pmem(nd) && nodes_subset(nodes, numa_nodes_dram))) + nd = NODE_DATA(nd)->peer_node; + } + return nd; } --- linux.orig/mm/page_alloc.c 2018-12-26 20:03:49.821417489 +0800 +++ linux/mm/page_alloc.c 2018-12-26 20:03:49.817417321 +0800 @@ -5153,6 +5153,10 @@ static int find_next_best_node(int node, if (node_isset(n, *used_node_mask)) continue; + /* DRAM node doesn't fallback to pmem node */ + if (is_node_pmem(n)) + continue; + /* Use the distance array to find the distance */ val = node_distance(node, n); @@ -5242,19 +5246,31 @@ static void build_zonelists(pg_data_t *p nodes_clear(used_mask); memset(node_order, 0, sizeof(node_order)); - while ((node = find_next_best_node(local_node, &used_mask)) >= 0) { - /* - * We don't want to pressure a particular node. - * So adding penalty to the first node in same - * distance group to make it round-robin. - */ - if (node_distance(local_node, node) != - node_distance(local_node, prev_node)) - node_load[node] = load; - - node_order[nr_nodes++] = node; - prev_node = node; - load--; + /* Pmem node doesn't fallback to DRAM node */ + if (is_node_pmem(local_node)) { + int n; + + /* Pmem nodes should fallback to each other */ + node_order[nr_nodes++] = local_node; + for_each_node_state(n, N_MEMORY) { + if ((n != local_node) && is_node_pmem(n)) + node_order[nr_nodes++] = n; + } + } else { + while ((node = find_next_best_node(local_node, &used_mask)) >= 0) { + /* + * We don't want to pressure a particular node. + * So adding penalty to the first node in same + * distance group to make it round-robin. + */ + if (node_distance(local_node, node) != + node_distance(local_node, prev_node)) + node_load[node] = load; + + node_order[nr_nodes++] = node; + prev_node = node; + load--; + } } build_zonelists_in_node_order(pgdat, node_order, nr_nodes);