Received: by 2002:ac0:bc90:0:0:0:0:0 with SMTP id a16csp3545252img; Mon, 25 Mar 2019 12:29:34 -0700 (PDT) X-Google-Smtp-Source: APXvYqzS9cPPtjbmh0zQj3AL1fmbgZYdQWODbx8Qm9vODi5ogv3j1kSA8538+f3985jPS6V7++At X-Received: by 2002:a63:5d04:: with SMTP id r4mr15256969pgb.117.1553542174343; Mon, 25 Mar 2019 12:29:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1553542174; cv=none; d=google.com; s=arc-20160816; b=qGUwBqAB56OwNpvYpwAvscTyh0wDpq0xD885+FSs/S2FcqeYbx+XUNVmlPUpPgtozy F/Bayg/x9//gTkH9kY4bBa0O6uGJaPkBjTCEcWbQaNa1jWqanH32FCsogL5O89YAfU7u qU8FV6SKBpqWqrGVKzwua0jngxiQ9ycakmrDvdmFcA4LbSQbGVDpANiOYK6Fo0wYgpri 6OznwjD6ugOlxFq/QPNgNGkSqCiP/Ri2IZdVVQYSYcUiO23peCH1irGeh84nQDHGcOng WsYDvdKq4ExFP1ucBycQhneP2ufT8B2W/CDyb7Jws9Izat+00J1JBb3ssCKNUsfZ50Nq wmvw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=btV626X+QwJDSZBoS1Ce6ifVEcqCKLFjPzK5mh5VjT4=; b=z/Jl2+N7X1JW8EGWOz9fnawD24+2ZtozZO6tG2f+iH8/XWejDRkJ71v7j6hdSb3ORK P5RXejhZnGf65sly0V/M7yLq6tJGxDDUExuEF2pQzTvUmDI2OwN7IoevrMaCNFWdIKFH prmc9UhfNixvtQYrrVs/LHaWTiYvruDUpPIr+cE1Sj1Dm6cFPV04pSadKjeVEYARcKST /+Nxn7YkGUtFAUKiROlhEyvDMBsn3MH2yDAt7db3sc7pDR2VoH+M8qZTrmf7cr6CNreY I7f6FuUNAkQuEGsz2w32FkAy1gVVSSyN/VWOvsZUJgmTcAKyaSG0dzg9GUt/iBjrHukY gOvg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k7si13036851pgi.451.2019.03.25.12.29.19; Mon, 25 Mar 2019 12:29:34 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729963AbfCYT2U (ORCPT + 99 others); Mon, 25 Mar 2019 15:28:20 -0400 Received: from out30-56.freemail.mail.aliyun.com ([115.124.30.56]:51393 "EHLO out30-56.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729548AbfCYT2U (ORCPT ); Mon, 25 Mar 2019 15:28:20 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R171e4;CH=green;DM=||false|;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01f04392;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=13;SR=0;TI=SMTPD_---0TNeJ6Wf_1553542093; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0TNeJ6Wf_1553542093) by smtp.aliyun-inc.com(127.0.0.1); Tue, 26 Mar 2019 03:28:16 +0800 Subject: Re: [PATCH 01/10] mm: control memory placement by nodemask for two tier main memory To: Dan Williams Cc: Michal Hocko , Mel Gorman , Rik van Riel , Johannes Weiner , Andrew Morton , Dave Hansen , Keith Busch , Fengguang Wu , "Du, Fan" , "Huang, Ying" , Linux MM , Linux Kernel Mailing List References: <1553316275-21985-1-git-send-email-yang.shi@linux.alibaba.com> <1553316275-21985-2-git-send-email-yang.shi@linux.alibaba.com> From: Yang Shi Message-ID: <688dffbc-2adc-005d-223e-fe488be8c5fc@linux.alibaba.com> Date: Mon, 25 Mar 2019 12:28:13 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 3/23/19 10:21 AM, Dan Williams wrote: > On Fri, Mar 22, 2019 at 9:45 PM Yang Shi wrote: >> When running applications on the machine with NVDIMM as NUMA node, the >> memory allocation may end up on NVDIMM node. This may result in silent >> performance degradation and regression due to the difference of hardware >> property. >> >> DRAM first should be obeyed to prevent from surprising regression. Any >> non-DRAM nodes should be excluded from default allocation. Use nodemask >> to control the memory placement. Introduce def_alloc_nodemask which has >> DRAM nodes set only. Any non-DRAM allocation should be specified by >> NUMA policy explicitly. >> >> In the future we may be able to extract the memory charasteristics from >> HMAT or other source to build up the default allocation nodemask. >> However, just distinguish DRAM and PMEM (non-DRAM) nodes by SRAT flag >> for the time being. >> >> Signed-off-by: Yang Shi >> --- >> arch/x86/mm/numa.c | 1 + >> drivers/acpi/numa.c | 8 ++++++++ >> include/linux/mmzone.h | 3 +++ >> mm/page_alloc.c | 18 ++++++++++++++++-- >> 4 files changed, 28 insertions(+), 2 deletions(-) >> >> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c >> index dfb6c4d..d9e0ca4 100644 >> --- a/arch/x86/mm/numa.c >> +++ b/arch/x86/mm/numa.c >> @@ -626,6 +626,7 @@ static int __init numa_init(int (*init_func)(void)) >> nodes_clear(numa_nodes_parsed); >> nodes_clear(node_possible_map); >> nodes_clear(node_online_map); >> + nodes_clear(def_alloc_nodemask); >> memset(&numa_meminfo, 0, sizeof(numa_meminfo)); >> WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory, >> MAX_NUMNODES)); >> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c >> index 867f6e3..79dfedf 100644 >> --- a/drivers/acpi/numa.c >> +++ b/drivers/acpi/numa.c >> @@ -296,6 +296,14 @@ void __init acpi_numa_slit_init(struct acpi_table_slit *slit) >> goto out_err_bad_srat; >> } >> >> + /* >> + * Non volatile memory is excluded from zonelist by default. >> + * Only regular DRAM nodes are set in default allocation node >> + * mask. >> + */ >> + if (!(ma->flags & ACPI_SRAT_MEM_NON_VOLATILE)) >> + node_set(node, def_alloc_nodemask); > Hmm, no, I don't think we should do this. Especially considering > current generation NVDIMMs are energy backed DRAM there is no > performance difference that should be assumed by the non-volatile > flag. Actually, here I would like to initialize a node mask for default allocation. Memory allocation should not end up on any nodes excluded by this node mask unless they are specified by mempolicy. We may have a few different ways or criteria to initialize the node mask, for example, we can read from HMAT (when HMAT is ready in the future), and we definitely could have non-DRAM nodes set if they have no performance difference (I'm supposed you mean NVDIMM-F  or HBM). As long as there are different tiers, distinguished by performance, for main memory, IMHO, there should be a defined default allocation node mask to control the memory placement no matter where we get the information. But, for now we haven't had such information ready for such use yet, so the SRAT flag might be a choice. > > Why isn't default SLIT distance sufficient for ensuring a DRAM-first > default policy? "DRAM-first" may sound ambiguous, actually I mean "DRAM only by default". SLIT should just can tell us what node is local what node is remote, but can't tell us the performance difference. Thanks, Yang