Received: by 2002:ac0:bc90:0:0:0:0:0 with SMTP id a16csp3421812img; Mon, 25 Mar 2019 09:58:15 -0700 (PDT) X-Google-Smtp-Source: APXvYqzKJYdyPtrOBGuRRzCmoscLgh4c7bdgMin+pxog1njzNV8UlnMtej3R/US69knK23b4pbgN X-Received: by 2002:a63:7e0a:: with SMTP id z10mr23855654pgc.144.1553533095704; Mon, 25 Mar 2019 09:58:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1553533095; cv=none; d=google.com; s=arc-20160816; b=z8AA54WBGJCMsyqo+K5Wy01d7Mm3RyiCnPyZfY2DCECFRMsz2tAlomEdRUPlE3tW5E bPXtcrZ1VvOvxUT4mx9ZMlr/s3S9LDKMSim1TT/Hm3jzZpC1GXrky0XWX5HVkRDe6yJG l8wtpxh1Y5F5wIwTL6ZOqj3Uzyy0xldbnLW/7RFnx2b7rhv3aj7FQS9A2Nt+RVystQX4 H9kqZCifRzU6bLIsL1Ci4CoFbpki84wojKTcfjWjruNMt8nes4LH73PDWBsQ3PSVzLsJ nVgKJqr5fzDAHZhGr9PTRUTDcZ9R4bHAzpEHU8kQheJqTKSe+2Mh532+uIjuiw0fikQo PdPw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=hdrToB2TSv1yp4eb70QIvwrlSIW2W5jFyB226wrDD0Q=; b=dmt39bM/cbbGmD36QPo1CmTmH5T6BBF2BWkMbLjw+U6S+RqFwwu6Q/Md5kix/X+aa7 jHeUZ63LroaMAMsf9bKJNNgcStZuyseJvomZPJn283H2YYO/ji2WzXOESEB7kypgXNqY 3vzbAD3OSV9QtPCIZcaAga/EhYGigFqPMaMJqpgnh5e1hTdKNtNSavwGCnMacrN6jaIJ SnuFl3bPRjLUBNaHbRyP2n4mzgGivBJg5HW1fIKbi3uRYd6yV4l3jZJnk4kCzM+SlMkB 6JUDeULr8k3+vdMrrPdRTwzw7BulgZUBKMwbrel6pXAeXsQ8qMmUMuB7fb8JeEBW2e9T uuGw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=srMzDC+m; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i18si13381863pfa.205.2019.03.25.09.57.59; Mon, 25 Mar 2019 09:58:15 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=srMzDC+m; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729749AbfCYQ45 (ORCPT + 99 others); Mon, 25 Mar 2019 12:56:57 -0400 Received: from mail-qt1-f196.google.com ([209.85.160.196]:37777 "EHLO mail-qt1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725788AbfCYQ44 (ORCPT ); Mon, 25 Mar 2019 12:56:56 -0400 Received: by mail-qt1-f196.google.com with SMTP id z16so11144741qtn.4 for ; Mon, 25 Mar 2019 09:56:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=hdrToB2TSv1yp4eb70QIvwrlSIW2W5jFyB226wrDD0Q=; b=srMzDC+ms5YQVqrvZwEGDX/E/ckzFQVhFOG1bMHwhic4SroBbCVHANYL/SW7vm3dcp avpmvQxnZulEN13hnbtVkEInZPvG6TPTYzdDrz+/SKsrOhPEA9oiiF754eehdWTs2lLD Gm7FtbqtzbumYo/P3bUaAQK0CG5bPTt/s9ZE4Xqrz46K+vzzbO3DuYepBASZG0TS4WLX nliN3mXxhSN40Ia6QSqWfU9+lMq2AO9rAOlFtskjaIc+pmLucpgXw6v4ByzXWNa2jc7l toVmuhIDkJ9VSAsPKdzbegy/3sLS/c7VmTttjV+bhDJF9mZUrMjXI6G474lIcbTt1PeZ rgAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=hdrToB2TSv1yp4eb70QIvwrlSIW2W5jFyB226wrDD0Q=; b=YrShlpm650Xf4Bu8sqicrL7n2kHF3qT2l83U/ar5YCAa/ntTnjlP/co8tUd4qCUMMZ Lm+e85IjUAkBsTCOZc5KY0All0H5KEA5Jh5MeMQAcdBAkXyN6qBFLG40Qgt6V/XTJeWN WB3PvvwsR7RSqFR9xAQ0Z4nne737N+a/ojSj9xZ2xV3Rl4dQt8nhiOoyxp5kL2/Y7ty3 vb2WP9m3eT6xCHR7PG9uDvU+WOvHMss+XJFb7S0MiRwboxEJV1sIgCFoml6fx72iF1Yq xE3At1P1AWTmdYQqKjUGXRThHflzYFJ8ecSL/jjdZdrbmMUX9ZiyA/EXsQtnVRQZOA74 jSxA== X-Gm-Message-State: APjAAAXM9Nk8wWu0YN0WoxPqY38LXQLL8fh5d8UBxnRRgmI4Vm35kLKD bksDcwmJUBmThLtb1JTYxiuVDj+rOyQGIhk5xosyxw== X-Received: by 2002:ac8:32fb:: with SMTP id a56mr21951184qtb.338.1553533015773; Mon, 25 Mar 2019 09:56:55 -0700 (PDT) MIME-Version: 1.0 References: <1553316275-21985-1-git-send-email-yang.shi@linux.alibaba.com> In-Reply-To: From: Dan Williams Date: Mon, 25 Mar 2019 09:56:44 -0700 Message-ID: Subject: Re: [RFC PATCH 0/10] Another Approach to Use PMEM as NUMA Node To: Brice Goglin Cc: Yang Shi , Michal Hocko , Mel Gorman , Rik van Riel , Johannes Weiner , Andrew Morton , Dave Hansen , Keith Busch , Fengguang Wu , "Du, Fan" , "Huang, Ying" , Linux MM , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 25, 2019 at 9:15 AM Brice Goglin wrote: > > > Le 23/03/2019 =C3=A0 05:44, Yang Shi a =C3=A9crit : > > With Dave Hansen's patches merged into Linus's tree > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/comm= it/?id=3Dc221c0b0308fd01d9fb33a16f64d2fd95f8830a4 > > > > PMEM could be hot plugged as NUMA node now. But, how to use PMEM as NUM= A node > > effectively and efficiently is still a question. > > > > There have been a couple of proposals posted on the mailing list [1] [2= ]. > > > > The patchset is aimed to try a different approach from this proposal [1= ] > > to use PMEM as NUMA nodes. > > > > The approach is designed to follow the below principles: > > > > 1. Use PMEM as normal NUMA node, no special gfp flag, zone, zonelist, e= tc. > > > > 2. DRAM first/by default. No surprise to existing applications and defa= ult > > running. PMEM will not be allocated unless its node is specified explic= itly > > by NUMA policy. Some applications may be not very sensitive to memory l= atency, > > so they could be placed on PMEM nodes then have hot pages promote to DR= AM > > gradually. > > > I am not against the approach for some workloads. However, many HPC > people would rather do this manually. But there's currently no easy way > to find out from userspace whether a given NUMA node is DDR or PMEM*. We > have to assume HMAT is available (and correct) and look at performance > attributes. When talking to humans, it would be better to say "I > allocated on the local DDR NUMA node" rather than "I allocated on the > fastest node according to HMAT latency". > > Also, when we'll have HBM+DDR, some applications may want to use DDR by > default, which means they want the *slowest* node according to HMAT (by > the way, will your hybrid policy work if we ever have HBM+DDR+PMEM?). > Performance attributes could help, but how does user-space know for sure > that X>Y will still mean HBM>DDR and not DDR>PMEM in 5 years? > > It seems to me that exporting a flag in sysfs saying whether a node is > PMEM could be convenient. Patch series [1] exported a "type" in sysfs > node directories ("pmem" or "dram"). I don't know how if there's an easy > way to define what HBM is and expose that type too. I'm generally against the concept that a "pmem" or "type" flag should indicate anything about the expected performance of the address range. The kernel should explicitly look to the HMAT for performance data and not otherwise make type-based performance assumptions.