Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp4423467imu; Mon, 12 Nov 2018 10:45:06 -0800 (PST) X-Google-Smtp-Source: AJdET5ew+B+sXBxw6nrtPsqUSC4e8tIfoiiTS3snd+5G8NK+tNtaECTKKuVF2F0lc0t2CcGZdDH0 X-Received: by 2002:a63:1d10:: with SMTP id d16-v6mr1742493pgd.228.1542048306453; Mon, 12 Nov 2018 10:45:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1542048306; cv=none; d=google.com; s=arc-20160816; b=eYG/ybfRRtL8TfflrlG66OWxmKBnxZyvHrQeMzQA9uYEGV2DU4EmUzI8iLnZvIkYen XEtH7PFATu3T7Aggt/5385azRnLiguVKkA6QvTiithTu7YVmThmmNh5L0yrAJt+VEdJE OHDMMvv164WEMnNMHV7IDFW/8WB5TvDIn3Zrz62/vdSQ/bVjihNhD2FDkSNyKlv6iFM5 U7Cniw02OCPY1r/+pa/c84CB4SfY7iRs0Q+RmREhjNgY+xGqXD7P6l3iMhUf4/Ui4eUs w6wThfL3qDCXWk47B5xaoS+YcYvOnfluiB/WjfISPRMPY3UyAsuS6Q+CHOwE/Kf7UNBb KizA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=e1DVwQEiQVZvUa1rYsGOrhj0Kcpt+WtrHl52JQz/N4U=; b=0r94b0V/OrEJpVH/sctZEgOszH5EV+O1MDXaNli6iQYjrLHT4y6Z95cgFlnr6npmDc KwZIeNlJirmUgeWt8hy8CMdoER6oM0aJm67ti2d7Xqmtpi3L9zsrFS40cAzlh/cGjL5P Poo3/sVWXY655MWZnXWDlLHcaZWppFF6hvqy6Df4UXjV2fD//6qgYRaL7iPQi2yRN4ma FhxC406tjHq58bnkBxiBHbVZ2loUdQwJJGchY6x5EIocww8ekurbN6FEO2Vu0r3/MM/G 5gj/eLut6q2CTU3k19LpLkTvwkVc2/JfiPygnFxq8+9PXaz6KT4gGGZ6xDK2DcjNgcd5 tExg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=othKJmlK; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m8-v6si16603312pgk.424.2018.11.12.10.44.50; Mon, 12 Nov 2018 10:45:06 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=othKJmlK; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730106AbeKMEi5 (ORCPT + 99 others); Mon, 12 Nov 2018 23:38:57 -0500 Received: from mail-qk1-f193.google.com ([209.85.222.193]:45427 "EHLO mail-qk1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727530AbeKMEi4 (ORCPT ); Mon, 12 Nov 2018 23:38:56 -0500 Received: by mail-qk1-f193.google.com with SMTP id d135so15110935qkc.12; Mon, 12 Nov 2018 10:44:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=e1DVwQEiQVZvUa1rYsGOrhj0Kcpt+WtrHl52JQz/N4U=; b=othKJmlKte43454ldG7dz/VLqcnA1/AGGUwhQSqemhr6e29JIMOGay3XNBm2nCzqZf 7oKhyO99DbPZ+wQoUD93gwoQwiWqAigpsn0OR3G+cRof2eC81F10sFZyrZJpmKAqslOU QfDT2T+eFXdkdoPzQREAS5NmZDIEtcVcRYcals2YhfDeUqCC5VEHvUJGEE46j2L7deW+ 9CS9Av90d4J9ZjzKoBPNU49SxlP18bfMJHEWCbI32gK+3gYT6YDE/ubaAICGAm+UN7yJ GubhTQhVnsFdB8xn9KS5iwajgYLeeh9XRZQjuHZUX6ECbFAn3L8xZrbcd5/5/DTrKRWJ mssg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=e1DVwQEiQVZvUa1rYsGOrhj0Kcpt+WtrHl52JQz/N4U=; b=g4bUu+ZBKDPkLoSQ1LhRPfMs4qtkT4zqElz4ZsJ0VwQcIjq3RzuPE8W3fIffGFfPDL lAFkfOtIOdkU2ES1eS6onl8kSFhhDIbjBKaRHZ1jlKDK1H5j6m2faO+iaFTQiPllIpZo FTYQKLNqI9X6pQVEQF4rnzUkb/n1nDrd+BdnE6DS/yfT3lCEsG4y0NGswh7gETC2CdUz HteF9IHFKw5+s7xvFY/fqp5hw/zet07cJ74z/oAC8VFVeqbyOL3LYylelbV0vFKaWD+t m9TE+nYaqDD8TVmyNd2d/TDhi0htLIBhgkfEz3dWdGfgaYSbyRMNq/somwmyhmlAXBGu Uefg== X-Gm-Message-State: AGRZ1gLRTbKxAQL5zCmoJkV3sNjk9fZFgk6spyb4QfzPy3P37mmrg0h2 CDqZfxiQHfGaHIMpF3u28SjrWNq0s8ycCgPYJSI= X-Received: by 2002:a37:77c5:: with SMTP id s188mr1955614qkc.174.1542048267694; Mon, 12 Nov 2018 10:44:27 -0800 (PST) MIME-Version: 1.0 References: <154180517736.2047781.13819458479548681732.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: <154180517736.2047781.13819458479548681732.stgit@dwillia2-desk3.amr.corp.intel.com> From: Yang Shi Date: Mon, 12 Nov 2018 10:44:09 -0800 Message-ID: Subject: Re: [PATCH] acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node To: dan.j.williams@intel.com Cc: linux-nvdimm@lists.01.org, fan.du@intel.com, mpe@ellerman.id.au, oohall@gmail.com, dave.hansen@linux.intel.com, jglisse@redhat.com, linux-acpi@vger.kernel.org, Linux MM , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Nov 9, 2018 at 3:24 PM Dan Williams wrot= e: > > Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware > Interface Table), is the first known instance of a memory range > described by a unique "target" proximity domain. Where "initiator" and > "target" proximity domains is an approach that the ACPI HMAT > (Heterogeneous Memory Attributes Table) uses to described the unique > performance properties of a memory range relative to a given initiator > (e.g. CPU or DMA device). > > Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y > char-device follows the traditional notion of 'numa-node' where the > attribute conveys the closest online numa-node. That numa-node attribute > is useful for cpu-binding and memory-binding processes *near* the > device. However, when the memory range backing a 'pmem', or 'dax' device > is onlined (memory hot-add) the memory-only-numa-node representing that > address needs to be differentiated from the set of online nodes. In > other words, the numa-node association of the device depends on whether > you can bind processes *near* the cpu-numa-node in the offline > device-case, or bind process *on* the memory-range directly after the > backing address range is onlined. > > Allow for the case that platform firmware describes persistent memory > with a unique proximity domain, i.e. when it is distinct from the > proximity of DRAM and CPUs that are on the same socket. Plumb the Linux > numa-node translation of that proximity through the libnvdimm region > device to namespaces that are in device-dax mode. With this in place the > proposed kmem driver [1] can optionally discover a unique numa-node > number for the address range as it transitions the memory from an > offline state managed by a device-driver to an online memory range > managed by the core-mm. > > [1]: https://lkml.org/lkml/2018/10/23/9 > > Reported-by: Fan Du Thanks for coming up with the patch quickly. We reported the problem to Fan Du, and did a very similar, but preliminary, in-house implementation. This implementation looks good to me. Reviewed-by: Yang Shi > Cc: Michael Ellerman > Cc: "Oliver O'Halloran" > Cc: Dave Hansen > Cc: J=C3=A9r=C3=B4me Glisse > Signed-off-by: Dan Williams > --- > arch/powerpc/platforms/pseries/papr_scm.c | 1 + > drivers/acpi/nfit/core.c | 8 ++++++-- > drivers/acpi/numa.c | 1 + > drivers/dax/bus.c | 4 +++- > drivers/dax/bus.h | 3 ++- > drivers/dax/dax-private.h | 4 ++++ > drivers/dax/pmem/core.c | 4 +++- > drivers/nvdimm/e820.c | 1 + > drivers/nvdimm/nd.h | 2 +- > drivers/nvdimm/of_pmem.c | 1 + > drivers/nvdimm/region_devs.c | 1 + > include/linux/libnvdimm.h | 1 + > 12 files changed, 25 insertions(+), 6 deletions(-) > > diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/pla= tforms/pseries/papr_scm.c > index ee9372b65ca5..6a0a35b872d1 100644 > --- a/arch/powerpc/platforms/pseries/papr_scm.c > +++ b/arch/powerpc/platforms/pseries/papr_scm.c > @@ -233,6 +233,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv = *p) > memset(&ndr_desc, 0, sizeof(ndr_desc)); > ndr_desc.attr_groups =3D region_attr_groups; > ndr_desc.numa_node =3D dev_to_node(&p->pdev->dev); > + ndr_desc.target_node =3D ndr_desc.numa_node; > ndr_desc.res =3D &p->res; > ndr_desc.of_node =3D p->dn; > ndr_desc.provider_data =3D p; > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c > index f8c638f3c946..2225e3de33ac 100644 > --- a/drivers/acpi/nfit/core.c > +++ b/drivers/acpi/nfit/core.c > @@ -2825,11 +2825,15 @@ static int acpi_nfit_register_region(struct acpi_= nfit_desc *acpi_desc, > ndr_desc->res =3D &res; > ndr_desc->provider_data =3D nfit_spa; > ndr_desc->attr_groups =3D acpi_nfit_region_attribute_groups; > - if (spa->flags & ACPI_NFIT_PROXIMITY_VALID) > + if (spa->flags & ACPI_NFIT_PROXIMITY_VALID) { > ndr_desc->numa_node =3D acpi_map_pxm_to_online_node( > spa->proximity_domain); > - else > + ndr_desc->target_node =3D acpi_map_pxm_to_node( > + spa->proximity_domain); > + } else { > ndr_desc->numa_node =3D NUMA_NO_NODE; > + ndr_desc->target_node =3D NUMA_NO_NODE; > + } > > /* > * Persistence domain bits are hierarchical, if > diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c > index 274699463b4f..b9d86babb13a 100644 > --- a/drivers/acpi/numa.c > +++ b/drivers/acpi/numa.c > @@ -84,6 +84,7 @@ int acpi_map_pxm_to_node(int pxm) > > return node; > } > +EXPORT_SYMBOL(acpi_map_pxm_to_node); > > /** > * acpi_map_pxm_to_online_node - Map proximity ID to online node > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c > index 568168500217..c620ad52d7e5 100644 > --- a/drivers/dax/bus.c > +++ b/drivers/dax/bus.c > @@ -214,7 +214,7 @@ static void dax_region_unregister(void *region) > } > > struct dax_region *alloc_dax_region(struct device *parent, int region_id= , > - struct resource *res, unsigned int align, > + struct resource *res, int target_node, unsigned int align= , > unsigned long pfn_flags) > { > struct dax_region *dax_region; > @@ -244,6 +244,7 @@ struct dax_region *alloc_dax_region(struct device *pa= rent, int region_id, > dax_region->id =3D region_id; > dax_region->align =3D align; > dax_region->dev =3D parent; > + dax_region->target_node =3D target_node; > if (sysfs_create_groups(&parent->kobj, dax_region_attribute_group= s)) { > kfree(dax_region); > return NULL; > @@ -348,6 +349,7 @@ struct dev_dax *__devm_create_dev_dax(struct dax_regi= on *dax_region, int id, > > dev_dax->dax_dev =3D dax_dev; > dev_dax->region =3D dax_region; > + dev_dax->target_node =3D dax_region->target_node; > kref_get(&dax_region->kref); > > inode =3D dax_inode(dax_dev); > diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h > index ce977552ffb5..8619e3299943 100644 > --- a/drivers/dax/bus.h > +++ b/drivers/dax/bus.h > @@ -10,7 +10,8 @@ struct dax_device; > struct dax_region; > void dax_region_put(struct dax_region *dax_region); > struct dax_region *alloc_dax_region(struct device *parent, int region_id= , > - struct resource *res, unsigned int align, unsigned long f= lags); > + struct resource *res, int target_node, unsigned int align= , > + unsigned long flags); > > enum dev_dax_subsys { > DEV_DAX_BUS, > diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h > index a82ce48f5884..a45612148ca0 100644 > --- a/drivers/dax/dax-private.h > +++ b/drivers/dax/dax-private.h > @@ -26,6 +26,7 @@ void dax_bus_exit(void); > /** > * struct dax_region - mapping infrastructure for dax devices > * @id: kernel-wide unique region for a memory range > + * @target_node: effective numa node if this memory range is onlined > * @kref: to pin while other agents have a need to do lookups > * @dev: parent device backing this region > * @align: allocation and mapping alignment for child dax devices > @@ -34,6 +35,7 @@ void dax_bus_exit(void); > */ > struct dax_region { > int id; > + int target_node; > struct kref kref; > struct device *dev; > unsigned int align; > @@ -46,6 +48,7 @@ struct dax_region { > * data while the device is activated in the driver. > * @region - parent region > * @dax_dev - core dax functionality > + * @target_node: effective numa node if dev_dax memory range is onlined > * @dev - device core > * @pgmap - pgmap for memmap setup / lifetime (driver owned) > * @ref: pgmap reference count (driver owned) > @@ -54,6 +57,7 @@ struct dax_region { > struct dev_dax { > struct dax_region *region; > struct dax_device *dax_dev; > + int target_node; > struct device dev; > struct dev_pagemap pgmap; > struct percpu_ref ref; > diff --git a/drivers/dax/pmem/core.c b/drivers/dax/pmem/core.c > index bdcff1b14e95..f71019ce0647 100644 > --- a/drivers/dax/pmem/core.c > +++ b/drivers/dax/pmem/core.c > @@ -20,6 +20,7 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, en= um dev_dax_subsys subsys) > struct nd_namespace_common *ndns; > struct nd_dax *nd_dax =3D to_nd_dax(dev); > struct nd_pfn *nd_pfn =3D &nd_dax->nd_pfn; > + struct nd_region *nd_region =3D to_nd_region(dev->parent); > > ndns =3D nvdimm_namespace_common_probe(dev); > if (IS_ERR(ndns)) > @@ -52,7 +53,8 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, en= um dev_dax_subsys subsys) > memcpy(&res, &pgmap.res, sizeof(res)); > res.start +=3D offset; > dax_region =3D alloc_dax_region(dev, region_id, &res, > - le32_to_cpu(pfn_sb->align), PFN_DEV|PFN_MAP); > + nd_region->target_node, le32_to_cpu(pfn_sb->align= ), > + PFN_DEV|PFN_MAP); > if (!dax_region) > return ERR_PTR(-ENOMEM); > > diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c > index 521eaf53a52a..36be9b619187 100644 > --- a/drivers/nvdimm/e820.c > +++ b/drivers/nvdimm/e820.c > @@ -47,6 +47,7 @@ static int e820_register_one(struct resource *res, void= *data) > ndr_desc.res =3D res; > ndr_desc.attr_groups =3D e820_pmem_region_attribute_groups; > ndr_desc.numa_node =3D e820_range_to_nid(res->start); > + ndr_desc.target_node =3D ndr_desc.numa_node; > set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags); > if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc)) > return -ENXIO; > diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h > index e79cc8e5c114..623216545cc8 100644 > --- a/drivers/nvdimm/nd.h > +++ b/drivers/nvdimm/nd.h > @@ -153,7 +153,7 @@ struct nd_region { > u16 ndr_mappings; > u64 ndr_size; > u64 ndr_start; > - int id, num_lanes, ro, numa_node; > + int id, num_lanes, ro, numa_node, target_node; > void *provider_data; > struct kernfs_node *bb_state; > struct badblocks bb; > diff --git a/drivers/nvdimm/of_pmem.c b/drivers/nvdimm/of_pmem.c > index 0a701837dfc0..ecaaa27438e2 100644 > --- a/drivers/nvdimm/of_pmem.c > +++ b/drivers/nvdimm/of_pmem.c > @@ -68,6 +68,7 @@ static int of_pmem_region_probe(struct platform_device = *pdev) > memset(&ndr_desc, 0, sizeof(ndr_desc)); > ndr_desc.attr_groups =3D region_attr_groups; > ndr_desc.numa_node =3D dev_to_node(&pdev->dev); > + ndr_desc.target_node =3D ndr_desc.numa_node; > ndr_desc.res =3D &pdev->resource[i]; > ndr_desc.of_node =3D np; > set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags); > diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c > index 174a418cb171..86cd425b786d 100644 > --- a/drivers/nvdimm/region_devs.c > +++ b/drivers/nvdimm/region_devs.c > @@ -1060,6 +1060,7 @@ static struct nd_region *nd_region_create(struct nv= dimm_bus *nvdimm_bus, > nd_region->flags =3D ndr_desc->flags; > nd_region->ro =3D ro; > nd_region->numa_node =3D ndr_desc->numa_node; > + nd_region->target_node =3D ndr_desc->target_node; > ida_init(&nd_region->ns_ida); > ida_init(&nd_region->btt_ida); > ida_init(&nd_region->pfn_ida); > diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h > index 097072c5a852..941102c0c81f 100644 > --- a/include/linux/libnvdimm.h > +++ b/include/linux/libnvdimm.h > @@ -124,6 +124,7 @@ struct nd_region_desc { > void *provider_data; > int num_lanes; > int numa_node; > + int target_node; > unsigned long flags; > struct device_node *of_node; > }; >