2015-06-03 00:46:37

by Toshi Kani

[permalink] [raw]
Subject: [PATCH 0/3] Add NUMA support for NVDIMM devices

Since NVDIMMs are installed on memory slots, they expose the NUMA
topology of a platform. This patchset adds support of sysfs
'numa_node' to the NVDIMM devices under /sys/bus/nd/devices.
With this change, numactl(8) accepts 'block:' and 'file:' paths of
pmem and btt devices as shown in the examples below.
numactl --preferred block:pmem0 --show
numactl --preferred file:/dev/pmem0s --show

Here is a result of fio benchmark to ext4/dax on pmem0 using numactl
on an HP DL380 with 2 sockets; pmem0 on node 0, and pmem1 on node 1.

Local [1] : 4098.3MB/s
Remote [2]: 3718.4MB/s

[1] numactl --preferred block:pmem0 --cpunodebind block:pmem0 fio <fs-on-pmem0>
[2] numactl --preferred block:pmem1 --cpunodebind block:pmem1 fio <fs-on-pmem0>

Patch 1/3-2/3 apply on top of the acpica branch of the pm tree.
Patch 3/3 applies on top of Dan Williams's v5 patch series of
"libnvdimm: non-volatile memory devices".

---
Toshi Kani (3):
1/3 acpi: Fix acpi_map_pxm_to_node() to handle numa_off
2/3 acpi: Add export to acpi_map_pxm_to_node()
3/3 libnvdimm: Add sysfs numa_node to NVDIMM devices

---
drivers/acpi/nfit.c | 2 ++
drivers/acpi/numa.c | 13 ++++++++++---
drivers/nvdimm/btt.c | 2 ++
drivers/nvdimm/bus.c | 12 ++++++++++++
drivers/nvdimm/nd.h | 1 +
drivers/nvdimm/region.c | 1 +
drivers/nvdimm/region_devs.c | 1 +
include/linux/acpi.h | 5 +++++
include/linux/libnvdimm.h | 1 +
9 files changed, 35 insertions(+), 3 deletions(-)


2015-06-03 00:47:09

by Toshi Kani

[permalink] [raw]
Subject: [PATCH 1/3] acpi: Fix acpi_map_pxm_to_node() to handle numa_off

When numa_off is set, NUMA is turned off and node 0 is the only
valid node on the system. The kernel skips parsing ACPI SRAT
table in this case.

Change acpi_map_pxm_to_node() to always return 0 when numa_off
is set. Also move the range check of a proximity ID from
acpi_get_node() to acpi_map_pxm_to_node() after the numa_off
check. This keeps the interfaces to return 0 regardless of
proximity ID values.

Signed-off-by: Toshi Kani <[email protected]>
---
drivers/acpi/numa.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 1333cbdc..4898082 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -70,7 +70,15 @@ static void __acpi_map_pxm_to_node(int pxm, int node)

int acpi_map_pxm_to_node(int pxm)
{
- int node = pxm_to_node_map[pxm];
+ int node;
+
+ if (numa_off)
+ return 0;
+
+ if (pxm < 0 || pxm >= MAX_PXM_DOMAINS)
+ return NUMA_NO_NODE;
+
+ node = pxm_to_node_map[pxm];

if (node == NUMA_NO_NODE) {
if (nodes_weight(nodes_found_map) >= MAX_NUMNODES)
@@ -328,8 +336,6 @@ int acpi_get_node(acpi_handle handle)
int pxm;

pxm = acpi_get_pxm(handle);
- if (pxm < 0 || pxm >= MAX_PXM_DOMAINS)
- return NUMA_NO_NODE;

return acpi_map_pxm_to_node(pxm);
}

2015-06-03 00:46:49

by Toshi Kani

[permalink] [raw]
Subject: [PATCH 2/3] acpi: Add export to acpi_map_pxm_to_node()

acpi_get_node() allows ACPI kernel modules to map a proximity ID
to a node ID from ACPI _PXM method. Some ACPI static tables,
such as SRAT, DMAR, PMTT and NFIT, also contain proximity IDs.

This patch exports acpi_map_pxm_to_node() so that ACPI kernel
modules can call this function to map a proximity ID to a node ID
from an ACPI static table as well.

When CONFIG_ACPI_NUMA is not set, acpi_map_pxm_to_node() always
returns 0.

Signed-off-by: Toshi Kani <[email protected]>
---
drivers/acpi/numa.c | 1 +
include/linux/acpi.h | 5 +++++
2 files changed, 6 insertions(+)

diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 4898082..8f50d5d 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -90,6 +90,7 @@ int acpi_map_pxm_to_node(int pxm)

return node;
}
+EXPORT_SYMBOL(acpi_map_pxm_to_node);

static void __init
acpi_table_print_srat_entry(struct acpi_subtable_header *header)
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index e4da5e3..7a76e51 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -289,8 +289,13 @@ extern void acpi_dmi_osi_linux(int enable, const struct dmi_system_id *d);
extern void acpi_osi_setup(char *str);

#ifdef CONFIG_ACPI_NUMA
+int acpi_map_pxm_to_node(int pxm);
int acpi_get_node(acpi_handle handle);
#else
+static inline int acpi_map_pxm_to_node(int pxm)
+{
+ return 0;
+}
static inline int acpi_get_node(acpi_handle handle)
{
return 0;

2015-06-03 00:47:03

by Toshi Kani

[permalink] [raw]
Subject: [PATCH 3/3] libnvdimm: Add sysfs numa_node to NVDIMM devices

Since NVDIMMs are installed on memory slots, they expose the NUMA
topology of a platform.

This patch adds support of sysfs 'numa_node' to the NVDIMM devices
under /sys/bus/nd/devices, such as regionN, namespaceN.0, and bttN.
When bttN is not set up, its numa_node returns -1 (NUMA_NO_NODE).
nmemN/numa_node always returns -1 for now since this device is for
dimm-ioctl message interface and has little use of NUMA. It can be
enhanced later to set a valid value if necessary.

Here is an example of numa_node values on a 2-socket system with
a single NVDIMM range on each socket.
/sys/bus/nd/devices
|-- btt0/numa_node:-1
|-- btt1/numa_node:0
|-- namespace0.0/numa_node:0
|-- namespace1.0/numa_node:1
|-- nmem0/numa_node:-1
|-- nmem1/numa_node:-1
|-- region0/numa_node:0
|-- region1/numa_node:1

With this change, numactl(8) accepts 'block:' and 'file:' paths of
pmem and btt devices as shown in the examples below.
numactl --preferred block:pmem0 --show
numactl --preferred file:/dev/pmem0s --show

Signed-off-by: Toshi Kani <[email protected]>
---
drivers/acpi/nfit.c | 2 ++
drivers/nvdimm/btt.c | 2 ++
drivers/nvdimm/bus.c | 12 ++++++++++++
drivers/nvdimm/nd.h | 1 +
drivers/nvdimm/region.c | 1 +
drivers/nvdimm/region_devs.c | 1 +
include/linux/libnvdimm.h | 1 +
7 files changed, 20 insertions(+)

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 5731e4a..a255f3a 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -1255,6 +1255,8 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
ndr_desc->res = &res;
ndr_desc->provider_data = nfit_spa;
ndr_desc->attr_groups = acpi_nfit_region_attribute_groups;
+ ndr_desc->numa_node = acpi_map_pxm_to_node(spa->proximity_domain);
+
list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
struct acpi_nfit_memory_map *memdev = nfit_memdev->memdev;
struct nd_mapping *nd_mapping;
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 2d7ce9e..3b3e115 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1369,6 +1369,8 @@ static int nd_btt_probe(struct device *dev)
rc = -ENOMEM;
goto err_btt;
}
+
+ set_dev_node(dev, nd_region->numa_node);
dev_set_drvdata(dev, btt);

return 0;
diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index d8a1794..5c34e68 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -339,9 +339,21 @@ static ssize_t devtype_show(struct device *dev, struct device_attribute *attr,
}
static DEVICE_ATTR_RO(devtype);

+#ifdef CONFIG_NUMA
+static ssize_t numa_node_show(struct device *dev, struct device_attribute *attr,
+ char *buf)
+{
+ return sprintf(buf, "%d\n", dev->numa_node);
+}
+DEVICE_ATTR_RO(numa_node);
+#endif
+
static struct attribute *nd_device_attributes[] = {
&dev_attr_modalias.attr,
&dev_attr_devtype.attr,
+#ifdef CONFIG_NUMA
+ &dev_attr_numa_node.attr,
+#endif
NULL,
};

diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index c807379..fefd8f6 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -108,6 +108,7 @@ struct nd_region {
u64 ndr_size;
u64 ndr_start;
int id, num_lanes;
+ int numa_node;
void *provider_data;
struct nd_interleave_set *nd_set;
struct nd_mapping mapping[0];
diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
index 373eab4..783220e 100644
--- a/drivers/nvdimm/region.c
+++ b/drivers/nvdimm/region.c
@@ -123,6 +123,7 @@ static int nd_region_probe(struct device *dev)

num_ns->active = rc;
num_ns->count = rc + err;
+ set_dev_node(dev, nd_region->numa_node);
dev_set_drvdata(dev, num_ns);

if (err == 0)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 86adbd8..352bc80 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -627,6 +627,7 @@ static noinline struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus
nd_region->provider_data = ndr_desc->provider_data;
nd_region->nd_set = ndr_desc->nd_set;
nd_region->num_lanes = ndr_desc->num_lanes;
+ nd_region->numa_node = ndr_desc->numa_node;
ida_init(&nd_region->ns_ida);
dev = &nd_region->dev;
dev_set_name(dev, "region%d", nd_region->id);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 96b9507..5d0c75a 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -78,6 +78,7 @@ struct nd_region_desc {
struct nd_interleave_set *nd_set;
void *provider_data;
int num_lanes;
+ int numa_node;
};

struct nvdimm_bus;

2015-06-03 01:01:09

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 3/3] libnvdimm: Add sysfs numa_node to NVDIMM devices

On Tue, Jun 2, 2015 at 5:26 PM, Toshi Kani <[email protected]> wrote:
> Since NVDIMMs are installed on memory slots, they expose the NUMA
> topology of a platform.
>
> This patch adds support of sysfs 'numa_node' to the NVDIMM devices
> under /sys/bus/nd/devices, such as regionN, namespaceN.0, and bttN.
> When bttN is not set up, its numa_node returns -1 (NUMA_NO_NODE).
> nmemN/numa_node always returns -1 for now since this device is for
> dimm-ioctl message interface and has little use of NUMA. It can be
> enhanced later to set a valid value if necessary.
>
> Here is an example of numa_node values on a 2-socket system with
> a single NVDIMM range on each socket.
> /sys/bus/nd/devices
> |-- btt0/numa_node:-1
> |-- btt1/numa_node:0
> |-- namespace0.0/numa_node:0
> |-- namespace1.0/numa_node:1
> |-- nmem0/numa_node:-1
> |-- nmem1/numa_node:-1
> |-- region0/numa_node:0
> |-- region1/numa_node:1
>
> With this change, numactl(8) accepts 'block:' and 'file:' paths of
> pmem and btt devices as shown in the examples below.
> numactl --preferred block:pmem0 --show
> numactl --preferred file:/dev/pmem0s --show
>
> Signed-off-by: Toshi Kani <[email protected]>
> ---
> drivers/acpi/nfit.c | 2 ++
> drivers/nvdimm/btt.c | 2 ++
> drivers/nvdimm/bus.c | 12 ++++++++++++
> drivers/nvdimm/nd.h | 1 +
> drivers/nvdimm/region.c | 1 +
> drivers/nvdimm/region_devs.c | 1 +
> include/linux/libnvdimm.h | 1 +
> 7 files changed, 20 insertions(+)
>
> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
> index 5731e4a..a255f3a 100644
> --- a/drivers/acpi/nfit.c
> +++ b/drivers/acpi/nfit.c
> @@ -1255,6 +1255,8 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
> ndr_desc->res = &res;
> ndr_desc->provider_data = nfit_spa;
> ndr_desc->attr_groups = acpi_nfit_region_attribute_groups;
> + ndr_desc->numa_node = acpi_map_pxm_to_node(spa->proximity_domain);
> +
> list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
> struct acpi_nfit_memory_map *memdev = nfit_memdev->memdev;
> struct nd_mapping *nd_mapping;
> diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
> index 2d7ce9e..3b3e115 100644
> --- a/drivers/nvdimm/btt.c
> +++ b/drivers/nvdimm/btt.c
> @@ -1369,6 +1369,8 @@ static int nd_btt_probe(struct device *dev)
> rc = -ENOMEM;
> goto err_btt;
> }
> +
> + set_dev_node(dev, nd_region->numa_node);
> dev_set_drvdata(dev, btt);
>
> return 0;
> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
> index d8a1794..5c34e68 100644
> --- a/drivers/nvdimm/bus.c
> +++ b/drivers/nvdimm/bus.c
> @@ -339,9 +339,21 @@ static ssize_t devtype_show(struct device *dev, struct device_attribute *attr,
> }
> static DEVICE_ATTR_RO(devtype);
>
> +#ifdef CONFIG_NUMA
> +static ssize_t numa_node_show(struct device *dev, struct device_attribute *attr,
> + char *buf)
> +{
> + return sprintf(buf, "%d\n", dev->numa_node);
> +}
> +DEVICE_ATTR_RO(numa_node);
> +#endif
> +
> static struct attribute *nd_device_attributes[] = {
> &dev_attr_modalias.attr,
> &dev_attr_devtype.attr,
> +#ifdef CONFIG_NUMA
> + &dev_attr_numa_node.attr,
> +#endif
> NULL,
> };

I'd prefer you define is_visible() in the nd_device_attribute_group
and gate showing this attribute on IS_ENABLED(CONFIG_NUMA) rather than
including these ifdef guards. The ifdef guards aren't necessary in
the CONFIG_NUMA=n case.

2015-06-03 02:51:16

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 3/3] libnvdimm: Add sysfs numa_node to NVDIMM devices

On Tue, Jun 2, 2015 at 6:01 PM, Dan Williams <[email protected]> wrote:
> On Tue, Jun 2, 2015 at 5:26 PM, Toshi Kani <[email protected]> wrote:
>> Since NVDIMMs are installed on memory slots, they expose the NUMA
>> topology of a platform.
>>
>> This patch adds support of sysfs 'numa_node' to the NVDIMM devices
>> under /sys/bus/nd/devices, such as regionN, namespaceN.0, and bttN.
>> When bttN is not set up, its numa_node returns -1 (NUMA_NO_NODE).
>> nmemN/numa_node always returns -1 for now since this device is for
>> dimm-ioctl message interface and has little use of NUMA. It can be
>> enhanced later to set a valid value if necessary.
>>
>> Here is an example of numa_node values on a 2-socket system with
>> a single NVDIMM range on each socket.
>> /sys/bus/nd/devices
>> |-- btt0/numa_node:-1
>> |-- btt1/numa_node:0
>> |-- namespace0.0/numa_node:0
>> |-- namespace1.0/numa_node:1
>> |-- nmem0/numa_node:-1
>> |-- nmem1/numa_node:-1
>> |-- region0/numa_node:0
>> |-- region1/numa_node:1
>>
>> With this change, numactl(8) accepts 'block:' and 'file:' paths of
>> pmem and btt devices as shown in the examples below.
>> numactl --preferred block:pmem0 --show
>> numactl --preferred file:/dev/pmem0s --show
>>
>> Signed-off-by: Toshi Kani <[email protected]>
>> ---
>> drivers/acpi/nfit.c | 2 ++
>> drivers/nvdimm/btt.c | 2 ++
>> drivers/nvdimm/bus.c | 12 ++++++++++++
>> drivers/nvdimm/nd.h | 1 +
>> drivers/nvdimm/region.c | 1 +
>> drivers/nvdimm/region_devs.c | 1 +
>> include/linux/libnvdimm.h | 1 +
>> 7 files changed, 20 insertions(+)
>>
>> diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
>> index 5731e4a..a255f3a 100644
>> --- a/drivers/acpi/nfit.c
>> +++ b/drivers/acpi/nfit.c
>> @@ -1255,6 +1255,8 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
>> ndr_desc->res = &res;
>> ndr_desc->provider_data = nfit_spa;
>> ndr_desc->attr_groups = acpi_nfit_region_attribute_groups;
>> + ndr_desc->numa_node = acpi_map_pxm_to_node(spa->proximity_domain);
>> +
>> list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
>> struct acpi_nfit_memory_map *memdev = nfit_memdev->memdev;
>> struct nd_mapping *nd_mapping;
>> diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
>> index 2d7ce9e..3b3e115 100644
>> --- a/drivers/nvdimm/btt.c
>> +++ b/drivers/nvdimm/btt.c
>> @@ -1369,6 +1369,8 @@ static int nd_btt_probe(struct device *dev)
>> rc = -ENOMEM;
>> goto err_btt;
>> }
>> +
>> + set_dev_node(dev, nd_region->numa_node);
>> dev_set_drvdata(dev, btt);
>>
>> return 0;
>> diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
>> index d8a1794..5c34e68 100644
>> --- a/drivers/nvdimm/bus.c
>> +++ b/drivers/nvdimm/bus.c
>> @@ -339,9 +339,21 @@ static ssize_t devtype_show(struct device *dev, struct device_attribute *attr,
>> }
>> static DEVICE_ATTR_RO(devtype);
>>
>> +#ifdef CONFIG_NUMA
>> +static ssize_t numa_node_show(struct device *dev, struct device_attribute *attr,
>> + char *buf)
>> +{
>> + return sprintf(buf, "%d\n", dev->numa_node);
>> +}
>> +DEVICE_ATTR_RO(numa_node);
>> +#endif
>> +
>> static struct attribute *nd_device_attributes[] = {
>> &dev_attr_modalias.attr,
>> &dev_attr_devtype.attr,
>> +#ifdef CONFIG_NUMA
>> + &dev_attr_numa_node.attr,
>> +#endif
>> NULL,
>> };
>
> I'd prefer you define is_visible() in the nd_device_attribute_group
> and gate showing this attribute on IS_ENABLED(CONFIG_NUMA) rather than
> including these ifdef guards. The ifdef guards aren't necessary in
> the CONFIG_NUMA=n case.

I also think is_visible() should hide this attribute on is_nvdimm() devices.

2015-06-03 15:43:05

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH 3/3] libnvdimm: Add sysfs numa_node to NVDIMM devices

On Tue, 2015-06-02 at 19:51 -0700, Dan Williams wrote:
> On Tue, Jun 2, 2015 at 6:01 PM, Dan Williams <[email protected]> wrote:
> > On Tue, Jun 2, 2015 at 5:26 PM, Toshi Kani <[email protected]> wrote:
:
> >>
> >> +#ifdef CONFIG_NUMA
> >> +static ssize_t numa_node_show(struct device *dev, struct device_attribute *attr,
> >> + char *buf)
> >> +{
> >> + return sprintf(buf, "%d\n", dev->numa_node);
> >> +}
> >> +DEVICE_ATTR_RO(numa_node);
> >> +#endif
> >> +
> >> static struct attribute *nd_device_attributes[] = {
> >> &dev_attr_modalias.attr,
> >> &dev_attr_devtype.attr,
> >> +#ifdef CONFIG_NUMA
> >> + &dev_attr_numa_node.attr,
> >> +#endif
> >> NULL,
> >> };
> >
> > I'd prefer you define is_visible() in the nd_device_attribute_group
> > and gate showing this attribute on IS_ENABLED(CONFIG_NUMA) rather than
> > including these ifdef guards. The ifdef guards aren't necessary in
> > the CONFIG_NUMA=n case.
>
> I also think is_visible() should hide this attribute on is_nvdimm() devices.

Agreed. Will do.

Thanks,
-Toshi

2015-06-06 00:33:43

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH 1/3] acpi: Fix acpi_map_pxm_to_node() to handle numa_off

On Tue, 2015-06-02 at 18:26 -0600, Toshi Kani wrote:
> When numa_off is set, NUMA is turned off and node 0 is the only
> valid node on the system. The kernel skips parsing ACPI SRAT
> table in this case.
>
> Change acpi_map_pxm_to_node() to always return 0 when numa_off
> is set. Also move the range check of a proximity ID from
> acpi_get_node() to acpi_map_pxm_to_node() after the numa_off
> check. This keeps the interfaces to return 0 regardless of
> proximity ID values.
>
> Signed-off-by: Toshi Kani <[email protected]>
> ---
> drivers/acpi/numa.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> index 1333cbdc..4898082 100644
> --- a/drivers/acpi/numa.c
> +++ b/drivers/acpi/numa.c
> @@ -70,7 +70,15 @@ static void __acpi_map_pxm_to_node(int pxm, int node)
>
> int acpi_map_pxm_to_node(int pxm)
> {
> - int node = pxm_to_node_map[pxm];
> + int node;
> +
> + if (numa_off)
> + return 0;

I found an issue in this patch. numa_off is only defined in x86, but
this numa.c (ACPI_NUMA) can be enabled on x86 and IA64. I will fix or
drop it in the next version.

Thanks,
-Toshi