2015-06-09 23:10:55

by Toshi Kani

[permalink] [raw]
Subject: [PATCH v2 0/3] Add NUMA support for NVDIMM devices

Since NVDIMMs are installed on memory slots, they expose the NUMA
topology of a platform. This patchset adds support of sysfs
'numa_node' to I/O-related NVDIMM devices under /sys/bus/nd/devices.
This enables numactl(8) to accept 'block:' and 'file:' paths of
pmem and btt devices as shown in the examples below.
numactl --preferred block:pmem0 --show
numactl --preferred file:/dev/pmem0s --show

numactl can be used to bind an application to the locality of
a target NVDIMM for better performance. Here is a result of fio
benchmark to ext4/dax on an HP DL380 with 2 sockets for local and
remote settings.

Local [1] : 4098.3MB/s
Remote [2]: 3718.4MB/s

[1] numactl --preferred block:pmem0 --cpunodebind block:pmem0 fio <fs-on-pmem0>
[2] numactl --preferred block:pmem1 --cpunodebind block:pmem1 fio <fs-on-pmem0>

Patch 1/3 applies on top of the acpica branch of the pm tree.
Patch 2/3-3/3 apply on top of Dan Williams's v5 patch series of
"libnvdimm: non-volatile memory devices".

---
v2:
- Add acpi_map_pxm_to_online_node(), which returns an online node.
- Manage visibility of sysfs numa_node with is_visible. (Dan Williams)
- Check ACPI_NFIT_PROXIMITY_VALID in spa->flags.

---
Toshi Kani (3):
1/3 acpi: Add acpi_map_pxm_to_online_node()
2/3 libnvdimm: Set numa_node to NVDIMM devices
3/3 libnvdimm: Add sysfs numa_node to NVDIMM devices

---
drivers/acpi/nfit.c | 7 +++++++
drivers/acpi/numa.c | 40 +++++++++++++++++++++++++++++++++++++---
drivers/nvdimm/btt.c | 2 ++
drivers/nvdimm/btt_devs.c | 1 +
drivers/nvdimm/bus.c | 30 ++++++++++++++++++++++++++++++
drivers/nvdimm/namespace_devs.c | 1 +
drivers/nvdimm/nd.h | 1 +
drivers/nvdimm/region.c | 1 +
drivers/nvdimm/region_devs.c | 1 +
include/linux/acpi.h | 5 +++++
include/linux/libnvdimm.h | 2 ++
11 files changed, 88 insertions(+), 3 deletions(-)


2015-06-09 23:11:06

by Toshi Kani

[permalink] [raw]
Subject: [PATCH v2 1/3] acpi: Add acpi_map_pxm_to_online_node()

The kernel initializes CPU & memory's NUMA topology from ACPI
SRAT table. Some other ACPI tables, such as NFIT and DMAR,
also contain proximity IDs for their device's NUMA topology.
This information can be used to improve performance of these
devices.

This patch introduces acpi_map_pxm_to_online_node(), which maps
a given pxm to an online node. This allows ACPI device driver
modules to obtain a node from a device proximity ID. Unlike
acpi_map_pxm_to_node(), this interface is guaranteed to return
an online node so that the caller module can use the node without
dealing with the node status. A node may be offline when a device
proximity ID is unique, SRAT memory entry does not exist, or
NUMA is disabled (ex. numa_off on x86).

This patch also moves the pxm range check from acpi_get_node()
to acpi_map_pxm_to_node().

Signed-off-by: Toshi Kani <[email protected]>
---
drivers/acpi/numa.c | 40 +++++++++++++++++++++++++++++++++++++---
include/linux/acpi.h | 5 +++++
2 files changed, 42 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 1333cbdc..a64947e 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -29,6 +29,8 @@
#include <linux/errno.h>
#include <linux/acpi.h>
#include <linux/numa.h>
+#include <linux/nodemask.h>
+#include <linux/topology.h>

#define PREFIX "ACPI: "

@@ -70,7 +72,12 @@ static void __acpi_map_pxm_to_node(int pxm, int node)

int acpi_map_pxm_to_node(int pxm)
{
- int node = pxm_to_node_map[pxm];
+ int node;
+
+ if (pxm < 0 || pxm >= MAX_PXM_DOMAINS)
+ return NUMA_NO_NODE;
+
+ node = pxm_to_node_map[pxm];

if (node == NUMA_NO_NODE) {
if (nodes_weight(nodes_found_map) >= MAX_NUMNODES)
@@ -83,6 +90,35 @@ int acpi_map_pxm_to_node(int pxm)
return node;
}

+/*
+ * Return an online node from a pxm. This interface is intended for ACPI
+ * device drivers that obtain device NUMA topology from ACPI table, but
+ * do not initialize the node status.
+ */
+int acpi_map_pxm_to_online_node(int pxm)
+{
+ int node, n, dist, min_dist;
+
+ node = acpi_map_pxm_to_node(pxm);
+
+ if (node == NUMA_NO_NODE)
+ node = 0;
+
+ if (!node_online(node)) {
+ min_dist = INT_MAX;
+ for_each_online_node(n) {
+ dist = node_distance(node, n);
+ if (dist < min_dist) {
+ min_dist = dist;
+ node = n;
+ }
+ }
+ }
+
+ return node;
+}
+EXPORT_SYMBOL(acpi_map_pxm_to_online_node);
+
static void __init
acpi_table_print_srat_entry(struct acpi_subtable_header *header)
{
@@ -328,8 +364,6 @@ int acpi_get_node(acpi_handle handle)
int pxm;

pxm = acpi_get_pxm(handle);
- if (pxm < 0 || pxm >= MAX_PXM_DOMAINS)
- return NUMA_NO_NODE;

return acpi_map_pxm_to_node(pxm);
}
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index e4da5e3..1b3bbb1 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -289,8 +289,13 @@ extern void acpi_dmi_osi_linux(int enable, const struct dmi_system_id *d);
extern void acpi_osi_setup(char *str);

#ifdef CONFIG_ACPI_NUMA
+int acpi_map_pxm_to_online_node(int pxm);
int acpi_get_node(acpi_handle handle);
#else
+static inline int acpi_map_pxm_to_online_node(int pxm)
+{
+ return 0;
+}
static inline int acpi_get_node(acpi_handle handle)
{
return 0;

2015-06-09 23:11:32

by Toshi Kani

[permalink] [raw]
Subject: [PATCH v2 2/3] libnvdimm: Set numa_node to NVDIMM devices

ACPI NFIT table has System Physical Address Range Structure
entries that describe a proximity ID of each range when
ACPI_NFIT_PROXIMITY_VALID is set in the flags.

Change acpi_nfit_register_region() to map a proximity ID to its
node ID, and set it to a new numa_node field of nd_region_desc,
which is then conveyed to nd_region.

nd_region_probe() and nd_btt_probe() set the numa_node of nd_region
to their device object being probed. A namespace device inherits
the numa_node from the parent region device.

Signed-off-by: Toshi Kani <[email protected]>
---
drivers/acpi/nfit.c | 6 ++++++
drivers/nvdimm/btt.c | 2 ++
drivers/nvdimm/nd.h | 1 +
drivers/nvdimm/region.c | 1 +
drivers/nvdimm/region_devs.c | 1 +
include/linux/libnvdimm.h | 1 +
6 files changed, 12 insertions(+)

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 5731e4a..69dc6e0 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -1255,6 +1255,12 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
ndr_desc->res = &res;
ndr_desc->provider_data = nfit_spa;
ndr_desc->attr_groups = acpi_nfit_region_attribute_groups;
+ if (spa->flags & ACPI_NFIT_PROXIMITY_VALID)
+ ndr_desc->numa_node = acpi_map_pxm_to_online_node(
+ spa->proximity_domain);
+ else
+ ndr_desc->numa_node = NUMA_NO_NODE;
+
list_for_each_entry(nfit_memdev, &acpi_desc->memdevs, list) {
struct acpi_nfit_memory_map *memdev = nfit_memdev->memdev;
struct nd_mapping *nd_mapping;
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 2d7ce9e..3b3e115 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1369,6 +1369,8 @@ static int nd_btt_probe(struct device *dev)
rc = -ENOMEM;
goto err_btt;
}
+
+ set_dev_node(dev, nd_region->numa_node);
dev_set_drvdata(dev, btt);

return 0;
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index c807379..fefd8f6 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -108,6 +108,7 @@ struct nd_region {
u64 ndr_size;
u64 ndr_start;
int id, num_lanes;
+ int numa_node;
void *provider_data;
struct nd_interleave_set *nd_set;
struct nd_mapping mapping[0];
diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
index 373eab4..783220e 100644
--- a/drivers/nvdimm/region.c
+++ b/drivers/nvdimm/region.c
@@ -123,6 +123,7 @@ static int nd_region_probe(struct device *dev)

num_ns->active = rc;
num_ns->count = rc + err;
+ set_dev_node(dev, nd_region->numa_node);
dev_set_drvdata(dev, num_ns);

if (err == 0)
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 86adbd8..352bc80 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -627,6 +627,7 @@ static noinline struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus
nd_region->provider_data = ndr_desc->provider_data;
nd_region->nd_set = ndr_desc->nd_set;
nd_region->num_lanes = ndr_desc->num_lanes;
+ nd_region->numa_node = ndr_desc->numa_node;
ida_init(&nd_region->ns_ida);
dev = &nd_region->dev;
dev_set_name(dev, "region%d", nd_region->id);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 96b9507..5d0c75a 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -78,6 +78,7 @@ struct nd_region_desc {
struct nd_interleave_set *nd_set;
void *provider_data;
int num_lanes;
+ int numa_node;
};

struct nvdimm_bus;

2015-06-09 23:11:25

by Toshi Kani

[permalink] [raw]
Subject: [PATCH v2 3/3] libnvdimm: Add sysfs numa_node to NVDIMM devices

Add support of sysfs 'numa_node' to I/O-related NVDIMM devices
under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.
When bttN is not set up, its numa_node returns -1 (NUMA_NO_NODE).

Here is an example of numa_node values on a 2-socket system with
a single NVDIMM range on each socket.
/sys/bus/nd/devices
|-- btt0/numa_node:-1
|-- btt1/numa_node:0
|-- namespace0.0/numa_node:0
|-- namespace1.0/numa_node:1
|-- region0/numa_node:0
|-- region1/numa_node:1

These numa_node files are then linked under the block class of
their device names.
/sys/class/block/pmem0/device/numa_node:0
/sys/class/block/pmem0s/device/numa_node:0
/sys/class/block/pmem1/device/numa_node:1

This enables numactl(8) to accept 'block:' and 'file:' paths of
pmem and btt devices as shown in the examples below.
numactl --preferred block:pmem0 --show
numactl --preferred file:/dev/pmem0s --show

Signed-off-by: Toshi Kani <[email protected]>
---
drivers/acpi/nfit.c | 1 +
drivers/nvdimm/btt_devs.c | 1 +
drivers/nvdimm/bus.c | 30 ++++++++++++++++++++++++++++++
drivers/nvdimm/namespace_devs.c | 1 +
include/linux/libnvdimm.h | 1 +
5 files changed, 34 insertions(+)

diff --git a/drivers/acpi/nfit.c b/drivers/acpi/nfit.c
index 69dc6e0..ebcaf2a 100644
--- a/drivers/acpi/nfit.c
+++ b/drivers/acpi/nfit.c
@@ -789,6 +789,7 @@ static const struct attribute_group *acpi_nfit_region_attribute_groups[] = {
&nd_region_attribute_group,
&nd_mapping_attribute_group,
&nd_device_attribute_group,
+ &nd_numa_attribute_group,
&acpi_nfit_region_attribute_group,
NULL,
};
diff --git a/drivers/nvdimm/btt_devs.c b/drivers/nvdimm/btt_devs.c
index 740b560..4a053e9 100644
--- a/drivers/nvdimm/btt_devs.c
+++ b/drivers/nvdimm/btt_devs.c
@@ -295,6 +295,7 @@ static struct attribute_group nd_btt_attribute_group = {
static const struct attribute_group *nd_btt_attribute_groups[] = {
&nd_btt_attribute_group,
&nd_device_attribute_group,
+ &nd_numa_attribute_group,
NULL,
};

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index d8a1794..20ffacc 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -353,6 +353,36 @@ struct attribute_group nd_device_attribute_group = {
};
EXPORT_SYMBOL_GPL(nd_device_attribute_group);

+static ssize_t numa_node_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", dev_to_node(dev));
+}
+static DEVICE_ATTR_RO(numa_node);
+
+static struct attribute *nd_numa_attributes[] = {
+ &dev_attr_numa_node.attr,
+ NULL,
+};
+
+static umode_t nd_numa_attr_visible(struct kobject *kobj, struct attribute *a,
+ int n)
+{
+ if (!IS_ENABLED(CONFIG_NUMA))
+ return 0;
+
+ return a->mode;
+}
+
+/**
+ * nd_numa_attribute_group - NUMA attributes for all devices on an nd bus
+ */
+struct attribute_group nd_numa_attribute_group = {
+ .attrs = nd_numa_attributes,
+ .is_visible = nd_numa_attr_visible,
+};
+EXPORT_SYMBOL_GPL(nd_numa_attribute_group);
+
int nvdimm_bus_create_ndctl(struct nvdimm_bus *nvdimm_bus)
{
dev_t devt = MKDEV(nvdimm_bus_major, nvdimm_bus->id);
diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
index e89b019..26f877f 100644
--- a/drivers/nvdimm/namespace_devs.c
+++ b/drivers/nvdimm/namespace_devs.c
@@ -1123,6 +1123,7 @@ static struct attribute_group nd_namespace_attribute_group = {
static const struct attribute_group *nd_namespace_attribute_groups[] = {
&nd_device_attribute_group,
&nd_namespace_attribute_group,
+ &nd_numa_attribute_group,
NULL,
};

diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 5d0c75a..a85566b 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -35,6 +35,7 @@ enum {
extern struct attribute_group nvdimm_bus_attribute_group;
extern struct attribute_group nvdimm_attribute_group;
extern struct attribute_group nd_device_attribute_group;
+extern struct attribute_group nd_numa_attribute_group;
extern struct attribute_group nd_region_attribute_group;
extern struct attribute_group nd_mapping_attribute_group;

2015-06-10 15:55:09

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Add NUMA support for NVDIMM devices

Toshi Kani <[email protected]> writes:

> Since NVDIMMs are installed on memory slots, they expose the NUMA
> topology of a platform. This patchset adds support of sysfs
> 'numa_node' to I/O-related NVDIMM devices under /sys/bus/nd/devices.
> This enables numactl(8) to accept 'block:' and 'file:' paths of
> pmem and btt devices as shown in the examples below.
> numactl --preferred block:pmem0 --show
> numactl --preferred file:/dev/pmem0s --show
>
> numactl can be used to bind an application to the locality of
> a target NVDIMM for better performance. Here is a result of fio
> benchmark to ext4/dax on an HP DL380 with 2 sockets for local and
> remote settings.
>
> Local [1] : 4098.3MB/s
> Remote [2]: 3718.4MB/s
>
> [1] numactl --preferred block:pmem0 --cpunodebind block:pmem0 fio <fs-on-pmem0>
> [2] numactl --preferred block:pmem1 --cpunodebind block:pmem1 fio <fs-on-pmem0>

Did you post the patches to numactl somewhere?

-Jeff

2015-06-10 16:01:03

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Add NUMA support for NVDIMM devices

On Wed, Jun 10, 2015 at 8:54 AM, Jeff Moyer <[email protected]> wrote:
> Toshi Kani <[email protected]> writes:
>
>> Since NVDIMMs are installed on memory slots, they expose the NUMA
>> topology of a platform. This patchset adds support of sysfs
>> 'numa_node' to I/O-related NVDIMM devices under /sys/bus/nd/devices.
>> This enables numactl(8) to accept 'block:' and 'file:' paths of
>> pmem and btt devices as shown in the examples below.
>> numactl --preferred block:pmem0 --show
>> numactl --preferred file:/dev/pmem0s --show
>>
>> numactl can be used to bind an application to the locality of
>> a target NVDIMM for better performance. Here is a result of fio
>> benchmark to ext4/dax on an HP DL380 with 2 sockets for local and
>> remote settings.
>>
>> Local [1] : 4098.3MB/s
>> Remote [2]: 3718.4MB/s
>>
>> [1] numactl --preferred block:pmem0 --cpunodebind block:pmem0 fio <fs-on-pmem0>
>> [2] numactl --preferred block:pmem1 --cpunodebind block:pmem1 fio <fs-on-pmem0>
>
> Did you post the patches to numactl somewhere?
>

numactl already supports this today.

2015-06-10 16:44:58

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Add NUMA support for NVDIMM devices

Dan Williams <[email protected]> writes:

> On Wed, Jun 10, 2015 at 8:54 AM, Jeff Moyer <[email protected]> wrote:
>> Toshi Kani <[email protected]> writes:
>>
>>> Since NVDIMMs are installed on memory slots, they expose the NUMA
>>> topology of a platform. This patchset adds support of sysfs
>>> 'numa_node' to I/O-related NVDIMM devices under /sys/bus/nd/devices.
>>> This enables numactl(8) to accept 'block:' and 'file:' paths of
>>> pmem and btt devices as shown in the examples below.
>>> numactl --preferred block:pmem0 --show
>>> numactl --preferred file:/dev/pmem0s --show
>>>
>>> numactl can be used to bind an application to the locality of
>>> a target NVDIMM for better performance. Here is a result of fio
>>> benchmark to ext4/dax on an HP DL380 with 2 sockets for local and
>>> remote settings.
>>>
>>> Local [1] : 4098.3MB/s
>>> Remote [2]: 3718.4MB/s
>>>
>>> [1] numactl --preferred block:pmem0 --cpunodebind block:pmem0 fio <fs-on-pmem0>
>>> [2] numactl --preferred block:pmem1 --cpunodebind block:pmem1 fio <fs-on-pmem0>
>>
>> Did you post the patches to numactl somewhere?
>>
>
> numactl already supports this today.

Ah, I did not know that. I guess I should have RTFM. :)

Cheers,
Jeff

Subject: RE: [PATCH v2 0/3] Add NUMA support for NVDIMM devices

> -----Original Message-----
> From: Linux-nvdimm [mailto:[email protected]] On Behalf Of
> Dan Williams
> Sent: Wednesday, June 10, 2015 9:58 AM
> To: Jeff Moyer
> Cc: linux-nvdimm; Rafael J. Wysocki; [email protected]; Linux
> ACPI
> Subject: Re: [PATCH v2 0/3] Add NUMA support for NVDIMM devices
>
> On Wed, Jun 10, 2015 at 8:54 AM, Jeff Moyer <[email protected]> wrote:
> > Toshi Kani <[email protected]> writes:
> >
> >> Since NVDIMMs are installed on memory slots, they expose the NUMA
> >> topology of a platform. This patchset adds support of sysfs
> >> 'numa_node' to I/O-related NVDIMM devices under /sys/bus/nd/devices.
> >> This enables numactl(8) to accept 'block:' and 'file:' paths of
> >> pmem and btt devices as shown in the examples below.
> >> numactl --preferred block:pmem0 --show
> >> numactl --preferred file:/dev/pmem0s --show
> >>
> >> numactl can be used to bind an application to the locality of
> >> a target NVDIMM for better performance. Here is a result of fio
> >> benchmark to ext4/dax on an HP DL380 with 2 sockets for local and
> >> remote settings.
> >>
> >> Local [1] : 4098.3MB/s
> >> Remote [2]: 3718.4MB/s
> >>
> >> [1] numactl --preferred block:pmem0 --cpunodebind block:pmem0 fio <fs-
> on-pmem0>
> >> [2] numactl --preferred block:pmem1 --cpunodebind block:pmem1 fio <fs-
> on-pmem0>
> >
> > Did you post the patches to numactl somewhere?
> >
>
> numactl already supports this today.

numactl does have a bug handling partitions under these devices,
because it assumes all storage devices have "/devices/pci"
in their path as it tries to find the parent device for the
partition. I think we'll propose a numactl patch for that;
I don't think the drivers can fool it.

Details (from an earlier version of the patch series
in which btt devices were named /dev/nd1, etc.):

strace shows that numactl is trying to find numa_node in very
different locations for /dev/nd1p1 vs. /dev/sda1.

strace for /dev/nd1p1
=====================
open("/sys/class/block/nd1p1/dev", O_RDONLY) = 4
read(4, "259:1\n", 4095) = 6
close(4) = 0
close(3) = 0
readlink("/sys/class/block/nd1p1", "../../devices/LNXSYSTM:00/LNXSYB"..., 1024) = 77
open("/sys/class/block/nd1p1/device/numa_node", O_RDONLY) = -1 ENOENT (No such file or directory)

strace for /dev/sda1
====================
open("/sys/class/block/sda1/dev", O_RDONLY) = 4
read(4, "8:1\n", 4095) = 4
close(4) = 0
close(3) = 0
readlink("/sys/class/block/sda1", "../../devices/pci0000:00/0000:00"..., 1024) = 91
open("/sys//devices/pci0000:00/0000:00:01.0//numa_node", O_RDONLY) = 3
read(3, "0\n", 4095) = 2
close(3) = 0

The "sys/class/block/xxx" paths link to:
lrwxrwxrwx. 1 root root 0 May 20 20:42 /sys/class/block/nd1p1 -> ../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/btt1/block/nd1/nd1p1
lrwxrwxrwx. 1 root root 0 May 20 20:41 /sys/class/block/sda1 -> ../../devices/pci0000:00/0000:00:01.0/0000:03:00.0/host6/target6:0:0/6:0:0:0/block/sda/sda1


For /dev/sda1, numactl recognizes "/devices/pci" as
a special path, and strips off everything after the
numbers. Faced with:
../../devices/pci0000:00/0000:00:01.0/0000:03:00.0/host6/target6:0:0/6:0:0:0/block/sda/sda1

it ends up with this (leaving a sloppy "//" in the path):
/sys/devices/pci0000:00/0000:00:01.0//numa_node

It would also succeed if it ended up with this:
/sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/numa_node

For /dev/nd1p1 it does not see that string, so just
tries to open "/sys/class/block/nd1p1/device/numa_node"

There are no "device/" subdirectories in the tree for
partition devices (for either sda1 or nd1p1), so this
fails.


>From http://oss.sgi.com/projects/libnuma/
numactl affinity.c:
/* Somewhat hackish: extract device from symlink path.
Better would be a direct backlink. This knows slightly too
much about the actual sysfs layout. */
char path[1024];
char *fn = NULL;
if (asprintf(&fn, "/sys/class/%s/%s", cls, dev) > 0 &&
readlink(fn, path, sizeof path) > 0) {
regex_t re;
regmatch_t match[2];
char *p;

regcomp(&re, "(/devices/pci[0-9a-fA-F:/]+\\.[0-9]+)/",
REG_EXTENDED);
ret = regexec(&re, path, 2, match, 0);
regfree(&re);
if (ret == 0) {
free(fn);
assert(match[0].rm_so > 0);
assert(match[0].rm_eo > 0);
path[match[1].rm_eo + 1] = 0;
p = path + match[0].rm_so;
ret = sysfs_node_read(mask, "/sys/%s/numa_node", p);
if (ret < 0)
return node_parse_failure(ret, NULL, p);
return ret;
}
}
free(fn);

ret = sysfs_node_read(mask, "/sys/class/%s/%s/device/numa_node",
cls, dev);




2015-06-10 16:20:52

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Add NUMA support for NVDIMM devices

On Wed, 2015-06-10 at 08:57 -0700, Dan Williams wrote:
> On Wed, Jun 10, 2015 at 8:54 AM, Jeff Moyer <[email protected]> wrote:
> > Toshi Kani <[email protected]> writes:
> >
> >> Since NVDIMMs are installed on memory slots, they expose the NUMA
> >> topology of a platform. This patchset adds support of sysfs
> >> 'numa_node' to I/O-related NVDIMM devices under /sys/bus/nd/devices.
> >> This enables numactl(8) to accept 'block:' and 'file:' paths of
> >> pmem and btt devices as shown in the examples below.
> >> numactl --preferred block:pmem0 --show
> >> numactl --preferred file:/dev/pmem0s --show
> >>
> >> numactl can be used to bind an application to the locality of
> >> a target NVDIMM for better performance. Here is a result of fio
> >> benchmark to ext4/dax on an HP DL380 with 2 sockets for local and
> >> remote settings.
> >>
> >> Local [1] : 4098.3MB/s
> >> Remote [2]: 3718.4MB/s
> >>
> >> [1] numactl --preferred block:pmem0 --cpunodebind block:pmem0 fio <fs-on-pmem0>
> >> [2] numactl --preferred block:pmem1 --cpunodebind block:pmem1 fio <fs-on-pmem0>
> >
> > Did you post the patches to numactl somewhere?
> >
>
> numactl already supports this today.

Yes, numactl supports the following sysfs class lookup for numa_node.
This patchset adds numa_node for NVDIMM devices in the same sysfs format
as described in patch 3/3.

/* Generic sysfs class lookup */
static int
affinity_class(struct bitmask *mask, char *cls, const char *dev)
{
:
ret = sysfs_node_read(mask, "/sys/class/%s/%s/device/numa_node",
cls, dev);

Thanks,
-Toshi

2015-06-10 16:38:20

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Add NUMA support for NVDIMM devices

On Wed, Jun 10, 2015 at 9:20 AM, Elliott, Robert (Server Storage)
<[email protected]> wrote:
>> -----Original Message-----
>> From: Linux-nvdimm [mailto:[email protected]] On Behalf Of
>> Dan Williams
>> Sent: Wednesday, June 10, 2015 9:58 AM
>> To: Jeff Moyer
>> Cc: linux-nvdimm; Rafael J. Wysocki; [email protected]; Linux
>> ACPI
>> Subject: Re: [PATCH v2 0/3] Add NUMA support for NVDIMM devices
>>
>> On Wed, Jun 10, 2015 at 8:54 AM, Jeff Moyer <[email protected]> wrote:
>> > Toshi Kani <[email protected]> writes:
>> >
>> >> Since NVDIMMs are installed on memory slots, they expose the NUMA
>> >> topology of a platform. This patchset adds support of sysfs
>> >> 'numa_node' to I/O-related NVDIMM devices under /sys/bus/nd/devices.
>> >> This enables numactl(8) to accept 'block:' and 'file:' paths of
>> >> pmem and btt devices as shown in the examples below.
>> >> numactl --preferred block:pmem0 --show
>> >> numactl --preferred file:/dev/pmem0s --show
>> >>
>> >> numactl can be used to bind an application to the locality of
>> >> a target NVDIMM for better performance. Here is a result of fio
>> >> benchmark to ext4/dax on an HP DL380 with 2 sockets for local and
>> >> remote settings.
>> >>
>> >> Local [1] : 4098.3MB/s
>> >> Remote [2]: 3718.4MB/s
>> >>
>> >> [1] numactl --preferred block:pmem0 --cpunodebind block:pmem0 fio <fs-
>> on-pmem0>
>> >> [2] numactl --preferred block:pmem1 --cpunodebind block:pmem1 fio <fs-
>> on-pmem0>
>> >
>> > Did you post the patches to numactl somewhere?
>> >
>>
>> numactl already supports this today.
>
> numactl does have a bug handling partitions under these devices,
> because it assumes all storage devices have "/devices/pci"
> in their path as it tries to find the parent device for the
> partition. I think we'll propose a numactl patch for that;
> I don't think the drivers can fool it.
>
> Details (from an earlier version of the patch series
> in which btt devices were named /dev/nd1, etc.):
>
> strace shows that numactl is trying to find numa_node in very
> different locations for /dev/nd1p1 vs. /dev/sda1.
>
> strace for /dev/nd1p1
> =====================
> open("/sys/class/block/nd1p1/dev", O_RDONLY) = 4
> read(4, "259:1\n", 4095) = 6
> close(4) = 0
> close(3) = 0
> readlink("/sys/class/block/nd1p1", "../../devices/LNXSYSTM:00/LNXSYB"..., 1024) = 77
> open("/sys/class/block/nd1p1/device/numa_node", O_RDONLY) = -1 ENOENT (No such file or directory)
>
> strace for /dev/sda1
> ====================
> open("/sys/class/block/sda1/dev", O_RDONLY) = 4
> read(4, "8:1\n", 4095) = 4
> close(4) = 0
> close(3) = 0
> readlink("/sys/class/block/sda1", "../../devices/pci0000:00/0000:00"..., 1024) = 91
> open("/sys//devices/pci0000:00/0000:00:01.0//numa_node", O_RDONLY) = 3
> read(3, "0\n", 4095) = 2
> close(3) = 0
>
> The "sys/class/block/xxx" paths link to:
> lrwxrwxrwx. 1 root root 0 May 20 20:42 /sys/class/block/nd1p1 -> ../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/btt1/block/nd1/nd1p1
> lrwxrwxrwx. 1 root root 0 May 20 20:41 /sys/class/block/sda1 -> ../../devices/pci0000:00/0000:00:01.0/0000:03:00.0/host6/target6:0:0/6:0:0:0/block/sda/sda1
>
>
> For /dev/sda1, numactl recognizes "/devices/pci" as
> a special path, and strips off everything after the
> numbers. Faced with:
> ../../devices/pci0000:00/0000:00:01.0/0000:03:00.0/host6/target6:0:0/6:0:0:0/block/sda/sda1
>
> it ends up with this (leaving a sloppy "//" in the path):
> /sys/devices/pci0000:00/0000:00:01.0//numa_node
>
> It would also succeed if it ended up with this:
> /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/numa_node
>
> For /dev/nd1p1 it does not see that string, so just
> tries to open "/sys/class/block/nd1p1/device/numa_node"
>
> There are no "device/" subdirectories in the tree for
> partition devices (for either sda1 or nd1p1), so this
> fails.
>
>
> From http://oss.sgi.com/projects/libnuma/
> numactl affinity.c:
> /* Somewhat hackish: extract device from symlink path.
> Better would be a direct backlink. This knows slightly too
> much about the actual sysfs layout. */
> char path[1024];
> char *fn = NULL;
> if (asprintf(&fn, "/sys/class/%s/%s", cls, dev) > 0 &&
> readlink(fn, path, sizeof path) > 0) {
> regex_t re;
> regmatch_t match[2];
> char *p;
>
> regcomp(&re, "(/devices/pci[0-9a-fA-F:/]+\\.[0-9]+)/",
> REG_EXTENDED);
> ret = regexec(&re, path, 2, match, 0);
> regfree(&re);
> if (ret == 0) {
> free(fn);
> assert(match[0].rm_so > 0);
> assert(match[0].rm_eo > 0);
> path[match[1].rm_eo + 1] = 0;
> p = path + match[0].rm_so;
> ret = sysfs_node_read(mask, "/sys/%s/numa_node", p);
> if (ret < 0)
> return node_parse_failure(ret, NULL, p);
> return ret;
> }
> }
> free(fn);
>
> ret = sysfs_node_read(mask, "/sys/class/%s/%s/device/numa_node",
> cls, dev);

I think it is broken to try go from /sys/class down it should go from
the device node up. I.e. from the resolved path of
/sys/dev/block/<major>:<minor>, and then walk up the directory tree to
the parent of block.

$ readlink -f /sys/dev/block/8\:1/
/sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/sda1

2015-06-11 15:38:23

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Add NUMA support for NVDIMM devices

On Tue, Jun 9, 2015 at 4:10 PM, Toshi Kani <[email protected]> wrote:
> Since NVDIMMs are installed on memory slots, they expose the NUMA
> topology of a platform. This patchset adds support of sysfs
> 'numa_node' to I/O-related NVDIMM devices under /sys/bus/nd/devices.
> This enables numactl(8) to accept 'block:' and 'file:' paths of
> pmem and btt devices as shown in the examples below.
> numactl --preferred block:pmem0 --show
> numactl --preferred file:/dev/pmem0s --show
>
> numactl can be used to bind an application to the locality of
> a target NVDIMM for better performance. Here is a result of fio
> benchmark to ext4/dax on an HP DL380 with 2 sockets for local and
> remote settings.
>
> Local [1] : 4098.3MB/s
> Remote [2]: 3718.4MB/s
>
> [1] numactl --preferred block:pmem0 --cpunodebind block:pmem0 fio <fs-on-pmem0>
> [2] numactl --preferred block:pmem1 --cpunodebind block:pmem1 fio <fs-on-pmem0>
>
> Patch 1/3 applies on top of the acpica branch of the pm tree.
> Patch 2/3-3/3 apply on top of Dan Williams's v5 patch series of
> "libnvdimm: non-volatile memory devices".
>
> ---
> v2:
> - Add acpi_map_pxm_to_online_node(), which returns an online node.
> - Manage visibility of sysfs numa_node with is_visible. (Dan Williams)
> - Check ACPI_NFIT_PROXIMITY_VALID in spa->flags.
>
> ---
> Toshi Kani (3):
> 1/3 acpi: Add acpi_map_pxm_to_online_node()
> 2/3 libnvdimm: Set numa_node to NVDIMM devices
> 3/3 libnvdimm: Add sysfs numa_node to NVDIMM devices

Looks good to me. Once Rafael acks the ACPI core changes I'll pull it
in to libnvdimm-for-next.

2015-06-11 15:46:03

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Add NUMA support for NVDIMM devices

On Thu, 2015-06-11 at 08:38 -0700, Dan Williams wrote:
> On Tue, Jun 9, 2015 at 4:10 PM, Toshi Kani <[email protected]> wrote:
> > Since NVDIMMs are installed on memory slots, they expose the NUMA
> > topology of a platform. This patchset adds support of sysfs
> > 'numa_node' to I/O-related NVDIMM devices under /sys/bus/nd/devices.
> > This enables numactl(8) to accept 'block:' and 'file:' paths of
> > pmem and btt devices as shown in the examples below.
> > numactl --preferred block:pmem0 --show
> > numactl --preferred file:/dev/pmem0s --show
> >
> > numactl can be used to bind an application to the locality of
> > a target NVDIMM for better performance. Here is a result of fio
> > benchmark to ext4/dax on an HP DL380 with 2 sockets for local and
> > remote settings.
> >
> > Local [1] : 4098.3MB/s
> > Remote [2]: 3718.4MB/s
> >
> > [1] numactl --preferred block:pmem0 --cpunodebind block:pmem0 fio <fs-on-pmem0>
> > [2] numactl --preferred block:pmem1 --cpunodebind block:pmem1 fio <fs-on-pmem0>
> >
> > Patch 1/3 applies on top of the acpica branch of the pm tree.
> > Patch 2/3-3/3 apply on top of Dan Williams's v5 patch series of
> > "libnvdimm: non-volatile memory devices".
> >
> > ---
> > v2:
> > - Add acpi_map_pxm_to_online_node(), which returns an online node.
> > - Manage visibility of sysfs numa_node with is_visible. (Dan Williams)
> > - Check ACPI_NFIT_PROXIMITY_VALID in spa->flags.
> >
> > ---
> > Toshi Kani (3):
> > 1/3 acpi: Add acpi_map_pxm_to_online_node()
> > 2/3 libnvdimm: Set numa_node to NVDIMM devices
> > 3/3 libnvdimm: Add sysfs numa_node to NVDIMM devices
>
> Looks good to me. Once Rafael acks the ACPI core changes I'll pull it
> in to libnvdimm-for-next.

Great! Thanks Dan,
-Toshi

2015-06-18 20:24:11

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Add NUMA support for NVDIMM devices

Rafael, does patch1 look ok to you?

On Tue, Jun 9, 2015 at 4:10 PM, Toshi Kani <[email protected]> wrote:
> Since NVDIMMs are installed on memory slots, they expose the NUMA
> topology of a platform. This patchset adds support of sysfs
> 'numa_node' to I/O-related NVDIMM devices under /sys/bus/nd/devices.
> This enables numactl(8) to accept 'block:' and 'file:' paths of
> pmem and btt devices as shown in the examples below.
> numactl --preferred block:pmem0 --show
> numactl --preferred file:/dev/pmem0s --show
>
> numactl can be used to bind an application to the locality of
> a target NVDIMM for better performance. Here is a result of fio
> benchmark to ext4/dax on an HP DL380 with 2 sockets for local and
> remote settings.
>
> Local [1] : 4098.3MB/s
> Remote [2]: 3718.4MB/s
>
> [1] numactl --preferred block:pmem0 --cpunodebind block:pmem0 fio <fs-on-pmem0>
> [2] numactl --preferred block:pmem1 --cpunodebind block:pmem1 fio <fs-on-pmem0>
>
> Patch 1/3 applies on top of the acpica branch of the pm tree.
> Patch 2/3-3/3 apply on top of Dan Williams's v5 patch series of
> "libnvdimm: non-volatile memory devices".
>
> ---
> v2:
> - Add acpi_map_pxm_to_online_node(), which returns an online node.
> - Manage visibility of sysfs numa_node with is_visible. (Dan Williams)
> - Check ACPI_NFIT_PROXIMITY_VALID in spa->flags.
>
> ---
> Toshi Kani (3):
> 1/3 acpi: Add acpi_map_pxm_to_online_node()
> 2/3 libnvdimm: Set numa_node to NVDIMM devices
> 3/3 libnvdimm: Add sysfs numa_node to NVDIMM devices
>
> ---
> drivers/acpi/nfit.c | 7 +++++++
> drivers/acpi/numa.c | 40 +++++++++++++++++++++++++++++++++++++---
> drivers/nvdimm/btt.c | 2 ++
> drivers/nvdimm/btt_devs.c | 1 +
> drivers/nvdimm/bus.c | 30 ++++++++++++++++++++++++++++++
> drivers/nvdimm/namespace_devs.c | 1 +
> drivers/nvdimm/nd.h | 1 +
> drivers/nvdimm/region.c | 1 +
> drivers/nvdimm/region_devs.c | 1 +
> include/linux/acpi.h | 5 +++++
> include/linux/libnvdimm.h | 2 ++
> 11 files changed, 88 insertions(+), 3 deletions(-)

2015-06-19 00:16:34

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] acpi: Add acpi_map_pxm_to_online_node()

On Tuesday, June 09, 2015 05:10:38 PM Toshi Kani wrote:
> The kernel initializes CPU & memory's NUMA topology from ACPI
> SRAT table. Some other ACPI tables, such as NFIT and DMAR,
> also contain proximity IDs for their device's NUMA topology.
> This information can be used to improve performance of these
> devices.
>
> This patch introduces acpi_map_pxm_to_online_node(), which maps
> a given pxm to an online node. This allows ACPI device driver
> modules to obtain a node from a device proximity ID. Unlike
> acpi_map_pxm_to_node(), this interface is guaranteed to return
> an online node so that the caller module can use the node without
> dealing with the node status. A node may be offline when a device
> proximity ID is unique, SRAT memory entry does not exist, or
> NUMA is disabled (ex. numa_off on x86).
>
> This patch also moves the pxm range check from acpi_get_node()
> to acpi_map_pxm_to_node().
>
> Signed-off-by: Toshi Kani <[email protected]>
> ---
> drivers/acpi/numa.c | 40 +++++++++++++++++++++++++++++++++++++---
> include/linux/acpi.h | 5 +++++
> 2 files changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> index 1333cbdc..a64947e 100644
> --- a/drivers/acpi/numa.c
> +++ b/drivers/acpi/numa.c
> @@ -29,6 +29,8 @@
> #include <linux/errno.h>
> #include <linux/acpi.h>
> #include <linux/numa.h>
> +#include <linux/nodemask.h>
> +#include <linux/topology.h>
>
> #define PREFIX "ACPI: "
>
> @@ -70,7 +72,12 @@ static void __acpi_map_pxm_to_node(int pxm, int node)
>
> int acpi_map_pxm_to_node(int pxm)
> {
> - int node = pxm_to_node_map[pxm];
> + int node;
> +
> + if (pxm < 0 || pxm >= MAX_PXM_DOMAINS)
> + return NUMA_NO_NODE;
> +
> + node = pxm_to_node_map[pxm];
>
> if (node == NUMA_NO_NODE) {
> if (nodes_weight(nodes_found_map) >= MAX_NUMNODES)
> @@ -83,6 +90,35 @@ int acpi_map_pxm_to_node(int pxm)
> return node;
> }
>
> +/*
> + * Return an online node from a pxm. This interface is intended for ACPI
> + * device drivers that obtain device NUMA topology from ACPI table, but
> + * do not initialize the node status.
> + */

Can you make this a proper kerneldoc, please? *Especially* that it is an
exported function.

The description is a bit terse too in my view.

> +int acpi_map_pxm_to_online_node(int pxm)
> +{
> + int node, n, dist, min_dist;
> +
> + node = acpi_map_pxm_to_node(pxm);
> +
> + if (node == NUMA_NO_NODE)
> + node = 0;
> +
> + if (!node_online(node)) {
> + min_dist = INT_MAX;
> + for_each_online_node(n) {
> + dist = node_distance(node, n);
> + if (dist < min_dist) {
> + min_dist = dist;
> + node = n;
> + }
> + }
> + }
> +
> + return node;
> +}
> +EXPORT_SYMBOL(acpi_map_pxm_to_online_node);
> +
> static void __init
> acpi_table_print_srat_entry(struct acpi_subtable_header *header)
> {
> @@ -328,8 +364,6 @@ int acpi_get_node(acpi_handle handle)
> int pxm;
>
> pxm = acpi_get_pxm(handle);
> - if (pxm < 0 || pxm >= MAX_PXM_DOMAINS)
> - return NUMA_NO_NODE;
>
> return acpi_map_pxm_to_node(pxm);
> }
> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> index e4da5e3..1b3bbb1 100644
> --- a/include/linux/acpi.h
> +++ b/include/linux/acpi.h
> @@ -289,8 +289,13 @@ extern void acpi_dmi_osi_linux(int enable, const struct dmi_system_id *d);
> extern void acpi_osi_setup(char *str);
>
> #ifdef CONFIG_ACPI_NUMA
> +int acpi_map_pxm_to_online_node(int pxm);
> int acpi_get_node(acpi_handle handle);
> #else
> +static inline int acpi_map_pxm_to_online_node(int pxm)
> +{
> + return 0;
> +}
> static inline int acpi_get_node(acpi_handle handle)
> {
> return 0;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2015-06-19 00:18:01

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Add NUMA support for NVDIMM devices

On Thursday, June 18, 2015 01:24:01 PM Dan Williams wrote:
> Rafael, does patch1 look ok to you?

Mostly. acpi_map_pxm_to_online_node() needs a proper kerneldoc comment
describing what it does.

Thanks,
Rafael

2015-06-19 01:17:07

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] acpi: Add acpi_map_pxm_to_online_node()

On Fri, 2015-06-19 at 02:42 +0200, Rafael J. Wysocki wrote:
> On Tuesday, June 09, 2015 05:10:38 PM Toshi Kani wrote:
> > The kernel initializes CPU & memory's NUMA topology from ACPI
> > SRAT table. Some other ACPI tables, such as NFIT and DMAR,
> > also contain proximity IDs for their device's NUMA topology.
> > This information can be used to improve performance of these
> > devices.
> >
> > This patch introduces acpi_map_pxm_to_online_node(), which maps
> > a given pxm to an online node. This allows ACPI device driver
> > modules to obtain a node from a device proximity ID. Unlike
> > acpi_map_pxm_to_node(), this interface is guaranteed to return
> > an online node so that the caller module can use the node without
> > dealing with the node status. A node may be offline when a device
> > proximity ID is unique, SRAT memory entry does not exist, or
> > NUMA is disabled (ex. numa_off on x86).
> >
> > This patch also moves the pxm range check from acpi_get_node()
> > to acpi_map_pxm_to_node().
:
> > +/*
> > + * Return an online node from a pxm. This interface is intended for ACPI
> > + * device drivers that obtain device NUMA topology from ACPI table, but
> > + * do not initialize the node status.
> > + */
>
> Can you make this a proper kerneldoc, please? *Especially* that it is an
> exported function.
>
> The description is a bit terse too in my view.

Agreed. I will update the comment as a proper kerneldoc.

Thanks!
-Toshi