LinuxLists.cc - [RFC v2 0/5] surface heterogeneous memory performance information

2017-07-06 21:52:46

by Ross Zwisler

[permalink] [raw]

Subject: [RFC v2 0/5] surface heterogeneous memory performance information

==== Quick Summary ====

Platforms in the very near future will have multiple types of memory
attached to a single CPU. These disparate memory ranges will have some
characteristics in common, such as CPU cache coherence, but they can have
wide ranges of performance both in terms of latency and bandwidth.

For example, consider a system that contains persistent memory, standard
DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU.
There could potentially be an order of magnitude or more difference in
performance between the slowest and fastest memory attached to that CPU.

With the current Linux code NUMA nodes are CPU-centric, so all the memory
attached to a given CPU will be lumped into the same NUMA node. This makes
it very difficult for userspace applications to understand the performance
of different memory ranges on a given CPU.

We solve this issue by providing userspace with performance information on
individual memory ranges. This performance information is exposed via
sysfs:

# grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null
mem_tgt2/firmware_id:1
mem_tgt2/is_cached:0
mem_tgt2/is_enabled:1
mem_tgt2/is_isolated:0
mem_tgt2/phys_addr_base:0x0
mem_tgt2/phys_length_bytes:0x800000000
mem_tgt2/local_init/read_bw_MBps:30720
mem_tgt2/local_init/read_lat_nsec:100
mem_tgt2/local_init/write_bw_MBps:30720
mem_tgt2/local_init/write_lat_nsec:100

This allows applications to easily find the memory that they want to use.
We expect that the existing NUMA APIs will be enhanced to use this new
information so that applications can continue to use them to select their
desired memory.

This series is built upon acpica-1705:

https://github.com/zetalog/linux/commits/acpica-1705

And you can find a working tree here:

https://git.kernel.org/pub/scm/linux/kernel/git/zwisler/linux.git/log/?h=hmem_sysfs

==== Lots of Details ====

This patch set is only concerned with CPU-addressable memory types, not
on-device memory like what we have with Jerome Glisse's HMM series:

https://lwn.net/Articles/726691/

This patch set works by enabling the new Heterogeneous Memory Attribute
Table (HMAT) table, newly defined in ACPI 6.2. One major conceptual change
in ACPI 6.2 related to this work is that proximity domains no longer need
to contain a processor. We can now have memory-only proximity domains,
which means that we can now have memory-only Linux NUMA nodes.

Here is an example configuration where we have a single processor, one
range of regular memory and one range of HBM:

+---------------+ +----------------+
| Processor | | Memory |
| prox domain 0 +---+ prox domain 1 |
| NUMA node 1 | | NUMA node 2 |
+-------+-------+ +----------------+
|
+-------+----------+
| HBM |
| prox domain 2 |
| NUMA node 0 |
+------------------+

This gives us one initiator (the processor) and two targets (the two memory
ranges). Each of these three has its own ACPI proximity domain and
associated Linux NUMA node. Note also that while there is a 1:1 mapping
from each proximity domain to each NUMA node, the numbers don't necessarily
match up. Additionally we can have extra NUMA nodes that don't map back to
ACPI proximity domains.

The above configuration could also have the processor and one of the two
memory ranges sharing a proximity domain and NUMA node, but for the
purposes of the HMAT the two memory ranges will always need to be
separated.

The overall goal of this series and of the HMAT is to allow users to
identify memory using its performance characteristics. This can broadly be
done in one of two ways:

Option 1: Provide the user with a way to map between proximity domains and
NUMA nodes and a way to access the HMAT directly (probably via
/sys/firmware/acpi/tables). Then, through possibly a library and a daemon,
provide an API so that applications can either request information about
memory ranges, or request memory allocations that meet a given set of
performance characteristics.

Option 2: Provide the user with HMAT performance data directly in sysfs,
allowing applications to directly access it without the need for the
library and daemon.

The kernel work for option 1 is started by patches 1-3. These just surface
the minimal amount of information in sysfs to allow userspace to map
between proximity domains and NUMA nodes so that the raw data in the HMAT
table can be understood.

Patches 4 and 5 enable option 2, adding performance information from the
HMAT to sysfs. The second option is complicated by the amount of HMAT data
that could be present in very large systems, so in this series we only
surface performance information for local (initiator,target) pairings. The
changelog for patch 5 discusses this in detail.

The naming collision between Jerome's "Heterogeneous Memory Management
(HMM)" and this "Heterogeneous Memory (HMEM)" series is unfortunate, but I
was trying to stick with the word "Heterogeneous" because of the naming of
the ACPI 6.2 Heterogeneous Memory Attribute Table table. Suggestions for
better naming are welcome.

==== Next steps ====

There is still a lot of work to be done on this series, but the overall
goal of this RFC is to gather feedback on which of the two options we
should pursue, or whether some third option is preferred. After that is
done and we have a solid direction we can add support for ACPI hot add,
test more complex configurations, etc.

So, for applications that need to differentiate between memory ranges based
on their performance, what option would work best for you? Is the local
(initiator,target) performance provided by patch 5 enough, or do you
require performance information for all possible (initiator,target)
pairings?

If option 1 looks best, do we have ideas on what the userspace API would
look like?

What other things should we consider, or what needs do you have that aren't
being addressed?

---
Changes from previous RFC (https://lwn.net/Articles/724562/):

- Allow multiple initiators to be local to a given memory target, as long
as they all have the same performance characteristics. (Dan Williams)

- A few small fixes to the ACPI parsing to allow for configurations I
hadn't previously considered.

Ross Zwisler (5):
acpi: add missing include in acpi_numa.h
acpi: HMAT support in acpi_parse_entries_array()
hmem: add heterogeneous memory sysfs support
sysfs: add sysfs_add_group_link()
hmem: add performance attributes

MAINTAINERS | 5 +
drivers/acpi/Kconfig | 1 +
drivers/acpi/Makefile | 1 +
drivers/acpi/hmem/Kconfig | 7 +
drivers/acpi/hmem/Makefile | 2 +
drivers/acpi/hmem/core.c | 835 ++++++++++++++++++++++++++++++++++++
drivers/acpi/hmem/hmem.h | 64 +++
drivers/acpi/hmem/initiator.c | 61 +++
drivers/acpi/hmem/perf_attributes.c | 56 +++
drivers/acpi/hmem/target.c | 97 +++++
drivers/acpi/numa.c | 2 +-
drivers/acpi/tables.c | 52 ++-
fs/sysfs/group.c | 30 +-
include/acpi/acpi_numa.h | 1 +
include/linux/sysfs.h | 2 +
15 files changed, 1197 insertions(+), 19 deletions(-)
create mode 100644 drivers/acpi/hmem/Kconfig
create mode 100644 drivers/acpi/hmem/Makefile
create mode 100644 drivers/acpi/hmem/core.c
create mode 100644 drivers/acpi/hmem/hmem.h
create mode 100644 drivers/acpi/hmem/initiator.c
create mode 100644 drivers/acpi/hmem/perf_attributes.c
create mode 100644 drivers/acpi/hmem/target.c

--
2.9.4

2017-07-06 21:52:49

by Ross Zwisler

[permalink] [raw]

Subject: [RFC v2 2/5] acpi: HMAT support in acpi_parse_entries_array()

The current implementation of acpi_parse_entries_array() assumes that each
subtable has a standard ACPI subtable entry of type struct
acpi_sutbable_header. This standard subtable header has a one byte length
followed by a one byte type.

The HMAT subtables have to allow for a longer length so they have subtable
headers of type struct acpi_hmat_structure which has a 2 byte type and a 4
byte length.

Enhance the subtable parsing in acpi_parse_entries_array() so that it can
handle these new HMAT subtables.

Signed-off-by: Ross Zwisler <[email protected]>
---
drivers/acpi/numa.c | 2 +-
drivers/acpi/tables.c | 52 ++++++++++++++++++++++++++++++++++++++++-----------
2 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index edb0c79..917f1cc 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -443,7 +443,7 @@ int __init acpi_numa_init(void)
* So go over all cpu entries in SRAT to get apicid to node mapping.
*/

- /* SRAT: Static Resource Affinity Table */
+ /* SRAT: System Resource Affinity Table */
if (!acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat)) {
struct acpi_subtable_proc srat_proc[3];

diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index ff42539..7979171 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -218,6 +218,33 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header)
}
}

+static unsigned long __init
+acpi_get_entry_type(char *id, void *entry)
+{
+ if (!strncmp(id, ACPI_SIG_HMAT, 4))
+ return ((struct acpi_hmat_structure *)entry)->type;
+ else
+ return ((struct acpi_subtable_header *)entry)->type;
+}
+
+static unsigned long __init
+acpi_get_entry_length(char *id, void *entry)
+{
+ if (!strncmp(id, ACPI_SIG_HMAT, 4))
+ return ((struct acpi_hmat_structure *)entry)->length;
+ else
+ return ((struct acpi_subtable_header *)entry)->length;
+}
+
+static unsigned long __init
+acpi_get_subtable_header_length(char *id)
+{
+ if (!strncmp(id, ACPI_SIG_HMAT, 4))
+ return sizeof(struct acpi_hmat_structure);
+ else
+ return sizeof(struct acpi_subtable_header);
+}
+
/**
* acpi_parse_entries_array - for each proc_num find a suitable subtable
*
@@ -242,10 +269,10 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
struct acpi_subtable_proc *proc, int proc_num,
unsigned int max_entries)
{
- struct acpi_subtable_header *entry;
- unsigned long table_end;
+ unsigned long table_end, subtable_header_length;
int count = 0;
int errs = 0;
+ void *entry;
int i;

if (acpi_disabled)
@@ -263,19 +290,23 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
}

table_end = (unsigned long)table_header + table_header->length;
+ subtable_header_length = acpi_get_subtable_header_length(id);

/* Parse all entries looking for a match. */

- entry = (struct acpi_subtable_header *)
- ((unsigned long)table_header + table_size);
+ entry = (void *)table_header + table_size;
+
+ while (((unsigned long)entry) + subtable_header_length < table_end) {
+ unsigned long entry_type, entry_length;

- while (((unsigned long)entry) + sizeof(struct acpi_subtable_header) <
- table_end) {
if (max_entries && count >= max_entries)
break;

+ entry_type = acpi_get_entry_type(id, entry);
+ entry_length = acpi_get_entry_length(id, entry);
+
for (i = 0; i < proc_num; i++) {
- if (entry->type != proc[i].id)
+ if (entry_type != proc[i].id)
continue;
if (!proc[i].handler ||
(!errs && proc[i].handler(entry, table_end))) {
@@ -290,16 +321,15 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
count++;

/*
- * If entry->length is 0, break from this loop to avoid
+ * If entry_length is 0, break from this loop to avoid
* infinite loop.
*/
- if (entry->length == 0) {
+ if (entry_length == 0) {
pr_err("[%4.4s:0x%02x] Invalid zero length\n", id, proc->id);
return -EINVAL;
}

- entry = (struct acpi_subtable_header *)
- ((unsigned long)entry + entry->length);
+ entry += entry_length;
}

if (max_entries && count > max_entries) {
--
2.9.4

2017-07-06 21:52:54

by Ross Zwisler

[permalink] [raw]

Subject: [RFC v2 5/5] hmem: add performance attributes

Add performance information found in the HMAT to the sysfs representation.
This information lives as an attribute group named "local_init" in the
memory target:

# tree mem_tgt2/
mem_tgt2/
├── firmware_id
├── is_cached
├── is_enabled
├── is_isolated
├── local_init
│   ├── mem_init0 -> ../../mem_init0
│   ├── mem_init1 -> ../../mem_init1
│   ├── mem_tgt2 -> ../../mem_tgt2
│   ├── read_bw_MBps
│   ├── read_lat_nsec
│   ├── write_bw_MBps
│   └── write_lat_nsec
├── node2 -> ../../node/node2
├── phys_addr_base
├── phys_length_bytes
├── power
│   ├── async
│   ...
├── subsystem -> ../../../../bus/hmem
└── uevent

This attribute group surfaces latency and bandwidth performance for a memory
target and its local initiators. For example:

# grep . mem_tgt2/local_init/* 2>/dev/null
mem_tgt2/local_init/read_bw_MBps:30720
mem_tgt2/local_init/read_lat_nsec:100
mem_tgt2/local_init/write_bw_MBps:30720
mem_tgt2/local_init/write_lat_nsec:100

The local initiators have a symlink to the performance information which lives
in the target's attribute group:

# ls -l mem_init0/mem_tgt2
lrwxrwxrwx. 1 root root 0 Jul 5 14:38 mem_init0/mem_tgt2 ->
../mem_tgt2/local_init

We create performance attribute groups only for local (initiator,target)
pairings, where the the first local initiator for a given target is defined
by the "Processor Proximity Domain" field in the HMAT's Memory Subsystem
Address Range Structure table. After we have one local initiator we scan
the performance data to link to any other "local" initiators with the same
local performance to a given memory target.

A given target only has one set of local performance values, so each target
will have at most one "local_init" attribute group, though that group can
contain links to multiple initiators that all have local performance. A
given memory initiator may have multiple local memory targets, so multiple
"mem_tgtX" links may exist for a given initiator.

If a given memory target is cached we give performance numbers only for the
media itself, and rely on the "is_cached" attribute to represent the
fact that there is a caching layer.

The fact that we only expose a subset of the performance information
presented in the HMAT via sysfs as a compromise, driven by fact that those
usages will be the highest performing and because to represent all possible
paths could cause an unmanageable explosion of sysfs entries.

If we dump everything from the HMAT into sysfs we end up with
O(num_targets * num_initiators * num_caching_levels) attributes. Each of
these attributes only takes up 2 bytes in a System Locality Latency and
Bandwidth Information Structure, but if we have to create a directory entry
for each it becomes much more expensive.

For example, very large systems today can have on the order of thousands of
NUMA nodes. Say we have a system which used to have 1,000 NUMA nodes that
each had both a CPU and local memory. The HMAT allows us to separate the
CPUs and memory into separate NUMA nodes, so we can end up with 1,000 CPU
initiator NUMA nodes and 1,000 memory target NUMA nodes. If we represented
the performance information for each possible CPU/memory pair in sysfs we
would end up with 1,000,000 attribute groups.

This is a lot to pass in a set of packed data tables, but I think we'll
break sysfs if we try to create millions of attributes, regardless of how
we nest them in a directory hierarchy.

By only representing performance information for local (initiator,target)
pairings, we reduce the number of sysfs entries to O(num_targets).

Signed-off-by: Ross Zwisler <[email protected]>
---
drivers/acpi/hmem/Makefile | 2 +-
drivers/acpi/hmem/core.c | 268 +++++++++++++++++++++++++++++++++++-
drivers/acpi/hmem/hmem.h | 17 +++
drivers/acpi/hmem/perf_attributes.c | 56 ++++++++
4 files changed, 341 insertions(+), 2 deletions(-)
create mode 100644 drivers/acpi/hmem/perf_attributes.c

diff --git a/drivers/acpi/hmem/Makefile b/drivers/acpi/hmem/Makefile
index d2aa546..44e8304 100644
--- a/drivers/acpi/hmem/Makefile
+++ b/drivers/acpi/hmem/Makefile
@@ -1,2 +1,2 @@
obj-$(CONFIG_ACPI_HMEM) := hmem.o
-hmem-y := core.o initiator.o target.o
+hmem-y := core.o initiator.o target.o perf_attributes.o
diff --git a/drivers/acpi/hmem/core.c b/drivers/acpi/hmem/core.c
index f7638db..8f3ab35 100644
--- a/drivers/acpi/hmem/core.c
+++ b/drivers/acpi/hmem/core.c
@@ -23,11 +23,230 @@
#include <linux/slab.h>
#include "hmem.h"

+#define NO_VALUE -1
+#define LOCAL_INIT "local_init"
+
static LIST_HEAD(target_list);
static LIST_HEAD(initiator_list);
+LIST_HEAD(locality_list);

static bool bad_hmem;

+/* Performance attributes for an initiator/target pair. */
+static int get_performance_data(u32 init_pxm, u32 tgt_pxm,
+ struct acpi_hmat_locality *hmat_loc)
+{
+ int num_init = hmat_loc->number_of_initiator_Pds;
+ int num_tgt = hmat_loc->number_of_target_Pds;
+ int init_idx = NO_VALUE;
+ int tgt_idx = NO_VALUE;
+ u32 *initiators, *targets;
+ u16 *entries, val;
+ int i;
+
+ /* the initiator array is after the struct acpi_hmat_locality fields */
+ initiators = (u32 *)(hmat_loc + 1);
+ targets = &initiators[num_init];
+ entries = (u16 *)&targets[num_tgt];
+
+ for (i = 0; i < num_init; i++) {
+ if (initiators[i] == init_pxm) {
+ init_idx = i;
+ break;
+ }
+ }
+
+ if (init_idx == NO_VALUE)
+ return NO_VALUE;
+
+ for (i = 0; i < num_tgt; i++) {
+ if (targets[i] == tgt_pxm) {
+ tgt_idx = i;
+ break;
+ }
+ }
+
+ if (tgt_idx == NO_VALUE)
+ return NO_VALUE;
+
+ val = entries[init_idx*num_tgt + tgt_idx];
+ if (val < 10 || val == 0xFFFF)
+ return NO_VALUE;
+
+ return (val * hmat_loc->entry_base_unit) / 10;
+}
+
+/*
+ * 'direction' is either READ or WRITE
+ * Latency is reported in nanoseconds and bandwidth is reported in MB/s.
+ */
+static int hmem_get_attribute(int init_pxm, int tgt_pxm, int direction,
+ enum hmem_attr_type type)
+{
+ struct memory_locality *loc;
+ int value;
+
+ list_for_each_entry(loc, &locality_list, list) {
+ struct acpi_hmat_locality *hmat_loc = loc->hmat_loc;
+
+ if (direction == READ && type == LATENCY &&
+ (hmat_loc->data_type == ACPI_HMAT_ACCESS_LATENCY ||
+ hmat_loc->data_type == ACPI_HMAT_READ_LATENCY)) {
+ value = get_performance_data(init_pxm, tgt_pxm,
+ hmat_loc);
+ if (value != NO_VALUE)
+ return value;
+ }
+
+ if (direction == WRITE && type == LATENCY &&
+ (hmat_loc->data_type == ACPI_HMAT_ACCESS_LATENCY ||
+ hmat_loc->data_type == ACPI_HMAT_WRITE_LATENCY)) {
+ value = get_performance_data(init_pxm, tgt_pxm,
+ hmat_loc);
+ if (value != NO_VALUE)
+ return value;
+ }
+
+ if (direction == READ && type == BANDWIDTH &&
+ (hmat_loc->data_type == ACPI_HMAT_ACCESS_BANDWIDTH ||
+ hmat_loc->data_type == ACPI_HMAT_READ_BANDWIDTH)) {
+ value = get_performance_data(init_pxm, tgt_pxm,
+ hmat_loc);
+ if (value != NO_VALUE)
+ return value;
+ }
+
+ if (direction == WRITE && type == BANDWIDTH &&
+ (hmat_loc->data_type == ACPI_HMAT_ACCESS_BANDWIDTH ||
+ hmat_loc->data_type == ACPI_HMAT_WRITE_BANDWIDTH)) {
+ value = get_performance_data(init_pxm, tgt_pxm,
+ hmat_loc);
+ if (value != NO_VALUE)
+ return value;
+ }
+ }
+
+ return NO_VALUE;
+}
+
+/*
+ * 'direction' is either READ or WRITE
+ * Latency is reported in nanoseconds and bandwidth is reported in MB/s.
+ */
+int hmem_local_attribute(struct device *tgt_dev, int direction,
+ enum hmem_attr_type type)
+{
+ struct memory_target *tgt = to_memory_target(tgt_dev);
+ int tgt_pxm = tgt->ma->proximity_domain;
+ int init_pxm;
+
+ if (!tgt->local_init)
+ return NO_VALUE;
+
+ init_pxm = tgt->local_init->pxm;
+ return hmem_get_attribute(init_pxm, tgt_pxm, direction, type);
+}
+
+static bool is_local_init(int init_pxm, int tgt_pxm, int read_lat,
+ int write_lat, int read_bw, int write_bw)
+{
+ if (read_lat != hmem_get_attribute(init_pxm, tgt_pxm, READ, LATENCY))
+ return false;
+
+ if (write_lat != hmem_get_attribute(init_pxm, tgt_pxm, WRITE, LATENCY))
+ return false;
+
+ if (read_bw != hmem_get_attribute(init_pxm, tgt_pxm, READ, BANDWIDTH))
+ return false;
+
+ if (write_bw != hmem_get_attribute(init_pxm, tgt_pxm, WRITE, BANDWIDTH))
+ return false;
+
+ return true;
+}
+
+static const struct attribute_group performance_attribute_group = {
+ .attrs = performance_attributes,
+ .name = LOCAL_INIT,
+};
+
+static void remove_performance_attributes(struct memory_target *tgt)
+{
+ if (!tgt->local_init)
+ return;
+
+ /*
+ * FIXME: Need to enhance the core sysfs code to remove all the links
+ * in both the attribute group and in the device itself when those are
+ * removed.
+ */
+ sysfs_remove_group(&tgt->dev.kobj, &performance_attribute_group);
+}
+
+static int add_performance_attributes(struct memory_target *tgt)
+{
+ int read_lat, write_lat, read_bw, write_bw;
+ int tgt_pxm = tgt->ma->proximity_domain;
+ struct kobject *init_kobj, *tgt_kobj;
+ struct device *init_dev, *tgt_dev;
+ struct memory_initiator *init;
+ int ret;
+
+ if (!tgt->local_init)
+ return 0;
+
+ tgt_dev = &tgt->dev;
+ tgt_kobj = &tgt_dev->kobj;
+
+ /* Create entries for initiator/target pair in the target. */
+ ret = sysfs_create_group(tgt_kobj, &performance_attribute_group);
+ if (ret < 0)
+ return ret;
+
+ ret = sysfs_add_link_to_group(tgt_kobj, LOCAL_INIT, tgt_kobj,
+ dev_name(tgt_dev));
+ if (ret < 0)
+ goto err;
+
+ /*
+ * Iterate through initiators and find all the ones that have the same
+ * performance as the local initiator.
+ */
+ read_lat = hmem_local_attribute(tgt_dev, READ, LATENCY);
+ write_lat = hmem_local_attribute(tgt_dev, WRITE, LATENCY);
+ read_bw = hmem_local_attribute(tgt_dev, READ, BANDWIDTH);
+ write_bw = hmem_local_attribute(tgt_dev, WRITE, BANDWIDTH);
+
+ list_for_each_entry(init, &initiator_list, list) {
+ init_dev = &init->dev;
+ init_kobj = &init_dev->kobj;
+
+ if (init == tgt->local_init ||
+ is_local_init(init->pxm, tgt_pxm, read_lat,
+ write_lat, read_bw, write_bw)) {
+ ret = sysfs_add_link_to_group(tgt_kobj, LOCAL_INIT,
+ init_kobj, dev_name(init_dev));
+ if (ret < 0)
+ goto err;
+
+ /*
+ * Create a link in the initiator to the performance
+ * attributes.
+ */
+ ret = sysfs_add_group_link(init_kobj, tgt_kobj,
+ LOCAL_INIT, dev_name(tgt_dev));
+ if (ret < 0)
+ goto err;
+ }
+
+ }
+ tgt->has_perf_attributes = true;
+ return 0;
+err:
+ remove_performance_attributes(tgt);
+ return ret;
+}
+
static int link_node_for_kobj(unsigned int node, struct kobject *kobj)
{
if (node_devices[node])
@@ -178,6 +397,9 @@ static void release_memory_target(struct device *dev)

static void __init remove_memory_target(struct memory_target *tgt)
{
+ if (tgt->has_perf_attributes)
+ remove_performance_attributes(tgt);
+
if (tgt->is_registered) {
remove_node_for_kobj(pxm_to_node(tgt->ma->proximity_domain),
&tgt->dev.kobj);
@@ -309,6 +531,38 @@ hmat_parse_address_range(struct acpi_subtable_header *header,
return -EINVAL;
}

+static int __init hmat_parse_locality(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_hmat_locality *hmat_loc;
+ struct memory_locality *loc;
+
+ if (bad_hmem)
+ return 0;
+
+ hmat_loc = (struct acpi_hmat_locality *)header;
+ if (!hmat_loc) {
+ pr_err("HMEM: NULL table entry\n");
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ /* We don't report cached performance information in sysfs. */
+ if (hmat_loc->flags == ACPI_HMAT_MEMORY ||
+ hmat_loc->flags == ACPI_HMAT_LAST_LEVEL_CACHE) {
+ loc = kzalloc(sizeof(*loc), GFP_KERNEL);
+ if (!loc) {
+ bad_hmem = true;
+ return -ENOMEM;
+ }
+
+ loc->hmat_loc = hmat_loc;
+ list_add_tail(&loc->list, &locality_list);
+ }
+
+ return 0;
+}
+
static int __init hmat_parse_cache(struct acpi_subtable_header *header,
const unsigned long end)
{
@@ -464,6 +718,7 @@ srat_parse_memory_affinity(struct acpi_subtable_header *header,
static void hmem_cleanup(void)
{
struct memory_initiator *init, *init_iter;
+ struct memory_locality *loc, *loc_iter;
struct memory_target *tgt, *tgt_iter;

list_for_each_entry_safe(tgt, tgt_iter, &target_list, list)
@@ -471,6 +726,11 @@ static void hmem_cleanup(void)

list_for_each_entry_safe(init, init_iter, &initiator_list, list)
remove_memory_initiator(init);
+
+ list_for_each_entry_safe(loc, loc_iter, &locality_list, list) {
+ list_del(&loc->list);
+ kfree(loc);
+ }
}

static int __init hmem_init(void)
@@ -521,13 +781,15 @@ static int __init hmem_init(void)
}

if (!acpi_table_parse(ACPI_SIG_HMAT, hmem_noop_parse)) {
- struct acpi_subtable_proc hmat_proc[2];
+ struct acpi_subtable_proc hmat_proc[3];

memset(hmat_proc, 0, sizeof(hmat_proc));
hmat_proc[0].id = ACPI_HMAT_TYPE_ADDRESS_RANGE;
hmat_proc[0].handler = hmat_parse_address_range;
hmat_proc[1].id = ACPI_HMAT_TYPE_CACHE;
hmat_proc[1].handler = hmat_parse_cache;
+ hmat_proc[2].id = ACPI_HMAT_TYPE_LOCALITY;
+ hmat_proc[2].handler = hmat_parse_locality;

acpi_table_parse_entries_array(ACPI_SIG_HMAT,
sizeof(struct acpi_table_hmat),
@@ -549,6 +811,10 @@ static int __init hmem_init(void)
ret = register_memory_target(tgt);
if (ret)
goto err;
+
+ ret = add_performance_attributes(tgt);
+ if (ret)
+ goto err;
}

return 0;
diff --git a/drivers/acpi/hmem/hmem.h b/drivers/acpi/hmem/hmem.h
index 38ff540..71b10f0 100644
--- a/drivers/acpi/hmem/hmem.h
+++ b/drivers/acpi/hmem/hmem.h
@@ -16,6 +16,11 @@
#ifndef _ACPI_HMEM_H_
#define _ACPI_HMEM_H_

+enum hmem_attr_type {
+ LATENCY,
+ BANDWIDTH,
+};
+
struct memory_initiator {
struct list_head list;
struct device dev;
@@ -39,9 +44,21 @@ struct memory_target {

bool is_cached;
bool is_registered;
+ bool has_perf_attributes;
};
#define to_memory_target(d) container_of((d), struct memory_target, dev)

+struct memory_locality {
+ struct list_head list;
+ struct acpi_hmat_locality *hmat_loc;
+};
+
extern const struct attribute_group *memory_initiator_attribute_groups[];
extern const struct attribute_group *memory_target_attribute_groups[];
+extern struct attribute *performance_attributes[];
+
+extern struct list_head locality_list;
+
+int hmem_local_attribute(struct device *tgt_dev, int direction,
+ enum hmem_attr_type type);
#endif /* _ACPI_HMEM_H_ */
diff --git a/drivers/acpi/hmem/perf_attributes.c b/drivers/acpi/hmem/perf_attributes.c
new file mode 100644
index 0000000..72a710a
--- /dev/null
+++ b/drivers/acpi/hmem/perf_attributes.c
@@ -0,0 +1,56 @@
+/*
+ * Heterogeneous memory performance attributes
+ *
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/acpi.h>
+#include <linux/device.h>
+#include <linux/sysfs.h>
+#include "hmem.h"
+
+static ssize_t read_lat_nsec_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", hmem_local_attribute(dev, READ, LATENCY));
+}
+static DEVICE_ATTR_RO(read_lat_nsec);
+
+static ssize_t write_lat_nsec_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", hmem_local_attribute(dev, WRITE, LATENCY));
+}
+static DEVICE_ATTR_RO(write_lat_nsec);
+
+static ssize_t read_bw_MBps_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", hmem_local_attribute(dev, READ, BANDWIDTH));
+}
+static DEVICE_ATTR_RO(read_bw_MBps);
+
+static ssize_t write_bw_MBps_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n",
+ hmem_local_attribute(dev, WRITE, BANDWIDTH));
+}
+static DEVICE_ATTR_RO(write_bw_MBps);
+
+struct attribute *performance_attributes[] = {
+ &dev_attr_read_lat_nsec.attr,
+ &dev_attr_write_lat_nsec.attr,
+ &dev_attr_read_bw_MBps.attr,
+ &dev_attr_write_bw_MBps.attr,
+ NULL
+};
--
2.9.4

2017-07-06 21:53:16

by Ross Zwisler

[permalink] [raw]

Subject: [RFC v2 4/5] sysfs: add sysfs_add_group_link()

The current __compat_only_sysfs_link_entry_to_kobj() code allows us to
create symbolic links in sysfs to groups or attributes. Something like:

/sys/.../entry1/groupA -> /sys/.../entry2/groupA

This patch extends this functionality with a new sysfs_add_group_link()
call that allows the link to have a different name than the group or
attribute, so:

/sys/.../entry1/link_name -> /sys/.../entry2/groupA

__compat_only_sysfs_link_entry_to_kobj() now just calls
sysfs_add_group_link(), passing in the same name for both the
group/attribute and for the link name.

This is needed by the ACPI HMAT enabling work because we want to have a
group of performance attributes that live in a memory target. This group
represents the performance between a memory target and its local
inititator. In the target the attribute group is named "local_init":

# tree mem_tgt2/local_init/
mem_tgt2/local_init/
├── mem_init0 -> ../../mem_init0
├── mem_tgt2 -> ../../mem_tgt2
├── read_bw_MBps
├── read_lat_nsec
├── write_bw_MBps
└── write_lat_nsec

We then want to link to this attribute group from the initiator, but change
the name of the link to "mem_tgtX" since we're now looking at it from the
initiator's perspective, and because a given initiator can have multiple
local memory targets:

# ls -l mem_init0/mem_tgt2
lrwxrwxrwx. 1 root root 0 Jul 5 14:38 mem_init0/mem_tgt2 ->
../mem_tgt2/local_init

Signed-off-by: Ross Zwisler <[email protected]>
---
fs/sysfs/group.c | 30 +++++++++++++++++++++++-------
include/linux/sysfs.h | 2 ++
2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index ac2de0e..19db57c8 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -367,15 +367,15 @@ void sysfs_remove_link_from_group(struct kobject *kobj, const char *group_name,
EXPORT_SYMBOL_GPL(sysfs_remove_link_from_group);

/**
- * __compat_only_sysfs_link_entry_to_kobj - add a symlink to a kobject pointing
- * to a group or an attribute
+ * sysfs_add_group_link - add a symlink to a kobject pointing to a group or
+ * an attribute
* @kobj: The kobject containing the group.
* @target_kobj: The target kobject.
* @target_name: The name of the target group or attribute.
+ * @link_name: The name of the link to the target group or attribute.
*/
-int __compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj,
- struct kobject *target_kobj,
- const char *target_name)
+int sysfs_add_group_link(struct kobject *kobj, struct kobject *target_kobj,
+ const char *target_name, const char *link_name)
{
struct kernfs_node *target;
struct kernfs_node *entry;
@@ -400,12 +400,28 @@ int __compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj,
return -ENOENT;
}

- link = kernfs_create_link(kobj->sd, target_name, entry);
+ link = kernfs_create_link(kobj->sd, link_name, entry);
if (IS_ERR(link) && PTR_ERR(link) == -EEXIST)
- sysfs_warn_dup(kobj->sd, target_name);
+ sysfs_warn_dup(kobj->sd, link_name);

kernfs_put(entry);
kernfs_put(target);
return IS_ERR(link) ? PTR_ERR(link) : 0;
}
+EXPORT_SYMBOL_GPL(sysfs_add_group_link);
+
+/**
+ * __compat_only_sysfs_link_entry_to_kobj - add a symlink to a kobject pointing
+ * to a group or an attribute
+ * @kobj: The kobject containing the group.
+ * @target_kobj: The target kobject.
+ * @target_name: The name of the target group or attribute.
+ */
+int __compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj,
+ struct kobject *target_kobj,
+ const char *target_name)
+{
+ return sysfs_add_group_link(kobj, target_kobj, target_name,
+ target_name);
+}
EXPORT_SYMBOL_GPL(__compat_only_sysfs_link_entry_to_kobj);
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index c6f0f0d..865f499 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -278,6 +278,8 @@ int sysfs_add_link_to_group(struct kobject *kobj, const char *group_name,
struct kobject *target, const char *link_name);
void sysfs_remove_link_from_group(struct kobject *kobj, const char *group_name,
const char *link_name);
+int sysfs_add_group_link(struct kobject *kobj, struct kobject *target_kobj,
+ const char *target_name, const char *link_name);
int __compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj,
struct kobject *target_kobj,
const char *target_name);
--
2.9.4

2017-07-06 21:53:14

by Ross Zwisler

[permalink] [raw]

Subject: [RFC v2 1/5] acpi: add missing include in acpi_numa.h

Right now if a file includes acpi_numa.h and they don't happen to include
linux/numa.h before it, they get the following warning:

./include/acpi/acpi_numa.h:9:5: warning: "MAX_NUMNODES" is not defined [-Wundef]
#if MAX_NUMNODES > 256
^~~~~~~~~~~~

Signed-off-by: Ross Zwisler <[email protected]>
---
include/acpi/acpi_numa.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/acpi/acpi_numa.h b/include/acpi/acpi_numa.h
index d4b7294..1e3a74f 100644
--- a/include/acpi/acpi_numa.h
+++ b/include/acpi/acpi_numa.h
@@ -3,6 +3,7 @@

#ifdef CONFIG_ACPI_NUMA
#include <linux/kernel.h>
+#include <linux/numa.h>

/* Proximity bitmap length */
#if MAX_NUMNODES > 256
--
2.9.4

2017-07-06 21:53:53

by Ross Zwisler

[permalink] [raw]

Subject: [RFC v2 3/5] hmem: add heterogeneous memory sysfs support

Add a new sysfs subsystem, /sys/devices/system/hmem, which surfaces
information about memory initiators and memory targets to the user. These
initiators and targets are described by the ACPI SRAT and HMAT tables.

A "memory initiator" in this case is any device such as a CPU or a separate
memory I/O device that can initiate a memory request. A "memory target" is
a CPU-accessible physical address range.

The key piece of information surfaced by this patch is the mapping between
the ACPI table "proximity domain" numbers, held in the "firmware_id"
attribute, and Linux NUMA node numbers.

Initiators are found at /sys/devices/system/hmem/mem_initX, and the
attributes for a given initiator look like this:

# tree mem_init0/
mem_init0/
├── cpu0 -> ../../cpu/cpu0
├── firmware_id
├── is_enabled
├── node0 -> ../../node/node0
├── power
│   ├── async
│   ...
├── subsystem -> ../../../../bus/hmem
└── uevent

Where "mem_init0" on my system represents the CPU acting as a memory
initiator at NUMA node 0.

Targets are found at /sys/devices/system/hmem/mem_tgtX, and the attributes
for a given target look like this:

# tree mem_tgt2/
mem_tgt2/
├── firmware_id
├── is_cached
├── is_enabled
├── is_isolated
├── node2 -> ../../node/node2
├── phys_addr_base
├── phys_length_bytes
├── power
│   ├── async
│   ...
├── subsystem -> ../../../../bus/hmem
└── uevent

Signed-off-by: Ross Zwisler <[email protected]>
---
MAINTAINERS | 5 +
drivers/acpi/Kconfig | 1 +
drivers/acpi/Makefile | 1 +
drivers/acpi/hmem/Kconfig | 7 +
drivers/acpi/hmem/Makefile | 2 +
drivers/acpi/hmem/core.c | 569 ++++++++++++++++++++++++++++++++++++++++++
drivers/acpi/hmem/hmem.h | 47 ++++
drivers/acpi/hmem/initiator.c | 61 +++++
drivers/acpi/hmem/target.c | 97 +++++++
9 files changed, 790 insertions(+)
create mode 100644 drivers/acpi/hmem/Kconfig
create mode 100644 drivers/acpi/hmem/Makefile
create mode 100644 drivers/acpi/hmem/core.c
create mode 100644 drivers/acpi/hmem/hmem.h
create mode 100644 drivers/acpi/hmem/initiator.c
create mode 100644 drivers/acpi/hmem/target.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 053c3bd..554b833 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6085,6 +6085,11 @@ S: Supported
F: drivers/scsi/hisi_sas/
F: Documentation/devicetree/bindings/scsi/hisilicon-sas.txt

+HMEM (ACPI HETEROGENEOUS MEMORY SUPPORT)
+M: Ross Zwisler <[email protected]>
+S: Supported
+F: drivers/acpi/hmem/
+
HOST AP DRIVER
M: Jouni Malinen <[email protected]>
L: [email protected]
diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index 1ce52f8..44dd97f 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -460,6 +460,7 @@ config ACPI_REDUCED_HARDWARE_ONLY
If you are unsure what to do, do not enable this option.

source "drivers/acpi/nfit/Kconfig"
+source "drivers/acpi/hmem/Kconfig"

source "drivers/acpi/apei/Kconfig"
source "drivers/acpi/dptf/Kconfig"
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index b1aacfc..31e3f20 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -72,6 +72,7 @@ obj-$(CONFIG_ACPI_PROCESSOR) += processor.o
obj-$(CONFIG_ACPI) += container.o
obj-$(CONFIG_ACPI_THERMAL) += thermal.o
obj-$(CONFIG_ACPI_NFIT) += nfit/
+obj-$(CONFIG_ACPI_HMEM) += hmem/
obj-$(CONFIG_ACPI) += acpi_memhotplug.o
obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
obj-$(CONFIG_ACPI_BATTERY) += battery.o
diff --git a/drivers/acpi/hmem/Kconfig b/drivers/acpi/hmem/Kconfig
new file mode 100644
index 0000000..09282be
--- /dev/null
+++ b/drivers/acpi/hmem/Kconfig
@@ -0,0 +1,7 @@
+config ACPI_HMEM
+ bool "ACPI Heterogeneous Memory Support"
+ depends on ACPI_NUMA
+ depends on SYSFS
+ help
+ Exports a sysfs representation of the ACPI Heterogeneous Memory
+ Attributes Table (HMAT).
diff --git a/drivers/acpi/hmem/Makefile b/drivers/acpi/hmem/Makefile
new file mode 100644
index 0000000..d2aa546
--- /dev/null
+++ b/drivers/acpi/hmem/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_ACPI_HMEM) := hmem.o
+hmem-y := core.o initiator.o target.o
diff --git a/drivers/acpi/hmem/core.c b/drivers/acpi/hmem/core.c
new file mode 100644
index 0000000..f7638db
--- /dev/null
+++ b/drivers/acpi/hmem/core.c
@@ -0,0 +1,569 @@
+/*
+ * Heterogeneous memory representation in sysfs
+ *
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <acpi/acpi_numa.h>
+#include <linux/acpi.h>
+#include <linux/cpu.h>
+#include <linux/device.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include "hmem.h"
+
+static LIST_HEAD(target_list);
+static LIST_HEAD(initiator_list);
+
+static bool bad_hmem;
+
+static int link_node_for_kobj(unsigned int node, struct kobject *kobj)
+{
+ if (node_devices[node])
+ return sysfs_create_link(kobj, &node_devices[node]->dev.kobj,
+ kobject_name(&node_devices[node]->dev.kobj));
+
+ return 0;
+}
+
+static void remove_node_for_kobj(unsigned int node, struct kobject *kobj)
+{
+ if (node_devices[node])
+ sysfs_remove_link(kobj,
+ kobject_name(&node_devices[node]->dev.kobj));
+}
+
+#define HMEM_CLASS_NAME "hmem"
+
+static struct bus_type hmem_subsys = {
+ /*
+ * .dev_name is set before device_register() based on the type of
+ * device we are registering.
+ */
+ .name = HMEM_CLASS_NAME,
+};
+
+/* memory initiators */
+static int link_cpu_under_mem_init(struct memory_initiator *init)
+{
+ struct device *cpu_dev;
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ cpu_dev = get_cpu_device(cpu);
+ if (!cpu_dev)
+ continue;
+
+ if (pxm_to_node(init->pxm) == cpu_to_node(cpu)) {
+ return sysfs_create_link(&init->dev.kobj,
+ &cpu_dev->kobj,
+ kobject_name(&cpu_dev->kobj));
+ }
+
+ }
+ return 0;
+}
+
+static void remove_cpu_under_mem_init(struct memory_initiator *init)
+{
+ struct device *cpu_dev;
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ cpu_dev = get_cpu_device(cpu);
+ if (!cpu_dev)
+ continue;
+
+ if (pxm_to_node(init->pxm) == cpu_to_node(cpu)) {
+ sysfs_remove_link(&init->dev.kobj,
+ kobject_name(&cpu_dev->kobj));
+ return;
+ }
+
+ }
+}
+
+static void release_memory_initiator(struct device *dev)
+{
+ struct memory_initiator *init = to_memory_initiator(dev);
+
+ list_del(&init->list);
+ kfree(init);
+}
+
+static void __init remove_memory_initiator(struct memory_initiator *init)
+{
+ if (init->is_registered) {
+ remove_cpu_under_mem_init(init);
+ remove_node_for_kobj(pxm_to_node(init->pxm), &init->dev.kobj);
+ device_unregister(&init->dev);
+ } else
+ release_memory_initiator(&init->dev);
+}
+
+static int __init register_memory_initiator(struct memory_initiator *init)
+{
+ int ret;
+
+ hmem_subsys.dev_name = "mem_init";
+ init->dev.bus = &hmem_subsys;
+ init->dev.id = pxm_to_node(init->pxm);
+ init->dev.release = release_memory_initiator;
+ init->dev.groups = memory_initiator_attribute_groups;
+
+ ret = device_register(&init->dev);
+ if (ret < 0)
+ return ret;
+
+ init->is_registered = true;
+
+ ret = link_cpu_under_mem_init(init);
+ if (ret < 0)
+ return ret;
+
+ return link_node_for_kobj(pxm_to_node(init->pxm), &init->dev.kobj);
+}
+
+static struct memory_initiator * __init add_memory_initiator(int pxm)
+{
+ struct memory_initiator *init;
+
+ if (pxm_to_node(pxm) == NUMA_NO_NODE) {
+ pr_err("HMEM: No NUMA node for PXM %d\n", pxm);
+ bad_hmem = true;
+ return ERR_PTR(-EINVAL);
+ }
+
+ /*
+ * Make sure we haven't already added an initiator for this proximity
+ * domain. We don't care about any other differences in the SRAT
+ * tables (apic_id, etc), so we just use the data from the first table
+ * we see for a given proximity domain.
+ */
+ list_for_each_entry(init, &initiator_list, list)
+ if (init->pxm == pxm)
+ return 0;
+
+ init = kzalloc(sizeof(*init), GFP_KERNEL);
+ if (!init) {
+ bad_hmem = true;
+ return ERR_PTR(-ENOMEM);
+ }
+
+ init->pxm = pxm;
+
+ list_add_tail(&init->list, &initiator_list);
+ return init;
+}
+
+/* memory targets */
+static void release_memory_target(struct device *dev)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ list_del(&tgt->list);
+ kfree(tgt);
+}
+
+static void __init remove_memory_target(struct memory_target *tgt)
+{
+ if (tgt->is_registered) {
+ remove_node_for_kobj(pxm_to_node(tgt->ma->proximity_domain),
+ &tgt->dev.kobj);
+ device_unregister(&tgt->dev);
+ } else
+ release_memory_target(&tgt->dev);
+}
+
+static int __init register_memory_target(struct memory_target *tgt)
+{
+ int ret;
+
+ if (!tgt->ma || !tgt->spa) {
+ pr_err("HMEM: Incomplete memory target found\n");
+ return -EINVAL;
+ }
+
+ hmem_subsys.dev_name = "mem_tgt";
+ tgt->dev.bus = &hmem_subsys;
+ tgt->dev.id = pxm_to_node(tgt->ma->proximity_domain);
+ tgt->dev.release = release_memory_target;
+ tgt->dev.groups = memory_target_attribute_groups;
+
+ ret = device_register(&tgt->dev);
+ if (ret < 0)
+ return ret;
+
+ tgt->is_registered = true;
+
+ return link_node_for_kobj(pxm_to_node(tgt->ma->proximity_domain),
+ &tgt->dev.kobj);
+}
+
+static int __init add_memory_target(struct acpi_srat_mem_affinity *ma)
+{
+ struct memory_target *tgt;
+
+ if (pxm_to_node(ma->proximity_domain) == NUMA_NO_NODE) {
+ pr_err("HMEM: No NUMA node for PXM %d\n", ma->proximity_domain);
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ tgt = kzalloc(sizeof(*tgt), GFP_KERNEL);
+ if (!tgt) {
+ bad_hmem = true;
+ return -ENOMEM;
+ }
+
+ tgt->ma = ma;
+
+ list_add_tail(&tgt->list, &target_list);
+ return 0;
+}
+
+/* ACPI parsing code, starting with the HMAT */
+static int __init hmem_noop_parse(struct acpi_table_header *table)
+{
+ /* real work done by the hmat_parse_* and srat_parse_* routines */
+ return 0;
+}
+
+static bool __init hmat_spa_matches_srat(struct acpi_hmat_address_range *spa,
+ struct acpi_srat_mem_affinity *ma)
+{
+ if (spa->physical_address_base != ma->base_address ||
+ spa->physical_address_length != ma->length)
+ return false;
+
+ return true;
+}
+
+static void find_local_initiator(struct memory_target *tgt)
+{
+ struct memory_initiator *init;
+
+ if (!(tgt->spa->flags & ACPI_HMAT_PROCESSOR_PD_VALID) ||
+ pxm_to_node(tgt->spa->processor_PD) == NUMA_NO_NODE)
+ return;
+
+ list_for_each_entry(init, &initiator_list, list) {
+ if (init->pxm == tgt->spa->processor_PD) {
+ tgt->local_init = init;
+ return;
+ }
+ }
+}
+
+/* ACPI HMAT parsing routines */
+static int __init
+hmat_parse_address_range(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_hmat_address_range *spa;
+ struct memory_target *tgt;
+
+ if (bad_hmem)
+ return 0;
+
+ spa = (struct acpi_hmat_address_range *)header;
+ if (!spa) {
+ pr_err("HMEM: NULL table entry\n");
+ goto err;
+ }
+
+ if (spa->header.length != sizeof(*spa)) {
+ pr_err("HMEM: Unexpected header length: %d\n",
+ spa->header.length);
+ goto err;
+ }
+
+ list_for_each_entry(tgt, &target_list, list) {
+ if ((spa->flags & ACPI_HMAT_MEMORY_PD_VALID) &&
+ spa->memory_PD == tgt->ma->proximity_domain) {
+ if (!hmat_spa_matches_srat(spa, tgt->ma)) {
+ pr_err("HMEM: SRAT and HMAT disagree on "
+ "address range info\n");
+ goto err;
+ }
+ tgt->spa = spa;
+ find_local_initiator(tgt);
+ return 0;
+ }
+ }
+
+ return 0;
+err:
+ bad_hmem = true;
+ return -EINVAL;
+}
+
+static int __init hmat_parse_cache(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_hmat_cache *cache;
+ struct memory_target *tgt;
+
+ if (bad_hmem)
+ return 0;
+
+ cache = (struct acpi_hmat_cache *)header;
+ if (!cache) {
+ pr_err("HMEM: NULL table entry\n");
+ goto err;
+ }
+
+ if (cache->header.length < sizeof(*cache)) {
+ pr_err("HMEM: Unexpected header length: %d\n",
+ cache->header.length);
+ goto err;
+ }
+
+ list_for_each_entry(tgt, &target_list, list) {
+ if (cache->memory_PD == tgt->ma->proximity_domain) {
+ tgt->is_cached = true;
+ return 0;
+ }
+ }
+
+ pr_err("HMEM: Couldn't find cached target PXM %d\n", cache->memory_PD);
+err:
+ bad_hmem = true;
+ return -EINVAL;
+}
+
+/*
+ * SRAT parsing. We use srat_disabled() and pxm_to_node() so we don't redo
+ * any of the SRAT sanity checking done in drivers/acpi/numa.c.
+ */
+static int __init
+srat_parse_processor_affinity(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_srat_cpu_affinity *cpu;
+ struct memory_initiator *init;
+ u32 pxm;
+
+ if (bad_hmem)
+ return 0;
+
+ cpu = (struct acpi_srat_cpu_affinity *)header;
+ if (!cpu) {
+ pr_err("HMEM: NULL table entry\n");
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ pxm = cpu->proximity_domain_lo;
+ if (acpi_srat_revision >= 2)
+ pxm |= *((unsigned int *)cpu->proximity_domain_hi) << 8;
+
+ if (!(cpu->flags & ACPI_SRAT_CPU_ENABLED))
+ return 0;
+
+ init = add_memory_initiator(pxm);
+ if (IS_ERR_OR_NULL(init))
+ return PTR_ERR(init);
+
+ init->cpu = cpu;
+ return 0;
+}
+
+static int __init
+srat_parse_x2apic_affinity(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_srat_x2apic_cpu_affinity *x2apic;
+ struct memory_initiator *init;
+
+ if (bad_hmem)
+ return 0;
+
+ x2apic = (struct acpi_srat_x2apic_cpu_affinity *)header;
+ if (!x2apic) {
+ pr_err("HMEM: NULL table entry\n");
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ if (!(x2apic->flags & ACPI_SRAT_CPU_ENABLED))
+ return 0;
+
+ init = add_memory_initiator(x2apic->proximity_domain);
+ if (IS_ERR_OR_NULL(init))
+ return PTR_ERR(init);
+
+ init->x2apic = x2apic;
+ return 0;
+}
+
+static int __init
+srat_parse_gicc_affinity(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_srat_gicc_affinity *gicc;
+ struct memory_initiator *init;
+
+ if (bad_hmem)
+ return 0;
+
+ gicc = (struct acpi_srat_gicc_affinity *)header;
+ if (!gicc) {
+ pr_err("HMEM: NULL table entry\n");
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ if (!(gicc->flags & ACPI_SRAT_GICC_ENABLED))
+ return 0;
+
+ init = add_memory_initiator(gicc->proximity_domain);
+ if (IS_ERR_OR_NULL(init))
+ return PTR_ERR(init);
+
+ init->gicc = gicc;
+ return 0;
+}
+
+static int __init
+srat_parse_memory_affinity(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_srat_mem_affinity *ma;
+
+ if (bad_hmem)
+ return 0;
+
+ ma = (struct acpi_srat_mem_affinity *)header;
+ if (!ma) {
+ pr_err("HMEM: NULL table entry\n");
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
+ return 0;
+
+ return add_memory_target(ma);
+}
+
+/*
+ * Remove our sysfs entries, unregister our devices and free allocated memory.
+ */
+static void hmem_cleanup(void)
+{
+ struct memory_initiator *init, *init_iter;
+ struct memory_target *tgt, *tgt_iter;
+
+ list_for_each_entry_safe(tgt, tgt_iter, &target_list, list)
+ remove_memory_target(tgt);
+
+ list_for_each_entry_safe(init, init_iter, &initiator_list, list)
+ remove_memory_initiator(init);
+}
+
+static int __init hmem_init(void)
+{
+ struct acpi_table_header *tbl;
+ struct memory_initiator *init;
+ struct memory_target *tgt;
+ acpi_status status = AE_OK;
+ int ret;
+
+ if (srat_disabled())
+ return 0;
+
+ /*
+ * We take a permanent reference to both the HMAT and SRAT in ACPI
+ * memory so we can keep pointers to their subtables. These tables
+ * already had references on them which would never be released, taken
+ * by acpi_sysfs_init(), so this shouldn't negatively impact anything.
+ */
+ status = acpi_get_table(ACPI_SIG_SRAT, 0, &tbl);
+ if (ACPI_FAILURE(status))
+ return 0;
+
+ status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
+ if (ACPI_FAILURE(status))
+ return 0;
+
+ ret = subsys_system_register(&hmem_subsys, NULL);
+ if (ret)
+ return ret;
+
+ if (!acpi_table_parse(ACPI_SIG_SRAT, hmem_noop_parse)) {
+ struct acpi_subtable_proc srat_proc[4];
+
+ memset(srat_proc, 0, sizeof(srat_proc));
+ srat_proc[0].id = ACPI_SRAT_TYPE_CPU_AFFINITY;
+ srat_proc[0].handler = srat_parse_processor_affinity;
+ srat_proc[1].id = ACPI_SRAT_TYPE_X2APIC_CPU_AFFINITY;
+ srat_proc[1].handler = srat_parse_x2apic_affinity;
+ srat_proc[2].id = ACPI_SRAT_TYPE_GICC_AFFINITY;
+ srat_proc[2].handler = srat_parse_gicc_affinity;
+ srat_proc[3].id = ACPI_SRAT_TYPE_MEMORY_AFFINITY;
+ srat_proc[3].handler = srat_parse_memory_affinity;
+
+ acpi_table_parse_entries_array(ACPI_SIG_SRAT,
+ sizeof(struct acpi_table_srat),
+ srat_proc, ARRAY_SIZE(srat_proc), 0);
+ }
+
+ if (!acpi_table_parse(ACPI_SIG_HMAT, hmem_noop_parse)) {
+ struct acpi_subtable_proc hmat_proc[2];
+
+ memset(hmat_proc, 0, sizeof(hmat_proc));
+ hmat_proc[0].id = ACPI_HMAT_TYPE_ADDRESS_RANGE;
+ hmat_proc[0].handler = hmat_parse_address_range;
+ hmat_proc[1].id = ACPI_HMAT_TYPE_CACHE;
+ hmat_proc[1].handler = hmat_parse_cache;
+
+ acpi_table_parse_entries_array(ACPI_SIG_HMAT,
+ sizeof(struct acpi_table_hmat),
+ hmat_proc, ARRAY_SIZE(hmat_proc), 0);
+ }
+
+ if (bad_hmem) {
+ ret = -EINVAL;
+ goto err;
+ }
+
+ list_for_each_entry(init, &initiator_list, list) {
+ ret = register_memory_initiator(init);
+ if (ret)
+ goto err;
+ }
+
+ list_for_each_entry(tgt, &target_list, list) {
+ ret = register_memory_target(tgt);
+ if (ret)
+ goto err;
+ }
+
+ return 0;
+err:
+ pr_err("HMEM: Error during initialization\n");
+ hmem_cleanup();
+ return ret;
+}
+
+static __exit void hmem_exit(void)
+{
+ hmem_cleanup();
+}
+
+module_init(hmem_init);
+module_exit(hmem_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
diff --git a/drivers/acpi/hmem/hmem.h b/drivers/acpi/hmem/hmem.h
new file mode 100644
index 0000000..38ff540
--- /dev/null
+++ b/drivers/acpi/hmem/hmem.h
@@ -0,0 +1,47 @@
+/*
+ * Heterogeneous memory representation in sysfs
+ *
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _ACPI_HMEM_H_
+#define _ACPI_HMEM_H_
+
+struct memory_initiator {
+ struct list_head list;
+ struct device dev;
+
+ /* only one of the following three will be set */
+ struct acpi_srat_cpu_affinity *cpu;
+ struct acpi_srat_x2apic_cpu_affinity *x2apic;
+ struct acpi_srat_gicc_affinity *gicc;
+
+ int pxm;
+ bool is_registered;
+};
+#define to_memory_initiator(d) container_of((d), struct memory_initiator, dev)
+
+struct memory_target {
+ struct list_head list;
+ struct device dev;
+ struct acpi_srat_mem_affinity *ma;
+ struct acpi_hmat_address_range *spa;
+ struct memory_initiator *local_init;
+
+ bool is_cached;
+ bool is_registered;
+};
+#define to_memory_target(d) container_of((d), struct memory_target, dev)
+
+extern const struct attribute_group *memory_initiator_attribute_groups[];
+extern const struct attribute_group *memory_target_attribute_groups[];
+#endif /* _ACPI_HMEM_H_ */
diff --git a/drivers/acpi/hmem/initiator.c b/drivers/acpi/hmem/initiator.c
new file mode 100644
index 0000000..905f030
--- /dev/null
+++ b/drivers/acpi/hmem/initiator.c
@@ -0,0 +1,61 @@
+/*
+ * Heterogeneous memory initiator sysfs attributes
+ *
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <acpi/acpi_numa.h>
+#include <linux/acpi.h>
+#include <linux/device.h>
+#include <linux/sysfs.h>
+#include "hmem.h"
+
+static ssize_t firmware_id_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_initiator *init = to_memory_initiator(dev);
+
+ return sprintf(buf, "%d\n", init->pxm);
+}
+static DEVICE_ATTR_RO(firmware_id);
+
+static ssize_t is_enabled_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_initiator *init = to_memory_initiator(dev);
+ int is_enabled;
+
+ if (init->cpu)
+ is_enabled = !!(init->cpu->flags & ACPI_SRAT_CPU_ENABLED);
+ else if (init->x2apic)
+ is_enabled = !!(init->x2apic->flags & ACPI_SRAT_CPU_ENABLED);
+ else
+ is_enabled = !!(init->gicc->flags & ACPI_SRAT_GICC_ENABLED);
+
+ return sprintf(buf, "%d\n", is_enabled);
+}
+static DEVICE_ATTR_RO(is_enabled);
+
+static struct attribute *memory_initiator_attributes[] = {
+ &dev_attr_firmware_id.attr,
+ &dev_attr_is_enabled.attr,
+ NULL,
+};
+
+static struct attribute_group memory_initiator_attribute_group = {
+ .attrs = memory_initiator_attributes,
+};
+
+const struct attribute_group *memory_initiator_attribute_groups[] = {
+ &memory_initiator_attribute_group,
+ NULL,
+};
diff --git a/drivers/acpi/hmem/target.c b/drivers/acpi/hmem/target.c
new file mode 100644
index 0000000..dd57437
--- /dev/null
+++ b/drivers/acpi/hmem/target.c
@@ -0,0 +1,97 @@
+/*
+ * Heterogeneous memory target sysfs attributes
+ *
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <acpi/acpi_numa.h>
+#include <linux/acpi.h>
+#include <linux/device.h>
+#include <linux/sysfs.h>
+#include "hmem.h"
+
+/* attributes for memory targets */
+static ssize_t phys_addr_base_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%#llx\n", tgt->ma->base_address);
+}
+static DEVICE_ATTR_RO(phys_addr_base);
+
+static ssize_t phys_length_bytes_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%#llx\n", tgt->ma->length);
+}
+static DEVICE_ATTR_RO(phys_length_bytes);
+
+static ssize_t firmware_id_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%d\n", tgt->ma->proximity_domain);
+}
+static DEVICE_ATTR_RO(firmware_id);
+
+static ssize_t is_cached_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%d\n", tgt->is_cached);
+}
+static DEVICE_ATTR_RO(is_cached);
+
+static ssize_t is_isolated_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%d\n",
+ !!(tgt->spa->flags & ACPI_HMAT_RESERVATION_HINT));
+}
+static DEVICE_ATTR_RO(is_isolated);
+
+static ssize_t is_enabled_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%d\n",
+ !!(tgt->ma->flags & ACPI_SRAT_MEM_ENABLED));
+}
+static DEVICE_ATTR_RO(is_enabled);
+
+static struct attribute *memory_target_attributes[] = {
+ &dev_attr_phys_addr_base.attr,
+ &dev_attr_phys_length_bytes.attr,
+ &dev_attr_firmware_id.attr,
+ &dev_attr_is_cached.attr,
+ &dev_attr_is_isolated.attr,
+ &dev_attr_is_enabled.attr,
+ NULL
+};
+
+/* attributes which are present for all memory targets */
+static struct attribute_group memory_target_attribute_group = {
+ .attrs = memory_target_attributes,
+};
+
+const struct attribute_group *memory_target_attribute_groups[] = {
+ &memory_target_attribute_group,
+ NULL,
+};
--
2.9.4

2017-07-06 22:08:39

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [RFC v2 1/5] acpi: add missing include in acpi_numa.h

On Thu, Jul 6, 2017 at 11:52 PM, Ross Zwisler
<[email protected]> wrote:
> Right now if a file includes acpi_numa.h and they don't happen to include
> linux/numa.h before it, they get the following warning:
>
> ./include/acpi/acpi_numa.h:9:5: warning: "MAX_NUMNODES" is not defined [-Wundef]
> #if MAX_NUMNODES > 256
> ^~~~~~~~~~~~
>
> Signed-off-by: Ross Zwisler <[email protected]>

Acked-by: Rafael J. Wysocki <[email protected]>

> ---
> include/acpi/acpi_numa.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/include/acpi/acpi_numa.h b/include/acpi/acpi_numa.h
> index d4b7294..1e3a74f 100644
> --- a/include/acpi/acpi_numa.h
> +++ b/include/acpi/acpi_numa.h
> @@ -3,6 +3,7 @@
>
> #ifdef CONFIG_ACPI_NUMA
> #include <linux/kernel.h>
> +#include <linux/numa.h>
>
> /* Proximity bitmap length */
> #if MAX_NUMNODES > 256
> --
> 2.9.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2017-07-06 22:13:57

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [RFC v2 2/5] acpi: HMAT support in acpi_parse_entries_array()

On Thu, Jul 6, 2017 at 11:52 PM, Ross Zwisler
<[email protected]> wrote:
> The current implementation of acpi_parse_entries_array() assumes that each
> subtable has a standard ACPI subtable entry of type struct
> acpi_sutbable_header. This standard subtable header has a one byte length
> followed by a one byte type.
>
> The HMAT subtables have to allow for a longer length so they have subtable
> headers of type struct acpi_hmat_structure which has a 2 byte type and a 4
> byte length.
>
> Enhance the subtable parsing in acpi_parse_entries_array() so that it can
> handle these new HMAT subtables.
>
> Signed-off-by: Ross Zwisler <[email protected]>
> ---
> drivers/acpi/numa.c | 2 +-
> drivers/acpi/tables.c | 52 ++++++++++++++++++++++++++++++++++++++++-----------
> 2 files changed, 42 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> index edb0c79..917f1cc 100644
> --- a/drivers/acpi/numa.c
> +++ b/drivers/acpi/numa.c
> @@ -443,7 +443,7 @@ int __init acpi_numa_init(void)
> * So go over all cpu entries in SRAT to get apicid to node mapping.
> */
>
> - /* SRAT: Static Resource Affinity Table */
> + /* SRAT: System Resource Affinity Table */
> if (!acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat)) {
> struct acpi_subtable_proc srat_proc[3];
>

This change is unrelated to the rest of the patch.

Maybe send it separately?

> diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
> index ff42539..7979171 100644
> --- a/drivers/acpi/tables.c
> +++ b/drivers/acpi/tables.c
> @@ -218,6 +218,33 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header)
> }
> }
>
> +static unsigned long __init
> +acpi_get_entry_type(char *id, void *entry)
> +{
> + if (!strncmp(id, ACPI_SIG_HMAT, 4))
> + return ((struct acpi_hmat_structure *)entry)->type;
> + else
> + return ((struct acpi_subtable_header *)entry)->type;
> +}

I slightly prefer to use ? : in similar situations.

> +
> +static unsigned long __init
> +acpi_get_entry_length(char *id, void *entry)
> +{
> + if (!strncmp(id, ACPI_SIG_HMAT, 4))
> + return ((struct acpi_hmat_structure *)entry)->length;
> + else
> + return ((struct acpi_subtable_header *)entry)->length;
> +}
> +
> +static unsigned long __init
> +acpi_get_subtable_header_length(char *id)
> +{
> + if (!strncmp(id, ACPI_SIG_HMAT, 4))
> + return sizeof(struct acpi_hmat_structure);
> + else
> + return sizeof(struct acpi_subtable_header);
> +}
> +
> /**
> * acpi_parse_entries_array - for each proc_num find a suitable subtable
> *
> @@ -242,10 +269,10 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
> struct acpi_subtable_proc *proc, int proc_num,
> unsigned int max_entries)
> {
> - struct acpi_subtable_header *entry;
> - unsigned long table_end;
> + unsigned long table_end, subtable_header_length;
> int count = 0;
> int errs = 0;
> + void *entry;
> int i;
>
> if (acpi_disabled)
> @@ -263,19 +290,23 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
> }
>
> table_end = (unsigned long)table_header + table_header->length;
> + subtable_header_length = acpi_get_subtable_header_length(id);
>
> /* Parse all entries looking for a match. */
>
> - entry = (struct acpi_subtable_header *)
> - ((unsigned long)table_header + table_size);
> + entry = (void *)table_header + table_size;
> +
> + while (((unsigned long)entry) + subtable_header_length < table_end) {
> + unsigned long entry_type, entry_length;
>
> - while (((unsigned long)entry) + sizeof(struct acpi_subtable_header) <
> - table_end) {
> if (max_entries && count >= max_entries)
> break;
>
> + entry_type = acpi_get_entry_type(id, entry);
> + entry_length = acpi_get_entry_length(id, entry);
> +
> for (i = 0; i < proc_num; i++) {
> - if (entry->type != proc[i].id)
> + if (entry_type != proc[i].id)
> continue;
> if (!proc[i].handler ||
> (!errs && proc[i].handler(entry, table_end))) {
> @@ -290,16 +321,15 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
> count++;
>
> /*
> - * If entry->length is 0, break from this loop to avoid
> + * If entry_length is 0, break from this loop to avoid
> * infinite loop.
> */
> - if (entry->length == 0) {
> + if (entry_length == 0) {
> pr_err("[%4.4s:0x%02x] Invalid zero length\n", id, proc->id);
> return -EINVAL;
> }
>
> - entry = (struct acpi_subtable_header *)
> - ((unsigned long)entry + entry->length);
> + entry += entry_length;
> }
>
> if (max_entries && count > max_entries) {
> --

The rest is fine by me.

Thanks,
Rafael

2017-07-06 22:22:05

by Ross Zwisler

[permalink] [raw]

Subject: Re: [RFC v2 2/5] acpi: HMAT support in acpi_parse_entries_array()

On Fri, Jul 07, 2017 at 12:13:54AM +0200, Rafael J. Wysocki wrote:
> On Thu, Jul 6, 2017 at 11:52 PM, Ross Zwisler
> <[email protected]> wrote:
> > The current implementation of acpi_parse_entries_array() assumes that each
> > subtable has a standard ACPI subtable entry of type struct
> > acpi_sutbable_header. This standard subtable header has a one byte length
> > followed by a one byte type.
> >
> > The HMAT subtables have to allow for a longer length so they have subtable
> > headers of type struct acpi_hmat_structure which has a 2 byte type and a 4
> > byte length.
> >
> > Enhance the subtable parsing in acpi_parse_entries_array() so that it can
> > handle these new HMAT subtables.
> >
> > Signed-off-by: Ross Zwisler <[email protected]>
> > ---
> > drivers/acpi/numa.c | 2 +-
> > drivers/acpi/tables.c | 52 ++++++++++++++++++++++++++++++++++++++++-----------
> > 2 files changed, 42 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> > index edb0c79..917f1cc 100644
> > --- a/drivers/acpi/numa.c
> > +++ b/drivers/acpi/numa.c
> > @@ -443,7 +443,7 @@ int __init acpi_numa_init(void)
> > * So go over all cpu entries in SRAT to get apicid to node mapping.
> > */
> >
> > - /* SRAT: Static Resource Affinity Table */
> > + /* SRAT: System Resource Affinity Table */
> > if (!acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat)) {
> > struct acpi_subtable_proc srat_proc[3];
> >
>
> This change is unrelated to the rest of the patch.
>
> Maybe send it separately?

Sure, will do.

> > diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
> > index ff42539..7979171 100644
> > --- a/drivers/acpi/tables.c
> > +++ b/drivers/acpi/tables.c
> > @@ -218,6 +218,33 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header)
> > }
> > }
> >
> > +static unsigned long __init
> > +acpi_get_entry_type(char *id, void *entry)
> > +{
> > + if (!strncmp(id, ACPI_SIG_HMAT, 4))
> > + return ((struct acpi_hmat_structure *)entry)->type;
> > + else
> > + return ((struct acpi_subtable_header *)entry)->type;
> > +}
>
> I slightly prefer to use ? : in similar situations.

Hmm..that becomes rather long, and seems complex for the already hard to read
?: operator? Let's see, this:

if (!strncmp(id, ACPI_SIG_HMAT, 4))
return ((struct acpi_hmat_structure *)entry)->type;
else
return ((struct acpi_subtable_header *)entry)->type;

becomes

return strncmp(id, ACPI_SIG_HMAT, 4)) ?
((struct acpi_subtable_header *)entry)->type :
((struct acpi_hmat_structure *)entry)->type;

Hmm...we only save one line, and I personally find that a lot harder to read,
but that being said if you feel strongly about it I'll make the change.

2017-07-06 22:36:19

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [RFC v2 2/5] acpi: HMAT support in acpi_parse_entries_array()

On Fri, Jul 7, 2017 at 12:22 AM, Ross Zwisler
<[email protected]> wrote:
> On Fri, Jul 07, 2017 at 12:13:54AM +0200, Rafael J. Wysocki wrote:
>> On Thu, Jul 6, 2017 at 11:52 PM, Ross Zwisler
>> <[email protected]> wrote:
>> > The current implementation of acpi_parse_entries_array() assumes that each
>> > subtable has a standard ACPI subtable entry of type struct
>> > acpi_sutbable_header. This standard subtable header has a one byte length
>> > followed by a one byte type.
>> >
>> > The HMAT subtables have to allow for a longer length so they have subtable
>> > headers of type struct acpi_hmat_structure which has a 2 byte type and a 4
>> > byte length.
>> >
>> > Enhance the subtable parsing in acpi_parse_entries_array() so that it can
>> > handle these new HMAT subtables.
>> >
>> > Signed-off-by: Ross Zwisler <[email protected]>
>> > ---
>> > drivers/acpi/numa.c | 2 +-
>> > drivers/acpi/tables.c | 52 ++++++++++++++++++++++++++++++++++++++++-----------
>> > 2 files changed, 42 insertions(+), 12 deletions(-)

[cut]

>> > diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
>> > index ff42539..7979171 100644
>> > --- a/drivers/acpi/tables.c
>> > +++ b/drivers/acpi/tables.c
>> > @@ -218,6 +218,33 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header)
>> > }
>> > }
>> >
>> > +static unsigned long __init
>> > +acpi_get_entry_type(char *id, void *entry)
>> > +{
>> > + if (!strncmp(id, ACPI_SIG_HMAT, 4))
>> > + return ((struct acpi_hmat_structure *)entry)->type;
>> > + else
>> > + return ((struct acpi_subtable_header *)entry)->type;
>> > +}
>>
>> I slightly prefer to use ? : in similar situations.
>
> Hmm..that becomes rather long, and seems complex for the already hard to read
> ?: operator? Let's see, this:
>
> if (!strncmp(id, ACPI_SIG_HMAT, 4))
> return ((struct acpi_hmat_structure *)entry)->type;
> else
> return ((struct acpi_subtable_header *)entry)->type;
>
> becomes
>
> return strncmp(id, ACPI_SIG_HMAT, 4)) ?
> ((struct acpi_subtable_header *)entry)->type :
> ((struct acpi_hmat_structure *)entry)->type;
>
> Hmm...we only save one line, and I personally find that a lot harder to read,
> but that being said if you feel strongly about it I'll make the change.

Well, I said "slightly". :-)

Thanks,
Rafael

2017-07-06 23:08:10

by Jerome Glisse

[permalink] [raw]

Subject: Re: [RFC v2 0/5] surface heterogeneous memory performance information

On Thu, Jul 06, 2017 at 03:52:28PM -0600, Ross Zwisler wrote:

[...]

>
> ==== Next steps ====
>
> There is still a lot of work to be done on this series, but the overall
> goal of this RFC is to gather feedback on which of the two options we
> should pursue, or whether some third option is preferred. After that is
> done and we have a solid direction we can add support for ACPI hot add,
> test more complex configurations, etc.
>
> So, for applications that need to differentiate between memory ranges based
> on their performance, what option would work best for you? Is the local
> (initiator,target) performance provided by patch 5 enough, or do you
> require performance information for all possible (initiator,target)
> pairings?

Am i right in assuming that HBM or any faster memory will be relatively small
(1GB - 8GB maybe 16GB ?) and of fix amount (ie size will depend on the exact
CPU model you have) ?

If so i am wondering if we should not restrict NUMA placement policy for such
node to vma only. Forbid any policy that would prefer those node globally at
thread/process level. This would avoid wide thread policy to exhaust this
smaller pool of memory.

Drawback of doing so would be that existing applications would not benefit
from it. So workload where is acceptable to exhaust such memory wouldn't
benefit until their application are updated.

This is definitly not something impacting this patchset. I am just thinking
about this at large and i believe that NUMA might need to evolve slightly
to better handle memory hierarchy.

Cheers,
J?r?me

2017-07-06 23:30:16

[permalink] [raw]

Subject: Re: [RFC v2 0/5] surface heterogeneous memory performance information

On 07/06/2017 04:08 PM, Jerome Glisse wrote:
>> So, for applications that need to differentiate between memory ranges based
>> on their performance, what option would work best for you? Is the local
>> (initiator,target) performance provided by patch 5 enough, or do you
>> require performance information for all possible (initiator,target)
>> pairings?
>
> Am i right in assuming that HBM or any faster memory will be relatively small
> (1GB - 8GB maybe 16GB ?) and of fix amount (ie size will depend on the exact
> CPU model you have) ?

For HBM, that's certainly consistent with the Xeon Phi MCDRAM.

But, please remember that this patch set is for fast memory *and* slow
memory (vs. plain DRAM).

> If so i am wondering if we should not restrict NUMA placement policy for such
> node to vma only. Forbid any policy that would prefer those node globally at
> thread/process level. This would avoid wide thread policy to exhaust this
> smaller pool of memory.

You would like to take the NUMA APIs and bifurcate them? Make some of
them able to work on this memory, and others not? So, set_mempolicy()
would work if you passed it one of these "special" nodes with
MPOL_F_ADDR, but would fail otherwise?

> Drawback of doing so would be that existing applications would not benefit
> from it. So workload where is acceptable to exhaust such memory wouldn't
> benefit until their application are updated.

I think the guys running 40-year-old fortran binaries might not be so
keen on this restriction. I bet there are a pretty substantial number
of folks out there that would love to get new hardware and just do:

numactl --membind=fast-node ./old-binary

If I were working for a hardware company, I'd sure like to just be able
to sell somebody some fancy new hardware and have their existing
software "just work" with a minimal wrapper.

2017-07-07 05:31:09

by John Hubbard

[permalink] [raw]

Subject: Re: [RFC v2 0/5] surface heterogeneous memory performance information

On 07/06/2017 02:52 PM, Ross Zwisler wrote:
[...]
>
> The naming collision between Jerome's "Heterogeneous Memory Management
> (HMM)" and this "Heterogeneous Memory (HMEM)" series is unfortunate, but I
> was trying to stick with the word "Heterogeneous" because of the naming of
> the ACPI 6.2 Heterogeneous Memory Attribute Table table. Suggestions for
> better naming are welcome.
>

Hi Ross,

Say, most of the places (file names, function and variable names, and even
print statements) where this patchset uses hmem or HMEM, it really seems to
mean, the Heterogeneous Memory Attribute Table. That's not *always* true, but
given that it's a pretty severe naming conflict, how about just changing:

hmem --> hmat
HMEM --> HMAT

...everywhere? Then you still have Heterogeneous Memory in the name, but
there is enough lexical distance (is that a thing? haha) between HMM and HMAT
to keep us all sane. :)

With or without the above suggestion, there are a few places (Kconfig, comments,
prints) where we can more easily make it clear that HMM != HMEM (or HMAT),
so for those I can just comment on them separately in the individual patches.

thanks,
john h

2017-07-07 05:53:43

by John Hubbard

[permalink] [raw]

Subject: Re: [RFC v2 3/5] hmem: add heterogeneous memory sysfs support

On 07/06/2017 02:52 PM, Ross Zwisler wrote:
[...]
> diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
> index b1aacfc..31e3f20 100644
> --- a/drivers/acpi/Makefile
> +++ b/drivers/acpi/Makefile
> @@ -72,6 +72,7 @@ obj-$(CONFIG_ACPI_PROCESSOR) += processor.o
> obj-$(CONFIG_ACPI) += container.o
> obj-$(CONFIG_ACPI_THERMAL) += thermal.o
> obj-$(CONFIG_ACPI_NFIT) += nfit/
> +obj-$(CONFIG_ACPI_HMEM) += hmem/
> obj-$(CONFIG_ACPI) += acpi_memhotplug.o
> obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
> obj-$(CONFIG_ACPI_BATTERY) += battery.o

Hi Ross,

Following are a series of suggestions, intended to clarify naming just
enough so that, when Jerome's HMM patchset lands, we'll be able to
tell the difference between the two types of Heterogeneous Memory.

> diff --git a/drivers/acpi/hmem/Kconfig b/drivers/acpi/hmem/Kconfig
> new file mode 100644
> index 0000000..09282be
> --- /dev/null
> +++ b/drivers/acpi/hmem/Kconfig
> @@ -0,0 +1,7 @@
> +config ACPI_HMEM
> + bool "ACPI Heterogeneous Memory Support"

How about:

bool "ACPI Heterogeneous Memory Attribute Table Support"

The idea here, and throughout, is that this type of
Heterogeneous Memory support is all about "the Heterogeneous Memory
that you found via ACPI's Heterogeneous Memory Attribute Table".

That's different from "the Heterogeneous Memory that you found
when you installed a PCIe device that supports HMM". Or, at least
it is different, until the day that someone decides to burn in
support for an HMM device, into the ACPI tables. Seems unlikely,
though. :) And even so, I think it would still work.

> + depends on ACPI_NUMA
> + depends on SYSFS
> + help
> + Exports a sysfs representation of the ACPI Heterogeneous Memory
> + Attributes Table (HMAT).
> diff --git a/drivers/acpi/hmem/Makefile b/drivers/acpi/hmem/Makefile
> new file mode 100644
> index 0000000..d2aa546
> --- /dev/null
> +++ b/drivers/acpi/hmem/Makefile
> @@ -0,0 +1,2 @@
> +obj-$(CONFIG_ACPI_HMEM) := hmem.o
> +hmem-y := core.o initiator.o target.o
> diff --git a/drivers/acpi/hmem/core.c b/drivers/acpi/hmem/core.c
> new file mode 100644
> index 0000000..f7638db
> --- /dev/null
> +++ b/drivers/acpi/hmem/core.c
> @@ -0,0 +1,569 @@
> +/*
> + * Heterogeneous memory representation in sysfs

Heterogeneous memory, as discovered via ACPI's Heterogeneous Memory
Attribute Table: representation in sysfs

> + *
> + * Copyright (c) 2017, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <acpi/acpi_numa.h>

[...]

> diff --git a/drivers/acpi/hmem/hmem.h b/drivers/acpi/hmem/hmem.h
> new file mode 100644
> index 0000000..38ff540
> --- /dev/null
> +++ b/drivers/acpi/hmem/hmem.h
> @@ -0,0 +1,47 @@
> +/*
> + * Heterogeneous memory representation in sysfs

Heterogeneous memory, as discovered via ACPI's Heterogeneous Memory
Attribute Table: representation in sysfs

> + *
> + * Copyright (c) 2017, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
> + * more details.
> + */
> +
> +#ifndef _ACPI_HMEM_H_
> +#define _ACPI_HMEM_H_
> +
> +struct memory_initiator {
> + struct list_head list;
> + struct device dev;
> +
> + /* only one of the following three will be set */
> + struct acpi_srat_cpu_affinity *cpu;
> + struct acpi_srat_x2apic_cpu_affinity *x2apic;
> + struct acpi_srat_gicc_affinity *gicc;
> +
> + int pxm;
> + bool is_registered;
> +};
> +#define to_memory_initiator(d) container_of((d), struct memory_initiator, dev)
> +
> +struct memory_target {
> + struct list_head list;
> + struct device dev;
> + struct acpi_srat_mem_affinity *ma;
> + struct acpi_hmat_address_range *spa;
> + struct memory_initiator *local_init;
> +
> + bool is_cached;
> + bool is_registered;
> +};
> +#define to_memory_target(d) container_of((d), struct memory_target, dev)
> +
> +extern const struct attribute_group *memory_initiator_attribute_groups[];
> +extern const struct attribute_group *memory_target_attribute_groups[];
> +#endif /* _ACPI_HMEM_H_ */
> diff --git a/drivers/acpi/hmem/initiator.c b/drivers/acpi/hmem/initiator.c
> new file mode 100644
> index 0000000..905f030
> --- /dev/null
> +++ b/drivers/acpi/hmem/initiator.c
> @@ -0,0 +1,61 @@
> +/*
> + * Heterogeneous memory initiator sysfs attributes

HMAT (Heterogeneous Memory Attribute Table)-based memory: initiator sysfs attributes

> + *
> + * Copyright (c) 2017, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <acpi/acpi_numa.h>
> +#include <linux/acpi.h>
> +#include <linux/device.h>
> +#include <linux/sysfs.h>
> +#include "hmem.h"
> +
> +static ssize_t firmware_id_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct memory_initiator *init = to_memory_initiator(dev);
> +
> + return sprintf(buf, "%d\n", init->pxm);
> +}
> +static DEVICE_ATTR_RO(firmware_id);
> +
> +static ssize_t is_enabled_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct memory_initiator *init = to_memory_initiator(dev);
> + int is_enabled;
> +
> + if (init->cpu)
> + is_enabled = !!(init->cpu->flags & ACPI_SRAT_CPU_ENABLED);
> + else if (init->x2apic)
> + is_enabled = !!(init->x2apic->flags & ACPI_SRAT_CPU_ENABLED);
> + else
> + is_enabled = !!(init->gicc->flags & ACPI_SRAT_GICC_ENABLED);
> +
> + return sprintf(buf, "%d\n", is_enabled);
> +}
> +static DEVICE_ATTR_RO(is_enabled);
> +
> +static struct attribute *memory_initiator_attributes[] = {
> + &dev_attr_firmware_id.attr,
> + &dev_attr_is_enabled.attr,
> + NULL,
> +};
> +
> +static struct attribute_group memory_initiator_attribute_group = {
> + .attrs = memory_initiator_attributes,
> +};
> +
> +const struct attribute_group *memory_initiator_attribute_groups[] = {
> + &memory_initiator_attribute_group,
> + NULL,
> +};
> diff --git a/drivers/acpi/hmem/target.c b/drivers/acpi/hmem/target.c
> new file mode 100644
> index 0000000..dd57437
> --- /dev/null
> +++ b/drivers/acpi/hmem/target.c
> @@ -0,0 +1,97 @@
> +/*
> + * Heterogeneous memory target sysfs attributes

HMAT (Heterogeneous Memory Attribute Table)-based memory: target sysfs attributes

So, maybe those will help.

thanks
john h

2017-07-07 06:28:35

by Balbir Singh

[permalink] [raw]

Subject: Re: [RFC v2 0/5] surface heterogeneous memory performance information

On Thu, 2017-07-06 at 15:52 -0600, Ross Zwisler wrote:
> ==== Quick Summary ====
>
> Platforms in the very near future will have multiple types of memory
> attached to a single CPU. These disparate memory ranges will have some
> characteristics in common, such as CPU cache coherence, but they can have
> wide ranges of performance both in terms of latency and bandwidth.
>
> For example, consider a system that contains persistent memory, standard
> DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU.
> There could potentially be an order of magnitude or more difference in
> performance between the slowest and fastest memory attached to that CPU.
>
> With the current Linux code NUMA nodes are CPU-centric, so all the memory
> attached to a given CPU will be lumped into the same NUMA node. This makes
> it very difficult for userspace applications to understand the performance
> of different memory ranges on a given CPU.
>
> We solve this issue by providing userspace with performance information on
> individual memory ranges. This performance information is exposed via
> sysfs:
>
> # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null
> mem_tgt2/firmware_id:1
> mem_tgt2/is_cached:0
> mem_tgt2/is_enabled:1
> mem_tgt2/is_isolated:0

Could you please explain these charactersitics, are they in the patches
to follow?

> mem_tgt2/phys_addr_base:0x0
> mem_tgt2/phys_length_bytes:0x800000000
> mem_tgt2/local_init/read_bw_MBps:30720
> mem_tgt2/local_init/read_lat_nsec:100
> mem_tgt2/local_init/write_bw_MBps:30720
> mem_tgt2/local_init/write_lat_nsec:100
>

How to these numbers compare to normal system memory?

> This allows applications to easily find the memory that they want to use.
> We expect that the existing NUMA APIs will be enhanced to use this new
> information so that applications can continue to use them to select their
> desired memory.
>
> This series is built upon acpica-1705:
>
> https://github.com/zetalog/linux/commits/acpica-1705
>
> And you can find a working tree here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/zwisler/linux.git/log/?h=hmem_sysfs
>
> ==== Lots of Details ====
>
> This patch set is only concerned with CPU-addressable memory types, not
> on-device memory like what we have with Jerome Glisse's HMM series:
>
> https://lwn.net/Articles/726691/
>
> This patch set works by enabling the new Heterogeneous Memory Attribute
> Table (HMAT) table, newly defined in ACPI 6.2. One major conceptual change
> in ACPI 6.2 related to this work is that proximity domains no longer need
> to contain a processor. We can now have memory-only proximity domains,
> which means that we can now have memory-only Linux NUMA nodes.
>
> Here is an example configuration where we have a single processor, one
> range of regular memory and one range of HBM:
>
> +---------------+ +----------------+
> | Processor | | Memory |
> | prox domain 0 +---+ prox domain 1 |
> | NUMA node 1 | | NUMA node 2 |
> +-------+-------+ +----------------+
> |
> +-------+----------+
> | HBM |
> | prox domain 2 |
> | NUMA node 0 |
> +------------------+
>
> This gives us one initiator (the processor) and two targets (the two memory
> ranges). Each of these three has its own ACPI proximity domain and
> associated Linux NUMA node. Note also that while there is a 1:1 mapping
> from each proximity domain to each NUMA node, the numbers don't necessarily
> match up. Additionally we can have extra NUMA nodes that don't map back to
> ACPI proximity domains.

Could you expand on proximity domains, are they the same as node distance
or is this ACPI terminology for something more?

>
> The above configuration could also have the processor and one of the two
> memory ranges sharing a proximity domain and NUMA node, but for the
> purposes of the HMAT the two memory ranges will always need to be
> separated.
>
> The overall goal of this series and of the HMAT is to allow users to
> identify memory using its performance characteristics. This can broadly be
> done in one of two ways:
>
> Option 1: Provide the user with a way to map between proximity domains and
> NUMA nodes and a way to access the HMAT directly (probably via
> /sys/firmware/acpi/tables). Then, through possibly a library and a daemon,
> provide an API so that applications can either request information about
> memory ranges, or request memory allocations that meet a given set of
> performance characteristics.
>
> Option 2: Provide the user with HMAT performance data directly in sysfs,
> allowing applications to directly access it without the need for the
> library and daemon.
>
> The kernel work for option 1 is started by patches 1-3. These just surface
> the minimal amount of information in sysfs to allow userspace to map
> between proximity domains and NUMA nodes so that the raw data in the HMAT
> table can be understood.
>
> Patches 4 and 5 enable option 2, adding performance information from the
> HMAT to sysfs. The second option is complicated by the amount of HMAT data
> that could be present in very large systems, so in this series we only
> surface performance information for local (initiator,target) pairings. The
> changelog for patch 5 discusses this in detail.
>
> The naming collision between Jerome's "Heterogeneous Memory Management
> (HMM)" and this "Heterogeneous Memory (HMEM)" series is unfortunate, but I
> was trying to stick with the word "Heterogeneous" because of the naming of
> the ACPI 6.2 Heterogeneous Memory Attribute Table table. Suggestions for
> better naming are welcome.
>

Balbir Singh.

2017-07-07 16:20:01

[permalink] [raw]

Subject: Re: [RFC v2 0/5] surface heterogeneous memory performance information

On 07/06/2017 11:27 PM, Balbir Singh wrote:
> On Thu, 2017-07-06 at 15:52 -0600, Ross Zwisler wrote:
>> # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null
>> mem_tgt2/firmware_id:1

This is here for folks that know their platform and know exactly the
firmware ID (PXM in ACPI parlance) of a given piece of memory. Without
this, we might be stuck with requiring the NUMA node ID that the kernel
uses to be bound 1:1 with the firmware ID.

>> mem_tgt2/is_cached:0

This tells whether the memory is cached by some other memory. MCDRAM is
an example of this. It can be used as a high-bandwidth cache in front
of the lower-bandwidth DRAM.

This is referred to as "Memory Side Cache Information Structure" in the
ACPI spec: http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf

>> mem_tgt2/is_enabled:1
>> mem_tgt2/is_isolated:0

This one is described in detail in the ACPI spec. It's called
"Reservation hint" in there.

>> mem_tgt2/phys_addr_base:0x0
>> mem_tgt2/phys_length_bytes:0x800000000
>> mem_tgt2/local_init/read_bw_MBps:30720
>> mem_tgt2/local_init/read_lat_nsec:100
>> mem_tgt2/local_init/write_bw_MBps:30720
>> mem_tgt2/local_init/write_lat_nsec:100
>
> How to these numbers compare to normal system memory?

They're made up in this instance. But, it's safe to expect 10x swings
in bandwidth in latency, both up and down.

2017-07-07 16:25:17

by Ross Zwisler

[permalink] [raw]

Subject: Re: [RFC v2 0/5] surface heterogeneous memory performance information

On Fri, Jul 07, 2017 at 04:27:16PM +1000, Balbir Singh wrote:
> On Thu, 2017-07-06 at 15:52 -0600, Ross Zwisler wrote:
> > ==== Quick Summary ====
> >
> > Platforms in the very near future will have multiple types of memory
> > attached to a single CPU. These disparate memory ranges will have some
> > characteristics in common, such as CPU cache coherence, but they can have
> > wide ranges of performance both in terms of latency and bandwidth.
> >
> > For example, consider a system that contains persistent memory, standard
> > DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU.
> > There could potentially be an order of magnitude or more difference in
> > performance between the slowest and fastest memory attached to that CPU.
> >
> > With the current Linux code NUMA nodes are CPU-centric, so all the memory
> > attached to a given CPU will be lumped into the same NUMA node. This makes
> > it very difficult for userspace applications to understand the performance
> > of different memory ranges on a given CPU.
> >
> > We solve this issue by providing userspace with performance information on
> > individual memory ranges. This performance information is exposed via
> > sysfs:
> >
> > # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null
> > mem_tgt2/firmware_id:1
> > mem_tgt2/is_cached:0
> > mem_tgt2/is_enabled:1
> > mem_tgt2/is_isolated:0
>
> Could you please explain these charactersitics, are they in the patches
> to follow?

Yea, sorry, these do need more explanation. These values are derived from the
ACPI SRAT/HMAT tables:

> > mem_tgt2/firmware_id:1

This is the proximity domain, as defined in the SRAT and HMAT. Basically
every ACPI proximity domain will end up being a unique NUMA node in Linux, but
the numbers may get reordered and Linux can create extra NUMA nodes that don't
map back to ACPI proximity domains. So, this value is needed if anyone ever
wants to look at the ACPI HMAT and SRAT tables directly and make sense of how
they map to NUMA nodes in Linux.

> > mem_tgt2/is_cached:0

The HMAT provides lots of detailed information when a memory region has
caching layers. For each layer of memory caching it has the ability to
provide latency and bandwidth information for both reads and writes,
information about the caching associativity (direct mapped, something more
complex), the writeback policy (WB, WT), the cache line size, etc.

For simplicity this sysfs interface doesn't expose that level of detail to the
user, and this flag just lets the user know whether the memory region they are
looking at has caching layers or not. Right now the additional details, if
desired, can be gathered by looking at the raw tables.

> > mem_tgt2/is_enabled:1

Tells whether the memory region is enabled, as defined by the flags in the
SRAT. Actually, though, in this version of the patch series we don't create
entries for CPUs or memory regions that aren't enabled, so this isn't needed.
I'll remove for v3.

> > mem_tgt2/is_isolated:0

This surfaces a flag in the HMAT's Memory Subsystem Address Range Structure:

Bit [2]: Reservation hint—if set to 1, it is recommended
that the operating system avoid placing allocations in
this region if it cannot relocate (e.g. OS core memory
management structures, OS core executable). Any
allocations placed here should be able to be relocated
(e.g. disk cache) if the memory is needed for another
purpose.

Adding kernel support for this hint (i.e. actually reserving the memory region
during boot so it isn't used by the kernel or userspace, and is fully
available for explicit allocation) is part of the future work that we'd do in
follow-on patch series.

> > mem_tgt2/phys_addr_base:0x0
> > mem_tgt2/phys_length_bytes:0x800000000
> > mem_tgt2/local_init/read_bw_MBps:30720
> > mem_tgt2/local_init/read_lat_nsec:100
> > mem_tgt2/local_init/write_bw_MBps:30720
> > mem_tgt2/local_init/write_lat_nsec:100
>
> How to these numbers compare to normal system memory?

These are garbage numbers that I made up in my hacked-up QEMU target. :)

> > This allows applications to easily find the memory that they want to use.
> > We expect that the existing NUMA APIs will be enhanced to use this new
> > information so that applications can continue to use them to select their
> > desired memory.
> >
> > This series is built upon acpica-1705:
> >
> > https://github.com/zetalog/linux/commits/acpica-1705
> >
> > And you can find a working tree here:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/zwisler/linux.git/log/?h=hmem_sysfs
> >
> > ==== Lots of Details ====
> >
> > This patch set is only concerned with CPU-addressable memory types, not
> > on-device memory like what we have with Jerome Glisse's HMM series:
> >
> > https://lwn.net/Articles/726691/
> >
> > This patch set works by enabling the new Heterogeneous Memory Attribute
> > Table (HMAT) table, newly defined in ACPI 6.2. One major conceptual change
> > in ACPI 6.2 related to this work is that proximity domains no longer need
> > to contain a processor. We can now have memory-only proximity domains,
> > which means that we can now have memory-only Linux NUMA nodes.
> >
> > Here is an example configuration where we have a single processor, one
> > range of regular memory and one range of HBM:
> >
> > +---------------+ +----------------+
> > | Processor | | Memory |
> > | prox domain 0 +---+ prox domain 1 |
> > | NUMA node 1 | | NUMA node 2 |
> > +-------+-------+ +----------------+
> > |
> > +-------+----------+
> > | HBM |
> > | prox domain 2 |
> > | NUMA node 0 |
> > +------------------+
> >
> > This gives us one initiator (the processor) and two targets (the two memory
> > ranges). Each of these three has its own ACPI proximity domain and
> > associated Linux NUMA node. Note also that while there is a 1:1 mapping
> > from each proximity domain to each NUMA node, the numbers don't necessarily
> > match up. Additionally we can have extra NUMA nodes that don't map back to
> > ACPI proximity domains.
>
> Could you expand on proximity domains, are they the same as node distance
> or is this ACPI terminology for something more?

I think I answered this above in my explanation of the "firmware_id" field,
but please let me know if you have any more questions. Basically, a proximity
domain is an ACPI concept that is very similar to a Linux NUMA node, and every
ACPI proximity domain generates and can be mapped to a unique Linux NUMA node.

2017-07-07 16:31:03

by Ross Zwisler

[permalink] [raw]

Subject: Re: [RFC v2 0/5] surface heterogeneous memory performance information

On Thu, Jul 06, 2017 at 10:30:46PM -0700, John Hubbard wrote:
> On 07/06/2017 02:52 PM, Ross Zwisler wrote:
> [...]
> >
> > The naming collision between Jerome's "Heterogeneous Memory Management
> > (HMM)" and this "Heterogeneous Memory (HMEM)" series is unfortunate, but I
> > was trying to stick with the word "Heterogeneous" because of the naming of
> > the ACPI 6.2 Heterogeneous Memory Attribute Table table. Suggestions for
> > better naming are welcome.
> >
>
> Hi Ross,
>
> Say, most of the places (file names, function and variable names, and even
> print statements) where this patchset uses hmem or HMEM, it really seems to
> mean, the Heterogeneous Memory Attribute Table. That's not *always* true, but
> given that it's a pretty severe naming conflict, how about just changing:
>
> hmem --> hmat
> HMEM --> HMAT
>
> ...everywhere? Then you still have Heterogeneous Memory in the name, but
> there is enough lexical distance (is that a thing? haha) between HMM and HMAT
> to keep us all sane. :)
>
> With or without the above suggestion, there are a few places (Kconfig, comments,
> prints) where we can more easily make it clear that HMM != HMEM (or HMAT),
> so for those I can just comment on them separately in the individual patches.
>
> thanks,
> john h

Hi John,

Sure, that change makes sense to me. I had initially tried to make this
enabling more generic so that other, non-ACPI systems could use the same sysfs
representation even if they got their performance numbers from some other
source, but while implementing it pretty quickly became very tightly tied to
the ACPI HMAT.

- Ross

2017-07-07 16:32:17

by Ross Zwisler

[permalink] [raw]

Subject: Re: [RFC v2 3/5] hmem: add heterogeneous memory sysfs support

On Thu, Jul 06, 2017 at 10:53:39PM -0700, John Hubbard wrote:
> On 07/06/2017 02:52 PM, Ross Zwisler wrote:
> [...]
> > diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
> > index b1aacfc..31e3f20 100644
> > --- a/drivers/acpi/Makefile
> > +++ b/drivers/acpi/Makefile
> > @@ -72,6 +72,7 @@ obj-$(CONFIG_ACPI_PROCESSOR) += processor.o
> > obj-$(CONFIG_ACPI) += container.o
> > obj-$(CONFIG_ACPI_THERMAL) += thermal.o
> > obj-$(CONFIG_ACPI_NFIT) += nfit/
> > +obj-$(CONFIG_ACPI_HMEM) += hmem/
> > obj-$(CONFIG_ACPI) += acpi_memhotplug.o
> > obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
> > obj-$(CONFIG_ACPI_BATTERY) += battery.o
>
> Hi Ross,
>
> Following are a series of suggestions, intended to clarify naming just
> enough so that, when Jerome's HMM patchset lands, we'll be able to
> tell the difference between the two types of Heterogeneous Memory.

Sure, these all seem sane to me, thanks. I'll fix this up in v3.

2017-07-19 09:49:39

by Liubo(OS Lab)

[permalink] [raw]

Subject: Re: [RFC v2 0/5] surface heterogeneous memory performance information

On 2017/7/7 5:52, Ross Zwisler wrote:
> ==== Quick Summary ====
>
> Platforms in the very near future will have multiple types of memory
> attached to a single CPU. These disparate memory ranges will have some
> characteristics in common, such as CPU cache coherence, but they can have
> wide ranges of performance both in terms of latency and bandwidth.
>
> For example, consider a system that contains persistent memory, standard
> DDR memory and High Bandwidth Memory (HBM), all attached to the same CPU.
> There could potentially be an order of magnitude or more difference in
> performance between the slowest and fastest memory attached to that CPU.
>
> With the current Linux code NUMA nodes are CPU-centric, so all the memory
> attached to a given CPU will be lumped into the same NUMA node. This makes
> it very difficult for userspace applications to understand the performance
> of different memory ranges on a given CPU.
>
> We solve this issue by providing userspace with performance information on
> individual memory ranges. This performance information is exposed via
> sysfs:
>
> # grep . mem_tgt2/* mem_tgt2/local_init/* 2>/dev/null
> mem_tgt2/firmware_id:1
> mem_tgt2/is_cached:0
> mem_tgt2/is_enabled:1
> mem_tgt2/is_isolated:0
> mem_tgt2/phys_addr_base:0x0
> mem_tgt2/phys_length_bytes:0x800000000
> mem_tgt2/local_init/read_bw_MBps:30720
> mem_tgt2/local_init/read_lat_nsec:100
> mem_tgt2/local_init/write_bw_MBps:30720
> mem_tgt2/local_init/write_lat_nsec:100
>
> This allows applications to easily find the memory that they want to use.
> We expect that the existing NUMA APIs will be enhanced to use this new
> information so that applications can continue to use them to select their
> desired memory.
>
> This series is built upon acpica-1705:
>
> https://github.com/zetalog/linux/commits/acpica-1705
>
> And you can find a working tree here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/zwisler/linux.git/log/?h=hmem_sysfs
>
> ==== Lots of Details ====
>
> This patch set is only concerned with CPU-addressable memory types, not
> on-device memory like what we have with Jerome Glisse's HMM series:
>
> https://lwn.net/Articles/726691/
>
> This patch set works by enabling the new Heterogeneous Memory Attribute
> Table (HMAT) table, newly defined in ACPI 6.2. One major conceptual change
> in ACPI 6.2 related to this work is that proximity domains no longer need
> to contain a processor. We can now have memory-only proximity domains,
> which means that we can now have memory-only Linux NUMA nodes.
>
> Here is an example configuration where we have a single processor, one
> range of regular memory and one range of HBM:
>
> +---------------+ +----------------+
> | Processor | | Memory |
> | prox domain 0 +---+ prox domain 1 |
> | NUMA node 1 | | NUMA node 2 |
> +-------+-------+ +----------------+
> |
> +-------+----------+
> | HBM |
> | prox domain 2 |
> | NUMA node 0 |
> +------------------+
>
> This gives us one initiator (the processor) and two targets (the two memory
> ranges). Each of these three has its own ACPI proximity domain and
> associated Linux NUMA node. Note also that while there is a 1:1 mapping
> from each proximity domain to each NUMA node, the numbers don't necessarily
> match up. Additionally we can have extra NUMA nodes that don't map back to
> ACPI proximity domains.
>
> The above configuration could also have the processor and one of the two
> memory ranges sharing a proximity domain and NUMA node, but for the
> purposes of the HMAT the two memory ranges will always need to be
> separated.
>
> The overall goal of this series and of the HMAT is to allow users to
> identify memory using its performance characteristics. This can broadly be
> done in one of two ways:
>
> Option 1: Provide the user with a way to map between proximity domains and
> NUMA nodes and a way to access the HMAT directly (probably via
> /sys/firmware/acpi/tables). Then, through possibly a library and a daemon,
> provide an API so that applications can either request information about
> memory ranges, or request memory allocations that meet a given set of
> performance characteristics.
>
> Option 2: Provide the user with HMAT performance data directly in sysfs,
> allowing applications to directly access it without the need for the
> library and daemon.
>

Is it possible to do the memory allocation automatically by the kernel and transparent to users?
It sounds like unreasonable that most users should aware this detail memory topology.

--
Thanks,
Bob Liu

> The kernel work for option 1 is started by patches 1-3. These just surface
> the minimal amount of information in sysfs to allow userspace to map
> between proximity domains and NUMA nodes so that the raw data in the HMAT
> table can be understood.

2017-07-19 15:25:15

[permalink] [raw]

Subject: Re: [RFC v2 0/5] surface heterogeneous memory performance information

On 07/19/2017 02:48 AM, Bob Liu wrote:
>> Option 2: Provide the user with HMAT performance data directly in
>> sysfs, allowing applications to directly access it without the need
>> for the library and daemon.
>>
> Is it possible to do the memory allocation automatically by the
> kernel and transparent to users? It sounds like unreasonable that
> most users should aware this detail memory topology.

It's possible, but I'm not sure this is something we automatically want
to see added to the kernel.

I look at it like NUMA. We have lots of details available about how
things are connected. But, "most users" are totally unaware of this.
We give them decent default policies and the ones that need more can do
so with the NUMA APIs.

These patches provide the framework to help users/apps who *do* care and
want to make intelligent, topology-aware decisions.