LinuxLists.cc - [RFC 0/6] Add support for Heterogeneous Memory Attribute Table

2017-06-02 21:00:02

by Ross Zwisler

[permalink] [raw]

Subject: [RFC 0/6] Add support for Heterogeneous Memory Attribute Table

==== Quick summary ====

This series adds kernel support for the Heterogeneous Memory Attribute
Table (HMAT) table, newly defined in ACPI 6.2:

http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf

The HMAT table, in concert with the existing System Resource Affinity Table
(SRAT), provides users with information about memory initiators and memory
targets in the system.

A "memory initiator" in this case is any device such as a CPU or a separate
memory I/O device that can initiate a memory request. A "memory target" is
a CPU-accessible physical address range.

The HMAT provides performance information (expected latency and bandwidth,
etc.) for various (initiator,target) pairs. This is mostly motivated by
the need to optimally use performance-differentiated DRAM, but it also
allows us to describe the performance characteristics of persistent memory.

The purpose of this RFC is to gather feedback on the different options for
enabling the HMAT in the kernel and in userspace.

==== Lots of details ====

The HMAT only covers CPU-addressable memory types, not on-device memory
like what we have with Jerome Glisse's HMM series:

https://lkml.org/lkml/2017/5/24/731

One major conceptual change in ACPI 6.2 related to this work is that
proximity domains no longer need to contain a processor. We can now have
memory-only proximity domains, which means that we can now have memory-only
Linux NUMA nodes.

Here is an example configuration where we have a single processor, one
range of regular memory and one range of High Bandwidth Memory (HBM):

+---------------+ +----------------+
| Processor | | Memory |
| prox domain 0 +---+ prox domain 1 |
| NUMA node 1 | | NUMA node 2 |
+-------+-------+ +----------------+
|
+-------+----------+
| HBM |
| prox domain 2 |
| NUMA node 0 |
+------------------+

This gives us one initiator (the processor) and two targets (the two memory
ranges). Each of these three has its own ACPI proximity domain and
associated Linux NUMA node. Note also that while there is a 1:1 mapping
from each proximity domain to each NUMA node, the numbers don't necessarily
match up. Additionally we can have extra NUMA nodes that don't map back to
ACPI proximity domains.

The above configuration could also have the processor and one of the two
memory ranges sharing a proximity domain and NUMA node, but for the
purposes of the HMAT the two memory ranges will always need to be
separated.

The overall goal of this series and of the HMAT is to allow users to
identify memory using its performance characteristics. This can broadly be
done in one of two ways:

Option 1: Provide the user with a way to map between proximity domains and
NUMA nodes and a way to access the HMAT directly (probably via
/sys/firmware/acpi/tables). Then, through possibly a library and a daemon,
provide an API so that applications can either request information about
memory ranges, or request memory allocations that meet a given set of
performance characteristics.

Option 2: Provide the user with HMAT performance data directly in sysfs,
allowing applications to directly access it without the need for the
library and daemon.

The kernel work for option 1 is started by patches 1-4. These just surface
the minimal amount of information in sysfs to allow userspace to map
between proximity domains and NUMA nodes so that the raw data in the HMAT
table can be understood.

Patches 5 and 6 enable option 2, adding performance information from the
HMAT to sysfs. The second option is complicated by the amount of HMAT data
that could be present in very large systems, so in this series we only
surface performance information for local (initiator,target) pairings. The
changelog for patch 6 discusses this in detail.

==== Next steps ====

There is still a lot of work to be done on this series, but the overall
goal of this RFC is to gather feedback on which of the two options we
should pursue, or whether some third option is preferred. After that is
done and we have a solid direction we can add support for ACPI hot add,
test more complex configurations, etc.

So, for applications that need to differentiate between memory ranges based
on their performance, what option would work best for you? Is the local
(initiator,target) performance provided by patch 6 enough, or do you
require performance information for all possible (initiator,target)
pairings?

If option 1 looks best, do we have ideas on what the userspace API would
look like?

For option 2 Dan Williams had suggested that it may be worthwhile to allow
for multiple memory initiators to be listed as "local" if they all have the
same performance, even if the HMAT's Memory Subsystem Address Range
Structure table only defines a single local initiator. Do others agree?

What other things should we consider, or what needs do you have that aren't
being addressed?

Ross Zwisler (6):
ACPICA: add HMAT table definitions
acpi: add missing include in acpi_numa.h
acpi: HMAT support in acpi_parse_entries_array()
hmem: add heterogeneous memory sysfs support
sysfs: add sysfs_add_group_link()
hmem: add performance attributes

MAINTAINERS | 5 +
drivers/acpi/Kconfig | 1 +
drivers/acpi/Makefile | 1 +
drivers/acpi/hmem/Kconfig | 7 +
drivers/acpi/hmem/Makefile | 2 +
drivers/acpi/hmem/core.c | 679 ++++++++++++++++++++++++++++++++++++
drivers/acpi/hmem/hmem.h | 56 +++
drivers/acpi/hmem/initiator.c | 61 ++++
drivers/acpi/hmem/perf_attributes.c | 158 +++++++++
drivers/acpi/hmem/target.c | 97 ++++++
drivers/acpi/numa.c | 2 +-
drivers/acpi/tables.c | 52 ++-
fs/sysfs/group.c | 30 +-
include/acpi/acpi_numa.h | 1 +
include/acpi/actbl1.h | 119 +++++++
include/linux/sysfs.h | 2 +
16 files changed, 1254 insertions(+), 19 deletions(-)
create mode 100644 drivers/acpi/hmem/Kconfig
create mode 100644 drivers/acpi/hmem/Makefile
create mode 100644 drivers/acpi/hmem/core.c
create mode 100644 drivers/acpi/hmem/hmem.h
create mode 100644 drivers/acpi/hmem/initiator.c
create mode 100644 drivers/acpi/hmem/perf_attributes.c
create mode 100644 drivers/acpi/hmem/target.c

--
2.9.4

2017-06-02 21:00:05

by Ross Zwisler

[permalink] [raw]

Subject: [RFC 2/6] acpi: add missing include in acpi_numa.h

Right now if a file includes acpi_numa.h and they don't happen to include
linux/numa.h before it, they get the following warning:

./include/acpi/acpi_numa.h:9:5: warning: "MAX_NUMNODES" is not defined [-Wundef]
#if MAX_NUMNODES > 256
^~~~~~~~~~~~

Signed-off-by: Ross Zwisler <[email protected]>
---
include/acpi/acpi_numa.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/acpi/acpi_numa.h b/include/acpi/acpi_numa.h
index d4b7294..1e3a74f 100644
--- a/include/acpi/acpi_numa.h
+++ b/include/acpi/acpi_numa.h
@@ -3,6 +3,7 @@

#ifdef CONFIG_ACPI_NUMA
#include <linux/kernel.h>
+#include <linux/numa.h>

/* Proximity bitmap length */
#if MAX_NUMNODES > 256
--
2.9.4

2017-06-02 21:00:15

by Ross Zwisler

[permalink] [raw]

Subject: [RFC 5/6] sysfs: add sysfs_add_group_link()

The current __compat_only_sysfs_link_entry_to_kobj() code allows us to
create symbolic links in sysfs to groups or attributes. Something like:

/sys/.../entry1/groupA -> /sys/.../entry2/groupA

This patch extends this functionality with a new sysfs_add_group_link()
call that allows the link to have a different name than the group or
attribute, so:

/sys/.../entry1/link_name -> /sys/.../entry2/groupA

__compat_only_sysfs_link_entry_to_kobj() now just calls
sysfs_add_group_link(), passing in the same name for both the
group/attribute and for the link name.

This is needed by the ACPI HMAT enabling work because we want to have a
group of performance attributes that live in a memory target. This group
represents the performance between the (initiator,target) pair, and in the
target the attribute group is named "via_mem_initX" to represent this
pairing:

# tree mem_tgt2/via_mem_init0/
mem_tgt2/via_mem_init0/
├── mem_init0 -> ../../mem_init0
├── mem_tgt2 -> ../../mem_tgt2
├── read_bw_MBps
├── read_lat_nsec
├── write_bw_MBps
└── write_lat_nsec

We then want to link to this attribute group from the initiator, but change
the name to "via_mem_tgtX" since we're now looking at it from the
initiator's perspective:

# ls -l mem_init0/via_mem_tgt2
lrwxrwxrwx. 1 root root 0 Jun 1 10:00 mem_init0/via_mem_tgt2 ->
../mem_tgt2/via_mem_init0

Signed-off-by: Ross Zwisler <[email protected]>
---
fs/sysfs/group.c | 30 +++++++++++++++++++++++-------
include/linux/sysfs.h | 2 ++
2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index ac2de0e..19db57c8 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -367,15 +367,15 @@ void sysfs_remove_link_from_group(struct kobject *kobj, const char *group_name,
EXPORT_SYMBOL_GPL(sysfs_remove_link_from_group);

/**
- * __compat_only_sysfs_link_entry_to_kobj - add a symlink to a kobject pointing
- * to a group or an attribute
+ * sysfs_add_group_link - add a symlink to a kobject pointing to a group or
+ * an attribute
* @kobj: The kobject containing the group.
* @target_kobj: The target kobject.
* @target_name: The name of the target group or attribute.
+ * @link_name: The name of the link to the target group or attribute.
*/
-int __compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj,
- struct kobject *target_kobj,
- const char *target_name)
+int sysfs_add_group_link(struct kobject *kobj, struct kobject *target_kobj,
+ const char *target_name, const char *link_name)
{
struct kernfs_node *target;
struct kernfs_node *entry;
@@ -400,12 +400,28 @@ int __compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj,
return -ENOENT;
}

- link = kernfs_create_link(kobj->sd, target_name, entry);
+ link = kernfs_create_link(kobj->sd, link_name, entry);
if (IS_ERR(link) && PTR_ERR(link) == -EEXIST)
- sysfs_warn_dup(kobj->sd, target_name);
+ sysfs_warn_dup(kobj->sd, link_name);

kernfs_put(entry);
kernfs_put(target);
return IS_ERR(link) ? PTR_ERR(link) : 0;
}
+EXPORT_SYMBOL_GPL(sysfs_add_group_link);
+
+/**
+ * __compat_only_sysfs_link_entry_to_kobj - add a symlink to a kobject pointing
+ * to a group or an attribute
+ * @kobj: The kobject containing the group.
+ * @target_kobj: The target kobject.
+ * @target_name: The name of the target group or attribute.
+ */
+int __compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj,
+ struct kobject *target_kobj,
+ const char *target_name)
+{
+ return sysfs_add_group_link(kobj, target_kobj, target_name,
+ target_name);
+}
EXPORT_SYMBOL_GPL(__compat_only_sysfs_link_entry_to_kobj);
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index c6f0f0d..865f499 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -278,6 +278,8 @@ int sysfs_add_link_to_group(struct kobject *kobj, const char *group_name,
struct kobject *target, const char *link_name);
void sysfs_remove_link_from_group(struct kobject *kobj, const char *group_name,
const char *link_name);
+int sysfs_add_group_link(struct kobject *kobj, struct kobject *target_kobj,
+ const char *target_name, const char *link_name);
int __compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj,
struct kobject *target_kobj,
const char *target_name);
--
2.9.4

2017-06-02 21:00:09

by Ross Zwisler

[permalink] [raw]

Subject: [RFC 4/6] hmem: add heterogeneous memory sysfs support

Add a new sysfs subsystem, /sys/devices/system/hmem, which surfaces
information about memory initiators and memory targets to the user. These
initiators and targets are described by the ACPI SRAT and HMAT tables.

A "memory initiator" in this case is any device such as a CPU or a separate
memory I/O device that can initiate a memory request. A "memory target" is
a CPU-accessible physical address range.

The key piece of information surfaced by this patch is the mapping between
the ACPI table "proximity domain" numbers, held in the "firmware_id"
attribute, and Linux NUMA node numbers.

Initiators are found at /sys/devices/system/hmem/mem_initX, and the
attributes for a given initiator look like this:

# tree mem_init0/
mem_init0/
├── cpu0 -> ../../cpu/cpu0
├── firmware_id
├── is_enabled
├── node0 -> ../../node/node0
├── power
│   ├── async
│   ...
├── subsystem -> ../../../../bus/hmem
└── uevent

Where "mem_init0" on my system represents the CPU acting as a memory
initiator at NUMA node 0.

Targets are found at /sys/devices/system/hmem/mem_tgtX, and the attributes
for a given target look like this:

# tree mem_tgt2/
mem_tgt2/
├── firmware_id
├── is_cached
├── is_enabled
├── is_isolated
├── node2 -> ../../node/node2
├── phys_addr_base
├── phys_length_bytes
├── power
│   ├── async
│   ...
├── subsystem -> ../../../../bus/hmem
└── uevent

Signed-off-by: Ross Zwisler <[email protected]>
---
MAINTAINERS | 5 +
drivers/acpi/Kconfig | 1 +
drivers/acpi/Makefile | 1 +
drivers/acpi/hmem/Kconfig | 7 +
drivers/acpi/hmem/Makefile | 2 +
drivers/acpi/hmem/core.c | 547 ++++++++++++++++++++++++++++++++++++++++++
drivers/acpi/hmem/hmem.h | 47 ++++
drivers/acpi/hmem/initiator.c | 61 +++++
drivers/acpi/hmem/target.c | 97 ++++++++
9 files changed, 768 insertions(+)
create mode 100644 drivers/acpi/hmem/Kconfig
create mode 100644 drivers/acpi/hmem/Makefile
create mode 100644 drivers/acpi/hmem/core.c
create mode 100644 drivers/acpi/hmem/hmem.h
create mode 100644 drivers/acpi/hmem/initiator.c
create mode 100644 drivers/acpi/hmem/target.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 053c3bd..554b833 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6085,6 +6085,11 @@ S: Supported
F: drivers/scsi/hisi_sas/
F: Documentation/devicetree/bindings/scsi/hisilicon-sas.txt

+HMEM (ACPI HETEROGENEOUS MEMORY SUPPORT)
+M: Ross Zwisler <[email protected]>
+S: Supported
+F: drivers/acpi/hmem/
+
HOST AP DRIVER
M: Jouni Malinen <[email protected]>
L: [email protected]
diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index 1ce52f8..44dd97f 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -460,6 +460,7 @@ config ACPI_REDUCED_HARDWARE_ONLY
If you are unsure what to do, do not enable this option.

source "drivers/acpi/nfit/Kconfig"
+source "drivers/acpi/hmem/Kconfig"

source "drivers/acpi/apei/Kconfig"
source "drivers/acpi/dptf/Kconfig"
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index b1aacfc..31e3f20 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -72,6 +72,7 @@ obj-$(CONFIG_ACPI_PROCESSOR) += processor.o
obj-$(CONFIG_ACPI) += container.o
obj-$(CONFIG_ACPI_THERMAL) += thermal.o
obj-$(CONFIG_ACPI_NFIT) += nfit/
+obj-$(CONFIG_ACPI_HMEM) += hmem/
obj-$(CONFIG_ACPI) += acpi_memhotplug.o
obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
obj-$(CONFIG_ACPI_BATTERY) += battery.o
diff --git a/drivers/acpi/hmem/Kconfig b/drivers/acpi/hmem/Kconfig
new file mode 100644
index 0000000..09282be
--- /dev/null
+++ b/drivers/acpi/hmem/Kconfig
@@ -0,0 +1,7 @@
+config ACPI_HMEM
+ bool "ACPI Heterogeneous Memory Support"
+ depends on ACPI_NUMA
+ depends on SYSFS
+ help
+ Exports a sysfs representation of the ACPI Heterogeneous Memory
+ Attributes Table (HMAT).
diff --git a/drivers/acpi/hmem/Makefile b/drivers/acpi/hmem/Makefile
new file mode 100644
index 0000000..d2aa546
--- /dev/null
+++ b/drivers/acpi/hmem/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_ACPI_HMEM) := hmem.o
+hmem-y := core.o initiator.o target.o
diff --git a/drivers/acpi/hmem/core.c b/drivers/acpi/hmem/core.c
new file mode 100644
index 0000000..2947fac
--- /dev/null
+++ b/drivers/acpi/hmem/core.c
@@ -0,0 +1,547 @@
+/*
+ * Heterogeneous memory representation in sysfs
+ *
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <acpi/acpi_numa.h>
+#include <linux/acpi.h>
+#include <linux/cpu.h>
+#include <linux/device.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include "hmem.h"
+
+static LIST_HEAD(target_list);
+static LIST_HEAD(initiator_list);
+
+static bool bad_hmem;
+
+static int link_node_for_kobj(unsigned int node, struct kobject *kobj)
+{
+ if (node_devices[node])
+ return sysfs_create_link(kobj, &node_devices[node]->dev.kobj,
+ kobject_name(&node_devices[node]->dev.kobj));
+
+ return 0;
+}
+
+static void remove_node_for_kobj(unsigned int node, struct kobject *kobj)
+{
+ if (node_devices[node])
+ sysfs_remove_link(kobj,
+ kobject_name(&node_devices[node]->dev.kobj));
+}
+
+#define HMEM_CLASS_NAME "hmem"
+
+static struct bus_type hmem_subsys = {
+ /*
+ * .dev_name is set before device_register() based on the type of
+ * device we are registering.
+ */
+ .name = HMEM_CLASS_NAME,
+};
+
+/* memory initiators */
+static int link_cpu_under_mem_init(struct memory_initiator *init)
+{
+ struct device *cpu_dev;
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ cpu_dev = get_cpu_device(cpu);
+ if (!cpu_dev)
+ continue;
+
+ if (pxm_to_node(init->pxm) == cpu_to_node(cpu)) {
+ return sysfs_create_link(&init->dev.kobj,
+ &cpu_dev->kobj,
+ kobject_name(&cpu_dev->kobj));
+ }
+
+ }
+ return 0;
+}
+
+static void remove_cpu_under_mem_init(struct memory_initiator *init)
+{
+ struct device *cpu_dev;
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ cpu_dev = get_cpu_device(cpu);
+ if (!cpu_dev)
+ continue;
+
+ if (pxm_to_node(init->pxm) == cpu_to_node(cpu)) {
+ sysfs_remove_link(&init->dev.kobj,
+ kobject_name(&cpu_dev->kobj));
+ return;
+ }
+
+ }
+}
+
+static void release_memory_initiator(struct device *dev)
+{
+ struct memory_initiator *init = to_memory_initiator(dev);
+
+ list_del(&init->list);
+ kfree(init);
+}
+
+static void __init remove_memory_initiator(struct memory_initiator *init)
+{
+ if (init->is_registered) {
+ remove_cpu_under_mem_init(init);
+ remove_node_for_kobj(pxm_to_node(init->pxm), &init->dev.kobj);
+ device_unregister(&init->dev);
+ } else
+ release_memory_initiator(&init->dev);
+}
+
+static int __init register_memory_initiator(struct memory_initiator *init)
+{
+ int ret;
+
+ hmem_subsys.dev_name = "mem_init";
+ init->dev.bus = &hmem_subsys;
+ init->dev.id = pxm_to_node(init->pxm);
+ init->dev.release = release_memory_initiator;
+ init->dev.groups = memory_initiator_attribute_groups;
+
+ ret = device_register(&init->dev);
+ if (ret < 0)
+ return ret;
+
+ init->is_registered = true;
+
+ ret = link_cpu_under_mem_init(init);
+ if (ret < 0)
+ return ret;
+
+ return link_node_for_kobj(pxm_to_node(init->pxm), &init->dev.kobj);
+}
+
+static struct memory_initiator * __init add_memory_initiator(int pxm)
+{
+ struct memory_initiator *init;
+
+ if (pxm_to_node(pxm) == NUMA_NO_NODE) {
+ pr_err("HMEM: No NUMA node for PXM %d\n", pxm);
+ bad_hmem = true;
+ return ERR_PTR(-EINVAL);
+ }
+
+ init = kzalloc(sizeof(*init), GFP_KERNEL);
+ if (!init) {
+ bad_hmem = true;
+ return ERR_PTR(-ENOMEM);
+ }
+
+ init->pxm = pxm;
+
+ list_add_tail(&init->list, &initiator_list);
+ return init;
+}
+
+/* memory targets */
+static void release_memory_target(struct device *dev)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ list_del(&tgt->list);
+ kfree(tgt);
+}
+
+static void __init remove_memory_target(struct memory_target *tgt)
+{
+ if (tgt->is_registered) {
+ remove_node_for_kobj(pxm_to_node(tgt->ma->proximity_domain),
+ &tgt->dev.kobj);
+ device_unregister(&tgt->dev);
+ } else
+ release_memory_target(&tgt->dev);
+}
+
+static int __init register_memory_target(struct memory_target *tgt)
+{
+ int ret;
+
+ if (!tgt->ma || !tgt->spa) {
+ pr_err("HMEM: Incomplete memory target found\n");
+ return -EINVAL;
+ }
+
+ hmem_subsys.dev_name = "mem_tgt";
+ tgt->dev.bus = &hmem_subsys;
+ tgt->dev.id = pxm_to_node(tgt->ma->proximity_domain);
+ tgt->dev.release = release_memory_target;
+ tgt->dev.groups = memory_target_attribute_groups;
+
+ ret = device_register(&tgt->dev);
+ if (ret < 0)
+ return ret;
+
+ tgt->is_registered = true;
+
+ return link_node_for_kobj(pxm_to_node(tgt->ma->proximity_domain),
+ &tgt->dev.kobj);
+}
+
+static int __init add_memory_target(struct acpi_srat_mem_affinity *ma)
+{
+ struct memory_target *tgt;
+
+ if (pxm_to_node(ma->proximity_domain) == NUMA_NO_NODE) {
+ pr_err("HMEM: No NUMA node for PXM %d\n", ma->proximity_domain);
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ tgt = kzalloc(sizeof(*tgt), GFP_KERNEL);
+ if (!tgt) {
+ bad_hmem = true;
+ return -ENOMEM;
+ }
+
+ tgt->ma = ma;
+
+ list_add_tail(&tgt->list, &target_list);
+ return 0;
+}
+
+/* ACPI parsing code, starting with the HMAT */
+static int __init hmem_noop_parse(struct acpi_table_header *table)
+{
+ /* real work done by the hmat_parse_* and srat_parse_* routines */
+ return 0;
+}
+
+static bool __init hmat_spa_matches_srat(struct acpi_hmat_address_range *spa,
+ struct acpi_srat_mem_affinity *ma)
+{
+ if (spa->physical_address_base != ma->base_address ||
+ spa->physical_address_length != ma->length)
+ return false;
+
+ return true;
+}
+
+static void find_local_initiator(struct memory_target *tgt)
+{
+ struct memory_initiator *init;
+
+ if (!(tgt->spa->flags & ACPI_HMAT_PROCESSOR_PD_VALID) ||
+ pxm_to_node(tgt->spa->processor_PD) == NUMA_NO_NODE)
+ return;
+
+ list_for_each_entry(init, &initiator_list, list) {
+ if (init->pxm == tgt->spa->processor_PD) {
+ tgt->local_init = init;
+ return;
+ }
+ }
+}
+
+/* ACPI HMAT parsing routines */
+static int __init
+hmat_parse_address_range(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_hmat_address_range *spa;
+ struct memory_target *tgt;
+
+ if (bad_hmem)
+ return 0;
+
+ spa = (struct acpi_hmat_address_range *)header;
+ if (!spa) {
+ pr_err("HMEM: NULL table entry\n");
+ goto err;
+ }
+
+ if (spa->header.length != sizeof(*spa)) {
+ pr_err("HMEM: Unexpected header length: %d\n",
+ spa->header.length);
+ goto err;
+ }
+
+ list_for_each_entry(tgt, &target_list, list) {
+ if ((spa->flags & ACPI_HMAT_MEMORY_PD_VALID) &&
+ spa->memory_PD == tgt->ma->proximity_domain) {
+ if (!hmat_spa_matches_srat(spa, tgt->ma)) {
+ pr_err("HMEM: SRAT and HMAT disagree on "
+ "address range info\n");
+ goto err;
+ }
+ tgt->spa = spa;
+ find_local_initiator(tgt);
+ return 0;
+ }
+ }
+
+ return 0;
+err:
+ bad_hmem = true;
+ return -EINVAL;
+}
+
+static int __init hmat_parse_cache(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_hmat_cache *cache;
+ struct memory_target *tgt;
+
+ if (bad_hmem)
+ return 0;
+
+ cache = (struct acpi_hmat_cache *)header;
+ if (!cache) {
+ pr_err("HMEM: NULL table entry\n");
+ goto err;
+ }
+
+ if (cache->header.length < sizeof(*cache)) {
+ pr_err("HMEM: Unexpected header length: %d\n",
+ cache->header.length);
+ goto err;
+ }
+
+ list_for_each_entry(tgt, &target_list, list) {
+ if (cache->memory_PD == tgt->ma->proximity_domain) {
+ tgt->is_cached = true;
+ return 0;
+ }
+ }
+
+ pr_err("HMEM: Couldn't find cached target PXM %d\n", cache->memory_PD);
+err:
+ bad_hmem = true;
+ return -EINVAL;
+}
+
+/*
+ * SRAT parsing. We use srat_disabled() and pxm_to_node() so we don't redo
+ * any of the SRAT sanity checking done in drivers/acpi/numa.c.
+ */
+static int __init
+srat_parse_processor_affinity(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_srat_cpu_affinity *cpu;
+ struct memory_initiator *init;
+ u32 pxm;
+
+ if (bad_hmem)
+ return 0;
+
+ cpu = (struct acpi_srat_cpu_affinity *)header;
+ if (!cpu) {
+ pr_err("HMEM: NULL table entry\n");
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ pxm = cpu->proximity_domain_lo;
+ if (acpi_srat_revision >= 2)
+ pxm |= *((unsigned int *)cpu->proximity_domain_hi) << 8;
+
+ init = add_memory_initiator(pxm);
+ if (IS_ERR(init))
+ return PTR_ERR(init);
+
+ init->cpu = cpu;
+ return 0;
+}
+
+static int __init
+srat_parse_x2apic_affinity(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_srat_x2apic_cpu_affinity *x2apic;
+ struct memory_initiator *init;
+
+ if (bad_hmem)
+ return 0;
+
+ x2apic = (struct acpi_srat_x2apic_cpu_affinity *)header;
+ if (!x2apic) {
+ pr_err("HMEM: NULL table entry\n");
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ init = add_memory_initiator(x2apic->proximity_domain);
+ if (IS_ERR(init))
+ return PTR_ERR(init);
+
+ init->x2apic = x2apic;
+ return 0;
+}
+
+static int __init
+srat_parse_gicc_affinity(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_srat_gicc_affinity *gicc;
+ struct memory_initiator *init;
+
+ if (bad_hmem)
+ return 0;
+
+ gicc = (struct acpi_srat_gicc_affinity *)header;
+ if (!gicc) {
+ pr_err("HMEM: NULL table entry\n");
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ init = add_memory_initiator(gicc->proximity_domain);
+ if (IS_ERR(init))
+ return PTR_ERR(init);
+
+ init->gicc = gicc;
+ return 0;
+}
+
+static int __init
+srat_parse_memory_affinity(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_srat_mem_affinity *ma;
+
+ if (bad_hmem)
+ return 0;
+
+ ma = (struct acpi_srat_mem_affinity *)header;
+ if (!ma) {
+ pr_err("HMEM: NULL table entry\n");
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ return add_memory_target(ma);
+}
+
+/*
+ * Remove our sysfs entries, unregister our devices and free allocated memory.
+ */
+static void hmem_cleanup(void)
+{
+ struct memory_initiator *init, *init_iter;
+ struct memory_target *tgt, *tgt_iter;
+
+ list_for_each_entry_safe(tgt, tgt_iter, &target_list, list)
+ remove_memory_target(tgt);
+
+ list_for_each_entry_safe(init, init_iter, &initiator_list, list)
+ remove_memory_initiator(init);
+}
+
+static int __init hmem_init(void)
+{
+ struct acpi_table_header *tbl;
+ struct memory_initiator *init;
+ struct memory_target *tgt;
+ acpi_status status = AE_OK;
+ int ret;
+
+ if (srat_disabled())
+ return 0;
+
+ /*
+ * We take a permanent reference to both the HMAT and SRAT in ACPI
+ * memory so we can keep pointers to their subtables. These tables
+ * already had references on them which would never be released, taken
+ * by acpi_sysfs_init(), so this shouldn't negatively impact anything.
+ */
+ status = acpi_get_table(ACPI_SIG_SRAT, 0, &tbl);
+ if (ACPI_FAILURE(status))
+ return 0;
+
+ status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
+ if (ACPI_FAILURE(status))
+ return 0;
+
+ ret = subsys_system_register(&hmem_subsys, NULL);
+ if (ret)
+ return ret;
+
+ if (!acpi_table_parse(ACPI_SIG_SRAT, hmem_noop_parse)) {
+ struct acpi_subtable_proc srat_proc[4];
+
+ memset(srat_proc, 0, sizeof(srat_proc));
+ srat_proc[0].id = ACPI_SRAT_TYPE_CPU_AFFINITY;
+ srat_proc[0].handler = srat_parse_processor_affinity;
+ srat_proc[1].id = ACPI_SRAT_TYPE_X2APIC_CPU_AFFINITY;
+ srat_proc[1].handler = srat_parse_x2apic_affinity;
+ srat_proc[2].id = ACPI_SRAT_TYPE_GICC_AFFINITY;
+ srat_proc[2].handler = srat_parse_gicc_affinity;
+ srat_proc[3].id = ACPI_SRAT_TYPE_MEMORY_AFFINITY;
+ srat_proc[3].handler = srat_parse_memory_affinity;
+
+ acpi_table_parse_entries_array(ACPI_SIG_SRAT,
+ sizeof(struct acpi_table_srat),
+ srat_proc, ARRAY_SIZE(srat_proc), 0);
+ }
+
+ if (!acpi_table_parse(ACPI_SIG_HMAT, hmem_noop_parse)) {
+ struct acpi_subtable_proc hmat_proc[2];
+
+ memset(hmat_proc, 0, sizeof(hmat_proc));
+ hmat_proc[0].id = ACPI_HMAT_TYPE_ADDRESS_RANGE;
+ hmat_proc[0].handler = hmat_parse_address_range;
+ hmat_proc[1].id = ACPI_HMAT_TYPE_CACHE;
+ hmat_proc[1].handler = hmat_parse_cache;
+
+ acpi_table_parse_entries_array(ACPI_SIG_HMAT,
+ sizeof(struct acpi_table_hmat),
+ hmat_proc, ARRAY_SIZE(hmat_proc), 0);
+ }
+
+ if (bad_hmem) {
+ ret = -EINVAL;
+ goto err;
+ }
+
+ list_for_each_entry(init, &initiator_list, list) {
+ ret = register_memory_initiator(init);
+ if (ret)
+ goto err;
+ }
+
+ list_for_each_entry(tgt, &target_list, list) {
+ ret = register_memory_target(tgt);
+ if (ret)
+ goto err;
+ }
+
+ return 0;
+err:
+ pr_err("HMEM: Error during initialization\n");
+ hmem_cleanup();
+ return ret;
+}
+
+static __exit void hmem_exit(void)
+{
+ hmem_cleanup();
+}
+
+module_init(hmem_init);
+module_exit(hmem_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
diff --git a/drivers/acpi/hmem/hmem.h b/drivers/acpi/hmem/hmem.h
new file mode 100644
index 0000000..8ea42b6
--- /dev/null
+++ b/drivers/acpi/hmem/hmem.h
@@ -0,0 +1,47 @@
+/*
+ * Heterogeneous memory representation in sysfs
+ *
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _ACPI_HMEM_H_
+#define _ACPI_HMEM_H_
+
+struct memory_initiator {
+ struct list_head list;
+ struct device dev;
+
+ /* only one of the following three will be set */
+ struct acpi_srat_cpu_affinity *cpu;
+ struct acpi_srat_x2apic_cpu_affinity *x2apic;
+ struct acpi_srat_gicc_affinity *gicc;
+
+ int pxm;
+ bool is_registered;
+};
+#define to_memory_initiator(dev) container_of(dev, struct memory_initiator, dev)
+
+struct memory_target {
+ struct list_head list;
+ struct device dev;
+ struct acpi_srat_mem_affinity *ma;
+ struct acpi_hmat_address_range *spa;
+ struct memory_initiator *local_init;
+
+ bool is_cached;
+ bool is_registered;
+};
+#define to_memory_target(dev) container_of(dev, struct memory_target, dev)
+
+extern const struct attribute_group *memory_initiator_attribute_groups[];
+extern const struct attribute_group *memory_target_attribute_groups[];
+#endif /* _ACPI_HMEM_H_ */
diff --git a/drivers/acpi/hmem/initiator.c b/drivers/acpi/hmem/initiator.c
new file mode 100644
index 0000000..905f030
--- /dev/null
+++ b/drivers/acpi/hmem/initiator.c
@@ -0,0 +1,61 @@
+/*
+ * Heterogeneous memory initiator sysfs attributes
+ *
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <acpi/acpi_numa.h>
+#include <linux/acpi.h>
+#include <linux/device.h>
+#include <linux/sysfs.h>
+#include "hmem.h"
+
+static ssize_t firmware_id_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_initiator *init = to_memory_initiator(dev);
+
+ return sprintf(buf, "%d\n", init->pxm);
+}
+static DEVICE_ATTR_RO(firmware_id);
+
+static ssize_t is_enabled_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_initiator *init = to_memory_initiator(dev);
+ int is_enabled;
+
+ if (init->cpu)
+ is_enabled = !!(init->cpu->flags & ACPI_SRAT_CPU_ENABLED);
+ else if (init->x2apic)
+ is_enabled = !!(init->x2apic->flags & ACPI_SRAT_CPU_ENABLED);
+ else
+ is_enabled = !!(init->gicc->flags & ACPI_SRAT_GICC_ENABLED);
+
+ return sprintf(buf, "%d\n", is_enabled);
+}
+static DEVICE_ATTR_RO(is_enabled);
+
+static struct attribute *memory_initiator_attributes[] = {
+ &dev_attr_firmware_id.attr,
+ &dev_attr_is_enabled.attr,
+ NULL,
+};
+
+static struct attribute_group memory_initiator_attribute_group = {
+ .attrs = memory_initiator_attributes,
+};
+
+const struct attribute_group *memory_initiator_attribute_groups[] = {
+ &memory_initiator_attribute_group,
+ NULL,
+};
diff --git a/drivers/acpi/hmem/target.c b/drivers/acpi/hmem/target.c
new file mode 100644
index 0000000..dd57437
--- /dev/null
+++ b/drivers/acpi/hmem/target.c
@@ -0,0 +1,97 @@
+/*
+ * Heterogeneous memory target sysfs attributes
+ *
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <acpi/acpi_numa.h>
+#include <linux/acpi.h>
+#include <linux/device.h>
+#include <linux/sysfs.h>
+#include "hmem.h"
+
+/* attributes for memory targets */
+static ssize_t phys_addr_base_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%#llx\n", tgt->ma->base_address);
+}
+static DEVICE_ATTR_RO(phys_addr_base);
+
+static ssize_t phys_length_bytes_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%#llx\n", tgt->ma->length);
+}
+static DEVICE_ATTR_RO(phys_length_bytes);
+
+static ssize_t firmware_id_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%d\n", tgt->ma->proximity_domain);
+}
+static DEVICE_ATTR_RO(firmware_id);
+
+static ssize_t is_cached_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%d\n", tgt->is_cached);
+}
+static DEVICE_ATTR_RO(is_cached);
+
+static ssize_t is_isolated_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%d\n",
+ !!(tgt->spa->flags & ACPI_HMAT_RESERVATION_HINT));
+}
+static DEVICE_ATTR_RO(is_isolated);
+
+static ssize_t is_enabled_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%d\n",
+ !!(tgt->ma->flags & ACPI_SRAT_MEM_ENABLED));
+}
+static DEVICE_ATTR_RO(is_enabled);
+
+static struct attribute *memory_target_attributes[] = {
+ &dev_attr_phys_addr_base.attr,
+ &dev_attr_phys_length_bytes.attr,
+ &dev_attr_firmware_id.attr,
+ &dev_attr_is_cached.attr,
+ &dev_attr_is_isolated.attr,
+ &dev_attr_is_enabled.attr,
+ NULL
+};
+
+/* attributes which are present for all memory targets */
+static struct attribute_group memory_target_attribute_group = {
+ .attrs = memory_target_attributes,
+};
+
+const struct attribute_group *memory_target_attribute_groups[] = {
+ &memory_target_attribute_group,
+ NULL,
+};
--
2.9.4

2017-06-02 21:00:37

by Ross Zwisler

[permalink] [raw]

Subject: [RFC 6/6] hmem: add performance attributes

Add performance information found in the HMAT to the sysfs representation.
This information lives as an attribute group named "via_mem_initX" in the
memory target:

# tree mem_tgt2
mem_tgt2
├── firmware_id
├── is_cached
├── is_enabled
├── is_isolated
├── node2 -> ../../node/node2
├── phys_addr_base
├── phys_length_bytes
├── power
│ ├── async
│ ...
├── subsystem -> ../../../../bus/hmem
├── uevent
└── via_mem_init0
├── mem_init0 -> ../../mem_init0
├── mem_tgt2 -> ../../mem_tgt2
├── read_bw_MBps
├── read_lat_nsec
├── write_bw_MBps
└── write_lat_nsec

This attribute group surfaces latency and bandwidth performance for a given
(initiator,target) pairing. For example:

# grep . mem_tgt2/via_mem_init0/* 2>/dev/null
mem_tgt2/via_mem_init0/read_bw_MBps:40960
mem_tgt2/via_mem_init0/read_lat_nsec:50
mem_tgt2/via_mem_init0/write_bw_MBps:40960
mem_tgt2/via_mem_init0/write_lat_nsec:50

The initiator has a symlink to the performance information which lives in
the target's attribute group:

# ls -l mem_init0/via_mem_tgt2
lrwxrwxrwx. 1 root root 0 Jun 1 10:00 mem_init0/via_mem_tgt2 ->
../mem_tgt2/via_mem_init0

We create performance attribute groups only for local (initiator,target)
pairings, where the local initiator for a given target is defined by the
"Processor Proximity Domain" field in the HMAT's Memory Subsystem Address
Range Structure table.

A given target is only local to a single initiator, so each target will
have at most one "via_mem_initX" attribute group. A given memory initiator
may have multiple local memory targets, so multiple "via_mem_tgtX" links
may exist for a given initiator.

If a given memory target is cached we give performance numbers only for the
media itself, and rely on the "is_cached" attribute to represent the
fact that there is a caching layer.

The fact that we only expose a subset of the performance information
presented in the HMAT via sysfs as a compromise, driven by fact that those
usages will be the highest performing and because to represent all possible
paths could cause an unmanageable explosion of sysfs entries.

If we dump everything from the HMAT into sysfs we end up with
O(num_targets * num_initiators * num_caching_levels) attributes. Each of
these attributes only takes up 2 bytes in a System Locality Latency and
Bandwidth Information Structure, but if we have to create a directory entry
for each it becomes much more expensive.

For example, very large systems today can have on the order of thousands of
NUMA nodes. Say we have a system which used to have 1,000 NUMA nodes that
each had both a CPU and local memory. The HMAT allows us to separate the
CPUs and memory into separate NUMA nodes, so we can end up with 1,000 CPU
initiator NUMA nodes and 1,000 memory target NUMA nodes. If we represented
the performance information for each possible CPU/memory pair in sysfs we
would end up with 1,000,000 attribute groups.

This is a lot to pass in a set of packed data tables, but I think we'll
break sysfs if we try to create millions of attributes, regardless of how
we nest them in a directory hierarchy.

By only representing performance information for local (initiator,target)
pairings, we reduce the number of sysfs entries to O(num_targets).

Signed-off-by: Ross Zwisler <[email protected]>
---
drivers/acpi/hmem/Makefile | 2 +-
drivers/acpi/hmem/core.c | 134 +++++++++++++++++++++++++++++-
drivers/acpi/hmem/hmem.h | 9 ++
drivers/acpi/hmem/perf_attributes.c | 158 ++++++++++++++++++++++++++++++++++++
4 files changed, 301 insertions(+), 2 deletions(-)
create mode 100644 drivers/acpi/hmem/perf_attributes.c

diff --git a/drivers/acpi/hmem/Makefile b/drivers/acpi/hmem/Makefile
index d2aa546..44e8304 100644
--- a/drivers/acpi/hmem/Makefile
+++ b/drivers/acpi/hmem/Makefile
@@ -1,2 +1,2 @@
obj-$(CONFIG_ACPI_HMEM) := hmem.o
-hmem-y := core.o initiator.o target.o
+hmem-y := core.o initiator.o target.o perf_attributes.o
diff --git a/drivers/acpi/hmem/core.c b/drivers/acpi/hmem/core.c
index 2947fac..df93058 100644
--- a/drivers/acpi/hmem/core.c
+++ b/drivers/acpi/hmem/core.c
@@ -25,9 +25,94 @@

static LIST_HEAD(target_list);
static LIST_HEAD(initiator_list);
+LIST_HEAD(locality_list);

static bool bad_hmem;

+static int add_performance_attributes(struct memory_target *tgt)
+{
+ struct attribute_group performance_attribute_group = {
+ .attrs = performance_attributes,
+ };
+ struct kobject *init_kobj, *tgt_kobj;
+ struct device *init_dev, *tgt_dev;
+ char via_init[128], via_tgt[128];
+ int ret;
+
+ if (!tgt->local_init)
+ return 0;
+
+ init_dev = &tgt->local_init->dev;
+ tgt_dev = &tgt->dev;
+ init_kobj = &init_dev->kobj;
+ tgt_kobj = &tgt_dev->kobj;
+
+ snprintf(via_init, 128, "via_%s", dev_name(init_dev));
+ snprintf(via_tgt, 128, "via_%s", dev_name(tgt_dev));
+
+ /* Create entries for initiator/target pair in the target. */
+ performance_attribute_group.name = via_init;
+ ret = sysfs_create_group(tgt_kobj, &performance_attribute_group);
+ if (ret < 0)
+ return ret;
+
+ ret = sysfs_add_link_to_group(tgt_kobj, via_init, init_kobj,
+ dev_name(init_dev));
+ if (ret < 0)
+ goto err;
+
+ ret = sysfs_add_link_to_group(tgt_kobj, via_init, tgt_kobj,
+ dev_name(tgt_dev));
+ if (ret < 0)
+ goto err;
+
+ /* Create a link in the initiator to the performance attributes. */
+ ret = sysfs_add_group_link(init_kobj, tgt_kobj, via_init, via_tgt);
+ if (ret < 0)
+ goto err;
+
+ tgt->has_perf_attributes = true;
+ return 0;
+err:
+ /* Removals of links that haven't been added yet are harmless. */
+ sysfs_remove_link_from_group(tgt_kobj, via_init, dev_name(init_dev));
+ sysfs_remove_link_from_group(tgt_kobj, via_init, dev_name(tgt_dev));
+ sysfs_remove_group(tgt_kobj, &performance_attribute_group);
+ return ret;
+}
+
+static void remove_performance_attributes(struct memory_target *tgt)
+{
+ struct attribute_group performance_attribute_group = {
+ .attrs = performance_attributes,
+ };
+ struct kobject *init_kobj, *tgt_kobj;
+ struct device *init_dev, *tgt_dev;
+ char via_init[128], via_tgt[128];
+
+ if (!tgt->local_init)
+ return;
+
+ init_dev = &tgt->local_init->dev;
+ tgt_dev = &tgt->dev;
+ init_kobj = &init_dev->kobj;
+ tgt_kobj = &tgt_dev->kobj;
+
+ snprintf(via_init, 128, "via_%s", dev_name(init_dev));
+ snprintf(via_tgt, 128, "via_%s", dev_name(tgt_dev));
+
+ performance_attribute_group.name = via_init;
+
+ /* Remove entries for initiator/target pair in the target. */
+ sysfs_remove_link_from_group(tgt_kobj, via_init, dev_name(init_dev));
+ sysfs_remove_link_from_group(tgt_kobj, via_init, dev_name(tgt_dev));
+
+ /* Remove the initiator's link to the performance attributes. */
+ sysfs_remove_link(init_kobj, via_tgt);
+
+ sysfs_remove_group(tgt_kobj, &performance_attribute_group);
+}
+
static int link_node_for_kobj(unsigned int node, struct kobject *kobj)
{
if (node_devices[node])
@@ -168,6 +253,9 @@ static void release_memory_target(struct device *dev)

static void __init remove_memory_target(struct memory_target *tgt)
{
+ if (tgt->has_perf_attributes)
+ remove_performance_attributes(tgt);
+
if (tgt->is_registered) {
remove_node_for_kobj(pxm_to_node(tgt->ma->proximity_domain),
&tgt->dev.kobj);
@@ -299,6 +387,38 @@ hmat_parse_address_range(struct acpi_subtable_header *header,
return -EINVAL;
}

+static int __init hmat_parse_locality(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_hmat_locality *hmat_loc;
+ struct memory_locality *loc;
+
+ if (bad_hmem)
+ return 0;
+
+ hmat_loc = (struct acpi_hmat_locality *)header;
+ if (!hmat_loc) {
+ pr_err("HMEM: NULL table entry\n");
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ /* We don't report cached performance information in sysfs. */
+ if (hmat_loc->flags == ACPI_HMAT_MEMORY ||
+ hmat_loc->flags == ACPI_HMAT_LAST_LEVEL_CACHE) {
+ loc = kzalloc(sizeof(*loc), GFP_KERNEL);
+ if (!loc) {
+ bad_hmem = true;
+ return -ENOMEM;
+ }
+
+ loc->hmat_loc = hmat_loc;
+ list_add_tail(&loc->list, &locality_list);
+ }
+
+ return 0;
+}
+
static int __init hmat_parse_cache(struct acpi_subtable_header *header,
const unsigned long end)
{
@@ -442,6 +562,7 @@ srat_parse_memory_affinity(struct acpi_subtable_header *header,
static void hmem_cleanup(void)
{
struct memory_initiator *init, *init_iter;
+ struct memory_locality *loc, *loc_iter;
struct memory_target *tgt, *tgt_iter;

list_for_each_entry_safe(tgt, tgt_iter, &target_list, list)
@@ -449,6 +570,11 @@ static void hmem_cleanup(void)

list_for_each_entry_safe(init, init_iter, &initiator_list, list)
remove_memory_initiator(init);
+
+ list_for_each_entry_safe(loc, loc_iter, &locality_list, list) {
+ list_del(&loc->list);
+ kfree(loc);
+ }
}

static int __init hmem_init(void)
@@ -499,13 +625,15 @@ static int __init hmem_init(void)
}

if (!acpi_table_parse(ACPI_SIG_HMAT, hmem_noop_parse)) {
- struct acpi_subtable_proc hmat_proc[2];
+ struct acpi_subtable_proc hmat_proc[3];

memset(hmat_proc, 0, sizeof(hmat_proc));
hmat_proc[0].id = ACPI_HMAT_TYPE_ADDRESS_RANGE;
hmat_proc[0].handler = hmat_parse_address_range;
hmat_proc[1].id = ACPI_HMAT_TYPE_CACHE;
hmat_proc[1].handler = hmat_parse_cache;
+ hmat_proc[2].id = ACPI_HMAT_TYPE_LOCALITY;
+ hmat_proc[2].handler = hmat_parse_locality;

acpi_table_parse_entries_array(ACPI_SIG_HMAT,
sizeof(struct acpi_table_hmat),
@@ -527,6 +655,10 @@ static int __init hmem_init(void)
ret = register_memory_target(tgt);
if (ret)
goto err;
+
+ ret = add_performance_attributes(tgt);
+ if (ret)
+ goto err;
}

return 0;
diff --git a/drivers/acpi/hmem/hmem.h b/drivers/acpi/hmem/hmem.h
index 8ea42b6..6073ec4 100644
--- a/drivers/acpi/hmem/hmem.h
+++ b/drivers/acpi/hmem/hmem.h
@@ -39,9 +39,18 @@ struct memory_target {

bool is_cached;
bool is_registered;
+ bool has_perf_attributes;
};
#define to_memory_target(dev) container_of(dev, struct memory_target, dev)

+struct memory_locality {
+ struct list_head list;
+ struct acpi_hmat_locality *hmat_loc;
+};
+
extern const struct attribute_group *memory_initiator_attribute_groups[];
extern const struct attribute_group *memory_target_attribute_groups[];
+extern struct attribute *performance_attributes[];
+
+extern struct list_head locality_list;
#endif /* _ACPI_HMEM_H_ */
diff --git a/drivers/acpi/hmem/perf_attributes.c b/drivers/acpi/hmem/perf_attributes.c
new file mode 100644
index 0000000..cb77b21
--- /dev/null
+++ b/drivers/acpi/hmem/perf_attributes.c
@@ -0,0 +1,158 @@
+/*
+ * Heterogeneous memory performance attributes
+ *
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/acpi.h>
+#include <linux/device.h>
+#include <linux/sysfs.h>
+#include "hmem.h"
+
+#define NO_VALUE -1
+#define LATENCY 0
+#define BANDWIDTH 1
+
+/* Performance attributes for an initiator/target pair. */
+static int get_performance_data(u32 init_pxm, u32 tgt_pxm,
+ struct acpi_hmat_locality *hmat_loc)
+{
+ int num_init = hmat_loc->number_of_initiator_Pds;
+ int num_tgt = hmat_loc->number_of_target_Pds;
+ int init_idx = NO_VALUE;
+ int tgt_idx = NO_VALUE;
+ u32 *initiators, *targets;
+ u16 *entries, val;
+ int i;
+
+ initiators = hmat_loc->data;
+ targets = &initiators[num_init];
+ entries = (u16 *)&targets[num_tgt];
+
+ for (i = 0; i < num_init; i++) {
+ if (initiators[i] == init_pxm) {
+ init_idx = i;
+ break;
+ }
+ }
+
+ if (init_idx == NO_VALUE)
+ return NO_VALUE;
+
+ for (i = 0; i < num_tgt; i++) {
+ if (targets[i] == tgt_pxm) {
+ tgt_idx = i;
+ break;
+ }
+ }
+
+ if (tgt_idx == NO_VALUE)
+ return NO_VALUE;
+
+ val = entries[init_idx*num_tgt + tgt_idx];
+ if (val < 10 || val == 0xFFFF)
+ return NO_VALUE;
+
+ return (val * hmat_loc->entry_base_unit) / 10;
+}
+
+/*
+ * 'direction' is either READ or WRITE
+ * 'type' is either LATENCY or BANDWIDTH
+ * Latency is reported in nanoseconds and bandwidth is reported in MB/s.
+ */
+static int get_dev_attribute(struct device *dev, int direction, int type)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+ int tgt_pxm = tgt->ma->proximity_domain;
+ int init_pxm = tgt->local_init->pxm;
+ struct memory_locality *loc;
+ int value;
+
+ list_for_each_entry(loc, &locality_list, list) {
+ struct acpi_hmat_locality *hmat_loc = loc->hmat_loc;
+
+ if (direction == READ && type == LATENCY &&
+ (hmat_loc->data_type == ACPI_HMAT_ACCESS_LATENCY ||
+ hmat_loc->data_type == ACPI_HMAT_READ_LATENCY)) {
+ value = get_performance_data(init_pxm, tgt_pxm,
+ hmat_loc);
+ if (value != NO_VALUE)
+ return value;
+ }
+
+ if (direction == WRITE && type == LATENCY &&
+ (hmat_loc->data_type == ACPI_HMAT_ACCESS_LATENCY ||
+ hmat_loc->data_type == ACPI_HMAT_WRITE_LATENCY)) {
+ value = get_performance_data(init_pxm, tgt_pxm,
+ hmat_loc);
+ if (value != NO_VALUE)
+ return value;
+ }
+
+ if (direction == READ && type == BANDWIDTH &&
+ (hmat_loc->data_type == ACPI_HMAT_ACCESS_BANDWIDTH ||
+ hmat_loc->data_type == ACPI_HMAT_READ_BANDWIDTH)) {
+ value = get_performance_data(init_pxm, tgt_pxm,
+ hmat_loc);
+ if (value != NO_VALUE)
+ return value;
+ }
+
+ if (direction == WRITE && type == BANDWIDTH &&
+ (hmat_loc->data_type == ACPI_HMAT_ACCESS_BANDWIDTH ||
+ hmat_loc->data_type == ACPI_HMAT_WRITE_BANDWIDTH)) {
+ value = get_performance_data(init_pxm, tgt_pxm,
+ hmat_loc);
+ if (value != NO_VALUE)
+ return value;
+ }
+ }
+
+ return NO_VALUE;
+}
+
+static ssize_t read_lat_nsec_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", get_dev_attribute(dev, READ, LATENCY));
+}
+static DEVICE_ATTR_RO(read_lat_nsec);
+
+static ssize_t write_lat_nsec_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", get_dev_attribute(dev, WRITE, LATENCY));
+}
+static DEVICE_ATTR_RO(write_lat_nsec);
+
+static ssize_t read_bw_MBps_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", get_dev_attribute(dev, READ, BANDWIDTH));
+}
+static DEVICE_ATTR_RO(read_bw_MBps);
+
+static ssize_t write_bw_MBps_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", get_dev_attribute(dev, WRITE, BANDWIDTH));
+}
+static DEVICE_ATTR_RO(write_bw_MBps);
+
+struct attribute *performance_attributes[] = {
+ &dev_attr_read_lat_nsec.attr,
+ &dev_attr_write_lat_nsec.attr,
+ &dev_attr_read_bw_MBps.attr,
+ &dev_attr_write_bw_MBps.attr,
+ NULL
+};
--
2.9.4

2017-06-02 21:01:00

by Ross Zwisler

[permalink] [raw]

Subject: [RFC 3/6] acpi: HMAT support in acpi_parse_entries_array()

The current implementation of acpi_parse_entries_array() assumes that each
subtable has a standard ACPI subtable entry of type struct
acpi_sutbable_header. This standard subtable header has a one byte length
followed by a one byte type.

The HMAT subtables have to allow for a longer length so they have subtable
headers of type struct acpi_hmat_structure which has a 2 byte type and a 4
byte length.

Enhance the subtable parsing in acpi_parse_entries_array() so that it can
handle these new HMAT subtables.

Signed-off-by: Ross Zwisler <[email protected]>
---
drivers/acpi/numa.c | 2 +-
drivers/acpi/tables.c | 52 ++++++++++++++++++++++++++++++++++++++++-----------
2 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index edb0c79..917f1cc 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -443,7 +443,7 @@ int __init acpi_numa_init(void)
* So go over all cpu entries in SRAT to get apicid to node mapping.
*/

- /* SRAT: Static Resource Affinity Table */
+ /* SRAT: System Resource Affinity Table */
if (!acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat)) {
struct acpi_subtable_proc srat_proc[3];

diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index ff42539..7979171 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -218,6 +218,33 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header)
}
}

+static unsigned long __init
+acpi_get_entry_type(char *id, void *entry)
+{
+ if (!strncmp(id, ACPI_SIG_HMAT, 4))
+ return ((struct acpi_hmat_structure *)entry)->type;
+ else
+ return ((struct acpi_subtable_header *)entry)->type;
+}
+
+static unsigned long __init
+acpi_get_entry_length(char *id, void *entry)
+{
+ if (!strncmp(id, ACPI_SIG_HMAT, 4))
+ return ((struct acpi_hmat_structure *)entry)->length;
+ else
+ return ((struct acpi_subtable_header *)entry)->length;
+}
+
+static unsigned long __init
+acpi_get_subtable_header_length(char *id)
+{
+ if (!strncmp(id, ACPI_SIG_HMAT, 4))
+ return sizeof(struct acpi_hmat_structure);
+ else
+ return sizeof(struct acpi_subtable_header);
+}
+
/**
* acpi_parse_entries_array - for each proc_num find a suitable subtable
*
@@ -242,10 +269,10 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
struct acpi_subtable_proc *proc, int proc_num,
unsigned int max_entries)
{
- struct acpi_subtable_header *entry;
- unsigned long table_end;
+ unsigned long table_end, subtable_header_length;
int count = 0;
int errs = 0;
+ void *entry;
int i;

if (acpi_disabled)
@@ -263,19 +290,23 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
}

table_end = (unsigned long)table_header + table_header->length;
+ subtable_header_length = acpi_get_subtable_header_length(id);

/* Parse all entries looking for a match. */

- entry = (struct acpi_subtable_header *)
- ((unsigned long)table_header + table_size);
+ entry = (void *)table_header + table_size;
+
+ while (((unsigned long)entry) + subtable_header_length < table_end) {
+ unsigned long entry_type, entry_length;

- while (((unsigned long)entry) + sizeof(struct acpi_subtable_header) <
- table_end) {
if (max_entries && count >= max_entries)
break;

+ entry_type = acpi_get_entry_type(id, entry);
+ entry_length = acpi_get_entry_length(id, entry);
+
for (i = 0; i < proc_num; i++) {
- if (entry->type != proc[i].id)
+ if (entry_type != proc[i].id)
continue;
if (!proc[i].handler ||
(!errs && proc[i].handler(entry, table_end))) {
@@ -290,16 +321,15 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
count++;

/*
- * If entry->length is 0, break from this loop to avoid
+ * If entry_length is 0, break from this loop to avoid
* infinite loop.
*/
- if (entry->length == 0) {
+ if (entry_length == 0) {
pr_err("[%4.4s:0x%02x] Invalid zero length\n", id, proc->id);
return -EINVAL;
}

- entry = (struct acpi_subtable_header *)
- ((unsigned long)entry + entry->length);
+ entry += entry_length;
}

if (max_entries && count > max_entries) {
--
2.9.4

2017-06-02 21:01:21

by Ross Zwisler

[permalink] [raw]

Subject: [RFC 1/6] ACPICA: add HMAT table definitions

Import HMAT table definitions from the ACPICA codebase.

This kernel patch was generated using an ACPICA patch from "Zheng, Lv"
<[email protected]>. The actual upstream patch that adds these table
definitions will come from the Intel ACPICA team as part of their greater
ACPI 6.2 update.

Signed-off-by: Ross Zwisler <[email protected]>
---
include/acpi/actbl1.h | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 119 insertions(+)

diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h
index b4ce55c..a5df3f3 100644
--- a/include/acpi/actbl1.h
+++ b/include/acpi/actbl1.h
@@ -65,6 +65,7 @@
#define ACPI_SIG_ECDT "ECDT" /* Embedded Controller Boot Resources Table */
#define ACPI_SIG_EINJ "EINJ" /* Error Injection table */
#define ACPI_SIG_ERST "ERST" /* Error Record Serialization Table */
+#define ACPI_SIG_HMAT "HMAT" /* Heterogeneous Memory Attributes Table */
#define ACPI_SIG_HEST "HEST" /* Hardware Error Source Table */
#define ACPI_SIG_MADT "APIC" /* Multiple APIC Description Table */
#define ACPI_SIG_MSCT "MSCT" /* Maximum System Characteristics Table */
@@ -688,6 +689,124 @@ struct acpi_hest_generic_data_v300 {

/*******************************************************************************
*
+ * HMAT - Heterogeneous Memory Attributes Table (ACPI 6.2)
+ * Version 1
+ *
+ ******************************************************************************/
+
+struct acpi_table_hmat {
+ struct acpi_table_header header; /* Common ACPI table header */
+ u32 reserved;
+};
+
+
+/* Values for HMAT structure types */
+
+enum acpi_hmat_type {
+ ACPI_HMAT_TYPE_ADDRESS_RANGE = 0, /* Memory subystem address range */
+ ACPI_HMAT_TYPE_LOCALITY = 1, /* System locality latency and bandwidth information */
+ ACPI_HMAT_TYPE_CACHE = 2, /* Memory side cache information */
+ ACPI_HMAT_TYPE_RESERVED = 3 /* 3 and greater are reserved */
+};
+
+struct acpi_hmat_structure {
+ u16 type;
+ u16 reserved;
+ u32 length;
+};
+
+/*
+ * HMAT Structures, correspond to Type in struct acpi_hmat_structure
+ */
+
+/* 0: Memory subystem address range */
+
+struct acpi_hmat_address_range {
+ struct acpi_hmat_structure header;
+ u16 flags;
+ u16 reserved1;
+ u32 processor_PD; /* Processor proximity domain */
+ u32 memory_PD; /* Memory proximity domain */
+ u32 reserved2;
+ u64 physical_address_base; /* Physical address range base */
+ u64 physical_address_length; /* Physical address range length */
+};
+
+/* Masks for Flags field above */
+
+#define ACPI_HMAT_PROCESSOR_PD_VALID (1) /* 1: processor_PD field is valid */
+#define ACPI_HMAT_MEMORY_PD_VALID (1<<1) /* 1: memory_PD field is valid */
+#define ACPI_HMAT_RESERVATION_HINT (1<<2) /* 1: Reservation hint */
+
+/* 1: System locality latency and bandwidth information */
+
+struct acpi_hmat_locality {
+ struct acpi_hmat_structure header;
+ u8 flags;
+ u8 data_type;
+ u16 reserved1;
+ u32 number_of_initiator_Pds;
+ u32 number_of_target_Pds;
+ u32 reserved2;
+ u64 entry_base_unit;
+ u32 data[1]; /* initiator/target lists followed by entry matrix */
+};
+
+/* Masks for Flags field above */
+
+#define ACPI_HMAT_MEMORY_HIERARCHY (0x0F)
+
+/* Values for Memory Hierarchy flag */
+
+#define ACPI_HMAT_MEMORY 0
+#define ACPI_HMAT_LAST_LEVEL_CACHE 1
+#define ACPI_HMAT_1ST_LEVEL_CACHE 2
+#define ACPI_HMAT_2ND_LEVEL_CACHE 3
+#define ACPI_HMAT_3RD_LEVEL_CACHE 4
+
+/* Values for data_type field above */
+
+#define ACPI_HMAT_ACCESS_LATENCY 0
+#define ACPI_HMAT_READ_LATENCY 1
+#define ACPI_HMAT_WRITE_LATENCY 2
+#define ACPI_HMAT_ACCESS_BANDWIDTH 3
+#define ACPI_HMAT_READ_BANDWIDTH 4
+#define ACPI_HMAT_WRITE_BANDWIDTH 5
+
+/* 2: Memory side cache information */
+
+struct acpi_hmat_cache {
+ struct acpi_hmat_structure header;
+ u32 memory_PD;
+ u32 reserved1;
+ u64 cache_size;
+ u32 cache_attributes;
+ u16 reserved2;
+ u16 number_of_SMBIOShandles;
+};
+
+/* Masks for cache_attributes field above */
+
+#define ACPI_HMAT_TOTAL_CACHE_LEVEL (0x0000000F)
+#define ACPI_HMAT_CACHE_LEVEL (0x000000F0)
+#define ACPI_HMAT_CACHE_ASSOCIATIVITY (0x00000F00)
+#define ACPI_HMAT_WRITE_POLICY (0x0000F000)
+#define ACPI_HMAT_CACHE_LINE_SIZE (0xFFFF0000)
+
+/* Values for cache associativity flag */
+
+#define ACPI_HMAT_CA_NONE (0)
+#define ACPI_HMAT_CA_DIRECT_MAPPED (1)
+#define ACPI_HMAT_CA_COMPLEX_CACHE_INDEXING (2)
+
+/* Values for write policy flag */
+
+#define ACPI_HMAT_CP_NONE (0)
+#define ACPI_HMAT_CP_WB (1)
+#define ACPI_HMAT_CP_WT (2)
+
+/*******************************************************************************
+ *
* MADT - Multiple APIC Description Table
* Version 3
*
--
2.9.4

2017-06-02 21:23:20

by Moore, Robert

[permalink] [raw]

Subject: RE: [RFC 1/6] ACPICA: add HMAT table definitions

Full support for HMAT was just released in ACPICA version 20170531.

> -----Original Message-----
> From: Ross Zwisler [mailto:[email protected]]
> Sent: Friday, June 2, 2017 2:00 PM
> To: [email protected]
> Cc: Ross Zwisler <[email protected]>; Anaczkowski, Lukasz
> <[email protected]>; Box, David E <[email protected]>;
> Kogut, Jaroslaw <[email protected]>; Lahtinen, Joonas
> <[email protected]>; Moore, Robert <[email protected]>;
> Nachimuthu, Murugasamy <[email protected]>; Odzioba,
> Lukasz <[email protected]>; Wysocki, Rafael J
> <[email protected]>; Rafael J. Wysocki <[email protected]>;
> Schmauss, Erik <[email protected]>; Verma, Vishal L
> <[email protected]>; Zheng, Lv <[email protected]>; Williams,
> Dan J <[email protected]>; Hansen, Dave <[email protected]>;
> Dave Hansen <[email protected]>; Greg Kroah-Hartman
> <[email protected]>; Len Brown <[email protected]>; Tim Chen
> <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]
> Subject: [RFC 1/6] ACPICA: add HMAT table definitions
>
> Import HMAT table definitions from the ACPICA codebase.
>
> This kernel patch was generated using an ACPICA patch from "Zheng, Lv"
> <[email protected]>. The actual upstream patch that adds these table
> definitions will come from the Intel ACPICA team as part of their
> greater ACPI 6.2 update.
>
> Signed-off-by: Ross Zwisler <[email protected]>
> ---
> include/acpi/actbl1.h | 119
> ++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 119 insertions(+)
>
> diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h index
> b4ce55c..a5df3f3 100644
> --- a/include/acpi/actbl1.h
> +++ b/include/acpi/actbl1.h
> @@ -65,6 +65,7 @@
> #define ACPI_SIG_ECDT "ECDT" /* Embedded Controller Boot
> Resources Table */
> #define ACPI_SIG_EINJ "EINJ" /* Error Injection table */
> #define ACPI_SIG_ERST "ERST" /* Error Record Serialization
> Table */
> +#define ACPI_SIG_HMAT "HMAT" /* Heterogeneous Memory
> Attributes Table */
> #define ACPI_SIG_HEST "HEST" /* Hardware Error Source
> Table */
> #define ACPI_SIG_MADT "APIC" /* Multiple APIC Description
> Table */
> #define ACPI_SIG_MSCT "MSCT" /* Maximum System
> Characteristics Table */
> @@ -688,6 +689,124 @@ struct acpi_hest_generic_data_v300 {
>
>
> /***********************************************************************
> ********
> *
> + * HMAT - Heterogeneous Memory Attributes Table (ACPI 6.2)
> + * Version 1
> + *
> + **********************************************************************
> + ********/
> +
> +struct acpi_table_hmat {
> + struct acpi_table_header header; /* Common ACPI table header */
> + u32 reserved;
> +};
> +
> +
> +/* Values for HMAT structure types */
> +
> +enum acpi_hmat_type {
> + ACPI_HMAT_TYPE_ADDRESS_RANGE = 0, /* Memory subystem address range
> */
> + ACPI_HMAT_TYPE_LOCALITY = 1, /* System locality latency and
> bandwidth information */
> + ACPI_HMAT_TYPE_CACHE = 2, /* Memory side cache information
> */
> + ACPI_HMAT_TYPE_RESERVED = 3 /* 3 and greater are reserved */
> +};
> +
> +struct acpi_hmat_structure {
> + u16 type;
> + u16 reserved;
> + u32 length;
> +};
> +
> +/*
> + * HMAT Structures, correspond to Type in struct acpi_hmat_structure
> +*/
> +
> +/* 0: Memory subystem address range */
> +
> +struct acpi_hmat_address_range {
> + struct acpi_hmat_structure header;
> + u16 flags;
> + u16 reserved1;
> + u32 processor_PD; /* Processor proximity domain */
> + u32 memory_PD; /* Memory proximity domain */
> + u32 reserved2;
> + u64 physical_address_base; /* Physical address range base */
> + u64 physical_address_length; /* Physical address range length */ };
> +
> +/* Masks for Flags field above */
> +
> +#define ACPI_HMAT_PROCESSOR_PD_VALID (1) /* 1: processor_PD field is
> valid */
> +#define ACPI_HMAT_MEMORY_PD_VALID (1<<1) /* 1: memory_PD field is
> valid */
> +#define ACPI_HMAT_RESERVATION_HINT (1<<2) /* 1: Reservation hint */
> +
> +/* 1: System locality latency and bandwidth information */
> +
> +struct acpi_hmat_locality {
> + struct acpi_hmat_structure header;
> + u8 flags;
> + u8 data_type;
> + u16 reserved1;
> + u32 number_of_initiator_Pds;
> + u32 number_of_target_Pds;
> + u32 reserved2;
> + u64 entry_base_unit;
> + u32 data[1]; /* initiator/target lists followed by entry matrix */
> };
> +
> +/* Masks for Flags field above */
> +
> +#define ACPI_HMAT_MEMORY_HIERARCHY (0x0F)
> +
> +/* Values for Memory Hierarchy flag */
> +
> +#define ACPI_HMAT_MEMORY 0
> +#define ACPI_HMAT_LAST_LEVEL_CACHE 1
> +#define ACPI_HMAT_1ST_LEVEL_CACHE 2
> +#define ACPI_HMAT_2ND_LEVEL_CACHE 3
> +#define ACPI_HMAT_3RD_LEVEL_CACHE 4
> +
> +/* Values for data_type field above */
> +
> +#define ACPI_HMAT_ACCESS_LATENCY 0
> +#define ACPI_HMAT_READ_LATENCY 1
> +#define ACPI_HMAT_WRITE_LATENCY 2
> +#define ACPI_HMAT_ACCESS_BANDWIDTH 3
> +#define ACPI_HMAT_READ_BANDWIDTH 4
> +#define ACPI_HMAT_WRITE_BANDWIDTH 5
> +
> +/* 2: Memory side cache information */
> +
> +struct acpi_hmat_cache {
> + struct acpi_hmat_structure header;
> + u32 memory_PD;
> + u32 reserved1;
> + u64 cache_size;
> + u32 cache_attributes;
> + u16 reserved2;
> + u16 number_of_SMBIOShandles;
> +};
> +
> +/* Masks for cache_attributes field above */
> +
> +#define ACPI_HMAT_TOTAL_CACHE_LEVEL (0x0000000F)
> +#define ACPI_HMAT_CACHE_LEVEL (0x000000F0)
> +#define ACPI_HMAT_CACHE_ASSOCIATIVITY (0x00000F00)
> +#define ACPI_HMAT_WRITE_POLICY (0x0000F000)
> +#define ACPI_HMAT_CACHE_LINE_SIZE (0xFFFF0000)
> +
> +/* Values for cache associativity flag */
> +
> +#define ACPI_HMAT_CA_NONE (0)
> +#define ACPI_HMAT_CA_DIRECT_MAPPED (1)
> +#define ACPI_HMAT_CA_COMPLEX_CACHE_INDEXING (2)
> +
> +/* Values for write policy flag */
> +
> +#define ACPI_HMAT_CP_NONE (0)
> +#define ACPI_HMAT_CP_WB (1)
> +#define ACPI_HMAT_CP_WT (2)
> +
> +/**********************************************************************
> +*********
> + *
> * MADT - Multiple APIC Description Table
> * Version 3
> *
> --
> 2.9.4

2017-06-05 19:51:08

by Ross Zwisler

[permalink] [raw]

Subject: [resend RFC 0/6] Add support for Heterogeneous Memory Attribute Table

[ Apologies for the resend. The [email protected] list rejected my first
posting because I wasn't subscribed. ]

==== Quick summary ====

This series adds kernel support for the Heterogeneous Memory Attribute
Table (HMAT) table, newly defined in ACPI 6.2:

http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf

The HMAT table, in concert with the existing System Resource Affinity Table
(SRAT), provides users with information about memory initiators and memory
targets in the system.

A "memory initiator" in this case is any device such as a CPU or a separate
memory I/O device that can initiate a memory request. A "memory target" is
a CPU-accessible physical address range.

The HMAT provides performance information (expected latency and bandwidth,
etc.) for various (initiator,target) pairs. This is mostly motivated by
the need to optimally use performance-differentiated DRAM, but it also
allows us to describe the performance characteristics of persistent memory.

The purpose of this RFC is to gather feedback on the different options for
enabling the HMAT in the kernel and in userspace.

==== Lots of details ====

The HMAT only covers CPU-addressable memory types, not on-device memory
like what we have with Jerome Glisse's HMM series:

https://lkml.org/lkml/2017/5/24/731

One major conceptual change in ACPI 6.2 related to this work is that
proximity domains no longer need to contain a processor. We can now have
memory-only proximity domains, which means that we can now have memory-only
Linux NUMA nodes.

Here is an example configuration where we have a single processor, one
range of regular memory and one range of High Bandwidth Memory (HBM):

+---------------+ +----------------+
| Processor | | Memory |
| prox domain 0 +---+ prox domain 1 |
| NUMA node 1 | | NUMA node 2 |
+-------+-------+ +----------------+
|
+-------+----------+
| HBM |
| prox domain 2 |
| NUMA node 0 |
+------------------+

This gives us one initiator (the processor) and two targets (the two memory
ranges). Each of these three has its own ACPI proximity domain and
associated Linux NUMA node. Note also that while there is a 1:1 mapping
from each proximity domain to each NUMA node, the numbers don't necessarily
match up. Additionally we can have extra NUMA nodes that don't map back to
ACPI proximity domains.

The above configuration could also have the processor and one of the two
memory ranges sharing a proximity domain and NUMA node, but for the
purposes of the HMAT the two memory ranges will always need to be
separated.

The overall goal of this series and of the HMAT is to allow users to
identify memory using its performance characteristics. This can broadly be
done in one of two ways:

Option 1: Provide the user with a way to map between proximity domains and
NUMA nodes and a way to access the HMAT directly (probably via
/sys/firmware/acpi/tables). Then, through possibly a library and a daemon,
provide an API so that applications can either request information about
memory ranges, or request memory allocations that meet a given set of
performance characteristics.

Option 2: Provide the user with HMAT performance data directly in sysfs,
allowing applications to directly access it without the need for the
library and daemon.

The kernel work for option 1 is started by patches 1-4. These just surface
the minimal amount of information in sysfs to allow userspace to map
between proximity domains and NUMA nodes so that the raw data in the HMAT
table can be understood.

Patches 5 and 6 enable option 2, adding performance information from the
HMAT to sysfs. The second option is complicated by the amount of HMAT data
that could be present in very large systems, so in this series we only
surface performance information for local (initiator,target) pairings. The
changelog for patch 6 discusses this in detail.

==== Next steps ====

There is still a lot of work to be done on this series, but the overall
goal of this RFC is to gather feedback on which of the two options we
should pursue, or whether some third option is preferred. After that is
done and we have a solid direction we can add support for ACPI hot add,
test more complex configurations, etc.

So, for applications that need to differentiate between memory ranges based
on their performance, what option would work best for you? Is the local
(initiator,target) performance provided by patch 6 enough, or do you
require performance information for all possible (initiator,target)
pairings?

If option 1 looks best, do we have ideas on what the userspace API would
look like?

For option 2 Dan Williams had suggested that it may be worthwhile to allow
for multiple memory initiators to be listed as "local" if they all have the
same performance, even if the HMAT's Memory Subsystem Address Range
Structure table only defines a single local initiator. Do others agree?

What other things should we consider, or what needs do you have that aren't
being addressed?

Ross Zwisler (6):
ACPICA: add HMAT table definitions
acpi: add missing include in acpi_numa.h
acpi: HMAT support in acpi_parse_entries_array()
hmem: add heterogeneous memory sysfs support
sysfs: add sysfs_add_group_link()
hmem: add performance attributes

MAINTAINERS | 5 +
drivers/acpi/Kconfig | 1 +
drivers/acpi/Makefile | 1 +
drivers/acpi/hmem/Kconfig | 7 +
drivers/acpi/hmem/Makefile | 2 +
drivers/acpi/hmem/core.c | 679 ++++++++++++++++++++++++++++++++++++
drivers/acpi/hmem/hmem.h | 56 +++
drivers/acpi/hmem/initiator.c | 61 ++++
drivers/acpi/hmem/perf_attributes.c | 158 +++++++++
drivers/acpi/hmem/target.c | 97 ++++++
drivers/acpi/numa.c | 2 +-
drivers/acpi/tables.c | 52 ++-
fs/sysfs/group.c | 30 +-
include/acpi/acpi_numa.h | 1 +
include/acpi/actbl1.h | 119 +++++++
include/linux/sysfs.h | 2 +
16 files changed, 1254 insertions(+), 19 deletions(-)
create mode 100644 drivers/acpi/hmem/Kconfig
create mode 100644 drivers/acpi/hmem/Makefile
create mode 100644 drivers/acpi/hmem/core.c
create mode 100644 drivers/acpi/hmem/hmem.h
create mode 100644 drivers/acpi/hmem/initiator.c
create mode 100644 drivers/acpi/hmem/perf_attributes.c
create mode 100644 drivers/acpi/hmem/target.c

--
2.9.4

2017-06-05 19:51:19

by Ross Zwisler

[permalink] [raw]

Subject: [resend RFC 2/6] acpi: add missing include in acpi_numa.h

Right now if a file includes acpi_numa.h and they don't happen to include
linux/numa.h before it, they get the following warning:

./include/acpi/acpi_numa.h:9:5: warning: "MAX_NUMNODES" is not defined [-Wundef]
#if MAX_NUMNODES > 256
^~~~~~~~~~~~

Signed-off-by: Ross Zwisler <[email protected]>
---
include/acpi/acpi_numa.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/acpi/acpi_numa.h b/include/acpi/acpi_numa.h
index d4b7294..1e3a74f 100644
--- a/include/acpi/acpi_numa.h
+++ b/include/acpi/acpi_numa.h
@@ -3,6 +3,7 @@

#ifdef CONFIG_ACPI_NUMA
#include <linux/kernel.h>
+#include <linux/numa.h>

/* Proximity bitmap length */
#if MAX_NUMNODES > 256
--
2.9.4

2017-06-05 19:51:24

by Ross Zwisler

[permalink] [raw]

Subject: [resend RFC 3/6] acpi: HMAT support in acpi_parse_entries_array()

The current implementation of acpi_parse_entries_array() assumes that each
subtable has a standard ACPI subtable entry of type struct
acpi_sutbable_header. This standard subtable header has a one byte length
followed by a one byte type.

The HMAT subtables have to allow for a longer length so they have subtable
headers of type struct acpi_hmat_structure which has a 2 byte type and a 4
byte length.

Enhance the subtable parsing in acpi_parse_entries_array() so that it can
handle these new HMAT subtables.

Signed-off-by: Ross Zwisler <[email protected]>
---
drivers/acpi/numa.c | 2 +-
drivers/acpi/tables.c | 52 ++++++++++++++++++++++++++++++++++++++++-----------
2 files changed, 42 insertions(+), 12 deletions(-)

diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index edb0c79..917f1cc 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -443,7 +443,7 @@ int __init acpi_numa_init(void)
* So go over all cpu entries in SRAT to get apicid to node mapping.
*/

- /* SRAT: Static Resource Affinity Table */
+ /* SRAT: System Resource Affinity Table */
if (!acpi_table_parse(ACPI_SIG_SRAT, acpi_parse_srat)) {
struct acpi_subtable_proc srat_proc[3];

diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index ff42539..7979171 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -218,6 +218,33 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header)
}
}

+static unsigned long __init
+acpi_get_entry_type(char *id, void *entry)
+{
+ if (!strncmp(id, ACPI_SIG_HMAT, 4))
+ return ((struct acpi_hmat_structure *)entry)->type;
+ else
+ return ((struct acpi_subtable_header *)entry)->type;
+}
+
+static unsigned long __init
+acpi_get_entry_length(char *id, void *entry)
+{
+ if (!strncmp(id, ACPI_SIG_HMAT, 4))
+ return ((struct acpi_hmat_structure *)entry)->length;
+ else
+ return ((struct acpi_subtable_header *)entry)->length;
+}
+
+static unsigned long __init
+acpi_get_subtable_header_length(char *id)
+{
+ if (!strncmp(id, ACPI_SIG_HMAT, 4))
+ return sizeof(struct acpi_hmat_structure);
+ else
+ return sizeof(struct acpi_subtable_header);
+}
+
/**
* acpi_parse_entries_array - for each proc_num find a suitable subtable
*
@@ -242,10 +269,10 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
struct acpi_subtable_proc *proc, int proc_num,
unsigned int max_entries)
{
- struct acpi_subtable_header *entry;
- unsigned long table_end;
+ unsigned long table_end, subtable_header_length;
int count = 0;
int errs = 0;
+ void *entry;
int i;

if (acpi_disabled)
@@ -263,19 +290,23 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
}

table_end = (unsigned long)table_header + table_header->length;
+ subtable_header_length = acpi_get_subtable_header_length(id);

/* Parse all entries looking for a match. */

- entry = (struct acpi_subtable_header *)
- ((unsigned long)table_header + table_size);
+ entry = (void *)table_header + table_size;
+
+ while (((unsigned long)entry) + subtable_header_length < table_end) {
+ unsigned long entry_type, entry_length;

- while (((unsigned long)entry) + sizeof(struct acpi_subtable_header) <
- table_end) {
if (max_entries && count >= max_entries)
break;

+ entry_type = acpi_get_entry_type(id, entry);
+ entry_length = acpi_get_entry_length(id, entry);
+
for (i = 0; i < proc_num; i++) {
- if (entry->type != proc[i].id)
+ if (entry_type != proc[i].id)
continue;
if (!proc[i].handler ||
(!errs && proc[i].handler(entry, table_end))) {
@@ -290,16 +321,15 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
count++;

/*
- * If entry->length is 0, break from this loop to avoid
+ * If entry_length is 0, break from this loop to avoid
* infinite loop.
*/
- if (entry->length == 0) {
+ if (entry_length == 0) {
pr_err("[%4.4s:0x%02x] Invalid zero length\n", id, proc->id);
return -EINVAL;
}

- entry = (struct acpi_subtable_header *)
- ((unsigned long)entry + entry->length);
+ entry += entry_length;
}

if (max_entries && count > max_entries) {
--
2.9.4

2017-06-05 19:51:30

by Ross Zwisler

[permalink] [raw]

Subject: [resend RFC 6/6] hmem: add performance attributes

Add performance information found in the HMAT to the sysfs representation.
This information lives as an attribute group named "via_mem_initX" in the
memory target:

# tree mem_tgt2
mem_tgt2
├── firmware_id
├── is_cached
├── is_enabled
├── is_isolated
├── node2 -> ../../node/node2
├── phys_addr_base
├── phys_length_bytes
├── power
│ ├── async
│ ...
├── subsystem -> ../../../../bus/hmem
├── uevent
└── via_mem_init0
├── mem_init0 -> ../../mem_init0
├── mem_tgt2 -> ../../mem_tgt2
├── read_bw_MBps
├── read_lat_nsec
├── write_bw_MBps
└── write_lat_nsec

This attribute group surfaces latency and bandwidth performance for a given
(initiator,target) pairing. For example:

# grep . mem_tgt2/via_mem_init0/* 2>/dev/null
mem_tgt2/via_mem_init0/read_bw_MBps:40960
mem_tgt2/via_mem_init0/read_lat_nsec:50
mem_tgt2/via_mem_init0/write_bw_MBps:40960
mem_tgt2/via_mem_init0/write_lat_nsec:50

The initiator has a symlink to the performance information which lives in
the target's attribute group:

# ls -l mem_init0/via_mem_tgt2
lrwxrwxrwx. 1 root root 0 Jun 1 10:00 mem_init0/via_mem_tgt2 ->
../mem_tgt2/via_mem_init0

We create performance attribute groups only for local (initiator,target)
pairings, where the local initiator for a given target is defined by the
"Processor Proximity Domain" field in the HMAT's Memory Subsystem Address
Range Structure table.

A given target is only local to a single initiator, so each target will
have at most one "via_mem_initX" attribute group. A given memory initiator
may have multiple local memory targets, so multiple "via_mem_tgtX" links
may exist for a given initiator.

If a given memory target is cached we give performance numbers only for the
media itself, and rely on the "is_cached" attribute to represent the
fact that there is a caching layer.

The fact that we only expose a subset of the performance information
presented in the HMAT via sysfs as a compromise, driven by fact that those
usages will be the highest performing and because to represent all possible
paths could cause an unmanageable explosion of sysfs entries.

If we dump everything from the HMAT into sysfs we end up with
O(num_targets * num_initiators * num_caching_levels) attributes. Each of
these attributes only takes up 2 bytes in a System Locality Latency and
Bandwidth Information Structure, but if we have to create a directory entry
for each it becomes much more expensive.

For example, very large systems today can have on the order of thousands of
NUMA nodes. Say we have a system which used to have 1,000 NUMA nodes that
each had both a CPU and local memory. The HMAT allows us to separate the
CPUs and memory into separate NUMA nodes, so we can end up with 1,000 CPU
initiator NUMA nodes and 1,000 memory target NUMA nodes. If we represented
the performance information for each possible CPU/memory pair in sysfs we
would end up with 1,000,000 attribute groups.

This is a lot to pass in a set of packed data tables, but I think we'll
break sysfs if we try to create millions of attributes, regardless of how
we nest them in a directory hierarchy.

By only representing performance information for local (initiator,target)
pairings, we reduce the number of sysfs entries to O(num_targets).

Signed-off-by: Ross Zwisler <[email protected]>
---
drivers/acpi/hmem/Makefile | 2 +-
drivers/acpi/hmem/core.c | 134 +++++++++++++++++++++++++++++-
drivers/acpi/hmem/hmem.h | 9 ++
drivers/acpi/hmem/perf_attributes.c | 158 ++++++++++++++++++++++++++++++++++++
4 files changed, 301 insertions(+), 2 deletions(-)
create mode 100644 drivers/acpi/hmem/perf_attributes.c

diff --git a/drivers/acpi/hmem/Makefile b/drivers/acpi/hmem/Makefile
index d2aa546..44e8304 100644
--- a/drivers/acpi/hmem/Makefile
+++ b/drivers/acpi/hmem/Makefile
@@ -1,2 +1,2 @@
obj-$(CONFIG_ACPI_HMEM) := hmem.o
-hmem-y := core.o initiator.o target.o
+hmem-y := core.o initiator.o target.o perf_attributes.o
diff --git a/drivers/acpi/hmem/core.c b/drivers/acpi/hmem/core.c
index 2947fac..df93058 100644
--- a/drivers/acpi/hmem/core.c
+++ b/drivers/acpi/hmem/core.c
@@ -25,9 +25,94 @@

static LIST_HEAD(target_list);
static LIST_HEAD(initiator_list);
+LIST_HEAD(locality_list);

static bool bad_hmem;

+static int add_performance_attributes(struct memory_target *tgt)
+{
+ struct attribute_group performance_attribute_group = {
+ .attrs = performance_attributes,
+ };
+ struct kobject *init_kobj, *tgt_kobj;
+ struct device *init_dev, *tgt_dev;
+ char via_init[128], via_tgt[128];
+ int ret;
+
+ if (!tgt->local_init)
+ return 0;
+
+ init_dev = &tgt->local_init->dev;
+ tgt_dev = &tgt->dev;
+ init_kobj = &init_dev->kobj;
+ tgt_kobj = &tgt_dev->kobj;
+
+ snprintf(via_init, 128, "via_%s", dev_name(init_dev));
+ snprintf(via_tgt, 128, "via_%s", dev_name(tgt_dev));
+
+ /* Create entries for initiator/target pair in the target. */
+ performance_attribute_group.name = via_init;
+ ret = sysfs_create_group(tgt_kobj, &performance_attribute_group);
+ if (ret < 0)
+ return ret;
+
+ ret = sysfs_add_link_to_group(tgt_kobj, via_init, init_kobj,
+ dev_name(init_dev));
+ if (ret < 0)
+ goto err;
+
+ ret = sysfs_add_link_to_group(tgt_kobj, via_init, tgt_kobj,
+ dev_name(tgt_dev));
+ if (ret < 0)
+ goto err;
+
+ /* Create a link in the initiator to the performance attributes. */
+ ret = sysfs_add_group_link(init_kobj, tgt_kobj, via_init, via_tgt);
+ if (ret < 0)
+ goto err;
+
+ tgt->has_perf_attributes = true;
+ return 0;
+err:
+ /* Removals of links that haven't been added yet are harmless. */
+ sysfs_remove_link_from_group(tgt_kobj, via_init, dev_name(init_dev));
+ sysfs_remove_link_from_group(tgt_kobj, via_init, dev_name(tgt_dev));
+ sysfs_remove_group(tgt_kobj, &performance_attribute_group);
+ return ret;
+}
+
+static void remove_performance_attributes(struct memory_target *tgt)
+{
+ struct attribute_group performance_attribute_group = {
+ .attrs = performance_attributes,
+ };
+ struct kobject *init_kobj, *tgt_kobj;
+ struct device *init_dev, *tgt_dev;
+ char via_init[128], via_tgt[128];
+
+ if (!tgt->local_init)
+ return;
+
+ init_dev = &tgt->local_init->dev;
+ tgt_dev = &tgt->dev;
+ init_kobj = &init_dev->kobj;
+ tgt_kobj = &tgt_dev->kobj;
+
+ snprintf(via_init, 128, "via_%s", dev_name(init_dev));
+ snprintf(via_tgt, 128, "via_%s", dev_name(tgt_dev));
+
+ performance_attribute_group.name = via_init;
+
+ /* Remove entries for initiator/target pair in the target. */
+ sysfs_remove_link_from_group(tgt_kobj, via_init, dev_name(init_dev));
+ sysfs_remove_link_from_group(tgt_kobj, via_init, dev_name(tgt_dev));
+
+ /* Remove the initiator's link to the performance attributes. */
+ sysfs_remove_link(init_kobj, via_tgt);
+
+ sysfs_remove_group(tgt_kobj, &performance_attribute_group);
+}
+
static int link_node_for_kobj(unsigned int node, struct kobject *kobj)
{
if (node_devices[node])
@@ -168,6 +253,9 @@ static void release_memory_target(struct device *dev)

static void __init remove_memory_target(struct memory_target *tgt)
{
+ if (tgt->has_perf_attributes)
+ remove_performance_attributes(tgt);
+
if (tgt->is_registered) {
remove_node_for_kobj(pxm_to_node(tgt->ma->proximity_domain),
&tgt->dev.kobj);
@@ -299,6 +387,38 @@ hmat_parse_address_range(struct acpi_subtable_header *header,
return -EINVAL;
}

+static int __init hmat_parse_locality(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_hmat_locality *hmat_loc;
+ struct memory_locality *loc;
+
+ if (bad_hmem)
+ return 0;
+
+ hmat_loc = (struct acpi_hmat_locality *)header;
+ if (!hmat_loc) {
+ pr_err("HMEM: NULL table entry\n");
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ /* We don't report cached performance information in sysfs. */
+ if (hmat_loc->flags == ACPI_HMAT_MEMORY ||
+ hmat_loc->flags == ACPI_HMAT_LAST_LEVEL_CACHE) {
+ loc = kzalloc(sizeof(*loc), GFP_KERNEL);
+ if (!loc) {
+ bad_hmem = true;
+ return -ENOMEM;
+ }
+
+ loc->hmat_loc = hmat_loc;
+ list_add_tail(&loc->list, &locality_list);
+ }
+
+ return 0;
+}
+
static int __init hmat_parse_cache(struct acpi_subtable_header *header,
const unsigned long end)
{
@@ -442,6 +562,7 @@ srat_parse_memory_affinity(struct acpi_subtable_header *header,
static void hmem_cleanup(void)
{
struct memory_initiator *init, *init_iter;
+ struct memory_locality *loc, *loc_iter;
struct memory_target *tgt, *tgt_iter;

list_for_each_entry_safe(tgt, tgt_iter, &target_list, list)
@@ -449,6 +570,11 @@ static void hmem_cleanup(void)

list_for_each_entry_safe(init, init_iter, &initiator_list, list)
remove_memory_initiator(init);
+
+ list_for_each_entry_safe(loc, loc_iter, &locality_list, list) {
+ list_del(&loc->list);
+ kfree(loc);
+ }
}

static int __init hmem_init(void)
@@ -499,13 +625,15 @@ static int __init hmem_init(void)
}

if (!acpi_table_parse(ACPI_SIG_HMAT, hmem_noop_parse)) {
- struct acpi_subtable_proc hmat_proc[2];
+ struct acpi_subtable_proc hmat_proc[3];

memset(hmat_proc, 0, sizeof(hmat_proc));
hmat_proc[0].id = ACPI_HMAT_TYPE_ADDRESS_RANGE;
hmat_proc[0].handler = hmat_parse_address_range;
hmat_proc[1].id = ACPI_HMAT_TYPE_CACHE;
hmat_proc[1].handler = hmat_parse_cache;
+ hmat_proc[2].id = ACPI_HMAT_TYPE_LOCALITY;
+ hmat_proc[2].handler = hmat_parse_locality;

acpi_table_parse_entries_array(ACPI_SIG_HMAT,
sizeof(struct acpi_table_hmat),
@@ -527,6 +655,10 @@ static int __init hmem_init(void)
ret = register_memory_target(tgt);
if (ret)
goto err;
+
+ ret = add_performance_attributes(tgt);
+ if (ret)
+ goto err;
}

return 0;
diff --git a/drivers/acpi/hmem/hmem.h b/drivers/acpi/hmem/hmem.h
index 8ea42b6..6073ec4 100644
--- a/drivers/acpi/hmem/hmem.h
+++ b/drivers/acpi/hmem/hmem.h
@@ -39,9 +39,18 @@ struct memory_target {

bool is_cached;
bool is_registered;
+ bool has_perf_attributes;
};
#define to_memory_target(dev) container_of(dev, struct memory_target, dev)

+struct memory_locality {
+ struct list_head list;
+ struct acpi_hmat_locality *hmat_loc;
+};
+
extern const struct attribute_group *memory_initiator_attribute_groups[];
extern const struct attribute_group *memory_target_attribute_groups[];
+extern struct attribute *performance_attributes[];
+
+extern struct list_head locality_list;
#endif /* _ACPI_HMEM_H_ */
diff --git a/drivers/acpi/hmem/perf_attributes.c b/drivers/acpi/hmem/perf_attributes.c
new file mode 100644
index 0000000..cb77b21
--- /dev/null
+++ b/drivers/acpi/hmem/perf_attributes.c
@@ -0,0 +1,158 @@
+/*
+ * Heterogeneous memory performance attributes
+ *
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/acpi.h>
+#include <linux/device.h>
+#include <linux/sysfs.h>
+#include "hmem.h"
+
+#define NO_VALUE -1
+#define LATENCY 0
+#define BANDWIDTH 1
+
+/* Performance attributes for an initiator/target pair. */
+static int get_performance_data(u32 init_pxm, u32 tgt_pxm,
+ struct acpi_hmat_locality *hmat_loc)
+{
+ int num_init = hmat_loc->number_of_initiator_Pds;
+ int num_tgt = hmat_loc->number_of_target_Pds;
+ int init_idx = NO_VALUE;
+ int tgt_idx = NO_VALUE;
+ u32 *initiators, *targets;
+ u16 *entries, val;
+ int i;
+
+ initiators = hmat_loc->data;
+ targets = &initiators[num_init];
+ entries = (u16 *)&targets[num_tgt];
+
+ for (i = 0; i < num_init; i++) {
+ if (initiators[i] == init_pxm) {
+ init_idx = i;
+ break;
+ }
+ }
+
+ if (init_idx == NO_VALUE)
+ return NO_VALUE;
+
+ for (i = 0; i < num_tgt; i++) {
+ if (targets[i] == tgt_pxm) {
+ tgt_idx = i;
+ break;
+ }
+ }
+
+ if (tgt_idx == NO_VALUE)
+ return NO_VALUE;
+
+ val = entries[init_idx*num_tgt + tgt_idx];
+ if (val < 10 || val == 0xFFFF)
+ return NO_VALUE;
+
+ return (val * hmat_loc->entry_base_unit) / 10;
+}
+
+/*
+ * 'direction' is either READ or WRITE
+ * 'type' is either LATENCY or BANDWIDTH
+ * Latency is reported in nanoseconds and bandwidth is reported in MB/s.
+ */
+static int get_dev_attribute(struct device *dev, int direction, int type)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+ int tgt_pxm = tgt->ma->proximity_domain;
+ int init_pxm = tgt->local_init->pxm;
+ struct memory_locality *loc;
+ int value;
+
+ list_for_each_entry(loc, &locality_list, list) {
+ struct acpi_hmat_locality *hmat_loc = loc->hmat_loc;
+
+ if (direction == READ && type == LATENCY &&
+ (hmat_loc->data_type == ACPI_HMAT_ACCESS_LATENCY ||
+ hmat_loc->data_type == ACPI_HMAT_READ_LATENCY)) {
+ value = get_performance_data(init_pxm, tgt_pxm,
+ hmat_loc);
+ if (value != NO_VALUE)
+ return value;
+ }
+
+ if (direction == WRITE && type == LATENCY &&
+ (hmat_loc->data_type == ACPI_HMAT_ACCESS_LATENCY ||
+ hmat_loc->data_type == ACPI_HMAT_WRITE_LATENCY)) {
+ value = get_performance_data(init_pxm, tgt_pxm,
+ hmat_loc);
+ if (value != NO_VALUE)
+ return value;
+ }
+
+ if (direction == READ && type == BANDWIDTH &&
+ (hmat_loc->data_type == ACPI_HMAT_ACCESS_BANDWIDTH ||
+ hmat_loc->data_type == ACPI_HMAT_READ_BANDWIDTH)) {
+ value = get_performance_data(init_pxm, tgt_pxm,
+ hmat_loc);
+ if (value != NO_VALUE)
+ return value;
+ }
+
+ if (direction == WRITE && type == BANDWIDTH &&
+ (hmat_loc->data_type == ACPI_HMAT_ACCESS_BANDWIDTH ||
+ hmat_loc->data_type == ACPI_HMAT_WRITE_BANDWIDTH)) {
+ value = get_performance_data(init_pxm, tgt_pxm,
+ hmat_loc);
+ if (value != NO_VALUE)
+ return value;
+ }
+ }
+
+ return NO_VALUE;
+}
+
+static ssize_t read_lat_nsec_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", get_dev_attribute(dev, READ, LATENCY));
+}
+static DEVICE_ATTR_RO(read_lat_nsec);
+
+static ssize_t write_lat_nsec_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", get_dev_attribute(dev, WRITE, LATENCY));
+}
+static DEVICE_ATTR_RO(write_lat_nsec);
+
+static ssize_t read_bw_MBps_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", get_dev_attribute(dev, READ, BANDWIDTH));
+}
+static DEVICE_ATTR_RO(read_bw_MBps);
+
+static ssize_t write_bw_MBps_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ return sprintf(buf, "%d\n", get_dev_attribute(dev, WRITE, BANDWIDTH));
+}
+static DEVICE_ATTR_RO(write_bw_MBps);
+
+struct attribute *performance_attributes[] = {
+ &dev_attr_read_lat_nsec.attr,
+ &dev_attr_write_lat_nsec.attr,
+ &dev_attr_read_bw_MBps.attr,
+ &dev_attr_write_bw_MBps.attr,
+ NULL
+};
--
2.9.4

2017-06-05 19:51:23

by Ross Zwisler

[permalink] [raw]

Subject: [resend RFC 4/6] hmem: add heterogeneous memory sysfs support

Add a new sysfs subsystem, /sys/devices/system/hmem, which surfaces
information about memory initiators and memory targets to the user. These
initiators and targets are described by the ACPI SRAT and HMAT tables.

A "memory initiator" in this case is any device such as a CPU or a separate
memory I/O device that can initiate a memory request. A "memory target" is
a CPU-accessible physical address range.

The key piece of information surfaced by this patch is the mapping between
the ACPI table "proximity domain" numbers, held in the "firmware_id"
attribute, and Linux NUMA node numbers.

Initiators are found at /sys/devices/system/hmem/mem_initX, and the
attributes for a given initiator look like this:

# tree mem_init0/
mem_init0/
├── cpu0 -> ../../cpu/cpu0
├── firmware_id
├── is_enabled
├── node0 -> ../../node/node0
├── power
│   ├── async
│   ...
├── subsystem -> ../../../../bus/hmem
└── uevent

Where "mem_init0" on my system represents the CPU acting as a memory
initiator at NUMA node 0.

Targets are found at /sys/devices/system/hmem/mem_tgtX, and the attributes
for a given target look like this:

# tree mem_tgt2/
mem_tgt2/
├── firmware_id
├── is_cached
├── is_enabled
├── is_isolated
├── node2 -> ../../node/node2
├── phys_addr_base
├── phys_length_bytes
├── power
│   ├── async
│   ...
├── subsystem -> ../../../../bus/hmem
└── uevent

Signed-off-by: Ross Zwisler <[email protected]>
---
MAINTAINERS | 5 +
drivers/acpi/Kconfig | 1 +
drivers/acpi/Makefile | 1 +
drivers/acpi/hmem/Kconfig | 7 +
drivers/acpi/hmem/Makefile | 2 +
drivers/acpi/hmem/core.c | 547 ++++++++++++++++++++++++++++++++++++++++++
drivers/acpi/hmem/hmem.h | 47 ++++
drivers/acpi/hmem/initiator.c | 61 +++++
drivers/acpi/hmem/target.c | 97 ++++++++
9 files changed, 768 insertions(+)
create mode 100644 drivers/acpi/hmem/Kconfig
create mode 100644 drivers/acpi/hmem/Makefile
create mode 100644 drivers/acpi/hmem/core.c
create mode 100644 drivers/acpi/hmem/hmem.h
create mode 100644 drivers/acpi/hmem/initiator.c
create mode 100644 drivers/acpi/hmem/target.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 053c3bd..554b833 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6085,6 +6085,11 @@ S: Supported
F: drivers/scsi/hisi_sas/
F: Documentation/devicetree/bindings/scsi/hisilicon-sas.txt

+HMEM (ACPI HETEROGENEOUS MEMORY SUPPORT)
+M: Ross Zwisler <[email protected]>
+S: Supported
+F: drivers/acpi/hmem/
+
HOST AP DRIVER
M: Jouni Malinen <[email protected]>
L: [email protected]
diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index 1ce52f8..44dd97f 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -460,6 +460,7 @@ config ACPI_REDUCED_HARDWARE_ONLY
If you are unsure what to do, do not enable this option.

source "drivers/acpi/nfit/Kconfig"
+source "drivers/acpi/hmem/Kconfig"

source "drivers/acpi/apei/Kconfig"
source "drivers/acpi/dptf/Kconfig"
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index b1aacfc..31e3f20 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -72,6 +72,7 @@ obj-$(CONFIG_ACPI_PROCESSOR) += processor.o
obj-$(CONFIG_ACPI) += container.o
obj-$(CONFIG_ACPI_THERMAL) += thermal.o
obj-$(CONFIG_ACPI_NFIT) += nfit/
+obj-$(CONFIG_ACPI_HMEM) += hmem/
obj-$(CONFIG_ACPI) += acpi_memhotplug.o
obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
obj-$(CONFIG_ACPI_BATTERY) += battery.o
diff --git a/drivers/acpi/hmem/Kconfig b/drivers/acpi/hmem/Kconfig
new file mode 100644
index 0000000..09282be
--- /dev/null
+++ b/drivers/acpi/hmem/Kconfig
@@ -0,0 +1,7 @@
+config ACPI_HMEM
+ bool "ACPI Heterogeneous Memory Support"
+ depends on ACPI_NUMA
+ depends on SYSFS
+ help
+ Exports a sysfs representation of the ACPI Heterogeneous Memory
+ Attributes Table (HMAT).
diff --git a/drivers/acpi/hmem/Makefile b/drivers/acpi/hmem/Makefile
new file mode 100644
index 0000000..d2aa546
--- /dev/null
+++ b/drivers/acpi/hmem/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_ACPI_HMEM) := hmem.o
+hmem-y := core.o initiator.o target.o
diff --git a/drivers/acpi/hmem/core.c b/drivers/acpi/hmem/core.c
new file mode 100644
index 0000000..2947fac
--- /dev/null
+++ b/drivers/acpi/hmem/core.c
@@ -0,0 +1,547 @@
+/*
+ * Heterogeneous memory representation in sysfs
+ *
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <acpi/acpi_numa.h>
+#include <linux/acpi.h>
+#include <linux/cpu.h>
+#include <linux/device.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include "hmem.h"
+
+static LIST_HEAD(target_list);
+static LIST_HEAD(initiator_list);
+
+static bool bad_hmem;
+
+static int link_node_for_kobj(unsigned int node, struct kobject *kobj)
+{
+ if (node_devices[node])
+ return sysfs_create_link(kobj, &node_devices[node]->dev.kobj,
+ kobject_name(&node_devices[node]->dev.kobj));
+
+ return 0;
+}
+
+static void remove_node_for_kobj(unsigned int node, struct kobject *kobj)
+{
+ if (node_devices[node])
+ sysfs_remove_link(kobj,
+ kobject_name(&node_devices[node]->dev.kobj));
+}
+
+#define HMEM_CLASS_NAME "hmem"
+
+static struct bus_type hmem_subsys = {
+ /*
+ * .dev_name is set before device_register() based on the type of
+ * device we are registering.
+ */
+ .name = HMEM_CLASS_NAME,
+};
+
+/* memory initiators */
+static int link_cpu_under_mem_init(struct memory_initiator *init)
+{
+ struct device *cpu_dev;
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ cpu_dev = get_cpu_device(cpu);
+ if (!cpu_dev)
+ continue;
+
+ if (pxm_to_node(init->pxm) == cpu_to_node(cpu)) {
+ return sysfs_create_link(&init->dev.kobj,
+ &cpu_dev->kobj,
+ kobject_name(&cpu_dev->kobj));
+ }
+
+ }
+ return 0;
+}
+
+static void remove_cpu_under_mem_init(struct memory_initiator *init)
+{
+ struct device *cpu_dev;
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ cpu_dev = get_cpu_device(cpu);
+ if (!cpu_dev)
+ continue;
+
+ if (pxm_to_node(init->pxm) == cpu_to_node(cpu)) {
+ sysfs_remove_link(&init->dev.kobj,
+ kobject_name(&cpu_dev->kobj));
+ return;
+ }
+
+ }
+}
+
+static void release_memory_initiator(struct device *dev)
+{
+ struct memory_initiator *init = to_memory_initiator(dev);
+
+ list_del(&init->list);
+ kfree(init);
+}
+
+static void __init remove_memory_initiator(struct memory_initiator *init)
+{
+ if (init->is_registered) {
+ remove_cpu_under_mem_init(init);
+ remove_node_for_kobj(pxm_to_node(init->pxm), &init->dev.kobj);
+ device_unregister(&init->dev);
+ } else
+ release_memory_initiator(&init->dev);
+}
+
+static int __init register_memory_initiator(struct memory_initiator *init)
+{
+ int ret;
+
+ hmem_subsys.dev_name = "mem_init";
+ init->dev.bus = &hmem_subsys;
+ init->dev.id = pxm_to_node(init->pxm);
+ init->dev.release = release_memory_initiator;
+ init->dev.groups = memory_initiator_attribute_groups;
+
+ ret = device_register(&init->dev);
+ if (ret < 0)
+ return ret;
+
+ init->is_registered = true;
+
+ ret = link_cpu_under_mem_init(init);
+ if (ret < 0)
+ return ret;
+
+ return link_node_for_kobj(pxm_to_node(init->pxm), &init->dev.kobj);
+}
+
+static struct memory_initiator * __init add_memory_initiator(int pxm)
+{
+ struct memory_initiator *init;
+
+ if (pxm_to_node(pxm) == NUMA_NO_NODE) {
+ pr_err("HMEM: No NUMA node for PXM %d\n", pxm);
+ bad_hmem = true;
+ return ERR_PTR(-EINVAL);
+ }
+
+ init = kzalloc(sizeof(*init), GFP_KERNEL);
+ if (!init) {
+ bad_hmem = true;
+ return ERR_PTR(-ENOMEM);
+ }
+
+ init->pxm = pxm;
+
+ list_add_tail(&init->list, &initiator_list);
+ return init;
+}
+
+/* memory targets */
+static void release_memory_target(struct device *dev)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ list_del(&tgt->list);
+ kfree(tgt);
+}
+
+static void __init remove_memory_target(struct memory_target *tgt)
+{
+ if (tgt->is_registered) {
+ remove_node_for_kobj(pxm_to_node(tgt->ma->proximity_domain),
+ &tgt->dev.kobj);
+ device_unregister(&tgt->dev);
+ } else
+ release_memory_target(&tgt->dev);
+}
+
+static int __init register_memory_target(struct memory_target *tgt)
+{
+ int ret;
+
+ if (!tgt->ma || !tgt->spa) {
+ pr_err("HMEM: Incomplete memory target found\n");
+ return -EINVAL;
+ }
+
+ hmem_subsys.dev_name = "mem_tgt";
+ tgt->dev.bus = &hmem_subsys;
+ tgt->dev.id = pxm_to_node(tgt->ma->proximity_domain);
+ tgt->dev.release = release_memory_target;
+ tgt->dev.groups = memory_target_attribute_groups;
+
+ ret = device_register(&tgt->dev);
+ if (ret < 0)
+ return ret;
+
+ tgt->is_registered = true;
+
+ return link_node_for_kobj(pxm_to_node(tgt->ma->proximity_domain),
+ &tgt->dev.kobj);
+}
+
+static int __init add_memory_target(struct acpi_srat_mem_affinity *ma)
+{
+ struct memory_target *tgt;
+
+ if (pxm_to_node(ma->proximity_domain) == NUMA_NO_NODE) {
+ pr_err("HMEM: No NUMA node for PXM %d\n", ma->proximity_domain);
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ tgt = kzalloc(sizeof(*tgt), GFP_KERNEL);
+ if (!tgt) {
+ bad_hmem = true;
+ return -ENOMEM;
+ }
+
+ tgt->ma = ma;
+
+ list_add_tail(&tgt->list, &target_list);
+ return 0;
+}
+
+/* ACPI parsing code, starting with the HMAT */
+static int __init hmem_noop_parse(struct acpi_table_header *table)
+{
+ /* real work done by the hmat_parse_* and srat_parse_* routines */
+ return 0;
+}
+
+static bool __init hmat_spa_matches_srat(struct acpi_hmat_address_range *spa,
+ struct acpi_srat_mem_affinity *ma)
+{
+ if (spa->physical_address_base != ma->base_address ||
+ spa->physical_address_length != ma->length)
+ return false;
+
+ return true;
+}
+
+static void find_local_initiator(struct memory_target *tgt)
+{
+ struct memory_initiator *init;
+
+ if (!(tgt->spa->flags & ACPI_HMAT_PROCESSOR_PD_VALID) ||
+ pxm_to_node(tgt->spa->processor_PD) == NUMA_NO_NODE)
+ return;
+
+ list_for_each_entry(init, &initiator_list, list) {
+ if (init->pxm == tgt->spa->processor_PD) {
+ tgt->local_init = init;
+ return;
+ }
+ }
+}
+
+/* ACPI HMAT parsing routines */
+static int __init
+hmat_parse_address_range(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_hmat_address_range *spa;
+ struct memory_target *tgt;
+
+ if (bad_hmem)
+ return 0;
+
+ spa = (struct acpi_hmat_address_range *)header;
+ if (!spa) {
+ pr_err("HMEM: NULL table entry\n");
+ goto err;
+ }
+
+ if (spa->header.length != sizeof(*spa)) {
+ pr_err("HMEM: Unexpected header length: %d\n",
+ spa->header.length);
+ goto err;
+ }
+
+ list_for_each_entry(tgt, &target_list, list) {
+ if ((spa->flags & ACPI_HMAT_MEMORY_PD_VALID) &&
+ spa->memory_PD == tgt->ma->proximity_domain) {
+ if (!hmat_spa_matches_srat(spa, tgt->ma)) {
+ pr_err("HMEM: SRAT and HMAT disagree on "
+ "address range info\n");
+ goto err;
+ }
+ tgt->spa = spa;
+ find_local_initiator(tgt);
+ return 0;
+ }
+ }
+
+ return 0;
+err:
+ bad_hmem = true;
+ return -EINVAL;
+}
+
+static int __init hmat_parse_cache(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_hmat_cache *cache;
+ struct memory_target *tgt;
+
+ if (bad_hmem)
+ return 0;
+
+ cache = (struct acpi_hmat_cache *)header;
+ if (!cache) {
+ pr_err("HMEM: NULL table entry\n");
+ goto err;
+ }
+
+ if (cache->header.length < sizeof(*cache)) {
+ pr_err("HMEM: Unexpected header length: %d\n",
+ cache->header.length);
+ goto err;
+ }
+
+ list_for_each_entry(tgt, &target_list, list) {
+ if (cache->memory_PD == tgt->ma->proximity_domain) {
+ tgt->is_cached = true;
+ return 0;
+ }
+ }
+
+ pr_err("HMEM: Couldn't find cached target PXM %d\n", cache->memory_PD);
+err:
+ bad_hmem = true;
+ return -EINVAL;
+}
+
+/*
+ * SRAT parsing. We use srat_disabled() and pxm_to_node() so we don't redo
+ * any of the SRAT sanity checking done in drivers/acpi/numa.c.
+ */
+static int __init
+srat_parse_processor_affinity(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_srat_cpu_affinity *cpu;
+ struct memory_initiator *init;
+ u32 pxm;
+
+ if (bad_hmem)
+ return 0;
+
+ cpu = (struct acpi_srat_cpu_affinity *)header;
+ if (!cpu) {
+ pr_err("HMEM: NULL table entry\n");
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ pxm = cpu->proximity_domain_lo;
+ if (acpi_srat_revision >= 2)
+ pxm |= *((unsigned int *)cpu->proximity_domain_hi) << 8;
+
+ init = add_memory_initiator(pxm);
+ if (IS_ERR(init))
+ return PTR_ERR(init);
+
+ init->cpu = cpu;
+ return 0;
+}
+
+static int __init
+srat_parse_x2apic_affinity(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_srat_x2apic_cpu_affinity *x2apic;
+ struct memory_initiator *init;
+
+ if (bad_hmem)
+ return 0;
+
+ x2apic = (struct acpi_srat_x2apic_cpu_affinity *)header;
+ if (!x2apic) {
+ pr_err("HMEM: NULL table entry\n");
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ init = add_memory_initiator(x2apic->proximity_domain);
+ if (IS_ERR(init))
+ return PTR_ERR(init);
+
+ init->x2apic = x2apic;
+ return 0;
+}
+
+static int __init
+srat_parse_gicc_affinity(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_srat_gicc_affinity *gicc;
+ struct memory_initiator *init;
+
+ if (bad_hmem)
+ return 0;
+
+ gicc = (struct acpi_srat_gicc_affinity *)header;
+ if (!gicc) {
+ pr_err("HMEM: NULL table entry\n");
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ init = add_memory_initiator(gicc->proximity_domain);
+ if (IS_ERR(init))
+ return PTR_ERR(init);
+
+ init->gicc = gicc;
+ return 0;
+}
+
+static int __init
+srat_parse_memory_affinity(struct acpi_subtable_header *header,
+ const unsigned long end)
+{
+ struct acpi_srat_mem_affinity *ma;
+
+ if (bad_hmem)
+ return 0;
+
+ ma = (struct acpi_srat_mem_affinity *)header;
+ if (!ma) {
+ pr_err("HMEM: NULL table entry\n");
+ bad_hmem = true;
+ return -EINVAL;
+ }
+
+ return add_memory_target(ma);
+}
+
+/*
+ * Remove our sysfs entries, unregister our devices and free allocated memory.
+ */
+static void hmem_cleanup(void)
+{
+ struct memory_initiator *init, *init_iter;
+ struct memory_target *tgt, *tgt_iter;
+
+ list_for_each_entry_safe(tgt, tgt_iter, &target_list, list)
+ remove_memory_target(tgt);
+
+ list_for_each_entry_safe(init, init_iter, &initiator_list, list)
+ remove_memory_initiator(init);
+}
+
+static int __init hmem_init(void)
+{
+ struct acpi_table_header *tbl;
+ struct memory_initiator *init;
+ struct memory_target *tgt;
+ acpi_status status = AE_OK;
+ int ret;
+
+ if (srat_disabled())
+ return 0;
+
+ /*
+ * We take a permanent reference to both the HMAT and SRAT in ACPI
+ * memory so we can keep pointers to their subtables. These tables
+ * already had references on them which would never be released, taken
+ * by acpi_sysfs_init(), so this shouldn't negatively impact anything.
+ */
+ status = acpi_get_table(ACPI_SIG_SRAT, 0, &tbl);
+ if (ACPI_FAILURE(status))
+ return 0;
+
+ status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
+ if (ACPI_FAILURE(status))
+ return 0;
+
+ ret = subsys_system_register(&hmem_subsys, NULL);
+ if (ret)
+ return ret;
+
+ if (!acpi_table_parse(ACPI_SIG_SRAT, hmem_noop_parse)) {
+ struct acpi_subtable_proc srat_proc[4];
+
+ memset(srat_proc, 0, sizeof(srat_proc));
+ srat_proc[0].id = ACPI_SRAT_TYPE_CPU_AFFINITY;
+ srat_proc[0].handler = srat_parse_processor_affinity;
+ srat_proc[1].id = ACPI_SRAT_TYPE_X2APIC_CPU_AFFINITY;
+ srat_proc[1].handler = srat_parse_x2apic_affinity;
+ srat_proc[2].id = ACPI_SRAT_TYPE_GICC_AFFINITY;
+ srat_proc[2].handler = srat_parse_gicc_affinity;
+ srat_proc[3].id = ACPI_SRAT_TYPE_MEMORY_AFFINITY;
+ srat_proc[3].handler = srat_parse_memory_affinity;
+
+ acpi_table_parse_entries_array(ACPI_SIG_SRAT,
+ sizeof(struct acpi_table_srat),
+ srat_proc, ARRAY_SIZE(srat_proc), 0);
+ }
+
+ if (!acpi_table_parse(ACPI_SIG_HMAT, hmem_noop_parse)) {
+ struct acpi_subtable_proc hmat_proc[2];
+
+ memset(hmat_proc, 0, sizeof(hmat_proc));
+ hmat_proc[0].id = ACPI_HMAT_TYPE_ADDRESS_RANGE;
+ hmat_proc[0].handler = hmat_parse_address_range;
+ hmat_proc[1].id = ACPI_HMAT_TYPE_CACHE;
+ hmat_proc[1].handler = hmat_parse_cache;
+
+ acpi_table_parse_entries_array(ACPI_SIG_HMAT,
+ sizeof(struct acpi_table_hmat),
+ hmat_proc, ARRAY_SIZE(hmat_proc), 0);
+ }
+
+ if (bad_hmem) {
+ ret = -EINVAL;
+ goto err;
+ }
+
+ list_for_each_entry(init, &initiator_list, list) {
+ ret = register_memory_initiator(init);
+ if (ret)
+ goto err;
+ }
+
+ list_for_each_entry(tgt, &target_list, list) {
+ ret = register_memory_target(tgt);
+ if (ret)
+ goto err;
+ }
+
+ return 0;
+err:
+ pr_err("HMEM: Error during initialization\n");
+ hmem_cleanup();
+ return ret;
+}
+
+static __exit void hmem_exit(void)
+{
+ hmem_cleanup();
+}
+
+module_init(hmem_init);
+module_exit(hmem_exit);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
diff --git a/drivers/acpi/hmem/hmem.h b/drivers/acpi/hmem/hmem.h
new file mode 100644
index 0000000..8ea42b6
--- /dev/null
+++ b/drivers/acpi/hmem/hmem.h
@@ -0,0 +1,47 @@
+/*
+ * Heterogeneous memory representation in sysfs
+ *
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _ACPI_HMEM_H_
+#define _ACPI_HMEM_H_
+
+struct memory_initiator {
+ struct list_head list;
+ struct device dev;
+
+ /* only one of the following three will be set */
+ struct acpi_srat_cpu_affinity *cpu;
+ struct acpi_srat_x2apic_cpu_affinity *x2apic;
+ struct acpi_srat_gicc_affinity *gicc;
+
+ int pxm;
+ bool is_registered;
+};
+#define to_memory_initiator(dev) container_of(dev, struct memory_initiator, dev)
+
+struct memory_target {
+ struct list_head list;
+ struct device dev;
+ struct acpi_srat_mem_affinity *ma;
+ struct acpi_hmat_address_range *spa;
+ struct memory_initiator *local_init;
+
+ bool is_cached;
+ bool is_registered;
+};
+#define to_memory_target(dev) container_of(dev, struct memory_target, dev)
+
+extern const struct attribute_group *memory_initiator_attribute_groups[];
+extern const struct attribute_group *memory_target_attribute_groups[];
+#endif /* _ACPI_HMEM_H_ */
diff --git a/drivers/acpi/hmem/initiator.c b/drivers/acpi/hmem/initiator.c
new file mode 100644
index 0000000..905f030
--- /dev/null
+++ b/drivers/acpi/hmem/initiator.c
@@ -0,0 +1,61 @@
+/*
+ * Heterogeneous memory initiator sysfs attributes
+ *
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <acpi/acpi_numa.h>
+#include <linux/acpi.h>
+#include <linux/device.h>
+#include <linux/sysfs.h>
+#include "hmem.h"
+
+static ssize_t firmware_id_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_initiator *init = to_memory_initiator(dev);
+
+ return sprintf(buf, "%d\n", init->pxm);
+}
+static DEVICE_ATTR_RO(firmware_id);
+
+static ssize_t is_enabled_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_initiator *init = to_memory_initiator(dev);
+ int is_enabled;
+
+ if (init->cpu)
+ is_enabled = !!(init->cpu->flags & ACPI_SRAT_CPU_ENABLED);
+ else if (init->x2apic)
+ is_enabled = !!(init->x2apic->flags & ACPI_SRAT_CPU_ENABLED);
+ else
+ is_enabled = !!(init->gicc->flags & ACPI_SRAT_GICC_ENABLED);
+
+ return sprintf(buf, "%d\n", is_enabled);
+}
+static DEVICE_ATTR_RO(is_enabled);
+
+static struct attribute *memory_initiator_attributes[] = {
+ &dev_attr_firmware_id.attr,
+ &dev_attr_is_enabled.attr,
+ NULL,
+};
+
+static struct attribute_group memory_initiator_attribute_group = {
+ .attrs = memory_initiator_attributes,
+};
+
+const struct attribute_group *memory_initiator_attribute_groups[] = {
+ &memory_initiator_attribute_group,
+ NULL,
+};
diff --git a/drivers/acpi/hmem/target.c b/drivers/acpi/hmem/target.c
new file mode 100644
index 0000000..dd57437
--- /dev/null
+++ b/drivers/acpi/hmem/target.c
@@ -0,0 +1,97 @@
+/*
+ * Heterogeneous memory target sysfs attributes
+ *
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <acpi/acpi_numa.h>
+#include <linux/acpi.h>
+#include <linux/device.h>
+#include <linux/sysfs.h>
+#include "hmem.h"
+
+/* attributes for memory targets */
+static ssize_t phys_addr_base_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%#llx\n", tgt->ma->base_address);
+}
+static DEVICE_ATTR_RO(phys_addr_base);
+
+static ssize_t phys_length_bytes_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%#llx\n", tgt->ma->length);
+}
+static DEVICE_ATTR_RO(phys_length_bytes);
+
+static ssize_t firmware_id_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%d\n", tgt->ma->proximity_domain);
+}
+static DEVICE_ATTR_RO(firmware_id);
+
+static ssize_t is_cached_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%d\n", tgt->is_cached);
+}
+static DEVICE_ATTR_RO(is_cached);
+
+static ssize_t is_isolated_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%d\n",
+ !!(tgt->spa->flags & ACPI_HMAT_RESERVATION_HINT));
+}
+static DEVICE_ATTR_RO(is_isolated);
+
+static ssize_t is_enabled_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct memory_target *tgt = to_memory_target(dev);
+
+ return sprintf(buf, "%d\n",
+ !!(tgt->ma->flags & ACPI_SRAT_MEM_ENABLED));
+}
+static DEVICE_ATTR_RO(is_enabled);
+
+static struct attribute *memory_target_attributes[] = {
+ &dev_attr_phys_addr_base.attr,
+ &dev_attr_phys_length_bytes.attr,
+ &dev_attr_firmware_id.attr,
+ &dev_attr_is_cached.attr,
+ &dev_attr_is_isolated.attr,
+ &dev_attr_is_enabled.attr,
+ NULL
+};
+
+/* attributes which are present for all memory targets */
+static struct attribute_group memory_target_attribute_group = {
+ .attrs = memory_target_attributes,
+};
+
+const struct attribute_group *memory_target_attribute_groups[] = {
+ &memory_target_attribute_group,
+ NULL,
+};
--
2.9.4

2017-06-05 19:51:18

by Ross Zwisler

[permalink] [raw]

Subject: [resend RFC 5/6] sysfs: add sysfs_add_group_link()

The current __compat_only_sysfs_link_entry_to_kobj() code allows us to
create symbolic links in sysfs to groups or attributes. Something like:

/sys/.../entry1/groupA -> /sys/.../entry2/groupA

This patch extends this functionality with a new sysfs_add_group_link()
call that allows the link to have a different name than the group or
attribute, so:

/sys/.../entry1/link_name -> /sys/.../entry2/groupA

__compat_only_sysfs_link_entry_to_kobj() now just calls
sysfs_add_group_link(), passing in the same name for both the
group/attribute and for the link name.

This is needed by the ACPI HMAT enabling work because we want to have a
group of performance attributes that live in a memory target. This group
represents the performance between the (initiator,target) pair, and in the
target the attribute group is named "via_mem_initX" to represent this
pairing:

# tree mem_tgt2/via_mem_init0/
mem_tgt2/via_mem_init0/
├── mem_init0 -> ../../mem_init0
├── mem_tgt2 -> ../../mem_tgt2
├── read_bw_MBps
├── read_lat_nsec
├── write_bw_MBps
└── write_lat_nsec

We then want to link to this attribute group from the initiator, but change
the name to "via_mem_tgtX" since we're now looking at it from the
initiator's perspective:

# ls -l mem_init0/via_mem_tgt2
lrwxrwxrwx. 1 root root 0 Jun 1 10:00 mem_init0/via_mem_tgt2 ->
../mem_tgt2/via_mem_init0

Signed-off-by: Ross Zwisler <[email protected]>
---
fs/sysfs/group.c | 30 +++++++++++++++++++++++-------
include/linux/sysfs.h | 2 ++
2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/fs/sysfs/group.c b/fs/sysfs/group.c
index ac2de0e..19db57c8 100644
--- a/fs/sysfs/group.c
+++ b/fs/sysfs/group.c
@@ -367,15 +367,15 @@ void sysfs_remove_link_from_group(struct kobject *kobj, const char *group_name,
EXPORT_SYMBOL_GPL(sysfs_remove_link_from_group);

/**
- * __compat_only_sysfs_link_entry_to_kobj - add a symlink to a kobject pointing
- * to a group or an attribute
+ * sysfs_add_group_link - add a symlink to a kobject pointing to a group or
+ * an attribute
* @kobj: The kobject containing the group.
* @target_kobj: The target kobject.
* @target_name: The name of the target group or attribute.
+ * @link_name: The name of the link to the target group or attribute.
*/
-int __compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj,
- struct kobject *target_kobj,
- const char *target_name)
+int sysfs_add_group_link(struct kobject *kobj, struct kobject *target_kobj,
+ const char *target_name, const char *link_name)
{
struct kernfs_node *target;
struct kernfs_node *entry;
@@ -400,12 +400,28 @@ int __compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj,
return -ENOENT;
}

- link = kernfs_create_link(kobj->sd, target_name, entry);
+ link = kernfs_create_link(kobj->sd, link_name, entry);
if (IS_ERR(link) && PTR_ERR(link) == -EEXIST)
- sysfs_warn_dup(kobj->sd, target_name);
+ sysfs_warn_dup(kobj->sd, link_name);

kernfs_put(entry);
kernfs_put(target);
return IS_ERR(link) ? PTR_ERR(link) : 0;
}
+EXPORT_SYMBOL_GPL(sysfs_add_group_link);
+
+/**
+ * __compat_only_sysfs_link_entry_to_kobj - add a symlink to a kobject pointing
+ * to a group or an attribute
+ * @kobj: The kobject containing the group.
+ * @target_kobj: The target kobject.
+ * @target_name: The name of the target group or attribute.
+ */
+int __compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj,
+ struct kobject *target_kobj,
+ const char *target_name)
+{
+ return sysfs_add_group_link(kobj, target_kobj, target_name,
+ target_name);
+}
EXPORT_SYMBOL_GPL(__compat_only_sysfs_link_entry_to_kobj);
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index c6f0f0d..865f499 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -278,6 +278,8 @@ int sysfs_add_link_to_group(struct kobject *kobj, const char *group_name,
struct kobject *target, const char *link_name);
void sysfs_remove_link_from_group(struct kobject *kobj, const char *group_name,
const char *link_name);
+int sysfs_add_group_link(struct kobject *kobj, struct kobject *target_kobj,
+ const char *target_name, const char *link_name);
int __compat_only_sysfs_link_entry_to_kobj(struct kobject *kobj,
struct kobject *target_kobj,
const char *target_name);
--
2.9.4

2017-06-05 19:51:07

by Ross Zwisler

[permalink] [raw]

Subject: [resend RFC 1/6] ACPICA: add HMAT table definitions

Import HMAT table definitions from the ACPICA codebase.

This kernel patch was generated using an ACPICA patch from "Zheng, Lv"
<[email protected]>. The actual upstream patch that adds these table
definitions will come from the Intel ACPICA team as part of their greater
ACPI 6.2 update.

Signed-off-by: Ross Zwisler <[email protected]>
---
include/acpi/actbl1.h | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 119 insertions(+)

diff --git a/include/acpi/actbl1.h b/include/acpi/actbl1.h
index b4ce55c..a5df3f3 100644
--- a/include/acpi/actbl1.h
+++ b/include/acpi/actbl1.h
@@ -65,6 +65,7 @@
#define ACPI_SIG_ECDT "ECDT" /* Embedded Controller Boot Resources Table */
#define ACPI_SIG_EINJ "EINJ" /* Error Injection table */
#define ACPI_SIG_ERST "ERST" /* Error Record Serialization Table */
+#define ACPI_SIG_HMAT "HMAT" /* Heterogeneous Memory Attributes Table */
#define ACPI_SIG_HEST "HEST" /* Hardware Error Source Table */
#define ACPI_SIG_MADT "APIC" /* Multiple APIC Description Table */
#define ACPI_SIG_MSCT "MSCT" /* Maximum System Characteristics Table */
@@ -688,6 +689,124 @@ struct acpi_hest_generic_data_v300 {

/*******************************************************************************
*
+ * HMAT - Heterogeneous Memory Attributes Table (ACPI 6.2)
+ * Version 1
+ *
+ ******************************************************************************/
+
+struct acpi_table_hmat {
+ struct acpi_table_header header; /* Common ACPI table header */
+ u32 reserved;
+};
+
+
+/* Values for HMAT structure types */
+
+enum acpi_hmat_type {
+ ACPI_HMAT_TYPE_ADDRESS_RANGE = 0, /* Memory subystem address range */
+ ACPI_HMAT_TYPE_LOCALITY = 1, /* System locality latency and bandwidth information */
+ ACPI_HMAT_TYPE_CACHE = 2, /* Memory side cache information */
+ ACPI_HMAT_TYPE_RESERVED = 3 /* 3 and greater are reserved */
+};
+
+struct acpi_hmat_structure {
+ u16 type;
+ u16 reserved;
+ u32 length;
+};
+
+/*
+ * HMAT Structures, correspond to Type in struct acpi_hmat_structure
+ */
+
+/* 0: Memory subystem address range */
+
+struct acpi_hmat_address_range {
+ struct acpi_hmat_structure header;
+ u16 flags;
+ u16 reserved1;
+ u32 processor_PD; /* Processor proximity domain */
+ u32 memory_PD; /* Memory proximity domain */
+ u32 reserved2;
+ u64 physical_address_base; /* Physical address range base */
+ u64 physical_address_length; /* Physical address range length */
+};
+
+/* Masks for Flags field above */
+
+#define ACPI_HMAT_PROCESSOR_PD_VALID (1) /* 1: processor_PD field is valid */
+#define ACPI_HMAT_MEMORY_PD_VALID (1<<1) /* 1: memory_PD field is valid */
+#define ACPI_HMAT_RESERVATION_HINT (1<<2) /* 1: Reservation hint */
+
+/* 1: System locality latency and bandwidth information */
+
+struct acpi_hmat_locality {
+ struct acpi_hmat_structure header;
+ u8 flags;
+ u8 data_type;
+ u16 reserved1;
+ u32 number_of_initiator_Pds;
+ u32 number_of_target_Pds;
+ u32 reserved2;
+ u64 entry_base_unit;
+ u32 data[1]; /* initiator/target lists followed by entry matrix */
+};
+
+/* Masks for Flags field above */
+
+#define ACPI_HMAT_MEMORY_HIERARCHY (0x0F)
+
+/* Values for Memory Hierarchy flag */
+
+#define ACPI_HMAT_MEMORY 0
+#define ACPI_HMAT_LAST_LEVEL_CACHE 1
+#define ACPI_HMAT_1ST_LEVEL_CACHE 2
+#define ACPI_HMAT_2ND_LEVEL_CACHE 3
+#define ACPI_HMAT_3RD_LEVEL_CACHE 4
+
+/* Values for data_type field above */
+
+#define ACPI_HMAT_ACCESS_LATENCY 0
+#define ACPI_HMAT_READ_LATENCY 1
+#define ACPI_HMAT_WRITE_LATENCY 2
+#define ACPI_HMAT_ACCESS_BANDWIDTH 3
+#define ACPI_HMAT_READ_BANDWIDTH 4
+#define ACPI_HMAT_WRITE_BANDWIDTH 5
+
+/* 2: Memory side cache information */
+
+struct acpi_hmat_cache {
+ struct acpi_hmat_structure header;
+ u32 memory_PD;
+ u32 reserved1;
+ u64 cache_size;
+ u32 cache_attributes;
+ u16 reserved2;
+ u16 number_of_SMBIOShandles;
+};
+
+/* Masks for cache_attributes field above */
+
+#define ACPI_HMAT_TOTAL_CACHE_LEVEL (0x0000000F)
+#define ACPI_HMAT_CACHE_LEVEL (0x000000F0)
+#define ACPI_HMAT_CACHE_ASSOCIATIVITY (0x00000F00)
+#define ACPI_HMAT_WRITE_POLICY (0x0000F000)
+#define ACPI_HMAT_CACHE_LINE_SIZE (0xFFFF0000)
+
+/* Values for cache associativity flag */
+
+#define ACPI_HMAT_CA_NONE (0)
+#define ACPI_HMAT_CA_DIRECT_MAPPED (1)
+#define ACPI_HMAT_CA_COMPLEX_CACHE_INDEXING (2)
+
+/* Values for write policy flag */
+
+#define ACPI_HMAT_CP_NONE (0)
+#define ACPI_HMAT_CP_WB (1)
+#define ACPI_HMAT_CP_WT (2)
+
+/*******************************************************************************
+ *
* MADT - Multiple APIC Description Table
* Version 3
*
--
2.9.4

2017-06-05 20:44:15

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [resend RFC 1/6] ACPICA: add HMAT table definitions

On Mon, Jun 5, 2017 at 9:50 PM, Ross Zwisler
<[email protected]> wrote:
> Import HMAT table definitions from the ACPICA codebase.
>
> This kernel patch was generated using an ACPICA patch from "Zheng, Lv"
> <[email protected]>. The actual upstream patch that adds these table
> definitions will come from the Intel ACPICA team as part of their greater
> ACPI 6.2 update.
>
> Signed-off-by: Ross Zwisler <[email protected]>

Can you please hold on util we have integrated all of the pending
ACPICA changes?

Thanks,
Rafael

2017-06-06 00:30:41

by Ross Zwisler

[permalink] [raw]

Subject: Re: [resend RFC 1/6] ACPICA: add HMAT table definitions

On Mon, Jun 05, 2017 at 10:44:11PM +0200, Rafael J. Wysocki wrote:
> On Mon, Jun 5, 2017 at 9:50 PM, Ross Zwisler
> <[email protected]> wrote:
> > Import HMAT table definitions from the ACPICA codebase.
> >
> > This kernel patch was generated using an ACPICA patch from "Zheng, Lv"
> > <[email protected]>. The actual upstream patch that adds these table
> > definitions will come from the Intel ACPICA team as part of their greater
> > ACPI 6.2 update.
> >
> > Signed-off-by: Ross Zwisler <[email protected]>
>
> Can you please hold on util we have integrated all of the pending
> ACPICA changes?

Sure, this is really just meant to spur discussion on the correct course.
Comments welcome (encouraged!) on the series as a whole and especially the
APIs needed by userspace, but ultimately patch 1 will be dropped and I'll just
build on the ACPICA definitions.