2018-12-11 01:11:44

by Keith Busch

[permalink] [raw]
Subject: [PATCHv2 00/12] Heterogeneous memory node attributes

Here is the second version for adding heterogeneous memory attributes to
the existing node sysfs representation.

Background:

Platforms may provide multiple types of cpu attached system memory. The
memory ranges for each type may have different characteristics that
applications may wish to know about when considering what node they want
their memory allocated from.

It had previously been difficult to describe these setups as memory
rangers were generally lumped into the NUMA node of the CPUs. New
platform attributes have been created and in use today that describe
the more complex memory hierarchies that can be created.

This series' objective is to provide the attributes from such systems
that are useful for applications to know about, and readily usable with
existing tools and libraries.

Changes since v1:

Reordered the patches. The ACPI and bare-bones HMAT parsing come
first. The kernel interfaces, documentation and in-kernel users follow.

For correctness, have the new generic ACPI parsing and their callbacks
use the acpi union header type instead of the common header.

Added node masks in addition to the node symlinks for primary memory and
cpu nodes.

Altered the naming conventions to clearly indicate the attributes are
for primary access and capture the relationship for the new access
attributes to their primary nodes.

Added Documentation/ABI.

Used 'struct device' instead of kobject for memory side caches.

Initialize HMAT with subsys_initcall instead of device_init.

Combined the numa performance documentation into a single file and
moved it to admin-guide/mm/.

Changelogs updated with spelling/grammar/editorial fixes, and include
additional examples.

Keith Busch (12):
acpi: Create subtable parsing infrastructure
acpi/hmat: Parse and report heterogeneous memory
node: Link memory nodes to their compute nodes
Documentation/ABI: Add new node sysfs attributes
acpi/hmat: Register processor domain to its memory
node: Add heterogenous memory performance
Documentation/ABI: Add node performance attributes
acpi/hmat: Register performance attributes
node: Add memory caching attributes
Documentation/ABI: Add node cache attributes
acpi/hmat: Register memory side cache attributes
doc/mm: New documentation for memory performance

Documentation/ABI/stable/sysfs-devices-node | 96 ++++++-
Documentation/admin-guide/mm/numaperf.rst | 171 ++++++++++++
arch/x86/kernel/acpi/boot.c | 36 +--
drivers/acpi/Kconfig | 9 +
drivers/acpi/Makefile | 1 +
drivers/acpi/hmat.c | 393 ++++++++++++++++++++++++++++
drivers/acpi/numa.c | 16 +-
drivers/acpi/scan.c | 4 +-
drivers/acpi/tables.c | 76 +++++-
drivers/base/Kconfig | 8 +
drivers/base/node.c | 269 +++++++++++++++++++
drivers/irqchip/irq-gic-v3.c | 2 +-
drivers/mailbox/pcc.c | 2 +-
include/linux/acpi.h | 6 +-
include/linux/node.h | 49 ++++
15 files changed, 1096 insertions(+), 42 deletions(-)
create mode 100644 Documentation/admin-guide/mm/numaperf.rst
create mode 100644 drivers/acpi/hmat.c

--
2.14.4



2018-12-11 01:10:01

by Keith Busch

[permalink] [raw]
Subject: [PATCHv2 01/12] acpi: Create subtable parsing infrastructure

Parsing entries in an ACPI table had assumed a generic header structure
that is most common. There is no standard ACPI header, though, so less
common types would need custom parsers if they want go through their
sub-table entry list.

Create the infrastructure for adding different table types so parsing
the entries array may be more reused for all ACPI system tables.

Signed-off-by: Keith Busch <[email protected]>
---
arch/x86/kernel/acpi/boot.c | 36 ++++++++++++------------
drivers/acpi/numa.c | 16 +++++------
drivers/acpi/scan.c | 4 +--
drivers/acpi/tables.c | 67 +++++++++++++++++++++++++++++++++++++-------
drivers/irqchip/irq-gic-v3.c | 2 +-
drivers/mailbox/pcc.c | 2 +-
include/linux/acpi.h | 5 +++-
7 files changed, 91 insertions(+), 41 deletions(-)

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 06635fbca81c..58561b4df09d 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -197,7 +197,7 @@ static int acpi_register_lapic(int id, u32 acpiid, u8 enabled)
}

static int __init
-acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
+acpi_parse_x2apic(union acpi_subtable_headers *header, const unsigned long end)
{
struct acpi_madt_local_x2apic *processor = NULL;
#ifdef CONFIG_X86_X2APIC
@@ -210,7 +210,7 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
if (BAD_MADT_ENTRY(processor, end))
return -EINVAL;

- acpi_table_print_madt_entry(header);
+ acpi_table_print_madt_entry(&header->common);

#ifdef CONFIG_X86_X2APIC
apic_id = processor->local_apic_id;
@@ -242,7 +242,7 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
}

static int __init
-acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_lapic(union acpi_subtable_headers * header, const unsigned long end)
{
struct acpi_madt_local_apic *processor = NULL;

@@ -251,7 +251,7 @@ acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end)
if (BAD_MADT_ENTRY(processor, end))
return -EINVAL;

- acpi_table_print_madt_entry(header);
+ acpi_table_print_madt_entry(&header->common);

/* Ignore invalid ID */
if (processor->id == 0xff)
@@ -272,7 +272,7 @@ acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end)
}

static int __init
-acpi_parse_sapic(struct acpi_subtable_header *header, const unsigned long end)
+acpi_parse_sapic(union acpi_subtable_headers *header, const unsigned long end)
{
struct acpi_madt_local_sapic *processor = NULL;

@@ -281,7 +281,7 @@ acpi_parse_sapic(struct acpi_subtable_header *header, const unsigned long end)
if (BAD_MADT_ENTRY(processor, end))
return -EINVAL;

- acpi_table_print_madt_entry(header);
+ acpi_table_print_madt_entry(&header->common);

acpi_register_lapic((processor->id << 8) | processor->eid,/* APIC ID */
processor->processor_id, /* ACPI ID */
@@ -291,7 +291,7 @@ acpi_parse_sapic(struct acpi_subtable_header *header, const unsigned long end)
}

static int __init
-acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header,
+acpi_parse_lapic_addr_ovr(union acpi_subtable_headers * header,
const unsigned long end)
{
struct acpi_madt_local_apic_override *lapic_addr_ovr = NULL;
@@ -301,7 +301,7 @@ acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header,
if (BAD_MADT_ENTRY(lapic_addr_ovr, end))
return -EINVAL;

- acpi_table_print_madt_entry(header);
+ acpi_table_print_madt_entry(&header->common);

acpi_lapic_addr = lapic_addr_ovr->address;

@@ -309,7 +309,7 @@ acpi_parse_lapic_addr_ovr(struct acpi_subtable_header * header,
}

static int __init
-acpi_parse_x2apic_nmi(struct acpi_subtable_header *header,
+acpi_parse_x2apic_nmi(union acpi_subtable_headers *header,
const unsigned long end)
{
struct acpi_madt_local_x2apic_nmi *x2apic_nmi = NULL;
@@ -319,7 +319,7 @@ acpi_parse_x2apic_nmi(struct acpi_subtable_header *header,
if (BAD_MADT_ENTRY(x2apic_nmi, end))
return -EINVAL;

- acpi_table_print_madt_entry(header);
+ acpi_table_print_madt_entry(&header->common);

if (x2apic_nmi->lint != 1)
printk(KERN_WARNING PREFIX "NMI not connected to LINT 1!\n");
@@ -328,7 +328,7 @@ acpi_parse_x2apic_nmi(struct acpi_subtable_header *header,
}

static int __init
-acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_lapic_nmi(union acpi_subtable_headers * header, const unsigned long end)
{
struct acpi_madt_local_apic_nmi *lapic_nmi = NULL;

@@ -337,7 +337,7 @@ acpi_parse_lapic_nmi(struct acpi_subtable_header * header, const unsigned long e
if (BAD_MADT_ENTRY(lapic_nmi, end))
return -EINVAL;

- acpi_table_print_madt_entry(header);
+ acpi_table_print_madt_entry(&header->common);

if (lapic_nmi->lint != 1)
printk(KERN_WARNING PREFIX "NMI not connected to LINT 1!\n");
@@ -449,7 +449,7 @@ static int __init mp_register_ioapic_irq(u8 bus_irq, u8 polarity,
}

static int __init
-acpi_parse_ioapic(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_ioapic(union acpi_subtable_headers * header, const unsigned long end)
{
struct acpi_madt_io_apic *ioapic = NULL;
struct ioapic_domain_cfg cfg = {
@@ -462,7 +462,7 @@ acpi_parse_ioapic(struct acpi_subtable_header * header, const unsigned long end)
if (BAD_MADT_ENTRY(ioapic, end))
return -EINVAL;

- acpi_table_print_madt_entry(header);
+ acpi_table_print_madt_entry(&header->common);

/* Statically assign IRQ numbers for IOAPICs hosting legacy IRQs */
if (ioapic->global_irq_base < nr_legacy_irqs())
@@ -508,7 +508,7 @@ static void __init acpi_sci_ioapic_setup(u8 bus_irq, u16 polarity, u16 trigger,
}

static int __init
-acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
+acpi_parse_int_src_ovr(union acpi_subtable_headers * header,
const unsigned long end)
{
struct acpi_madt_interrupt_override *intsrc = NULL;
@@ -518,7 +518,7 @@ acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
if (BAD_MADT_ENTRY(intsrc, end))
return -EINVAL;

- acpi_table_print_madt_entry(header);
+ acpi_table_print_madt_entry(&header->common);

if (intsrc->source_irq == acpi_gbl_FADT.sci_interrupt) {
acpi_sci_ioapic_setup(intsrc->source_irq,
@@ -550,7 +550,7 @@ acpi_parse_int_src_ovr(struct acpi_subtable_header * header,
}

static int __init
-acpi_parse_nmi_src(struct acpi_subtable_header * header, const unsigned long end)
+acpi_parse_nmi_src(union acpi_subtable_headers * header, const unsigned long end)
{
struct acpi_madt_nmi_source *nmi_src = NULL;

@@ -559,7 +559,7 @@ acpi_parse_nmi_src(struct acpi_subtable_header * header, const unsigned long end
if (BAD_MADT_ENTRY(nmi_src, end))
return -EINVAL;

- acpi_table_print_madt_entry(header);
+ acpi_table_print_madt_entry(&header->common);

/* TBD: Support nimsrc entries? */

diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 274699463b4f..f5e09c39ff22 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -338,7 +338,7 @@ acpi_numa_x2apic_affinity_init(struct acpi_srat_x2apic_cpu_affinity *pa)
}

static int __init
-acpi_parse_x2apic_affinity(struct acpi_subtable_header *header,
+acpi_parse_x2apic_affinity(union acpi_subtable_headers *header,
const unsigned long end)
{
struct acpi_srat_x2apic_cpu_affinity *processor_affinity;
@@ -347,7 +347,7 @@ acpi_parse_x2apic_affinity(struct acpi_subtable_header *header,
if (!processor_affinity)
return -EINVAL;

- acpi_table_print_srat_entry(header);
+ acpi_table_print_srat_entry(&header->common);

/* let architecture-dependent part to do it */
acpi_numa_x2apic_affinity_init(processor_affinity);
@@ -356,7 +356,7 @@ acpi_parse_x2apic_affinity(struct acpi_subtable_header *header,
}

static int __init
-acpi_parse_processor_affinity(struct acpi_subtable_header *header,
+acpi_parse_processor_affinity(union acpi_subtable_headers *header,
const unsigned long end)
{
struct acpi_srat_cpu_affinity *processor_affinity;
@@ -365,7 +365,7 @@ acpi_parse_processor_affinity(struct acpi_subtable_header *header,
if (!processor_affinity)
return -EINVAL;

- acpi_table_print_srat_entry(header);
+ acpi_table_print_srat_entry(&header->common);

/* let architecture-dependent part to do it */
acpi_numa_processor_affinity_init(processor_affinity);
@@ -374,7 +374,7 @@ acpi_parse_processor_affinity(struct acpi_subtable_header *header,
}

static int __init
-acpi_parse_gicc_affinity(struct acpi_subtable_header *header,
+acpi_parse_gicc_affinity(union acpi_subtable_headers *header,
const unsigned long end)
{
struct acpi_srat_gicc_affinity *processor_affinity;
@@ -383,7 +383,7 @@ acpi_parse_gicc_affinity(struct acpi_subtable_header *header,
if (!processor_affinity)
return -EINVAL;

- acpi_table_print_srat_entry(header);
+ acpi_table_print_srat_entry(&header->common);

/* let architecture-dependent part to do it */
acpi_numa_gicc_affinity_init(processor_affinity);
@@ -394,7 +394,7 @@ acpi_parse_gicc_affinity(struct acpi_subtable_header *header,
static int __initdata parsed_numa_memblks;

static int __init
-acpi_parse_memory_affinity(struct acpi_subtable_header * header,
+acpi_parse_memory_affinity(union acpi_subtable_headers * header,
const unsigned long end)
{
struct acpi_srat_mem_affinity *memory_affinity;
@@ -403,7 +403,7 @@ acpi_parse_memory_affinity(struct acpi_subtable_header * header,
if (!memory_affinity)
return -EINVAL;

- acpi_table_print_srat_entry(header);
+ acpi_table_print_srat_entry(&header->common);

/* let architecture-dependent part to do it */
if (!acpi_numa_memory_affinity_init(memory_affinity))
diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c
index bd1c59fb0e17..d98d5da6a279 100644
--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -2234,10 +2234,10 @@ static struct acpi_probe_entry *ape;
static int acpi_probe_count;
static DEFINE_MUTEX(acpi_probe_mutex);

-static int __init acpi_match_madt(struct acpi_subtable_header *header,
+static int __init acpi_match_madt(union acpi_subtable_headers *header,
const unsigned long end)
{
- if (!ape->subtable_valid || ape->subtable_valid(header, ape))
+ if (!ape->subtable_valid || ape->subtable_valid(&header->common, ape))
if (!ape->probe_subtbl(header, end))
acpi_probe_count++;

diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 61203eebf3a1..e9643b4267c7 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -49,6 +49,15 @@ static struct acpi_table_desc initial_tables[ACPI_MAX_TABLES] __initdata;

static int acpi_apic_instance __initdata;

+enum acpi_subtable_type {
+ ACPI_SUBTABLE_COMMON,
+};
+
+struct acpi_subtable_entry {
+ union acpi_subtable_headers *hdr;
+ enum acpi_subtable_type type;
+};
+
/*
* Disable table checksum verification for the early stage due to the size
* limitation of the current x86 early mapping implementation.
@@ -217,6 +226,42 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header)
}
}

+static unsigned long __init
+acpi_get_entry_type(struct acpi_subtable_entry *entry)
+{
+ switch (entry->type) {
+ case ACPI_SUBTABLE_COMMON:
+ return entry->hdr->common.type;
+ }
+ return 0;
+}
+
+static unsigned long __init
+acpi_get_entry_length(struct acpi_subtable_entry *entry)
+{
+ switch (entry->type) {
+ case ACPI_SUBTABLE_COMMON:
+ return entry->hdr->common.length;
+ }
+ return 0;
+}
+
+static unsigned long __init
+acpi_get_subtable_header_length(struct acpi_subtable_entry *entry)
+{
+ switch (entry->type) {
+ case ACPI_SUBTABLE_COMMON:
+ return sizeof(entry->hdr->common);
+ }
+ return 0;
+}
+
+static enum acpi_subtable_type __init
+acpi_get_subtable_type(char *id)
+{
+ return ACPI_SUBTABLE_COMMON;
+}
+
/**
* acpi_parse_entries_array - for each proc_num find a suitable subtable
*
@@ -246,8 +291,8 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
struct acpi_subtable_proc *proc, int proc_num,
unsigned int max_entries)
{
- struct acpi_subtable_header *entry;
- unsigned long table_end;
+ struct acpi_subtable_entry entry;
+ unsigned long table_end, subtable_len, entry_len;
int count = 0;
int errs = 0;
int i;
@@ -270,19 +315,20 @@ acpi_parse_entries_array(char *id, unsigned long table_size,

/* Parse all entries looking for a match. */

- entry = (struct acpi_subtable_header *)
+ entry.type = acpi_get_subtable_type(id);
+ entry.hdr = (union acpi_subtable_headers *)
((unsigned long)table_header + table_size);
+ subtable_len = acpi_get_subtable_header_length(&entry);

- while (((unsigned long)entry) + sizeof(struct acpi_subtable_header) <
- table_end) {
+ while (((unsigned long)entry.hdr) + subtable_len < table_end) {
if (max_entries && count >= max_entries)
break;

for (i = 0; i < proc_num; i++) {
- if (entry->type != proc[i].id)
+ if (acpi_get_entry_type(&entry) != proc[i].id)
continue;
if (!proc[i].handler ||
- (!errs && proc[i].handler(entry, table_end))) {
+ (!errs && proc[i].handler(entry.hdr, table_end))) {
errs++;
continue;
}
@@ -297,13 +343,14 @@ acpi_parse_entries_array(char *id, unsigned long table_size,
* If entry->length is 0, break from this loop to avoid
* infinite loop.
*/
- if (entry->length == 0) {
+ entry_len = acpi_get_entry_length(&entry);
+ if (entry_len == 0) {
pr_err("[%4.4s:0x%02x] Invalid zero length\n", id, proc->id);
return -EINVAL;
}

- entry = (struct acpi_subtable_header *)
- ((unsigned long)entry + entry->length);
+ entry.hdr = (union acpi_subtable_headers *)
+ ((unsigned long)entry.hdr + entry_len);
}

if (max_entries && count > max_entries) {
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 8f87f40c9460..ef3f72196ad6 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -1383,7 +1383,7 @@ gic_acpi_parse_madt_redist(struct acpi_subtable_header *header,
}

static int __init
-gic_acpi_parse_madt_gicc(struct acpi_subtable_header *header,
+gic_acpi_parse_madt_gicc(union acpi_subtable_headers *header,
const unsigned long end)
{
struct acpi_madt_generic_interrupt *gicc =
diff --git a/drivers/mailbox/pcc.c b/drivers/mailbox/pcc.c
index 256f18b67e8a..08a0a3517138 100644
--- a/drivers/mailbox/pcc.c
+++ b/drivers/mailbox/pcc.c
@@ -382,7 +382,7 @@ static const struct mbox_chan_ops pcc_chan_ops = {
*
* This gets called for each entry in the PCC table.
*/
-static int parse_pcc_subspace(struct acpi_subtable_header *header,
+static int parse_pcc_subspace(union acpi_subtable_headers *header,
const unsigned long end)
{
struct acpi_pcct_subspace *ss = (struct acpi_pcct_subspace *) header;
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index ed80f147bd50..18805a967c70 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -141,10 +141,13 @@ enum acpi_address_range_id {


/* Table Handlers */
+union acpi_subtable_headers {
+ struct acpi_subtable_header common;
+};

typedef int (*acpi_tbl_table_handler)(struct acpi_table_header *table);

-typedef int (*acpi_tbl_entry_handler)(struct acpi_subtable_header *header,
+typedef int (*acpi_tbl_entry_handler)(union acpi_subtable_headers *header,
const unsigned long end);

/* Debugger support */
--
2.14.4


2018-12-11 01:10:40

by Keith Busch

[permalink] [raw]
Subject: [PATCHv2 05/12] acpi/hmat: Register processor domain to its memory

If the HMAT Subsystem Address Range provides a valid processor proximity
domain for a memory domain, or a processor domain with the highest
performing access, register the node as one of the initiator's primary
memory targets so this relationship will be visible under the node's
sysfs directory.

Since HMAT requires valid address ranges have an equivalent SRAT entry,
verify each memory target satisfies this requirement.

Signed-off-by: Keith Busch <[email protected]>
---
drivers/acpi/hmat.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 142 insertions(+), 7 deletions(-)

diff --git a/drivers/acpi/hmat.c b/drivers/acpi/hmat.c
index ef3881f0f370..5d8747ad025f 100644
--- a/drivers/acpi/hmat.c
+++ b/drivers/acpi/hmat.c
@@ -17,6 +17,43 @@
#include <linux/slab.h>
#include <linux/sysfs.h>

+static LIST_HEAD(targets);
+
+struct memory_target {
+ struct list_head node;
+ unsigned int memory_pxm;
+ unsigned long p_nodes[BITS_TO_LONGS(MAX_NUMNODES)];
+};
+
+static __init struct memory_target *find_mem_target(unsigned int m)
+{
+ struct memory_target *t;
+
+ list_for_each_entry(t, &targets, node)
+ if (t->memory_pxm == m)
+ return t;
+ return NULL;
+}
+
+static __init void alloc_memory_target(unsigned int mem_pxm)
+{
+ struct memory_target *t;
+
+ if (pxm_to_node(mem_pxm) == NUMA_NO_NODE)
+ return;
+
+ t = find_mem_target(mem_pxm);
+ if (t)
+ return;
+
+ t = kzalloc(sizeof(*t), GFP_KERNEL);
+ if (!t)
+ return;
+
+ t->memory_pxm = mem_pxm;
+ list_add_tail(&t->node, &targets);
+}
+
static __init const char *hmat_data_type(u8 type)
{
switch (type) {
@@ -53,11 +90,30 @@ static __init const char *hmat_data_type_suffix(u8 type)
};
}

+static __init void hmat_update_access(u8 type, u32 value, u32 *best)
+{
+ switch (type) {
+ case ACPI_HMAT_ACCESS_LATENCY:
+ case ACPI_HMAT_READ_LATENCY:
+ case ACPI_HMAT_WRITE_LATENCY:
+ if (!*best || *best > value)
+ *best = value;
+ break;
+ case ACPI_HMAT_ACCESS_BANDWIDTH:
+ case ACPI_HMAT_READ_BANDWIDTH:
+ case ACPI_HMAT_WRITE_BANDWIDTH:
+ if (!*best || *best < value)
+ *best = value;
+ break;
+ }
+}
+
static __init int hmat_parse_locality(union acpi_subtable_headers *header,
const unsigned long end)
{
+ struct memory_target *t;
struct acpi_hmat_locality *loc = (void *)header;
- unsigned int init, targ, total_size, ipds, tpds;
+ unsigned int init, targ, pass, p_node, total_size, ipds, tpds;
u32 *inits, *targs, value;
u16 *entries;
u8 type;
@@ -87,12 +143,28 @@ static __init int hmat_parse_locality(union acpi_subtable_headers *header,
targs = &inits[ipds];
entries = (u16 *)(&targs[tpds]);
for (targ = 0; targ < tpds; targ++) {
- for (init = 0; init < ipds; init++) {
- value = entries[init * tpds + targ];
- value = (value * loc->entry_base_unit) / 10;
- pr_info(" Initiator-Target[%d-%d]:%d%s\n",
- inits[init], targs[targ], value,
- hmat_data_type_suffix(type));
+ u32 best = 0;
+
+ t = find_mem_target(targs[targ]);
+ for (pass = 0; pass < 2; pass++) {
+ for (init = 0; init < ipds; init++) {
+ value = entries[init * tpds + targ];
+ value = (value * loc->entry_base_unit) / 10;
+
+ if (!pass) {
+ hmat_update_access(type, value, &best);
+ pr_info(" Initiator-Target[%d-%d]:%d%s\n",
+ inits[init], targs[targ], value,
+ hmat_data_type_suffix(type));
+ continue;
+ }
+
+ if (!t)
+ continue;
+ p_node = pxm_to_node(inits[init]);
+ if (p_node != NUMA_NO_NODE && value == best)
+ set_bit(p_node, t->p_nodes);
+ }
}
}
return 0;
@@ -122,6 +194,7 @@ static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
const unsigned long end)
{
struct acpi_hmat_address_range *spa = (void *)header;
+ struct memory_target *t = NULL;

if (spa->header.length != sizeof(*spa)) {
pr_err("HMAT: Unexpected address range header length: %d\n",
@@ -131,6 +204,23 @@ static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
pr_info("HMAT: Memory (%#llx length %#llx) Flags:%04x Processor Domain:%d Memory Domain:%d\n",
spa->physical_address_base, spa->physical_address_length,
spa->flags, spa->processor_PD, spa->memory_PD);
+
+ if (spa->flags & ACPI_HMAT_MEMORY_PD_VALID) {
+ t = find_mem_target(spa->memory_PD);
+ if (!t) {
+ pr_warn("HMAT: Memory Domain missing from SRAT\n");
+ return -EINVAL;
+ }
+ }
+ if (t && spa->flags & ACPI_HMAT_PROCESSOR_PD_VALID) {
+ int p_node = pxm_to_node(spa->processor_PD);
+
+ if (p_node == NUMA_NO_NODE) {
+ pr_warn("HMAT: Invalid Processor Domain\n");
+ return -EINVAL;
+ }
+ set_bit(p_node, t->p_nodes);
+ }
return 0;
}

@@ -154,6 +244,33 @@ static int __init hmat_parse_subtable(union acpi_subtable_headers *header,
}
}

+static __init int srat_parse_mem_affinity(union acpi_subtable_headers *header,
+ const unsigned long end)
+{
+ struct acpi_srat_mem_affinity *ma = (void *)header;
+
+ if (!ma)
+ return -EINVAL;
+ if (!(ma->flags & ACPI_SRAT_MEM_ENABLED))
+ return 0;
+ alloc_memory_target(ma->proximity_domain);
+ return 0;
+}
+
+static __init void hmat_register_targets(void)
+{
+ struct memory_target *t, *next;
+ unsigned m, p;
+
+ list_for_each_entry_safe(t, next, &targets, node) {
+ list_del(&t->node);
+ m = pxm_to_node(t->memory_pxm);
+ for_each_set_bit(p, t->p_nodes, MAX_NUMNODES)
+ register_memory_node_under_compute_node(m, p);
+ kfree(t);
+ }
+}
+
static __init int parse_noop(struct acpi_table_header *table)
{
return 0;
@@ -169,6 +286,23 @@ static __init int hmat_init(void)
if (srat_disabled())
return 0;

+ status = acpi_get_table(ACPI_SIG_SRAT, 0, &tbl);
+ if (ACPI_FAILURE(status))
+ return 0;
+
+ if (acpi_table_parse(ACPI_SIG_SRAT, parse_noop))
+ goto out_put;
+
+ memset(&subtable_proc, 0, sizeof(subtable_proc));
+ subtable_proc.id = ACPI_SRAT_TYPE_MEMORY_AFFINITY;
+ subtable_proc.handler = srat_parse_mem_affinity;
+
+ if (acpi_table_parse_entries_array(ACPI_SIG_SRAT,
+ sizeof(struct acpi_table_srat),
+ &subtable_proc, 1, 0) < 0)
+ goto out_put;
+ acpi_put_table(tbl);
+
status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
if (ACPI_FAILURE(status))
return 0;
@@ -185,6 +319,7 @@ static __init int hmat_init(void)
&subtable_proc, 1, 0) < 0)
goto out_put;
}
+ hmat_register_targets();
out_put:
acpi_put_table(tbl);
return 0;
--
2.14.4


2018-12-11 01:10:41

by Keith Busch

[permalink] [raw]
Subject: [PATCHv2 11/12] acpi/hmat: Register memory side cache attributes

Register memory side cache attributes with the memory's node if HMAT
provides the side cache information table.

Signed-off-by: Keith Busch <[email protected]>
---
drivers/acpi/hmat.c | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)

diff --git a/drivers/acpi/hmat.c b/drivers/acpi/hmat.c
index 40bc83f4b593..48d53ceb4778 100644
--- a/drivers/acpi/hmat.c
+++ b/drivers/acpi/hmat.c
@@ -206,6 +206,7 @@ static __init int hmat_parse_cache(union acpi_subtable_headers *header,
const unsigned long end)
{
struct acpi_hmat_cache *cache = (void *)header;
+ struct node_cache_attrs cache_attrs;
u32 attrs;

if (cache->header.length < sizeof(*cache)) {
@@ -219,6 +220,37 @@ static __init int hmat_parse_cache(union acpi_subtable_headers *header,
cache->memory_PD, cache->cache_size, attrs,
cache->number_of_SMBIOShandles);

+ cache_attrs.size = cache->cache_size;
+ cache_attrs.level = (attrs & ACPI_HMAT_CACHE_LEVEL) >> 4;
+ cache_attrs.line_size = (attrs & ACPI_HMAT_CACHE_LINE_SIZE) >> 16;
+
+ switch ((attrs & ACPI_HMAT_CACHE_ASSOCIATIVITY) >> 8) {
+ case ACPI_HMAT_CA_DIRECT_MAPPED:
+ cache_attrs.associativity = NODE_CACHE_DIRECT_MAP;
+ break;
+ case ACPI_HMAT_CA_COMPLEX_CACHE_INDEXING:
+ cache_attrs.associativity = NODE_CACHE_INDEXED;
+ break;
+ case ACPI_HMAT_CA_NONE:
+ default:
+ cache_attrs.associativity = NODE_CACHE_OTHER;
+ break;
+ }
+
+ switch ((attrs & ACPI_HMAT_WRITE_POLICY) >> 12) {
+ case ACPI_HMAT_CP_WB:
+ cache_attrs.write_policy = NODE_CACHE_WRITE_BACK;
+ break;
+ case ACPI_HMAT_CP_WT:
+ cache_attrs.write_policy = NODE_CACHE_WRITE_THROUGH;
+ break;
+ case ACPI_HMAT_CP_NONE:
+ default:
+ cache_attrs.write_policy = NODE_CACHE_WRITE_OTHER;
+ break;
+ }
+
+ node_add_cache(pxm_to_node(cache->memory_PD), &cache_attrs);
return 0;
}

--
2.14.4


2018-12-11 01:10:46

by Keith Busch

[permalink] [raw]
Subject: [PATCHv2 08/12] acpi/hmat: Register performance attributes

Save the best performance access attributes and register these with the
memory's node if HMAT provides the locality table.

Signed-off-by: Keith Busch <[email protected]>
---
drivers/acpi/Kconfig | 1 +
drivers/acpi/hmat.c | 34 ++++++++++++++++++++++++++++++++++
2 files changed, 35 insertions(+)

diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index 9a05af3a18cf..6b5f6ca690af 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -330,6 +330,7 @@ config ACPI_NUMA
config ACPI_HMAT
bool "ACPI Heterogeneous Memory Attribute Table Support"
depends on ACPI_NUMA
+ select HMEM_REPORTING
help
Parses representation of the ACPI Heterogeneous Memory Attributes
Table (HMAT) and set the memory node relationships and access
diff --git a/drivers/acpi/hmat.c b/drivers/acpi/hmat.c
index 5d8747ad025f..40bc83f4b593 100644
--- a/drivers/acpi/hmat.c
+++ b/drivers/acpi/hmat.c
@@ -23,6 +23,8 @@ struct memory_target {
struct list_head node;
unsigned int memory_pxm;
unsigned long p_nodes[BITS_TO_LONGS(MAX_NUMNODES)];
+ bool hmem_valid;
+ struct node_hmem_attrs hmem;
};

static __init struct memory_target *find_mem_target(unsigned int m)
@@ -108,6 +110,34 @@ static __init void hmat_update_access(u8 type, u32 value, u32 *best)
}
}

+static __init void hmat_update_target(struct memory_target *t, u8 type,
+ u32 value)
+{
+ switch (type) {
+ case ACPI_HMAT_ACCESS_LATENCY:
+ t->hmem.read_latency = value;
+ t->hmem.write_latency = value;
+ break;
+ case ACPI_HMAT_READ_LATENCY:
+ t->hmem.read_latency = value;
+ break;
+ case ACPI_HMAT_WRITE_LATENCY:
+ t->hmem.write_latency = value;
+ break;
+ case ACPI_HMAT_ACCESS_BANDWIDTH:
+ t->hmem.read_bandwidth = value;
+ t->hmem.write_bandwidth = value;
+ break;
+ case ACPI_HMAT_READ_BANDWIDTH:
+ t->hmem.read_bandwidth = value;
+ break;
+ case ACPI_HMAT_WRITE_BANDWIDTH:
+ t->hmem.write_bandwidth = value;
+ break;
+ }
+ t->hmem_valid = true;
+}
+
static __init int hmat_parse_locality(union acpi_subtable_headers *header,
const unsigned long end)
{
@@ -166,6 +196,8 @@ static __init int hmat_parse_locality(union acpi_subtable_headers *header,
set_bit(p_node, t->p_nodes);
}
}
+ if (t && best)
+ hmat_update_target(t, type, best);
}
return 0;
}
@@ -267,6 +299,8 @@ static __init void hmat_register_targets(void)
m = pxm_to_node(t->memory_pxm);
for_each_set_bit(p, t->p_nodes, MAX_NUMNODES)
register_memory_node_under_compute_node(m, p);
+ if (t->hmem_valid)
+ node_set_perf_attrs(m, &t->hmem);
kfree(t);
}
}
--
2.14.4


2018-12-11 01:10:57

by Keith Busch

[permalink] [raw]
Subject: [PATCHv2 09/12] node: Add memory caching attributes

System memory may have side caches to help improve access speed. While
the system provided cache is transparent to the software accessing
these memory ranges, applications can optimize their own access based
on cache attributes.

Provide a new API for the kernel to register these memory side caches
under the memory node that provides it.

The kernel's sysfs representation is modeled from the cpu cacheinfo
attributes, as seen from /sys/devices/system/cpu/cpuX/side_cache/.
Unlike CPU cacheinfo, though, the node cache level is reported from
the view of the memory. A higher number is nearer to the CPU, while
lower levels are closer to the backing memory. Also unlike CPU cache,
it is assumed the system will handle flushing any dirty cached memory to
the last level the memory on a power failure if the range is persistent
memory.

The attributes we export are the cache size, the line size, associativity,
and write back policy.

Signed-off-by: Keith Busch <[email protected]>
---
drivers/base/node.c | 140 +++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/node.h | 23 +++++++++
2 files changed, 163 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 768612c06c56..54184424ca7f 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -17,6 +17,7 @@
#include <linux/nodemask.h>
#include <linux/cpu.h>
#include <linux/device.h>
+#include <linux/pm_runtime.h>
#include <linux/swap.h>
#include <linux/slab.h>

@@ -141,6 +142,143 @@ void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs)
pr_info("failed to add performance attribute group to node %d\n",
nid);
}
+
+struct node_cache_info {
+ struct device dev;
+ struct list_head node;
+ struct node_cache_attrs cache_attrs;
+};
+#define to_cache_info(device) container_of(device, struct node_cache_info, dev)
+
+#define CACHE_ATTR(name, fmt) \
+static ssize_t name##_show(struct device *dev, \
+ struct device_attribute *attr, \
+ char *buf) \
+{ \
+ return sprintf(buf, fmt "\n", to_cache_info(dev)->cache_attrs.name); \
+} \
+DEVICE_ATTR_RO(name);
+
+CACHE_ATTR(size, "%lld")
+CACHE_ATTR(level, "%d")
+CACHE_ATTR(line_size, "%d")
+CACHE_ATTR(associativity, "%d")
+CACHE_ATTR(write_policy, "%d")
+
+static struct attribute *cache_attrs[] = {
+ &dev_attr_level.attr,
+ &dev_attr_associativity.attr,
+ &dev_attr_size.attr,
+ &dev_attr_line_size.attr,
+ &dev_attr_write_policy.attr,
+ NULL,
+};
+
+const struct attribute_group node_cache_attrs_group = {
+ .attrs = cache_attrs,
+};
+
+const struct attribute_group *node_cache_attrs_groups[] = {
+ &node_cache_attrs_group,
+ NULL,
+};
+
+static void node_release(struct device *dev)
+{
+ kfree(dev);
+}
+
+static void node_cache_release(struct device *dev)
+{
+ struct node_cache_info *info = to_cache_info(dev);
+ kfree(info);
+}
+
+static void node_init_cache_dev(struct node *node)
+{
+ struct device *dev;
+
+ dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+ if (!dev)
+ return;
+
+ dev->parent = &node->dev;
+ dev->release = node_release;
+ dev_set_name(dev, "side_cache");
+
+ if (device_register(dev)) {
+ kfree(dev);
+ return;
+ }
+ pm_runtime_no_callbacks(dev);
+ node->cache_dev = dev;
+}
+
+void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs)
+{
+ struct node_cache_info *info;
+ struct device *dev;
+ struct node *node;
+
+ if (!node_online(nid) || !node_devices[nid])
+ return;
+
+ node = node_devices[nid];
+ list_for_each_entry(info, &node->cache_attrs, node) {
+ if (info->cache_attrs.level == cache_attrs->level) {
+ dev_warn(&node->dev,
+ "attempt to add duplicate cache level:%d\n",
+ cache_attrs->level);
+ return;
+ }
+ }
+
+ if (!node->cache_dev)
+ node_init_cache_dev(node);
+ if (!node->cache_dev)
+ return;
+
+ info = kzalloc(sizeof(*info), GFP_KERNEL);
+ if (!info)
+ return;
+
+ dev = &info->dev;
+ dev->parent = node->cache_dev;
+ dev->release = node_cache_release;
+ dev->groups = node_cache_attrs_groups;
+ dev_set_name(dev, "index%d", cache_attrs->level);
+ info->cache_attrs = *cache_attrs;
+ if (device_register(dev)) {
+ dev_warn(&node->dev, "failed to add cache level:%d\n",
+ cache_attrs->level);
+ kfree(info);
+ return;
+ }
+ pm_runtime_no_callbacks(dev);
+ list_add_tail(&info->node, &node->cache_attrs);
+}
+
+static void node_remove_caches(struct node *node)
+{
+ struct node_cache_info *info, *next;
+
+ if (!node->cache_dev)
+ return;
+
+ list_for_each_entry_safe(info, next, &node->cache_attrs, node) {
+ list_del(&info->node);
+ device_unregister(&info->dev);
+ }
+ device_unregister(node->cache_dev);
+}
+
+static void node_init_caches(unsigned int nid)
+{
+ INIT_LIST_HEAD(&node_devices[nid]->cache_attrs);
+}
+#else
+static void node_init_caches(unsigned int nid) { }
+static void node_remove_caches(struct node *node) { }
#endif

#define K(x) ((x) << (PAGE_SHIFT - 10))
@@ -389,6 +527,7 @@ static void node_device_release(struct device *dev)
*/
flush_work(&node->node_work);
#endif
+ node_remove_caches(node);
kfree(node);
}

@@ -711,6 +850,7 @@ int __register_one_node(int nid)

/* initialize work queue for memory hot plug */
init_node_hugetlb_work(nid);
+ node_init_caches(nid);

return error;
}
diff --git a/include/linux/node.h b/include/linux/node.h
index 71abaf0d4f4b..897e04e99e80 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -36,6 +36,27 @@ struct node_hmem_attrs {
unsigned int write_latency;
};
void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs);
+
+enum cache_associativity {
+ NODE_CACHE_DIRECT_MAP,
+ NODE_CACHE_INDEXED,
+ NODE_CACHE_OTHER,
+};
+
+enum cache_write_policy {
+ NODE_CACHE_WRITE_BACK,
+ NODE_CACHE_WRITE_THROUGH,
+ NODE_CACHE_WRITE_OTHER,
+};
+
+struct node_cache_attrs {
+ enum cache_associativity associativity;
+ enum cache_write_policy write_policy;
+ u64 size;
+ u16 line_size;
+ u8 level;
+};
+void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs);
#endif

struct node {
@@ -48,6 +69,8 @@ struct node {
#endif
#ifdef CONFIG_HMEM_REPORTING
struct node_hmem_attrs hmem_attrs;
+ struct list_head cache_attrs;
+ struct device *cache_dev;
#endif
};

--
2.14.4


2018-12-11 01:11:06

by Keith Busch

[permalink] [raw]
Subject: [PATCHv2 06/12] node: Add heterogenous memory performance

Heterogeneous memory systems provide memory nodes with different latency
and bandwidth performance attributes. Create an interface for the kernel
to register the attributes for the primary memory initiators under the
memory node. If the system provides this information, applications can
then query the node attributes when deciding which node to request memory.

The following example shows the new sysfs hierarchy for a node exporting
performance attributes:

# tree /sys/devices/system/node/nodeY/primary_initiator_access
/sys/devices/system/node/nodeY/primary_initiator_access
|-- read_bandwidth
|-- read_latency
|-- write_bandwidth
`-- write_latency

The bandwidth is exported as MB/s and latency is reported in nanoseconds.
Memory accesses from an initiator node that is not one of the memory's
primary compute nodes may encounter a performance penalty that does not
match the performance reported for primary memory initiators.

As an example of what you may be able to do with this, let's say we have
a PCIe storage device, /dev/nvme0n1, attached to a particular node, and
we want to run IO to it using the fastest memory with primary access
from the same node as that PCIe device. The following shell script is
such an example to achieve that goal:

#!/bin/bash
DEV_NODE=/sys/devices/system/node/node$(cat /sys/block/nvme0n1/device/device/numa_node)
BEST_WRITE_BW=0
BEST_MEM_NODE=0
for i in $(ls -d ${DEV_NODE}/primary_target*); do
tmp=$(cat ${i}/primary_initiator_access/write_bandwidth);
if ((${tmp} > ${BEST_WRITE_BW})); then
BEST_WRITE_BW=${tmp}
BEST_MEM_NODE=$(echo ${i} | sed s/^.*primary_target//g)
fi
done

numactl --membind=${BEST_MEM_NODE} \
--cpunodebind=$(cat ${DEV_NODE}/primary_cpu_nodelist) \
-- fio --filename=/dev/nvme0n1 --bs=4k --name=access-test

Signed-off-by: Keith Busch <[email protected]>
---
drivers/base/Kconfig | 8 ++++++++
drivers/base/node.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/node.h | 22 ++++++++++++++++++++++
3 files changed, 74 insertions(+)

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index 3e63a900b330..6014980238e8 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -149,6 +149,14 @@ config DEBUG_TEST_DRIVER_REMOVE
unusable. You should say N here unless you are explicitly looking to
test this functionality.

+config HMEM_REPORTING
+ bool
+ default y
+ depends on NUMA
+ help
+ Enable reporting for heterogenous memory access attributes under
+ their non-uniform memory nodes.
+
source "drivers/base/test/Kconfig"

config SYS_HYPERVISOR
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 50412ce3fd7d..768612c06c56 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -99,6 +99,50 @@ static DEVICE_ATTR_RO(primary_cpu_nodelist);
static DEVICE_ATTR(cpumap, S_IRUGO, node_read_cpumask, NULL);
static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);

+#ifdef CONFIG_HMEM_REPORTING
+const struct attribute_group node_access_attrs_group;
+
+#define ACCESS_ATTR(name) \
+static ssize_t name##_show(struct device *dev, \
+ struct device_attribute *attr, \
+ char *buf) \
+{ \
+ return sprintf(buf, "%d\n", to_node(dev)->hmem_attrs.name); \
+} \
+static DEVICE_ATTR_RO(name);
+
+ACCESS_ATTR(read_bandwidth)
+ACCESS_ATTR(read_latency)
+ACCESS_ATTR(write_bandwidth)
+ACCESS_ATTR(write_latency)
+
+static struct attribute *access_attrs[] = {
+ &dev_attr_read_bandwidth.attr,
+ &dev_attr_read_latency.attr,
+ &dev_attr_write_bandwidth.attr,
+ &dev_attr_write_latency.attr,
+ NULL,
+};
+
+const struct attribute_group node_access_attrs_group = {
+ .name = "primary_initiator_access",
+ .attrs = access_attrs,
+};
+
+void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs)
+{
+ struct node *node;
+
+ if (WARN_ON_ONCE(!node_online(nid)))
+ return;
+ node = node_devices[nid];
+ node->hmem_attrs = *hmem_attrs;
+ if (sysfs_create_group(&node->dev.kobj, &node_access_attrs_group))
+ pr_info("failed to add performance attribute group to node %d\n",
+ nid);
+}
+#endif
+
#define K(x) ((x) << (PAGE_SHIFT - 10))
static ssize_t node_read_meminfo(struct device *dev,
struct device_attribute *attr, char *buf)
diff --git a/include/linux/node.h b/include/linux/node.h
index 3d06de045cbf..71abaf0d4f4b 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -17,8 +17,27 @@

#include <linux/device.h>
#include <linux/cpumask.h>
+#include <linux/list.h>
#include <linux/workqueue.h>

+#ifdef CONFIG_HMEM_REPORTING
+/**
+ * struct node_hmem_attrs - heterogeneous memory performance attributes
+ *
+ * read_bandwidth: Read bandwidth in MB/s
+ * write_bandwidth: Write bandwidth in MB/s
+ * read_latency: Read latency in nanoseconds
+ * write_latency: Write latency in nanoseconds
+ */
+struct node_hmem_attrs {
+ unsigned int read_bandwidth;
+ unsigned int write_bandwidth;
+ unsigned int read_latency;
+ unsigned int write_latency;
+};
+void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs);
+#endif
+
struct node {
struct device dev;
nodemask_t primary_mem_nodes;
@@ -27,6 +46,9 @@ struct node {
#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
struct work_struct node_work;
#endif
+#ifdef CONFIG_HMEM_REPORTING
+ struct node_hmem_attrs hmem_attrs;
+#endif
};

struct memory_block;
--
2.14.4


2018-12-11 01:11:11

by Keith Busch

[permalink] [raw]
Subject: [PATCHv2 07/12] Documentation/ABI: Add node performance attributes

Add descriptions for primary memory initiator performance access
attributes.

Signed-off-by: Keith Busch <[email protected]>
---
Documentation/ABI/stable/sysfs-devices-node | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 8430d5b261f6..6cdb0643f9fd 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -123,3 +123,31 @@ Description:
The map of memory nodes that this node has primary access.
Memory accesses from this node to nodes not in this map may
encounter a performance penalty.
+
+What: /sys/devices/system/node/nodeX/primary_initiator_access/read_bandwidth
+Date: December 2018
+Contact: Keith Busch <[email protected]>
+Description:
+ Read bandwidth in MB/s available to memory initiators found in
+ primary_cpu_nodemask.
+
+What: /sys/devices/system/node/nodeX/primary_initiator_access/read_latency
+Date: December 2018
+Contact: Keith Busch <[email protected]>
+Description:
+ Read latency in nanosecondss available to memory initiators
+ found in primary_cpu_nodemask.
+
+What: /sys/devices/system/node/nodeX/primary_initiator_access/write_bandwidth
+Date: December 2018
+Contact: Keith Busch <[email protected]>
+Description:
+ Write bandwidth in MB/s available to memory initiators found in
+ primary_cpu_nodemask.
+
+What: /sys/devices/system/node/nodeX/primary_initiator_access/write_latency
+Date: December 2018
+Contact: Keith Busch <[email protected]>
+Description:
+ Write latency in nanosecondss available to memory initiators
+ found in primary_cpu_nodemask.
--
2.14.4


2018-12-11 01:11:24

by Keith Busch

[permalink] [raw]
Subject: [PATCHv2 10/12] Documentation/ABI: Add node cache attributes

Add the attributes for the system memory side caches.

Signed-off-by: Keith Busch <[email protected]>
---
Documentation/ABI/stable/sysfs-devices-node | 34 +++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 6cdb0643f9fd..63a757b8a29e 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -151,3 +151,37 @@ Contact: Keith Busch <[email protected]>
Description:
Write latency in nanosecondss available to memory initiators
found in primary_cpu_nodemask.
+
+What: /sys/devices/system/node/nodeX/side_cache/indexY/associativity
+Date: December 2018
+Contact: Keith Busch <[email protected]>
+Description:
+ The caches associativity: 0 for direct mapped, non-zero if
+ indexed.
+
+What: /sys/devices/system/node/nodeX/side_cache/indexY/level
+Date: December 2018
+Contact: Keith Busch <[email protected]>
+Description:
+ This cache's level in the memory hierarchy. Matches 'Y' in the
+ directory name.
+
+What: /sys/devices/system/node/nodeX/side_cache/indexY/line_size
+Date: December 2018
+Contact: Keith Busch <[email protected]>
+Description:
+ The number of bytes accessed from the next cache level on a
+ cache miss.
+
+What: /sys/devices/system/node/nodeX/side_cache/indexY/size
+Date: December 2018
+Contact: Keith Busch <[email protected]>
+Description:
+ The size of this memory side cache in bytes.
+
+What: /sys/devices/system/node/nodeX/side_cache/indexY/write_policy
+Date: December 2018
+Contact: Keith Busch <[email protected]>
+Description:
+ The cache write policy: 0 for write-back, 1 for write-through,
+ 2 for other or unknown.
--
2.14.4


2018-12-11 01:11:27

by Keith Busch

[permalink] [raw]
Subject: [PATCHv2 04/12] Documentation/ABI: Add new node sysfs attributes

Add the entries for primary cpu and memory node attributes.

Signed-off-by: Keith Busch <[email protected]>
---
Documentation/ABI/stable/sysfs-devices-node | 34 ++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 3e90e1f3bf0a..8430d5b261f6 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -90,4 +90,36 @@ Date: December 2009
Contact: Lee Schermerhorn <[email protected]>
Description:
The node's huge page size control/query attributes.
- See Documentation/admin-guide/mm/hugetlbpage.rst
\ No newline at end of file
+ See Documentation/admin-guide/mm/hugetlbpage.rst
+
+What: /sys/devices/system/node/nodeX/primary_cpu_nodelist
+Date: December 2018
+Contact: Keith Busch <[email protected]>
+Description:
+ The node list of CPUs that have primary access to this node's
+ memory. CPUs not in the list accessing this node's memory may
+ encounter a performance penalty.
+
+What: /sys/devices/system/node/nodeX/primary_cpu_nodemask
+Date: December 2018
+Contact: Keith Busch <[email protected]>
+Description:
+ The node map for CPUs that have primary access to this node's
+ memory. CPUs not in the list accessing this node's memory may
+ encounter a performance penalty.
+
+What: /sys/devices/system/node/nodeX/primary_mem_nodelist
+Date: December 2018
+Contact: Keith Busch <[email protected]>
+Description:
+ The list of memory nodes that this node has primary access.
+ Memory accesses from this node to nodes not in this list may
+ encounter a performance penalty.
+
+What: /sys/devices/system/node/nodeX/primary_mem_nodemask
+Date: December 2018
+Contact: Keith Busch <[email protected]>
+Description:
+ The map of memory nodes that this node has primary access.
+ Memory accesses from this node to nodes not in this map may
+ encounter a performance penalty.
--
2.14.4


2018-12-11 01:11:43

by Keith Busch

[permalink] [raw]
Subject: [PATCHv2 12/12] doc/mm: New documentation for memory performance

Platforms may provide system memory where some physical address ranges
perform differently than others, or is side cached by the system.

Add documentation describing a high level overview of such systems and the
performance and caching attributes the kernel provides for applications
wishing to query this information.

Signed-off-by: Keith Busch <[email protected]>
---
Documentation/admin-guide/mm/numaperf.rst | 171 ++++++++++++++++++++++++++++++
1 file changed, 171 insertions(+)
create mode 100644 Documentation/admin-guide/mm/numaperf.rst

diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst
new file mode 100644
index 000000000000..846b3f991e7f
--- /dev/null
+++ b/Documentation/admin-guide/mm/numaperf.rst
@@ -0,0 +1,171 @@
+.. _numaperf:
+
+=============
+NUMA Locality
+=============
+
+Some platforms may have multiple types of memory attached to a single
+CPU. These disparate memory ranges share some characteristics, such as
+CPU cache coherence, but may have different performance. For example,
+different media types and buses affect bandwidth and latency.
+
+A system supporting such heterogeneous memory by grouping each memory
+type under different "nodes" based on similar CPU locality and performance
+characteristics. Some memory may share the same node as a CPU, and others
+are provided as memory only nodes. While memory only nodes do not provide
+CPUs, they may still be directly accessible, or local, to one or more
+compute nodes. The following diagram shows one such example of two compute
+noes with local memory and a memory only node for each of compute node:
+
+ +------------------+ +------------------+
+ | Compute Node 0 +-----+ Compute Node 1 |
+ | Local Node0 Mem | | Local Node1 Mem |
+ +--------+---------+ +--------+---------+
+ | |
+ +--------+---------+ +--------+---------+
+ | Slower Node2 Mem | | Slower Node3 Mem |
+ +------------------+ +--------+---------+
+
+A "memory initiator" is a node containing one or more devices such as
+CPUs or separate memory I/O devices that can initiate memory requests. A
+"memory target" is a node containing one or more accessible physical
+address ranges from one or more memory initiators.
+
+When multiple memory initiators exist, they may not all have the same
+performance when accessing a given memory target. The highest performing
+initiator to a given target is considered to be one of that target's
+local initiators. Any given target may have one or more local initiators,
+and any given initiator may have multiple local memory targets.
+
+To aid applications matching memory targets with their initiators,
+the kernel provide symlinks to each other like the following example::
+
+ # ls -l /sys/devices/system/node/nodeX/local_target*
+ /sys/devices/system/node/nodeX/local_targetY -> ../nodeY
+
+ # ls -l /sys/devices/system/node/nodeY/local_initiator*
+ /sys/devices/system/node/nodeY/local_initiatorX -> ../nodeX
+
+The linked nodes will also have their node number set in the local_mem
+and local_cpu node list and maps.
+
+An example showing how this may be used to run a particular task on CPUs
+and memory that are both local to a particular PCI device can be done
+using existing 'numactl' as follows::
+
+ # NODE=$(cat /sys/devices/pci:0000:00/.../numa_node)
+ # numactl --membind=$(cat /sys/devices/node/node${NODE}/local_mem_nodelist) \
+ --cpunodebind=$(cat /sys/devices/node/node${NODE}/local_cpu_nodelist) \
+ -- <some-program-to-execute>
+
+================
+NUMA Performance
+================
+
+Applications may wish to consider which node they want their memory to
+be allocated from based on the node's performance characteristics. If the
+system provides these attributes, the kernel exports them under the node
+sysfs hierarchy by appending the local_initiator_access directory under
+the memory node as follows::
+
+ /sys/devices/system/node/nodeY/local_initiator_access/
+
+The kernel does not provide performance attributes for non-local memory
+initiators. These attributes apply only to the memory initiator nodes that
+have a local_initiatorX link, or are set in the local_cpu_nodelist. A
+memory initiator node is considered local to itself if it also is
+a memory target and will be set it its node list and map, but won't
+contain a symlink to itself.
+
+The performance characteristics the kernel provides for the local initiators
+are exported are as follows::
+
+ # tree /sys/devices/system/node/nodeY/local_initiator_access
+ /sys/devices/system/node/nodeY/local_initiator_access
+ |-- read_bandwidth
+ |-- read_latency
+ |-- write_bandwidth
+ `-- write_latency
+
+The bandwidth attributes are provided in MiB/second.
+
+The latency attributes are provided in nanoseconds.
+
+==========
+NUMA Cache
+==========
+
+System memory may be constructed in a hierarchy of elements with various
+performance characteristics in order to provide large address space
+of slower performing memory side-cached by a smaller higher performing
+memory. The system physical addresses that initiators are aware of is
+provided by the last memory level in the hierarchy, while the system uses
+higher performing memory to transparently cache access to progressively
+slower levels.
+
+The term "far memory" is used to denote the last level memory in the
+hierarchy. Each increasing cache level provides higher performing
+initiator access, and the term "near memory" represents the fastest
+cache provided by the system.
+
+This numbering is different than CPU caches where the cache level (ex:
+L1, L2, L3) uses a CPU centric view with each increased level is lower
+performing. In contrast, the memory cache level is centric to the last
+level memory, so the higher numbered cache level denotes memory nearer
+to the CPU, and further from far memory.
+
+The memory side caches are not directly addressable by software. When
+software accesses a system address, the system will return it from the
+near memory cache if it is present. If it is not present, the system
+accesses the next level of memory until there is either a hit in that
+cache level, or it reaches far memory.
+
+An application does not need to know about caching attributes in order
+to use the system, software may optionally query the memory cache
+attributes in order to maximize the performance out of such a setup.
+If the system provides a way for the kernel to discover this information,
+for example with ACPI HMAT (Heterogeneous Memory Attribute Table),
+the kernel will append these attributes to the NUMA node memory target.
+
+When the kernel first registers a memory cache with a node, the kernel
+will create the following directory::
+
+ /sys/devices/system/node/nodeX/side_cache/
+
+If that directory is not present, the system either does not not provide
+a memory side cache, or that information is not accessible to the kernel.
+
+The attributes for each level of cache is provided under its cache
+level index::
+
+ /sys/devices/system/node/nodeX/side_cache/indexA/
+ /sys/devices/system/node/nodeX/side_cache/indexB/
+ /sys/devices/system/node/nodeX/side_cache/indexC/
+
+Each cache level's directory provides its attributes. For example,
+the following is a single cache level and the attributes available for
+software to query::
+
+ # tree sys/devices/system/node/node0/side_cache/
+ /sys/devices/system/node/node0/side_cache/
+ |-- index1
+ | |-- associativity
+ | |-- level
+ | |-- line_size
+ | |-- size
+ | `-- write_policy
+
+The "associativity" will be 0 if it is a direct-mapped cache, and non-zero
+for any other indexed based, multi-way associativity.
+
+The "level" is the distance from the far memory, and matches the number
+appended to its "index" directory.
+
+The "line_size" is the number of bytes accessed on a cache miss.
+
+The "size" is the number of bytes provided by this cache level.
+
+The "write_policy" will be 0 for write-back, and non-zero for
+write-through caching.
+
+See also: https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
--
2.14.4


2018-12-11 01:12:25

by Keith Busch

[permalink] [raw]
Subject: [PATCHv2 02/12] acpi/hmat: Parse and report heterogeneous memory

Systems may provide different memory types and export this information
in the ACPI Heterogeneous Memory Attribute Table (HMAT). Parse these
tables provided by the platform and report the memory access and caching
attributes.

Signed-off-by: Keith Busch <[email protected]>
---
drivers/acpi/Kconfig | 8 +++
drivers/acpi/Makefile | 1 +
drivers/acpi/hmat.c | 192 ++++++++++++++++++++++++++++++++++++++++++++++++++
drivers/acpi/tables.c | 9 +++
include/linux/acpi.h | 1 +
5 files changed, 211 insertions(+)
create mode 100644 drivers/acpi/hmat.c

diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index 7cea769c37df..9a05af3a18cf 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -327,6 +327,14 @@ config ACPI_NUMA
depends on (X86 || IA64 || ARM64)
default y if IA64_GENERIC || IA64_SGI_SN2 || ARM64

+config ACPI_HMAT
+ bool "ACPI Heterogeneous Memory Attribute Table Support"
+ depends on ACPI_NUMA
+ help
+ Parses representation of the ACPI Heterogeneous Memory Attributes
+ Table (HMAT) and set the memory node relationships and access
+ attributes.
+
config ACPI_CUSTOM_DSDT_FILE
string "Custom DSDT Table file to include"
default ""
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index edc039313cd6..b5e13499f88b 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -55,6 +55,7 @@ acpi-$(CONFIG_X86) += x86/apple.o
acpi-$(CONFIG_X86) += x86/utils.o
acpi-$(CONFIG_DEBUG_FS) += debugfs.o
acpi-$(CONFIG_ACPI_NUMA) += numa.o
+acpi-$(CONFIG_ACPI_HMAT) += hmat.o
acpi-$(CONFIG_ACPI_PROCFS_POWER) += cm_sbs.o
acpi-y += acpi_lpat.o
acpi-$(CONFIG_ACPI_LPIT) += acpi_lpit.o
diff --git a/drivers/acpi/hmat.c b/drivers/acpi/hmat.c
new file mode 100644
index 000000000000..ef3881f0f370
--- /dev/null
+++ b/drivers/acpi/hmat.c
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Heterogeneous Memory Attributes Table (HMAT) representation
+ *
+ * Copyright (c) 2018, Intel Corporation.
+ */
+
+#include <acpi/acpi_numa.h>
+#include <linux/acpi.h>
+#include <linux/bitops.h>
+#include <linux/cpu.h>
+#include <linux/device.h>
+#include <linux/init.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/node.h>
+#include <linux/slab.h>
+#include <linux/sysfs.h>
+
+static __init const char *hmat_data_type(u8 type)
+{
+ switch (type) {
+ case ACPI_HMAT_ACCESS_LATENCY:
+ return "Access Latency";
+ case ACPI_HMAT_READ_LATENCY:
+ return "Read Latency";
+ case ACPI_HMAT_WRITE_LATENCY:
+ return "Write Latency";
+ case ACPI_HMAT_ACCESS_BANDWIDTH:
+ return "Access Bandwidth";
+ case ACPI_HMAT_READ_BANDWIDTH:
+ return "Read Bandwidth";
+ case ACPI_HMAT_WRITE_BANDWIDTH:
+ return "Write Bandwidth";
+ default:
+ return "Reserved";
+ };
+}
+
+static __init const char *hmat_data_type_suffix(u8 type)
+{
+ switch (type) {
+ case ACPI_HMAT_ACCESS_LATENCY:
+ case ACPI_HMAT_READ_LATENCY:
+ case ACPI_HMAT_WRITE_LATENCY:
+ return " nsec";
+ case ACPI_HMAT_ACCESS_BANDWIDTH:
+ case ACPI_HMAT_READ_BANDWIDTH:
+ case ACPI_HMAT_WRITE_BANDWIDTH:
+ return " MB/s";
+ default:
+ return "";
+ };
+}
+
+static __init int hmat_parse_locality(union acpi_subtable_headers *header,
+ const unsigned long end)
+{
+ struct acpi_hmat_locality *loc = (void *)header;
+ unsigned int init, targ, total_size, ipds, tpds;
+ u32 *inits, *targs, value;
+ u16 *entries;
+ u8 type;
+
+ if (loc->header.length < sizeof(*loc)) {
+ pr_err("HMAT: Unexpected locality header length: %d\n",
+ loc->header.length);
+ return -EINVAL;
+ }
+
+ type = loc->data_type;
+ ipds = loc->number_of_initiator_Pds;
+ tpds = loc->number_of_target_Pds;
+ total_size = sizeof(*loc) + sizeof(*entries) * ipds * tpds +
+ sizeof(*inits) * ipds + sizeof(*targs) * tpds;
+ if (loc->header.length < total_size) {
+ pr_err("HMAT: Unexpected locality header length:%d, minimum required:%d\n",
+ loc->header.length, total_size);
+ return -EINVAL;
+ }
+
+ pr_info("HMAT: Locality: Flags:%02x Type:%s Initiator Domains:%d Target Domains:%d Base:%lld\n",
+ loc->flags, hmat_data_type(type), ipds, tpds,
+ loc->entry_base_unit);
+
+ inits = (u32 *)(loc + 1);
+ targs = &inits[ipds];
+ entries = (u16 *)(&targs[tpds]);
+ for (targ = 0; targ < tpds; targ++) {
+ for (init = 0; init < ipds; init++) {
+ value = entries[init * tpds + targ];
+ value = (value * loc->entry_base_unit) / 10;
+ pr_info(" Initiator-Target[%d-%d]:%d%s\n",
+ inits[init], targs[targ], value,
+ hmat_data_type_suffix(type));
+ }
+ }
+ return 0;
+}
+
+static __init int hmat_parse_cache(union acpi_subtable_headers *header,
+ const unsigned long end)
+{
+ struct acpi_hmat_cache *cache = (void *)header;
+ u32 attrs;
+
+ if (cache->header.length < sizeof(*cache)) {
+ pr_err("HMAT: Unexpected cache header length: %d\n",
+ cache->header.length);
+ return -EINVAL;
+ }
+
+ attrs = cache->cache_attributes;
+ pr_info("HMAT: Cache: Domain:%d Size:%llu Attrs:%08x SMBIOS Handles:%d\n",
+ cache->memory_PD, cache->cache_size, attrs,
+ cache->number_of_SMBIOShandles);
+
+ return 0;
+}
+
+static int __init hmat_parse_address_range(union acpi_subtable_headers *header,
+ const unsigned long end)
+{
+ struct acpi_hmat_address_range *spa = (void *)header;
+
+ if (spa->header.length != sizeof(*spa)) {
+ pr_err("HMAT: Unexpected address range header length: %d\n",
+ spa->header.length);
+ return -EINVAL;
+ }
+ pr_info("HMAT: Memory (%#llx length %#llx) Flags:%04x Processor Domain:%d Memory Domain:%d\n",
+ spa->physical_address_base, spa->physical_address_length,
+ spa->flags, spa->processor_PD, spa->memory_PD);
+ return 0;
+}
+
+static int __init hmat_parse_subtable(union acpi_subtable_headers *header,
+ const unsigned long end)
+{
+ struct acpi_hmat_structure *hdr = (void *)header;
+
+ if (!hdr)
+ return -EINVAL;
+
+ switch (hdr->type) {
+ case ACPI_HMAT_TYPE_ADDRESS_RANGE:
+ return hmat_parse_address_range(header, end);
+ case ACPI_HMAT_TYPE_LOCALITY:
+ return hmat_parse_locality(header, end);
+ case ACPI_HMAT_TYPE_CACHE:
+ return hmat_parse_cache(header, end);
+ default:
+ return -EINVAL;
+ }
+}
+
+static __init int parse_noop(struct acpi_table_header *table)
+{
+ return 0;
+}
+
+static __init int hmat_init(void)
+{
+ struct acpi_subtable_proc subtable_proc;
+ struct acpi_table_header *tbl;
+ enum acpi_hmat_type i;
+ acpi_status status;
+
+ if (srat_disabled())
+ return 0;
+
+ status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
+ if (ACPI_FAILURE(status))
+ return 0;
+
+ if (acpi_table_parse(ACPI_SIG_HMAT, parse_noop))
+ goto out_put;
+
+ memset(&subtable_proc, 0, sizeof(subtable_proc));
+ subtable_proc.handler = hmat_parse_subtable;
+ for (i = ACPI_HMAT_TYPE_ADDRESS_RANGE; i < ACPI_HMAT_TYPE_RESERVED; i++) {
+ subtable_proc.id = i;
+ if (acpi_table_parse_entries_array(ACPI_SIG_HMAT,
+ sizeof(struct acpi_table_hmat),
+ &subtable_proc, 1, 0) < 0)
+ goto out_put;
+ }
+ out_put:
+ acpi_put_table(tbl);
+ return 0;
+}
+subsys_initcall(hmat_init);
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index e9643b4267c7..bc1addf715dc 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -51,6 +51,7 @@ static int acpi_apic_instance __initdata;

enum acpi_subtable_type {
ACPI_SUBTABLE_COMMON,
+ ACPI_SUBTABLE_HMAT,
};

struct acpi_subtable_entry {
@@ -232,6 +233,8 @@ acpi_get_entry_type(struct acpi_subtable_entry *entry)
switch (entry->type) {
case ACPI_SUBTABLE_COMMON:
return entry->hdr->common.type;
+ case ACPI_SUBTABLE_HMAT:
+ return entry->hdr->hmat.type;
}
return 0;
}
@@ -242,6 +245,8 @@ acpi_get_entry_length(struct acpi_subtable_entry *entry)
switch (entry->type) {
case ACPI_SUBTABLE_COMMON:
return entry->hdr->common.length;
+ case ACPI_SUBTABLE_HMAT:
+ return entry->hdr->hmat.length;
}
return 0;
}
@@ -252,6 +257,8 @@ acpi_get_subtable_header_length(struct acpi_subtable_entry *entry)
switch (entry->type) {
case ACPI_SUBTABLE_COMMON:
return sizeof(entry->hdr->common);
+ case ACPI_SUBTABLE_HMAT:
+ return sizeof(entry->hdr->hmat);
}
return 0;
}
@@ -259,6 +266,8 @@ acpi_get_subtable_header_length(struct acpi_subtable_entry *entry)
static enum acpi_subtable_type __init
acpi_get_subtable_type(char *id)
{
+ if (strncmp(id, ACPI_SIG_HMAT, 4) == 0)
+ return ACPI_SUBTABLE_HMAT;
return ACPI_SUBTABLE_COMMON;
}

diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 18805a967c70..4373f5ba0f95 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -143,6 +143,7 @@ enum acpi_address_range_id {
/* Table Handlers */
union acpi_subtable_headers {
struct acpi_subtable_header common;
+ struct acpi_hmat_structure hmat;
};

typedef int (*acpi_tbl_table_handler)(struct acpi_table_header *table);
--
2.14.4


2018-12-11 01:12:41

by Keith Busch

[permalink] [raw]
Subject: [PATCHv2 03/12] node: Link memory nodes to their compute nodes

Memory-only nodes will often have affinity to a compute node that can
initiate memory access. Platforms may have various ways to express that
locality. The nodes that initiate memory requests closest to a memory
target node may be considered to have 'primary' access.

In preparation for these systems, provide a way for the kernel to link the
target memory node to its primary initiator compute nodes with symlinks
to each other. Also add memory target and initiator node masks showing
the same relationship.

The following example show the node's new sysfs hierarchy setup for
memory node 'Y' with primary access to commpute node 'X':

# ls -l /sys/devices/system/node/nodeX/primary_target*
/sys/devices/system/node/nodeX/primary_targetY -> ../nodeY

# ls -l /sys/devices/system/node/nodeY/primary_initiator*
/sys/devices/system/node/nodeY/primary_initiatorX -> ../nodeX

A single initiator may have primary access to multiple memory targets, and
the targets may also have primary access from multiple memory initiators.

The following demonstrates how this may look for a theoretical system
with 8 memory nodes and 2 compute nodes.

# cat /sys/devices/system/node/node0/primary_mem_nodelist
0,2,4,6

# cat /sys/devices/system/node/node1/primary_mem_nodelist
1,3,5,7

And then going the other way to identify the cpu lists of a node's
primary targets:

# cat /sys/devices/system/node/node0/primary_target*/primary_cpu_nodelist | tr "\n" ","
0,0,0,0

# cat /sys/devices/system/node/node1/primary_target*/primary_cpu_nodelist
1,1,1,1

As an example of what you may be able to do with this, let's say we have
a PCIe storage device, /dev/nvme0n1, attached to a particular node, and
we want to run IO to it using only CPUs and Memory that share primary
access. The following shell script is such an example to achieve
that goal:

#!/bin/bash
DEV_NODE=/sys/devices/system/node/node$(cat /sys/block/nvme0n1/device/device/numa_node)
numactl --membind=$(cat ${DEV_NODE}/primary_mem_nodelist) \
--cpunodebind=$(cat ${DEV_NODE}/primary_cpu_nodelist) \
-- fio --filename=/dev/nvme0n1 --bs=4k --name=access-test

Signed-off-by: Keith Busch <[email protected]>
---
drivers/base/node.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/node.h | 4 +++
2 files changed, 89 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 86d6cd92ce3d..50412ce3fd7d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -56,6 +56,46 @@ static inline ssize_t node_read_cpulist(struct device *dev,
return node_read_cpumap(dev, true, buf);
}

+static ssize_t node_read_nodemap(nodemask_t *mask, bool list, char *buf)
+{
+ return list ? scnprintf(buf, PAGE_SIZE - 1, "%*pbl\n",
+ nodemask_pr_args(mask)) :
+ scnprintf(buf, PAGE_SIZE - 1, "%*pb\n",
+ nodemask_pr_args(mask));
+}
+
+static ssize_t primary_mem_nodelist_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct node *n = to_node(dev);
+ return node_read_nodemap(&n->primary_mem_nodes, true, buf);
+}
+
+static ssize_t primary_mem_nodemask_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct node *n = to_node(dev);
+ return node_read_nodemap(&n->primary_mem_nodes, false, buf);
+}
+
+static ssize_t primary_cpu_nodelist_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct node *n = to_node(dev);
+ return node_read_nodemap(&n->primary_cpu_nodes, true, buf);
+}
+
+static ssize_t primary_cpu_nodemask_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct node *n = to_node(dev);
+ return node_read_nodemap(&n->primary_cpu_nodes, false, buf);
+}
+
+static DEVICE_ATTR_RO(primary_mem_nodelist);
+static DEVICE_ATTR_RO(primary_mem_nodemask);
+static DEVICE_ATTR_RO(primary_cpu_nodemask);
+static DEVICE_ATTR_RO(primary_cpu_nodelist);
static DEVICE_ATTR(cpumap, S_IRUGO, node_read_cpumask, NULL);
static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);

@@ -240,6 +280,10 @@ static struct attribute *node_dev_attrs[] = {
&dev_attr_numastat.attr,
&dev_attr_distance.attr,
&dev_attr_vmstat.attr,
+ &dev_attr_primary_mem_nodelist.attr,
+ &dev_attr_primary_mem_nodemask.attr,
+ &dev_attr_primary_cpu_nodemask.attr,
+ &dev_attr_primary_cpu_nodelist.attr,
NULL
};
ATTRIBUTE_GROUPS(node_dev);
@@ -372,6 +416,42 @@ int register_cpu_under_node(unsigned int cpu, unsigned int nid)
kobject_name(&node_devices[nid]->dev.kobj));
}

+int register_memory_node_under_compute_node(unsigned int m, unsigned int p)
+{
+ struct node *init, *targ;
+ char initiator[28]; /* "primary_initiator4294967295\0" */
+ char target[25]; /* "primary_target4294967295\0" */
+ int ret;
+
+ if (!node_online(p) || !node_online(m))
+ return -ENODEV;
+ if (m == p)
+ return 0;
+
+ snprintf(initiator, sizeof(initiator), "primary_initiator%d", p);
+ snprintf(target, sizeof(target), "primary_target%d", m);
+
+ init = node_devices[p];
+ targ = node_devices[m];
+
+ ret = sysfs_create_link(&init->dev.kobj, &targ->dev.kobj, target);
+ if (ret)
+ return ret;
+
+ ret = sysfs_create_link(&targ->dev.kobj, &init->dev.kobj, initiator);
+ if (ret)
+ goto err;
+
+ node_set(m, init->primary_mem_nodes);
+ node_set(p, targ->primary_cpu_nodes);
+
+ return 0;
+ err:
+ sysfs_remove_link(&node_devices[p]->dev.kobj,
+ kobject_name(&node_devices[m]->dev.kobj));
+ return ret;
+}
+
int unregister_cpu_under_node(unsigned int cpu, unsigned int nid)
{
struct device *obj;
@@ -580,6 +660,11 @@ int __register_one_node(int nid)
register_cpu_under_node(cpu, nid);
}

+ if (node_state(nid, N_MEMORY))
+ node_set(nid, node_devices[nid]->primary_mem_nodes);
+ if (node_state(nid, N_CPU))
+ node_set(nid, node_devices[nid]->primary_cpu_nodes);
+
/* initialize work queue for memory hot plug */
init_node_hugetlb_work(nid);

diff --git a/include/linux/node.h b/include/linux/node.h
index 257bb3d6d014..3d06de045cbf 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -21,6 +21,8 @@

struct node {
struct device dev;
+ nodemask_t primary_mem_nodes;
+ nodemask_t primary_cpu_nodes;

#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
struct work_struct node_work;
@@ -75,6 +77,8 @@ extern int register_mem_sect_under_node(struct memory_block *mem_blk,
extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
unsigned long phys_index);

+extern int register_memory_node_under_compute_node(unsigned int m, unsigned int p);
+
#ifdef CONFIG_HUGETLBFS
extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
node_registration_func_t unregister);
--
2.14.4


2018-12-11 06:06:17

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCHv2 02/12] acpi/hmat: Parse and report heterogeneous memory

On Mon, Dec 10, 2018 at 5:05 PM Keith Busch <[email protected]> wrote:
>
> Systems may provide different memory types and export this information
> in the ACPI Heterogeneous Memory Attribute Table (HMAT). Parse these
> tables provided by the platform and report the memory access and caching
> attributes.
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> drivers/acpi/Kconfig | 8 +++
> drivers/acpi/Makefile | 1 +
> drivers/acpi/hmat.c | 192 ++++++++++++++++++++++++++++++++++++++++++++++++++
> drivers/acpi/tables.c | 9 +++
> include/linux/acpi.h | 1 +
> 5 files changed, 211 insertions(+)
> create mode 100644 drivers/acpi/hmat.c
>
> diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
> index 7cea769c37df..9a05af3a18cf 100644
> --- a/drivers/acpi/Kconfig
> +++ b/drivers/acpi/Kconfig
> @@ -327,6 +327,14 @@ config ACPI_NUMA
> depends on (X86 || IA64 || ARM64)
> default y if IA64_GENERIC || IA64_SGI_SN2 || ARM64
>
> +config ACPI_HMAT
> + bool "ACPI Heterogeneous Memory Attribute Table Support"
> + depends on ACPI_NUMA
> + help
> + Parses representation of the ACPI Heterogeneous Memory Attributes
> + Table (HMAT) and set the memory node relationships and access
> + attributes.
> +
> config ACPI_CUSTOM_DSDT_FILE
> string "Custom DSDT Table file to include"
> default ""
> diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
> index edc039313cd6..b5e13499f88b 100644
> --- a/drivers/acpi/Makefile
> +++ b/drivers/acpi/Makefile
> @@ -55,6 +55,7 @@ acpi-$(CONFIG_X86) += x86/apple.o
> acpi-$(CONFIG_X86) += x86/utils.o
> acpi-$(CONFIG_DEBUG_FS) += debugfs.o
> acpi-$(CONFIG_ACPI_NUMA) += numa.o
> +acpi-$(CONFIG_ACPI_HMAT) += hmat.o
> acpi-$(CONFIG_ACPI_PROCFS_POWER) += cm_sbs.o
> acpi-y += acpi_lpat.o
> acpi-$(CONFIG_ACPI_LPIT) += acpi_lpit.o
> diff --git a/drivers/acpi/hmat.c b/drivers/acpi/hmat.c
> new file mode 100644
> index 000000000000..ef3881f0f370
> --- /dev/null
> +++ b/drivers/acpi/hmat.c
[..]
> +static __init int hmat_init(void)
> +{
> + struct acpi_subtable_proc subtable_proc;
> + struct acpi_table_header *tbl;
> + enum acpi_hmat_type i;
> + acpi_status status;
> +
> + if (srat_disabled())
> + return 0;
> +
> + status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl);
> + if (ACPI_FAILURE(status))
> + return 0;
> +
> + if (acpi_table_parse(ACPI_SIG_HMAT, parse_noop))
> + goto out_put;
> +
> + memset(&subtable_proc, 0, sizeof(subtable_proc));
> + subtable_proc.handler = hmat_parse_subtable;
> + for (i = ACPI_HMAT_TYPE_ADDRESS_RANGE; i < ACPI_HMAT_TYPE_RESERVED; i++) {
> + subtable_proc.id = i;
> + if (acpi_table_parse_entries_array(ACPI_SIG_HMAT,
> + sizeof(struct acpi_table_hmat),
> + &subtable_proc, 1, 0) < 0)
> + goto out_put;
> + }
> + out_put:
> + acpi_put_table(tbl);
> + return 0;
> +}
> +subsys_initcall(hmat_init);

I have a use case to detect the presence of a memory-side-cache early
at init time [1]. To me this means that hmat_init() needs to happen as
a part of acpi_numa_init(). Subsequently I think that also means that
the sysfs portion needs to be broken out to its own init path that can
probably run at module_init() priority.

Perhaps we should split this patch set into two? The table parsing
with an in-kernel user is a bit easier to reason about and can go in
first. Towards that end can I steal / refllow patches 1 & 2 into the
memory randomization series? Other ideas how to handle this?

[1]: https://lkml.org/lkml/2018/10/12/309

2018-12-11 06:49:39

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCHv2 12/12] doc/mm: New documentation for memory performance

Hi Keith,

Thanks for the docs! :)
Some nits below...

On Mon, Dec 10, 2018 at 06:03:10PM -0700, Keith Busch wrote:
> Platforms may provide system memory where some physical address ranges
> perform differently than others, or is side cached by the system.
>
> Add documentation describing a high level overview of such systems and the
> performance and caching attributes the kernel provides for applications
> wishing to query this information.
>
> Signed-off-by: Keith Busch <[email protected]>
> ---
> Documentation/admin-guide/mm/numaperf.rst | 171 ++++++++++++++++++++++++++++++
> 1 file changed, 171 insertions(+)
> create mode 100644 Documentation/admin-guide/mm/numaperf.rst
>
> diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst
> new file mode 100644
> index 000000000000..846b3f991e7f
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/numaperf.rst
> @@ -0,0 +1,171 @@
> +.. _numaperf:
> +
> +=============
> +NUMA Locality
> +=============
> +
> +Some platforms may have multiple types of memory attached to a single
> +CPU. These disparate memory ranges share some characteristics, such as
> +CPU cache coherence, but may have different performance. For example,
> +different media types and buses affect bandwidth and latency.
> +
> +A system supporting such heterogeneous memory by grouping each memory

Maybe "A system supports ..."?

> +type under different "nodes" based on similar CPU locality and performance
> +characteristics. Some memory may share the same node as a CPU, and others
> +are provided as memory only nodes. While memory only nodes do not provide
> +CPUs, they may still be directly accessible, or local, to one or more
> +compute nodes. The following diagram shows one such example of two compute
> +noes with local memory and a memory only node for each of compute node:

^ attached to each ?
> +
> + +------------------+ +------------------+
> + | Compute Node 0 +-----+ Compute Node 1 |
> + | Local Node0 Mem | | Local Node1 Mem |
> + +--------+---------+ +--------+---------+
> + | |
> + +--------+---------+ +--------+---------+
> + | Slower Node2 Mem | | Slower Node3 Mem |
> + +------------------+ +--------+---------+
> +
> +A "memory initiator" is a node containing one or more devices such as
> +CPUs or separate memory I/O devices that can initiate memory requests. A
> +"memory target" is a node containing one or more accessible physical
> +address ranges from one or more memory initiators.

Maybe "... one or more address ranges accessible from one or more memory
initiators"

> +
> +When multiple memory initiators exist, they may not all have the same
> +performance when accessing a given memory target. The highest performing
> +initiator to a given target is considered to be one of that target's
> +local initiators. Any given target may have one or more local initiators,
> +and any given initiator may have multiple local memory targets.
> +
> +To aid applications matching memory targets with their initiators,
> +the kernel provide symlinks to each other like the following example::

^ provides
> +
> + # ls -l /sys/devices/system/node/nodeX/local_target*
> + /sys/devices/system/node/nodeX/local_targetY -> ../nodeY
> +
> + # ls -l /sys/devices/system/node/nodeY/local_initiator*
> + /sys/devices/system/node/nodeY/local_initiatorX -> ../nodeX
> +
> +The linked nodes will also have their node number set in the local_mem
> +and local_cpu node list and maps.
> +
> +An example showing how this may be used to run a particular task on CPUs
> +and memory that are both local to a particular PCI device can be done
> +using existing 'numactl' as follows::
> +
> + # NODE=$(cat /sys/devices/pci:0000:00/.../numa_node)
> + # numactl --membind=$(cat /sys/devices/node/node${NODE}/local_mem_nodelist) \
> + --cpunodebind=$(cat /sys/devices/node/node${NODE}/local_cpu_nodelist) \
> + -- <some-program-to-execute>
> +
> +================
> +NUMA Performance
> +================
> +
> +Applications may wish to consider which node they want their memory to
> +be allocated from based on the node's performance characteristics. If the
> +system provides these attributes, the kernel exports them under the node
> +sysfs hierarchy by appending the local_initiator_access directory under
> +the memory node as follows::
> +
> + /sys/devices/system/node/nodeY/local_initiator_access/
> +
> +The kernel does not provide performance attributes for non-local memory
> +initiators. These attributes apply only to the memory initiator nodes that
> +have a local_initiatorX link, or are set in the local_cpu_nodelist. A
> +memory initiator node is considered local to itself if it also is
> +a memory target and will be set it its node list and map, but won't
> +contain a symlink to itself.
> +
> +The performance characteristics the kernel provides for the local initiators
> +are exported are as follows::
> +
> + # tree /sys/devices/system/node/nodeY/local_initiator_access
> + /sys/devices/system/node/nodeY/local_initiator_access
> + |-- read_bandwidth
> + |-- read_latency
> + |-- write_bandwidth
> + `-- write_latency
> +
> +The bandwidth attributes are provided in MiB/second.
> +
> +The latency attributes are provided in nanoseconds.
> +
> +==========
> +NUMA Cache
> +==========
> +
> +System memory may be constructed in a hierarchy of elements with various
> +performance characteristics in order to provide large address space
> +of slower performing memory side-cached by a smaller higher performing
> +memory. The system physical addresses that initiators are aware of is
> +provided by the last memory level in the hierarchy, while the system uses
> +higher performing memory to transparently cache access to progressively
> +slower levels.
> +
> +The term "far memory" is used to denote the last level memory in the
> +hierarchy. Each increasing cache level provides higher performing
> +initiator access, and the term "near memory" represents the fastest
> +cache provided by the system.
> +
> +This numbering is different than CPU caches where the cache level (ex:
> +L1, L2, L3) uses a CPU centric view with each increased level is lower
> +performing. In contrast, the memory cache level is centric to the last
> +level memory, so the higher numbered cache level denotes memory nearer
> +to the CPU, and further from far memory.
> +
> +The memory side caches are not directly addressable by software. When
> +software accesses a system address, the system will return it from the

^ satisfy the request

> +near memory cache if it is present. If it is not present, the system
> +accesses the next level of memory until there is either a hit in that
> +cache level, or it reaches far memory.
> +
> +An application does not need to know about caching attributes in order
> +to use the system, software may optionally query the memory cache
> +attributes in order to maximize the performance out of such a setup.
> +If the system provides a way for the kernel to discover this information,
> +for example with ACPI HMAT (Heterogeneous Memory Attribute Table),
> +the kernel will append these attributes to the NUMA node memory target.
> +
> +When the kernel first registers a memory cache with a node, the kernel
> +will create the following directory::
> +
> + /sys/devices/system/node/nodeX/side_cache/
> +
> +If that directory is not present, the system either does not not provide
> +a memory side cache, or that information is not accessible to the kernel.
> +
> +The attributes for each level of cache is provided under its cache
> +level index::
> +
> + /sys/devices/system/node/nodeX/side_cache/indexA/
> + /sys/devices/system/node/nodeX/side_cache/indexB/
> + /sys/devices/system/node/nodeX/side_cache/indexC/
> +
> +Each cache level's directory provides its attributes. For example,
> +the following is a single cache level and the attributes available for
> +software to query::
> +
> + # tree sys/devices/system/node/node0/side_cache/
> + /sys/devices/system/node/node0/side_cache/
> + |-- index1
> + | |-- associativity
> + | |-- level
> + | |-- line_size
> + | |-- size
> + | `-- write_policy
> +
> +The "associativity" will be 0 if it is a direct-mapped cache, and non-zero
> +for any other indexed based, multi-way associativity.
> +
> +The "level" is the distance from the far memory, and matches the number
> +appended to its "index" directory.
> +
> +The "line_size" is the number of bytes accessed on a cache miss.
> +
> +The "size" is the number of bytes provided by this cache level.
> +
> +The "write_policy" will be 0 for write-back, and non-zero for
> +write-through caching.
> +
> +See also: https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf

I'd suggest to reference relevant sections rather than entire 1K pages doc ;-)

> --
> 2.14.4
>

--
Sincerely yours,
Mike.


2018-12-11 09:47:38

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCHv2 01/12] acpi: Create subtable parsing infrastructure

On Tue, Dec 11, 2018 at 2:05 AM Keith Busch <[email protected]> wrote:
>
> Parsing entries in an ACPI table had assumed a generic header structure
> that is most common. There is no standard ACPI header, though, so less
> common types would need custom parsers if they want go through their
> sub-table entry list.

It looks like the problem at hand is that acpi_hmat_structure is
incompatible with acpi_subtable_header because of the different layout
and field sizes.

If so, please state that clearly here.

With that, please feel free to add

Reviewed-by: Rafael J. Wysocki <[email protected]>

to this patch.

2018-12-11 16:59:37

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCHv2 02/12] acpi/hmat: Parse and report heterogeneous memory

On Mon, Dec 10, 2018 at 10:03:40PM -0800, Dan Williams wrote:
> I have a use case to detect the presence of a memory-side-cache early
> at init time [1]. To me this means that hmat_init() needs to happen as
> a part of acpi_numa_init(). Subsequently I think that also means that
> the sysfs portion needs to be broken out to its own init path that can
> probably run at module_init() priority.
>
> Perhaps we should split this patch set into two? The table parsing
> with an in-kernel user is a bit easier to reason about and can go in
> first. Towards that end can I steal / refllow patches 1 & 2 into the
> memory randomization series? Other ideas how to handle this?
>
> [1]: https://lkml.org/lkml/2018/10/12/309

To that end, will something like the following work for you? This just
needs to happen after patch 1.

---
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index f5e09c39ff22..03ef3c8ba4ea 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -40,6 +40,8 @@ static int pxm_to_node_map[MAX_PXM_DOMAINS]
static int node_to_pxm_map[MAX_NUMNODES]
= { [0 ... MAX_NUMNODES - 1] = PXM_INVAL };

+static unsigned long node_side_cached[BITS_TO_LONGS(MAX_PXM_DOMAINS)];
+
unsigned char acpi_srat_revision __initdata;
int acpi_numa __initdata;

@@ -262,6 +264,7 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
u64 start, end;
u32 hotpluggable;
int node, pxm;
+ bool side_cached;

if (srat_disabled())
goto out_err;
@@ -308,6 +311,11 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
pr_warn("SRAT: Failed to mark hotplug range [mem %#010Lx-%#010Lx] in memblock\n",
(unsigned long long)start, (unsigned long long)end - 1);

+ side_cached = test_bit(pxm, node_side_cached);
+ if (side_cached && memblock_mark_sidecached(start, ma->length))
+ pr_warn("SRAT: Failed to mark side cached range [mem %#010Lx-%#010Lx] in memblock\n",
+ (unsigned long long)start, (unsigned long long)end - 1);
+
max_possible_pfn = max(max_possible_pfn, PFN_UP(end - 1));

return 0;
@@ -411,6 +419,19 @@ acpi_parse_memory_affinity(union acpi_subtable_headers * header,
return 0;
}

+static int __init
+acpi_parse_cache(union acpi_subtable_headers *header, const unsigned long end)
+{
+ struct acpi_hmat_cache *cache = (void *)header;
+ u32 attrs;
+
+ attrs = cache->cache_attributes;
+ if (((attrs & ACPI_HMAT_CACHE_ASSOCIATIVITY) >> 8) ==
+ ACPI_HMAT_CA_DIRECT_MAPPED)
+ set_bit(cache->memory_PD, node_side_cached);
+ return 0;
+}
+
static int __init acpi_parse_srat(struct acpi_table_header *table)
{
struct acpi_table_srat *srat = (struct acpi_table_srat *)table;
@@ -422,6 +443,11 @@ static int __init acpi_parse_srat(struct acpi_table_header *table)
return 0;
}

+static __init int acpi_parse_hmat(struct acpi_table_header *table)
+{
+ return 0;
+}
+
static int __init
acpi_table_parse_srat(enum acpi_srat_type id,
acpi_tbl_entry_handler handler, unsigned int max_entries)
@@ -460,6 +486,16 @@ int __init acpi_numa_init(void)
sizeof(struct acpi_table_srat),
srat_proc, ARRAY_SIZE(srat_proc), 0);

+ if (!acpi_table_parse(ACPI_SIG_HMAT, acpi_parse_hmat)) {
+ struct acpi_subtable_proc hmat_proc;
+
+ memset(&hmat_proc, 0, sizeof(hmat_proc));
+ hmat_proc.handler = acpi_parse_cache;
+ hmat_proc.id = ACPI_HMAT_TYPE_CACHE;
+ acpi_table_parse_entries_array(ACPI_SIG_HMAT,
+ sizeof(struct acpi_table_hmat),
+ &hmat_proc, 1, 0);
+ }
cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
acpi_parse_memory_affinity, 0);
}
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index aee299a6aa76..a24c918a4496 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -44,6 +44,7 @@ enum memblock_flags {
MEMBLOCK_HOTPLUG = 0x1, /* hotpluggable region */
MEMBLOCK_MIRROR = 0x2, /* mirrored region */
MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
+ MEMBLOCK_SIDECACHED = 0x8, /* System side caches memory access */
};

/**
@@ -130,6 +131,7 @@ int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_mark_mirror(phys_addr_t base, phys_addr_t size);
int memblock_mark_nomap(phys_addr_t base, phys_addr_t size);
int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
+int memblock_mark_sidecached(phys_addr_t base, phys_addr_t size);
enum memblock_flags choose_memblock_flags(void);

unsigned long memblock_free_all(void);
@@ -227,6 +229,11 @@ static inline bool memblock_is_nomap(struct memblock_region *m)
return m->flags & MEMBLOCK_NOMAP;
}

+static inline bool memblock_is_sidecached(struct memblock_region *m)
+{
+ return m->flags & MEMBLOCK_SIDECACHED;
+}
+
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
unsigned long *end_pfn);
diff --git a/mm/memblock.c b/mm/memblock.c
index 9a2d5ae81ae1..827b709afdcd 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -865,6 +865,11 @@ int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size)
return memblock_setclr_flag(base, size, 1, MEMBLOCK_HOTPLUG);
}

+int __init_memblock memblock_mark_sidecached(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_setclr_flag(base, size, 1, MEMBLOCK_SIDECACHED);
+}
+
/**
* memblock_clear_hotplug - Clear flag MEMBLOCK_HOTPLUG for a specified region.
* @base: the base phys addr of the region
--

2018-12-11 20:32:18

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCHv2 02/12] acpi/hmat: Parse and report heterogeneous memory

On Tue, Dec 11, 2018 at 8:58 AM Keith Busch <[email protected]> wrote:
>
> On Mon, Dec 10, 2018 at 10:03:40PM -0800, Dan Williams wrote:
> > I have a use case to detect the presence of a memory-side-cache early
> > at init time [1]. To me this means that hmat_init() needs to happen as
> > a part of acpi_numa_init(). Subsequently I think that also means that
> > the sysfs portion needs to be broken out to its own init path that can
> > probably run at module_init() priority.
> >
> > Perhaps we should split this patch set into two? The table parsing
> > with an in-kernel user is a bit easier to reason about and can go in
> > first. Towards that end can I steal / refllow patches 1 & 2 into the
> > memory randomization series? Other ideas how to handle this?
> >
> > [1]: https://lkml.org/lkml/2018/10/12/309
>
> To that end, will something like the following work for you? This just
> needs to happen after patch 1.
>
> ---
> diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
> index f5e09c39ff22..03ef3c8ba4ea 100644
> --- a/drivers/acpi/numa.c
> +++ b/drivers/acpi/numa.c
> @@ -40,6 +40,8 @@ static int pxm_to_node_map[MAX_PXM_DOMAINS]
> static int node_to_pxm_map[MAX_NUMNODES]
> = { [0 ... MAX_NUMNODES - 1] = PXM_INVAL };
>
> +static unsigned long node_side_cached[BITS_TO_LONGS(MAX_PXM_DOMAINS)];
> +
> unsigned char acpi_srat_revision __initdata;
> int acpi_numa __initdata;
>
> @@ -262,6 +264,7 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
> u64 start, end;
> u32 hotpluggable;
> int node, pxm;
> + bool side_cached;
>
> if (srat_disabled())
> goto out_err;
> @@ -308,6 +311,11 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
> pr_warn("SRAT: Failed to mark hotplug range [mem %#010Lx-%#010Lx] in memblock\n",
> (unsigned long long)start, (unsigned long long)end - 1);
>
> + side_cached = test_bit(pxm, node_side_cached);
> + if (side_cached && memblock_mark_sidecached(start, ma->length))
> + pr_warn("SRAT: Failed to mark side cached range [mem %#010Lx-%#010Lx] in memblock\n",
> + (unsigned long long)start, (unsigned long long)end - 1);
> +
> max_possible_pfn = max(max_possible_pfn, PFN_UP(end - 1));
>
> return 0;
> @@ -411,6 +419,19 @@ acpi_parse_memory_affinity(union acpi_subtable_headers * header,
> return 0;
> }
>
> +static int __init
> +acpi_parse_cache(union acpi_subtable_headers *header, const unsigned long end)
> +{
> + struct acpi_hmat_cache *cache = (void *)header;
> + u32 attrs;
> +
> + attrs = cache->cache_attributes;
> + if (((attrs & ACPI_HMAT_CACHE_ASSOCIATIVITY) >> 8) ==
> + ACPI_HMAT_CA_DIRECT_MAPPED)
> + set_bit(cache->memory_PD, node_side_cached);

I'm not sure I see a use case for 'node_side_cached'. Instead I need
to know if a cache intercepts a "System RAM" resource, because a cache
in front of a reserved address range would not be impacted by page
allocator randomization. Or, are you saying have memblock generically
describes this capability and move the responsibility of acting on
that data to a higher level?

The other detail to consider is the cache ratio size, but that would
be a follow on feature. The use case is to automatically determine the
ratio to pass to numa_emulation:

cc9aec03e58f x86/numa_emulation: Introduce uniform split capability

> + return 0;
> +}
> +
> static int __init acpi_parse_srat(struct acpi_table_header *table)
> {
> struct acpi_table_srat *srat = (struct acpi_table_srat *)table;
> @@ -422,6 +443,11 @@ static int __init acpi_parse_srat(struct acpi_table_header *table)
> return 0;
> }
>
> +static __init int acpi_parse_hmat(struct acpi_table_header *table)
> +{
> + return 0;
> +}

What's this acpi_parse_hmat() stub for?

> +
> static int __init
> acpi_table_parse_srat(enum acpi_srat_type id,
> acpi_tbl_entry_handler handler, unsigned int max_entries)
> @@ -460,6 +486,16 @@ int __init acpi_numa_init(void)
> sizeof(struct acpi_table_srat),
> srat_proc, ARRAY_SIZE(srat_proc), 0);
>
> + if (!acpi_table_parse(ACPI_SIG_HMAT, acpi_parse_hmat)) {
> + struct acpi_subtable_proc hmat_proc;
> +
> + memset(&hmat_proc, 0, sizeof(hmat_proc));
> + hmat_proc.handler = acpi_parse_cache;
> + hmat_proc.id = ACPI_HMAT_TYPE_CACHE;
> + acpi_table_parse_entries_array(ACPI_SIG_HMAT,
> + sizeof(struct acpi_table_hmat),
> + &hmat_proc, 1, 0);
> + }
> cnt = acpi_table_parse_srat(ACPI_SRAT_TYPE_MEMORY_AFFINITY,
> acpi_parse_memory_affinity, 0);
> }
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index aee299a6aa76..a24c918a4496 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -44,6 +44,7 @@ enum memblock_flags {
> MEMBLOCK_HOTPLUG = 0x1, /* hotpluggable region */
> MEMBLOCK_MIRROR = 0x2, /* mirrored region */
> MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
> + MEMBLOCK_SIDECACHED = 0x8, /* System side caches memory access */

I'm concerned that we may be stretching memblock past its intended use
case especially for just this randomization case. For example, I think
memblock_find_in_range() gets confused in the presence of
MEMBLOCK_SIDECACHED memblocks.

2018-12-11 20:49:44

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCHv2 02/12] acpi/hmat: Parse and report heterogeneous memory

On Tue, Dec 11, 2018 at 12:29:45PM -0800, Dan Williams wrote:
> On Tue, Dec 11, 2018 at 8:58 AM Keith Busch <[email protected]> wrote:
> > +static int __init
> > +acpi_parse_cache(union acpi_subtable_headers *header, const unsigned long end)
> > +{
> > + struct acpi_hmat_cache *cache = (void *)header;
> > + u32 attrs;
> > +
> > + attrs = cache->cache_attributes;
> > + if (((attrs & ACPI_HMAT_CACHE_ASSOCIATIVITY) >> 8) ==
> > + ACPI_HMAT_CA_DIRECT_MAPPED)
> > + set_bit(cache->memory_PD, node_side_cached);
>
> I'm not sure I see a use case for 'node_side_cached'. Instead I need
> to know if a cache intercepts a "System RAM" resource, because a cache
> in front of a reserved address range would not be impacted by page
> allocator randomization. Or, are you saying have memblock generically
> describes this capability and move the responsibility of acting on
> that data to a higher level?

The "node_side_cached" array isn't intended to be used directly. It's
just holding the PXM's that HMAT says have a side cache so we know which
PXM's have that attribute before parsing SRAT's memory affinity.

The intention was that this is just another attribute of a memory range
similiar to hotpluggable. Whoever needs to use it may query it from
the memblock, if that makes sense.

> The other detail to consider is the cache ratio size, but that would
> be a follow on feature. The use case is to automatically determine the
> ratio to pass to numa_emulation:
>
> cc9aec03e58f x86/numa_emulation: Introduce uniform split capability

Will look into that.

> > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > index aee299a6aa76..a24c918a4496 100644
> > --- a/include/linux/memblock.h
> > +++ b/include/linux/memblock.h
> > @@ -44,6 +44,7 @@ enum memblock_flags {
> > MEMBLOCK_HOTPLUG = 0x1, /* hotpluggable region */
> > MEMBLOCK_MIRROR = 0x2, /* mirrored region */
> > MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
> > + MEMBLOCK_SIDECACHED = 0x8, /* System side caches memory access */
>
> I'm concerned that we may be stretching memblock past its intended use
> case especially for just this randomization case. For example, I think
> memblock_find_in_range() gets confused in the presence of
> MEMBLOCK_SIDECACHED memblocks.

Ok, I see. Is there a better structure or interface that you may recommend
for your use case to identify which memory ranges contain this attribute?

2018-12-11 22:51:49

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCHv2 02/12] acpi/hmat: Parse and report heterogeneous memory

On Tue, Dec 11, 2018 at 12:47 PM Keith Busch <[email protected]> wrote:
>
> On Tue, Dec 11, 2018 at 12:29:45PM -0800, Dan Williams wrote:
> > On Tue, Dec 11, 2018 at 8:58 AM Keith Busch <[email protected]> wrote:
> > > +static int __init
> > > +acpi_parse_cache(union acpi_subtable_headers *header, const unsigned long end)
> > > +{
> > > + struct acpi_hmat_cache *cache = (void *)header;
> > > + u32 attrs;
> > > +
> > > + attrs = cache->cache_attributes;
> > > + if (((attrs & ACPI_HMAT_CACHE_ASSOCIATIVITY) >> 8) ==
> > > + ACPI_HMAT_CA_DIRECT_MAPPED)
> > > + set_bit(cache->memory_PD, node_side_cached);
> >
> > I'm not sure I see a use case for 'node_side_cached'. Instead I need
> > to know if a cache intercepts a "System RAM" resource, because a cache
> > in front of a reserved address range would not be impacted by page
> > allocator randomization. Or, are you saying have memblock generically
> > describes this capability and move the responsibility of acting on
> > that data to a higher level?
>
> The "node_side_cached" array isn't intended to be used directly. It's
> just holding the PXM's that HMAT says have a side cache so we know which
> PXM's have that attribute before parsing SRAT's memory affinity.
>
> The intention was that this is just another attribute of a memory range
> similiar to hotpluggable. Whoever needs to use it may query it from
> the memblock, if that makes sense.
>
> > The other detail to consider is the cache ratio size, but that would
> > be a follow on feature. The use case is to automatically determine the
> > ratio to pass to numa_emulation:
> >
> > cc9aec03e58f x86/numa_emulation: Introduce uniform split capability
>
> Will look into that.
>
> > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> > > index aee299a6aa76..a24c918a4496 100644
> > > --- a/include/linux/memblock.h
> > > +++ b/include/linux/memblock.h
> > > @@ -44,6 +44,7 @@ enum memblock_flags {
> > > MEMBLOCK_HOTPLUG = 0x1, /* hotpluggable region */
> > > MEMBLOCK_MIRROR = 0x2, /* mirrored region */
> > > MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
> > > + MEMBLOCK_SIDECACHED = 0x8, /* System side caches memory access */
> >
> > I'm concerned that we may be stretching memblock past its intended use
> > case especially for just this randomization case. For example, I think
> > memblock_find_in_range() gets confused in the presence of
> > MEMBLOCK_SIDECACHED memblocks.
>
> Ok, I see. Is there a better structure or interface that you may recommend
> for your use case to identify which memory ranges contain this attribute?

Well, no I don't think there is one. We just need to either audit
existing memblock users to make sure they are prepared for this new
type, or move the cache information somewhere else. I'd be inclined to
just move it to new fields, or a sub-struct in struct memblock_region.
Then we can put the cache set-way info and near memory size
information in there as well. Likely need a new
CONFIG_HAVE_MEMBLOCK_CACHE_INFO gated by the presence of HMAT support.

2018-12-12 04:57:45

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCHv2 12/12] doc/mm: New documentation for memory performance

On 12/11/18 6:33 AM, Keith Busch wrote:
> Platforms may provide system memory where some physical address ranges
> perform differently than others, or is side cached by the system.
>
> Add documentation describing a high level overview of such systems and the
> performance and caching attributes the kernel provides for applications
> wishing to query this information.
>

The series looks good with the examples in the commit messages

> Signed-off-by: Keith Busch <[email protected]>
> ---
> Documentation/admin-guide/mm/numaperf.rst | 171 ++++++++++++++++++++++++++++++
> 1 file changed, 171 insertions(+)
> create mode 100644 Documentation/admin-guide/mm/numaperf.rst
>
> diff --git a/Documentation/admin-guide/mm/numaperf.rst b/Documentation/admin-guide/mm/numaperf.rst
> new file mode 100644
> index 000000000000..846b3f991e7f
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/numaperf.rst
> @@ -0,0 +1,171 @@
> +.. _numaperf:
> +
> +=============
> +NUMA Locality
> +=============
> +
> +Some platforms may have multiple types of memory attached to a single
> +CPU. These disparate memory ranges share some characteristics, such as
> +CPU cache coherence, but may have different performance. For example,
> +different media types and buses affect bandwidth and latency.
> +
> +A system supporting such heterogeneous memory by grouping each memory
> +type under different "nodes" based on similar CPU locality and performance
> +characteristics. Some memory may share the same node as a CPU, and others
> +are provided as memory only nodes. While memory only nodes do not provide
> +CPUs, they may still be directly accessible, or local, to one or more
> +compute nodes. The following diagram shows one such example of two compute
> +noes with local memory and a memory only node for each of compute node:
> +
> + +------------------+ +------------------+
> + | Compute Node 0 +-----+ Compute Node 1 |
> + | Local Node0 Mem | | Local Node1 Mem |
> + +--------+---------+ +--------+---------+
> + | |
> + +--------+---------+ +--------+---------+
> + | Slower Node2 Mem | | Slower Node3 Mem |
> + +------------------+ +--------+---------+
> +
> +A "memory initiator" is a node containing one or more devices such as
> +CPUs or separate memory I/O devices that can initiate memory requests. A
> +"memory target" is a node containing one or more accessible physical
> +address ranges from one or more memory initiators.
> +
> +When multiple memory initiators exist, they may not all have the same
> +performance when accessing a given memory target. The highest performing
> +initiator to a given target is considered to be one of that target's
> +local initiators. Any given target may have one or more local initiators,
> +and any given initiator may have multiple local memory targets.
> +

Can you also add summary here suggesting node X is compute and Node y is
memory target

> +To aid applications matching memory targets with their initiators,
> +the kernel provide symlinks to each other like the following example::
> +
> + # ls -l /sys/devices/system/node/nodeX/local_target*
> + /sys/devices/system/node/nodeX/local_targetY -> ../nodeY
> +
> + # ls -l /sys/devices/system/node/nodeY/local_initiator*
> + /sys/devices/system/node/nodeY/local_initiatorX -> ../nodeX
> +

the patch series had primary_target and primary_initiator

> +The linked nodes will also have their node number set in the local_mem
> +and local_cpu node list and maps.
> +
> +An example showing how this may be used to run a particular task on CPUs
> +and memory that are both local to a particular PCI device can be done
> +using existing 'numactl' as follows::
> +
> + # NODE=$(cat /sys/devices/pci:0000:00/.../numa_node)
> + # numactl --membind=$(cat /sys/devices/node/node${NODE}/local_mem_nodelist) \
> + --cpunodebind=$(cat /sys/devices/node/node${NODE}/local_cpu_nodelist) \
> + -- <some-program-to-execute>
> +
> +================
> +NUMA Performance
> +================
> +
> +Applications may wish to consider which node they want their memory to
> +be allocated from based on the node's performance characteristics. If the
> +system provides these attributes, the kernel exports them under the node
> +sysfs hierarchy by appending the local_initiator_access directory under
> +the memory node as follows::
> +
> + /sys/devices/system/node/nodeY/local_initiator_access/
> +


Same here s/local/primary?

> +The kernel does not provide performance attributes for non-local memory
> +initiators. These attributes apply only to the memory initiator nodes that
> +have a local_initiatorX link, or are set in the local_cpu_nodelist. A
> +memory initiator node is considered local to itself if it also is
> +a memory target and will be set it its node list and map, but won't
> +contain a symlink to itself.
> +
> +The performance characteristics the kernel provides for the local initiators
> +are exported are as follows::
> +
> + # tree /sys/devices/system/node/nodeY/local_initiator_access
> + /sys/devices/system/node/nodeY/local_initiator_access
> + |-- read_bandwidth
> + |-- read_latency
> + |-- write_bandwidth
> + `-- write_latency
> +
> +The bandwidth attributes are provided in MiB/second.
> +
> +The latency attributes are provided in nanoseconds.
> +
> +==========
> +NUMA Cache
> +==========
> +
> +System memory may be constructed in a hierarchy of elements with various
> +performance characteristics in order to provide large address space
> +of slower performing memory side-cached by a smaller higher performing
> +memory. The system physical addresses that initiators are aware of is
> +provided by the last memory level in the hierarchy, while the system uses
> +higher performing memory to transparently cache access to progressively
> +slower levels.
> +
> +The term "far memory" is used to denote the last level memory in the
> +hierarchy. Each increasing cache level provides higher performing
> +initiator access, and the term "near memory" represents the fastest
> +cache provided by the system.
> +
> +This numbering is different than CPU caches where the cache level (ex:
> +L1, L2, L3) uses a CPU centric view with each increased level is lower
> +performing. In contrast, the memory cache level is centric to the last
> +level memory, so the higher numbered cache level denotes memory nearer
> +to the CPU, and further from far memory.
> +
> +The memory side caches are not directly addressable by software. When
> +software accesses a system address, the system will return it from the
> +near memory cache if it is present. If it is not present, the system
> +accesses the next level of memory until there is either a hit in that
> +cache level, or it reaches far memory.
> +
> +An application does not need to know about caching attributes in order
> +to use the system, software may optionally query the memory cache
> +attributes in order to maximize the performance out of such a setup.
> +If the system provides a way for the kernel to discover this information,
> +for example with ACPI HMAT (Heterogeneous Memory Attribute Table),
> +the kernel will append these attributes to the NUMA node memory target.
> +
> +When the kernel first registers a memory cache with a node, the kernel
> +will create the following directory::
> +
> + /sys/devices/system/node/nodeX/side_cache/
> +

This is something even the patch commit message didn't explain we create
side_cache directory in memory target nodes or initiator nodes? I assume
it is part of memory target nodes. If so to be consistent can you use nodeY?


> +If that directory is not present, the system either does not not provide
> +a memory side cache, or that information is not accessible to the kernel.
> +
> +The attributes for each level of cache is provided under its cache
> +level index::
> +
> + /sys/devices/system/node/nodeX/side_cache/indexA/
> + /sys/devices/system/node/nodeX/side_cache/indexB/
> + /sys/devices/system/node/nodeX/side_cache/indexC/
> +
> +Each cache level's directory provides its attributes. For example,
> +the following is a single cache level and the attributes available for
> +software to query::
> +
> + # tree sys/devices/system/node/node0/side_cache/
> + /sys/devices/system/node/node0/side_cache/
> + |-- index1
> + | |-- associativity
> + | |-- level
> + | |-- line_size
> + | |-- size
> + | `-- write_policy
> +
> +The "associativity" will be 0 if it is a direct-mapped cache, and non-zero
> +for any other indexed based, multi-way associativity.
> +
> +The "level" is the distance from the far memory, and matches the number
> +appended to its "index" directory.
> +
> +The "line_size" is the number of bytes accessed on a cache miss.
> +
> +The "size" is the number of bytes provided by this cache level.
> +
> +The "write_policy" will be 0 for write-back, and non-zero for
> +write-through caching.
> +
> +See also: https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
>


2018-12-12 14:49:34

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCHv2 12/12] doc/mm: New documentation for memory performance

On Wed, Dec 12, 2018 at 10:23:24AM +0530, Aneesh Kumar K.V wrote:
> On 12/11/18 6:33 AM, Keith Busch wrote:
> > +When multiple memory initiators exist, they may not all have the same
> > +performance when accessing a given memory target. The highest performing
> > +initiator to a given target is considered to be one of that target's
> > +local initiators. Any given target may have one or more local initiators,
> > +and any given initiator may have multiple local memory targets.
> > +
>
> Can you also add summary here suggesting node X is compute and Node y is
> memory target

Sure thing.

> > +To aid applications matching memory targets with their initiators,
> > +the kernel provide symlinks to each other like the following example::
> > +
> > + # ls -l /sys/devices/system/node/nodeX/local_target*
> > + /sys/devices/system/node/nodeX/local_targetY -> ../nodeY
> > +
> > + # ls -l /sys/devices/system/node/nodeY/local_initiator*
> > + /sys/devices/system/node/nodeY/local_initiatorX -> ../nodeX
> > +
>
> the patch series had primary_target and primary_initiator

Yeah, I noticed that mistake too. I went through several iterations of
naming this, and I think it will yet be named something else in the
final revision to accomodate different access levels since it sounds
like some people may wish to show more than just the best.

> > +When the kernel first registers a memory cache with a node, the kernel
> > +will create the following directory::
> > +
> > + /sys/devices/system/node/nodeX/side_cache/
> > +
>
> This is something even the patch commit message didn't explain we create
> side_cache directory in memory target nodes or initiator nodes? I assume it
> is part of memory target nodes. If so to be consistent can you use nodeY?

Right, only memory targets may have memory side caches. Will use more
consistent symbols.

2018-12-13 09:09:06

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCHv2 01/12] acpi: Create subtable parsing infrastructure

Hi Keith,

I love your patch! Yet something to improve:

[auto build test ERROR on pm/linux-next]
[also build test ERROR on v4.20-rc6]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Keith-Busch/acpi-Create-subtable-parsing-infrastructure/20181211-154110
base: https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
config: ia64-allnoconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 8.1.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=8.1.0 make.cross ARCH=ia64

All errors (new ones prefixed by >>):

arch/ia64/kernel/acpi.c: In function 'early_acpi_boot_init':
>> arch/ia64/kernel/acpi.c:675:3: error: passing argument 2 of 'acpi_table_parse_madt' from incompatible pointer type [-Werror=incompatible-pointer-types]
acpi_parse_lsapic, NR_CPUS);
^~~~~~~~~~~~~~~~~
In file included from arch/ia64/kernel/acpi.c:43:
include/linux/acpi.h:250:29: note: expected 'acpi_tbl_entry_handler' {aka 'int (*)(union acpi_subtable_headers *, const long unsigned int)'} but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
acpi_tbl_entry_handler handler,
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
arch/ia64/kernel/acpi.c: In function 'acpi_boot_init':
arch/ia64/kernel/acpi.c:718:43: error: passing argument 2 of 'acpi_table_parse_madt' from incompatible pointer type [-Werror=incompatible-pointer-types]
(ACPI_MADT_TYPE_LOCAL_APIC_OVERRIDE, acpi_parse_lapic_addr_ovr, 0) < 0)
^~~~~~~~~~~~~~~~~~~~~~~~~
In file included from arch/ia64/kernel/acpi.c:43:
include/linux/acpi.h:250:29: note: expected 'acpi_tbl_entry_handler' {aka 'int (*)(union acpi_subtable_headers *, const long unsigned int)'} but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
acpi_tbl_entry_handler handler,
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
arch/ia64/kernel/acpi.c:722:59: error: passing argument 2 of 'acpi_table_parse_madt' from incompatible pointer type [-Werror=incompatible-pointer-types]
if (acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_APIC_NMI, acpi_parse_lapic_nmi, 0)
^~~~~~~~~~~~~~~~~~~~
In file included from arch/ia64/kernel/acpi.c:43:
include/linux/acpi.h:250:29: note: expected 'acpi_tbl_entry_handler' {aka 'int (*)(union acpi_subtable_headers *, const long unsigned int)'} but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
acpi_tbl_entry_handler handler,
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
arch/ia64/kernel/acpi.c:729:32: error: passing argument 2 of 'acpi_table_parse_madt' from incompatible pointer type [-Werror=incompatible-pointer-types]
(ACPI_MADT_TYPE_IO_SAPIC, acpi_parse_iosapic, NR_IOSAPICS) < 1) {
^~~~~~~~~~~~~~~~~~
In file included from arch/ia64/kernel/acpi.c:43:
include/linux/acpi.h:250:29: note: expected 'acpi_tbl_entry_handler' {aka 'int (*)(union acpi_subtable_headers *, const long unsigned int)'} but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
acpi_tbl_entry_handler handler,
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
arch/ia64/kernel/acpi.c:738:40: error: passing argument 2 of 'acpi_table_parse_madt' from incompatible pointer type [-Werror=incompatible-pointer-types]
(ACPI_MADT_TYPE_INTERRUPT_SOURCE, acpi_parse_plat_int_src,
^~~~~~~~~~~~~~~~~~~~~~~
In file included from arch/ia64/kernel/acpi.c:43:
include/linux/acpi.h:250:29: note: expected 'acpi_tbl_entry_handler' {aka 'int (*)(union acpi_subtable_headers *, const long unsigned int)'} but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
acpi_tbl_entry_handler handler,
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
arch/ia64/kernel/acpi.c:744:42: error: passing argument 2 of 'acpi_table_parse_madt' from incompatible pointer type [-Werror=incompatible-pointer-types]
(ACPI_MADT_TYPE_INTERRUPT_OVERRIDE, acpi_parse_int_src_ovr, 0) < 0)
^~~~~~~~~~~~~~~~~~~~~~
In file included from arch/ia64/kernel/acpi.c:43:
include/linux/acpi.h:250:29: note: expected 'acpi_tbl_entry_handler' {aka 'int (*)(union acpi_subtable_headers *, const long unsigned int)'} but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
acpi_tbl_entry_handler handler,
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
arch/ia64/kernel/acpi.c:748:55: error: passing argument 2 of 'acpi_table_parse_madt' from incompatible pointer type [-Werror=incompatible-pointer-types]
if (acpi_table_parse_madt(ACPI_MADT_TYPE_NMI_SOURCE, acpi_parse_nmi_src, 0) < 0)
^~~~~~~~~~~~~~~~~~
In file included from arch/ia64/kernel/acpi.c:43:
include/linux/acpi.h:250:29: note: expected 'acpi_tbl_entry_handler' {aka 'int (*)(union acpi_subtable_headers *, const long unsigned int)'} but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
acpi_tbl_entry_handler handler,
~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~
cc1: some warnings being treated as errors

vim +/acpi_table_parse_madt +675 arch/ia64/kernel/acpi.c

^1da177e4 Linus Torvalds 2005-04-16 660
62ee0540f Doug Chapman 2008-11-05 661 int __init early_acpi_boot_init(void)
62ee0540f Doug Chapman 2008-11-05 662 {
62ee0540f Doug Chapman 2008-11-05 663 int ret;
62ee0540f Doug Chapman 2008-11-05 664
62ee0540f Doug Chapman 2008-11-05 665 /*
62ee0540f Doug Chapman 2008-11-05 666 * do a partial walk of MADT to determine how many CPUs
62ee0540f Doug Chapman 2008-11-05 667 * we have including offline CPUs
62ee0540f Doug Chapman 2008-11-05 668 */
62ee0540f Doug Chapman 2008-11-05 669 if (acpi_table_parse(ACPI_SIG_MADT, acpi_parse_madt)) {
62ee0540f Doug Chapman 2008-11-05 670 printk(KERN_ERR PREFIX "Can't find MADT\n");
62ee0540f Doug Chapman 2008-11-05 671 return 0;
62ee0540f Doug Chapman 2008-11-05 672 }
62ee0540f Doug Chapman 2008-11-05 673
62ee0540f Doug Chapman 2008-11-05 674 ret = acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_SAPIC,
62ee0540f Doug Chapman 2008-11-05 @675 acpi_parse_lsapic, NR_CPUS);
62ee0540f Doug Chapman 2008-11-05 676 if (ret < 1)
62ee0540f Doug Chapman 2008-11-05 677 printk(KERN_ERR PREFIX
62ee0540f Doug Chapman 2008-11-05 678 "Error parsing MADT - no LAPIC entries\n");
247dba58a Baoquan He 2014-05-05 679 else
247dba58a Baoquan He 2014-05-05 680 acpi_lapic = 1;
62ee0540f Doug Chapman 2008-11-05 681

:::::: The code at line 675 was first introduced by commit
:::::: 62ee0540f5e5a804b79cae8b3c0185a85f02436b [IA64] fix boot panic caused by offline CPUs

:::::: TO: Doug Chapman <[email protected]>
:::::: CC: Tony Luck <[email protected]>

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (7.41 kB)
.config.gz (5.52 kB)
Download all attachments

2018-12-20 01:05:12

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCHv2 01/12] acpi: Create subtable parsing infrastructure

Hi Keith,

I love your patch! Yet something to improve:

[auto build test ERROR on pm/linux-next]
[also build test ERROR on v4.20-rc7 next-20181219]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Keith-Busch/acpi-Create-subtable-parsing-infrastructure/20181211-154110
base: https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git linux-next
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=arm64

All errors (new ones prefixed by >>):

drivers//irqchip/irq-gic.c: In function 'acpi_gic_redist_is_present':
>> drivers//irqchip/irq-gic.c:1552:10: error: passing argument 2 of 'acpi_table_parse_madt' from incompatible pointer type [-Werror=incompatible-pointer-types]
acpi_dummy_func, 0) > 0;
^~~~~~~~~~~~~~~
In file included from drivers//irqchip/irq-gic.c:36:0:
include/linux/acpi.h:249:5: note: expected 'acpi_tbl_entry_handler {aka int (*)(union acpi_subtable_headers *, const long unsigned int)}' but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
int acpi_table_parse_madt(enum acpi_madt_type id,
^~~~~~~~~~~~~~~~~~~~~
drivers//irqchip/irq-gic.c: In function 'gic_v2_acpi_init':
drivers//irqchip/irq-gic.c:1614:11: error: passing argument 2 of 'acpi_table_parse_madt' from incompatible pointer type [-Werror=incompatible-pointer-types]
gic_acpi_parse_madt_cpu, 0);
^~~~~~~~~~~~~~~~~~~~~~~
In file included from drivers//irqchip/irq-gic.c:36:0:
include/linux/acpi.h:249:5: note: expected 'acpi_tbl_entry_handler {aka int (*)(union acpi_subtable_headers *, const long unsigned int)}' but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
int acpi_table_parse_madt(enum acpi_madt_type id,
^~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
--
drivers//irqchip/irq-gic-v2m.c: In function 'gicv2m_acpi_init':
>> drivers//irqchip/irq-gic-v2m.c:495:11: error: passing argument 2 of 'acpi_table_parse_madt' from incompatible pointer type [-Werror=incompatible-pointer-types]
acpi_parse_madt_msi, 0);
^~~~~~~~~~~~~~~~~~~
In file included from drivers//irqchip/irq-gic-v2m.c:18:0:
include/linux/acpi.h:249:5: note: expected 'acpi_tbl_entry_handler {aka int (*)(union acpi_subtable_headers *, const long unsigned int)}' but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
int acpi_table_parse_madt(enum acpi_madt_type id,
^~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
--
drivers//irqchip/irq-gic-v3.c: In function 'gic_acpi_collect_gicr_base':
>> drivers//irqchip/irq-gic-v3.c:1417:17: error: assignment from incompatible pointer type [-Werror=incompatible-pointer-types]
redist_parser = gic_acpi_parse_madt_redist;
^
drivers//irqchip/irq-gic-v3.c: In function 'gic_acpi_count_gicr_regions':
drivers//irqchip/irq-gic-v3.c:1468:11: error: passing argument 2 of 'acpi_table_parse_madt' from incompatible pointer type [-Werror=incompatible-pointer-types]
gic_acpi_match_gicr, 0);
^~~~~~~~~~~~~~~~~~~
In file included from drivers//irqchip/irq-gic-v3.c:20:0:
include/linux/acpi.h:249:5: note: expected 'acpi_tbl_entry_handler {aka int (*)(union acpi_subtable_headers *, const long unsigned int)}' but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
int acpi_table_parse_madt(enum acpi_madt_type id,
^~~~~~~~~~~~~~~~~~~~~
drivers//irqchip/irq-gic-v3.c:1475:11: error: passing argument 2 of 'acpi_table_parse_madt' from incompatible pointer type [-Werror=incompatible-pointer-types]
gic_acpi_match_gicc, 0);
^~~~~~~~~~~~~~~~~~~
In file included from drivers//irqchip/irq-gic-v3.c:20:0:
include/linux/acpi.h:249:5: note: expected 'acpi_tbl_entry_handler {aka int (*)(union acpi_subtable_headers *, const long unsigned int)}' but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
int acpi_table_parse_madt(enum acpi_madt_type id,
^~~~~~~~~~~~~~~~~~~~~
drivers//irqchip/irq-gic-v3.c: In function 'gic_acpi_collect_virt_info':
drivers//irqchip/irq-gic-v3.c:1542:11: error: passing argument 2 of 'acpi_table_parse_madt' from incompatible pointer type [-Werror=incompatible-pointer-types]
gic_acpi_parse_virt_madt_gicc, 0);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from drivers//irqchip/irq-gic-v3.c:20:0:
include/linux/acpi.h:249:5: note: expected 'acpi_tbl_entry_handler {aka int (*)(union acpi_subtable_headers *, const long unsigned int)}' but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
int acpi_table_parse_madt(enum acpi_madt_type id,
^~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
--
drivers//irqchip/irq-gic-v3-its.c: In function 'acpi_table_parse_srat_its':
>> drivers//irqchip/irq-gic-v3-its.c:3812:4: error: passing argument 4 of 'acpi_table_parse_entries' from incompatible pointer type [-Werror=incompatible-pointer-types]
gic_acpi_match_srat_its, 0);
^~~~~~~~~~~~~~~~~~~~~~~
In file included from drivers//irqchip/irq-gic-v3-its.c:18:0:
include/linux/acpi.h:242:12: note: expected 'acpi_tbl_entry_handler {aka int (*)(union acpi_subtable_headers *, const long unsigned int)}' but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
int __init acpi_table_parse_entries(char *id, unsigned long table_size,
^~~~~~~~~~~~~~~~~~~~~~~~
drivers//irqchip/irq-gic-v3-its.c:3826:4: error: passing argument 4 of 'acpi_table_parse_entries' from incompatible pointer type [-Werror=incompatible-pointer-types]
gic_acpi_parse_srat_its, 0);
^~~~~~~~~~~~~~~~~~~~~~~
In file included from drivers//irqchip/irq-gic-v3-its.c:18:0:
include/linux/acpi.h:242:12: note: expected 'acpi_tbl_entry_handler {aka int (*)(union acpi_subtable_headers *, const long unsigned int)}' but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
int __init acpi_table_parse_entries(char *id, unsigned long table_size,
^~~~~~~~~~~~~~~~~~~~~~~~
drivers//irqchip/irq-gic-v3-its.c: In function 'its_acpi_probe':
>> drivers//irqchip/irq-gic-v3-its.c:3884:10: error: passing argument 2 of 'acpi_table_parse_madt' from incompatible pointer type [-Werror=incompatible-pointer-types]
gic_acpi_parse_madt_its, 0);
^~~~~~~~~~~~~~~~~~~~~~~
In file included from drivers//irqchip/irq-gic-v3-its.c:18:0:
include/linux/acpi.h:249:5: note: expected 'acpi_tbl_entry_handler {aka int (*)(union acpi_subtable_headers *, const long unsigned int)}' but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
int acpi_table_parse_madt(enum acpi_madt_type id,
^~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
--
drivers//irqchip/irq-gic-v3-its-platform-msi.c: In function 'its_pmsi_acpi_init':
>> drivers//irqchip/irq-gic-v3-its-platform-msi.c:147:10: error: passing argument 2 of 'acpi_table_parse_madt' from incompatible pointer type [-Werror=incompatible-pointer-types]
its_pmsi_parse_madt, 0);
^~~~~~~~~~~~~~~~~~~
In file included from include/linux/acpi_iort.h:22:0,
from drivers//irqchip/irq-gic-v3-its-platform-msi.c:18:
include/linux/acpi.h:249:5: note: expected 'acpi_tbl_entry_handler {aka int (*)(union acpi_subtable_headers *, const long unsigned int)}' but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
int acpi_table_parse_madt(enum acpi_madt_type id,
^~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
--
drivers//irqchip/irq-gic-v3-its-pci-msi.c: In function 'its_pci_acpi_msi_init':
>> drivers//irqchip/irq-gic-v3-its-pci-msi.c:191:10: error: passing argument 2 of 'acpi_table_parse_madt' from incompatible pointer type [-Werror=incompatible-pointer-types]
its_pci_msi_parse_madt, 0);
^~~~~~~~~~~~~~~~~~~~~~
In file included from include/linux/acpi_iort.h:22:0,
from drivers//irqchip/irq-gic-v3-its-pci-msi.c:18:
include/linux/acpi.h:249:5: note: expected 'acpi_tbl_entry_handler {aka int (*)(union acpi_subtable_headers *, const long unsigned int)}' but argument is of type 'int (*)(struct acpi_subtable_header *, const long unsigned int)'
int acpi_table_parse_madt(enum acpi_madt_type id,
^~~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors

vim +/acpi_table_parse_madt +1552 drivers//irqchip/irq-gic.c

d60fc3892 Tomasz Nowicki 2015-03-24 1548
f26527b14 Marc Zyngier 2015-09-28 1549 static bool __init acpi_gic_redist_is_present(void)
f26527b14 Marc Zyngier 2015-09-28 1550 {
f26527b14 Marc Zyngier 2015-09-28 1551 return acpi_table_parse_madt(ACPI_MADT_TYPE_GENERIC_REDISTRIBUTOR,
f26527b14 Marc Zyngier 2015-09-28 @1552 acpi_dummy_func, 0) > 0;
f26527b14 Marc Zyngier 2015-09-28 1553 }
d60fc3892 Tomasz Nowicki 2015-03-24 1554

:::::: The code at line 1552 was first introduced by commit
:::::: f26527b1428f379fbd7edf779854c3b41bc0b3e5 irqchip / GIC: Convert the GIC driver to ACPI probing

:::::: TO: Marc Zyngier <[email protected]>
:::::: CC: Rafael J. Wysocki <[email protected]>

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (9.95 kB)
.config.gz (39.97 kB)
Download all attachments

2018-12-20 01:21:48

by Schmauss, Erik

[permalink] [raw]
Subject: RE: [PATCHv2 01/12] acpi: Create subtable parsing infrastructure



> -----Original Message-----
> From: [email protected] [mailto:linux-acpi-
> [email protected]] On Behalf Of Rafael J. Wysocki
> Sent: Tuesday, December 11, 2018 1:45 AM
> To: Busch, Keith <[email protected]>
> Cc: Linux Kernel Mailing List <[email protected]>; ACPI Devel
> Maling List <[email protected]>; Linux Memory Management List
> <[email protected]>; Greg Kroah-Hartman
> <[email protected]>; Rafael J. Wysocki <[email protected]>;
> Hansen, Dave <[email protected]>; Williams, Dan J
> <[email protected]>
> Subject: Re: [PATCHv2 01/12] acpi: Create subtable parsing infrastructure
>
> On Tue, Dec 11, 2018 at 2:05 AM Keith Busch <[email protected]>
> wrote:
> >

Hi Rafael and Bob,

> > Parsing entries in an ACPI table had assumed a generic header
> > structure that is most common. There is no standard ACPI header,
> > though, so less common types would need custom parsers if they want go
> > through their sub-table entry list.
>
> It looks like the problem at hand is that acpi_hmat_structure is incompatible
> with acpi_subtable_header because of the different layout and field sizes.

Just out of curiosity, why don't we use ACPICA code to parse static ACPI tables
in Linux?

We have a disassembler for static tables that parses all supported tables. This
seems like a duplication of code/effort...

Erik
>
> If so, please state that clearly here.
>
> With that, please feel free to add
>
> Reviewed-by: Rafael J. Wysocki <[email protected]>
>
> to this patch.

2018-12-20 01:50:07

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCHv2 01/12] acpi: Create subtable parsing infrastructure

On Wed, Dec 19, 2018 at 3:19 PM Schmauss, Erik <[email protected]> wrote:
>
>
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-acpi-
> > [email protected]] On Behalf Of Rafael J. Wysocki
> > Sent: Tuesday, December 11, 2018 1:45 AM
> > To: Busch, Keith <[email protected]>
> > Cc: Linux Kernel Mailing List <[email protected]>; ACPI Devel
> > Maling List <[email protected]>; Linux Memory Management List
> > <[email protected]>; Greg Kroah-Hartman
> > <[email protected]>; Rafael J. Wysocki <[email protected]>;
> > Hansen, Dave <[email protected]>; Williams, Dan J
> > <[email protected]>
> > Subject: Re: [PATCHv2 01/12] acpi: Create subtable parsing infrastructure
> >
> > On Tue, Dec 11, 2018 at 2:05 AM Keith Busch <[email protected]>
> > wrote:
> > >
>
> Hi Rafael and Bob,
>
> > > Parsing entries in an ACPI table had assumed a generic header
> > > structure that is most common. There is no standard ACPI header,
> > > though, so less common types would need custom parsers if they want go
> > > through their sub-table entry list.
> >
> > It looks like the problem at hand is that acpi_hmat_structure is incompatible
> > with acpi_subtable_header because of the different layout and field sizes.
>
> Just out of curiosity, why don't we use ACPICA code to parse static ACPI tables
> in Linux?
>
> We have a disassembler for static tables that parses all supported tables. This
> seems like a duplication of code/effort...

Oh, I thought acpi_table_parse_entries() was the common code. What's
the ACPICA duplicate?

2018-12-20 02:31:55

by Schmauss, Erik

[permalink] [raw]
Subject: RE: [PATCHv2 01/12] acpi: Create subtable parsing infrastructure



> -----Original Message-----
> From: [email protected] [mailto:linux-acpi-
> [email protected]] On Behalf Of Dan Williams
> Sent: Wednesday, December 19, 2018 4:00 PM
> To: Schmauss, Erik <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>; Busch, Keith
> <[email protected]>; Moore, Robert <[email protected]>;
> Linux Kernel Mailing List <[email protected]>; ACPI Devel
> Maling List <[email protected]>; Linux Memory Management
> List <[email protected]>; Greg Kroah-Hartman
> <[email protected]>; Hansen, Dave
> <[email protected]>
> Subject: Re: [PATCHv2 01/12] acpi: Create subtable parsing
> infrastructure
>
> On Wed, Dec 19, 2018 at 3:19 PM Schmauss, Erik
> <[email protected]> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: [email protected] [mailto:linux-acpi-
> > > [email protected]] On Behalf Of Rafael J. Wysocki
> > > Sent: Tuesday, December 11, 2018 1:45 AM
> > > To: Busch, Keith <[email protected]>
> > > Cc: Linux Kernel Mailing List <[email protected]>; ACPI
> > > Devel Maling List <[email protected]>; Linux Memory
> > > Management List <[email protected]>; Greg Kroah-Hartman
> > > <[email protected]>; Rafael J. Wysocki
> <[email protected]>;
> > > Hansen, Dave <[email protected]>; Williams, Dan J
> > > <[email protected]>
> > > Subject: Re: [PATCHv2 01/12] acpi: Create subtable parsing
> > > infrastructure
> > >
> > > On Tue, Dec 11, 2018 at 2:05 AM Keith Busch
> <[email protected]>
> > > wrote:
> > > >
> >
> > Hi Rafael and Bob,
> >
> > > > Parsing entries in an ACPI table had assumed a generic header
> > > > structure that is most common. There is no standard ACPI
> header,
> > > > though, so less common types would need custom parsers if they
> > > > want go through their sub-table entry list.
> > >
> > > It looks like the problem at hand is that acpi_hmat_structure is
> > > incompatible with acpi_subtable_header because of the different
> layout and field sizes.
> >
> > Just out of curiosity, why don't we use ACPICA code to parse static
> > ACPI tables in Linux?
> >
> > We have a disassembler for static tables that parses all supported
> > tables. This seems like a duplication of code/effort...
>
Hi Dan,

> Oh, I thought acpi_table_parse_entries() was the common code.
> What's the ACPICA duplicate?

I was thinking AcpiDmDumpTable(). After looking at this ACPICA code,
I realized that the this ACPICA doesn't actually build a parse tree or data structure.
It loops over the data structure to format the input ACPI table to a file.

To me, it seems like a good idea for Linux and ACPICA to share the same code when
parsing and analyzing these structures. I know that Linux may emit warnings
that are specific to Linux but there are structural analyses that should be the same (such as
checking lengths of tables and subtables so that we don't have out of bounds access).

Erik

2018-12-20 09:22:50

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCHv2 01/12] acpi: Create subtable parsing infrastructure

On Thu, Dec 20, 2018 at 2:15 AM Schmauss, Erik <[email protected]> wrote:
>
>
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-acpi-
> > [email protected]] On Behalf Of Dan Williams
> > Sent: Wednesday, December 19, 2018 4:00 PM
> > To: Schmauss, Erik <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>; Busch, Keith
> > <[email protected]>; Moore, Robert <[email protected]>;
> > Linux Kernel Mailing List <[email protected]>; ACPI Devel
> > Maling List <[email protected]>; Linux Memory Management
> > List <[email protected]>; Greg Kroah-Hartman
> > <[email protected]>; Hansen, Dave
> > <[email protected]>
> > Subject: Re: [PATCHv2 01/12] acpi: Create subtable parsing
> > infrastructure
> >
> > On Wed, Dec 19, 2018 at 3:19 PM Schmauss, Erik
> > <[email protected]> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: [email protected] [mailto:linux-acpi-
> > > > [email protected]] On Behalf Of Rafael J. Wysocki
> > > > Sent: Tuesday, December 11, 2018 1:45 AM
> > > > To: Busch, Keith <[email protected]>
> > > > Cc: Linux Kernel Mailing List <[email protected]>; ACPI
> > > > Devel Maling List <[email protected]>; Linux Memory
> > > > Management List <[email protected]>; Greg Kroah-Hartman
> > > > <[email protected]>; Rafael J. Wysocki
> > <[email protected]>;
> > > > Hansen, Dave <[email protected]>; Williams, Dan J
> > > > <[email protected]>
> > > > Subject: Re: [PATCHv2 01/12] acpi: Create subtable parsing
> > > > infrastructure
> > > >
> > > > On Tue, Dec 11, 2018 at 2:05 AM Keith Busch
> > <[email protected]>
> > > > wrote:
> > > > >
> > >
> > > Hi Rafael and Bob,
> > >
> > > > > Parsing entries in an ACPI table had assumed a generic header
> > > > > structure that is most common. There is no standard ACPI
> > header,
> > > > > though, so less common types would need custom parsers if they
> > > > > want go through their sub-table entry list.
> > > >
> > > > It looks like the problem at hand is that acpi_hmat_structure is
> > > > incompatible with acpi_subtable_header because of the different
> > layout and field sizes.
> > >
> > > Just out of curiosity, why don't we use ACPICA code to parse static
> > > ACPI tables in Linux?
> > >
> > > We have a disassembler for static tables that parses all supported
> > > tables. This seems like a duplication of code/effort...
> >
> Hi Dan,
>
> > Oh, I thought acpi_table_parse_entries() was the common code.
> > What's the ACPICA duplicate?
>
> I was thinking AcpiDmDumpTable(). After looking at this ACPICA code,
> I realized that the this ACPICA doesn't actually build a parse tree or data structure.
> It loops over the data structure to format the input ACPI table to a file.
>
> To me, it seems like a good idea for Linux and ACPICA to share the same code when
> parsing and analyzing these structures. I know that Linux may emit warnings
> that are specific to Linux but there are structural analyses that should be the same (such as
> checking lengths of tables and subtables so that we don't have out of bounds access).

I agree.

I guess the reason why it has not been done this way was because
nobody thought about it. :-)

So a project to consolidate these things might be a good one.

2018-12-21 06:13:48

by Schmauss, Erik

[permalink] [raw]
Subject: RE: [PATCHv2 01/12] acpi: Create subtable parsing infrastructure



> -----Original Message-----
> From: Rafael J. Wysocki [mailto:[email protected]]
> Sent: Thursday, December 20, 2018 12:57 AM
> To: Schmauss, Erik <[email protected]>
> Cc: Williams, Dan J <[email protected]>; Rafael J. Wysocki
> <[email protected]>; Busch, Keith <[email protected]>; Moore,
> Robert <[email protected]>; Linux Kernel Mailing List <linux-
> [email protected]>; ACPI Devel Maling List <linux-
> [email protected]>; Linux Memory Management List <linux-
> [email protected]>; Greg Kroah-Hartman
> <[email protected]>; Hansen, Dave
> <[email protected]>
> Subject: Re: [PATCHv2 01/12] acpi: Create subtable parsing
> infrastructure
>
> On Thu, Dec 20, 2018 at 2:15 AM Schmauss, Erik
> <[email protected]> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: [email protected] [mailto:linux-acpi-
> > > [email protected]] On Behalf Of Dan Williams
> > > Sent: Wednesday, December 19, 2018 4:00 PM
> > > To: Schmauss, Erik <[email protected]>
> > > Cc: Rafael J. Wysocki <[email protected]>; Busch, Keith
> > > <[email protected]>; Moore, Robert
> <[email protected]>;
> > > Linux Kernel Mailing List <[email protected]>; ACPI
> Devel
> > > Maling List <[email protected]>; Linux Memory
> Management
> > > List <[email protected]>; Greg Kroah-Hartman
> > > <[email protected]>; Hansen, Dave
> <[email protected]>
> > > Subject: Re: [PATCHv2 01/12] acpi: Create subtable parsing
> > > infrastructure
> > >
> > > On Wed, Dec 19, 2018 at 3:19 PM Schmauss, Erik
> > > <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: [email protected] [mailto:linux-acpi-
> > > > > [email protected]] On Behalf Of Rafael J. Wysocki
> > > > > Sent: Tuesday, December 11, 2018 1:45 AM
> > > > > To: Busch, Keith <[email protected]>
> > > > > Cc: Linux Kernel Mailing List <[email protected]>;
> > > > > ACPI Devel Maling List <[email protected]>; Linux
> > > > > Memory Management List <[email protected]>; Greg
> Kroah-Hartman
> > > > > <[email protected]>; Rafael J. Wysocki
> > > <[email protected]>;
> > > > > Hansen, Dave <[email protected]>; Williams, Dan J
> > > > > <[email protected]>
> > > > > Subject: Re: [PATCHv2 01/12] acpi: Create subtable parsing
> > > > > infrastructure
> > > > >
> > > > > On Tue, Dec 11, 2018 at 2:05 AM Keith Busch
> > > <[email protected]>
> > > > > wrote:
> > > > > >
> > > >
> > > > Hi Rafael and Bob,
> > > >
> > > > > > Parsing entries in an ACPI table had assumed a generic header
> > > > > > structure that is most common. There is no standard ACPI
> > > header,
> > > > > > though, so less common types would need custom parsers if
> they
> > > > > > want go through their sub-table entry list.
> > > > >
> > > > > It looks like the problem at hand is that acpi_hmat_structure is
> > > > > incompatible with acpi_subtable_header because of the
> different
> > > layout and field sizes.
> > > >
> > > > Just out of curiosity, why don't we use ACPICA code to parse
> > > > static ACPI tables in Linux?
> > > >
> > > > We have a disassembler for static tables that parses all supported
> > > > tables. This seems like a duplication of code/effort...
> > >
> > Hi Dan,
> >
> > > Oh, I thought acpi_table_parse_entries() was the common code.
> > > What's the ACPICA duplicate?
> >
> > I was thinking AcpiDmDumpTable(). After looking at this ACPICA
> code, I
> > realized that the this ACPICA doesn't actually build a parse tree or
> data structure.
> > It loops over the data structure to format the input ACPI table to a
> file.
> >
> > To me, it seems like a good idea for Linux and ACPICA to share the
> > same code when parsing and analyzing these structures. I know that
> > Linux may emit warnings that are specific to Linux but there are
> > structural analyses that should be the same (such as checking lengths
> of tables and subtables so that we don't have out of bounds access).
>
> I agree.
>
> I guess the reason why it has not been done this way was because
> nobody thought about it. :-)
>
> So a project to consolidate these things might be a good one.

Ok, I'll talk to Bob about it and see what we can do