2020-03-02 22:38:09

by Dan Williams

[permalink] [raw]
Subject: [PATCH 0/5] Manual definition of Soft Reserved memory devices

Given the current dearth of systems that supply an ACPI HMAT table, and
the utility of being able to manually define device-dax "hmem" instances
via the efi_fake_mem= option, relax the requirements for creating these
devices. Specifically, add an option (numa=nohmat) to optionally disable
consideration of the HMAT and update efi_fake_mem= to behave like
memmap=nn!ss in terms of delimiting device boundaries.

All review welcome of course, but the E820 changes want an x86
maintainer ack, the efi_fake_mem update needs Ard, and Rafael has
previously shepherded the HMAT changes. For the changes to
kernel/resource.c, where there is no clear maintainer, I just copied the
last few people to make thoughtful changes in that area. I am happy to
take these through the nvdimm tree along with these prerequisites
already in -next:

b2ca916ce392 ACPI: NUMA: Up-level "map to online node" functionality
4fcbe96e4d0b mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node()
575e23b6e13c powerpc/papr_scm: Switch to numa_map_to_online_node()
1e5d8e1e47af x86/mm: Introduce CONFIG_NUMA_KEEP_MEMINFO
5d30f92e7631 x86/NUMA: Provide a range-to-target_node lookup facility
7b27a8622f80 libnvdimm/e820: Retrieve and populate correct 'target_node' info

Tested with:

numa=nohmat efi_fake_mem=4G@9G:0x40000,4G@13G:0x40000

...to create to device-dax instances:

# daxctl list -RDu
[
{
"path":"\/platform\/hmem.1",
"id":1,
"size":"4.00 GiB (4.29 GB)",
"align":2097152,
"devices":[
{
"chardev":"dax1.0",
"size":"4.00 GiB (4.29 GB)",
"target_node":3,
"mode":"devdax"
}
]
},
{
"path":"\/platform\/hmem.0",
"id":0,
"size":"4.00 GiB (4.29 GB)",
"align":2097152,
"devices":[
{
"chardev":"dax0.0",
"size":"4.00 GiB (4.29 GB)",
"target_node":2,
"mode":"devdax"
}
]
}
]

---

Dan Williams (5):
ACPI: NUMA: Add 'nohmat' option
efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance
ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
resource: Report parent to walk_iomem_res_desc() callback
ACPI: HMAT: Attach a device for each soft-reserved range


arch/x86/kernel/e820.c | 16 +++++-
arch/x86/mm/numa.c | 4 +
drivers/acpi/numa/hmat.c | 71 +++-----------------------
drivers/dax/Kconfig | 5 ++
drivers/dax/Makefile | 3 -
drivers/dax/hmem/Makefile | 6 ++
drivers/dax/hmem/device.c | 97 +++++++++++++++++++++++++++++++++++
drivers/dax/hmem/hmem.c | 2 -
drivers/firmware/efi/x86_fake_mem.c | 12 +++-
include/acpi/acpi_numa.h | 1
include/linux/dax.h | 8 +++
kernel/resource.c | 1
12 files changed, 156 insertions(+), 70 deletions(-)
create mode 100644 drivers/dax/hmem/Makefile
create mode 100644 drivers/dax/hmem/device.c
rename drivers/dax/{hmem.c => hmem/hmem.c} (98%)

base-commit: 7b27a8622f802761d5c6abd6c37b22312a35343c


2020-03-02 22:38:09

by Dan Williams

[permalink] [raw]
Subject: [PATCH 3/5] ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device

In preparation for exposing "Soft Reserved" memory ranges without an
HMAT, move the hmem device registration to its own compilation unit and
make the implementation generic.

The generic implementation drops usage acpi_map_pxm_to_online_node()
that was translating ACPI proximity domain values and instead relies on
numa_map_to_online_node() to determine the numa node for the device.

Cc: "Rafael J. Wysocki" <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/acpi/numa/hmat.c | 68 ++++-----------------------------------------
drivers/dax/Kconfig | 4 +++
drivers/dax/Makefile | 3 +-
drivers/dax/hmem/Makefile | 5 +++
drivers/dax/hmem/device.c | 64 ++++++++++++++++++++++++++++++++++++++++++
drivers/dax/hmem/hmem.c | 2 +
include/linux/dax.h | 8 +++++
7 files changed, 89 insertions(+), 65 deletions(-)
create mode 100644 drivers/dax/hmem/Makefile
create mode 100644 drivers/dax/hmem/device.c
rename drivers/dax/{hmem.c => hmem/hmem.c} (98%)

diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index d3db121e393a..2379efcea570 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -24,6 +24,7 @@
#include <linux/mutex.h>
#include <linux/node.h>
#include <linux/sysfs.h>
+#include <linux/dax.h>

static u8 hmat_revision;
int hmat_disable __initdata;
@@ -635,66 +636,6 @@ static void hmat_register_target_perf(struct memory_target *target)
node_set_perf_attrs(mem_nid, &target->hmem_attrs, 0);
}

-static void hmat_register_target_device(struct memory_target *target,
- struct resource *r)
-{
- /* define a clean / non-busy resource for the platform device */
- struct resource res = {
- .start = r->start,
- .end = r->end,
- .flags = IORESOURCE_MEM,
- };
- struct platform_device *pdev;
- struct memregion_info info;
- int rc, id;
-
- rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
- IORES_DESC_SOFT_RESERVED);
- if (rc != REGION_INTERSECTS)
- return;
-
- id = memregion_alloc(GFP_KERNEL);
- if (id < 0) {
- pr_err("memregion allocation failure for %pr\n", &res);
- return;
- }
-
- pdev = platform_device_alloc("hmem", id);
- if (!pdev) {
- pr_err("hmem device allocation failure for %pr\n", &res);
- goto out_pdev;
- }
-
- pdev->dev.numa_node = acpi_map_pxm_to_online_node(target->memory_pxm);
- info = (struct memregion_info) {
- .target_node = acpi_map_pxm_to_node(target->memory_pxm),
- };
- rc = platform_device_add_data(pdev, &info, sizeof(info));
- if (rc < 0) {
- pr_err("hmem memregion_info allocation failure for %pr\n", &res);
- goto out_pdev;
- }
-
- rc = platform_device_add_resources(pdev, &res, 1);
- if (rc < 0) {
- pr_err("hmem resource allocation failure for %pr\n", &res);
- goto out_resource;
- }
-
- rc = platform_device_add(pdev);
- if (rc < 0) {
- dev_err(&pdev->dev, "device add failed for %pr\n", &res);
- goto out_resource;
- }
-
- return;
-
-out_resource:
- put_device(&pdev->dev);
-out_pdev:
- memregion_free(id);
-}
-
static void hmat_register_target_devices(struct memory_target *target)
{
struct resource *res;
@@ -706,8 +647,11 @@ static void hmat_register_target_devices(struct memory_target *target)
if (!IS_ENABLED(CONFIG_DEV_DAX_HMEM))
return;

- for (res = target->memregions.child; res; res = res->sibling)
- hmat_register_target_device(target, res);
+ for (res = target->memregions.child; res; res = res->sibling) {
+ int target_nid = acpi_map_pxm_to_node(target->memory_pxm);
+
+ hmem_register_device(target_nid, res);
+ }
}

static void hmat_register_target(struct memory_target *target)
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 3b6c06f07326..a229f45d34aa 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -48,6 +48,10 @@ config DEV_DAX_HMEM

Say M if unsure.

+config DEV_DAX_HMEM_DEVICES
+ depends on DEV_DAX_HMEM
+ def_bool y
+
config DEV_DAX_KMEM
tristate "KMEM DAX: volatile-use of persistent memory"
default DEV_DAX
diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
index 80065b38b3c4..9d4ba672d305 100644
--- a/drivers/dax/Makefile
+++ b/drivers/dax/Makefile
@@ -2,11 +2,10 @@
obj-$(CONFIG_DAX) += dax.o
obj-$(CONFIG_DEV_DAX) += device_dax.o
obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
-obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o

dax-y := super.o
dax-y += bus.o
device_dax-y := device.o
-dax_hmem-y := hmem.o

obj-y += pmem/
+obj-y += hmem/
diff --git a/drivers/dax/hmem/Makefile b/drivers/dax/hmem/Makefile
new file mode 100644
index 000000000000..a9d353d0c9ed
--- /dev/null
+++ b/drivers/dax/hmem/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
+obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device.o
+
+dax_hmem-y := hmem.o
diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
new file mode 100644
index 000000000000..99bc15a8b031
--- /dev/null
+++ b/drivers/dax/hmem/device.c
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/platform_device.h>
+#include <linux/memregion.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+
+void hmem_register_device(int target_nid, struct resource *r)
+{
+ /* define a clean / non-busy resource for the platform device */
+ struct resource res = {
+ .start = r->start,
+ .end = r->end,
+ .flags = IORESOURCE_MEM,
+ };
+ struct platform_device *pdev;
+ struct memregion_info info;
+ int rc, id;
+
+ rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
+ IORES_DESC_SOFT_RESERVED);
+ if (rc != REGION_INTERSECTS)
+ return;
+
+ id = memregion_alloc(GFP_KERNEL);
+ if (id < 0) {
+ pr_err("memregion allocation failure for %pr\n", &res);
+ return;
+ }
+
+ pdev = platform_device_alloc("hmem", id);
+ if (!pdev) {
+ pr_err("hmem device allocation failure for %pr\n", &res);
+ goto out_pdev;
+ }
+
+ pdev->dev.numa_node = numa_map_to_online_node(target_nid);
+ info = (struct memregion_info) {
+ .target_node = target_nid,
+ };
+ rc = platform_device_add_data(pdev, &info, sizeof(info));
+ if (rc < 0) {
+ pr_err("hmem memregion_info allocation failure for %pr\n", &res);
+ goto out_pdev;
+ }
+
+ rc = platform_device_add_resources(pdev, &res, 1);
+ if (rc < 0) {
+ pr_err("hmem resource allocation failure for %pr\n", &res);
+ goto out_resource;
+ }
+
+ rc = platform_device_add(pdev);
+ if (rc < 0) {
+ dev_err(&pdev->dev, "device add failed for %pr\n", &res);
+ goto out_resource;
+ }
+
+ return;
+
+out_resource:
+ put_device(&pdev->dev);
+out_pdev:
+ memregion_free(id);
+}
diff --git a/drivers/dax/hmem.c b/drivers/dax/hmem/hmem.c
similarity index 98%
rename from drivers/dax/hmem.c
rename to drivers/dax/hmem/hmem.c
index fe7214daf62e..29ceb5795297 100644
--- a/drivers/dax/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -3,7 +3,7 @@
#include <linux/memregion.h>
#include <linux/module.h>
#include <linux/pfn_t.h>
-#include "bus.h"
+#include "../bus.h"

static int dax_hmem_probe(struct platform_device *pdev)
{
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9bd8528bd305..9f6c282e9140 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -239,4 +239,12 @@ static inline bool dax_mapping(struct address_space *mapping)
return mapping->host && IS_DAX(mapping->host);
}

+#ifdef CONFIG_DEV_DAX_HMEM_DEVICES
+void hmem_register_device(int target_nid, struct resource *r);
+#else
+static inline void hmem_register_device(int target_nid, struct resource *r)
+{
+}
+#endif
+
#endif

2020-03-06 20:08:39

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 0/5] Manual definition of Soft Reserved memory devices

Dan Williams <[email protected]> writes:

> Given the current dearth of systems that supply an ACPI HMAT table, and
> the utility of being able to manually define device-dax "hmem" instances
> via the efi_fake_mem= option, relax the requirements for creating these
> devices. Specifically, add an option (numa=nohmat) to optionally disable
> consideration of the HMAT and update efi_fake_mem= to behave like
> memmap=nn!ss in terms of delimiting device boundaries.

So, am I correct in deducing that your primary motivation is testing
without hardware/firmware support? This looks like a bit of a hack to
me, and I think maybe it would be better to just emulate the HMAT using
qemu. I don't have a strong objection, though.

-Jeff

>
> All review welcome of course, but the E820 changes want an x86
> maintainer ack, the efi_fake_mem update needs Ard, and Rafael has
> previously shepherded the HMAT changes. For the changes to
> kernel/resource.c, where there is no clear maintainer, I just copied the
> last few people to make thoughtful changes in that area. I am happy to
> take these through the nvdimm tree along with these prerequisites
> already in -next:
>
> b2ca916ce392 ACPI: NUMA: Up-level "map to online node" functionality
> 4fcbe96e4d0b mm/numa: Skip NUMA_NO_NODE and online nodes in numa_map_to_online_node()
> 575e23b6e13c powerpc/papr_scm: Switch to numa_map_to_online_node()
> 1e5d8e1e47af x86/mm: Introduce CONFIG_NUMA_KEEP_MEMINFO
> 5d30f92e7631 x86/NUMA: Provide a range-to-target_node lookup facility
> 7b27a8622f80 libnvdimm/e820: Retrieve and populate correct 'target_node' info
>
> Tested with:
>
> numa=nohmat efi_fake_mem=4G@9G:0x40000,4G@13G:0x40000
>
> ...to create to device-dax instances:
>
> # daxctl list -RDu
> [
> {
> "path":"\/platform\/hmem.1",
> "id":1,
> "size":"4.00 GiB (4.29 GB)",
> "align":2097152,
> "devices":[
> {
> "chardev":"dax1.0",
> "size":"4.00 GiB (4.29 GB)",
> "target_node":3,
> "mode":"devdax"
> }
> ]
> },
> {
> "path":"\/platform\/hmem.0",
> "id":0,
> "size":"4.00 GiB (4.29 GB)",
> "align":2097152,
> "devices":[
> {
> "chardev":"dax0.0",
> "size":"4.00 GiB (4.29 GB)",
> "target_node":2,
> "mode":"devdax"
> }
> ]
> }
> ]
>
> ---
>
> Dan Williams (5):
> ACPI: NUMA: Add 'nohmat' option
> efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance
> ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
> resource: Report parent to walk_iomem_res_desc() callback
> ACPI: HMAT: Attach a device for each soft-reserved range
>
>
> arch/x86/kernel/e820.c | 16 +++++-
> arch/x86/mm/numa.c | 4 +
> drivers/acpi/numa/hmat.c | 71 +++-----------------------
> drivers/dax/Kconfig | 5 ++
> drivers/dax/Makefile | 3 -
> drivers/dax/hmem/Makefile | 6 ++
> drivers/dax/hmem/device.c | 97 +++++++++++++++++++++++++++++++++++
> drivers/dax/hmem/hmem.c | 2 -
> drivers/firmware/efi/x86_fake_mem.c | 12 +++-
> include/acpi/acpi_numa.h | 1
> include/linux/dax.h | 8 +++
> kernel/resource.c | 1
> 12 files changed, 156 insertions(+), 70 deletions(-)
> create mode 100644 drivers/dax/hmem/Makefile
> create mode 100644 drivers/dax/hmem/device.c
> rename drivers/dax/{hmem.c => hmem/hmem.c} (98%)
>
> base-commit: 7b27a8622f802761d5c6abd6c37b22312a35343c

2020-03-06 21:06:15

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 0/5] Manual definition of Soft Reserved memory devices

On Fri, Mar 6, 2020 at 12:07 PM Jeff Moyer <[email protected]> wrote:
>
> Dan Williams <[email protected]> writes:
>
> > Given the current dearth of systems that supply an ACPI HMAT table, and
> > the utility of being able to manually define device-dax "hmem" instances
> > via the efi_fake_mem= option, relax the requirements for creating these
> > devices. Specifically, add an option (numa=nohmat) to optionally disable
> > consideration of the HMAT and update efi_fake_mem= to behave like
> > memmap=nn!ss in terms of delimiting device boundaries.
>
> So, am I correct in deducing that your primary motivation is testing
> without hardware/firmware support?

My primary motivation is making the dax_kmem facility useful to
shipping platforms that have performance differentiated memory, but
may not have EFI-defined soft-reservations / HMAT (or
non-EFI-ACPI-platform equivalent). I'm anticipating HMAT enabled
platforms where the platform firmware policy for what is
soft-reserved, or not, is not the policy the system owner would pick.
I'd also highlight Joao's work [1] (see the TODO section) as an
indication of the demand for custom carving memory resources and
applying the device-dax memory management interface.

> This looks like a bit of a hack to
> me, and I think maybe it would be better to just emulate the HMAT using
> qemu. I don't have a strong objection, though.

Yeah, qemu emulation does not help when you, the system owner, have a
different use case than what the bare-metal platform-firmware
envisioned for "specific-purpose memory".

[1]: https://lore.kernel.org/lkml/[email protected]/