2020-08-03 05:19:23

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges

Changes since v3 [1]:
- Update x86 boot options documentation for 'nohmat' (Randy)

- Fixup a handful of kbuild robot reports, the most significant being
moving usage of PUD_SIZE and PMD_SIZE under
#ifdef CONFIG_TRANSPARENT_HUGEPAGE protection.

[1]: http://lore.kernel.org/r/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com

---
Merge notes:

Well, no v5.8-rc8 to line this up for v5.9, so next best is early
integration into -mm before other collisions develop.

Chatted with Justin offline and it currently appears that the missing
numa information is the fault of the platform firmware to populate all
the necessary NUMA data in the NFIT.

---
Cover:

The device-dax facility allows an address range to be directly mapped
through a chardev, or optionally hotplugged to the core kernel page
allocator as System-RAM. It is the mechanism for converting persistent
memory (pmem) to be used as another volatile memory pool i.e. the
current Memory Tiering hot topic on linux-mm.

In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
it, but that labeling mechanism is not available / applicable to
soft-reserved ("EFI specific purpose") memory [3]. This series provides
a sysfs-mechanism for the daxctl utility to enable provisioning of
volatile-soft-reserved memory ranges.

The motivations for this facility are:

1/ Allow performance differentiated memory ranges to be split between
kernel-managed and directly-accessed use cases.

2/ Allow physical memory to be provisioned along performance relevant
address boundaries. For example, divide a memory-side cache [4] along
cache-color boundaries.

3/ Parcel out soft-reserved memory to VMs using device-dax as a security
/ permissions boundary [5]. Specifically I have seen people (ab)using
memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the
device-dax interface on custom address ranges. A follow-on for the VM
use case is to teach device-dax to dynamically allocate 'struct page' at
runtime to reduce the duplication of 'struct page' space in both the
guest and the host kernel for the same physical pages.

[2]: http://lore.kernel.org/r/[email protected]
[3]: http://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit@dwillia2-desk3.amr.corp.intel.com
[4]: http://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
[5]: http://lore.kernel.org/r/[email protected]

---

Dan Williams (19):
x86/numa: Cleanup configuration dependent command-line options
x86/numa: Add 'nohmat' option
efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance
ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
resource: Report parent to walk_iomem_res_desc() callback
mm/memory_hotplug: Introduce default phys_to_target_node() implementation
ACPI: HMAT: Attach a device for each soft-reserved range
device-dax: Drop the dax_region.pfn_flags attribute
device-dax: Move instance creation parameters to 'struct dev_dax_data'
device-dax: Make pgmap optional for instance creation
device-dax: Kill dax_kmem_res
device-dax: Add an allocation interface for device-dax instances
device-dax: Introduce 'seed' devices
drivers/base: Make device_find_child_by_name() compatible with sysfs inputs
device-dax: Add resize support
mm/memremap_pages: Convert to 'struct range'
mm/memremap_pages: Support multiple ranges per invocation
device-dax: Add dis-contiguous resource support
device-dax: Introduce 'mapping' devices

Joao Martins (4):
device-dax: Make align a per-device property
device-dax: Add an 'align' attribute
dax/hmem: Introduce dax_hmem.region_idle parameter
device-dax: Add a range mapping allocation attribute


Documentation/x86/x86_64/boot-options.rst | 4
arch/powerpc/kvm/book3s_hv_uvmem.c | 14
arch/x86/include/asm/numa.h | 8
arch/x86/kernel/e820.c | 16
arch/x86/mm/numa.c | 11
arch/x86/mm/numa_emulation.c | 3
arch/x86/xen/enlighten_pv.c | 2
drivers/acpi/numa/hmat.c | 76 --
drivers/acpi/numa/srat.c | 9
drivers/base/core.c | 2
drivers/dax/Kconfig | 4
drivers/dax/Makefile | 3
drivers/dax/bus.c | 1046 +++++++++++++++++++++++++++--
drivers/dax/bus.h | 28 -
drivers/dax/dax-private.h | 60 +-
drivers/dax/device.c | 134 ++--
drivers/dax/hmem.c | 56 --
drivers/dax/hmem/Makefile | 6
drivers/dax/hmem/device.c | 100 +++
drivers/dax/hmem/hmem.c | 65 ++
drivers/dax/kmem.c | 199 +++---
drivers/dax/pmem/compat.c | 2
drivers/dax/pmem/core.c | 22 -
drivers/firmware/efi/x86_fake_mem.c | 12
drivers/gpu/drm/nouveau/nouveau_dmem.c | 15
drivers/nvdimm/badrange.c | 26 -
drivers/nvdimm/claim.c | 13
drivers/nvdimm/nd.h | 3
drivers/nvdimm/pfn_devs.c | 13
drivers/nvdimm/pmem.c | 27 -
drivers/nvdimm/region.c | 21 -
drivers/pci/p2pdma.c | 12
include/acpi/acpi_numa.h | 14
include/linux/dax.h | 8
include/linux/memory_hotplug.h | 5
include/linux/memremap.h | 11
include/linux/numa.h | 11
include/linux/range.h | 6
kernel/resource.c | 11
lib/test_hmm.c | 15
mm/memory_hotplug.c | 10
mm/memremap.c | 299 +++++---
tools/testing/nvdimm/dax-dev.c | 22 -
tools/testing/nvdimm/test/iomap.c | 2
44 files changed, 1825 insertions(+), 601 deletions(-)
delete mode 100644 drivers/dax/hmem.c
create mode 100644 drivers/dax/hmem/Makefile
create mode 100644 drivers/dax/hmem/device.c
create mode 100644 drivers/dax/hmem/hmem.c

base-commit: 01830e6c042e8eb6eb202e05d7df8057135b4c26


2020-08-03 05:19:28

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 01/23] x86/numa: Cleanup configuration dependent command-line options

In preparation for adding a new numa= option clean up the existing ones
to avoid ifdefs in numa_setup(), and provide feedback when the option is
numa=fake= option is invalid due to kernel config. The same does not
need to be done for numa=noacpi, since the capability is already hard
disabled at compile-time.

Suggested-by: Rafael J. Wysocki <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/include/asm/numa.h | 8 +++++++-
arch/x86/mm/numa.c | 8 ++------
arch/x86/mm/numa_emulation.c | 3 ++-
arch/x86/xen/enlighten_pv.c | 2 +-
drivers/acpi/numa/srat.c | 9 +++++++--
include/acpi/acpi_numa.h | 6 +++++-
6 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index bbfde3d2662f..0aecc0b629e0 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -3,6 +3,7 @@
#define _ASM_X86_NUMA_H

#include <linux/nodemask.h>
+#include <linux/errno.h>

#include <asm/topology.h>
#include <asm/apicdef.h>
@@ -77,7 +78,12 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable);
#ifdef CONFIG_NUMA_EMU
#define FAKE_NODE_MIN_SIZE ((u64)32 << 20)
#define FAKE_NODE_MIN_HASH_MASK (~(FAKE_NODE_MIN_SIZE - 1UL))
-void numa_emu_cmdline(char *);
+int numa_emu_cmdline(char *str);
+#else /* CONFIG_NUMA_EMU */
+static inline int numa_emu_cmdline(char *str)
+{
+ return -EINVAL;
+}
#endif /* CONFIG_NUMA_EMU */

#endif /* _ASM_X86_NUMA_H */
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index aa76ec2d359b..87c52822cc44 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -37,14 +37,10 @@ static __init int numa_setup(char *opt)
return -EINVAL;
if (!strncmp(opt, "off", 3))
numa_off = 1;
-#ifdef CONFIG_NUMA_EMU
if (!strncmp(opt, "fake=", 5))
- numa_emu_cmdline(opt + 5);
-#endif
-#ifdef CONFIG_ACPI_NUMA
+ return numa_emu_cmdline(opt + 5);
if (!strncmp(opt, "noacpi", 6))
- acpi_numa = -1;
-#endif
+ disable_srat();
return 0;
}
early_param("numa", numa_setup);
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index c5174b4e318b..847c23196e57 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -13,9 +13,10 @@
static int emu_nid_to_phys[MAX_NUMNODES];
static char *emu_cmdline __initdata;

-void __init numa_emu_cmdline(char *str)
+int __init numa_emu_cmdline(char *str)
{
emu_cmdline = str;
+ return 0;
}

static int __init emu_find_memblk_by_nid(int nid, const struct numa_meminfo *mi)
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 2aab43a13a8c..64b81ba5a4d6 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -1350,7 +1350,7 @@ asmlinkage __visible void __init xen_start_kernel(void)
* any NUMA information the kernel tries to get from ACPI will
* be meaningless. Prevent it from trying.
*/
- acpi_numa = -1;
+ disable_srat();
#endif
WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_pv, xen_cpu_dead_pv));

diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index 15bbaab8500b..1b0ae0a1959b 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -27,7 +27,12 @@ static int node_to_pxm_map[MAX_NUMNODES]
= { [0 ... MAX_NUMNODES - 1] = PXM_INVAL };

unsigned char acpi_srat_revision __initdata;
-int acpi_numa __initdata;
+static int acpi_numa __initdata;
+
+void __init disable_srat(void)
+{
+ acpi_numa = -1;
+}

int pxm_to_node(int pxm)
{
@@ -163,7 +168,7 @@ static int __init slit_valid(struct acpi_table_slit *slit)
void __init bad_srat(void)
{
pr_err("SRAT: SRAT not used.\n");
- acpi_numa = -1;
+ disable_srat();
}

int __init srat_disabled(void)
diff --git a/include/acpi/acpi_numa.h b/include/acpi/acpi_numa.h
index fdebcfc6c8df..8784183b2204 100644
--- a/include/acpi/acpi_numa.h
+++ b/include/acpi/acpi_numa.h
@@ -17,10 +17,14 @@ extern int pxm_to_node(int);
extern int node_to_pxm(int);
extern int acpi_map_pxm_to_node(int);
extern unsigned char acpi_srat_revision;
-extern int acpi_numa __initdata;
+extern void disable_srat(void);

extern void bad_srat(void);
extern int srat_disabled(void);

+#else /* CONFIG_ACPI_NUMA */
+static inline void disable_srat(void)
+{
+}
#endif /* CONFIG_ACPI_NUMA */
#endif /* __ACP_NUMA_H */

2020-08-03 05:19:37

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 02/23] x86/numa: Add 'nohmat' option

Disable parsing of the HMAT for debug, to workaround broken platform
instances, or cases where it is otherwise not wanted.

Cc: [email protected]
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
Documentation/x86/x86_64/boot-options.rst | 4 ++++
arch/x86/mm/numa.c | 2 ++
drivers/acpi/numa/hmat.c | 8 +++++++-
include/acpi/acpi_numa.h | 8 ++++++++
4 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/Documentation/x86/x86_64/boot-options.rst b/Documentation/x86/x86_64/boot-options.rst
index 2b98efb5ba7f..324cefff92e7 100644
--- a/Documentation/x86/x86_64/boot-options.rst
+++ b/Documentation/x86/x86_64/boot-options.rst
@@ -173,6 +173,10 @@ NUMA
numa=noacpi
Don't parse the SRAT table for NUMA setup

+ numa=nohmat
+ Don't parse the HMAT table for NUMA setup, or soft-reserved memory
+ partitioning.
+
numa=fake=<size>[MG]
If given as a memory unit, fills all system RAM with nodes of
size interleaved over physical nodes.
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 87c52822cc44..f3805bbaa784 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -41,6 +41,8 @@ static __init int numa_setup(char *opt)
return numa_emu_cmdline(opt + 5);
if (!strncmp(opt, "noacpi", 6))
disable_srat();
+ if (!strncmp(opt, "nohmat", 6))
+ disable_hmat();
return 0;
}
early_param("numa", numa_setup);
diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index 2c32cfb72370..a12e36a12618 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -26,6 +26,12 @@
#include <linux/sysfs.h>

static u8 hmat_revision;
+static int hmat_disable __initdata;
+
+void __init disable_hmat(void)
+{
+ hmat_disable = 1;
+}

static LIST_HEAD(targets);
static LIST_HEAD(initiators);
@@ -814,7 +820,7 @@ static __init int hmat_init(void)
enum acpi_hmat_type i;
acpi_status status;

- if (srat_disabled())
+ if (srat_disabled() || hmat_disable)
return 0;

status = acpi_get_table(ACPI_SIG_SRAT, 0, &tbl);
diff --git a/include/acpi/acpi_numa.h b/include/acpi/acpi_numa.h
index 8784183b2204..0e9302285f14 100644
--- a/include/acpi/acpi_numa.h
+++ b/include/acpi/acpi_numa.h
@@ -27,4 +27,12 @@ static inline void disable_srat(void)
{
}
#endif /* CONFIG_ACPI_NUMA */
+
+#ifdef CONFIG_ACPI_HMAT
+extern void disable_hmat(void);
+#else /* CONFIG_ACPI_HMAT */
+static inline void disable_hmat(void)
+{
+}
+#endif /* CONFIG_ACPI_HMAT */
#endif /* __ACP_NUMA_H */

2020-08-03 05:19:54

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 03/23] efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance

In preparation for attaching a platform device per iomem resource teach
the efi_fake_mem code to create an e820 entry per instance. Similar to
E820_TYPE_PRAM, bypass merging resource when the e820 map is sanitized.

Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Acked-by: Ard Biesheuvel <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/kernel/e820.c | 16 +++++++++++++++-
drivers/firmware/efi/x86_fake_mem.c | 12 +++++++++---
2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 983cd53ed4c9..22aad412f965 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -305,6 +305,20 @@ static int __init cpcompare(const void *a, const void *b)
return (ap->addr != ap->entry->addr) - (bp->addr != bp->entry->addr);
}

+static bool e820_nomerge(enum e820_type type)
+{
+ /*
+ * These types may indicate distinct platform ranges aligned to
+ * numa node, protection domain, performance domain, or other
+ * boundaries. Do not merge them.
+ */
+ if (type == E820_TYPE_PRAM)
+ return true;
+ if (type == E820_TYPE_SOFT_RESERVED)
+ return true;
+ return false;
+}
+
int __init e820__update_table(struct e820_table *table)
{
struct e820_entry *entries = table->entries;
@@ -380,7 +394,7 @@ int __init e820__update_table(struct e820_table *table)
}

/* Continue building up new map based on this information: */
- if (current_type != last_type || current_type == E820_TYPE_PRAM) {
+ if (current_type != last_type || e820_nomerge(current_type)) {
if (last_type != 0) {
new_entries[new_nr_entries].size = change_point[chg_idx]->addr - last_addr;
/* Move forward only if the new size was non-zero: */
diff --git a/drivers/firmware/efi/x86_fake_mem.c b/drivers/firmware/efi/x86_fake_mem.c
index e5d6d5a1b240..0bafcc1bb0f6 100644
--- a/drivers/firmware/efi/x86_fake_mem.c
+++ b/drivers/firmware/efi/x86_fake_mem.c
@@ -38,7 +38,7 @@ void __init efi_fake_memmap_early(void)
m_start = mem->range.start;
m_end = mem->range.end;
for_each_efi_memory_desc(md) {
- u64 start, end;
+ u64 start, end, size;

if (md->type != EFI_CONVENTIONAL_MEMORY)
continue;
@@ -58,11 +58,17 @@ void __init efi_fake_memmap_early(void)
*/
start = max(start, m_start);
end = min(end, m_end);
+ size = end - start + 1;

if (end <= start)
continue;
- e820__range_update(start, end - start + 1, E820_TYPE_RAM,
- E820_TYPE_SOFT_RESERVED);
+
+ /*
+ * Ensure each efi_fake_mem instance results in
+ * a unique e820 resource
+ */
+ e820__range_remove(start, size, E820_TYPE_RAM, 1);
+ e820__range_add(start, size, E820_TYPE_SOFT_RESERVED);
e820__update_table(e820_table);
}
}

2020-08-03 05:20:07

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 04/23] ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device

In preparation for exposing "Soft Reserved" memory ranges without an
HMAT, move the hmem device registration to its own compilation unit and
make the implementation generic.

The generic implementation drops usage acpi_map_pxm_to_online_node()
that was translating ACPI proximity domain values and instead relies on
numa_map_to_online_node() to determine the numa node for the device.

Cc: "Rafael J. Wysocki" <[email protected]>
Link: https://lore.kernel.org/r/158318761484.2216124.2049322072599482736.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <[email protected]>
---
drivers/acpi/numa/hmat.c | 68 ++++-----------------------------------------
drivers/dax/Kconfig | 4 +++
drivers/dax/Makefile | 3 +-
drivers/dax/hmem.c | 56 -------------------------------------
drivers/dax/hmem/Makefile | 5 +++
drivers/dax/hmem/device.c | 65 +++++++++++++++++++++++++++++++++++++++++++
drivers/dax/hmem/hmem.c | 56 +++++++++++++++++++++++++++++++++++++
include/linux/dax.h | 8 +++++
8 files changed, 145 insertions(+), 120 deletions(-)
delete mode 100644 drivers/dax/hmem.c
create mode 100644 drivers/dax/hmem/Makefile
create mode 100644 drivers/dax/hmem/device.c
create mode 100644 drivers/dax/hmem/hmem.c

diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index a12e36a12618..134bcb40b2af 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -24,6 +24,7 @@
#include <linux/mutex.h>
#include <linux/node.h>
#include <linux/sysfs.h>
+#include <linux/dax.h>

static u8 hmat_revision;
static int hmat_disable __initdata;
@@ -640,66 +641,6 @@ static void hmat_register_target_perf(struct memory_target *target)
node_set_perf_attrs(mem_nid, &target->hmem_attrs, 0);
}

-static void hmat_register_target_device(struct memory_target *target,
- struct resource *r)
-{
- /* define a clean / non-busy resource for the platform device */
- struct resource res = {
- .start = r->start,
- .end = r->end,
- .flags = IORESOURCE_MEM,
- };
- struct platform_device *pdev;
- struct memregion_info info;
- int rc, id;
-
- rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
- IORES_DESC_SOFT_RESERVED);
- if (rc != REGION_INTERSECTS)
- return;
-
- id = memregion_alloc(GFP_KERNEL);
- if (id < 0) {
- pr_err("memregion allocation failure for %pr\n", &res);
- return;
- }
-
- pdev = platform_device_alloc("hmem", id);
- if (!pdev) {
- pr_err("hmem device allocation failure for %pr\n", &res);
- goto out_pdev;
- }
-
- pdev->dev.numa_node = acpi_map_pxm_to_online_node(target->memory_pxm);
- info = (struct memregion_info) {
- .target_node = acpi_map_pxm_to_node(target->memory_pxm),
- };
- rc = platform_device_add_data(pdev, &info, sizeof(info));
- if (rc < 0) {
- pr_err("hmem memregion_info allocation failure for %pr\n", &res);
- goto out_pdev;
- }
-
- rc = platform_device_add_resources(pdev, &res, 1);
- if (rc < 0) {
- pr_err("hmem resource allocation failure for %pr\n", &res);
- goto out_resource;
- }
-
- rc = platform_device_add(pdev);
- if (rc < 0) {
- dev_err(&pdev->dev, "device add failed for %pr\n", &res);
- goto out_resource;
- }
-
- return;
-
-out_resource:
- put_device(&pdev->dev);
-out_pdev:
- memregion_free(id);
-}
-
static void hmat_register_target_devices(struct memory_target *target)
{
struct resource *res;
@@ -711,8 +652,11 @@ static void hmat_register_target_devices(struct memory_target *target)
if (!IS_ENABLED(CONFIG_DEV_DAX_HMEM))
return;

- for (res = target->memregions.child; res; res = res->sibling)
- hmat_register_target_device(target, res);
+ for (res = target->memregions.child; res; res = res->sibling) {
+ int target_nid = acpi_map_pxm_to_node(target->memory_pxm);
+
+ hmem_register_device(target_nid, res);
+ }
}

static void hmat_register_target(struct memory_target *target)
diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig
index 3b6c06f07326..a229f45d34aa 100644
--- a/drivers/dax/Kconfig
+++ b/drivers/dax/Kconfig
@@ -48,6 +48,10 @@ config DEV_DAX_HMEM

Say M if unsure.

+config DEV_DAX_HMEM_DEVICES
+ depends on DEV_DAX_HMEM
+ def_bool y
+
config DEV_DAX_KMEM
tristate "KMEM DAX: volatile-use of persistent memory"
default DEV_DAX
diff --git a/drivers/dax/Makefile b/drivers/dax/Makefile
index 80065b38b3c4..9d4ba672d305 100644
--- a/drivers/dax/Makefile
+++ b/drivers/dax/Makefile
@@ -2,11 +2,10 @@
obj-$(CONFIG_DAX) += dax.o
obj-$(CONFIG_DEV_DAX) += device_dax.o
obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
-obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o

dax-y := super.o
dax-y += bus.o
device_dax-y := device.o
-dax_hmem-y := hmem.o

obj-y += pmem/
+obj-y += hmem/
diff --git a/drivers/dax/hmem.c b/drivers/dax/hmem.c
deleted file mode 100644
index fe7214daf62e..000000000000
--- a/drivers/dax/hmem.c
+++ /dev/null
@@ -1,56 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-#include <linux/platform_device.h>
-#include <linux/memregion.h>
-#include <linux/module.h>
-#include <linux/pfn_t.h>
-#include "bus.h"
-
-static int dax_hmem_probe(struct platform_device *pdev)
-{
- struct device *dev = &pdev->dev;
- struct dev_pagemap pgmap = { };
- struct dax_region *dax_region;
- struct memregion_info *mri;
- struct dev_dax *dev_dax;
- struct resource *res;
-
- res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
- if (!res)
- return -ENOMEM;
-
- mri = dev->platform_data;
- memcpy(&pgmap.res, res, sizeof(*res));
-
- dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node,
- PMD_SIZE, PFN_DEV|PFN_MAP);
- if (!dax_region)
- return -ENOMEM;
-
- dev_dax = devm_create_dev_dax(dax_region, 0, &pgmap);
- if (IS_ERR(dev_dax))
- return PTR_ERR(dev_dax);
-
- /* child dev_dax instances now own the lifetime of the dax_region */
- dax_region_put(dax_region);
- return 0;
-}
-
-static int dax_hmem_remove(struct platform_device *pdev)
-{
- /* devm handles teardown */
- return 0;
-}
-
-static struct platform_driver dax_hmem_driver = {
- .probe = dax_hmem_probe,
- .remove = dax_hmem_remove,
- .driver = {
- .name = "hmem",
- },
-};
-
-module_platform_driver(dax_hmem_driver);
-
-MODULE_ALIAS("platform:hmem*");
-MODULE_LICENSE("GPL v2");
-MODULE_AUTHOR("Intel Corporation");
diff --git a/drivers/dax/hmem/Makefile b/drivers/dax/hmem/Makefile
new file mode 100644
index 000000000000..a9d353d0c9ed
--- /dev/null
+++ b/drivers/dax/hmem/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
+obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device.o
+
+dax_hmem-y := hmem.o
diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
new file mode 100644
index 000000000000..b9dd6b27745c
--- /dev/null
+++ b/drivers/dax/hmem/device.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/platform_device.h>
+#include <linux/memregion.h>
+#include <linux/module.h>
+#include <linux/dax.h>
+#include <linux/mm.h>
+
+void hmem_register_device(int target_nid, struct resource *r)
+{
+ /* define a clean / non-busy resource for the platform device */
+ struct resource res = {
+ .start = r->start,
+ .end = r->end,
+ .flags = IORESOURCE_MEM,
+ };
+ struct platform_device *pdev;
+ struct memregion_info info;
+ int rc, id;
+
+ rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
+ IORES_DESC_SOFT_RESERVED);
+ if (rc != REGION_INTERSECTS)
+ return;
+
+ id = memregion_alloc(GFP_KERNEL);
+ if (id < 0) {
+ pr_err("memregion allocation failure for %pr\n", &res);
+ return;
+ }
+
+ pdev = platform_device_alloc("hmem", id);
+ if (!pdev) {
+ pr_err("hmem device allocation failure for %pr\n", &res);
+ goto out_pdev;
+ }
+
+ pdev->dev.numa_node = numa_map_to_online_node(target_nid);
+ info = (struct memregion_info) {
+ .target_node = target_nid,
+ };
+ rc = platform_device_add_data(pdev, &info, sizeof(info));
+ if (rc < 0) {
+ pr_err("hmem memregion_info allocation failure for %pr\n", &res);
+ goto out_pdev;
+ }
+
+ rc = platform_device_add_resources(pdev, &res, 1);
+ if (rc < 0) {
+ pr_err("hmem resource allocation failure for %pr\n", &res);
+ goto out_resource;
+ }
+
+ rc = platform_device_add(pdev);
+ if (rc < 0) {
+ dev_err(&pdev->dev, "device add failed for %pr\n", &res);
+ goto out_resource;
+ }
+
+ return;
+
+out_resource:
+ put_device(&pdev->dev);
+out_pdev:
+ memregion_free(id);
+}
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
new file mode 100644
index 000000000000..29ceb5795297
--- /dev/null
+++ b/drivers/dax/hmem/hmem.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/platform_device.h>
+#include <linux/memregion.h>
+#include <linux/module.h>
+#include <linux/pfn_t.h>
+#include "../bus.h"
+
+static int dax_hmem_probe(struct platform_device *pdev)
+{
+ struct device *dev = &pdev->dev;
+ struct dev_pagemap pgmap = { };
+ struct dax_region *dax_region;
+ struct memregion_info *mri;
+ struct dev_dax *dev_dax;
+ struct resource *res;
+
+ res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+ if (!res)
+ return -ENOMEM;
+
+ mri = dev->platform_data;
+ memcpy(&pgmap.res, res, sizeof(*res));
+
+ dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node,
+ PMD_SIZE, PFN_DEV|PFN_MAP);
+ if (!dax_region)
+ return -ENOMEM;
+
+ dev_dax = devm_create_dev_dax(dax_region, 0, &pgmap);
+ if (IS_ERR(dev_dax))
+ return PTR_ERR(dev_dax);
+
+ /* child dev_dax instances now own the lifetime of the dax_region */
+ dax_region_put(dax_region);
+ return 0;
+}
+
+static int dax_hmem_remove(struct platform_device *pdev)
+{
+ /* devm handles teardown */
+ return 0;
+}
+
+static struct platform_driver dax_hmem_driver = {
+ .probe = dax_hmem_probe,
+ .remove = dax_hmem_remove,
+ .driver = {
+ .name = "hmem",
+ },
+};
+
+module_platform_driver(dax_hmem_driver);
+
+MODULE_ALIAS("platform:hmem*");
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 6904d4e0b2e0..5d2c3d9aeb5e 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -221,4 +221,12 @@ static inline bool dax_mapping(struct address_space *mapping)
return mapping->host && IS_DAX(mapping->host);
}

+#ifdef CONFIG_DEV_DAX_HMEM_DEVICES
+void hmem_register_device(int target_nid, struct resource *r);
+#else
+static inline void hmem_register_device(int target_nid, struct resource *r)
+{
+}
+#endif
+
#endif

2020-08-03 05:20:13

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 05/23] resource: Report parent to walk_iomem_res_desc() callback

In support of detecting whether a resource might have been been claimed,
report the parent to the walk_iomem_res_desc() callback. For example,
the ACPI HMAT parser publishes "hmem" platform devices per target range.
However, if the HMAT is disabled / missing a fallback driver can attach
devices to the raw memory ranges as a fallback if it sees unclaimed /
orphan "Soft Reserved" resources in the resource tree.

Otherwise, find_next_iomem_res() returns a resource with garbage data
from the stack allocation in __walk_iomem_res_desc() for the res->parent
field.

There are currently no users that expect ->child and ->sibling to be
valid, and the resource_lock would be needed to traverse them. Use a
compound literal to implicitly zero initialize the fields that are not
being returned in addition to setting ->parent.

Cc: Jason Gunthorpe <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Tom Lendacky <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
kernel/resource.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/resource.c b/kernel/resource.c
index 841737bbda9e..f1175ce93a1d 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -382,10 +382,13 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,

if (p) {
/* copy data */
- res->start = max(start, p->start);
- res->end = min(end, p->end);
- res->flags = p->flags;
- res->desc = p->desc;
+ *res = (struct resource) {
+ .start = max(start, p->start),
+ .end = min(end, p->end),
+ .flags = p->flags,
+ .desc = p->desc,
+ .parent = p->parent,
+ };
}

read_unlock(&resource_lock);

2020-08-03 05:20:21

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 06/23] mm/memory_hotplug: Introduce default phys_to_target_node() implementation

In preparation to set a fallback value for dev_dax->target_node,
introduce generic fallback helpers for phys_to_target_node()

A generic implementation based on node-data or memblock was proposed,
but as noted by Mike:

"Here again, I would prefer to add a weak default for
phys_to_target_node() because the "generic" implementation is not really
generic.

The fallback to reserved ranges is x86 specfic because on x86 most of
the reserved areas is not in memblock.memory. AFAIK, no other
architecture does this."

The info message in the generic memory_add_physaddr_to_nid()
implementation is fixed up to properly reflect that
memory_add_physaddr_to_nid() communicates "online" node info and
phys_to_target_node() indicates "target / to-be-onlined" node info.

Cc: David Hildenbrand <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Jia He <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/x86/mm/numa.c | 1 -
include/linux/memory_hotplug.h | 5 +++++
include/linux/numa.h | 11 -----------
mm/memory_hotplug.c | 10 +++++++++-
4 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index f3805bbaa784..c62e274d52d0 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -917,7 +917,6 @@ int phys_to_target_node(phys_addr_t start)

return meminfo_to_nid(&numa_reserved_meminfo, start);
}
-EXPORT_SYMBOL_GPL(phys_to_target_node);

int memory_add_physaddr_to_nid(u64 start)
{
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 375515803cd8..dcdc7d6206d5 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -151,11 +151,16 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,

#ifdef CONFIG_NUMA
extern int memory_add_physaddr_to_nid(u64 start);
+extern int phys_to_target_node(u64 start);
#else
static inline int memory_add_physaddr_to_nid(u64 start)
{
return 0;
}
+static inline int phys_to_target_node(u64 start)
+{
+ return 0;
+}
#endif

#ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
diff --git a/include/linux/numa.h b/include/linux/numa.h
index a42df804679e..8cb33ccfb671 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -23,22 +23,11 @@
#ifdef CONFIG_NUMA
/* Generic implementation available */
int numa_map_to_online_node(int node);
-
-/*
- * Optional architecture specific implementation, users need a "depends
- * on $ARCH"
- */
-int phys_to_target_node(phys_addr_t addr);
#else
static inline int numa_map_to_online_node(int node)
{
return NUMA_NO_NODE;
}
-
-static inline int phys_to_target_node(phys_addr_t addr)
-{
- return NUMA_NO_NODE;
-}
#endif

#endif /* _LINUX_NUMA_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index dcdf3271f87e..426b79adf529 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -353,11 +353,19 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
#ifdef CONFIG_NUMA
int __weak memory_add_physaddr_to_nid(u64 start)
{
- pr_info_once("Unknown target node for memory at 0x%llx, assuming node 0\n",
+ pr_info_once("Unknown online node for memory at 0x%llx, assuming node 0\n",
start);
return 0;
}
EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
+
+int __weak phys_to_target_node(u64 start)
+{
+ pr_info_once("Unknown target node for memory at 0x%llx, assuming node 0\n",
+ start);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(phys_to_target_node);
#endif

/* find the smallest valid pfn in the range [start_pfn, end_pfn) */

2020-08-03 05:20:30

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 07/23] ACPI: HMAT: Attach a device for each soft-reserved range

The hmem enabling in commit 'cf8741ac57ed ("ACPI: NUMA: HMAT: Register
"soft reserved" memory as an "hmem" device")' only registered ranges to
the hmem driver for each soft-reservation that also appeared in the
HMAT. While this is meant to encourage platform firmware to "do the
right thing" and publish an HMAT, the corollary is that platforms that
fail to publish an accurate HMAT will strand memory from Linux usage.
Additionally, the "efi_fake_mem" kernel command line option enabling
will strand memory by default without an HMAT.

Arrange for "soft reserved" memory that goes unclaimed by HMAT entries
to be published as raw resource ranges for the hmem driver to consume.

Include a module parameter to disable either this fallback behavior, or
the hmat enabling from creating hmem devices. The module parameter
requires the hmem device enabling to have unique name in the module
namespace: "device_hmem".

The driver depends on the architecture providing phys_to_target_node()
which is only x86 via numa_meminfo() and arm64 via a generic memblock
implementation.

Cc: Jonathan Cameron <[email protected]>
Cc: Brice Goglin <[email protected]>
Cc: Ard Biesheuvel <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Jeff Moyer <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Will Deacon <[email protected]>
Reviewed-by: Joao Martins <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/hmem/Makefile | 3 ++-
drivers/dax/hmem/device.c | 35 +++++++++++++++++++++++++++++++++++
2 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/hmem/Makefile b/drivers/dax/hmem/Makefile
index a9d353d0c9ed..57377b4c3d47 100644
--- a/drivers/dax/hmem/Makefile
+++ b/drivers/dax/hmem/Makefile
@@ -1,5 +1,6 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
-obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device.o
+obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device_hmem.o

+device_hmem-y := device.o
dax_hmem-y := hmem.o
diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
index b9dd6b27745c..cb6401c9e9a4 100644
--- a/drivers/dax/hmem/device.c
+++ b/drivers/dax/hmem/device.c
@@ -5,6 +5,9 @@
#include <linux/dax.h>
#include <linux/mm.h>

+static bool nohmem;
+module_param_named(disable, nohmem, bool, 0444);
+
void hmem_register_device(int target_nid, struct resource *r)
{
/* define a clean / non-busy resource for the platform device */
@@ -17,6 +20,9 @@ void hmem_register_device(int target_nid, struct resource *r)
struct memregion_info info;
int rc, id;

+ if (nohmem)
+ return;
+
rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
IORES_DESC_SOFT_RESERVED);
if (rc != REGION_INTERSECTS)
@@ -63,3 +69,32 @@ void hmem_register_device(int target_nid, struct resource *r)
out_pdev:
memregion_free(id);
}
+
+static __init int hmem_register_one(struct resource *res, void *data)
+{
+ /*
+ * If the resource is not a top-level resource it was already
+ * assigned to a device by the HMAT parsing.
+ */
+ if (res->parent != &iomem_resource) {
+ pr_info("HMEM: skip %pr, already claimed\n", res);
+ return 0;
+ }
+
+ hmem_register_device(phys_to_target_node(res->start), res);
+
+ return 0;
+}
+
+static __init int hmem_init(void)
+{
+ walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED,
+ IORESOURCE_MEM, 0, -1, NULL, hmem_register_one);
+ return 0;
+}
+
+/*
+ * As this is a fallback for address ranges unclaimed by the ACPI HMAT
+ * parsing it must be at an initcall level greater than hmat_init().
+ */
+late_initcall(hmem_init);

2020-08-03 05:20:46

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 08/23] device-dax: Drop the dax_region.pfn_flags attribute

All callers specify the same flags to alloc_dax_region(), so there is no
need to allow for anything other than PFN_DEV|PFN_MAP, or carry a
->pfn_flags around on the region. Device-dax instances are always page
backed.

Cc: Vishal Verma <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/bus.c | 4 +---
drivers/dax/bus.h | 3 +--
drivers/dax/dax-private.h | 2 --
drivers/dax/device.c | 26 +++-----------------------
drivers/dax/hmem/hmem.c | 2 +-
drivers/dax/pmem/core.c | 3 +--
6 files changed, 7 insertions(+), 33 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index df238c8b6ef2..f06ffa66cd78 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -226,8 +226,7 @@ static void dax_region_unregister(void *region)
}

struct dax_region *alloc_dax_region(struct device *parent, int region_id,
- struct resource *res, int target_node, unsigned int align,
- unsigned long long pfn_flags)
+ struct resource *res, int target_node, unsigned int align)
{
struct dax_region *dax_region;

@@ -251,7 +250,6 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,

dev_set_drvdata(parent, dax_region);
memcpy(&dax_region->res, res, sizeof(*res));
- dax_region->pfn_flags = pfn_flags;
kref_init(&dax_region->kref);
dax_region->id = region_id;
dax_region->align = align;
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 9e4eba67e8b9..55577e9791da 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -10,8 +10,7 @@ struct dax_device;
struct dax_region;
void dax_region_put(struct dax_region *dax_region);
struct dax_region *alloc_dax_region(struct device *parent, int region_id,
- struct resource *res, int target_node, unsigned int align,
- unsigned long long flags);
+ struct resource *res, int target_node, unsigned int align);

enum dev_dax_subsys {
DEV_DAX_BUS,
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 16850d5388ab..8a4c40ccd2ef 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -23,7 +23,6 @@ void dax_bus_exit(void);
* @dev: parent device backing this region
* @align: allocation and mapping alignment for child dax devices
* @res: physical address range of the region
- * @pfn_flags: identify whether the pfns are paged back or not
*/
struct dax_region {
int id;
@@ -32,7 +31,6 @@ struct dax_region {
struct device *dev;
unsigned int align;
struct resource res;
- unsigned long long pfn_flags;
};

/**
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 4c0af2eb7e19..bffef1b21144 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -41,14 +41,6 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
return -EINVAL;
}

- if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV
- && (vma->vm_flags & VM_DONTCOPY) == 0) {
- dev_info_ratelimited(dev,
- "%s: %s: fail, dax range requires MADV_DONTFORK\n",
- current->comm, func);
- return -EINVAL;
- }
-
if (!vma_is_dax(vma)) {
dev_info_ratelimited(dev,
"%s: %s: fail, vma is not DAX capable\n",
@@ -102,7 +94,7 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
return VM_FAULT_SIGBUS;
}

- *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+ *pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);

return vmf_insert_mixed(vmf->vma, vmf->address, *pfn);
}
@@ -127,12 +119,6 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
return VM_FAULT_SIGBUS;
}

- /* dax pmd mappings require pfn_t_devmap() */
- if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) != (PFN_DEV|PFN_MAP)) {
- dev_dbg(dev, "region lacks devmap flags\n");
- return VM_FAULT_SIGBUS;
- }
-
if (fault_size < dax_region->align)
return VM_FAULT_SIGBUS;
else if (fault_size > dax_region->align)
@@ -150,7 +136,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
return VM_FAULT_SIGBUS;
}

- *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+ *pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);

return vmf_insert_pfn_pmd(vmf, *pfn, vmf->flags & FAULT_FLAG_WRITE);
}
@@ -177,12 +163,6 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
return VM_FAULT_SIGBUS;
}

- /* dax pud mappings require pfn_t_devmap() */
- if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) != (PFN_DEV|PFN_MAP)) {
- dev_dbg(dev, "region lacks devmap flags\n");
- return VM_FAULT_SIGBUS;
- }
-
if (fault_size < dax_region->align)
return VM_FAULT_SIGBUS;
else if (fault_size > dax_region->align)
@@ -200,7 +180,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
return VM_FAULT_SIGBUS;
}

- *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+ *pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);

return vmf_insert_pfn_pud(vmf, *pfn, vmf->flags & FAULT_FLAG_WRITE);
}
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 29ceb5795297..506893861253 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -22,7 +22,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
memcpy(&pgmap.res, res, sizeof(*res));

dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node,
- PMD_SIZE, PFN_DEV|PFN_MAP);
+ PMD_SIZE);
if (!dax_region)
return -ENOMEM;

diff --git a/drivers/dax/pmem/core.c b/drivers/dax/pmem/core.c
index 2bedf8414fff..ea52bb77a294 100644
--- a/drivers/dax/pmem/core.c
+++ b/drivers/dax/pmem/core.c
@@ -53,8 +53,7 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys)
memcpy(&res, &pgmap.res, sizeof(res));
res.start += offset;
dax_region = alloc_dax_region(dev, region_id, &res,
- nd_region->target_node, le32_to_cpu(pfn_sb->align),
- PFN_DEV|PFN_MAP);
+ nd_region->target_node, le32_to_cpu(pfn_sb->align));
if (!dax_region)
return ERR_PTR(-ENOMEM);


2020-08-03 05:20:48

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 09/23] device-dax: Move instance creation parameters to 'struct dev_dax_data'

In preparation for adding more parameters to instance creation, move
existing parameters to a new struct.

Cc: Vishal Verma <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/bus.c | 14 +++++++-------
drivers/dax/bus.h | 16 ++++++++--------
drivers/dax/hmem/hmem.c | 8 +++++++-
drivers/dax/pmem/core.c | 9 ++++++++-
4 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index f06ffa66cd78..dffa4655e128 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -395,9 +395,9 @@ static void unregister_dev_dax(void *dev)
put_device(dev);
}

-struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id,
- struct dev_pagemap *pgmap, enum dev_dax_subsys subsys)
+struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
{
+ struct dax_region *dax_region = data->dax_region;
struct device *parent = dax_region->dev;
struct dax_device *dax_dev;
struct dev_dax *dev_dax;
@@ -405,14 +405,14 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id,
struct device *dev;
int rc = -ENOMEM;

- if (id < 0)
+ if (data->id < 0)
return ERR_PTR(-EINVAL);

dev_dax = kzalloc(sizeof(*dev_dax), GFP_KERNEL);
if (!dev_dax)
return ERR_PTR(-ENOMEM);

- memcpy(&dev_dax->pgmap, pgmap, sizeof(*pgmap));
+ memcpy(&dev_dax->pgmap, data->pgmap, sizeof(struct dev_pagemap));

/*
* No 'host' or dax_operations since there is no access to this
@@ -438,13 +438,13 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id,

inode = dax_inode(dax_dev);
dev->devt = inode->i_rdev;
- if (subsys == DEV_DAX_BUS)
+ if (data->subsys == DEV_DAX_BUS)
dev->bus = &dax_bus_type;
else
dev->class = dax_class;
dev->parent = parent;
dev->type = &dev_dax_type;
- dev_set_name(dev, "dax%d.%d", dax_region->id, id);
+ dev_set_name(dev, "dax%d.%d", dax_region->id, data->id);

rc = device_add(dev);
if (rc) {
@@ -464,7 +464,7 @@ struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id,

return ERR_PTR(rc);
}
-EXPORT_SYMBOL_GPL(__devm_create_dev_dax);
+EXPORT_SYMBOL_GPL(devm_create_dev_dax);

static int match_always_count;

diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 55577e9791da..299c2e7fac09 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -13,18 +13,18 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
struct resource *res, int target_node, unsigned int align);

enum dev_dax_subsys {
- DEV_DAX_BUS,
+ DEV_DAX_BUS = 0, /* zeroed dev_dax_data picks this by default */
DEV_DAX_CLASS,
};

-struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id,
- struct dev_pagemap *pgmap, enum dev_dax_subsys subsys);
+struct dev_dax_data {
+ struct dax_region *dax_region;
+ struct dev_pagemap *pgmap;
+ enum dev_dax_subsys subsys;
+ int id;
+};

-static inline struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region,
- int id, struct dev_pagemap *pgmap)
-{
- return __devm_create_dev_dax(dax_region, id, pgmap, DEV_DAX_BUS);
-}
+struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);

/* to be deleted when DEV_DAX_CLASS is removed */
struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys);
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 506893861253..b84fe17178d8 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -11,6 +11,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
struct dev_pagemap pgmap = { };
struct dax_region *dax_region;
struct memregion_info *mri;
+ struct dev_dax_data data;
struct dev_dax *dev_dax;
struct resource *res;

@@ -26,7 +27,12 @@ static int dax_hmem_probe(struct platform_device *pdev)
if (!dax_region)
return -ENOMEM;

- dev_dax = devm_create_dev_dax(dax_region, 0, &pgmap);
+ data = (struct dev_dax_data) {
+ .dax_region = dax_region,
+ .id = 0,
+ .pgmap = &pgmap,
+ };
+ dev_dax = devm_create_dev_dax(&data);
if (IS_ERR(dev_dax))
return PTR_ERR(dev_dax);

diff --git a/drivers/dax/pmem/core.c b/drivers/dax/pmem/core.c
index ea52bb77a294..08ee5947a49c 100644
--- a/drivers/dax/pmem/core.c
+++ b/drivers/dax/pmem/core.c
@@ -14,6 +14,7 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys)
resource_size_t offset;
struct nd_pfn_sb *pfn_sb;
struct dev_dax *dev_dax;
+ struct dev_dax_data data;
struct nd_namespace_io *nsio;
struct dax_region *dax_region;
struct dev_pagemap pgmap = { };
@@ -57,7 +58,13 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys)
if (!dax_region)
return ERR_PTR(-ENOMEM);

- dev_dax = __devm_create_dev_dax(dax_region, id, &pgmap, subsys);
+ data = (struct dev_dax_data) {
+ .dax_region = dax_region,
+ .id = id,
+ .pgmap = &pgmap,
+ .subsys = subsys,
+ };
+ dev_dax = devm_create_dev_dax(&data);

/* child dev_dax instances now own the lifetime of the dax_region */
dax_region_put(dax_region);

2020-08-03 05:21:07

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

Several related issues around this unneeded attribute:

- The dax_kmem_res property allows the kmem driver to stash the adjusted
resource range that was used for the hotplug operation, but that can be
recalculated from the original base range.

- kmem is using an open coded release_resource() + kfree() when an
idiomatic release_mem_region() is sufficient.

- The driver managed resource need only manage the busy flag. Other flags
are of no concern to the kmem driver. In fact if kmem inherits some
memory range that add_memory_driver_managed() rejects that is a
memory-hotplug-core policy that the driver is in no position to
override.

- The implementation trusts that failed remove_memory() results in the
entire resource range remaining pinned busy. The driver need not make
that layering violation assumption and just maintain the busy state in
its local resource.

- The "Hot-remove not yet implemented." comment is stale since hotremove
support is now included.

Cc: David Hildenbrand <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/dax-private.h | 3 -
drivers/dax/kmem.c | 123 +++++++++++++++++++++------------------------
2 files changed, 58 insertions(+), 68 deletions(-)

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 6779f683671d..12a2dbc43b40 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -42,8 +42,6 @@ struct dax_region {
* @dev - device core
* @pgmap - pgmap for memmap setup / lifetime (driver owned)
* @range: resource range for the instance
- * @dax_mem_res: physical address range of hotadded DAX memory
- * @dax_mem_name: name for hotadded DAX memory via add_memory_driver_managed()
*/
struct dev_dax {
struct dax_region *region;
@@ -52,7 +50,6 @@ struct dev_dax {
struct device dev;
struct dev_pagemap *pgmap;
struct range range;
- struct resource *dax_kmem_res;
};

static inline u64 range_len(struct range *range)
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 5bb133df147d..77e25361fbeb 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -19,16 +19,24 @@ static const char *kmem_name;
/* Set if any memory will remain added when the driver will be unloaded. */
static bool any_hotremove_failed;

+static struct range dax_kmem_range(struct dev_dax *dev_dax)
+{
+ struct range range;
+
+ /* memory-block align the hotplug range */
+ range.start = ALIGN(dev_dax->range.start, memory_block_size_bytes());
+ range.end = ALIGN_DOWN(dev_dax->range.end + 1,
+ memory_block_size_bytes()) - 1;
+ return range;
+}
+
int dev_dax_kmem_probe(struct device *dev)
{
struct dev_dax *dev_dax = to_dev_dax(dev);
- struct range *range = &dev_dax->range;
- resource_size_t kmem_start;
- resource_size_t kmem_size;
- resource_size_t kmem_end;
- struct resource *new_res;
- const char *new_res_name;
- int numa_node;
+ struct range range = dax_kmem_range(dev_dax);
+ int numa_node = dev_dax->target_node;
+ struct resource *res;
+ char *res_name;
int rc;

/*
@@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev)
* could be mixed in a node with faster memory, causing
* unavoidable performance issues.
*/
- numa_node = dev_dax->target_node;
if (numa_node < 0) {
dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
numa_node);
return -EINVAL;
}

- /* Hotplug starting at the beginning of the next block: */
- kmem_start = ALIGN(range->start, memory_block_size_bytes());
-
- kmem_size = range_len(range);
- /* Adjust the size down to compensate for moving up kmem_start: */
- kmem_size -= kmem_start - range->start;
- /* Align the size down to cover only complete blocks: */
- kmem_size &= ~(memory_block_size_bytes() - 1);
- kmem_end = kmem_start + kmem_size;
-
- new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
- if (!new_res_name)
+ res_name = kstrdup(dev_name(dev), GFP_KERNEL);
+ if (!res_name)
return -ENOMEM;

- /* Region is permanently reserved if hotremove fails. */
- new_res = request_mem_region(kmem_start, kmem_size, new_res_name);
- if (!new_res) {
- dev_warn(dev, "could not reserve region [%pa-%pa]\n",
- &kmem_start, &kmem_end);
- kfree(new_res_name);
+ res = request_mem_region(range.start, range_len(&range), res_name);
+ if (!res) {
+ dev_warn(dev, "could not reserve region [%#llx-%#llx]\n",
+ range.start, range.end);
+ kfree(res_name);
return -EBUSY;
}

/*
- * Set flags appropriate for System RAM. Leave ..._BUSY clear
- * so that add_memory() can add a child resource. Do not
- * inherit flags from the parent since it may set new flags
- * unknown to us that will break add_memory() below.
+ * Temporarily clear busy to allow add_memory_driver_managed()
+ * to claim it.
*/
- new_res->flags = IORESOURCE_SYSTEM_RAM;
+ res->flags &= ~IORESOURCE_BUSY;

/*
* Ensure that future kexec'd kernels will not treat this as RAM
* automatically.
*/
- rc = add_memory_driver_managed(numa_node, new_res->start,
- resource_size(new_res), kmem_name);
+ rc = add_memory_driver_managed(numa_node, res->start,
+ resource_size(res), kmem_name);
+
+ res->flags |= IORESOURCE_BUSY;
if (rc) {
- release_resource(new_res);
- kfree(new_res);
- kfree(new_res_name);
+ release_mem_region(range.start, range_len(&range));
+ kfree(res_name);
return rc;
}
- dev_dax->dax_kmem_res = new_res;
+
+ dev_set_drvdata(dev, res_name);

return 0;
}

#ifdef CONFIG_MEMORY_HOTREMOVE
-static int dev_dax_kmem_remove(struct device *dev)
+static void dax_kmem_release(struct dev_dax *dev_dax)
{
- struct dev_dax *dev_dax = to_dev_dax(dev);
- struct resource *res = dev_dax->dax_kmem_res;
- resource_size_t kmem_start = res->start;
- resource_size_t kmem_size = resource_size(res);
- const char *res_name = res->name;
int rc;
+ struct device *dev = &dev_dax->dev;
+ const char *res_name = dev_get_drvdata(dev);
+ struct range range = dax_kmem_range(dev_dax);

/*
* We have one shot for removing memory, if some memory blocks were not
* offline prior to calling this function remove_memory() will fail, and
* there is no way to hotremove this memory until reboot because device
- * unbind will succeed even if we return failure.
+ * unbind will proceed regardless of the remove_memory result.
*/
- rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
- if (rc) {
- any_hotremove_failed = true;
- dev_err(dev,
- "DAX region %pR cannot be hotremoved until the next reboot\n",
- res);
- return rc;
+ rc = remove_memory(dev_dax->target_node, range.start, range_len(&range));
+ if (rc == 0) {
+ release_mem_region(range.start, range_len(&range));
+ dev_set_drvdata(dev, NULL);
+ kfree(res_name);
+ return;
}

- /* Release and free dax resources */
- release_resource(res);
- kfree(res);
- kfree(res_name);
- dev_dax->dax_kmem_res = NULL;
-
- return 0;
+ any_hotremove_failed = true;
+ dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next reboot\n",
+ range.start, range.end);
}
#else
-static int dev_dax_kmem_remove(struct device *dev)
+static void dax_kmem_release(struct dev_dax *dev_dax)
{
/*
- * Without hotremove purposely leak the request_mem_region() for the
- * device-dax range and return '0' to ->remove() attempts. The removal
- * of the device from the driver always succeeds, but the region is
- * permanently pinned as reserved by the unreleased
- * request_mem_region().
+ * Without hotremove purposely leak the request_mem_region() for
+ * the device-dax range attempts. The removal of the device from
+ * the driver always succeeds, but the region is permanently
+ * pinned as reserved by the unreleased request_mem_region().
*/
any_hotremove_failed = true;
- return 0;
}
#endif /* CONFIG_MEMORY_HOTREMOVE */

+static int dev_dax_kmem_remove(struct device *dev)
+{
+ dax_kmem_release(to_dev_dax(dev));
+ return 0;
+}
+
static struct dax_device_driver device_dax_kmem_driver = {
.drv = {
.probe = dev_dax_kmem_probe,

2020-08-03 05:21:19

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 13/23] device-dax: Introduce 'seed' devices

Add a seed device concept for dynamic dax regions to be able to split
the region amongst multiple sub-instances. The seed device, similar to
libnvdimm seed devices, is a device that starts with zero capacity
allocated and unbound to a driver. In contrast to libnvdimm seed devices
explicit 'create' and 'delete' interfaces are added to the region to
trigger seeds to be created and unused devices to be reclaimed. The
explicit create and delete replaces implicit create as a side effect of
probe and implicit delete when writing 0 to the size that libnvdimm
implements.

Delete can be performed on any 0-sized and idle device. This avoids the
gymnastics of needing to move device_unregister() to its own async
context. Specifically, it avoids the deadlock of deleting a device via
one of its own attributes. It is also less surprising to userspace which
never sees an extra device it did not request.

For now just add the device creation, teardown, and ->probe()
prevention. A later patch will arrange for the 'dax/size' attribute to
be writable to allocate capacity from the region.

Cc: Vishal Verma <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/bus.c | 317 ++++++++++++++++++++++++++++++++++++++++-----
drivers/dax/bus.h | 4 -
drivers/dax/dax-private.h | 9 +
drivers/dax/device.c | 12 +-
drivers/dax/hmem/hmem.c | 2
drivers/dax/kmem.c | 14 +-
drivers/dax/pmem/compat.c | 2
7 files changed, 304 insertions(+), 56 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 0a48ce378686..dce9413a4394 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -135,10 +135,46 @@ static bool is_static(struct dax_region *dax_region)
return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
}

+static int dax_bus_probe(struct device *dev)
+{
+ struct dax_device_driver *dax_drv = to_dax_drv(dev->driver);
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+ struct dax_region *dax_region = dev_dax->region;
+ struct range *range = &dev_dax->range;
+ int rc;
+
+ if (range_len(range) == 0 || dev_dax->id < 0)
+ return -ENXIO;
+
+ rc = dax_drv->probe(dev_dax);
+
+ if (rc || is_static(dax_region))
+ return rc;
+
+ /*
+ * Track new seed creation only after successful probe of the
+ * previous seed.
+ */
+ if (dax_region->seed == dev)
+ dax_region->seed = NULL;
+
+ return 0;
+}
+
+static int dax_bus_remove(struct device *dev)
+{
+ struct dax_device_driver *dax_drv = to_dax_drv(dev->driver);
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+
+ return dax_drv->remove(dev_dax);
+}
+
static struct bus_type dax_bus_type = {
.name = "dax",
.uevent = dax_bus_uevent,
.match = dax_bus_match,
+ .probe = dax_bus_probe,
+ .remove = dax_bus_remove,
.drv_groups = dax_drv_groups,
};

@@ -219,14 +255,216 @@ static ssize_t available_size_show(struct device *dev,
}
static DEVICE_ATTR_RO(available_size);

+static ssize_t seed_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dax_region *dax_region = dev_get_drvdata(dev);
+ struct device *seed;
+ ssize_t rc;
+
+ if (is_static(dax_region))
+ return -EINVAL;
+
+ device_lock(dev);
+ seed = dax_region->seed;
+ rc = sprintf(buf, "%s\n", seed ? dev_name(seed) : "");
+ device_unlock(dev);
+
+ return rc;
+}
+static DEVICE_ATTR_RO(seed);
+
+static ssize_t create_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dax_region *dax_region = dev_get_drvdata(dev);
+ struct device *youngest;
+ ssize_t rc;
+
+ if (is_static(dax_region))
+ return -EINVAL;
+
+ device_lock(dev);
+ youngest = dax_region->youngest;
+ rc = sprintf(buf, "%s\n", youngest ? dev_name(youngest) : "");
+ device_unlock(dev);
+
+ return rc;
+}
+
+static ssize_t create_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ struct dax_region *dax_region = dev_get_drvdata(dev);
+ unsigned long long avail;
+ ssize_t rc;
+ int val;
+
+ if (is_static(dax_region))
+ return -EINVAL;
+
+ rc = kstrtoint(buf, 0, &val);
+ if (rc)
+ return rc;
+ if (val != 1)
+ return -EINVAL;
+
+ device_lock(dev);
+ avail = dax_region_avail_size(dax_region);
+ if (avail == 0)
+ rc = -ENOSPC;
+ else {
+ struct dev_dax_data data = {
+ .dax_region = dax_region,
+ .size = 0,
+ .id = -1,
+ };
+ struct dev_dax *dev_dax = devm_create_dev_dax(&data);
+
+ if (IS_ERR(dev_dax))
+ rc = PTR_ERR(dev_dax);
+ else {
+ /*
+ * In support of crafting multiple new devices
+ * simultaneously multiple seeds can be created,
+ * but only the first one that has not been
+ * successfully bound is tracked as the region
+ * seed.
+ */
+ if (!dax_region->seed)
+ dax_region->seed = &dev_dax->dev;
+ dax_region->youngest = &dev_dax->dev;
+ rc = len;
+ }
+ }
+ device_unlock(dev);
+
+ return rc;
+}
+static DEVICE_ATTR_RW(create);
+
+void kill_dev_dax(struct dev_dax *dev_dax)
+{
+ struct dax_device *dax_dev = dev_dax->dax_dev;
+ struct inode *inode = dax_inode(dax_dev);
+
+ kill_dax(dax_dev);
+ unmap_mapping_range(inode->i_mapping, 0, 0, 1);
+}
+EXPORT_SYMBOL_GPL(kill_dev_dax);
+
+static void free_dev_dax_range(struct dev_dax *dev_dax)
+{
+ struct dax_region *dax_region = dev_dax->region;
+ struct range *range = &dev_dax->range;
+
+ device_lock_assert(dax_region->dev);
+ if (range_len(range))
+ __release_region(&dax_region->res, range->start,
+ range_len(range));
+}
+
+static void unregister_dev_dax(void *dev)
+{
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+
+ dev_dbg(dev, "%s\n", __func__);
+
+ kill_dev_dax(dev_dax);
+ free_dev_dax_range(dev_dax);
+ device_del(dev);
+ put_device(dev);
+}
+
+/* a return value >= 0 indicates this invocation invalidated the id */
+static int __free_dev_dax_id(struct dev_dax *dev_dax)
+{
+ struct dax_region *dax_region = dev_dax->region;
+ struct device *dev = &dev_dax->dev;
+ int rc = dev_dax->id;
+
+ device_lock_assert(dev);
+
+ if (is_static(dax_region) || dev_dax->id < 0)
+ return -1;
+ ida_free(&dax_region->ida, dev_dax->id);
+ dev_dax->id = -1;
+ return rc;
+}
+
+static int free_dev_dax_id(struct dev_dax *dev_dax)
+{
+ struct device *dev = &dev_dax->dev;
+ int rc;
+
+ device_lock(dev);
+ rc = __free_dev_dax_id(dev_dax);
+ device_unlock(dev);
+ return rc;
+}
+
+static ssize_t delete_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ struct dax_region *dax_region = dev_get_drvdata(dev);
+ struct dev_dax *dev_dax;
+ struct device *victim;
+ bool do_del = false;
+ int rc;
+
+ if (is_static(dax_region))
+ return -EINVAL;
+
+ victim = device_find_child_by_name(dax_region->dev, buf);
+ if (!victim)
+ return -ENXIO;
+
+ device_lock(dev);
+ device_lock(victim);
+ dev_dax = to_dev_dax(victim);
+ if (victim->driver || range_len(&dev_dax->range))
+ rc = -EBUSY;
+ else {
+ /*
+ * Invalidate the device so it does not become active
+ * again, but always preserve device-id-0 so that
+ * /sys/bus/dax/ is guaranteed to be populated while any
+ * dax_region is registered.
+ */
+ if (dev_dax->id > 0) {
+ do_del = __free_dev_dax_id(dev_dax) >= 0;
+ rc = len;
+ if (dax_region->seed == victim)
+ dax_region->seed = NULL;
+ if (dax_region->youngest == victim)
+ dax_region->youngest = NULL;
+ } else
+ rc = -EBUSY;
+ }
+ device_unlock(victim);
+
+ /* won the race to invalidate the device, clean it up */
+ if (do_del)
+ devm_release_action(dev, unregister_dev_dax, victim);
+ device_unlock(dev);
+ put_device(victim);
+
+ return rc;
+}
+static DEVICE_ATTR_WO(delete);
+
static umode_t dax_region_visible(struct kobject *kobj, struct attribute *a,
int n)
{
struct device *dev = container_of(kobj, struct device, kobj);
struct dax_region *dax_region = dev_get_drvdata(dev);

- if (is_static(dax_region) && a == &dev_attr_available_size.attr)
- return 0;
+ if (is_static(dax_region))
+ if (a == &dev_attr_available_size.attr
+ || a == &dev_attr_create.attr
+ || a == &dev_attr_seed.attr
+ || a == &dev_attr_delete.attr)
+ return 0;
return a->mode;
}

@@ -234,6 +472,9 @@ static struct attribute *dax_region_attributes[] = {
&dev_attr_available_size.attr,
&dev_attr_region_size.attr,
&dev_attr_align.attr,
+ &dev_attr_create.attr,
+ &dev_attr_seed.attr,
+ &dev_attr_delete.attr,
&dev_attr_id.attr,
NULL,
};
@@ -302,6 +543,7 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
dax_region->align = align;
dax_region->dev = parent;
dax_region->target_node = target_node;
+ ida_init(&dax_region->ida);
dax_region->res = (struct resource) {
.start = res->start,
.end = res->end,
@@ -329,6 +571,15 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, resource_size_t size)

device_lock_assert(dax_region->dev);

+ /* handle the seed alloc special case */
+ if (!size) {
+ dev_dax->range = (struct range) {
+ .start = res->start,
+ .end = res->start - 1,
+ };
+ return 0;
+ }
+
/* TODO: handle multiple allocations per region */
if (res->child)
return -ENOMEM;
@@ -430,33 +681,15 @@ static const struct attribute_group *dax_attribute_groups[] = {
NULL,
};

-void kill_dev_dax(struct dev_dax *dev_dax)
-{
- struct dax_device *dax_dev = dev_dax->dax_dev;
- struct inode *inode = dax_inode(dax_dev);
-
- kill_dax(dax_dev);
- unmap_mapping_range(inode->i_mapping, 0, 0, 1);
-}
-EXPORT_SYMBOL_GPL(kill_dev_dax);
-
-static void free_dev_dax_range(struct dev_dax *dev_dax)
-{
- struct dax_region *dax_region = dev_dax->region;
- struct range *range = &dev_dax->range;
-
- device_lock_assert(dax_region->dev);
- __release_region(&dax_region->res, range->start, range_len(range));
-}
-
static void dev_dax_release(struct device *dev)
{
struct dev_dax *dev_dax = to_dev_dax(dev);
struct dax_region *dax_region = dev_dax->region;
struct dax_device *dax_dev = dev_dax->dax_dev;

- dax_region_put(dax_region);
put_dax(dax_dev);
+ free_dev_dax_id(dev_dax);
+ dax_region_put(dax_region);
kfree(dev_dax->pgmap);
kfree(dev_dax);
}
@@ -466,18 +699,6 @@ static const struct device_type dev_dax_type = {
.groups = dax_attribute_groups,
};

-static void unregister_dev_dax(void *dev)
-{
- struct dev_dax *dev_dax = to_dev_dax(dev);
-
- dev_dbg(dev, "%s\n", __func__);
-
- kill_dev_dax(dev_dax);
- free_dev_dax_range(dev_dax);
- device_del(dev);
- put_device(dev);
-}
-
struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
{
struct dax_region *dax_region = data->dax_region;
@@ -488,17 +709,35 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
struct device *dev;
int rc;

- if (data->id < 0)
- return ERR_PTR(-EINVAL);
-
dev_dax = kzalloc(sizeof(*dev_dax), GFP_KERNEL);
if (!dev_dax)
return ERR_PTR(-ENOMEM);

+ if (is_static(dax_region)) {
+ if (dev_WARN_ONCE(parent, data->id < 0,
+ "dynamic id specified to static region\n")) {
+ rc = -EINVAL;
+ goto err_id;
+ }
+
+ dev_dax->id = data->id;
+ } else {
+ if (dev_WARN_ONCE(parent, data->id >= 0,
+ "static id specified to dynamic region\n")) {
+ rc = -EINVAL;
+ goto err_id;
+ }
+
+ rc = ida_alloc(&dax_region->ida, GFP_KERNEL);
+ if (rc < 0)
+ goto err_id;
+ dev_dax->id = rc;
+ }
+
dev_dax->region = dax_region;
dev = &dev_dax->dev;
device_initialize(dev);
- dev_set_name(dev, "dax%d.%d", dax_region->id, data->id);
+ dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);

rc = alloc_dev_dax_range(dev_dax, data->size);
if (rc)
@@ -561,6 +800,8 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
err_pgmap:
free_dev_dax_range(dev_dax);
err_range:
+ free_dev_dax_id(dev_dax);
+err_id:
kfree(dev_dax);

return ERR_PTR(rc);
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 44592a8cac0f..da27ea70a19a 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -38,6 +38,8 @@ struct dax_device_driver {
struct device_driver drv;
struct list_head ids;
int match_always;
+ int (*probe)(struct dev_dax *dev);
+ int (*remove)(struct dev_dax *dev);
};

int __dax_driver_register(struct dax_device_driver *dax_drv,
@@ -48,7 +50,7 @@ void dax_driver_unregister(struct dax_device_driver *dax_drv);
void kill_dev_dax(struct dev_dax *dev_dax);

#if IS_ENABLED(CONFIG_DEV_DAX_PMEM_COMPAT)
-int dev_dax_probe(struct device *dev);
+int dev_dax_probe(struct dev_dax *dev_dax);
#endif

/*
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 99b1273bb232..b81a1494d82b 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -7,6 +7,7 @@

#include <linux/device.h>
#include <linux/cdev.h>
+#include <linux/idr.h>

/* private routines between core files */
struct dax_device;
@@ -22,7 +23,10 @@ void dax_bus_exit(void);
* @kref: to pin while other agents have a need to do lookups
* @dev: parent device backing this region
* @align: allocation and mapping alignment for child dax devices
+ * @ida: instance id allocator
* @res: resource tree to track instance allocations
+ * @seed: allow userspace to find the first unbound seed device
+ * @youngest: allow userspace to find the most recently created device
*/
struct dax_region {
int id;
@@ -30,7 +34,10 @@ struct dax_region {
struct kref kref;
struct device *dev;
unsigned int align;
+ struct ida ida;
struct resource res;
+ struct device *seed;
+ struct device *youngest;
};

/**
@@ -39,6 +46,7 @@ struct dax_region {
* @region - parent region
* @dax_dev - core dax functionality
* @target_node: effective numa node if dev_dax memory range is onlined
+ * @id: ida allocated id
* @dev - device core
* @pgmap - pgmap for memmap setup / lifetime (driver owned)
* @range: resource range for the instance
@@ -47,6 +55,7 @@ struct dev_dax {
struct dax_region *region;
struct dax_device *dax_dev;
int target_node;
+ int id;
struct device dev;
struct dev_pagemap *pgmap;
struct range range;
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index a86b2c9788c5..5496331f6eb6 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -392,11 +392,11 @@ static void dev_dax_kill(void *dev_dax)
kill_dev_dax(dev_dax);
}

-int dev_dax_probe(struct device *dev)
+int dev_dax_probe(struct dev_dax *dev_dax)
{
- struct dev_dax *dev_dax = to_dev_dax(dev);
struct dax_device *dax_dev = dev_dax->dax_dev;
struct range *range = &dev_dax->range;
+ struct device *dev = &dev_dax->dev;
struct dev_pagemap *pgmap;
struct inode *inode;
struct cdev *cdev;
@@ -446,17 +446,15 @@ int dev_dax_probe(struct device *dev)
}
EXPORT_SYMBOL_GPL(dev_dax_probe);

-static int dev_dax_remove(struct device *dev)
+static int dev_dax_remove(struct dev_dax *dev_dax)
{
/* all probe actions are unwound by devm */
return 0;
}

static struct dax_device_driver device_dax_driver = {
- .drv = {
- .probe = dev_dax_probe,
- .remove = dev_dax_remove,
- },
+ .probe = dev_dax_probe,
+ .remove = dev_dax_remove,
.match_always = 1,
};

diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index e7b64539e23e..aa260009dfc7 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -26,7 +26,7 @@ static int dax_hmem_probe(struct platform_device *pdev)

data = (struct dev_dax_data) {
.dax_region = dax_region,
- .id = 0,
+ .id = -1,
.size = resource_size(res),
};
dev_dax = devm_create_dev_dax(&data);
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 77e25361fbeb..b1be8a53b26a 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -30,11 +30,11 @@ static struct range dax_kmem_range(struct dev_dax *dev_dax)
return range;
}

-int dev_dax_kmem_probe(struct device *dev)
+int dev_dax_kmem_probe(struct dev_dax *dev_dax)
{
- struct dev_dax *dev_dax = to_dev_dax(dev);
struct range range = dax_kmem_range(dev_dax);
int numa_node = dev_dax->target_node;
+ struct device *dev = &dev_dax->dev;
struct resource *res;
char *res_name;
int rc;
@@ -127,17 +127,15 @@ static void dax_kmem_release(struct dev_dax *dev_dax)
}
#endif /* CONFIG_MEMORY_HOTREMOVE */

-static int dev_dax_kmem_remove(struct device *dev)
+static int dev_dax_kmem_remove(struct dev_dax *dev_dax)
{
- dax_kmem_release(to_dev_dax(dev));
+ dax_kmem_release(dev_dax);
return 0;
}

static struct dax_device_driver device_dax_kmem_driver = {
- .drv = {
- .probe = dev_dax_kmem_probe,
- .remove = dev_dax_kmem_remove,
- },
+ .probe = dev_dax_kmem_probe,
+ .remove = dev_dax_kmem_remove,
};

static int __init dax_kmem_init(void)
diff --git a/drivers/dax/pmem/compat.c b/drivers/dax/pmem/compat.c
index d7b15e6f30c5..863c114fd88c 100644
--- a/drivers/dax/pmem/compat.c
+++ b/drivers/dax/pmem/compat.c
@@ -22,7 +22,7 @@ static int dax_pmem_compat_probe(struct device *dev)
return -ENOMEM;

device_lock(&dev_dax->dev);
- rc = dev_dax_probe(&dev_dax->dev);
+ rc = dev_dax_probe(dev_dax);
device_unlock(&dev_dax->dev);

devres_close_group(&dev_dax->dev, dev_dax);

2020-08-03 05:21:27

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 14/23] drivers/base: Make device_find_child_by_name() compatible with sysfs inputs

Use sysfs_streq() in device_find_child_by_name() to allow it to use a
sysfs input string that might contain a trailing newline.

The other "device by name" interfaces,
{bus,driver,class}_find_device_by_name(), already account for sysfs
strings.

Cc: "Rafael J. Wysocki" <[email protected]>
Reviewed-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/base/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 2169c5132558..231189dd6599 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -3328,7 +3328,7 @@ struct device *device_find_child_by_name(struct device *parent,

klist_iter_init(&parent->p->klist_children, &i);
while ((child = next_device(&i)))
- if (!strcmp(dev_name(child), name) && get_device(child))
+ if (sysfs_streq(dev_name(child), name) && get_device(child))
break;
klist_iter_exit(&i);
return child;

2020-08-03 05:21:34

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 15/23] device-dax: Add resize support

Make the device-dax 'size' attribute writable to allow capacity to be
split between multiple instances in a region. The intended consumers of
this capability are users that want to split a scarce memory resource
between device-dax and System-RAM access, or users that want to have
multiple security domains for a large region.

By default the hmem instance provider allocates an entire region to the
first instance. The process of creating a new instance (assuming a
region-id of 0) is find the region and trigger the 'create' attribute
which yields an empty instance to configure. For example:

cd /sys/bus/dax/devices
echo dax0.0 > dax0.0/driver/unbind
echo $new_size > dax0.0/size
echo 1 > $(readlink -f dax0.0)../dax_region/create
seed=$(cat $(readlink -f dax0.0)../dax_region/seed)
echo $new_size > $seed/size
echo dax0.0 > ../drivers/{device_dax,kmem}/bind
echo dax0.1 > ../drivers/{device_dax,kmem}/bind

Instances can be destroyed by:

echo $device > $(readlink -f $device)../dax_region/delete

Cc: Vishal Verma <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/bus.c | 161 ++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 152 insertions(+), 9 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index dce9413a4394..53d07f2f1285 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -6,6 +6,7 @@
#include <linux/list.h>
#include <linux/slab.h>
#include <linux/dax.h>
+#include <linux/io.h>
#include "dax-private.h"
#include "bus.h"

@@ -562,7 +563,8 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
}
EXPORT_SYMBOL_GPL(alloc_dax_region);

-static int alloc_dev_dax_range(struct dev_dax *dev_dax, resource_size_t size)
+static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
+ resource_size_t size)
{
struct dax_region *dax_region = dev_dax->region;
struct resource *res = &dax_region->res;
@@ -580,12 +582,7 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, resource_size_t size)
return 0;
}

- /* TODO: handle multiple allocations per region */
- if (res->child)
- return -ENOMEM;
-
- alloc = __request_region(res, res->start, size, dev_name(dev), 0);
-
+ alloc = __request_region(res, start, size, dev_name(dev), 0);
if (!alloc)
return -ENOMEM;

@@ -597,6 +594,29 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, resource_size_t size)
return 0;
}

+static int adjust_dev_dax_range(struct dev_dax *dev_dax, struct resource *res, resource_size_t size)
+{
+ struct dax_region *dax_region = dev_dax->region;
+ struct range *range = &dev_dax->range;
+ int rc = 0;
+
+ device_lock_assert(dax_region->dev);
+
+ if (size)
+ rc = adjust_resource(res, range->start, size);
+ else
+ __release_region(&dax_region->res, range->start, range_len(range));
+ if (rc)
+ return rc;
+
+ dev_dax->range = (struct range) {
+ .start = range->start,
+ .end = range->start + size - 1,
+ };
+
+ return 0;
+}
+
static ssize_t size_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -605,7 +625,127 @@ static ssize_t size_show(struct device *dev,

return sprintf(buf, "%llu\n", size);
}
-static DEVICE_ATTR_RO(size);
+
+static bool alloc_is_aligned(struct dax_region *dax_region,
+ resource_size_t size)
+{
+ /*
+ * The minimum mapping granularity for a device instance is a
+ * single subsection, unless the arch says otherwise.
+ */
+ return IS_ALIGNED(size, max_t(unsigned long, dax_region->align,
+ memremap_compat_align()));
+}
+
+static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
+{
+ struct dax_region *dax_region = dev_dax->region;
+ struct range *range = &dev_dax->range;
+ struct resource *res, *adjust = NULL;
+ struct device *dev = &dev_dax->dev;
+
+ for_each_dax_region_resource(dax_region, res)
+ if (strcmp(res->name, dev_name(dev)) == 0
+ && res->start == range->start) {
+ adjust = res;
+ break;
+ }
+
+ if (dev_WARN_ONCE(dev, !adjust, "failed to find matching resource\n"))
+ return -ENXIO;
+ return adjust_dev_dax_range(dev_dax, adjust, size);
+}
+
+static ssize_t dev_dax_resize(struct dax_region *dax_region,
+ struct dev_dax *dev_dax, resource_size_t size)
+{
+ resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
+ resource_size_t dev_size = range_len(&dev_dax->range);
+ struct resource *region_res = &dax_region->res;
+ struct device *dev = &dev_dax->dev;
+ const char *name = dev_name(dev);
+ struct resource *res, *first;
+
+ if (dev->driver)
+ return -EBUSY;
+ if (size == dev_size)
+ return 0;
+ if (size > dev_size && size - dev_size > avail)
+ return -ENOSPC;
+ if (size < dev_size)
+ return dev_dax_shrink(dev_dax, size);
+
+ to_alloc = size - dev_size;
+ if (dev_WARN_ONCE(dev, !alloc_is_aligned(dax_region, to_alloc),
+ "resize of %pa misaligned\n", &to_alloc))
+ return -ENXIO;
+
+ /*
+ * Expand the device into the unused portion of the region. This
+ * may involve adjusting the end of an existing resource, or
+ * allocating a new resource.
+ */
+ first = region_res->child;
+ if (!first)
+ return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
+ for (res = first; to_alloc && res; res = res->sibling) {
+ struct resource *next = res->sibling;
+ resource_size_t free;
+
+ /* space at the beginning of the region */
+ free = 0;
+ if (res == first && res->start > dax_region->res.start)
+ free = res->start - dax_region->res.start;
+ if (free >= to_alloc && dev_size == 0)
+ return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
+
+ free = 0;
+ /* space between allocations */
+ if (next && next->start > res->end + 1)
+ free = next->start - res->end + 1;
+
+ /* space at the end of the region */
+ if (free < to_alloc && !next && res->end < region_res->end)
+ free = region_res->end - res->end;
+
+ if (free >= to_alloc && strcmp(name, res->name) == 0)
+ return adjust_dev_dax_range(dev_dax, res, resource_size(res) + to_alloc);
+ else if (free >= to_alloc && dev_size == 0)
+ return alloc_dev_dax_range(dev_dax, res->end + 1, to_alloc);
+ }
+ return -ENOSPC;
+}
+
+static ssize_t size_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ ssize_t rc;
+ unsigned long long val;
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+ struct dax_region *dax_region = dev_dax->region;
+
+ rc = kstrtoull(buf, 0, &val);
+ if (rc)
+ return rc;
+
+ if (!alloc_is_aligned(dax_region, val)) {
+ dev_dbg(dev, "%s: size: %lld misaligned\n", __func__, val);
+ return -EINVAL;
+ }
+
+ device_lock(dax_region->dev);
+ if (!dax_region->dev->driver) {
+ device_unlock(dax_region->dev);
+ return -ENXIO;
+ }
+ device_lock(dev);
+ rc = dev_dax_resize(dax_region, dev_dax, val);
+ device_unlock(dev);
+ device_unlock(dax_region->dev);
+
+ return rc == 0 ? len : rc;
+}
+static DEVICE_ATTR_RW(size);

static int dev_dax_target_node(struct dev_dax *dev_dax)
{
@@ -654,11 +794,14 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
{
struct device *dev = container_of(kobj, struct device, kobj);
struct dev_dax *dev_dax = to_dev_dax(dev);
+ struct dax_region *dax_region = dev_dax->region;

if (a == &dev_attr_target_node.attr && dev_dax_target_node(dev_dax) < 0)
return 0;
if (a == &dev_attr_numa_node.attr && !IS_ENABLED(CONFIG_NUMA))
return 0;
+ if (a == &dev_attr_size.attr && is_static(dax_region))
+ return 0444;
return a->mode;
}

@@ -739,7 +882,7 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
device_initialize(dev);
dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);

- rc = alloc_dev_dax_range(dev_dax, data->size);
+ rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size);
if (rc)
goto err_range;


2020-08-03 05:21:52

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 21/23] device-dax: Add an 'align' attribute

From: Joao Martins <[email protected]>

Introduce a device align attribute. While doing so,
rename the region align attribute to be more explicitly
named as so, but keep it named as @align to retain the API
for tools like daxctl.

Changes on align may not always be valid, when say certain
mappings were created with 2M and then we switch to 1G. So, we
validate all ranges against the new value being attempted,
post resizing.

Signed-off-by: Joao Martins <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/bus.c | 93 ++++++++++++++++++++++++++++++++++++++++-----
drivers/dax/dax-private.h | 18 +++++++++
2 files changed, 101 insertions(+), 10 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 9edfdf83408e..b984213c315f 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -230,14 +230,15 @@ static ssize_t region_size_show(struct device *dev,
static struct device_attribute dev_attr_region_size = __ATTR(size, 0444,
region_size_show, NULL);

-static ssize_t align_show(struct device *dev,
+static ssize_t region_align_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct dax_region *dax_region = dev_get_drvdata(dev);

return sprintf(buf, "%u\n", dax_region->align);
}
-static DEVICE_ATTR_RO(align);
+static struct device_attribute dev_attr_region_align =
+ __ATTR(align, 0400, region_align_show, NULL);

#define for_each_dax_region_resource(dax_region, res) \
for (res = (dax_region)->res.child; res; res = res->sibling)
@@ -488,7 +489,7 @@ static umode_t dax_region_visible(struct kobject *kobj, struct attribute *a,
static struct attribute *dax_region_attributes[] = {
&dev_attr_available_size.attr,
&dev_attr_region_size.attr,
- &dev_attr_align.attr,
+ &dev_attr_region_align.attr,
&dev_attr_create.attr,
&dev_attr_seed.attr,
&dev_attr_delete.attr,
@@ -855,15 +856,13 @@ static ssize_t size_show(struct device *dev,
return sprintf(buf, "%llu\n", size);
}

-static bool alloc_is_aligned(struct dax_region *dax_region,
- resource_size_t size)
+static bool alloc_is_aligned(struct dev_dax *dev_dax, resource_size_t size)
{
/*
* The minimum mapping granularity for a device instance is a
* single subsection, unless the arch says otherwise.
*/
- return IS_ALIGNED(size, max_t(unsigned long, dax_region->align,
- memremap_compat_align()));
+ return IS_ALIGNED(size, max_t(unsigned long, dev_dax->align, memremap_compat_align()));
}

static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
@@ -958,7 +957,7 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
return dev_dax_shrink(dev_dax, size);

to_alloc = size - dev_size;
- if (dev_WARN_ONCE(dev, !alloc_is_aligned(dax_region, to_alloc),
+ if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
"resize of %pa misaligned\n", &to_alloc))
return -ENXIO;

@@ -1022,7 +1021,7 @@ static ssize_t size_store(struct device *dev, struct device_attribute *attr,
if (rc)
return rc;

- if (!alloc_is_aligned(dax_region, val)) {
+ if (!alloc_is_aligned(dev_dax, val)) {
dev_dbg(dev, "%s: size: %lld misaligned\n", __func__, val);
return -EINVAL;
}
@@ -1041,6 +1040,78 @@ static ssize_t size_store(struct device *dev, struct device_attribute *attr,
}
static DEVICE_ATTR_RW(size);

+static ssize_t align_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+
+ return sprintf(buf, "%d\n", dev_dax->align);
+}
+
+static ssize_t dev_dax_validate_align(struct dev_dax *dev_dax)
+{
+ resource_size_t dev_size = dev_dax_size(dev_dax);
+ struct device *dev = &dev_dax->dev;
+ int i;
+
+ if (dev_size > 0 && !alloc_is_aligned(dev_dax, dev_size)) {
+ dev_dbg(dev, "%s: align %u invalid for size %pa\n",
+ __func__, dev_dax->align, &dev_size);
+ return -EINVAL;
+ }
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ size_t len = range_len(&dev_dax->ranges[i].range);
+
+ if (!alloc_is_aligned(dev_dax, len)) {
+ dev_dbg(dev, "%s: align %u invalid for range %d\n",
+ __func__, dev_dax->align, i);
+ return -EINVAL;
+ }
+ }
+
+ return 0;
+}
+
+static ssize_t align_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+ struct dax_region *dax_region = dev_dax->region;
+ unsigned long val, align_save;
+ ssize_t rc;
+
+ rc = kstrtoul(buf, 0, &val);
+ if (rc)
+ return -ENXIO;
+
+ if (!dax_align_valid(val))
+ return -EINVAL;
+
+ device_lock(dax_region->dev);
+ if (!dax_region->dev->driver) {
+ device_unlock(dax_region->dev);
+ return -ENXIO;
+ }
+
+ device_lock(dev);
+ if (dev->driver) {
+ rc = -EBUSY;
+ goto out_unlock;
+ }
+
+ align_save = dev_dax->align;
+ dev_dax->align = val;
+ rc = dev_dax_validate_align(dev_dax);
+ if (rc)
+ dev_dax->align = align_save;
+out_unlock:
+ device_unlock(dev);
+ device_unlock(dax_region->dev);
+ return rc == 0 ? len : rc;
+}
+static DEVICE_ATTR_RW(align);
+
static int dev_dax_target_node(struct dev_dax *dev_dax)
{
struct dax_region *dax_region = dev_dax->region;
@@ -1101,7 +1172,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
return 0;
if (a == &dev_attr_numa_node.attr && !IS_ENABLED(CONFIG_NUMA))
return 0;
- if (a == &dev_attr_size.attr && is_static(dax_region))
+ if ((a == &dev_attr_align.attr ||
+ a == &dev_attr_size.attr) && is_static(dax_region))
return 0444;
return a->mode;
}
@@ -1110,6 +1182,7 @@ static struct attribute *dev_dax_attributes[] = {
&dev_attr_modalias.attr,
&dev_attr_size.attr,
&dev_attr_target_node.attr,
+ &dev_attr_align.attr,
&dev_attr_resource.attr,
&dev_attr_numa_node.attr,
NULL,
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 5fd3a26cfcea..fe337436d7f5 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -87,4 +87,22 @@ static inline struct dax_mapping *to_dax_mapping(struct device *dev)
}

phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, unsigned long size);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline bool dax_align_valid(unsigned int align)
+{
+ if (align == PUD_SIZE && IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD))
+ return true;
+ if (align == PMD_SIZE && has_transparent_hugepage())
+ return true;
+ if (align == PAGE_SIZE)
+ return true;
+ return false;
+}
+#else
+static inline bool dax_align_valid(unsigned int align)
+{
+ return align == PAGE_SIZE;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif

2020-08-03 05:22:40

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 10/23] device-dax: Make pgmap optional for instance creation

The passed in dev_pagemap is only required in the pmem case as the
libnvdimm core may have reserved a vmem_altmap for dev_memremap_pages()
to place the memmap in pmem directly. In the hmem case there is no
agent reserving an altmap so it can all be handled by a core internal
default.

Pass the resource range via a new @range property of 'struct
dev_dax_data'.

Cc: Vishal Verma <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/bus.c | 29 +++++++++++++++--------------
drivers/dax/bus.h | 2 ++
drivers/dax/dax-private.h | 9 ++++++++-
drivers/dax/device.c | 28 +++++++++++++++++++---------
drivers/dax/hmem/hmem.c | 8 ++++----
drivers/dax/kmem.c | 12 ++++++------
drivers/dax/pmem/core.c | 4 ++++
tools/testing/nvdimm/dax-dev.c | 8 ++++----
8 files changed, 62 insertions(+), 38 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index dffa4655e128..96bd64ba95a5 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -271,7 +271,7 @@ static ssize_t size_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct dev_dax *dev_dax = to_dev_dax(dev);
- unsigned long long size = resource_size(&dev_dax->region->res);
+ unsigned long long size = range_len(&dev_dax->range);

return sprintf(buf, "%llu\n", size);
}
@@ -293,19 +293,12 @@ static ssize_t target_node_show(struct device *dev,
}
static DEVICE_ATTR_RO(target_node);

-static unsigned long long dev_dax_resource(struct dev_dax *dev_dax)
-{
- struct dax_region *dax_region = dev_dax->region;
-
- return dax_region->res.start;
-}
-
static ssize_t resource_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct dev_dax *dev_dax = to_dev_dax(dev);

- return sprintf(buf, "%#llx\n", dev_dax_resource(dev_dax));
+ return sprintf(buf, "%#llx\n", dev_dax->range.start);
}
static DEVICE_ATTR(resource, 0400, resource_show, NULL);

@@ -376,6 +369,7 @@ static void dev_dax_release(struct device *dev)

dax_region_put(dax_region);
put_dax(dax_dev);
+ kfree(dev_dax->pgmap);
kfree(dev_dax);
}

@@ -412,7 +406,12 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
if (!dev_dax)
return ERR_PTR(-ENOMEM);

- memcpy(&dev_dax->pgmap, data->pgmap, sizeof(struct dev_pagemap));
+ if (data->pgmap) {
+ dev_dax->pgmap = kmemdup(data->pgmap,
+ sizeof(struct dev_pagemap), GFP_KERNEL);
+ if (!dev_dax->pgmap)
+ goto err_pgmap;
+ }

/*
* No 'host' or dax_operations since there is no access to this
@@ -421,18 +420,19 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
dax_dev = alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC);
if (IS_ERR(dax_dev)) {
rc = PTR_ERR(dax_dev);
- goto err;
+ goto err_alloc_dax;
}

/* a device_dax instance is dead while the driver is not attached */
kill_dax(dax_dev);

- /* from here on we're committed to teardown via dax_dev_release() */
+ /* from here on we're committed to teardown via dev_dax_release() */
dev = &dev_dax->dev;
device_initialize(dev);

dev_dax->dax_dev = dax_dev;
dev_dax->region = dax_region;
+ dev_dax->range = data->range;
dev_dax->target_node = dax_region->target_node;
kref_get(&dax_region->kref);

@@ -458,8 +458,9 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
return ERR_PTR(rc);

return dev_dax;
-
- err:
+err_alloc_dax:
+ kfree(dev_dax->pgmap);
+err_pgmap:
kfree(dev_dax);

return ERR_PTR(rc);
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 299c2e7fac09..4aeb36da83a4 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -3,6 +3,7 @@
#ifndef __DAX_BUS_H__
#define __DAX_BUS_H__
#include <linux/device.h>
+#include <linux/range.h>

struct dev_dax;
struct resource;
@@ -21,6 +22,7 @@ struct dev_dax_data {
struct dax_region *dax_region;
struct dev_pagemap *pgmap;
enum dev_dax_subsys subsys;
+ struct range range;
int id;
};

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 8a4c40ccd2ef..6779f683671d 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -41,6 +41,7 @@ struct dax_region {
* @target_node: effective numa node if dev_dax memory range is onlined
* @dev - device core
* @pgmap - pgmap for memmap setup / lifetime (driver owned)
+ * @range: resource range for the instance
* @dax_mem_res: physical address range of hotadded DAX memory
* @dax_mem_name: name for hotadded DAX memory via add_memory_driver_managed()
*/
@@ -49,10 +50,16 @@ struct dev_dax {
struct dax_device *dax_dev;
int target_node;
struct device dev;
- struct dev_pagemap pgmap;
+ struct dev_pagemap *pgmap;
+ struct range range;
struct resource *dax_kmem_res;
};

+static inline u64 range_len(struct range *range)
+{
+ return range->end - range->start + 1;
+}
+
static inline struct dev_dax *to_dev_dax(struct device *dev)
{
return container_of(dev, struct dev_dax, dev);
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index bffef1b21144..a86b2c9788c5 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -55,12 +55,12 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
unsigned long size)
{
- struct resource *res = &dev_dax->region->res;
+ struct range *range = &dev_dax->range;
phys_addr_t phys;

- phys = pgoff * PAGE_SIZE + res->start;
- if (phys >= res->start && phys <= res->end) {
- if (phys + size - 1 <= res->end)
+ phys = pgoff * PAGE_SIZE + range->start;
+ if (phys >= range->start && phys <= range->end) {
+ if (phys + size - 1 <= range->end)
return phys;
}

@@ -396,21 +396,31 @@ int dev_dax_probe(struct device *dev)
{
struct dev_dax *dev_dax = to_dev_dax(dev);
struct dax_device *dax_dev = dev_dax->dax_dev;
- struct resource *res = &dev_dax->region->res;
+ struct range *range = &dev_dax->range;
+ struct dev_pagemap *pgmap;
struct inode *inode;
struct cdev *cdev;
void *addr;
int rc;

/* 1:1 map region resource range to device-dax instance range */
- if (!devm_request_mem_region(dev, res->start, resource_size(res),
+ if (!devm_request_mem_region(dev, range->start, range_len(range),
dev_name(dev))) {
- dev_warn(dev, "could not reserve region %pR\n", res);
+ dev_warn(dev, "could not reserve range: %#llx - %#llx\n",
+ range->start, range->end);
return -EBUSY;
}

- dev_dax->pgmap.type = MEMORY_DEVICE_DEVDAX;
- addr = devm_memremap_pages(dev, &dev_dax->pgmap);
+ pgmap = dev_dax->pgmap;
+ if (!pgmap) {
+ pgmap = devm_kzalloc(dev, sizeof(*pgmap), GFP_KERNEL);
+ if (!pgmap)
+ return -ENOMEM;
+ pgmap->res.start = range->start;
+ pgmap->res.end = range->end;
+ }
+ pgmap->type = MEMORY_DEVICE_DEVDAX;
+ addr = devm_memremap_pages(dev, pgmap);
if (IS_ERR(addr))
return PTR_ERR(addr);

diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index b84fe17178d8..af82d6ba820a 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -8,7 +8,6 @@
static int dax_hmem_probe(struct platform_device *pdev)
{
struct device *dev = &pdev->dev;
- struct dev_pagemap pgmap = { };
struct dax_region *dax_region;
struct memregion_info *mri;
struct dev_dax_data data;
@@ -20,8 +19,6 @@ static int dax_hmem_probe(struct platform_device *pdev)
return -ENOMEM;

mri = dev->platform_data;
- memcpy(&pgmap.res, res, sizeof(*res));
-
dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node,
PMD_SIZE);
if (!dax_region)
@@ -30,7 +27,10 @@ static int dax_hmem_probe(struct platform_device *pdev)
data = (struct dev_dax_data) {
.dax_region = dax_region,
.id = 0,
- .pgmap = &pgmap,
+ .range = {
+ .start = res->start,
+ .end = res->end,
+ },
};
dev_dax = devm_create_dev_dax(&data);
if (IS_ERR(dev_dax))
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 275aa5f87399..5bb133df147d 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -22,7 +22,7 @@ static bool any_hotremove_failed;
int dev_dax_kmem_probe(struct device *dev)
{
struct dev_dax *dev_dax = to_dev_dax(dev);
- struct resource *res = &dev_dax->region->res;
+ struct range *range = &dev_dax->range;
resource_size_t kmem_start;
resource_size_t kmem_size;
resource_size_t kmem_end;
@@ -39,17 +39,17 @@ int dev_dax_kmem_probe(struct device *dev)
*/
numa_node = dev_dax->target_node;
if (numa_node < 0) {
- dev_warn(dev, "rejecting DAX region %pR with invalid node: %d\n",
- res, numa_node);
+ dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
+ numa_node);
return -EINVAL;
}

/* Hotplug starting at the beginning of the next block: */
- kmem_start = ALIGN(res->start, memory_block_size_bytes());
+ kmem_start = ALIGN(range->start, memory_block_size_bytes());

- kmem_size = resource_size(res);
+ kmem_size = range_len(range);
/* Adjust the size down to compensate for moving up kmem_start: */
- kmem_size -= kmem_start - res->start;
+ kmem_size -= kmem_start - range->start;
/* Align the size down to cover only complete blocks: */
kmem_size &= ~(memory_block_size_bytes() - 1);
kmem_end = kmem_start + kmem_size;
diff --git a/drivers/dax/pmem/core.c b/drivers/dax/pmem/core.c
index 08ee5947a49c..4fa81d3d2f65 100644
--- a/drivers/dax/pmem/core.c
+++ b/drivers/dax/pmem/core.c
@@ -63,6 +63,10 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys)
.id = id,
.pgmap = &pgmap,
.subsys = subsys,
+ .range = {
+ .start = res.start,
+ .end = res.end,
+ },
};
dev_dax = devm_create_dev_dax(&data);

diff --git a/tools/testing/nvdimm/dax-dev.c b/tools/testing/nvdimm/dax-dev.c
index 7e5d979e73cb..38d8e55c4a0d 100644
--- a/tools/testing/nvdimm/dax-dev.c
+++ b/tools/testing/nvdimm/dax-dev.c
@@ -9,12 +9,12 @@
phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
unsigned long size)
{
- struct resource *res = &dev_dax->region->res;
+ struct range *range = &dev_dax->range;
phys_addr_t addr;

- addr = pgoff * PAGE_SIZE + res->start;
- if (addr >= res->start && addr <= res->end) {
- if (addr + size - 1 <= res->end) {
+ addr = pgoff * PAGE_SIZE + range->start;
+ if (addr >= range->start && addr <= range->end) {
+ if (addr + size - 1 <= range->end) {
if (get_nfit_res(addr)) {
struct page *page;


2020-08-03 05:22:55

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 12/23] device-dax: Add an allocation interface for device-dax instances

In preparation for a facility that enables dax regions to be
sub-divided, introduce infrastructure to track and allocate region
capacity.

The new dax_region/available_size attribute is only enabled for volatile
hmem devices, not pmem devices that are defined by nvdimm namespace
boundaries. This is per Jeff's feedback the last time dynamic device-dax
capacity allocation support was discussed.

Link: https://lore.kernel.org/linux-nvdimm/[email protected]
Cc: Vishal Verma <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/bus.c | 120 +++++++++++++++++++++++++++++++++++++++++----
drivers/dax/bus.h | 7 ++-
drivers/dax/dax-private.h | 2 -
drivers/dax/hmem/hmem.c | 7 +--
drivers/dax/pmem/core.c | 8 +--
5 files changed, 121 insertions(+), 23 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 96bd64ba95a5..0a48ce378686 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -130,6 +130,11 @@ ATTRIBUTE_GROUPS(dax_drv);

static int dax_bus_match(struct device *dev, struct device_driver *drv);

+static bool is_static(struct dax_region *dax_region)
+{
+ return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
+}
+
static struct bus_type dax_bus_type = {
.name = "dax",
.uevent = dax_bus_uevent,
@@ -185,7 +190,48 @@ static ssize_t align_show(struct device *dev,
}
static DEVICE_ATTR_RO(align);

+#define for_each_dax_region_resource(dax_region, res) \
+ for (res = (dax_region)->res.child; res; res = res->sibling)
+
+static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
+{
+ resource_size_t size = resource_size(&dax_region->res);
+ struct resource *res;
+
+ device_lock_assert(dax_region->dev);
+
+ for_each_dax_region_resource(dax_region, res)
+ size -= resource_size(res);
+ return size;
+}
+
+static ssize_t available_size_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dax_region *dax_region = dev_get_drvdata(dev);
+ unsigned long long size;
+
+ device_lock(dev);
+ size = dax_region_avail_size(dax_region);
+ device_unlock(dev);
+
+ return sprintf(buf, "%llu\n", size);
+}
+static DEVICE_ATTR_RO(available_size);
+
+static umode_t dax_region_visible(struct kobject *kobj, struct attribute *a,
+ int n)
+{
+ struct device *dev = container_of(kobj, struct device, kobj);
+ struct dax_region *dax_region = dev_get_drvdata(dev);
+
+ if (is_static(dax_region) && a == &dev_attr_available_size.attr)
+ return 0;
+ return a->mode;
+}
+
static struct attribute *dax_region_attributes[] = {
+ &dev_attr_available_size.attr,
&dev_attr_region_size.attr,
&dev_attr_align.attr,
&dev_attr_id.attr,
@@ -195,6 +241,7 @@ static struct attribute *dax_region_attributes[] = {
static const struct attribute_group dax_region_attribute_group = {
.name = "dax_region",
.attrs = dax_region_attributes,
+ .is_visible = dax_region_visible,
};

static const struct attribute_group *dax_region_attribute_groups[] = {
@@ -226,7 +273,8 @@ static void dax_region_unregister(void *region)
}

struct dax_region *alloc_dax_region(struct device *parent, int region_id,
- struct resource *res, int target_node, unsigned int align)
+ struct resource *res, int target_node, unsigned int align,
+ unsigned long flags)
{
struct dax_region *dax_region;

@@ -249,12 +297,17 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
return NULL;

dev_set_drvdata(parent, dax_region);
- memcpy(&dax_region->res, res, sizeof(*res));
kref_init(&dax_region->kref);
dax_region->id = region_id;
dax_region->align = align;
dax_region->dev = parent;
dax_region->target_node = target_node;
+ dax_region->res = (struct resource) {
+ .start = res->start,
+ .end = res->end,
+ .flags = IORESOURCE_MEM | flags,
+ };
+
if (sysfs_create_groups(&parent->kobj, dax_region_attribute_groups)) {
kfree(dax_region);
return NULL;
@@ -267,6 +320,32 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
}
EXPORT_SYMBOL_GPL(alloc_dax_region);

+static int alloc_dev_dax_range(struct dev_dax *dev_dax, resource_size_t size)
+{
+ struct dax_region *dax_region = dev_dax->region;
+ struct resource *res = &dax_region->res;
+ struct device *dev = &dev_dax->dev;
+ struct resource *alloc;
+
+ device_lock_assert(dax_region->dev);
+
+ /* TODO: handle multiple allocations per region */
+ if (res->child)
+ return -ENOMEM;
+
+ alloc = __request_region(res, res->start, size, dev_name(dev), 0);
+
+ if (!alloc)
+ return -ENOMEM;
+
+ dev_dax->range = (struct range) {
+ .start = alloc->start,
+ .end = alloc->end,
+ };
+
+ return 0;
+}
+
static ssize_t size_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -361,6 +440,15 @@ void kill_dev_dax(struct dev_dax *dev_dax)
}
EXPORT_SYMBOL_GPL(kill_dev_dax);

+static void free_dev_dax_range(struct dev_dax *dev_dax)
+{
+ struct dax_region *dax_region = dev_dax->region;
+ struct range *range = &dev_dax->range;
+
+ device_lock_assert(dax_region->dev);
+ __release_region(&dax_region->res, range->start, range_len(range));
+}
+
static void dev_dax_release(struct device *dev)
{
struct dev_dax *dev_dax = to_dev_dax(dev);
@@ -385,6 +473,7 @@ static void unregister_dev_dax(void *dev)
dev_dbg(dev, "%s\n", __func__);

kill_dev_dax(dev_dax);
+ free_dev_dax_range(dev_dax);
device_del(dev);
put_device(dev);
}
@@ -397,7 +486,7 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
struct dev_dax *dev_dax;
struct inode *inode;
struct device *dev;
- int rc = -ENOMEM;
+ int rc;

if (data->id < 0)
return ERR_PTR(-EINVAL);
@@ -406,11 +495,25 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
if (!dev_dax)
return ERR_PTR(-ENOMEM);

+ dev_dax->region = dax_region;
+ dev = &dev_dax->dev;
+ device_initialize(dev);
+ dev_set_name(dev, "dax%d.%d", dax_region->id, data->id);
+
+ rc = alloc_dev_dax_range(dev_dax, data->size);
+ if (rc)
+ goto err_range;
+
if (data->pgmap) {
+ dev_WARN_ONCE(parent, !is_static(dax_region),
+ "custom dev_pagemap requires a static dax_region\n");
+
dev_dax->pgmap = kmemdup(data->pgmap,
sizeof(struct dev_pagemap), GFP_KERNEL);
- if (!dev_dax->pgmap)
+ if (!dev_dax->pgmap) {
+ rc = -ENOMEM;
goto err_pgmap;
+ }
}

/*
@@ -427,12 +530,7 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
kill_dax(dax_dev);

/* from here on we're committed to teardown via dev_dax_release() */
- dev = &dev_dax->dev;
- device_initialize(dev);
-
dev_dax->dax_dev = dax_dev;
- dev_dax->region = dax_region;
- dev_dax->range = data->range;
dev_dax->target_node = dax_region->target_node;
kref_get(&dax_region->kref);

@@ -444,7 +542,6 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
dev->class = dax_class;
dev->parent = parent;
dev->type = &dev_dax_type;
- dev_set_name(dev, "dax%d.%d", dax_region->id, data->id);

rc = device_add(dev);
if (rc) {
@@ -458,9 +555,12 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
return ERR_PTR(rc);

return dev_dax;
+
err_alloc_dax:
kfree(dev_dax->pgmap);
err_pgmap:
+ free_dev_dax_range(dev_dax);
+err_range:
kfree(dev_dax);

return ERR_PTR(rc);
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index 4aeb36da83a4..44592a8cac0f 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -10,8 +10,11 @@ struct resource;
struct dax_device;
struct dax_region;
void dax_region_put(struct dax_region *dax_region);
+
+#define IORESOURCE_DAX_STATIC (1UL << 0)
struct dax_region *alloc_dax_region(struct device *parent, int region_id,
- struct resource *res, int target_node, unsigned int align);
+ struct resource *res, int target_node, unsigned int align,
+ unsigned long flags);

enum dev_dax_subsys {
DEV_DAX_BUS = 0, /* zeroed dev_dax_data picks this by default */
@@ -22,7 +25,7 @@ struct dev_dax_data {
struct dax_region *dax_region;
struct dev_pagemap *pgmap;
enum dev_dax_subsys subsys;
- struct range range;
+ resource_size_t size;
int id;
};

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 12a2dbc43b40..99b1273bb232 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -22,7 +22,7 @@ void dax_bus_exit(void);
* @kref: to pin while other agents have a need to do lookups
* @dev: parent device backing this region
* @align: allocation and mapping alignment for child dax devices
- * @res: physical address range of the region
+ * @res: resource tree to track instance allocations
*/
struct dax_region {
int id;
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index af82d6ba820a..e7b64539e23e 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -20,17 +20,14 @@ static int dax_hmem_probe(struct platform_device *pdev)

mri = dev->platform_data;
dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node,
- PMD_SIZE);
+ PMD_SIZE, 0);
if (!dax_region)
return -ENOMEM;

data = (struct dev_dax_data) {
.dax_region = dax_region,
.id = 0,
- .range = {
- .start = res->start,
- .end = res->end,
- },
+ .size = resource_size(res),
};
dev_dax = devm_create_dev_dax(&data);
if (IS_ERR(dev_dax))
diff --git a/drivers/dax/pmem/core.c b/drivers/dax/pmem/core.c
index 4fa81d3d2f65..4fe700884338 100644
--- a/drivers/dax/pmem/core.c
+++ b/drivers/dax/pmem/core.c
@@ -54,7 +54,8 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys)
memcpy(&res, &pgmap.res, sizeof(res));
res.start += offset;
dax_region = alloc_dax_region(dev, region_id, &res,
- nd_region->target_node, le32_to_cpu(pfn_sb->align));
+ nd_region->target_node, le32_to_cpu(pfn_sb->align),
+ IORESOURCE_DAX_STATIC);
if (!dax_region)
return ERR_PTR(-ENOMEM);

@@ -63,10 +64,7 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys)
.id = id,
.pgmap = &pgmap,
.subsys = subsys,
- .range = {
- .start = res.start,
- .end = res.end,
- },
+ .size = resource_size(&res),
};
dev_dax = devm_create_dev_dax(&data);


2020-08-03 05:23:07

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 17/23] mm/memremap_pages: Support multiple ranges per invocation

In support of device-dax growing the ability to front physically
dis-contiguous ranges of memory, update devm_memremap_pages() to track
multiple ranges with a single reference counter and devm instance.

Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ben Skeggs <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/powerpc/kvm/book3s_hv_uvmem.c | 1
drivers/dax/device.c | 1
drivers/gpu/drm/nouveau/nouveau_dmem.c | 1
drivers/nvdimm/pfn_devs.c | 1
drivers/nvdimm/pmem.c | 1
drivers/pci/p2pdma.c | 1
include/linux/memremap.h | 10 +
lib/test_hmm.c | 1
mm/memremap.c | 258 +++++++++++++++++++-------------
9 files changed, 165 insertions(+), 110 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 29ec555055c2..84e5a2dc8be5 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -1172,6 +1172,7 @@ int kvmppc_uvmem_init(void)
kvmppc_uvmem_pgmap.type = MEMORY_DEVICE_PRIVATE;
kvmppc_uvmem_pgmap.range.start = res->start;
kvmppc_uvmem_pgmap.range.end = res->end;
+ kvmppc_uvmem_pgmap.nr_range = 1;
kvmppc_uvmem_pgmap.ops = &kvmppc_uvmem_ops;
/* just one global instance: */
kvmppc_uvmem_pgmap.owner = &kvmppc_uvmem_pgmap;
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index fffc54ce0911..f3755df4ae29 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -417,6 +417,7 @@ int dev_dax_probe(struct dev_dax *dev_dax)
if (!pgmap)
return -ENOMEM;
pgmap->range = *range;
+ pgmap->nr_range = 1;
}
pgmap->type = MEMORY_DEVICE_DEVDAX;
addr = devm_memremap_pages(dev, pgmap);
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 25811ed7e274..a13c6215bba8 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -251,6 +251,7 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
chunk->pagemap.type = MEMORY_DEVICE_PRIVATE;
chunk->pagemap.range.start = res->start;
chunk->pagemap.range.end = res->end;
+ chunk->pagemap.nr_range = 1;
chunk->pagemap.ops = &nouveau_dmem_pagemap_ops;
chunk->pagemap.owner = drm->dev;

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 3c4787b92a6a..b499df630d4d 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -693,6 +693,7 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
.start = nsio->res.start + start_pad,
.end = nsio->res.end - end_trunc,
};
+ pgmap->nr_range = 1;
if (nd_pfn->mode == PFN_MODE_RAM) {
if (offset < reserve)
return -EINVAL;
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 69cc0e783709..1f45af363a94 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -442,6 +442,7 @@ static int pmem_attach_disk(struct device *dev,
} else if (pmem_should_map_pages(dev)) {
pmem->pgmap.range.start = res->start;
pmem->pgmap.range.end = res->end;
+ pmem->pgmap.nr_range = 1;
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
pmem->pgmap.ops = &fsdax_pagemap_ops;
addr = devm_memremap_pages(dev, &pmem->pgmap);
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index dd6b0d51a50c..403304785561 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -187,6 +187,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
pgmap = &p2p_pgmap->pgmap;
pgmap->range.start = pci_resource_start(pdev, bar) + offset;
pgmap->range.end = pgmap->range.start + size - 1;
+ pgmap->nr_range = 1;
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;

p2p_pgmap->provider = pdev;
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 6c21951bdb16..4e9c738f4b31 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -95,7 +95,6 @@ struct dev_pagemap_ops {
/**
* struct dev_pagemap - metadata for ZONE_DEVICE mappings
* @altmap: pre-allocated/reserved memory for vmemmap allocations
- * @range: physical address range covered by @ref
* @ref: reference count that pins the devm_memremap_pages() mapping
* @internal_ref: internal reference if @ref is not provided by the caller
* @done: completion for @internal_ref
@@ -105,10 +104,12 @@ struct dev_pagemap_ops {
* @owner: an opaque pointer identifying the entity that manages this
* instance. Used by various helpers to make sure that no
* foreign ZONE_DEVICE memory is accessed.
+ * @nr_range: number of ranges to be mapped
+ * @range: range to be mapped when nr_range == 1
+ * @ranges: array of ranges to be mapped when nr_range > 1
*/
struct dev_pagemap {
struct vmem_altmap altmap;
- struct range range;
struct percpu_ref *ref;
struct percpu_ref internal_ref;
struct completion done;
@@ -116,6 +117,11 @@ struct dev_pagemap {
unsigned int flags;
const struct dev_pagemap_ops *ops;
void *owner;
+ int nr_range;
+ union {
+ struct range range;
+ struct range ranges[0];
+ };
};

static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 5b4521991621..e3065d6123f0 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -489,6 +489,7 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
devmem->pagemap.range.start = res->start;
devmem->pagemap.range.end = res->end;
+ devmem->pagemap.nr_range = 1;
devmem->pagemap.ops = &dmirror_devmem_ops;
devmem->pagemap.owner = mdevice;

diff --git a/mm/memremap.c b/mm/memremap.c
index 9979891fec78..a7346638a09f 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -77,15 +77,19 @@ static void pgmap_array_delete(struct range *range)
synchronize_rcu();
}

-static unsigned long pfn_first(struct dev_pagemap *pgmap)
+static unsigned long pfn_first(struct dev_pagemap *pgmap, int range_id)
{
- return PHYS_PFN(pgmap->range.start) +
- vmem_altmap_offset(pgmap_altmap(pgmap));
+ struct range *range = &pgmap->ranges[range_id];
+ unsigned long pfn = PHYS_PFN(range->start);
+
+ if (range_id)
+ return pfn;
+ return pfn + vmem_altmap_offset(pgmap_altmap(pgmap));
}

-static unsigned long pfn_end(struct dev_pagemap *pgmap)
+static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
{
- const struct range *range = &pgmap->range;
+ const struct range *range = &pgmap->ranges[range_id];

return (range->start + range_len(range)) >> PAGE_SHIFT;
}
@@ -117,8 +121,8 @@ bool pfn_zone_device_reserved(unsigned long pfn)
return ret;
}

-#define for_each_device_pfn(pfn, map) \
- for (pfn = pfn_first(map); pfn < pfn_end(map); pfn = pfn_next(pfn))
+#define for_each_device_pfn(pfn, map, i) \
+ for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))

static void dev_pagemap_kill(struct dev_pagemap *pgmap)
{
@@ -144,20 +148,14 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)
pgmap->ref = NULL;
}

-void memunmap_pages(struct dev_pagemap *pgmap)
+static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
{
- struct range *range = &pgmap->range;
+ struct range *range = &pgmap->ranges[range_id];
struct page *first_page;
- unsigned long pfn;
int nid;

- dev_pagemap_kill(pgmap);
- for_each_device_pfn(pfn, pgmap)
- put_page(pfn_to_page(pfn));
- dev_pagemap_cleanup(pgmap);
-
/* make sure to access a memmap that was actually initialized */
- first_page = pfn_to_page(pfn_first(pgmap));
+ first_page = pfn_to_page(pfn_first(pgmap, range_id));

/* pages are dead and unused, undo the arch mapping */
nid = page_to_nid(first_page);
@@ -177,6 +175,22 @@ void memunmap_pages(struct dev_pagemap *pgmap)

untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range));
pgmap_array_delete(range);
+}
+
+void memunmap_pages(struct dev_pagemap *pgmap)
+{
+ unsigned long pfn;
+ int i;
+
+ dev_pagemap_kill(pgmap);
+ for (i = 0; i < pgmap->nr_range; i++)
+ for_each_device_pfn(pfn, pgmap, i)
+ put_page(pfn_to_page(pfn));
+ dev_pagemap_cleanup(pgmap);
+
+ for (i = 0; i < pgmap->nr_range; i++)
+ pageunmap_range(pgmap, i);
+
WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
devmap_managed_enable_put();
}
@@ -195,96 +209,29 @@ static void dev_pagemap_percpu_release(struct percpu_ref *ref)
complete(&pgmap->done);
}

-/*
- * Not device managed version of dev_memremap_pages, undone by
- * memunmap_pages(). Please use dev_memremap_pages if you have a struct
- * device available.
- */
-void *memremap_pages(struct dev_pagemap *pgmap, int nid)
+static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
+ int range_id, int nid)
{
- struct range *range = &pgmap->range;
+ struct range *range = &pgmap->ranges[range_id];
struct dev_pagemap *conflict_pgmap;
- struct mhp_params params = {
- /*
- * We do not want any optional features only our own memmap
- */
- .altmap = pgmap_altmap(pgmap),
- .pgprot = PAGE_KERNEL,
- };
int error, is_ram;
- bool need_devmap_managed = true;

- switch (pgmap->type) {
- case MEMORY_DEVICE_PRIVATE:
- if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
- WARN(1, "Device private memory not supported\n");
- return ERR_PTR(-EINVAL);
- }
- if (!pgmap->ops || !pgmap->ops->migrate_to_ram) {
- WARN(1, "Missing migrate_to_ram method\n");
- return ERR_PTR(-EINVAL);
- }
- if (!pgmap->owner) {
- WARN(1, "Missing owner\n");
- return ERR_PTR(-EINVAL);
- }
- break;
- case MEMORY_DEVICE_FS_DAX:
- if (!IS_ENABLED(CONFIG_ZONE_DEVICE) ||
- IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
- WARN(1, "File system DAX not supported\n");
- return ERR_PTR(-EINVAL);
- }
- break;
- case MEMORY_DEVICE_DEVDAX:
- need_devmap_managed = false;
- break;
- case MEMORY_DEVICE_PCI_P2PDMA:
- params.pgprot = pgprot_noncached(params.pgprot);
- need_devmap_managed = false;
- break;
- default:
- WARN(1, "Invalid pgmap type %d\n", pgmap->type);
- break;
- }
-
- if (!pgmap->ref) {
- if (pgmap->ops && (pgmap->ops->kill || pgmap->ops->cleanup))
- return ERR_PTR(-EINVAL);
-
- init_completion(&pgmap->done);
- error = percpu_ref_init(&pgmap->internal_ref,
- dev_pagemap_percpu_release, 0, GFP_KERNEL);
- if (error)
- return ERR_PTR(error);
- pgmap->ref = &pgmap->internal_ref;
- } else {
- if (!pgmap->ops || !pgmap->ops->kill || !pgmap->ops->cleanup) {
- WARN(1, "Missing reference count teardown definition\n");
- return ERR_PTR(-EINVAL);
- }
- }
-
- if (need_devmap_managed) {
- error = devmap_managed_enable_get(pgmap);
- if (error)
- return ERR_PTR(error);
- }
+ if (WARN_ONCE(pgmap_altmap(pgmap) && range_id > 0,
+ "altmap not supported for multiple ranges\n"))
+ return -EINVAL;

conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->start), NULL);
if (conflict_pgmap) {
WARN(1, "Conflicting mapping in same section\n");
put_dev_pagemap(conflict_pgmap);
- error = -ENOMEM;
- goto err_array;
+ return -ENOMEM;
}

conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->end), NULL);
if (conflict_pgmap) {
WARN(1, "Conflicting mapping in same section\n");
put_dev_pagemap(conflict_pgmap);
- error = -ENOMEM;
- goto err_array;
+ return -ENOMEM;
}

is_ram = region_intersects(range->start, range_len(range),
@@ -294,19 +241,18 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
WARN_ONCE(1, "attempted on %s region %#llx-%#llx\n",
is_ram == REGION_MIXED ? "mixed" : "ram",
range->start, range->end);
- error = -ENXIO;
- goto err_array;
+ return -ENXIO;
}

error = xa_err(xa_store_range(&pgmap_array, PHYS_PFN(range->start),
PHYS_PFN(range->end), pgmap, GFP_KERNEL));
if (error)
- goto err_array;
+ return error;

if (nid < 0)
nid = numa_mem_id();

- error = track_pfn_remap(NULL, &params.pgprot, PHYS_PFN(range->start), 0,
+ error = track_pfn_remap(NULL, &params->pgprot, PHYS_PFN(range->start), 0,
range_len(range));
if (error)
goto err_pfn_remap;
@@ -326,7 +272,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
*/
if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
error = add_pages(nid, PHYS_PFN(range->start),
- PHYS_PFN(range_len(range)), &params);
+ PHYS_PFN(range_len(range)), params);
} else {
error = kasan_add_zero_shadow(__va(range->start), range_len(range));
if (error) {
@@ -335,7 +281,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
}

error = arch_add_memory(nid, range->start, range_len(range),
- &params);
+ params);
}

if (!error) {
@@ -343,7 +289,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)

zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
move_pfn_range_to_zone(zone, PHYS_PFN(range->start),
- PHYS_PFN(range_len(range)), params.altmap);
+ PHYS_PFN(range_len(range)), params->altmap);
}

mem_hotplug_done();
@@ -357,20 +303,116 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
PHYS_PFN(range->start),
PHYS_PFN(range_len(range)), pgmap);
- percpu_ref_get_many(pgmap->ref, pfn_end(pgmap) - pfn_first(pgmap));
- return __va(range->start);
+ percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
+ - pfn_first(pgmap, range_id));
+ return 0;

- err_add_memory:
+err_add_memory:
kasan_remove_zero_shadow(__va(range->start), range_len(range));
- err_kasan:
+err_kasan:
untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range));
- err_pfn_remap:
+err_pfn_remap:
pgmap_array_delete(range);
- err_array:
- dev_pagemap_kill(pgmap);
- dev_pagemap_cleanup(pgmap);
- devmap_managed_enable_put();
- return ERR_PTR(error);
+ return error;
+}
+
+
+/*
+ * Not device managed version of dev_memremap_pages, undone by
+ * memunmap_pages(). Please use dev_memremap_pages if you have a struct
+ * device available.
+ */
+void *memremap_pages(struct dev_pagemap *pgmap, int nid)
+{
+ struct mhp_params params = {
+ .altmap = pgmap_altmap(pgmap),
+ .pgprot = PAGE_KERNEL,
+ };
+ const int nr_range = pgmap->nr_range;
+ bool need_devmap_managed = true;
+ int error, i;
+
+ if (WARN_ONCE(!nr_range, "nr_range must be specified\n"))
+ return ERR_PTR(-EINVAL);
+
+ switch (pgmap->type) {
+ case MEMORY_DEVICE_PRIVATE:
+ if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
+ WARN(1, "Device private memory not supported\n");
+ return ERR_PTR(-EINVAL);
+ }
+ if (!pgmap->ops || !pgmap->ops->migrate_to_ram) {
+ WARN(1, "Missing migrate_to_ram method\n");
+ return ERR_PTR(-EINVAL);
+ }
+ if (!pgmap->owner) {
+ WARN(1, "Missing owner\n");
+ return ERR_PTR(-EINVAL);
+ }
+ break;
+ case MEMORY_DEVICE_FS_DAX:
+ if (!IS_ENABLED(CONFIG_ZONE_DEVICE) ||
+ IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
+ WARN(1, "File system DAX not supported\n");
+ return ERR_PTR(-EINVAL);
+ }
+ break;
+ case MEMORY_DEVICE_DEVDAX:
+ need_devmap_managed = false;
+ break;
+ case MEMORY_DEVICE_PCI_P2PDMA:
+ params.pgprot = pgprot_noncached(params.pgprot);
+ need_devmap_managed = false;
+ break;
+ default:
+ WARN(1, "Invalid pgmap type %d\n", pgmap->type);
+ break;
+ }
+
+ if (!pgmap->ref) {
+ if (pgmap->ops && (pgmap->ops->kill || pgmap->ops->cleanup))
+ return ERR_PTR(-EINVAL);
+
+ init_completion(&pgmap->done);
+ error = percpu_ref_init(&pgmap->internal_ref,
+ dev_pagemap_percpu_release, 0, GFP_KERNEL);
+ if (error)
+ return ERR_PTR(error);
+ pgmap->ref = &pgmap->internal_ref;
+ } else {
+ if (!pgmap->ops || !pgmap->ops->kill || !pgmap->ops->cleanup) {
+ WARN(1, "Missing reference count teardown definition\n");
+ return ERR_PTR(-EINVAL);
+ }
+ }
+
+ if (need_devmap_managed) {
+ error = devmap_managed_enable_get(pgmap);
+ if (error)
+ return ERR_PTR(error);
+ }
+
+ /*
+ * Clear the pgmap nr_range as it will be incremented for each
+ * successfully processed range. This communicates how many
+ * regions to unwind in the abort case.
+ */
+ pgmap->nr_range = 0;
+ error = 0;
+ for (i = 0; i < nr_range; i++) {
+ error = pagemap_range(pgmap, &params, i, nid);
+ if (error)
+ break;
+ pgmap->nr_range++;
+ }
+
+ if (i < nr_range) {
+ memunmap_pages(pgmap);
+ pgmap->nr_range = nr_range;
+ return ERR_PTR(error);
+ }
+
+ return __va(pgmap->ranges[0].start);
}
EXPORT_SYMBOL_GPL(memremap_pages);


2020-08-03 05:23:11

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 22/23] dax/hmem: Introduce dax_hmem.region_idle parameter

From: Joao Martins <[email protected]>

Introduce a new module parameter for dax_hmem which
initializes all region devices as free, rather than allocating
a pagemap for the region by default.

All hmem devices created with dax_hmem.region_idle=1 will have full
available size for creating dynamic dax devices.

Signed-off-by: Joao Martins <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/hmem/hmem.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 1a3347bb6143..1bf040dbc834 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -5,6 +5,9 @@
#include <linux/pfn_t.h>
#include "../bus.h"

+static bool region_idle;
+module_param_named(region_idle, region_idle, bool, 0644);
+
static int dax_hmem_probe(struct platform_device *pdev)
{
struct device *dev = &pdev->dev;
@@ -30,7 +33,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
data = (struct dev_dax_data) {
.dax_region = dax_region,
.id = -1,
- .size = resource_size(res),
+ .size = region_idle ? 0 : resource_size(res),
};
dev_dax = devm_create_dev_dax(&data);
if (IS_ERR(dev_dax))

2020-08-03 05:23:34

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 20/23] device-dax: Make align a per-device property

From: Joao Martins <[email protected]>

Introduce @align to struct dev_dax.

When creating a new device, we still initialize to the default
dax_region @align. Child devices belonging to a region may wish
to keep a different alignment property instead of a global
region-defined one.

Signed-off-by: Joao Martins <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/bus.c | 1 +
drivers/dax/dax-private.h | 3 +++
drivers/dax/device.c | 37 +++++++++++++++----------------------
3 files changed, 19 insertions(+), 22 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 2779c65dc7c0..9edfdf83408e 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1215,6 +1215,7 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)

dev_dax->dax_dev = dax_dev;
dev_dax->target_node = dax_region->target_node;
+ dev_dax->align = dax_region->align;
ida_init(&dev_dax->ida);
kref_get(&dax_region->kref);

diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 13780f62b95e..5fd3a26cfcea 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -62,6 +62,7 @@ struct dax_mapping {
struct dev_dax {
struct dax_region *region;
struct dax_device *dax_dev;
+ unsigned int align;
int target_node;
int id;
struct ida ida;
@@ -84,4 +85,6 @@ static inline struct dax_mapping *to_dax_mapping(struct device *dev)
{
return container_of(dev, struct dax_mapping, dev);
}
+
+phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, unsigned long size);
#endif
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 2bfc5c83e3b0..d2b1892cb1b2 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -17,7 +17,6 @@
static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
const char *func)
{
- struct dax_region *dax_region = dev_dax->region;
struct device *dev = &dev_dax->dev;
unsigned long mask;

@@ -32,7 +31,7 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
return -EINVAL;
}

- mask = dax_region->align - 1;
+ mask = dev_dax->align - 1;
if (vma->vm_start & mask || vma->vm_end & mask) {
dev_info_ratelimited(dev,
"%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n",
@@ -78,21 +77,19 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
struct vm_fault *vmf, pfn_t *pfn)
{
struct device *dev = &dev_dax->dev;
- struct dax_region *dax_region;
phys_addr_t phys;
unsigned int fault_size = PAGE_SIZE;

if (check_vma(dev_dax, vmf->vma, __func__))
return VM_FAULT_SIGBUS;

- dax_region = dev_dax->region;
- if (dax_region->align > PAGE_SIZE) {
+ if (dev_dax->align > PAGE_SIZE) {
dev_dbg(dev, "alignment (%#x) > fault size (%#x)\n",
- dax_region->align, fault_size);
+ dev_dax->align, fault_size);
return VM_FAULT_SIGBUS;
}

- if (fault_size != dax_region->align)
+ if (fault_size != dev_dax->align)
return VM_FAULT_SIGBUS;

phys = dax_pgoff_to_phys(dev_dax, vmf->pgoff, PAGE_SIZE);
@@ -120,15 +117,15 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
return VM_FAULT_SIGBUS;

dax_region = dev_dax->region;
- if (dax_region->align > PMD_SIZE) {
+ if (dev_dax->align > PMD_SIZE) {
dev_dbg(dev, "alignment (%#x) > fault size (%#x)\n",
- dax_region->align, fault_size);
+ dev_dax->align, fault_size);
return VM_FAULT_SIGBUS;
}

- if (fault_size < dax_region->align)
+ if (fault_size < dev_dax->align)
return VM_FAULT_SIGBUS;
- else if (fault_size > dax_region->align)
+ else if (fault_size > dev_dax->align)
return VM_FAULT_FALLBACK;

/* if we are outside of the VMA */
@@ -164,15 +161,15 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
return VM_FAULT_SIGBUS;

dax_region = dev_dax->region;
- if (dax_region->align > PUD_SIZE) {
+ if (dev_dax->align > PUD_SIZE) {
dev_dbg(dev, "alignment (%#x) > fault size (%#x)\n",
- dax_region->align, fault_size);
+ dev_dax->align, fault_size);
return VM_FAULT_SIGBUS;
}

- if (fault_size < dax_region->align)
+ if (fault_size < dev_dax->align)
return VM_FAULT_SIGBUS;
- else if (fault_size > dax_region->align)
+ else if (fault_size > dev_dax->align)
return VM_FAULT_FALLBACK;

/* if we are outside of the VMA */
@@ -267,9 +264,8 @@ static int dev_dax_split(struct vm_area_struct *vma, unsigned long addr)
{
struct file *filp = vma->vm_file;
struct dev_dax *dev_dax = filp->private_data;
- struct dax_region *dax_region = dev_dax->region;

- if (!IS_ALIGNED(addr, dax_region->align))
+ if (!IS_ALIGNED(addr, dev_dax->align))
return -EINVAL;
return 0;
}
@@ -278,9 +274,8 @@ static unsigned long dev_dax_pagesize(struct vm_area_struct *vma)
{
struct file *filp = vma->vm_file;
struct dev_dax *dev_dax = filp->private_data;
- struct dax_region *dax_region = dev_dax->region;

- return dax_region->align;
+ return dev_dax->align;
}

static const struct vm_operations_struct dax_vm_ops = {
@@ -319,13 +314,11 @@ static unsigned long dax_get_unmapped_area(struct file *filp,
{
unsigned long off, off_end, off_align, len_align, addr_align, align;
struct dev_dax *dev_dax = filp ? filp->private_data : NULL;
- struct dax_region *dax_region;

if (!dev_dax || addr)
goto out;

- dax_region = dev_dax->region;
- align = dax_region->align;
+ align = dev_dax->align;
off = pgoff << PAGE_SHIFT;
off_end = off + len;
off_align = round_up(off, align);

2020-08-03 05:23:48

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 19/23] device-dax: Introduce 'mapping' devices

In support of interrogating the physical address layout of a device with
dis-contiguous ranges, introduce a sysfs directory with 'start', 'end',
and 'page_offset' attributes. The alternative is trying to parse
/proc/iomem, and that file will not reflect the extent layout until the
device is enabled.

Cc: Vishal Verma <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/bus.c | 191 +++++++++++++++++++++++++++++++++++++++++++++
drivers/dax/dax-private.h | 14 +++
2 files changed, 203 insertions(+), 2 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 8dd82ea9d53d..2779c65dc7c0 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -579,6 +579,167 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
}
EXPORT_SYMBOL_GPL(alloc_dax_region);

+static void dax_mapping_release(struct device *dev)
+{
+ struct dax_mapping *mapping = to_dax_mapping(dev);
+ struct dev_dax *dev_dax = to_dev_dax(dev->parent);
+
+ ida_free(&dev_dax->ida, mapping->id);
+ kfree(mapping);
+}
+
+static void unregister_dax_mapping(void *data)
+{
+ struct device *dev = data;
+ struct dax_mapping *mapping = to_dax_mapping(dev);
+ struct dev_dax *dev_dax = to_dev_dax(dev->parent);
+ struct dax_region *dax_region = dev_dax->region;
+
+ dev_dbg(dev, "%s\n", __func__);
+
+ device_lock_assert(dax_region->dev);
+
+ dev_dax->ranges[mapping->range_id].mapping = NULL;
+ mapping->range_id = -1;
+
+ device_del(dev);
+ put_device(dev);
+}
+
+static struct dev_dax_range *get_dax_range(struct device *dev)
+{
+ struct dax_mapping *mapping = to_dax_mapping(dev);
+ struct dev_dax *dev_dax = to_dev_dax(dev->parent);
+ struct dax_region *dax_region = dev_dax->region;
+
+ device_lock(dax_region->dev);
+ if (mapping->range_id < 0) {
+ device_unlock(dax_region->dev);
+ return NULL;
+ }
+
+ return &dev_dax->ranges[mapping->range_id];
+}
+
+static void put_dax_range(struct dev_dax_range *dax_range)
+{
+ struct dax_mapping *mapping = dax_range->mapping;
+ struct dev_dax *dev_dax = to_dev_dax(mapping->dev.parent);
+ struct dax_region *dax_region = dev_dax->region;
+
+ device_unlock(dax_region->dev);
+}
+
+static ssize_t start_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dev_dax_range *dax_range;
+ ssize_t rc;
+
+ dax_range = get_dax_range(dev);
+ if (!dax_range)
+ return -ENXIO;
+ rc = sprintf(buf, "%#llx\n", dax_range->range.start);
+ put_dax_range(dax_range);
+
+ return rc;
+}
+static DEVICE_ATTR(start, 0400, start_show, NULL);
+
+static ssize_t end_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dev_dax_range *dax_range;
+ ssize_t rc;
+
+ dax_range = get_dax_range(dev);
+ if (!dax_range)
+ return -ENXIO;
+ rc = sprintf(buf, "%#llx\n", dax_range->range.end);
+ put_dax_range(dax_range);
+
+ return rc;
+}
+static DEVICE_ATTR(end, 0400, end_show, NULL);
+
+static ssize_t pgoff_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dev_dax_range *dax_range;
+ ssize_t rc;
+
+ dax_range = get_dax_range(dev);
+ if (!dax_range)
+ return -ENXIO;
+ rc = sprintf(buf, "%#lx\n", dax_range->pgoff);
+ put_dax_range(dax_range);
+
+ return rc;
+}
+static DEVICE_ATTR(page_offset, 0400, pgoff_show, NULL);
+
+static struct attribute *dax_mapping_attributes[] = {
+ &dev_attr_start.attr,
+ &dev_attr_end.attr,
+ &dev_attr_page_offset.attr,
+ NULL,
+};
+
+static const struct attribute_group dax_mapping_attribute_group = {
+ .attrs = dax_mapping_attributes,
+};
+
+static const struct attribute_group *dax_mapping_attribute_groups[] = {
+ &dax_mapping_attribute_group,
+ NULL,
+};
+
+static struct device_type dax_mapping_type = {
+ .release = dax_mapping_release,
+ .groups = dax_mapping_attribute_groups,
+};
+
+static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
+{
+ struct dax_region *dax_region = dev_dax->region;
+ struct dax_mapping *mapping;
+ struct device *dev;
+ int rc;
+
+ device_lock_assert(dax_region->dev);
+
+ if (dev_WARN_ONCE(&dev_dax->dev, !dax_region->dev->driver,
+ "region disabled\n"))
+ return -ENXIO;
+
+ mapping = kzalloc(sizeof(*mapping), GFP_KERNEL);
+ if (!mapping)
+ return -ENOMEM;
+ mapping->range_id = range_id;
+ mapping->id = ida_alloc(&dev_dax->ida, GFP_KERNEL);
+ if (mapping->id < 0) {
+ kfree(mapping);
+ return -ENOMEM;
+ }
+ dev_dax->ranges[range_id].mapping = mapping;
+ dev = &mapping->dev;
+ device_initialize(dev);
+ dev->parent = &dev_dax->dev;
+ dev->type = &dax_mapping_type;
+ dev_set_name(dev, "mapping%d", mapping->id);
+ rc = device_add(dev);
+ if (rc) {
+ put_device(dev);
+ return rc;
+ }
+
+ rc = devm_add_action_or_reset(dax_region->dev, unregister_dax_mapping,
+ dev);
+ if (rc)
+ return rc;
+ return 0;
+}
+
static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
resource_size_t size)
{
@@ -588,7 +749,7 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
struct dev_dax_range *ranges;
unsigned long pgoff = 0;
struct resource *alloc;
- int i;
+ int i, rc;

device_lock_assert(dax_region->dev);

@@ -630,6 +791,22 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,

dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
&alloc->start, &alloc->end);
+ /*
+ * A dev_dax instance must be registered before mapping device
+ * children can be added. Defer to devm_create_dev_dax() to add
+ * the initial mapping device.
+ */
+ if (!device_is_registered(&dev_dax->dev))
+ return 0;
+
+ rc = devm_register_dax_mapping(dev_dax, dev_dax->nr_range - 1);
+ if (rc) {
+ dev_dbg(dev, "delete range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
+ &alloc->start, &alloc->end);
+ dev_dax->nr_range--;
+ __release_region(res, alloc->start, resource_size(alloc));
+ return rc;
+ }

return 0;
}
@@ -698,11 +875,14 @@ static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)

for (i = dev_dax->nr_range - 1; i >= 0; i--) {
struct range *range = &dev_dax->ranges[i].range;
+ struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
struct resource *adjust = NULL, *res;
resource_size_t shrink;

shrink = min_t(u64, to_shrink, range_len(range));
if (shrink >= range_len(range)) {
+ devm_release_action(dax_region->dev,
+ unregister_dax_mapping, &mapping->dev);
__release_region(&dax_region->res, range->start,
range_len(range));
dev_dax->nr_range--;
@@ -1033,9 +1213,9 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
/* a device_dax instance is dead while the driver is not attached */
kill_dax(dax_dev);

- /* from here on we're committed to teardown via dev_dax_release() */
dev_dax->dax_dev = dax_dev;
dev_dax->target_node = dax_region->target_node;
+ ida_init(&dev_dax->ida);
kref_get(&dax_region->kref);

inode = dax_inode(dax_dev);
@@ -1058,6 +1238,13 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
if (rc)
return ERR_PTR(rc);

+ /* register mapping device for the initial allocation range */
+ if (dev_dax->nr_range && range_len(&dev_dax->ranges[0].range)) {
+ rc = devm_register_dax_mapping(dev_dax, 0);
+ if (rc)
+ return ERR_PTR(rc);
+ }
+
return dev_dax;

err_alloc_dax:
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index f863287107fd..13780f62b95e 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -40,6 +40,12 @@ struct dax_region {
struct device *youngest;
};

+struct dax_mapping {
+ struct device dev;
+ int range_id;
+ int id;
+};
+
/**
* struct dev_dax - instance data for a subdivision of a dax region, and
* data while the device is activated in the driver.
@@ -47,6 +53,7 @@ struct dax_region {
* @dax_dev - core dax functionality
* @target_node: effective numa node if dev_dax memory range is onlined
* @id: ida allocated id
+ * @ida: mapping id allocator
* @dev - device core
* @pgmap - pgmap for memmap setup / lifetime (driver owned)
* @nr_range: size of @ranges
@@ -57,12 +64,14 @@ struct dev_dax {
struct dax_device *dax_dev;
int target_node;
int id;
+ struct ida ida;
struct device dev;
struct dev_pagemap *pgmap;
int nr_range;
struct dev_dax_range {
unsigned long pgoff;
struct range range;
+ struct dax_mapping *mapping;
} *ranges;
};

@@ -70,4 +79,9 @@ static inline struct dev_dax *to_dev_dax(struct device *dev)
{
return container_of(dev, struct dev_dax, dev);
}
+
+static inline struct dax_mapping *to_dax_mapping(struct device *dev)
+{
+ return container_of(dev, struct dax_mapping, dev);
+}
#endif

2020-08-03 05:24:04

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 23/23] device-dax: Add a range mapping allocation attribute

From: Joao Martins <[email protected]>

Add a sysfs attribute which denotes a range from the dax region
to be allocated. It's an write only @mapping sysfs attribute in
the format of '<start>-<end>' to allocate a range. @start and
@end use hexadecimal values and the @pgoff is implicitly ordered
wrt to previous writes to @mapping sysfs e.g. a write of a range
of length 1G the pgoff is 0..1G(-4K), a second write will use
@pgoff for 1G+4K..<size>.

This range mapping interface is useful for:

1) Application which want to implement its own allocation logic,
and thus pick the desired ranges from dax_region.

2) For use cases like VMM fast restart[0] where after kexec we
want to the same gpa<->phys mappings (as originally created
before kexec).

[0] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf

Signed-off-by: Joao Martins <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/bus.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 64 insertions(+)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index b984213c315f..092112bba6ed 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -1040,6 +1040,67 @@ static ssize_t size_store(struct device *dev, struct device_attribute *attr,
}
static DEVICE_ATTR_RW(size);

+static ssize_t range_parse(const char *opt, size_t len, struct range *range)
+{
+ unsigned long long addr = 0;
+ char *start, *end, *str;
+ ssize_t rc = EINVAL;
+
+ str = kstrdup(opt, GFP_KERNEL);
+ if (!str)
+ return rc;
+
+ end = str;
+ start = strsep(&end, "-");
+ if (!start || !end)
+ goto err;
+
+ rc = kstrtoull(start, 16, &addr);
+ if (rc)
+ goto err;
+ range->start = addr;
+
+ rc = kstrtoull(end, 16, &addr);
+ if (rc)
+ goto err;
+ range->end = addr;
+
+err:
+ kfree(str);
+ return rc;
+}
+
+static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
+ const char *buf, size_t len)
+{
+ struct dev_dax *dev_dax = to_dev_dax(dev);
+ struct dax_region *dax_region = dev_dax->region;
+ size_t to_alloc;
+ struct range r;
+ ssize_t rc;
+
+ rc = range_parse(buf, len, &r);
+ if (rc)
+ return rc;
+
+ rc = -ENXIO;
+ device_lock(dax_region->dev);
+ if (!dax_region->dev->driver) {
+ device_unlock(dax_region->dev);
+ return rc;
+ }
+ device_lock(dev);
+
+ to_alloc = range_len(&r);
+ if (alloc_is_aligned(dev_dax, to_alloc))
+ rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
+ device_unlock(dev);
+ device_unlock(dax_region->dev);
+
+ return rc == 0 ? len : rc;
+}
+static DEVICE_ATTR_WO(mapping);
+
static ssize_t align_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -1172,6 +1233,8 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
return 0;
if (a == &dev_attr_numa_node.attr && !IS_ENABLED(CONFIG_NUMA))
return 0;
+ if (a == &dev_attr_mapping.attr && is_static(dax_region))
+ return 0;
if ((a == &dev_attr_align.attr ||
a == &dev_attr_size.attr) && is_static(dax_region))
return 0444;
@@ -1181,6 +1244,7 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n)
static struct attribute *dev_dax_attributes[] = {
&dev_attr_modalias.attr,
&dev_attr_size.attr,
+ &dev_attr_mapping.attr,
&dev_attr_target_node.attr,
&dev_attr_align.attr,
&dev_attr_resource.attr,

2020-08-03 05:33:25

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 18/23] device-dax: Add dis-contiguous resource support

Break the requirement that device-dax instances are physically
contiguous. With this constraint removed it allows fragmented available
capacity to be fully allocated.

This capability is useful to mitigate the "noisy neighbor" problem with
memory-side-cache management for virtual machines, or any other scenario
where a platform address boundary also designates a performance
boundary. For example a direct mapped memory side cache might rotate
cache colors at 1GB boundaries. With dis-contiguous allocations a
device-dax instance could be configured to contain only 1 cache color.

It also satisfies Joao's use case (see link) for partitioning memory for
exclusive guest access. It allows for a future potential mode where the
host kernel need not allocate 'struct page' capacity up-front.

Link: https://lore.kernel.org/lkml/[email protected]/
Reported-by: Joao Martins <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
drivers/dax/bus.c | 230 +++++++++++++++++++++++++++++++---------
drivers/dax/dax-private.h | 9 +-
drivers/dax/device.c | 55 ++++++----
drivers/dax/kmem.c | 132 +++++++++++++++--------
tools/testing/nvdimm/dax-dev.c | 20 ++-
5 files changed, 319 insertions(+), 127 deletions(-)

diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 00fa73a8dfb4..8dd82ea9d53d 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -136,15 +136,27 @@ static bool is_static(struct dax_region *dax_region)
return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
}

+static u64 dev_dax_size(struct dev_dax *dev_dax)
+{
+ u64 size = 0;
+ int i;
+
+ device_lock_assert(&dev_dax->dev);
+
+ for (i = 0; i < dev_dax->nr_range; i++)
+ size += range_len(&dev_dax->ranges[i].range);
+
+ return size;
+}
+
static int dax_bus_probe(struct device *dev)
{
struct dax_device_driver *dax_drv = to_dax_drv(dev->driver);
struct dev_dax *dev_dax = to_dev_dax(dev);
struct dax_region *dax_region = dev_dax->region;
- struct range *range = &dev_dax->range;
int rc;

- if (range_len(range) == 0 || dev_dax->id < 0)
+ if (dev_dax_size(dev_dax) == 0 || dev_dax->id < 0)
return -ENXIO;

rc = dax_drv->probe(dev_dax);
@@ -354,15 +366,19 @@ void kill_dev_dax(struct dev_dax *dev_dax)
}
EXPORT_SYMBOL_GPL(kill_dev_dax);

-static void free_dev_dax_range(struct dev_dax *dev_dax)
+static void free_dev_dax_ranges(struct dev_dax *dev_dax)
{
struct dax_region *dax_region = dev_dax->region;
- struct range *range = &dev_dax->range;
+ int i;

device_lock_assert(dax_region->dev);
- if (range_len(range))
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct range *range = &dev_dax->ranges[i].range;
+
__release_region(&dax_region->res, range->start,
range_len(range));
+ }
+ dev_dax->nr_range = 0;
}

static void unregister_dev_dax(void *dev)
@@ -372,7 +388,7 @@ static void unregister_dev_dax(void *dev)
dev_dbg(dev, "%s\n", __func__);

kill_dev_dax(dev_dax);
- free_dev_dax_range(dev_dax);
+ free_dev_dax_ranges(dev_dax);
device_del(dev);
put_device(dev);
}
@@ -423,7 +439,7 @@ static ssize_t delete_store(struct device *dev, struct device_attribute *attr,
device_lock(dev);
device_lock(victim);
dev_dax = to_dev_dax(victim);
- if (victim->driver || range_len(&dev_dax->range))
+ if (victim->driver || dev_dax_size(dev_dax))
rc = -EBUSY;
else {
/*
@@ -569,51 +585,83 @@ static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
struct dax_region *dax_region = dev_dax->region;
struct resource *res = &dax_region->res;
struct device *dev = &dev_dax->dev;
+ struct dev_dax_range *ranges;
+ unsigned long pgoff = 0;
struct resource *alloc;
+ int i;

device_lock_assert(dax_region->dev);

/* handle the seed alloc special case */
if (!size) {
- dev_dax->range = (struct range) {
- .start = res->start,
- .end = res->start - 1,
- };
+ if (dev_WARN_ONCE(dev, dev_dax->nr_range,
+ "0-size allocation must be first\n"))
+ return -EBUSY;
+ /* nr_range == 0 is elsewhere special cased as 0-size device */
return 0;
}

+ ranges = krealloc(dev_dax->ranges, sizeof(*ranges)
+ * (dev_dax->nr_range + 1), GFP_KERNEL);
+ if (!ranges)
+ return -ENOMEM;
+
alloc = __request_region(res, start, size, dev_name(dev), 0);
- if (!alloc)
+ if (!alloc && !dev_dax->nr_range) {
+ /*
+ * If we adjusted an existing @ranges leave it alone,
+ * but if this was an empty set of ranges nothing else
+ * will release @ranges, so do it now.
+ */
+ kfree(ranges);
return -ENOMEM;
+ }

- dev_dax->range = (struct range) {
- .start = alloc->start,
- .end = alloc->end,
+ for (i = 0; i < dev_dax->nr_range; i++)
+ pgoff += PHYS_PFN(range_len(&ranges[i].range));
+ dev_dax->ranges = ranges;
+ ranges[dev_dax->nr_range++] = (struct dev_dax_range) {
+ .pgoff = pgoff,
+ .range = {
+ .start = alloc->start,
+ .end = alloc->end,
+ },
};

+ dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
+ &alloc->start, &alloc->end);
+
return 0;
}

static int adjust_dev_dax_range(struct dev_dax *dev_dax, struct resource *res, resource_size_t size)
{
+ int last_range = dev_dax->nr_range - 1;
+ struct dev_dax_range *dax_range = &dev_dax->ranges[last_range];
struct dax_region *dax_region = dev_dax->region;
- struct range *range = &dev_dax->range;
- int rc = 0;
+ bool is_shrink = resource_size(res) > size;
+ struct range *range = &dax_range->range;
+ struct device *dev = &dev_dax->dev;
+ int rc;

device_lock_assert(dax_region->dev);

- if (size)
- rc = adjust_resource(res, range->start, size);
- else
- __release_region(&dax_region->res, range->start, range_len(range));
+ if (dev_WARN_ONCE(dev, !size, "deletion is handled by dev_dax_shrink\n"))
+ return -EINVAL;
+
+ rc = adjust_resource(res, range->start, size);
if (rc)
return rc;

- dev_dax->range = (struct range) {
+ *range = (struct range) {
.start = range->start,
.end = range->start + size - 1,
};

+ dev_dbg(dev, "%s range[%d]: %#llx:%#llx\n", is_shrink ? "shrink" : "extend",
+ last_range, (unsigned long long) range->start,
+ (unsigned long long) range->end);
+
return 0;
}

@@ -621,7 +669,11 @@ static ssize_t size_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct dev_dax *dev_dax = to_dev_dax(dev);
- unsigned long long size = range_len(&dev_dax->range);
+ unsigned long long size;
+
+ device_lock(dev);
+ size = dev_dax_size(dev_dax);
+ device_unlock(dev);

return sprintf(buf, "%llu\n", size);
}
@@ -639,32 +691,82 @@ static bool alloc_is_aligned(struct dax_region *dax_region,

static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
{
+ resource_size_t to_shrink = dev_dax_size(dev_dax) - size;
struct dax_region *dax_region = dev_dax->region;
- struct range *range = &dev_dax->range;
- struct resource *res, *adjust = NULL;
struct device *dev = &dev_dax->dev;
-
- for_each_dax_region_resource(dax_region, res)
- if (strcmp(res->name, dev_name(dev)) == 0
- && res->start == range->start) {
- adjust = res;
- break;
+ int i;
+
+ for (i = dev_dax->nr_range - 1; i >= 0; i--) {
+ struct range *range = &dev_dax->ranges[i].range;
+ struct resource *adjust = NULL, *res;
+ resource_size_t shrink;
+
+ shrink = min_t(u64, to_shrink, range_len(range));
+ if (shrink >= range_len(range)) {
+ __release_region(&dax_region->res, range->start,
+ range_len(range));
+ dev_dax->nr_range--;
+ dev_dbg(dev, "delete range[%d]: %#llx:%#llx\n", i,
+ (unsigned long long) range->start,
+ (unsigned long long) range->end);
+ to_shrink -= shrink;
+ if (!to_shrink)
+ break;
+ continue;
}

- if (dev_WARN_ONCE(dev, !adjust, "failed to find matching resource\n"))
- return -ENXIO;
- return adjust_dev_dax_range(dev_dax, adjust, size);
+ for_each_dax_region_resource(dax_region, res)
+ if (strcmp(res->name, dev_name(dev)) == 0
+ && res->start == range->start) {
+ adjust = res;
+ break;
+ }
+
+ if (dev_WARN_ONCE(dev, !adjust || i != dev_dax->nr_range - 1,
+ "failed to find matching resource\n"))
+ return -ENXIO;
+ return adjust_dev_dax_range(dev_dax, adjust, range_len(range)
+ - shrink);
+ }
+ return 0;
+}
+
+/*
+ * Only allow adjustments that preserve the relative pgoff of existing
+ * allocations. I.e. the dev_dax->ranges array is ordered by increasing pgoff.
+ */
+static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
+{
+ struct dev_dax_range *last;
+ int i;
+
+ if (dev_dax->nr_range == 0)
+ return false;
+ if (strcmp(res->name, dev_name(&dev_dax->dev)) != 0)
+ return false;
+ last = &dev_dax->ranges[dev_dax->nr_range - 1];
+ if (last->range.start != res->start || last->range.end != res->end)
+ return false;
+ for (i = 0; i < dev_dax->nr_range - 1; i++) {
+ struct dev_dax_range *dax_range = &dev_dax->ranges[i];
+
+ if (dax_range->pgoff > last->pgoff)
+ return false;
+ }
+
+ return true;
}

static ssize_t dev_dax_resize(struct dax_region *dax_region,
struct dev_dax *dev_dax, resource_size_t size)
{
resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
- resource_size_t dev_size = range_len(&dev_dax->range);
+ resource_size_t dev_size = dev_dax_size(dev_dax);
struct resource *region_res = &dax_region->res;
struct device *dev = &dev_dax->dev;
- const char *name = dev_name(dev);
struct resource *res, *first;
+ resource_size_t alloc = 0;
+ int rc;

if (dev->driver)
return -EBUSY;
@@ -685,35 +787,47 @@ static ssize_t dev_dax_resize(struct dax_region *dax_region,
* may involve adjusting the end of an existing resource, or
* allocating a new resource.
*/
+retry:
first = region_res->child;
if (!first)
return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
- for (res = first; to_alloc && res; res = res->sibling) {
+
+ rc = -ENOSPC;
+ for (res = first; res; res = res->sibling) {
struct resource *next = res->sibling;
- resource_size_t free;

/* space at the beginning of the region */
- free = 0;
- if (res == first && res->start > dax_region->res.start)
- free = res->start - dax_region->res.start;
- if (free >= to_alloc && dev_size == 0)
- return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
+ if (res == first && res->start > dax_region->res.start) {
+ alloc = min(res->start - dax_region->res.start, to_alloc);
+ rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc);
+ break;
+ }

- free = 0;
+ alloc = 0;
/* space between allocations */
if (next && next->start > res->end + 1)
- free = next->start - res->end + 1;
+ alloc = min(next->start - (res->end + 1), to_alloc);

/* space at the end of the region */
- if (free < to_alloc && !next && res->end < region_res->end)
- free = region_res->end - res->end;
+ if (!alloc && !next && res->end < region_res->end)
+ alloc = min(region_res->end - res->end, to_alloc);

- if (free >= to_alloc && strcmp(name, res->name) == 0)
- return adjust_dev_dax_range(dev_dax, res, resource_size(res) + to_alloc);
- else if (free >= to_alloc && dev_size == 0)
- return alloc_dev_dax_range(dev_dax, res->end + 1, to_alloc);
+ if (!alloc)
+ continue;
+
+ if (adjust_ok(dev_dax, res)) {
+ rc = adjust_dev_dax_range(dev_dax, res, resource_size(res) + alloc);
+ break;
+ }
+ rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc);
+ break;
}
- return -ENOSPC;
+ if (rc)
+ return rc;
+ to_alloc -= alloc;
+ if (to_alloc)
+ goto retry;
+ return 0;
}

static ssize_t size_store(struct device *dev, struct device_attribute *attr,
@@ -767,8 +881,15 @@ static ssize_t resource_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct dev_dax *dev_dax = to_dev_dax(dev);
+ struct dax_region *dax_region = dev_dax->region;
+ unsigned long long start;
+
+ if (dev_dax->nr_range < 1)
+ start = dax_region->res.start;
+ else
+ start = dev_dax->ranges[0].range.start;

- return sprintf(buf, "%#llx\n", dev_dax->range.start);
+ return sprintf(buf, "%#llx\n", start);
}
static DEVICE_ATTR(resource, 0400, resource_show, NULL);

@@ -833,6 +954,7 @@ static void dev_dax_release(struct device *dev)
put_dax(dax_dev);
free_dev_dax_id(dev_dax);
dax_region_put(dax_region);
+ kfree(dev_dax->ranges);
kfree(dev_dax->pgmap);
kfree(dev_dax);
}
@@ -941,7 +1063,7 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
err_alloc_dax:
kfree(dev_dax->pgmap);
err_pgmap:
- free_dev_dax_range(dev_dax);
+ free_dev_dax_ranges(dev_dax);
err_range:
free_dev_dax_id(dev_dax);
err_id:
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 0cbb2ec81ca7..f863287107fd 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -49,7 +49,8 @@ struct dax_region {
* @id: ida allocated id
* @dev - device core
* @pgmap - pgmap for memmap setup / lifetime (driver owned)
- * @range: resource range for the instance
+ * @nr_range: size of @ranges
+ * @ranges: resource-span + pgoff tuples for the instance
*/
struct dev_dax {
struct dax_region *region;
@@ -58,7 +59,11 @@ struct dev_dax {
int id;
struct device dev;
struct dev_pagemap *pgmap;
- struct range range;
+ int nr_range;
+ struct dev_dax_range {
+ unsigned long pgoff;
+ struct range range;
+ } *ranges;
};

static inline struct dev_dax *to_dev_dax(struct device *dev)
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index f3755df4ae29..2bfc5c83e3b0 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -55,15 +55,22 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
__weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
unsigned long size)
{
- struct range *range = &dev_dax->range;
- phys_addr_t phys;
-
- phys = pgoff * PAGE_SIZE + range->start;
- if (phys >= range->start && phys <= range->end) {
+ int i;
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct dev_dax_range *dax_range = &dev_dax->ranges[i];
+ struct range *range = &dax_range->range;
+ unsigned long long pgoff_end;
+ phys_addr_t phys;
+
+ pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
+ if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
+ continue;
+ phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
if (phys + size - 1 <= range->end)
return phys;
+ break;
}
-
return -1;
}

@@ -395,30 +402,40 @@ static void dev_dax_kill(void *dev_dax)
int dev_dax_probe(struct dev_dax *dev_dax)
{
struct dax_device *dax_dev = dev_dax->dax_dev;
- struct range *range = &dev_dax->range;
struct device *dev = &dev_dax->dev;
struct dev_pagemap *pgmap;
struct inode *inode;
struct cdev *cdev;
void *addr;
- int rc;
-
- /* 1:1 map region resource range to device-dax instance range */
- if (!devm_request_mem_region(dev, range->start, range_len(range),
- dev_name(dev))) {
- dev_warn(dev, "could not reserve range: %#llx - %#llx\n",
- range->start, range->end);
- return -EBUSY;
- }
+ int rc, i;

pgmap = dev_dax->pgmap;
+ if (dev_WARN_ONCE(dev, pgmap && dev_dax->nr_range > 1,
+ "static pgmap / multi-range device conflict\n"))
+ return -EINVAL;
+
if (!pgmap) {
- pgmap = devm_kzalloc(dev, sizeof(*pgmap), GFP_KERNEL);
+ pgmap = devm_kzalloc(dev, sizeof(*pgmap) + sizeof(struct range)
+ * (dev_dax->nr_range - 1), GFP_KERNEL);
if (!pgmap)
return -ENOMEM;
- pgmap->range = *range;
- pgmap->nr_range = 1;
+ pgmap->nr_range = dev_dax->nr_range;
+ }
+
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct range *range = &dev_dax->ranges[i].range;
+
+ if (!devm_request_mem_region(dev, range->start,
+ range_len(range), dev_name(dev))) {
+ dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve range\n",
+ i, range->start, range->end);
+ return -EBUSY;
+ }
+ /* don't update the range for static pgmap */
+ if (!dev_dax->pgmap)
+ pgmap->ranges[i] = *range;
}
+
pgmap->type = MEMORY_DEVICE_DEVDAX;
addr = devm_memremap_pages(dev, pgmap);
if (IS_ERR(addr))
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index b1be8a53b26a..7dcb2902e9b1 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -19,25 +19,28 @@ static const char *kmem_name;
/* Set if any memory will remain added when the driver will be unloaded. */
static bool any_hotremove_failed;

-static struct range dax_kmem_range(struct dev_dax *dev_dax)
+static int dax_kmem_range(struct dev_dax *dev_dax, int i, struct range *r)
{
- struct range range;
+ struct dev_dax_range *dax_range = &dev_dax->ranges[i];
+ struct range *range = &dax_range->range;

/* memory-block align the hotplug range */
- range.start = ALIGN(dev_dax->range.start, memory_block_size_bytes());
- range.end = ALIGN_DOWN(dev_dax->range.end + 1,
- memory_block_size_bytes()) - 1;
- return range;
+ r->start = ALIGN(range->start, memory_block_size_bytes());
+ r->end = ALIGN_DOWN(range->end + 1, memory_block_size_bytes()) - 1;
+ if (r->start >= r->end) {
+ r->start = range->start;
+ r->end = range->end;
+ return -ENOSPC;
+ }
+ return 0;
}

int dev_dax_kmem_probe(struct dev_dax *dev_dax)
{
- struct range range = dax_kmem_range(dev_dax);
int numa_node = dev_dax->target_node;
struct device *dev = &dev_dax->dev;
- struct resource *res;
+ int i, mapped = 0;
char *res_name;
- int rc;

/*
* Ensure good NUMA information for the persistent memory.
@@ -55,32 +58,56 @@ int dev_dax_kmem_probe(struct dev_dax *dev_dax)
if (!res_name)
return -ENOMEM;

- res = request_mem_region(range.start, range_len(&range), res_name);
- if (!res) {
- dev_warn(dev, "could not reserve region [%#llx-%#llx]\n",
- range.start, range.end);
- kfree(res_name);
- return -EBUSY;
- }
-
- /*
- * Temporarily clear busy to allow add_memory_driver_managed()
- * to claim it.
- */
- res->flags &= ~IORESOURCE_BUSY;
-
- /*
- * Ensure that future kexec'd kernels will not treat this as RAM
- * automatically.
- */
- rc = add_memory_driver_managed(numa_node, res->start,
- resource_size(res), kmem_name);
-
- res->flags |= IORESOURCE_BUSY;
- if (rc) {
- release_mem_region(range.start, range_len(&range));
- kfree(res_name);
- return rc;
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct resource *res;
+ struct range range;
+ int rc;
+
+ rc = dax_kmem_range(dev_dax, i, &range);
+ if (rc) {
+ dev_info(dev, "mapping%d: %#llx-%#llx too small after alignment\n",
+ i, range.start, range.end);
+ continue;
+ }
+
+ res = request_mem_region(range.start, range_len(&range), res_name);
+ if (!res) {
+ dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
+ i, range.start, range.end);
+ /*
+ * Once some memory has been onlined we can't
+ * assume that it can be un-onlined safely.
+ */
+ if (mapped)
+ continue;
+ kfree(res_name);
+ return -EBUSY;
+ }
+
+ /*
+ * Temporarily clear busy to allow
+ * add_memory_driver_managed() to claim it.
+ */
+ res->flags &= ~IORESOURCE_BUSY;
+
+ /*
+ * Ensure that future kexec'd kernels will not treat
+ * this as RAM automatically.
+ */
+ rc = add_memory_driver_managed(numa_node, range.start,
+ range_len(&range), kmem_name);
+
+ res->flags |= IORESOURCE_BUSY;
+ if (rc) {
+ dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
+ i, range.start, range.end);
+ release_mem_region(range.start, range_len(&range));
+ if (mapped)
+ continue;
+ kfree(res_name);
+ return rc;
+ }
+ mapped++;
}

dev_set_drvdata(dev, res_name);
@@ -91,10 +118,9 @@ int dev_dax_kmem_probe(struct dev_dax *dev_dax)
#ifdef CONFIG_MEMORY_HOTREMOVE
static void dax_kmem_release(struct dev_dax *dev_dax)
{
- int rc;
+ int i, success = 0;
struct device *dev = &dev_dax->dev;
const char *res_name = dev_get_drvdata(dev);
- struct range range = dax_kmem_range(dev_dax);

/*
* We have one shot for removing memory, if some memory blocks were not
@@ -102,17 +128,31 @@ static void dax_kmem_release(struct dev_dax *dev_dax)
* there is no way to hotremove this memory until reboot because device
* unbind will proceed regardless of the remove_memory result.
*/
- rc = remove_memory(dev_dax->target_node, range.start, range_len(&range));
- if (rc == 0) {
- release_mem_region(range.start, range_len(&range));
- dev_set_drvdata(dev, NULL);
- kfree(res_name);
- return;
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct range range;
+ int rc;
+
+ rc = dax_kmem_range(dev_dax, i, &range);
+ if (rc)
+ continue;
+
+ rc = remove_memory(dev_dax->target_node, range.start,
+ range_len(&range));
+ if (rc == 0) {
+ release_mem_region(range.start, range_len(&range));
+ success++;
+ continue;
+ }
+ any_hotremove_failed = true;
+ dev_err(dev,
+ "mapping%d: %#llx-%#llx cannot be hotremoved until the next reboot\n",
+ i, range.start, range.end);
}

- any_hotremove_failed = true;
- dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next reboot\n",
- range.start, range.end);
+ if (success >= dev_dax->nr_range) {
+ kfree(res_name);
+ dev_set_drvdata(dev, NULL);
+ }
}
#else
static void dax_kmem_release(struct dev_dax *dev_dax)
diff --git a/tools/testing/nvdimm/dax-dev.c b/tools/testing/nvdimm/dax-dev.c
index 38d8e55c4a0d..fb342a8c98d3 100644
--- a/tools/testing/nvdimm/dax-dev.c
+++ b/tools/testing/nvdimm/dax-dev.c
@@ -9,11 +9,18 @@
phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
unsigned long size)
{
- struct range *range = &dev_dax->range;
- phys_addr_t addr;
+ int i;

- addr = pgoff * PAGE_SIZE + range->start;
- if (addr >= range->start && addr <= range->end) {
+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct dev_dax_range *dax_range = &dev_dax->ranges[i];
+ struct range *range = &dax_range->range;
+ unsigned long long pgoff_end;
+ phys_addr_t addr;
+
+ pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
+ if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
+ continue;
+ addr = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
if (addr + size - 1 <= range->end) {
if (get_nfit_res(addr)) {
struct page *page;
@@ -23,9 +30,10 @@ phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,

page = vmalloc_to_page((void *)addr);
return PFN_PHYS(page_to_pfn(page));
- } else
- return addr;
+ }
+ return addr;
}
+ break;
}
return -1;
}

2020-08-03 05:40:40

by Dan Williams

[permalink] [raw]
Subject: [PATCH v4 16/23] mm/memremap_pages: Convert to 'struct range'

The 'struct resource' in 'struct dev_pagemap' is only used for holding
resource span information. The other fields, 'name', 'flags', 'desc',
'parent', 'sibling', and 'child' are all unused wasted space.

This is in preparation for introducing a multi-range extension of
devm_memremap_pages().

The bulk of this change is unwinding all the places internal to
libnvdimm that used 'struct resource' unnecessarily.

P2PDMA had a minor usage of the flags field, but only to report failures
with "%pR". That is replaced with an open coded print of the range.

Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Vishal Verma <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Ben Skeggs <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Ira Weiny <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
---
arch/powerpc/kvm/book3s_hv_uvmem.c | 13 +++--
drivers/dax/bus.c | 10 ++--
drivers/dax/bus.h | 2 -
drivers/dax/dax-private.h | 5 --
drivers/dax/device.c | 3 -
drivers/dax/hmem/hmem.c | 5 ++
drivers/dax/pmem/core.c | 12 ++---
drivers/gpu/drm/nouveau/nouveau_dmem.c | 14 +++---
drivers/nvdimm/badrange.c | 26 +++++------
drivers/nvdimm/claim.c | 13 +++--
drivers/nvdimm/nd.h | 3 +
drivers/nvdimm/pfn_devs.c | 12 ++---
drivers/nvdimm/pmem.c | 26 ++++++-----
drivers/nvdimm/region.c | 21 +++++----
drivers/pci/p2pdma.c | 11 ++---
include/linux/memremap.h | 5 +-
include/linux/range.h | 6 ++
lib/test_hmm.c | 14 +++---
mm/memremap.c | 77 ++++++++++++++++----------------
tools/testing/nvdimm/test/iomap.c | 2 -
20 files changed, 147 insertions(+), 133 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 7705d5557239..29ec555055c2 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -687,9 +687,9 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
struct kvmppc_uvmem_page_pvt *pvt;
unsigned long pfn_last, pfn_first;

- pfn_first = kvmppc_uvmem_pgmap.res.start >> PAGE_SHIFT;
+ pfn_first = kvmppc_uvmem_pgmap.range.start >> PAGE_SHIFT;
pfn_last = pfn_first +
- (resource_size(&kvmppc_uvmem_pgmap.res) >> PAGE_SHIFT);
+ (range_len(&kvmppc_uvmem_pgmap.range) >> PAGE_SHIFT);

spin_lock(&kvmppc_uvmem_bitmap_lock);
bit = find_first_zero_bit(kvmppc_uvmem_bitmap,
@@ -1007,7 +1007,7 @@ static vm_fault_t kvmppc_uvmem_migrate_to_ram(struct vm_fault *vmf)
static void kvmppc_uvmem_page_free(struct page *page)
{
unsigned long pfn = page_to_pfn(page) -
- (kvmppc_uvmem_pgmap.res.start >> PAGE_SHIFT);
+ (kvmppc_uvmem_pgmap.range.start >> PAGE_SHIFT);
struct kvmppc_uvmem_page_pvt *pvt;

spin_lock(&kvmppc_uvmem_bitmap_lock);
@@ -1170,7 +1170,8 @@ int kvmppc_uvmem_init(void)
}

kvmppc_uvmem_pgmap.type = MEMORY_DEVICE_PRIVATE;
- kvmppc_uvmem_pgmap.res = *res;
+ kvmppc_uvmem_pgmap.range.start = res->start;
+ kvmppc_uvmem_pgmap.range.end = res->end;
kvmppc_uvmem_pgmap.ops = &kvmppc_uvmem_ops;
/* just one global instance: */
kvmppc_uvmem_pgmap.owner = &kvmppc_uvmem_pgmap;
@@ -1205,7 +1206,7 @@ void kvmppc_uvmem_free(void)
return;

memunmap_pages(&kvmppc_uvmem_pgmap);
- release_mem_region(kvmppc_uvmem_pgmap.res.start,
- resource_size(&kvmppc_uvmem_pgmap.res));
+ release_mem_region(kvmppc_uvmem_pgmap.range.start,
+ range_len(&kvmppc_uvmem_pgmap.range));
kfree(kvmppc_uvmem_bitmap);
}
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 53d07f2f1285..00fa73a8dfb4 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -515,7 +515,7 @@ static void dax_region_unregister(void *region)
}

struct dax_region *alloc_dax_region(struct device *parent, int region_id,
- struct resource *res, int target_node, unsigned int align,
+ struct range *range, int target_node, unsigned int align,
unsigned long flags)
{
struct dax_region *dax_region;
@@ -530,8 +530,8 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
return NULL;
}

- if (!IS_ALIGNED(res->start, align)
- || !IS_ALIGNED(resource_size(res), align))
+ if (!IS_ALIGNED(range->start, align)
+ || !IS_ALIGNED(range_len(range), align))
return NULL;

dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL);
@@ -546,8 +546,8 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
dax_region->target_node = target_node;
ida_init(&dax_region->ida);
dax_region->res = (struct resource) {
- .start = res->start,
- .end = res->end,
+ .start = range->start,
+ .end = range->end,
.flags = IORESOURCE_MEM | flags,
};

diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index da27ea70a19a..72b92f95509f 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -13,7 +13,7 @@ void dax_region_put(struct dax_region *dax_region);

#define IORESOURCE_DAX_STATIC (1UL << 0)
struct dax_region *alloc_dax_region(struct device *parent, int region_id,
- struct resource *res, int target_node, unsigned int align,
+ struct range *range, int target_node, unsigned int align,
unsigned long flags);

enum dev_dax_subsys {
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index b81a1494d82b..0cbb2ec81ca7 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -61,11 +61,6 @@ struct dev_dax {
struct range range;
};

-static inline u64 range_len(struct range *range)
-{
- return range->end - range->start + 1;
-}
-
static inline struct dev_dax *to_dev_dax(struct device *dev)
{
return container_of(dev, struct dev_dax, dev);
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 5496331f6eb6..fffc54ce0911 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -416,8 +416,7 @@ int dev_dax_probe(struct dev_dax *dev_dax)
pgmap = devm_kzalloc(dev, sizeof(*pgmap), GFP_KERNEL);
if (!pgmap)
return -ENOMEM;
- pgmap->res.start = range->start;
- pgmap->res.end = range->end;
+ pgmap->range = *range;
}
pgmap->type = MEMORY_DEVICE_DEVDAX;
addr = devm_memremap_pages(dev, pgmap);
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index aa260009dfc7..1a3347bb6143 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -13,13 +13,16 @@ static int dax_hmem_probe(struct platform_device *pdev)
struct dev_dax_data data;
struct dev_dax *dev_dax;
struct resource *res;
+ struct range range;

res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
if (!res)
return -ENOMEM;

mri = dev->platform_data;
- dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node,
+ range.start = res->start;
+ range.end = res->end;
+ dax_region = alloc_dax_region(dev, pdev->id, &range, mri->target_node,
PMD_SIZE, 0);
if (!dax_region)
return -ENOMEM;
diff --git a/drivers/dax/pmem/core.c b/drivers/dax/pmem/core.c
index 4fe700884338..62b26bfceab1 100644
--- a/drivers/dax/pmem/core.c
+++ b/drivers/dax/pmem/core.c
@@ -9,7 +9,7 @@

struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys)
{
- struct resource res;
+ struct range range;
int rc, id, region_id;
resource_size_t offset;
struct nd_pfn_sb *pfn_sb;
@@ -50,10 +50,10 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys)
if (rc != 2)
return ERR_PTR(-EINVAL);

- /* adjust the dax_region resource to the start of data */
- memcpy(&res, &pgmap.res, sizeof(res));
- res.start += offset;
- dax_region = alloc_dax_region(dev, region_id, &res,
+ /* adjust the dax_region range to the start of data */
+ range = pgmap.range;
+ range.start += offset,
+ dax_region = alloc_dax_region(dev, region_id, &range,
nd_region->target_node, le32_to_cpu(pfn_sb->align),
IORESOURCE_DAX_STATIC);
if (!dax_region)
@@ -64,7 +64,7 @@ struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys)
.id = id,
.pgmap = &pgmap,
.subsys = subsys,
- .size = resource_size(&res),
+ .size = range_len(&range),
};
dev_dax = devm_create_dev_dax(&data);

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 4e8112fde3e6..25811ed7e274 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -101,7 +101,7 @@ unsigned long nouveau_dmem_page_addr(struct page *page)
{
struct nouveau_dmem_chunk *chunk = nouveau_page_to_chunk(page);
unsigned long off = (page_to_pfn(page) << PAGE_SHIFT) -
- chunk->pagemap.res.start;
+ chunk->pagemap.range.start;

return chunk->bo->offset + off;
}
@@ -249,7 +249,8 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)

chunk->drm = drm;
chunk->pagemap.type = MEMORY_DEVICE_PRIVATE;
- chunk->pagemap.res = *res;
+ chunk->pagemap.range.start = res->start;
+ chunk->pagemap.range.end = res->end;
chunk->pagemap.ops = &nouveau_dmem_pagemap_ops;
chunk->pagemap.owner = drm->dev;

@@ -273,7 +274,7 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
list_add(&chunk->list, &drm->dmem->chunks);
mutex_unlock(&drm->dmem->mutex);

- pfn_first = chunk->pagemap.res.start >> PAGE_SHIFT;
+ pfn_first = chunk->pagemap.range.start >> PAGE_SHIFT;
page = pfn_to_page(pfn_first);
spin_lock(&drm->dmem->lock);
for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) {
@@ -294,8 +295,7 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
out_bo_free:
nouveau_bo_ref(NULL, &chunk->bo);
out_release:
- release_mem_region(chunk->pagemap.res.start,
- resource_size(&chunk->pagemap.res));
+ release_mem_region(chunk->pagemap.range.start, range_len(&chunk->pagemap.range));
out_free:
kfree(chunk);
out:
@@ -382,8 +382,8 @@ nouveau_dmem_fini(struct nouveau_drm *drm)
nouveau_bo_ref(NULL, &chunk->bo);
list_del(&chunk->list);
memunmap_pages(&chunk->pagemap);
- release_mem_region(chunk->pagemap.res.start,
- resource_size(&chunk->pagemap.res));
+ release_mem_region(chunk->pagemap.range.start,
+ range_len(&chunk->pagemap.range));
kfree(chunk);
}

diff --git a/drivers/nvdimm/badrange.c b/drivers/nvdimm/badrange.c
index b9eeefa27e3a..aaf6e215a8c6 100644
--- a/drivers/nvdimm/badrange.c
+++ b/drivers/nvdimm/badrange.c
@@ -211,7 +211,7 @@ static void __add_badblock_range(struct badblocks *bb, u64 ns_offset, u64 len)
}

static void badblocks_populate(struct badrange *badrange,
- struct badblocks *bb, const struct resource *res)
+ struct badblocks *bb, const struct range *range)
{
struct badrange_entry *bre;

@@ -222,34 +222,34 @@ static void badblocks_populate(struct badrange *badrange,
u64 bre_end = bre->start + bre->length - 1;

/* Discard intervals with no intersection */
- if (bre_end < res->start)
+ if (bre_end < range->start)
continue;
- if (bre->start > res->end)
+ if (bre->start > range->end)
continue;
/* Deal with any overlap after start of the namespace */
- if (bre->start >= res->start) {
+ if (bre->start >= range->start) {
u64 start = bre->start;
u64 len;

- if (bre_end <= res->end)
+ if (bre_end <= range->end)
len = bre->length;
else
- len = res->start + resource_size(res)
+ len = range->start + range_len(range)
- bre->start;
- __add_badblock_range(bb, start - res->start, len);
+ __add_badblock_range(bb, start - range->start, len);
continue;
}
/*
* Deal with overlap for badrange starting before
* the namespace.
*/
- if (bre->start < res->start) {
+ if (bre->start < range->start) {
u64 len;

- if (bre_end < res->end)
- len = bre->start + bre->length - res->start;
+ if (bre_end < range->end)
+ len = bre->start + bre->length - range->start;
else
- len = resource_size(res);
+ len = range_len(range);
__add_badblock_range(bb, 0, len);
}
}
@@ -267,7 +267,7 @@ static void badblocks_populate(struct badrange *badrange,
* and add badblocks entries for all matching sub-ranges
*/
void nvdimm_badblocks_populate(struct nd_region *nd_region,
- struct badblocks *bb, const struct resource *res)
+ struct badblocks *bb, const struct range *range)
{
struct nvdimm_bus *nvdimm_bus;

@@ -279,7 +279,7 @@ void nvdimm_badblocks_populate(struct nd_region *nd_region,
nvdimm_bus = walk_to_nvdimm_bus(&nd_region->dev);

nvdimm_bus_lock(&nvdimm_bus->dev);
- badblocks_populate(&nvdimm_bus->badrange, bb, res);
+ badblocks_populate(&nvdimm_bus->badrange, bb, range);
nvdimm_bus_unlock(&nvdimm_bus->dev);
}
EXPORT_SYMBOL_GPL(nvdimm_badblocks_populate);
diff --git a/drivers/nvdimm/claim.c b/drivers/nvdimm/claim.c
index 45964acba944..290267e1ff9f 100644
--- a/drivers/nvdimm/claim.c
+++ b/drivers/nvdimm/claim.c
@@ -303,13 +303,16 @@ static int nsio_rw_bytes(struct nd_namespace_common *ndns,
int devm_nsio_enable(struct device *dev, struct nd_namespace_io *nsio,
resource_size_t size)
{
- struct resource *res = &nsio->res;
struct nd_namespace_common *ndns = &nsio->common;
+ struct range range = {
+ .start = nsio->res.start,
+ .end = nsio->res.end,
+ };

nsio->size = size;
- if (!devm_request_mem_region(dev, res->start, size,
+ if (!devm_request_mem_region(dev, range.start, size,
dev_name(&ndns->dev))) {
- dev_warn(dev, "could not reserve region %pR\n", res);
+ dev_warn(dev, "could not reserve region %pR\n", &nsio->res);
return -EBUSY;
}

@@ -317,9 +320,9 @@ int devm_nsio_enable(struct device *dev, struct nd_namespace_io *nsio,
if (devm_init_badblocks(dev, &nsio->bb))
return -ENOMEM;
nvdimm_badblocks_populate(to_nd_region(ndns->dev.parent), &nsio->bb,
- &nsio->res);
+ &range);

- nsio->addr = devm_memremap(dev, res->start, size, ARCH_MEMREMAP_PMEM);
+ nsio->addr = devm_memremap(dev, range.start, size, ARCH_MEMREMAP_PMEM);

return PTR_ERR_OR_ZERO(nsio->addr);
}
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index 85c1ae813ea3..bac90afa4604 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -377,8 +377,9 @@ int nvdimm_namespace_detach_btt(struct nd_btt *nd_btt);
const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns,
char *name);
unsigned int pmem_sector_size(struct nd_namespace_common *ndns);
+struct range;
void nvdimm_badblocks_populate(struct nd_region *nd_region,
- struct badblocks *bb, const struct resource *res);
+ struct badblocks *bb, const struct range *range);
int devm_namespace_enable(struct device *dev, struct nd_namespace_common *ndns,
resource_size_t size);
void devm_namespace_disable(struct device *dev,
diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index 3e11ef8d3f5b..3c4787b92a6a 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -672,7 +672,7 @@ static unsigned long init_altmap_reserve(resource_size_t base)

static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
{
- struct resource *res = &pgmap->res;
+ struct range *range = &pgmap->range;
struct vmem_altmap *altmap = &pgmap->altmap;
struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb;
u64 offset = le64_to_cpu(pfn_sb->dataoff);
@@ -689,16 +689,16 @@ static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
.end_pfn = PHYS_PFN(end),
};

- memcpy(res, &nsio->res, sizeof(*res));
- res->start += start_pad;
- res->end -= end_trunc;
-
+ *range = (struct range) {
+ .start = nsio->res.start + start_pad,
+ .end = nsio->res.end - end_trunc,
+ };
if (nd_pfn->mode == PFN_MODE_RAM) {
if (offset < reserve)
return -EINVAL;
nd_pfn->npfns = le64_to_cpu(pfn_sb->npfns);
} else if (nd_pfn->mode == PFN_MODE_PMEM) {
- nd_pfn->npfns = PHYS_PFN((resource_size(res) - offset));
+ nd_pfn->npfns = PHYS_PFN((range_len(range) - offset));
if (le64_to_cpu(nd_pfn->pfn_sb->npfns) > nd_pfn->npfns)
dev_info(&nd_pfn->dev,
"number of pfns truncated from %lld to %ld\n",
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index fab29b514372..69cc0e783709 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -376,7 +376,7 @@ static int pmem_attach_disk(struct device *dev,
struct nd_region *nd_region = to_nd_region(dev->parent);
int nid = dev_to_node(dev), fua;
struct resource *res = &nsio->res;
- struct resource bb_res;
+ struct range bb_range;
struct nd_pfn *nd_pfn = NULL;
struct dax_device *dax_dev;
struct nd_pfn_sb *pfn_sb;
@@ -435,24 +435,26 @@ static int pmem_attach_disk(struct device *dev,
pfn_sb = nd_pfn->pfn_sb;
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
pmem->pfn_pad = resource_size(res) -
- resource_size(&pmem->pgmap.res);
+ range_len(&pmem->pgmap.range);
pmem->pfn_flags |= PFN_MAP;
- memcpy(&bb_res, &pmem->pgmap.res, sizeof(bb_res));
- bb_res.start += pmem->data_offset;
+ bb_range = pmem->pgmap.range;
+ bb_range.start += pmem->data_offset;
} else if (pmem_should_map_pages(dev)) {
- memcpy(&pmem->pgmap.res, &nsio->res, sizeof(pmem->pgmap.res));
+ pmem->pgmap.range.start = res->start;
+ pmem->pgmap.range.end = res->end;
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
pmem->pgmap.ops = &fsdax_pagemap_ops;
addr = devm_memremap_pages(dev, &pmem->pgmap);
pmem->pfn_flags |= PFN_MAP;
- memcpy(&bb_res, &pmem->pgmap.res, sizeof(bb_res));
+ bb_range = pmem->pgmap.range;
} else {
if (devm_add_action_or_reset(dev, pmem_release_queue,
&pmem->pgmap))
return -ENOMEM;
addr = devm_memremap(dev, pmem->phys_addr,
pmem->size, ARCH_MEMREMAP_PMEM);
- memcpy(&bb_res, &nsio->res, sizeof(bb_res));
+ bb_range.start = res->start;
+ bb_range.end = res->end;
}

if (IS_ERR(addr))
@@ -482,7 +484,7 @@ static int pmem_attach_disk(struct device *dev,
/ 512);
if (devm_init_badblocks(dev, &pmem->bb))
return -ENOMEM;
- nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
+ nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_range);
disk->bb = &pmem->bb;

if (is_nvdimm_sync(nd_region))
@@ -593,8 +595,8 @@ static void nd_pmem_notify(struct device *dev, enum nvdimm_event event)
resource_size_t offset = 0, end_trunc = 0;
struct nd_namespace_common *ndns;
struct nd_namespace_io *nsio;
- struct resource res;
struct badblocks *bb;
+ struct range range;
struct kernfs_node *bb_state;

if (event != NVDIMM_REVALIDATE_POISON)
@@ -630,9 +632,9 @@ static void nd_pmem_notify(struct device *dev, enum nvdimm_event event)
nsio = to_nd_namespace_io(&ndns->dev);
}

- res.start = nsio->res.start + offset;
- res.end = nsio->res.end - end_trunc;
- nvdimm_badblocks_populate(nd_region, bb, &res);
+ range.start = nsio->res.start + offset;
+ range.end = nsio->res.end - end_trunc;
+ nvdimm_badblocks_populate(nd_region, bb, &range);
if (bb_state)
sysfs_notify_dirent(bb_state);
}
diff --git a/drivers/nvdimm/region.c b/drivers/nvdimm/region.c
index 0f6978e72e7c..bfce87ed72ab 100644
--- a/drivers/nvdimm/region.c
+++ b/drivers/nvdimm/region.c
@@ -35,7 +35,10 @@ static int nd_region_probe(struct device *dev)
return rc;

if (is_memory(&nd_region->dev)) {
- struct resource ndr_res;
+ struct range range = {
+ .start = nd_region->ndr_start,
+ .end = nd_region->ndr_start + nd_region->ndr_size - 1,
+ };

if (devm_init_badblocks(dev, &nd_region->bb))
return -ENODEV;
@@ -44,9 +47,7 @@ static int nd_region_probe(struct device *dev)
if (!nd_region->bb_state)
dev_warn(&nd_region->dev,
"'badblocks' notification disabled\n");
- ndr_res.start = nd_region->ndr_start;
- ndr_res.end = nd_region->ndr_start + nd_region->ndr_size - 1;
- nvdimm_badblocks_populate(nd_region, &nd_region->bb, &ndr_res);
+ nvdimm_badblocks_populate(nd_region, &nd_region->bb, &range);
}

rc = nd_region_register_namespaces(nd_region, &err);
@@ -121,14 +122,16 @@ static void nd_region_notify(struct device *dev, enum nvdimm_event event)
{
if (event == NVDIMM_REVALIDATE_POISON) {
struct nd_region *nd_region = to_nd_region(dev);
- struct resource res;

if (is_memory(&nd_region->dev)) {
- res.start = nd_region->ndr_start;
- res.end = nd_region->ndr_start +
- nd_region->ndr_size - 1;
+ struct range range = {
+ .start = nd_region->ndr_start,
+ .end = nd_region->ndr_start +
+ nd_region->ndr_size - 1,
+ };
+
nvdimm_badblocks_populate(nd_region,
- &nd_region->bb, &res);
+ &nd_region->bb, &range);
if (nd_region->bb_state)
sysfs_notify_dirent(nd_region->bb_state);
}
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index f29a48f8fa59..dd6b0d51a50c 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -185,9 +185,8 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
return -ENOMEM;

pgmap = &p2p_pgmap->pgmap;
- pgmap->res.start = pci_resource_start(pdev, bar) + offset;
- pgmap->res.end = pgmap->res.start + size - 1;
- pgmap->res.flags = pci_resource_flags(pdev, bar);
+ pgmap->range.start = pci_resource_start(pdev, bar) + offset;
+ pgmap->range.end = pgmap->range.start + size - 1;
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;

p2p_pgmap->provider = pdev;
@@ -202,13 +201,13 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,

error = gen_pool_add_owner(pdev->p2pdma->pool, (unsigned long)addr,
pci_bus_address(pdev, bar) + offset,
- resource_size(&pgmap->res), dev_to_node(&pdev->dev),
+ range_len(&pgmap->range), dev_to_node(&pdev->dev),
pgmap->ref);
if (error)
goto pages_free;

- pci_info(pdev, "added peer-to-peer DMA memory %pR\n",
- &pgmap->res);
+ pci_info(pdev, "added peer-to-peer DMA memory %#llx-%#llx\n",
+ pgmap->range.start, pgmap->range.end);

return 0;

diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index fac75e5a723b..6c21951bdb16 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,6 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _LINUX_MEMREMAP_H_
#define _LINUX_MEMREMAP_H_
+#include <linux/range.h>
#include <linux/ioport.h>
#include <linux/percpu-refcount.h>

@@ -94,7 +95,7 @@ struct dev_pagemap_ops {
/**
* struct dev_pagemap - metadata for ZONE_DEVICE mappings
* @altmap: pre-allocated/reserved memory for vmemmap allocations
- * @res: physical address range covered by @ref
+ * @range: physical address range covered by @ref
* @ref: reference count that pins the devm_memremap_pages() mapping
* @internal_ref: internal reference if @ref is not provided by the caller
* @done: completion for @internal_ref
@@ -107,7 +108,7 @@ struct dev_pagemap_ops {
*/
struct dev_pagemap {
struct vmem_altmap altmap;
- struct resource res;
+ struct range range;
struct percpu_ref *ref;
struct percpu_ref internal_ref;
struct completion done;
diff --git a/include/linux/range.h b/include/linux/range.h
index d1fbeb664012..274681cc3154 100644
--- a/include/linux/range.h
+++ b/include/linux/range.h
@@ -1,12 +1,18 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _LINUX_RANGE_H
#define _LINUX_RANGE_H
+#include <linux/types.h>

struct range {
u64 start;
u64 end;
};

+static inline u64 range_len(const struct range *range)
+{
+ return range->end - range->start + 1;
+}
+
int add_range(struct range *range, int az, int nr_range,
u64 start, u64 end);

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index e7dc3de355b7..5b4521991621 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -487,7 +487,8 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
goto err_release;

devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
- devmem->pagemap.res = *res;
+ devmem->pagemap.range.start = res->start;
+ devmem->pagemap.range.end = res->end;
devmem->pagemap.ops = &dmirror_devmem_ops;
devmem->pagemap.owner = mdevice;

@@ -496,9 +497,8 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
goto err_free;

devmem->mdevice = mdevice;
- pfn_first = devmem->pagemap.res.start >> PAGE_SHIFT;
- pfn_last = pfn_first +
- (resource_size(&devmem->pagemap.res) >> PAGE_SHIFT);
+ pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT;
+ pfn_last = pfn_first + (range_len(&devmem->pagemap.range) >> PAGE_SHIFT);
mdevice->devmem_chunks[mdevice->devmem_count++] = devmem;

mutex_unlock(&mdevice->devmem_lock);
@@ -528,7 +528,7 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
err_free:
kfree(devmem);
err_release:
- release_mem_region(res->start, resource_size(res));
+ release_mem_region(devmem->pagemap.range.start, range_len(&devmem->pagemap.range));
err:
mutex_unlock(&mdevice->devmem_lock);
return false;
@@ -1100,8 +1100,8 @@ static void dmirror_device_remove(struct dmirror_device *mdevice)
mdevice->devmem_chunks[i];

memunmap_pages(&devmem->pagemap);
- release_mem_region(devmem->pagemap.res.start,
- resource_size(&devmem->pagemap.res));
+ release_mem_region(devmem->pagemap.range.start,
+ range_len(&devmem->pagemap.range));
kfree(devmem);
}
kfree(mdevice->devmem_chunks);
diff --git a/mm/memremap.c b/mm/memremap.c
index 8afcc54c8928..9979891fec78 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -70,24 +70,24 @@ static void devmap_managed_enable_put(void)
}
#endif /* CONFIG_DEV_PAGEMAP_OPS */

-static void pgmap_array_delete(struct resource *res)
+static void pgmap_array_delete(struct range *range)
{
- xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end),
+ xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end),
NULL, GFP_KERNEL);
synchronize_rcu();
}

static unsigned long pfn_first(struct dev_pagemap *pgmap)
{
- return PHYS_PFN(pgmap->res.start) +
+ return PHYS_PFN(pgmap->range.start) +
vmem_altmap_offset(pgmap_altmap(pgmap));
}

static unsigned long pfn_end(struct dev_pagemap *pgmap)
{
- const struct resource *res = &pgmap->res;
+ const struct range *range = &pgmap->range;

- return (res->start + resource_size(res)) >> PAGE_SHIFT;
+ return (range->start + range_len(range)) >> PAGE_SHIFT;
}

static unsigned long pfn_next(unsigned long pfn)
@@ -146,7 +146,7 @@ static void dev_pagemap_cleanup(struct dev_pagemap *pgmap)

void memunmap_pages(struct dev_pagemap *pgmap)
{
- struct resource *res = &pgmap->res;
+ struct range *range = &pgmap->range;
struct page *first_page;
unsigned long pfn;
int nid;
@@ -163,20 +163,20 @@ void memunmap_pages(struct dev_pagemap *pgmap)
nid = page_to_nid(first_page);

mem_hotplug_begin();
- remove_pfn_range_from_zone(page_zone(first_page), PHYS_PFN(res->start),
- PHYS_PFN(resource_size(res)));
+ remove_pfn_range_from_zone(page_zone(first_page), PHYS_PFN(range->start),
+ PHYS_PFN(range_len(range)));
if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
- __remove_pages(PHYS_PFN(res->start),
- PHYS_PFN(resource_size(res)), NULL);
+ __remove_pages(PHYS_PFN(range->start),
+ PHYS_PFN(range_len(range)), NULL);
} else {
- arch_remove_memory(nid, res->start, resource_size(res),
+ arch_remove_memory(nid, range->start, range_len(range),
pgmap_altmap(pgmap));
- kasan_remove_zero_shadow(__va(res->start), resource_size(res));
+ kasan_remove_zero_shadow(__va(range->start), range_len(range));
}
mem_hotplug_done();

- untrack_pfn(NULL, PHYS_PFN(res->start), resource_size(res));
- pgmap_array_delete(res);
+ untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range));
+ pgmap_array_delete(range);
WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
devmap_managed_enable_put();
}
@@ -202,7 +202,7 @@ static void dev_pagemap_percpu_release(struct percpu_ref *ref)
*/
void *memremap_pages(struct dev_pagemap *pgmap, int nid)
{
- struct resource *res = &pgmap->res;
+ struct range *range = &pgmap->range;
struct dev_pagemap *conflict_pgmap;
struct mhp_params params = {
/*
@@ -271,7 +271,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
return ERR_PTR(error);
}

- conflict_pgmap = get_dev_pagemap(PHYS_PFN(res->start), NULL);
+ conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->start), NULL);
if (conflict_pgmap) {
WARN(1, "Conflicting mapping in same section\n");
put_dev_pagemap(conflict_pgmap);
@@ -279,7 +279,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
goto err_array;
}

- conflict_pgmap = get_dev_pagemap(PHYS_PFN(res->end), NULL);
+ conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->end), NULL);
if (conflict_pgmap) {
WARN(1, "Conflicting mapping in same section\n");
put_dev_pagemap(conflict_pgmap);
@@ -287,26 +287,27 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
goto err_array;
}

- is_ram = region_intersects(res->start, resource_size(res),
+ is_ram = region_intersects(range->start, range_len(range),
IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE);

if (is_ram != REGION_DISJOINT) {
- WARN_ONCE(1, "%s attempted on %s region %pr\n", __func__,
- is_ram == REGION_MIXED ? "mixed" : "ram", res);
+ WARN_ONCE(1, "attempted on %s region %#llx-%#llx\n",
+ is_ram == REGION_MIXED ? "mixed" : "ram",
+ range->start, range->end);
error = -ENXIO;
goto err_array;
}

- error = xa_err(xa_store_range(&pgmap_array, PHYS_PFN(res->start),
- PHYS_PFN(res->end), pgmap, GFP_KERNEL));
+ error = xa_err(xa_store_range(&pgmap_array, PHYS_PFN(range->start),
+ PHYS_PFN(range->end), pgmap, GFP_KERNEL));
if (error)
goto err_array;

if (nid < 0)
nid = numa_mem_id();

- error = track_pfn_remap(NULL, &params.pgprot, PHYS_PFN(res->start),
- 0, resource_size(res));
+ error = track_pfn_remap(NULL, &params.pgprot, PHYS_PFN(range->start), 0,
+ range_len(range));
if (error)
goto err_pfn_remap;

@@ -324,16 +325,16 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
* arch_add_memory().
*/
if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
- error = add_pages(nid, PHYS_PFN(res->start),
- PHYS_PFN(resource_size(res)), &params);
+ error = add_pages(nid, PHYS_PFN(range->start),
+ PHYS_PFN(range_len(range)), &params);
} else {
- error = kasan_add_zero_shadow(__va(res->start), resource_size(res));
+ error = kasan_add_zero_shadow(__va(range->start), range_len(range));
if (error) {
mem_hotplug_done();
goto err_kasan;
}

- error = arch_add_memory(nid, res->start, resource_size(res),
+ error = arch_add_memory(nid, range->start, range_len(range),
&params);
}

@@ -341,8 +342,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
struct zone *zone;

zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
- move_pfn_range_to_zone(zone, PHYS_PFN(res->start),
- PHYS_PFN(resource_size(res)), params.altmap);
+ move_pfn_range_to_zone(zone, PHYS_PFN(range->start),
+ PHYS_PFN(range_len(range)), params.altmap);
}

mem_hotplug_done();
@@ -354,17 +355,17 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
* to allow us to do the work while not holding the hotplug lock.
*/
memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
- PHYS_PFN(res->start),
- PHYS_PFN(resource_size(res)), pgmap);
+ PHYS_PFN(range->start),
+ PHYS_PFN(range_len(range)), pgmap);
percpu_ref_get_many(pgmap->ref, pfn_end(pgmap) - pfn_first(pgmap));
- return __va(res->start);
+ return __va(range->start);

err_add_memory:
- kasan_remove_zero_shadow(__va(res->start), resource_size(res));
+ kasan_remove_zero_shadow(__va(range->start), range_len(range));
err_kasan:
- untrack_pfn(NULL, PHYS_PFN(res->start), resource_size(res));
+ untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range));
err_pfn_remap:
- pgmap_array_delete(res);
+ pgmap_array_delete(range);
err_array:
dev_pagemap_kill(pgmap);
dev_pagemap_cleanup(pgmap);
@@ -389,7 +390,7 @@ EXPORT_SYMBOL_GPL(memremap_pages);
* 'live' on entry and will be killed and reaped at
* devm_memremap_pages_release() time, or if this routine fails.
*
- * 4/ res is expected to be a host memory range that could feasibly be
+ * 4/ range is expected to be a host memory range that could feasibly be
* treated as a "System RAM" range, i.e. not a device mmio range, but
* this is not enforced.
*/
@@ -446,7 +447,7 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
* In the cached case we're already holding a live reference.
*/
if (pgmap) {
- if (phys >= pgmap->res.start && phys <= pgmap->res.end)
+ if (phys >= pgmap->range.start && phys <= pgmap->range.end)
return pgmap;
put_dev_pagemap(pgmap);
}
diff --git a/tools/testing/nvdimm/test/iomap.c b/tools/testing/nvdimm/test/iomap.c
index 03e40b3b0106..c62d372d426f 100644
--- a/tools/testing/nvdimm/test/iomap.c
+++ b/tools/testing/nvdimm/test/iomap.c
@@ -126,7 +126,7 @@ static void dev_pagemap_percpu_release(struct percpu_ref *ref)
void *__wrap_devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
{
int error;
- resource_size_t offset = pgmap->res.start;
+ resource_size_t offset = pgmap->range.start;
struct nfit_test_resource *nfit_res = get_nfit_res(offset);

if (!nfit_res)

2020-08-03 07:49:35

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges

[...]

> Well, no v5.8-rc8 to line this up for v5.9, so next best is early
> integration into -mm before other collisions develop.
>
> Chatted with Justin offline and it currently appears that the missing
> numa information is the fault of the platform firmware to populate all
> the necessary NUMA data in the NFIT.

I'm planning on looking at some bits of this series this week, but some
questions upfront ...

>
> ---
> Cover:
>
> The device-dax facility allows an address range to be directly mapped
> through a chardev, or optionally hotplugged to the core kernel page
> allocator as System-RAM. It is the mechanism for converting persistent
> memory (pmem) to be used as another volatile memory pool i.e. the
> current Memory Tiering hot topic on linux-mm.
>
> In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
> it, but that labeling mechanism is not available / applicable to
> soft-reserved ("EFI specific purpose") memory [3]. This series provides
> a sysfs-mechanism for the daxctl utility to enable provisioning of
> volatile-soft-reserved memory ranges.
>
> The motivations for this facility are:
>
> 1/ Allow performance differentiated memory ranges to be split between
> kernel-managed and directly-accessed use cases.
>
> 2/ Allow physical memory to be provisioned along performance relevant
> address boundaries. For example, divide a memory-side cache [4] along
> cache-color boundaries.
>
> 3/ Parcel out soft-reserved memory to VMs using device-dax as a security
> / permissions boundary [5]. Specifically I have seen people (ab)using
> memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the
> device-dax interface on custom address ranges. A follow-on for the VM
> use case is to teach device-dax to dynamically allocate 'struct page' at
> runtime to reduce the duplication of 'struct page' space in both the
> guest and the host kernel for the same physical pages.


I think I am missing some important pieces. Bear with me.

1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
automatically used in the buddy during boot, but remains untouched
(similar to pmem). But as it involves ACPI as well, it could also be
used on arm64 (-e820), correct?

2. Soft-reserved memory is volatile RAM with differing performance
characteristics ("performance differentiated memory"). What would be
examples of such memory? Like, memory that is faster than RAM (scratch
pad), or slower (pmem)? Or both? :) Is it a valid use case to use pmem
in a hypervisor to back this memory?

3. There seem to be use cases where "soft-reserved" memory is used via
DAX. What is an example use case? I assume it's *not* to treat it like
PMEM but instead e.g., use it as a fast buffer inside applications or
similar.

4. There seem to be use cases where some part of "soft-reserved" memory
is used via DAX, some other is given to the buddy. What is an example
use case? Is this really necessary or only some theoretical use case?

5. The "provisioned along performance relevant address boundaries." part
is unclear to me. Can you give an example of how this would look like
from user space? Like, split that memory in blocks of size X with
alignment Y and give them to separate applications?

6. If you add such memory to the buddy, is there any way the system can
differentiate it from other memory? E.g., via fake/other NUMA nodes?


Also, can you give examples of how kmem-added memory is represented in
/proc/iomem for a) pmem and b) soft-resered memory after this series
(skimming over the patches, I think there is a change for pmem, right?)?

I am really wondering if it's the right approach to squeeze this into
our pmem/nvdimm infrastructure just because it's easy to do. E.g., man
"ndctl" - "ndctl - Manage "libnvdimm" subsystem devices (Non-volatile
Memory)" speaks explicitly about non-volatile memory.


>
> [2]: http://lore.kernel.org/r/[email protected]
> [3]: http://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit@dwillia2-desk3.amr.corp.intel.com
> [4]: http://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
> [5]: http://lore.kernel.org/r/[email protected]
>
> ---
>
> Dan Williams (19):
> x86/numa: Cleanup configuration dependent command-line options
> x86/numa: Add 'nohmat' option
> efi/fake_mem: Arrange for a resource entry per efi_fake_mem instance
> ACPI: HMAT: Refactor hmat_register_target_device to hmem_register_device
> resource: Report parent to walk_iomem_res_desc() callback
> mm/memory_hotplug: Introduce default phys_to_target_node() implementation
> ACPI: HMAT: Attach a device for each soft-reserved range
> device-dax: Drop the dax_region.pfn_flags attribute
> device-dax: Move instance creation parameters to 'struct dev_dax_data'
> device-dax: Make pgmap optional for instance creation
> device-dax: Kill dax_kmem_res
> device-dax: Add an allocation interface for device-dax instances
> device-dax: Introduce 'seed' devices
> drivers/base: Make device_find_child_by_name() compatible with sysfs inputs
> device-dax: Add resize support
> mm/memremap_pages: Convert to 'struct range'
> mm/memremap_pages: Support multiple ranges per invocation
> device-dax: Add dis-contiguous resource support
> device-dax: Introduce 'mapping' devices
>
> Joao Martins (4):
> device-dax: Make align a per-device property
> device-dax: Add an 'align' attribute
> dax/hmem: Introduce dax_hmem.region_idle parameter
> device-dax: Add a range mapping allocation attribute
>
>
> Documentation/x86/x86_64/boot-options.rst | 4
> arch/powerpc/kvm/book3s_hv_uvmem.c | 14
> arch/x86/include/asm/numa.h | 8
> arch/x86/kernel/e820.c | 16
> arch/x86/mm/numa.c | 11
> arch/x86/mm/numa_emulation.c | 3
> arch/x86/xen/enlighten_pv.c | 2
> drivers/acpi/numa/hmat.c | 76 --
> drivers/acpi/numa/srat.c | 9
> drivers/base/core.c | 2
> drivers/dax/Kconfig | 4
> drivers/dax/Makefile | 3
> drivers/dax/bus.c | 1046 +++++++++++++++++++++++++++--
> drivers/dax/bus.h | 28 -
> drivers/dax/dax-private.h | 60 +-
> drivers/dax/device.c | 134 ++--
> drivers/dax/hmem.c | 56 --
> drivers/dax/hmem/Makefile | 6
> drivers/dax/hmem/device.c | 100 +++
> drivers/dax/hmem/hmem.c | 65 ++
> drivers/dax/kmem.c | 199 +++---
> drivers/dax/pmem/compat.c | 2
> drivers/dax/pmem/core.c | 22 -
> drivers/firmware/efi/x86_fake_mem.c | 12
> drivers/gpu/drm/nouveau/nouveau_dmem.c | 15
> drivers/nvdimm/badrange.c | 26 -
> drivers/nvdimm/claim.c | 13
> drivers/nvdimm/nd.h | 3
> drivers/nvdimm/pfn_devs.c | 13
> drivers/nvdimm/pmem.c | 27 -
> drivers/nvdimm/region.c | 21 -
> drivers/pci/p2pdma.c | 12
> include/acpi/acpi_numa.h | 14
> include/linux/dax.h | 8
> include/linux/memory_hotplug.h | 5
> include/linux/memremap.h | 11
> include/linux/numa.h | 11
> include/linux/range.h | 6
> kernel/resource.c | 11
> lib/test_hmm.c | 15
> mm/memory_hotplug.c | 10
> mm/memremap.c | 299 +++++---
> tools/testing/nvdimm/dax-dev.c | 22 -
> tools/testing/nvdimm/test/iomap.c | 2
> 44 files changed, 1825 insertions(+), 601 deletions(-)
> delete mode 100644 drivers/dax/hmem.c
> create mode 100644 drivers/dax/hmem/Makefile
> create mode 100644 drivers/dax/hmem/device.c
> create mode 100644 drivers/dax/hmem/hmem.c
>
> base-commit: 01830e6c042e8eb6eb202e05d7df8057135b4c26
>


--
Thanks,

David / dhildenb

2020-08-20 01:55:22

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges

On Mon, Aug 3, 2020 at 12:48 AM David Hildenbrand <[email protected]> wrote:
>
> [...]
>
> > Well, no v5.8-rc8 to line this up for v5.9, so next best is early
> > integration into -mm before other collisions develop.
> >
> > Chatted with Justin offline and it currently appears that the missing
> > numa information is the fault of the platform firmware to populate all
> > the necessary NUMA data in the NFIT.
>
> I'm planning on looking at some bits of this series this week, but some
> questions upfront ...
>
> >
> > ---
> > Cover:
> >
> > The device-dax facility allows an address range to be directly mapped
> > through a chardev, or optionally hotplugged to the core kernel page
> > allocator as System-RAM. It is the mechanism for converting persistent
> > memory (pmem) to be used as another volatile memory pool i.e. the
> > current Memory Tiering hot topic on linux-mm.
> >
> > In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
> > it, but that labeling mechanism is not available / applicable to
> > soft-reserved ("EFI specific purpose") memory [3]. This series provides
> > a sysfs-mechanism for the daxctl utility to enable provisioning of
> > volatile-soft-reserved memory ranges.
> >
> > The motivations for this facility are:
> >
> > 1/ Allow performance differentiated memory ranges to be split between
> > kernel-managed and directly-accessed use cases.
> >
> > 2/ Allow physical memory to be provisioned along performance relevant
> > address boundaries. For example, divide a memory-side cache [4] along
> > cache-color boundaries.
> >
> > 3/ Parcel out soft-reserved memory to VMs using device-dax as a security
> > / permissions boundary [5]. Specifically I have seen people (ab)using
> > memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the
> > device-dax interface on custom address ranges. A follow-on for the VM
> > use case is to teach device-dax to dynamically allocate 'struct page' at
> > runtime to reduce the duplication of 'struct page' space in both the
> > guest and the host kernel for the same physical pages.
>
>
> I think I am missing some important pieces. Bear with me.

No worries, also bear with me, I'm going to be offline intermittently
until at least mid-September. Hopefully Joao and/or Vishal can jump in
on this discussion.

>
> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
> automatically used in the buddy during boot, but remains untouched
> (similar to pmem). But as it involves ACPI as well, it could also be
> used on arm64 (-e820), correct?

Correct, arm64 also gets the EFI support for enumerating memory this
way. However, I would clarify that whether soft-reserved is given to
the buddy allocator by default or not is the kernel's policy choice,
"buddy-by-default" is ok and is what will happen anyways with older
kernels on platforms that enumerate a memory range this way.

> 2. Soft-reserved memory is volatile RAM with differing performance
> characteristics ("performance differentiated memory"). What would be
> examples of such memory?

Likely the most prominent one that drove the creation of the "EFI
Specific Purpose" attribute bit is high-bandwidth memory. One concrete
example of that was a platform called Knights Landing [1] that ended
up shipping firmware that lied to the OS about the latency
characteristics of the memory to try to reverse engineer OS behavior
to not allocate from that memory range by default. With the EFI
attribute firmware performance tables can tell the truth about the
performance characteristics of the memory range *and* indicate that
the OS not use it for general purpose allocations by default.

[1]: https://software.intel.com/content/www/us/en/develop/blogs/an-intro-to-mcdram-high-bandwidth-memory-on-knights-landing.html

> Like, memory that is faster than RAM (scratch
> pad), or slower (pmem)? Or both? :)

Both, but note that PMEM is already hard-reserved by default.
Soft-reserved is about a memory range that, for example, an
administrator may want to reserve 100% for a weather simulation where
if even a small amount of memory was stolen for the page cache the
application may not meet its performance targets. It could also be a
memory range that is so slow that only applications with higher
latency tolerances would be prepared to consume it.

In other words the soft-reserved memory can be used to indicate memory
that is either too precious, or too slow for general purpose OS
allocations.

> Is it a valid use case to use pmem
> in a hypervisor to back this memory?

Depends on the pmem. That performance capability is indicated by the
ACPI HMAT, not the EFI soft-reserved designation.

> 3. There seem to be use cases where "soft-reserved" memory is used via
> DAX. What is an example use case? I assume it's *not* to treat it like
> PMEM but instead e.g., use it as a fast buffer inside applications or
> similar.

Right, in that weather-simulation example that application could just
mmap /dev/daxX.Y and never worry about contending for the "fast
memory" resource on the platform. Alternatively if that resource needs
to be shared and/or over-commited then kernel memory-management
services are needed and that dax-device can be assigned to kmem.

> 4. There seem to be use cases where some part of "soft-reserved" memory
> is used via DAX, some other is given to the buddy. What is an example
> use case? Is this really necessary or only some theoretical use case?

It's as necessary as pmem namespace partitioning, or the inclusion of
dax-kmem upstream in the first place. In that kmem case the motivation
was that some users want a portion of pmem provisioned for storage and
some for volatile usage. The motivation is similar here, platform
firmware can only identify memory attributes on coarse boundaries,
finer grained provisioning decisions are up to the administrator /
platform-owner and the kernel is a just a facilitator of that policy.

>
> 5. The "provisioned along performance relevant address boundaries." part
> is unclear to me. Can you give an example of how this would look like
> from user space? Like, split that memory in blocks of size X with
> alignment Y and give them to separate applications?

One example of platform address boundaries are the memory address
ranges that alias in a direct-mapped memory-side-cache. In the
direct-map-cache aliasing may repeat every N GBs where N is the ratio
of far-to-near memory. ("Near memory" == cache "Far memory" ==
backing memory). Also refer back to the background in the page
allocator shuffling patches [2]. With this partitioning mechanism you
could, for one example use case, assign different VMs to exclusive
colors in the memory side cache.

[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e900a918b098

> 6. If you add such memory to the buddy, is there any way the system can
> differentiate it from other memory? E.g., via fake/other NUMA nodes?

Numa node numbers / are how performance differentiated memory ranges
are enumerated. The expectation is that all distinct performance
memory targets have unique ACPI proximity domains and Linux numa node
numbers as a result.

> Also, can you give examples of how kmem-added memory is represented in
> /proc/iomem for a) pmem and b) soft-resered memory after this series
> (skimming over the patches, I think there is a change for pmem, right?)?

I don't expect a change. The only difference is the parent resource
will be marked "Soft Reserved" instead of "Persistent Memory".

> I am really wondering if it's the right approach to squeeze this into
> our pmem/nvdimm infrastructure just because it's easy to do. E.g., man
> "ndctl" - "ndctl - Manage "libnvdimm" subsystem devices (Non-volatile
> Memory)" speaks explicitly about non-volatile memory.

In fact it's not squeezed into PMEM infrastructure. dax-kmem and
device-dax are independent of PMEM. PMEM is one source of potential
device-dax instances, soft-reserved memory is another orthogonal
source. This is why device-dax needs its own userspace policy directed
partitioning mechanism because there is no PMEM to store the
configuration for partitioned higph-bandwidth memory. The userspace
tooling for this mechanism is targeted for a tool called daxctl that
has no PMEM dependencies. Look to Joao's use case that is using this
infrastructure independent of PMEM with manual soft-reservations
specified on the kernel command-line.

2020-08-21 10:10:13

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

On 03.08.20 07:03, Dan Williams wrote:
> Several related issues around this unneeded attribute:
>
> - The dax_kmem_res property allows the kmem driver to stash the adjusted
> resource range that was used for the hotplug operation, but that can be
> recalculated from the original base range.
>
> - kmem is using an open coded release_resource() + kfree() when an
> idiomatic release_mem_region() is sufficient.
>
> - The driver managed resource need only manage the busy flag. Other flags
> are of no concern to the kmem driver. In fact if kmem inherits some
> memory range that add_memory_driver_managed() rejects that is a
> memory-hotplug-core policy that the driver is in no position to
> override.
>
> - The implementation trusts that failed remove_memory() results in the
> entire resource range remaining pinned busy. The driver need not make
> that layering violation assumption and just maintain the busy state in
> its local resource.
>
> - The "Hot-remove not yet implemented." comment is stale since hotremove
> support is now included.

I think some of these changes could have been nicely split out to
simplify reviewing. E.g., comment update, release_mem_region(), &=
~IORESOURCE_BUSY ...

[...]

> +
> int dev_dax_kmem_probe(struct device *dev)
> {
> struct dev_dax *dev_dax = to_dev_dax(dev);
> - struct range *range = &dev_dax->range;
> - resource_size_t kmem_start;
> - resource_size_t kmem_size;
> - resource_size_t kmem_end;
> - struct resource *new_res;
> - const char *new_res_name;
> - int numa_node;
> + struct range range = dax_kmem_range(dev_dax);
> + int numa_node = dev_dax->target_node;
> + struct resource *res;
> + char *res_name;
> int rc;
>
> /*
> @@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev)
> * could be mixed in a node with faster memory, causing
> * unavoidable performance issues.
> */
> - numa_node = dev_dax->target_node;
> if (numa_node < 0) {
> dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
> numa_node);
> return -EINVAL;
> }
>
> - /* Hotplug starting at the beginning of the next block: */
> - kmem_start = ALIGN(range->start, memory_block_size_bytes());
> -
> - kmem_size = range_len(range);
> - /* Adjust the size down to compensate for moving up kmem_start: */
> - kmem_size -= kmem_start - range->start;
> - /* Align the size down to cover only complete blocks: */
> - kmem_size &= ~(memory_block_size_bytes() - 1);
> - kmem_end = kmem_start + kmem_size;
> -
> - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
> - if (!new_res_name)
> + res_name = kstrdup(dev_name(dev), GFP_KERNEL);
> + if (!res_name)
> return -ENOMEM;
>
> - /* Region is permanently reserved if hotremove fails. */
> - new_res = request_mem_region(kmem_start, kmem_size, new_res_name);
> - if (!new_res) {
> - dev_warn(dev, "could not reserve region [%pa-%pa]\n",
> - &kmem_start, &kmem_end);
> - kfree(new_res_name);
> + res = request_mem_region(range.start, range_len(&range), res_name);

I think our range could be empty after aligning. I assume
request_mem_region() would check that, but maybe we could report a
better error/warning in that case.

> + if (!res) {
> + dev_warn(dev, "could not reserve region [%#llx-%#llx]\n",
> + range.start, range.end);
> + kfree(res_name);
> return -EBUSY;
> }
>
> /*
> - * Set flags appropriate for System RAM. Leave ..._BUSY clear
> - * so that add_memory() can add a child resource. Do not
> - * inherit flags from the parent since it may set new flags
> - * unknown to us that will break add_memory() below.
> + * Temporarily clear busy to allow add_memory_driver_managed()
> + * to claim it.
> */
> - new_res->flags = IORESOURCE_SYSTEM_RAM;
> + res->flags &= ~IORESOURCE_BUSY;

Right, same approach is taken by virtio-mem.

>
> /*
> * Ensure that future kexec'd kernels will not treat this as RAM
> * automatically.
> */
> - rc = add_memory_driver_managed(numa_node, new_res->start,
> - resource_size(new_res), kmem_name);
> + rc = add_memory_driver_managed(numa_node, res->start,
> + resource_size(res), kmem_name);
> +
> + res->flags |= IORESOURCE_BUSY;

Hm, I don't think that's correct. Any specific reason why to mark the
not-added, unaligned parts BUSY? E.g., walk_system_ram_range() could
suddenly stumble over it - and e.g., similarly kexec code when trying to
find memory for placing kexec images. I think we should leave this
!BUSY, just as it is right now.

> if (rc) {
> - release_resource(new_res);
> - kfree(new_res);
> - kfree(new_res_name);
> + release_mem_region(range.start, range_len(&range));
> + kfree(res_name);
> return rc;
> }
> - dev_dax->dax_kmem_res = new_res;
> +
> + dev_set_drvdata(dev, res_name);
>
> return 0;
> }
>
> #ifdef CONFIG_MEMORY_HOTREMOVE
> -static int dev_dax_kmem_remove(struct device *dev)
> +static void dax_kmem_release(struct dev_dax *dev_dax)
> {
> - struct dev_dax *dev_dax = to_dev_dax(dev);
> - struct resource *res = dev_dax->dax_kmem_res;
> - resource_size_t kmem_start = res->start;
> - resource_size_t kmem_size = resource_size(res);
> - const char *res_name = res->name;
> int rc;
> + struct device *dev = &dev_dax->dev;
> + const char *res_name = dev_get_drvdata(dev);
> + struct range range = dax_kmem_range(dev_dax);
>
> /*
> * We have one shot for removing memory, if some memory blocks were not
> * offline prior to calling this function remove_memory() will fail, and
> * there is no way to hotremove this memory until reboot because device
> - * unbind will succeed even if we return failure.
> + * unbind will proceed regardless of the remove_memory result.
> */
> - rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
> - if (rc) {
> - any_hotremove_failed = true;
> - dev_err(dev,
> - "DAX region %pR cannot be hotremoved until the next reboot\n",
> - res);
> - return rc;
> + rc = remove_memory(dev_dax->target_node, range.start, range_len(&range));
> + if (rc == 0) {

if (!rc) ?

> + release_mem_region(range.start, range_len(&range));

remove_memory() does a release_mem_region_adjustable(). Don't you
actually want to release the *unaligned* region you requested?

> + dev_set_drvdata(dev, NULL);
> + kfree(res_name);
> + return;
> }

Not sure if inverting the error handling improved the code / review here.
>
> - /* Release and free dax resources */
> - release_resource(res);
> - kfree(res);
> - kfree(res_name);
> - dev_dax->dax_kmem_res = NULL;
> -
> - return 0;
> + any_hotremove_failed = true;
> + dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next reboot\n",
> + range.start, range.end);
> }
> #else
> -static int dev_dax_kmem_remove(struct device *dev)
> +static void dax_kmem_release(struct dev_dax *dev_dax)
> {
> /*
> - * Without hotremove purposely leak the request_mem_region() for the
> - * device-dax range and return '0' to ->remove() attempts. The removal
> - * of the device from the driver always succeeds, but the region is
> - * permanently pinned as reserved by the unreleased
> - * request_mem_region().
> + * Without hotremove purposely leak the request_mem_region() for
> + * the device-dax range attempts. The removal of the device from
> + * the driver always succeeds, but the region is permanently
> + * pinned as reserved by the unreleased request_mem_region().
> */
> any_hotremove_failed = true;
> - return 0;
> }
> #endif /* CONFIG_MEMORY_HOTREMOVE */
>
> +static int dev_dax_kmem_remove(struct device *dev)
> +{
> + dax_kmem_release(to_dev_dax(dev));
> + return 0;
> +}
> +
> static struct dax_device_driver device_dax_kmem_driver = {
> .drv = {
> .probe = dev_dax_kmem_probe,
>

Maybe split some of these changes out. Would at least help me to review ;)

--
Thanks,

David / dhildenb

2020-08-21 10:16:40

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges

>>
>> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
>> automatically used in the buddy during boot, but remains untouched
>> (similar to pmem). But as it involves ACPI as well, it could also be
>> used on arm64 (-e820), correct?
>
> Correct, arm64 also gets the EFI support for enumerating memory this
> way. However, I would clarify that whether soft-reserved is given to
> the buddy allocator by default or not is the kernel's policy choice,
> "buddy-by-default" is ok and is what will happen anyways with older
> kernels on platforms that enumerate a memory range this way.

Is "soft-reserved" then the right terminology for that? It sounds very
x86-64/e820 specific. Maybe a compressed for of "performance
differentiated memory" might be a better fit to expose to user space, no?

>
>> 2. Soft-reserved memory is volatile RAM with differing performance
>> characteristics ("performance differentiated memory"). What would be
>> examples of such memory?
>
> Likely the most prominent one that drove the creation of the "EFI
> Specific Purpose" attribute bit is high-bandwidth memory. One concrete
> example of that was a platform called Knights Landing [1] that ended
> up shipping firmware that lied to the OS about the latency
> characteristics of the memory to try to reverse engineer OS behavior
> to not allocate from that memory range by default. With the EFI
> attribute firmware performance tables can tell the truth about the
> performance characteristics of the memory range *and* indicate that
> the OS not use it for general purpose allocations by default.

Thanks for clarifying!

>
> [1]: https://software.intel.com/content/www/us/en/develop/blogs/an-intro-to-mcdram-high-bandwidth-memory-on-knights-landing.html
>
>> Like, memory that is faster than RAM (scratch
>> pad), or slower (pmem)? Or both? :)
>
> Both, but note that PMEM is already hard-reserved by default.
> Soft-reserved is about a memory range that, for example, an
> administrator may want to reserve 100% for a weather simulation where
> if even a small amount of memory was stolen for the page cache the
> application may not meet its performance targets. It could also be a
> memory range that is so slow that only applications with higher
> latency tolerances would be prepared to consume it.
>
> In other words the soft-reserved memory can be used to indicate memory
> that is either too precious, or too slow for general purpose OS
> allocations.

Right, so actually performance-differentiated in any way :)

>
>> Is it a valid use case to use pmem
>> in a hypervisor to back this memory?
>
> Depends on the pmem. That performance capability is indicated by the
> ACPI HMAT, not the EFI soft-reserved designation.
>
>> 3. There seem to be use cases where "soft-reserved" memory is used via
>> DAX. What is an example use case? I assume it's *not* to treat it like
>> PMEM but instead e.g., use it as a fast buffer inside applications or
>> similar.
>
> Right, in that weather-simulation example that application could just
> mmap /dev/daxX.Y and never worry about contending for the "fast
> memory" resource on the platform. Alternatively if that resource needs
> to be shared and/or over-commited then kernel memory-management
> services are needed and that dax-device can be assigned to kmem.
>
>> 4. There seem to be use cases where some part of "soft-reserved" memory
>> is used via DAX, some other is given to the buddy. What is an example
>> use case? Is this really necessary or only some theoretical use case?
>
> It's as necessary as pmem namespace partitioning, or the inclusion of
> dax-kmem upstream in the first place. In that kmem case the motivation
> was that some users want a portion of pmem provisioned for storage and
> some for volatile usage. The motivation is similar here, platform
> firmware can only identify memory attributes on coarse boundaries,
> finer grained provisioning decisions are up to the administrator /
> platform-owner and the kernel is a just a facilitator of that policy.
>
>>
>> 5. The "provisioned along performance relevant address boundaries." part
>> is unclear to me. Can you give an example of how this would look like
>> from user space? Like, split that memory in blocks of size X with
>> alignment Y and give them to separate applications?
>
> One example of platform address boundaries are the memory address
> ranges that alias in a direct-mapped memory-side-cache. In the
> direct-map-cache aliasing may repeat every N GBs where N is the ratio
> of far-to-near memory. ("Near memory" == cache "Far memory" ==
> backing memory). Also refer back to the background in the page
> allocator shuffling patches [2]. With this partitioning mechanism you
> could, for one example use case, assign different VMs to exclusive
> colors in the memory side cache.

Interesting, thanks!

>
> [2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e900a918b098
>
>> 6. If you add such memory to the buddy, is there any way the system can
>> differentiate it from other memory? E.g., via fake/other NUMA nodes?
>
> Numa node numbers / are how performance differentiated memory ranges
> are enumerated. The expectation is that all distinct performance
> memory targets have unique ACPI proximity domains and Linux numa node
> numbers as a result.

Makes sense to me (although it's somehow weird, because memory of the
same socket/node would be represented via different NUMA nodes), thanks!

>
>> Also, can you give examples of how kmem-added memory is represented in
>> /proc/iomem for a) pmem and b) soft-resered memory after this series
>> (skimming over the patches, I think there is a change for pmem, right?)?
>
> I don't expect a change. The only difference is the parent resource
> will be marked "Soft Reserved" instead of "Persistent Memory".

Right, I misread patch #11 while skimming - I thought the device
resource would be dropped.

>
>> I am really wondering if it's the right approach to squeeze this into
>> our pmem/nvdimm infrastructure just because it's easy to do. E.g., man
>> "ndctl" - "ndctl - Manage "libnvdimm" subsystem devices (Non-volatile
>> Memory)" speaks explicitly about non-volatile memory.
>
> In fact it's not squeezed into PMEM infrastructure. dax-kmem and
> device-dax are independent of PMEM. PMEM is one source of potential
> device-dax instances, soft-reserved memory is another orthogonal
> source. This is why device-dax needs its own userspace policy directed
> partitioning mechanism because there is no PMEM to store the
> configuration for partitioned higph-bandwidth memory. The userspace
> tooling for this mechanism is targeted for a tool called daxctl that
> has no PMEM dependencies. Look to Joao's use case that is using this
> infrastructure independent of PMEM with manual soft-reservations
> specified on the kernel command-line.

Thanks for clarifying, I was under the impression we would be reusing
libnvdimm to manage that memory.

--
Thanks,

David / dhildenb

2020-08-21 18:32:29

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges

On 21.08.20 20:27, Dan Williams wrote:
> On Fri, Aug 21, 2020 at 3:15 AM David Hildenbrand <[email protected]> wrote:
>>
>>>>
>>>> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
>>>> automatically used in the buddy during boot, but remains untouched
>>>> (similar to pmem). But as it involves ACPI as well, it could also be
>>>> used on arm64 (-e820), correct?
>>>
>>> Correct, arm64 also gets the EFI support for enumerating memory this
>>> way. However, I would clarify that whether soft-reserved is given to
>>> the buddy allocator by default or not is the kernel's policy choice,
>>> "buddy-by-default" is ok and is what will happen anyways with older
>>> kernels on platforms that enumerate a memory range this way.
>>
>> Is "soft-reserved" then the right terminology for that? It sounds very
>> x86-64/e820 specific. Maybe a compressed for of "performance
>> differentiated memory" might be a better fit to expose to user space, no?
>
> No. The EFI "Specific Purpose" bit is an attribute independent of
> e820, it's x86-Linux that entangles those together. There is no
> requirement for platform firmware to use that designation even for
> drastic performance differentiation between ranges, and conversely
> there is no requirement that memory *with* that designation has any
> performance difference compared to the default memory pool. So it
> really is a reservation policy about a memory range to keep out of the
> buddy allocator by default.

Okay, still "soft-reserved" is x86-64 specific, no? (AFAIK,
"soft-reserved" will be visible in /proc/iomem, or am I confusing
stuff?) IOW, it "performance differentiated" is not universally
applicable, maybe "specific purpose memory" is ?

--
Thanks,

David / dhildenb

2020-08-21 18:40:12

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges

On Fri, Aug 21, 2020 at 3:15 AM David Hildenbrand <[email protected]> wrote:
>
> >>
> >> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
> >> automatically used in the buddy during boot, but remains untouched
> >> (similar to pmem). But as it involves ACPI as well, it could also be
> >> used on arm64 (-e820), correct?
> >
> > Correct, arm64 also gets the EFI support for enumerating memory this
> > way. However, I would clarify that whether soft-reserved is given to
> > the buddy allocator by default or not is the kernel's policy choice,
> > "buddy-by-default" is ok and is what will happen anyways with older
> > kernels on platforms that enumerate a memory range this way.
>
> Is "soft-reserved" then the right terminology for that? It sounds very
> x86-64/e820 specific. Maybe a compressed for of "performance
> differentiated memory" might be a better fit to expose to user space, no?

No. The EFI "Specific Purpose" bit is an attribute independent of
e820, it's x86-Linux that entangles those together. There is no
requirement for platform firmware to use that designation even for
drastic performance differentiation between ranges, and conversely
there is no requirement that memory *with* that designation has any
performance difference compared to the default memory pool. So it
really is a reservation policy about a memory range to keep out of the
buddy allocator by default.

[..]
> > Both, but note that PMEM is already hard-reserved by default.
> > Soft-reserved is about a memory range that, for example, an
> > administrator may want to reserve 100% for a weather simulation where
> > if even a small amount of memory was stolen for the page cache the
> > application may not meet its performance targets. It could also be a
> > memory range that is so slow that only applications with higher
> > latency tolerances would be prepared to consume it.
> >
> > In other words the soft-reserved memory can be used to indicate memory
> > that is either too precious, or too slow for general purpose OS
> > allocations.
>
> Right, so actually performance-differentiated in any way :)

... or not differentiated at all which is Joao's use case for example.

[..]
> > Numa node numbers / are how performance differentiated memory ranges
> > are enumerated. The expectation is that all distinct performance
> > memory targets have unique ACPI proximity domains and Linux numa node
> > numbers as a result.
>
> Makes sense to me (although it's somehow weird, because memory of the
> same socket/node would be represented via different NUMA nodes), thanks!

Yes, numa ids as only physical socket identifiers is no longer a
reliable assumption since the introduction of the ACPI HMAT.

2020-08-21 21:18:33

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges

On Fri, Aug 21, 2020 at 11:30 AM David Hildenbrand <[email protected]> wrote:
>
> On 21.08.20 20:27, Dan Williams wrote:
> > On Fri, Aug 21, 2020 at 3:15 AM David Hildenbrand <[email protected]> wrote:
> >>
> >>>>
> >>>> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
> >>>> automatically used in the buddy during boot, but remains untouched
> >>>> (similar to pmem). But as it involves ACPI as well, it could also be
> >>>> used on arm64 (-e820), correct?
> >>>
> >>> Correct, arm64 also gets the EFI support for enumerating memory this
> >>> way. However, I would clarify that whether soft-reserved is given to
> >>> the buddy allocator by default or not is the kernel's policy choice,
> >>> "buddy-by-default" is ok and is what will happen anyways with older
> >>> kernels on platforms that enumerate a memory range this way.
> >>
> >> Is "soft-reserved" then the right terminology for that? It sounds very
> >> x86-64/e820 specific. Maybe a compressed for of "performance
> >> differentiated memory" might be a better fit to expose to user space, no?
> >
> > No. The EFI "Specific Purpose" bit is an attribute independent of
> > e820, it's x86-Linux that entangles those together. There is no
> > requirement for platform firmware to use that designation even for
> > drastic performance differentiation between ranges, and conversely
> > there is no requirement that memory *with* that designation has any
> > performance difference compared to the default memory pool. So it
> > really is a reservation policy about a memory range to keep out of the
> > buddy allocator by default.
>
> Okay, still "soft-reserved" is x86-64 specific, no?

There's nothing preventing other EFI archs, or a similar designation
in another firmware spec, picking up this policy.

> (AFAIK,
> "soft-reserved" will be visible in /proc/iomem, or am I confusing
> stuff?)

No, you're correct.

> IOW, it "performance differentiated" is not universally
> applicable, maybe "specific purpose memory" is ?

Those bikeshed colors don't seem an improvement to me.

"Soft-reserved" actually tells you something about the kernel policy
for the memory. The criticism of "specific purpose" that led to
calling it "soft-reserved" in Linux is the fact that "specific" is
undefined as far as the firmware knows, and "specific" may have
different applications based on the platform user. "Soft-reserved"
like "Reserved" tells you that a driver policy might be in play for
that memory.

Also note that the current color of the bikeshed has already shipped since v5.5:

262b45ae3ab4 x86/efi: EFI soft reservation to E820 enumeration

2020-08-21 21:35:12

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges



> Am 21.08.2020 um 23:17 schrieb Dan Williams <[email protected]>:
>
> On Fri, Aug 21, 2020 at 11:30 AM David Hildenbrand <[email protected]> wrote:
>>
>>> On 21.08.20 20:27, Dan Williams wrote:
>>> On Fri, Aug 21, 2020 at 3:15 AM David Hildenbrand <[email protected]> wrote:
>>>>
>>>>>>
>>>>>> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
>>>>>> automatically used in the buddy during boot, but remains untouched
>>>>>> (similar to pmem). But as it involves ACPI as well, it could also be
>>>>>> used on arm64 (-e820), correct?
>>>>>
>>>>> Correct, arm64 also gets the EFI support for enumerating memory this
>>>>> way. However, I would clarify that whether soft-reserved is given to
>>>>> the buddy allocator by default or not is the kernel's policy choice,
>>>>> "buddy-by-default" is ok and is what will happen anyways with older
>>>>> kernels on platforms that enumerate a memory range this way.
>>>>
>>>> Is "soft-reserved" then the right terminology for that? It sounds very
>>>> x86-64/e820 specific. Maybe a compressed for of "performance
>>>> differentiated memory" might be a better fit to expose to user space, no?
>>>
>>> No. The EFI "Specific Purpose" bit is an attribute independent of
>>> e820, it's x86-Linux that entangles those together. There is no
>>> requirement for platform firmware to use that designation even for
>>> drastic performance differentiation between ranges, and conversely
>>> there is no requirement that memory *with* that designation has any
>>> performance difference compared to the default memory pool. So it
>>> really is a reservation policy about a memory range to keep out of the
>>> buddy allocator by default.
>>
>> Okay, still "soft-reserved" is x86-64 specific, no?
>
> There's nothing preventing other EFI archs, or a similar designation
> in another firmware spec, picking up this policy.
>
>> (AFAIK,
>> "soft-reserved" will be visible in /proc/iomem, or am I confusing
>> stuff?)
>
> No, you're correct.
>
>> IOW, it "performance differentiated" is not universally
>> applicable, maybe "specific purpose memory" is ?
>
> Those bikeshed colors don't seem an improvement to me.
>
> "Soft-reserved" actually tells you something about the kernel policy
> for the memory. The criticism of "specific purpose" that led to
> calling it "soft-reserved" in Linux is the fact that "specific" is
> undefined as far as the firmware knows, and "specific" may have
> different applications based on the platform user. "Soft-reserved"
> like "Reserved" tells you that a driver policy might be in play for
> that memory.
>
> Also note that the current color of the bikeshed has already shipped since v5.5:
>
> 262b45ae3ab4 x86/efi: EFI soft reservation to E820 enumeration
>

I was asking because I was struggling to even understand what „soft-reserved“ is and I could bet most people have no clue what that is supposed to be.

In contrast „persistent memory“ or „special purpose memory“ in /proc/iomem is something normal (Linux using) human beings can understand.

But anyhow, just details, and you‘re telling me that that ship already sailed. So no further comments from my side.

Thanks for all the info!

2020-08-21 21:51:30

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges

On 21.08.20 23:33, David Hildenbrand wrote:
>
>
>> Am 21.08.2020 um 23:17 schrieb Dan Williams <[email protected]>:
>>
>> On Fri, Aug 21, 2020 at 11:30 AM David Hildenbrand <[email protected]> wrote:
>>>
>>>> On 21.08.20 20:27, Dan Williams wrote:
>>>> On Fri, Aug 21, 2020 at 3:15 AM David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>>>>
>>>>>>> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
>>>>>>> automatically used in the buddy during boot, but remains untouched
>>>>>>> (similar to pmem). But as it involves ACPI as well, it could also be
>>>>>>> used on arm64 (-e820), correct?
>>>>>>
>>>>>> Correct, arm64 also gets the EFI support for enumerating memory this
>>>>>> way. However, I would clarify that whether soft-reserved is given to
>>>>>> the buddy allocator by default or not is the kernel's policy choice,
>>>>>> "buddy-by-default" is ok and is what will happen anyways with older
>>>>>> kernels on platforms that enumerate a memory range this way.
>>>>>
>>>>> Is "soft-reserved" then the right terminology for that? It sounds very
>>>>> x86-64/e820 specific. Maybe a compressed for of "performance
>>>>> differentiated memory" might be a better fit to expose to user space, no?
>>>>
>>>> No. The EFI "Specific Purpose" bit is an attribute independent of
>>>> e820, it's x86-Linux that entangles those together. There is no
>>>> requirement for platform firmware to use that designation even for
>>>> drastic performance differentiation between ranges, and conversely
>>>> there is no requirement that memory *with* that designation has any
>>>> performance difference compared to the default memory pool. So it
>>>> really is a reservation policy about a memory range to keep out of the
>>>> buddy allocator by default.
>>>
>>> Okay, still "soft-reserved" is x86-64 specific, no?
>>
>> There's nothing preventing other EFI archs, or a similar designation
>> in another firmware spec, picking up this policy.
>>
>>> (AFAIK,
>>> "soft-reserved" will be visible in /proc/iomem, or am I confusing
>>> stuff?)
>>
>> No, you're correct.
>>
>>> IOW, it "performance differentiated" is not universally
>>> applicable, maybe "specific purpose memory" is ?
>>
>> Those bikeshed colors don't seem an improvement to me.
>>
>> "Soft-reserved" actually tells you something about the kernel policy
>> for the memory. The criticism of "specific purpose" that led to
>> calling it "soft-reserved" in Linux is the fact that "specific" is
>> undefined as far as the firmware knows, and "specific" may have
>> different applications based on the platform user. "Soft-reserved"
>> like "Reserved" tells you that a driver policy might be in play for
>> that memory.
>>
>> Also note that the current color of the bikeshed has already shipped since v5.5:
>>
>> 262b45ae3ab4 x86/efi: EFI soft reservation to E820 enumeration
>>
>
> I was asking because I was struggling to even understand what „soft-reserved“ is and I could bet most people have no clue what that is supposed to be.
>
> In contrast „persistent memory“ or „special purpose memory“ in /proc/iomem is something normal (Linux using) human beings can understand.

s/normal/most/ , shouldn't be writing emails from my smartphone. Cheers!


--
Thanks,

David / dhildenb

2020-08-21 22:09:01

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges



> Am 21.08.2020 um 23:34 schrieb David Hildenbrand <[email protected]>:
>
> 
>
>>> Am 21.08.2020 um 23:17 schrieb Dan Williams <[email protected]>:
>>>
>>> On Fri, Aug 21, 2020 at 11:30 AM David Hildenbrand <[email protected]> wrote:
>>>
>>>> On 21.08.20 20:27, Dan Williams wrote:
>>>> On Fri, Aug 21, 2020 at 3:15 AM David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>>>>
>>>>>>> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
>>>>>>> automatically used in the buddy during boot, but remains untouched
>>>>>>> (similar to pmem). But as it involves ACPI as well, it could also be
>>>>>>> used on arm64 (-e820), correct?
>>>>>>
>>>>>> Correct, arm64 also gets the EFI support for enumerating memory this
>>>>>> way. However, I would clarify that whether soft-reserved is given to
>>>>>> the buddy allocator by default or not is the kernel's policy choice,
>>>>>> "buddy-by-default" is ok and is what will happen anyways with older
>>>>>> kernels on platforms that enumerate a memory range this way.
>>>>>
>>>>> Is "soft-reserved" then the right terminology for that? It sounds very
>>>>> x86-64/e820 specific. Maybe a compressed for of "performance
>>>>> differentiated memory" might be a better fit to expose to user space, no?
>>>>
>>>> No. The EFI "Specific Purpose" bit is an attribute independent of
>>>> e820, it's x86-Linux that entangles those together. There is no
>>>> requirement for platform firmware to use that designation even for
>>>> drastic performance differentiation between ranges, and conversely
>>>> there is no requirement that memory *with* that designation has any
>>>> performance difference compared to the default memory pool. So it
>>>> really is a reservation policy about a memory range to keep out of the
>>>> buddy allocator by default.
>>>
>>> Okay, still "soft-reserved" is x86-64 specific, no?
>>
>> There's nothing preventing other EFI archs, or a similar designation
>> in another firmware spec, picking up this policy.
>>
>>> (AFAIK,
>>> "soft-reserved" will be visible in /proc/iomem, or am I confusing
>>> stuff?)
>>
>> No, you're correct.
>>
>>> IOW, it "performance differentiated" is not universally
>>> applicable, maybe "specific purpose memory" is ?
>>
>> Those bikeshed colors don't seem an improvement to me.
>>
>> "Soft-reserved" actually tells you something about the kernel policy
>> for the memory. The criticism of "specific purpose" that led to
>> calling it "soft-reserved" in Linux is the fact that "specific" is
>> undefined as far as the firmware knows, and "specific" may have
>> different applications based on the platform user. "Soft-reserved"
>> like "Reserved" tells you that a driver policy might be in play for
>> that memory.
>>
>> Also note that the current color of the bikeshed has already shipped since v5.5:
>>
>> 262b45ae3ab4 x86/efi: EFI soft reservation to E820 enumeration
>>
>
> I was asking because I was struggling to even understand what „soft-reserved“ is and I could bet most people have no clue what that is supposed to be.
>
> In contrast „persistent memory“ or „special purpose memory“ in /proc/iomem is something normal (Linux using) human beings can understand.

Obviously s/normal/most/

Cheers!

2020-08-21 22:11:16

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges



> Am 21.08.2020 um 23:34 schrieb David Hildenbrand <[email protected]>:
>
> 
>
>>> Am 21.08.2020 um 23:17 schrieb Dan Williams <[email protected]>:
>>>
>>> On Fri, Aug 21, 2020 at 11:30 AM David Hildenbrand <[email protected]> wrote:
>>>
>>>> On 21.08.20 20:27, Dan Williams wrote:
>>>> On Fri, Aug 21, 2020 at 3:15 AM David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>>>>
>>>>>>> 1. On x86-64, e820 indicates "soft-reserved" memory. This memory is not
>>>>>>> automatically used in the buddy during boot, but remains untouched
>>>>>>> (similar to pmem). But as it involves ACPI as well, it could also be
>>>>>>> used on arm64 (-e820), correct?
>>>>>>
>>>>>> Correct, arm64 also gets the EFI support for enumerating memory this
>>>>>> way. However, I would clarify that whether soft-reserved is given to
>>>>>> the buddy allocator by default or not is the kernel's policy choice,
>>>>>> "buddy-by-default" is ok and is what will happen anyways with older
>>>>>> kernels on platforms that enumerate a memory range this way.
>>>>>
>>>>> Is "soft-reserved" then the right terminology for that? It sounds very
>>>>> x86-64/e820 specific. Maybe a compressed for of "performance
>>>>> differentiated memory" might be a better fit to expose to user space, no?
>>>>
>>>> No. The EFI "Specific Purpose" bit is an attribute independent of
>>>> e820, it's x86-Linux that entangles those together. There is no
>>>> requirement for platform firmware to use that designation even for
>>>> drastic performance differentiation between ranges, and conversely
>>>> there is no requirement that memory *with* that designation has any
>>>> performance difference compared to the default memory pool. So it
>>>> really is a reservation policy about a memory range to keep out of the
>>>> buddy allocator by default.
>>>
>>> Okay, still "soft-reserved" is x86-64 specific, no?
>>
>> There's nothing preventing other EFI archs, or a similar designation
>> in another firmware spec, picking up this policy.
>>
>>> (AFAIK,
>>> "soft-reserved" will be visible in /proc/iomem, or am I confusing
>>> stuff?)
>>
>> No, you're correct.
>>
>>> IOW, it "performance differentiated" is not universally
>>> applicable, maybe "specific purpose memory" is ?
>>
>> Those bikeshed colors don't seem an improvement to me.
>>
>> "Soft-reserved" actually tells you something about the kernel policy
>> for the memory. The criticism of "specific purpose" that led to
>> calling it "soft-reserved" in Linux is the fact that "specific" is
>> undefined as far as the firmware knows, and "specific" may have
>> different applications based on the platform user. "Soft-reserved"
>> like "Reserved" tells you that a driver policy might be in play for
>> that memory.
>>
>> Also note that the current color of the bikeshed has already shipped since v5.5:
>>
>> 262b45ae3ab4 x86/efi: EFI soft reservation to E820 enumeration
>>
>
> I was asking because I was struggling to even understand what „soft-reserved“ is and I could bet most people have no clue what that is supposed to be.
>
> In contrast „persistent memory“ or „special purpose memory“ in /proc/iomem is something normal (Linux using) human beings can understand.

s/normal/most/ of course :)

2020-08-21 22:58:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 15/23] device-dax: Add resize support

On Sun, 02 Aug 2020 22:03:46 -0700 Dan Williams <[email protected]> wrote:

> Make the device-dax 'size' attribute writable to allow capacity to be
> split between multiple instances in a region. The intended consumers of
> this capability are users that want to split a scarce memory resource
> between device-dax and System-RAM access, or users that want to have
> multiple security domains for a large region.
>
> By default the hmem instance provider allocates an entire region to the
> first instance. The process of creating a new instance (assuming a
> region-id of 0) is find the region and trigger the 'create' attribute
> which yields an empty instance to configure. For example:
>
> cd /sys/bus/dax/devices
> echo dax0.0 > dax0.0/driver/unbind
> echo $new_size > dax0.0/size
> echo 1 > $(readlink -f dax0.0)../dax_region/create
> seed=$(cat $(readlink -f dax0.0)../dax_region/seed)
> echo $new_size > $seed/size
> echo dax0.0 > ../drivers/{device_dax,kmem}/bind
> echo dax0.1 > ../drivers/{device_dax,kmem}/bind
>
> Instances can be destroyed by:
>
> echo $device > $(readlink -f $device)../dax_region/delete

This userspace interface doesn't seem to be documented anywhere, so
there's nothing to update for this patch :(

2020-08-21 23:23:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges

On Wed, 19 Aug 2020 18:53:57 -0700 Dan Williams <[email protected]> wrote:

> > I think I am missing some important pieces. Bear with me.
>
> No worries, also bear with me, I'm going to be offline intermittently
> until at least mid-September. Hopefully Joao and/or Vishal can jump in
> on this discussion.

Ordinarily I'd prefer a refresh&resend for 2+ week-old series such as
this.

But given that v4 all applies OK and that Dan has pending outages, I'll
scoop up this version, even though at least one change has been suggested.

Also, this series has killed Zhen Lei's little cleanup
(http://lkml.kernel.org/r/[email protected]).
I don't think the affected code was moved elsewhere, so I'll drop that
patch.

2020-08-22 02:35:17

by Zhen Lei

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges



On 8/22/2020 7:21 AM, Andrew Morton wrote:
> On Wed, 19 Aug 2020 18:53:57 -0700 Dan Williams <[email protected]> wrote:
>
>>> I think I am missing some important pieces. Bear with me.
>>
>> No worries, also bear with me, I'm going to be offline intermittently
>> until at least mid-September. Hopefully Joao and/or Vishal can jump in
>> on this discussion.
>
> Ordinarily I'd prefer a refresh&resend for 2+ week-old series such as
> this.
>
> But given that v4 all applies OK and that Dan has pending outages, I'll
> scoop up this version, even though at least one change has been suggested.
>
> Also, this series has killed Zhen Lei's little cleanup
> (http://lkml.kernel.org/r/[email protected]).
> I don't think the affected code was moved elsewhere, so I'll drop that
> patch.

OK, this patch is really optional.

>
>
> .
>

2020-09-08 10:48:50

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges

On 22.08.20 01:21, Andrew Morton wrote:
> On Wed, 19 Aug 2020 18:53:57 -0700 Dan Williams <[email protected]> wrote:
>
>>> I think I am missing some important pieces. Bear with me.
>>
>> No worries, also bear with me, I'm going to be offline intermittently
>> until at least mid-September. Hopefully Joao and/or Vishal can jump in
>> on this discussion.
>
> Ordinarily I'd prefer a refresh&resend for 2+ week-old series such as
> this.
>
> But given that v4 all applies OK and that Dan has pending outages, I'll
> scoop up this version, even though at least one change has been suggested.
>

Should I try to fix patch #11 while Dan is away? Because I think at
least two things in there are wrong (and it would have been better to
split that patch into reviewable pieces).


--
Thanks,

David / dhildenb

2020-09-08 18:05:25

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

>>> + release_mem_region(range.start, range_len(&range));
>>
>> remove_memory() does a release_mem_region_adjustable(). Don't you
>> actually want to release the *unaligned* region you requested?
>>
> Isn't it what we're doing here?
> (The release_mem_region_adjustable() is using the same
> dax_kmem-aligned range and there's no split/adjust)

Oh, I think I was messing up things (there is just too much going on in
this patch).

Right, request_mem_region() and add_memory_driver_managed() are - and
were - called with the exact same range. That would have been clearer if
the patch would simply use range.start and range_len(&range) for both
calls (similar in the original code).

So, also the release calls have to use the same range. Agreed.

>
> Meaning right now (+ parent marked as !BUSY), and if I am understanding
> this correctly:
>
> request_mem_region(range.start, range_len)
> __request_region(iomem_res, range.start, range_len) -> alloc @parent
> add_memory_driver_managed(parent.start, resource_size(parent))
> __request_region(parent.start, resource_size(parent)) -> alloc @child
>
> [...]
>
> remove_memory(range.start, range_len)
> request_mem_region_adjustable(range.start, range_len)
> __release_region(range.start, range_len) -> remove @child
>
> release_mem_region(range.start, range_len)
> __release_region(range.start, range_len) -> doesn't remove @parent because !BUSY?
>
> The add/removal of this relies on !BUSY. But now I am wondering if the parent remaining
> unreleased is deliberate even on CONFIG_MEMORY_HOTREMOVE=y.

Interesting, I can only tell that virtio-mem expects that
remove_memory() won't remove the parent resource (which is !BUSY). So it
relies on the existing functionality.

I do wonder how walk_system_ram_range() behaves if both the parent and
the child are BUSY. Looking at it, I think it will detect the parent and
skip to the next range (without visiting the child) - which is not what
we want.

We could set the parent to BUSY just before doing the
release_mem_region() call, but that feels like a hack.

Maybe it's just easier to keep dax_kmem_res around ...

--
Thanks,

David / dhildenb

2020-09-08 19:57:53

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

[Sorry for the late response]

On 8/21/20 11:06 AM, David Hildenbrand wrote:
> On 03.08.20 07:03, Dan Williams wrote:
>> @@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev)
>> * could be mixed in a node with faster memory, causing
>> * unavoidable performance issues.
>> */
>> - numa_node = dev_dax->target_node;
>> if (numa_node < 0) {
>> dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
>> numa_node);
>> return -EINVAL;
>> }
>>
>> - /* Hotplug starting at the beginning of the next block: */
>> - kmem_start = ALIGN(range->start, memory_block_size_bytes());
>> -
>> - kmem_size = range_len(range);
>> - /* Adjust the size down to compensate for moving up kmem_start: */
>> - kmem_size -= kmem_start - range->start;
>> - /* Align the size down to cover only complete blocks: */
>> - kmem_size &= ~(memory_block_size_bytes() - 1);
>> - kmem_end = kmem_start + kmem_size;
>> -
>> - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
>> - if (!new_res_name)
>> + res_name = kstrdup(dev_name(dev), GFP_KERNEL);
>> + if (!res_name)
>> return -ENOMEM;
>>
>> - /* Region is permanently reserved if hotremove fails. */
>> - new_res = request_mem_region(kmem_start, kmem_size, new_res_name);
>> - if (!new_res) {
>> - dev_warn(dev, "could not reserve region [%pa-%pa]\n",
>> - &kmem_start, &kmem_end);
>> - kfree(new_res_name);
>> + res = request_mem_region(range.start, range_len(&range), res_name);
>
> I think our range could be empty after aligning. I assume
> request_mem_region() would check that, but maybe we could report a
> better error/warning in that case.
>
dax_kmem_range() already returns a memory-block-aligned @range but
IIUC request_mem_region() isn't checking for that. Having said that
the returned @res wouldn't be different from the passed range.start.

>> /*
>> * Ensure that future kexec'd kernels will not treat this as RAM
>> * automatically.
>> */
>> - rc = add_memory_driver_managed(numa_node, new_res->start,
>> - resource_size(new_res), kmem_name);
>> + rc = add_memory_driver_managed(numa_node, res->start,
>> + resource_size(res), kmem_name);
>> +
>> + res->flags |= IORESOURCE_BUSY;
>
> Hm, I don't think that's correct. Any specific reason why to mark the
> not-added, unaligned parts BUSY? E.g., walk_system_ram_range() could
> suddenly stumble over it - and e.g., similarly kexec code when trying to
> find memory for placing kexec images. I think we should leave this
> !BUSY, just as it is right now.
>
Agreed.

>> if (rc) {
>> - release_resource(new_res);
>> - kfree(new_res);
>> - kfree(new_res_name);
>> + release_mem_region(range.start, range_len(&range));
>> + kfree(res_name);
>> return rc;
>> }
>> - dev_dax->dax_kmem_res = new_res;
>> +
>> + dev_set_drvdata(dev, res_name);
>>
>> return 0;
>> }
>>
>> #ifdef CONFIG_MEMORY_HOTREMOVE
>> -static int dev_dax_kmem_remove(struct device *dev)
>> +static void dax_kmem_release(struct dev_dax *dev_dax)
>> {
>> - struct dev_dax *dev_dax = to_dev_dax(dev);
>> - struct resource *res = dev_dax->dax_kmem_res;
>> - resource_size_t kmem_start = res->start;
>> - resource_size_t kmem_size = resource_size(res);
>> - const char *res_name = res->name;
>> int rc;
>> + struct device *dev = &dev_dax->dev;
>> + const char *res_name = dev_get_drvdata(dev);
>> + struct range range = dax_kmem_range(dev_dax);
>>
>> /*
>> * We have one shot for removing memory, if some memory blocks were not
>> * offline prior to calling this function remove_memory() will fail, and
>> * there is no way to hotremove this memory until reboot because device
>> - * unbind will succeed even if we return failure.
>> + * unbind will proceed regardless of the remove_memory result.
>> */
>> - rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
>> - if (rc) {
>> - any_hotremove_failed = true;
>> - dev_err(dev,
>> - "DAX region %pR cannot be hotremoved until the next reboot\n",
>> - res);
>> - return rc;
>> + rc = remove_memory(dev_dax->target_node, range.start, range_len(&range));
>> + if (rc == 0) {
>
> if (!rc) ?
>
Better off would be to keep the old order:

if (rc) {
any_hotremove_failed = true;
dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next reboot\n",
range.start, range.end);
return;
}

release_mem_region(range.start, range_len(&range));
dev_set_drvdata(dev, NULL);
kfree(res_name);
return;


>> + release_mem_region(range.start, range_len(&range));
>
> remove_memory() does a release_mem_region_adjustable(). Don't you
> actually want to release the *unaligned* region you requested?
>
Isn't it what we're doing here?
(The release_mem_region_adjustable() is using the same
dax_kmem-aligned range and there's no split/adjust)

Meaning right now (+ parent marked as !BUSY), and if I am understanding
this correctly:

request_mem_region(range.start, range_len)
__request_region(iomem_res, range.start, range_len) -> alloc @parent
add_memory_driver_managed(parent.start, resource_size(parent))
__request_region(parent.start, resource_size(parent)) -> alloc @child

[...]

remove_memory(range.start, range_len)
request_mem_region_adjustable(range.start, range_len)
__release_region(range.start, range_len) -> remove @child

release_mem_region(range.start, range_len)
__release_region(range.start, range_len) -> doesn't remove @parent because !BUSY?

The add/removal of this relies on !BUSY. But now I am wondering if the parent remaining
unreleased is deliberate even on CONFIG_MEMORY_HOTREMOVE=y.

Joao

2020-09-23 01:05:31

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 00/23] device-dax: Support sub-dividing soft-reserved ranges

On Tue, Sep 8, 2020 at 3:46 AM David Hildenbrand <[email protected]> wrote:
>
> On 22.08.20 01:21, Andrew Morton wrote:
> > On Wed, 19 Aug 2020 18:53:57 -0700 Dan Williams <[email protected]> wrote:
> >
> >>> I think I am missing some important pieces. Bear with me.
> >>
> >> No worries, also bear with me, I'm going to be offline intermittently
> >> until at least mid-September. Hopefully Joao and/or Vishal can jump in
> >> on this discussion.
> >
> > Ordinarily I'd prefer a refresh&resend for 2+ week-old series such as
> > this.
> >
> > But given that v4 all applies OK and that Dan has pending outages, I'll
> > scoop up this version, even though at least one change has been suggested.
> >
>
> Should I try to fix patch #11 while Dan is away? Because I think at
> least two things in there are wrong (and it would have been better to
> split that patch into reviewable pieces).

Hey David,

Back now, I'll take a look. I didn't immediately see a way to
meaningfully break up that patch without some dead-code steps in the
conversion, but I'll take another run at it.

2020-09-23 08:06:55

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

On 08.09.20 17:33, Joao Martins wrote:
> [Sorry for the late response]
>
> On 8/21/20 11:06 AM, David Hildenbrand wrote:
>> On 03.08.20 07:03, Dan Williams wrote:
>>> @@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev)
>>> * could be mixed in a node with faster memory, causing
>>> * unavoidable performance issues.
>>> */
>>> - numa_node = dev_dax->target_node;
>>> if (numa_node < 0) {
>>> dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
>>> numa_node);
>>> return -EINVAL;
>>> }
>>>
>>> - /* Hotplug starting at the beginning of the next block: */
>>> - kmem_start = ALIGN(range->start, memory_block_size_bytes());
>>> -
>>> - kmem_size = range_len(range);
>>> - /* Adjust the size down to compensate for moving up kmem_start: */
>>> - kmem_size -= kmem_start - range->start;
>>> - /* Align the size down to cover only complete blocks: */
>>> - kmem_size &= ~(memory_block_size_bytes() - 1);
>>> - kmem_end = kmem_start + kmem_size;
>>> -
>>> - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
>>> - if (!new_res_name)
>>> + res_name = kstrdup(dev_name(dev), GFP_KERNEL);
>>> + if (!res_name)
>>> return -ENOMEM;
>>>
>>> - /* Region is permanently reserved if hotremove fails. */
>>> - new_res = request_mem_region(kmem_start, kmem_size, new_res_name);
>>> - if (!new_res) {
>>> - dev_warn(dev, "could not reserve region [%pa-%pa]\n",
>>> - &kmem_start, &kmem_end);
>>> - kfree(new_res_name);
>>> + res = request_mem_region(range.start, range_len(&range), res_name);
>>
>> I think our range could be empty after aligning. I assume
>> request_mem_region() would check that, but maybe we could report a
>> better error/warning in that case.
>>
> dax_kmem_range() already returns a memory-block-aligned @range but
> IIUC request_mem_region() isn't checking for that. Having said that
> the returned @res wouldn't be different from the passed range.start.
>
>>> /*
>>> * Ensure that future kexec'd kernels will not treat this as RAM
>>> * automatically.
>>> */
>>> - rc = add_memory_driver_managed(numa_node, new_res->start,
>>> - resource_size(new_res), kmem_name);
>>> + rc = add_memory_driver_managed(numa_node, res->start,
>>> + resource_size(res), kmem_name);
>>> +
>>> + res->flags |= IORESOURCE_BUSY;
>>
>> Hm, I don't think that's correct. Any specific reason why to mark the
>> not-added, unaligned parts BUSY? E.g., walk_system_ram_range() could
>> suddenly stumble over it - and e.g., similarly kexec code when trying to
>> find memory for placing kexec images. I think we should leave this
>> !BUSY, just as it is right now.
>>
> Agreed.
>
>>> if (rc) {
>>> - release_resource(new_res);
>>> - kfree(new_res);
>>> - kfree(new_res_name);
>>> + release_mem_region(range.start, range_len(&range));
>>> + kfree(res_name);
>>> return rc;
>>> }
>>> - dev_dax->dax_kmem_res = new_res;
>>> +
>>> + dev_set_drvdata(dev, res_name);
>>>
>>> return 0;
>>> }
>>>
>>> #ifdef CONFIG_MEMORY_HOTREMOVE
>>> -static int dev_dax_kmem_remove(struct device *dev)
>>> +static void dax_kmem_release(struct dev_dax *dev_dax)
>>> {
>>> - struct dev_dax *dev_dax = to_dev_dax(dev);
>>> - struct resource *res = dev_dax->dax_kmem_res;
>>> - resource_size_t kmem_start = res->start;
>>> - resource_size_t kmem_size = resource_size(res);
>>> - const char *res_name = res->name;
>>> int rc;
>>> + struct device *dev = &dev_dax->dev;
>>> + const char *res_name = dev_get_drvdata(dev);
>>> + struct range range = dax_kmem_range(dev_dax);
>>>
>>> /*
>>> * We have one shot for removing memory, if some memory blocks were not
>>> * offline prior to calling this function remove_memory() will fail, and
>>> * there is no way to hotremove this memory until reboot because device
>>> - * unbind will succeed even if we return failure.
>>> + * unbind will proceed regardless of the remove_memory result.
>>> */
>>> - rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
>>> - if (rc) {
>>> - any_hotremove_failed = true;
>>> - dev_err(dev,
>>> - "DAX region %pR cannot be hotremoved until the next reboot\n",
>>> - res);
>>> - return rc;
>>> + rc = remove_memory(dev_dax->target_node, range.start, range_len(&range));
>>> + if (rc == 0) {
>>
>> if (!rc) ?
>>
> Better off would be to keep the old order:
>
> if (rc) {
> any_hotremove_failed = true;
> dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next reboot\n",
> range.start, range.end);
> return;
> }
>
> release_mem_region(range.start, range_len(&range));
> dev_set_drvdata(dev, NULL);
> kfree(res_name);
> return;
>
>
>>> + release_mem_region(range.start, range_len(&range));
>>
>> remove_memory() does a release_mem_region_adjustable(). Don't you
>> actually want to release the *unaligned* region you requested?
>>
> Isn't it what we're doing here?
> (The release_mem_region_adjustable() is using the same
> dax_kmem-aligned range and there's no split/adjust)
>
> Meaning right now (+ parent marked as !BUSY), and if I am understanding
> this correctly:
>
> request_mem_region(range.start, range_len)
> __request_region(iomem_res, range.start, range_len) -> alloc @parent
> add_memory_driver_managed(parent.start, resource_size(parent))
> __request_region(parent.start, resource_size(parent)) -> alloc @child
>
> [...]
>
> remove_memory(range.start, range_len)
> request_mem_region_adjustable(range.start, range_len)
> __release_region(range.start, range_len) -> remove @child
>
> release_mem_region(range.start, range_len)
> __release_region(range.start, range_len) -> doesn't remove @parent because !BUSY?
>
> The add/removal of this relies on !BUSY. But now I am wondering if the parent remaining
> unreleased is deliberate even on CONFIG_MEMORY_HOTREMOVE=y.
>
> Joao
>

Thinking about it, if we don't set the parent resource BUSY (which is
what I think is the right way of doing things), and don't want to store
the parent resource pointer, we could add something like
lookup_resource() - e.g., lookup_mem_resource() - , however, searching
properly in the whole hierarchy (instead of only the first level), and
traversing down to the last hierarchy. Then it would be as simple as

remove_memory(range.start, range_len)
res = lookup_mem_resource(range.start);
release_resource(res);

--
Thanks,

David / dhildenb

2020-09-23 21:45:49

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

On Wed, Sep 23, 2020 at 1:04 AM David Hildenbrand <[email protected]> wrote:
>
> On 08.09.20 17:33, Joao Martins wrote:
> > [Sorry for the late response]
> >
> > On 8/21/20 11:06 AM, David Hildenbrand wrote:
> >> On 03.08.20 07:03, Dan Williams wrote:
> >>> @@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev)
> >>> * could be mixed in a node with faster memory, causing
> >>> * unavoidable performance issues.
> >>> */
> >>> - numa_node = dev_dax->target_node;
> >>> if (numa_node < 0) {
> >>> dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
> >>> numa_node);
> >>> return -EINVAL;
> >>> }
> >>>
> >>> - /* Hotplug starting at the beginning of the next block: */
> >>> - kmem_start = ALIGN(range->start, memory_block_size_bytes());
> >>> -
> >>> - kmem_size = range_len(range);
> >>> - /* Adjust the size down to compensate for moving up kmem_start: */
> >>> - kmem_size -= kmem_start - range->start;
> >>> - /* Align the size down to cover only complete blocks: */
> >>> - kmem_size &= ~(memory_block_size_bytes() - 1);
> >>> - kmem_end = kmem_start + kmem_size;
> >>> -
> >>> - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
> >>> - if (!new_res_name)
> >>> + res_name = kstrdup(dev_name(dev), GFP_KERNEL);
> >>> + if (!res_name)
> >>> return -ENOMEM;
> >>>
> >>> - /* Region is permanently reserved if hotremove fails. */
> >>> - new_res = request_mem_region(kmem_start, kmem_size, new_res_name);
> >>> - if (!new_res) {
> >>> - dev_warn(dev, "could not reserve region [%pa-%pa]\n",
> >>> - &kmem_start, &kmem_end);
> >>> - kfree(new_res_name);
> >>> + res = request_mem_region(range.start, range_len(&range), res_name);
> >>
> >> I think our range could be empty after aligning. I assume
> >> request_mem_region() would check that, but maybe we could report a
> >> better error/warning in that case.
> >>
> > dax_kmem_range() already returns a memory-block-aligned @range but
> > IIUC request_mem_region() isn't checking for that. Having said that
> > the returned @res wouldn't be different from the passed range.start.
> >
> >>> /*
> >>> * Ensure that future kexec'd kernels will not treat this as RAM
> >>> * automatically.
> >>> */
> >>> - rc = add_memory_driver_managed(numa_node, new_res->start,
> >>> - resource_size(new_res), kmem_name);
> >>> + rc = add_memory_driver_managed(numa_node, res->start,
> >>> + resource_size(res), kmem_name);
> >>> +
> >>> + res->flags |= IORESOURCE_BUSY;
> >>
> >> Hm, I don't think that's correct. Any specific reason why to mark the
> >> not-added, unaligned parts BUSY? E.g., walk_system_ram_range() could
> >> suddenly stumble over it - and e.g., similarly kexec code when trying to
> >> find memory for placing kexec images. I think we should leave this
> >> !BUSY, just as it is right now.
> >>
> > Agreed.
> >
> >>> if (rc) {
> >>> - release_resource(new_res);
> >>> - kfree(new_res);
> >>> - kfree(new_res_name);
> >>> + release_mem_region(range.start, range_len(&range));
> >>> + kfree(res_name);
> >>> return rc;
> >>> }
> >>> - dev_dax->dax_kmem_res = new_res;
> >>> +
> >>> + dev_set_drvdata(dev, res_name);
> >>>
> >>> return 0;
> >>> }
> >>>
> >>> #ifdef CONFIG_MEMORY_HOTREMOVE
> >>> -static int dev_dax_kmem_remove(struct device *dev)
> >>> +static void dax_kmem_release(struct dev_dax *dev_dax)
> >>> {
> >>> - struct dev_dax *dev_dax = to_dev_dax(dev);
> >>> - struct resource *res = dev_dax->dax_kmem_res;
> >>> - resource_size_t kmem_start = res->start;
> >>> - resource_size_t kmem_size = resource_size(res);
> >>> - const char *res_name = res->name;
> >>> int rc;
> >>> + struct device *dev = &dev_dax->dev;
> >>> + const char *res_name = dev_get_drvdata(dev);
> >>> + struct range range = dax_kmem_range(dev_dax);
> >>>
> >>> /*
> >>> * We have one shot for removing memory, if some memory blocks were not
> >>> * offline prior to calling this function remove_memory() will fail, and
> >>> * there is no way to hotremove this memory until reboot because device
> >>> - * unbind will succeed even if we return failure.
> >>> + * unbind will proceed regardless of the remove_memory result.
> >>> */
> >>> - rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
> >>> - if (rc) {
> >>> - any_hotremove_failed = true;
> >>> - dev_err(dev,
> >>> - "DAX region %pR cannot be hotremoved until the next reboot\n",
> >>> - res);
> >>> - return rc;
> >>> + rc = remove_memory(dev_dax->target_node, range.start, range_len(&range));
> >>> + if (rc == 0) {
> >>
> >> if (!rc) ?
> >>
> > Better off would be to keep the old order:
> >
> > if (rc) {
> > any_hotremove_failed = true;
> > dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next reboot\n",
> > range.start, range.end);
> > return;
> > }
> >
> > release_mem_region(range.start, range_len(&range));
> > dev_set_drvdata(dev, NULL);
> > kfree(res_name);
> > return;
> >
> >
> >>> + release_mem_region(range.start, range_len(&range));
> >>
> >> remove_memory() does a release_mem_region_adjustable(). Don't you
> >> actually want to release the *unaligned* region you requested?
> >>
> > Isn't it what we're doing here?
> > (The release_mem_region_adjustable() is using the same
> > dax_kmem-aligned range and there's no split/adjust)
> >
> > Meaning right now (+ parent marked as !BUSY), and if I am understanding
> > this correctly:
> >
> > request_mem_region(range.start, range_len)
> > __request_region(iomem_res, range.start, range_len) -> alloc @parent
> > add_memory_driver_managed(parent.start, resource_size(parent))
> > __request_region(parent.start, resource_size(parent)) -> alloc @child
> >
> > [...]
> >
> > remove_memory(range.start, range_len)
> > request_mem_region_adjustable(range.start, range_len)
> > __release_region(range.start, range_len) -> remove @child
> >
> > release_mem_region(range.start, range_len)
> > __release_region(range.start, range_len) -> doesn't remove @parent because !BUSY?
> >
> > The add/removal of this relies on !BUSY. But now I am wondering if the parent remaining
> > unreleased is deliberate even on CONFIG_MEMORY_HOTREMOVE=y.
> >
> > Joao
> >
>
> Thinking about it, if we don't set the parent resource BUSY (which is
> what I think is the right way of doing things), and don't want to store
> the parent resource pointer, we could add something like
> lookup_resource() - e.g., lookup_mem_resource() - , however, searching
> properly in the whole hierarchy (instead of only the first level), and
> traversing down to the last hierarchy. Then it would be as simple as
>
> remove_memory(range.start, range_len)
> res = lookup_mem_resource(range.start);
> release_resource(res);

Another thought... I notice that you've taught
register_memory_resource() a IORESOURCE_MEM_DRIVER_MANAGED special
case. Lets just make the assumption of add_memory_driver_managed()
that it is the driver's responsibility to mark the range busy before
calling, and the driver's responsibility to release the region. I.e.
validate (rather than request) that the range is busy in
register_memory_resource(), and teach release_memory_resource() to
skip releasing the region when the memory is marked driver managed.
That would let dax_kmem drop its manipulation of the 'busy' flag which
is a layering violation no matter how many comments we put around it.

2020-09-24 07:27:45

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

On 23.09.20 23:41, Dan Williams wrote:
> On Wed, Sep 23, 2020 at 1:04 AM David Hildenbrand <[email protected]> wrote:
>>
>> On 08.09.20 17:33, Joao Martins wrote:
>>> [Sorry for the late response]
>>>
>>> On 8/21/20 11:06 AM, David Hildenbrand wrote:
>>>> On 03.08.20 07:03, Dan Williams wrote:
>>>>> @@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev)
>>>>> * could be mixed in a node with faster memory, causing
>>>>> * unavoidable performance issues.
>>>>> */
>>>>> - numa_node = dev_dax->target_node;
>>>>> if (numa_node < 0) {
>>>>> dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
>>>>> numa_node);
>>>>> return -EINVAL;
>>>>> }
>>>>>
>>>>> - /* Hotplug starting at the beginning of the next block: */
>>>>> - kmem_start = ALIGN(range->start, memory_block_size_bytes());
>>>>> -
>>>>> - kmem_size = range_len(range);
>>>>> - /* Adjust the size down to compensate for moving up kmem_start: */
>>>>> - kmem_size -= kmem_start - range->start;
>>>>> - /* Align the size down to cover only complete blocks: */
>>>>> - kmem_size &= ~(memory_block_size_bytes() - 1);
>>>>> - kmem_end = kmem_start + kmem_size;
>>>>> -
>>>>> - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
>>>>> - if (!new_res_name)
>>>>> + res_name = kstrdup(dev_name(dev), GFP_KERNEL);
>>>>> + if (!res_name)
>>>>> return -ENOMEM;
>>>>>
>>>>> - /* Region is permanently reserved if hotremove fails. */
>>>>> - new_res = request_mem_region(kmem_start, kmem_size, new_res_name);
>>>>> - if (!new_res) {
>>>>> - dev_warn(dev, "could not reserve region [%pa-%pa]\n",
>>>>> - &kmem_start, &kmem_end);
>>>>> - kfree(new_res_name);
>>>>> + res = request_mem_region(range.start, range_len(&range), res_name);
>>>>
>>>> I think our range could be empty after aligning. I assume
>>>> request_mem_region() would check that, but maybe we could report a
>>>> better error/warning in that case.
>>>>
>>> dax_kmem_range() already returns a memory-block-aligned @range but
>>> IIUC request_mem_region() isn't checking for that. Having said that
>>> the returned @res wouldn't be different from the passed range.start.
>>>
>>>>> /*
>>>>> * Ensure that future kexec'd kernels will not treat this as RAM
>>>>> * automatically.
>>>>> */
>>>>> - rc = add_memory_driver_managed(numa_node, new_res->start,
>>>>> - resource_size(new_res), kmem_name);
>>>>> + rc = add_memory_driver_managed(numa_node, res->start,
>>>>> + resource_size(res), kmem_name);
>>>>> +
>>>>> + res->flags |= IORESOURCE_BUSY;
>>>>
>>>> Hm, I don't think that's correct. Any specific reason why to mark the
>>>> not-added, unaligned parts BUSY? E.g., walk_system_ram_range() could
>>>> suddenly stumble over it - and e.g., similarly kexec code when trying to
>>>> find memory for placing kexec images. I think we should leave this
>>>> !BUSY, just as it is right now.
>>>>
>>> Agreed.
>>>
>>>>> if (rc) {
>>>>> - release_resource(new_res);
>>>>> - kfree(new_res);
>>>>> - kfree(new_res_name);
>>>>> + release_mem_region(range.start, range_len(&range));
>>>>> + kfree(res_name);
>>>>> return rc;
>>>>> }
>>>>> - dev_dax->dax_kmem_res = new_res;
>>>>> +
>>>>> + dev_set_drvdata(dev, res_name);
>>>>>
>>>>> return 0;
>>>>> }
>>>>>
>>>>> #ifdef CONFIG_MEMORY_HOTREMOVE
>>>>> -static int dev_dax_kmem_remove(struct device *dev)
>>>>> +static void dax_kmem_release(struct dev_dax *dev_dax)
>>>>> {
>>>>> - struct dev_dax *dev_dax = to_dev_dax(dev);
>>>>> - struct resource *res = dev_dax->dax_kmem_res;
>>>>> - resource_size_t kmem_start = res->start;
>>>>> - resource_size_t kmem_size = resource_size(res);
>>>>> - const char *res_name = res->name;
>>>>> int rc;
>>>>> + struct device *dev = &dev_dax->dev;
>>>>> + const char *res_name = dev_get_drvdata(dev);
>>>>> + struct range range = dax_kmem_range(dev_dax);
>>>>>
>>>>> /*
>>>>> * We have one shot for removing memory, if some memory blocks were not
>>>>> * offline prior to calling this function remove_memory() will fail, and
>>>>> * there is no way to hotremove this memory until reboot because device
>>>>> - * unbind will succeed even if we return failure.
>>>>> + * unbind will proceed regardless of the remove_memory result.
>>>>> */
>>>>> - rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
>>>>> - if (rc) {
>>>>> - any_hotremove_failed = true;
>>>>> - dev_err(dev,
>>>>> - "DAX region %pR cannot be hotremoved until the next reboot\n",
>>>>> - res);
>>>>> - return rc;
>>>>> + rc = remove_memory(dev_dax->target_node, range.start, range_len(&range));
>>>>> + if (rc == 0) {
>>>>
>>>> if (!rc) ?
>>>>
>>> Better off would be to keep the old order:
>>>
>>> if (rc) {
>>> any_hotremove_failed = true;
>>> dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next reboot\n",
>>> range.start, range.end);
>>> return;
>>> }
>>>
>>> release_mem_region(range.start, range_len(&range));
>>> dev_set_drvdata(dev, NULL);
>>> kfree(res_name);
>>> return;
>>>
>>>
>>>>> + release_mem_region(range.start, range_len(&range));
>>>>
>>>> remove_memory() does a release_mem_region_adjustable(). Don't you
>>>> actually want to release the *unaligned* region you requested?
>>>>
>>> Isn't it what we're doing here?
>>> (The release_mem_region_adjustable() is using the same
>>> dax_kmem-aligned range and there's no split/adjust)
>>>
>>> Meaning right now (+ parent marked as !BUSY), and if I am understanding
>>> this correctly:
>>>
>>> request_mem_region(range.start, range_len)
>>> __request_region(iomem_res, range.start, range_len) -> alloc @parent
>>> add_memory_driver_managed(parent.start, resource_size(parent))
>>> __request_region(parent.start, resource_size(parent)) -> alloc @child
>>>
>>> [...]
>>>
>>> remove_memory(range.start, range_len)
>>> request_mem_region_adjustable(range.start, range_len)
>>> __release_region(range.start, range_len) -> remove @child
>>>
>>> release_mem_region(range.start, range_len)
>>> __release_region(range.start, range_len) -> doesn't remove @parent because !BUSY?
>>>
>>> The add/removal of this relies on !BUSY. But now I am wondering if the parent remaining
>>> unreleased is deliberate even on CONFIG_MEMORY_HOTREMOVE=y.
>>>
>>> Joao
>>>
>>
>> Thinking about it, if we don't set the parent resource BUSY (which is
>> what I think is the right way of doing things), and don't want to store
>> the parent resource pointer, we could add something like
>> lookup_resource() - e.g., lookup_mem_resource() - , however, searching
>> properly in the whole hierarchy (instead of only the first level), and
>> traversing down to the last hierarchy. Then it would be as simple as
>>
>> remove_memory(range.start, range_len)
>> res = lookup_mem_resource(range.start);
>> release_resource(res);
>
> Another thought... I notice that you've taught
> register_memory_resource() a IORESOURCE_MEM_DRIVER_MANAGED special
> case. Lets just make the assumption of add_memory_driver_managed()
> that it is the driver's responsibility to mark the range busy before
> calling, and the driver's responsibility to release the region. I.e.
> validate (rather than request) that the range is busy in
> register_memory_resource(), and teach release_memory_resource() to
> skip releasing the region when the memory is marked driver managed.
> That would let dax_kmem drop its manipulation of the 'busy' flag which
> is a layering violation no matter how many comments we put around it.

IIUC, that won't work for virtio-mem, whereby the parent resource spans
multiple possible (future) add_memory_driver_managed() calls and is
(just like for kmem) a pure indication to which device memory ranges belong.

For example, when exposing 2GB via a virtio-mem device with max 4GB:

(/proc/iomem)
240000000-33fffffff : virtio0
240000000-2bfffffff : System RAM (virtio_mem)

And after hotplugging additional 2GB:

240000000-33fffffff : virtio0
240000000-33fffffff : System RAM (virtio_mem)

So marking "virtio0" always BUSY (especially right from the start) would
be wrong. The assumption is that anything that's IORESOURCE_SYSTEM_RAM
and IORESOUCE_BUSY is currently added to the system as system RAM (e.g.,
after add_memory() and friends, or during boot).

I do agree that manually clearing the busy flag is ugly. What we most
probably want is request_mem_region() that performs similar checks (no
overlaps with existing BUSY resources), but doesn't set the region busy.

--
Thanks,

David / dhildenb

2020-09-24 13:56:36

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

On Thu, Sep 24, 2020 at 12:26 AM David Hildenbrand <[email protected]> wrote:
>
> On 23.09.20 23:41, Dan Williams wrote:
> > On Wed, Sep 23, 2020 at 1:04 AM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 08.09.20 17:33, Joao Martins wrote:
> >>> [Sorry for the late response]
> >>>
> >>> On 8/21/20 11:06 AM, David Hildenbrand wrote:
> >>>> On 03.08.20 07:03, Dan Williams wrote:
> >>>>> @@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev)
> >>>>> * could be mixed in a node with faster memory, causing
> >>>>> * unavoidable performance issues.
> >>>>> */
> >>>>> - numa_node = dev_dax->target_node;
> >>>>> if (numa_node < 0) {
> >>>>> dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
> >>>>> numa_node);
> >>>>> return -EINVAL;
> >>>>> }
> >>>>>
> >>>>> - /* Hotplug starting at the beginning of the next block: */
> >>>>> - kmem_start = ALIGN(range->start, memory_block_size_bytes());
> >>>>> -
> >>>>> - kmem_size = range_len(range);
> >>>>> - /* Adjust the size down to compensate for moving up kmem_start: */
> >>>>> - kmem_size -= kmem_start - range->start;
> >>>>> - /* Align the size down to cover only complete blocks: */
> >>>>> - kmem_size &= ~(memory_block_size_bytes() - 1);
> >>>>> - kmem_end = kmem_start + kmem_size;
> >>>>> -
> >>>>> - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
> >>>>> - if (!new_res_name)
> >>>>> + res_name = kstrdup(dev_name(dev), GFP_KERNEL);
> >>>>> + if (!res_name)
> >>>>> return -ENOMEM;
> >>>>>
> >>>>> - /* Region is permanently reserved if hotremove fails. */
> >>>>> - new_res = request_mem_region(kmem_start, kmem_size, new_res_name);
> >>>>> - if (!new_res) {
> >>>>> - dev_warn(dev, "could not reserve region [%pa-%pa]\n",
> >>>>> - &kmem_start, &kmem_end);
> >>>>> - kfree(new_res_name);
> >>>>> + res = request_mem_region(range.start, range_len(&range), res_name);
> >>>>
> >>>> I think our range could be empty after aligning. I assume
> >>>> request_mem_region() would check that, but maybe we could report a
> >>>> better error/warning in that case.
> >>>>
> >>> dax_kmem_range() already returns a memory-block-aligned @range but
> >>> IIUC request_mem_region() isn't checking for that. Having said that
> >>> the returned @res wouldn't be different from the passed range.start.
> >>>
> >>>>> /*
> >>>>> * Ensure that future kexec'd kernels will not treat this as RAM
> >>>>> * automatically.
> >>>>> */
> >>>>> - rc = add_memory_driver_managed(numa_node, new_res->start,
> >>>>> - resource_size(new_res), kmem_name);
> >>>>> + rc = add_memory_driver_managed(numa_node, res->start,
> >>>>> + resource_size(res), kmem_name);
> >>>>> +
> >>>>> + res->flags |= IORESOURCE_BUSY;
> >>>>
> >>>> Hm, I don't think that's correct. Any specific reason why to mark the
> >>>> not-added, unaligned parts BUSY? E.g., walk_system_ram_range() could
> >>>> suddenly stumble over it - and e.g., similarly kexec code when trying to
> >>>> find memory for placing kexec images. I think we should leave this
> >>>> !BUSY, just as it is right now.
> >>>>
> >>> Agreed.
> >>>
> >>>>> if (rc) {
> >>>>> - release_resource(new_res);
> >>>>> - kfree(new_res);
> >>>>> - kfree(new_res_name);
> >>>>> + release_mem_region(range.start, range_len(&range));
> >>>>> + kfree(res_name);
> >>>>> return rc;
> >>>>> }
> >>>>> - dev_dax->dax_kmem_res = new_res;
> >>>>> +
> >>>>> + dev_set_drvdata(dev, res_name);
> >>>>>
> >>>>> return 0;
> >>>>> }
> >>>>>
> >>>>> #ifdef CONFIG_MEMORY_HOTREMOVE
> >>>>> -static int dev_dax_kmem_remove(struct device *dev)
> >>>>> +static void dax_kmem_release(struct dev_dax *dev_dax)
> >>>>> {
> >>>>> - struct dev_dax *dev_dax = to_dev_dax(dev);
> >>>>> - struct resource *res = dev_dax->dax_kmem_res;
> >>>>> - resource_size_t kmem_start = res->start;
> >>>>> - resource_size_t kmem_size = resource_size(res);
> >>>>> - const char *res_name = res->name;
> >>>>> int rc;
> >>>>> + struct device *dev = &dev_dax->dev;
> >>>>> + const char *res_name = dev_get_drvdata(dev);
> >>>>> + struct range range = dax_kmem_range(dev_dax);
> >>>>>
> >>>>> /*
> >>>>> * We have one shot for removing memory, if some memory blocks were not
> >>>>> * offline prior to calling this function remove_memory() will fail, and
> >>>>> * there is no way to hotremove this memory until reboot because device
> >>>>> - * unbind will succeed even if we return failure.
> >>>>> + * unbind will proceed regardless of the remove_memory result.
> >>>>> */
> >>>>> - rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
> >>>>> - if (rc) {
> >>>>> - any_hotremove_failed = true;
> >>>>> - dev_err(dev,
> >>>>> - "DAX region %pR cannot be hotremoved until the next reboot\n",
> >>>>> - res);
> >>>>> - return rc;
> >>>>> + rc = remove_memory(dev_dax->target_node, range.start, range_len(&range));
> >>>>> + if (rc == 0) {
> >>>>
> >>>> if (!rc) ?
> >>>>
> >>> Better off would be to keep the old order:
> >>>
> >>> if (rc) {
> >>> any_hotremove_failed = true;
> >>> dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next reboot\n",
> >>> range.start, range.end);
> >>> return;
> >>> }
> >>>
> >>> release_mem_region(range.start, range_len(&range));
> >>> dev_set_drvdata(dev, NULL);
> >>> kfree(res_name);
> >>> return;
> >>>
> >>>
> >>>>> + release_mem_region(range.start, range_len(&range));
> >>>>
> >>>> remove_memory() does a release_mem_region_adjustable(). Don't you
> >>>> actually want to release the *unaligned* region you requested?
> >>>>
> >>> Isn't it what we're doing here?
> >>> (The release_mem_region_adjustable() is using the same
> >>> dax_kmem-aligned range and there's no split/adjust)
> >>>
> >>> Meaning right now (+ parent marked as !BUSY), and if I am understanding
> >>> this correctly:
> >>>
> >>> request_mem_region(range.start, range_len)
> >>> __request_region(iomem_res, range.start, range_len) -> alloc @parent
> >>> add_memory_driver_managed(parent.start, resource_size(parent))
> >>> __request_region(parent.start, resource_size(parent)) -> alloc @child
> >>>
> >>> [...]
> >>>
> >>> remove_memory(range.start, range_len)
> >>> request_mem_region_adjustable(range.start, range_len)
> >>> __release_region(range.start, range_len) -> remove @child
> >>>
> >>> release_mem_region(range.start, range_len)
> >>> __release_region(range.start, range_len) -> doesn't remove @parent because !BUSY?
> >>>
> >>> The add/removal of this relies on !BUSY. But now I am wondering if the parent remaining
> >>> unreleased is deliberate even on CONFIG_MEMORY_HOTREMOVE=y.
> >>>
> >>> Joao
> >>>
> >>
> >> Thinking about it, if we don't set the parent resource BUSY (which is
> >> what I think is the right way of doing things), and don't want to store
> >> the parent resource pointer, we could add something like
> >> lookup_resource() - e.g., lookup_mem_resource() - , however, searching
> >> properly in the whole hierarchy (instead of only the first level), and
> >> traversing down to the last hierarchy. Then it would be as simple as
> >>
> >> remove_memory(range.start, range_len)
> >> res = lookup_mem_resource(range.start);
> >> release_resource(res);
> >
> > Another thought... I notice that you've taught
> > register_memory_resource() a IORESOURCE_MEM_DRIVER_MANAGED special
> > case. Lets just make the assumption of add_memory_driver_managed()
> > that it is the driver's responsibility to mark the range busy before
> > calling, and the driver's responsibility to release the region. I.e.
> > validate (rather than request) that the range is busy in
> > register_memory_resource(), and teach release_memory_resource() to
> > skip releasing the region when the memory is marked driver managed.
> > That would let dax_kmem drop its manipulation of the 'busy' flag which
> > is a layering violation no matter how many comments we put around it.
>
> IIUC, that won't work for virtio-mem, whereby the parent resource spans
> multiple possible (future) add_memory_driver_managed() calls and is
> (just like for kmem) a pure indication to which device memory ranges belong.
>
> For example, when exposing 2GB via a virtio-mem device with max 4GB:
>
> (/proc/iomem)
> 240000000-33fffffff : virtio0
> 240000000-2bfffffff : System RAM (virtio_mem)
>
> And after hotplugging additional 2GB:
>
> 240000000-33fffffff : virtio0
> 240000000-33fffffff : System RAM (virtio_mem)
>
> So marking "virtio0" always BUSY (especially right from the start) would
> be wrong.

I'm not suggesting to busy the whole "virtio" range, just the portion
that's about to be passed to add_memory_driver_managed().

> The assumption is that anything that's IORESOURCE_SYSTEM_RAM
> and IORESOUCE_BUSY is currently added to the system as system RAM (e.g.,
> after add_memory() and friends, or during boot).
>
> I do agree that manually clearing the busy flag is ugly. What we most
> probably want is request_mem_region() that performs similar checks (no
> overlaps with existing BUSY resources), but doesn't set the region busy.
>

I can't see that working without some way to export and hold the
resource lock until some agent can atomically claim the range.

2020-09-24 18:17:00

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

On 24.09.20 15:54, Dan Williams wrote:
> On Thu, Sep 24, 2020 at 12:26 AM David Hildenbrand <[email protected]> wrote:
>>
>> On 23.09.20 23:41, Dan Williams wrote:
>>> On Wed, Sep 23, 2020 at 1:04 AM David Hildenbrand <[email protected]> wrote:
>>>>
>>>> On 08.09.20 17:33, Joao Martins wrote:
>>>>> [Sorry for the late response]
>>>>>
>>>>> On 8/21/20 11:06 AM, David Hildenbrand wrote:
>>>>>> On 03.08.20 07:03, Dan Williams wrote:
>>>>>>> @@ -37,109 +45,94 @@ int dev_dax_kmem_probe(struct device *dev)
>>>>>>> * could be mixed in a node with faster memory, causing
>>>>>>> * unavoidable performance issues.
>>>>>>> */
>>>>>>> - numa_node = dev_dax->target_node;
>>>>>>> if (numa_node < 0) {
>>>>>>> dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
>>>>>>> numa_node);
>>>>>>> return -EINVAL;
>>>>>>> }
>>>>>>>
>>>>>>> - /* Hotplug starting at the beginning of the next block: */
>>>>>>> - kmem_start = ALIGN(range->start, memory_block_size_bytes());
>>>>>>> -
>>>>>>> - kmem_size = range_len(range);
>>>>>>> - /* Adjust the size down to compensate for moving up kmem_start: */
>>>>>>> - kmem_size -= kmem_start - range->start;
>>>>>>> - /* Align the size down to cover only complete blocks: */
>>>>>>> - kmem_size &= ~(memory_block_size_bytes() - 1);
>>>>>>> - kmem_end = kmem_start + kmem_size;
>>>>>>> -
>>>>>>> - new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
>>>>>>> - if (!new_res_name)
>>>>>>> + res_name = kstrdup(dev_name(dev), GFP_KERNEL);
>>>>>>> + if (!res_name)
>>>>>>> return -ENOMEM;
>>>>>>>
>>>>>>> - /* Region is permanently reserved if hotremove fails. */
>>>>>>> - new_res = request_mem_region(kmem_start, kmem_size, new_res_name);
>>>>>>> - if (!new_res) {
>>>>>>> - dev_warn(dev, "could not reserve region [%pa-%pa]\n",
>>>>>>> - &kmem_start, &kmem_end);
>>>>>>> - kfree(new_res_name);
>>>>>>> + res = request_mem_region(range.start, range_len(&range), res_name);
>>>>>>
>>>>>> I think our range could be empty after aligning. I assume
>>>>>> request_mem_region() would check that, but maybe we could report a
>>>>>> better error/warning in that case.
>>>>>>
>>>>> dax_kmem_range() already returns a memory-block-aligned @range but
>>>>> IIUC request_mem_region() isn't checking for that. Having said that
>>>>> the returned @res wouldn't be different from the passed range.start.
>>>>>
>>>>>>> /*
>>>>>>> * Ensure that future kexec'd kernels will not treat this as RAM
>>>>>>> * automatically.
>>>>>>> */
>>>>>>> - rc = add_memory_driver_managed(numa_node, new_res->start,
>>>>>>> - resource_size(new_res), kmem_name);
>>>>>>> + rc = add_memory_driver_managed(numa_node, res->start,
>>>>>>> + resource_size(res), kmem_name);
>>>>>>> +
>>>>>>> + res->flags |= IORESOURCE_BUSY;
>>>>>>
>>>>>> Hm, I don't think that's correct. Any specific reason why to mark the
>>>>>> not-added, unaligned parts BUSY? E.g., walk_system_ram_range() could
>>>>>> suddenly stumble over it - and e.g., similarly kexec code when trying to
>>>>>> find memory for placing kexec images. I think we should leave this
>>>>>> !BUSY, just as it is right now.
>>>>>>
>>>>> Agreed.
>>>>>
>>>>>>> if (rc) {
>>>>>>> - release_resource(new_res);
>>>>>>> - kfree(new_res);
>>>>>>> - kfree(new_res_name);
>>>>>>> + release_mem_region(range.start, range_len(&range));
>>>>>>> + kfree(res_name);
>>>>>>> return rc;
>>>>>>> }
>>>>>>> - dev_dax->dax_kmem_res = new_res;
>>>>>>> +
>>>>>>> + dev_set_drvdata(dev, res_name);
>>>>>>>
>>>>>>> return 0;
>>>>>>> }
>>>>>>>
>>>>>>> #ifdef CONFIG_MEMORY_HOTREMOVE
>>>>>>> -static int dev_dax_kmem_remove(struct device *dev)
>>>>>>> +static void dax_kmem_release(struct dev_dax *dev_dax)
>>>>>>> {
>>>>>>> - struct dev_dax *dev_dax = to_dev_dax(dev);
>>>>>>> - struct resource *res = dev_dax->dax_kmem_res;
>>>>>>> - resource_size_t kmem_start = res->start;
>>>>>>> - resource_size_t kmem_size = resource_size(res);
>>>>>>> - const char *res_name = res->name;
>>>>>>> int rc;
>>>>>>> + struct device *dev = &dev_dax->dev;
>>>>>>> + const char *res_name = dev_get_drvdata(dev);
>>>>>>> + struct range range = dax_kmem_range(dev_dax);
>>>>>>>
>>>>>>> /*
>>>>>>> * We have one shot for removing memory, if some memory blocks were not
>>>>>>> * offline prior to calling this function remove_memory() will fail, and
>>>>>>> * there is no way to hotremove this memory until reboot because device
>>>>>>> - * unbind will succeed even if we return failure.
>>>>>>> + * unbind will proceed regardless of the remove_memory result.
>>>>>>> */
>>>>>>> - rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
>>>>>>> - if (rc) {
>>>>>>> - any_hotremove_failed = true;
>>>>>>> - dev_err(dev,
>>>>>>> - "DAX region %pR cannot be hotremoved until the next reboot\n",
>>>>>>> - res);
>>>>>>> - return rc;
>>>>>>> + rc = remove_memory(dev_dax->target_node, range.start, range_len(&range));
>>>>>>> + if (rc == 0) {
>>>>>>
>>>>>> if (!rc) ?
>>>>>>
>>>>> Better off would be to keep the old order:
>>>>>
>>>>> if (rc) {
>>>>> any_hotremove_failed = true;
>>>>> dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next reboot\n",
>>>>> range.start, range.end);
>>>>> return;
>>>>> }
>>>>>
>>>>> release_mem_region(range.start, range_len(&range));
>>>>> dev_set_drvdata(dev, NULL);
>>>>> kfree(res_name);
>>>>> return;
>>>>>
>>>>>
>>>>>>> + release_mem_region(range.start, range_len(&range));
>>>>>>
>>>>>> remove_memory() does a release_mem_region_adjustable(). Don't you
>>>>>> actually want to release the *unaligned* region you requested?
>>>>>>
>>>>> Isn't it what we're doing here?
>>>>> (The release_mem_region_adjustable() is using the same
>>>>> dax_kmem-aligned range and there's no split/adjust)
>>>>>
>>>>> Meaning right now (+ parent marked as !BUSY), and if I am understanding
>>>>> this correctly:
>>>>>
>>>>> request_mem_region(range.start, range_len)
>>>>> __request_region(iomem_res, range.start, range_len) -> alloc @parent
>>>>> add_memory_driver_managed(parent.start, resource_size(parent))
>>>>> __request_region(parent.start, resource_size(parent)) -> alloc @child
>>>>>
>>>>> [...]
>>>>>
>>>>> remove_memory(range.start, range_len)
>>>>> request_mem_region_adjustable(range.start, range_len)
>>>>> __release_region(range.start, range_len) -> remove @child
>>>>>
>>>>> release_mem_region(range.start, range_len)
>>>>> __release_region(range.start, range_len) -> doesn't remove @parent because !BUSY?
>>>>>
>>>>> The add/removal of this relies on !BUSY. But now I am wondering if the parent remaining
>>>>> unreleased is deliberate even on CONFIG_MEMORY_HOTREMOVE=y.
>>>>>
>>>>> Joao
>>>>>
>>>>
>>>> Thinking about it, if we don't set the parent resource BUSY (which is
>>>> what I think is the right way of doing things), and don't want to store
>>>> the parent resource pointer, we could add something like
>>>> lookup_resource() - e.g., lookup_mem_resource() - , however, searching
>>>> properly in the whole hierarchy (instead of only the first level), and
>>>> traversing down to the last hierarchy. Then it would be as simple as
>>>>
>>>> remove_memory(range.start, range_len)
>>>> res = lookup_mem_resource(range.start);
>>>> release_resource(res);
>>>
>>> Another thought... I notice that you've taught
>>> register_memory_resource() a IORESOURCE_MEM_DRIVER_MANAGED special
>>> case. Lets just make the assumption of add_memory_driver_managed()
>>> that it is the driver's responsibility to mark the range busy before
>>> calling, and the driver's responsibility to release the region. I.e.
>>> validate (rather than request) that the range is busy in
>>> register_memory_resource(), and teach release_memory_resource() to
>>> skip releasing the region when the memory is marked driver managed.
>>> That would let dax_kmem drop its manipulation of the 'busy' flag which
>>> is a layering violation no matter how many comments we put around it.
>>
>> IIUC, that won't work for virtio-mem, whereby the parent resource spans
>> multiple possible (future) add_memory_driver_managed() calls and is
>> (just like for kmem) a pure indication to which device memory ranges belong.
>>
>> For example, when exposing 2GB via a virtio-mem device with max 4GB:
>>
>> (/proc/iomem)
>> 240000000-33fffffff : virtio0
>> 240000000-2bfffffff : System RAM (virtio_mem)
>>
>> And after hotplugging additional 2GB:
>>
>> 240000000-33fffffff : virtio0
>> 240000000-33fffffff : System RAM (virtio_mem)
>>
>> So marking "virtio0" always BUSY (especially right from the start) would
>> be wrong.
>
> I'm not suggesting to busy the whole "virtio" range, just the portion
> that's about to be passed to add_memory_driver_managed().

I'm afraid I don't get your point. For virtio-mem:

Before:

1. Create virtio0 container resource

2. (somewhen in the future) add_memory_driver_managed()
- Create resource (System RAM (virtio_mem)), marking it busy/driver
managed

After:

1. Create virtio0 container resource

2. (somewhen in the future) Create resource (System RAM (virtio_mem)),
marking it busy/driver managed
3. add_memory_driver_managed()

Not helpful or simpler IMHO.

>
>> The assumption is that anything that's IORESOURCE_SYSTEM_RAM
>> and IORESOUCE_BUSY is currently added to the system as system RAM (e.g.,
>> after add_memory() and friends, or during boot).
>>
>> I do agree that manually clearing the busy flag is ugly. What we most
>> probably want is request_mem_region() that performs similar checks (no
>> overlaps with existing BUSY resources), but doesn't set the region busy.
>>
>
> I can't see that working without some way to export and hold the
> resource lock until some agent can atomically claim the range.

I don't think we have to care about races here. The "BUSY" checks is
really just a check for leftovers, e.g., after kexec or after driver
reloading. If somebody else would try to concurrently add System RAM
/something else within the range of your device, something else, very
weird, would be going on (let's call it a BUG, just like if somebody
would be removing system RAM in your device range ...).

For example, in case of virtio-mem, when you unload the driver, it
cannot remove the "virtio0" resource in case some system ram in the
range is still plugged (busy).

So when reloading the driver, it would try to re-create the virtio0
resource, detect that some system ram in the range is still BUSY, and
fail gracefully. This is how it works and how it's expected to work - at
least for virtio-mem.

I assume something similar can be observed with kmem, when trying to
reload the driver or similar - but races shouldn't be relevant here.

--
Thanks,

David / dhildenb

2020-09-24 21:29:30

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

[..]
> > I'm not suggesting to busy the whole "virtio" range, just the portion
> > that's about to be passed to add_memory_driver_managed().
>
> I'm afraid I don't get your point. For virtio-mem:
>
> Before:
>
> 1. Create virtio0 container resource
>
> 2. (somewhen in the future) add_memory_driver_managed()
> - Create resource (System RAM (virtio_mem)), marking it busy/driver
> managed
>
> After:
>
> 1. Create virtio0 container resource
>
> 2. (somewhen in the future) Create resource (System RAM (virtio_mem)),
> marking it busy/driver managed
> 3. add_memory_driver_managed()
>
> Not helpful or simpler IMHO.

The concern I'm trying to address is the theoretical race window and
layering violation in this sequence in the kmem driver:

1/ res = request_mem_region(...);
2/ res->flags = IORESOURCE_MEM;
3/ add_memory_driver_managed();

Between 2/ and 3/ something can race and think that it owns the
region. Do I think it will happen in practice, no, but it's still a
pattern that deserves come cleanup.

2020-09-24 21:46:09

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res



> Am 24.09.2020 um 23:26 schrieb Dan Williams <[email protected]>:
>
> [..]
>>> I'm not suggesting to busy the whole "virtio" range, just the portion
>>> that's about to be passed to add_memory_driver_managed().
>>
>> I'm afraid I don't get your point. For virtio-mem:
>>
>> Before:
>>
>> 1. Create virtio0 container resource
>>
>> 2. (somewhen in the future) add_memory_driver_managed()
>> - Create resource (System RAM (virtio_mem)), marking it busy/driver
>> managed
>>
>> After:
>>
>> 1. Create virtio0 container resource
>>
>> 2. (somewhen in the future) Create resource (System RAM (virtio_mem)),
>> marking it busy/driver managed
>> 3. add_memory_driver_managed()
>>
>> Not helpful or simpler IMHO.
>
> The concern I'm trying to address is the theoretical race window and
> layering violation in this sequence in the kmem driver:
>
> 1/ res = request_mem_region(...);
> 2/ res->flags = IORESOURCE_MEM;
> 3/ add_memory_driver_managed();
>
> Between 2/ and 3/ something can race and think that it owns the
> region. Do I think it will happen in practice, no, but it's still a
> pattern that deserves come cleanup.

I think in that unlikely event (rather impossible), add_memory_driver_managed() should fail, detecting a conflicting (busy) resource. Not sure what will happen next ( and did not double-check).

But yeah, the way the BUSY bit is cleared here is wrong - simply overwriting other bits. And it would be even better if we could avoid manually messing with flags here.
>

2020-09-24 21:52:16

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

On Thu, Sep 24, 2020 at 2:42 PM David Hildenbrand <[email protected]> wrote:
>
>
>
> > Am 24.09.2020 um 23:26 schrieb Dan Williams <[email protected]>:
> >
> > [..]
> >>> I'm not suggesting to busy the whole "virtio" range, just the portion
> >>> that's about to be passed to add_memory_driver_managed().
> >>
> >> I'm afraid I don't get your point. For virtio-mem:
> >>
> >> Before:
> >>
> >> 1. Create virtio0 container resource
> >>
> >> 2. (somewhen in the future) add_memory_driver_managed()
> >> - Create resource (System RAM (virtio_mem)), marking it busy/driver
> >> managed
> >>
> >> After:
> >>
> >> 1. Create virtio0 container resource
> >>
> >> 2. (somewhen in the future) Create resource (System RAM (virtio_mem)),
> >> marking it busy/driver managed
> >> 3. add_memory_driver_managed()
> >>
> >> Not helpful or simpler IMHO.
> >
> > The concern I'm trying to address is the theoretical race window and
> > layering violation in this sequence in the kmem driver:
> >
> > 1/ res = request_mem_region(...);
> > 2/ res->flags = IORESOURCE_MEM;
> > 3/ add_memory_driver_managed();
> >
> > Between 2/ and 3/ something can race and think that it owns the
> > region. Do I think it will happen in practice, no, but it's still a
> > pattern that deserves come cleanup.
>
> I think in that unlikely event (rather impossible), add_memory_driver_managed() should fail, detecting a conflicting (busy) resource. Not sure what will happen next ( and did not double-check).

add_memory_driver_managed() will fail, but the release_mem_region() in
kmem to unwind on the error path will do the wrong thing because that
other driver thinks it got ownership of the region.

> But yeah, the way the BUSY bit is cleared here is wrong - simply overwriting other bits. And it would be even better if we could avoid manually messing with flags here.

I'm ok to leave it alone for now (hasn't been and likely never will be
a problem in practice), but I think it was still worth grumbling
about. I'll leave that part of kmem alone in the upcoming split of
dax_kmem_res removal.

2020-09-25 08:58:35

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 11/23] device-dax: Kill dax_kmem_res

On 24.09.20 23:50, Dan Williams wrote:
> On Thu, Sep 24, 2020 at 2:42 PM David Hildenbrand <[email protected]> wrote:
>>
>>
>>
>>> Am 24.09.2020 um 23:26 schrieb Dan Williams <[email protected]>:
>>>
>>> [..]
>>>>> I'm not suggesting to busy the whole "virtio" range, just the portion
>>>>> that's about to be passed to add_memory_driver_managed().
>>>>
>>>> I'm afraid I don't get your point. For virtio-mem:
>>>>
>>>> Before:
>>>>
>>>> 1. Create virtio0 container resource
>>>>
>>>> 2. (somewhen in the future) add_memory_driver_managed()
>>>> - Create resource (System RAM (virtio_mem)), marking it busy/driver
>>>> managed
>>>>
>>>> After:
>>>>
>>>> 1. Create virtio0 container resource
>>>>
>>>> 2. (somewhen in the future) Create resource (System RAM (virtio_mem)),
>>>> marking it busy/driver managed
>>>> 3. add_memory_driver_managed()
>>>>
>>>> Not helpful or simpler IMHO.
>>>
>>> The concern I'm trying to address is the theoretical race window and
>>> layering violation in this sequence in the kmem driver:
>>>
>>> 1/ res = request_mem_region(...);
>>> 2/ res->flags = IORESOURCE_MEM;
>>> 3/ add_memory_driver_managed();
>>>
>>> Between 2/ and 3/ something can race and think that it owns the
>>> region. Do I think it will happen in practice, no, but it's still a
>>> pattern that deserves come cleanup.
>>
>> I think in that unlikely event (rather impossible), add_memory_driver_managed() should fail, detecting a conflicting (busy) resource. Not sure what will happen next ( and did not double-check).
>
> add_memory_driver_managed() will fail, but the release_mem_region() in
> kmem to unwind on the error path will do the wrong thing because that
> other driver thinks it got ownership of the region.
>

I think if somebody would race and claim the region for itself (after we
unchecked the BUSY flag), there would be another memory resource below
our resource container (e.g., via __request_region()).

So, interestingly, the current code will do a

release_resource->__release_resource(old, true);

which will remove whatever somebody added below the resource.

If we were to do a

remove_resource->__release_resource(old, false);

we would only remove what we temporarily added, relocating anychildren
(someone nasty added).

But yeah, I don't think we have to worry about this case.

>> But yeah, the way the BUSY bit is cleared here is wrong - simply overwriting other bits. And it would be even better if we could avoid manually messing with flags here.
>
> I'm ok to leave it alone for now (hasn't been and likely never will be
> a problem in practice), but I think it was still worth grumbling

Definitely, it gives us a better understanding.

> about. I'll leave that part of kmem alone in the upcoming split of
> dax_kmem_res removal.

Yeah, stuff is more complicated than I would wished, so I guess it's
better to leave it alone for now until we actually see issues with
somebody else regarding *our* device-owned region (or we're able to come
up with a cleanup that keeps all corner cases working for kmem and
virtio-mem).

--
Thanks,

David / dhildenb