The dax/kmem driver can potentially hot-add large amounts of memory
originating from CXL memory expanders, or NVDIMMs, or other 'device
memories'. There is a chance there isn't enough regular system memory
available to fit the memmap for this new memory. It's therefore
desirable, if all other conditions are met, for the kmem managed memory
to place its memmap on the newly added memory itself.
The main hurdle for accomplishing this for kmem is that memmap_on_memory
can only be done if the memory being added is equal to the size of one
memblock. To overcome this, allow the hotplug code to split an add_memory()
request into memblock-sized chunks, and try_remove_memory() to also
expect and handle such a scenario.
Patch 1 replaces an open-coded kmemdup()
Patch 2 teaches the memory_hotplug code to allow for splitting
add_memory() and remove_memory() requests over memblock sized chunks.
Patch 3 allows the dax region drivers to request memmap_on_memory
semantics. CXL dax regions default this to 'on', all others default to
off to keep existing behavior unchanged.
Signed-off-by: Vishal Verma <[email protected]>
---
Changes in v8:
- Fix unwinding in create_altmaps_and_memory_blocks() to remove
partially added blocks and altmaps. (David, Ying)
- Simplify remove_memory_blocks_and_altmaps() since an altmap is
assured. (David)
- Since we remove per memory-block altmaps, the walk through memory
blocks for a larger range isn't needed. Instead we can lookup the
memory block directly from the pfn, allowing the test_has_altmap_cb()
callback to be dropped and removed. (David)
- Link to v7: https://lore.kernel.org/r/[email protected]
Changes in v7:
- Make the add_memory_resource() flow symmetrical w.r.t. try_remove_memory()
in terms of how the altmap path is taken (David Hildenbrand)
- Move a comment, clean up usage of 'memblock' vs. 'memory_block'
(David Hildenbrand)
- Don't use the altmap path for the mhp_supports_memmap_on_memory(memblock_size) == false
case (Huang Ying)
- Link to v6: https://lore.kernel.org/r/[email protected]
Changes in v6:
- Add a prep patch to replace an open coded kmemdup in
add_memory_resource() (Dan Williams)
- Fix ordering of firmware_map_remove w.r.t taking the hotplug lock
(David Hildenbrand)
- Remove unused 'nid' variable, and a stray whitespace (David Hildenbrand)
- Clean up and simplify the altmap vs non-altmap paths for
try_remove_memory (David Hildenbrand)
- Add a note to the changelog in patch 1 linking to the PUD mappings
proposal (David Hildenbrand)
- Remove the new sysfs ABI from the kmem/dax drivers until ABI
documentation for /sys/bus/dax can be established (will split this out
into a separate patchset) (Dan Williams)
- Link to v5: https://lore.kernel.org/r/[email protected]
Changes in v5:
- Separate out per-memblock operations from per memory block operations
in try_remove_memory(), and rename the inner function appropriately.
This does expand the scope of the memory hotplug lock to include
remove_memory_block_devices(), but the alternative was to drop the
lock in the inner function separately for each iteration, and then
re-acquire it in try_remove_memory() creating a small window where
the lock isn't held. (David Hildenbrand)
- Remove unnecessary rc check from the memmap_on_memory_store sysfs
helper in patch 2 (Dan Carpenter)
- Link to v4: https://lore.kernel.org/r/[email protected]
Changes in v4:
- Rebase to Aneesh's PPC64 memmap_on_memory series v8 [2].
- Tweak a goto / error path in add_memory_create_devices() (Jonathan)
- Retain the old behavior for dax devices, only default to
memmap_on_memory for CXL (Jonathan)
- Link to v3: https://lore.kernel.org/r/[email protected]
[2]: https://lore.kernel.org/linux-mm/[email protected]
Changes in v3:
- Rebase on Aneesh's patches [1]
- Drop Patch 1 - it is not needed since [1] allows for dynamic setting
of the memmap_on_memory param (David)
- Link to v2: https://lore.kernel.org/r/[email protected]
[1]: https://lore.kernel.org/r/[email protected]
Changes in v2:
- Drop the patch to create an override path for the memmap_on_memory
module param (David)
- Move the chunking into memory_hotplug.c so that any caller of
add_memory() can request this behavior. (David)
- Handle remove_memory() too. (David, Ying)
- Add a sysfs control in the kmem driver for memmap_on_memory semantics
(David, Jonathan)
- Add a #else case to define mhp_supports_memmap_on_memory() if
CONFIG_MEMORY_HOTPLUG is unset. (0day report)
- Link to v1: https://lore.kernel.org/r/[email protected]
---
Vishal Verma (3):
mm/memory_hotplug: replace an open-coded kmemdup() in add_memory_resource()
mm/memory_hotplug: split memmap_on_memory requests across memblocks
dax/kmem: allow kmem to add memory with memmap_on_memory
drivers/dax/bus.h | 1 +
drivers/dax/dax-private.h | 1 +
drivers/dax/bus.c | 3 +
drivers/dax/cxl.c | 1 +
drivers/dax/hmem/hmem.c | 1 +
drivers/dax/kmem.c | 8 +-
drivers/dax/pmem.c | 1 +
mm/memory_hotplug.c | 211 ++++++++++++++++++++++++++++++----------------
8 files changed, 152 insertions(+), 75 deletions(-)
---
base-commit: 25b5b1a0646c3d39e1d885e27c10be1c9e202bf2
change-id: 20230613-vv-kmem_memmap-5483c8d04279
Best regards,
--
Vishal Verma <[email protected]>
The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
'memblock_size' chunks of memory being added. Adding a larger span of
memory precludes memmap_on_memory semantics.
For users of hotplug such as kmem, large amounts of memory might get
added from the CXL subsystem. In some cases, this amount may exceed the
available 'main memory' to store the memmap for the memory being added.
In this case, it is useful to have a way to place the memmap on the
memory being added, even if it means splitting the addition into
memblock-sized chunks.
Change add_memory_resource() to loop over memblock-sized chunks of
memory if caller requested memmap_on_memory, and if other conditions for
it are met. Teach try_remove_memory() to also expect that a memory
range being removed might have been split up into memblock sized chunks,
and to loop through those as needed.
This does preclude being able to use PUD mappings in the direct map; a
proposal to how this could be optimized in the future is laid out
here[1].
[1]: https://lore.kernel.org/linux-mm/[email protected]/
Cc: Andrew Morton <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Huang Ying <[email protected]>
Suggested-by: David Hildenbrand <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Signed-off-by: Vishal Verma <[email protected]>
---
mm/memory_hotplug.c | 213 ++++++++++++++++++++++++++++++++++------------------
1 file changed, 138 insertions(+), 75 deletions(-)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 6be7de9efa55..d242e49d7f7b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1380,6 +1380,84 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
return arch_supports_memmap_on_memory(vmemmap_size);
}
+static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size)
+{
+ unsigned long memblock_size = memory_block_size_bytes();
+ u64 cur_start;
+
+ /*
+ * For memmap_on_memory, the altmaps were added on a per-memblock
+ * basis; we have to process each individual memory block.
+ */
+ for (cur_start = start; cur_start < start + size;
+ cur_start += memblock_size) {
+ struct vmem_altmap *altmap = NULL;
+ struct memory_block *mem;
+
+ mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(cur_start)));
+ WARN_ON_ONCE(!mem);
+ if (!mem)
+ continue;
+
+ altmap = mem->altmap;
+ mem->altmap = NULL;
+
+ remove_memory_block_devices(cur_start, memblock_size);
+
+ arch_remove_memory(cur_start, memblock_size, altmap);
+
+ /* Verify that all vmemmap pages have actually been freed. */
+ WARN(altmap->alloc, "Altmap not fully unmapped");
+ kfree(altmap);
+ }
+}
+
+static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
+ u64 start, u64 size)
+{
+ unsigned long memblock_size = memory_block_size_bytes();
+ u64 cur_start;
+ int ret;
+
+ for (cur_start = start; cur_start < start + size;
+ cur_start += memblock_size) {
+ struct mhp_params params = { .pgprot =
+ pgprot_mhp(PAGE_KERNEL) };
+ struct vmem_altmap mhp_altmap = {
+ .base_pfn = PHYS_PFN(cur_start),
+ .end_pfn = PHYS_PFN(cur_start + memblock_size - 1),
+ };
+
+ mhp_altmap.free = memory_block_memmap_on_memory_pages();
+ params.altmap = kmemdup(&mhp_altmap, sizeof(struct vmem_altmap),
+ GFP_KERNEL);
+ if (!params.altmap)
+ return -ENOMEM;
+
+ /* call arch's memory hotadd */
+ ret = arch_add_memory(nid, cur_start, memblock_size, ¶ms);
+ if (ret < 0) {
+ kfree(params.altmap);
+ goto out;
+ }
+
+ /* create memory block devices after memory was added */
+ ret = create_memory_block_devices(cur_start, memblock_size,
+ params.altmap, group);
+ if (ret) {
+ arch_remove_memory(cur_start, memblock_size, NULL);
+ kfree(params.altmap);
+ goto out;
+ }
+ }
+
+ return 0;
+out:
+ if (ret && (cur_start != start))
+ remove_memory_blocks_and_altmaps(start, cur_start - start);
+ return ret;
+}
+
/*
* NOTE: The caller must call lock_device_hotplug() to serialize hotplug
* and online/offline operations (triggered e.g. by sysfs).
@@ -1390,10 +1468,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
{
struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
enum memblock_flags memblock_flags = MEMBLOCK_NONE;
- struct vmem_altmap mhp_altmap = {
- .base_pfn = PHYS_PFN(res->start),
- .end_pfn = PHYS_PFN(res->end),
- };
struct memory_group *group = NULL;
u64 start, size;
bool new_node = false;
@@ -1436,28 +1510,22 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
/*
* Self hosted memmap array
*/
- if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
- if (mhp_supports_memmap_on_memory(size)) {
- mhp_altmap.free = memory_block_memmap_on_memory_pages();
- params.altmap = kmemdup(&mhp_altmap,
- sizeof(struct vmem_altmap),
- GFP_KERNEL);
- if (!params.altmap)
- goto error;
+ if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
+ mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
+ ret = create_altmaps_and_memory_blocks(nid, group, start, size);
+ if (ret)
+ goto error;
+ } else {
+ ret = arch_add_memory(nid, start, size, ¶ms);
+ if (ret < 0)
+ goto error;
+
+ /* create memory block devices after memory was added */
+ ret = create_memory_block_devices(start, size, NULL, group);
+ if (ret) {
+ arch_remove_memory(start, size, NULL);
+ goto error;
}
- /* fallback to not using altmap */
- }
-
- /* call arch's memory hotadd */
- ret = arch_add_memory(nid, start, size, ¶ms);
- if (ret < 0)
- goto error_free;
-
- /* create memory block devices after memory was added */
- ret = create_memory_block_devices(start, size, params.altmap, group);
- if (ret) {
- arch_remove_memory(start, size, NULL);
- goto error_free;
}
if (new_node) {
@@ -1494,8 +1562,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
walk_memory_blocks(start, size, NULL, online_memory_block);
return ret;
-error_free:
- kfree(params.altmap);
error:
if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
memblock_remove(start, size);
@@ -2062,17 +2128,13 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
return 0;
}
-static int test_has_altmap_cb(struct memory_block *mem, void *arg)
+static int count_memory_range_altmaps_cb(struct memory_block *mem, void *arg)
{
- struct memory_block **mem_ptr = (struct memory_block **)arg;
- /*
- * return the memblock if we have altmap
- * and break callback.
- */
- if (mem->altmap) {
- *mem_ptr = mem;
- return 1;
- }
+ u64 *num_altmaps = (u64 *)arg;
+
+ if (mem->altmap)
+ *num_altmaps += 1;
+
return 0;
}
@@ -2146,11 +2208,31 @@ void try_offline_node(int nid)
}
EXPORT_SYMBOL(try_offline_node);
+static int memory_blocks_have_altmaps(u64 start, u64 size)
+{
+ u64 num_memblocks = size / memory_block_size_bytes();
+ u64 num_altmaps = 0;
+
+ if (!mhp_memmap_on_memory())
+ return 0;
+
+ walk_memory_blocks(start, size, &num_altmaps,
+ count_memory_range_altmaps_cb);
+
+ if (num_altmaps == 0)
+ return 0;
+
+ if (num_memblocks != num_altmaps) {
+ WARN_ONCE(1, "Not all memblocks in range have altmaps");
+ return -EINVAL;
+ }
+
+ return 1;
+}
+
static int __ref try_remove_memory(u64 start, u64 size)
{
- struct memory_block *mem;
- int rc = 0, nid = NUMA_NO_NODE;
- struct vmem_altmap *altmap = NULL;
+ int rc, nid = NUMA_NO_NODE;
BUG_ON(check_hotplug_memory_range(start, size));
@@ -2167,45 +2249,25 @@ static int __ref try_remove_memory(u64 start, u64 size)
if (rc)
return rc;
- /*
- * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
- * the same granularity it was added - a single memory block.
- */
- if (mhp_memmap_on_memory()) {
- rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb);
- if (rc) {
- if (size != memory_block_size_bytes()) {
- pr_warn("Refuse to remove %#llx - %#llx,"
- "wrong granularity\n",
- start, start + size);
- return -EINVAL;
- }
- altmap = mem->altmap;
- /*
- * Mark altmap NULL so that we can add a debug
- * check on memblock free.
- */
- mem->altmap = NULL;
- }
- }
-
/* remove memmap entry */
firmware_map_remove(start, start + size, "System RAM");
- /*
- * Memory block device removal under the device_hotplug_lock is
- * a barrier against racing online attempts.
- */
- remove_memory_block_devices(start, size);
-
mem_hotplug_begin();
- arch_remove_memory(start, size, altmap);
-
- /* Verify that all vmemmap pages have actually been freed. */
- if (altmap) {
- WARN(altmap->alloc, "Altmap not fully unmapped");
- kfree(altmap);
+ rc = memory_blocks_have_altmaps(start, size);
+ if (rc < 0) {
+ goto err;
+ } else if (rc == 0) {
+ /*
+ * Memory block device removal under the device_hotplug_lock is
+ * a barrier against racing online attempts.
+ * No altmaps present, do the removal directly
+ */
+ remove_memory_block_devices(start, size);
+ arch_remove_memory(start, size, NULL);
+ } else {
+ /* all memblocks in the range have altmaps */
+ remove_memory_blocks_and_altmaps(start, size);
}
if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
@@ -2218,8 +2280,9 @@ static int __ref try_remove_memory(u64 start, u64 size)
if (nid != NUMA_NO_NODE)
try_offline_node(nid);
+err:
mem_hotplug_done();
- return 0;
+ return (rc < 0 ? rc : 0);
}
/**
--
2.41.0
Vishal Verma <[email protected]> writes:
> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
> 'memblock_size' chunks of memory being added. Adding a larger span of
> memory precludes memmap_on_memory semantics.
>
> For users of hotplug such as kmem, large amounts of memory might get
> added from the CXL subsystem. In some cases, this amount may exceed the
> available 'main memory' to store the memmap for the memory being added.
> In this case, it is useful to have a way to place the memmap on the
> memory being added, even if it means splitting the addition into
> memblock-sized chunks.
>
> Change add_memory_resource() to loop over memblock-sized chunks of
> memory if caller requested memmap_on_memory, and if other conditions for
> it are met. Teach try_remove_memory() to also expect that a memory
> range being removed might have been split up into memblock sized chunks,
> and to loop through those as needed.
>
> This does preclude being able to use PUD mappings in the direct map; a
> proposal to how this could be optimized in the future is laid out
> here[1].
>
> [1]: https://lore.kernel.org/linux-mm/[email protected]/
>
> Cc: Andrew Morton <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Huang Ying <[email protected]>
> Suggested-by: David Hildenbrand <[email protected]>
> Reviewed-by: Dan Williams <[email protected]>
> Signed-off-by: Vishal Verma <[email protected]>
> ---
> mm/memory_hotplug.c | 213 ++++++++++++++++++++++++++++++++++------------------
> 1 file changed, 138 insertions(+), 75 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 6be7de9efa55..d242e49d7f7b 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1380,6 +1380,84 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
> return arch_supports_memmap_on_memory(vmemmap_size);
> }
>
> +static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size)
> +{
> + unsigned long memblock_size = memory_block_size_bytes();
> + u64 cur_start;
> +
> + /*
> + * For memmap_on_memory, the altmaps were added on a per-memblock
> + * basis; we have to process each individual memory block.
> + */
> + for (cur_start = start; cur_start < start + size;
> + cur_start += memblock_size) {
> + struct vmem_altmap *altmap = NULL;
> + struct memory_block *mem;
> +
> + mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(cur_start)));
> + WARN_ON_ONCE(!mem);
> + if (!mem)
> + continue;
> +
> + altmap = mem->altmap;
> + mem->altmap = NULL;
> +
> + remove_memory_block_devices(cur_start, memblock_size);
> +
> + arch_remove_memory(cur_start, memblock_size, altmap);
> +
> + /* Verify that all vmemmap pages have actually been freed. */
> + WARN(altmap->alloc, "Altmap not fully unmapped");
> + kfree(altmap);
> + }
> +}
> +
> +static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
> + u64 start, u64 size)
> +{
> + unsigned long memblock_size = memory_block_size_bytes();
> + u64 cur_start;
> + int ret;
> +
> + for (cur_start = start; cur_start < start + size;
> + cur_start += memblock_size) {
> + struct mhp_params params = { .pgprot =
> + pgprot_mhp(PAGE_KERNEL) };
> + struct vmem_altmap mhp_altmap = {
> + .base_pfn = PHYS_PFN(cur_start),
> + .end_pfn = PHYS_PFN(cur_start + memblock_size - 1),
> + };
> +
> + mhp_altmap.free = memory_block_memmap_on_memory_pages();
> + params.altmap = kmemdup(&mhp_altmap, sizeof(struct vmem_altmap),
> + GFP_KERNEL);
> + if (!params.altmap)
> + return -ENOMEM;
Use "goto out" here too?
> +
> + /* call arch's memory hotadd */
> + ret = arch_add_memory(nid, cur_start, memblock_size, ¶ms);
> + if (ret < 0) {
> + kfree(params.altmap);
> + goto out;
> + }
> +
> + /* create memory block devices after memory was added */
> + ret = create_memory_block_devices(cur_start, memblock_size,
> + params.altmap, group);
> + if (ret) {
> + arch_remove_memory(cur_start, memblock_size, NULL);
> + kfree(params.altmap);
How about move arch_remove_memory() and kree() to error path and use
different label?
--
Best Regards,
Huang, Ying
> + goto out;
> + }
> + }
> +
> + return 0;
> +out:
> + if (ret && (cur_start != start))
> + remove_memory_blocks_and_altmaps(start, cur_start - start);
> + return ret;
> +}
> +
> /*
> * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> * and online/offline operations (triggered e.g. by sysfs).
> @@ -1390,10 +1468,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> {
> struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> enum memblock_flags memblock_flags = MEMBLOCK_NONE;
> - struct vmem_altmap mhp_altmap = {
> - .base_pfn = PHYS_PFN(res->start),
> - .end_pfn = PHYS_PFN(res->end),
> - };
> struct memory_group *group = NULL;
> u64 start, size;
> bool new_node = false;
> @@ -1436,28 +1510,22 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> /*
> * Self hosted memmap array
> */
> - if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> - if (mhp_supports_memmap_on_memory(size)) {
> - mhp_altmap.free = memory_block_memmap_on_memory_pages();
> - params.altmap = kmemdup(&mhp_altmap,
> - sizeof(struct vmem_altmap),
> - GFP_KERNEL);
> - if (!params.altmap)
> - goto error;
> + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
> + mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
> + ret = create_altmaps_and_memory_blocks(nid, group, start, size);
> + if (ret)
> + goto error;
> + } else {
> + ret = arch_add_memory(nid, start, size, ¶ms);
> + if (ret < 0)
> + goto error;
> +
> + /* create memory block devices after memory was added */
> + ret = create_memory_block_devices(start, size, NULL, group);
> + if (ret) {
> + arch_remove_memory(start, size, NULL);
> + goto error;
> }
> - /* fallback to not using altmap */
> - }
> -
> - /* call arch's memory hotadd */
> - ret = arch_add_memory(nid, start, size, ¶ms);
> - if (ret < 0)
> - goto error_free;
> -
> - /* create memory block devices after memory was added */
> - ret = create_memory_block_devices(start, size, params.altmap, group);
> - if (ret) {
> - arch_remove_memory(start, size, NULL);
> - goto error_free;
> }
>
> if (new_node) {
> @@ -1494,8 +1562,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> walk_memory_blocks(start, size, NULL, online_memory_block);
>
> return ret;
> -error_free:
> - kfree(params.altmap);
> error:
> if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
> memblock_remove(start, size);
> @@ -2062,17 +2128,13 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
> return 0;
> }
>
> -static int test_has_altmap_cb(struct memory_block *mem, void *arg)
> +static int count_memory_range_altmaps_cb(struct memory_block *mem, void *arg)
> {
> - struct memory_block **mem_ptr = (struct memory_block **)arg;
> - /*
> - * return the memblock if we have altmap
> - * and break callback.
> - */
> - if (mem->altmap) {
> - *mem_ptr = mem;
> - return 1;
> - }
> + u64 *num_altmaps = (u64 *)arg;
> +
> + if (mem->altmap)
> + *num_altmaps += 1;
> +
> return 0;
> }
>
> @@ -2146,11 +2208,31 @@ void try_offline_node(int nid)
> }
> EXPORT_SYMBOL(try_offline_node);
>
> +static int memory_blocks_have_altmaps(u64 start, u64 size)
> +{
> + u64 num_memblocks = size / memory_block_size_bytes();
> + u64 num_altmaps = 0;
> +
> + if (!mhp_memmap_on_memory())
> + return 0;
> +
> + walk_memory_blocks(start, size, &num_altmaps,
> + count_memory_range_altmaps_cb);
> +
> + if (num_altmaps == 0)
> + return 0;
> +
> + if (num_memblocks != num_altmaps) {
> + WARN_ONCE(1, "Not all memblocks in range have altmaps");
> + return -EINVAL;
> + }
> +
> + return 1;
> +}
> +
> static int __ref try_remove_memory(u64 start, u64 size)
> {
> - struct memory_block *mem;
> - int rc = 0, nid = NUMA_NO_NODE;
> - struct vmem_altmap *altmap = NULL;
> + int rc, nid = NUMA_NO_NODE;
>
> BUG_ON(check_hotplug_memory_range(start, size));
>
> @@ -2167,45 +2249,25 @@ static int __ref try_remove_memory(u64 start, u64 size)
> if (rc)
> return rc;
>
> - /*
> - * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
> - * the same granularity it was added - a single memory block.
> - */
> - if (mhp_memmap_on_memory()) {
> - rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb);
> - if (rc) {
> - if (size != memory_block_size_bytes()) {
> - pr_warn("Refuse to remove %#llx - %#llx,"
> - "wrong granularity\n",
> - start, start + size);
> - return -EINVAL;
> - }
> - altmap = mem->altmap;
> - /*
> - * Mark altmap NULL so that we can add a debug
> - * check on memblock free.
> - */
> - mem->altmap = NULL;
> - }
> - }
> -
> /* remove memmap entry */
> firmware_map_remove(start, start + size, "System RAM");
>
> - /*
> - * Memory block device removal under the device_hotplug_lock is
> - * a barrier against racing online attempts.
> - */
> - remove_memory_block_devices(start, size);
> -
> mem_hotplug_begin();
>
> - arch_remove_memory(start, size, altmap);
> -
> - /* Verify that all vmemmap pages have actually been freed. */
> - if (altmap) {
> - WARN(altmap->alloc, "Altmap not fully unmapped");
> - kfree(altmap);
> + rc = memory_blocks_have_altmaps(start, size);
> + if (rc < 0) {
> + goto err;
> + } else if (rc == 0) {
> + /*
> + * Memory block device removal under the device_hotplug_lock is
> + * a barrier against racing online attempts.
> + * No altmaps present, do the removal directly
> + */
> + remove_memory_block_devices(start, size);
> + arch_remove_memory(start, size, NULL);
> + } else {
> + /* all memblocks in the range have altmaps */
> + remove_memory_blocks_and_altmaps(start, size);
> }
>
> if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
> @@ -2218,8 +2280,9 @@ static int __ref try_remove_memory(u64 start, u64 size)
> if (nid != NUMA_NO_NODE)
> try_offline_node(nid);
>
> +err:
> mem_hotplug_done();
> - return 0;
> + return (rc < 0 ? rc : 0);
> }
>
> /**
On Thu, 2023-11-02 at 09:16 +0800, Huang, Ying wrote:
> Vishal Verma <[email protected]> writes:
>
[..]
> > +
> > +static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
> > + u64 start, u64 size)
> > +{
> > + unsigned long memblock_size = memory_block_size_bytes();
> > + u64 cur_start;
> > + int ret;
> > +
> > + for (cur_start = start; cur_start < start + size;
> > + cur_start += memblock_size) {
> > + struct mhp_params params = { .pgprot =
> > + pgprot_mhp(PAGE_KERNEL) };
> > + struct vmem_altmap mhp_altmap = {
> > + .base_pfn = PHYS_PFN(cur_start),
> > + .end_pfn = PHYS_PFN(cur_start + memblock_size - 1),
> > + };
> > +
> > + mhp_altmap.free = memory_block_memmap_on_memory_pages();
> > + params.altmap = kmemdup(&mhp_altmap, sizeof(struct vmem_altmap),
> > + GFP_KERNEL);
> > + if (!params.altmap)
> > + return -ENOMEM;
>
> Use "goto out" here too?
Hm, yes I suppose we want to clean up previous iterations of the loop -
I'll make this change.
>
> > +
> > + /* call arch's memory hotadd */
> > + ret = arch_add_memory(nid, cur_start, memblock_size, ¶ms);
> > + if (ret < 0) {
> > + kfree(params.altmap);
> > + goto out;
> > + }
> > +
> > + /* create memory block devices after memory was added */
> > + ret = create_memory_block_devices(cur_start, memblock_size,
> > + params.altmap, group);
> > + if (ret) {
> > + arch_remove_memory(cur_start, memblock_size, NULL);
> > + kfree(params.altmap);
>
> How about move arch_remove_memory() and kree() to error path and use
> different label?
I thought of this, but it got slightly awkward because of the scope of
'params' (declared/allocated within the loop), just kfree'ing in that
scope looked cleaner..
On 01.11.23 23:51, Vishal Verma wrote:
> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
> 'memblock_size' chunks of memory being added. Adding a larger span of
> memory precludes memmap_on_memory semantics.
>
> For users of hotplug such as kmem, large amounts of memory might get
> added from the CXL subsystem. In some cases, this amount may exceed the
> available 'main memory' to store the memmap for the memory being added.
> In this case, it is useful to have a way to place the memmap on the
> memory being added, even if it means splitting the addition into
> memblock-sized chunks.
>
> Change add_memory_resource() to loop over memblock-sized chunks of
> memory if caller requested memmap_on_memory, and if other conditions for
> it are met. Teach try_remove_memory() to also expect that a memory
> range being removed might have been split up into memblock sized chunks,
> and to loop through those as needed.
>
> This does preclude being able to use PUD mappings in the direct map; a
> proposal to how this could be optimized in the future is laid out
> here[1].
>
> [1]: https://lore.kernel.org/linux-mm/[email protected]/
>
> Cc: Andrew Morton <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Huang Ying <[email protected]>
> Suggested-by: David Hildenbrand <[email protected]>
> Reviewed-by: Dan Williams <[email protected]>
> Signed-off-by: Vishal Verma <[email protected]>
> ---
> mm/memory_hotplug.c | 213 ++++++++++++++++++++++++++++++++++------------------
> 1 file changed, 138 insertions(+), 75 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 6be7de9efa55..d242e49d7f7b 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1380,6 +1380,84 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
> return arch_supports_memmap_on_memory(vmemmap_size);
> }
>
> +static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size)
> +{
> + unsigned long memblock_size = memory_block_size_bytes();
> + u64 cur_start;
> +
> + /*
> + * For memmap_on_memory, the altmaps were added on a per-memblock
> + * basis; we have to process each individual memory block.
> + */
> + for (cur_start = start; cur_start < start + size;
> + cur_start += memblock_size) {
> + struct vmem_altmap *altmap = NULL;
> + struct memory_block *mem;
> +
> + mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(cur_start)));
> + WARN_ON_ONCE(!mem);
> + if (!mem)
> + continue;
Nit:
if (WARN_ON_ONCE(!mem))
continue;
> + for (cur_start = start; cur_start < start + size;
> + cur_start += memblock_size) {
> + struct mhp_params params = { .pgprot =
> + pgprot_mhp(PAGE_KERNEL) };
> + struct vmem_altmap mhp_altmap = {
> + .base_pfn = PHYS_PFN(cur_start),
> + .end_pfn = PHYS_PFN(cur_start + memblock_size - 1),
> + };
> +
> + mhp_altmap.free = memory_block_memmap_on_memory_pages();
> + params.altmap = kmemdup(&mhp_altmap, sizeof(struct vmem_altmap),
> + GFP_KERNEL);
> + if (!params.altmap)
> + return -ENOMEM;
As already spotted, we have to cleanup.
> +
> + /* call arch's memory hotadd */
> + ret = arch_add_memory(nid, cur_start, memblock_size, ¶ms);
> + if (ret < 0) {
> + kfree(params.altmap);
> + goto out;
> + }
> +
> + /* create memory block devices after memory was added */
> + ret = create_memory_block_devices(cur_start, memblock_size,
> + params.altmap, group);
> + if (ret) {
> + arch_remove_memory(cur_start, memblock_size, NULL);
> + kfree(params.altmap);
> + goto out;
> + }
> + }
> +
> + return 0;
> +out:
> + if (ret && (cur_start != start))
Nit: I think you can drop the inner parentheses.
> @@ -2146,11 +2208,31 @@ void try_offline_node(int nid)
> }
> EXPORT_SYMBOL(try_offline_node);
>
> +static int memory_blocks_have_altmaps(u64 start, u64 size)
> +{
> + u64 num_memblocks = size / memory_block_size_bytes();
> + u64 num_altmaps = 0;
> +
> + if (!mhp_memmap_on_memory())
> + return 0;
> +
> + walk_memory_blocks(start, size, &num_altmaps,
> + count_memory_range_altmaps_cb);
> +
> + if (num_altmaps == 0)
> + return 0;
> +
> + if (num_memblocks != num_altmaps) {
> + WARN_ONCE(1, "Not all memblocks in range have altmaps");
Nit:
if (WARN_ON_ONCE(num_memblocks != num_altmaps))
return -EINVAL;
Should be sufficient.
[...]
> /* remove memmap entry */
> firmware_map_remove(start, start + size, "System RAM");
>
> - /*
> - * Memory block device removal under the device_hotplug_lock is
> - * a barrier against racing online attempts.
> - */
> - remove_memory_block_devices(start, size);
> -
> mem_hotplug_begin();
>
> - arch_remove_memory(start, size, altmap);
> -
> - /* Verify that all vmemmap pages have actually been freed. */
> - if (altmap) {
> - WARN(altmap->alloc, "Altmap not fully unmapped");
> - kfree(altmap);
> + rc = memory_blocks_have_altmaps(start, size);
> + if (rc < 0) {
> + goto err;
Nit: Maybe better to just
if (rc < 0) {
mem_hotplug_done();
return rc
} else ...
And avoid the error label below. Makes the code easier to read.
> + } else if (rc == 0) {
Nit: else if (!rc)
With the cleanup fixed,
Acked-by: David Hildenbrand <[email protected]>
--
Cheers,
David / dhildenb
On Wed, Nov 01, 2023 at 04:51:52PM -0600, Vishal Verma wrote:
> The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to
> 'memblock_size' chunks of memory being added. Adding a larger span of
> memory precludes memmap_on_memory semantics.
>
> For users of hotplug such as kmem, large amounts of memory might get
> added from the CXL subsystem. In some cases, this amount may exceed the
> available 'main memory' to store the memmap for the memory being added.
> In this case, it is useful to have a way to place the memmap on the
> memory being added, even if it means splitting the addition into
> memblock-sized chunks.
>
> Change add_memory_resource() to loop over memblock-sized chunks of
> memory if caller requested memmap_on_memory, and if other conditions for
> it are met. Teach try_remove_memory() to also expect that a memory
> range being removed might have been split up into memblock sized chunks,
> and to loop through those as needed.
>
> This does preclude being able to use PUD mappings in the direct map; a
> proposal to how this could be optimized in the future is laid out
> here[1].
>
> [1]: https://lore.kernel.org/linux-mm/[email protected]/
>
> Cc: Andrew Morton <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Huang Ying <[email protected]>
> Suggested-by: David Hildenbrand <[email protected]>
> Reviewed-by: Dan Williams <[email protected]>
> Signed-off-by: Vishal Verma <[email protected]>
> ---
> mm/memory_hotplug.c | 213 ++++++++++++++++++++++++++++++++++------------------
> 1 file changed, 138 insertions(+), 75 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 6be7de9efa55..d242e49d7f7b 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1380,6 +1380,84 @@ static bool mhp_supports_memmap_on_memory(unsigned long size)
> return arch_supports_memmap_on_memory(vmemmap_size);
> }
>
> +static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size)
> +{
> + unsigned long memblock_size = memory_block_size_bytes();
> + u64 cur_start;
> +
> + /*
> + * For memmap_on_memory, the altmaps were added on a per-memblock
> + * basis; we have to process each individual memory block.
> + */
> + for (cur_start = start; cur_start < start + size;
> + cur_start += memblock_size) {
> + struct vmem_altmap *altmap = NULL;
> + struct memory_block *mem;
> +
> + mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(cur_start)));
> + WARN_ON_ONCE(!mem);
> + if (!mem)
> + continue;
> +
> + altmap = mem->altmap;
> + mem->altmap = NULL;
> +
> + remove_memory_block_devices(cur_start, memblock_size);
Is cur_start always aligned to memory_block_size_bytes? If not, the
above function will return directly, is that a issue?
Fan
> +
> + arch_remove_memory(cur_start, memblock_size, altmap);
> +
> + /* Verify that all vmemmap pages have actually been freed. */
> + WARN(altmap->alloc, "Altmap not fully unmapped");
> + kfree(altmap);
> + }
> +}
> +
> +static int create_altmaps_and_memory_blocks(int nid, struct memory_group *group,
> + u64 start, u64 size)
> +{
> + unsigned long memblock_size = memory_block_size_bytes();
> + u64 cur_start;
> + int ret;
> +
> + for (cur_start = start; cur_start < start + size;
> + cur_start += memblock_size) {
> + struct mhp_params params = { .pgprot =
> + pgprot_mhp(PAGE_KERNEL) };
> + struct vmem_altmap mhp_altmap = {
> + .base_pfn = PHYS_PFN(cur_start),
> + .end_pfn = PHYS_PFN(cur_start + memblock_size - 1),
> + };
> +
> + mhp_altmap.free = memory_block_memmap_on_memory_pages();
> + params.altmap = kmemdup(&mhp_altmap, sizeof(struct vmem_altmap),
> + GFP_KERNEL);
> + if (!params.altmap)
> + return -ENOMEM;
> +
> + /* call arch's memory hotadd */
> + ret = arch_add_memory(nid, cur_start, memblock_size, ¶ms);
> + if (ret < 0) {
> + kfree(params.altmap);
> + goto out;
> + }
> +
> + /* create memory block devices after memory was added */
> + ret = create_memory_block_devices(cur_start, memblock_size,
> + params.altmap, group);
> + if (ret) {
> + arch_remove_memory(cur_start, memblock_size, NULL);
> + kfree(params.altmap);
> + goto out;
> + }
> + }
> +
> + return 0;
> +out:
> + if (ret && (cur_start != start))
> + remove_memory_blocks_and_altmaps(start, cur_start - start);
> + return ret;
> +}
> +
> /*
> * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> * and online/offline operations (triggered e.g. by sysfs).
> @@ -1390,10 +1468,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> {
> struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
> enum memblock_flags memblock_flags = MEMBLOCK_NONE;
> - struct vmem_altmap mhp_altmap = {
> - .base_pfn = PHYS_PFN(res->start),
> - .end_pfn = PHYS_PFN(res->end),
> - };
> struct memory_group *group = NULL;
> u64 start, size;
> bool new_node = false;
> @@ -1436,28 +1510,22 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> /*
> * Self hosted memmap array
> */
> - if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> - if (mhp_supports_memmap_on_memory(size)) {
> - mhp_altmap.free = memory_block_memmap_on_memory_pages();
> - params.altmap = kmemdup(&mhp_altmap,
> - sizeof(struct vmem_altmap),
> - GFP_KERNEL);
> - if (!params.altmap)
> - goto error;
> + if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) &&
> + mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
> + ret = create_altmaps_and_memory_blocks(nid, group, start, size);
> + if (ret)
> + goto error;
> + } else {
> + ret = arch_add_memory(nid, start, size, ¶ms);
> + if (ret < 0)
> + goto error;
> +
> + /* create memory block devices after memory was added */
> + ret = create_memory_block_devices(start, size, NULL, group);
> + if (ret) {
> + arch_remove_memory(start, size, NULL);
> + goto error;
> }
> - /* fallback to not using altmap */
> - }
> -
> - /* call arch's memory hotadd */
> - ret = arch_add_memory(nid, start, size, ¶ms);
> - if (ret < 0)
> - goto error_free;
> -
> - /* create memory block devices after memory was added */
> - ret = create_memory_block_devices(start, size, params.altmap, group);
> - if (ret) {
> - arch_remove_memory(start, size, NULL);
> - goto error_free;
> }
>
> if (new_node) {
> @@ -1494,8 +1562,6 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
> walk_memory_blocks(start, size, NULL, online_memory_block);
>
> return ret;
> -error_free:
> - kfree(params.altmap);
> error:
> if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
> memblock_remove(start, size);
> @@ -2062,17 +2128,13 @@ static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
> return 0;
> }
>
> -static int test_has_altmap_cb(struct memory_block *mem, void *arg)
> +static int count_memory_range_altmaps_cb(struct memory_block *mem, void *arg)
> {
> - struct memory_block **mem_ptr = (struct memory_block **)arg;
> - /*
> - * return the memblock if we have altmap
> - * and break callback.
> - */
> - if (mem->altmap) {
> - *mem_ptr = mem;
> - return 1;
> - }
> + u64 *num_altmaps = (u64 *)arg;
> +
> + if (mem->altmap)
> + *num_altmaps += 1;
> +
> return 0;
> }
>
> @@ -2146,11 +2208,31 @@ void try_offline_node(int nid)
> }
> EXPORT_SYMBOL(try_offline_node);
>
> +static int memory_blocks_have_altmaps(u64 start, u64 size)
> +{
> + u64 num_memblocks = size / memory_block_size_bytes();
> + u64 num_altmaps = 0;
> +
> + if (!mhp_memmap_on_memory())
> + return 0;
> +
> + walk_memory_blocks(start, size, &num_altmaps,
> + count_memory_range_altmaps_cb);
> +
> + if (num_altmaps == 0)
> + return 0;
> +
> + if (num_memblocks != num_altmaps) {
> + WARN_ONCE(1, "Not all memblocks in range have altmaps");
> + return -EINVAL;
> + }
> +
> + return 1;
> +}
> +
> static int __ref try_remove_memory(u64 start, u64 size)
> {
> - struct memory_block *mem;
> - int rc = 0, nid = NUMA_NO_NODE;
> - struct vmem_altmap *altmap = NULL;
> + int rc, nid = NUMA_NO_NODE;
>
> BUG_ON(check_hotplug_memory_range(start, size));
>
> @@ -2167,45 +2249,25 @@ static int __ref try_remove_memory(u64 start, u64 size)
> if (rc)
> return rc;
>
> - /*
> - * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
> - * the same granularity it was added - a single memory block.
> - */
> - if (mhp_memmap_on_memory()) {
> - rc = walk_memory_blocks(start, size, &mem, test_has_altmap_cb);
> - if (rc) {
> - if (size != memory_block_size_bytes()) {
> - pr_warn("Refuse to remove %#llx - %#llx,"
> - "wrong granularity\n",
> - start, start + size);
> - return -EINVAL;
> - }
> - altmap = mem->altmap;
> - /*
> - * Mark altmap NULL so that we can add a debug
> - * check on memblock free.
> - */
> - mem->altmap = NULL;
> - }
> - }
> -
> /* remove memmap entry */
> firmware_map_remove(start, start + size, "System RAM");
>
> - /*
> - * Memory block device removal under the device_hotplug_lock is
> - * a barrier against racing online attempts.
> - */
> - remove_memory_block_devices(start, size);
> -
> mem_hotplug_begin();
>
> - arch_remove_memory(start, size, altmap);
> -
> - /* Verify that all vmemmap pages have actually been freed. */
> - if (altmap) {
> - WARN(altmap->alloc, "Altmap not fully unmapped");
> - kfree(altmap);
> + rc = memory_blocks_have_altmaps(start, size);
> + if (rc < 0) {
> + goto err;
> + } else if (rc == 0) {
> + /*
> + * Memory block device removal under the device_hotplug_lock is
> + * a barrier against racing online attempts.
> + * No altmaps present, do the removal directly
> + */
> + remove_memory_block_devices(start, size);
> + arch_remove_memory(start, size, NULL);
> + } else {
> + /* all memblocks in the range have altmaps */
> + remove_memory_blocks_and_altmaps(start, size);
> }
>
> if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
> @@ -2218,8 +2280,9 @@ static int __ref try_remove_memory(u64 start, u64 size)
> if (nid != NUMA_NO_NODE)
> try_offline_node(nid);
>
> +err:
> mem_hotplug_done();
> - return 0;
> + return (rc < 0 ? rc : 0);
> }
>
> /**
>
> --
> 2.41.0
>
On Fri, 2023-11-03 at 09:43 -0700, fan wrote:
> On Wed, Nov 01, 2023 at 04:51:52PM -0600, Vishal Verma wrote:
> >
[..]
> >
> > +static void __ref remove_memory_blocks_and_altmaps(u64 start, u64 size)
> > +{
> > + unsigned long memblock_size = memory_block_size_bytes();
> > + u64 cur_start;
> > +
> > + /*
> > + * For memmap_on_memory, the altmaps were added on a per-memblock
> > + * basis; we have to process each individual memory block.
> > + */
> > + for (cur_start = start; cur_start < start + size;
> > + cur_start += memblock_size) {
> > + struct vmem_altmap *altmap = NULL;
> > + struct memory_block *mem;
> > +
> > + mem = find_memory_block(pfn_to_section_nr(PFN_DOWN(cur_start)));
> > + WARN_ON_ONCE(!mem);
> > + if (!mem)
> > + continue;
> > +
> > + altmap = mem->altmap;
> > + mem->altmap = NULL;
> > +
> > + remove_memory_block_devices(cur_start, memblock_size);
>
> Is cur_start always aligned to memory_block_size_bytes? If not, the
> above function will return directly, is that a issue?
>
Hi Fan,
Thanks for taking a look and the review (btw v9 is the latest revision
of these).
I think we're okay because the create side would've adding this memory
in the first place as it too does an alignment check for
memory_block_size_bytes.
Thanks
Vishal