2023-06-15 22:15:47

by Verma, Vishal L

[permalink] [raw]
Subject: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

With DAX memory regions originating from CXL memory expanders or
NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
on a system without enough 'regular' main memory to support the memmap
for it. To avoid this, ensure that all kmem managed hotplugged memory is
added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
new memory region being hot added.

To do this, call add_memory() in chunks of memory_block_size_bytes() as
that is a requirement for memmap_on_memory. Additionally, Use the
mhp_flag to force the memmap_on_memory checks regardless of the
respective module parameter setting.

Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Jiang <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Huang Ying <[email protected]>
Signed-off-by: Vishal Verma <[email protected]>
---
drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
1 file changed, 36 insertions(+), 13 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 7b36db6f1cbd..0751346193ef 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -12,6 +12,7 @@
#include <linux/mm.h>
#include <linux/mman.h>
#include <linux/memory-tiers.h>
+#include <linux/memory_hotplug.h>
#include "dax-private.h"
#include "bus.h"

@@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
data->mgid = rc;

for (i = 0; i < dev_dax->nr_range; i++) {
+ u64 cur_start, cur_len, remaining;
struct resource *res;
struct range range;

@@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
res->flags = IORESOURCE_SYSTEM_RAM;

/*
- * Ensure that future kexec'd kernels will not treat
- * this as RAM automatically.
+ * Add memory in chunks of memory_block_size_bytes() so that
+ * it is considered for MHP_MEMMAP_ON_MEMORY
+ * @range has already been aligned to memory_block_size_bytes(),
+ * so the following loop will always break it down cleanly.
*/
- rc = add_memory_driver_managed(data->mgid, range.start,
- range_len(&range), kmem_name, MHP_NID_IS_MGID);
+ cur_start = range.start;
+ cur_len = memory_block_size_bytes();
+ remaining = range_len(&range);
+ while (remaining) {
+ mhp_t mhp_flags = MHP_NID_IS_MGID;

- if (rc) {
- dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
- i, range.start, range.end);
- remove_resource(res);
- kfree(res);
- data->res[i] = NULL;
- if (mapped)
- continue;
- goto err_request_mem;
+ if (mhp_supports_memmap_on_memory(cur_len,
+ MHP_MEMMAP_ON_MEMORY))
+ mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+ /*
+ * Ensure that future kexec'd kernels will not treat
+ * this as RAM automatically.
+ */
+ rc = add_memory_driver_managed(data->mgid, cur_start,
+ cur_len, kmem_name,
+ mhp_flags);
+
+ if (rc) {
+ dev_warn(dev,
+ "mapping%d: %#llx-%#llx memory add failed\n",
+ i, cur_start, cur_start + cur_len - 1);
+ remove_resource(res);
+ kfree(res);
+ data->res[i] = NULL;
+ if (mapped)
+ continue;
+ goto err_request_mem;
+ }
+
+ cur_start += cur_len;
+ remaining -= cur_len;
}
mapped++;
}

--
2.40.1



2023-06-16 07:14:57

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

Vishal Verma <[email protected]> writes:

> With DAX memory regions originating from CXL memory expanders or
> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
> on a system without enough 'regular' main memory to support the memmap
> for it. To avoid this, ensure that all kmem managed hotplugged memory is
> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
> new memory region being hot added.
>
> To do this, call add_memory() in chunks of memory_block_size_bytes() as
> that is a requirement for memmap_on_memory. Additionally, Use the
> mhp_flag to force the memmap_on_memory checks regardless of the
> respective module parameter setting.
>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Huang Ying <[email protected]>
> Signed-off-by: Vishal Verma <[email protected]>
> ---
> drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 36 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 7b36db6f1cbd..0751346193ef 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -12,6 +12,7 @@
> #include <linux/mm.h>
> #include <linux/mman.h>
> #include <linux/memory-tiers.h>
> +#include <linux/memory_hotplug.h>
> #include "dax-private.h"
> #include "bus.h"
>
> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> data->mgid = rc;
>
> for (i = 0; i < dev_dax->nr_range; i++) {
> + u64 cur_start, cur_len, remaining;
> struct resource *res;
> struct range range;
>
> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> res->flags = IORESOURCE_SYSTEM_RAM;
>
> /*
> - * Ensure that future kexec'd kernels will not treat
> - * this as RAM automatically.
> + * Add memory in chunks of memory_block_size_bytes() so that
> + * it is considered for MHP_MEMMAP_ON_MEMORY
> + * @range has already been aligned to memory_block_size_bytes(),
> + * so the following loop will always break it down cleanly.
> */
> - rc = add_memory_driver_managed(data->mgid, range.start,
> - range_len(&range), kmem_name, MHP_NID_IS_MGID);
> + cur_start = range.start;
> + cur_len = memory_block_size_bytes();
> + remaining = range_len(&range);
> + while (remaining) {
> + mhp_t mhp_flags = MHP_NID_IS_MGID;
>
> - if (rc) {
> - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
> - i, range.start, range.end);
> - remove_resource(res);
> - kfree(res);
> - data->res[i] = NULL;
> - if (mapped)
> - continue;
> - goto err_request_mem;
> + if (mhp_supports_memmap_on_memory(cur_len,
> + MHP_MEMMAP_ON_MEMORY))
> + mhp_flags |= MHP_MEMMAP_ON_MEMORY;
> + /*
> + * Ensure that future kexec'd kernels will not treat
> + * this as RAM automatically.
> + */
> + rc = add_memory_driver_managed(data->mgid, cur_start,
> + cur_len, kmem_name,
> + mhp_flags);
> +
> + if (rc) {
> + dev_warn(dev,
> + "mapping%d: %#llx-%#llx memory add failed\n",
> + i, cur_start, cur_start + cur_len - 1);
> + remove_resource(res);
> + kfree(res);
> + data->res[i] = NULL;
> + if (mapped)
> + continue;
> + goto err_request_mem;
> + }
> +
> + cur_start += cur_len;
> + remaining -= cur_len;
> }
> mapped++;
> }

It appears that we need to hot-remove memory in the granularity of
memory_block_size_bytes() too, according to try_remove_memory(). If so,
it seems better to allocate one dax_kmem_data.res[] element for each
memory block instead of dax region?

Best Regards,
Huang, Ying

2023-06-16 08:29:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

On 16.06.23 00:00, Vishal Verma wrote:
> With DAX memory regions originating from CXL memory expanders or
> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
> on a system without enough 'regular' main memory to support the memmap
> for it. To avoid this, ensure that all kmem managed hotplugged memory is
> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
> new memory region being hot added.
>
> To do this, call add_memory() in chunks of memory_block_size_bytes() as
> that is a requirement for memmap_on_memory. Additionally, Use the
> mhp_flag to force the memmap_on_memory checks regardless of the
> respective module parameter setting.
>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Huang Ying <[email protected]>
> Signed-off-by: Vishal Verma <[email protected]>
> ---
> drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 36 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 7b36db6f1cbd..0751346193ef 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -12,6 +12,7 @@
> #include <linux/mm.h>
> #include <linux/mman.h>
> #include <linux/memory-tiers.h>
> +#include <linux/memory_hotplug.h>
> #include "dax-private.h"
> #include "bus.h"
>
> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> data->mgid = rc;
>
> for (i = 0; i < dev_dax->nr_range; i++) {
> + u64 cur_start, cur_len, remaining;
> struct resource *res;
> struct range range;
>
> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> res->flags = IORESOURCE_SYSTEM_RAM;
>
> /*
> - * Ensure that future kexec'd kernels will not treat
> - * this as RAM automatically.
> + * Add memory in chunks of memory_block_size_bytes() so that
> + * it is considered for MHP_MEMMAP_ON_MEMORY
> + * @range has already been aligned to memory_block_size_bytes(),
> + * so the following loop will always break it down cleanly.
> */
> - rc = add_memory_driver_managed(data->mgid, range.start,
> - range_len(&range), kmem_name, MHP_NID_IS_MGID);
> + cur_start = range.start;
> + cur_len = memory_block_size_bytes();
> + remaining = range_len(&range);
> + while (remaining) {
> + mhp_t mhp_flags = MHP_NID_IS_MGID;
>
> - if (rc) {
> - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
> - i, range.start, range.end);
> - remove_resource(res);
> - kfree(res);
> - data->res[i] = NULL;
> - if (mapped)
> - continue;
> - goto err_request_mem;
> + if (mhp_supports_memmap_on_memory(cur_len,
> + MHP_MEMMAP_ON_MEMORY))
> + mhp_flags |= MHP_MEMMAP_ON_MEMORY;
> + /*
> + * Ensure that future kexec'd kernels will not treat
> + * this as RAM automatically.
> + */
> + rc = add_memory_driver_managed(data->mgid, cur_start,
> + cur_len, kmem_name,
> + mhp_flags);
> +
> + if (rc) {
> + dev_warn(dev,
> + "mapping%d: %#llx-%#llx memory add failed\n",
> + i, cur_start, cur_start + cur_len - 1);
> + remove_resource(res);
> + kfree(res);
> + data->res[i] = NULL;
> + if (mapped)
> + continue;
> + goto err_request_mem;
> + }
> +
> + cur_start += cur_len;
> + remaining -= cur_len;
> }
> mapped++;
> }
>

Maybe the better alternative is teach
add_memory_resource()/try_remove_memory() to do that internally.

In the add_memory_resource() case, it might be a loop around that
memmap_on_memory + arch_add_memory code path (well, and the error path
also needs adjustment):

/*
* Self hosted memmap array
*/
if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
if (!mhp_supports_memmap_on_memory(size)) {
ret = -EINVAL;
goto error;
}
mhp_altmap.free = PHYS_PFN(size);
mhp_altmap.base_pfn = PHYS_PFN(start);
params.altmap = &mhp_altmap;
}

/* call arch's memory hotadd */
ret = arch_add_memory(nid, start, size, &params);
if (ret < 0)
goto error;


Note that we want to handle that on a per memory-block basis, because we
don't want the vmemmap of memory block #2 to end up on memory block #1.
It all gets messy with memory onlining/offlining etc otherwise ...

--
Cheers,

David / dhildenb


2023-06-20 13:24:13

by Tarun Sahu

[permalink] [raw]
Subject: Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory


Hi Vishal,

Vishal Verma <[email protected]> writes:

> With DAX memory regions originating from CXL memory expanders or
> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
> on a system without enough 'regular' main memory to support the memmap
> for it. To avoid this, ensure that all kmem managed hotplugged memory is
> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
> new memory region being hot added.
>

Some architectures doesn't have support for MEMMAP_ON_MEMORY, bypassing
the check mhp_memmap_on_memory() might cause problems on such
architectures (for e.g PPC64).

> To do this, call add_memory() in chunks of memory_block_size_bytes() as
> that is a requirement for memmap_on_memory. Additionally, Use the
> mhp_flag to force the memmap_on_memory checks regardless of the
> respective module parameter setting.
>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Jiang <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Huang Ying <[email protected]>
> Signed-off-by: Vishal Verma <[email protected]>
> ---
> drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 36 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 7b36db6f1cbd..0751346193ef 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -12,6 +12,7 @@
> #include <linux/mm.h>
> #include <linux/mman.h>
> #include <linux/memory-tiers.h>
> +#include <linux/memory_hotplug.h>
> #include "dax-private.h"
> #include "bus.h"
>
> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> data->mgid = rc;
>
> for (i = 0; i < dev_dax->nr_range; i++) {
> + u64 cur_start, cur_len, remaining;
> struct resource *res;
> struct range range;
>
> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> res->flags = IORESOURCE_SYSTEM_RAM;
>
> /*
> - * Ensure that future kexec'd kernels will not treat
> - * this as RAM automatically.
> + * Add memory in chunks of memory_block_size_bytes() so that
> + * it is considered for MHP_MEMMAP_ON_MEMORY
> + * @range has already been aligned to memory_block_size_bytes(),
> + * so the following loop will always break it down cleanly.
> */
> - rc = add_memory_driver_managed(data->mgid, range.start,
> - range_len(&range), kmem_name, MHP_NID_IS_MGID);
> + cur_start = range.start;
> + cur_len = memory_block_size_bytes();
> + remaining = range_len(&range);
> + while (remaining) {
> + mhp_t mhp_flags = MHP_NID_IS_MGID;
>
> - if (rc) {
> - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
> - i, range.start, range.end);
> - remove_resource(res);
> - kfree(res);
> - data->res[i] = NULL;
> - if (mapped)
> - continue;
> - goto err_request_mem;
> + if (mhp_supports_memmap_on_memory(cur_len,
> + MHP_MEMMAP_ON_MEMORY))
> + mhp_flags |= MHP_MEMMAP_ON_MEMORY;
> + /*
> + * Ensure that future kexec'd kernels will not treat
> + * this as RAM automatically.
> + */
> + rc = add_memory_driver_managed(data->mgid, cur_start,
> + cur_len, kmem_name,
> + mhp_flags);
> +
> + if (rc) {
> + dev_warn(dev,
> + "mapping%d: %#llx-%#llx memory add failed\n",
> + i, cur_start, cur_start + cur_len - 1);
> + remove_resource(res);
> + kfree(res);
> + data->res[i] = NULL;
> + if (mapped)
> + continue;
> + goto err_request_mem;
> + }
> +
> + cur_start += cur_len;
> + remaining -= cur_len;
> }
> mapped++;
> }
>
> --
> 2.40.1

2023-07-11 14:52:17

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

David Hildenbrand <[email protected]> writes:

> On 16.06.23 00:00, Vishal Verma wrote:
>> With DAX memory regions originating from CXL memory expanders or
>> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
>> on a system without enough 'regular' main memory to support the memmap
>> for it. To avoid this, ensure that all kmem managed hotplugged memory is
>> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
>> new memory region being hot added.
>>
>> To do this, call add_memory() in chunks of memory_block_size_bytes() as
>> that is a requirement for memmap_on_memory. Additionally, Use the
>> mhp_flag to force the memmap_on_memory checks regardless of the
>> respective module parameter setting.
>>
>> Cc: "Rafael J. Wysocki" <[email protected]>
>> Cc: Len Brown <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: David Hildenbrand <[email protected]>
>> Cc: Oscar Salvador <[email protected]>
>> Cc: Dan Williams <[email protected]>
>> Cc: Dave Jiang <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Huang Ying <[email protected]>
>> Signed-off-by: Vishal Verma <[email protected]>
>> ---
>> drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
>> 1 file changed, 36 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>> index 7b36db6f1cbd..0751346193ef 100644
>> --- a/drivers/dax/kmem.c
>> +++ b/drivers/dax/kmem.c
>> @@ -12,6 +12,7 @@
>> #include <linux/mm.h>
>> #include <linux/mman.h>
>> #include <linux/memory-tiers.h>
>> +#include <linux/memory_hotplug.h>
>> #include "dax-private.h"
>> #include "bus.h"
>>
>> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>> data->mgid = rc;
>>
>> for (i = 0; i < dev_dax->nr_range; i++) {
>> + u64 cur_start, cur_len, remaining;
>> struct resource *res;
>> struct range range;
>>
>> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>> res->flags = IORESOURCE_SYSTEM_RAM;
>>
>> /*
>> - * Ensure that future kexec'd kernels will not treat
>> - * this as RAM automatically.
>> + * Add memory in chunks of memory_block_size_bytes() so that
>> + * it is considered for MHP_MEMMAP_ON_MEMORY
>> + * @range has already been aligned to memory_block_size_bytes(),
>> + * so the following loop will always break it down cleanly.
>> */
>> - rc = add_memory_driver_managed(data->mgid, range.start,
>> - range_len(&range), kmem_name, MHP_NID_IS_MGID);
>> + cur_start = range.start;
>> + cur_len = memory_block_size_bytes();
>> + remaining = range_len(&range);
>> + while (remaining) {
>> + mhp_t mhp_flags = MHP_NID_IS_MGID;
>>
>> - if (rc) {
>> - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
>> - i, range.start, range.end);
>> - remove_resource(res);
>> - kfree(res);
>> - data->res[i] = NULL;
>> - if (mapped)
>> - continue;
>> - goto err_request_mem;
>> + if (mhp_supports_memmap_on_memory(cur_len,
>> + MHP_MEMMAP_ON_MEMORY))
>> + mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>> + /*
>> + * Ensure that future kexec'd kernels will not treat
>> + * this as RAM automatically.
>> + */
>> + rc = add_memory_driver_managed(data->mgid, cur_start,
>> + cur_len, kmem_name,
>> + mhp_flags);
>> +
>> + if (rc) {
>> + dev_warn(dev,
>> + "mapping%d: %#llx-%#llx memory add failed\n",
>> + i, cur_start, cur_start + cur_len - 1);
>> + remove_resource(res);
>> + kfree(res);
>> + data->res[i] = NULL;
>> + if (mapped)
>> + continue;
>> + goto err_request_mem;
>> + }
>> +
>> + cur_start += cur_len;
>> + remaining -= cur_len;
>> }
>> mapped++;
>> }
>>
>
> Maybe the better alternative is teach
> add_memory_resource()/try_remove_memory() to do that internally.
>
> In the add_memory_resource() case, it might be a loop around that
> memmap_on_memory + arch_add_memory code path (well, and the error path
> also needs adjustment):
>
> /*
> * Self hosted memmap array
> */
> if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> if (!mhp_supports_memmap_on_memory(size)) {
> ret = -EINVAL;
> goto error;
> }
> mhp_altmap.free = PHYS_PFN(size);
> mhp_altmap.base_pfn = PHYS_PFN(start);
> params.altmap = &mhp_altmap;
> }
>
> /* call arch's memory hotadd */
> ret = arch_add_memory(nid, start, size, &params);
> if (ret < 0)
> goto error;
>
>
> Note that we want to handle that on a per memory-block basis, because we
> don't want the vmemmap of memory block #2 to end up on memory block #1.
> It all gets messy with memory onlining/offlining etc otherwise ...
>

I tried to implement this inside add_memory_driver_managed() and also
within dax/kmem. IMHO doing the error handling inside dax/kmem is
better. Here is how it looks:

1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions
2. For each succesful request_mem_regions if any blocks got added, we
keep the resource. If none got added, we will kfree the resource



for (i = 0; i < dev_dax->nr_range; i++) {
u64 cur_start, cur_len, remaining;
struct resource *res;
struct range range;
bool block_added;

rc = dax_kmem_range(dev_dax, i, &range);
if (rc)
continue;

/* Region is permanently reserved if hotremove fails. */
res = request_mem_region(range.start, range_len(&range), data->res_name);
if (!res) {
dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
i, range.start, range.end);
/*
* Once some memory has been onlined we can't
* assume that it can be un-onlined safely.
*/
if (mapped)
continue;
rc = -EBUSY;
goto err_request_mem;
}
data->res[i] = res;

/*
* Set flags appropriate for System RAM. Leave ..._BUSY clear
* so that add_memory() can add a child resource. Do not
* inherit flags from the parent since it may set new flags
* unknown to us that will break add_memory() below.
*/
res->flags = IORESOURCE_SYSTEM_RAM;

/*
* Add memory in chunks of memory_block_size_bytes() so that
* it is considered for MHP_MEMMAP_ON_MEMORY
* @range has already been aligned to memory_block_size_bytes(),
* so the following loop will always break it down cleanly.
*/
cur_start = range.start;
cur_len = memory_block_size_bytes();
remaining = range_len(&range);
block_added = false;
while (remaining) {
/*
* If alignment rules are not satisified we will
* fallback normal memmap allocation.
*/
mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY;
/*
* Ensure that future kexec'd kernels will not treat
* this as RAM automatically.
*/
rc = add_memory_driver_managed(data->mgid, cur_start,
cur_len, kmem_name,
mhp_flags);

if (rc)
dev_warn(dev,
"mapping%d: %#llx-%#llx memory add failed\n",
i, cur_start, cur_start + cur_len - 1);
else
block_added = true;

cur_start += cur_len;
remaining -= cur_len;
}
if (!block_added) {
/*
* None of the blocks got added, remove the resource.
*/
remove_resource(res);
kfree(res);
data->res[i] = NULL;
} else
mapped++;
}
if (mapped) {
dev_set_drvdata(dev, data);
return 0;
}

err_request_mem:
/*
* If none of the resources got mapped.
* unregister the group.
*/
memory_group_unregister(data->mgid);
err_reg_mgid:
kfree(data->res_name);

2023-07-11 15:29:32

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

On 11.07.23 16:30, Aneesh Kumar K.V wrote:
> David Hildenbrand <[email protected]> writes:
>
>> On 16.06.23 00:00, Vishal Verma wrote:
>>> With DAX memory regions originating from CXL memory expanders or
>>> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
>>> on a system without enough 'regular' main memory to support the memmap
>>> for it. To avoid this, ensure that all kmem managed hotplugged memory is
>>> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
>>> new memory region being hot added.
>>>
>>> To do this, call add_memory() in chunks of memory_block_size_bytes() as
>>> that is a requirement for memmap_on_memory. Additionally, Use the
>>> mhp_flag to force the memmap_on_memory checks regardless of the
>>> respective module parameter setting.
>>>
>>> Cc: "Rafael J. Wysocki" <[email protected]>
>>> Cc: Len Brown <[email protected]>
>>> Cc: Andrew Morton <[email protected]>
>>> Cc: David Hildenbrand <[email protected]>
>>> Cc: Oscar Salvador <[email protected]>
>>> Cc: Dan Williams <[email protected]>
>>> Cc: Dave Jiang <[email protected]>
>>> Cc: Dave Hansen <[email protected]>
>>> Cc: Huang Ying <[email protected]>
>>> Signed-off-by: Vishal Verma <[email protected]>
>>> ---
>>> drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
>>> 1 file changed, 36 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>> index 7b36db6f1cbd..0751346193ef 100644
>>> --- a/drivers/dax/kmem.c
>>> +++ b/drivers/dax/kmem.c
>>> @@ -12,6 +12,7 @@
>>> #include <linux/mm.h>
>>> #include <linux/mman.h>
>>> #include <linux/memory-tiers.h>
>>> +#include <linux/memory_hotplug.h>
>>> #include "dax-private.h"
>>> #include "bus.h"
>>>
>>> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>> data->mgid = rc;
>>>
>>> for (i = 0; i < dev_dax->nr_range; i++) {
>>> + u64 cur_start, cur_len, remaining;
>>> struct resource *res;
>>> struct range range;
>>>
>>> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>> res->flags = IORESOURCE_SYSTEM_RAM;
>>>
>>> /*
>>> - * Ensure that future kexec'd kernels will not treat
>>> - * this as RAM automatically.
>>> + * Add memory in chunks of memory_block_size_bytes() so that
>>> + * it is considered for MHP_MEMMAP_ON_MEMORY
>>> + * @range has already been aligned to memory_block_size_bytes(),
>>> + * so the following loop will always break it down cleanly.
>>> */
>>> - rc = add_memory_driver_managed(data->mgid, range.start,
>>> - range_len(&range), kmem_name, MHP_NID_IS_MGID);
>>> + cur_start = range.start;
>>> + cur_len = memory_block_size_bytes();
>>> + remaining = range_len(&range);
>>> + while (remaining) {
>>> + mhp_t mhp_flags = MHP_NID_IS_MGID;
>>>
>>> - if (rc) {
>>> - dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
>>> - i, range.start, range.end);
>>> - remove_resource(res);
>>> - kfree(res);
>>> - data->res[i] = NULL;
>>> - if (mapped)
>>> - continue;
>>> - goto err_request_mem;
>>> + if (mhp_supports_memmap_on_memory(cur_len,
>>> + MHP_MEMMAP_ON_MEMORY))
>>> + mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>>> + /*
>>> + * Ensure that future kexec'd kernels will not treat
>>> + * this as RAM automatically.
>>> + */
>>> + rc = add_memory_driver_managed(data->mgid, cur_start,
>>> + cur_len, kmem_name,
>>> + mhp_flags);
>>> +
>>> + if (rc) {
>>> + dev_warn(dev,
>>> + "mapping%d: %#llx-%#llx memory add failed\n",
>>> + i, cur_start, cur_start + cur_len - 1);
>>> + remove_resource(res);
>>> + kfree(res);
>>> + data->res[i] = NULL;
>>> + if (mapped)
>>> + continue;
>>> + goto err_request_mem;
>>> + }
>>> +
>>> + cur_start += cur_len;
>>> + remaining -= cur_len;
>>> }
>>> mapped++;
>>> }
>>>
>>
>> Maybe the better alternative is teach
>> add_memory_resource()/try_remove_memory() to do that internally.
>>
>> In the add_memory_resource() case, it might be a loop around that
>> memmap_on_memory + arch_add_memory code path (well, and the error path
>> also needs adjustment):
>>
>> /*
>> * Self hosted memmap array
>> */
>> if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
>> if (!mhp_supports_memmap_on_memory(size)) {
>> ret = -EINVAL;
>> goto error;
>> }
>> mhp_altmap.free = PHYS_PFN(size);
>> mhp_altmap.base_pfn = PHYS_PFN(start);
>> params.altmap = &mhp_altmap;
>> }
>>
>> /* call arch's memory hotadd */
>> ret = arch_add_memory(nid, start, size, &params);
>> if (ret < 0)
>> goto error;
>>
>>
>> Note that we want to handle that on a per memory-block basis, because we
>> don't want the vmemmap of memory block #2 to end up on memory block #1.
>> It all gets messy with memory onlining/offlining etc otherwise ...
>>
>
> I tried to implement this inside add_memory_driver_managed() and also
> within dax/kmem. IMHO doing the error handling inside dax/kmem is
> better. Here is how it looks:
>
> 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions
> 2. For each succesful request_mem_regions if any blocks got added, we
> keep the resource. If none got added, we will kfree the resource
>

Doing this unconditional splitting outside of
add_memory_driver_managed() is undesirable for at least two reasons

1) You end up always creating individual entries in the resource tree
(/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective.
2) As we call arch_add_memory() in memory block granularity (e.g., 128
MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the
identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective.

While you could sense for support and do the split based on that, it
will be beneficial for other users (especially DIMMs) if we do that
internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be
effective or not.

In general, we avoid placing important kernel data-structures on slow
memory. That's one of the reasons why PMEM decided to mostly always use
ZONE_MOVABLE such that exactly what this patch does would not happen. So
I'm wondering if there would be demand for an additional toggle.

Because even with memmap_on_memory enabled in general, you might not
want to do that for dax/kmem.

IMHO, this patch should be dropped from your ppc64 series, as it's an
independent change that might be valuable for other architectures as well.

--
Cheers,

David / dhildenb


2023-07-13 07:26:09

by Verma, Vishal L

[permalink] [raw]
Subject: Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

On Tue, 2023-07-11 at 17:21 +0200, David Hildenbrand wrote:
> On 11.07.23 16:30, Aneesh Kumar K.V wrote:
> > David Hildenbrand <[email protected]> writes:
> > >
> > > Maybe the better alternative is teach
> > > add_memory_resource()/try_remove_memory() to do that internally.
> > >
> > > In the add_memory_resource() case, it might be a loop around that
> > > memmap_on_memory + arch_add_memory code path (well, and the error path
> > > also needs adjustment):
> > >
> > >         /*
> > >          * Self hosted memmap array
> > >          */
> > >         if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> > >                 if (!mhp_supports_memmap_on_memory(size)) {
> > >                         ret = -EINVAL;
> > >                         goto error;
> > >                 }
> > >                 mhp_altmap.free = PHYS_PFN(size);
> > >                 mhp_altmap.base_pfn = PHYS_PFN(start);
> > >                 params.altmap = &mhp_altmap;
> > >         }
> > >
> > >         /* call arch's memory hotadd */
> > >         ret = arch_add_memory(nid, start, size, &params);
> > >         if (ret < 0)
> > >                 goto error;
> > >
> > >
> > > Note that we want to handle that on a per memory-block basis, because we
> > > don't want the vmemmap of memory block #2 to end up on memory block #1.
> > > It all gets messy with memory onlining/offlining etc otherwise ...
> > >
> >
> > I tried to implement this inside add_memory_driver_managed() and also
> > within dax/kmem. IMHO doing the error handling inside dax/kmem is
> > better. Here is how it looks:
> >
> > 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions
> > 2. For each succesful request_mem_regions if any blocks got added, we
> > keep the resource. If none got added, we will kfree the resource
> >
>
> Doing this unconditional splitting outside of
> add_memory_driver_managed() is undesirable for at least two reasons
>
> 1) You end up always creating individual entries in the resource tree
>     (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective.
> 2) As we call arch_add_memory() in memory block granularity (e.g., 128
>     MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the
>     identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective.
>
> While you could sense for support and do the split based on that, it
> will be beneficial for other users (especially DIMMs) if we do that
> internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be
> effective or not.

I'm taking a shot at implementing the splitting internally in
memory_hotplug.c. The caller (kmem) side does become trivial with this
approach, but there's a slight complication if I don't have the module
param override (patch 1 of this series).

The kmem diff now looks like:

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 898ca9505754..8be932f63f90 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
data->mgid = rc;

for (i = 0; i < dev_dax->nr_range; i++) {
+ mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
+ MHP_SPLIT_MEMBLOCKS;
struct resource *res;
struct range range;

@@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
* this as RAM automatically.
*/
rc = add_memory_driver_managed(data->mgid, range.start,
- range_len(&range), kmem_name, MHP_NID_IS_MGID);
+ range_len(&range), kmem_name, mhp_flags);

if (rc) {
dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",


However this begins to fail if the memmap_on_memory modparam is not
set, as add_memory_driver_managed EINVALs from the
mhp_supports_memmap_on_memory() check.

The way to work around this would probably include doing the
mhp_supports_memmap_on_memory() check in kmem, in a loop to check for
each memblock sized chunk, and that feels like a leak of the
implementation details into the caller.

Any suggestions on how to go about this?
>
> In general, we avoid placing important kernel data-structures on slow
> memory. That's one of the reasons why PMEM decided to mostly always use
> ZONE_MOVABLE such that exactly what this patch does would not happen. So
> I'm wondering if there would be demand for an additional toggle.
>
> Because even with memmap_on_memory enabled in general, you might not
> want to do that for dax/kmem.
>
> IMHO, this patch should be dropped from your ppc64 series, as it's an
> independent change that might be valuable for other architectures as well.
>
Sure thing, I can pick this back up and Aneesh can drop this from his set.

2023-07-13 07:34:56

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

On 13.07.23 08:45, Verma, Vishal L wrote:
> On Tue, 2023-07-11 at 17:21 +0200, David Hildenbrand wrote:
>> On 11.07.23 16:30, Aneesh Kumar K.V wrote:
>>> David Hildenbrand <[email protected]> writes:
>>>>
>>>> Maybe the better alternative is teach
>>>> add_memory_resource()/try_remove_memory() to do that internally.
>>>>
>>>> In the add_memory_resource() case, it might be a loop around that
>>>> memmap_on_memory + arch_add_memory code path (well, and the error path
>>>> also needs adjustment):
>>>>
>>>>         /*
>>>>          * Self hosted memmap array
>>>>          */
>>>>         if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
>>>>                 if (!mhp_supports_memmap_on_memory(size)) {
>>>>                         ret = -EINVAL;
>>>>                         goto error;
>>>>                 }
>>>>                 mhp_altmap.free = PHYS_PFN(size);
>>>>                 mhp_altmap.base_pfn = PHYS_PFN(start);
>>>>                 params.altmap = &mhp_altmap;
>>>>         }
>>>>
>>>>         /* call arch's memory hotadd */
>>>>         ret = arch_add_memory(nid, start, size, &params);
>>>>         if (ret < 0)
>>>>                 goto error;
>>>>
>>>>
>>>> Note that we want to handle that on a per memory-block basis, because we
>>>> don't want the vmemmap of memory block #2 to end up on memory block #1.
>>>> It all gets messy with memory onlining/offlining etc otherwise ...
>>>>
>>>
>>> I tried to implement this inside add_memory_driver_managed() and also
>>> within dax/kmem. IMHO doing the error handling inside dax/kmem is
>>> better. Here is how it looks:
>>>
>>> 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions
>>> 2. For each succesful request_mem_regions if any blocks got added, we
>>> keep the resource. If none got added, we will kfree the resource
>>>
>>
>> Doing this unconditional splitting outside of
>> add_memory_driver_managed() is undesirable for at least two reasons
>>
>> 1) You end up always creating individual entries in the resource tree
>>     (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective.
>> 2) As we call arch_add_memory() in memory block granularity (e.g., 128
>>     MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the
>>     identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective.
>>
>> While you could sense for support and do the split based on that, it
>> will be beneficial for other users (especially DIMMs) if we do that
>> internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be
>> effective or not.
>
> I'm taking a shot at implementing the splitting internally in
> memory_hotplug.c. The caller (kmem) side does become trivial with this
> approach, but there's a slight complication if I don't have the module
> param override (patch 1 of this series).
>
> The kmem diff now looks like:
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 898ca9505754..8be932f63f90 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> data->mgid = rc;
>
> for (i = 0; i < dev_dax->nr_range; i++) {
> + mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
> + MHP_SPLIT_MEMBLOCKS;
> struct resource *res;
> struct range range;
>
> @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> * this as RAM automatically.
> */
> rc = add_memory_driver_managed(data->mgid, range.start,
> - range_len(&range), kmem_name, MHP_NID_IS_MGID);
> + range_len(&range), kmem_name, mhp_flags);
>
> if (rc) {
> dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
>
>

Why do we need the MHP_SPLIT_MEMBLOCKS?

In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a
single memory block, you can simply split up internally, no?

Essentially in add_memory_resource() something like

if (mhp_flags & MHP_MEMMAP_ON_MEMORY &&
mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
for (cur_start = start, cur_start < start + size;
cur_start += memory_block_size_bytes()) {
mhp_altmap.free = PHYS_PFN(memory_block_size_bytes());
mhp_altmap.base_pfn = PHYS_PFN(start);
params.altmap = &mhp_altmap;

ret = arch_add_memory(nid, start,
memory_block_size_bytes(), &params);
if (ret < 0) ...

ret = create_memory_block_devices(start, memory_block_size_bytes(),
mhp_altmap.alloc, group);
if (ret) ...

}
} else {
/* old boring stuff */
}

Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into
a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes()
...

If any adding of memory failed, we remove what we already added. That works, because as
long as we're holding the relevant locks, memory cannot get onlined in the meantime.

Then we only have to teach remove_memory() to deal with individual blocks when finding
blocks that have an altmap.

--
Cheers,

David / dhildenb


2023-07-13 15:26:49

by Verma, Vishal L

[permalink] [raw]
Subject: Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote:
> On 13.07.23 08:45, Verma, Vishal L wrote:
> >
> > I'm taking a shot at implementing the splitting internally in
> > memory_hotplug.c. The caller (kmem) side does become trivial with this
> > approach, but there's a slight complication if I don't have the module
> > param override (patch 1 of this series).
> >
> > The kmem diff now looks like:
> >
> >     diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> >     index 898ca9505754..8be932f63f90 100644
> >     --- a/drivers/dax/kmem.c
> >     +++ b/drivers/dax/kmem.c
> >     @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> >             data->mgid = rc;
> >     
> >             for (i = 0; i < dev_dax->nr_range; i++) {
> >     +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
> >     +                                 MHP_SPLIT_MEMBLOCKS;
> >                     struct resource *res;
> >                     struct range range;
> >     
> >     @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> >                      * this as RAM automatically.
> >                      */
> >                     rc = add_memory_driver_managed(data->mgid, range.start,
> >     -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
> >     +                               range_len(&range), kmem_name, mhp_flags);
> >     
> >                     if (rc) {
> >                             dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
> >    
> >
>
> Why do we need the MHP_SPLIT_MEMBLOCKS?

I thought we still wanted either an opt-in or opt-out for the kmem
driver to be able to do memmap_on_memory, in case there were
performance implications or the lack of 1GiB PUDs. I haven't
implemented that yet, but I was thinking along the lines of a sysfs
knob exposed by kmem, that controls setting of this new
MHP_SPLIT_MEMBLOCKS flag.

>
> In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a
> single memory block, you can simply split up internally, no?
>
> Essentially in add_memory_resource() something like
>
> if (mhp_flags & MHP_MEMMAP_ON_MEMORY &&
>      mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
>         for (cur_start = start, cur_start < start + size;
>              cur_start += memory_block_size_bytes()) {
>                 mhp_altmap.free = PHYS_PFN(memory_block_size_bytes());
>                 mhp_altmap.base_pfn = PHYS_PFN(start);
>                 params.altmap = &mhp_altmap;
>
>                 ret = arch_add_memory(nid, start,
>                                       memory_block_size_bytes(), &params);
>                 if (ret < 0) ...
>
>                 ret = create_memory_block_devices(start, memory_block_size_bytes(),
>                                                   mhp_altmap.alloc, group);
>                 if (ret) ...
>                 
>         }
> } else {
>         /* old boring stuff */
> }
>
> Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into
> a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes()
> ...

My current approach was looping a level higher, on the call to
add_memory_resource, but this looks reasonable too, and I can switch to
this. In fact it is better as in my case I had to loop twice, once for
the regular add_memory() path and once for the _driver_managed() path.
Yours should avoid that.

>
> If any adding of memory failed, we remove what we already added. That works, because as
> long as we're holding the relevant locks, memory cannot get onlined in the meantime.
>
> Then we only have to teach remove_memory() to deal with individual blocks when finding
> blocks that have an altmap.
>

2023-07-13 15:32:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

On 13.07.23 17:15, Verma, Vishal L wrote:
> On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote:
>> On 13.07.23 08:45, Verma, Vishal L wrote:
>>>
>>> I'm taking a shot at implementing the splitting internally in
>>> memory_hotplug.c. The caller (kmem) side does become trivial with this
>>> approach, but there's a slight complication if I don't have the module
>>> param override (patch 1 of this series).
>>>
>>> The kmem diff now looks like:
>>>
>>>     diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>>     index 898ca9505754..8be932f63f90 100644
>>>     --- a/drivers/dax/kmem.c
>>>     +++ b/drivers/dax/kmem.c
>>>     @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>             data->mgid = rc;
>>>
>>>             for (i = 0; i < dev_dax->nr_range; i++) {
>>>     +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
>>>     +                                 MHP_SPLIT_MEMBLOCKS;
>>>                     struct resource *res;
>>>                     struct range range;
>>>
>>>     @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>                      * this as RAM automatically.
>>>                      */
>>>                     rc = add_memory_driver_managed(data->mgid, range.start,
>>>     -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
>>>     +                               range_len(&range), kmem_name, mhp_flags);
>>>
>>>                     if (rc) {
>>>                             dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
>>>
>>>
>>
>> Why do we need the MHP_SPLIT_MEMBLOCKS?
>
> I thought we still wanted either an opt-in or opt-out for the kmem
> driver to be able to do memmap_on_memory, in case there were
> performance implications or the lack of 1GiB PUDs. I haven't
> implemented that yet, but I was thinking along the lines of a sysfs
> knob exposed by kmem, that controls setting of this new
> MHP_SPLIT_MEMBLOCKS flag.

Why is MHP_MEMMAP_ON_MEMORY not sufficient for that?

>
>>
>> In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a
>> single memory block, you can simply split up internally, no?
>>
>> Essentially in add_memory_resource() something like
>>
>> if (mhp_flags & MHP_MEMMAP_ON_MEMORY &&
>>      mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
>>         for (cur_start = start, cur_start < start + size;
>>              cur_start += memory_block_size_bytes()) {
>>                 mhp_altmap.free = PHYS_PFN(memory_block_size_bytes());
>>                 mhp_altmap.base_pfn = PHYS_PFN(start);
>>                 params.altmap = &mhp_altmap;
>>
>>                 ret = arch_add_memory(nid, start,
>>                                       memory_block_size_bytes(), &params);
>>                 if (ret < 0) ...
>>
>>                 ret = create_memory_block_devices(start, memory_block_size_bytes(),
>>                                                   mhp_altmap.alloc, group);
>>                 if (ret) ...
>>
>>         }
>> } else {
>>         /* old boring stuff */
>> }
>>
>> Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into
>> a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes()
>> ...
>
> My current approach was looping a level higher, on the call to
> add_memory_resource, but this looks reasonable too, and I can switch to
> this. In fact it is better as in my case I had to loop twice, once for
> the regular add_memory() path and once for the _driver_managed() path.
> Yours should avoid that.

As we only care about the altmap here, handling it for arch_add_memory()
+ create_memory_block_devices() should be sufficient.

--
Cheers,

David / dhildenb


2023-07-13 16:10:36

by Verma, Vishal L

[permalink] [raw]
Subject: Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

On Thu, 2023-07-13 at 17:23 +0200, David Hildenbrand wrote:
> On 13.07.23 17:15, Verma, Vishal L wrote:
> > On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote:
> > > On 13.07.23 08:45, Verma, Vishal L wrote:
> > > >
> > > > I'm taking a shot at implementing the splitting internally in
> > > > memory_hotplug.c. The caller (kmem) side does become trivial with this
> > > > approach, but there's a slight complication if I don't have the module
> > > > param override (patch 1 of this series).
> > > >
> > > > The kmem diff now looks like:
> > > >
> > > >      diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> > > >      index 898ca9505754..8be932f63f90 100644
> > > >      --- a/drivers/dax/kmem.c
> > > >      +++ b/drivers/dax/kmem.c
> > > >      @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> > > >              data->mgid = rc;
> > > >      
> > > >              for (i = 0; i < dev_dax->nr_range; i++) {
> > > >      +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
> > > >      +                                 MHP_SPLIT_MEMBLOCKS;
> > > >                      struct resource *res;
> > > >                      struct range range;
> > > >      
> > > >      @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> > > >                       * this as RAM automatically.
> > > >                       */
> > > >                      rc = add_memory_driver_managed(data->mgid, range.start,
> > > >      -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
> > > >      +                               range_len(&range), kmem_name, mhp_flags);
> > > >      
> > > >                      if (rc) {
> > > >                              dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
> > > >     
> > > >
> > >
> > > Why do we need the MHP_SPLIT_MEMBLOCKS?
> >
> > I thought we still wanted either an opt-in or opt-out for the kmem
> > driver to be able to do memmap_on_memory, in case there were
> > performance implications or the lack of 1GiB PUDs. I haven't
> > implemented that yet, but I was thinking along the lines of a sysfs
> > knob exposed by kmem, that controls setting of this new
> > MHP_SPLIT_MEMBLOCKS flag.
>
> Why is MHP_MEMMAP_ON_MEMORY not sufficient for that?
>
>
Ah I see what you mean now - knob just controls MHP_MEMMAP_ON_MEMORY,
and memory_hotplug is free to split to memblocks if it needs to to
satisfy that.

That sounds reasonable. Let me give this a try and see if I run into
anything else. Thanks David!

2023-07-13 16:11:39

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

On 13.07.23 17:40, Verma, Vishal L wrote:
> On Thu, 2023-07-13 at 17:23 +0200, David Hildenbrand wrote:
>> On 13.07.23 17:15, Verma, Vishal L wrote:
>>> On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote:
>>>> On 13.07.23 08:45, Verma, Vishal L wrote:
>>>>>
>>>>> I'm taking a shot at implementing the splitting internally in
>>>>> memory_hotplug.c. The caller (kmem) side does become trivial with this
>>>>> approach, but there's a slight complication if I don't have the module
>>>>> param override (patch 1 of this series).
>>>>>
>>>>> The kmem diff now looks like:
>>>>>
>>>>>      diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>>>>      index 898ca9505754..8be932f63f90 100644
>>>>>      --- a/drivers/dax/kmem.c
>>>>>      +++ b/drivers/dax/kmem.c
>>>>>      @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>              data->mgid = rc;
>>>>>
>>>>>              for (i = 0; i < dev_dax->nr_range; i++) {
>>>>>      +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
>>>>>      +                                 MHP_SPLIT_MEMBLOCKS;
>>>>>                      struct resource *res;
>>>>>                      struct range range;
>>>>>
>>>>>      @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>                       * this as RAM automatically.
>>>>>                       */
>>>>>                      rc = add_memory_driver_managed(data->mgid, range.start,
>>>>>      -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
>>>>>      +                               range_len(&range), kmem_name, mhp_flags);
>>>>>
>>>>>                      if (rc) {
>>>>>                              dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
>>>>>
>>>>>
>>>>
>>>> Why do we need the MHP_SPLIT_MEMBLOCKS?
>>>
>>> I thought we still wanted either an opt-in or opt-out for the kmem
>>> driver to be able to do memmap_on_memory, in case there were
>>> performance implications or the lack of 1GiB PUDs. I haven't
>>> implemented that yet, but I was thinking along the lines of a sysfs
>>> knob exposed by kmem, that controls setting of this new
>>> MHP_SPLIT_MEMBLOCKS flag.
>>
>> Why is MHP_MEMMAP_ON_MEMORY not sufficient for that?
>>
>>
> Ah I see what you mean now - knob just controls MHP_MEMMAP_ON_MEMORY,
> and memory_hotplug is free to split to memblocks if it needs to to
> satisfy that.

And if you don't want memmap holes in a larger area you're adding (for
example to runtime-allocate 1 GiB pages), simply check the size your
adding, and if it's, say, less than 1 G, don't set the flag.

But that's probably a corner case use case not worth considering for now.

>
> That sounds reasonable. Let me give this a try and see if I run into
> anything else. Thanks David!

Sure!

--
Cheers,

David / dhildenb