2021-07-28 08:24:13

by Justin He

[permalink] [raw]
Subject: [PATCH 0/1] fix pmem RAM device when nid is NUMA_NO_NODE

Background
==========
I once sent out the preparatory patches [1] but I dropped the last patch
of using fallback nid with memory_add_physaddr_to_nid() due to some
dependency.

After phys_addr_to_target_node() and memory_add_physaddr_to_nid() are
stable now, it's time to fix the original bug on arm64 now.

Compared with the last version [2], this version rebases the patch to
latest v5.14-rc3 (s/kmem_start/range.start)

Test
====
Tested on ThunderX2 host/qemu "-M virt" guest with a nvdimm device. The
memblocks from the dax pmem device can be either hot-added or hot-removed
on arm64 guest. Also passed the compilation test on x86.

[1] https://www.mail-archive.com/[email protected]/msg2228771.html
[2] https://lkml.org/lkml/2020/7/8/1546

Jia He (1):
device-dax: use fallback nid when numa_node is invalid

drivers/dax/kmem.c | 36 ++++++++++++++++++++----------------
1 file changed, 20 insertions(+), 16 deletions(-)

--
2.17.1



2021-07-28 08:25:10

by Justin He

[permalink] [raw]
Subject: [PATCH] device-dax: use fallback nid when numa_node is invalid

Previously, numa_off was set unconditionally in dummy_numa_init()
even with a fake numa node. Then ACPI set node id as NUMA_NO_NODE(-1)
after acpi_map_pxm_to_node() because it regards numa_off as turning
off the numa node. Hence dev_dax->target_node is NUMA_NO_NODE on
arm64 with fake numa.

Without this patch, pmem can't be probed as a RAM device on arm64 if
SRAT table isn't present:
$ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K
kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with invalid node: -1
kmem: probe of dax0.0 failed with error -22

This fixes it by using fallback memory_add_physaddr_to_nid() as nid.

Suggested-by: David Hildenbrand <[email protected]>
Signed-off-by: Jia He <[email protected]>
---
drivers/dax/kmem.c | 36 ++++++++++++++++++++----------------
1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index ac231cc36359..749674909e51 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -46,20 +46,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
struct dax_kmem_data *data;
int rc = -ENOMEM;
int i, mapped = 0;
- int numa_node;
-
- /*
- * Ensure good NUMA information for the persistent memory.
- * Without this check, there is a risk that slow memory
- * could be mixed in a node with faster memory, causing
- * unavoidable performance issues.
- */
- numa_node = dev_dax->target_node;
- if (numa_node < 0) {
- dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
- numa_node);
- return -EINVAL;
- }
+ int numa_node = dev_dax->target_node, new_node;

data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
if (!data)
@@ -104,6 +91,20 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
*/
res->flags = IORESOURCE_SYSTEM_RAM;

+ /*
+ * Ensure good NUMA information for the persistent memory.
+ * Without this check, there is a risk but not fatal that slow
+ * memory could be mixed in a node with faster memory, causing
+ * unavoidable performance issues. Furthermore, fallback node
+ * id can be used when numa_node is invalid.
+ */
+ if (numa_node < 0) {
+ new_node = memory_add_physaddr_to_nid(range.start);
+ dev_info(dev, "changing nid from %d to %d for DAX region %pR\n",
+ numa_node, new_node, res);
+ numa_node = new_node;
+ }
+
/*
* Ensure that future kexec'd kernels will not treat
* this as RAM automatically.
@@ -141,6 +142,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
int i, success = 0;
struct device *dev = &dev_dax->dev;
struct dax_kmem_data *data = dev_get_drvdata(dev);
+ int numa_node = dev_dax->target_node;

/*
* We have one shot for removing memory, if some memory blocks were not
@@ -156,8 +158,10 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
if (rc)
continue;

- rc = remove_memory(dev_dax->target_node, range.start,
- range_len(&range));
+ if (numa_node < 0)
+ numa_node = memory_add_physaddr_to_nid(range.start);
+
+ rc = remove_memory(numa_node, range.start, range_len(&range));
if (rc == 0) {
release_resource(data->res[i]);
kfree(data->res[i]);
--
2.17.1


2021-07-28 20:19:09

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] device-dax: use fallback nid when numa_node is invalid

On 28.07.21 10:22, Jia He wrote:
> Previously, numa_off was set unconditionally in dummy_numa_init()
> even with a fake numa node. Then ACPI set node id as NUMA_NO_NODE(-1)
> after acpi_map_pxm_to_node() because it regards numa_off as turning
> off the numa node. Hence dev_dax->target_node is NUMA_NO_NODE on
> arm64 with fake numa.
>
> Without this patch, pmem can't be probed as a RAM device on arm64 if
> SRAT table isn't present:
> $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K
> kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with invalid node: -1
> kmem: probe of dax0.0 failed with error -22
>
> This fixes it by using fallback memory_add_physaddr_to_nid() as nid.
>
> Suggested-by: David Hildenbrand <[email protected]>
> Signed-off-by: Jia He <[email protected]>
> ---
> drivers/dax/kmem.c | 36 ++++++++++++++++++++----------------
> 1 file changed, 20 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index ac231cc36359..749674909e51 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -46,20 +46,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> struct dax_kmem_data *data;
> int rc = -ENOMEM;
> int i, mapped = 0;
> - int numa_node;
> -
> - /*
> - * Ensure good NUMA information for the persistent memory.
> - * Without this check, there is a risk that slow memory
> - * could be mixed in a node with faster memory, causing
> - * unavoidable performance issues.
> - */
> - numa_node = dev_dax->target_node;
> - if (numa_node < 0) {
> - dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
> - numa_node);
> - return -EINVAL;
> - }
> + int numa_node = dev_dax->target_node, new_node;
>
> data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
> if (!data)
> @@ -104,6 +91,20 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> */
> res->flags = IORESOURCE_SYSTEM_RAM;
>
> + /*
> + * Ensure good NUMA information for the persistent memory.
> + * Without this check, there is a risk but not fatal that slow
> + * memory could be mixed in a node with faster memory, causing
> + * unavoidable performance issues. Furthermore, fallback node
> + * id can be used when numa_node is invalid.
> + */
> + if (numa_node < 0) {
> + new_node = memory_add_physaddr_to_nid(range.start);
> + dev_info(dev, "changing nid from %d to %d for DAX region %pR\n",
> + numa_node, new_node, res);
> + numa_node = new_node;
> + }
> +
> /*
> * Ensure that future kexec'd kernels will not treat
> * this as RAM automatically.
> @@ -141,6 +142,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
> int i, success = 0;
> struct device *dev = &dev_dax->dev;
> struct dax_kmem_data *data = dev_get_drvdata(dev);
> + int numa_node = dev_dax->target_node;
>
> /*
> * We have one shot for removing memory, if some memory blocks were not
> @@ -156,8 +158,10 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax)
> if (rc)
> continue;
>
> - rc = remove_memory(dev_dax->target_node, range.start,
> - range_len(&range));
> + if (numa_node < 0)
> + numa_node = memory_add_physaddr_to_nid(range.start);
> +
> + rc = remove_memory(numa_node, range.start, range_len(&range));
> if (rc == 0) {
> release_resource(data->res[i]);
> kfree(data->res[i]);
>

Note that this patch conflicts with:

https://lkml.kernel.org/r/[email protected]

But nothing fundamental. Determining a single NID is similar to how I'm
handling it for ACPI:

https://lkml.kernel.org/r/[email protected]

--
Thanks,

David / dhildenb


2021-07-29 00:22:51

by Justin He

[permalink] [raw]
Subject: RE: [PATCH] device-dax: use fallback nid when numa_node is invalid

Hi David

> -----Original Message-----
> From: David Hildenbrand <[email protected]>
> Sent: Thursday, July 29, 2021 4:17 AM
> To: Justin He <[email protected]>; Dan Williams <[email protected]>;
> Vishal Verma <[email protected]>; Dave Jiang <[email protected]>
> Cc: [email protected]; [email protected]; nd <[email protected]>
> Subject: Re: [PATCH] device-dax: use fallback nid when numa_node is invalid
>
> On 28.07.21 10:22, Jia He wrote:
> > Previously, numa_off was set unconditionally in dummy_numa_init()
> > even with a fake numa node. Then ACPI set node id as NUMA_NO_NODE(-1)
> > after acpi_map_pxm_to_node() because it regards numa_off as turning
> > off the numa node. Hence dev_dax->target_node is NUMA_NO_NODE on
> > arm64 with fake numa.
> >
> > Without this patch, pmem can't be probed as a RAM device on arm64 if
> > SRAT table isn't present:
> > $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g
> -a 64K
> > kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with
> invalid node: -1
> > kmem: probe of dax0.0 failed with error -22
> >
> > This fixes it by using fallback memory_add_physaddr_to_nid() as nid.
> >
> > Suggested-by: David Hildenbrand <[email protected]>
> > Signed-off-by: Jia He <[email protected]>
> > ---
> > drivers/dax/kmem.c | 36 ++++++++++++++++++++----------------
> > 1 file changed, 20 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> > index ac231cc36359..749674909e51 100644
> > --- a/drivers/dax/kmem.c
> > +++ b/drivers/dax/kmem.c
> > @@ -46,20 +46,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> > struct dax_kmem_data *data;
> > int rc = -ENOMEM;
> > int i, mapped = 0;
> > - int numa_node;
> > -
> > - /*
> > - * Ensure good NUMA information for the persistent memory.
> > - * Without this check, there is a risk that slow memory
> > - * could be mixed in a node with faster memory, causing
> > - * unavoidable performance issues.
> > - */
> > - numa_node = dev_dax->target_node;
> > - if (numa_node < 0) {
> > - dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
> > - numa_node);
> > - return -EINVAL;
> > - }
> > + int numa_node = dev_dax->target_node, new_node;
> >
> > data = kzalloc(struct_size(data, res, dev_dax->nr_range),
> GFP_KERNEL);
> > if (!data)
> > @@ -104,6 +91,20 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> > */
> > res->flags = IORESOURCE_SYSTEM_RAM;
> >
> > + /*
> > + * Ensure good NUMA information for the persistent memory.
> > + * Without this check, there is a risk but not fatal that slow
> > + * memory could be mixed in a node with faster memory, causing
> > + * unavoidable performance issues. Furthermore, fallback node
> > + * id can be used when numa_node is invalid.
> > + */
> > + if (numa_node < 0) {
> > + new_node = memory_add_physaddr_to_nid(range.start);
> > + dev_info(dev, "changing nid from %d to %d for DAX
> region %pR\n",
> > + numa_node, new_node, res);
> > + numa_node = new_node;
> > + }
> > +
> > /*
> > * Ensure that future kexec'd kernels will not treat
> > * this as RAM automatically.
> > @@ -141,6 +142,7 @@ static void dev_dax_kmem_remove(struct dev_dax
> *dev_dax)
> > int i, success = 0;
> > struct device *dev = &dev_dax->dev;
> > struct dax_kmem_data *data = dev_get_drvdata(dev);
> > + int numa_node = dev_dax->target_node;
> >
> > /*
> > * We have one shot for removing memory, if some memory blocks were
> not
> > @@ -156,8 +158,10 @@ static void dev_dax_kmem_remove(struct dev_dax
> *dev_dax)
> > if (rc)
> > continue;
> >
> > - rc = remove_memory(dev_dax->target_node, range.start,
> > - range_len(&range));
> > + if (numa_node < 0)
> > + numa_node = memory_add_physaddr_to_nid(range.start);
> > +
> > + rc = remove_memory(numa_node, range.start, range_len(&range));
> > if (rc == 0) {
> > release_resource(data->res[i]);
> > kfree(data->res[i]);
> >
>
> Note that this patch conflicts with:
>
> https://lkml.kernel.org/r/[email protected]
>
> But nothing fundamental. Determining a single NID is similar to how I'm
> handling it for ACPI:
>
> https://lkml.kernel.org/r/[email protected]
>

Okay, got it. Thanks for the reminder.
Seems my patch is not useful after your patch.


--
Cheers,
Justin (Jia He)


2021-07-29 08:02:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] device-dax: use fallback nid when numa_node is invalid

Hi Justin,

>>>
>>
>> Note that this patch conflicts with:
>>
>> https://lkml.kernel.org/r/[email protected]
>>
>> But nothing fundamental. Determining a single NID is similar to how I'm
>> handling it for ACPI:
>>
>> https://lkml.kernel.org/r/[email protected]
>>
>
> Okay, got it. Thanks for the reminder.
> Seems my patch is not useful after your patch.
>

I think your patch still makes sense. With

https://lore.kernel.org/linux-acpi/[email protected]/

We'd have to detect the node id in the first loop instead.

--
Thanks,

David / dhildenb


2021-07-29 14:48:56

by Justin He

[permalink] [raw]
Subject: RE: [PATCH] device-dax: use fallback nid when numa_node is invalid

Hi David

> -----Original Message-----
> From: David Hildenbrand <[email protected]>
> Sent: Thursday, July 29, 2021 3:59 PM
> To: Justin He <[email protected]>; Dan Williams <[email protected]>;
> Vishal Verma <[email protected]>; Dave Jiang <[email protected]>
> Cc: [email protected]; [email protected]; nd <[email protected]>
> Subject: Re: [PATCH] device-dax: use fallback nid when numa_node is
> invalid
>
> Hi Justin,
>
> >>>
> >>
> >> Note that this patch conflicts with:
> >>
> >> https://lkml.kernel.org/r/[email protected]
> >>
> >> But nothing fundamental. Determining a single NID is similar to how I'm
> >> handling it for ACPI:
> >>
> >> https://lkml.kernel.org/r/[email protected]
> >>
> >
> > Okay, got it. Thanks for the reminder.
> > Seems my patch is not useful after your patch.
> >
>
> I think your patch still makes sense. With
>
> https://lore.kernel.org/linux-acpi/20210723125210.29987-7-
> [email protected]/
>
> We'd have to detect the node id in the first loop instead.

Ok, I got your point. I will do that in v2.

Btw, sorry for commenting there about your patch 06 since I didn't
subscribe lkml via this mailbox.

+ for (i = 0; i < dev_dax->nr_range; i++) {
+ struct range range;
+
+ rc = dax_kmem_range(dev_dax, i, &range);
+ if (rc) {
+ dev_info(dev, "mapping%d: %#llx-%#llx too small after alignment\n",
+ i, range.start, range.end);
+ continue;
+ }
+ total_len += range_len(&range);
+ }
You add an additional loop to get the total_len.
I wonder is it independent on 2nd loop?
If yes, why not merge the 2 loops into one?
Sorry if this question is too simple, I don't know too much
about the background of your patch.


--
Cheers,
Justin (Jia He)


2021-08-02 15:50:32

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] device-dax: use fallback nid when numa_node is invalid

On 29.07.21 16:44, Justin He wrote:
> Hi David
>
>> -----Original Message-----
>> From: David Hildenbrand <[email protected]>
>> Sent: Thursday, July 29, 2021 3:59 PM
>> To: Justin He <[email protected]>; Dan Williams <[email protected]>;
>> Vishal Verma <[email protected]>; Dave Jiang <[email protected]>
>> Cc: [email protected]; [email protected]; nd <[email protected]>
>> Subject: Re: [PATCH] device-dax: use fallback nid when numa_node is
>> invalid
>>
>> Hi Justin,
>>
>>>>>
>>>>
>>>> Note that this patch conflicts with:
>>>>
>>>> https://lkml.kernel.org/r/[email protected]
>>>>
>>>> But nothing fundamental. Determining a single NID is similar to how I'm
>>>> handling it for ACPI:
>>>>
>>>> https://lkml.kernel.org/r/[email protected]
>>>>
>>>
>>> Okay, got it. Thanks for the reminder.
>>> Seems my patch is not useful after your patch.
>>>
>>
>> I think your patch still makes sense. With
>>
>> https://lore.kernel.org/linux-acpi/20210723125210.29987-7-
>> [email protected]/
>>
>> We'd have to detect the node id in the first loop instead.
>
> Ok, I got your point. I will do that in v2.
>
> Btw, sorry for commenting there about your patch 06 since I didn't
> subscribe lkml via this mailbox.

Sure, you really should subscribe :)

> > + for (i = 0; i < dev_dax->nr_range; i++) {
> + struct range range;
> +
> + rc = dax_kmem_range(dev_dax, i, &range);
> + if (rc) {
> + dev_info(dev, "mapping%d: %#llx-%#llx too small after alignment\n",
> + i, range.start, range.end);
> + continue;
> + }
> + total_len += range_len(&range);
> + }
> You add an additional loop to get the total_len.
> I wonder is it independent on 2nd loop?
> If yes, why not merge the 2 loops into one?
> Sorry if this question is too simple, I don't know too much
> about the background of your patch.

We need total_len to register the memory group. We need the memory group
to add memory. Therefore, we need a second loop to calculate total_len
upfront.


--
Thanks,

David / dhildenb