2020-07-07 06:00:04

by Justin He

[permalink] [raw]
Subject: [PATCH v2 0/3] Fix and enable pmem as RAM device on arm64

This fix a few issues when I tried to enable pmem as RAM device on arm64.

Tested on ThunderX2 host/qemu "-M virt" guest with a nvdimm device. The
memblocks from the dax pmem device can be either hot-added or hot-removed
on arm64 guest.

Changes:
v2: - Drop unneccessary patch to harden try_offline_node
- Use new solution(by David) to fix dev->target_node=-1 during probing
- Refine the mem_hotplug_begin/done patch

v1: https://lkml.org/lkml/2020/7/5/381

Jia He (3):
arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL
device-dax: use fallback nid when numa_node is invalid
mm/memory_hotplug: fix unpaired mem_hotplug_begin/done

arch/arm64/mm/numa.c | 5 +++--
drivers/dax/kmem.c | 22 ++++++++++++++--------
mm/memory_hotplug.c | 5 ++---
3 files changed, 19 insertions(+), 13 deletions(-)

--
2.17.1


2020-07-07 06:00:15

by Justin He

[permalink] [raw]
Subject: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

This exports memory_add_physaddr_to_nid() for module driver to use.

memory_add_physaddr_to_nid() is a fallback option to get the nid in case
NUMA_NO_NID is detected.

Suggested-by: David Hildenbrand <[email protected]>
Signed-off-by: Jia He <[email protected]>
---
arch/arm64/mm/numa.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index aafcee3e3f7e..7eeb31740248 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -464,10 +464,11 @@ void __init arm64_numa_init(void)

/*
* We hope that we will be hotplugging memory on nodes we already know about,
- * such that acpi_get_node() succeeds and we never fall back to this...
+ * such that acpi_get_node() succeeds. But when SRAT is not present, the node
+ * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
*/
int memory_add_physaddr_to_nid(u64 addr)
{
- pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
return 0;
}
+EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
--
2.17.1

2020-07-07 06:01:58

by Justin He

[permalink] [raw]
Subject: [PATCH v2 3/3] mm/memory_hotplug: fix unpaired mem_hotplug_begin/done

When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
online at that time), mem_hotplug_begin/done is unpaired in such case.

Therefore a warning:
Call Trace:
percpu_up_write+0x33/0x40
try_remove_memory+0x66/0x120
? _cond_resched+0x19/0x30
remove_memory+0x2b/0x40
dev_dax_kmem_remove+0x36/0x72 [kmem]
device_release_driver_internal+0xf0/0x1c0
device_release_driver+0x12/0x20
bus_remove_device+0xe1/0x150
device_del+0x17b/0x3e0
unregister_dev_dax+0x29/0x60
devm_action_release+0x15/0x20
release_nodes+0x19a/0x1e0
devres_release_all+0x3f/0x50
device_release_driver_internal+0x100/0x1c0
driver_detach+0x4c/0x8f
bus_remove_driver+0x5c/0xd0
driver_unregister+0x31/0x50
dax_pmem_exit+0x10/0xfe0 [dax_pmem]

Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat")
Cc: [email protected] # v5.6+
Signed-off-by: Jia He <[email protected]>
---
mm/memory_hotplug.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index da374cd3d45b..76c75a599da3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1742,7 +1742,7 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
*/
rc = walk_memory_blocks(start, size, NULL, check_memblock_offlined_cb);
if (rc)
- goto done;
+ return rc;

/* remove memmap entry */
firmware_map_remove(start, start + size, "System RAM");
@@ -1766,9 +1766,8 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)

try_offline_node(nid);

-done:
mem_hotplug_done();
- return rc;
+ return 0;
}

/**
--
2.17.1

2020-07-07 06:03:23

by Justin He

[permalink] [raw]
Subject: [RFC PATCH v2 2/3] device-dax: use fallback nid when numa_node is invalid

Previously, numa_off is set unconditionally at the end of dummy_numa_init(),
even with a fake numa node. Then ACPI detects node id as NUMA_NO_NODE(-1) in
acpi_map_pxm_to_node() because it regards numa_off as turning off the numa
node. Hence dev_dax->target_node is NUMA_NO_NODE on arm64 with fake numa.

Without this patch, pmem can't be probed as a RAM device on arm64 if SRAT table
isn't present:
$ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K
kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with invalid node: -1
kmem: probe of dax0.0 failed with error -22

This fixes it by using fallback memory_add_physaddr_to_nid() as nid.

Suggested-by: David Hildenbrand <[email protected]>
Signed-off-by: Jia He <[email protected]>
---
I noticed that on powerpc memory_add_physaddr_to_nid is not exported for module
driver. Set it to RFC due to this concern.

drivers/dax/kmem.c | 22 ++++++++++++++--------
1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 275aa5f87399..68e693ca6d59 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -28,20 +28,22 @@ int dev_dax_kmem_probe(struct device *dev)
resource_size_t kmem_end;
struct resource *new_res;
const char *new_res_name;
- int numa_node;
+ int numa_node, new_node;
int rc;

/*
* Ensure good NUMA information for the persistent memory.
- * Without this check, there is a risk that slow memory
- * could be mixed in a node with faster memory, causing
- * unavoidable performance issues.
+ * Without this check, there is a risk but not fatal that slow
+ * memory could be mixed in a node with faster memory, causing
+ * unavoidable performance issues. Furthermore, fallback node
+ * id can be used when numa_node is invalid.
*/
numa_node = dev_dax->target_node;
if (numa_node < 0) {
- dev_warn(dev, "rejecting DAX region %pR with invalid node: %d\n",
- res, numa_node);
- return -EINVAL;
+ new_node = memory_add_physaddr_to_nid(kmem_start);
+ dev_info(dev, "changing nid from %d to %d for DAX region %pR\n",
+ numa_node, new_node, res);
+ numa_node = new_node;
}

/* Hotplug starting at the beginning of the next block: */
@@ -100,6 +102,7 @@ static int dev_dax_kmem_remove(struct device *dev)
resource_size_t kmem_start = res->start;
resource_size_t kmem_size = resource_size(res);
const char *res_name = res->name;
+ int numa_node = dev_dax->target_node;
int rc;

/*
@@ -108,7 +111,10 @@ static int dev_dax_kmem_remove(struct device *dev)
* there is no way to hotremove this memory until reboot because device
* unbind will succeed even if we return failure.
*/
- rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
+ if (numa_node < 0)
+ numa_node = memory_add_physaddr_to_nid(kmem_start);
+
+ rc = remove_memory(numa_node, kmem_start, kmem_size);
if (rc) {
any_hotremove_failed = true;
dev_err(dev,
--
2.17.1

2020-07-07 06:11:30

by Justin He

[permalink] [raw]
Subject: RE: [RFC PATCH v2 2/3] device-dax: use fallback nid when numa_node is invalid

[+] add Powerpc maintainers to check my concern about memory_add_physaddr_to_nid
Exporting


--
Cheers,
Justin (Jia He)



> -----Original Message-----
> From: Jia He <[email protected]>
> Sent: Tuesday, July 7, 2020 1:59 PM
> To: Catalin Marinas <[email protected]>; Will Deacon
> <[email protected]>; Dan Williams <[email protected]>; Vishal Verma
> <[email protected]>; Dave Jiang <[email protected]>
> Cc: Michal Hocko <[email protected]>; Andrew Morton <akpm@linux-
> foundation.org>; Mike Rapoport <[email protected]>; Baoquan He
> <[email protected]>; Chuhong Yuan <[email protected]>; linux-arm-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Kaly Xin <[email protected]>;
> Justin He <[email protected]>
> Subject: [RFC PATCH v2 2/3] device-dax: use fallback nid when numa_node is
> invalid
>
> Previously, numa_off is set unconditionally at the end of
> dummy_numa_init(),
> even with a fake numa node. Then ACPI detects node id as NUMA_NO_NODE(-1)
> in
> acpi_map_pxm_to_node() because it regards numa_off as turning off the numa
> node. Hence dev_dax->target_node is NUMA_NO_NODE on arm64 with fake numa.
>
> Without this patch, pmem can't be probed as a RAM device on arm64 if SRAT
> table
> isn't present:
> $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a
> 64K
> kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with
> invalid node: -1
> kmem: probe of dax0.0 failed with error -22
>
> This fixes it by using fallback memory_add_physaddr_to_nid() as nid.
>
> Suggested-by: David Hildenbrand <[email protected]>
> Signed-off-by: Jia He <[email protected]>
> ---
> I noticed that on powerpc memory_add_physaddr_to_nid is not exported for
> module
> driver. Set it to RFC due to this concern.
>
> drivers/dax/kmem.c | 22 ++++++++++++++--------
> 1 file changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 275aa5f87399..68e693ca6d59 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -28,20 +28,22 @@ int dev_dax_kmem_probe(struct device *dev)
> resource_size_t kmem_end;
> struct resource *new_res;
> const char *new_res_name;
> - int numa_node;
> + int numa_node, new_node;
> int rc;
>
> /*
> * Ensure good NUMA information for the persistent memory.
> - * Without this check, there is a risk that slow memory
> - * could be mixed in a node with faster memory, causing
> - * unavoidable performance issues.
> + * Without this check, there is a risk but not fatal that slow
> + * memory could be mixed in a node with faster memory, causing
> + * unavoidable performance issues. Furthermore, fallback node
> + * id can be used when numa_node is invalid.
> */
> numa_node = dev_dax->target_node;
> if (numa_node < 0) {
> - dev_warn(dev, "rejecting DAX region %pR with invalid
> node: %d\n",
> - res, numa_node);
> - return -EINVAL;
> + new_node = memory_add_physaddr_to_nid(kmem_start);
> + dev_info(dev, "changing nid from %d to %d for DAX
> region %pR\n",
> + numa_node, new_node, res);
> + numa_node = new_node;
> }
>
> /* Hotplug starting at the beginning of the next block: */
> @@ -100,6 +102,7 @@ static int dev_dax_kmem_remove(struct device *dev)
> resource_size_t kmem_start = res->start;
> resource_size_t kmem_size = resource_size(res);
> const char *res_name = res->name;
> + int numa_node = dev_dax->target_node;
> int rc;
>
> /*
> @@ -108,7 +111,10 @@ static int dev_dax_kmem_remove(struct device *dev)
> * there is no way to hotremove this memory until reboot because
> device
> * unbind will succeed even if we return failure.
> */
> - rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
> + if (numa_node < 0)
> + numa_node = memory_add_physaddr_to_nid(kmem_start);
> +
> + rc = remove_memory(numa_node, kmem_start, kmem_size);
> if (rc) {
> any_hotremove_failed = true;
> dev_err(dev,
> --
> 2.17.1

2020-07-07 10:09:08

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] mm/memory_hotplug: fix unpaired mem_hotplug_begin/done

On Tue 07-07-20 13:59:17, Jia He wrote:
> When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
> online at that time), mem_hotplug_begin/done is unpaired in such case.
>
> Therefore a warning:
> Call Trace:
> percpu_up_write+0x33/0x40
> try_remove_memory+0x66/0x120
> ? _cond_resched+0x19/0x30
> remove_memory+0x2b/0x40
> dev_dax_kmem_remove+0x36/0x72 [kmem]
> device_release_driver_internal+0xf0/0x1c0
> device_release_driver+0x12/0x20
> bus_remove_device+0xe1/0x150
> device_del+0x17b/0x3e0
> unregister_dev_dax+0x29/0x60
> devm_action_release+0x15/0x20
> release_nodes+0x19a/0x1e0
> devres_release_all+0x3f/0x50
> device_release_driver_internal+0x100/0x1c0
> driver_detach+0x4c/0x8f
> bus_remove_driver+0x5c/0xd0
> driver_unregister+0x31/0x50
> dax_pmem_exit+0x10/0xfe0 [dax_pmem]
>
> Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat")
> Cc: [email protected] # v5.6+
> Signed-off-by: Jia He <[email protected]>

Ups, I have missed that when reviewing that commit. Thanks for catching
this up!

Acked-by: Michal Hocko <[email protected]>

> ---
> mm/memory_hotplug.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index da374cd3d45b..76c75a599da3 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1742,7 +1742,7 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
> */
> rc = walk_memory_blocks(start, size, NULL, check_memblock_offlined_cb);
> if (rc)
> - goto done;
> + return rc;
>
> /* remove memmap entry */
> firmware_map_remove(start, start + size, "System RAM");
> @@ -1766,9 +1766,8 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
>
> try_offline_node(nid);
>
> -done:
> mem_hotplug_done();
> - return rc;
> + return 0;
> }
>
> /**
> --
> 2.17.1

--
Michal Hocko
SUSE Labs

2020-07-07 11:33:31

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] mm/memory_hotplug: fix unpaired mem_hotplug_begin/done

On 07.07.20 07:59, Jia He wrote:
> When check_memblock_offlined_cb() returns failed rc(e.g. the memblock is
> online at that time), mem_hotplug_begin/done is unpaired in such case.
>
> Therefore a warning:
> Call Trace:
> percpu_up_write+0x33/0x40
> try_remove_memory+0x66/0x120
> ? _cond_resched+0x19/0x30
> remove_memory+0x2b/0x40
> dev_dax_kmem_remove+0x36/0x72 [kmem]
> device_release_driver_internal+0xf0/0x1c0
> device_release_driver+0x12/0x20
> bus_remove_device+0xe1/0x150
> device_del+0x17b/0x3e0
> unregister_dev_dax+0x29/0x60
> devm_action_release+0x15/0x20
> release_nodes+0x19a/0x1e0
> devres_release_all+0x3f/0x50
> device_release_driver_internal+0x100/0x1c0
> driver_detach+0x4c/0x8f
> bus_remove_driver+0x5c/0xd0
> driver_unregister+0x31/0x50
> dax_pmem_exit+0x10/0xfe0 [dax_pmem]
>
> Fixes: f1037ec0cc8a ("mm/memory_hotplug: fix remove_memory() lockdep splat")
> Cc: [email protected] # v5.6+
> Signed-off-by: Jia He <[email protected]>
> ---
> mm/memory_hotplug.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index da374cd3d45b..76c75a599da3 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1742,7 +1742,7 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
> */
> rc = walk_memory_blocks(start, size, NULL, check_memblock_offlined_cb);
> if (rc)
> - goto done;
> + return rc;
>
> /* remove memmap entry */
> firmware_map_remove(start, start + size, "System RAM");
> @@ -1766,9 +1766,8 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size)
>
> try_offline_node(nid);
>
> -done:
> mem_hotplug_done();
> - return rc;
> + return 0;
> }
>
> /**
>

Reviewed-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb

2020-07-07 11:35:09

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH v2 2/3] device-dax: use fallback nid when numa_node is invalid

On 07.07.20 07:59, Jia He wrote:
> Previously, numa_off is set unconditionally at the end of dummy_numa_init(),
> even with a fake numa node. Then ACPI detects node id as NUMA_NO_NODE(-1) in
> acpi_map_pxm_to_node() because it regards numa_off as turning off the numa
> node. Hence dev_dax->target_node is NUMA_NO_NODE on arm64 with fake numa.
>
> Without this patch, pmem can't be probed as a RAM device on arm64 if SRAT table
> isn't present:
> $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a 64K
> kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with invalid node: -1
> kmem: probe of dax0.0 failed with error -22
>
> This fixes it by using fallback memory_add_physaddr_to_nid() as nid.
>
> Suggested-by: David Hildenbrand <[email protected]>
> Signed-off-by: Jia He <[email protected]>
> ---
> I noticed that on powerpc memory_add_physaddr_to_nid is not exported for module
> driver. Set it to RFC due to this concern.
>
> drivers/dax/kmem.c | 22 ++++++++++++++--------
> 1 file changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 275aa5f87399..68e693ca6d59 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -28,20 +28,22 @@ int dev_dax_kmem_probe(struct device *dev)
> resource_size_t kmem_end;
> struct resource *new_res;
> const char *new_res_name;
> - int numa_node;
> + int numa_node, new_node;
> int rc;
>
> /*
> * Ensure good NUMA information for the persistent memory.
> - * Without this check, there is a risk that slow memory
> - * could be mixed in a node with faster memory, causing
> - * unavoidable performance issues.
> + * Without this check, there is a risk but not fatal that slow
> + * memory could be mixed in a node with faster memory, causing
> + * unavoidable performance issues. Furthermore, fallback node
> + * id can be used when numa_node is invalid.
> */
> numa_node = dev_dax->target_node;
> if (numa_node < 0) {
> - dev_warn(dev, "rejecting DAX region %pR with invalid node: %d\n",
> - res, numa_node);
> - return -EINVAL;
> + new_node = memory_add_physaddr_to_nid(kmem_start);
> + dev_info(dev, "changing nid from %d to %d for DAX region %pR\n",
> + numa_node, new_node, res);
> + numa_node = new_node;

Now, the warning does not really make sense. We have NUMA_NO_NODE (< 0),
that is not a change in the nid, but a selection of a nid. Printing
NUMA_NO_NODE does not make too much sense. I suggest just getting rid of
new_node and turning the dev_info() into something like

dev_info(dev, "using nid %d for DAX region with undefined nid %pR\n",
numa_node, res);


--
Thanks,

David / dhildenb

2020-07-07 11:36:24

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On 07.07.20 07:59, Jia He wrote:
> This exports memory_add_physaddr_to_nid() for module driver to use.
>
> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> NUMA_NO_NID is detected.
>
> Suggested-by: David Hildenbrand <[email protected]>
> Signed-off-by: Jia He <[email protected]>
> ---
> arch/arm64/mm/numa.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index aafcee3e3f7e..7eeb31740248 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>
> /*
> * We hope that we will be hotplugging memory on nodes we already know about,
> - * such that acpi_get_node() succeeds and we never fall back to this...
> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> */
> int memory_add_physaddr_to_nid(u64 addr)
> {
> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> return 0;
> }
> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>
We could turn that into a pr_info() instead, but the effect is visible
to user space (e.g., which memory blocks belong to which node in sysfs),
so this can be debugged easily on demand.

Reviewed-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb

2020-07-07 11:55:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Tue 07-07-20 13:59:15, Jia He wrote:
> This exports memory_add_physaddr_to_nid() for module driver to use.
>
> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> NUMA_NO_NID is detected.
>
> Suggested-by: David Hildenbrand <[email protected]>
> Signed-off-by: Jia He <[email protected]>
> ---
> arch/arm64/mm/numa.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index aafcee3e3f7e..7eeb31740248 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>
> /*
> * We hope that we will be hotplugging memory on nodes we already know about,
> - * such that acpi_get_node() succeeds and we never fall back to this...
> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> */
> int memory_add_physaddr_to_nid(u64 addr)
> {
> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> return 0;
> }
> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);

Does it make sense to export a noop function? Wouldn't make more sense
to simply make it static inline somewhere in a header? I haven't checked
whether there is an easy way to do that sanely bu this just hit my eyes.
--
Michal Hocko
SUSE Labs

2020-07-07 12:14:20

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
> On Tue 07-07-20 13:59:15, Jia He wrote:
> > This exports memory_add_physaddr_to_nid() for module driver to use.
> >
> > memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> > NUMA_NO_NID is detected.
> >
> > Suggested-by: David Hildenbrand <[email protected]>
> > Signed-off-by: Jia He <[email protected]>
> > ---
> > arch/arm64/mm/numa.c | 5 +++--
> > 1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > index aafcee3e3f7e..7eeb31740248 100644
> > --- a/arch/arm64/mm/numa.c
> > +++ b/arch/arm64/mm/numa.c
> > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> >
> > /*
> > * We hope that we will be hotplugging memory on nodes we already know about,
> > - * such that acpi_get_node() succeeds and we never fall back to this...
> > + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> > */
> > int memory_add_physaddr_to_nid(u64 addr)
> > {
> > - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> > return 0;
> > }
> > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>
> Does it make sense to export a noop function? Wouldn't make more sense
> to simply make it static inline somewhere in a header? I haven't checked
> whether there is an easy way to do that sanely bu this just hit my eyes.

We'll need to either add a CONFIG_ option or arch specific callback to
make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
implementations coexist ...

> --
> Michal Hocko
> SUSE Labs

--
Sincerely yours,
Mike.

2020-07-07 12:27:25

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On 07.07.20 14:13, Mike Rapoport wrote:
> On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
>> On Tue 07-07-20 13:59:15, Jia He wrote:
>>> This exports memory_add_physaddr_to_nid() for module driver to use.
>>>
>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
>>> NUMA_NO_NID is detected.
>>>
>>> Suggested-by: David Hildenbrand <[email protected]>
>>> Signed-off-by: Jia He <[email protected]>
>>> ---
>>> arch/arm64/mm/numa.c | 5 +++--
>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>> index aafcee3e3f7e..7eeb31740248 100644
>>> --- a/arch/arm64/mm/numa.c
>>> +++ b/arch/arm64/mm/numa.c
>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>>>
>>> /*
>>> * We hope that we will be hotplugging memory on nodes we already know about,
>>> - * such that acpi_get_node() succeeds and we never fall back to this...
>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
>>> */
>>> int memory_add_physaddr_to_nid(u64 addr)
>>> {
>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
>>> return 0;
>>> }
>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>>
>> Does it make sense to export a noop function? Wouldn't make more sense
>> to simply make it static inline somewhere in a header? I haven't checked
>> whether there is an easy way to do that sanely bu this just hit my eyes.
>
> We'll need to either add a CONFIG_ option or arch specific callback to
> make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
> implementations coexist ...

Note: I have a similar dummy (return 0) patch for s390x lying around here.


--
Thanks,

David / dhildenb

2020-07-07 18:04:09

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Tue, Jul 07, 2020 at 02:26:08PM +0200, David Hildenbrand wrote:
> On 07.07.20 14:13, Mike Rapoport wrote:
> > On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
> >> On Tue 07-07-20 13:59:15, Jia He wrote:
> >>> This exports memory_add_physaddr_to_nid() for module driver to use.
> >>>
> >>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> >>> NUMA_NO_NID is detected.
> >>>
> >>> Suggested-by: David Hildenbrand <[email protected]>
> >>> Signed-off-by: Jia He <[email protected]>
> >>> ---
> >>> arch/arm64/mm/numa.c | 5 +++--
> >>> 1 file changed, 3 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> >>> index aafcee3e3f7e..7eeb31740248 100644
> >>> --- a/arch/arm64/mm/numa.c
> >>> +++ b/arch/arm64/mm/numa.c
> >>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> >>>
> >>> /*
> >>> * We hope that we will be hotplugging memory on nodes we already know about,
> >>> - * such that acpi_get_node() succeeds and we never fall back to this...
> >>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> >>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> >>> */
> >>> int memory_add_physaddr_to_nid(u64 addr)
> >>> {
> >>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> >>> return 0;
> >>> }
> >>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> >>
> >> Does it make sense to export a noop function? Wouldn't make more sense
> >> to simply make it static inline somewhere in a header? I haven't checked
> >> whether there is an easy way to do that sanely bu this just hit my eyes.
> >
> > We'll need to either add a CONFIG_ option or arch specific callback to
> > make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
> > implementations coexist ...
>
> Note: I have a similar dummy (return 0) patch for s390x lying around here.

Then we'll call it a tie - 3:3 ;-)

> --
> Thanks,
>
> David / dhildenb
>

--
Sincerely yours,
Mike.

2020-07-07 22:51:59

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Tue, Jul 7, 2020 at 11:01 AM Mike Rapoport <[email protected]> wrote:
>
> On Tue, Jul 07, 2020 at 02:26:08PM +0200, David Hildenbrand wrote:
> > On 07.07.20 14:13, Mike Rapoport wrote:
> > > On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
> > >> On Tue 07-07-20 13:59:15, Jia He wrote:
> > >>> This exports memory_add_physaddr_to_nid() for module driver to use.
> > >>>
> > >>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> > >>> NUMA_NO_NID is detected.
> > >>>
> > >>> Suggested-by: David Hildenbrand <[email protected]>
> > >>> Signed-off-by: Jia He <[email protected]>
> > >>> ---
> > >>> arch/arm64/mm/numa.c | 5 +++--
> > >>> 1 file changed, 3 insertions(+), 2 deletions(-)
> > >>>
> > >>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > >>> index aafcee3e3f7e..7eeb31740248 100644
> > >>> --- a/arch/arm64/mm/numa.c
> > >>> +++ b/arch/arm64/mm/numa.c
> > >>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > >>>
> > >>> /*
> > >>> * We hope that we will be hotplugging memory on nodes we already know about,
> > >>> - * such that acpi_get_node() succeeds and we never fall back to this...
> > >>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> > >>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> > >>> */
> > >>> int memory_add_physaddr_to_nid(u64 addr)
> > >>> {
> > >>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> > >>> return 0;
> > >>> }
> > >>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > >>
> > >> Does it make sense to export a noop function? Wouldn't make more sense
> > >> to simply make it static inline somewhere in a header? I haven't checked
> > >> whether there is an easy way to do that sanely bu this just hit my eyes.
> > >
> > > We'll need to either add a CONFIG_ option or arch specific callback to
> > > make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
> > > implementations coexist ...
> >
> > Note: I have a similar dummy (return 0) patch for s390x lying around here.
>
> Then we'll call it a tie - 3:3 ;-)

So I'd be happy to jump on the train of people wanting to export the
ARM stub for this (and add a new ARM stub for phys_to_target_node()),
but Will did have a plausibly better idea that I have been meaning to
circle back to:

http://lore.kernel.org/r/20200325111039.GA32109@willie-the-truck

...i.e. iterate over node data to do the lookup. This would seem to
work generically for multiple archs unless I am missing something?

2020-07-08 01:43:13

by Justin He

[permalink] [raw]
Subject: RE: [RFC PATCH v2 2/3] device-dax: use fallback nid when numa_node is invalid

Hi David

> -----Original Message-----
> From: David Hildenbrand <[email protected]>
> Sent: Tuesday, July 7, 2020 7:34 PM
> To: Justin He <[email protected]>; Catalin Marinas
> <[email protected]>; Will Deacon <[email protected]>; Dan Williams
> <[email protected]>; Vishal Verma <[email protected]>; Dave
> Jiang <[email protected]>
> Cc: Michal Hocko <[email protected]>; Andrew Morton <akpm@linux-
> foundation.org>; Mike Rapoport <[email protected]>; Baoquan He
> <[email protected]>; Chuhong Yuan <[email protected]>; linux-arm-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Kaly Xin <[email protected]>
> Subject: Re: [RFC PATCH v2 2/3] device-dax: use fallback nid when
> numa_node is invalid
>
> On 07.07.20 07:59, Jia He wrote:
> > Previously, numa_off is set unconditionally at the end of
> dummy_numa_init(),
> > even with a fake numa node. Then ACPI detects node id as NUMA_NO_NODE(-1)
> in
> > acpi_map_pxm_to_node() because it regards numa_off as turning off the
> numa
> > node. Hence dev_dax->target_node is NUMA_NO_NODE on arm64 with fake numa.
> >
> > Without this patch, pmem can't be probed as a RAM device on arm64 if
> SRAT table
> > isn't present:
> > $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -
> a 64K
> > kmem dax0.0: rejecting DAX region [mem 0x240400000-0x2bfffffff] with
> invalid node: -1
> > kmem: probe of dax0.0 failed with error -22
> >
> > This fixes it by using fallback memory_add_physaddr_to_nid() as nid.
> >
> > Suggested-by: David Hildenbrand <[email protected]>
> > Signed-off-by: Jia He <[email protected]>
> > ---
> > I noticed that on powerpc memory_add_physaddr_to_nid is not exported for
> module
> > driver. Set it to RFC due to this concern.
> >
> > drivers/dax/kmem.c | 22 ++++++++++++++--------
> > 1 file changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> > index 275aa5f87399..68e693ca6d59 100644
> > --- a/drivers/dax/kmem.c
> > +++ b/drivers/dax/kmem.c
> > @@ -28,20 +28,22 @@ int dev_dax_kmem_probe(struct device *dev)
> > resource_size_t kmem_end;
> > struct resource *new_res;
> > const char *new_res_name;
> > - int numa_node;
> > + int numa_node, new_node;
> > int rc;
> >
> > /*
> > * Ensure good NUMA information for the persistent memory.
> > - * Without this check, there is a risk that slow memory
> > - * could be mixed in a node with faster memory, causing
> > - * unavoidable performance issues.
> > + * Without this check, there is a risk but not fatal that slow
> > + * memory could be mixed in a node with faster memory, causing
> > + * unavoidable performance issues. Furthermore, fallback node
> > + * id can be used when numa_node is invalid.
> > */
> > numa_node = dev_dax->target_node;
> > if (numa_node < 0) {
> > - dev_warn(dev, "rejecting DAX region %pR with invalid
> node: %d\n",
> > - res, numa_node);
> > - return -EINVAL;
> > + new_node = memory_add_physaddr_to_nid(kmem_start);
> > + dev_info(dev, "changing nid from %d to %d for DAX
> region %pR\n",
> > + numa_node, new_node, res);
> > + numa_node = new_node;
>
> Now, the warning does not really make sense. We have NUMA_NO_NODE (< 0),
> that is not a change in the nid, but a selection of a nid. Printing
> NUMA_NO_NODE does not make too much sense. I suggest just getting rid of
> new_node and turning the dev_info() into something like
>
> dev_info(dev, "using nid %d for DAX region with undefined nid %pR\n",
> numa_node, res);
>
Okay, I will update it as your sugguestion. Thanks

--
Cheers,
Justin (Jia He)


2020-07-08 02:21:50

by Justin He

[permalink] [raw]
Subject: RE: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

Hi Michal and David

> -----Original Message-----
> From: Michal Hocko <[email protected]>
> Sent: Tuesday, July 7, 2020 7:55 PM
> To: Justin He <[email protected]>
> Cc: Catalin Marinas <[email protected]>; Will Deacon
> <[email protected]>; Dan Williams <[email protected]>; Vishal Verma
> <[email protected]>; Dave Jiang <[email protected]>; Andrew
> Morton <[email protected]>; Mike Rapoport <[email protected]>;
> Baoquan He <[email protected]>; Chuhong Yuan <[email protected]>; linux-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Kaly Xin <[email protected]>
> Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
> as EXPORT_SYMBOL_GPL
>
> On Tue 07-07-20 13:59:15, Jia He wrote:
> > This exports memory_add_physaddr_to_nid() for module driver to use.
> >
> > memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> > NUMA_NO_NID is detected.
> >
> > Suggested-by: David Hildenbrand <[email protected]>
> > Signed-off-by: Jia He <[email protected]>
> > ---
> > arch/arm64/mm/numa.c | 5 +++--
> > 1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > index aafcee3e3f7e..7eeb31740248 100644
> > --- a/arch/arm64/mm/numa.c
> > +++ b/arch/arm64/mm/numa.c
> > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> >
> > /*
> > * We hope that we will be hotplugging memory on nodes we already know
> about,
> > - * such that acpi_get_node() succeeds and we never fall back to this...
> > + * such that acpi_get_node() succeeds. But when SRAT is not present,
> the node
> > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback
> option.
> > */
> > int memory_add_physaddr_to_nid(u64 addr)
> > {
> > - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n",
> addr);
> > return 0;
> > }
> > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>
> Does it make sense to export a noop function? Wouldn't make more sense
> to simply make it static inline somewhere in a header? I haven't checked
> whether there is an easy way to do that sanely bu this just hit my eyes.

Okay, I can make a change in memory_hotplug.h, sth like:
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
struct mhp_params *params);
#endif /* ARCH_HAS_ADD_PAGES */

-#ifdef CONFIG_NUMA
-extern int memory_add_physaddr_to_nid(u64 start);
-#else
+#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
static inline int memory_add_physaddr_to_nid(u64 start)
{
return 0;
}
+#else
+extern int memory_add_physaddr_to_nid(u64 start);
#endif

And then check the memory_add_physaddr_to_nid() helper on all arches,
if it is noop(return 0), I can simply remove it.
if it is not noop, after the helper,
#define memory_add_physaddr_to_nid

What do you think of this proposal?

--
Cheers,
Justin (Jia He)


2020-07-08 04:12:54

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Tue, Jul 7, 2020 at 7:20 PM Justin He <[email protected]> wrote:
>
> Hi Michal and David
>
> > -----Original Message-----
> > From: Michal Hocko <[email protected]>
> > Sent: Tuesday, July 7, 2020 7:55 PM
> > To: Justin He <[email protected]>
> > Cc: Catalin Marinas <[email protected]>; Will Deacon
> > <[email protected]>; Dan Williams <[email protected]>; Vishal Verma
> > <[email protected]>; Dave Jiang <[email protected]>; Andrew
> > Morton <[email protected]>; Mike Rapoport <[email protected]>;
> > Baoquan He <[email protected]>; Chuhong Yuan <[email protected]>; linux-
> > [email protected]; [email protected]; linux-
> > [email protected]; [email protected]; Kaly Xin <[email protected]>
> > Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
> > as EXPORT_SYMBOL_GPL
> >
> > On Tue 07-07-20 13:59:15, Jia He wrote:
> > > This exports memory_add_physaddr_to_nid() for module driver to use.
> > >
> > > memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> > > NUMA_NO_NID is detected.
> > >
> > > Suggested-by: David Hildenbrand <[email protected]>
> > > Signed-off-by: Jia He <[email protected]>
> > > ---
> > > arch/arm64/mm/numa.c | 5 +++--
> > > 1 file changed, 3 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > > index aafcee3e3f7e..7eeb31740248 100644
> > > --- a/arch/arm64/mm/numa.c
> > > +++ b/arch/arm64/mm/numa.c
> > > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > >
> > > /*
> > > * We hope that we will be hotplugging memory on nodes we already know
> > about,
> > > - * such that acpi_get_node() succeeds and we never fall back to this...
> > > + * such that acpi_get_node() succeeds. But when SRAT is not present,
> > the node
> > > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback
> > option.
> > > */
> > > int memory_add_physaddr_to_nid(u64 addr)
> > > {
> > > - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n",
> > addr);
> > > return 0;
> > > }
> > > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> >
> > Does it make sense to export a noop function? Wouldn't make more sense
> > to simply make it static inline somewhere in a header? I haven't checked
> > whether there is an easy way to do that sanely bu this just hit my eyes.
>
> Okay, I can make a change in memory_hotplug.h, sth like:
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> struct mhp_params *params);
> #endif /* ARCH_HAS_ADD_PAGES */
>
> -#ifdef CONFIG_NUMA
> -extern int memory_add_physaddr_to_nid(u64 start);
> -#else
> +#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
> static inline int memory_add_physaddr_to_nid(u64 start)
> {
> return 0;
> }
> +#else
> +extern int memory_add_physaddr_to_nid(u64 start);
> #endif
>
> And then check the memory_add_physaddr_to_nid() helper on all arches,
> if it is noop(return 0), I can simply remove it.
> if it is not noop, after the helper,
> #define memory_add_physaddr_to_nid
>
> What do you think of this proposal?

Especially for architectures that use memblock info for numa info
(which seems to be everyone except x86) why not implement a generic
memory_add_physaddr_to_nid() that does:

int memory_add_physaddr_to_nid(u64 addr)
{
unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
int nid;

for_each_online_node(nid) {
get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
if (pfn >= start_pfn && pfn <= end_pfn)
return nid;
}
return NUMA_NO_NODE;
}

2020-07-08 04:19:22

by Justin He

[permalink] [raw]
Subject: RE: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

Hi Dan

> -----Original Message-----
> From: Dan Williams <[email protected]>
> Sent: Wednesday, July 8, 2020 11:57 AM
> To: Justin He <[email protected]>
> Cc: Michal Hocko <[email protected]>; David Hildenbrand <[email protected]>;
> Catalin Marinas <[email protected]>; Will Deacon <[email protected]>;
> Vishal Verma <[email protected]>; Dave Jiang <[email protected]>;
> Andrew Morton <[email protected]>; Mike Rapoport
> <[email protected]>; Baoquan He <[email protected]>; Chuhong Yuan
> <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> Kaly Xin <[email protected]>
> Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
> as EXPORT_SYMBOL_GPL
>
> On Tue, Jul 7, 2020 at 7:20 PM Justin He <[email protected]> wrote:
> >
> > Hi Michal and David
> >
> > > -----Original Message-----
> > > From: Michal Hocko <[email protected]>
> > > Sent: Tuesday, July 7, 2020 7:55 PM
> > > To: Justin He <[email protected]>
> > > Cc: Catalin Marinas <[email protected]>; Will Deacon
> > > <[email protected]>; Dan Williams <[email protected]>; Vishal
> Verma
> > > <[email protected]>; Dave Jiang <[email protected]>; Andrew
> > > Morton <[email protected]>; Mike Rapoport <[email protected]>;
> > > Baoquan He <[email protected]>; Chuhong Yuan <[email protected]>;
> linux-
> > > [email protected]; [email protected]; linux-
> > > [email protected]; [email protected]; Kaly Xin <[email protected]>
> > > Subject: Re: [PATCH v2 1/3] arm64/numa: export
> memory_add_physaddr_to_nid
> > > as EXPORT_SYMBOL_GPL
> > >
> > > On Tue 07-07-20 13:59:15, Jia He wrote:
> > > > This exports memory_add_physaddr_to_nid() for module driver to use.
> > > >
> > > > memory_add_physaddr_to_nid() is a fallback option to get the nid in
> case
> > > > NUMA_NO_NID is detected.
> > > >
> > > > Suggested-by: David Hildenbrand <[email protected]>
> > > > Signed-off-by: Jia He <[email protected]>
> > > > ---
> > > > arch/arm64/mm/numa.c | 5 +++--
> > > > 1 file changed, 3 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > > > index aafcee3e3f7e..7eeb31740248 100644
> > > > --- a/arch/arm64/mm/numa.c
> > > > +++ b/arch/arm64/mm/numa.c
> > > > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > > >
> > > > /*
> > > > * We hope that we will be hotplugging memory on nodes we already
> know
> > > about,
> > > > - * such that acpi_get_node() succeeds and we never fall back to
> this...
> > > > + * such that acpi_get_node() succeeds. But when SRAT is not present,
> > > the node
> > > > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a
> fallback
> > > option.
> > > > */
> > > > int memory_add_physaddr_to_nid(u64 addr)
> > > > {
> > > > - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n",
> > > addr);
> > > > return 0;
> > > > }
> > > > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > >
> > > Does it make sense to export a noop function? Wouldn't make more sense
> > > to simply make it static inline somewhere in a header? I haven't
> checked
> > > whether there is an easy way to do that sanely bu this just hit my
> eyes.
> >
> > Okay, I can make a change in memory_hotplug.h, sth like:
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn,
> unsigned long nr_pages,
> > struct mhp_params *params);
> > #endif /* ARCH_HAS_ADD_PAGES */
> >
> > -#ifdef CONFIG_NUMA
> > -extern int memory_add_physaddr_to_nid(u64 start);
> > -#else
> > +#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
> > static inline int memory_add_physaddr_to_nid(u64 start)
> > {
> > return 0;
> > }
> > +#else
> > +extern int memory_add_physaddr_to_nid(u64 start);
> > #endif
> >
> > And then check the memory_add_physaddr_to_nid() helper on all arches,
> > if it is noop(return 0), I can simply remove it.
> > if it is not noop, after the helper,
> > #define memory_add_physaddr_to_nid
> >
> > What do you think of this proposal?
>
> Especially for architectures that use memblock info for numa info
> (which seems to be everyone except x86) why not implement a generic
> memory_add_physaddr_to_nid() that does:
>
> int memory_add_physaddr_to_nid(u64 addr)
> {
> unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
> int nid;
>
> for_each_online_node(nid) {
> get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> if (pfn >= start_pfn && pfn <= end_pfn)
> return nid;
> }
> return NUMA_NO_NODE;
> }

Thanks for your suggestion,
Could I wrap the codes and let memory_add_physaddr_to_nid simply invoke
phys_to_target_node()?


--
Cheers,
Justin (Jia He)


2020-07-08 04:42:33

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Tue, Jul 7, 2020 at 9:08 PM Justin He <[email protected]> wrote:
[..]
> > Especially for architectures that use memblock info for numa info
> > (which seems to be everyone except x86) why not implement a generic
> > memory_add_physaddr_to_nid() that does:
> >
> > int memory_add_physaddr_to_nid(u64 addr)
> > {
> > unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
> > int nid;
> >
> > for_each_online_node(nid) {
> > get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> > if (pfn >= start_pfn && pfn <= end_pfn)
> > return nid;
> > }
> > return NUMA_NO_NODE;
> > }
>
> Thanks for your suggestion,
> Could I wrap the codes and let memory_add_physaddr_to_nid simply invoke
> phys_to_target_node()?

I think it needs to be the reverse. phys_to_target_node() should call
memory_add_physaddr_to_nid() by default, but fall back to searching
reserved memory address ranges in memblock. See phys_to_target_node()
in arch/x86/mm/numa.c. That one uses numa_meminfo instead of memblock,
but the principle is the same i.e. that a target node may not be
represented in memblock.memory, but memblock.reserved. I'm working on
a patch to provide a function similar to get_pfn_range_for_nid() that
operates on reserved memory.

2020-07-08 05:39:40

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Tue, Jul 07, 2020 at 03:05:48PM -0700, Dan Williams wrote:
> On Tue, Jul 7, 2020 at 11:01 AM Mike Rapoport <[email protected]> wrote:
> >
> > On Tue, Jul 07, 2020 at 02:26:08PM +0200, David Hildenbrand wrote:
> > > On 07.07.20 14:13, Mike Rapoport wrote:
> > > > On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
> > > >> On Tue 07-07-20 13:59:15, Jia He wrote:
> > > >>> This exports memory_add_physaddr_to_nid() for module driver to use.
> > > >>>
> > > >>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> > > >>> NUMA_NO_NID is detected.
> > > >>>
> > > >>> Suggested-by: David Hildenbrand <[email protected]>
> > > >>> Signed-off-by: Jia He <[email protected]>
> > > >>> ---
> > > >>> arch/arm64/mm/numa.c | 5 +++--
> > > >>> 1 file changed, 3 insertions(+), 2 deletions(-)
> > > >>>
> > > >>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > > >>> index aafcee3e3f7e..7eeb31740248 100644
> > > >>> --- a/arch/arm64/mm/numa.c
> > > >>> +++ b/arch/arm64/mm/numa.c
> > > >>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > > >>>
> > > >>> /*
> > > >>> * We hope that we will be hotplugging memory on nodes we already know about,
> > > >>> - * such that acpi_get_node() succeeds and we never fall back to this...
> > > >>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> > > >>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> > > >>> */
> > > >>> int memory_add_physaddr_to_nid(u64 addr)
> > > >>> {
> > > >>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> > > >>> return 0;
> > > >>> }
> > > >>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > > >>
> > > >> Does it make sense to export a noop function? Wouldn't make more sense
> > > >> to simply make it static inline somewhere in a header? I haven't checked
> > > >> whether there is an easy way to do that sanely bu this just hit my eyes.
> > > >
> > > > We'll need to either add a CONFIG_ option or arch specific callback to
> > > > make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
> > > > implementations coexist ...
> > >
> > > Note: I have a similar dummy (return 0) patch for s390x lying around here.
> >
> > Then we'll call it a tie - 3:3 ;-)
>
> So I'd be happy to jump on the train of people wanting to export the
> ARM stub for this (and add a new ARM stub for phys_to_target_node()),
> but Will did have a plausibly better idea that I have been meaning to
> circle back to:
>
> http://lore.kernel.org/r/20200325111039.GA32109@willie-the-truck
>
> ...i.e. iterate over node data to do the lookup. This would seem to
> work generically for multiple archs unless I am missing something?

I think it would work on arm64, power and, most propbably on s390
(David?), but not on x86. x86 does not have reserved memory in pgdat,
it's never memblock_add()'ed (see e820__memblock_setup()).

I've suggested to add E820_*_RESERVED to memblock.memory a while ago
[1], but apparently there are systems that cannot tolerate OS mappings
of the BIOS reserved areas.

[1] https://lore.kernel.org/lkml/[email protected]/

--
Sincerely yours,
Mike.

2020-07-08 05:40:17

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Tue, Jul 07, 2020 at 08:56:36PM -0700, Dan Williams wrote:
> On Tue, Jul 7, 2020 at 7:20 PM Justin He <[email protected]> wrote:
> >
> > Hi Michal and David
> >
> > > -----Original Message-----
> > > From: Michal Hocko <[email protected]>
> > > Sent: Tuesday, July 7, 2020 7:55 PM
> > > To: Justin He <[email protected]>
> > > Cc: Catalin Marinas <[email protected]>; Will Deacon
> > > <[email protected]>; Dan Williams <[email protected]>; Vishal Verma
> > > <[email protected]>; Dave Jiang <[email protected]>; Andrew
> > > Morton <[email protected]>; Mike Rapoport <[email protected]>;
> > > Baoquan He <[email protected]>; Chuhong Yuan <[email protected]>; linux-
> > > [email protected]; [email protected]; linux-
> > > [email protected]; [email protected]; Kaly Xin <[email protected]>
> > > Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
> > > as EXPORT_SYMBOL_GPL
> > >
> > > On Tue 07-07-20 13:59:15, Jia He wrote:
> > > > This exports memory_add_physaddr_to_nid() for module driver to use.
> > > >
> > > > memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> > > > NUMA_NO_NID is detected.
> > > >
> > > > Suggested-by: David Hildenbrand <[email protected]>
> > > > Signed-off-by: Jia He <[email protected]>
> > > > ---
> > > > arch/arm64/mm/numa.c | 5 +++--
> > > > 1 file changed, 3 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > > > index aafcee3e3f7e..7eeb31740248 100644
> > > > --- a/arch/arm64/mm/numa.c
> > > > +++ b/arch/arm64/mm/numa.c
> > > > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > > >
> > > > /*
> > > > * We hope that we will be hotplugging memory on nodes we already know
> > > about,
> > > > - * such that acpi_get_node() succeeds and we never fall back to this...
> > > > + * such that acpi_get_node() succeeds. But when SRAT is not present,
> > > the node
> > > > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback
> > > option.
> > > > */
> > > > int memory_add_physaddr_to_nid(u64 addr)
> > > > {
> > > > - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n",
> > > addr);
> > > > return 0;
> > > > }
> > > > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > >
> > > Does it make sense to export a noop function? Wouldn't make more sense
> > > to simply make it static inline somewhere in a header? I haven't checked
> > > whether there is an easy way to do that sanely bu this just hit my eyes.
> >
> > Okay, I can make a change in memory_hotplug.h, sth like:
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> > struct mhp_params *params);
> > #endif /* ARCH_HAS_ADD_PAGES */
> >
> > -#ifdef CONFIG_NUMA
> > -extern int memory_add_physaddr_to_nid(u64 start);
> > -#else
> > +#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
> > static inline int memory_add_physaddr_to_nid(u64 start)
> > {
> > return 0;
> > }
> > +#else
> > +extern int memory_add_physaddr_to_nid(u64 start);
> > #endif
> >
> > And then check the memory_add_physaddr_to_nid() helper on all arches,
> > if it is noop(return 0), I can simply remove it.
> > if it is not noop, after the helper,
> > #define memory_add_physaddr_to_nid
> >
> > What do you think of this proposal?
>
> Especially for architectures that use memblock info for numa info
> (which seems to be everyone except x86) why not implement a generic
> memory_add_physaddr_to_nid() that does:

That would be only arm64.

> int memory_add_physaddr_to_nid(u64 addr)
> {
> unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
> int nid;
>
> for_each_online_node(nid) {
> get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> if (pfn >= start_pfn && pfn <= end_pfn)
> return nid;
> }
> return NUMA_NO_NODE;
> }

--
Sincerely yours,
Mike.

2020-07-08 05:52:30

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Tue, Jul 7, 2020 at 10:33 PM Mike Rapoport <[email protected]> wrote:
>
> On Tue, Jul 07, 2020 at 08:56:36PM -0700, Dan Williams wrote:
> > On Tue, Jul 7, 2020 at 7:20 PM Justin He <[email protected]> wrote:
> > >
> > > Hi Michal and David
> > >
> > > > -----Original Message-----
> > > > From: Michal Hocko <[email protected]>
> > > > Sent: Tuesday, July 7, 2020 7:55 PM
> > > > To: Justin He <[email protected]>
> > > > Cc: Catalin Marinas <[email protected]>; Will Deacon
> > > > <[email protected]>; Dan Williams <[email protected]>; Vishal Verma
> > > > <[email protected]>; Dave Jiang <[email protected]>; Andrew
> > > > Morton <[email protected]>; Mike Rapoport <[email protected]>;
> > > > Baoquan He <[email protected]>; Chuhong Yuan <[email protected]>; linux-
> > > > [email protected]; [email protected]; linux-
> > > > [email protected]; [email protected]; Kaly Xin <[email protected]>
> > > > Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
> > > > as EXPORT_SYMBOL_GPL
> > > >
> > > > On Tue 07-07-20 13:59:15, Jia He wrote:
> > > > > This exports memory_add_physaddr_to_nid() for module driver to use.
> > > > >
> > > > > memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> > > > > NUMA_NO_NID is detected.
> > > > >
> > > > > Suggested-by: David Hildenbrand <[email protected]>
> > > > > Signed-off-by: Jia He <[email protected]>
> > > > > ---
> > > > > arch/arm64/mm/numa.c | 5 +++--
> > > > > 1 file changed, 3 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > > > > index aafcee3e3f7e..7eeb31740248 100644
> > > > > --- a/arch/arm64/mm/numa.c
> > > > > +++ b/arch/arm64/mm/numa.c
> > > > > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > > > >
> > > > > /*
> > > > > * We hope that we will be hotplugging memory on nodes we already know
> > > > about,
> > > > > - * such that acpi_get_node() succeeds and we never fall back to this...
> > > > > + * such that acpi_get_node() succeeds. But when SRAT is not present,
> > > > the node
> > > > > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback
> > > > option.
> > > > > */
> > > > > int memory_add_physaddr_to_nid(u64 addr)
> > > > > {
> > > > > - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n",
> > > > addr);
> > > > > return 0;
> > > > > }
> > > > > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > > >
> > > > Does it make sense to export a noop function? Wouldn't make more sense
> > > > to simply make it static inline somewhere in a header? I haven't checked
> > > > whether there is an easy way to do that sanely bu this just hit my eyes.
> > >
> > > Okay, I can make a change in memory_hotplug.h, sth like:
> > > --- a/include/linux/memory_hotplug.h
> > > +++ b/include/linux/memory_hotplug.h
> > > @@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> > > struct mhp_params *params);
> > > #endif /* ARCH_HAS_ADD_PAGES */
> > >
> > > -#ifdef CONFIG_NUMA
> > > -extern int memory_add_physaddr_to_nid(u64 start);
> > > -#else
> > > +#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
> > > static inline int memory_add_physaddr_to_nid(u64 start)
> > > {
> > > return 0;
> > > }
> > > +#else
> > > +extern int memory_add_physaddr_to_nid(u64 start);
> > > #endif
> > >
> > > And then check the memory_add_physaddr_to_nid() helper on all arches,
> > > if it is noop(return 0), I can simply remove it.
> > > if it is not noop, after the helper,
> > > #define memory_add_physaddr_to_nid
> > >
> > > What do you think of this proposal?
> >
> > Especially for architectures that use memblock info for numa info
> > (which seems to be everyone except x86) why not implement a generic
> > memory_add_physaddr_to_nid() that does:
>
> That would be only arm64.
>

Darn, I saw ARCH_KEEP_MEMBLOCK and had delusions of grandeur that it
could solve my numa api woes. At least for x86 the problem is already
solved with reserved numa_meminfo, but now I'm trying to write generic
drivers that use those apis and finding these gaps on other archs.

2020-07-08 06:29:28

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Tue, Jul 07, 2020 at 10:48:19PM -0700, Dan Williams wrote:
> On Tue, Jul 7, 2020 at 10:33 PM Mike Rapoport <[email protected]> wrote:
> >
> > On Tue, Jul 07, 2020 at 08:56:36PM -0700, Dan Williams wrote:
> > > On Tue, Jul 7, 2020 at 7:20 PM Justin He <[email protected]> wrote:
> > > >
> > > > Hi Michal and David
> > > >
> > > > > -----Original Message-----
> > > > > From: Michal Hocko <[email protected]>
> > > > > Sent: Tuesday, July 7, 2020 7:55 PM
> > > > > To: Justin He <[email protected]>
> > > > > Cc: Catalin Marinas <[email protected]>; Will Deacon
> > > > > <[email protected]>; Dan Williams <[email protected]>; Vishal Verma
> > > > > <[email protected]>; Dave Jiang <[email protected]>; Andrew
> > > > > Morton <[email protected]>; Mike Rapoport <[email protected]>;
> > > > > Baoquan He <[email protected]>; Chuhong Yuan <[email protected]>; linux-
> > > > > [email protected]; [email protected]; linux-
> > > > > [email protected]; [email protected]; Kaly Xin <[email protected]>
> > > > > Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
> > > > > as EXPORT_SYMBOL_GPL
> > > > >
> > > > > On Tue 07-07-20 13:59:15, Jia He wrote:
> > > > > > This exports memory_add_physaddr_to_nid() for module driver to use.
> > > > > >
> > > > > > memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> > > > > > NUMA_NO_NID is detected.
> > > > > >
> > > > > > Suggested-by: David Hildenbrand <[email protected]>
> > > > > > Signed-off-by: Jia He <[email protected]>
> > > > > > ---
> > > > > > arch/arm64/mm/numa.c | 5 +++--
> > > > > > 1 file changed, 3 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > > > > > index aafcee3e3f7e..7eeb31740248 100644
> > > > > > --- a/arch/arm64/mm/numa.c
> > > > > > +++ b/arch/arm64/mm/numa.c
> > > > > > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > > > > >
> > > > > > /*
> > > > > > * We hope that we will be hotplugging memory on nodes we already know
> > > > > about,
> > > > > > - * such that acpi_get_node() succeeds and we never fall back to this...
> > > > > > + * such that acpi_get_node() succeeds. But when SRAT is not present,
> > > > > the node
> > > > > > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback
> > > > > option.
> > > > > > */
> > > > > > int memory_add_physaddr_to_nid(u64 addr)
> > > > > > {
> > > > > > - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n",
> > > > > addr);
> > > > > > return 0;
> > > > > > }
> > > > > > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > > > >
> > > > > Does it make sense to export a noop function? Wouldn't make more sense
> > > > > to simply make it static inline somewhere in a header? I haven't checked
> > > > > whether there is an easy way to do that sanely bu this just hit my eyes.
> > > >
> > > > Okay, I can make a change in memory_hotplug.h, sth like:
> > > > --- a/include/linux/memory_hotplug.h
> > > > +++ b/include/linux/memory_hotplug.h
> > > > @@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> > > > struct mhp_params *params);
> > > > #endif /* ARCH_HAS_ADD_PAGES */
> > > >
> > > > -#ifdef CONFIG_NUMA
> > > > -extern int memory_add_physaddr_to_nid(u64 start);
> > > > -#else
> > > > +#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
> > > > static inline int memory_add_physaddr_to_nid(u64 start)
> > > > {
> > > > return 0;
> > > > }
> > > > +#else
> > > > +extern int memory_add_physaddr_to_nid(u64 start);
> > > > #endif
> > > >
> > > > And then check the memory_add_physaddr_to_nid() helper on all arches,
> > > > if it is noop(return 0), I can simply remove it.
> > > > if it is not noop, after the helper,
> > > > #define memory_add_physaddr_to_nid
> > > >
> > > > What do you think of this proposal?
> > >
> > > Especially for architectures that use memblock info for numa info
> > > (which seems to be everyone except x86) why not implement a generic
> > > memory_add_physaddr_to_nid() that does:
> >
> > That would be only arm64.
> >
>
> Darn, I saw ARCH_KEEP_MEMBLOCK and had delusions of grandeur that it
> could solve my numa api woes. At least for x86 the problem is already
> solved with reserved numa_meminfo, but now I'm trying to write generic
> drivers that use those apis and finding these gaps on other archs.

I'm not sure if x86's numa_meminfo is a part of the solution or a part
of the problem ;-)
Anyway, this all indeed messy and there is a lot to improve there.

--
Sincerely yours,
Mike.

2020-07-08 06:31:20

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Tue, Jul 07, 2020 at 09:27:43PM -0700, Dan Williams wrote:
> On Tue, Jul 7, 2020 at 9:08 PM Justin He <[email protected]> wrote:
> [..]
> > > Especially for architectures that use memblock info for numa info
> > > (which seems to be everyone except x86) why not implement a generic
> > > memory_add_physaddr_to_nid() that does:
> > >
> > > int memory_add_physaddr_to_nid(u64 addr)
> > > {
> > > unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
> > > int nid;
> > >
> > > for_each_online_node(nid) {
> > > get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> > > if (pfn >= start_pfn && pfn <= end_pfn)
> > > return nid;
> > > }
> > > return NUMA_NO_NODE;
> > > }
> >
> > Thanks for your suggestion,
> > Could I wrap the codes and let memory_add_physaddr_to_nid simply invoke
> > phys_to_target_node()?
>
> I think it needs to be the reverse. phys_to_target_node() should call
> memory_add_physaddr_to_nid() by default, but fall back to searching
> reserved memory address ranges in memblock. See phys_to_target_node()
> in arch/x86/mm/numa.c. That one uses numa_meminfo instead of memblock,
> but the principle is the same i.e. that a target node may not be
> represented in memblock.memory, but memblock.reserved. I'm working on
> a patch to provide a function similar to get_pfn_range_for_nid() that
> operates on reserved memory.

Do we really need yet another memblock iterator?
I think only x86 has memory that is not in memblock.memory but only in
memblock.reserved.

--
Sincerely yours,
Mike.

2020-07-08 06:45:19

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Tue, Jul 7, 2020 at 11:20 PM Mike Rapoport <[email protected]> wrote:
[..]
> > Darn, I saw ARCH_KEEP_MEMBLOCK and had delusions of grandeur that it
> > could solve my numa api woes. At least for x86 the problem is already
> > solved with reserved numa_meminfo, but now I'm trying to write generic
> > drivers that use those apis and finding these gaps on other archs.
>
> I'm not sure if x86's numa_meminfo is a part of the solution or a part
> of the problem ;-)

More the latter, but hopefully it can remain an exception and not the rule.

2020-07-08 06:55:44

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Tue, Jul 7, 2020 at 11:22 PM Mike Rapoport <[email protected]> wrote:
[..]
> > > Thanks for your suggestion,
> > > Could I wrap the codes and let memory_add_physaddr_to_nid simply invoke
> > > phys_to_target_node()?
> >
> > I think it needs to be the reverse. phys_to_target_node() should call
> > memory_add_physaddr_to_nid() by default, but fall back to searching
> > reserved memory address ranges in memblock. See phys_to_target_node()
> > in arch/x86/mm/numa.c. That one uses numa_meminfo instead of memblock,
> > but the principle is the same i.e. that a target node may not be
> > represented in memblock.memory, but memblock.reserved. I'm working on
> > a patch to provide a function similar to get_pfn_range_for_nid() that
> > operates on reserved memory.
>
> Do we really need yet another memblock iterator?
> I think only x86 has memory that is not in memblock.memory but only in
> memblock.reserved.

Well, that's what led me here. EFI has introduced a memory attribute
called "EFI Special Purpose Memory". I mapped it to a new Linux
concept called Soft Reserved memory (commit b617c5266eed "efi: Common
enable/disable infrastructure for EFI soft reservation"). The driver I
want to claim that memory, device-dax, wants to be able to look up
numa information for an address range that is marked reserved in
memblock. The device-dax facility has the ability to either let
userspace map a device, or assign the memory backing that device to
the page allocator. In both scenarios the driver needs numa info to
either populate the 'numa_node' property of the device in sysfs, or to
pass an node-id to add_memory_resource() when it is hot-plugged.

I was thwarted by the lack of phys_to_target_node() on arm64, and
rather than add another stub like memory_add_physaddr_to_nid() I
wanted to see if it could be solved properly / generically with
memblock data.

2020-07-08 06:57:27

by Justin He

[permalink] [raw]
Subject: RE: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

Hi Dan

> -----Original Message-----
> From: Dan Williams <[email protected]>
> Sent: Wednesday, July 8, 2020 1:48 PM
> To: Mike Rapoport <[email protected]>
> Cc: Justin He <[email protected]>; Michal Hocko <[email protected]>; David
> Hildenbrand <[email protected]>; Catalin Marinas <[email protected]>;
> Will Deacon <[email protected]>; Vishal Verma <[email protected]>;
> Dave Jiang <[email protected]>; Andrew Morton <akpm@linux-
> foundation.org>; Baoquan He <[email protected]>; Chuhong Yuan
> <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> Kaly Xin <[email protected]>
> Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
> as EXPORT_SYMBOL_GPL
>
> On Tue, Jul 7, 2020 at 10:33 PM Mike Rapoport <[email protected]> wrote:
> >
> > On Tue, Jul 07, 2020 at 08:56:36PM -0700, Dan Williams wrote:
> > > On Tue, Jul 7, 2020 at 7:20 PM Justin He <[email protected]> wrote:
> > > >
> > > > Hi Michal and David
> > > >
> > > > > -----Original Message-----
> > > > > From: Michal Hocko <[email protected]>
> > > > > Sent: Tuesday, July 7, 2020 7:55 PM
> > > > > To: Justin He <[email protected]>
> > > > > Cc: Catalin Marinas <[email protected]>; Will Deacon
> > > > > <[email protected]>; Dan Williams <[email protected]>; Vishal
> Verma
> > > > > <[email protected]>; Dave Jiang <[email protected]>;
> Andrew
> > > > > Morton <[email protected]>; Mike Rapoport
> <[email protected]>;
> > > > > Baoquan He <[email protected]>; Chuhong Yuan <[email protected]>;
> linux-
> > > > > [email protected]; [email protected];
> linux-
> > > > > [email protected]; [email protected]; Kaly Xin
> <[email protected]>
> > > > > Subject: Re: [PATCH v2 1/3] arm64/numa: export
> memory_add_physaddr_to_nid
> > > > > as EXPORT_SYMBOL_GPL
> > > > >
> > > > > On Tue 07-07-20 13:59:15, Jia He wrote:
> > > > > > This exports memory_add_physaddr_to_nid() for module driver to
> use.
> > > > > >
> > > > > > memory_add_physaddr_to_nid() is a fallback option to get the nid
> in case
> > > > > > NUMA_NO_NID is detected.
> > > > > >
> > > > > > Suggested-by: David Hildenbrand <[email protected]>
> > > > > > Signed-off-by: Jia He <[email protected]>
> > > > > > ---
> > > > > > arch/arm64/mm/numa.c | 5 +++--
> > > > > > 1 file changed, 3 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > > > > > index aafcee3e3f7e..7eeb31740248 100644
> > > > > > --- a/arch/arm64/mm/numa.c
> > > > > > +++ b/arch/arm64/mm/numa.c
> > > > > > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > > > > >
> > > > > > /*
> > > > > > * We hope that we will be hotplugging memory on nodes we
> already know
> > > > > about,
> > > > > > - * such that acpi_get_node() succeeds and we never fall back to
> this...
> > > > > > + * such that acpi_get_node() succeeds. But when SRAT is not
> present,
> > > > > the node
> > > > > > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a
> fallback
> > > > > option.
> > > > > > */
> > > > > > int memory_add_physaddr_to_nid(u64 addr)
> > > > > > {
> > > > > > - pr_warn("Unknown node for memory at 0x%llx, assuming node
> 0\n",
> > > > > addr);
> > > > > > return 0;
> > > > > > }
> > > > > > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > > > >
> > > > > Does it make sense to export a noop function? Wouldn't make more
> sense
> > > > > to simply make it static inline somewhere in a header? I haven't
> checked
> > > > > whether there is an easy way to do that sanely bu this just hit my
> eyes.
> > > >
> > > > Okay, I can make a change in memory_hotplug.h, sth like:
> > > > --- a/include/linux/memory_hotplug.h
> > > > +++ b/include/linux/memory_hotplug.h
> > > > @@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn,
> unsigned long nr_pages,
> > > > struct mhp_params *params);
> > > > #endif /* ARCH_HAS_ADD_PAGES */
> > > >
> > > > -#ifdef CONFIG_NUMA
> > > > -extern int memory_add_physaddr_to_nid(u64 start);
> > > > -#else
> > > > +#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
> > > > static inline int memory_add_physaddr_to_nid(u64 start)
> > > > {
> > > > return 0;
> > > > }
> > > > +#else
> > > > +extern int memory_add_physaddr_to_nid(u64 start);
> > > > #endif
> > > >
> > > > And then check the memory_add_physaddr_to_nid() helper on all arches,
> > > > if it is noop(return 0), I can simply remove it.
> > > > if it is not noop, after the helper,
> > > > #define memory_add_physaddr_to_nid
> > > >
> > > > What do you think of this proposal?
> > >
> > > Especially for architectures that use memblock info for numa info
> > > (which seems to be everyone except x86) why not implement a generic
> > > memory_add_physaddr_to_nid() that does:
> >
> > That would be only arm64.
> >
>
> Darn, I saw ARCH_KEEP_MEMBLOCK and had delusions of grandeur that it
> could solve my numa api woes. At least for x86 the problem is already
> solved with reserved numa_meminfo, but now I'm trying to write generic
> drivers that use those apis and finding these gaps on other archs.

Even on arm64, there is a dependency issue in dax_pmem kmem case.
If dax pmem uses memory_add_physaddr_to_nid() to decide which node that
memblock should add into, get_pfn_range_for_nid() might not have
the correct memblock info at that time. That is, get_pfn_range_for_nid()
can't get the correct memblock info before add_memory()

So IMO, memory_add_physaddr_to_nid() still have to implement as noop on
arm64 (return 0) together with sh,s390x? Powerpc, x86,ia64 can use their
own implementation. And phys_to_target_node() can use your suggested(
for_each_online_node() ...)

What do you think of it? Thanks

--
Cheers,
Justin (Jia He)


2020-07-08 07:00:15

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On 08.07.20 08:22, Mike Rapoport wrote:
> On Tue, Jul 07, 2020 at 09:27:43PM -0700, Dan Williams wrote:
>> On Tue, Jul 7, 2020 at 9:08 PM Justin He <[email protected]> wrote:
>> [..]
>>>> Especially for architectures that use memblock info for numa info
>>>> (which seems to be everyone except x86) why not implement a generic
>>>> memory_add_physaddr_to_nid() that does:
>>>>
>>>> int memory_add_physaddr_to_nid(u64 addr)
>>>> {
>>>> unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
>>>> int nid;
>>>>
>>>> for_each_online_node(nid) {
>>>> get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
>>>> if (pfn >= start_pfn && pfn <= end_pfn)
>>>> return nid;
>>>> }
>>>> return NUMA_NO_NODE;
>>>> }
>>>
>>> Thanks for your suggestion,
>>> Could I wrap the codes and let memory_add_physaddr_to_nid simply invoke
>>> phys_to_target_node()?
>>
>> I think it needs to be the reverse. phys_to_target_node() should call
>> memory_add_physaddr_to_nid() by default, but fall back to searching
>> reserved memory address ranges in memblock. See phys_to_target_node()
>> in arch/x86/mm/numa.c. That one uses numa_meminfo instead of memblock,
>> but the principle is the same i.e. that a target node may not be
>> represented in memblock.memory, but memblock.reserved. I'm working on
>> a patch to provide a function similar to get_pfn_range_for_nid() that
>> operates on reserved memory.
>
> Do we really need yet another memblock iterator?
> I think only x86 has memory that is not in memblock.memory but only in
> memblock.reserved.

Reading about abusing the memblock allcoator once again in memory
hotplug paths makes me shiver.

--
Thanks,

David / dhildenb

2020-07-08 07:04:35

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On 08.07.20 08:56, Justin He wrote:
> Hi Dan
>
>> -----Original Message-----
>> From: Dan Williams <[email protected]>
>> Sent: Wednesday, July 8, 2020 1:48 PM
>> To: Mike Rapoport <[email protected]>
>> Cc: Justin He <[email protected]>; Michal Hocko <[email protected]>; David
>> Hildenbrand <[email protected]>; Catalin Marinas <[email protected]>;
>> Will Deacon <[email protected]>; Vishal Verma <[email protected]>;
>> Dave Jiang <[email protected]>; Andrew Morton <akpm@linux-
>> foundation.org>; Baoquan He <[email protected]>; Chuhong Yuan
>> <[email protected]>; [email protected]; linux-
>> [email protected]; [email protected]; [email protected];
>> Kaly Xin <[email protected]>
>> Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
>> as EXPORT_SYMBOL_GPL
>>
>> On Tue, Jul 7, 2020 at 10:33 PM Mike Rapoport <[email protected]> wrote:
>>>
>>> On Tue, Jul 07, 2020 at 08:56:36PM -0700, Dan Williams wrote:
>>>> On Tue, Jul 7, 2020 at 7:20 PM Justin He <[email protected]> wrote:
>>>>>
>>>>> Hi Michal and David
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Michal Hocko <[email protected]>
>>>>>> Sent: Tuesday, July 7, 2020 7:55 PM
>>>>>> To: Justin He <[email protected]>
>>>>>> Cc: Catalin Marinas <[email protected]>; Will Deacon
>>>>>> <[email protected]>; Dan Williams <[email protected]>; Vishal
>> Verma
>>>>>> <[email protected]>; Dave Jiang <[email protected]>;
>> Andrew
>>>>>> Morton <[email protected]>; Mike Rapoport
>> <[email protected]>;
>>>>>> Baoquan He <[email protected]>; Chuhong Yuan <[email protected]>;
>> linux-
>>>>>> [email protected]; [email protected];
>> linux-
>>>>>> [email protected]; [email protected]; Kaly Xin
>> <[email protected]>
>>>>>> Subject: Re: [PATCH v2 1/3] arm64/numa: export
>> memory_add_physaddr_to_nid
>>>>>> as EXPORT_SYMBOL_GPL
>>>>>>
>>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
>>>>>>> This exports memory_add_physaddr_to_nid() for module driver to
>> use.
>>>>>>>
>>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid
>> in case
>>>>>>> NUMA_NO_NID is detected.
>>>>>>>
>>>>>>> Suggested-by: David Hildenbrand <[email protected]>
>>>>>>> Signed-off-by: Jia He <[email protected]>
>>>>>>> ---
>>>>>>> arch/arm64/mm/numa.c | 5 +++--
>>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>>>>> index aafcee3e3f7e..7eeb31740248 100644
>>>>>>> --- a/arch/arm64/mm/numa.c
>>>>>>> +++ b/arch/arm64/mm/numa.c
>>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>>>>>>>
>>>>>>> /*
>>>>>>> * We hope that we will be hotplugging memory on nodes we
>> already know
>>>>>> about,
>>>>>>> - * such that acpi_get_node() succeeds and we never fall back to
>> this...
>>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not
>> present,
>>>>>> the node
>>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a
>> fallback
>>>>>> option.
>>>>>>> */
>>>>>>> int memory_add_physaddr_to_nid(u64 addr)
>>>>>>> {
>>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node
>> 0\n",
>>>>>> addr);
>>>>>>> return 0;
>>>>>>> }
>>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>>>>>>
>>>>>> Does it make sense to export a noop function? Wouldn't make more
>> sense
>>>>>> to simply make it static inline somewhere in a header? I haven't
>> checked
>>>>>> whether there is an easy way to do that sanely bu this just hit my
>> eyes.
>>>>>
>>>>> Okay, I can make a change in memory_hotplug.h, sth like:
>>>>> --- a/include/linux/memory_hotplug.h
>>>>> +++ b/include/linux/memory_hotplug.h
>>>>> @@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn,
>> unsigned long nr_pages,
>>>>> struct mhp_params *params);
>>>>> #endif /* ARCH_HAS_ADD_PAGES */
>>>>>
>>>>> -#ifdef CONFIG_NUMA
>>>>> -extern int memory_add_physaddr_to_nid(u64 start);
>>>>> -#else
>>>>> +#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
>>>>> static inline int memory_add_physaddr_to_nid(u64 start)
>>>>> {
>>>>> return 0;
>>>>> }
>>>>> +#else
>>>>> +extern int memory_add_physaddr_to_nid(u64 start);
>>>>> #endif
>>>>>
>>>>> And then check the memory_add_physaddr_to_nid() helper on all arches,
>>>>> if it is noop(return 0), I can simply remove it.
>>>>> if it is not noop, after the helper,
>>>>> #define memory_add_physaddr_to_nid
>>>>>
>>>>> What do you think of this proposal?
>>>>
>>>> Especially for architectures that use memblock info for numa info
>>>> (which seems to be everyone except x86) why not implement a generic
>>>> memory_add_physaddr_to_nid() that does:
>>>
>>> That would be only arm64.
>>>
>>
>> Darn, I saw ARCH_KEEP_MEMBLOCK and had delusions of grandeur that it
>> could solve my numa api woes. At least for x86 the problem is already
>> solved with reserved numa_meminfo, but now I'm trying to write generic
>> drivers that use those apis and finding these gaps on other archs.
>
> Even on arm64, there is a dependency issue in dax_pmem kmem case.
> If dax pmem uses memory_add_physaddr_to_nid() to decide which node that
> memblock should add into, get_pfn_range_for_nid() might not have
> the correct memblock info at that time. That is, get_pfn_range_for_nid()
> can't get the correct memblock info before add_memory()
>
> So IMO, memory_add_physaddr_to_nid() still have to implement as noop on
> arm64 (return 0) together with sh,s390x? Powerpc, x86,ia64 can use their
> own implementation. And phys_to_target_node() can use your suggested(
> for_each_online_node() ...)
>
> What do you think of it? Thanks

You are trying to fix the "we only have one dummy node" AFAIU, so what
you propose here is certainly good enough for now.

--
Thanks,

David / dhildenb

2020-07-08 07:07:35

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Tue, Jul 7, 2020 at 11:59 PM David Hildenbrand <[email protected]> wrote:
>
> On 08.07.20 08:22, Mike Rapoport wrote:
> > On Tue, Jul 07, 2020 at 09:27:43PM -0700, Dan Williams wrote:
> >> On Tue, Jul 7, 2020 at 9:08 PM Justin He <[email protected]> wrote:
> >> [..]
> >>>> Especially for architectures that use memblock info for numa info
> >>>> (which seems to be everyone except x86) why not implement a generic
> >>>> memory_add_physaddr_to_nid() that does:
> >>>>
> >>>> int memory_add_physaddr_to_nid(u64 addr)
> >>>> {
> >>>> unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
> >>>> int nid;
> >>>>
> >>>> for_each_online_node(nid) {
> >>>> get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> >>>> if (pfn >= start_pfn && pfn <= end_pfn)
> >>>> return nid;
> >>>> }
> >>>> return NUMA_NO_NODE;
> >>>> }
> >>>
> >>> Thanks for your suggestion,
> >>> Could I wrap the codes and let memory_add_physaddr_to_nid simply invoke
> >>> phys_to_target_node()?
> >>
> >> I think it needs to be the reverse. phys_to_target_node() should call
> >> memory_add_physaddr_to_nid() by default, but fall back to searching
> >> reserved memory address ranges in memblock. See phys_to_target_node()
> >> in arch/x86/mm/numa.c. That one uses numa_meminfo instead of memblock,
> >> but the principle is the same i.e. that a target node may not be
> >> represented in memblock.memory, but memblock.reserved. I'm working on
> >> a patch to provide a function similar to get_pfn_range_for_nid() that
> >> operates on reserved memory.
> >
> > Do we really need yet another memblock iterator?
> > I think only x86 has memory that is not in memblock.memory but only in
> > memblock.reserved.
>
> Reading about abusing the memblock allcoator once again in memory
> hotplug paths makes me shiver.

Technical reasoning please?

arm64 numa information is established from memblock data. It seems
counterproductive to ignore that fact if we're already touching
memory_add_physaddr_to_nid() and have a use case for a driver to call
it.

2020-07-08 07:18:58

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On 08.07.20 09:04, Dan Williams wrote:
> On Tue, Jul 7, 2020 at 11:59 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 08.07.20 08:22, Mike Rapoport wrote:
>>> On Tue, Jul 07, 2020 at 09:27:43PM -0700, Dan Williams wrote:
>>>> On Tue, Jul 7, 2020 at 9:08 PM Justin He <[email protected]> wrote:
>>>> [..]
>>>>>> Especially for architectures that use memblock info for numa info
>>>>>> (which seems to be everyone except x86) why not implement a generic
>>>>>> memory_add_physaddr_to_nid() that does:
>>>>>>
>>>>>> int memory_add_physaddr_to_nid(u64 addr)
>>>>>> {
>>>>>> unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
>>>>>> int nid;
>>>>>>
>>>>>> for_each_online_node(nid) {
>>>>>> get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
>>>>>> if (pfn >= start_pfn && pfn <= end_pfn)
>>>>>> return nid;
>>>>>> }
>>>>>> return NUMA_NO_NODE;
>>>>>> }
>>>>>
>>>>> Thanks for your suggestion,
>>>>> Could I wrap the codes and let memory_add_physaddr_to_nid simply invoke
>>>>> phys_to_target_node()?
>>>>
>>>> I think it needs to be the reverse. phys_to_target_node() should call
>>>> memory_add_physaddr_to_nid() by default, but fall back to searching
>>>> reserved memory address ranges in memblock. See phys_to_target_node()
>>>> in arch/x86/mm/numa.c. That one uses numa_meminfo instead of memblock,
>>>> but the principle is the same i.e. that a target node may not be
>>>> represented in memblock.memory, but memblock.reserved. I'm working on
>>>> a patch to provide a function similar to get_pfn_range_for_nid() that
>>>> operates on reserved memory.
>>>
>>> Do we really need yet another memblock iterator?
>>> I think only x86 has memory that is not in memblock.memory but only in
>>> memblock.reserved.
>>
>> Reading about abusing the memblock allcoator once again in memory
>> hotplug paths makes me shiver.
>
> Technical reasoning please?

ARCH_KEEP_MEMBLOCK is (AFAIK) only a hack for arm64 to implement
pfn_valid(), because they zap out individual pages corresponding to
memory holes of full sections.

I am not a friend of adding more post-init code to rely on memblock
data. It just makes it harder to eventually get rid of ARCH_KEEP_MEMBLOCK.

>
> arm64 numa information is established from memblock data. It seems
> counterproductive to ignore that fact if we're already touching
> memory_add_physaddr_to_nid() and have a use case for a driver to call
> it.

... and we are trying to handle the "only a single dummy node" case
(patch #2), or what am I missing? What is there to optimize currently?

--
Thanks,

David / dhildenb

2020-07-08 07:24:57

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On 08.07.20 07:27, Mike Rapoport wrote:
> On Tue, Jul 07, 2020 at 03:05:48PM -0700, Dan Williams wrote:
>> On Tue, Jul 7, 2020 at 11:01 AM Mike Rapoport <[email protected]> wrote:
>>>
>>> On Tue, Jul 07, 2020 at 02:26:08PM +0200, David Hildenbrand wrote:
>>>> On 07.07.20 14:13, Mike Rapoport wrote:
>>>>> On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
>>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
>>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
>>>>>>>
>>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
>>>>>>> NUMA_NO_NID is detected.
>>>>>>>
>>>>>>> Suggested-by: David Hildenbrand <[email protected]>
>>>>>>> Signed-off-by: Jia He <[email protected]>
>>>>>>> ---
>>>>>>> arch/arm64/mm/numa.c | 5 +++--
>>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>>>>> index aafcee3e3f7e..7eeb31740248 100644
>>>>>>> --- a/arch/arm64/mm/numa.c
>>>>>>> +++ b/arch/arm64/mm/numa.c
>>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>>>>>>>
>>>>>>> /*
>>>>>>> * We hope that we will be hotplugging memory on nodes we already know about,
>>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
>>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
>>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
>>>>>>> */
>>>>>>> int memory_add_physaddr_to_nid(u64 addr)
>>>>>>> {
>>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
>>>>>>> return 0;
>>>>>>> }
>>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>>>>>>
>>>>>> Does it make sense to export a noop function? Wouldn't make more sense
>>>>>> to simply make it static inline somewhere in a header? I haven't checked
>>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.
>>>>>
>>>>> We'll need to either add a CONFIG_ option or arch specific callback to
>>>>> make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
>>>>> implementations coexist ...
>>>>
>>>> Note: I have a similar dummy (return 0) patch for s390x lying around here.
>>>
>>> Then we'll call it a tie - 3:3 ;-)
>>
>> So I'd be happy to jump on the train of people wanting to export the
>> ARM stub for this (and add a new ARM stub for phys_to_target_node()),
>> but Will did have a plausibly better idea that I have been meaning to
>> circle back to:
>>
>> http://lore.kernel.org/r/20200325111039.GA32109@willie-the-truck
>>
>> ...i.e. iterate over node data to do the lookup. This would seem to
>> work generically for multiple archs unless I am missing something?

IIRC, only memory assigned to/onlined to a ZONE is represented in the
pgdat node span. E.g., not offline memory blocks.

Esp., when hotplugging + onlining consecutive memory, there won't really
be any intersections in most cases if I am not wrong. It would not be
"intersection" but rather "closest fit".

With overlapping nodes it's even more unclear. Which one to pick?

>
> I think it would work on arm64, power and, most propbably on s390

With only a single dummy node I guess it should work (searching when
there is only a single node does not make too much sense).

> (David?), but not on x86. x86 does not have reserved memory in pgdat,
> it's never memblock_add()'ed (see e820__memblock_setup()).

Can you enlighten me why that is relevant for the memory hotplug path?
(or is it just a general comment to make the function as accurate as
possible for all addresses?)

--
Thanks,

David / dhildenb

2020-07-08 07:50:46

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Wed, Jul 08, 2020 at 09:21:25AM +0200, David Hildenbrand wrote:
> On 08.07.20 07:27, Mike Rapoport wrote:
> > On Tue, Jul 07, 2020 at 03:05:48PM -0700, Dan Williams wrote:
> >> On Tue, Jul 7, 2020 at 11:01 AM Mike Rapoport <[email protected]> wrote:
> >>>
> >>> On Tue, Jul 07, 2020 at 02:26:08PM +0200, David Hildenbrand wrote:
> >>>> On 07.07.20 14:13, Mike Rapoport wrote:
> >>>>> On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
> >>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
> >>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
> >>>>>>>
> >>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> >>>>>>> NUMA_NO_NID is detected.
> >>>>>>>
> >>>>>>> Suggested-by: David Hildenbrand <[email protected]>
> >>>>>>> Signed-off-by: Jia He <[email protected]>
> >>>>>>> ---
> >>>>>>> arch/arm64/mm/numa.c | 5 +++--
> >>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> >>>>>>> index aafcee3e3f7e..7eeb31740248 100644
> >>>>>>> --- a/arch/arm64/mm/numa.c
> >>>>>>> +++ b/arch/arm64/mm/numa.c
> >>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> >>>>>>>
> >>>>>>> /*
> >>>>>>> * We hope that we will be hotplugging memory on nodes we already know about,
> >>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
> >>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> >>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> >>>>>>> */
> >>>>>>> int memory_add_physaddr_to_nid(u64 addr)
> >>>>>>> {
> >>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> >>>>>>> return 0;
> >>>>>>> }
> >>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> >>>>>>
> >>>>>> Does it make sense to export a noop function? Wouldn't make more sense
> >>>>>> to simply make it static inline somewhere in a header? I haven't checked
> >>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.
> >>>>>
> >>>>> We'll need to either add a CONFIG_ option or arch specific callback to
> >>>>> make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
> >>>>> implementations coexist ...
> >>>>
> >>>> Note: I have a similar dummy (return 0) patch for s390x lying around here.
> >>>
> >>> Then we'll call it a tie - 3:3 ;-)
> >>
> >> So I'd be happy to jump on the train of people wanting to export the
> >> ARM stub for this (and add a new ARM stub for phys_to_target_node()),
> >> but Will did have a plausibly better idea that I have been meaning to
> >> circle back to:
> >>
> >> http://lore.kernel.org/r/20200325111039.GA32109@willie-the-truck
> >>
> >> ...i.e. iterate over node data to do the lookup. This would seem to
> >> work generically for multiple archs unless I am missing something?
>
> IIRC, only memory assigned to/onlined to a ZONE is represented in the
> pgdat node span. E.g., not offline memory blocks.
>
> Esp., when hotplugging + onlining consecutive memory, there won't really
> be any intersections in most cases if I am not wrong. It would not be
> "intersection" but rather "closest fit".
>
> With overlapping nodes it's even more unclear. Which one to pick?
>
> >
> > I think it would work on arm64, power and, most propbably on s390
>
> With only a single dummy node I guess it should work (searching when
> there is only a single node does not make too much sense).
>
> > (David?), but not on x86. x86 does not have reserved memory in pgdat,
> > it's never memblock_add()'ed (see e820__memblock_setup()).
>
> Can you enlighten me why that is relevant for the memory hotplug path?
> (or is it just a general comment to make the function as accurate as
> possible for all addresses?)

phys_to_target_node() on x86 falls back to numa_reserved_meminfo which
holds memory that is never listed in a node.

> --
> Thanks,
>
> David / dhildenb
>

--
Sincerely yours,
Mike.

2020-07-08 07:51:29

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On 08.07.20 09:38, Mike Rapoport wrote:
> On Wed, Jul 08, 2020 at 09:21:25AM +0200, David Hildenbrand wrote:
>> On 08.07.20 07:27, Mike Rapoport wrote:
>>> On Tue, Jul 07, 2020 at 03:05:48PM -0700, Dan Williams wrote:
>>>> On Tue, Jul 7, 2020 at 11:01 AM Mike Rapoport <[email protected]> wrote:
>>>>>
>>>>> On Tue, Jul 07, 2020 at 02:26:08PM +0200, David Hildenbrand wrote:
>>>>>> On 07.07.20 14:13, Mike Rapoport wrote:
>>>>>>> On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
>>>>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
>>>>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
>>>>>>>>>
>>>>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
>>>>>>>>> NUMA_NO_NID is detected.
>>>>>>>>>
>>>>>>>>> Suggested-by: David Hildenbrand <[email protected]>
>>>>>>>>> Signed-off-by: Jia He <[email protected]>
>>>>>>>>> ---
>>>>>>>>> arch/arm64/mm/numa.c | 5 +++--
>>>>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>>>>>>> index aafcee3e3f7e..7eeb31740248 100644
>>>>>>>>> --- a/arch/arm64/mm/numa.c
>>>>>>>>> +++ b/arch/arm64/mm/numa.c
>>>>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>>>>>>>>>
>>>>>>>>> /*
>>>>>>>>> * We hope that we will be hotplugging memory on nodes we already know about,
>>>>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
>>>>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
>>>>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
>>>>>>>>> */
>>>>>>>>> int memory_add_physaddr_to_nid(u64 addr)
>>>>>>>>> {
>>>>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
>>>>>>>>> return 0;
>>>>>>>>> }
>>>>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>>>>>>>>
>>>>>>>> Does it make sense to export a noop function? Wouldn't make more sense
>>>>>>>> to simply make it static inline somewhere in a header? I haven't checked
>>>>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.
>>>>>>>
>>>>>>> We'll need to either add a CONFIG_ option or arch specific callback to
>>>>>>> make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
>>>>>>> implementations coexist ...
>>>>>>
>>>>>> Note: I have a similar dummy (return 0) patch for s390x lying around here.
>>>>>
>>>>> Then we'll call it a tie - 3:3 ;-)
>>>>
>>>> So I'd be happy to jump on the train of people wanting to export the
>>>> ARM stub for this (and add a new ARM stub for phys_to_target_node()),
>>>> but Will did have a plausibly better idea that I have been meaning to
>>>> circle back to:
>>>>
>>>> http://lore.kernel.org/r/20200325111039.GA32109@willie-the-truck
>>>>
>>>> ...i.e. iterate over node data to do the lookup. This would seem to
>>>> work generically for multiple archs unless I am missing something?
>>
>> IIRC, only memory assigned to/onlined to a ZONE is represented in the
>> pgdat node span. E.g., not offline memory blocks.
>>
>> Esp., when hotplugging + onlining consecutive memory, there won't really
>> be any intersections in most cases if I am not wrong. It would not be
>> "intersection" but rather "closest fit".
>>
>> With overlapping nodes it's even more unclear. Which one to pick?
>>
>>>
>>> I think it would work on arm64, power and, most propbably on s390
>>
>> With only a single dummy node I guess it should work (searching when
>> there is only a single node does not make too much sense).
>>
>>> (David?), but not on x86. x86 does not have reserved memory in pgdat,
>>> it's never memblock_add()'ed (see e820__memblock_setup()).
>>
>> Can you enlighten me why that is relevant for the memory hotplug path?
>> (or is it just a general comment to make the function as accurate as
>> possible for all addresses?)
>
> phys_to_target_node() on x86 falls back to numa_reserved_meminfo which
> holds memory that is never listed in a node.
>

Ah, I see - thanks.


--
Thanks,

David / dhildenb

2020-07-08 07:55:26

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Wed, Jul 08, 2020 at 09:16:01AM +0200, David Hildenbrand wrote:
> On 08.07.20 09:04, Dan Williams wrote:
> > On Tue, Jul 7, 2020 at 11:59 PM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 08.07.20 08:22, Mike Rapoport wrote:
> >>> On Tue, Jul 07, 2020 at 09:27:43PM -0700, Dan Williams wrote:
> >>>> On Tue, Jul 7, 2020 at 9:08 PM Justin He <[email protected]> wrote:
> >>>> [..]
> >>>>>> Especially for architectures that use memblock info for numa info
> >>>>>> (which seems to be everyone except x86) why not implement a generic
> >>>>>> memory_add_physaddr_to_nid() that does:
> >>>>>>
> >>>>>> int memory_add_physaddr_to_nid(u64 addr)
> >>>>>> {
> >>>>>> unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
> >>>>>> int nid;
> >>>>>>
> >>>>>> for_each_online_node(nid) {
> >>>>>> get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> >>>>>> if (pfn >= start_pfn && pfn <= end_pfn)
> >>>>>> return nid;
> >>>>>> }
> >>>>>> return NUMA_NO_NODE;
> >>>>>> }
> >>>>>
> >>>>> Thanks for your suggestion,
> >>>>> Could I wrap the codes and let memory_add_physaddr_to_nid simply invoke
> >>>>> phys_to_target_node()?
> >>>>
> >>>> I think it needs to be the reverse. phys_to_target_node() should call
> >>>> memory_add_physaddr_to_nid() by default, but fall back to searching
> >>>> reserved memory address ranges in memblock. See phys_to_target_node()
> >>>> in arch/x86/mm/numa.c. That one uses numa_meminfo instead of memblock,
> >>>> but the principle is the same i.e. that a target node may not be
> >>>> represented in memblock.memory, but memblock.reserved. I'm working on
> >>>> a patch to provide a function similar to get_pfn_range_for_nid() that
> >>>> operates on reserved memory.
> >>>
> >>> Do we really need yet another memblock iterator?
> >>> I think only x86 has memory that is not in memblock.memory but only in
> >>> memblock.reserved.
> >>
> >> Reading about abusing the memblock allcoator once again in memory
> >> hotplug paths makes me shiver.
> >
> > Technical reasoning please?
>
> ARCH_KEEP_MEMBLOCK is (AFAIK) only a hack for arm64 to implement
> pfn_valid(), because they zap out individual pages corresponding to
> memory holes of full sections.
>
> I am not a friend of adding more post-init code to rely on memblock
> data. It just makes it harder to eventually get rid of ARCH_KEEP_MEMBLOCK.

The most heavy user of memblock in post-init code is powerpc. It won't
be easy to get rid of it there.

> > arm64 numa information is established from memblock data. It seems
> > counterproductive to ignore that fact if we're already touching
> > memory_add_physaddr_to_nid() and have a use case for a driver to call
> > it.
>
> ... and we are trying to handle the "only a single dummy node" case
> (patch #2), or what am I missing? What is there to optimize currently?
>
> --
> Thanks,
>
> David / dhildenb
>

--
Sincerely yours,
Mike.

2020-07-08 08:01:14

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Wed, Jul 8, 2020 at 12:22 AM David Hildenbrand <[email protected]> wrote:
>
> On 08.07.20 07:27, Mike Rapoport wrote:
> > On Tue, Jul 07, 2020 at 03:05:48PM -0700, Dan Williams wrote:
> >> On Tue, Jul 7, 2020 at 11:01 AM Mike Rapoport <[email protected]> wrote:
> >>>
> >>> On Tue, Jul 07, 2020 at 02:26:08PM +0200, David Hildenbrand wrote:
> >>>> On 07.07.20 14:13, Mike Rapoport wrote:
> >>>>> On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
> >>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
> >>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
> >>>>>>>
> >>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> >>>>>>> NUMA_NO_NID is detected.
> >>>>>>>
> >>>>>>> Suggested-by: David Hildenbrand <[email protected]>
> >>>>>>> Signed-off-by: Jia He <[email protected]>
> >>>>>>> ---
> >>>>>>> arch/arm64/mm/numa.c | 5 +++--
> >>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> >>>>>>> index aafcee3e3f7e..7eeb31740248 100644
> >>>>>>> --- a/arch/arm64/mm/numa.c
> >>>>>>> +++ b/arch/arm64/mm/numa.c
> >>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> >>>>>>>
> >>>>>>> /*
> >>>>>>> * We hope that we will be hotplugging memory on nodes we already know about,
> >>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
> >>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> >>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> >>>>>>> */
> >>>>>>> int memory_add_physaddr_to_nid(u64 addr)
> >>>>>>> {
> >>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> >>>>>>> return 0;
> >>>>>>> }
> >>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> >>>>>>
> >>>>>> Does it make sense to export a noop function? Wouldn't make more sense
> >>>>>> to simply make it static inline somewhere in a header? I haven't checked
> >>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.
> >>>>>
> >>>>> We'll need to either add a CONFIG_ option or arch specific callback to
> >>>>> make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
> >>>>> implementations coexist ...
> >>>>
> >>>> Note: I have a similar dummy (return 0) patch for s390x lying around here.
> >>>
> >>> Then we'll call it a tie - 3:3 ;-)
> >>
> >> So I'd be happy to jump on the train of people wanting to export the
> >> ARM stub for this (and add a new ARM stub for phys_to_target_node()),
> >> but Will did have a plausibly better idea that I have been meaning to
> >> circle back to:
> >>
> >> http://lore.kernel.org/r/20200325111039.GA32109@willie-the-truck
> >>
> >> ...i.e. iterate over node data to do the lookup. This would seem to
> >> work generically for multiple archs unless I am missing something?
>
> IIRC, only memory assigned to/onlined to a ZONE is represented in the
> pgdat node span. E.g., not offline memory blocks.

So this dovetails somewhat with Will's idea. What if we populated
node_data for "offline" ranges? I started there, but then saw
ARCH_KEEP_MEMBLOCK and thought it would be safer to just teach
phys_to_target_node() to use that rather than update other code paths
to expect node_data might not always reflect online data.

> Esp., when hotplugging + onlining consecutive memory, there won't really
> be any intersections in most cases if I am not wrong. It would not be
> "intersection" but rather "closest fit".
>
> With overlapping nodes it's even more unclear. Which one to pick?

In the overlap case you get what you get. Some signal is better than
the noise of a dummy function. The consequences of picking the wrong
node might be that the kernel can't properly associate a memory range
to its performance data tables in firmware, but then again firmware
messed up with an overlapping node definition in the first instance.

2020-07-08 08:28:31

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On 08.07.20 09:50, Dan Williams wrote:
> On Wed, Jul 8, 2020 at 12:22 AM David Hildenbrand <[email protected]> wrote:
>>
>> On 08.07.20 07:27, Mike Rapoport wrote:
>>> On Tue, Jul 07, 2020 at 03:05:48PM -0700, Dan Williams wrote:
>>>> On Tue, Jul 7, 2020 at 11:01 AM Mike Rapoport <[email protected]> wrote:
>>>>>
>>>>> On Tue, Jul 07, 2020 at 02:26:08PM +0200, David Hildenbrand wrote:
>>>>>> On 07.07.20 14:13, Mike Rapoport wrote:
>>>>>>> On Tue, Jul 07, 2020 at 01:54:54PM +0200, Michal Hocko wrote:
>>>>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
>>>>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
>>>>>>>>>
>>>>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
>>>>>>>>> NUMA_NO_NID is detected.
>>>>>>>>>
>>>>>>>>> Suggested-by: David Hildenbrand <[email protected]>
>>>>>>>>> Signed-off-by: Jia He <[email protected]>
>>>>>>>>> ---
>>>>>>>>> arch/arm64/mm/numa.c | 5 +++--
>>>>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>>>>>>> index aafcee3e3f7e..7eeb31740248 100644
>>>>>>>>> --- a/arch/arm64/mm/numa.c
>>>>>>>>> +++ b/arch/arm64/mm/numa.c
>>>>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>>>>>>>>>
>>>>>>>>> /*
>>>>>>>>> * We hope that we will be hotplugging memory on nodes we already know about,
>>>>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
>>>>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
>>>>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
>>>>>>>>> */
>>>>>>>>> int memory_add_physaddr_to_nid(u64 addr)
>>>>>>>>> {
>>>>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
>>>>>>>>> return 0;
>>>>>>>>> }
>>>>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>>>>>>>>
>>>>>>>> Does it make sense to export a noop function? Wouldn't make more sense
>>>>>>>> to simply make it static inline somewhere in a header? I haven't checked
>>>>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.
>>>>>>>
>>>>>>> We'll need to either add a CONFIG_ option or arch specific callback to
>>>>>>> make both non-empty (x86, powerpc, ia64) and empty (arm64, sh)
>>>>>>> implementations coexist ...
>>>>>>
>>>>>> Note: I have a similar dummy (return 0) patch for s390x lying around here.
>>>>>
>>>>> Then we'll call it a tie - 3:3 ;-)
>>>>
>>>> So I'd be happy to jump on the train of people wanting to export the
>>>> ARM stub for this (and add a new ARM stub for phys_to_target_node()),
>>>> but Will did have a plausibly better idea that I have been meaning to
>>>> circle back to:
>>>>
>>>> http://lore.kernel.org/r/20200325111039.GA32109@willie-the-truck
>>>>
>>>> ...i.e. iterate over node data to do the lookup. This would seem to
>>>> work generically for multiple archs unless I am missing something?
>>
>> IIRC, only memory assigned to/onlined to a ZONE is represented in the
>> pgdat node span. E.g., not offline memory blocks.
>
> So this dovetails somewhat with Will's idea. What if we populated
> node_data for "offline" ranges? I started there, but then saw
> ARCH_KEEP_MEMBLOCK and thought it would be safer to just teach
> phys_to_target_node() to use that rather than update other code paths
> to expect node_data might not always reflect online data.

We currently need a somewhat-accurate pgdat node span to detect when to
offline a node. See try_offline_node(). This works fairly reliable.

Shrinking the node span is currently fairly easy for !ZONE_DEVICE
memory, because we can rely on pfn_to_online_page() + pfn_to_nid(pfn).
See e.g., find_biggest_section_pfn().

If we glue growing/shrinking the node span to adding/removing of memory
(instead of e.g., onlining/offlining), we can no longer base shrinking
on memmap data. We would have to get the information ("how far can I
shrink the node span, is it empty?") from somewhere else. E.g.,
for_each_memory_block() - but that one does not cover ZONE_DEVICE. And
there are memory blocks which cover multiple nodes, in which case we
only store one of them ... unreliable.

This certainly needs more thought :/

>
>> Esp., when hotplugging + onlining consecutive memory, there won't really
>> be any intersections in most cases if I am not wrong. It would not be
>> "intersection" but rather "closest fit".
>>
>> With overlapping nodes it's even more unclear. Which one to pick?
>
> In the overlap case you get what you get. Some signal is better than
> the noise of a dummy function. The consequences of picking the wrong
> node might be that the kernel can't properly associate a memory range
> to its performance data tables in firmware, but then again firmware
> messed up with an overlapping node definition in the first instance.

I'd be curious if what we are trying to optimize here is actually worth
optimizing. IOW, is there a well-known scenario where the dummy value on
arm64 would be problematic and is worth the effort?

I mean, in all performance relevant setups (ignoring
hv_balloon/xen-balloon/prove_store(), which also use
memory_add_physaddr_to_nid()), we should have a proper PXM/node
specified by the hardware on memory hotadd. The fallback of
memory_add_physaddr_to_nid() is not relevant in these scenarios.

--
Thanks,

David / dhildenb

2020-07-08 08:41:23

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Wed, Jul 08, 2020 at 10:26:41AM +0200, David Hildenbrand wrote:
> On 08.07.20 09:50, Dan Williams wrote:
> > On Wed, Jul 8, 2020 at 12:22 AM David Hildenbrand <[email protected]> wrote:
> >>
> >>>>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
> >>>>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
> >>>>>>>>>
> >>>>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> >>>>>>>>> NUMA_NO_NID is detected.
> >>>>>>>>>
> >>>>>>>>> Suggested-by: David Hildenbrand <[email protected]>
> >>>>>>>>> Signed-off-by: Jia He <[email protected]>
> >>>>>>>>> ---
> >>>>>>>>> arch/arm64/mm/numa.c | 5 +++--
> >>>>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> >>>>>>>>> index aafcee3e3f7e..7eeb31740248 100644
> >>>>>>>>> --- a/arch/arm64/mm/numa.c
> >>>>>>>>> +++ b/arch/arm64/mm/numa.c
> >>>>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> >>>>>>>>>
> >>>>>>>>> /*
> >>>>>>>>> * We hope that we will be hotplugging memory on nodes we already know about,
> >>>>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
> >>>>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> >>>>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> >>>>>>>>> */
> >>>>>>>>> int memory_add_physaddr_to_nid(u64 addr)
> >>>>>>>>> {
> >>>>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> >>>>>>>>> return 0;
> >>>>>>>>> }
> >>>>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> >>>>>>>>
> >>>>>>>> Does it make sense to export a noop function? Wouldn't make more sense
> >>>>>>>> to simply make it static inline somewhere in a header? I haven't checked
> >>>>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.

> I'd be curious if what we are trying to optimize here is actually worth
> optimizing. IOW, is there a well-known scenario where the dummy value on
> arm64 would be problematic and is worth the effort?

Well, it started with Michal's comment above that EXPORT_SYMBOL_GPL()
for a stub might be an overkill.

I think Jia's suggestion [1] with addition of a comment that explains
why and when the stub will be used, can work for both
memory_add_physaddr_to_nid() and phys_to_target_node().

But on more theoretical/fundmanetal level, I think we lack a generic
abstraction similar to e.g. x86 'struct numa_meminfo' that serves as
translaton of firmware supplied information into data that can be used
by the generic mm without need to reimplement it for each and every
arch.

[1] https://lore.kernel.org/lkml/AM6PR08MB406907F9F2B13DA6DC893AD9F7670@AM6PR08MB4069.eurprd08.prod.outlook.com

> I mean, in all performance relevant setups (ignoring
> hv_balloon/xen-balloon/prove_store(), which also use
> memory_add_physaddr_to_nid()), we should have a proper PXM/node
> specified by the hardware on memory hotadd. The fallback of
> memory_add_physaddr_to_nid() is not relevant in these scenarios.
>
> --
> Thanks,
>
> David / dhildenb
>

--
Sincerely yours,
Mike.

2020-07-08 08:46:55

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On 08.07.20 10:39, Mike Rapoport wrote:
> On Wed, Jul 08, 2020 at 10:26:41AM +0200, David Hildenbrand wrote:
>> On 08.07.20 09:50, Dan Williams wrote:
>>> On Wed, Jul 8, 2020 at 12:22 AM David Hildenbrand <[email protected]> wrote:
>>>>
>>>>>>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
>>>>>>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
>>>>>>>>>>>
>>>>>>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
>>>>>>>>>>> NUMA_NO_NID is detected.
>>>>>>>>>>>
>>>>>>>>>>> Suggested-by: David Hildenbrand <[email protected]>
>>>>>>>>>>> Signed-off-by: Jia He <[email protected]>
>>>>>>>>>>> ---
>>>>>>>>>>> arch/arm64/mm/numa.c | 5 +++--
>>>>>>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>>>>>>>>> index aafcee3e3f7e..7eeb31740248 100644
>>>>>>>>>>> --- a/arch/arm64/mm/numa.c
>>>>>>>>>>> +++ b/arch/arm64/mm/numa.c
>>>>>>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>>>>>>>>>>>
>>>>>>>>>>> /*
>>>>>>>>>>> * We hope that we will be hotplugging memory on nodes we already know about,
>>>>>>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
>>>>>>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
>>>>>>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
>>>>>>>>>>> */
>>>>>>>>>>> int memory_add_physaddr_to_nid(u64 addr)
>>>>>>>>>>> {
>>>>>>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
>>>>>>>>>>> return 0;
>>>>>>>>>>> }
>>>>>>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>>>>>>>>>>
>>>>>>>>>> Does it make sense to export a noop function? Wouldn't make more sense
>>>>>>>>>> to simply make it static inline somewhere in a header? I haven't checked
>>>>>>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.
>
>> I'd be curious if what we are trying to optimize here is actually worth
>> optimizing. IOW, is there a well-known scenario where the dummy value on
>> arm64 would be problematic and is worth the effort?
>
> Well, it started with Michal's comment above that EXPORT_SYMBOL_GPL()
> for a stub might be an overkill.
>
> I think Jia's suggestion [1] with addition of a comment that explains
> why and when the stub will be used, can work for both
> memory_add_physaddr_to_nid() and phys_to_target_node().

Agreed.

>
> But on more theoretical/fundmanetal level, I think we lack a generic
> abstraction similar to e.g. x86 'struct numa_meminfo' that serves as
> translaton of firmware supplied information into data that can be used
> by the generic mm without need to reimplement it for each and every
> arch.

Right. As I expressed, I am not a friend of using memblock for that, and
the pgdat node span is tricky.

Maybe abstracting that x86 concept is possible in some way (and we could
restrict the information to boot-time properties, so we don't have to
mess with memory hot(un)plug - just as done for numa_meminfo AFAIKS).

--
Thanks,

David / dhildenb

2020-07-08 09:16:57

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Wed, Jul 08, 2020 at 10:45:17AM +0200, David Hildenbrand wrote:
> On 08.07.20 10:39, Mike Rapoport wrote:
> > On Wed, Jul 08, 2020 at 10:26:41AM +0200, David Hildenbrand wrote:
> >> On 08.07.20 09:50, Dan Williams wrote:
> >>> On Wed, Jul 8, 2020 at 12:22 AM David Hildenbrand <[email protected]> wrote:
> >>>>
> >>>>>>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
> >>>>>>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
> >>>>>>>>>>>
> >>>>>>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> >>>>>>>>>>> NUMA_NO_NID is detected.
> >>>>>>>>>>>
> >>>>>>>>>>> Suggested-by: David Hildenbrand <[email protected]>
> >>>>>>>>>>> Signed-off-by: Jia He <[email protected]>
> >>>>>>>>>>> ---
> >>>>>>>>>>> arch/arm64/mm/numa.c | 5 +++--
> >>>>>>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> >>>>>>>>>>> index aafcee3e3f7e..7eeb31740248 100644
> >>>>>>>>>>> --- a/arch/arm64/mm/numa.c
> >>>>>>>>>>> +++ b/arch/arm64/mm/numa.c
> >>>>>>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> >>>>>>>>>>>
> >>>>>>>>>>> /*
> >>>>>>>>>>> * We hope that we will be hotplugging memory on nodes we already know about,
> >>>>>>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
> >>>>>>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
> >>>>>>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
> >>>>>>>>>>> */
> >>>>>>>>>>> int memory_add_physaddr_to_nid(u64 addr)
> >>>>>>>>>>> {
> >>>>>>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
> >>>>>>>>>>> return 0;
> >>>>>>>>>>> }
> >>>>>>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> >>>>>>>>>>
> >>>>>>>>>> Does it make sense to export a noop function? Wouldn't make more sense
> >>>>>>>>>> to simply make it static inline somewhere in a header? I haven't checked
> >>>>>>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.
> >
> >> I'd be curious if what we are trying to optimize here is actually worth
> >> optimizing. IOW, is there a well-known scenario where the dummy value on
> >> arm64 would be problematic and is worth the effort?
> >
> > Well, it started with Michal's comment above that EXPORT_SYMBOL_GPL()
> > for a stub might be an overkill.
> >
> > I think Jia's suggestion [1] with addition of a comment that explains
> > why and when the stub will be used, can work for both
> > memory_add_physaddr_to_nid() and phys_to_target_node().
>
> Agreed.
>
> >
> > But on more theoretical/fundmanetal level, I think we lack a generic
> > abstraction similar to e.g. x86 'struct numa_meminfo' that serves as
> > translaton of firmware supplied information into data that can be used
> > by the generic mm without need to reimplement it for each and every
> > arch.
>
> Right. As I expressed, I am not a friend of using memblock for that, and
> the pgdat node span is tricky.
>
> Maybe abstracting that x86 concept is possible in some way (and we could
> restrict the information to boot-time properties, so we don't have to
> mess with memory hot(un)plug - just as done for numa_meminfo AFAIKS).

I agree with pgdat part and disagree about memblock. It already has
non-init physmap, why won't we add memblock.memory to the mix? ;-)

Now, seriously, memblock already has all the necessary information about
the coldplug memory for several architectures. x86 being an exception
because for some reason the reserved memory is not considered memory
there. The infrastructure for quiering and iterating memory regions is
already there. We just need to leave out the irrelevant parts, like
memblock.reserved and allocation funcions.

Otherwise we'll add yet another 'struct { start, end }', a horde of
covnersion and re-initialization functions that will do more or less the
same things as current memblock APIs.

> --
> Thanks,
>
> David / dhildenb
>
>

--
Sincerely yours,
Mike.

2020-07-08 09:29:23

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On 08.07.20 11:15, Mike Rapoport wrote:
> On Wed, Jul 08, 2020 at 10:45:17AM +0200, David Hildenbrand wrote:
>> On 08.07.20 10:39, Mike Rapoport wrote:
>>> On Wed, Jul 08, 2020 at 10:26:41AM +0200, David Hildenbrand wrote:
>>>> On 08.07.20 09:50, Dan Williams wrote:
>>>>> On Wed, Jul 8, 2020 at 12:22 AM David Hildenbrand <[email protected]> wrote:
>>>>>>
>>>>>>>>>>>> On Tue 07-07-20 13:59:15, Jia He wrote:
>>>>>>>>>>>>> This exports memory_add_physaddr_to_nid() for module driver to use.
>>>>>>>>>>>>>
>>>>>>>>>>>>> memory_add_physaddr_to_nid() is a fallback option to get the nid in case
>>>>>>>>>>>>> NUMA_NO_NID is detected.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Suggested-by: David Hildenbrand <[email protected]>
>>>>>>>>>>>>> Signed-off-by: Jia He <[email protected]>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>> arch/arm64/mm/numa.c | 5 +++--
>>>>>>>>>>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
>>>>>>>>>>>>> index aafcee3e3f7e..7eeb31740248 100644
>>>>>>>>>>>>> --- a/arch/arm64/mm/numa.c
>>>>>>>>>>>>> +++ b/arch/arm64/mm/numa.c
>>>>>>>>>>>>> @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
>>>>>>>>>>>>>
>>>>>>>>>>>>> /*
>>>>>>>>>>>>> * We hope that we will be hotplugging memory on nodes we already know about,
>>>>>>>>>>>>> - * such that acpi_get_node() succeeds and we never fall back to this...
>>>>>>>>>>>>> + * such that acpi_get_node() succeeds. But when SRAT is not present, the node
>>>>>>>>>>>>> + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback option.
>>>>>>>>>>>>> */
>>>>>>>>>>>>> int memory_add_physaddr_to_nid(u64 addr)
>>>>>>>>>>>>> {
>>>>>>>>>>>>> - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
>>>>>>>>>>>>> return 0;
>>>>>>>>>>>>> }
>>>>>>>>>>>>> +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>>>>>>>>>>>>
>>>>>>>>>>>> Does it make sense to export a noop function? Wouldn't make more sense
>>>>>>>>>>>> to simply make it static inline somewhere in a header? I haven't checked
>>>>>>>>>>>> whether there is an easy way to do that sanely bu this just hit my eyes.
>>>
>>>> I'd be curious if what we are trying to optimize here is actually worth
>>>> optimizing. IOW, is there a well-known scenario where the dummy value on
>>>> arm64 would be problematic and is worth the effort?
>>>
>>> Well, it started with Michal's comment above that EXPORT_SYMBOL_GPL()
>>> for a stub might be an overkill.
>>>
>>> I think Jia's suggestion [1] with addition of a comment that explains
>>> why and when the stub will be used, can work for both
>>> memory_add_physaddr_to_nid() and phys_to_target_node().
>>
>> Agreed.
>>
>>>
>>> But on more theoretical/fundmanetal level, I think we lack a generic
>>> abstraction similar to e.g. x86 'struct numa_meminfo' that serves as
>>> translaton of firmware supplied information into data that can be used
>>> by the generic mm without need to reimplement it for each and every
>>> arch.
>>
>> Right. As I expressed, I am not a friend of using memblock for that, and
>> the pgdat node span is tricky.
>>
>> Maybe abstracting that x86 concept is possible in some way (and we could
>> restrict the information to boot-time properties, so we don't have to
>> mess with memory hot(un)plug - just as done for numa_meminfo AFAIKS).
>
> I agree with pgdat part and disagree about memblock. It already has
> non-init physmap, why won't we add memblock.memory to the mix? ;-)

Can we generalize and tweak physmap to contain node info? That's all we
need, no? (the special mem= parameter handling should not matter for our
use case, where "physmap" and "memory" would differ)

>
> Now, seriously, memblock already has all the necessary information about
> the coldplug memory for several architectures. x86 being an exception
> because for some reason the reserved memory is not considered memory
> there. The infrastructure for quiering and iterating memory regions is
> already there. We just need to leave out the irrelevant parts, like
> memblock.reserved and allocation funcions.

I *really* don't want to mess with memblocks on memory hot(un)plug on
x86 and s390x (+other architectures in the future). I also thought about
stopping to create memblocks for hotplugged memory on arm64, by tweaking
pfn_valid() to query memblocks only for early sections.

If "physmem" is not an option, can we at least introduce something like
ARCH_UPDTAE_MEMBLOCK_ON_HOTPLUG to avoid doing that on x86 and s390x for
now (and later maybe for others)?

--
Thanks,

David / dhildenb

2020-07-08 09:47:03

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Wed, Jul 08, 2020 at 11:25:36AM +0200, David Hildenbrand wrote:
> On 08.07.20 11:15, Mike Rapoport wrote:
> >>>>>>
> >>> But on more theoretical/fundmanetal level, I think we lack a generic
> >>> abstraction similar to e.g. x86 'struct numa_meminfo' that serves as
> >>> translaton of firmware supplied information into data that can be used
> >>> by the generic mm without need to reimplement it for each and every
> >>> arch.
> >>
> >> Right. As I expressed, I am not a friend of using memblock for that, and
> >> the pgdat node span is tricky.
> >>
> >> Maybe abstracting that x86 concept is possible in some way (and we could
> >> restrict the information to boot-time properties, so we don't have to
> >> mess with memory hot(un)plug - just as done for numa_meminfo AFAIKS).
> >
> > I agree with pgdat part and disagree about memblock. It already has
> > non-init physmap, why won't we add memblock.memory to the mix? ;-)
>
> Can we generalize and tweak physmap to contain node info? That's all we
> need, no? (the special mem= parameter handling should not matter for our
> use case, where "physmap" and "memory" would differ)

TBH, I have only random vague thoughts at the moment. This might be an
option. But then we need to enable physmap on !s390, right?

> > Now, seriously, memblock already has all the necessary information about
> > the coldplug memory for several architectures. x86 being an exception
> > because for some reason the reserved memory is not considered memory
> > there. The infrastructure for quiering and iterating memory regions is
> > already there. We just need to leave out the irrelevant parts, like
> > memblock.reserved and allocation funcions.
>
> I *really* don't want to mess with memblocks on memory hot(un)plug on
> x86 and s390x (+other architectures in the future). I also thought about
> stopping to create memblocks for hotplugged memory on arm64, by tweaking
> pfn_valid() to query memblocks only for early sections.
>
> If "physmem" is not an option, can we at least introduce something like
> ARCH_UPDTAE_MEMBLOCK_ON_HOTPLUG to avoid doing that on x86 and s390x for
> now (and later maybe for others)?

I have to do more memory hotplug howework to answer that ;-)

My general point is that we don't have to reinvent the wheel to have
coldplug memory representation, it's already there. We just need a way
to use it properly.

> --
> Thanks,
>
> David / dhildenb
>

--
Sincerely yours,
Mike.

2020-07-08 10:05:55

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On 08.07.20 11:45, Mike Rapoport wrote:
> On Wed, Jul 08, 2020 at 11:25:36AM +0200, David Hildenbrand wrote:
>> On 08.07.20 11:15, Mike Rapoport wrote:
>>>>>>>>
>>>>> But on more theoretical/fundmanetal level, I think we lack a generic
>>>>> abstraction similar to e.g. x86 'struct numa_meminfo' that serves as
>>>>> translaton of firmware supplied information into data that can be used
>>>>> by the generic mm without need to reimplement it for each and every
>>>>> arch.
>>>>
>>>> Right. As I expressed, I am not a friend of using memblock for that, and
>>>> the pgdat node span is tricky.
>>>>
>>>> Maybe abstracting that x86 concept is possible in some way (and we could
>>>> restrict the information to boot-time properties, so we don't have to
>>>> mess with memory hot(un)plug - just as done for numa_meminfo AFAIKS).
>>>
>>> I agree with pgdat part and disagree about memblock. It already has
>>> non-init physmap, why won't we add memblock.memory to the mix? ;-)
>>
>> Can we generalize and tweak physmap to contain node info? That's all we
>> need, no? (the special mem= parameter handling should not matter for our
>> use case, where "physmap" and "memory" would differ)
>
> TBH, I have only random vague thoughts at the moment. This might be an
> option. But then we need to enable physmap on !s390, right?

Yes, looks like it.

>
>>> Now, seriously, memblock already has all the necessary information about
>>> the coldplug memory for several architectures. x86 being an exception
>>> because for some reason the reserved memory is not considered memory
>>> there. The infrastructure for quiering and iterating memory regions is
>>> already there. We just need to leave out the irrelevant parts, like
>>> memblock.reserved and allocation funcions.
>>
>> I *really* don't want to mess with memblocks on memory hot(un)plug on
>> x86 and s390x (+other architectures in the future). I also thought about
>> stopping to create memblocks for hotplugged memory on arm64, by tweaking
>> pfn_valid() to query memblocks only for early sections.
>>
>> If "physmem" is not an option, can we at least introduce something like
>> ARCH_UPDTAE_MEMBLOCK_ON_HOTPLUG to avoid doing that on x86 and s390x for
>> now (and later maybe for others)?
>
> I have to do more memory hotplug howework to answer that ;-)
>
> My general point is that we don't have to reinvent the wheel to have
> coldplug memory representation, it's already there. We just need a way
> to use it properly.

Yes, I tend to agree. Details to be clarified :)

--
Thanks,

David / dhildenb

2020-07-08 15:50:57

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Wed, Jul 8, 2020 at 3:04 AM David Hildenbrand <[email protected]> wrote:
>
> On 08.07.20 11:45, Mike Rapoport wrote:
> > On Wed, Jul 08, 2020 at 11:25:36AM +0200, David Hildenbrand wrote:
> >> On 08.07.20 11:15, Mike Rapoport wrote:
> >>>>>>>>
> >>>>> But on more theoretical/fundmanetal level, I think we lack a generic
> >>>>> abstraction similar to e.g. x86 'struct numa_meminfo' that serves as
> >>>>> translaton of firmware supplied information into data that can be used
> >>>>> by the generic mm without need to reimplement it for each and every
> >>>>> arch.
> >>>>
> >>>> Right. As I expressed, I am not a friend of using memblock for that, and
> >>>> the pgdat node span is tricky.
> >>>>
> >>>> Maybe abstracting that x86 concept is possible in some way (and we could
> >>>> restrict the information to boot-time properties, so we don't have to
> >>>> mess with memory hot(un)plug - just as done for numa_meminfo AFAIKS).
> >>>
> >>> I agree with pgdat part and disagree about memblock. It already has
> >>> non-init physmap, why won't we add memblock.memory to the mix? ;-)
> >>
> >> Can we generalize and tweak physmap to contain node info? That's all we
> >> need, no? (the special mem= parameter handling should not matter for our
> >> use case, where "physmap" and "memory" would differ)
> >
> > TBH, I have only random vague thoughts at the moment. This might be an
> > option. But then we need to enable physmap on !s390, right?
>
> Yes, looks like it.
>
> >
> >>> Now, seriously, memblock already has all the necessary information about
> >>> the coldplug memory for several architectures. x86 being an exception
> >>> because for some reason the reserved memory is not considered memory
> >>> there. The infrastructure for quiering and iterating memory regions is
> >>> already there. We just need to leave out the irrelevant parts, like
> >>> memblock.reserved and allocation funcions.
> >>
> >> I *really* don't want to mess with memblocks on memory hot(un)plug on
> >> x86 and s390x (+other architectures in the future). I also thought about
> >> stopping to create memblocks for hotplugged memory on arm64, by tweaking
> >> pfn_valid() to query memblocks only for early sections.
> >>
> >> If "physmem" is not an option, can we at least introduce something like
> >> ARCH_UPDTAE_MEMBLOCK_ON_HOTPLUG to avoid doing that on x86 and s390x for
> >> now (and later maybe for others)?
> >
> > I have to do more memory hotplug howework to answer that ;-)
> >
> > My general point is that we don't have to reinvent the wheel to have
> > coldplug memory representation, it's already there. We just need a way
> > to use it properly.
>
> Yes, I tend to agree. Details to be clarified :)

I'm not quite understanding the concern, or requirement about
"updating memblock" in the hotplug path. The routines
memory_add_physaddr_to_nid() and phys_to_target_node() are helpers to
interrogate platform-firmware numa info through a common abstraction.
They place no burden on the memory hotplug code they're just used to
see if a hot-added range lies within an existing node span when
platform-firmware otherwise fails to communicate a node. x86 can
continue to back those helpers with numa_meminfo, arm64 can use a
generic memblock implementation and other archs can follow the arm64
example if they want better numa answers for drivers.

2020-07-08 16:11:37

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On 08.07.20 17:50, Dan Williams wrote:
> On Wed, Jul 8, 2020 at 3:04 AM David Hildenbrand <[email protected]> wrote:
>>
>> On 08.07.20 11:45, Mike Rapoport wrote:
>>> On Wed, Jul 08, 2020 at 11:25:36AM +0200, David Hildenbrand wrote:
>>>> On 08.07.20 11:15, Mike Rapoport wrote:
>>>>>>>>>>
>>>>>>> But on more theoretical/fundmanetal level, I think we lack a generic
>>>>>>> abstraction similar to e.g. x86 'struct numa_meminfo' that serves as
>>>>>>> translaton of firmware supplied information into data that can be used
>>>>>>> by the generic mm without need to reimplement it for each and every
>>>>>>> arch.
>>>>>>
>>>>>> Right. As I expressed, I am not a friend of using memblock for that, and
>>>>>> the pgdat node span is tricky.
>>>>>>
>>>>>> Maybe abstracting that x86 concept is possible in some way (and we could
>>>>>> restrict the information to boot-time properties, so we don't have to
>>>>>> mess with memory hot(un)plug - just as done for numa_meminfo AFAIKS).
>>>>>
>>>>> I agree with pgdat part and disagree about memblock. It already has
>>>>> non-init physmap, why won't we add memblock.memory to the mix? ;-)
>>>>
>>>> Can we generalize and tweak physmap to contain node info? That's all we
>>>> need, no? (the special mem= parameter handling should not matter for our
>>>> use case, where "physmap" and "memory" would differ)
>>>
>>> TBH, I have only random vague thoughts at the moment. This might be an
>>> option. But then we need to enable physmap on !s390, right?
>>
>> Yes, looks like it.
>>
>>>
>>>>> Now, seriously, memblock already has all the necessary information about
>>>>> the coldplug memory for several architectures. x86 being an exception
>>>>> because for some reason the reserved memory is not considered memory
>>>>> there. The infrastructure for quiering and iterating memory regions is
>>>>> already there. We just need to leave out the irrelevant parts, like
>>>>> memblock.reserved and allocation funcions.
>>>>
>>>> I *really* don't want to mess with memblocks on memory hot(un)plug on
>>>> x86 and s390x (+other architectures in the future). I also thought about
>>>> stopping to create memblocks for hotplugged memory on arm64, by tweaking
>>>> pfn_valid() to query memblocks only for early sections.
>>>>
>>>> If "physmem" is not an option, can we at least introduce something like
>>>> ARCH_UPDTAE_MEMBLOCK_ON_HOTPLUG to avoid doing that on x86 and s390x for
>>>> now (and later maybe for others)?
>>>
>>> I have to do more memory hotplug howework to answer that ;-)
>>>
>>> My general point is that we don't have to reinvent the wheel to have
>>> coldplug memory representation, it's already there. We just need a way
>>> to use it properly.
>>
>> Yes, I tend to agree. Details to be clarified :)
>
> I'm not quite understanding the concern, or requirement about
> "updating memblock" in the hotplug path. The routines
> memory_add_physaddr_to_nid() and phys_to_target_node() are helpers to
> interrogate platform-firmware numa info through a common abstraction.
> They place no burden on the memory hotplug code they're just used to
> see if a hot-added range lies within an existing node span when
> platform-firmware otherwise fails to communicate a node. x86 can
> continue to back those helpers with numa_meminfo, arm64 can use a
> generic memblock implementation and other archs can follow the arm64
> example if they want better numa answers for drivers.
>

See memblock_add_node()/memblock_remove() in mm/memory_hotplug.c. I
don't want that code be reactivated for x86/s390x. That's all I am saying.

--
Thanks,

David / dhildenb

2020-07-08 16:51:44

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

On Wed, Jul 08, 2020 at 06:10:19PM +0200, David Hildenbrand wrote:
> On 08.07.20 17:50, Dan Williams wrote:
> > On Wed, Jul 8, 2020 at 3:04 AM David Hildenbrand <[email protected]> wrote:
> >>
> >
> > I'm not quite understanding the concern, or requirement about
> > "updating memblock" in the hotplug path. The routines
> > memory_add_physaddr_to_nid() and phys_to_target_node() are helpers to
> > interrogate platform-firmware numa info through a common abstraction.
> > They place no burden on the memory hotplug code they're just used to
> > see if a hot-added range lies within an existing node span when
> > platform-firmware otherwise fails to communicate a node. x86 can
> > continue to back those helpers with numa_meminfo, arm64 can use a
> > generic memblock implementation and other archs can follow the arm64
> > example if they want better numa answers for drivers.
> >
>
> See memblock_add_node()/memblock_remove() in mm/memory_hotplug.c. I
> don't want that code be reactivated for x86/s390x. That's all I am saying.

And these have actual meaning only on arm64 because powerpc does not
rely on memblock for memory hot(un)plug, AFAIU.

Anyway, at the moment we can use memblock on hotplug path only on arm64
and I don't think its the path worth exploring.

> --
> Thanks,
>
> David / dhildenb
>

--
Sincerely yours,
Mike.