2023-06-14 04:52:40

by Alison Schofield

[permalink] [raw]
Subject: [PATCH v2 0/2] CXL: Apply SRAT defined PXM to entire CFMWS window

From: Alison Schofield <[email protected]>

Along with the changes in v2 listed below, Dan questioned the maintenance
burden of x86 not switching to use the memblock API. See Dan Williams &
Mike Rapoport discuss the issue in the v1 link. [1]

IIUC switching existing x86 meminfo usage to memblock is the pre-existing
outstanding work, and per Mike 'that's quite some work needed to make
that happen' and since the memblock API doesn't support something like
numa_fill_memblks(), add that work on top.

So, with that open awaiting feedback from x86 maintainers, here's v2.


Changes in v2:

Patch 1/2: x86/numa: Introduce numa_fill_memblks()
- Update commit log with policy description. (Dan)
- Collect memblks with any HPA range intersect. (Dan)
- Adjust head or tail memblk to include, not align to, HPA range.
- Let the case of a single memblk simply fall through.
- Simplify the sort compare function to use start only.
- Rewrite and simplify the fill loop.
- Add code comment for exclusive memblk->end. (Dan)
- Add code comment for memblks being adjusted in place. (Dan)
- Add Tags: Reported-by, Suggested-by, Tested-by

Patch 2/2: ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window
- Add Tags: Reported-by, Suggested-by, Tested-by
- No changes in patch body.

[1] v1: https://lore.kernel.org/linux-cxl/[email protected]/

Cover Letter:

The CXL subsystem requires the creation of NUMA nodes for CFMWS
Windows not described in the SRAT. The existing implementation
only addresses windows that the SRAT describes completely or
not at all. This work addresses the case of partially described
CFMWS Windows by extending proximity domains in a portion of
a CFMWS window to the entire window.

Introduce a NUMA helper, numa_fill_memblks(), to fill gaps in
a numa_meminfo memblk address range. Update the CFMWS parsing
in the ACPI driver to use numa_fill_memblks() to extend SRAT
defined proximity domains to entire CXL windows.

An RFC of this patchset was previously posted for CXL folks
review.[2] The RFC feedback led to the implementation here,
extending existing memblks (Dan). Also, both Jonathan and
Dan influenced the changelog comments in the ACPI patch, with
regards to setting expectations on this evolving heuristic.

Repeating here to set reviewer expectations:
*Note that this heuristic will evolve when CFMWS Windows present
a wider range of characteristics. The extension of the proximity
domain, implemented here, is likely a step in developing a more
sophisticated performance profile in the future.

[2] https://lore.kernel.org/linux-cxl/[email protected]/

Alison Schofield (2):
x86/numa: Introduce numa_fill_memblks()
ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window

arch/x86/include/asm/sparsemem.h | 2 +
arch/x86/mm/numa.c | 87 ++++++++++++++++++++++++++++++++
drivers/acpi/numa/srat.c | 11 ++--
include/linux/numa.h | 7 +++
4 files changed, 104 insertions(+), 3 deletions(-)


base-commit: 6e2e1e779912345f0b5f86ef01facc2802bd97cc
--
2.37.3



2023-06-14 04:59:39

by Alison Schofield

[permalink] [raw]
Subject: [PATCH v2 1/2] x86/numa: Introduce numa_fill_memblks()

From: Alison Schofield <[email protected]>

numa_fill_memblks() fills in the gaps in numa_meminfo memblks
over an HPA address range.

The ACPI driver will use numa_fill_memblks() to implement a new Linux
policy that prescribes extending proximity domains in a portion of a
CFMWS window to the entire window.

Dan Williams offered this explanation of the policy:
A CFWMS is an ACPI data structure that indicates *potential* locations
where CXL memory can be placed. It is the playground where the CXL
driver has free reign to establish regions. That space can be populated
by BIOS created regions, or driver created regions, after hotplug or
other reconfiguration.

When BIOS creates a region in a CXL Window it additionally describes
that subset of the Window range in the other typical ACPI tables SRAT,
SLIT, and HMAT. The rationale for BIOS not pre-describing the entire
CXL Window in SRAT, SLIT, and HMAT is that it can not predict the
future. I.e. there is nothing stopping higher or lower performance
devices being placed in the same Window. Compare that to ACPI memory
hotplug that just onlines additional capacity in the proximity domain
with little freedom for dynamic performance differentiation.

That leaves the OS with a choice, should unpopulated window capacity
match the proximity domain of an existing region, or should it allocate
a new one? This patch takes the simple position of minimizing proximity
domain proliferation by reusing any proximity domain intersection for
the entire Window. If the Window has no intersections then allocate a
new proximity domain. Note that SRAT, SLIT and HMAT information can be
enumerated dynamically in a standard way from device provided data.
Think of CXL as the end of ACPI needing to describe memory attributes,
CXL offers a standard discovery model for performance attributes, but
Linux still needs to interoperate with the old regime.

Reported-by: Derick Marks <[email protected]>
Suggested-by: Dan Williams <[email protected]>
Signed-off-by: Alison Schofield <[email protected]>
Tested-by: Derick Marks <[email protected]>
---
arch/x86/include/asm/sparsemem.h | 2 +
arch/x86/mm/numa.c | 87 ++++++++++++++++++++++++++++++++
include/linux/numa.h | 7 +++
3 files changed, 96 insertions(+)

diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
index 64df897c0ee3..1be13b2dfe8b 100644
--- a/arch/x86/include/asm/sparsemem.h
+++ b/arch/x86/include/asm/sparsemem.h
@@ -37,6 +37,8 @@ extern int phys_to_target_node(phys_addr_t start);
#define phys_to_target_node phys_to_target_node
extern int memory_add_physaddr_to_nid(u64 start);
#define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
+extern int numa_fill_memblks(u64 start, u64 end);
+#define numa_fill_memblks numa_fill_memblks
#endif
#endif /* __ASSEMBLY__ */

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2aadb2019b4f..fa82141d1a04 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -11,6 +11,7 @@
#include <linux/nodemask.h>
#include <linux/sched.h>
#include <linux/topology.h>
+#include <linux/sort.h>

#include <asm/e820/api.h>
#include <asm/proto.h>
@@ -961,4 +962,90 @@ int memory_add_physaddr_to_nid(u64 start)
return nid;
}
EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
+
+static int __init cmp_memblk(const void *a, const void *b)
+{
+ const struct numa_memblk *ma = *(const struct numa_memblk **)a;
+ const struct numa_memblk *mb = *(const struct numa_memblk **)b;
+
+ if (ma->start != mb->start)
+ return (ma->start < mb->start) ? -1 : 1;
+
+ /* Caller handles duplicate start addresses */
+ return 0;
+}
+
+static struct numa_memblk *numa_memblk_list[NR_NODE_MEMBLKS] __initdata;
+
+/**
+ * numa_fill_memblks - Fill gaps in numa_meminfo memblks
+ * @start: address to begin fill
+ * @end: address to end fill
+ *
+ * Find and extend numa_meminfo memblks to cover the @start-@end
+ * HPA address range, such that the first memblk includes @start,
+ * the last memblk includes @end, and any gaps in between are
+ * filled.
+ *
+ * RETURNS:
+ * 0 : Success
+ * NUMA_NO_MEMBLK : No memblk exists in @start-@end range
+ */
+
+int __init numa_fill_memblks(u64 start, u64 end)
+{
+ struct numa_memblk **blk = &numa_memblk_list[0];
+ struct numa_meminfo *mi = &numa_meminfo;
+ int count = 0;
+ u64 prev_end;
+
+ /*
+ * Create a list of pointers to numa_meminfo memblks that
+ * overlap start, end. Exclude (start == bi->end) since
+ * end addresses in both a CFMWS range and a memblk range
+ * are exclusive.
+ *
+ * This list of pointers is used to make in-place changes
+ * that fill out the numa_meminfo memblks.
+ */
+ for (int i = 0; i < mi->nr_blks; i++) {
+ struct numa_memblk *bi = &mi->blk[i];
+
+ if (start < bi->end && end >= bi->start) {
+ blk[count] = &mi->blk[i];
+ count++;
+ }
+ }
+ if (!count)
+ return NUMA_NO_MEMBLK;
+
+ /* Sort the list of pointers in memblk->start order */
+ sort(&blk[0], count, sizeof(blk[0]), cmp_memblk, NULL);
+
+ /* Make sure the first/last memblks include start/end */
+ blk[0]->start = min(blk[0]->start, start);
+ blk[count - 1]->end = max(blk[count - 1]->end, end);
+
+ /*
+ * Fill any gaps by tracking the previous memblks end address,
+ * prev_end, and backfilling to it if needed. Avoid filling
+ * overlapping memblks by making prev_end monotonically non-
+ * decreasing.
+ */
+ prev_end = blk[0]->end;
+ for (int i = 1; i < count; i++) {
+ struct numa_memblk *curr = blk[i];
+
+ if (prev_end >= curr->start) {
+ if (prev_end < curr->end)
+ prev_end = curr->end;
+ } else {
+ curr->start = prev_end;
+ prev_end = curr->end;
+ }
+ }
+ return 0;
+}
+EXPORT_SYMBOL_GPL(numa_fill_memblks);
+
#endif
diff --git a/include/linux/numa.h b/include/linux/numa.h
index 59df211d051f..0f512c0aba54 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -12,6 +12,7 @@
#define MAX_NUMNODES (1 << NODES_SHIFT)

#define NUMA_NO_NODE (-1)
+#define NUMA_NO_MEMBLK (-1)

/* optionally keep NUMA memory info available post init */
#ifdef CONFIG_NUMA_KEEP_MEMINFO
@@ -43,6 +44,12 @@ static inline int phys_to_target_node(u64 start)
return 0;
}
#endif
+#ifndef numa_fill_memblks
+static inline int __init numa_fill_memblks(u64 start, u64 end)
+{
+ return NUMA_NO_MEMBLK;
+}
+#endif
#else /* !CONFIG_NUMA */
static inline int numa_map_to_online_node(int node)
{
--
2.37.3


2023-06-14 08:00:51

by Wilczynski, Michal

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] x86/numa: Introduce numa_fill_memblks()



On 6/14/2023 6:35 AM, [email protected] wrote:
> From: Alison Schofield <[email protected]>
>
> numa_fill_memblks() fills in the gaps in numa_meminfo memblks
> over an HPA address range.
>
> The ACPI driver will use numa_fill_memblks() to implement a new Linux
> policy that prescribes extending proximity domains in a portion of a
> CFMWS window to the entire window.
>
> Dan Williams offered this explanation of the policy:
> A CFWMS is an ACPI data structure that indicates *potential* locations
> where CXL memory can be placed. It is the playground where the CXL
> driver has free reign to establish regions. That space can be populated
> by BIOS created regions, or driver created regions, after hotplug or
> other reconfiguration.
>
> When BIOS creates a region in a CXL Window it additionally describes
> that subset of the Window range in the other typical ACPI tables SRAT,
> SLIT, and HMAT. The rationale for BIOS not pre-describing the entire
> CXL Window in SRAT, SLIT, and HMAT is that it can not predict the
> future. I.e. there is nothing stopping higher or lower performance
> devices being placed in the same Window. Compare that to ACPI memory
> hotplug that just onlines additional capacity in the proximity domain
> with little freedom for dynamic performance differentiation.
>
> That leaves the OS with a choice, should unpopulated window capacity
> match the proximity domain of an existing region, or should it allocate
> a new one? This patch takes the simple position of minimizing proximity
> domain proliferation by reusing any proximity domain intersection for
> the entire Window. If the Window has no intersections then allocate a
> new proximity domain. Note that SRAT, SLIT and HMAT information can be
> enumerated dynamically in a standard way from device provided data.
> Think of CXL as the end of ACPI needing to describe memory attributes,
> CXL offers a standard discovery model for performance attributes, but
> Linux still needs to interoperate with the old regime.
>
> Reported-by: Derick Marks <[email protected]>
> Suggested-by: Dan Williams <[email protected]>
> Signed-off-by: Alison Schofield <[email protected]>
> Tested-by: Derick Marks <[email protected]>
> ---
> arch/x86/include/asm/sparsemem.h | 2 +
> arch/x86/mm/numa.c | 87 ++++++++++++++++++++++++++++++++
> include/linux/numa.h | 7 +++
> 3 files changed, 96 insertions(+)
>
> diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
> index 64df897c0ee3..1be13b2dfe8b 100644
> --- a/arch/x86/include/asm/sparsemem.h
> +++ b/arch/x86/include/asm/sparsemem.h
> @@ -37,6 +37,8 @@ extern int phys_to_target_node(phys_addr_t start);
> #define phys_to_target_node phys_to_target_node
> extern int memory_add_physaddr_to_nid(u64 start);
> #define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
> +extern int numa_fill_memblks(u64 start, u64 end);
> +#define numa_fill_memblks numa_fill_memblks
> #endif
> #endif /* __ASSEMBLY__ */
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 2aadb2019b4f..fa82141d1a04 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -11,6 +11,7 @@
> #include <linux/nodemask.h>
> #include <linux/sched.h>
> #include <linux/topology.h>
> +#include <linux/sort.h>
>
> #include <asm/e820/api.h>
> #include <asm/proto.h>
> @@ -961,4 +962,90 @@ int memory_add_physaddr_to_nid(u64 start)
> return nid;
> }
> EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> +
> +static int __init cmp_memblk(const void *a, const void *b)
> +{
> + const struct numa_memblk *ma = *(const struct numa_memblk **)a;
> + const struct numa_memblk *mb = *(const struct numa_memblk **)b;

Is this casting necessary  ?

> +
> + if (ma->start != mb->start)
> + return (ma->start < mb->start) ? -1 : 1;
> +
> + /* Caller handles duplicate start addresses */
> + return 0;
> +}
> +
> +static struct numa_memblk *numa_memblk_list[NR_NODE_MEMBLKS] __initdata;
> +
> +/**
> + * numa_fill_memblks - Fill gaps in numa_meminfo memblks
> + * @start: address to begin fill
> + * @end: address to end fill
> + *
> + * Find and extend numa_meminfo memblks to cover the @start-@end
> + * HPA address range, such that the first memblk includes @start,
> + * the last memblk includes @end, and any gaps in between are
> + * filled.
> + *
> + * RETURNS:
> + * 0 : Success
> + * NUMA_NO_MEMBLK : No memblk exists in @start-@end range
> + */
> +
> +int __init numa_fill_memblks(u64 start, u64 end)
> +{
> + struct numa_memblk **blk = &numa_memblk_list[0];
> + struct numa_meminfo *mi = &numa_meminfo;
> + int count = 0;
> + u64 prev_end;
> +
> + /*
> + * Create a list of pointers to numa_meminfo memblks that
> + * overlap start, end. Exclude (start == bi->end) since
> + * end addresses in both a CFMWS range and a memblk range
> + * are exclusive.
> + *
> + * This list of pointers is used to make in-place changes
> + * that fill out the numa_meminfo memblks.
> + */
> + for (int i = 0; i < mi->nr_blks; i++) {
> + struct numa_memblk *bi = &mi->blk[i];
> +
> + if (start < bi->end && end >= bi->start) {
> + blk[count] = &mi->blk[i];
> + count++;
> + }
> + }
> + if (!count)
> + return NUMA_NO_MEMBLK;
> +
> + /* Sort the list of pointers in memblk->start order */
> + sort(&blk[0], count, sizeof(blk[0]), cmp_memblk, NULL);
> +
> + /* Make sure the first/last memblks include start/end */
> + blk[0]->start = min(blk[0]->start, start);
> + blk[count - 1]->end = max(blk[count - 1]->end, end);
> +
> + /*
> + * Fill any gaps by tracking the previous memblks end address,
> + * prev_end, and backfilling to it if needed. Avoid filling
> + * overlapping memblks by making prev_end monotonically non-
> + * decreasing.
> + */
> + prev_end = blk[0]->end;
> + for (int i = 1; i < count; i++) {
> + struct numa_memblk *curr = blk[i];
> +
> + if (prev_end >= curr->start) {
> + if (prev_end < curr->end)
> + prev_end = curr->end;
> + } else {
> + curr->start = prev_end;
> + prev_end = curr->end;
> + }
> + }
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(numa_fill_memblks);
> +
> #endif
> diff --git a/include/linux/numa.h b/include/linux/numa.h
> index 59df211d051f..0f512c0aba54 100644
> --- a/include/linux/numa.h
> +++ b/include/linux/numa.h
> @@ -12,6 +12,7 @@
> #define MAX_NUMNODES (1 << NODES_SHIFT)
>
> #define NUMA_NO_NODE (-1)
> +#define NUMA_NO_MEMBLK (-1)

Same error code as NUMA_NO_NODE ?

>
> /* optionally keep NUMA memory info available post init */
> #ifdef CONFIG_NUMA_KEEP_MEMINFO
> @@ -43,6 +44,12 @@ static inline int phys_to_target_node(u64 start)
> return 0;
> }
> #endif
> +#ifndef numa_fill_memblks

Why not just #ifndef CONFIG_NUMA_KEEP_MEMINFO ?

> +static inline int __init numa_fill_memblks(u64 start, u64 end)
> +{
> + return NUMA_NO_MEMBLK;
> +}
> +#endif
> #else /* !CONFIG_NUMA */
> static inline int numa_map_to_online_node(int node)
> {


2023-06-14 09:21:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] CXL: Apply SRAT defined PXM to entire CFMWS window

On Tue, Jun 13, 2023 at 09:35:23PM -0700, [email protected] wrote:
> The CXL subsystem requires the creation of NUMA nodes for CFMWS

The thing is CXL some persistent memory thing, right? But what is this
CFMWS thing? I don't think I've ever seen that particular combination of
letters together.

2023-06-14 09:29:09

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] CXL: Apply SRAT defined PXM to entire CFMWS window

On Wed, 14 Jun 2023 10:32:40 +0200
Peter Zijlstra <[email protected]> wrote:

> On Tue, Jun 13, 2023 at 09:35:23PM -0700, [email protected] wrote:
> > The CXL subsystem requires the creation of NUMA nodes for CFMWS
>
> The thing is CXL some persistent memory thing, right? But what is this
> CFMWS thing? I don't think I've ever seen that particular combination of
> letters together.
>
Hi Peter,

To save time before the US based folk wake up.

Both persistent and volatile memory found on CXL devices (mostly volatile on
early devices).

CXL Fixed Memory Window (structure) (CFMWS - defined in 9.17.1.3 of CXL r3.0
- via an ACPI table (CEDT). CFMWS, as a term, is sometimes abused in the kernel
(and here) for the window rather than the structure describing the window
(the S on the end).

CFMWS - A region of Host Physical Address (HPA) Space which routes accesses to CXL Host
bridges. A CFMWS describes interleaving as well (so multiple target host bridges).
If multiple interleave setups are available, then you'll see multiple CFMWS entries
- so different statically regions of HPA can route to same host bridges with different
interleave setups (decoding via the configurable part to hit different actual memory
on the downstream devices).
Where accesses are routed after that depends on the configurable parts
of the CXL topology (Host-Managed Device Memory (HDM) decoders in host bridges,
switches etc). Note that a CFMWS address may route to nowhere if downstream
devices aren't available / configured yet.

CFMWS is the CXL specification avoiding defining interfaces for controlling
the host address space to CXL host bridge mapping as those vary so much across
host implementations + not always configurable at runtime anyway. Also includes
a bunch of other details about the region (too many details perhaps!)

Who does the configuration (BIOS / kernel) varies across implementations
and we have OS managed hotplug so the OS always has to do some of it
(personally I prefer the kernel doing everything :)
It's made messier by CXL 1.1 hosts where a lot less was discoverable so
generally the BIOS has to do the heavy lifting. For CXL 2.0 onwards the OS
'might' do everything except whatever is needed on the host to configure
the CXL Fixed Memory Windows it is advertising.

Note there is no requirement that the access characteristics of memory mapped
into a given CFMWS should be remotely consistent across the whole window
- some of the window may route through switches, and to directly connected
devices.
That's a simplifying assumption made today as we don't yet know the full
scope of what people are building.

Hope that helps (rather than causing confusion!)

Jonathan

2023-06-14 13:08:59

by Dan Williams

[permalink] [raw]
Subject: RE: [PATCH v2 1/2] x86/numa: Introduce numa_fill_memblks()

alison.schofield@ wrote:
> From: Alison Schofield <[email protected]>
>
> numa_fill_memblks() fills in the gaps in numa_meminfo memblks
> over an HPA address range.
>
> The ACPI driver will use numa_fill_memblks() to implement a new Linux
> policy that prescribes extending proximity domains in a portion of a
> CFMWS window to the entire window.
>
> Dan Williams offered this explanation of the policy:
> A CFWMS is an ACPI data structure that indicates *potential* locations
> where CXL memory can be placed. It is the playground where the CXL
> driver has free reign to establish regions. That space can be populated
> by BIOS created regions, or driver created regions, after hotplug or
> other reconfiguration.
>
> When BIOS creates a region in a CXL Window it additionally describes
> that subset of the Window range in the other typical ACPI tables SRAT,
> SLIT, and HMAT. The rationale for BIOS not pre-describing the entire
> CXL Window in SRAT, SLIT, and HMAT is that it can not predict the
> future. I.e. there is nothing stopping higher or lower performance
> devices being placed in the same Window. Compare that to ACPI memory
> hotplug that just onlines additional capacity in the proximity domain
> with little freedom for dynamic performance differentiation.
>
> That leaves the OS with a choice, should unpopulated window capacity
> match the proximity domain of an existing region, or should it allocate
> a new one? This patch takes the simple position of minimizing proximity
> domain proliferation by reusing any proximity domain intersection for
> the entire Window. If the Window has no intersections then allocate a
> new proximity domain. Note that SRAT, SLIT and HMAT information can be
> enumerated dynamically in a standard way from device provided data.
> Think of CXL as the end of ACPI needing to describe memory attributes,
> CXL offers a standard discovery model for performance attributes, but
> Linux still needs to interoperate with the old regime.
>
> Reported-by: Derick Marks <[email protected]>
> Suggested-by: Dan Williams <[email protected]>
> Signed-off-by: Alison Schofield <[email protected]>
> Tested-by: Derick Marks <[email protected]>
[..]
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 2aadb2019b4f..fa82141d1a04 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
[..]
> @@ -961,4 +962,90 @@ int memory_add_physaddr_to_nid(u64 start)
> return nid;
> }
> EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> +
> +static int __init cmp_memblk(const void *a, const void *b)
> +{
> + const struct numa_memblk *ma = *(const struct numa_memblk **)a;
> + const struct numa_memblk *mb = *(const struct numa_memblk **)b;
> +
> + if (ma->start != mb->start)
> + return (ma->start < mb->start) ? -1 : 1;
> +
> + /* Caller handles duplicate start addresses */
> + return 0;

This can be simplified to:

static int __init cmp_memblk(const void *a, const void *b)
{
const struct numa_memblk *ma = *(const struct numa_memblk **)a;
const struct numa_memblk *mb = *(const struct numa_memblk **)b;

return ma->start - mb->start;
}

> +}
> +
> +static struct numa_memblk *numa_memblk_list[NR_NODE_MEMBLKS] __initdata;
> +
> +/**
> + * numa_fill_memblks - Fill gaps in numa_meminfo memblks
> + * @start: address to begin fill
> + * @end: address to end fill
> + *
> + * Find and extend numa_meminfo memblks to cover the @start-@end
> + * HPA address range, such that the first memblk includes @start,
> + * the last memblk includes @end, and any gaps in between are
> + * filled.
> + *
> + * RETURNS:
> + * 0 : Success
> + * NUMA_NO_MEMBLK : No memblk exists in @start-@end range
> + */
> +
> +int __init numa_fill_memblks(u64 start, u64 end)
> +{
> + struct numa_memblk **blk = &numa_memblk_list[0];
> + struct numa_meminfo *mi = &numa_meminfo;
> + int count = 0;
> + u64 prev_end;
> +
> + /*
> + * Create a list of pointers to numa_meminfo memblks that
> + * overlap start, end. Exclude (start == bi->end) since
> + * end addresses in both a CFMWS range and a memblk range
> + * are exclusive.
> + *
> + * This list of pointers is used to make in-place changes
> + * that fill out the numa_meminfo memblks.
> + */

Thanks for this comment, looks good.

> + for (int i = 0; i < mi->nr_blks; i++) {
> + struct numa_memblk *bi = &mi->blk[i];
> +
> + if (start < bi->end && end >= bi->start) {
> + blk[count] = &mi->blk[i];
> + count++;
> + }
> + }
> + if (!count)
> + return NUMA_NO_MEMBLK;
> +
> + /* Sort the list of pointers in memblk->start order */
> + sort(&blk[0], count, sizeof(blk[0]), cmp_memblk, NULL);
> +
> + /* Make sure the first/last memblks include start/end */
> + blk[0]->start = min(blk[0]->start, start);
> + blk[count - 1]->end = max(blk[count - 1]->end, end);
> +
> + /*
> + * Fill any gaps by tracking the previous memblks end address,
> + * prev_end, and backfilling to it if needed. Avoid filling
> + * overlapping memblks by making prev_end monotonically non-
> + * decreasing.

I am not immediately understanding the use of the term monotonically
non-decreasing here? I think the first sentence of this comment is
enough, or am I missing a nuance?

> + */
> + prev_end = blk[0]->end;
> + for (int i = 1; i < count; i++) {
> + struct numa_memblk *curr = blk[i];
> +
> + if (prev_end >= curr->start) {
> + if (prev_end < curr->end)
> + prev_end = curr->end;
> + } else {
> + curr->start = prev_end;
> + prev_end = curr->end;
> + }
> + }
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(numa_fill_memblks);

This export is not needed. The only caller of this is
drivers/acpi/numa/srat.c which is only ever built-in, not a module.

2023-06-14 13:11:39

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] x86/numa: Introduce numa_fill_memblks()

Wilczynski, Michal wrote:
>
>
> On 6/14/2023 6:35 AM, [email protected] wrote:
> > From: Alison Schofield <[email protected]>
> >
> > numa_fill_memblks() fills in the gaps in numa_meminfo memblks
> > over an HPA address range.
> >
> > The ACPI driver will use numa_fill_memblks() to implement a new Linux
> > policy that prescribes extending proximity domains in a portion of a
> > CFMWS window to the entire window.
> >
> > Dan Williams offered this explanation of the policy:
> > A CFWMS is an ACPI data structure that indicates *potential* locations
> > where CXL memory can be placed. It is the playground where the CXL
> > driver has free reign to establish regions. That space can be populated
> > by BIOS created regions, or driver created regions, after hotplug or
> > other reconfiguration.
> >
> > When BIOS creates a region in a CXL Window it additionally describes
> > that subset of the Window range in the other typical ACPI tables SRAT,
> > SLIT, and HMAT. The rationale for BIOS not pre-describing the entire
> > CXL Window in SRAT, SLIT, and HMAT is that it can not predict the
> > future. I.e. there is nothing stopping higher or lower performance
> > devices being placed in the same Window. Compare that to ACPI memory
> > hotplug that just onlines additional capacity in the proximity domain
> > with little freedom for dynamic performance differentiation.
> >
> > That leaves the OS with a choice, should unpopulated window capacity
> > match the proximity domain of an existing region, or should it allocate
> > a new one? This patch takes the simple position of minimizing proximity
> > domain proliferation by reusing any proximity domain intersection for
> > the entire Window. If the Window has no intersections then allocate a
> > new proximity domain. Note that SRAT, SLIT and HMAT information can be
> > enumerated dynamically in a standard way from device provided data.
> > Think of CXL as the end of ACPI needing to describe memory attributes,
> > CXL offers a standard discovery model for performance attributes, but
> > Linux still needs to interoperate with the old regime.
> >
> > Reported-by: Derick Marks <[email protected]>
> > Suggested-by: Dan Williams <[email protected]>
> > Signed-off-by: Alison Schofield <[email protected]>
> > Tested-by: Derick Marks <[email protected]>
> > ---
> > arch/x86/include/asm/sparsemem.h | 2 +
> > arch/x86/mm/numa.c | 87 ++++++++++++++++++++++++++++++++
> > include/linux/numa.h | 7 +++
> > 3 files changed, 96 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
> > index 64df897c0ee3..1be13b2dfe8b 100644
> > --- a/arch/x86/include/asm/sparsemem.h
> > +++ b/arch/x86/include/asm/sparsemem.h
> > @@ -37,6 +37,8 @@ extern int phys_to_target_node(phys_addr_t start);
> > #define phys_to_target_node phys_to_target_node
> > extern int memory_add_physaddr_to_nid(u64 start);
> > #define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
> > +extern int numa_fill_memblks(u64 start, u64 end);
> > +#define numa_fill_memblks numa_fill_memblks
> > #endif
> > #endif /* __ASSEMBLY__ */
> >
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 2aadb2019b4f..fa82141d1a04 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -11,6 +11,7 @@
> > #include <linux/nodemask.h>
> > #include <linux/sched.h>
> > #include <linux/topology.h>
> > +#include <linux/sort.h>
> >
> > #include <asm/e820/api.h>
> > #include <asm/proto.h>
> > @@ -961,4 +962,90 @@ int memory_add_physaddr_to_nid(u64 start)
> > return nid;
> > }
> > EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > +
> > +static int __init cmp_memblk(const void *a, const void *b)
> > +{
> > + const struct numa_memblk *ma = *(const struct numa_memblk **)a;
> > + const struct numa_memblk *mb = *(const struct numa_memblk **)b;
>
> Is this casting necessary? ?

This is idiomatic for sort() comparison handlers.

> > +
> > + if (ma->start != mb->start)
> > + return (ma->start < mb->start) ? -1 : 1;
> > +
> > + /* Caller handles duplicate start addresses */
> > + return 0;
> > +}
> > +
> > +static struct numa_memblk *numa_memblk_list[NR_NODE_MEMBLKS] __initdata;
> > +
> > +/**
> > + * numa_fill_memblks - Fill gaps in numa_meminfo memblks
> > + * @start: address to begin fill
> > + * @end: address to end fill
> > + *
> > + * Find and extend numa_meminfo memblks to cover the @start-@end
> > + * HPA address range, such that the first memblk includes @start,
> > + * the last memblk includes @end, and any gaps in between are
> > + * filled.
> > + *
> > + * RETURNS:
> > + * 0 : Success
> > + * NUMA_NO_MEMBLK : No memblk exists in @start-@end range
> > + */
> > +
> > +int __init numa_fill_memblks(u64 start, u64 end)
> > +{
> > + struct numa_memblk **blk = &numa_memblk_list[0];
> > + struct numa_meminfo *mi = &numa_meminfo;
> > + int count = 0;
> > + u64 prev_end;
> > +
> > + /*
> > + * Create a list of pointers to numa_meminfo memblks that
> > + * overlap start, end. Exclude (start == bi->end) since
> > + * end addresses in both a CFMWS range and a memblk range
> > + * are exclusive.
> > + *
> > + * This list of pointers is used to make in-place changes
> > + * that fill out the numa_meminfo memblks.
> > + */
> > + for (int i = 0; i < mi->nr_blks; i++) {
> > + struct numa_memblk *bi = &mi->blk[i];
> > +
> > + if (start < bi->end && end >= bi->start) {
> > + blk[count] = &mi->blk[i];
> > + count++;
> > + }
> > + }
> > + if (!count)
> > + return NUMA_NO_MEMBLK;
> > +
> > + /* Sort the list of pointers in memblk->start order */
> > + sort(&blk[0], count, sizeof(blk[0]), cmp_memblk, NULL);
> > +
> > + /* Make sure the first/last memblks include start/end */
> > + blk[0]->start = min(blk[0]->start, start);
> > + blk[count - 1]->end = max(blk[count - 1]->end, end);
> > +
> > + /*
> > + * Fill any gaps by tracking the previous memblks end address,
> > + * prev_end, and backfilling to it if needed. Avoid filling
> > + * overlapping memblks by making prev_end monotonically non-
> > + * decreasing.
> > + */
> > + prev_end = blk[0]->end;
> > + for (int i = 1; i < count; i++) {
> > + struct numa_memblk *curr = blk[i];
> > +
> > + if (prev_end >= curr->start) {
> > + if (prev_end < curr->end)
> > + prev_end = curr->end;
> > + } else {
> > + curr->start = prev_end;
> > + prev_end = curr->end;
> > + }
> > + }
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(numa_fill_memblks);
> > +
> > #endif
> > diff --git a/include/linux/numa.h b/include/linux/numa.h
> > index 59df211d051f..0f512c0aba54 100644
> > --- a/include/linux/numa.h
> > +++ b/include/linux/numa.h
> > @@ -12,6 +12,7 @@
> > #define MAX_NUMNODES (1 << NODES_SHIFT)
> >
> > #define NUMA_NO_NODE (-1)
> > +#define NUMA_NO_MEMBLK (-1)
>
> Same error code as NUMA_NO_NODE ?
>
> >
> > /* optionally keep NUMA memory info available post init */
> > #ifdef CONFIG_NUMA_KEEP_MEMINFO
> > @@ -43,6 +44,12 @@ static inline int phys_to_target_node(u64 start)
> > return 0;
> > }
> > #endif
> > +#ifndef numa_fill_memblks
>
> Why not just #ifndef CONFIG_NUMA_KEEP_MEMINFO ?

This is due to the fact that multiple archs use
CONFIG_NUMA_KEEP_MEMINFO (x86, ARM64, LOONGARCH), but only one supplies
a numa_fill_memblks() implementation (x86).

2023-06-14 14:45:52

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] CXL: Apply SRAT defined PXM to entire CFMWS window

Jonathan Cameron wrote:
> On Wed, 14 Jun 2023 10:32:40 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > On Tue, Jun 13, 2023 at 09:35:23PM -0700, [email protected] wrote:
> > > The CXL subsystem requires the creation of NUMA nodes for CFMWS
> >
> > The thing is CXL some persistent memory thing, right? But what is this
> > CFMWS thing? I don't think I've ever seen that particular combination of
> > letters together.
> >
> Hi Peter,
>
> To save time before the US based folk wake up.
>
[..]
> Note there is no requirement that the access characteristics of memory mapped
> into a given CFMWS should be remotely consistent across the whole window
> - some of the window may route through switches, and to directly connected
> devices.
> That's a simplifying assumption made today as we don't yet know the full
> scope of what people are building.
>
> Hope that helps (rather than causing confusion!)

Thanks Jonathan! Patch 1 changelog also goes into more detail.

2023-06-14 16:34:25

by Alison Schofield

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] x86/numa: Introduce numa_fill_memblks()

On Wed, Jun 14, 2023 at 05:44:19AM -0700, Dan Williams wrote:
> alison.schofield@ wrote:
> > From: Alison Schofield <[email protected]>
> >
> > numa_fill_memblks() fills in the gaps in numa_meminfo memblks
> > over an HPA address range.
> >
> > The ACPI driver will use numa_fill_memblks() to implement a new Linux
> > policy that prescribes extending proximity domains in a portion of a
> > CFMWS window to the entire window.
> >
> > Dan Williams offered this explanation of the policy:
> > A CFWMS is an ACPI data structure that indicates *potential* locations
> > where CXL memory can be placed. It is the playground where the CXL
> > driver has free reign to establish regions. That space can be populated
> > by BIOS created regions, or driver created regions, after hotplug or
> > other reconfiguration.
> >
> > When BIOS creates a region in a CXL Window it additionally describes
> > that subset of the Window range in the other typical ACPI tables SRAT,
> > SLIT, and HMAT. The rationale for BIOS not pre-describing the entire
> > CXL Window in SRAT, SLIT, and HMAT is that it can not predict the
> > future. I.e. there is nothing stopping higher or lower performance
> > devices being placed in the same Window. Compare that to ACPI memory
> > hotplug that just onlines additional capacity in the proximity domain
> > with little freedom for dynamic performance differentiation.
> >
> > That leaves the OS with a choice, should unpopulated window capacity
> > match the proximity domain of an existing region, or should it allocate
> > a new one? This patch takes the simple position of minimizing proximity
> > domain proliferation by reusing any proximity domain intersection for
> > the entire Window. If the Window has no intersections then allocate a
> > new proximity domain. Note that SRAT, SLIT and HMAT information can be
> > enumerated dynamically in a standard way from device provided data.
> > Think of CXL as the end of ACPI needing to describe memory attributes,
> > CXL offers a standard discovery model for performance attributes, but
> > Linux still needs to interoperate with the old regime.
> >
> > Reported-by: Derick Marks <[email protected]>
> > Suggested-by: Dan Williams <[email protected]>
> > Signed-off-by: Alison Schofield <[email protected]>
> > Tested-by: Derick Marks <[email protected]>
> [..]
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 2aadb2019b4f..fa82141d1a04 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> [..]
> > @@ -961,4 +962,90 @@ int memory_add_physaddr_to_nid(u64 start)
> > return nid;
> > }
> > EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > +
> > +static int __init cmp_memblk(const void *a, const void *b)
> > +{
> > + const struct numa_memblk *ma = *(const struct numa_memblk **)a;
> > + const struct numa_memblk *mb = *(const struct numa_memblk **)b;
> > +
> > + if (ma->start != mb->start)
> > + return (ma->start < mb->start) ? -1 : 1;
> > +
> > + /* Caller handles duplicate start addresses */
> > + return 0;
>
> This can be simplified to:
>
> static int __init cmp_memblk(const void *a, const void *b)
> {
> const struct numa_memblk *ma = *(const struct numa_memblk **)a;
> const struct numa_memblk *mb = *(const struct numa_memblk **)b;
>
> return ma->start - mb->start;
> }

Got it.

>
> > +}
> > +
> > +static struct numa_memblk *numa_memblk_list[NR_NODE_MEMBLKS] __initdata;
> > +
> > +/**
> > + * numa_fill_memblks - Fill gaps in numa_meminfo memblks
> > + * @start: address to begin fill
> > + * @end: address to end fill
> > + *
> > + * Find and extend numa_meminfo memblks to cover the @start-@end
> > + * HPA address range, such that the first memblk includes @start,
> > + * the last memblk includes @end, and any gaps in between are
> > + * filled.
> > + *
> > + * RETURNS:
> > + * 0 : Success
> > + * NUMA_NO_MEMBLK : No memblk exists in @start-@end range
> > + */
> > +
> > +int __init numa_fill_memblks(u64 start, u64 end)
> > +{
> > + struct numa_memblk **blk = &numa_memblk_list[0];
> > + struct numa_meminfo *mi = &numa_meminfo;
> > + int count = 0;
> > + u64 prev_end;
> > +
> > + /*
> > + * Create a list of pointers to numa_meminfo memblks that
> > + * overlap start, end. Exclude (start == bi->end) since
> > + * end addresses in both a CFMWS range and a memblk range
> > + * are exclusive.
> > + *
> > + * This list of pointers is used to make in-place changes
> > + * that fill out the numa_meminfo memblks.
> > + */
>
> Thanks for this comment, looks good.
>
> > + for (int i = 0; i < mi->nr_blks; i++) {
> > + struct numa_memblk *bi = &mi->blk[i];
> > +
> > + if (start < bi->end && end >= bi->start) {
> > + blk[count] = &mi->blk[i];
> > + count++;
> > + }
> > + }
> > + if (!count)
> > + return NUMA_NO_MEMBLK;
> > +
> > + /* Sort the list of pointers in memblk->start order */
> > + sort(&blk[0], count, sizeof(blk[0]), cmp_memblk, NULL);
> > +
> > + /* Make sure the first/last memblks include start/end */
> > + blk[0]->start = min(blk[0]->start, start);
> > + blk[count - 1]->end = max(blk[count - 1]->end, end);
> > +
> > + /*
> > + * Fill any gaps by tracking the previous memblks end address,
> > + * prev_end, and backfilling to it if needed. Avoid filling
> > + * overlapping memblks by making prev_end monotonically non-
> > + * decreasing.
>
> I am not immediately understanding the use of the term monotonically
> non-decreasing here? I think the first sentence of this comment is
> enough, or am I missing a nuance?

Not sure it's a nuance, but if we advanced prev_end to be curr_end
at each iteration, gaps get needlessly filled, when curr is wholly
contained in prev. So the 'monotonically non-decreasing' comment is
emphasizing that prev-end will either stay the same or increase
at each iteration.

>
> > + */
> > + prev_end = blk[0]->end;
> > + for (int i = 1; i < count; i++) {
> > + struct numa_memblk *curr = blk[i];
> > +
> > + if (prev_end >= curr->start) {
> > + if (prev_end < curr->end)
> > + prev_end = curr->end;
> > + } else {
> > + curr->start = prev_end;
> > + prev_end = curr->end;
> > + }
> > + }
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(numa_fill_memblks);
>
> This export is not needed. The only caller of this is
> drivers/acpi/numa/srat.c which is only ever built-in, not a module.

Got it.
Thanks for the review Dan and for helping address other reviewer
comments.

Alison



2023-06-14 17:17:46

by Alison Schofield

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] x86/numa: Introduce numa_fill_memblks()

On Wed, Jun 14, 2023 at 09:35:22AM +0200, Wilczynski, Michal wrote:
>
>
> On 6/14/2023 6:35 AM, [email protected] wrote:
> > From: Alison Schofield <[email protected]>
> >
> > numa_fill_memblks() fills in the gaps in numa_meminfo memblks
> > over an HPA address range.
> >
> > The ACPI driver will use numa_fill_memblks() to implement a new Linux
> > policy that prescribes extending proximity domains in a portion of a
> > CFMWS window to the entire window.
> >
> > Dan Williams offered this explanation of the policy:
> > A CFWMS is an ACPI data structure that indicates *potential* locations
> > where CXL memory can be placed. It is the playground where the CXL
> > driver has free reign to establish regions. That space can be populated
> > by BIOS created regions, or driver created regions, after hotplug or
> > other reconfiguration.
> >
> > When BIOS creates a region in a CXL Window it additionally describes
> > that subset of the Window range in the other typical ACPI tables SRAT,
> > SLIT, and HMAT. The rationale for BIOS not pre-describing the entire
> > CXL Window in SRAT, SLIT, and HMAT is that it can not predict the
> > future. I.e. there is nothing stopping higher or lower performance
> > devices being placed in the same Window. Compare that to ACPI memory
> > hotplug that just onlines additional capacity in the proximity domain
> > with little freedom for dynamic performance differentiation.
> >
> > That leaves the OS with a choice, should unpopulated window capacity
> > match the proximity domain of an existing region, or should it allocate
> > a new one? This patch takes the simple position of minimizing proximity
> > domain proliferation by reusing any proximity domain intersection for
> > the entire Window. If the Window has no intersections then allocate a
> > new proximity domain. Note that SRAT, SLIT and HMAT information can be
> > enumerated dynamically in a standard way from device provided data.
> > Think of CXL as the end of ACPI needing to describe memory attributes,
> > CXL offers a standard discovery model for performance attributes, but
> > Linux still needs to interoperate with the old regime.
> >
> > Reported-by: Derick Marks <[email protected]>
> > Suggested-by: Dan Williams <[email protected]>
> > Signed-off-by: Alison Schofield <[email protected]>
> > Tested-by: Derick Marks <[email protected]>
> > ---
> > arch/x86/include/asm/sparsemem.h | 2 +
> > arch/x86/mm/numa.c | 87 ++++++++++++++++++++++++++++++++
> > include/linux/numa.h | 7 +++
> > 3 files changed, 96 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
> > index 64df897c0ee3..1be13b2dfe8b 100644
> > --- a/arch/x86/include/asm/sparsemem.h
> > +++ b/arch/x86/include/asm/sparsemem.h
> > @@ -37,6 +37,8 @@ extern int phys_to_target_node(phys_addr_t start);
> > #define phys_to_target_node phys_to_target_node
> > extern int memory_add_physaddr_to_nid(u64 start);
> > #define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
> > +extern int numa_fill_memblks(u64 start, u64 end);
> > +#define numa_fill_memblks numa_fill_memblks
> > #endif
> > #endif /* __ASSEMBLY__ */
> >
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 2aadb2019b4f..fa82141d1a04 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -11,6 +11,7 @@
> > #include <linux/nodemask.h>
> > #include <linux/sched.h>
> > #include <linux/topology.h>
> > +#include <linux/sort.h>
> >
> > #include <asm/e820/api.h>
> > #include <asm/proto.h>
> > @@ -961,4 +962,90 @@ int memory_add_physaddr_to_nid(u64 start)
> > return nid;
> > }
> > EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > +
> > +static int __init cmp_memblk(const void *a, const void *b)
> > +{
> > + const struct numa_memblk *ma = *(const struct numa_memblk **)a;
> > + const struct numa_memblk *mb = *(const struct numa_memblk **)b;
>
> Is this casting necessary? ?

Thanks for the review Michael,

I found the cast to be necessary.
Sort passes pointers to the array elements to compare, even if they
are already pointers, so cmp_ gets a double pointer.

>
> > +
> > + if (ma->start != mb->start)
> > + return (ma->start < mb->start) ? -1 : 1;
> > +
> > + /* Caller handles duplicate start addresses */
> > + return 0;
> > +}
> > +
> > +static struct numa_memblk *numa_memblk_list[NR_NODE_MEMBLKS] __initdata;
> > +
> > +/**
> > + * numa_fill_memblks - Fill gaps in numa_meminfo memblks
> > + * @start: address to begin fill
> > + * @end: address to end fill
> > + *
> > + * Find and extend numa_meminfo memblks to cover the @start-@end
> > + * HPA address range, such that the first memblk includes @start,
> > + * the last memblk includes @end, and any gaps in between are
> > + * filled.
> > + *
> > + * RETURNS:
> > + * 0 : Success
> > + * NUMA_NO_MEMBLK : No memblk exists in @start-@end range
> > + */
> > +
> > +int __init numa_fill_memblks(u64 start, u64 end)
> > +{
> > + struct numa_memblk **blk = &numa_memblk_list[0];
> > + struct numa_meminfo *mi = &numa_meminfo;
> > + int count = 0;
> > + u64 prev_end;
> > +
> > + /*
> > + * Create a list of pointers to numa_meminfo memblks that
> > + * overlap start, end. Exclude (start == bi->end) since
> > + * end addresses in both a CFMWS range and a memblk range
> > + * are exclusive.
> > + *
> > + * This list of pointers is used to make in-place changes
> > + * that fill out the numa_meminfo memblks.
> > + */
> > + for (int i = 0; i < mi->nr_blks; i++) {
> > + struct numa_memblk *bi = &mi->blk[i];
> > +
> > + if (start < bi->end && end >= bi->start) {
> > + blk[count] = &mi->blk[i];
> > + count++;
> > + }
> > + }
> > + if (!count)
> > + return NUMA_NO_MEMBLK;
> > +
> > + /* Sort the list of pointers in memblk->start order */
> > + sort(&blk[0], count, sizeof(blk[0]), cmp_memblk, NULL);
> > +
> > + /* Make sure the first/last memblks include start/end */
> > + blk[0]->start = min(blk[0]->start, start);
> > + blk[count - 1]->end = max(blk[count - 1]->end, end);
> > +
> > + /*
> > + * Fill any gaps by tracking the previous memblks end address,
> > + * prev_end, and backfilling to it if needed. Avoid filling
> > + * overlapping memblks by making prev_end monotonically non-
> > + * decreasing.
> > + */
> > + prev_end = blk[0]->end;
> > + for (int i = 1; i < count; i++) {
> > + struct numa_memblk *curr = blk[i];
> > +
> > + if (prev_end >= curr->start) {
> > + if (prev_end < curr->end)
> > + prev_end = curr->end;
> > + } else {
> > + curr->start = prev_end;
> > + prev_end = curr->end;
> > + }
> > + }
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(numa_fill_memblks);
> > +
> > #endif
> > diff --git a/include/linux/numa.h b/include/linux/numa.h
> > index 59df211d051f..0f512c0aba54 100644
> > --- a/include/linux/numa.h
> > +++ b/include/linux/numa.h
> > @@ -12,6 +12,7 @@
> > #define MAX_NUMNODES (1 << NODES_SHIFT)
> >
> > #define NUMA_NO_NODE (-1)
> > +#define NUMA_NO_MEMBLK (-1)
>
> Same error code as NUMA_NO_NODE ?
>

Yes. It's a define for convenience/clarity, rather than just using
(-1). I could have just used NUMA_NO_NODE, since no memblk also
means no node, but in a function whose job is to fill memblks, that
seemed wrong.

> >
> > /* optionally keep NUMA memory info available post init */
> > #ifdef CONFIG_NUMA_KEEP_MEMINFO
> > @@ -43,6 +44,12 @@ static inline int phys_to_target_node(u64 start)
> > return 0;
> > }
> > #endif
> > +#ifndef numa_fill_memblks
>
> Why not just #ifndef CONFIG_NUMA_KEEP_MEMINFO ?

Dan responded to this, nothing to add to that:
This is due to the fact that multiple archs use
CONFIG_NUMA_KEEP_MEMINFO (x86, ARM64, LOONGARCH), but only one supplies
a numa_fill_memblks() implementation (x86).

>
> > +static inline int __init numa_fill_memblks(u64 start, u64 end)
> > +{
> > + return NUMA_NO_MEMBLK;
> > +}
> > +#endif
> > #else /* !CONFIG_NUMA */
> > static inline int numa_map_to_online_node(int node)
> > {
>

2023-06-15 08:46:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] x86/numa: Introduce numa_fill_memblks()

On Tue, Jun 13, 2023 at 09:35:24PM -0700, [email protected] wrote:
> From: Alison Schofield <[email protected]>
>
> numa_fill_memblks() fills in the gaps in numa_meminfo memblks
> over an HPA address range.

What's with the Host part, should that not simply be PA ?

2023-06-15 08:52:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] x86/numa: Introduce numa_fill_memblks()

On Wed, Jun 14, 2023 at 05:27:39AM -0700, Dan Williams wrote:
> Wilczynski, Michal wrote:
> > On 6/14/2023 6:35 AM, [email protected] wrote:

> > > +static int __init cmp_memblk(const void *a, const void *b)
> > > +{
> > > + const struct numa_memblk *ma = *(const struct numa_memblk **)a;
> > > + const struct numa_memblk *mb = *(const struct numa_memblk **)b;
> >
> > Is this casting necessary? ?
>
> This is idiomatic for sort() comparison handlers.

Aside of that, it *is* actually required, since sort() does indirect
calls to it's cmp_func_t argument the Control Flow Integrity (CFI, not
to be confused with Call-Frame-Information) stuff has a hard requirement
that function signatures match.

At the very least clang builds should warn if you do indirect calls with
non-matching signatures these days. And kCFI enabled builds will get you
a runtime error if you manage to ignore that warning.

> > > +
> > > + if (ma->start != mb->start)
> > > + return (ma->start < mb->start) ? -1 : 1;
> > > +
> > > + /* Caller handles duplicate start addresses */
> > > + return 0;
> > > +}



2023-06-15 15:49:09

by Alison Schofield

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] x86/numa: Introduce numa_fill_memblks()

On Thu, Jun 15, 2023 at 10:34:07AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 13, 2023 at 09:35:24PM -0700, [email protected] wrote:
> > From: Alison Schofield <[email protected]>
> >
> > numa_fill_memblks() fills in the gaps in numa_meminfo memblks
> > over an HPA address range.
>
> What's with the Host part, should that not simply be PA ?

Yes, it should be PA.

The HPA acronym usage is CXL-land language, where we qualify
qualify physical addresses as either HPAs and DPAs (Device),
seeping in needlessly.