2024-02-20 20:26:03

by Gregory Price

[permalink] [raw]
Subject: [RCF 0/1] mm/mempolicy: weighted interleave system default weights

Weighted interleave added a sysfs interface for users to change
the interleave weights based on user input - with a default value
of `1` until reasonable system default code could be agreed upon.

This RFC series will suggest and solicit ideas for how to generate
these system defaults, and lay out some challenges in generating them.

Future work on the CXL driver (drivers/cxl) will introduce additional
code which registers HMAT information for hotplug memory provided
by CXL devices. This RFC does not presently provide that integration,
but will after it is upstream.


Interfaces introduced:
- mempolicy_set_node_perf
Called when HMAT data for a node is reported to the system

Integration points:
- node_set_perf_attrs - for reporting bandwidth info to mempolicy
- get_il_weight and weighted interleave allocation interfaces to
provide system defaults when applying weighted interleave.

New data in mempolicy:
- node_bw_table - cached bandwidth information about each node
- default_iw_table - the system default interleave weights


Note that because there are now multiple tables (default and sysfs),
the allocators fetch each weight individually, rather than via memcpy.
This means if weights change at runtime (extremely unlikely), the
allocators may temporarily see an "incorrect distribution" while the
system is being reweighted. This is not harmful (simply inaccurate)
and a result of providing a clean way to revert to the system default.


v1: Simple GCD reduction of basic bandwidth distribution.

Approach:
- whenever new coordinates are reported, recalculate all weights
- cache each node's min(read, write) bandwidth
- calculate the percentage each node's bandwidth is of the whole
- use GCD to reduce all percentages down to the minimum possible

The approach is simple and fast, and operates well under reasonably
well if the numbers reported by HMAT for each node happen to land
on easily reducable percentages. For example, a system presenting
88% of its bandwidth on DRAM and 11% of its bandwidth on CXL (floored
for simplicity) will end up with default weights of (8:1), which is
a preferably small number assigned in each weight.

The downside of this approach is that it is susceptible to prime and
co-prime numbers keeping interleave weights large (e.g. 89:11 vs 8:1).
We prefer finer grained interleaves to prevent large swaths of
contiguous memory from landing on the same device.

Additionally, this also hides the fact that multi-socket systems
experience chokepoints across sockets. For example a 2-socket
system with 200GB/s on each socket from DDR does not mean a given
socket has an aggregate of 400GB/s of bandwidth. Interconnects between
sockets provide less aggregate bandwidth than the DDR they provide
access to (e.g. 3 UPI lanes vs 8 DDR channels).

So this approach will reduce multi-socket interleave weights to (1:1)
by default if all sockets provide the same bandwidth.

Signed-off-by: Gregory Price <[email protected]>

Gregory Price (1):
mm/mempolicy: introduce system default interleave weights

drivers/acpi/numa/hmat.c | 1 +
drivers/base/node.c | 7 +++
include/linux/mempolicy.h | 4 ++
mm/mempolicy.c | 129 ++++++++++++++++++++++++++++++--------
4 files changed, 116 insertions(+), 25 deletions(-)

--
2.39.1



2024-02-20 20:26:08

by Gregory Price

[permalink] [raw]
Subject: [RFC 1/1] mm/mempolicy: introduce system default interleave weights

Startup and hotplug code may register HMAT data for memory
devices. Utilize this data to generate reasonable default
weighted interleave values.

Introduce `mempolicy_set_node_perf()`. A function which can
be invoked from node and CXL code to have mempolicy rebalance
the system default interleave weights.

mempolicy_set_node_perf() cache's each node's bandwidth (in
this patch: min(read_bw, write_bw)), and recalculates the weight
associated with each node. After weights are calculated, we use
gcd() to reduce these weights to the smallest amount possible in
and effort to more aggressively interleave on smaller intervals.

For example, a 1-socket system with a CXL memory expander which
exposes 224GB/s and 64GB/s of bandwidth respectively will end
up with a weight array of [7,2].

The downside of this approach is that some distributes may
experience large default values if they happen to a bandwidth
distribution that includes an unfortunate prime number, or if
any two values are co-prime.

Signed-off-by: Gregory Price <[email protected]>
---
drivers/acpi/numa/hmat.c | 1 +
drivers/base/node.c | 7 +++
include/linux/mempolicy.h | 4 ++
mm/mempolicy.c | 129 ++++++++++++++++++++++++++++++--------
4 files changed, 116 insertions(+), 25 deletions(-)

diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index d6b85f0f6082..7935d387e001 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -20,6 +20,7 @@
#include <linux/list_sort.h>
#include <linux/memregion.h>
#include <linux/memory.h>
+#include <linux/mempolicy.h>
#include <linux/mutex.h>
#include <linux/node.h>
#include <linux/sysfs.h>
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 1c05640461dd..30458df504b4 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -7,6 +7,7 @@
#include <linux/init.h>
#include <linux/mm.h>
#include <linux/memory.h>
+#include <linux/mempolicy.h>
#include <linux/vmstat.h>
#include <linux/notifier.h>
#include <linux/node.h>
@@ -214,6 +215,12 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
break;
}
}
+
+ /* When setting CPU access coordinates, update mempolicy */
+ if (access == ACCESS_COORDINATE_CPU) {
+ if (mempolicy_set_node_perf(nid, coord))
+ pr_info("failed to set node%d mempolicy attrs\n", nid);
+ }
}

/**
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 931b118336f4..d564e9e893ea 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -11,6 +11,7 @@
#include <linux/slab.h>
#include <linux/rbtree.h>
#include <linux/spinlock.h>
+#include <linux/node.h>
#include <linux/nodemask.h>
#include <linux/pagemap.h>
#include <uapi/linux/mempolicy.h>
@@ -177,6 +178,9 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)

extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);

+extern int mempolicy_set_node_perf(unsigned int node,
+ struct access_coordinate *coords);
+
#else

struct mempolicy {};
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ba0b2b81bd08..0a82aa51e497 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -109,6 +109,7 @@
#include <linux/mmu_notifier.h>
#include <linux/printk.h>
#include <linux/swapops.h>
+#include <linux/gcd.h>

#include <asm/tlbflush.h>
#include <asm/tlb.h>
@@ -139,31 +140,114 @@ static struct mempolicy default_policy = {
static struct mempolicy preferred_node_policy[MAX_NUMNODES];

/*
- * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
- * system-default value should be used. A NULL iw_table also denotes that
- * system-default values should be used. Until the system-default table
- * is implemented, the system-default is always 1.
+ * The interleave weight tables denote what weights should be used with
+ * the weighted interleave policy. There are two tables:
+ * - iw_table : the sysfs-set interleave weight table
+ * - default_iw_table : the system default interleave weight table.
*
- * iw_table is RCU protected
+ * If the iw_table is NULL, default_iw_table values are used.
+ * If both tables are NULL, a minimum weight of 1 is always used.
+ * A value of 0 in the iw_table means the system default value will be used.
+ *
+ * iw_table, and default_iw_table are RCU protected
+ * node_bw_table is protected by default_iwt_lock
+ *
+ * system startup and hotplug code may register node performance information
+ * via mempolicy_set_node_attributes()
*/
+static unsigned long *node_bw_table;
+static u8 __rcu *default_iw_table;
+static DEFINE_MUTEX(default_iwt_lock);
+
static u8 __rcu *iw_table;
static DEFINE_MUTEX(iw_table_lock);

static u8 get_il_weight(int node)
{
- u8 *table;
+ u8 *table, *default_table;
u8 weight;

rcu_read_lock();
table = rcu_dereference(iw_table);
- /* if no iw_table, use system default */
- weight = table ? table[node] : 1;
- /* if value in iw_table is 0, use system default */
- weight = weight ? weight : 1;
+ default_table = rcu_dereference(default_iw_table);
+ /* if no table pointers or value is 0, use system default or 1 */
+ weight = table ? table[node] : 0;
+ weight = weight ? weight : (default_table ? default_table[node] : 1);
rcu_read_unlock();
return weight;
}

+int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
+{
+ unsigned long *old_bw, *new_bw;
+ unsigned long gcd_val;
+ u8 *old_iw, *new_iw;
+ uint64_t ttl_bw = 0;
+ int i;
+
+ new_bw = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
+ if (!new_bw)
+ return -ENOMEM;
+
+ new_iw = kzalloc(nr_node_ids, GFP_KERNEL);
+ if (!new_iw) {
+ kfree(new_bw);
+ return -ENOMEM;
+ }
+
+ mutex_lock(&default_iwt_lock);
+ old_bw = node_bw_table;
+ old_iw = rcu_dereference_protected(default_iw_table,
+ lockdep_is_held(&default_iwt_lock));
+
+ if (old_bw)
+ memcpy(new_bw, old_bw, nr_node_ids*sizeof(unsigned long));
+ new_bw[node] = min(coords->read_bandwidth, coords->write_bandwidth);
+
+ /* New recalculate the bandwidth distribution given the new info */
+ for (i = 0; i < nr_node_ids; i++)
+ ttl_bw += new_bw[i];
+
+ /* If node is not set or has < 1% of total bw, use minimum value of 1 */
+ for (i = 0; i < nr_node_ids; i++) {
+ if (new_bw[i])
+ new_iw[i] = max((100 * new_bw[i] / ttl_bw), 1);
+ else
+ new_iw[i] = 1;
+ }
+ /*
+ * Now attempt to aggressively reduce the interleave weights by GCD
+ * We want smaller interleave intervals to have a better distribution
+ * of memory, even on smaller memory regions. If weights are divisible
+ * by each other, we can do some quick math to aggresively squash them.
+ */
+reduce:
+ gcd_val = new_iw[i];
+ for (i = 0; i < nr_node_ids; i++) {
+ /* Skip nodes that haven't been set */
+ if (!new_bw[i])
+ continue;
+ gcd_val = gcd(gcd_val, new_iw[i]);
+ if (gcd_val == 1)
+ goto leave;
+ }
+ for (i = 0; i < nr_node_ids; i++) {
+ if (!new_bw[i])
+ continue;
+ new_iw[i] /= gcd_val;
+ }
+ /* repeat until we get a gcd of 1 */
+ goto reduce;
+leave:
+ node_bw_table = new_bw;
+ rcu_assign_pointer(default_iw_table, new_iw);
+ mutex_unlock(&default_iwt_lock);
+ synchronize_rcu();
+ kfree(old_bw);
+ kfree(old_iw);
+ return 0;
+}
+
/**
* numa_nearest_node - Find nearest node by state
* @node: Node id to start the search
@@ -1983,7 +2067,7 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
{
nodemask_t nodemask;
unsigned int target, nr_nodes;
- u8 *table;
+ u8 *table, *default_table;
unsigned int weight_total = 0;
u8 weight;
int nid;
@@ -1994,11 +2078,13 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)

rcu_read_lock();
table = rcu_dereference(iw_table);
+ default_table = rcu_dereference(default_iw_table);
/* calculate the total weight */
for_each_node_mask(nid, nodemask) {
/* detect system default usage */
- weight = table ? table[nid] : 1;
- weight = weight ? weight : 1;
+ weight = table ? table[nid] : 0;
+ weight = weight ? weight :
+ (default_table ? default_table[nid] : 1);
weight_total += weight;
}

@@ -2007,8 +2093,9 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
nid = first_node(nodemask);
while (target) {
/* detect system default usage */
- weight = table ? table[nid] : 1;
- weight = weight ? weight : 1;
+ weight = table ? table[nid] : 0;
+ weight = weight ? weight :
+ (default_table ? default_table[nid] : 1);
if (target < weight)
break;
target -= weight;
@@ -2391,7 +2478,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
unsigned long nr_allocated = 0;
unsigned long rounds;
unsigned long node_pages, delta;
- u8 *table, *weights, weight;
+ u8 *weights, weight;
unsigned int weight_total = 0;
unsigned long rem_pages = nr_pages;
nodemask_t nodes;
@@ -2440,16 +2527,8 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
if (!weights)
return total_allocated;

- rcu_read_lock();
- table = rcu_dereference(iw_table);
- if (table)
- memcpy(weights, table, nr_node_ids);
- rcu_read_unlock();
-
- /* calculate total, detect system default usage */
for_each_node_mask(node, nodes) {
- if (!weights[node])
- weights[node] = 1;
+ weights[node] = get_il_weight(node);
weight_total += weights[node];
}

--
2.39.1


2024-02-22 07:12:20

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC 1/1] mm/mempolicy: introduce system default interleave weights

Hi, Gregory,

Thanks a lot for working on this!

Gregory Price <[email protected]> writes:

> Startup and hotplug code may register HMAT data for memory
> devices. Utilize this data to generate reasonable default
> weighted interleave values.
>
> Introduce `mempolicy_set_node_perf()`. A function which can
> be invoked from node and CXL code to have mempolicy rebalance
> the system default interleave weights.
>
> mempolicy_set_node_perf() cache's each node's bandwidth (in
> this patch: min(read_bw, write_bw)), and recalculates the weight
> associated with each node. After weights are calculated, we use
> gcd() to reduce these weights to the smallest amount possible in
> and effort to more aggressively interleave on smaller intervals.
>
> For example, a 1-socket system with a CXL memory expander which
> exposes 224GB/s and 64GB/s of bandwidth respectively will end
> up with a weight array of [7,2].
>
> The downside of this approach is that some distributes may
> experience large default values if they happen to a bandwidth
> distribution that includes an unfortunate prime number, or if
> any two values are co-prime.
>
> Signed-off-by: Gregory Price <[email protected]>
> ---
> drivers/acpi/numa/hmat.c | 1 +
> drivers/base/node.c | 7 +++
> include/linux/mempolicy.h | 4 ++
> mm/mempolicy.c | 129 ++++++++++++++++++++++++++++++--------
> 4 files changed, 116 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index d6b85f0f6082..7935d387e001 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -20,6 +20,7 @@
> #include <linux/list_sort.h>
> #include <linux/memregion.h>
> #include <linux/memory.h>
> +#include <linux/mempolicy.h>
> #include <linux/mutex.h>
> #include <linux/node.h>
> #include <linux/sysfs.h>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 1c05640461dd..30458df504b4 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -7,6 +7,7 @@
> #include <linux/init.h>
> #include <linux/mm.h>
> #include <linux/memory.h>
> +#include <linux/mempolicy.h>
> #include <linux/vmstat.h>
> #include <linux/notifier.h>
> #include <linux/node.h>
> @@ -214,6 +215,12 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
> break;
> }
> }
> +
> + /* When setting CPU access coordinates, update mempolicy */
> + if (access == ACCESS_COORDINATE_CPU) {
> + if (mempolicy_set_node_perf(nid, coord))
> + pr_info("failed to set node%d mempolicy attrs\n", nid);
> + }
> }
>
> /**
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 931b118336f4..d564e9e893ea 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -11,6 +11,7 @@
> #include <linux/slab.h>
> #include <linux/rbtree.h>
> #include <linux/spinlock.h>
> +#include <linux/node.h>
> #include <linux/nodemask.h>
> #include <linux/pagemap.h>
> #include <uapi/linux/mempolicy.h>
> @@ -177,6 +178,9 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
>
> extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);
>
> +extern int mempolicy_set_node_perf(unsigned int node,
> + struct access_coordinate *coords);
> +
> #else
>
> struct mempolicy {};
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index ba0b2b81bd08..0a82aa51e497 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -109,6 +109,7 @@
> #include <linux/mmu_notifier.h>
> #include <linux/printk.h>
> #include <linux/swapops.h>
> +#include <linux/gcd.h>
>
> #include <asm/tlbflush.h>
> #include <asm/tlb.h>
> @@ -139,31 +140,114 @@ static struct mempolicy default_policy = {
> static struct mempolicy preferred_node_policy[MAX_NUMNODES];
>
> /*
> - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
> - * system-default value should be used. A NULL iw_table also denotes that
> - * system-default values should be used. Until the system-default table
> - * is implemented, the system-default is always 1.
> + * The interleave weight tables denote what weights should be used with
> + * the weighted interleave policy. There are two tables:
> + * - iw_table : the sysfs-set interleave weight table
> + * - default_iw_table : the system default interleave weight table.
> *
> - * iw_table is RCU protected
> + * If the iw_table is NULL, default_iw_table values are used.
> + * If both tables are NULL, a minimum weight of 1 is always used.
> + * A value of 0 in the iw_table means the system default value will be used.
> + *
> + * iw_table, and default_iw_table are RCU protected
> + * node_bw_table is protected by default_iwt_lock
> + *
> + * system startup and hotplug code may register node performance information
> + * via mempolicy_set_node_attributes()
> */
> +static unsigned long *node_bw_table;
> +static u8 __rcu *default_iw_table;
> +static DEFINE_MUTEX(default_iwt_lock);
> +
> static u8 __rcu *iw_table;
> static DEFINE_MUTEX(iw_table_lock);
>
> static u8 get_il_weight(int node)
> {
> - u8 *table;
> + u8 *table, *default_table;
> u8 weight;
>
> rcu_read_lock();
> table = rcu_dereference(iw_table);
> - /* if no iw_table, use system default */
> - weight = table ? table[node] : 1;
> - /* if value in iw_table is 0, use system default */
> - weight = weight ? weight : 1;
> + default_table = rcu_dereference(default_iw_table);
> + /* if no table pointers or value is 0, use system default or 1 */
> + weight = table ? table[node] : 0;
> + weight = weight ? weight : (default_table ? default_table[node] : 1);
> rcu_read_unlock();
> return weight;
> }
>
> +int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
> +{
> + unsigned long *old_bw, *new_bw;
> + unsigned long gcd_val;
> + u8 *old_iw, *new_iw;
> + uint64_t ttl_bw = 0;
> + int i;
> +
> + new_bw = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
> + if (!new_bw)
> + return -ENOMEM;

We only use "node_bw_table" in this function with "default_iwt_lock"
held. So, we don't need to copy-on-write? Just change in place?

> + new_iw = kzalloc(nr_node_ids, GFP_KERNEL);
> + if (!new_iw) {
> + kfree(new_bw);
> + return -ENOMEM;
> + }
> +
> + mutex_lock(&default_iwt_lock);
> + old_bw = node_bw_table;
> + old_iw = rcu_dereference_protected(default_iw_table,
> + lockdep_is_held(&default_iwt_lock));
> +
> + if (old_bw)
> + memcpy(new_bw, old_bw, nr_node_ids*sizeof(unsigned long));
> + new_bw[node] = min(coords->read_bandwidth, coords->write_bandwidth);

We need to compress two bandwidth data into one. The possible choice
could be,

- min(coords->read_bandwidth, coords->write_bandwidth), that is, your code

- coords->read_bandwidth + coords->write_bandwidth

I don't know which one is better. Do you have some use cases to help to
determine which one is better?
> +
> + /* New recalculate the bandwidth distribution given the new info */
> + for (i = 0; i < nr_node_ids; i++)
> + ttl_bw += new_bw[i];
> +
> + /* If node is not set or has < 1% of total bw, use minimum value of 1 */
> + for (i = 0; i < nr_node_ids; i++) {
> + if (new_bw[i])
> + new_iw[i] = max((100 * new_bw[i] / ttl_bw), 1);
> + else
> + new_iw[i] = 1;

If we lacks performance data for some node, it will use "1" as default
weight. It doesn't look like the best solution for me. How about use
the average available bandwidth to calculate the default weight? Or use
memory bandwidth of node 0 if performance data is missing?

> + }
> + /*
> + * Now attempt to aggressively reduce the interleave weights by GCD
> + * We want smaller interleave intervals to have a better distribution
> + * of memory, even on smaller memory regions. If weights are divisible
> + * by each other, we can do some quick math to aggresively squash them.
> + */
> +reduce:
> + gcd_val = new_iw[i];

"i" will be "nr_node_ids" in the first loop. Right?

> + for (i = 0; i < nr_node_ids; i++) {
> + /* Skip nodes that haven't been set */
> + if (!new_bw[i])
> + continue;
> + gcd_val = gcd(gcd_val, new_iw[i]);
> + if (gcd_val == 1)
> + goto leave;
> + }
> + for (i = 0; i < nr_node_ids; i++) {
> + if (!new_bw[i])
> + continue;
> + new_iw[i] /= gcd_val;
> + }
> + /* repeat until we get a gcd of 1 */
> + goto reduce;
> +leave:
> + node_bw_table = new_bw;
> + rcu_assign_pointer(default_iw_table, new_iw);
> + mutex_unlock(&default_iwt_lock);
> + synchronize_rcu();
> + kfree(old_bw);
> + kfree(old_iw);
> + return 0;
> +}
> +
> /**
> * numa_nearest_node - Find nearest node by state
> * @node: Node id to start the search
> @@ -1983,7 +2067,7 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> {
> nodemask_t nodemask;
> unsigned int target, nr_nodes;
> - u8 *table;
> + u8 *table, *default_table;
> unsigned int weight_total = 0;
> u8 weight;
> int nid;
> @@ -1994,11 +2078,13 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>
> rcu_read_lock();
> table = rcu_dereference(iw_table);
> + default_table = rcu_dereference(default_iw_table);
> /* calculate the total weight */
> for_each_node_mask(nid, nodemask) {
> /* detect system default usage */
> - weight = table ? table[nid] : 1;
> - weight = weight ? weight : 1;
> + weight = table ? table[nid] : 0;
> + weight = weight ? weight :
> + (default_table ? default_table[nid] : 1);

This becomes long. I think that we should define a help function for this.

> weight_total += weight;
> }
>
> @@ -2007,8 +2093,9 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> nid = first_node(nodemask);
> while (target) {
> /* detect system default usage */
> - weight = table ? table[nid] : 1;
> - weight = weight ? weight : 1;
> + weight = table ? table[nid] : 0;
> + weight = weight ? weight :
> + (default_table ? default_table[nid] : 1);
> if (target < weight)
> break;
> target -= weight;
> @@ -2391,7 +2478,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> unsigned long nr_allocated = 0;
> unsigned long rounds;
> unsigned long node_pages, delta;
> - u8 *table, *weights, weight;
> + u8 *weights, weight;
> unsigned int weight_total = 0;
> unsigned long rem_pages = nr_pages;
> nodemask_t nodes;
> @@ -2440,16 +2527,8 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> if (!weights)
> return total_allocated;
>
> - rcu_read_lock();
> - table = rcu_dereference(iw_table);
> - if (table)
> - memcpy(weights, table, nr_node_ids);
> - rcu_read_unlock();
> -
> - /* calculate total, detect system default usage */
> for_each_node_mask(node, nodes) {
> - if (!weights[node])
> - weights[node] = 1;
> + weights[node] = get_il_weight(node);
> weight_total += weights[node];
> }

--
Best Regards,
Huang, Ying

2024-02-23 05:47:42

by Gregory Price

[permalink] [raw]
Subject: Re: [RFC 1/1] mm/mempolicy: introduce system default interleave weights

On Thu, Feb 22, 2024 at 03:10:11PM +0800, Huang, Ying wrote:
> Hi, Gregory,
>
> Thanks a lot for working on this!
>

It's worth doing :]

> > + new_bw = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
> > + if (!new_bw)
> > + return -ENOMEM;
>
> We only use "node_bw_table" in this function with "default_iwt_lock"
> held. So, we don't need to copy-on-write? Just change in place?
>

I'd originally planned to add a sysfs entry for the data, which would
have added RCU to this, but i realized it's just duplicating the
node/accessX/initiator information, so i'll rip this out and just do in
place changes.

> > + new_iw = kzalloc(nr_node_ids, GFP_KERNEL);
> > + if (!new_iw) {
> > + kfree(new_bw);
> > + return -ENOMEM;
> > + }
> > +
> > + mutex_lock(&default_iwt_lock);
> > + old_bw = node_bw_table;
> > + old_iw = rcu_dereference_protected(default_iw_table,
> > + lockdep_is_held(&default_iwt_lock));
> > +
> > + if (old_bw)
> > + memcpy(new_bw, old_bw, nr_node_ids*sizeof(unsigned long));
> > + new_bw[node] = min(coords->read_bandwidth, coords->write_bandwidth);
>
> We need to compress two bandwidth data into one. The possible choice
> could be,
>
> - min(coords->read_bandwidth, coords->write_bandwidth), that is, your code
>
> - coords->read_bandwidth + coords->write_bandwidth
>
> I don't know which one is better. Do you have some use cases to help to
> determine which one is better?

More generally: Are either read_bandwidth or write_bandwidth values
even worth trusting as-is? Should they be combined? Averaged?
Minimumed? Maximumed? Should they be floored to some reasonably round
number? These are the comments i'm hoping to garner :].

I've also considered maybe adding a weighted_interleave/read_write_ratio
sysfs entry that informs the system on how to treat the incoming
numbers. This would require us to cache more information, obviously.

I have limited access to hardware, but here is one datum from an Intel
platform w/ a sample CXL memory expander.

# DRAM on node0
cat /sys/bus/node/devices/node0/access0/initiators/*bandwidth
262100 < read
176100 < write

Notice the 90GB/s difference between read and write, and the trailing
100! That doesn't look to be good for a clean reduction :[

# CXL 1.1 device on node2
cat /sys/bus/node/devices/node2/access0/initiators/*bandwidth
60000 < read
60000 < write

These are pretty un-even distributions, and we may want to consider
forcing numbers to be a little more round - otherwise we're doomed to
just end up with whatever the ~/100 value is. Or we need to come up with
some reduction that gets us down to reasonable small interleave values.

In this scenario, we end up with:
>>> 60000+176100
236100
>>> 60000/236100
0.25412960609911056
>>> 176100/236100
0.7458703939008895

Which turns into 25:74 if you jsut round down, or 25:75 if you round up.

The problem is that any heuristic you come up with for rounding out the
bandwidth values is bound to have degenerate results. What happens if I
add another CXL memory expander? What happens with 2DPC? etc.

I wanted to collect some thoughts on this. I'm not sure what the best
"General" approach would be, and we may need some more data from people
with access to more varied hardware.

Maybe worth submitting to LSF/MM for a quick discussion, but I think
we'll need some help figuring this one out.

> > +
> > + /* New recalculate the bandwidth distribution given the new info */
> > + for (i = 0; i < nr_node_ids; i++)
> > + ttl_bw += new_bw[i];
> > +
> > + /* If node is not set or has < 1% of total bw, use minimum value of 1 */
> > + for (i = 0; i < nr_node_ids; i++) {
> > + if (new_bw[i])
> > + new_iw[i] = max((100 * new_bw[i] / ttl_bw), 1);
> > + else
> > + new_iw[i] = 1;
>
> If we lacks performance data for some node, it will use "1" as default
> weight. It doesn't look like the best solution for me. How about use
> the average available bandwidth to calculate the default weight? Or use
> memory bandwidth of node 0 if performance data is missing?
>

If we lack performance data for a node, it's one of 3 cases

1) The device did not provide HMAT information
2) We did not integrate that driver into the system yet.
3) The node is not online yet (therefore the data hasn't been reported)

#2 and #3 are not issues, the only real issue is #1.

In this scenario, I'm not sure what to do. We must have a non-0 value
for that device (to avoid div-by-0), but setting an abitrarily large
value also seems bad.

My thought was that if you are using weighted interleave, you're
probably already pretty confident that your environment is reasonably
sane - i.e. HMAT is providing the values.

> > + }
> > + /*
> > + * Now attempt to aggressively reduce the interleave weights by GCD
> > + * We want smaller interleave intervals to have a better distribution
> > + * of memory, even on smaller memory regions. If weights are divisible
> > + * by each other, we can do some quick math to aggresively squash them.
> > + */
> > +reduce:
> > + gcd_val = new_iw[i];
>
> "i" will be "nr_node_ids" in the first loop. Right?
>

ah good catch, this should be new_iw[node_being_updated], will update

> > - weight = table ? table[nid] : 1;
> > - weight = weight ? weight : 1;
> > + weight = table ? table[nid] : 0;
> > + weight = weight ? weight :
> > + (default_table ? default_table[nid] : 1);
>
> This becomes long. I think that we should define a help function for this.

Maybe? I didn't bother since it's not replicated elsewhere. It does
look similar to the help function which fetches the node weight, but
that function itself calls rcu_read_lock() since it is called during the
bulk allocator.

I think probably some more thought could be put into this interaction,
this was just a first pass. Certainly could be improved.

Thanks for the feedback, will chew on it a bit. Let me know your
thoughts on the bandwidth examples above if you have any.

~Gregory

2024-02-23 09:20:10

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC 1/1] mm/mempolicy: introduce system default interleave weights

Gregory Price <[email protected]> writes:

> On Thu, Feb 22, 2024 at 03:10:11PM +0800, Huang, Ying wrote:
>> Hi, Gregory,
>>
>> Thanks a lot for working on this!
>>
>
> It's worth doing :]
>
>> > + new_bw = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
>> > + if (!new_bw)
>> > + return -ENOMEM;
>>
>> We only use "node_bw_table" in this function with "default_iwt_lock"
>> held. So, we don't need to copy-on-write? Just change in place?
>>
>
> I'd originally planned to add a sysfs entry for the data, which would
> have added RCU to this, but i realized it's just duplicating the
> node/accessX/initiator information, so i'll rip this out and just do in
> place changes.
>
>> > + new_iw = kzalloc(nr_node_ids, GFP_KERNEL);
>> > + if (!new_iw) {
>> > + kfree(new_bw);
>> > + return -ENOMEM;
>> > + }
>> > +
>> > + mutex_lock(&default_iwt_lock);
>> > + old_bw = node_bw_table;
>> > + old_iw = rcu_dereference_protected(default_iw_table,
>> > + lockdep_is_held(&default_iwt_lock));
>> > +
>> > + if (old_bw)
>> > + memcpy(new_bw, old_bw, nr_node_ids*sizeof(unsigned long));
>> > + new_bw[node] = min(coords->read_bandwidth, coords->write_bandwidth);
>>
>> We need to compress two bandwidth data into one. The possible choice
>> could be,
>>
>> - min(coords->read_bandwidth, coords->write_bandwidth), that is, your code
>>
>> - coords->read_bandwidth + coords->write_bandwidth
>>
>> I don't know which one is better. Do you have some use cases to help to
>> determine which one is better?
>
> More generally: Are either read_bandwidth or write_bandwidth values
> even worth trusting as-is? Should they be combined? Averaged?
> Minimumed? Maximumed? Should they be floored to some reasonably round
> number? These are the comments i'm hoping to garner :].
>
> I've also considered maybe adding a weighted_interleave/read_write_ratio
> sysfs entry that informs the system on how to treat the incoming
> numbers. This would require us to cache more information, obviously.
>
> I have limited access to hardware, but here is one datum from an Intel
> platform w/ a sample CXL memory expander.
>
> # DRAM on node0
> cat /sys/bus/node/devices/node0/access0/initiators/*bandwidth
> 262100 < read
> 176100 < write
>
> Notice the 90GB/s difference between read and write, and the trailing
> 100! That doesn't look to be good for a clean reduction :[
>
> # CXL 1.1 device on node2
> cat /sys/bus/node/devices/node2/access0/initiators/*bandwidth
> 60000 < read
> 60000 < write
>
> These are pretty un-even distributions, and we may want to consider
> forcing numbers to be a little more round - otherwise we're doomed to
> just end up with whatever the ~/100 value is. Or we need to come up with
> some reduction that gets us down to reasonable small interleave values.
>
> In this scenario, we end up with:
>>>> 60000+176100
> 236100
>>>> 60000/236100
> 0.25412960609911056
>>>> 176100/236100
> 0.7458703939008895
>
> Which turns into 25:74 if you jsut round down, or 25:75 if you round up.
>
> The problem is that any heuristic you come up with for rounding out the
> bandwidth values is bound to have degenerate results. What happens if I
> add another CXL memory expander? What happens with 2DPC? etc.
>
> I wanted to collect some thoughts on this. I'm not sure what the best
> "General" approach would be, and we may need some more data from people
> with access to more varied hardware.
>
> Maybe worth submitting to LSF/MM for a quick discussion, but I think
> we'll need some help figuring this one out.
>
>> > +
>> > + /* New recalculate the bandwidth distribution given the new info */
>> > + for (i = 0; i < nr_node_ids; i++)
>> > + ttl_bw += new_bw[i];
>> > +
>> > + /* If node is not set or has < 1% of total bw, use minimum value of 1 */
>> > + for (i = 0; i < nr_node_ids; i++) {
>> > + if (new_bw[i])
>> > + new_iw[i] = max((100 * new_bw[i] / ttl_bw), 1);

IIUC, the sum of interleave weights of all nodes will be 100. If there
are more than 100 nodes in the system, this doesn't work properly. How
about use some fixed number like "16" for DRAM node?

>> > + else
>> > + new_iw[i] = 1;
>>
>> If we lacks performance data for some node, it will use "1" as default
>> weight. It doesn't look like the best solution for me. How about use
>> the average available bandwidth to calculate the default weight? Or use
>> memory bandwidth of node 0 if performance data is missing?
>>
>
> If we lack performance data for a node, it's one of 3 cases
>
> 1) The device did not provide HMAT information
> 2) We did not integrate that driver into the system yet.
> 3) The node is not online yet (therefore the data hasn't been reported)
>
> #2 and #3 are not issues, the only real issue is #1.
>
> In this scenario, I'm not sure what to do. We must have a non-0 value
> for that device (to avoid div-by-0), but setting an abitrarily large
> value also seems bad.

I think that it's kind of reasonable to use DRAM bandwidth for device
without data. If there are only DRAM nodes and nodes without data, this
will make interleave weight to "1".

> My thought was that if you are using weighted interleave, you're
> probably already pretty confident that your environment is reasonably
> sane - i.e. HMAT is providing the values.
>
>> > + }
>> > + /*
>> > + * Now attempt to aggressively reduce the interleave weights by GCD
>> > + * We want smaller interleave intervals to have a better distribution
>> > + * of memory, even on smaller memory regions. If weights are divisible
>> > + * by each other, we can do some quick math to aggresively squash them.
>> > + */
>> > +reduce:
>> > + gcd_val = new_iw[i];
>>
>> "i" will be "nr_node_ids" in the first loop. Right?
>>
>
> ah good catch, this should be new_iw[node_being_updated], will update
>
>> > - weight = table ? table[nid] : 1;
>> > - weight = weight ? weight : 1;
>> > + weight = table ? table[nid] : 0;
>> > + weight = weight ? weight :
>> > + (default_table ? default_table[nid] : 1);
>>
>> This becomes long. I think that we should define a help function for this.
>
> Maybe? I didn't bother since it's not replicated elsewhere. It does
> look similar to the help function which fetches the node weight, but
> that function itself calls rcu_read_lock() since it is called during the
> bulk allocator.
>
> I think probably some more thought could be put into this interaction,
> this was just a first pass. Certainly could be improved.
>
> Thanks for the feedback, will chew on it a bit. Let me know your
> thoughts on the bandwidth examples above if you have any.

--
Best Regards,
Huang, Ying

2024-02-26 14:32:36

by Gregory Price

[permalink] [raw]
Subject: Re: [RFC 1/1] mm/mempolicy: introduce system default interleave weights


On Fri, Feb 23, 2024 at 05:11:23PM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
>

(sorry for the re-send, error replying to list)

> >> > + /* If node is not set or has < 1% of total bw, use minimum value of 1 */
> >> > + for (i = 0; i < nr_node_ids; i++) {
> >> > + if (new_bw[i])
> >> > + new_iw[i] = max((100 * new_bw[i] / ttl_bw), 1);
>
> IIUC, the sum of interleave weights of all nodes will be 100. If there
> are more than 100 nodes in the system, this doesn't work properly. How
> about use some fixed number like "16" for DRAM node?
>

I suppose we could add a "type" value into the interface that says
what approximate "tier" a node is in, or we could ask the tiering
component for that information. But what does this actually change?

You still calculate the percentage of bandwidth provided by each node,
and then just apply that to the larger default number. I don't see the
point in that - if each node provides less than 1% of the overall system
bandwidth, and larger numbers won't do much. In fact, we want smaller
numbers to spread spacially local data out more aggressively.

More important question: In what world is a large numa system liabile
to use this interface to any real benefit?


I'd briefly considered this, but I strayed away from supporting that
case. Probably worth documenting, at the very least.

We had the cross-socket interleave discussion previously in the prior
series. The question above simplifies (complicates?) to: How useful
is interleave (weighted or not) in cross-socket workloads.

Consider the following configuration:


--------- A -------- C -------- D ---------
| DRAM0 | ---- | cpu0 |---UPI---| cpu1 |----| DRAM1 |
--------- -------- -------- ---------
| B | E
-------- --------
| cxl0 | | cxl1 |
-------- --------

Theoretical throughputs

A&D: 512GB/s (8 channel DDR5)
B&E: 64GB/s (1 CXL/PCIe5 link)
C : 62.4GB/s (3x UPI links)

Where are the 100 nodes coming from?

If it's across interconnects (UPI), then the throughput to remote
DRAM is better described by C, not A or D. However, we don't have
that information (maybe we should?). More importantly... is
interleaving across these links even useful? I suppose if you did
sub-numa clustering stuff and had an ultra-super-numa-aware piece
of software capable of keeping certain chunks of memory in certain
cores that might be useful.... but then you probably actually want
task-local weights as opposed to using the system default.

Otherwise, does a UPI link actually get the full throughput? Probably
only if the remote memory bus is unloaded. If the remote bus is
loaded, then link C performance information is basically a lie.

I've been convinced (so far) that cross-socket interconnect
interleaving is not a real use-case unless you intend to only run
your software on a single socket and use the remote socket for
whatever you can swipe over the interconnect. In that case, you're
probably smart enough to set the interleave weights manually.


So what if the nodes are coming from many memory sources down one
or more local CXL links (link B from cpu0).

--------- A --------
| DRAM0 | ---- | cpu0 |
--------- --------
| B
----------------------------
| |
-------- --------
| cxl0 | ...... | cxlN |
-------- --------

In that case it would be better for many reasons to reconfigure the
system to combine those nodes into fewer nodes via a hardware interleave
set. This can be done in hardware (at a switch), in BIOS (at the root
complex), or by the CXL Driver. The result is fewer nodes, and the real
performance of that node can be calculated by the drivers and repoted
accordingly.



So coming back to this code: Then why am I doing GCD across all
nodes, rather than taking the full topology into account? Mostly
because the topological information is not easily available, would
be complex to communicate across components, and the full reduction
is a decent approximation anyway.

Example from above using real HMAT reported numbers

A&D: 176100
B&E: 60000
C: Not a node, no information available.

Produces Node Weights

Calculating total system weighted averagee
A:37 D:37 B:12 E:12 (37 is prime so no reductions possible)

Calculating local-node relationships only
A:74--B:25 D:74--E:25 (GCD is 1, so no reductions possible)

Notice that 12+37 = 49 - 12/49 = 24%

So the ratios end up working out basically the same anyway, but
the smaller numbers produced by averaging over the entire system
are preferable to the "topologically aware" numbers anyway.


Obviously this breaks in a "large numa system" - but again...
is this even useful for those systems anyway? I contend: No.


This is still reasonable accurate in non-hogeneous systems

--------- A -------- C -------- D ---------
| DRAM0 | ---- | cpu0 |---UPI---| cpu1 |----| DRAM1 |
--------- -------- -------- ---------
| B
--------
| cxl0 |
--------

In this system the numbers work out to:

Global: A:42 B:14 D: 42 (GCD: 14)
Reduce: A:3 B:1 D: 3

A user doing `-w --interleave=A,B` will get a ratio of 3:1, which
is pretty much spot on.


So, long winded winded way of saying:
- Could we use a larger default number? Yes.
- Does that actually help us? Not really, we want smaller numbers.
- Does this reduce to normal-interleave under large-numa systems? Yes.
- Does that matter? Probably not. It doesn't seem like a real use case.
- What if it is? The workloads probably want task-local weights anyway.

> >
> > In this scenario, I'm not sure what to do. We must have a non-0 value
> > for that device (to avoid div-by-0), but setting an abitrarily large
> > value also seems bad.
>
> I think that it's kind of reasonable to use DRAM bandwidth for device
> without data. If there are only DRAM nodes and nodes without data, this
> will make interleave weight to "1".
>

Yes, those nodes would reduce to 1. Which is pretty much the best we can
do without accounting for interconnects - which as discussed above is not
really useful anyway.



I think I'll draft up an LSF/MM chat to see if we can garner more input.
If large-numa systems are a real issue, then yes we need to address it.

~Gregory

2024-02-27 00:40:28

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC 1/1] mm/mempolicy: introduce system default interleave weights

Gregory Price <[email protected]> writes:

> On Fri, Feb 23, 2024 at 05:11:23PM +0800, Huang, Ying wrote:
>> Gregory Price <[email protected]> writes:
>>
>
> (sorry for the re-send, error replying to list)
>
>> >> > + /* If node is not set or has < 1% of total bw, use minimum value of 1 */
>> >> > + for (i = 0; i < nr_node_ids; i++) {
>> >> > + if (new_bw[i])
>> >> > + new_iw[i] = max((100 * new_bw[i] / ttl_bw), 1);
>>
>> IIUC, the sum of interleave weights of all nodes will be 100. If there
>> are more than 100 nodes in the system, this doesn't work properly. How
>> about use some fixed number like "16" for DRAM node?
>>
>
> I suppose we could add a "type" value into the interface that says
> what approximate "tier" a node is in, or we could ask the tiering
> component for that information. But what does this actually change?
>
> You still calculate the percentage of bandwidth provided by each node,
> and then just apply that to the larger default number. I don't see the
> point in that - if each node provides less than 1% of the overall system
> bandwidth, and larger numbers won't do much. In fact, we want smaller
> numbers to spread spacially local data out more aggressively.
>
> More important question: In what world is a large numa system liabile
> to use this interface to any real benefit?
>
>
> I'd briefly considered this, but I strayed away from supporting that
> case. Probably worth documenting, at the very least.
>
> We had the cross-socket interleave discussion previously in the prior
> series. The question above simplifies (complicates?) to: How useful
> is interleave (weighted or not) in cross-socket workloads.
>
> Consider the following configuration:
>
>
> --------- A -------- C -------- D ---------
> | DRAM0 | ---- | cpu0 |---UPI---| cpu1 |----| DRAM1 |
> --------- -------- -------- ---------
> | B | E
> -------- --------
> | cxl0 | | cxl1 |
> -------- --------
>
> Theoretical throughputs
>
> A&D: 512GB/s (8 channel DDR5)
> B&E: 64GB/s (1 CXL/PCIe5 link)
> C : 62.4GB/s (3x UPI links)
>
> Where are the 100 nodes coming from?

If you have a real large machine with more than 100 nodes, and some of
them are CXL memory nodes, then it's possible that most nodes will have
interleave weight "1" because the sum of all interleave weights is
"100". Then, even if you use only one socket, the interleave weight of
DRAM and CXL MEM could be all "1", lead to useless default value. So, I
suggest don't cap the sum of interleave weights.

> If it's across interconnects (UPI), then the throughput to remote
> DRAM is better described by C, not A or D. However, we don't have
> that information (maybe we should?). More importantly... is
> interleaving across these links even useful? I suppose if you did
> sub-numa clustering stuff and had an ultra-super-numa-aware piece
> of software capable of keeping certain chunks of memory in certain
> cores that might be useful.... but then you probably actually want
> task-local weights as opposed to using the system default.
>
> Otherwise, does a UPI link actually get the full throughput? Probably
> only if the remote memory bus is unloaded. If the remote bus is
> loaded, then link C performance information is basically a lie.
>
> I've been convinced (so far) that cross-socket interconnect
> interleaving is not a real use-case unless you intend to only run
> your software on a single socket and use the remote socket for
> whatever you can swipe over the interconnect. In that case, you're
> probably smart enough to set the interleave weights manually.
>
>
> So what if the nodes are coming from many memory sources down one
> or more local CXL links (link B from cpu0).
>
> --------- A --------
> | DRAM0 | ---- | cpu0 |
> --------- --------
> | B
> ----------------------------
> | |
> -------- --------
> | cxl0 | ...... | cxlN |
> -------- --------
>
> In that case it would be better for many reasons to reconfigure the
> system to combine those nodes into fewer nodes via a hardware interleave
> set. This can be done in hardware (at a switch), in BIOS (at the root
> complex), or by the CXL Driver. The result is fewer nodes, and the real
> performance of that node can be calculated by the drivers and repoted
> accordingly.
>
>
>
> So coming back to this code: Then why am I doing GCD across all
> nodes, rather than taking the full topology into account? Mostly
> because the topological information is not easily available, would
> be complex to communicate across components, and the full reduction
> is a decent approximation anyway.
>
> Example from above using real HMAT reported numbers
>
> A&D: 176100
> B&E: 60000
> C: Not a node, no information available.
>
> Produces Node Weights
>
> Calculating total system weighted averagee
> A:37 D:37 B:12 E:12 (37 is prime so no reductions possible)
>
> Calculating local-node relationships only
> A:74--B:25 D:74--E:25 (GCD is 1, so no reductions possible)
>
> Notice that 12+37 = 49 - 12/49 = 24%
>
> So the ratios end up working out basically the same anyway, but
> the smaller numbers produced by averaging over the entire system
> are preferable to the "topologically aware" numbers anyway.
>
>
> Obviously this breaks in a "large numa system" - but again...
> is this even useful for those systems anyway? I contend: No.
>
>
> This is still reasonable accurate in non-hogeneous systems
>
> --------- A -------- C -------- D ---------
> | DRAM0 | ---- | cpu0 |---UPI---| cpu1 |----| DRAM1 |
> --------- -------- -------- ---------
> | B
> --------
> | cxl0 |
> --------
>
> In this system the numbers work out to:
>
> Global: A:42 B:14 D: 42 (GCD: 14)
> Reduce: A:3 B:1 D: 3
>
> A user doing `-w --interleave=A,B` will get a ratio of 3:1, which
> is pretty much spot on.
>
>
> So, long winded winded way of saying:
> - Could we use a larger default number? Yes.
> - Does that actually help us? Not really, we want smaller numbers.

The larger number will be reduced after GCD.

> - Does this reduce to normal-interleave under large-numa systems? Yes.
> - Does that matter? Probably not. It doesn't seem like a real use case.
> - What if it is? The workloads probably want task-local weights anyway.
>
>> >
>> > In this scenario, I'm not sure what to do. We must have a non-0 value
>> > for that device (to avoid div-by-0), but setting an abitrarily large
>> > value also seems bad.
>>
>> I think that it's kind of reasonable to use DRAM bandwidth for device
>> without data. If there are only DRAM nodes and nodes without data, this
>> will make interleave weight to "1".
>>
>
> Yes, those nodes would reduce to 1. Which is pretty much the best we can
> do without accounting for interconnects - which as discussed above is not
> really useful anyway.
>
>
>
> I think I'll draft up an LSF/MM chat to see if we can garner more input.
> If large-numa systems are a real issue, then yes we need to address it.

Sounds good to me!

--
Best Regards,
Huang, Ying

> ~Gregory

2024-02-27 05:38:20

by Gregory Price

[permalink] [raw]
Subject: Re: [RFC 1/1] mm/mempolicy: introduce system default interleave weights

On Tue, Feb 27, 2024 at 08:38:19AM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
> > Where are the 100 nodes coming from?
>
> If you have a real large machine with more than 100 nodes, and some of
> them are CXL memory nodes, then it's possible that most nodes will have
> interleave weight "1" because the sum of all interleave weights is
> "100". Then, even if you use only one socket, the interleave weight of
> DRAM and CXL MEM could be all "1", lead to useless default value. So, I
> suggest don't cap the sum of interleave weights.

I have to press this issue: Is this an actual, practical, concern?

It seems to me in this type of scenario, there are larger, more complex
numa topology issues that make the use of the general, global weighted
mempolicy system entirely impractical. This is a bit outside the scope

> > So, long winded winded way of saying:
> > - Could we use a larger default number? Yes.
> > - Does that actually help us? Not really, we want smaller numbers.
>
> The larger number will be reduced after GCD.
>

I suppose another strategy is to calculate the interleave weights
un-bounded from the raw bandwidth - but continuously force reductions
(through some yet-undefined algorithm) until at least one node reaches a
weight of `1`. This suffers from the opposite problem: what if the top
node has a value greater than 255? Do we just cap it at 255? That seems
the opposite form of problematic.

(Large numbers are quite pointless, as it is essentially the antithesis
of interleave)

> > I think I'll draft up an LSF/MM chat to see if we can garner more input.
> > If large-numa systems are a real issue, then yes we need to address it.
>
> Sounds good to me!

Working on it.

~Gregory

2024-02-27 06:01:37

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC 1/1] mm/mempolicy: introduce system default interleave weights

Gregory Price <[email protected]> writes:

> On Tue, Feb 27, 2024 at 08:38:19AM +0800, Huang, Ying wrote:
>> Gregory Price <[email protected]> writes:
>> > Where are the 100 nodes coming from?
>>
>> If you have a real large machine with more than 100 nodes, and some of
>> them are CXL memory nodes, then it's possible that most nodes will have
>> interleave weight "1" because the sum of all interleave weights is
>> "100". Then, even if you use only one socket, the interleave weight of
>> DRAM and CXL MEM could be all "1", lead to useless default value. So, I
>> suggest don't cap the sum of interleave weights.
>
> I have to press this issue: Is this an actual, practical, concern?

I don't know who have large machine like that. But I guess that it's
possible in the long run.

> It seems to me in this type of scenario, there are larger, more complex
> numa topology issues that make the use of the general, global weighted
> mempolicy system entirely impractical. This is a bit outside the scope

It's possible to solve the problem step by step. For example, add
per-task interleave weight at some time.

>> > So, long winded winded way of saying:
>> > - Could we use a larger default number? Yes.
>> > - Does that actually help us? Not really, we want smaller numbers.
>>
>> The larger number will be reduced after GCD.
>>
>
> I suppose another strategy is to calculate the interleave weights
> un-bounded from the raw bandwidth - but continuously force reductions
> (through some yet-undefined algorithm) until at least one node reaches a
> weight of `1`. This suffers from the opposite problem: what if the top
> node has a value greater than 255? Do we just cap it at 255? That seems
> the opposite form of problematic.
>
> (Large numbers are quite pointless, as it is essentially the antithesis
> of interleave)

Yes. So I suggest to use a relative small number as the default weight
to start with for normal DRAM. We will have to floor/ceiling the weight
value.

--
Best Regards,
Huang, Ying

2024-02-27 06:13:00

by Gregory Price

[permalink] [raw]
Subject: Re: [RFC 1/1] mm/mempolicy: introduce system default interleave weights

On Tue, Feb 27, 2024 at 01:59:26PM +0800, Huang, Ying wrote:
> Gregory Price <[email protected]> writes:
>
> > I have to press this issue: Is this an actual, practical, concern?
>
> I don't know who have large machine like that. But I guess that it's
> possible in the long run.
>

Certainly possible, although that seems like a hyper-specialized case of
a supercomputer. I suppose still worth considering for a bit.

> > I suppose another strategy is to calculate the interleave weights
> > un-bounded from the raw bandwidth - but continuously force reductions
> > (through some yet-undefined algorithm) until at least one node reaches a
> > weight of `1`. This suffers from the opposite problem: what if the top
> > node has a value greater than 255? Do we just cap it at 255? That seems
> > the opposite form of problematic.
> >
> > (Large numbers are quite pointless, as it is essentially the antithesis
> > of interleave)
>
> Yes. So I suggest to use a relative small number as the default weight
> to start with for normal DRAM. We will have to floor/ceiling the weight
> value.

Yeah more concretely, I was thinking something like

unsigned int *temp_weights; /* sizeof nr_node_ids */

memcpy(temp_weights, node_bandwidth);
while min(temp_weights) > 1:
- attempt GCD reduction
- if failed (GCD=1), adjust all odd numbers to be even (+1), try again

for weight in temp_weights:
iw_table[N] = (weight > 255) ? 255 : (unsigned char)weight;

Something like this. Of course this breaks if you have two nodes with a
massively different bandwidth ratio (> 255:1), but that seems
unrealistic given the intent of the devices.

~Gregory

2024-02-27 08:26:54

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC 1/1] mm/mempolicy: introduce system default interleave weights

Gregory Price <[email protected]> writes:

> On Tue, Feb 27, 2024 at 01:59:26PM +0800, Huang, Ying wrote:
>> Gregory Price <[email protected]> writes:
>>
>> > I have to press this issue: Is this an actual, practical, concern?
>>
>> I don't know who have large machine like that. But I guess that it's
>> possible in the long run.
>>
>
> Certainly possible, although that seems like a hyper-specialized case of
> a supercomputer. I suppose still worth considering for a bit.
>
>> > I suppose another strategy is to calculate the interleave weights
>> > un-bounded from the raw bandwidth - but continuously force reductions
>> > (through some yet-undefined algorithm) until at least one node reaches a
>> > weight of `1`. This suffers from the opposite problem: what if the top
>> > node has a value greater than 255? Do we just cap it at 255? That seems
>> > the opposite form of problematic.
>> >
>> > (Large numbers are quite pointless, as it is essentially the antithesis
>> > of interleave)
>>
>> Yes. So I suggest to use a relative small number as the default weight
>> to start with for normal DRAM. We will have to floor/ceiling the weight
>> value.
>
> Yeah more concretely, I was thinking something like
>
> unsigned int *temp_weights; /* sizeof nr_node_ids */
>
> memcpy(temp_weights, node_bandwidth);
> while min(temp_weights) > 1:
> - attempt GCD reduction
> - if failed (GCD=1), adjust all odd numbers to be even (+1), try again
>
> for weight in temp_weights:
> iw_table[N] = (weight > 255) ? 255 : (unsigned char)weight;
>
> Something like this. Of course this breaks if you have two nodes with a
> massively different bandwidth ratio (> 255:1), but that seems
> unrealistic given the intent of the devices.

Better to evaluate the maximum error introduced. For example, for 3:2
bandwidth, the result could be 2:1. That appears not necessary.

--
Best Regards,
Huang, Ying