2022-04-16 05:58:05

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH RFC lsfmm 0/6] mm: proactive reclaim and memory tiering topics

Hello,

With the increasing popularity of memory tiering, the idea of this series is to trigger
some discussion around David's[1] system-wide proactive reclaim beyond memcg[2] as well
as sysfs as the interface for exporting system-wide tiering information[2]. I am
hoping this can be discussed at LSFMM, and while I know many are interested in tiering
subjects in general, I have not seen anyone bring this up in the list.

There has been some initial discussion towards the need to expose system-wide tiering
information to userspace. I thought I'd start with two sysfs files as a node attribute
that exports the demotion_node as well as whether or not the node is fast memory. This
was considered (and I agree) better than a new /sys/devices/system/tier/tierN/ interface.
So, are we going to go this route? If so, what further information is useful for users?
Does having instead a /sys/devices/system/node/nodeN/reclaim/ make sense?

Applies against Linus' current tree and has only been _gently_ tested.

Thanks!

Davidlohr Bueso (6):
drivers/base/node: cleanup register_node()
mm/vmscan: use node_is_toptier helper in node_reclaim
mm: make __node_reclaim() more flexible
mm: introduce per-node proactive reclaim interface
mm/migration: export demotion_path of a node via sysfs
mm/migrate: export whether or not tier is toptier in sysfs

Documentation/ABI/stable/sysfs-devices-node | 22 ++++
drivers/base/node.c | 68 ++++++++++--
include/linux/migrate.h | 15 +++
include/linux/swap.h | 16 +++
include/trace/events/vmscan.h | 12 +--
mm/migrate.c | 15 +--
mm/vmscan.c | 108 +++++++++++++++-----
7 files changed, 206 insertions(+), 50 deletions(-)

--
2.26.2


2022-04-16 06:33:45

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 4/6] mm: introduce per-node proactive reclaim interface

This patch introduces a mechanism to trigger memory reclaim
as a per-node sysfs interface, inspired by compaction's
equivalent; ie:

echo 1G > /sys/devices/system/node/nodeX/reclaim

It is based on the discussions from David's thread[1] as
well as the current upstreaming of the memcg[2] interface
(which has nice explanations for the benefits of userspace
reclaim overall). In both cases conclusions were that either
way of inducing proactive reclaim should be KISS, and can be
later extended. So this patch does not allow the user much
fine tuning beyond the size of the reclaim, such as anon/file
or whether or semantics of demotion.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/

Signed-off-by: Davidlohr Bueso <[email protected]>
---
Documentation/ABI/stable/sysfs-devices-node | 10 ++++
drivers/base/node.c | 2 +
include/linux/swap.h | 16 ++++++
mm/vmscan.c | 59 +++++++++++++++++++++
4 files changed, 87 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index 8db67aa472f1..3c935e1334f7 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -182,3 +182,13 @@ Date: November 2021
Contact: Jarkko Sakkinen <[email protected]>
Description:
The total amount of SGX physical memory in bytes.
+
+What: /sys/devices/system/node/nodeX/reclaim
+Date: April 2022
+Contact: Davidlohr Bueso <[email protected]>
+Description:
+ Write the amount of bytes to induce memory reclaim in this node.
+ This file accepts a single key, the number of bytes to reclaim.
+ When it completes successfully, the specified amount or more memory
+ will have been reclaimed, and -EAGAIN if less bytes are reclaimed
+ than the specified amount.
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 6cdf25fd26c3..d80c478e2a6e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -670,6 +670,7 @@ static int register_node(struct node *node, int num)

hugetlb_register_node(node);
compaction_register_node(node);
+ reclaim_register_node(node);
return 0;
}

@@ -685,6 +686,7 @@ void unregister_node(struct node *node)
hugetlb_unregister_node(node); /* no-op, if memoryless node */
node_remove_accesses(node);
node_remove_caches(node);
+ reclaim_unregister_node(node);
device_unregister(&node->dev);
}

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 27093b477c5f..cca43ae6d770 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -398,6 +398,22 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
long remove_mapping(struct address_space *mapping, struct folio *folio);

+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+extern int reclaim_register_node(struct node *node);
+extern void reclaim_unregister_node(struct node *node);
+
+#else
+
+static inline int reclaim_register_node(struct node *node)
+{
+ return 0;
+}
+
+static inline void reclaim_unregister_node(struct node *node)
+{
+}
+#endif /* CONFIG_SYSFS && CONFIG_NUMA */
+
extern unsigned long reclaim_pages(struct list_head *page_list);
#ifdef CONFIG_NUMA
extern int node_reclaim_mode;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1735c302831c..3539f8a0f0ea 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4819,3 +4819,62 @@ void check_move_unevictable_pages(struct pagevec *pvec)
}
}
EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
+
+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+static ssize_t reclaim_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ int err, nid = dev->id;
+ gfp_t gfp_mask = GFP_KERNEL;
+ struct pglist_data *pgdat = NODE_DATA(nid);
+ unsigned long nr_to_reclaim, nr_reclaimed = 0;
+ unsigned int nr_retries = MAX_RECLAIM_RETRIES;
+ struct scan_control sc = {
+ .gfp_mask = current_gfp_context(gfp_mask),
+ .reclaim_idx = gfp_zone(gfp_mask),
+ .priority = NODE_RECLAIM_PRIORITY,
+ .may_writepage = !laptop_mode,
+ .may_unmap = 1,
+ .may_swap = 1,
+ };
+
+ buf = strstrip((char *)buf);
+ err = page_counter_memparse(buf, "", &nr_to_reclaim);
+ if (err)
+ return err;
+
+ sc.nr_to_reclaim = max(nr_to_reclaim, SWAP_CLUSTER_MAX);
+
+ while (nr_reclaimed < nr_to_reclaim) {
+ unsigned long reclaimed;
+
+ if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
+ return -EAGAIN;
+
+ /* does cond_resched() */
+ reclaimed = __node_reclaim(pgdat, gfp_mask,
+ nr_to_reclaim - nr_reclaimed, &sc);
+
+ clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
+
+ if (!reclaimed && !nr_retries--)
+ break;
+
+ nr_reclaimed += reclaimed;
+ }
+
+ return nr_reclaimed < nr_to_reclaim ? -EAGAIN : count;
+}
+
+static DEVICE_ATTR_WO(reclaim);
+int reclaim_register_node(struct node *node)
+{
+ return device_create_file(&node->dev, &dev_attr_reclaim);
+}
+
+void reclaim_unregister_node(struct node *node)
+{
+ return device_remove_file(&node->dev, &dev_attr_reclaim);
+}
+#endif
--
2.26.2

2022-04-16 06:58:37

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim

We have helpers for a reason.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1678802e03e7..cb583fcbf5bf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4750,7 +4750,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
* over remote processors and spread off node memory allocations
* as wide as possible.
*/
- if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
+ if (node_is_toptier(pgdat->node_id) && pgdat->node_id != numa_node_id())
return NODE_RECLAIM_NOSCAN;

if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
--
2.26.2

2022-04-16 07:49:51

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 1/6] drivers/base/node: cleanup register_node()

Trivially get rid of some unnecessary indentation.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
drivers/base/node.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index ec8bb24a5a22..6cdf25fd26c3 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -661,16 +661,16 @@ static int register_node(struct node *node, int num)
node->dev.bus = &node_subsys;
node->dev.release = node_device_release;
node->dev.groups = node_dev_groups;
- error = device_register(&node->dev);

- if (error)
+ error = device_register(&node->dev);
+ if (error) {
put_device(&node->dev);
- else {
- hugetlb_register_node(node);
-
- compaction_register_node(node);
+ return error;
}
- return error;
+
+ hugetlb_register_node(node);
+ compaction_register_node(node);
+ return 0;
}

/**
--
2.26.2

2022-04-18 12:28:09

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 6/6] mm/migrate: export whether or not node is toptier in sysf



This allows userspace to know if the node is considered fast
memory (with CPUs attached to it). While this can be already
derived without a new file, this helps further encapsulate the
concept.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
Resending, just noticed this oatch was never posted.

Documentation/ABI/stable/sysfs-devices-node | 6 ++++++
drivers/base/node.c | 13 +++++++++++++
2 files changed, 19 insertions(+)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index f620c6ae013c..1c21c3985535 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -198,3 +198,9 @@ Date: April 2022
Contact: Davidlohr Bueso <[email protected]>
Description:
Shows nodes within the next tier of slower memory below this node.
+
+What: /sys/devices/system/node/nodeX/memory_toptier
+Date: April 2022
+Contact: Davidlohr Bueso <[email protected]>
+Description:
+ Node is attached to fast memory or not.
diff --git a/drivers/base/node.c b/drivers/base/node.c
index ab4bae777535..b9de5b0360f2 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -598,12 +598,25 @@ static ssize_t node_read_demotion_path(struct device *dev,
}
static DEVICE_ATTR(demotion_path, 0444, node_read_demotion_path, NULL);

+static ssize_t node_read_memory_toptier(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ int nid = dev->id;
+ int len = 0;
+
+ len += sysfs_emit_at(buf, len, "%d\n", !!node_is_toptier(nid));
+
+ return len;
+}
+static DEVICE_ATTR(memory_toptier, 0444, node_read_memory_toptier, NULL);
+
static struct attribute *node_dev_attrs[] = {
&dev_attr_meminfo.attr,
&dev_attr_numastat.attr,
&dev_attr_distance.attr,
&dev_attr_vmstat.attr,
&dev_attr_demotion_path.attr,
+ &dev_attr_memory_toptier.attr,
NULL
};

--
2.26.2

2022-04-19 00:01:33

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/migrate: export whether or not node is toptier in sysf

On 4/16/22 20:49, Davidlohr Bueso wrote:
> This allows userspace to know if the node is considered fast
> memory (with CPUs attached to it). While this can be already
> derived without a new file, this helps further encapsulate the
> concept.

What is userspace supposed to *do* with this, though?

What does "attached" mean?

Isn't it just asking for trouble to add (known) redundancy to the ABI?
It seems like a recipe for future inconsistency.

2022-04-19 06:57:56

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/migrate: export whether or not node is toptier in sysf

On Mon, 18 Apr 2022, Dave Hansen wrote:

>I think we should try to stick to cold, hard facts as must as possible
>rather than trying to have the *kernel* dictate as a policy what is fast
>versus slow.

That's a very good point and I agree.

Thanks,
Davidlohr

2022-04-19 14:21:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/migrate: export whether or not node is toptier in sysf

On 4/18/22 09:45, Davidlohr Bueso wrote:
> On Mon, 18 Apr 2022, Dave Hansen wrote:
>> On 4/16/22 20:49, Davidlohr Bueso wrote:
>>> This allows userspace to know if the node is considered fast
>>> memory (with CPUs attached to it). While this can be already
>>> derived without a new file, this helps further encapsulate the
>>> concept.
>>
>> What is userspace supposed to *do* with this, though?
>
> This came as a scratch to my own itch. I wanted to start testing
> more tiering patches overall that I see pop up, and wanted a way
> to differentiate the slow vs the fast memories in order to better
> configure workload(s) working set sizes beyond what is your typical
> grep MemTotal /proc/meminfo. If there is a better way I'm all
> for it.

But how does this help you? Does it save you a few lines in a shell
script to find the nodes that have memory and CPUs?

>> Isn't it just asking for trouble to add (known) redundancy to the ABI?
>> It seems like a recipe for future inconsistency.
>
> Perhaps. It was mostly about the fact that the notion of top tier
> could also change as technology evolves.

It seems like something arbitrary that everyone will just disagree on.
I think we should try to stick to cold, hard facts as must as possible
rather than trying to have the *kernel* dictate as a policy what is fast
versus slow.

2022-04-22 20:15:54

by Tim Chen

[permalink] [raw]
Subject: Re: [PATCH 4/6] mm: introduce per-node proactive reclaim interface

On Fri, 2022-04-15 at 22:39 -0700, Davidlohr Bueso wrote:
> This patch introduces a mechanism to trigger memory reclaim
> as a per-node sysfs interface, inspired by compaction's
> equivalent; ie:
>
> echo 1G > /sys/devices/system/node/nodeX/reclaim
>

I think it will be more flexible to specify a node mask
as a parameter along with amount of memory with the
memory.reclaim memcg interface proposed by Yosry. Doing it node
by node is more cumbersome. It is just a special case
of reclaiming from root cgroup for a specific node.

Wei Gu, YIng and I have some discssions on this
https://lore.kernel.org/all/[email protected]/


Tim

> It is based on the discussions from David's thread[1] as
> well as the current upstreaming of the memcg[2] interface
> (which has nice explanations for the benefits of userspace
> reclaim overall). In both cases conclusions were that either
> way of inducing proactive reclaim should be KISS, and can be
> later extended. So this patch does not allow the user much
> fine tuning beyond the size of the reclaim, such as anon/file
> or whether or semantics of demotion.
>
> [1] https://lore.kernel.org/all/[email protected]/
> [2] https://lore.kernel.org/all/[email protected]/
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> ---
> Documentation/ABI/stable/sysfs-devices-node | 10 ++++
> drivers/base/node.c | 2 +
> include/linux/swap.h | 16 ++++++
> mm/vmscan.c | 59 +++++++++++++++++++++
> 4 files changed, 87 insertions(+)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index 8db67aa472f1..3c935e1334f7 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -182,3 +182,13 @@ Date: November 2021
> Contact: Jarkko Sakkinen <[email protected]>
> Description:
> The total amount of SGX physical memory in bytes.
> +
> +What: /sys/devices/system/node/nodeX/reclaim
> +Date: April 2022
> +Contact: Davidlohr Bueso <[email protected]>
> +Description:
> + Write the amount of bytes to induce memory reclaim in this node.
> + This file accepts a single key, the number of bytes to reclaim.
> + When it completes successfully, the specified amount or more memory
> + will have been reclaimed, and -EAGAIN if less bytes are reclaimed
> + than the specified amount.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 6cdf25fd26c3..d80c478e2a6e 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -670,6 +670,7 @@ static int register_node(struct node *node, int num)
>
> hugetlb_register_node(node);
> compaction_register_node(node);
> + reclaim_register_node(node);
> return 0;
> }
>
> @@ -685,6 +686,7 @@ void unregister_node(struct node *node)
> hugetlb_unregister_node(node); /* no-op, if memoryless node */
> node_remove_accesses(node);
> node_remove_caches(node);
> + reclaim_unregister_node(node);
> device_unregister(&node->dev);
> }
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 27093b477c5f..cca43ae6d770 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -398,6 +398,22 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
> extern int vm_swappiness;
> long remove_mapping(struct address_space *mapping, struct folio *folio);
>
> +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> +extern int reclaim_register_node(struct node *node);
> +extern void reclaim_unregister_node(struct node *node);
> +
> +#else
> +
> +static inline int reclaim_register_node(struct node *node)
> +{
> + return 0;
> +}
> +
> +static inline void reclaim_unregister_node(struct node *node)
> +{
> +}
> +#endif /* CONFIG_SYSFS && CONFIG_NUMA */
> +
> extern unsigned long reclaim_pages(struct list_head *page_list);
> #ifdef CONFIG_NUMA
> extern int node_reclaim_mode;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1735c302831c..3539f8a0f0ea 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4819,3 +4819,62 @@ void check_move_unevictable_pages(struct pagevec *pvec)
> }
> }
> EXPORT_SYMBOL_GPL(check_move_unevictable_pages);
> +
> +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> +static ssize_t reclaim_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + int err, nid = dev->id;
> + gfp_t gfp_mask = GFP_KERNEL;
> + struct pglist_data *pgdat = NODE_DATA(nid);
> + unsigned long nr_to_reclaim, nr_reclaimed = 0;
> + unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> + struct scan_control sc = {
> + .gfp_mask = current_gfp_context(gfp_mask),
> + .reclaim_idx = gfp_zone(gfp_mask),
> + .priority = NODE_RECLAIM_PRIORITY,
> + .may_writepage = !laptop_mode,
> + .may_unmap = 1,
> + .may_swap = 1,
> + };
> +
> + buf = strstrip((char *)buf);
> + err = page_counter_memparse(buf, "", &nr_to_reclaim);
> + if (err)
> + return err;
> +
> + sc.nr_to_reclaim = max(nr_to_reclaim, SWAP_CLUSTER_MAX);
> +
> + while (nr_reclaimed < nr_to_reclaim) {
> + unsigned long reclaimed;
> +
> + if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
> + return -EAGAIN;
> +
> + /* does cond_resched() */
> + reclaimed = __node_reclaim(pgdat, gfp_mask,
> + nr_to_reclaim - nr_reclaimed, &sc);
> +
> + clear_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
> +
> + if (!reclaimed && !nr_retries--)
> + break;
> +
> + nr_reclaimed += reclaimed;
> + }
> +
> + return nr_reclaimed < nr_to_reclaim ? -EAGAIN : count;
> +}
> +
> +static DEVICE_ATTR_WO(reclaim);
> +int reclaim_register_node(struct node *node)
> +{
> + return device_create_file(&node->dev, &dev_attr_reclaim);
> +}
> +
> +void reclaim_unregister_node(struct node *node)
> +{
> + return device_remove_file(&node->dev, &dev_attr_reclaim);
> +}
> +#endif

2022-04-22 22:46:59

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/migrate: export whether or not node is toptier in sysf

On Sat, Apr 16, 2022 at 8:49 PM Davidlohr Bueso <[email protected]> wrote:
>
>
>
> This allows userspace to know if the node is considered fast
> memory (with CPUs attached to it). While this can be already
> derived without a new file, this helps further encapsulate the
> concept.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> ---
> Resending, just noticed this oatch was never posted.
>
> Documentation/ABI/stable/sysfs-devices-node | 6 ++++++
> drivers/base/node.c | 13 +++++++++++++
> 2 files changed, 19 insertions(+)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
> index f620c6ae013c..1c21c3985535 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -198,3 +198,9 @@ Date: April 2022
> Contact: Davidlohr Bueso <[email protected]>
> Description:
> Shows nodes within the next tier of slower memory below this node.
> +
> +What: /sys/devices/system/node/nodeX/memory_toptier
> +Date: April 2022
> +Contact: Davidlohr Bueso <[email protected]>
> +Description:
> + Node is attached to fast memory or not.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index ab4bae777535..b9de5b0360f2 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -598,12 +598,25 @@ static ssize_t node_read_demotion_path(struct device *dev,
> }
> static DEVICE_ATTR(demotion_path, 0444, node_read_demotion_path, NULL);
>
> +static ssize_t node_read_memory_toptier(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + int nid = dev->id;
> + int len = 0;
> +
> + len += sysfs_emit_at(buf, len, "%d\n", !!node_is_toptier(nid));

It is not guaranteed. Some hardware configurations have cpuless DRAM
nodes, but they should be treated as top tier nodes IMHO. Please see
https://lore.kernel.org/linux-mm/[email protected]/

> +
> + return len;
> +}
> +static DEVICE_ATTR(memory_toptier, 0444, node_read_memory_toptier, NULL);
> +
> static struct attribute *node_dev_attrs[] = {
> &dev_attr_meminfo.attr,
> &dev_attr_numastat.attr,
> &dev_attr_distance.attr,
> &dev_attr_vmstat.attr,
> &dev_attr_demotion_path.attr,
> + &dev_attr_memory_toptier.attr,
> NULL
> };
>
> --
> 2.26.2
>

2022-04-22 22:52:23

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/migrate: export whether or not node is toptier in sysf

On Mon, 18 Apr 2022, Dave Hansen wrote:

>On 4/16/22 20:49, Davidlohr Bueso wrote:
>> This allows userspace to know if the node is considered fast
>> memory (with CPUs attached to it). While this can be already
>> derived without a new file, this helps further encapsulate the
>> concept.
>
>What is userspace supposed to *do* with this, though?

This came as a scratch to my own itch. I wanted to start testing
more tiering patches overall that I see pop up, and wanted a way
to differentiate the slow vs the fast memories in order to better
configure workload(s) working set sizes beyond what is your typical
grep MemTotal /proc/meminfo. If there is a better way I'm all
for it.

>
>What does "attached" mean?

I'll rephrase.

>Isn't it just asking for trouble to add (known) redundancy to the ABI?
>It seems like a recipe for future inconsistency.

Perhaps. It was mostly about the fact that the notion of top tier
could also change as technology evolves.

Thanks,
Davidlohr

2022-04-26 09:04:10

by Adam Manzanares

[permalink] [raw]
Subject: Re: [PATCH 1/6] drivers/base/node: cleanup register_node()

On Fri, Apr 15, 2022 at 10:38:57PM -0700, Davidlohr Bueso wrote:
> Trivially get rid of some unnecessary indentation.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> ---
> drivers/base/node.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index ec8bb24a5a22..6cdf25fd26c3 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -661,16 +661,16 @@ static int register_node(struct node *node, int num)
> node->dev.bus = &node_subsys;
> node->dev.release = node_device_release;
> node->dev.groups = node_dev_groups;
> - error = device_register(&node->dev);
>
> - if (error)
> + error = device_register(&node->dev);
> + if (error) {
> put_device(&node->dev);
> - else {
> - hugetlb_register_node(node);
> -
> - compaction_register_node(node);
> + return error;
> }
> - return error;
> +
> + hugetlb_register_node(node);
> + compaction_register_node(node);
> + return 0;
> }
>
> /**
> --
> 2.26.2
>


Looks good.

Reviewed by: Adam Manzanares <[email protected]>

2022-04-26 18:15:38

by Adam Manzanares

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim

On Fri, Apr 15, 2022 at 10:38:58PM -0700, Davidlohr Bueso wrote:
> We have helpers for a reason.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> ---
> mm/vmscan.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1678802e03e7..cb583fcbf5bf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4750,7 +4750,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
> * over remote processors and spread off node memory allocations
> * as wide as possible.
> */
> - if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
> + if (node_is_toptier(pgdat->node_id) && pgdat->node_id != numa_node_id())
> return NODE_RECLAIM_NOSCAN;
>
> if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
> --
> 2.26.2
>


Looks good.

Reviewed by: Adam Manzanares <[email protected]>

2022-05-04 07:38:57

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH 1/6] drivers/base/node: cleanup register_node()

On 16.04.22 07:38, Davidlohr Bueso wrote:
> Trivially get rid of some unnecessary indentation.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> ---
> drivers/base/node.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index ec8bb24a5a22..6cdf25fd26c3 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -661,16 +661,16 @@ static int register_node(struct node *node, int num)
> node->dev.bus = &node_subsys;
> node->dev.release = node_device_release;
> node->dev.groups = node_dev_groups;
> - error = device_register(&node->dev);
>
> - if (error)
> + error = device_register(&node->dev);
> + if (error) {
> put_device(&node->dev);
> - else {
> - hugetlb_register_node(node);
> -
> - compaction_register_node(node);
> + return error;
> }
> - return error;
> +
> + hugetlb_register_node(node);
> + compaction_register_node(node);
> + return 0;
> }
>
> /**

Reviewed-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb


2022-05-04 15:47:17

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim

On Fri, 15 Apr 2022, Davidlohr Bueso wrote:

> We have helpers for a reason.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>

Acked-by: David Rientjes <[email protected]>

2022-05-04 16:57:42

by Jagdish Gediya

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim

On Fri, Apr 15, 2022 at 10:38:58PM -0700, Davidlohr Bueso wrote:
> We have helpers for a reason.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> ---
> mm/vmscan.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1678802e03e7..cb583fcbf5bf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4750,7 +4750,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
> * over remote processors and spread off node memory allocations
> * as wide as possible.
> */
> - if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
> + if (node_is_toptier(pgdat->node_id) && pgdat->node_id != numa_node_id())
Currently node_is_toptier returns all N_CPU node as toptier but N_CPU
node will not stay necessarily top tier as per discussions on below
thread.

https://lore.kernel.org/linux-mm/CAAPL-u9sVx94ACSuCVN8V0tKp+AMxiY89cro0japtyB=xNfNBw@mail.gmail.com/

node_is_toptier() definition can change based on the discussion in above
thread.
> return NODE_RECLAIM_NOSCAN;
>
> if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
> --
> 2.26.2
>
>

2022-06-01 20:05:28

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim

On Wed, 01 Jun 2022, Ying Huang wrote:

>On Tue, 2022-05-31 at 17:20 +0530, Aneesh Kumar K.V wrote:
>> Davidlohr Bueso <[email protected]> writes:
>>
>> > We have helpers for a reason.
>> >
>> > Signed-off-by: Davidlohr Bueso <[email protected]>
>> > ---
>> > ?mm/vmscan.c | 2 +-
>> > ?1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 1678802e03e7..cb583fcbf5bf 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -4750,7 +4750,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
>> > ? * over remote processors and spread off node memory allocations
>> > ? * as wide as possible.
>> > ? */
>> > - if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
>> > + if (node_is_toptier(pgdat->node_id) && pgdat->node_id != numa_node_id())
>> > ? return NODE_RECLAIM_NOSCAN;
>> >
>> >
>> > ? if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
>>
>>
>> Are we really looking at the top tier in a tiered memory hierarchy here?
>> The comment seems to suggest we are looking at local NUMA node?
>
>The code change itself is correct. But it is an implementation details
>that node_is_toptier() == node_state(, N_CPU). And after we supporting
>more memory tiers (like GPU, HBM), we will change the implementation of
>node_is_toptier() soon. So I think that it's better to keep the
>original code.

Agreed.

Thanks,
Davidlohr

2022-06-01 20:45:51

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim

On Tue, 2022-05-31 at 17:20 +0530, Aneesh Kumar K.V wrote:
> Davidlohr Bueso <[email protected]> writes:
>
> > We have helpers for a reason.
> >
> > Signed-off-by: Davidlohr Bueso <[email protected]>
> > ---
> >  mm/vmscan.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 1678802e03e7..cb583fcbf5bf 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4750,7 +4750,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
> >   * over remote processors and spread off node memory allocations
> >   * as wide as possible.
> >   */
> > - if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
> > + if (node_is_toptier(pgdat->node_id) && pgdat->node_id != numa_node_id())
> >   return NODE_RECLAIM_NOSCAN;
> >  
> >
> >   if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
>
>
> Are we really looking at the top tier in a tiered memory hierarchy here?
> The comment seems to suggest we are looking at local NUMA node?

The code change itself is correct. But it is an implementation details
that node_is_toptier() == node_state(, N_CPU). And after we supporting
more memory tiers (like GPU, HBM), we will change the implementation of
node_is_toptier() soon. So I think that it's better to keep the
original code.

Best Regards,
Huang, Ying



2022-06-01 21:29:26

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/vmscan: use node_is_toptier helper in node_reclaim

Davidlohr Bueso <[email protected]> writes:

> We have helpers for a reason.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> ---
> mm/vmscan.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1678802e03e7..cb583fcbf5bf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4750,7 +4750,7 @@ int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
> * over remote processors and spread off node memory allocations
> * as wide as possible.
> */
> - if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
> + if (node_is_toptier(pgdat->node_id) && pgdat->node_id != numa_node_id())
> return NODE_RECLAIM_NOSCAN;
>
> if (test_and_set_bit(PGDAT_RECLAIM_LOCKED, &pgdat->flags))


Are we really looking at the top tier in a tiered memory hierarchy here?
The comment seems to suggest we are looking at local NUMA node?


-aneesh