2022-10-26 08:17:17

by Feng Tang

[permalink] [raw]
Subject: [PATCH] mm/vmscan: respect cpuset policy during page demotion

In page reclaim path, memory could be demoted from faster memory tier
to slower memory tier. Currently, there is no check about cpuset's
memory policy, that even if the target demotion node is not allowd
by cpuset, the demotion will still happen, which breaks the cpuset
semantics.

So add cpuset policy check in the demotion path and skip demotion
if the demotion targets are not allowed by cpuset.

Signed-off-by: Feng Tang <[email protected]>
---
Hi reviewers,

For easy bisectable, I combined the cpuset change and mm change
in one patch, if you prefer to separate them, I can turn it into
2 patches.

Thanks,
Feng

include/linux/cpuset.h | 6 ++++++
kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++
mm/vmscan.c | 35 ++++++++++++++++++++++++++++++++---
3 files changed, 67 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index d58e0476ee8e..6fcce2bd2631 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
task_unlock(current);
}

+extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
+ nodemask_t *nmask);
#else /* !CONFIG_CPUSETS */

static inline bool cpusets_enabled(void) { return false; }
@@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
return false;
}

+static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
+ nodemask_t *nmask)
+{
+}
#endif /* !CONFIG_CPUSETS */

#endif /* _LINUX_CPUSET_H */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 3ea2e836e93e..cbb118c0502f 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
return mask;
}

+/*
+ * Retrieve the allowed memory nodemask for a cgroup.
+ *
+ * Set *nmask to cpuset's effective allowed nodemask for cgroup v2,
+ * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there
+ * is no guaranteed association from a cgroup to a cpuset.
+ */
+void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask)
+{
+ struct cgroup_subsys_state *css;
+ struct cpuset *cs;
+
+ if (!is_in_v2_mode()) {
+ *nmask = NODE_MASK_ALL;
+ return;
+ }
+
+ rcu_read_lock();
+ css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys);
+ if (css) {
+ css_get(css);
+ cs = css_cs(css);
+ *nmask = cs->effective_mems;
+ css_put(css);
+ }
+
+ rcu_read_unlock();
+}
+
/**
* cpuset_nodemask_valid_mems_allowed - check nodemask vs. current mems_allowed
* @nodemask: the nodemask to be checked
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 18f6497994ec..c205d98283bc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
{
struct page *target_page;
nodemask_t *allowed_mask;
- struct migration_target_control *mtc;
+ struct migration_target_control *mtc = (void *)private;

- mtc = (struct migration_target_control *)private;
+#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
+ struct mem_cgroup *memcg;
+ nodemask_t cpuset_nmask;
+
+ memcg = page_memcg(page);
+ cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask);
+
+ if (!node_isset(mtc->nid, cpuset_nmask)) {
+ if (mtc->nmask)
+ nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask);
+ return alloc_migration_target(page, (unsigned long)mtc);
+ }
+#endif

allowed_mask = mtc->nmask;
/*
@@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
enum folio_references references = FOLIOREF_RECLAIM;
bool dirty, writeback;
unsigned int nr_pages;
+ bool skip_this_demotion = false;

cond_resched();

@@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
if (!folio_trylock(folio))
goto keep;

+#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
+ if (do_demote_pass) {
+ struct mem_cgroup *memcg;
+ nodemask_t nmask, nmask1;
+
+ node_get_allowed_targets(pgdat, &nmask);
+ memcg = folio_memcg(folio);
+ if (memcg)
+ cpuset_get_allowed_mem_nodes(memcg->css.cgroup,
+ &nmask1);
+
+ if (!nodes_intersects(nmask, nmask1))
+ skip_this_demotion = true;
+ }
+#endif
+
VM_BUG_ON_FOLIO(folio_test_active(folio), folio);

nr_pages = folio_nr_pages(folio);
@@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
* Before reclaiming the folio, try to relocate
* its contents to another node.
*/
- if (do_demote_pass &&
+ if (do_demote_pass && !skip_this_demotion &&
(thp_migration_supported() || !folio_test_large(folio))) {
list_add(&folio->lru, &demote_folios);
folio_unlock(folio);
--
2.27.0



2022-10-26 08:41:50

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

Hi Feng,

On 10/26/2022 3:43 PM, Feng Tang wrote:
> In page reclaim path, memory could be demoted from faster memory tier
> to slower memory tier. Currently, there is no check about cpuset's
> memory policy, that even if the target demotion node is not allowd
> by cpuset, the demotion will still happen, which breaks the cpuset
> semantics.
>
> So add cpuset policy check in the demotion path and skip demotion
> if the demotion targets are not allowed by cpuset.
>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> Hi reviewers,
>
> For easy bisectable, I combined the cpuset change and mm change
> in one patch, if you prefer to separate them, I can turn it into
> 2 patches.
>
> Thanks,
> Feng
>
> include/linux/cpuset.h | 6 ++++++
> kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++
> mm/vmscan.c | 35 ++++++++++++++++++++++++++++++++---
> 3 files changed, 67 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index d58e0476ee8e..6fcce2bd2631 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> task_unlock(current);
> }
>
> +extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> + nodemask_t *nmask);
> #else /* !CONFIG_CPUSETS */
>
> static inline bool cpusets_enabled(void) { return false; }
> @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
> return false;
> }
>
> +static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> + nodemask_t *nmask)
> +{
> +}
> #endif /* !CONFIG_CPUSETS */
>
> #endif /* _LINUX_CPUSET_H */
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 3ea2e836e93e..cbb118c0502f 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
> return mask;
> }
>
> +/*
> + * Retrieve the allowed memory nodemask for a cgroup.
> + *
> + * Set *nmask to cpuset's effective allowed nodemask for cgroup v2,
> + * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there
> + * is no guaranteed association from a cgroup to a cpuset.
> + */
> +void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask)
> +{
> + struct cgroup_subsys_state *css;
> + struct cpuset *cs;
> +
> + if (!is_in_v2_mode()) {
> + *nmask = NODE_MASK_ALL;
> + return;
> + }
> +
> + rcu_read_lock();
> + css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys);
> + if (css) {
> + css_get(css);
> + cs = css_cs(css);
> + *nmask = cs->effective_mems;
> + css_put(css);
> + }
> +
> + rcu_read_unlock();
> +}
> +
> /**
> * cpuset_nodemask_valid_mems_allowed - check nodemask vs. current mems_allowed
> * @nodemask: the nodemask to be checked
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 18f6497994ec..c205d98283bc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
> {
> struct page *target_page;
> nodemask_t *allowed_mask;
> - struct migration_target_control *mtc;
> + struct migration_target_control *mtc = (void *)private;
>
> - mtc = (struct migration_target_control *)private;
> +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
> + struct mem_cgroup *memcg;
> + nodemask_t cpuset_nmask;
> +
> + memcg = page_memcg(page);
> + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask);
> +
> + if (!node_isset(mtc->nid, cpuset_nmask)) {
> + if (mtc->nmask)
> + nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask);
> + return alloc_migration_target(page, (unsigned long)mtc);
> + }
> +#endif
>
> allowed_mask = mtc->nmask;
> /*
> @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> enum folio_references references = FOLIOREF_RECLAIM;
> bool dirty, writeback;
> unsigned int nr_pages;
> + bool skip_this_demotion = false;
>
> cond_resched();
>
> @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> if (!folio_trylock(folio))
> goto keep;
>
> +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
> + if (do_demote_pass) {
> + struct mem_cgroup *memcg;
> + nodemask_t nmask, nmask1;
> +
> + node_get_allowed_targets(pgdat, &nmask);
> + memcg = folio_memcg(folio);
> + if (memcg)
Doesn't check memcg in the change to alloc_demote_page(). What's the difference here?

> + cpuset_get_allowed_mem_nodes(memcg->css.cgroup,
> + &nmask1);
> +
> + if (!nodes_intersects(nmask, nmask1))
If memcg is NULL, nmask1 will have an uninitialized value. Thanks.

Regards
Yin, Fengwei

> + skip_this_demotion = true;
> + }
> +#endif
> +
> VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
>
> nr_pages = folio_nr_pages(folio);
> @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> * Before reclaiming the folio, try to relocate
> * its contents to another node.
> */
> - if (do_demote_pass &&
> + if (do_demote_pass && !skip_this_demotion &&
> (thp_migration_supported() || !folio_test_large(folio))) {
> list_add(&folio->lru, &demote_folios);
> folio_unlock(folio);

2022-10-26 08:42:52

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Wed, Oct 26, 2022 at 04:26:28PM +0800, Yin, Fengwei wrote:
> Hi Feng,
>
> On 10/26/2022 3:43 PM, Feng Tang wrote:
> > In page reclaim path, memory could be demoted from faster memory tier
> > to slower memory tier. Currently, there is no check about cpuset's
> > memory policy, that even if the target demotion node is not allowd
> > by cpuset, the demotion will still happen, which breaks the cpuset
> > semantics.
> >
> > So add cpuset policy check in the demotion path and skip demotion
> > if the demotion targets are not allowed by cpuset.
> >
> > Signed-off-by: Feng Tang <[email protected]>
> > ---
> > Hi reviewers,
> >
> > For easy bisectable, I combined the cpuset change and mm change
> > in one patch, if you prefer to separate them, I can turn it into
> > 2 patches.
> >
> > Thanks,
> > Feng
> >
> > include/linux/cpuset.h | 6 ++++++
> > kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++
> > mm/vmscan.c | 35 ++++++++++++++++++++++++++++++++---
> > 3 files changed, 67 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > index d58e0476ee8e..6fcce2bd2631 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> > task_unlock(current);
> > }
> >
> > +extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> > + nodemask_t *nmask);
> > #else /* !CONFIG_CPUSETS */
> >
> > static inline bool cpusets_enabled(void) { return false; }
> > @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
> > return false;
> > }
> >
> > +static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> > + nodemask_t *nmask)
> > +{
> > +}
> > #endif /* !CONFIG_CPUSETS */
> >
> > #endif /* _LINUX_CPUSET_H */
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 3ea2e836e93e..cbb118c0502f 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
> > return mask;
> > }
> >
> > +/*
> > + * Retrieve the allowed memory nodemask for a cgroup.
> > + *
> > + * Set *nmask to cpuset's effective allowed nodemask for cgroup v2,
> > + * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there
> > + * is no guaranteed association from a cgroup to a cpuset.
> > + */
> > +void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask)
> > +{
> > + struct cgroup_subsys_state *css;
> > + struct cpuset *cs;
> > +
> > + if (!is_in_v2_mode()) {
> > + *nmask = NODE_MASK_ALL;
> > + return;
> > + }
> > +
> > + rcu_read_lock();
> > + css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys);
> > + if (css) {
> > + css_get(css);
> > + cs = css_cs(css);
> > + *nmask = cs->effective_mems;
> > + css_put(css);
> > + }
> > +
> > + rcu_read_unlock();
> > +}
> > +
> > /**
> > * cpuset_nodemask_valid_mems_allowed - check nodemask vs. current mems_allowed
> > * @nodemask: the nodemask to be checked
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 18f6497994ec..c205d98283bc 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
> > {
> > struct page *target_page;
> > nodemask_t *allowed_mask;
> > - struct migration_target_control *mtc;
> > + struct migration_target_control *mtc = (void *)private;
> >
> > - mtc = (struct migration_target_control *)private;
> > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
> > + struct mem_cgroup *memcg;
> > + nodemask_t cpuset_nmask;
> > +
> > + memcg = page_memcg(page);
> > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask);
> > +
> > + if (!node_isset(mtc->nid, cpuset_nmask)) {
> > + if (mtc->nmask)
> > + nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask);
> > + return alloc_migration_target(page, (unsigned long)mtc);
> > + }
> > +#endif
> >
> > allowed_mask = mtc->nmask;
> > /*
> > @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > enum folio_references references = FOLIOREF_RECLAIM;
> > bool dirty, writeback;
> > unsigned int nr_pages;
> > + bool skip_this_demotion = false;
> >
> > cond_resched();
> >
> > @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > if (!folio_trylock(folio))
> > goto keep;
> >
> > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
> > + if (do_demote_pass) {
> > + struct mem_cgroup *memcg;
> > + nodemask_t nmask, nmask1;
> > +
> > + node_get_allowed_targets(pgdat, &nmask);
> > + memcg = folio_memcg(folio);
> > + if (memcg)
> Doesn't check memcg in the change to alloc_demote_page(). What's the difference here?
>
> > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup,
> > + &nmask1);
> > +
> > + if (!nodes_intersects(nmask, nmask1))
> If memcg is NULL, nmask1 will have an uninitialized value. Thanks.

Good catch! Yes, it should be initialized to NODE_MASK_ALL.

Actually I was not sure if I need to check 'memcg == NULL' case, while
I was under the impression that for page on LRU list, it's memcg is
either a specific memcg or the 'root_mem_cgroup'. I will double
check that.

Thanks,
Feng

> Regards
> Yin, Fengwei
>
> > + skip_this_demotion = true;
> > + }
> > +#endif
> > +
> > VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
> >
> > nr_pages = folio_nr_pages(folio);
> > @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > * Before reclaiming the folio, try to relocate
> > * its contents to another node.
> > */
> > - if (do_demote_pass &&
> > + if (do_demote_pass && !skip_this_demotion &&
> > (thp_migration_supported() || !folio_test_large(folio))) {
> > list_add(&folio->lru, &demote_folios);
> > folio_unlock(folio);

2022-10-26 12:42:25

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On 10/26/22 5:51 PM, Michal Hocko wrote:
> On Wed 26-10-22 17:38:06, Aneesh Kumar K V wrote:
>> On 10/26/22 4:32 PM, Michal Hocko wrote:
>>> On Wed 26-10-22 16:12:25, Aneesh Kumar K V wrote:
>>>> On 10/26/22 2:49 PM, Michal Hocko wrote:
>>>>> On Wed 26-10-22 16:00:13, Feng Tang wrote:
>>>>>> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
>>>>>>> On 10/26/22 1:13 PM, Feng Tang wrote:
>>>>>>>> In page reclaim path, memory could be demoted from faster memory tier
>>>>>>>> to slower memory tier. Currently, there is no check about cpuset's
>>>>>>>> memory policy, that even if the target demotion node is not allowd
>>>>>>>> by cpuset, the demotion will still happen, which breaks the cpuset
>>>>>>>> semantics.
>>>>>>>>
>>>>>>>> So add cpuset policy check in the demotion path and skip demotion
>>>>>>>> if the demotion targets are not allowed by cpuset.
>>>>>>>>
>>>>>>>
>>>>>>> What about the vma policy or the task memory policy? Shouldn't we respect
>>>>>>> those memory policy restrictions while demoting the page?
>>>>>>
>>>>>> Good question! We have some basic patches to consider memory policy
>>>>>> in demotion path too, which are still under test, and will be posted
>>>>>> soon. And the basic idea is similar to this patch.
>>>>>
>>>>> For that you need to consult each vma and it's owning task(s) and that
>>>>> to me sounds like something to be done in folio_check_references.
>>>>> Relying on memcg to get a cpuset cgroup is really ugly and not really
>>>>> 100% correct. Memory controller might be disabled and then you do not
>>>>> have your association anymore.
>>>>>
>>>>
>>>> I was looking at this recently and I am wondering whether we should worry about VM_SHARE
>>>> vmas.
>>>>
>>>> ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right?
>>>
>>> How would that help for private mappings shared between parent/child?
>>
>>
>> this is MAP_PRIVATE | MAP_SHARED?
>

Sorry, I meant MAP_ANONYMOUS | MAP_SHARED.

> This is not a valid combination IIRC. What I meant is a simple
> MAP_PRIVATE|MAP_ANON that is CoW shared between parent and child.
>
> [...]


-aneesh

2022-10-26 15:13:20

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On 10/26/22 03:43, Feng Tang wrote:
> In page reclaim path, memory could be demoted from faster memory tier
> to slower memory tier. Currently, there is no check about cpuset's
> memory policy, that even if the target demotion node is not allowd
> by cpuset, the demotion will still happen, which breaks the cpuset
> semantics.
>
> So add cpuset policy check in the demotion path and skip demotion
> if the demotion targets are not allowed by cpuset.
>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> Hi reviewers,
>
> For easy bisectable, I combined the cpuset change and mm change
> in one patch, if you prefer to separate them, I can turn it into
> 2 patches.
>
> Thanks,
> Feng
>
> include/linux/cpuset.h | 6 ++++++
> kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++
> mm/vmscan.c | 35 ++++++++++++++++++++++++++++++++---
> 3 files changed, 67 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index d58e0476ee8e..6fcce2bd2631 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> task_unlock(current);
> }
>
> +extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> + nodemask_t *nmask);
> #else /* !CONFIG_CPUSETS */
>
> static inline bool cpusets_enabled(void) { return false; }
> @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
> return false;
> }
>
> +static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> + nodemask_t *nmask)
> +{
> +}
> #endif /* !CONFIG_CPUSETS */
>
> #endif /* _LINUX_CPUSET_H */
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 3ea2e836e93e..cbb118c0502f 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
> return mask;
> }
>
> +/*
> + * Retrieve the allowed memory nodemask for a cgroup.
> + *
> + * Set *nmask to cpuset's effective allowed nodemask for cgroup v2,
> + * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there
> + * is no guaranteed association from a cgroup to a cpuset.
> + */
> +void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask)
> +{
> + struct cgroup_subsys_state *css;
> + struct cpuset *cs;
> +
> + if (!is_in_v2_mode()) {
> + *nmask = NODE_MASK_ALL;
> + return;
> + }

You are allowing all nodes to be used for cgroup v1. Is there a reason
why you ignore v1?

> +
> + rcu_read_lock();
> + css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys);
> + if (css) {
> + css_get(css);
> + cs = css_cs(css);
> + *nmask = cs->effective_mems;
> + css_put(css);
> + }
Since you are holding an RCU read lock and copying out the whole
nodemask, you probably don't need to do a css_get/css_put pair.
> +
> + rcu_read_unlock();
> +}
> +
Cheers,

Longman


2022-10-27 05:31:24

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

Feng Tang <[email protected]> writes:

> In page reclaim path, memory could be demoted from faster memory tier
> to slower memory tier. Currently, there is no check about cpuset's
> memory policy, that even if the target demotion node is not allowd
> by cpuset, the demotion will still happen, which breaks the cpuset
> semantics.
>
> So add cpuset policy check in the demotion path and skip demotion
> if the demotion targets are not allowed by cpuset.
>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> Hi reviewers,
>
> For easy bisectable, I combined the cpuset change and mm change
> in one patch, if you prefer to separate them, I can turn it into
> 2 patches.
>
> Thanks,
> Feng
>
> include/linux/cpuset.h | 6 ++++++
> kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++
> mm/vmscan.c | 35 ++++++++++++++++++++++++++++++++---
> 3 files changed, 67 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index d58e0476ee8e..6fcce2bd2631 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> task_unlock(current);
> }
>
> +extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> + nodemask_t *nmask);
> #else /* !CONFIG_CPUSETS */
>
> static inline bool cpusets_enabled(void) { return false; }
> @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
> return false;
> }
>
> +static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> + nodemask_t *nmask)
> +{
> +}
> #endif /* !CONFIG_CPUSETS */
>
> #endif /* _LINUX_CPUSET_H */
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 3ea2e836e93e..cbb118c0502f 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
> return mask;
> }
>
> +/*
> + * Retrieve the allowed memory nodemask for a cgroup.
> + *
> + * Set *nmask to cpuset's effective allowed nodemask for cgroup v2,
> + * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there
> + * is no guaranteed association from a cgroup to a cpuset.
> + */
> +void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask)
> +{
> + struct cgroup_subsys_state *css;
> + struct cpuset *cs;
> +
> + if (!is_in_v2_mode()) {
> + *nmask = NODE_MASK_ALL;
> + return;
> + }
> +
> + rcu_read_lock();
> + css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys);
> + if (css) {
> + css_get(css);
> + cs = css_cs(css);
> + *nmask = cs->effective_mems;
> + css_put(css);
> + }
> +
> + rcu_read_unlock();
> +}
> +
> /**
> * cpuset_nodemask_valid_mems_allowed - check nodemask vs. current mems_allowed
> * @nodemask: the nodemask to be checked
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 18f6497994ec..c205d98283bc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
> {
> struct page *target_page;
> nodemask_t *allowed_mask;
> - struct migration_target_control *mtc;
> + struct migration_target_control *mtc = (void *)private;
>
> - mtc = (struct migration_target_control *)private;

I think we should avoid (void *) conversion here.

> +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
> + struct mem_cgroup *memcg;
> + nodemask_t cpuset_nmask;
> +
> + memcg = page_memcg(page);
> + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask);
> +
> + if (!node_isset(mtc->nid, cpuset_nmask)) {
> + if (mtc->nmask)
> + nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask);
> + return alloc_migration_target(page, (unsigned long)mtc);
> + }

If node_isset(mtc->nid, cpuset_nmask) == true, we should keep the
original 2 steps allocation and apply nodes_and() on node mask.

> +#endif
>
> allowed_mask = mtc->nmask;
> /*
> @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> enum folio_references references = FOLIOREF_RECLAIM;
> bool dirty, writeback;
> unsigned int nr_pages;
> + bool skip_this_demotion = false;
>
> cond_resched();
>
> @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> if (!folio_trylock(folio))
> goto keep;
>
> +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
> + if (do_demote_pass) {
> + struct mem_cgroup *memcg;
> + nodemask_t nmask, nmask1;
> +
> + node_get_allowed_targets(pgdat, &nmask);

pgdat will not change in the loop, so we can move this out of the loop?

> + memcg = folio_memcg(folio);
> + if (memcg)
> + cpuset_get_allowed_mem_nodes(memcg->css.cgroup,
> + &nmask1);
> +
> + if (!nodes_intersects(nmask, nmask1))
> + skip_this_demotion = true;
> + }

If nodes_intersects() == true, we will call
cpuset_get_allowed_mem_nodes() twice. Better to pass the intersecting
mask to demote_folio_list()?

> +#endif
> +
> VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
>
> nr_pages = folio_nr_pages(folio);
> @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> * Before reclaiming the folio, try to relocate
> * its contents to another node.
> */
> - if (do_demote_pass &&
> + if (do_demote_pass && !skip_this_demotion &&
> (thp_migration_supported() || !folio_test_large(folio))) {
> list_add(&folio->lru, &demote_folios);
> folio_unlock(folio);

Best Regards,
Huang, Ying

2022-10-27 06:17:56

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Thu, Oct 27, 2022 at 01:13:30PM +0800, Huang, Ying wrote:
> Feng Tang <[email protected]> writes:
>
> > In page reclaim path, memory could be demoted from faster memory tier
> > to slower memory tier. Currently, there is no check about cpuset's
> > memory policy, that even if the target demotion node is not allowd
> > by cpuset, the demotion will still happen, which breaks the cpuset
> > semantics.
> >
> > So add cpuset policy check in the demotion path and skip demotion
> > if the demotion targets are not allowed by cpuset.
> >
> > Signed-off-by: Feng Tang <[email protected]>
> > ---
[...]
> > index 18f6497994ec..c205d98283bc 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
> > {
> > struct page *target_page;
> > nodemask_t *allowed_mask;
> > - struct migration_target_control *mtc;
> > + struct migration_target_control *mtc = (void *)private;
> >
> > - mtc = (struct migration_target_control *)private;
>
> I think we should avoid (void *) conversion here.

OK, will change back.

> > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
> > + struct mem_cgroup *memcg;
> > + nodemask_t cpuset_nmask;
> > +
> > + memcg = page_memcg(page);
> > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask);
> > +
> > + if (!node_isset(mtc->nid, cpuset_nmask)) {
> > + if (mtc->nmask)
> > + nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask);
> > + return alloc_migration_target(page, (unsigned long)mtc);
> > + }
>
> If node_isset(mtc->nid, cpuset_nmask) == true, we should keep the
> original 2 steps allocation and apply nodes_and() on node mask.

Good catch! Yes, the nodes_and() call should be taken out from this
check and done before calling node_isset().

> > +#endif
> >
> > allowed_mask = mtc->nmask;
> > /*
> > @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > enum folio_references references = FOLIOREF_RECLAIM;
> > bool dirty, writeback;
> > unsigned int nr_pages;
> > + bool skip_this_demotion = false;
> >
> > cond_resched();
> >
> > @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > if (!folio_trylock(folio))
> > goto keep;
> >
> > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
> > + if (do_demote_pass) {
> > + struct mem_cgroup *memcg;
> > + nodemask_t nmask, nmask1;
> > +
> > + node_get_allowed_targets(pgdat, &nmask);
>
> pgdat will not change in the loop, so we can move this out of the loop?

Yes

> > + memcg = folio_memcg(folio);
> > + if (memcg)
> > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup,
> > + &nmask1);
> > +
> > + if (!nodes_intersects(nmask, nmask1))
> > + skip_this_demotion = true;
> > + }
>
> If nodes_intersects() == true, we will call
> cpuset_get_allowed_mem_nodes() twice. Better to pass the intersecting
> mask to demote_folio_list()?

The pages in the loop may come from different mem control group, and
the cpuset's nodemask could be different, I don't know how to save
this per-page info to be used later in demote_folio_list.

Thanks,
Feng

> > +#endif
> > +
> > VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
> >
> > nr_pages = folio_nr_pages(folio);
> > @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > * Before reclaiming the folio, try to relocate
> > * its contents to another node.
> > */
> > - if (do_demote_pass &&
> > + if (do_demote_pass && !skip_this_demotion &&
> > (thp_migration_supported() || !folio_test_large(folio))) {
> > list_add(&folio->lru, &demote_folios);
> > folio_unlock(folio);
>
> Best Regards,
> Huang, Ying

2022-10-27 06:19:43

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

Feng Tang <[email protected]> writes:

> On Thu, Oct 27, 2022 at 01:13:30PM +0800, Huang, Ying wrote:
>> Feng Tang <[email protected]> writes:
>>
>> > In page reclaim path, memory could be demoted from faster memory tier
>> > to slower memory tier. Currently, there is no check about cpuset's
>> > memory policy, that even if the target demotion node is not allowd
>> > by cpuset, the demotion will still happen, which breaks the cpuset
>> > semantics.
>> >
>> > So add cpuset policy check in the demotion path and skip demotion
>> > if the demotion targets are not allowed by cpuset.
>> >
>> > Signed-off-by: Feng Tang <[email protected]>
>> > ---
> [...]
>> > index 18f6497994ec..c205d98283bc 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -1537,9 +1537,21 @@ static struct page *alloc_demote_page(struct page *page, unsigned long private)
>> > {
>> > struct page *target_page;
>> > nodemask_t *allowed_mask;
>> > - struct migration_target_control *mtc;
>> > + struct migration_target_control *mtc = (void *)private;
>> >
>> > - mtc = (struct migration_target_control *)private;
>>
>> I think we should avoid (void *) conversion here.
>
> OK, will change back.
>
>> > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
>> > + struct mem_cgroup *memcg;
>> > + nodemask_t cpuset_nmask;
>> > +
>> > + memcg = page_memcg(page);
>> > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup, &cpuset_nmask);
>> > +
>> > + if (!node_isset(mtc->nid, cpuset_nmask)) {
>> > + if (mtc->nmask)
>> > + nodes_and(*mtc->nmask, *mtc->nmask, cpuset_nmask);
>> > + return alloc_migration_target(page, (unsigned long)mtc);
>> > + }
>>
>> If node_isset(mtc->nid, cpuset_nmask) == true, we should keep the
>> original 2 steps allocation and apply nodes_and() on node mask.
>
> Good catch! Yes, the nodes_and() call should be taken out from this
> check and done before calling node_isset().
>
>> > +#endif
>> >
>> > allowed_mask = mtc->nmask;
>> > /*
>> > @@ -1649,6 +1661,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>> > enum folio_references references = FOLIOREF_RECLAIM;
>> > bool dirty, writeback;
>> > unsigned int nr_pages;
>> > + bool skip_this_demotion = false;
>> >
>> > cond_resched();
>> >
>> > @@ -1658,6 +1671,22 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>> > if (!folio_trylock(folio))
>> > goto keep;
>> >
>> > +#if IS_ENABLED(CONFIG_MEMCG) && IS_ENABLED(CONFIG_CPUSETS)
>> > + if (do_demote_pass) {
>> > + struct mem_cgroup *memcg;
>> > + nodemask_t nmask, nmask1;
>> > +
>> > + node_get_allowed_targets(pgdat, &nmask);
>>
>> pgdat will not change in the loop, so we can move this out of the loop?
>
> Yes
>
>> > + memcg = folio_memcg(folio);
>> > + if (memcg)
>> > + cpuset_get_allowed_mem_nodes(memcg->css.cgroup,
>> > + &nmask1);
>> > +
>> > + if (!nodes_intersects(nmask, nmask1))
>> > + skip_this_demotion = true;
>> > + }
>>
>> If nodes_intersects() == true, we will call
>> cpuset_get_allowed_mem_nodes() twice. Better to pass the intersecting
>> mask to demote_folio_list()?
>
> The pages in the loop may come from different mem control group, and
> the cpuset's nodemask could be different, I don't know how to save
> this per-page info to be used later in demote_folio_list.

Yes. You are right. We cannot do that.

Best Regards,
Huang, Ying

>
>> > +#endif
>> > +
>> > VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
>> >
>> > nr_pages = folio_nr_pages(folio);
>> > @@ -1799,7 +1828,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>> > * Before reclaiming the folio, try to relocate
>> > * its contents to another node.
>> > */
>> > - if (do_demote_pass &&
>> > + if (do_demote_pass && !skip_this_demotion &&
>> > (thp_migration_supported() || !folio_test_large(folio))) {
>> > list_add(&folio->lru, &demote_folios);
>> > folio_unlock(folio);
>>
>> Best Regards,
>> Huang, Ying

2022-10-27 06:30:59

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Wed, Oct 26, 2022 at 10:36:32PM +0800, Waiman Long wrote:
> On 10/26/22 03:43, Feng Tang wrote:
> > In page reclaim path, memory could be demoted from faster memory tier
> > to slower memory tier. Currently, there is no check about cpuset's
> > memory policy, that even if the target demotion node is not allowd
> > by cpuset, the demotion will still happen, which breaks the cpuset
> > semantics.
> >
> > So add cpuset policy check in the demotion path and skip demotion
> > if the demotion targets are not allowed by cpuset.
> >
> > Signed-off-by: Feng Tang <[email protected]>
> > ---
> > Hi reviewers,
> >
> > For easy bisectable, I combined the cpuset change and mm change
> > in one patch, if you prefer to separate them, I can turn it into
> > 2 patches.
> >
> > Thanks,
> > Feng
> >
> > include/linux/cpuset.h | 6 ++++++
> > kernel/cgroup/cpuset.c | 29 +++++++++++++++++++++++++++++
> > mm/vmscan.c | 35 ++++++++++++++++++++++++++++++++---
> > 3 files changed, 67 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> > index d58e0476ee8e..6fcce2bd2631 100644
> > --- a/include/linux/cpuset.h
> > +++ b/include/linux/cpuset.h
> > @@ -178,6 +178,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> > task_unlock(current);
> > }
> >
> > +extern void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> > + nodemask_t *nmask);
> > #else /* !CONFIG_CPUSETS */
> >
> > static inline bool cpusets_enabled(void) { return false; }
> > @@ -299,6 +301,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
> > return false;
> > }
> >
> > +static inline void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup,
> > + nodemask_t *nmask)
> > +{
> > +}
> > #endif /* !CONFIG_CPUSETS */
> >
> > #endif /* _LINUX_CPUSET_H */
> > diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> > index 3ea2e836e93e..cbb118c0502f 100644
> > --- a/kernel/cgroup/cpuset.c
> > +++ b/kernel/cgroup/cpuset.c
> > @@ -3750,6 +3750,35 @@ nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
> > return mask;
> > }
> >
> > +/*
> > + * Retrieve the allowed memory nodemask for a cgroup.
> > + *
> > + * Set *nmask to cpuset's effective allowed nodemask for cgroup v2,
> > + * and NODE_MASK_ALL (means no constraint) for cgroup v1 where there
> > + * is no guaranteed association from a cgroup to a cpuset.
> > + */
> > +void cpuset_get_allowed_mem_nodes(struct cgroup *cgroup, nodemask_t *nmask)
> > +{
> > + struct cgroup_subsys_state *css;
> > + struct cpuset *cs;
> > +
> > + if (!is_in_v2_mode()) {
> > + *nmask = NODE_MASK_ALL;
> > + return;
> > + }
>
> You are allowing all nodes to be used for cgroup v1. Is there a reason
> why you ignore v1?

The use case for the API is, for a memory control group, user want to
get its associated cpuset controller's memory policy, so it tries
the memcg --> cgroup --> cpuset chain. IIUC, there is no a reliable
chain for cgroup v1, plus cgroup v2 is the default option for many
distros, the cgroup v1 is bypassed here.

> > +
> > + rcu_read_lock();
> > + css = cgroup_e_css(cgroup, &cpuset_cgrp_subsys);
> > + if (css) {
> > + css_get(css);
> > + cs = css_cs(css);
> > + *nmask = cs->effective_mems;
> > + css_put(css);
> > + }
> Since you are holding an RCU read lock and copying out the whole
> nodemask, you probably don't need to do a css_get/css_put pair.

Thanks for the note!

Thanks,
Feng

> > +
> > + rcu_read_unlock();
> > +}
> > +
> Cheers,
>
> Longman
>

2022-10-27 08:10:53

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Thu 27-10-22 15:39:00, Huang, Ying wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Thu 27-10-22 14:47:22, Huang, Ying wrote:
> >> Michal Hocko <[email protected]> writes:
> > [...]
> >> > I can imagine workloads which wouldn't like to get their memory demoted
> >> > for some reason but wouldn't it be more practical to tell that
> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> >> > policies explicitly?
> >>
> >> If my understanding were correct, prctl() configures the process or
> >> thread.
> >
> > Not necessarily. There are properties which are per adddress space like
> > PR_[GS]ET_THP_DISABLE. This could be very similar.
> >
> >> How can we get process/thread configuration at demotion time?
> >
> > As already pointed out in previous emails. You could hook into
> > folio_check_references path, more specifically folio_referenced_one
> > where you have all that you need already - all vmas mapping the page and
> > then it is trivial to get the corresponding vm_mm. If at least one of
> > them has the flag set then the demotion is not allowed (essentially the
> > same model as VM_LOCKED).
>
> Got it! Thanks for detailed explanation.
>
> One bit may be not sufficient. For example, if we want to avoid or
> control cross-socket demotion and still allow demoting to slow memory
> nodes in local socket, we need to specify a node mask to exclude some
> NUMA nodes from demotion targets.

Isn't this something to be configured on the demotion topology side? Or
do you expect there will be per process/address space usecases? I mean
different processes running on the same topology, one requesting local
demotion while other ok with the whole demotion topology?

> >From overhead point of view, this appears similar as that of VMA/task
> memory policy? We can make mm->owner available for memory tiers
> (CONFIG_NUMA && CONFIG_MIGRATION). The advantage is that we don't need
> to introduce new ABI. I guess users may prefer to use `numactl` than a
> new ABI?

mm->owner is a wrong direction. It doesn't have a strong meaning because
there is no one task explicitly responsible for the mm so there is no
real owner (our clone() semantic is just to permissive for that). The
memcg::owner is a crude and ugly hack and it should go away over time
rather than build new uses.

Besides that, and as I have already tried to explain, per task demotion
policy is what makes this whole thing expensive. So this better be a per
mm or per vma property. Whether it is a on/off knob like PR_[GS]ET_THP_DISABLE
or there are explicit requirements for fine grain control on the vma
level I dunno. I haven't seen those usecases yet and it is really easy
to overengineer this.

To be completely honest I would much rather wait for those usecases
before adding a more complex APIs. PR_[GS]_DEMOTION_DISABLED sounds
like a reasonable first step. Should we have more fine grained
requirements wrt address space I would follow the MADV_{NO}HUGEPAGE
lead.

If we really need/want to give a fine grained control over demotion
nodemask then we would have to go with vma->mempolicy interface. In
any case a per process on/off knob sounds like a reasonable first step
before we learn more about real usecases.
--
Michal Hocko
SUSE Labs

2022-10-27 09:33:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Wed 26-10-22 18:05:46, Aneesh Kumar K V wrote:
> On 10/26/22 5:51 PM, Michal Hocko wrote:
> > On Wed 26-10-22 17:38:06, Aneesh Kumar K V wrote:
> >> On 10/26/22 4:32 PM, Michal Hocko wrote:
> >>> On Wed 26-10-22 16:12:25, Aneesh Kumar K V wrote:
> >>>> On 10/26/22 2:49 PM, Michal Hocko wrote:
> >>>>> On Wed 26-10-22 16:00:13, Feng Tang wrote:
> >>>>>> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
> >>>>>>> On 10/26/22 1:13 PM, Feng Tang wrote:
> >>>>>>>> In page reclaim path, memory could be demoted from faster memory tier
> >>>>>>>> to slower memory tier. Currently, there is no check about cpuset's
> >>>>>>>> memory policy, that even if the target demotion node is not allowd
> >>>>>>>> by cpuset, the demotion will still happen, which breaks the cpuset
> >>>>>>>> semantics.
> >>>>>>>>
> >>>>>>>> So add cpuset policy check in the demotion path and skip demotion
> >>>>>>>> if the demotion targets are not allowed by cpuset.
> >>>>>>>>
> >>>>>>>
> >>>>>>> What about the vma policy or the task memory policy? Shouldn't we respect
> >>>>>>> those memory policy restrictions while demoting the page?
> >>>>>>
> >>>>>> Good question! We have some basic patches to consider memory policy
> >>>>>> in demotion path too, which are still under test, and will be posted
> >>>>>> soon. And the basic idea is similar to this patch.
> >>>>>
> >>>>> For that you need to consult each vma and it's owning task(s) and that
> >>>>> to me sounds like something to be done in folio_check_references.
> >>>>> Relying on memcg to get a cpuset cgroup is really ugly and not really
> >>>>> 100% correct. Memory controller might be disabled and then you do not
> >>>>> have your association anymore.
> >>>>>
> >>>>
> >>>> I was looking at this recently and I am wondering whether we should worry about VM_SHARE
> >>>> vmas.
> >>>>
> >>>> ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right?
> >>>
> >>> How would that help for private mappings shared between parent/child?
> >>
> >>
> >> this is MAP_PRIVATE | MAP_SHARED?
> >
>
> Sorry, I meant MAP_ANONYMOUS | MAP_SHARED.

I am still not sure where you are targeting to be honest. MAP_SHARED or
MAP_PRIVATE both can have page shared between several vmas.
--
Michal Hocko
SUSE Labs

2022-10-27 09:56:31

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

Michal Hocko <[email protected]> writes:

> On Thu 27-10-22 15:39:00, Huang, Ying wrote:
>> Michal Hocko <[email protected]> writes:
>>
>> > On Thu 27-10-22 14:47:22, Huang, Ying wrote:
>> >> Michal Hocko <[email protected]> writes:
>> > [...]
>> >> > I can imagine workloads which wouldn't like to get their memory demoted
>> >> > for some reason but wouldn't it be more practical to tell that
>> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
>> >> > policies explicitly?
>> >>
>> >> If my understanding were correct, prctl() configures the process or
>> >> thread.
>> >
>> > Not necessarily. There are properties which are per adddress space like
>> > PR_[GS]ET_THP_DISABLE. This could be very similar.
>> >
>> >> How can we get process/thread configuration at demotion time?
>> >
>> > As already pointed out in previous emails. You could hook into
>> > folio_check_references path, more specifically folio_referenced_one
>> > where you have all that you need already - all vmas mapping the page and
>> > then it is trivial to get the corresponding vm_mm. If at least one of
>> > them has the flag set then the demotion is not allowed (essentially the
>> > same model as VM_LOCKED).
>>
>> Got it! Thanks for detailed explanation.
>>
>> One bit may be not sufficient. For example, if we want to avoid or
>> control cross-socket demotion and still allow demoting to slow memory
>> nodes in local socket, we need to specify a node mask to exclude some
>> NUMA nodes from demotion targets.
>
> Isn't this something to be configured on the demotion topology side? Or
> do you expect there will be per process/address space usecases? I mean
> different processes running on the same topology, one requesting local
> demotion while other ok with the whole demotion topology?

I think that it's possible for different processes have different
requirements.

- Some processes don't care about where the memory is placed, prefer
local, then fall back to remote if no free space.

- Some processes want to avoid cross-socket traffic, bind to nodes of
local socket.

- Some processes want to avoid to use slow memory, bind to fast memory
node only.

>> >From overhead point of view, this appears similar as that of VMA/task
>> memory policy? We can make mm->owner available for memory tiers
>> (CONFIG_NUMA && CONFIG_MIGRATION). The advantage is that we don't need
>> to introduce new ABI. I guess users may prefer to use `numactl` than a
>> new ABI?
>
> mm->owner is a wrong direction. It doesn't have a strong meaning because
> there is no one task explicitly responsible for the mm so there is no
> real owner (our clone() semantic is just to permissive for that). The
> memcg::owner is a crude and ugly hack and it should go away over time
> rather than build new uses.
>
> Besides that, and as I have already tried to explain, per task demotion
> policy is what makes this whole thing expensive. So this better be a per
> mm or per vma property. Whether it is a on/off knob like PR_[GS]ET_THP_DISABLE
> or there are explicit requirements for fine grain control on the vma
> level I dunno. I haven't seen those usecases yet and it is really easy
> to overengineer this.
>
> To be completely honest I would much rather wait for those usecases
> before adding a more complex APIs. PR_[GS]_DEMOTION_DISABLED sounds
> like a reasonable first step. Should we have more fine grained
> requirements wrt address space I would follow the MADV_{NO}HUGEPAGE
> lead.
>
> If we really need/want to give a fine grained control over demotion
> nodemask then we would have to go with vma->mempolicy interface. In
> any case a per process on/off knob sounds like a reasonable first step
> before we learn more about real usecases.

Yes. Per-mm or per-vma property is much better than per-task property.
Another possibility, how about add a new flag to set_mempolicy() system
call to set the per-mm mempolicy? `numactl` can use that by default.

Best Regards,
Huang, Ying

2022-10-27 10:23:48

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On 10/27/22 2:32 PM, Michal Hocko wrote:

>>
>> Sorry, I meant MAP_ANONYMOUS | MAP_SHARED.
>
> I am still not sure where you are targeting to be honest. MAP_SHARED or
> MAP_PRIVATE both can have page shared between several vmas.


What I was checking was w.r.t demotion and shared pages do we need to
cross-check all the related memory policies? On the page fault side, we don't do that.
ie, if the vma policy or the faulting task policy allowed pages to be allocated
from the demotion node, then we allocate the page even if there is a conflicting
policy from another thread.

For ex: in the case of MAP_ANON | MAP_PRIVATE cow shared pages if the parent
did allow allocation from the demotion node we can have pages in the demotion node
even though the child memory policy doesn't have the node in the node mask.

-aneesh

2022-10-27 12:33:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Thu 27-10-22 17:31:35, Huang, Ying wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Thu 27-10-22 15:39:00, Huang, Ying wrote:
> >> Michal Hocko <[email protected]> writes:
> >>
> >> > On Thu 27-10-22 14:47:22, Huang, Ying wrote:
> >> >> Michal Hocko <[email protected]> writes:
> >> > [...]
> >> >> > I can imagine workloads which wouldn't like to get their memory demoted
> >> >> > for some reason but wouldn't it be more practical to tell that
> >> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> >> >> > policies explicitly?
> >> >>
> >> >> If my understanding were correct, prctl() configures the process or
> >> >> thread.
> >> >
> >> > Not necessarily. There are properties which are per adddress space like
> >> > PR_[GS]ET_THP_DISABLE. This could be very similar.
> >> >
> >> >> How can we get process/thread configuration at demotion time?
> >> >
> >> > As already pointed out in previous emails. You could hook into
> >> > folio_check_references path, more specifically folio_referenced_one
> >> > where you have all that you need already - all vmas mapping the page and
> >> > then it is trivial to get the corresponding vm_mm. If at least one of
> >> > them has the flag set then the demotion is not allowed (essentially the
> >> > same model as VM_LOCKED).
> >>
> >> Got it! Thanks for detailed explanation.
> >>
> >> One bit may be not sufficient. For example, if we want to avoid or
> >> control cross-socket demotion and still allow demoting to slow memory
> >> nodes in local socket, we need to specify a node mask to exclude some
> >> NUMA nodes from demotion targets.
> >
> > Isn't this something to be configured on the demotion topology side? Or
> > do you expect there will be per process/address space usecases? I mean
> > different processes running on the same topology, one requesting local
> > demotion while other ok with the whole demotion topology?
>
> I think that it's possible for different processes have different
> requirements.
>
> - Some processes don't care about where the memory is placed, prefer
> local, then fall back to remote if no free space.
>
> - Some processes want to avoid cross-socket traffic, bind to nodes of
> local socket.
>
> - Some processes want to avoid to use slow memory, bind to fast memory
> node only.

Yes, I do understand that. Do you have any specific examples in mind?
[...]
> > If we really need/want to give a fine grained control over demotion
> > nodemask then we would have to go with vma->mempolicy interface. In
> > any case a per process on/off knob sounds like a reasonable first step
> > before we learn more about real usecases.
>
> Yes. Per-mm or per-vma property is much better than per-task property.
> Another possibility, how about add a new flag to set_mempolicy() system
> call to set the per-mm mempolicy? `numactl` can use that by default.

Do you mean a flag to control whether the given policy is applied to a
task or mm?
--
Michal Hocko
SUSE Labs

2022-10-27 14:04:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Thu 27-10-22 15:46:07, Aneesh Kumar K V wrote:
> On 10/27/22 2:32 PM, Michal Hocko wrote:
>
> >>
> >> Sorry, I meant MAP_ANONYMOUS | MAP_SHARED.
> >
> > I am still not sure where you are targeting to be honest. MAP_SHARED or
> > MAP_PRIVATE both can have page shared between several vmas.
>
>
> What I was checking was w.r.t demotion and shared pages do we need to
> cross-check all the related memory policies? On the page fault side, we don't do that.

Yes, because on the page fault we do have an originator and so we can
apply some reasonable semantic. For the memory reclaim there is no such
originator for a specific page. A completely unrelated context might be
reclaiming a page with some mempolicy constrain and you do not have any
records who has faulted it in. The fact that we have a memory policy
also at the task level makes a completely consistent semantic rather
hard if possible at all (e.g. what if different thread are simply bound
to a different node because shared memory is prefaulted and local thread
mappings will be always local).

I do not think shared mappings are very much special in that regards. It
is our mempolicy API that allows to specify a policy for vmas as well as
tasks and the later makes the semantic for reclaim really awkward.

--
Michal Hocko
SUSE Labs

2022-10-27 23:59:22

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

Michal Hocko <[email protected]> writes:

> On Thu 27-10-22 17:31:35, Huang, Ying wrote:
>> Michal Hocko <[email protected]> writes:
>>
>> > On Thu 27-10-22 15:39:00, Huang, Ying wrote:
>> >> Michal Hocko <[email protected]> writes:
>> >>
>> >> > On Thu 27-10-22 14:47:22, Huang, Ying wrote:
>> >> >> Michal Hocko <[email protected]> writes:
>> >> > [...]
>> >> >> > I can imagine workloads which wouldn't like to get their memory demoted
>> >> >> > for some reason but wouldn't it be more practical to tell that
>> >> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
>> >> >> > policies explicitly?
>> >> >>
>> >> >> If my understanding were correct, prctl() configures the process or
>> >> >> thread.
>> >> >
>> >> > Not necessarily. There are properties which are per adddress space like
>> >> > PR_[GS]ET_THP_DISABLE. This could be very similar.
>> >> >
>> >> >> How can we get process/thread configuration at demotion time?
>> >> >
>> >> > As already pointed out in previous emails. You could hook into
>> >> > folio_check_references path, more specifically folio_referenced_one
>> >> > where you have all that you need already - all vmas mapping the page and
>> >> > then it is trivial to get the corresponding vm_mm. If at least one of
>> >> > them has the flag set then the demotion is not allowed (essentially the
>> >> > same model as VM_LOCKED).
>> >>
>> >> Got it! Thanks for detailed explanation.
>> >>
>> >> One bit may be not sufficient. For example, if we want to avoid or
>> >> control cross-socket demotion and still allow demoting to slow memory
>> >> nodes in local socket, we need to specify a node mask to exclude some
>> >> NUMA nodes from demotion targets.
>> >
>> > Isn't this something to be configured on the demotion topology side? Or
>> > do you expect there will be per process/address space usecases? I mean
>> > different processes running on the same topology, one requesting local
>> > demotion while other ok with the whole demotion topology?
>>
>> I think that it's possible for different processes have different
>> requirements.
>>
>> - Some processes don't care about where the memory is placed, prefer
>> local, then fall back to remote if no free space.
>>
>> - Some processes want to avoid cross-socket traffic, bind to nodes of
>> local socket.
>>
>> - Some processes want to avoid to use slow memory, bind to fast memory
>> node only.
>
> Yes, I do understand that. Do you have any specific examples in mind?
> [...]

Sorry, I don't have specific examples.

>> > If we really need/want to give a fine grained control over demotion
>> > nodemask then we would have to go with vma->mempolicy interface. In
>> > any case a per process on/off knob sounds like a reasonable first step
>> > before we learn more about real usecases.
>>
>> Yes. Per-mm or per-vma property is much better than per-task property.
>> Another possibility, how about add a new flag to set_mempolicy() system
>> call to set the per-mm mempolicy? `numactl` can use that by default.
>
> Do you mean a flag to control whether the given policy is applied to a
> task or mm?

Yes. That is the idea.

Best Regards,
Huang, Ying

2022-10-31 09:52:13

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

Michal Hocko <[email protected]> writes:

> On Fri 28-10-22 07:22:27, Huang, Ying wrote:
>> Michal Hocko <[email protected]> writes:
>>
>> > On Thu 27-10-22 17:31:35, Huang, Ying wrote:
> [...]
>> >> I think that it's possible for different processes have different
>> >> requirements.
>> >>
>> >> - Some processes don't care about where the memory is placed, prefer
>> >> local, then fall back to remote if no free space.
>> >>
>> >> - Some processes want to avoid cross-socket traffic, bind to nodes of
>> >> local socket.
>> >>
>> >> - Some processes want to avoid to use slow memory, bind to fast memory
>> >> node only.
>> >
>> > Yes, I do understand that. Do you have any specific examples in mind?
>> > [...]
>>
>> Sorry, I don't have specific examples.
>
> OK, then let's stop any complicated solution right here then. Let's
> start simple with a per-mm flag to disable demotion of an address
> space.

I'm not a big fan of per-mm flag. Because we don't have users for that
too and it needs to add ABI too.

> Should there ever be a real demand for a more fine grained solution
> let's go further but I do not think we want a half baked solution
> without real usecases.

I'm OK to ignore per-task (and missing per-process) memory policy
support for now.

Best Regards,
Huang, Ying

2022-10-31 09:55:36

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Mon 31-10-22 16:51:11, Huang, Ying wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Fri 28-10-22 07:22:27, Huang, Ying wrote:
> >> Michal Hocko <[email protected]> writes:
> >>
> >> > On Thu 27-10-22 17:31:35, Huang, Ying wrote:
> > [...]
> >> >> I think that it's possible for different processes have different
> >> >> requirements.
> >> >>
> >> >> - Some processes don't care about where the memory is placed, prefer
> >> >> local, then fall back to remote if no free space.
> >> >>
> >> >> - Some processes want to avoid cross-socket traffic, bind to nodes of
> >> >> local socket.
> >> >>
> >> >> - Some processes want to avoid to use slow memory, bind to fast memory
> >> >> node only.
> >> >
> >> > Yes, I do understand that. Do you have any specific examples in mind?
> >> > [...]
> >>
> >> Sorry, I don't have specific examples.
> >
> > OK, then let's stop any complicated solution right here then. Let's
> > start simple with a per-mm flag to disable demotion of an address
> > space.
>
> I'm not a big fan of per-mm flag. Because we don't have users for that
> too and it needs to add ABI too.

OK, if there are no users for opt-out then let's jus document the
current limitations and be done with it.

> > Should there ever be a real demand for a more fine grained solution
> > let's go further but I do not think we want a half baked solution
> > without real usecases.
>
> I'm OK to ignore per-task (and missing per-process) memory policy
> support for now.

I am against such a half baked solution.
--
Michal Hocko
SUSE Labs

2022-10-31 09:56:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Fri 28-10-22 07:22:27, Huang, Ying wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Thu 27-10-22 17:31:35, Huang, Ying wrote:
[...]
> >> I think that it's possible for different processes have different
> >> requirements.
> >>
> >> - Some processes don't care about where the memory is placed, prefer
> >> local, then fall back to remote if no free space.
> >>
> >> - Some processes want to avoid cross-socket traffic, bind to nodes of
> >> local socket.
> >>
> >> - Some processes want to avoid to use slow memory, bind to fast memory
> >> node only.
> >
> > Yes, I do understand that. Do you have any specific examples in mind?
> > [...]
>
> Sorry, I don't have specific examples.

OK, then let's stop any complicated solution right here then. Let's
start simple with a per-mm flag to disable demotion of an address space.
Should there ever be a real demand for a more fine grained solution
let's go further but I do not think we want a half baked solution
without real usecases.

--
Michal Hocko
SUSE Labs

2022-10-31 14:23:05

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Mon, Oct 31, 2022 at 04:40:15PM +0800, Michal Hocko wrote:
> On Fri 28-10-22 07:22:27, Huang, Ying wrote:
> > Michal Hocko <[email protected]> writes:
> >
> > > On Thu 27-10-22 17:31:35, Huang, Ying wrote:
> [...]
> > >> I think that it's possible for different processes have different
> > >> requirements.
> > >>
> > >> - Some processes don't care about where the memory is placed, prefer
> > >> local, then fall back to remote if no free space.
> > >>
> > >> - Some processes want to avoid cross-socket traffic, bind to nodes of
> > >> local socket.
> > >>
> > >> - Some processes want to avoid to use slow memory, bind to fast memory
> > >> node only.
> > >
> > > Yes, I do understand that. Do you have any specific examples in mind?
> > > [...]
> >
> > Sorry, I don't have specific examples.
>
> OK, then let's stop any complicated solution right here then. Let's
> start simple with a per-mm flag to disable demotion of an address space.
> Should there ever be a real demand for a more fine grained solution
> let's go further but I do not think we want a half baked solution
> without real usecases.

Yes, the concern about the high cost for mempolicy from you and Yang is
valid.

How about the cpuset part? We've got bug reports from different channels
about using cpuset+docker to control meomry placement in memory tiering
system, leading to 2 commits solving them:

2685027fca38 ("cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in
cpuset_init_smp()")
https://lore.kernel.org/all/[email protected]/

8ca1b5a49885 ("mm/page_alloc: detect allocation forbidden by cpuset and
bail out early")
https://lore.kernel.org/all/[email protected]/

From these bug reports, I think it's reasonable to say there are quite
some real world users using cpuset+docker+memory-tiering-system. So
I plan to refine the original cpuset patch with some optimizations
discussed (like checking once for kswapd based shrink_folio_list()).

Thanks,
Feng

> --
> Michal Hocko
> SUSE Labs
>

2022-10-31 14:45:14

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Mon 31-10-22 22:09:15, Feng Tang wrote:
> On Mon, Oct 31, 2022 at 04:40:15PM +0800, Michal Hocko wrote:
> > On Fri 28-10-22 07:22:27, Huang, Ying wrote:
> > > Michal Hocko <[email protected]> writes:
> > >
> > > > On Thu 27-10-22 17:31:35, Huang, Ying wrote:
> > [...]
> > > >> I think that it's possible for different processes have different
> > > >> requirements.
> > > >>
> > > >> - Some processes don't care about where the memory is placed, prefer
> > > >> local, then fall back to remote if no free space.
> > > >>
> > > >> - Some processes want to avoid cross-socket traffic, bind to nodes of
> > > >> local socket.
> > > >>
> > > >> - Some processes want to avoid to use slow memory, bind to fast memory
> > > >> node only.
> > > >
> > > > Yes, I do understand that. Do you have any specific examples in mind?
> > > > [...]
> > >
> > > Sorry, I don't have specific examples.
> >
> > OK, then let's stop any complicated solution right here then. Let's
> > start simple with a per-mm flag to disable demotion of an address space.
> > Should there ever be a real demand for a more fine grained solution
> > let's go further but I do not think we want a half baked solution
> > without real usecases.
>
> Yes, the concern about the high cost for mempolicy from you and Yang is
> valid.
>
> How about the cpuset part?

Cpusets fall into the same bucket as per task mempolicies wrt costs. Geting a
cpuset requires knowing all tasks associated with a page. Or am I just
missing any magic? And no memcg->cpuset association is not a proper
solution at all.

> We've got bug reports from different channels
> about using cpuset+docker to control meomry placement in memory tiering
> system, leading to 2 commits solving them:
>
> 2685027fca38 ("cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in
> cpuset_init_smp()")
> https://lore.kernel.org/all/[email protected]/
>
> 8ca1b5a49885 ("mm/page_alloc: detect allocation forbidden by cpuset and
> bail out early")
> https://lore.kernel.org/all/[email protected]/
>
> >From these bug reports, I think it's reasonable to say there are quite
> some real world users using cpuset+docker+memory-tiering-system.

I don't think anybody is questioning existence of those usecases. The
primary question is whether any of them really require any non-trivial
(read nodemask aware) demotion policies. In other words do we know of
cpuset policy setups where demotion fallbacks are (partially) excluded?
--
Michal Hocko
SUSE Labs

2022-11-01 03:36:33

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

"Huang, Ying" <[email protected]> writes:

> Michal Hocko <[email protected]> writes:
>
>> On Thu 27-10-22 15:39:00, Huang, Ying wrote:
>>> Michal Hocko <[email protected]> writes:
>>>
>>> > On Thu 27-10-22 14:47:22, Huang, Ying wrote:
>>> >> Michal Hocko <[email protected]> writes:
>>> > [...]
>>> >> > I can imagine workloads which wouldn't like to get their memory demoted
>>> >> > for some reason but wouldn't it be more practical to tell that
>>> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
>>> >> > policies explicitly?
>>> >>
>>> >> If my understanding were correct, prctl() configures the process or
>>> >> thread.
>>> >
>>> > Not necessarily. There are properties which are per adddress space like
>>> > PR_[GS]ET_THP_DISABLE. This could be very similar.
>>> >
>>> >> How can we get process/thread configuration at demotion time?
>>> >
>>> > As already pointed out in previous emails. You could hook into
>>> > folio_check_references path, more specifically folio_referenced_one
>>> > where you have all that you need already - all vmas mapping the page and
>>> > then it is trivial to get the corresponding vm_mm. If at least one of
>>> > them has the flag set then the demotion is not allowed (essentially the
>>> > same model as VM_LOCKED).
>>>
>>> Got it! Thanks for detailed explanation.
>>>
>>> One bit may be not sufficient. For example, if we want to avoid or
>>> control cross-socket demotion and still allow demoting to slow memory
>>> nodes in local socket, we need to specify a node mask to exclude some
>>> NUMA nodes from demotion targets.
>>
>> Isn't this something to be configured on the demotion topology side? Or
>> do you expect there will be per process/address space usecases? I mean
>> different processes running on the same topology, one requesting local
>> demotion while other ok with the whole demotion topology?
>
> I think that it's possible for different processes have different
> requirements.
>
> - Some processes don't care about where the memory is placed, prefer
> local, then fall back to remote if no free space.
>
> - Some processes want to avoid cross-socket traffic, bind to nodes of
> local socket.
>
> - Some processes want to avoid to use slow memory, bind to fast memory
> node only.

Hi, Johannes, Wei, Jonathan, Yang, Aneesh,

We need your help. Do you or your organization have requirements to
restrict the page demotion target nodes? If so, can you share some
details of the requirements? For example, to avoid cross-socket
traffic, or to avoid using slow memory. And do you want to restrict
that with cpusets, memory policy, or some other interfaces.

Best Regards,
Huang, Ying

2022-11-07 08:14:54

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Mon, Oct 31, 2022 at 03:32:34PM +0100, Michal Hocko wrote:
> > > OK, then let's stop any complicated solution right here then. Let's
> > > start simple with a per-mm flag to disable demotion of an address space.
> > > Should there ever be a real demand for a more fine grained solution
> > > let's go further but I do not think we want a half baked solution
> > > without real usecases.
> >
> > Yes, the concern about the high cost for mempolicy from you and Yang is
> > valid.
> >
> > How about the cpuset part?
>
> Cpusets fall into the same bucket as per task mempolicies wrt costs. Geting a
> cpuset requires knowing all tasks associated with a page. Or am I just
> missing any magic? And no memcg->cpuset association is not a proper
> solution at all.

No, you are not missing anything. It's really difficult to find a
solution for all holes. And the patch is actually a best-efforts
approach, trying to cover cgroup v2 + memory controller enabled case,
which we think is a common user case for newer platforms with tiering
memory.

> > We've got bug reports from different channels
> > about using cpuset+docker to control meomry placement in memory tiering
> > system, leading to 2 commits solving them:
> >
> > 2685027fca38 ("cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in
> > cpuset_init_smp()")
> > https://lore.kernel.org/all/[email protected]/
> >
> > 8ca1b5a49885 ("mm/page_alloc: detect allocation forbidden by cpuset and
> > bail out early")
> > https://lore.kernel.org/all/[email protected]/
> >
> > >From these bug reports, I think it's reasonable to say there are quite
> > some real world users using cpuset+docker+memory-tiering-system.
>
> I don't think anybody is questioning existence of those usecases. The
> primary question is whether any of them really require any non-trivial
> (read nodemask aware) demotion policies. In other words do we know of
> cpuset policy setups where demotion fallbacks are (partially) excluded?

For cpuset numa memory binding, there are possible usercases:

* User wants cpuset to bind some important containers to faster
memory tiers for better latency/performance (where simply disabling
demotion should work, like your per-mm flag solution)

* User wants to bind to a set of physically closer nodes (like faster
CPU+DRAM node and slower PMEM node). With initial demotion code,
our HW will have 1:1 demotion/promotion pair for one DRAM node and
its closer PMEM node, and user's binding can work fine. And there
are many other types of memory tiering system from other vendors,
like many CPU-less DRAM nodes in system, and Aneesh's patchset[1]
created a more general tiering interface, where IIUC each tier has
a nodemask, and an upper tier can demote to the whole lower tier,
where the demotion path is N:N mapping. And for this, fine-tuning
cpuset nodes binding needs this handling.

[1]. https://lore.kernel.org/lkml/[email protected]/

Thanks,
Feng

> --
> Michal Hocko
> SUSE Labs

2022-11-07 09:03:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Mon 07-11-22 16:05:37, Feng Tang wrote:
> On Mon, Oct 31, 2022 at 03:32:34PM +0100, Michal Hocko wrote:
> > > > OK, then let's stop any complicated solution right here then. Let's
> > > > start simple with a per-mm flag to disable demotion of an address space.
> > > > Should there ever be a real demand for a more fine grained solution
> > > > let's go further but I do not think we want a half baked solution
> > > > without real usecases.
> > >
> > > Yes, the concern about the high cost for mempolicy from you and Yang is
> > > valid.
> > >
> > > How about the cpuset part?
> >
> > Cpusets fall into the same bucket as per task mempolicies wrt costs. Geting a
> > cpuset requires knowing all tasks associated with a page. Or am I just
> > missing any magic? And no memcg->cpuset association is not a proper
> > solution at all.
>
> No, you are not missing anything. It's really difficult to find a
> solution for all holes. And the patch is actually a best-efforts
> approach, trying to cover cgroup v2 + memory controller enabled case,
> which we think is a common user case for newer platforms with tiering
> memory.

Best effort is OK but it shouldn't create an unexpected behavior and
this approach does that.

I thought I have already explained that. But let me be more
explicit this time. Please have a look at how controllers can be
enabled/disabled at different levels of the hierarchy. Unless memcg
grows a hard dependency on another controller (as it does with the blk
io controller) then this approach can point to a wrong cpuset. See my
point?

Really, solution for this is not going to be cheap and also I am not
sure all the hessles is really worth it until there is a clear usecase
in sight.
--
Michal Hocko
SUSE Labs