2021-04-26 07:01:01

by Abel Wu

[permalink] [raw]
Subject: [PATCH 0/3] cgroup2: introduce cpuset.mems.migration

Some of our services are quite performance sensitive and
actually NUMA-aware designed, aka numa-service. The SLOs
can be easily violated when co-locate numa-services with
other workloads. Thus they are granted to occupy the whole
NUMA node and when such assignment applies, the workload
on that node needs to be moved away fast and complete.

This new cgroup v2 interface is an enhancement of cgroup
v1 interface cpuset.memory_migrate by adding a new mode
called "lazy". With the help of the "lazy" mode migration
we solved the aforementioned problem on fast eviction.

Patch 1 applies cpusets limits to tasks that using default
memory policies, which makes pages inside mems_allowed are
preferred when autoNUMA is enabled. This is also necessary
for the "lazy" mode of cpuset.mems.migration.

Patch 2&3 introduce cpuset.mems.migration, see the patches
for detailed information please.

Abel Wu (3):
mm/mempolicy: apply cpuset limits to tasks using default policy
cgroup/cpuset: introduce cpuset.mems.migration
docs/admin-guide/cgroup-v2: add cpuset.mems.migration

Documentation/admin-guide/cgroup-v2.rst | 36 ++++++++
kernel/cgroup/cpuset.c | 104 +++++++++++++++++++-----
mm/mempolicy.c | 7 +-
3 files changed, 124 insertions(+), 23 deletions(-)

--
2.31.1


2021-04-26 07:01:06

by Abel Wu

[permalink] [raw]
Subject: [PATCH 1/3] mm/mempolicy: apply cpuset limits to tasks using default policy

The nodemasks of non-default policies (pol->v) are calculated within
the restriction of task->mems_allowed, while default policies are not.
This may lead to improper results of mpol_misplaced(), since it can
return a target node outside of current->mems_allowed for tasks using
default policies. Although this is not a bug because migrating pages
to that out-of-cpuset node will fail eventually due to sanity checks
in page allocation, it still would be better to avoid such useless
efforts.

This patch also changes the behavior of autoNUMA a bit by showing
a tendency to move pages inside mems_allowed for tasks using default
policies, which is good for memory isolation.

Signed-off-by: Abel Wu <[email protected]>
---
mm/mempolicy.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d79fa299b70c..e0ae6997bbfb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2516,7 +2516,10 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long

/* Migrate the page towards the node whose CPU is referencing it */
if (pol->flags & MPOL_F_MORON) {
- polnid = thisnid;
+ if (node_isset(thisnid, cpuset_current_mems_allowed))
+ polnid = thisnid;
+ else
+ polnid = node_random(&cpuset_current_mems_allowed);

if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
goto out;
--
2.31.1

2021-04-26 07:01:21

by Abel Wu

[permalink] [raw]
Subject: [PATCH 2/3] cgroup/cpuset: introduce cpuset.mems.migration

Some of our services are quite performance sensitive and
actually NUMA-aware designed, aka numa-services. The SLOs
can be easily violated when co-locate these services with
other workloads. Thus they are granted to occupy the whole
one or several NUMA nodes according to their quota.

When a NUMA node is assigned to numa-service, the workload
on that node needs to be moved away fast and complete. The
main aspects we cared about on the eviction are as follows:

a) it should complete soon enough so that numa-services
won’t wait too long to hurt user experience
b) the workloads to be evicted could have massive usage on
memory, and migrating such amount of memory may lead to
a sudden severe performance drop lasting tens of seconds
that some certain workloads may not afford
c) the impact of the eviction should be limited within the
source and destination nodes
d) cgroup interface is preferred

So we come to a thought that:

1) fire up numa-services without waiting for memory migration
2) memory migration can be done asynchronously by using spare
memory bandwidth

AutoNUMA seems to be a solution, but its scope is global which
violates c&d. And cpuset.memory_migrate performs in a synchronous
fashion which breaks a&b. So a mixture of them, the new cgroup2
interface cpuset.mems.migration, is introduced.

The new cpuset.mems.migration supports three modes:

- "none" mode, meaning migration disabled
- "sync" mode, which is exactly the same as the cgroup v1
interface cpuset.memory_migrate
- "lazy" mode, when walking through all the pages, unlike
cpuset.memory_migrate, it only sets pages to protnone,
and numa faults triggered by later touch will handle the
movement.

See next patch for detailed information.

Signed-off-by: Abel Wu <[email protected]>
---
kernel/cgroup/cpuset.c | 104 ++++++++++++++++++++++++++++++++---------
mm/mempolicy.c | 2 +-
2 files changed, 84 insertions(+), 22 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a945504c0ae7..ee84f168eea8 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -212,6 +212,7 @@ typedef enum {
CS_MEM_EXCLUSIVE,
CS_MEM_HARDWALL,
CS_MEMORY_MIGRATE,
+ CS_MEMORY_MIGRATE_LAZY,
CS_SCHED_LOAD_BALANCE,
CS_SPREAD_PAGE,
CS_SPREAD_SLAB,
@@ -248,6 +249,11 @@ static inline int is_memory_migrate(const struct cpuset *cs)
return test_bit(CS_MEMORY_MIGRATE, &cs->flags);
}

+static inline int is_memory_migrate_lazy(const struct cpuset *cs)
+{
+ return test_bit(CS_MEMORY_MIGRATE_LAZY, &cs->flags);
+}
+
static inline int is_spread_page(const struct cpuset *cs)
{
return test_bit(CS_SPREAD_PAGE, &cs->flags);
@@ -1594,6 +1600,7 @@ struct cpuset_migrate_mm_work {
struct mm_struct *mm;
nodemask_t from;
nodemask_t to;
+ int flags;
};

static void cpuset_migrate_mm_workfn(struct work_struct *work)
@@ -1602,21 +1609,29 @@ static void cpuset_migrate_mm_workfn(struct work_struct *work)
container_of(work, struct cpuset_migrate_mm_work, work);

/* on a wq worker, no need to worry about %current's mems_allowed */
- do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, MPOL_MF_MOVE_ALL);
+ do_migrate_pages(mwork->mm, &mwork->from, &mwork->to, mwork->flags);
mmput(mwork->mm);
kfree(mwork);
}

-static void cpuset_migrate_mm(struct mm_struct *mm, const nodemask_t *from,
- const nodemask_t *to)
+static void cpuset_migrate_mm(struct cpuset *cs, struct mm_struct *mm,
+ const nodemask_t *from, const nodemask_t *to)
{
- struct cpuset_migrate_mm_work *mwork;
+ struct cpuset_migrate_mm_work *mwork = NULL;
+ int flags = 0;

- mwork = kzalloc(sizeof(*mwork), GFP_KERNEL);
+ if (is_memory_migrate_lazy(cs))
+ flags = MPOL_MF_LAZY;
+ else if (is_memory_migrate(cs))
+ flags = MPOL_MF_MOVE_ALL;
+
+ if (flags)
+ mwork = kzalloc(sizeof(*mwork), GFP_KERNEL);
if (mwork) {
mwork->mm = mm;
mwork->from = *from;
mwork->to = *to;
+ mwork->flags = flags;
INIT_WORK(&mwork->work, cpuset_migrate_mm_workfn);
queue_work(cpuset_migrate_mm_wq, &mwork->work);
} else {
@@ -1690,7 +1705,6 @@ static void update_tasks_nodemask(struct cpuset *cs)
css_task_iter_start(&cs->css, 0, &it);
while ((task = css_task_iter_next(&it))) {
struct mm_struct *mm;
- bool migrate;

cpuset_change_task_nodemask(task, &newmems);

@@ -1698,13 +1712,8 @@ static void update_tasks_nodemask(struct cpuset *cs)
if (!mm)
continue;

- migrate = is_memory_migrate(cs);
-
mpol_rebind_mm(mm, &cs->mems_allowed);
- if (migrate)
- cpuset_migrate_mm(mm, &cs->old_mems_allowed, &newmems);
- else
- mmput(mm);
+ cpuset_migrate_mm(cs, mm, &cs->old_mems_allowed, &newmems);
}
css_task_iter_end(&it);

@@ -1911,6 +1920,11 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
else
clear_bit(bit, &trialcs->flags);

+ if (bit == CS_MEMORY_MIGRATE)
+ clear_bit(CS_MEMORY_MIGRATE_LAZY, &trialcs->flags);
+ if (bit == CS_MEMORY_MIGRATE_LAZY)
+ clear_bit(CS_MEMORY_MIGRATE, &trialcs->flags);
+
err = validate_change(cs, trialcs);
if (err < 0)
goto out;
@@ -2237,11 +2251,8 @@ static void cpuset_attach(struct cgroup_taskset *tset)
* @old_mems_allowed is the right nodesets that we
* migrate mm from.
*/
- if (is_memory_migrate(cs))
- cpuset_migrate_mm(mm, &oldcs->old_mems_allowed,
- &cpuset_attach_nodemask_to);
- else
- mmput(mm);
+ cpuset_migrate_mm(cs, mm, &oldcs->old_mems_allowed,
+ &cpuset_attach_nodemask_to);
}
}

@@ -2258,6 +2269,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)

typedef enum {
FILE_MEMORY_MIGRATE,
+ FILE_MEMORY_MIGRATE_LAZY,
FILE_CPULIST,
FILE_MEMLIST,
FILE_EFFECTIVE_CPULIST,
@@ -2275,11 +2287,8 @@ typedef enum {
FILE_SPREAD_SLAB,
} cpuset_filetype_t;

-static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
- u64 val)
+static int __cpuset_write_u64(struct cpuset *cs, cpuset_filetype_t type, u64 val)
{
- struct cpuset *cs = css_cs(css);
- cpuset_filetype_t type = cft->private;
int retval = 0;

get_online_cpus();
@@ -2305,6 +2314,9 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
case FILE_MEMORY_MIGRATE:
retval = update_flag(CS_MEMORY_MIGRATE, cs, val);
break;
+ case FILE_MEMORY_MIGRATE_LAZY:
+ retval = update_flag(CS_MEMORY_MIGRATE_LAZY, cs, val);
+ break;
case FILE_MEMORY_PRESSURE_ENABLED:
cpuset_memory_pressure_enabled = !!val;
break;
@@ -2324,6 +2336,12 @@ static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
return retval;
}

+static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype *cft,
+ u64 val)
+{
+ return __cpuset_write_u64(css_cs(css), cft->private, val);
+}
+
static int cpuset_write_s64(struct cgroup_subsys_state *css, struct cftype *cft,
s64 val)
{
@@ -2473,6 +2491,8 @@ static u64 cpuset_read_u64(struct cgroup_subsys_state *css, struct cftype *cft)
return is_sched_load_balance(cs);
case FILE_MEMORY_MIGRATE:
return is_memory_migrate(cs);
+ case FILE_MEMORY_MIGRATE_LAZY:
+ return is_memory_migrate_lazy(cs);
case FILE_MEMORY_PRESSURE_ENABLED:
return cpuset_memory_pressure_enabled;
case FILE_MEMORY_PRESSURE:
@@ -2555,6 +2575,40 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
return retval ?: nbytes;
}

+static int cpuset_mm_migration_show(struct seq_file *seq, void *v)
+{
+ struct cpuset *cs = css_cs(seq_css(seq));
+
+ if (is_memory_migrate_lazy(cs))
+ seq_puts(seq, "lazy\n");
+ else if (is_memory_migrate(cs))
+ seq_puts(seq, "sync\n");
+ else
+ seq_puts(seq, "none\n");
+ return 0;
+}
+
+static ssize_t cpuset_mm_migration_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct cpuset *cs = css_cs(of_css(of));
+ cpuset_filetype_t type = FILE_MEMORY_MIGRATE;
+ int turning_on = 1;
+ int retval;
+
+ buf = strstrip(buf);
+
+ if (!strcmp(buf, "none"))
+ turning_on = 0;
+ else if (!strcmp(buf, "lazy"))
+ type = FILE_MEMORY_MIGRATE_LAZY;
+ else if (strcmp(buf, "sync"))
+ return -EINVAL;
+
+ retval = __cpuset_write_u64(cs, type, turning_on);
+ return retval ?: nbytes;
+}
+
/*
* for the common functions, 'private' gives the type of file
*/
@@ -2711,6 +2765,14 @@ static struct cftype dfl_files[] = {
.flags = CFTYPE_DEBUG,
},

+ {
+ .name = "mems.migration",
+ .seq_show = cpuset_mm_migration_show,
+ .write = cpuset_mm_migration_write,
+ .private = FILE_MEMORY_MIGRATE,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
+
{ } /* terminate */
};

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e0ae6997bbfb..f816b2ac5f52 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1097,7 +1097,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
* need migration. Between passing in the full user address
* space range and MPOL_MF_DISCONTIG_OK, this call can not fail.
*/
- VM_BUG_ON(!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)));
+ VM_BUG_ON(!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL | MPOL_MF_LAZY)));
queue_pages_range(mm, mm->mmap->vm_start, mm->task_size, &nmask,
flags | MPOL_MF_DISCONTIG_OK, &pagelist);

--
2.31.1

2021-04-26 07:02:59

by Abel Wu

[permalink] [raw]
Subject: [PATCH 3/3] docs/admin-guide/cgroup-v2: add cpuset.mems.migration

Add docs for new interface cpuset.mems.migration, most of which
are stolen from cpuset(7) manpages.

Signed-off-by: Abel Wu <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 36 +++++++++++++++++++++++++
1 file changed, 36 insertions(+)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index b1e81aa8598a..abf6589a390d 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2079,6 +2079,42 @@ Cpuset Interface Files
Changing the partition state of an invalid partition root to
"member" is always allowed even if child cpusets are present.

+ cpuset.mems.migration
+ A read-write single value file which exists on non-root
+ cpuset-enabled cgroups.
+
+ Only the following migration modes are defined.
+
+ ======== ==========================================
+ "none" migration disabled [default]
+ "sync" move pages to cpuset nodes synchronously
+ "lazy" move pages to cpuset nodes on second touch
+ ======== ==========================================
+
+ By default, "none" mode is enabled. In this mode, once a page
+ is allocated (given a physical page of main memory) then that
+ page stays on whatever node it was allocated, so long as it
+ remains allocated, even if the cpusets memory placement policy
+ 'cpuset.mems' subsequently changes.
+
+ If "sync" mode is enabled in a cpuset, when the 'cpuset.mems'
+ setting is changed, any memory page in use by any process in
+ the cpuset that is on a memory node that is no longer allowed
+ will be migrated to a memory node that is allowed synchronously.
+ The relative placement of a migrated page within the cpuset is
+ preserved during these migration operations if possible.
+
+ The "lazy" mode is almost the same as "sync" mode, except that
+ it doesn't move the pages right away. Instead it sets these
+ pages to protnone, and numa faults triggered by second touching
+ these pages will handle the movement.
+
+ Furthermore, if a process is moved into a cpuset with migration
+ enabled ("sync" or "lazy" enabled), any memory pages it uses
+ that on memory nodes allowed in its previous cpuset, but which
+ are not allowed in its new cpuset, will be migrated to a memory
+ node allowed in the new cpuset.
+

Device controller
-----------------
--
2.31.1

2021-04-27 14:46:15

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/3] cgroup/cpuset: introduce cpuset.mems.migration

Hello,

On Mon, Apr 26, 2021 at 02:59:45PM +0800, Abel Wu wrote:
> When a NUMA node is assigned to numa-service, the workload
> on that node needs to be moved away fast and complete. The
> main aspects we cared about on the eviction are as follows:
>
> a) it should complete soon enough so that numa-services
> won’t wait too long to hurt user experience
> b) the workloads to be evicted could have massive usage on
> memory, and migrating such amount of memory may lead to
> a sudden severe performance drop lasting tens of seconds
> that some certain workloads may not afford
> c) the impact of the eviction should be limited within the
> source and destination nodes
> d) cgroup interface is preferred
>
> So we come to a thought that:
>
> 1) fire up numa-services without waiting for memory migration
> 2) memory migration can be done asynchronously by using spare
> memory bandwidth
>
> AutoNUMA seems to be a solution, but its scope is global which
> violates c&d. And cpuset.memory_migrate performs in a synchronous

I don't think d) in itself is a valid requirement. How does it violate c)?

> fashion which breaks a&b. So a mixture of them, the new cgroup2
> interface cpuset.mems.migration, is introduced.
>
> The new cpuset.mems.migration supports three modes:
>
> - "none" mode, meaning migration disabled
> - "sync" mode, which is exactly the same as the cgroup v1
> interface cpuset.memory_migrate
> - "lazy" mode, when walking through all the pages, unlike
> cpuset.memory_migrate, it only sets pages to protnone,
> and numa faults triggered by later touch will handle the
> movement.

cpuset is already involved in NUMA allocation but it always felt like
something bolted on - it's weird to have cpu to NUMA node settings at global
level and then to have possibly conflicting direct NUMA configuration via
cpuset. My preference would be putting as much configuration as possible on
the mm / autonuma side and let cpuset's node confinements further restrict
their operations rather than cpuset having its own set of policy
configurations.

Johannes, what are your thoughts?

Thanks.

--
tejun

2021-04-28 07:26:21

by Abel Wu

[permalink] [raw]
Subject: Re: [PATCH 2/3] cgroup/cpuset: introduce cpuset.mems.migration

Hello Tejun, thanks for your review,

On 4/27/21 10:43 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, Apr 26, 2021 at 02:59:45PM +0800, Abel Wu wrote:
>> When a NUMA node is assigned to numa-service, the workload
>> on that node needs to be moved away fast and complete. The
>> main aspects we cared about on the eviction are as follows:
>>
>> a) it should complete soon enough so that numa-services
>> won’t wait too long to hurt user experience
>> b) the workloads to be evicted could have massive usage on
>> memory, and migrating such amount of memory may lead to
>> a sudden severe performance drop lasting tens of seconds
>> that some certain workloads may not afford
>> c) the impact of the eviction should be limited within the
>> source and destination nodes
>> d) cgroup interface is preferred
>>
>> So we come to a thought that:
>>
>> 1) fire up numa-services without waiting for memory migration
>> 2) memory migration can be done asynchronously by using spare
>> memory bandwidth
>>
>> AutoNUMA seems to be a solution, but its scope is global which
>> violates c&d. And cpuset.memory_migrate performs in a synchronous
>
> I don't think d) in itself is a valid requirement. How does it violate c)?
Yes, d) is more like a preference, since we operate in cgroup level.
Process/thread level interfaces are also acceptable.
AutoNUMA violates c) in its global effect that not only the source
and destination nodes, the processes running on other nodes would
also suffer from unwanted overhead due to numa faults.
And besides the global effect, one-shot mode migration is expected
in this scenario, like cpuset.memory_migrate, rather than autonuma's
periodic behavior.
>
>> fashion which breaks a&b. So a mixture of them, the new cgroup2
>> interface cpuset.mems.migration, is introduced.
>>
>> The new cpuset.mems.migration supports three modes:
>>
>> - "none" mode, meaning migration disabled
>> - "sync" mode, which is exactly the same as the cgroup v1
>> interface cpuset.memory_migrate
>> - "lazy" mode, when walking through all the pages, unlike
>> cpuset.memory_migrate, it only sets pages to protnone,
>> and numa faults triggered by later touch will handle the
>> movement.
>
> cpuset is already involved in NUMA allocation but it always felt like
> something bolted on - it's weird to have cpu to NUMA node settings at global
> level and then to have possibly conflicting direct NUMA configuration via
> cpuset. My preference would be putting as much configuration as possible on
> the mm / autonuma side and let cpuset's node confinements further restrict
> their operations rather than cpuset having its own set of policy
> configurations.
Such conflicting configuration exists in our system in order to
reduce TCO, and yet we haven't found a proper way to get rid of
it. Say a numa-service occupies the whole memory of a node but
still leave some cpus free. The free cpus may be assigned to some
services, the ones that are not memory latency sensitive and are
forbidden to use local memory. Thoughts?

Thanks,
Abel
>
> Johannes, what are your thoughts?
>
> Thanks.
>

2021-05-05 06:01:40

by Abel Wu

[permalink] [raw]
Subject: Re: [Phishing Risk] [External] Re: [PATCH 2/3] cgroup/cpuset: introduce cpuset.mems.migration

ping :)

On 4/27/21 10:43 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, Apr 26, 2021 at 02:59:45PM +0800, Abel Wu wrote:
>> When a NUMA node is assigned to numa-service, the workload
>> on that node needs to be moved away fast and complete. The
>> main aspects we cared about on the eviction are as follows:
>>
>> a) it should complete soon enough so that numa-services
>> won’t wait too long to hurt user experience
>> b) the workloads to be evicted could have massive usage on
>> memory, and migrating such amount of memory may lead to
>> a sudden severe performance drop lasting tens of seconds
>> that some certain workloads may not afford
>> c) the impact of the eviction should be limited within the
>> source and destination nodes
>> d) cgroup interface is preferred
>>
>> So we come to a thought that:
>>
>> 1) fire up numa-services without waiting for memory migration
>> 2) memory migration can be done asynchronously by using spare
>> memory bandwidth
>>
>> AutoNUMA seems to be a solution, but its scope is global which
>> violates c&d. And cpuset.memory_migrate performs in a synchronous
>
> I don't think d) in itself is a valid requirement. How does it violate c)?
>
>> fashion which breaks a&b. So a mixture of them, the new cgroup2
>> interface cpuset.mems.migration, is introduced.
>>
>> The new cpuset.mems.migration supports three modes:
>>
>> - "none" mode, meaning migration disabled
>> - "sync" mode, which is exactly the same as the cgroup v1
>> interface cpuset.memory_migrate
>> - "lazy" mode, when walking through all the pages, unlike
>> cpuset.memory_migrate, it only sets pages to protnone,
>> and numa faults triggered by later touch will handle the
>> movement.
>
> cpuset is already involved in NUMA allocation but it always felt like
> something bolted on - it's weird to have cpu to NUMA node settings at global
> level and then to have possibly conflicting direct NUMA configuration via
> cpuset. My preference would be putting as much configuration as possible on
> the mm / autonuma side and let cpuset's node confinements further restrict
> their operations rather than cpuset having its own set of policy
> configurations.
>
> Johannes, what are your thoughts?
>
> Thanks.
>

2021-05-05 22:44:35

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 2/3] cgroup/cpuset: introduce cpuset.mems.migration

On Tue, Apr 27, 2021 at 10:43:31AM -0400, Tejun Heo wrote:
> Hello,
>
> On Mon, Apr 26, 2021 at 02:59:45PM +0800, Abel Wu wrote:
> > When a NUMA node is assigned to numa-service, the workload
> > on that node needs to be moved away fast and complete. The
> > main aspects we cared about on the eviction are as follows:
> >
> > a) it should complete soon enough so that numa-services
> > won’t wait too long to hurt user experience
> > b) the workloads to be evicted could have massive usage on
> > memory, and migrating such amount of memory may lead to
> > a sudden severe performance drop lasting tens of seconds
> > that some certain workloads may not afford
> > c) the impact of the eviction should be limited within the
> > source and destination nodes
> > d) cgroup interface is preferred
> >
> > So we come to a thought that:
> >
> > 1) fire up numa-services without waiting for memory migration
> > 2) memory migration can be done asynchronously by using spare
> > memory bandwidth
> >
> > AutoNUMA seems to be a solution, but its scope is global which
> > violates c&d. And cpuset.memory_migrate performs in a synchronous
>
> I don't think d) in itself is a valid requirement. How does it violate c)?
>
> > fashion which breaks a&b. So a mixture of them, the new cgroup2
> > interface cpuset.mems.migration, is introduced.
> >
> > The new cpuset.mems.migration supports three modes:
> >
> > - "none" mode, meaning migration disabled
> > - "sync" mode, which is exactly the same as the cgroup v1
> > interface cpuset.memory_migrate
> > - "lazy" mode, when walking through all the pages, unlike
> > cpuset.memory_migrate, it only sets pages to protnone,
> > and numa faults triggered by later touch will handle the
> > movement.
>
> cpuset is already involved in NUMA allocation but it always felt like
> something bolted on - it's weird to have cpu to NUMA node settings at global
> level and then to have possibly conflicting direct NUMA configuration via
> cpuset. My preference would be putting as much configuration as possible on
> the mm / autonuma side and let cpuset's node confinements further restrict
> their operations rather than cpuset having its own set of policy
> configurations.
>
> Johannes, what are your thoughts?

This is basically a cgroup interface for the existing MPOL_MF_LAZY /
MPOL_F_MOF flags, which are per task (set_mempolicy()) and per-vma
(mbind()) scope respectively. They're not per-node, so cannot be
cgroupified through cpuset's node restrictions alone, and I understand
why a cgroup interface could be convenient.

On the other hand, this is not really about configuring a shared
resource. Rather it's using cgroups to set an arbitrary task parameter
on a bunch of tasks simultaneously. It's the SIMD type usecase of
cgroup1 that we tried to get away from in cgroup2, simply because it's
so unbounded in scope. There are *a lot* of possible task parameters,
and we could add a lot of kernel interfaces that boil down to
css_task_iter and setting or clearing a task flag.

So I'm also thinking this cgroup interface isn't desirable.

If you want to control numa policies of tasks from the outside, it's
probably best to extend the numa syscall interface to work on pids.
And then use cgroup.procs to cgroupify the operation from userspace.

Or extend the NUMA interface to make the system-wide default behavior
configurable, so that you can set MPOL_F_MOF in there (without having
to enable autonuma).

But yeah, cgroups doesn't seem like the right place to do this.

Thanks