Cpusets constrain the CPU and Memory placement of tasks.
`CONSTRAINT_CPUSET` type in oom has existed for a long time, but
has never been utilized.
When a process in cpuset which constrain memory placement triggers
oom, it may kill a completely irrelevant process on other numa nodes,
which will not release any memory for this cpuset.
We can easily achieve node aware oom by using `CONSTRAINT_CPUSET` and
selecting victim from cpusets with the same mems_allowed as the
current one.
Example:
Create two processes named mem_on_node0 and mem_on_node1 constrained
by cpusets respectively. These two processes alloc memory on their
own node. Now node0 has run out of memory, OOM will be invokled by
mem_on_node0.
Before this patch:
Since `CONSTRAINT_CPUSET` do nothing, the victim will be selected from
the entire system. Therefore, the OOM is highly likely to kill
mem_on_node1, which will not free any memory for mem_on_node0. This
is a useless kill.
```
[ 2786.519080] mem_on_node0 invoked oom-killer
[ 2786.885738] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 2787.181724] [ 13432] 0 13432 787016 786745 6344704 0 0 mem_on_node1
[ 2787.189115] [ 13457] 0 13457 787002 785504 6340608 0 0 mem_on_node0
[ 2787.216534] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0
[ 2787.229991] Out of memory: Killed process 13432 (mem_on_node1)
```
After this patch:
The victim will be selected only in all cpusets that have the same
mems_allowed as the cpuset that invoked oom. This will prevent
useless kill and protect innocent victims.
```
[ 395.922444] mem_on_node0 invoked oom-killer
[ 396.239777] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 396.246128] [ 2614] 0 2614 1311294 1144192 9224192 0 0 mem_on_node0
[ 396.252655] oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=test,mems_allowed=0
[ 396.264068] Out of memory: Killed process 2614 (mem_on_node0)
```
Suggested-by: Michal Hocko <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Zefan Li <[email protected]>
Signed-off-by: Gang Li <[email protected]>
---
Changes in v4:
- Modify comments and documentation.
Changes in v3:
- https://lore.kernel.org/all/[email protected]/
- Provide more details about the use case, testing, implementation.
- Document the userspace visible change in Documentation.
- Rename cpuset_cgroup_scan_tasks() to cpuset_scan_tasks() and add
a doctext comment about its purpose and how it should be used.
- Take cpuset_rwsem to ensure that cpusets are stable.
Changes in v2:
- https://lore.kernel.org/all/[email protected]/
- Select victim from all cpusets with the same mems_allowed as the current cpuset.
v1:
- https://lore.kernel.org/all/[email protected]/
- Introduce cpuset oom.
---
.../admin-guide/cgroup-v1/cpusets.rst | 16 ++++++-
Documentation/admin-guide/cgroup-v2.rst | 4 ++
include/linux/cpuset.h | 6 +++
kernel/cgroup/cpuset.c | 43 +++++++++++++++++++
mm/oom_kill.c | 4 ++
5 files changed, 71 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst
index 5d844ed4df69..51ffdc0eb167 100644
--- a/Documentation/admin-guide/cgroup-v1/cpusets.rst
+++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst
@@ -25,7 +25,8 @@ Written by [email protected]
1.6 What is memory spread ?
1.7 What is sched_load_balance ?
1.8 What is sched_relax_domain_level ?
- 1.9 How do I use cpusets ?
+ 1.9 What is cpuset oom ?
+ 1.10 How do I use cpusets ?
2. Usage Examples and Syntax
2.1 Basic Usage
2.2 Adding/removing cpus
@@ -607,8 +608,19 @@ If your situation is:
- The latency is required even it sacrifices cache hit rate etc.
then increasing 'sched_relax_domain_level' would benefit you.
+1.9 What is cpuset oom ?
+--------------------------
+If there is no available memory to allocate on the nodes specified by
+cpuset.mems, then an OOM (Out-Of-Memory) will be invoked.
+
+Since the victim selection is a heuristic algorithm, we cannot select
+the "perfect" victim. Therefore, currently, the victim will be selected
+from all the cpusets that have the same mems_allowed as the cpuset
+which invoked OOM.
+
+Cpuset oom works in both cgroup v1 and v2.
-1.9 How do I use cpusets ?
+1.10 How do I use cpusets ?
--------------------------
In order to minimize the impact of cpusets on critical kernel
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index f67c0829350b..594aa71cf441 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2199,6 +2199,10 @@ Cpuset Interface Files
a need to change "cpuset.mems" with active tasks, it shouldn't
be done frequently.
+ When a process invokes oom due to the constraint of cpuset.mems,
+ the victim will be selected from cpusets with the same
+ mems_allowed as the current one.
+
cpuset.mems.effective
A read-only multiple values file which exists on all
cpuset-enabled cgroups.
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 980b76a1237e..75465bf58f74 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
task_unlock(current);
}
+int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg);
+
#else /* !CONFIG_CPUSETS */
static inline bool cpusets_enabled(void) { return false; }
@@ -287,6 +289,10 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
return false;
}
+static inline int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
+{
+ return 0;
+}
#endif /* !CONFIG_CPUSETS */
#endif /* _LINUX_CPUSET_H */
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index bc4dcfd7bee5..cb6b49245e18 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4013,6 +4013,49 @@ void cpuset_print_current_mems_allowed(void)
rcu_read_unlock();
}
+/**
+ * cpuset_scan_tasks - specify the oom scan range
+ * @fn: callback function to select oom victim
+ * @arg: argument for callback function, usually a pointer to struct oom_control
+ *
+ * Description: This function is used to specify the oom scan range. Return 0 if
+ * no task is selected, otherwise return 1. The selected task will be stored in
+ * arg->chosen. This function can only be called in cpuset oom context.
+ *
+ * The selection algorithm is heuristic, therefore requires constant iteration
+ * based on user feedback. Currently, we just iterate through all cpusets with
+ * the same mems_allowed as the current cpuset.
+ */
+int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
+{
+ int ret = 0;
+ struct css_task_iter it;
+ struct task_struct *task;
+ struct cpuset *cs;
+ struct cgroup_subsys_state *pos_css;
+
+ /*
+ * Situation gets complex with overlapping nodemasks in different cpusets.
+ * TODO: Maybe we should calculate the "distance" between different mems_allowed.
+ *
+ * But for now, let's make it simple. Just iterate through all cpusets
+ * with the same mems_allowed as the current cpuset.
+ */
+ cpuset_read_lock();
+ rcu_read_lock();
+ cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
+ if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) {
+ css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it);
+ while (!ret && (task = css_task_iter_next(&it)))
+ ret = fn(task, arg);
+ css_task_iter_end(&it);
+ }
+ }
+ rcu_read_unlock();
+ cpuset_read_unlock();
+ return ret;
+}
+
/*
* Collection of memory_pressure is suppressed unless
* this flag is enabled by writing "1" to the special
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 044e1eed720e..228257788d9e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -367,6 +367,8 @@ static void select_bad_process(struct oom_control *oc)
if (is_memcg_oom(oc))
mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
+ else if (oc->constraint == CONSTRAINT_CPUSET)
+ cpuset_scan_tasks(oom_evaluate_task, oc);
else {
struct task_struct *p;
@@ -427,6 +429,8 @@ static void dump_tasks(struct oom_control *oc)
if (is_memcg_oom(oc))
mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
+ else if (oc->constraint == CONSTRAINT_CPUSET)
+ cpuset_scan_tasks(dump_task, oc);
else {
struct task_struct *p;
--
2.20.1
Hello.
On Tue, Apr 11, 2023 at 02:58:15PM +0800, Gang Li <[email protected]> wrote:
> +int cpuset_scan_tasks(int (*fn)(struct task_struct *, void *), void *arg)
> +{
> + int ret = 0;
> + struct css_task_iter it;
> + struct task_struct *task;
> + struct cpuset *cs;
> + struct cgroup_subsys_state *pos_css;
> +
> + /*
> + * Situation gets complex with overlapping nodemasks in different cpusets.
> + * TODO: Maybe we should calculate the "distance" between different mems_allowed.
> + *
> + * But for now, let's make it simple. Just iterate through all cpusets
> + * with the same mems_allowed as the current cpuset.
> + */
> + cpuset_read_lock();
> + rcu_read_lock();
> + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
> + if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) {
> + css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it);
> + while (!ret && (task = css_task_iter_next(&it)))
> + ret = fn(task, arg);
> + css_task_iter_end(&it);
> + }
> + }
> + rcu_read_unlock();
> + cpuset_read_unlock();
> + return ret;
> +}
I see this traverses all cpusets without the hierarchy actually
mattering that much. Wouldn't the CONSTRAINT_CPUSET better achieved by
globally (or per-memcg) scanning all processes and filtering with:
nodes_intersect(current->mems_allowed, p->mems_allowed)
(`current` triggers the OOM, `p` is the iterated task)
?
Thanks,
Michal
On 2023/4/11 20:23, Michal Koutný wrote:
> Hello.
>
> On Tue, Apr 11, 2023 at 02:58:15PM +0800, Gang Li <[email protected]> wrote:
>> + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
>> + if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) {
>> + css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it);
>> + while (!ret && (task = css_task_iter_next(&it)))
>> + ret = fn(task, arg);
>> + css_task_iter_end(&it);
>> + }
>> + }
>> + rcu_read_unlock();
>> + cpuset_read_unlock();
>> + return ret;
>> +}
>
> I see this traverses all cpusets without the hierarchy actually
> mattering that much. Wouldn't the CONSTRAINT_CPUSET better achieved by
> globally (or per-memcg) scanning all processes and filtering with:
Oh I see, you mean scanning all processes in all cpusets and scanning
all processes globally are equivalent.
> nodes_intersect(current->mems_allowed, p->mems_allowed
Perhaps it would be better to use nodes_equal first, and if no suitable
victim is found, then downgrade to nodes_intersect?
NUMA balancing mechanism tends to keep memory on the same NUMA node, and
if the selected victim's memory happens to be on a node that does not
intersect with the current process's node, we still won't be able to
free up any memory.
In this example:
A->mems_allowed: 0,1
B->mems_allowed: 1,2
nodes_intersect(A->mems_allowed, B->mems_allowed) == true
Memory Distribution:
+=======+=======+=======+
| Node0 | Node1 | Node2 |
+=======+=======+=======+
| A | | |
+-------+-------+-------+
| | |B |
+-------+-------+-------+
Process A invoke oom, then kill B.
But A still can't get any free mem on Node0 and 1.
> (`current` triggers the OOM, `p` is the iterated task)
> ?
>
> Thanks,
> Michal
On Tue 11-04-23 21:04:18, Gang Li wrote:
>
>
> On 2023/4/11 20:23, Michal Koutn? wrote:
> > Hello.
> >
> > On Tue, Apr 11, 2023 at 02:58:15PM +0800, Gang Li <[email protected]> wrote:
> > > + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
> > > + if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) {
> > > + css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it);
> > > + while (!ret && (task = css_task_iter_next(&it)))
> > > + ret = fn(task, arg);
> > > + css_task_iter_end(&it);
> > > + }
> > > + }
> > > + rcu_read_unlock();
> > > + cpuset_read_unlock();
> > > + return ret;
> > > +}
> >
> > I see this traverses all cpusets without the hierarchy actually
> > mattering that much. Wouldn't the CONSTRAINT_CPUSET better achieved by
> > globally (or per-memcg) scanning all processes and filtering with:
>
> Oh I see, you mean scanning all processes in all cpusets and scanning
> all processes globally are equivalent.
Why cannot you simple select a process from the cpuset the allocating
process belongs to? I thought the whole idea was to handle well
partitioned workloads.
> > nodes_intersect(current->mems_allowed, p->mems_allowed
>
> Perhaps it would be better to use nodes_equal first, and if no suitable
> victim is found, then downgrade to nodes_intersect?
How can this happen?
> NUMA balancing mechanism tends to keep memory on the same NUMA node, and
> if the selected victim's memory happens to be on a node that does not
> intersect with the current process's node, we still won't be able to
> free up any memory.
AFAIR NUMA balancing doesn't touch processes with memory policies.
--
Michal Hocko
SUSE Labs
On 2023/4/11 21:12, Michal Hocko wrote:
> On Tue 11-04-23 21:04:18, Gang Li wrote:
>>
>>
>> On 2023/4/11 20:23, Michal Koutný wrote:
>>> Hello.
>>>
>>> On Tue, Apr 11, 2023 at 02:58:15PM +0800, Gang Li <[email protected]> wrote:
>>>> + cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
>>>> + if (nodes_equal(cs->mems_allowed, task_cs(current)->mems_allowed)) {
>>>> + css_task_iter_start(&(cs->css), CSS_TASK_ITER_PROCS, &it);
>>>> + while (!ret && (task = css_task_iter_next(&it)))
>>>> + ret = fn(task, arg);
>>>> + css_task_iter_end(&it);
>>>> + }
>>>> + }
>>>> + rcu_read_unlock();
>>>> + cpuset_read_unlock();
>>>> + return ret;
>>>> +}
>>>
>>> I see this traverses all cpusets without the hierarchy actually
>>> mattering that much. Wouldn't the CONSTRAINT_CPUSET better achieved by
>>> globally (or per-memcg) scanning all processes and filtering with:
>>
>> Oh I see, you mean scanning all processes in all cpusets and scanning
>> all processes globally are equivalent.
>
> Why cannot you simple select a process from the cpuset the allocating
> process belongs to? I thought the whole idea was to handle well
> partitioned workloads.
>
Yes I can :) It's much easier.
>>> nodes_intersect(current->mems_allowed, p->mems_allowed
>>
>> Perhaps it would be better to use nodes_equal first, and if no suitable
>> victim is found, then downgrade to nodes_intersect?
>
> How can this happen?
>
>> NUMA balancing mechanism tends to keep memory on the same NUMA node, and
>> if the selected victim's memory happens to be on a node that does not
>> intersect with the current process's node, we still won't be able to
>> free up any memory.
>
> AFAIR NUMA balancing doesn't touch processes with memory policies.
On Tue 11-04-23 14:58:15, Gang Li wrote:
> Cpusets constrain the CPU and Memory placement of tasks.
> `CONSTRAINT_CPUSET` type in oom has existed for a long time, but
> has never been utilized.
>
> When a process in cpuset which constrain memory placement triggers
> oom, it may kill a completely irrelevant process on other numa nodes,
> which will not release any memory for this cpuset.
>
> We can easily achieve node aware oom by using `CONSTRAINT_CPUSET` and
> selecting victim from cpusets with the same mems_allowed as the
> current one.
I believe it still wouldn't hurt to be more specific here.
CONSTRAINT_CPUSET is rather obscure. Looking at this just makes my head
spin.
/* Check this allocation failure is caused by cpuset's wall function */
for_each_zone_zonelist_nodemask(zone, z, oc->zonelist,
highest_zoneidx, oc->nodemask)
if (!cpuset_zone_allowed(zone, oc->gfp_mask))
cpuset_limited = true;
Does this even work properly and why? prepare_alloc_pages sets
oc->nodemask to current->mems_allowed but the above gives us
cpuset_limited only if there is at least one zone/node that is not
oc->nodemask compatible. So it seems like this wouldn't ever get set
unless oc->nodemask got reset somewhere. This is a maze indeed. Is there
any reason why we cannot rely on __GFP_HARWALL here? Or should we
instead rely on the fact the nodemask should be same as
current->mems_allowed?
I do realize that this is not directly related to your patch but
considering this has been mostly doing nothing maybe we want to document
it better or even rework it at this occasion.
> Example:
>
> Create two processes named mem_on_node0 and mem_on_node1 constrained
> by cpusets respectively. These two processes alloc memory on their
> own node. Now node0 has run out of memory, OOM will be invokled by
> mem_on_node0.
Don't you have an actual real life example with a properly partitioned
system which clearly misbehaves and this patch addresses that?
--
Michal Hocko
SUSE Labs
On Tue, Apr 11, 2023 at 03:12:34PM +0200, Michal Hocko <[email protected]> wrote:
> > Oh I see, you mean scanning all processes in all cpusets and scanning
> > all processes globally are equivalent.
>
> Why cannot you simple select a process from the cpuset the allocating
> process belongs to? I thought the whole idea was to handle well
> partitioned workloads.
Ah, I was confused by the top_cpuset implementation.
The iteration should then start in
nearest_hardwall_ancestor(task_cs(current)).
(in the 1st approximation).
The nodes_intersect/nodes_subset/nodes_equal/whatnot heuristics is
secondary.
HTH,
Michal