LinuxLists.cc - [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY

2022-07-08 09:16:58

Subject: [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}

TLDR
----
If a mempolicy or cpuset is in effect, out_of_memory() will select victim
on specific node to kill. So that kernel can avoid accidental killing on
NUMA system.

Problem
-------
Before this patch series, oom will only kill the process with the highest
memory usage by selecting process with the highest oom_badness on the
entire system.

This works fine on UMA system, but may have some accidental killing on NUMA
system.

As shown below, if process c.out is bind to Node1 and keep allocating pages
from Node1, a.out will be killed first. But killing a.out did't free any
mem on Node1, so c.out will be killed then.

A lot of AMD machines have 8 numa nodes. In these systems, there is a
greater chance of triggering this problem.

OOM before patches:
```
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Total
----------- ---------- ------------- --------
3095 a.out 3073.34 0.11 3073.45(Killed first. Max mem usage)
3199 b.out 501.35 1500.00 2001.35
3805 c.out 1.52 (grow)2248.00 2249.52(Killed then. Node1 is full)
----------- ---------- ------------- --------
Total 3576.21 3748.11 7324.31
```

Solution
--------
We store per node rss in mm_rss_stat for each process.

If a page allocation with mempolicy or cpuset in effect triger oom. We will
calculate oom_badness with rss counter for the corresponding node. Then
select the process with the highest oom_badness on the corresponding node
to kill.

OOM after patches:
```
Per-node process memory usage (in MBs)
PID Node 0 Node 1 Total
----------- ---------- ------------- ----------
3095 a.out 3073.34 0.11 3073.45
3199 b.out 501.35 1500.00 2001.35
3805 c.out 1.52 (grow)2248.00 2249.52(killed)
----------- ---------- ------------- ----------
Total 3576.21 3748.11 7324.31
```

Overhead
--------
CPU:

According to the result of Unixbench. There is less than one percent
performance loss in most cases.

On 40c512g machine.

40 parallel copies of tests:
+----------+----------+-----+----------+---------+---------+---------+
| numastat | FileCopy | ... | Pipe | Fork | syscall | total |
+----------+----------+-----+----------+---------+---------+---------+
| off | 2920.24 | ... | 35926.58 | 6980.14 | 2617.18 | 8484.52 |
| on | 2919.15 | ... | 36066.07 | 6835.01 | 2724.82 | 8461.24 |
| overhead | 0.04% | ... | -0.39% | 2.12% | -3.95% | 0.28% |
+----------+----------+-----+----------+---------+---------+---------+

1 parallel copy of tests:
+----------+----------+-----+---------+--------+---------+---------+
| numastat | FileCopy | ... | Pipe | Fork | syscall | total |
+----------+----------+-----+---------+--------+---------+---------+
| off | 1515.37 | ... | 1473.97 | 546.88 | 1152.37 | 1671.2 |
| on | 1508.09 | ... | 1473.75 | 532.61 | 1148.83 | 1662.72 |
| overhead | 0.48% | ... | 0.01% | 2.68% | 0.31% | 0.51% |
+----------+----------+-----+---------+--------+---------+---------+

MEM:

per task_struct:
sizeof(int) * num_possible_nodes() + sizeof(int*)
typically 4 * 2 + 8 bytes

per mm_struct:
sizeof(atomic_long_t) * num_possible_nodes() + sizeof(atomic_long_t*)
typically 8 * 2 + 8 bytes

zap_pte_range:
sizeof(int) * num_possible_nodes() + sizeof(int*)
typically 4 * 2 + 8 bytes

Changelog
----------
v2:
- enable per numa node oom for `CONSTRAINT_CPUSET`.
- add benchmark result in cover letter.

Gang Li (5):
mm: add a new parameter `node` to `get/add/inc/dec_mm_counter`
mm: add numa_count field for rss_stat
mm: add numa fields for tracepoint rss_stat
mm: enable per numa node rss_stat count
mm, oom: enable per numa node oom for
CONSTRAINT_{MEMORY_POLICY,CPUSET}

arch/s390/mm/pgtable.c | 4 +-
fs/exec.c | 2 +-
fs/proc/base.c | 6 +-
fs/proc/task_mmu.c | 14 ++--
include/linux/mm.h | 59 ++++++++++++-----
include/linux/mm_types_task.h | 16 +++++
include/linux/oom.h | 2 +-
include/trace/events/kmem.h | 27 ++++++--
kernel/events/uprobes.c | 6 +-
kernel/fork.c | 70 +++++++++++++++++++-
mm/huge_memory.c | 13 ++--
mm/khugepaged.c | 4 +-
mm/ksm.c | 2 +-
mm/madvise.c | 2 +-
mm/memory.c | 119 ++++++++++++++++++++++++----------
mm/migrate.c | 4 ++
mm/migrate_device.c | 2 +-
mm/oom_kill.c | 69 +++++++++++++++-----
mm/rmap.c | 19 +++---
mm/swapfile.c | 6 +-
mm/userfaultfd.c | 2 +-
21 files changed, 335 insertions(+), 113 deletions(-)

--
2.20.1

2022-07-08 09:22:35

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}

On Fri 08-07-22 16:21:24, Gang Li wrote:
> TLDR
> ----
> If a mempolicy or cpuset is in effect, out_of_memory() will select victim
> on specific node to kill. So that kernel can avoid accidental killing on
> NUMA system.

We have discussed this in your previous posting and an alternative
proposal was to use cpusets to partition NUMA aware workloads and
enhance the oom killer to be cpuset aware instead which should be a much
easier solution.

> Problem
> -------
> Before this patch series, oom will only kill the process with the highest
> memory usage by selecting process with the highest oom_badness on the
> entire system.
>
> This works fine on UMA system, but may have some accidental killing on NUMA
> system.
>
> As shown below, if process c.out is bind to Node1 and keep allocating pages
> from Node1, a.out will be killed first. But killing a.out did't free any
> mem on Node1, so c.out will be killed then.
>
> A lot of AMD machines have 8 numa nodes. In these systems, there is a
> greater chance of triggering this problem.

Please be more specific about existing usecases which suffer from the
current OOM handling limitations.
--
Michal Hocko
SUSE Labs

2022-07-08 09:24:09

by Gang Li

[permalink] [raw]

Subject: [PATCH v2 5/5] mm, oom: enable per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}

Page allocator will only alloc pages on node indicated by mempolicy or
cpuset. But oom will still select bad process by total rss usage.

This patch let oom only calculate rss on the given node when
oc->constraint equals to CONSTRAINT_{MEMORY_POLICY,CPUSET}.

With those constraint, the process with the highest memory consumption on
the specific node will be killed. oom_kill dmesg now have a new column
`(%d)nrss`.

It looks like this:
```
[ 1471.436027] Tasks state (memory values in pages):
[ 1471.438518] [ pid ] uid tgid total_vm rss (01)nrss pgtables_bytes swapents oom_score_adj name
[ 1471.554703] [ 1011] 0 1011 220005 8589 1872 823296 0 0 node
[ 1471.707912] [ 12399] 0 12399 1311306 1311056 262170 10534912 0 0 a.out
[ 1471.712429] [ 13135] 0 13135 787018 674666 674300 5439488 0 0 a.out
[ 1471.721506] [ 13295] 0 13295 597 188 0 24576 0 0 sh
[ 1471.734600] oom-kill:constraint=CONSTRAINT_MEMORY_POLICY,nodemask=1,cpuset=/,mems_allowed=0-2,global_oom,task_memcg=/user.slice/user-0.slice/session-3.scope,task=a.out,pid=13135,uid=0
[ 1471.742583] Out of memory: Killed process 13135 (a.out) total-vm:3148072kB, anon-rss:2697304kB, file-rss:1360kB, shmem-rss:0kB, UID:0 pgtables:5312kB oom_score_adj:0
[ 1471.849615] oom_reaper: reaped process 13135 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
```

Signed-off-by: Gang Li <[email protected]>
---
fs/proc/base.c | 6 ++++-
include/linux/oom.h | 2 +-
mm/oom_kill.c | 55 ++++++++++++++++++++++++++++++++++++++-------
3 files changed, 53 insertions(+), 10 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 617816168748..92075e9dca06 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -552,8 +552,12 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
unsigned long totalpages = totalram_pages() + total_swap_pages;
unsigned long points = 0;
long badness;
+ struct oom_control oc = {
+ .totalpages = totalpages,
+ .gfp_mask = 0,
+ };

- badness = oom_badness(task, totalpages);
+ badness = oom_badness(task, &oc);
/*
* Special case OOM_SCORE_ADJ_MIN for all others scale the
* badness value into [0, 2000] range which we have been
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7d0c9c48a0c5..19eaa447ac57 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -98,7 +98,7 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
}

long oom_badness(struct task_struct *p,
- unsigned long totalpages);
+ struct oom_control *oc);

extern bool out_of_memory(struct oom_control *oc);

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e25c37e2e90d..921539e29ae9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -189,6 +189,18 @@ static bool should_dump_unreclaim_slab(void)
return (global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B) > nr_lru);
}

+static inline int get_nid_from_oom_control(struct oom_control *oc)
+{
+ nodemask_t *nodemask;
+ struct zoneref *zoneref;
+
+ nodemask = oc->constraint == CONSTRAINT_MEMORY_POLICY
+ ? oc->nodemask : &cpuset_current_mems_allowed;
+
+ zoneref = first_zones_zonelist(oc->zonelist, gfp_zone(oc->gfp_mask), nodemask);
+ return zone_to_nid(zoneref->zone);
+}
+
/**
* oom_badness - heuristic function to determine which candidate task to kill
* @p: task struct of which task we should calculate
@@ -198,7 +210,7 @@ static bool should_dump_unreclaim_slab(void)
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom failures.
*/
-long oom_badness(struct task_struct *p, unsigned long totalpages)
+long oom_badness(struct task_struct *p, struct oom_control *oc)
{
long points;
long adj;
@@ -227,12 +239,21 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
* The baseline for the badness score is the proportion of RAM that each
* task's rss, pagetable and swap space use.
*/
- points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS, NUMA_NO_NODE) +
- mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+ if (unlikely(oc->constraint == CONSTRAINT_MEMORY_POLICY ||
+ oc->constraint == CONSTRAINT_CPUSET)) {
+ int nid_to_find_victim = get_nid_from_oom_control(oc);
+
+ points = get_mm_counter(p->mm, -1, nid_to_find_victim) +
+ get_mm_counter(p->mm, MM_SWAPENTS, NUMA_NO_NODE) +
+ mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+ } else {
+ points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS, NUMA_NO_NODE) +
+ mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+ }
task_unlock(p);

/* Normalize to oom_score_adj units */
- adj *= totalpages / 1000;
+ adj *= oc->totalpages / 1000;
points += adj;

return points;
@@ -338,7 +359,7 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
goto select;
}

- points = oom_badness(task, oc->totalpages);
+ points = oom_badness(task, oc);
if (points == LONG_MIN || points < oc->chosen_points)
goto next;

@@ -382,6 +403,7 @@ static int dump_task(struct task_struct *p, void *arg)
{
struct oom_control *oc = arg;
struct task_struct *task;
+ unsigned long node_mm_rss;

if (oom_unkillable_task(p))
return 0;
@@ -399,9 +421,17 @@ static int dump_task(struct task_struct *p, void *arg)
return 0;
}

- pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n",
+ if (unlikely(oc->constraint == CONSTRAINT_MEMORY_POLICY ||
+ oc->constraint == CONSTRAINT_CPUSET)) {
+ int nid_to_find_victim = get_nid_from_oom_control(oc);
+
+ node_mm_rss = get_mm_counter(p->mm, -1, nid_to_find_victim);
+ } else {
+ node_mm_rss = 0;
+ }
+ pr_info("[%7d] %5d %5d %8lu %8lu %8lu %8ld %8lu %5hd %s\n",
task->pid, from_kuid(&init_user_ns, task_uid(task)),
- task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
+ task->tgid, task->mm->total_vm, get_mm_rss(task->mm), node_mm_rss,
mm_pgtables_bytes(task->mm),
get_mm_counter(task->mm, MM_SWAPENTS, NUMA_NO_NODE),
task->signal->oom_score_adj, task->comm);
@@ -422,8 +452,17 @@ static int dump_task(struct task_struct *p, void *arg)
*/
static void dump_tasks(struct oom_control *oc)
{
+ int nid_to_find_victim;
+
+ if (unlikely(oc->constraint == CONSTRAINT_MEMORY_POLICY ||
+ oc->constraint == CONSTRAINT_CPUSET)) {
+ nid_to_find_victim = get_nid_from_oom_control(oc);
+ } else {
+ nid_to_find_victim = -1;
+ }
pr_info("Tasks state (memory values in pages):\n");
- pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name\n");
+ pr_info("[ pid ] uid tgid total_vm rss (%02d)nrss pgtables_bytes swapents"
+ " oom_score_adj name\n", nid_to_find_victim);

if (is_memcg_oom(oc))
mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
--
2.20.1

2022-07-08 09:26:55

by Gang Li

[permalink] [raw]

Subject: Re: Re: [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}

Oh apologize. I just realized what you mean.

I should try a "cpuset cgroup oom killer" selecting victim from a
specific cpuset cgroup.

On 2022/7/8 16:54, Michal Hocko wrote:
> On Fri 08-07-22 16:21:24, Gang Li wrote:
>
> We have discussed this in your previous posting and an alternative
> proposal was to use cpusets to partition NUMA aware workloads and
> enhance the oom killer to be cpuset aware instead which should be a much
> easier solution.

2022-07-08 09:59:21

by Michal Hocko

[permalink] [raw]

Subject: Re: Re: [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}

On Fri 08-07-22 17:25:31, Gang Li wrote:
> Oh apologize. I just realized what you mean.
>
> I should try a "cpuset cgroup oom killer" selecting victim from a
> specific cpuset cgroup.

yes, that was the idea. Many workloads which really do care about
particioning the NUMA system tend to use cpusets. In those cases you
have reasonably defined boundaries and the current OOM killer
imeplementation is not really aware of that. The oom selection process
could be enhanced/fixed to select victims from those cpusets similar to
how memcg oom killer victim selection is done.

There is no additional accounting required for this approach because the
workload is partitioned on the cgroup level already. Maybe this is not
really the best fit for all workloads but it should be reasonably simple
to implement without intrusive or runtime visible changes.

I am not saying per-numa accounting is wrong or a bad idea. I would just
like to see a stronger justification for that and also some arguments
why a simpler approach via cpusets is not viable.

Does this make sense to you?

--
Michal Hocko
SUSE Labs

2022-07-12 11:31:07

by Abel Wu

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}

Hi Michal,

On 7/8/22 4:54 PM, Michal Hocko Wrote:
> On Fri 08-07-22 16:21:24, Gang Li wrote:
>> TLDR
>> ----
>> If a mempolicy or cpuset is in effect, out_of_memory() will select victim
>> on specific node to kill. So that kernel can avoid accidental killing on
>> NUMA system.
>
> We have discussed this in your previous posting and an alternative
> proposal was to use cpusets to partition NUMA aware workloads and
> enhance the oom killer to be cpuset aware instead which should be a much
> easier solution.
>
>> Problem
>> -------
>> Before this patch series, oom will only kill the process with the highest
>> memory usage by selecting process with the highest oom_badness on the
>> entire system.
>>
>> This works fine on UMA system, but may have some accidental killing on NUMA
>> system.
>>
>> As shown below, if process c.out is bind to Node1 and keep allocating pages
>> from Node1, a.out will be killed first. But killing a.out did't free any
>> mem on Node1, so c.out will be killed then.
>>
>> A lot of AMD machines have 8 numa nodes. In these systems, there is a
>> greater chance of triggering this problem.
>
> Please be more specific about existing usecases which suffer from the
> current OOM handling limitations.

I was just going through the mail list and happen to see this. There
is another usecase for us about per-numa memory usage.

Say we have several important latency-critical services sitting inside
different NUMA nodes without intersection. The need for memory of these
LC services varies, so the free memory of each node is also different.
Then we launch several background containers without cpuset constrains
to eat the left resources. Now the problem is that there doesn't seem
like a proper memory policy available to balance the usage between the
nodes, which could lead to memory-heavy LC services suffer from high
memory pressure and fails to meet the SLOs.

It's quite appreciated if you can shed some light on this!

Thanks & BR,
Abel

2022-07-12 13:39:12

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}

On Tue 12-07-22 19:12:18, Abel Wu wrote:
[...]
> I was just going through the mail list and happen to see this. There
> is another usecase for us about per-numa memory usage.
>
> Say we have several important latency-critical services sitting inside
> different NUMA nodes without intersection. The need for memory of these
> LC services varies, so the free memory of each node is also different.
> Then we launch several background containers without cpuset constrains
> to eat the left resources. Now the problem is that there doesn't seem
> like a proper memory policy available to balance the usage between the
> nodes, which could lead to memory-heavy LC services suffer from high
> memory pressure and fails to meet the SLOs.

I do agree that cpusets would be rather clumsy if usable at all in a
scenario when you are trying to mix NUMA bound workloads with those
that do not have any NUMA proferences. Could you be more specific about
requirements here though?

Let's say you run those latency critical services with "simple" memory
policies and mix them with the other workload without any policies in
place so they compete over memory. It is not really clear to me how can
you achieve any reasonable QoS in such an environment. Your latency
critical servises will be more constrained than the non-critical ones
yet they are more demanding AFAIU.
--
Michal Hocko
SUSE Labs

2022-07-12 15:22:10

by Abel Wu

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}

On 7/12/22 9:35 PM, Michal Hocko Wrote:
> On Tue 12-07-22 19:12:18, Abel Wu wrote:
> [...]
>> I was just going through the mail list and happen to see this. There
>> is another usecase for us about per-numa memory usage.
>>
>> Say we have several important latency-critical services sitting inside
>> different NUMA nodes without intersection. The need for memory of these
>> LC services varies, so the free memory of each node is also different.
>> Then we launch several background containers without cpuset constrains
>> to eat the left resources. Now the problem is that there doesn't seem
>> like a proper memory policy available to balance the usage between the
>> nodes, which could lead to memory-heavy LC services suffer from high
>> memory pressure and fails to meet the SLOs.
>
> I do agree that cpusets would be rather clumsy if usable at all in a
> scenario when you are trying to mix NUMA bound workloads with those
> that do not have any NUMA proferences. Could you be more specific about
> requirements here though?

Yes, these LC services are highly sensitive to memory access latency
and bandwidth, so they are provisioned by NUMA node granule to meet
their performance requirements. While on the other hand, they usually
do not make full use of cpu/mem resources which increases the TCO of
our IDCs, so we have to co-locate them with background tasks.

Some of these LC services are memory-bound but leave much of cpu's
capacity unused. In this case we hope the co-located background tasks
to consume some leftover without introducing obvious mm overhead to
the LC services.

>
> Let's say you run those latency critical services with "simple" memory
> policies and mix them with the other workload without any policies in
> place so they compete over memory. It is not really clear to me how can
> you achieve any reasonable QoS in such an environment. Your latency
> critical servises will be more constrained than the non-critical ones
> yet they are more demanding AFAIU.

Yes, the QoS over memory is the biggest block in the way (the other
resources are relatively easier). For now, we hacked a new mpol to
achieve weighted-interleave behavior to balance the memory usage across
NUMA nodes, and only set memcg protections to the LC services. If the
memory pressure is still high, the background tasks will be killed.
Ideas? Thanks!

Abel

2022-07-18 12:32:06

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] mm, oom: Introduce per numa node oom for CONSTRAINT_{MEMORY_POLICY,CPUSET}

On Tue 12-07-22 23:00:55, Abel Wu wrote:
>
> On 7/12/22 9:35 PM, Michal Hocko Wrote:
> > On Tue 12-07-22 19:12:18, Abel Wu wrote:
> > [...]
> > > I was just going through the mail list and happen to see this. There
> > > is another usecase for us about per-numa memory usage.
> > >
> > > Say we have several important latency-critical services sitting inside
> > > different NUMA nodes without intersection. The need for memory of these
> > > LC services varies, so the free memory of each node is also different.
> > > Then we launch several background containers without cpuset constrains
> > > to eat the left resources. Now the problem is that there doesn't seem
> > > like a proper memory policy available to balance the usage between the
> > > nodes, which could lead to memory-heavy LC services suffer from high
> > > memory pressure and fails to meet the SLOs.
> >
> > I do agree that cpusets would be rather clumsy if usable at all in a
> > scenario when you are trying to mix NUMA bound workloads with those
> > that do not have any NUMA proferences. Could you be more specific about
> > requirements here though?
>
> Yes, these LC services are highly sensitive to memory access latency
> and bandwidth, so they are provisioned by NUMA node granule to meet
> their performance requirements. While on the other hand, they usually
> do not make full use of cpu/mem resources which increases the TCO of
> our IDCs, so we have to co-locate them with background tasks.
>
> Some of these LC services are memory-bound but leave much of cpu's
> capacity unused. In this case we hope the co-located background tasks
> to consume some leftover without introducing obvious mm overhead to
> the LC services.

This are some tough requirements and I am afraid far from any typical
usage. So I believe that you need a careful tunning much more than a
policy which I really have hard time to imagine wrt semantic TBH.

> > Let's say you run those latency critical services with "simple" memory
> > policies and mix them with the other workload without any policies in
> > place so they compete over memory. It is not really clear to me how can
> > you achieve any reasonable QoS in such an environment. Your latency
> > critical servises will be more constrained than the non-critical ones
> > yet they are more demanding AFAIU.
>
> Yes, the QoS over memory is the biggest block in the way (the other
> resources are relatively easier). For now, we hacked a new mpol to
> achieve weighted-interleave behavior to balance the memory usage across
> NUMA nodes, and only set memcg protections to the LC services. If the
> memory pressure is still high, the background tasks will be killed.
> Ideas? Thanks!

It is not really clear what the new memory policy does and what is the
semantic of it from your description. Memory protection (via memcg) of
your sensitive workload makes sense but it would require proper setting
of background jobs as well. As soon as you hit the global direct reclaim
then the memory protection won't safe your sensitve workload.

--
Michal Hocko
SUSE Labs