LinuxLists.cc - [PATCH v2 0/4] memcg: robust enforcement of memory.high

2022-02-11 08:31:19

Subject: [PATCH v2 0/4] memcg: robust enforcement of memory.high

Due to the semantics of memory.high enforcement i.e. throttle the
workload without oom-kill, we are trying to use it for right sizing the
workloads in our production environment. However we observed the
mechanism fails for some specific applications which does big chunck of
allocations in a single syscall. The reason behind this failure is due
to the limitation of the memory.high enforcement's current
implementation. This patch series solves this issue by enforcing the
memory.high synchronously if the current process has accumulated a large
amount of high overcharge.

Changes since v1:
- Based on Roman's comment simply the sync enforcement and only target
the extreme cases.

Shakeel Butt (4):
memcg: refactor mem_cgroup_oom
memcg: unify force charging conditions
selftests: memcg: test high limit for single entry allocation
memcg: synchronously enforce memory.high for large overcharges

mm/memcontrol.c | 66 +++++++---------
tools/testing/selftests/cgroup/cgroup_util.c | 15 +++-
tools/testing/selftests/cgroup/cgroup_util.h | 1 +
.../selftests/cgroup/test_memcontrol.c | 78 +++++++++++++++++++
4 files changed, 120 insertions(+), 40 deletions(-)

--
2.35.1.265.g69c8d7142f-goog

2022-02-11 08:45:24

by Shakeel Butt

[permalink] [raw]

Subject: [PATCH v2 3/4] selftests: memcg: test high limit for single entry allocation

Test the enforcement of memory.high limit for large amount of memory
allocation within a single kernel entry. There are valid use-cases
where the application can trigger large amount of memory allocation
within a single syscall e.g. mlock() or mmap(MAP_POPULATE). Make sure
memory.high limit enforcement works for such use-cases.

Signed-off-by: Shakeel Butt <[email protected]>
---
Changes since v1:
- None

tools/testing/selftests/cgroup/cgroup_util.c | 15 +++-
tools/testing/selftests/cgroup/cgroup_util.h | 1 +
.../selftests/cgroup/test_memcontrol.c | 78 +++++++++++++++++++
3 files changed, 91 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testing/selftests/cgroup/cgroup_util.c
index 0cf7e90c0052..dbaa7aabbb4a 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.c
+++ b/tools/testing/selftests/cgroup/cgroup_util.c
@@ -583,7 +583,7 @@ int clone_into_cgroup_run_wait(const char *cgroup)
return 0;
}

-int cg_prepare_for_wait(const char *cgroup)
+static int __prepare_for_wait(const char *cgroup, const char *filename)
{
int fd, ret = -1;

@@ -591,8 +591,7 @@ int cg_prepare_for_wait(const char *cgroup)
if (fd == -1)
return fd;

- ret = inotify_add_watch(fd, cg_control(cgroup, "cgroup.events"),
- IN_MODIFY);
+ ret = inotify_add_watch(fd, cg_control(cgroup, filename), IN_MODIFY);
if (ret == -1) {
close(fd);
fd = -1;
@@ -601,6 +600,16 @@ int cg_prepare_for_wait(const char *cgroup)
return fd;
}

+int cg_prepare_for_wait(const char *cgroup)
+{
+ return __prepare_for_wait(cgroup, "cgroup.events");
+}
+
+int memcg_prepare_for_wait(const char *cgroup)
+{
+ return __prepare_for_wait(cgroup, "memory.events");
+}
+
int cg_wait_for(int fd)
{
int ret = -1;
diff --git a/tools/testing/selftests/cgroup/cgroup_util.h b/tools/testing/selftests/cgroup/cgroup_util.h
index 4f66d10626d2..628738532ac9 100644
--- a/tools/testing/selftests/cgroup/cgroup_util.h
+++ b/tools/testing/selftests/cgroup/cgroup_util.h
@@ -55,4 +55,5 @@ extern int clone_reap(pid_t pid, int options);
extern int clone_into_cgroup_run_wait(const char *cgroup);
extern int dirfd_open_opath(const char *dir);
extern int cg_prepare_for_wait(const char *cgroup);
+extern int memcg_prepare_for_wait(const char *cgroup);
extern int cg_wait_for(int fd);
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c
index c19a97dd02d4..36ccf2322e21 100644
--- a/tools/testing/selftests/cgroup/test_memcontrol.c
+++ b/tools/testing/selftests/cgroup/test_memcontrol.c
@@ -16,6 +16,7 @@
#include <netinet/in.h>
#include <netdb.h>
#include <errno.h>
+#include <sys/mman.h>

#include "../kselftest.h"
#include "cgroup_util.h"
@@ -628,6 +629,82 @@ static int test_memcg_high(const char *root)
return ret;
}

+static int alloc_anon_mlock(const char *cgroup, void *arg)
+{
+ size_t size = (size_t)arg;
+ void *buf;
+
+ buf = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON,
+ 0, 0);
+ if (buf == MAP_FAILED)
+ return -1;
+
+ mlock(buf, size);
+ munmap(buf, size);
+ return 0;
+}
+
+/*
+ * This test checks that memory.high is able to throttle big single shot
+ * allocation i.e. large allocation within one kernel entry.
+ */
+static int test_memcg_high_sync(const char *root)
+{
+ int ret = KSFT_FAIL, pid, fd = -1;
+ char *memcg;
+ long pre_high, pre_max;
+ long post_high, post_max;
+
+ memcg = cg_name(root, "memcg_test");
+ if (!memcg)
+ goto cleanup;
+
+ if (cg_create(memcg))
+ goto cleanup;
+
+ pre_high = cg_read_key_long(memcg, "memory.events", "high ");
+ pre_max = cg_read_key_long(memcg, "memory.events", "max ");
+ if (pre_high < 0 || pre_max < 0)
+ goto cleanup;
+
+ if (cg_write(memcg, "memory.swap.max", "0"))
+ goto cleanup;
+
+ if (cg_write(memcg, "memory.high", "30M"))
+ goto cleanup;
+
+ if (cg_write(memcg, "memory.max", "140M"))
+ goto cleanup;
+
+ fd = memcg_prepare_for_wait(memcg);
+ if (fd < 0)
+ goto cleanup;
+
+ pid = cg_run_nowait(memcg, alloc_anon_mlock, (void *)MB(200));
+ if (pid < 0)
+ goto cleanup;
+
+ cg_wait_for(fd);
+
+ post_high = cg_read_key_long(memcg, "memory.events", "high ");
+ post_max = cg_read_key_long(memcg, "memory.events", "max ");
+ if (post_high < 0 || post_max < 0)
+ goto cleanup;
+
+ if (pre_high == post_high || pre_max != post_max)
+ goto cleanup;
+
+ ret = KSFT_PASS;
+
+cleanup:
+ if (fd >= 0)
+ close(fd);
+ cg_destroy(memcg);
+ free(memcg);
+
+ return ret;
+}
+
/*
* This test checks that memory.max limits the amount of
* memory which can be consumed by either anonymous memory
@@ -1180,6 +1257,7 @@ struct memcg_test {
T(test_memcg_min),
T(test_memcg_low),
T(test_memcg_high),
+ T(test_memcg_high_sync),
T(test_memcg_max),
T(test_memcg_oom_events),
T(test_memcg_swap_max),
--
2.35.1.265.g69c8d7142f-goog

2022-02-11 09:56:48

by Shakeel Butt

[permalink] [raw]

Subject: [PATCH v2 4/4] memcg: synchronously enforce memory.high for large overcharges

The high limit is used to throttle the workload without invoking the
oom-killer. Recently we tried to use the high limit to right size our
internal workloads. More specifically dynamically adjusting the limits
of the workload without letting the workload get oom-killed. However due
to the limitation of the implementation of high limit enforcement, we
observed the mechanism fails for some real workloads.

The high limit is enforced on return-to-userspace i.e. the kernel let
the usage goes over the limit and when the execution returns to
userspace, the high reclaim is triggered and the process can get
throttled as well. However this mechanism fails for workloads which do
large allocations in a single kernel entry e.g. applications that
mlock() a large chunk of memory in a single syscall. Such applications
bypass the high limit and can trigger the oom-killer.

To make high limit enforcement more robust, this patch makes the limit
enforcement synchronous only if the accumulated overcharge becomes
larger than MEMCG_CHARGE_BATCH. So, most of the allocations would still
be throttled on the return-to-userspace path but only the extreme
allocations which accumulates large amount of overcharge without
returning to the userspace will be throttled synchronously. The value
MEMCG_CHARGE_BATCH is a bit arbitrary but most of other places in the
memcg codebase uses this constant therefore for now uses the same one.

Signed-off-by: Shakeel Butt <[email protected]>
---
Changes since v1:
- Based on Roman's comment simply the sync enforcement and only target
the extreme cases.

mm/memcontrol.c | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 292b0b99a2c7..0da4be4798e7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2703,6 +2703,11 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
}
} while ((memcg = parent_mem_cgroup(memcg)));

+ if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
+ !(current->flags & PF_MEMALLOC) &&
+ gfpflags_allow_blocking(gfp_mask)) {
+ mem_cgroup_handle_over_high();
+ }
return 0;
}

--
2.35.1.265.g69c8d7142f-goog

2022-02-11 17:21:06

by Shakeel Butt

[permalink] [raw]

Subject: [PATCH v2 1/4] memcg: refactor mem_cgroup_oom

The function mem_cgroup_oom returns enum which has four possible values
but the caller does not care about such values and only cares if the
return value is OOM_SUCCESS or not. So, remove the enum altogether and
make mem_cgroup_oom returns a simple bool.

Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
---
Changes since v1:
- Added comment for mem_cgroup_oom as suggested by Roman

mm/memcontrol.c | 44 +++++++++++++++++---------------------------
1 file changed, 17 insertions(+), 27 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a0e9d9f12cf5..f12e489ba9b8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1795,20 +1795,16 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
}

-enum oom_status {
- OOM_SUCCESS,
- OOM_FAILED,
- OOM_ASYNC,
- OOM_SKIPPED
-};
-
-static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
+/*
+ * Returns true if successfully killed one or more processes. Though in some
+ * corner cases it can return true even without killing any process.
+ */
+static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
{
- enum oom_status ret;
- bool locked;
+ bool locked, ret;

if (order > PAGE_ALLOC_COSTLY_ORDER)
- return OOM_SKIPPED;
+ return false;

memcg_memory_event(memcg, MEMCG_OOM);

@@ -1831,14 +1827,13 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
* victim and then we have to bail out from the charge path.
*/
if (memcg->oom_kill_disable) {
- if (!current->in_user_fault)
- return OOM_SKIPPED;
- css_get(&memcg->css);
- current->memcg_in_oom = memcg;
- current->memcg_oom_gfp_mask = mask;
- current->memcg_oom_order = order;
-
- return OOM_ASYNC;
+ if (current->in_user_fault) {
+ css_get(&memcg->css);
+ current->memcg_in_oom = memcg;
+ current->memcg_oom_gfp_mask = mask;
+ current->memcg_oom_order = order;
+ }
+ return false;
}

mem_cgroup_mark_under_oom(memcg);
@@ -1849,10 +1844,7 @@ static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int
mem_cgroup_oom_notify(memcg);

mem_cgroup_unmark_under_oom(memcg);
- if (mem_cgroup_out_of_memory(memcg, mask, order))
- ret = OOM_SUCCESS;
- else
- ret = OOM_FAILED;
+ ret = mem_cgroup_out_of_memory(memcg, mask, order);

if (locked)
mem_cgroup_oom_unlock(memcg);
@@ -2545,7 +2537,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
int nr_retries = MAX_RECLAIM_RETRIES;
struct mem_cgroup *mem_over_limit;
struct page_counter *counter;
- enum oom_status oom_status;
unsigned long nr_reclaimed;
bool passed_oom = false;
bool may_swap = true;
@@ -2648,9 +2639,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
* a forward progress or bypass the charge if the oom killer
* couldn't make any progress.
*/
- oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
- get_order(nr_pages * PAGE_SIZE));
- if (oom_status == OOM_SUCCESS) {
+ if (mem_cgroup_oom(mem_over_limit, gfp_mask,
+ get_order(nr_pages * PAGE_SIZE))) {
passed_oom = true;
nr_retries = MAX_RECLAIM_RETRIES;
goto retry;
--
2.35.1.265.g69c8d7142f-goog

2022-02-11 18:00:51

by Shakeel Butt

[permalink] [raw]

Subject: [PATCH v2 2/4] memcg: unify force charging conditions

Currently the kernel force charges the allocations which have __GFP_HIGH
flag without triggering the memory reclaim. __GFP_HIGH indicates that
the caller is high priority and since commit 869712fd3de5 ("mm:
memcontrol: fix network errors from failing __GFP_ATOMIC charges") the
kernel lets such allocations do force charging. Please note that
__GFP_ATOMIC has been replaced by __GFP_HIGH.

__GFP_HIGH does not tell if the caller can block or can trigger reclaim.
There are separate checks to determine that. So, there is no need to
skip reclaiming for __GFP_HIGH allocations. So, handle __GFP_HIGH
together with __GFP_NOFAIL which also does force charging.

Please note that this is a noop change as there are no __GFP_HIGH
allocators in the kernel which also have __GFP_ACCOUNT (or SLAB_ACCOUNT)
and does not allow reclaim for now.

Signed-off-by: Shakeel Butt <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
---
Changes since v1:
- None

mm/memcontrol.c | 17 +++++++----------
1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f12e489ba9b8..292b0b99a2c7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2564,15 +2564,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
goto retry;
}

- /*
- * Memcg doesn't have a dedicated reserve for atomic
- * allocations. But like the global atomic pool, we need to
- * put the burden of reclaim on regular allocation requests
- * and let these go through as privileged allocations.
- */
- if (gfp_mask & __GFP_HIGH)
- goto force;
-
/*
* Prevent unbounded recursion when reclaim operations need to
* allocate memory. This might exceed the limits temporarily,
@@ -2646,7 +2637,13 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
goto retry;
}
nomem:
- if (!(gfp_mask & __GFP_NOFAIL))
+ /*
+ * Memcg doesn't have a dedicated reserve for atomic
+ * allocations. But like the global atomic pool, we need to
+ * put the burden of reclaim on regular allocation requests
+ * and let these go through as privileged allocations.
+ */
+ if (!(gfp_mask & (__GFP_NOFAIL | __GFP_HIGH)))
return -ENOMEM;
force:
/*
--
2.35.1.265.g69c8d7142f-goog

2022-02-12 16:21:05

by Shakeel Butt

[permalink] [raw]

Subject: Re: [PATCH v2 4/4] memcg: synchronously enforce memory.high for large overcharges

On Fri, Feb 11, 2022 at 4:13 AM Chris Down <[email protected]> wrote:
>
[...]
> >To make high limit enforcement more robust, this patch makes the limit
> >enforcement synchronous only if the accumulated overcharge becomes
> >larger than MEMCG_CHARGE_BATCH. So, most of the allocations would still
> >be throttled on the return-to-userspace path but only the extreme
> >allocations which accumulates large amount of overcharge without
> >returning to the userspace will be throttled synchronously. The value
> >MEMCG_CHARGE_BATCH is a bit arbitrary but most of other places in the
> >memcg codebase uses this constant therefore for now uses the same one.
>
> Note that mem_cgroup_handle_over_high() has its own allocator throttling grace
> period, where it bails out if the penalty to apply is less than 10ms. The
> reclaim will still happen, though. So throttling might not happen even for
> roughly MEMCG_CHARGE_BATCH-sized allocations, depending on the overall size of
> the cgroup and its protection.
>

Here by throttling, I meant both reclaim and
schedule_timeout_killable(). I don't want to say low level details
which might change in future.

[...]
>
> Thanks, I was going to comment on v1 that I prefer to keep the implementation
> of mem_cgroup_handle_over_high if possible since we know that the mechanism has
> been safe in production over the past few years.
>
> One question I have is about throttling. It looks like this new
> mem_cgroup_handle_over_high callsite may mean that throttling is invoked more
> than once on a misbehaving workload that's failing to reclaim since the
> throttling could be invoked both here and in return to userspace, right? That
> might not be a problem, but we should think about the implications of that,
> especially in relation to MEMCG_MAX_HIGH_DELAY_JIFFIES.
>

Please note that mem_cgroup_handle_over_high() clears
memcg_nr_pages_over_high and if on the return-to-userspace path
mem_cgroup_handle_over_high() finds that memcg_nr_pages_over_high is
non-zero, then it means the task has further accumulated the charges
over high limit after a possibly synchronous
memcg_nr_pages_over_high() call.

> Maybe we should record if throttling happened previously and avoid doing it
> again for this entry into kernelspace? Not certain that's the right answer, but
> we should think about what the new semantics should be.

For now, I will keep this as is and will add a comment in the code and
a mention in the commit message about it. I will wait for others to
comment before sending the next version and thanks for taking a look.

2022-02-14 10:25:11

by Chris Down

[permalink] [raw]

Subject: Re: [PATCH v2 4/4] memcg: synchronously enforce memory.high for large overcharges

Shakeel Butt writes:
>The high limit is used to throttle the workload without invoking the
>oom-killer. Recently we tried to use the high limit to right size our
>internal workloads. More specifically dynamically adjusting the limits
>of the workload without letting the workload get oom-killed. However due
>to the limitation of the implementation of high limit enforcement, we
>observed the mechanism fails for some real workloads.
>
>The high limit is enforced on return-to-userspace i.e. the kernel let
>the usage goes over the limit and when the execution returns to
>userspace, the high reclaim is triggered and the process can get
>throttled as well. However this mechanism fails for workloads which do
>large allocations in a single kernel entry e.g. applications that
>mlock() a large chunk of memory in a single syscall. Such applications
>bypass the high limit and can trigger the oom-killer.
>
>To make high limit enforcement more robust, this patch makes the limit
>enforcement synchronous only if the accumulated overcharge becomes
>larger than MEMCG_CHARGE_BATCH. So, most of the allocations would still
>be throttled on the return-to-userspace path but only the extreme
>allocations which accumulates large amount of overcharge without
>returning to the userspace will be throttled synchronously. The value
>MEMCG_CHARGE_BATCH is a bit arbitrary but most of other places in the
>memcg codebase uses this constant therefore for now uses the same one.

Note that mem_cgroup_handle_over_high() has its own allocator throttling grace
period, where it bails out if the penalty to apply is less than 10ms. The
reclaim will still happen, though. So throttling might not happen even for
roughly MEMCG_CHARGE_BATCH-sized allocations, depending on the overall size of
the cgroup and its protection.

>Signed-off-by: Shakeel Butt <[email protected]>
>---
>Changes since v1:
>- Based on Roman's comment simply the sync enforcement and only target
> the extreme cases.
>
> mm/memcontrol.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
>diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>index 292b0b99a2c7..0da4be4798e7 100644
>--- a/mm/memcontrol.c
>+++ b/mm/memcontrol.c
>@@ -2703,6 +2703,11 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> }
> } while ((memcg = parent_mem_cgroup(memcg)));
>
>+ if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
>+ !(current->flags & PF_MEMALLOC) &&
>+ gfpflags_allow_blocking(gfp_mask)) {
>+ mem_cgroup_handle_over_high();

Thanks, I was going to comment on v1 that I prefer to keep the implementation
of mem_cgroup_handle_over_high if possible since we know that the mechanism has
been safe in production over the past few years.

One question I have is about throttling. It looks like this new
mem_cgroup_handle_over_high callsite may mean that throttling is invoked more
than once on a misbehaving workload that's failing to reclaim since the
throttling could be invoked both here and in return to userspace, right? That
might not be a problem, but we should think about the implications of that,
especially in relation to MEMCG_MAX_HIGH_DELAY_JIFFIES.

Maybe we should record if throttling happened previously and avoid doing it
again for this entry into kernelspace? Not certain that's the right answer, but
we should think about what the new semantics should be.

>+ }
> return 0;
> }
>
>--
>2.35.1.265.g69c8d7142f-goog
>

2022-02-16 04:21:35

by Roman Gushchin

[permalink] [raw]

Subject: Re: [PATCH v2 4/4] memcg: synchronously enforce memory.high for large overcharges

On Thu, Feb 10, 2022 at 10:49:17PM -0800, Shakeel Butt wrote:
> The high limit is used to throttle the workload without invoking the
> oom-killer. Recently we tried to use the high limit to right size our
> internal workloads. More specifically dynamically adjusting the limits
> of the workload without letting the workload get oom-killed. However due
> to the limitation of the implementation of high limit enforcement, we
> observed the mechanism fails for some real workloads.
>
> The high limit is enforced on return-to-userspace i.e. the kernel let
> the usage goes over the limit and when the execution returns to
> userspace, the high reclaim is triggered and the process can get
> throttled as well. However this mechanism fails for workloads which do
> large allocations in a single kernel entry e.g. applications that
> mlock() a large chunk of memory in a single syscall. Such applications
> bypass the high limit and can trigger the oom-killer.
>
> To make high limit enforcement more robust, this patch makes the limit
> enforcement synchronous only if the accumulated overcharge becomes
> larger than MEMCG_CHARGE_BATCH. So, most of the allocations would still
> be throttled on the return-to-userspace path but only the extreme
> allocations which accumulates large amount of overcharge without
> returning to the userspace will be throttled synchronously. The value
> MEMCG_CHARGE_BATCH is a bit arbitrary but most of other places in the
> memcg codebase uses this constant therefore for now uses the same one.
>
> Signed-off-by: Shakeel Butt <[email protected]>
> ---
> Changes since v1:
> - Based on Roman's comment simply the sync enforcement and only target
> the extreme cases.

Reviewed-by: Roman Gushchin <[email protected]>

This version indeed looks more safe to me.

Thanks!

2022-02-16 07:28:00

by Shakeel Butt

[permalink] [raw]

Subject: Re: [PATCH v2 4/4] memcg: synchronously enforce memory.high for large overcharges

On Thu, Feb 10, 2022 at 10:49 PM Shakeel Butt <[email protected]> wrote:
>
> The high limit is used to throttle the workload without invoking the
> oom-killer. Recently we tried to use the high limit to right size our
> internal workloads. More specifically dynamically adjusting the limits
> of the workload without letting the workload get oom-killed. However due
> to the limitation of the implementation of high limit enforcement, we
> observed the mechanism fails for some real workloads.
>
> The high limit is enforced on return-to-userspace i.e. the kernel let
> the usage goes over the limit and when the execution returns to
> userspace, the high reclaim is triggered and the process can get
> throttled as well. However this mechanism fails for workloads which do
> large allocations in a single kernel entry e.g. applications that
> mlock() a large chunk of memory in a single syscall. Such applications
> bypass the high limit and can trigger the oom-killer.
>
> To make high limit enforcement more robust, this patch makes the limit
> enforcement synchronous only if the accumulated overcharge becomes
> larger than MEMCG_CHARGE_BATCH. So, most of the allocations would still
> be throttled on the return-to-userspace path but only the extreme
> allocations which accumulates large amount of overcharge without
> returning to the userspace will be throttled synchronously. The value
> MEMCG_CHARGE_BATCH is a bit arbitrary but most of other places in the
> memcg codebase uses this constant therefore for now uses the same one.
>
> Signed-off-by: Shakeel Butt <[email protected]>

Any comments or concerns on this patch? Otherwise I would ask Andrew
to add this series into the mm tree.

> ---
> Changes since v1:
> - Based on Roman's comment simply the sync enforcement and only target
> the extreme cases.
>
> mm/memcontrol.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 292b0b99a2c7..0da4be4798e7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2703,6 +2703,11 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> }
> } while ((memcg = parent_mem_cgroup(memcg)));
>
> + if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
> + !(current->flags & PF_MEMALLOC) &&
> + gfpflags_allow_blocking(gfp_mask)) {
> + mem_cgroup_handle_over_high();
> + }
> return 0;
> }
>
> --
> 2.35.1.265.g69c8d7142f-goog
>

2022-02-16 07:29:59

by Roman Gushchin

[permalink] [raw]

Subject: Re: [PATCH v2 3/4] selftests: memcg: test high limit for single entry allocation

On Thu, Feb 10, 2022 at 10:49:16PM -0800, Shakeel Butt wrote:
> Test the enforcement of memory.high limit for large amount of memory
> allocation within a single kernel entry. There are valid use-cases
> where the application can trigger large amount of memory allocation
> within a single syscall e.g. mlock() or mmap(MAP_POPULATE). Make sure
> memory.high limit enforcement works for such use-cases.
>
> Signed-off-by: Shakeel Butt <[email protected]>
> ---
> Changes since v1:
> - None

Reviewed-by: Roman Gushchin <[email protected]>

Thanks!

2022-02-16 13:26:59

by Chris Down

[permalink] [raw]

Subject: Re: [PATCH v2 4/4] memcg: synchronously enforce memory.high for large overcharges

Shakeel Butt writes:
>> Thanks, I was going to comment on v1 that I prefer to keep the implementation
>> of mem_cgroup_handle_over_high if possible since we know that the mechanism has
>> been safe in production over the past few years.
>>
>> One question I have is about throttling. It looks like this new
>> mem_cgroup_handle_over_high callsite may mean that throttling is invoked more
>> than once on a misbehaving workload that's failing to reclaim since the
>> throttling could be invoked both here and in return to userspace, right? That
>> might not be a problem, but we should think about the implications of that,
>> especially in relation to MEMCG_MAX_HIGH_DELAY_JIFFIES.
>>
>
>Please note that mem_cgroup_handle_over_high() clears
>memcg_nr_pages_over_high and if on the return-to-userspace path
>mem_cgroup_handle_over_high() finds that memcg_nr_pages_over_high is
>non-zero, then it means the task has further accumulated the charges
>over high limit after a possibly synchronous
>memcg_nr_pages_over_high() call.

Oh sure, my point was only that MEMCG_MAX_HIGH_DELAY_JIFFIES was to more
reliably ensure we are returning to userspace at some point in the near future
to allow the task to have another chance at good behaviour instead of being
immediately whacked with whatever is monitoring PSI -- for example, in the case
where we have a daemon which is monitoring its own PSI contributions and will
make a proactive attempt to free structures in userspace.

That said, the throttling here still isn't unbounded, and it's not likely that
anyone doing such large allocations after already exceeding memory.high is
being a good citizen, so I think the patch makes sense as long as the change is
understood and documented internally.

Thanks!

Acked-by: Chris Down <[email protected]>