2019-02-01 01:15:27

by Chris Down

[permalink] [raw]
Subject: [PATCH] mm: Throttle allocators when failing reclaim over memory.high

We're trying to use memory.high to limit workloads, but have found that
containment can frequently fail completely and cause OOM situations
outside of the cgroup. This happens especially with swap space -- either
when none is configured, or swap is full. These failures often also
don't have enough warning to allow one to react, whether for a human or
for a daemon monitoring PSI.

Here is output from a simple program showing how long it takes in μsec
(column 2) to allocate a megabyte of anonymous memory (column 1) when a
cgroup is already beyond its memory high setting, and no swap is
available:

[root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
> --wait -t timeout 300 /root/mdf
[...]
95 1035
96 1038
97 1000
98 1036
99 1048
100 1590
101 1968
102 1776
103 1863
104 1757
105 1921
106 1893
107 1760
108 1748
109 1843
110 1716
111 1924
112 1776
113 1831
114 1766
115 1836
116 1588
117 1912
118 1802
119 1857
120 1731
[...]
[System OOM in 2-3 seconds]

The delay does go up extremely marginally past the 100MB memory.high
threshold, as now we spend time scanning before returning to usermode,
but it's nowhere near enough to contain growth. It also doesn't get
worse the more pages you have, since it only considers nr_pages.

The current situation goes against both the expectations of users of
memory.high, and our intentions as cgroup v2 developers. In
cgroup-v2.txt, we claim that we will throttle and only under "extreme
conditions" will memory.high protection be breached. Likewise, cgroup v2
users generally also expect that memory.high should throttle workloads
as they exceed their high threshold. However, as seen above, this isn't
always how it works in practice -- even on banal setups like those with
no swap, or where swap has become exhausted, we can end up with
memory.high being breached and us having no weapons left in our arsenal
to combat runaway growth with, since reclaim is futile.

It's also hard for system monitoring software or users to tell how bad
the situation is, as "high" events for the memcg may in some cases be
benign, and in others be catastrophic. The current status quo is that we
fail containment in a way that doesn't provide any advance warning that
things are about to go horribly wrong (for example, we are about to
invoke the kernel OOM killer).

This patch introduces explicit throttling when reclaim is failing to
keep memcg size contained at the memory.high setting. It does so by
applying an exponential delay curve derived from the memcg's overage
compared to memory.high. In the normal case where the memcg is either
below or only marginally over its memory.high setting, no throttling
will be performed.

This composes well with system health monitoring and remediation, as
these allocator delays are factored into PSI's memory pressure
calculations. This both creates a mechanism system administrators or
applications consuming the PSI interface to trivially see that the memcg
in question is struggling and use that to make more reasonable
decisions, and permits them enough time to act. Either of these can act
with significantly more nuance than that we can provide using the system
OOM killer.

This is a similar idea to memory.oom_control in cgroup v1 which would
put the cgroup to sleep if the threshold was violated, but it's also
significantly improved as it results in visible memory pressure, and
also doesn't schedule indefinitely, which previously made tracing and
other introspection difficult.

Contrast the previous results with a kernel with this patch:

[root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
> --wait -t timeout 300 /root/mdf
[...]
95 1002
96 1000
97 1002
98 1003
99 1000
100 1043
101 84724
102 330628
103 610511
104 1016265
105 1503969
106 2391692
107 2872061
108 3248003
109 4791904
110 5759832
111 6912509
112 8127818
113 9472203
114 12287622
115 12480079
116 14144008
117 15808029
118 16384500
119 16383242
120 16384979
[...]

As you can see, in the normal case, memory allocation takes around 1000
μsec. However, as we exceed our memory.high, things start to increase
exponentially, but fairly leniently at first. Our first megabyte over
memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then
the next is almost an entire second. This gets worse until we reach our
eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
However, this is still making forward progress, so permits tracing or
further analysis with programs like GDB.

This patch expands on earlier work by Johannes Weiner. Thanks!

Signed-off-by: Chris Down <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/memcontrol.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 117 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 18f4aefbe0bf..1844a88f1f68 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -65,6 +65,7 @@
#include <linux/lockdep.h>
#include <linux/file.h>
#include <linux/tracehook.h>
+#include <linux/psi.h>
#include "internal.h"
#include <net/sock.h>
#include <net/ip.h>
@@ -2161,12 +2162,68 @@ static void high_work_func(struct work_struct *work)
reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
}

+/*
+ * Clamp the maximum sleep time per allocation batch to 2 seconds. This is
+ * enough to still cause a significant slowdown in most cases, while still
+ * allowing diagnostics and tracing to proceed without becoming stuck.
+ */
+#define MEMCG_MAX_HIGH_DELAY_JIFFIES (2UL*HZ)
+
+/*
+ * When calculating the delay, we use these either side of the exponentiation to
+ * maintain precision and scale to a reasonable number of jiffies (see the table
+ * below.
+ *
+ * - MEMCG_DELAY_PRECISION_SHIFT: Extra precision bits while translating the
+ * overage ratio to a delay.
+ * - MEMCG_DELAY_SCALING_SHIFT: The number of bits to scale down down the
+ * proposed penalty in order to reduce to a reasonable number of jiffies, and
+ * to produce a reasonable delay curve.
+ *
+ * MEMCG_DELAY_SCALING_SHIFT just happens to be a number that produces a
+ * reasonable delay curve compared to precision-adjusted overage, not
+ * penalising heavily at first, but still making sure that growth beyond the
+ * limit penalises misbehaviour cgroups by slowing them down exponentially. For
+ * example, with a high of 100 megabytes:
+ *
+ * +-------+------------------------+
+ * | usage | time to allocate in ms |
+ * +-------+------------------------+
+ * | 100M | 0 |
+ * | 101M | 6 |
+ * | 102M | 25 |
+ * | 103M | 57 |
+ * | 104M | 102 |
+ * | 105M | 159 |
+ * | 106M | 230 |
+ * | 107M | 313 |
+ * | 108M | 409 |
+ * | 109M | 518 |
+ * | 110M | 639 |
+ * | 111M | 774 |
+ * | 112M | 921 |
+ * | 113M | 1081 |
+ * | 114M | 1254 |
+ * | 115M | 1439 |
+ * | 116M | 1638 |
+ * | 117M | 1849 |
+ * | 118M | 2000 |
+ * | 119M | 2000 |
+ * | 120M | 2000 |
+ * +-------+------------------------+
+ */
+ #define MEMCG_DELAY_PRECISION_SHIFT 20
+ #define MEMCG_DELAY_SCALING_SHIFT 14
+
/*
* Scheduled by try_charge() to be executed from the userland return path
* and reclaims memory over the high limit.
*/
void mem_cgroup_handle_over_high(void)
{
+ unsigned long usage, high;
+ unsigned long pflags;
+ unsigned long penalty_jiffies, overage;
unsigned int nr_pages = current->memcg_nr_pages_over_high;
struct mem_cgroup *memcg = current->memcg_high_reclaim;

@@ -2177,9 +2234,68 @@ void mem_cgroup_handle_over_high(void)
memcg = get_mem_cgroup_from_mm(current->mm);

reclaim_high(memcg, nr_pages, GFP_KERNEL);
- css_put(&memcg->css);
current->memcg_high_reclaim = NULL;
current->memcg_nr_pages_over_high = 0;
+
+ /*
+ * memory.high is breached and reclaim is unable to keep up. Throttle
+ * allocators proactively to slow down excessive growth.
+ *
+ * We use overage compared to memory.high to calculate the number of
+ * jiffies to sleep (penalty_jiffies). Ideally this value should be
+ * fairly lenient on small overages, and increasingly harsh when the
+ * memcg in question makes it clear that it has no intention of stopping
+ * its crazy behaviour, so we exponentially increase the delay based on
+ * overage amount.
+ */
+
+ usage = page_counter_read(&memcg->memory);
+ high = READ_ONCE(memcg->high);
+
+ if (usage <= high)
+ goto out;
+
+ overage = ((u64)(usage - high) << MEMCG_DELAY_PRECISION_SHIFT) / high;
+ penalty_jiffies = ((u64)overage * overage * HZ)
+ >> (MEMCG_DELAY_PRECISION_SHIFT + MEMCG_DELAY_SCALING_SHIFT);
+
+ /*
+ * Factor in the task's own contribution to the overage, such that four
+ * N-sized allocations are throttled approximately the same as one
+ * 4N-sized allocation.
+ *
+ * MEMCG_CHARGE_BATCH pages is nominal, so work out how much smaller or
+ * larger the current charge patch is than that.
+ */
+ penalty_jiffies = penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
+
+ /*
+ * Clamp the max delay per usermode return so as to still keep the
+ * application moving forwards and also permit diagnostics, albeit
+ * extremely slowly.
+ */
+ penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
+
+ /*
+ * Don't sleep if the amount of jiffies this memcg owes us is so low
+ * that it's not even worth doing, in an attempt to be nice to those who
+ * go only a small amount over their memory.high value and maybe haven't
+ * been aggressively reclaimed enough yet.
+ */
+ if (penalty_jiffies <= HZ / 100)
+ goto out;
+
+ /*
+ * If we exit early, we're guaranteed to die (since
+ * schedule_timeout_killable sets TASK_KILLABLE). This means we don't
+ * need to account for any ill-begotten jiffies to pay them off later.
+ */
+ psi_memstall_enter(&pflags);
+ schedule_timeout_killable(penalty_jiffies);
+ psi_memstall_leave(&pflags);
+
+out:
+ css_put(&memcg->css);
}

static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
--
2.20.1



2019-02-01 07:20:31

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: Throttle allocators when failing reclaim over memory.high

On Thu 31-01-19 20:13:52, Chris Down wrote:
[...]
> The current situation goes against both the expectations of users of
> memory.high, and our intentions as cgroup v2 developers. In
> cgroup-v2.txt, we claim that we will throttle and only under "extreme
> conditions" will memory.high protection be breached. Likewise, cgroup v2
> users generally also expect that memory.high should throttle workloads
> as they exceed their high threshold. However, as seen above, this isn't
> always how it works in practice -- even on banal setups like those with
> no swap, or where swap has become exhausted, we can end up with
> memory.high being breached and us having no weapons left in our arsenal
> to combat runaway growth with, since reclaim is futile.
>
> It's also hard for system monitoring software or users to tell how bad
> the situation is, as "high" events for the memcg may in some cases be
> benign, and in others be catastrophic. The current status quo is that we
> fail containment in a way that doesn't provide any advance warning that
> things are about to go horribly wrong (for example, we are about to
> invoke the kernel OOM killer).
>
> This patch introduces explicit throttling when reclaim is failing to
> keep memcg size contained at the memory.high setting. It does so by
> applying an exponential delay curve derived from the memcg's overage
> compared to memory.high. In the normal case where the memcg is either
> below or only marginally over its memory.high setting, no throttling
> will be performed.

How does this play wit the actual OOM when the user expects oom to
resolve the situation because the reclaim is futile and there is nothing
reclaimable except for killing a process?
--
Michal Hocko
SUSE Labs

2019-02-01 16:13:35

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm: Throttle allocators when failing reclaim over memory.high

On Fri, Feb 01, 2019 at 08:17:57AM +0100, Michal Hocko wrote:
> On Thu 31-01-19 20:13:52, Chris Down wrote:
> [...]
> > The current situation goes against both the expectations of users of
> > memory.high, and our intentions as cgroup v2 developers. In
> > cgroup-v2.txt, we claim that we will throttle and only under "extreme
> > conditions" will memory.high protection be breached. Likewise, cgroup v2
> > users generally also expect that memory.high should throttle workloads
> > as they exceed their high threshold. However, as seen above, this isn't
> > always how it works in practice -- even on banal setups like those with
> > no swap, or where swap has become exhausted, we can end up with
> > memory.high being breached and us having no weapons left in our arsenal
> > to combat runaway growth with, since reclaim is futile.
> >
> > It's also hard for system monitoring software or users to tell how bad
> > the situation is, as "high" events for the memcg may in some cases be
> > benign, and in others be catastrophic. The current status quo is that we
> > fail containment in a way that doesn't provide any advance warning that
> > things are about to go horribly wrong (for example, we are about to
> > invoke the kernel OOM killer).
> >
> > This patch introduces explicit throttling when reclaim is failing to
> > keep memcg size contained at the memory.high setting. It does so by
> > applying an exponential delay curve derived from the memcg's overage
> > compared to memory.high. In the normal case where the memcg is either
> > below or only marginally over its memory.high setting, no throttling
> > will be performed.
>
> How does this play wit the actual OOM when the user expects oom to
> resolve the situation because the reclaim is futile and there is nothing
> reclaimable except for killing a process?

Hm, can you elaborate on your question a bit?

The idea behind memory.high is to throttle allocations long enough for
the admin or a management daemon to intervene, but not to trigger the
kernel oom killer. It was designed as a replacement for the cgroup1
oom_control, but without the deadlock potential, ptrace problems etc.

What we specifically do is to set memory.high and have a daemon (oomd)
watch memory.pressure, io.pressure etc. in the group. If pressure
exceeds a certain threshold, the daemon kills something.

As you know, the kernel OOM killer does not kick in reliably when
e.g. page cache is thrashing heavily, since from a kernel POV it's
still successfully allocating and reclaiming - meanwhile the workload
is spending most its time in page faults. And when the kernel OOM
killer does kick in, its selection policy is not very workload-aware.

This daemon on the other hand can be configured to 1) kick in reliably
when the workload-specific tolerances for slowdowns and latencies are
violated (which tends to be way earlier than the kernel oom killer
usually kicks in) and 2) know about the workload and all its
components to make an informed kill decision.

Right now, that throttling mechanism works okay with swap enabled, but
we cannot enable swap everywhere, or sometimes run out of swap, and
then it breaks down and we run into system OOMs.

This patch makes sure memory.high *always* implements the throttling
semantics described in cgroup-v2.txt, not just most of the time.

2019-02-01 20:53:20

by Chris Down

[permalink] [raw]
Subject: Re: [PATCH] mm: Throttle allocators when failing reclaim over memory.high

Michal Hocko writes:
>How does this play wit the actual OOM when the user expects oom to
>resolve the situation because the reclaim is futile and there is nothing
>reclaimable except for killing a process?

In addition to what Johannes said, this doesn't impede OOM in the case of
global system starvation (eg. in the case that all major consumers of memory
are allocator throttling). In that case nothing unusual will happen, since the
task's state is TASK_KILLABLE rather than TASK_UNINTERRUPTIBLE, and we will
exit out of mem_cgroup_handle_over_high as quickly as possible.

2019-02-28 09:53:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: Throttle allocators when failing reclaim over memory.high

[Sorry for a late reply]

On Fri 01-02-19 11:12:33, Johannes Weiner wrote:
> On Fri, Feb 01, 2019 at 08:17:57AM +0100, Michal Hocko wrote:
> > On Thu 31-01-19 20:13:52, Chris Down wrote:
> > [...]
> > > The current situation goes against both the expectations of users of
> > > memory.high, and our intentions as cgroup v2 developers. In
> > > cgroup-v2.txt, we claim that we will throttle and only under "extreme
> > > conditions" will memory.high protection be breached. Likewise, cgroup v2
> > > users generally also expect that memory.high should throttle workloads
> > > as they exceed their high threshold. However, as seen above, this isn't
> > > always how it works in practice -- even on banal setups like those with
> > > no swap, or where swap has become exhausted, we can end up with
> > > memory.high being breached and us having no weapons left in our arsenal
> > > to combat runaway growth with, since reclaim is futile.
> > >
> > > It's also hard for system monitoring software or users to tell how bad
> > > the situation is, as "high" events for the memcg may in some cases be
> > > benign, and in others be catastrophic. The current status quo is that we
> > > fail containment in a way that doesn't provide any advance warning that
> > > things are about to go horribly wrong (for example, we are about to
> > > invoke the kernel OOM killer).
> > >
> > > This patch introduces explicit throttling when reclaim is failing to
> > > keep memcg size contained at the memory.high setting. It does so by
> > > applying an exponential delay curve derived from the memcg's overage
> > > compared to memory.high. In the normal case where the memcg is either
> > > below or only marginally over its memory.high setting, no throttling
> > > will be performed.
> >
> > How does this play wit the actual OOM when the user expects oom to
> > resolve the situation because the reclaim is futile and there is nothing
> > reclaimable except for killing a process?
>
> Hm, can you elaborate on your question a bit?
>
> The idea behind memory.high is to throttle allocations long enough for
> the admin or a management daemon to intervene, but not to trigger the
> kernel oom killer. It was designed as a replacement for the cgroup1
> oom_control, but without the deadlock potential, ptrace problems etc.

Yes, this makes sense. The high limit reclaim is also documented as a
best effort resource guarantee. My understanding is that if the workload
cannot be contained within the high limit then the system cannot do much
and eventually gives up. Having the full memory unreclaimable is such an
example. And there is either the global OOM killer or hard limit OOM
killer to trigger to resolve such a situation.

[...]
Thanks for describing the usecase.

> Right now, that throttling mechanism works okay with swap enabled, but
> we cannot enable swap everywhere, or sometimes run out of swap, and
> then it breaks down and we run into system OOMs.
>
> This patch makes sure memory.high *always* implements the throttling
> semantics described in cgroup-v2.txt, not just most of the time.

I am not really opposed to the throttling in the absence of a
reclaimable memory. We do that for the regular allocation paths already
(should_reclaim_retry). A swapless system with anon memory is very
likely to oom too quickly and this sounds like a real problem. But I do
not think that we should throttle the allocation to freeze it
completely. We should eventually OOM. And that was my question about
essentially. How much we can/should throttle to give a high limit events
consumer enough time to intervene. I am sorry to still not have time to
study the patch more closely but this should be explained in the
changelog. Are we talking about seconds/minutes or simply freeze each
allocator to death?
--
Michal Hocko
SUSE Labs

2019-04-10 18:55:52

by Chris Down

[permalink] [raw]
Subject: [PATCH REBASED] mm: Throttle allocators when failing reclaim over memory.high

We're trying to use memory.high to limit workloads, but have found that
containment can frequently fail completely and cause OOM situations
outside of the cgroup. This happens especially with swap space -- either
when none is configured, or swap is full. These failures often also
don't have enough warning to allow one to react, whether for a human or
for a daemon monitoring PSI.

Here is output from a simple program showing how long it takes in μsec
(column 2) to allocate a megabyte of anonymous memory (column 1) when a
cgroup is already beyond its memory high setting, and no swap is
available:

[root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
> --wait -t timeout 300 /root/mdf
[...]
95 1035
96 1038
97 1000
98 1036
99 1048
100 1590
101 1968
102 1776
103 1863
104 1757
105 1921
106 1893
107 1760
108 1748
109 1843
110 1716
111 1924
112 1776
113 1831
114 1766
115 1836
116 1588
117 1912
118 1802
119 1857
120 1731
[...]
[System OOM in 2-3 seconds]

The delay does go up extremely marginally past the 100MB memory.high
threshold, as now we spend time scanning before returning to usermode,
but it's nowhere near enough to contain growth. It also doesn't get
worse the more pages you have, since it only considers nr_pages.

The current situation goes against both the expectations of users of
memory.high, and our intentions as cgroup v2 developers. In
cgroup-v2.txt, we claim that we will throttle and only under "extreme
conditions" will memory.high protection be breached. Likewise, cgroup v2
users generally also expect that memory.high should throttle workloads
as they exceed their high threshold. However, as seen above, this isn't
always how it works in practice -- even on banal setups like those with
no swap, or where swap has become exhausted, we can end up with
memory.high being breached and us having no weapons left in our arsenal
to combat runaway growth with, since reclaim is futile.

It's also hard for system monitoring software or users to tell how bad
the situation is, as "high" events for the memcg may in some cases be
benign, and in others be catastrophic. The current status quo is that we
fail containment in a way that doesn't provide any advance warning that
things are about to go horribly wrong (for example, we are about to
invoke the kernel OOM killer).

This patch introduces explicit throttling when reclaim is failing to
keep memcg size contained at the memory.high setting. It does so by
applying an exponential delay curve derived from the memcg's overage
compared to memory.high. In the normal case where the memcg is either
below or only marginally over its memory.high setting, no throttling
will be performed.

This composes well with system health monitoring and remediation, as
these allocator delays are factored into PSI's memory pressure
calculations. This both creates a mechanism system administrators or
applications consuming the PSI interface to trivially see that the memcg
in question is struggling and use that to make more reasonable
decisions, and permits them enough time to act. Either of these can act
with significantly more nuance than that we can provide using the system
OOM killer.

This is a similar idea to memory.oom_control in cgroup v1 which would
put the cgroup to sleep if the threshold was violated, but it's also
significantly improved as it results in visible memory pressure, and
also doesn't schedule indefinitely, which previously made tracing and
other introspection difficult.

Contrast the previous results with a kernel with this patch:

[root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
> --wait -t timeout 300 /root/mdf
[...]
95 1002
96 1000
97 1002
98 1003
99 1000
100 1043
101 84724
102 330628
103 610511
104 1016265
105 1503969
106 2391692
107 2872061
108 3248003
109 4791904
110 5759832
111 6912509
112 8127818
113 9472203
114 12287622
115 12480079
116 14144008
117 15808029
118 16384500
119 16383242
120 16384979
[...]

As you can see, in the normal case, memory allocation takes around 1000
μsec. However, as we exceed our memory.high, things start to increase
exponentially, but fairly leniently at first. Our first megabyte over
memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then
the next is almost an entire second. This gets worse until we reach our
eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
However, this is still making forward progress, so permits tracing or
further analysis with programs like GDB.

This patch expands on earlier work by Johannes Weiner. Thanks!

Signed-off-by: Chris Down <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/memcontrol.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 117 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cd03b1181f7f..fa3e6cce7843 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -66,6 +66,7 @@
#include <linux/lockdep.h>
#include <linux/file.h>
#include <linux/tracehook.h>
+#include <linux/psi.h>
#include "internal.h"
#include <net/sock.h>
#include <net/ip.h>
@@ -2142,12 +2143,68 @@ static void high_work_func(struct work_struct *work)
reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
}

+/*
+ * Clamp the maximum sleep time per allocation batch to 2 seconds. This is
+ * enough to still cause a significant slowdown in most cases, while still
+ * allowing diagnostics and tracing to proceed without becoming stuck.
+ */
+#define MEMCG_MAX_HIGH_DELAY_JIFFIES (2UL*HZ)
+
+/*
+ * When calculating the delay, we use these either side of the exponentiation to
+ * maintain precision and scale to a reasonable number of jiffies (see the table
+ * below.
+ *
+ * - MEMCG_DELAY_PRECISION_SHIFT: Extra precision bits while translating the
+ * overage ratio to a delay.
+ * - MEMCG_DELAY_SCALING_SHIFT: The number of bits to scale down down the
+ * proposed penalty in order to reduce to a reasonable number of jiffies, and
+ * to produce a reasonable delay curve.
+ *
+ * MEMCG_DELAY_SCALING_SHIFT just happens to be a number that produces a
+ * reasonable delay curve compared to precision-adjusted overage, not
+ * penalising heavily at first, but still making sure that growth beyond the
+ * limit penalises misbehaviour cgroups by slowing them down exponentially. For
+ * example, with a high of 100 megabytes:
+ *
+ * +-------+------------------------+
+ * | usage | time to allocate in ms |
+ * +-------+------------------------+
+ * | 100M | 0 |
+ * | 101M | 6 |
+ * | 102M | 25 |
+ * | 103M | 57 |
+ * | 104M | 102 |
+ * | 105M | 159 |
+ * | 106M | 230 |
+ * | 107M | 313 |
+ * | 108M | 409 |
+ * | 109M | 518 |
+ * | 110M | 639 |
+ * | 111M | 774 |
+ * | 112M | 921 |
+ * | 113M | 1081 |
+ * | 114M | 1254 |
+ * | 115M | 1439 |
+ * | 116M | 1638 |
+ * | 117M | 1849 |
+ * | 118M | 2000 |
+ * | 119M | 2000 |
+ * | 120M | 2000 |
+ * +-------+------------------------+
+ */
+ #define MEMCG_DELAY_PRECISION_SHIFT 20
+ #define MEMCG_DELAY_SCALING_SHIFT 14
+
/*
* Scheduled by try_charge() to be executed from the userland return path
* and reclaims memory over the high limit.
*/
void mem_cgroup_handle_over_high(void)
{
+ unsigned long usage, high;
+ unsigned long pflags;
+ unsigned long penalty_jiffies, overage;
unsigned int nr_pages = current->memcg_nr_pages_over_high;
struct mem_cgroup *memcg = current->memcg_high_reclaim;

@@ -2158,9 +2215,68 @@ void mem_cgroup_handle_over_high(void)
memcg = get_mem_cgroup_from_mm(current->mm);

reclaim_high(memcg, nr_pages, GFP_KERNEL);
- css_put(&memcg->css);
current->memcg_high_reclaim = NULL;
current->memcg_nr_pages_over_high = 0;
+
+ /*
+ * memory.high is breached and reclaim is unable to keep up. Throttle
+ * allocators proactively to slow down excessive growth.
+ *
+ * We use overage compared to memory.high to calculate the number of
+ * jiffies to sleep (penalty_jiffies). Ideally this value should be
+ * fairly lenient on small overages, and increasingly harsh when the
+ * memcg in question makes it clear that it has no intention of stopping
+ * its crazy behaviour, so we exponentially increase the delay based on
+ * overage amount.
+ */
+
+ usage = page_counter_read(&memcg->memory);
+ high = READ_ONCE(memcg->high);
+
+ if (usage <= high)
+ goto out;
+
+ overage = ((u64)(usage - high) << MEMCG_DELAY_PRECISION_SHIFT) / high;
+ penalty_jiffies = ((u64)overage * overage * HZ)
+ >> (MEMCG_DELAY_PRECISION_SHIFT + MEMCG_DELAY_SCALING_SHIFT);
+
+ /*
+ * Factor in the task's own contribution to the overage, such that four
+ * N-sized allocations are throttled approximately the same as one
+ * 4N-sized allocation.
+ *
+ * MEMCG_CHARGE_BATCH pages is nominal, so work out how much smaller or
+ * larger the current charge patch is than that.
+ */
+ penalty_jiffies = penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
+
+ /*
+ * Clamp the max delay per usermode return so as to still keep the
+ * application moving forwards and also permit diagnostics, albeit
+ * extremely slowly.
+ */
+ penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
+
+ /*
+ * Don't sleep if the amount of jiffies this memcg owes us is so low
+ * that it's not even worth doing, in an attempt to be nice to those who
+ * go only a small amount over their memory.high value and maybe haven't
+ * been aggressively reclaimed enough yet.
+ */
+ if (penalty_jiffies <= HZ / 100)
+ goto out;
+
+ /*
+ * If we exit early, we're guaranteed to die (since
+ * schedule_timeout_killable sets TASK_KILLABLE). This means we don't
+ * need to account for any ill-begotten jiffies to pay them off later.
+ */
+ psi_memstall_enter(&pflags);
+ schedule_timeout_killable(penalty_jiffies);
+ psi_memstall_leave(&pflags);
+
+out:
+ css_put(&memcg->css);
}

static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
--
2.21.0

2019-04-10 18:56:37

by Chris Down

[permalink] [raw]
Subject: Re: [PATCH REBASED] mm: Throttle allocators when failing reclaim over memory.high

Hey Michal,

Just to come back to your last e-mail about how this interacts with OOM.

Michal Hocko writes:
> I am not really opposed to the throttling in the absence of a reclaimable
> memory. We do that for the regular allocation paths already
> (should_reclaim_retry). A swapless system with anon memory is very likely to
> oom too quickly and this sounds like a real problem. But I do not think that
> we should throttle the allocation to freeze it completely. We should
> eventually OOM. And that was my question about essentially. How much we
> can/should throttle to give a high limit events consumer enough time to
> intervene. I am sorry to still not have time to study the patch more closely
> but this should be explained in the changelog. Are we talking about
> seconds/minutes or simply freeze each allocator to death?

Per-allocation, the maximum is 2 seconds (MEMCG_MAX_HIGH_DELAY_JIFFIES), so we
don't freeze things to death -- they can recover if they are amenable to it.
The idea here is that primarily you handle it, just like memory.oom_control in
v1 (as mentioned in the commit message, or as a last resort, the kernel will
still OOM if our userspace daemon has kicked the bucket or is otherwise
ineffective.

If you're setting memory.high and memory.max together, then setting memory.high
always has to come with a.) tolerance of heavy throttling by your application,
and b.) userspace intervention in the case of high memory pressure resulting.
This patch doesn't really change those semantics.

2019-05-01 18:42:25

by Chris Down

[permalink] [raw]
Subject: [PATCH v3] mm: Throttle allocators when failing reclaim over memory.high

We're trying to use memory.high to limit workloads, but have found that
containment can frequently fail completely and cause OOM situations
outside of the cgroup. This happens especially with swap space -- either
when none is configured, or swap is full. These failures often also
don't have enough warning to allow one to react, whether for a human or
for a daemon monitoring PSI.

Here is output from a simple program showing how long it takes in μsec
(column 2) to allocate a megabyte of anonymous memory (column 1) when a
cgroup is already beyond its memory high setting, and no swap is
available:

[root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
> --wait -t timeout 300 /root/mdf
[...]
95 1035
96 1038
97 1000
98 1036
99 1048
100 1590
101 1968
102 1776
103 1863
104 1757
105 1921
106 1893
107 1760
108 1748
109 1843
110 1716
111 1924
112 1776
113 1831
114 1766
115 1836
116 1588
117 1912
118 1802
119 1857
120 1731
[...]
[System OOM in 2-3 seconds]

The delay does go up extremely marginally past the 100MB memory.high
threshold, as now we spend time scanning before returning to usermode,
but it's nowhere near enough to contain growth. It also doesn't get
worse the more pages you have, since it only considers nr_pages.

The current situation goes against both the expectations of users of
memory.high, and our intentions as cgroup v2 developers. In
cgroup-v2.txt, we claim that we will throttle and only under "extreme
conditions" will memory.high protection be breached. Likewise, cgroup v2
users generally also expect that memory.high should throttle workloads
as they exceed their high threshold. However, as seen above, this isn't
always how it works in practice -- even on banal setups like those with
no swap, or where swap has become exhausted, we can end up with
memory.high being breached and us having no weapons left in our arsenal
to combat runaway growth with, since reclaim is futile.

It's also hard for system monitoring software or users to tell how bad
the situation is, as "high" events for the memcg may in some cases be
benign, and in others be catastrophic. The current status quo is that we
fail containment in a way that doesn't provide any advance warning that
things are about to go horribly wrong (for example, we are about to
invoke the kernel OOM killer).

This patch introduces explicit throttling when reclaim is failing to
keep memcg size contained at the memory.high setting. It does so by
applying an exponential delay curve derived from the memcg's overage
compared to memory.high. In the normal case where the memcg is either
below or only marginally over its memory.high setting, no throttling
will be performed.

This composes well with system health monitoring and remediation, as
these allocator delays are factored into PSI's memory pressure
calculations. This both creates a mechanism system administrators or
applications consuming the PSI interface to trivially see that the memcg
in question is struggling and use that to make more reasonable
decisions, and permits them enough time to act. Either of these can act
with significantly more nuance than that we can provide using the system
OOM killer.

This is a similar idea to memory.oom_control in cgroup v1 which would
put the cgroup to sleep if the threshold was violated, but it's also
significantly improved as it results in visible memory pressure, and
also doesn't schedule indefinitely, which previously made tracing and
other introspection difficult (ie. it's clamped at 2*HZ per allocation
through MEMCG_MAX_HIGH_DELAY_JIFFIES).

Contrast the previous results with a kernel with this patch:

[root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
> --wait -t timeout 300 /root/mdf
[...]
95 1002
96 1000
97 1002
98 1003
99 1000
100 1043
101 84724
102 330628
103 610511
104 1016265
105 1503969
106 2391692
107 2872061
108 3248003
109 4791904
110 5759832
111 6912509
112 8127818
113 9472203
114 12287622
115 12480079
116 14144008
117 15808029
118 16384500
119 16383242
120 16384979
[...]

As you can see, in the normal case, memory allocation takes around 1000
μsec. However, as we exceed our memory.high, things start to increase
exponentially, but fairly leniently at first. Our first megabyte over
memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then
the next is almost an entire second. This gets worse until we reach our
eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
However, this is still making forward progress, so permits tracing or
further analysis with programs like GDB.

We use an exponential curve for our delay penalty for a few reasons:

1. We run mem_cgroup_handle_over_high to potentially do reclaim after
we've already performed allocations, which means that temporarily
going over memory.high by a small amount may be perfectly legitimate,
even for compliant workloads. We don't want to unduly penalise such
cases.
2. An exponential curve (as opposed to a static or linear delay) allows
ramping up memory pressure stats more gradually, which can be useful
to work out that you have set memory.high too low, without destroying
application performance entirely.

This patch expands on earlier work by Johannes Weiner. Thanks!

Signed-off-by: Chris Down <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/memcontrol.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 117 insertions(+), 1 deletion(-)

[v3: updated the changelog post discussion in person with Michal.]

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2535e54e7989..e12fec0d4b58 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -66,6 +66,7 @@
#include <linux/lockdep.h>
#include <linux/file.h>
#include <linux/tracehook.h>
+#include <linux/psi.h>
#include "internal.h"
#include <net/sock.h>
#include <net/ip.h>
@@ -2263,12 +2264,68 @@ static void high_work_func(struct work_struct *work)
reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
}

+/*
+ * Clamp the maximum sleep time per allocation batch to 2 seconds. This is
+ * enough to still cause a significant slowdown in most cases, while still
+ * allowing diagnostics and tracing to proceed without becoming stuck.
+ */
+#define MEMCG_MAX_HIGH_DELAY_JIFFIES (2UL*HZ)
+
+/*
+ * When calculating the delay, we use these either side of the exponentiation to
+ * maintain precision and scale to a reasonable number of jiffies (see the table
+ * below.
+ *
+ * - MEMCG_DELAY_PRECISION_SHIFT: Extra precision bits while translating the
+ * overage ratio to a delay.
+ * - MEMCG_DELAY_SCALING_SHIFT: The number of bits to scale down down the
+ * proposed penalty in order to reduce to a reasonable number of jiffies, and
+ * to produce a reasonable delay curve.
+ *
+ * MEMCG_DELAY_SCALING_SHIFT just happens to be a number that produces a
+ * reasonable delay curve compared to precision-adjusted overage, not
+ * penalising heavily at first, but still making sure that growth beyond the
+ * limit penalises misbehaviour cgroups by slowing them down exponentially. For
+ * example, with a high of 100 megabytes:
+ *
+ * +-------+------------------------+
+ * | usage | time to allocate in ms |
+ * +-------+------------------------+
+ * | 100M | 0 |
+ * | 101M | 6 |
+ * | 102M | 25 |
+ * | 103M | 57 |
+ * | 104M | 102 |
+ * | 105M | 159 |
+ * | 106M | 230 |
+ * | 107M | 313 |
+ * | 108M | 409 |
+ * | 109M | 518 |
+ * | 110M | 639 |
+ * | 111M | 774 |
+ * | 112M | 921 |
+ * | 113M | 1081 |
+ * | 114M | 1254 |
+ * | 115M | 1439 |
+ * | 116M | 1638 |
+ * | 117M | 1849 |
+ * | 118M | 2000 |
+ * | 119M | 2000 |
+ * | 120M | 2000 |
+ * +-------+------------------------+
+ */
+ #define MEMCG_DELAY_PRECISION_SHIFT 20
+ #define MEMCG_DELAY_SCALING_SHIFT 14
+
/*
* Scheduled by try_charge() to be executed from the userland return path
* and reclaims memory over the high limit.
*/
void mem_cgroup_handle_over_high(void)
{
+ unsigned long usage, high;
+ unsigned long pflags;
+ unsigned long penalty_jiffies, overage;
unsigned int nr_pages = current->memcg_nr_pages_over_high;
struct mem_cgroup *memcg = current->memcg_high_reclaim;

@@ -2279,9 +2336,68 @@ void mem_cgroup_handle_over_high(void)
memcg = get_mem_cgroup_from_mm(current->mm);

reclaim_high(memcg, nr_pages, GFP_KERNEL);
- css_put(&memcg->css);
current->memcg_high_reclaim = NULL;
current->memcg_nr_pages_over_high = 0;
+
+ /*
+ * memory.high is breached and reclaim is unable to keep up. Throttle
+ * allocators proactively to slow down excessive growth.
+ *
+ * We use overage compared to memory.high to calculate the number of
+ * jiffies to sleep (penalty_jiffies). Ideally this value should be
+ * fairly lenient on small overages, and increasingly harsh when the
+ * memcg in question makes it clear that it has no intention of stopping
+ * its crazy behaviour, so we exponentially increase the delay based on
+ * overage amount.
+ */
+
+ usage = page_counter_read(&memcg->memory);
+ high = READ_ONCE(memcg->high);
+
+ if (usage <= high)
+ goto out;
+
+ overage = ((u64)(usage - high) << MEMCG_DELAY_PRECISION_SHIFT) / high;
+ penalty_jiffies = ((u64)overage * overage * HZ)
+ >> (MEMCG_DELAY_PRECISION_SHIFT + MEMCG_DELAY_SCALING_SHIFT);
+
+ /*
+ * Factor in the task's own contribution to the overage, such that four
+ * N-sized allocations are throttled approximately the same as one
+ * 4N-sized allocation.
+ *
+ * MEMCG_CHARGE_BATCH pages is nominal, so work out how much smaller or
+ * larger the current charge patch is than that.
+ */
+ penalty_jiffies = penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
+
+ /*
+ * Clamp the max delay per usermode return so as to still keep the
+ * application moving forwards and also permit diagnostics, albeit
+ * extremely slowly.
+ */
+ penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
+
+ /*
+ * Don't sleep if the amount of jiffies this memcg owes us is so low
+ * that it's not even worth doing, in an attempt to be nice to those who
+ * go only a small amount over their memory.high value and maybe haven't
+ * been aggressively reclaimed enough yet.
+ */
+ if (penalty_jiffies <= HZ / 100)
+ goto out;
+
+ /*
+ * If we exit early, we're guaranteed to die (since
+ * schedule_timeout_killable sets TASK_KILLABLE). This means we don't
+ * need to account for any ill-begotten jiffies to pay them off later.
+ */
+ psi_memstall_enter(&pflags);
+ schedule_timeout_killable(penalty_jiffies);
+ psi_memstall_leave(&pflags);
+
+out:
+ css_put(&memcg->css);
}

static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
--
2.21.0

2019-05-07 08:45:32

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3] mm: Throttle allocators when failing reclaim over memory.high

Sorry I was traveling last week and will be off for a large part of
this week.

On Wed 01-05-19 14:41:04, Chris Down wrote:
> We're trying to use memory.high to limit workloads, but have found that
> containment can frequently fail completely and cause OOM situations
> outside of the cgroup. This happens especially with swap space -- either
> when none is configured, or swap is full. These failures often also
> don't have enough warning to allow one to react, whether for a human or
> for a daemon monitoring PSI.
>
> Here is output from a simple program showing how long it takes in μsec
> (column 2) to allocate a megabyte of anonymous memory (column 1) when a
> cgroup is already beyond its memory high setting, and no swap is
> available:
>
> [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
> > --wait -t timeout 300 /root/mdf
> [...]
> 95 1035
> 96 1038
> 97 1000
> 98 1036
> 99 1048
> 100 1590
> 101 1968
> 102 1776
> 103 1863
> 104 1757
> 105 1921
> 106 1893
> 107 1760
> 108 1748
> 109 1843
> 110 1716
> 111 1924
> 112 1776
> 113 1831
> 114 1766
> 115 1836
> 116 1588
> 117 1912
> 118 1802
> 119 1857
> 120 1731
> [...]
> [System OOM in 2-3 seconds]
>
> The delay does go up extremely marginally past the 100MB memory.high
> threshold, as now we spend time scanning before returning to usermode,
> but it's nowhere near enough to contain growth. It also doesn't get
> worse the more pages you have, since it only considers nr_pages.
>
> The current situation goes against both the expectations of users of
> memory.high, and our intentions as cgroup v2 developers. In
> cgroup-v2.txt, we claim that we will throttle and only under "extreme
> conditions" will memory.high protection be breached. Likewise, cgroup v2
> users generally also expect that memory.high should throttle workloads
> as they exceed their high threshold. However, as seen above, this isn't
> always how it works in practice -- even on banal setups like those with
> no swap, or where swap has become exhausted, we can end up with
> memory.high being breached and us having no weapons left in our arsenal
> to combat runaway growth with, since reclaim is futile.

Well, having only a non-reclaimable memory essentially means the
workload termination to resolve the situation. Be it an in kernel oom
killer or any other pro-active measure doing the same. On the other hand
I do understand that we shouldn't run into the system oom situation prematurely
because the userspace doesn't have enough time to react. In your example
above it takes 1M/1ms and that indeed doesn't give much room to
a potential userspace intervention. The global case already tries to
throttle on no progress (albeit only for pending writers). I do not see
a fundamental reason to not throttle on no progress here as well.

> It's also hard for system monitoring software or users to tell how bad
> the situation is, as "high" events for the memcg may in some cases be
> benign, and in others be catastrophic. The current status quo is that we
> fail containment in a way that doesn't provide any advance warning that
> things are about to go horribly wrong (for example, we are about to
> invoke the kernel OOM killer).
>
> This patch introduces explicit throttling when reclaim is failing to
> keep memcg size contained at the memory.high setting. It does so by
> applying an exponential delay curve derived from the memcg's overage
> compared to memory.high. In the normal case where the memcg is either
> below or only marginally over its memory.high setting, no throttling
> will be performed.
>
> This composes well with system health monitoring and remediation, as
> these allocator delays are factored into PSI's memory pressure
> calculations. This both creates a mechanism system administrators or
> applications consuming the PSI interface to trivially see that the memcg
> in question is struggling and use that to make more reasonable
> decisions, and permits them enough time to act. Either of these can act
> with significantly more nuance than that we can provide using the system
> OOM killer.
>
> This is a similar idea to memory.oom_control in cgroup v1 which would
> put the cgroup to sleep if the threshold was violated, but it's also
> significantly improved as it results in visible memory pressure, and
> also doesn't schedule indefinitely, which previously made tracing and
> other introspection difficult (ie. it's clamped at 2*HZ per allocation
> through MEMCG_MAX_HIGH_DELAY_JIFFIES).

OK, fair enough. One thing that I really do not want to see is to have
this explicitly documented in the user documentation because this is an
implementation detail, userspace shouldn't depend on.

> Contrast the previous results with a kernel with this patch:
>
> [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
> > --wait -t timeout 300 /root/mdf
> [...]
> 95 1002
> 96 1000
> 97 1002
> 98 1003
> 99 1000
> 100 1043
> 101 84724
> 102 330628
> 103 610511
> 104 1016265
> 105 1503969
> 106 2391692
> 107 2872061
> 108 3248003
> 109 4791904
> 110 5759832
> 111 6912509
> 112 8127818
> 113 9472203
> 114 12287622
> 115 12480079
> 116 14144008
> 117 15808029
> 118 16384500
> 119 16383242
> 120 16384979
> [...]
>
> As you can see, in the normal case, memory allocation takes around 1000
> μsec. However, as we exceed our memory.high, things start to increase
> exponentially, but fairly leniently at first. Our first megabyte over
> memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then
> the next is almost an entire second. This gets worse until we reach our
> eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
> However, this is still making forward progress, so permits tracing or
> further analysis with programs like GDB.
>
> We use an exponential curve for our delay penalty for a few reasons:
>
> 1. We run mem_cgroup_handle_over_high to potentially do reclaim after
> we've already performed allocations, which means that temporarily
> going over memory.high by a small amount may be perfectly legitimate,
> even for compliant workloads. We don't want to unduly penalise such
> cases.
> 2. An exponential curve (as opposed to a static or linear delay) allows
> ramping up memory pressure stats more gradually, which can be useful
> to work out that you have set memory.high too low, without destroying
> application performance entirely.

Thanks for extending the changelog!

> This patch expands on earlier work by Johannes Weiner. Thanks!
>
> Signed-off-by: Chris Down <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Roman Gushchin <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]


I didn't get to read the code carefully so I cannot really give my ack
but the idea seems plausible and if the current parameters turn out to
hurt some workloads we can tune from there. The important part is that
throttling doesn't push hard limit triggering to infinity because we
need the OOM killer to trigger and resolve the situation eventually as
not everybody is doing userspace oom killing.

The other important thing is that this throttling is completely
transparent to the userspace and it is a subject of the current
implementation rather than any form of contract (unlike the oom_control
example or any tunable to control the decay/max timeout etc). No
userspace should depend on it.

Thanks

> ---
> mm/memcontrol.c | 118 +++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 117 insertions(+), 1 deletion(-)
>
> [v3: updated the changelog post discussion in person with Michal.]
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2535e54e7989..e12fec0d4b58 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -66,6 +66,7 @@
> #include <linux/lockdep.h>
> #include <linux/file.h>
> #include <linux/tracehook.h>
> +#include <linux/psi.h>
> #include "internal.h"
> #include <net/sock.h>
> #include <net/ip.h>
> @@ -2263,12 +2264,68 @@ static void high_work_func(struct work_struct *work)
> reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
> }
>
> +/*
> + * Clamp the maximum sleep time per allocation batch to 2 seconds. This is
> + * enough to still cause a significant slowdown in most cases, while still
> + * allowing diagnostics and tracing to proceed without becoming stuck.
> + */
> +#define MEMCG_MAX_HIGH_DELAY_JIFFIES (2UL*HZ)
> +
> +/*
> + * When calculating the delay, we use these either side of the exponentiation to
> + * maintain precision and scale to a reasonable number of jiffies (see the table
> + * below.
> + *
> + * - MEMCG_DELAY_PRECISION_SHIFT: Extra precision bits while translating the
> + * overage ratio to a delay.
> + * - MEMCG_DELAY_SCALING_SHIFT: The number of bits to scale down down the
> + * proposed penalty in order to reduce to a reasonable number of jiffies, and
> + * to produce a reasonable delay curve.
> + *
> + * MEMCG_DELAY_SCALING_SHIFT just happens to be a number that produces a
> + * reasonable delay curve compared to precision-adjusted overage, not
> + * penalising heavily at first, but still making sure that growth beyond the
> + * limit penalises misbehaviour cgroups by slowing them down exponentially. For
> + * example, with a high of 100 megabytes:
> + *
> + * +-------+------------------------+
> + * | usage | time to allocate in ms |
> + * +-------+------------------------+
> + * | 100M | 0 |
> + * | 101M | 6 |
> + * | 102M | 25 |
> + * | 103M | 57 |
> + * | 104M | 102 |
> + * | 105M | 159 |
> + * | 106M | 230 |
> + * | 107M | 313 |
> + * | 108M | 409 |
> + * | 109M | 518 |
> + * | 110M | 639 |
> + * | 111M | 774 |
> + * | 112M | 921 |
> + * | 113M | 1081 |
> + * | 114M | 1254 |
> + * | 115M | 1439 |
> + * | 116M | 1638 |
> + * | 117M | 1849 |
> + * | 118M | 2000 |
> + * | 119M | 2000 |
> + * | 120M | 2000 |
> + * +-------+------------------------+
> + */
> + #define MEMCG_DELAY_PRECISION_SHIFT 20
> + #define MEMCG_DELAY_SCALING_SHIFT 14
> +
> /*
> * Scheduled by try_charge() to be executed from the userland return path
> * and reclaims memory over the high limit.
> */
> void mem_cgroup_handle_over_high(void)
> {
> + unsigned long usage, high;
> + unsigned long pflags;
> + unsigned long penalty_jiffies, overage;
> unsigned int nr_pages = current->memcg_nr_pages_over_high;
> struct mem_cgroup *memcg = current->memcg_high_reclaim;
>
> @@ -2279,9 +2336,68 @@ void mem_cgroup_handle_over_high(void)
> memcg = get_mem_cgroup_from_mm(current->mm);
>
> reclaim_high(memcg, nr_pages, GFP_KERNEL);
> - css_put(&memcg->css);
> current->memcg_high_reclaim = NULL;
> current->memcg_nr_pages_over_high = 0;
> +
> + /*
> + * memory.high is breached and reclaim is unable to keep up. Throttle
> + * allocators proactively to slow down excessive growth.
> + *
> + * We use overage compared to memory.high to calculate the number of
> + * jiffies to sleep (penalty_jiffies). Ideally this value should be
> + * fairly lenient on small overages, and increasingly harsh when the
> + * memcg in question makes it clear that it has no intention of stopping
> + * its crazy behaviour, so we exponentially increase the delay based on
> + * overage amount.
> + */
> +
> + usage = page_counter_read(&memcg->memory);
> + high = READ_ONCE(memcg->high);
> +
> + if (usage <= high)
> + goto out;
> +
> + overage = ((u64)(usage - high) << MEMCG_DELAY_PRECISION_SHIFT) / high;
> + penalty_jiffies = ((u64)overage * overage * HZ)
> + >> (MEMCG_DELAY_PRECISION_SHIFT + MEMCG_DELAY_SCALING_SHIFT);
> +
> + /*
> + * Factor in the task's own contribution to the overage, such that four
> + * N-sized allocations are throttled approximately the same as one
> + * 4N-sized allocation.
> + *
> + * MEMCG_CHARGE_BATCH pages is nominal, so work out how much smaller or
> + * larger the current charge patch is than that.
> + */
> + penalty_jiffies = penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
> +
> + /*
> + * Clamp the max delay per usermode return so as to still keep the
> + * application moving forwards and also permit diagnostics, albeit
> + * extremely slowly.
> + */
> + penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
> +
> + /*
> + * Don't sleep if the amount of jiffies this memcg owes us is so low
> + * that it's not even worth doing, in an attempt to be nice to those who
> + * go only a small amount over their memory.high value and maybe haven't
> + * been aggressively reclaimed enough yet.
> + */
> + if (penalty_jiffies <= HZ / 100)
> + goto out;
> +
> + /*
> + * If we exit early, we're guaranteed to die (since
> + * schedule_timeout_killable sets TASK_KILLABLE). This means we don't
> + * need to account for any ill-begotten jiffies to pay them off later.
> + */
> + psi_memstall_enter(&pflags);
> + schedule_timeout_killable(penalty_jiffies);
> + psi_memstall_leave(&pflags);
> +
> +out:
> + css_put(&memcg->css);
> }
>
> static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> --
> 2.21.0

--
Michal Hocko
SUSE Labs

2019-07-24 01:57:12

by Chris Down

[permalink] [raw]
Subject: [PATCH v4] mm: Throttle allocators when failing reclaim over memory.high

We're trying to use memory.high to limit workloads, but have found that
containment can frequently fail completely and cause OOM situations
outside of the cgroup. This happens especially with swap space -- either
when none is configured, or swap is full. These failures often also
don't have enough warning to allow one to react, whether for a human or
for a daemon monitoring PSI.

Here is output from a simple program showing how long it takes in μsec
(column 2) to allocate a megabyte of anonymous memory (column 1) when a
cgroup is already beyond its memory high setting, and no swap is
available:

[root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
> --wait -t timeout 300 /root/mdf
[...]
95 1035
96 1038
97 1000
98 1036
99 1048
100 1590
101 1968
102 1776
103 1863
104 1757
105 1921
106 1893
107 1760
108 1748
109 1843
110 1716
111 1924
112 1776
113 1831
114 1766
115 1836
116 1588
117 1912
118 1802
119 1857
120 1731
[...]
[System OOM in 2-3 seconds]

The delay does go up extremely marginally past the 100MB memory.high
threshold, as now we spend time scanning before returning to usermode,
but it's nowhere near enough to contain growth. It also doesn't get
worse the more pages you have, since it only considers nr_pages.

The current situation goes against both the expectations of users of
memory.high, and our intentions as cgroup v2 developers. In
cgroup-v2.txt, we claim that we will throttle and only under "extreme
conditions" will memory.high protection be breached. Likewise, cgroup v2
users generally also expect that memory.high should throttle workloads
as they exceed their high threshold. However, as seen above, this isn't
always how it works in practice -- even on banal setups like those with
no swap, or where swap has become exhausted, we can end up with
memory.high being breached and us having no weapons left in our arsenal
to combat runaway growth with, since reclaim is futile.

It's also hard for system monitoring software or users to tell how bad
the situation is, as "high" events for the memcg may in some cases be
benign, and in others be catastrophic. The current status quo is that we
fail containment in a way that doesn't provide any advance warning that
things are about to go horribly wrong (for example, we are about to
invoke the kernel OOM killer).

This patch introduces explicit throttling when reclaim is failing to
keep memcg size contained at the memory.high setting. It does so by
applying an exponential delay curve derived from the memcg's overage
compared to memory.high. In the normal case where the memcg is either
below or only marginally over its memory.high setting, no throttling
will be performed.

This composes well with system health monitoring and remediation, as
these allocator delays are factored into PSI's memory pressure
calculations. This both creates a mechanism system administrators or
applications consuming the PSI interface to trivially see that the memcg
in question is struggling and use that to make more reasonable
decisions, and permits them enough time to act. Either of these can act
with significantly more nuance than that we can provide using the system
OOM killer.

This is a similar idea to memory.oom_control in cgroup v1 which would
put the cgroup to sleep if the threshold was violated, but it's also
significantly improved as it results in visible memory pressure, and
also doesn't schedule indefinitely, which previously made tracing and
other introspection difficult (ie. it's clamped at 2*HZ per allocation
through MEMCG_MAX_HIGH_DELAY_JIFFIES).

Contrast the previous results with a kernel with this patch:

[root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
> --wait -t timeout 300 /root/mdf
[...]
95 1002
96 1000
97 1002
98 1003
99 1000
100 1043
101 84724
102 330628
103 610511
104 1016265
105 1503969
106 2391692
107 2872061
108 3248003
109 4791904
110 5759832
111 6912509
112 8127818
113 9472203
114 12287622
115 12480079
116 14144008
117 15808029
118 16384500
119 16383242
120 16384979
[...]

As you can see, in the normal case, memory allocation takes around 1000
μsec. However, as we exceed our memory.high, things start to increase
exponentially, but fairly leniently at first. Our first megabyte over
memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then
the next is almost an entire second. This gets worse until we reach our
eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
However, this is still making forward progress, so permits tracing or
further analysis with programs like GDB.

We use an exponential curve for our delay penalty for a few reasons:

1. We run mem_cgroup_handle_over_high to potentially do reclaim after
we've already performed allocations, which means that temporarily
going over memory.high by a small amount may be perfectly legitimate,
even for compliant workloads. We don't want to unduly penalise such
cases.
2. An exponential curve (as opposed to a static or linear delay) allows
ramping up memory pressure stats more gradually, which can be useful
to work out that you have set memory.high too low, without destroying
application performance entirely.

This patch expands on earlier work by Johannes Weiner. Thanks!

Signed-off-by: Chris Down <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
mm/memcontrol.c | 125 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 124 insertions(+), 1 deletion(-)

[v4: Rebased and fixed theoretical (but somewhat unlikely) divide by zero]

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d969cf5598ce..8a46496822e3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -57,6 +57,7 @@
#include <linux/lockdep.h>
#include <linux/file.h>
#include <linux/tracehook.h>
+#include <linux/psi.h>
#include <linux/seq_buf.h>
#include "internal.h"
#include <net/sock.h>
@@ -2314,12 +2315,68 @@ static void high_work_func(struct work_struct *work)
reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
}

+/*
+ * Clamp the maximum sleep time per allocation batch to 2 seconds. This is
+ * enough to still cause a significant slowdown in most cases, while still
+ * allowing diagnostics and tracing to proceed without becoming stuck.
+ */
+#define MEMCG_MAX_HIGH_DELAY_JIFFIES (2UL*HZ)
+
+/*
+ * When calculating the delay, we use these either side of the exponentiation to
+ * maintain precision and scale to a reasonable number of jiffies (see the table
+ * below.
+ *
+ * - MEMCG_DELAY_PRECISION_SHIFT: Extra precision bits while translating the
+ * overage ratio to a delay.
+ * - MEMCG_DELAY_SCALING_SHIFT: The number of bits to scale down down the
+ * proposed penalty in order to reduce to a reasonable number of jiffies, and
+ * to produce a reasonable delay curve.
+ *
+ * MEMCG_DELAY_SCALING_SHIFT just happens to be a number that produces a
+ * reasonable delay curve compared to precision-adjusted overage, not
+ * penalising heavily at first, but still making sure that growth beyond the
+ * limit penalises misbehaviour cgroups by slowing them down exponentially. For
+ * example, with a high of 100 megabytes:
+ *
+ * +-------+------------------------+
+ * | usage | time to allocate in ms |
+ * +-------+------------------------+
+ * | 100M | 0 |
+ * | 101M | 6 |
+ * | 102M | 25 |
+ * | 103M | 57 |
+ * | 104M | 102 |
+ * | 105M | 159 |
+ * | 106M | 230 |
+ * | 107M | 313 |
+ * | 108M | 409 |
+ * | 109M | 518 |
+ * | 110M | 639 |
+ * | 111M | 774 |
+ * | 112M | 921 |
+ * | 113M | 1081 |
+ * | 114M | 1254 |
+ * | 115M | 1439 |
+ * | 116M | 1638 |
+ * | 117M | 1849 |
+ * | 118M | 2000 |
+ * | 119M | 2000 |
+ * | 120M | 2000 |
+ * +-------+------------------------+
+ */
+ #define MEMCG_DELAY_PRECISION_SHIFT 20
+ #define MEMCG_DELAY_SCALING_SHIFT 14
+
/*
* Scheduled by try_charge() to be executed from the userland return path
* and reclaims memory over the high limit.
*/
void mem_cgroup_handle_over_high(void)
{
+ unsigned long usage, high, clamped_high;
+ unsigned long pflags;
+ unsigned long penalty_jiffies, overage;
unsigned int nr_pages = current->memcg_nr_pages_over_high;
struct mem_cgroup *memcg;

@@ -2328,8 +2385,74 @@ void mem_cgroup_handle_over_high(void)

memcg = get_mem_cgroup_from_mm(current->mm);
reclaim_high(memcg, nr_pages, GFP_KERNEL);
- css_put(&memcg->css);
current->memcg_nr_pages_over_high = 0;
+
+ /*
+ * memory.high is breached and reclaim is unable to keep up. Throttle
+ * allocators proactively to slow down excessive growth.
+ *
+ * We use overage compared to memory.high to calculate the number of
+ * jiffies to sleep (penalty_jiffies). Ideally this value should be
+ * fairly lenient on small overages, and increasingly harsh when the
+ * memcg in question makes it clear that it has no intention of stopping
+ * its crazy behaviour, so we exponentially increase the delay based on
+ * overage amount.
+ */
+
+ usage = page_counter_read(&memcg->memory);
+ high = READ_ONCE(memcg->high);
+
+ if (usage <= high)
+ goto out;
+
+ /*
+ * Prevent division by 0 in overage calculation by acting as if it was a
+ * threshold of 1 page
+ */
+ clamped_high = max(high, 1);
+
+ overage = ((u64)(usage - high) << MEMCG_DELAY_PRECISION_SHIFT)
+ / clamped_high;
+ penalty_jiffies = ((u64)overage * overage * HZ)
+ >> (MEMCG_DELAY_PRECISION_SHIFT + MEMCG_DELAY_SCALING_SHIFT);
+
+ /*
+ * Factor in the task's own contribution to the overage, such that four
+ * N-sized allocations are throttled approximately the same as one
+ * 4N-sized allocation.
+ *
+ * MEMCG_CHARGE_BATCH pages is nominal, so work out how much smaller or
+ * larger the current charge patch is than that.
+ */
+ penalty_jiffies = penalty_jiffies * nr_pages / MEMCG_CHARGE_BATCH;
+
+ /*
+ * Clamp the max delay per usermode return so as to still keep the
+ * application moving forwards and also permit diagnostics, albeit
+ * extremely slowly.
+ */
+ penalty_jiffies = min(penalty_jiffies, MEMCG_MAX_HIGH_DELAY_JIFFIES);
+
+ /*
+ * Don't sleep if the amount of jiffies this memcg owes us is so low
+ * that it's not even worth doing, in an attempt to be nice to those who
+ * go only a small amount over their memory.high value and maybe haven't
+ * been aggressively reclaimed enough yet.
+ */
+ if (penalty_jiffies <= HZ / 100)
+ goto out;
+
+ /*
+ * If we exit early, we're guaranteed to die (since
+ * schedule_timeout_killable sets TASK_KILLABLE). This means we don't
+ * need to account for any ill-begotten jiffies to pay them off later.
+ */
+ psi_memstall_enter(&pflags);
+ schedule_timeout_killable(penalty_jiffies);
+ psi_memstall_leave(&pflags);
+
+out:
+ css_put(&memcg->css);
}

static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
--
2.22.0

2019-07-24 02:32:14

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v4] mm: Throttle allocators when failing reclaim over memory.high

On Tue, Jul 23, 2019 at 02:07:00PM -0400, Chris Down wrote:
> We're trying to use memory.high to limit workloads, but have found that
> containment can frequently fail completely and cause OOM situations
> outside of the cgroup. This happens especially with swap space -- either
> when none is configured, or swap is full. These failures often also
> don't have enough warning to allow one to react, whether for a human or
> for a daemon monitoring PSI.
>
> Here is output from a simple program showing how long it takes in μsec
> (column 2) to allocate a megabyte of anonymous memory (column 1) when a
> cgroup is already beyond its memory high setting, and no swap is
> available:
>
> [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
> > --wait -t timeout 300 /root/mdf
> [...]
> 95 1035
> 96 1038
> 97 1000
> 98 1036
> 99 1048
> 100 1590
> 101 1968
> 102 1776
> 103 1863
> 104 1757
> 105 1921
> 106 1893
> 107 1760
> 108 1748
> 109 1843
> 110 1716
> 111 1924
> 112 1776
> 113 1831
> 114 1766
> 115 1836
> 116 1588
> 117 1912
> 118 1802
> 119 1857
> 120 1731
> [...]
> [System OOM in 2-3 seconds]
>
> The delay does go up extremely marginally past the 100MB memory.high
> threshold, as now we spend time scanning before returning to usermode,
> but it's nowhere near enough to contain growth. It also doesn't get
> worse the more pages you have, since it only considers nr_pages.
>
> The current situation goes against both the expectations of users of
> memory.high, and our intentions as cgroup v2 developers. In
> cgroup-v2.txt, we claim that we will throttle and only under "extreme
> conditions" will memory.high protection be breached. Likewise, cgroup v2
> users generally also expect that memory.high should throttle workloads
> as they exceed their high threshold. However, as seen above, this isn't
> always how it works in practice -- even on banal setups like those with
> no swap, or where swap has become exhausted, we can end up with
> memory.high being breached and us having no weapons left in our arsenal
> to combat runaway growth with, since reclaim is futile.
>
> It's also hard for system monitoring software or users to tell how bad
> the situation is, as "high" events for the memcg may in some cases be
> benign, and in others be catastrophic. The current status quo is that we
> fail containment in a way that doesn't provide any advance warning that
> things are about to go horribly wrong (for example, we are about to
> invoke the kernel OOM killer).
>
> This patch introduces explicit throttling when reclaim is failing to
> keep memcg size contained at the memory.high setting. It does so by
> applying an exponential delay curve derived from the memcg's overage
> compared to memory.high. In the normal case where the memcg is either
> below or only marginally over its memory.high setting, no throttling
> will be performed.
>
> This composes well with system health monitoring and remediation, as
> these allocator delays are factored into PSI's memory pressure
> calculations. This both creates a mechanism system administrators or
> applications consuming the PSI interface to trivially see that the memcg
> in question is struggling and use that to make more reasonable
> decisions, and permits them enough time to act. Either of these can act
> with significantly more nuance than that we can provide using the system
> OOM killer.
>
> This is a similar idea to memory.oom_control in cgroup v1 which would
> put the cgroup to sleep if the threshold was violated, but it's also
> significantly improved as it results in visible memory pressure, and
> also doesn't schedule indefinitely, which previously made tracing and
> other introspection difficult (ie. it's clamped at 2*HZ per allocation
> through MEMCG_MAX_HIGH_DELAY_JIFFIES).
>
> Contrast the previous results with a kernel with this patch:
>
> [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
> > --wait -t timeout 300 /root/mdf
> [...]
> 95 1002
> 96 1000
> 97 1002
> 98 1003
> 99 1000
> 100 1043
> 101 84724
> 102 330628
> 103 610511
> 104 1016265
> 105 1503969
> 106 2391692
> 107 2872061
> 108 3248003
> 109 4791904
> 110 5759832
> 111 6912509
> 112 8127818
> 113 9472203
> 114 12287622
> 115 12480079
> 116 14144008
> 117 15808029
> 118 16384500
> 119 16383242
> 120 16384979
> [...]
>
> As you can see, in the normal case, memory allocation takes around 1000
> μsec. However, as we exceed our memory.high, things start to increase
> exponentially, but fairly leniently at first. Our first megabyte over
> memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then
> the next is almost an entire second. This gets worse until we reach our
> eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
> However, this is still making forward progress, so permits tracing or
> further analysis with programs like GDB.
>
> We use an exponential curve for our delay penalty for a few reasons:
>
> 1. We run mem_cgroup_handle_over_high to potentially do reclaim after
> we've already performed allocations, which means that temporarily
> going over memory.high by a small amount may be perfectly legitimate,
> even for compliant workloads. We don't want to unduly penalise such
> cases.
> 2. An exponential curve (as opposed to a static or linear delay) allows
> ramping up memory pressure stats more gradually, which can be useful
> to work out that you have set memory.high too low, without destroying
> application performance entirely.
>
> This patch expands on earlier work by Johannes Weiner. Thanks!
>
> Signed-off-by: Chris Down <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Roman Gushchin <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---

Acked-by: Johannes Weiner <[email protected]>