Setting the original memory.limit_in_bytes hardlimit is subject to a
race condition when the desired value is below the current usage. The
code tries a few times to first reclaim and then see if the usage has
dropped to where we would like it to be, but there is no locking, and
the workload is free to continue making new charges up to the old
limit. Thus, attempting to shrink a workload relies on pure luck and
hope that the workload happens to cooperate.
To fix this in the cgroup2 memory.max knob, do it the other way round:
set the limit first, then try enforcement. And if reclaim is not able
to succeed, trigger OOM kills in the group. Keep going until the new
limit is met, we run out of OOM victims and there's only unreclaimable
memory left, or the task writing to memory.max is killed. This allows
users to shrink groups reliably, and the behavior is consistent with
what happens when new charges are attempted in excess of memory.max.
Signed-off-by: Johannes Weiner <[email protected]>
---
Documentation/cgroup-v2.txt | 6 ++++++
mm/memcontrol.c | 38 ++++++++++++++++++++++++++++++++++----
2 files changed, 40 insertions(+), 4 deletions(-)
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index c709d7ebc092..2fd4cdb78445 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -1540,6 +1540,12 @@ system than killing the group. Otherwise, memory.max is there to
limit this type of spillover and ultimately contain buggy or even
malicious applications.
+Setting the original memory.limit_in_bytes below the current usage was
+subject to a race condition, where concurrent charges could cause the
+limit setting to fail. memory.max on the other hand will first set the
+limit to prevent new charges, and then reclaim and OOM kill until the
+new limit is met - or the task writing to memory.max is killed.
+
The combined memory+swap accounting and limiting is replaced by real
control over swap space.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f7c9b4cbdf01..8614e0d750e5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1236,7 +1236,7 @@ static unsigned long mem_cgroup_get_limit(struct mem_cgroup *memcg)
return limit;
}
-static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
+static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
int order)
{
struct oom_control oc = {
@@ -1314,6 +1314,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
}
unlock:
mutex_unlock(&oom_lock);
+ return chosen;
}
#if MAX_NUMNODES > 1
@@ -5029,6 +5030,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ unsigned int nr_reclaims = MEM_CGROUP_RECLAIM_RETRIES;
+ bool drained = false;
unsigned long max;
int err;
@@ -5037,9 +5040,36 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
if (err)
return err;
- err = mem_cgroup_resize_limit(memcg, max);
- if (err)
- return err;
+ xchg(&memcg->memory.limit, max);
+
+ for (;;) {
+ unsigned long nr_pages = page_counter_read(&memcg->memory);
+
+ if (nr_pages <= max)
+ break;
+
+ if (signal_pending(current)) {
+ err = -EINTR;
+ break;
+ }
+
+ if (!drained) {
+ drain_all_stock(memcg);
+ drained = true;
+ continue;
+ }
+
+ if (nr_reclaims) {
+ if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
+ GFP_KERNEL, true))
+ nr_reclaims--;
+ continue;
+ }
+
+ mem_cgroup_events(memcg, MEMCG_OOM, 1);
+ if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0))
+ break;
+ }
memcg_wb_domain_size_changed(memcg);
return nbytes;
--
2.7.2
On Thu 10-03-16 15:50:14, Johannes Weiner wrote:
> Setting the original memory.limit_in_bytes hardlimit is subject to a
> race condition when the desired value is below the current usage. The
> code tries a few times to first reclaim and then see if the usage has
> dropped to where we would like it to be, but there is no locking, and
> the workload is free to continue making new charges up to the old
> limit. Thus, attempting to shrink a workload relies on pure luck and
> hope that the workload happens to cooperate.
OK this would be indeed a problem when you want to stop a runaway load.
> To fix this in the cgroup2 memory.max knob, do it the other way round:
> set the limit first, then try enforcement. And if reclaim is not able
> to succeed, trigger OOM kills in the group. Keep going until the new
> limit is met, we run out of OOM victims and there's only unreclaimable
> memory left, or the task writing to memory.max is killed. This allows
> users to shrink groups reliably, and the behavior is consistent with
> what happens when new charges are attempted in excess of memory.max.
Here as well. I think this should go into 4.5 final or later to stable
so that we do not have different behavior of the knob.
> Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
One nit below
[...]
> @@ -5037,9 +5040,36 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> if (err)
> return err;
>
> - err = mem_cgroup_resize_limit(memcg, max);
> - if (err)
> - return err;
> + xchg(&memcg->memory.limit, max);
> +
> + for (;;) {
> + unsigned long nr_pages = page_counter_read(&memcg->memory);
> +
> + if (nr_pages <= max)
> + break;
> +
> + if (signal_pending(current)) {
Didn't you want fatal_signal_pending here? At least the changelog
suggests that.
> + err = -EINTR;
> + break;
> + }
--
Michal Hocko
SUSE Labs
On Fri, Mar 11, 2016 at 09:18:25AM +0100, Michal Hocko wrote:
> On Thu 10-03-16 15:50:14, Johannes Weiner wrote:
...
> > @@ -5037,9 +5040,36 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> > if (err)
> > return err;
> >
> > - err = mem_cgroup_resize_limit(memcg, max);
> > - if (err)
> > - return err;
> > + xchg(&memcg->memory.limit, max);
> > +
> > + for (;;) {
> > + unsigned long nr_pages = page_counter_read(&memcg->memory);
> > +
> > + if (nr_pages <= max)
> > + break;
> > +
> > + if (signal_pending(current)) {
>
> Didn't you want fatal_signal_pending here? At least the changelog
> suggests that.
I suppose the user might want to interrupt the write by hitting CTRL-C.
Come to think of it, shouldn't we restore the old limit and return EBUSY
if we failed to reclaim enough memory?
>
> > + err = -EINTR;
> > + break;
> > + }
On Fri, Mar 11, 2016 at 12:19:31PM +0300, Vladimir Davydov wrote:
> On Fri, Mar 11, 2016 at 09:18:25AM +0100, Michal Hocko wrote:
> > On Thu 10-03-16 15:50:14, Johannes Weiner wrote:
> ...
> > > @@ -5037,9 +5040,36 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> > > if (err)
> > > return err;
> > >
> > > - err = mem_cgroup_resize_limit(memcg, max);
> > > - if (err)
> > > - return err;
> > > + xchg(&memcg->memory.limit, max);
> > > +
> > > + for (;;) {
> > > + unsigned long nr_pages = page_counter_read(&memcg->memory);
> > > +
> > > + if (nr_pages <= max)
> > > + break;
> > > +
> > > + if (signal_pending(current)) {
> >
> > Didn't you want fatal_signal_pending here? At least the changelog
> > suggests that.
>
> I suppose the user might want to interrupt the write by hitting CTRL-C.
Yeah. This is the same thing we do for the current limit setting loop.
> Come to think of it, shouldn't we restore the old limit and return EBUSY
> if we failed to reclaim enough memory?
I suspect it's very rare that it would fail. But even in that case
it's probably better to at least not allow new charges past what the
user requested, even if we can't push the level back far enough.
On Tue 15-03-16 22:18:48, Johannes Weiner wrote:
> On Fri, Mar 11, 2016 at 12:19:31PM +0300, Vladimir Davydov wrote:
> > On Fri, Mar 11, 2016 at 09:18:25AM +0100, Michal Hocko wrote:
> > > On Thu 10-03-16 15:50:14, Johannes Weiner wrote:
> > ...
> > > > @@ -5037,9 +5040,36 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> > > > if (err)
> > > > return err;
> > > >
> > > > - err = mem_cgroup_resize_limit(memcg, max);
> > > > - if (err)
> > > > - return err;
> > > > + xchg(&memcg->memory.limit, max);
> > > > +
> > > > + for (;;) {
> > > > + unsigned long nr_pages = page_counter_read(&memcg->memory);
> > > > +
> > > > + if (nr_pages <= max)
> > > > + break;
> > > > +
> > > > + if (signal_pending(current)) {
> > >
> > > Didn't you want fatal_signal_pending here? At least the changelog
> > > suggests that.
> >
> > I suppose the user might want to interrupt the write by hitting CTRL-C.
>
> Yeah. This is the same thing we do for the current limit setting loop.
Yes we do but then the operation is canceled without any change. Now
re-reading the changelog I've realized I have misread the "we run out of
OOM victims and there's only unreclaimable memory left, or the task
writing to memory.max is killed." part and considered task writing... is
OOM killed.
> > Come to think of it, shouldn't we restore the old limit and return EBUSY
> > if we failed to reclaim enough memory?
>
> I suspect it's very rare that it would fail. But even in that case
> it's probably better to at least not allow new charges past what the
> user requested, even if we can't push the level back far enough.
I guess you are right. This guarantee is indeed useful.
--
Michal Hocko
SUSE Labs
On Tue, Mar 15, 2016 at 10:18:48PM -0700, Johannes Weiner wrote:
> On Fri, Mar 11, 2016 at 12:19:31PM +0300, Vladimir Davydov wrote:
...
> > Come to think of it, shouldn't we restore the old limit and return EBUSY
> > if we failed to reclaim enough memory?
>
> I suspect it's very rare that it would fail. But even in that case
> it's probably better to at least not allow new charges past what the
> user requested, even if we can't push the level back far enough.
It's of course good to set the limit before trying to reclaim memory,
but isn't it strange that even if the cgroup's memory can't be reclaimed
to meet the new limit (tmpfs files or tasks protected from oom), the
write will still succeed? It's a rare use case, but still.
I've one more concern regarding this patch. It's about calling OOM while
reclaiming cgroup memory. AFAIU OOM killer can be quite disruptive for a
workload, so is it really good to call it when normal reclaim fails?
W/o OOM killer you can optimistically try to adjust memory.max and if it
fails you can manually kill some processes in the container or restart
it or cancel the limit update. With your patch adjusting memory.max
never fails, but OOM might kill vital processes rendering the whole
container useless. Wouldn't it be better to let the user decide if
processes should be killed or not rather than calling OOM forcefully?
Thanks,
Vladimir
On Wed, Mar 16, 2016 at 06:15:09PM +0300, Vladimir Davydov wrote:
> On Tue, Mar 15, 2016 at 10:18:48PM -0700, Johannes Weiner wrote:
> > On Fri, Mar 11, 2016 at 12:19:31PM +0300, Vladimir Davydov wrote:
> ...
> > > Come to think of it, shouldn't we restore the old limit and return EBUSY
> > > if we failed to reclaim enough memory?
> >
> > I suspect it's very rare that it would fail. But even in that case
> > it's probably better to at least not allow new charges past what the
> > user requested, even if we can't push the level back far enough.
>
> It's of course good to set the limit before trying to reclaim memory,
> but isn't it strange that even if the cgroup's memory can't be reclaimed
> to meet the new limit (tmpfs files or tasks protected from oom), the
> write will still succeed? It's a rare use case, but still.
It's not optimal, but there is nothing we can do about it, is there? I
don't want to go back to the racy semantics that allow the application
to balloon up again after the limit restriction fails.
> I've one more concern regarding this patch. It's about calling OOM while
> reclaiming cgroup memory. AFAIU OOM killer can be quite disruptive for a
> workload, so is it really good to call it when normal reclaim fails?
>
> W/o OOM killer you can optimistically try to adjust memory.max and if it
> fails you can manually kill some processes in the container or restart
> it or cancel the limit update. With your patch adjusting memory.max
> never fails, but OOM might kill vital processes rendering the whole
> container useless. Wouldn't it be better to let the user decide if
> processes should be killed or not rather than calling OOM forcefully?
Those are the memory.max semantics, though. Why should there be a
difference between the container growing beyond the limit and the
limit cutting into the container?
If you don't want OOM kills, set memory.high instead. This way you get
the memory pressure *and* the chance to do your own killing.
On Wed, Mar 16, 2016 at 01:13:29PM -0700, Johannes Weiner wrote:
> On Wed, Mar 16, 2016 at 06:15:09PM +0300, Vladimir Davydov wrote:
> > On Tue, Mar 15, 2016 at 10:18:48PM -0700, Johannes Weiner wrote:
> > > On Fri, Mar 11, 2016 at 12:19:31PM +0300, Vladimir Davydov wrote:
> > ...
> > > > Come to think of it, shouldn't we restore the old limit and return EBUSY
> > > > if we failed to reclaim enough memory?
> > >
> > > I suspect it's very rare that it would fail. But even in that case
> > > it's probably better to at least not allow new charges past what the
> > > user requested, even if we can't push the level back far enough.
> >
> > It's of course good to set the limit before trying to reclaim memory,
> > but isn't it strange that even if the cgroup's memory can't be reclaimed
> > to meet the new limit (tmpfs files or tasks protected from oom), the
> > write will still succeed? It's a rare use case, but still.
>
> It's not optimal, but there is nothing we can do about it, is there? I
> don't want to go back to the racy semantics that allow the application
> to balloon up again after the limit restriction fails.
>
> > I've one more concern regarding this patch. It's about calling OOM while
> > reclaiming cgroup memory. AFAIU OOM killer can be quite disruptive for a
> > workload, so is it really good to call it when normal reclaim fails?
> >
> > W/o OOM killer you can optimistically try to adjust memory.max and if it
> > fails you can manually kill some processes in the container or restart
> > it or cancel the limit update. With your patch adjusting memory.max
> > never fails, but OOM might kill vital processes rendering the whole
> > container useless. Wouldn't it be better to let the user decide if
> > processes should be killed or not rather than calling OOM forcefully?
>
> Those are the memory.max semantics, though. Why should there be a
> difference between the container growing beyond the limit and the
> limit cutting into the container?
>
> If you don't want OOM kills, set memory.high instead. This way you get
> the memory pressure *and* the chance to do your own killing.
Fair enough.
Thanks,
Vladimir