2019-10-23 00:21:28

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH] mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges

While upgrading from 4.16 to 5.2, we noticed these allocation errors
in the log of the new kernel:

[ 8642.253395] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
[ 8642.269170] cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0
[ 8642.293009] node 0: slabs: 5, objs: 170, free: 0

slab_out_of_memory+1
___slab_alloc+969
__slab_alloc+14
kmem_cache_alloc+346
inet_twsk_alloc+60
tcp_time_wait+46
tcp_fin+206
tcp_data_queue+2034
tcp_rcv_state_process+784
tcp_v6_do_rcv+405
__release_sock+118
tcp_close+385
inet_release+46
__sock_release+55
sock_close+17
__fput+170
task_work_run+127
exit_to_usermode_loop+191
do_syscall_64+212
entry_SYSCALL_64_after_hwframe+68

accompanied by an increase in machines going completely radio silent
under memory pressure.

One thing that changed since 4.16 is e699e2c6a654 ("net, mm: account
sock objects to kmemcg"), which made these slab caches subject to
cgroup memory accounting and control.

The problem with that is that cgroups, unlike the page allocator, do
not maintain dedicated atomic reserves. As a cgroup's usage hovers at
its limit, atomic allocations - such as done during network rx - can
fail consistently for extended periods of time. The kernel is not able
to operate under these conditions.

We don't want to revert the culprit patch, because it indeed tracks a
potentially substantial amount of memory used by a cgroup.

We also don't want to implement dedicated atomic reserves for cgroups.
There is no point in keeping a fixed margin of unused bytes in the
cgroup's memory budget to accomodate a consumer that is impossible to
predict - we'd be wasting memory and get into configuration headaches,
not unlike what we have going with min_free_kbytes. We do this for
physical mem because we have to, but cgroups are an accounting game.

Instead, account these privileged allocations to the cgroup, but let
them bypass the configured limit if they have to. This way, we get the
benefits of accounting the consumed memory and have it exert pressure
on the rest of the cgroup, but like with the page allocator, we shift
the burden of reclaimining on behalf of atomic allocations onto the
regular allocations that can block.

Cc: [email protected] # 4.18+
Fixes: e699e2c6a654 ("net, mm: account sock objects to kmemcg")
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8090b4c99ac7..c7e3e758c165 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2528,6 +2528,15 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
goto retry;
}

+ /*
+ * Memcg doesn't have a dedicated reserve for atomic
+ * allocations. But like the global atomic pool, we need to
+ * put the burden of reclaim on regular allocation requests
+ * and let these go through as privileged allocations.
+ */
+ if (gfp_mask & __GFP_ATOMIC)
+ goto force;
+
/*
* Unlike in global OOM situations, memcg is not in a physical
* memory shortage. Allow dying and OOM-killed tasks to
--
2.23.0


2019-10-23 00:32:05

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges

+Suleiman Souhlal

On Tue, Oct 22, 2019 at 4:37 PM Johannes Weiner <[email protected]> wrote:
>
> While upgrading from 4.16 to 5.2, we noticed these allocation errors
> in the log of the new kernel:
>
> [ 8642.253395] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
> [ 8642.269170] cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0
> [ 8642.293009] node 0: slabs: 5, objs: 170, free: 0
>
> slab_out_of_memory+1
> ___slab_alloc+969
> __slab_alloc+14
> kmem_cache_alloc+346
> inet_twsk_alloc+60
> tcp_time_wait+46
> tcp_fin+206
> tcp_data_queue+2034
> tcp_rcv_state_process+784
> tcp_v6_do_rcv+405
> __release_sock+118
> tcp_close+385
> inet_release+46
> __sock_release+55
> sock_close+17
> __fput+170
> task_work_run+127
> exit_to_usermode_loop+191
> do_syscall_64+212
> entry_SYSCALL_64_after_hwframe+68
>
> accompanied by an increase in machines going completely radio silent
> under memory pressure.
>
> One thing that changed since 4.16 is e699e2c6a654 ("net, mm: account
> sock objects to kmemcg"), which made these slab caches subject to
> cgroup memory accounting and control.
>
> The problem with that is that cgroups, unlike the page allocator, do
> not maintain dedicated atomic reserves. As a cgroup's usage hovers at
> its limit, atomic allocations - such as done during network rx - can
> fail consistently for extended periods of time. The kernel is not able
> to operate under these conditions.
>
> We don't want to revert the culprit patch, because it indeed tracks a
> potentially substantial amount of memory used by a cgroup.
>
> We also don't want to implement dedicated atomic reserves for cgroups.
> There is no point in keeping a fixed margin of unused bytes in the
> cgroup's memory budget to accomodate a consumer that is impossible to
> predict - we'd be wasting memory and get into configuration headaches,
> not unlike what we have going with min_free_kbytes. We do this for
> physical mem because we have to, but cgroups are an accounting game.
>
> Instead, account these privileged allocations to the cgroup, but let
> them bypass the configured limit if they have to. This way, we get the
> benefits of accounting the consumed memory and have it exert pressure
> on the rest of the cgroup, but like with the page allocator, we shift
> the burden of reclaimining on behalf of atomic allocations onto the
> regular allocations that can block.
>
> Cc: [email protected] # 4.18+
> Fixes: e699e2c6a654 ("net, mm: account sock objects to kmemcg")
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>

> ---
> mm/memcontrol.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8090b4c99ac7..c7e3e758c165 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2528,6 +2528,15 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> goto retry;
> }
>
> + /*
> + * Memcg doesn't have a dedicated reserve for atomic
> + * allocations. But like the global atomic pool, we need to
> + * put the burden of reclaim on regular allocation requests
> + * and let these go through as privileged allocations.
> + */
> + if (gfp_mask & __GFP_ATOMIC)
> + goto force;
> +

Actually we (Google) already have a similar internal patch where we
check for __GFP_HIGH and then go for force charging with similar
reasoning.

> /*
> * Unlike in global OOM situations, memcg is not in a physical
> * memory shortage. Allow dying and OOM-killed tasks to
> --
> 2.23.0
>

2019-10-23 06:44:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges

On Tue 22-10-19 19:37:08, Johannes Weiner wrote:
> While upgrading from 4.16 to 5.2, we noticed these allocation errors
> in the log of the new kernel:
>
> [ 8642.253395] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
> [ 8642.269170] cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0
> [ 8642.293009] node 0: slabs: 5, objs: 170, free: 0
>
> slab_out_of_memory+1
> ___slab_alloc+969
> __slab_alloc+14
> kmem_cache_alloc+346
> inet_twsk_alloc+60
> tcp_time_wait+46
> tcp_fin+206
> tcp_data_queue+2034
> tcp_rcv_state_process+784
> tcp_v6_do_rcv+405
> __release_sock+118
> tcp_close+385
> inet_release+46
> __sock_release+55
> sock_close+17
> __fput+170
> task_work_run+127
> exit_to_usermode_loop+191
> do_syscall_64+212
> entry_SYSCALL_64_after_hwframe+68
>
> accompanied by an increase in machines going completely radio silent
> under memory pressure.

This is really worrying because that suggests that something depends on
GFP_ATOMIC allocation which is fragile and broken.

> One thing that changed since 4.16 is e699e2c6a654 ("net, mm: account
> sock objects to kmemcg"), which made these slab caches subject to
> cgroup memory accounting and control.
>
> The problem with that is that cgroups, unlike the page allocator, do
> not maintain dedicated atomic reserves. As a cgroup's usage hovers at
> its limit, atomic allocations - such as done during network rx - can
> fail consistently for extended periods of time. The kernel is not able
> to operate under these conditions.
>
> We don't want to revert the culprit patch, because it indeed tracks a
> potentially substantial amount of memory used by a cgroup.
>
> We also don't want to implement dedicated atomic reserves for cgroups.
> There is no point in keeping a fixed margin of unused bytes in the
> cgroup's memory budget to accomodate a consumer that is impossible to
> predict - we'd be wasting memory and get into configuration headaches,
> not unlike what we have going with min_free_kbytes. We do this for
> physical mem because we have to, but cgroups are an accounting game.
>
> Instead, account these privileged allocations to the cgroup, but let
> them bypass the configured limit if they have to. This way, we get the
> benefits of accounting the consumed memory and have it exert pressure
> on the rest of the cgroup, but like with the page allocator, we shift
> the burden of reclaimining on behalf of atomic allocations onto the
> regular allocations that can block.

On the other hand this would allow to break the isolation by an
unpredictable amount. Should we put a simple cap on how much we can go
over the limit. If the memcg limit reclaim is not able to keep up with
those overflows then even __GFP_ATOMIC allocations have to fail. What do
you think?

> Cc: [email protected] # 4.18+
> Fixes: e699e2c6a654 ("net, mm: account sock objects to kmemcg")
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/memcontrol.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8090b4c99ac7..c7e3e758c165 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2528,6 +2528,15 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> goto retry;
> }
>
> + /*
> + * Memcg doesn't have a dedicated reserve for atomic
> + * allocations. But like the global atomic pool, we need to
> + * put the burden of reclaim on regular allocation requests
> + * and let these go through as privileged allocations.
> + */
> + if (gfp_mask & __GFP_ATOMIC)
> + goto force;
> +
> /*
> * Unlike in global OOM situations, memcg is not in a physical
> * memory shortage. Allow dying and OOM-killed tasks to
> --
> 2.23.0
>

--
Michal Hocko
SUSE Labs

2019-10-24 06:02:56

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges

On Wed, Oct 23, 2019 at 08:40:12AM +0200, Michal Hocko wrote:
> On Tue 22-10-19 19:37:08, Johannes Weiner wrote:
> > While upgrading from 4.16 to 5.2, we noticed these allocation errors
> > in the log of the new kernel:
> >
> > [ 8642.253395] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
> > [ 8642.269170] cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0
> > [ 8642.293009] node 0: slabs: 5, objs: 170, free: 0
> >
> > slab_out_of_memory+1
> > ___slab_alloc+969
> > __slab_alloc+14
> > kmem_cache_alloc+346
> > inet_twsk_alloc+60
> > tcp_time_wait+46
> > tcp_fin+206
> > tcp_data_queue+2034
> > tcp_rcv_state_process+784
> > tcp_v6_do_rcv+405
> > __release_sock+118
> > tcp_close+385
> > inet_release+46
> > __sock_release+55
> > sock_close+17
> > __fput+170
> > task_work_run+127
> > exit_to_usermode_loop+191
> > do_syscall_64+212
> > entry_SYSCALL_64_after_hwframe+68
> >
> > accompanied by an increase in machines going completely radio silent
> > under memory pressure.
>
> This is really worrying because that suggests that something depends on
> GFP_ATOMIC allocation which is fragile and broken.

I don't think that is true. You cannot rely on a *single instance* of
atomic allocations to succeed. But you have to be able to rely on that
failure is temporary and there is a chance of succeeding eventually.

Network is a good example. It retries transmits, but within reason. If
you aren't able to process incoming packets for minutes, you might as
well be dead.

> > One thing that changed since 4.16 is e699e2c6a654 ("net, mm: account
> > sock objects to kmemcg"), which made these slab caches subject to
> > cgroup memory accounting and control.
> >
> > The problem with that is that cgroups, unlike the page allocator, do
> > not maintain dedicated atomic reserves. As a cgroup's usage hovers at
> > its limit, atomic allocations - such as done during network rx - can
> > fail consistently for extended periods of time. The kernel is not able
> > to operate under these conditions.
> >
> > We don't want to revert the culprit patch, because it indeed tracks a
> > potentially substantial amount of memory used by a cgroup.
> >
> > We also don't want to implement dedicated atomic reserves for cgroups.
> > There is no point in keeping a fixed margin of unused bytes in the
> > cgroup's memory budget to accomodate a consumer that is impossible to
> > predict - we'd be wasting memory and get into configuration headaches,
> > not unlike what we have going with min_free_kbytes. We do this for
> > physical mem because we have to, but cgroups are an accounting game.
> >
> > Instead, account these privileged allocations to the cgroup, but let
> > them bypass the configured limit if they have to. This way, we get the
> > benefits of accounting the consumed memory and have it exert pressure
> > on the rest of the cgroup, but like with the page allocator, we shift
> > the burden of reclaimining on behalf of atomic allocations onto the
> > regular allocations that can block.
>
> On the other hand this would allow to break the isolation by an
> unpredictable amount. Should we put a simple cap on how much we can go
> over the limit. If the memcg limit reclaim is not able to keep up with
> those overflows then even __GFP_ATOMIC allocations have to fail. What do
> you think?

I don't expect a big overrun in practice, and it appears that Google
has been letting even NOWAIT allocations pass through without
isolation issues. Likewise, we have been force-charging the skmem for
a while now and it hasn't been an issue for reclaim to keep up.

My experience from production is that it's a whole lot easier to debug
something like a memory.max overrun than it is to debug a machine that
won't respond to networking. So that's the side I would err on.

2019-10-24 08:51:03

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges

On Wed, Oct 23, 2019 at 8:46 AM Johannes Weiner <[email protected]> wrote:
>
> On Wed, Oct 23, 2019 at 08:40:12AM +0200, Michal Hocko wrote:
> > On Tue 22-10-19 19:37:08, Johannes Weiner wrote:
> > > While upgrading from 4.16 to 5.2, we noticed these allocation errors
> > > in the log of the new kernel:
> > >
> > > [ 8642.253395] SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
> > > [ 8642.269170] cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0
> > > [ 8642.293009] node 0: slabs: 5, objs: 170, free: 0
> > >
> > > slab_out_of_memory+1
> > > ___slab_alloc+969
> > > __slab_alloc+14
> > > kmem_cache_alloc+346
> > > inet_twsk_alloc+60
> > > tcp_time_wait+46
> > > tcp_fin+206
> > > tcp_data_queue+2034
> > > tcp_rcv_state_process+784
> > > tcp_v6_do_rcv+405
> > > __release_sock+118
> > > tcp_close+385
> > > inet_release+46
> > > __sock_release+55
> > > sock_close+17
> > > __fput+170
> > > task_work_run+127
> > > exit_to_usermode_loop+191
> > > do_syscall_64+212
> > > entry_SYSCALL_64_after_hwframe+68
> > >
> > > accompanied by an increase in machines going completely radio silent
> > > under memory pressure.
> >
> > This is really worrying because that suggests that something depends on
> > GFP_ATOMIC allocation which is fragile and broken.
>
> I don't think that is true. You cannot rely on a *single instance* of
> atomic allocations to succeed. But you have to be able to rely on that
> failure is temporary and there is a chance of succeeding eventually.
>
> Network is a good example. It retries transmits, but within reason. If
> you aren't able to process incoming packets for minutes, you might as
> well be dead.
>
> > > One thing that changed since 4.16 is e699e2c6a654 ("net, mm: account
> > > sock objects to kmemcg"), which made these slab caches subject to
> > > cgroup memory accounting and control.
> > >
> > > The problem with that is that cgroups, unlike the page allocator, do
> > > not maintain dedicated atomic reserves. As a cgroup's usage hovers at
> > > its limit, atomic allocations - such as done during network rx - can
> > > fail consistently for extended periods of time. The kernel is not able
> > > to operate under these conditions.
> > >
> > > We don't want to revert the culprit patch, because it indeed tracks a
> > > potentially substantial amount of memory used by a cgroup.
> > >
> > > We also don't want to implement dedicated atomic reserves for cgroups.
> > > There is no point in keeping a fixed margin of unused bytes in the
> > > cgroup's memory budget to accomodate a consumer that is impossible to
> > > predict - we'd be wasting memory and get into configuration headaches,
> > > not unlike what we have going with min_free_kbytes. We do this for
> > > physical mem because we have to, but cgroups are an accounting game.
> > >
> > > Instead, account these privileged allocations to the cgroup, but let
> > > them bypass the configured limit if they have to. This way, we get the
> > > benefits of accounting the consumed memory and have it exert pressure
> > > on the rest of the cgroup, but like with the page allocator, we shift
> > > the burden of reclaimining on behalf of atomic allocations onto the
> > > regular allocations that can block.
> >
> > On the other hand this would allow to break the isolation by an
> > unpredictable amount. Should we put a simple cap on how much we can go
> > over the limit. If the memcg limit reclaim is not able to keep up with
> > those overflows then even __GFP_ATOMIC allocations have to fail. What do
> > you think?
>
> I don't expect a big overrun in practice, and it appears that Google
> has been letting even NOWAIT allocations pass through without
> isolation issues.

We have been overcharging for __GFP_HIGH allocations for couple of
years and see no isolation issues in the production.

> Likewise, we have been force-charging the skmem for
> a while now and it hasn't been an issue for reclaim to keep up.
>
> My experience from production is that it's a whole lot easier to debug
> something like a memory.max overrun than it is to debug a machine that
> won't respond to networking. So that's the side I would err on.

2019-10-25 01:31:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges

On Wed 23-10-19 10:38:36, Shakeel Butt wrote:
> On Wed, Oct 23, 2019 at 8:46 AM Johannes Weiner <[email protected]> wrote:
> >
> > On Wed, Oct 23, 2019 at 08:40:12AM +0200, Michal Hocko wrote:
[...]
> > > On the other hand this would allow to break the isolation by an
> > > unpredictable amount. Should we put a simple cap on how much we can go
> > > over the limit. If the memcg limit reclaim is not able to keep up with
> > > those overflows then even __GFP_ATOMIC allocations have to fail. What do
> > > you think?
> >
> > I don't expect a big overrun in practice, and it appears that Google
> > has been letting even NOWAIT allocations pass through without
> > isolation issues.
>
> We have been overcharging for __GFP_HIGH allocations for couple of
> years and see no isolation issues in the production.
>
> > Likewise, we have been force-charging the skmem for
> > a while now and it hasn't been an issue for reclaim to keep up.
> >
> > My experience from production is that it's a whole lot easier to debug
> > something like a memory.max overrun than it is to debug a machine that
> > won't respond to networking. So that's the side I would err on.

It is definitely good to hear that your production systems are working well.
I was not really worried about normal workloads but rather malicious
kind of (work)loads where memcg is used to contain a potentially untrusted
entities. That's where an unbounded atomic charges escapes would be a
much bigger deal. Maybe this is not the case now because we do not
have that many accounted __GFP_ATOMIC requests (I have tried to audit
but gave up very shortly afterwards because there are not that many
using __GFP_ACCOUNT directly so they are likely hidden behind
SLAB_ACCOUNT). But I do not really like that uncertainty.

If you have a really strong opinion on an explicit limit then I
would like to see at least some warning to the kernel log so that we
learn when some workloads hit a pathological paths that and act upon
that. Does that sound like something you would agree to?

E.g. something like

diff --git a/mm/page_counter.c b/mm/page_counter.c
index de31470655f6..e6999f6cf79e 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -62,6 +62,8 @@ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
WARN_ON_ONCE(new < 0);
}

+#define SAFE_OVERFLOW 1024
+
/**
* page_counter_charge - hierarchically charge pages
* @counter: counter
@@ -82,8 +84,14 @@ void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
* This is indeed racy, but we can live with some
* inaccuracy in the watermark.
*/
- if (new > c->watermark)
+ if (new > c->watermark) {
c->watermark = new;
+ if (new > c->max + SAFE_OVERFLOW) {
+ pr_warn("Max limit %lu breached, usage:%lu. Please report.\n",
+ c->max, atomic_long_read(&c->usage);
+ dump_stack();
+ }
+ }
}
}

--
Michal Hocko
SUSE Labs