2019-04-10 10:29:28

by Juha-Matti Tilli

[permalink] [raw]
Subject: [PATCH] net: add big honking pfmemalloc OOM warning

A system administrator is not notified (except via an obscure SNMP
counter that most sysadmins don't know to look for) if packets are
dropped due to out-of-memory condition when SKBs use pfmemalloc
reserves. This can for example lead to NFS connections hanging on
high-volume systems.

Implement a ratelimited big honking out of memory warning that directs
the sysadmin to bump up vm.min_free_kbytes in case this problem happens.
Our experience shows that with default vm.min_free_kbytes (90112, or
about 90 megabytes), NFS connections hang approximately once per day,
whereas with 901120 (default multiplied by 10) NFS connections never
hang.

Signed-off-by: Juha-Matti Tilli <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Rafael Aquini <[email protected]>
Cc: Murphy Zhou <[email protected]>
Cc: Yongcheng Yang <[email protected]>
Cc: Jianhong Yin <[email protected]>
---
net/core/filter.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index fc92ebc4e200..7d8ef239b9af 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -99,6 +99,8 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
* helping free memory
*/
if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) {
+ printk_ratelimited(KERN_WARNING
+ "dropped packet due to out-of-memory condition, please bump up vm.min_free_kbytes\n");
NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP);
return -ENOMEM;
}
--
2.17.1


2019-04-10 15:02:39

by Juha-Matti Tilli

[permalink] [raw]
Subject: Re: [PATCH] net: add big honking pfmemalloc OOM warning

On Wed, Apr 10, 2019 at 5:16 PM Eric Dumazet <[email protected]> wrote:
> If NFS sessions hang, then there is a bug to eventually root cause and fix.
>
> Just telling the user : Increase the limit is the same thing than admitting :
>
> Our limit system or TCP or NFS stacks are broken and unable to
> recover, so lets disable the limit system and work around a more
> serious bug.
>
> Maybe the bug is in a NIC driver, please share more details before
> adding yet another noisy signal in syslog
>
> SNMP counters are per netns, and more useful in the modern computing
> era, where a host is shared by many different containers.

Any idea where the bug might be?

It can't be in NFS, because I have observed the issue to be a TCP
level issue. NFS would be working just fine if TCP worked, but the
underlying TCP connection is not working fine, unless we bump up
vm.min_free_kbytes.

It could be in ixgbe, because the incoming SKB gets pfmemalloc pages
for some reason, and that happens repeatedly for a duration of 5-10
minutes for every single retransmit, until the condition clears. Ping
is working just fine at the time the NFS connection is stuck. I think
these 63-queue NICs use different queue for ping than they use for the
TCP NFS connection. I think there is some code in ixgbe for not
reusing pfmemalloc pages, but it seems every packet nevertheless gets
a pfmemalloc page in the queue that is used for TCP NFS. Might the
cause be that if ixgbe gets the pages in large bunches, it gets
multiple pfmemalloc pages at a time and then every packet is dropped
until all the pfmemalloc pages run out (not being reused)?

It could also be in the default value of vm.min_free_kbytes, but I'm
not experienced enough in Linux kernel internals to adjust the complex
calculations. Just saying that 90 MB sounds ridiculously low on a 256
GB NUMA machine.

Are you of the opinion that Intel as the developer of ixgbe should be informed?

Anyway, I posted more details to the mailing lists about a week ago,
search for "NFS hang, sk_drops increasing, segs_in not, pfmemalloc
suspected" in the mailing lists, or click this direct link:
https://lkml.org/lkml/2019/4/3/682

The current situation is that we've been running the production system
for 2 weeks with a bumped-up vm.min_free_kbytes, no NFS hangs, whereas
before the bump, we had approximately one hang per day, so without the
bump, the period of 2 weeks would have approximately 14 NFS hangs.

To me, this OOM condition seems to be global, so having it per-netns
offers no clear benefit in my opinion. Or is vm.min_free_kbytes per
container tunable?

BR, Juha-Matti

2019-04-10 17:04:18

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] net: add big honking pfmemalloc OOM warning

On Wed, Apr 10, 2019 at 8:01 AM Juha-Matti Tilli
<[email protected]> wrote:
>
> On Wed, Apr 10, 2019 at 5:16 PM Eric Dumazet <[email protected]> wrote:
> > If NFS sessions hang, then there is a bug to eventually root cause and fix.
> >
> > Just telling the user : Increase the limit is the same thing than admitting :
> >
> > Our limit system or TCP or NFS stacks are broken and unable to
> > recover, so lets disable the limit system and work around a more
> > serious bug.
> >
> > Maybe the bug is in a NIC driver, please share more details before
> > adding yet another noisy signal in syslog
> >
> > SNMP counters are per netns, and more useful in the modern computing
> > era, where a host is shared by many different containers.
>
> Any idea where the bug might be?
>

Before diving into the details, can we first double check which exact
kernel version you are using ?

In the past some pfmemalloc bugs have been solved, I do not want spend
time finding them a second time.



> It can't be in NFS, because I have observed the issue to be a TCP
> level issue. NFS would be working just fine if TCP worked, but the
> underlying TCP connection is not working fine, unless we bump up
> vm.min_free_kbytes.
>
> It could be in ixgbe, because the incoming SKB gets pfmemalloc pages
> for some reason, and that happens repeatedly for a duration of 5-10
> minutes for every single retransmit, until the condition clears. Ping
> is working just fine at the time the NFS connection is stuck. I think
> these 63-queue NICs use different queue for ping than they use for the
> TCP NFS connection. I think there is some code in ixgbe for not
> reusing pfmemalloc pages, but it seems every packet nevertheless gets
> a pfmemalloc page in the queue that is used for TCP NFS. Might the
> cause be that if ixgbe gets the pages in large bunches, it gets
> multiple pfmemalloc pages at a time and then every packet is dropped
> until all the pfmemalloc pages run out (not being reused)?
>
> It could also be in the default value of vm.min_free_kbytes, but I'm
> not experienced enough in Linux kernel internals to adjust the complex
> calculations. Just saying that 90 MB sounds ridiculously low on a 256
> GB NUMA machine.
>
> Are you of the opinion that Intel as the developer of ixgbe should be informed?
>
> Anyway, I posted more details to the mailing lists about a week ago,
> search for "NFS hang, sk_drops increasing, segs_in not, pfmemalloc
> suspected" in the mailing lists, or click this direct link:
> https://lkml.org/lkml/2019/4/3/682
>
> The current situation is that we've been running the production system
> for 2 weeks with a bumped-up vm.min_free_kbytes, no NFS hangs, whereas
> before the bump, we had approximately one hang per day, so without the
> bump, the period of 2 weeks would have approximately 14 NFS hangs.
>
> To me, this OOM condition seems to be global, so having it per-netns
> offers no clear benefit in my opinion. Or is vm.min_free_kbytes per
> container tunable?
>
> BR, Juha-Matti

2019-04-10 17:38:49

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] net: add big honking pfmemalloc OOM warning

On Wed, Apr 10, 2019 at 3:20 AM Juha-Matti Tilli
<[email protected]> wrote:
>
> A system administrator is not notified (except via an obscure SNMP
> counter that most sysadmins don't know to look for) if packets are
> dropped due to out-of-memory condition when SKBs use pfmemalloc
> reserves. This can for example lead to NFS connections hanging on
> high-volume systems.
>
> Implement a ratelimited big honking out of memory warning that directs
> the sysadmin to bump up vm.min_free_kbytes in case this problem happens.
> Our experience shows that with default vm.min_free_kbytes (90112, or
> about 90 megabytes), NFS connections hang approximately once per day,
> whereas with 901120 (default multiplied by 10) NFS connections never
> hang.
>

If NFS sessions hang, then there is a bug to eventually root cause and fix.

Just telling the user : Increase the limit is the same thing than admitting :

Our limit system or TCP or NFS stacks are broken and unable to
recover, so lets disable the limit system and work around a more
serious bug.

Maybe the bug is in a NIC driver, please share more details before
adding yet another noisy signal in syslog

SNMP counters are per netns, and more useful in the modern computing
era, where a host is shared by many different containers.



>
> Signed-off-by: Juha-Matti Tilli <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: Rafael Aquini <[email protected]>
> Cc: Murphy Zhou <[email protected]>
> Cc: Yongcheng Yang <[email protected]>
> Cc: Jianhong Yin <[email protected]>
> ---
> net/core/filter.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index fc92ebc4e200..7d8ef239b9af 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -99,6 +99,8 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
> * helping free memory
> */
> if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) {
> + printk_ratelimited(KERN_WARNING
> + "dropped packet due to out-of-memory condition, please bump up vm.min_free_kbytes\n");
> NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP);
> return -ENOMEM;
> }
> --
> 2.17.1
>

2019-04-10 19:12:41

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] net: add big honking pfmemalloc OOM warning

From: Eric Dumazet <[email protected]>
Date: Wed, 10 Apr 2019 07:16:21 -0700

> SNMP counters are per netns, and more useful in the modern computing
> era, where a host is shared by many different containers.

+1 There is no way I am applying this patch.

The kernel should not "big honking" anything in the logs.

2019-04-11 06:29:15

by Juha-Matti Tilli

[permalink] [raw]
Subject: Re: [PATCH] net: add big honking pfmemalloc OOM warning

On Wed, Apr 10, 2019 at 6:36 PM Eric Dumazet <[email protected]> wrote:
> Before diving into the details, can we first double check which exact
> kernel version you are using ?
>
> In the past some pfmemalloc bugs have been solved, I do not want spend
> time finding them a second time.

Of course I can tell the kernel version.

The NFS server is 3.10.0-862.14.4.el7.foreca.x86_64 where "foreca"
means we've applied a custom Samsung SSD TRIM patch, nothing else
different from regular CentOS / RHEL kernel.

The NFS client is 3.10.0-957.5.1.el7.x86_64.

The pfmemalloc problem happens at the server, not at the client.

The exact patches in this kernel may be a bit complicated to
determine, because RHEL apparently likes to maintain their own kernel
version. They do not have the SNMP counter even, which is why I opened
a bug report to RHEL Bugzilla. I have extracted source code, so if you
can tell a list of git commit SHA1 hashes that could fix the issue, I
can check manually whether they're there, and open bug reports to RHEL
bugzilla to have them backported if not there.

I at least see this in the exact kernel version I'm using:

static inline bool ixgbe_page_is_reserved(struct page *page)
{
return (page_to_nid(page) != numa_mem_id()) || page_is_pfmemalloc(page);
}

...so ixgbe shouldn't be reusing pfmemalloc pages.

BR, Juha-Matti

2019-04-11 06:52:53

by Juha-Matti Tilli

[permalink] [raw]
Subject: Re: [PATCH] net: add big honking pfmemalloc OOM warning

On Wed, Apr 10, 2019 at 10:11 PM David Miller <[email protected]> wrote:
> > SNMP counters are per netns, and more useful in the modern computing
> > era, where a host is shared by many different containers.
>
> +1 There is no way I am applying this patch.
>
> The kernel should not "big honking" anything in the logs.

Just to check, is the opposition to the patch related to the
expectation that it will log the condition too often despite the rate
limit, if many packets are dropped? Because if it is, that might be
possible to fix.

I think it might be possible to check the SNMP counter value, and if
zero, log the first instance of pfmemalloc drop, and then omit logging
afterwards. There could be race conditions, so in the absolute worst
case, you could have let's say 2 or 3 of these log lines instead of 1,
but I don't see that as an issue, because 99% of the time there would
be just one, and 2 or 3 lines won't fill the logs.

In our case, the existence of such a log message and the helpful
suggestion to bump up vm.min_free_kbytes would have saved us
approximately one month of debugging (or 2-3 weeks if the SNMP counter
was there in this kernel version). Even one such log message would be
enough. Our production systems were hanging daily during this
debugging happening.

In my opinion, the ideal count of pfmemalloc drops is exactly 0, and
the interesting event is the first instance of pfmemalloc drop
occurring.

If there's a bug in the kernel, I think the user should be notified,
so I see this as similar to some WARN_ON line -- which is even more
"big honking" log event because it's associated with a backtrace.

BR, Juha-Matti

2019-04-11 10:29:25

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] net: add big honking pfmemalloc OOM warning

On Wed, Apr 10, 2019 at 11:26 PM Juha-Matti Tilli
<[email protected]> wrote:
>
> On Wed, Apr 10, 2019 at 6:36 PM Eric Dumazet <[email protected]> wrote:
> > Before diving into the details, can we first double check which exact
> > kernel version you are using ?
> >
> > In the past some pfmemalloc bugs have been solved, I do not want spend
> > time finding them a second time.
>
> Of course I can tell the kernel version.
>
> The NFS server is 3.10.0-862.14.4.el7.foreca.x86_64 where "foreca"
> means we've applied a custom Samsung SSD TRIM patch, nothing else
> different from regular CentOS / RHEL kernel.
>
> The NFS client is 3.10.0-957.5.1.el7.x86_64.
>
> The pfmemalloc problem happens at the server, not at the client.
>
> The exact patches in this kernel may be a bit complicated to
> determine, because RHEL apparently likes to maintain their own kernel
> version. They do not have the SNMP counter even, which is why I opened
> a bug report to RHEL Bugzilla. I have extracted source code, so if you
> can tell a list of git commit SHA1 hashes that could fix the issue, I
> can check manually whether they're there, and open bug reports to RHEL
> bugzilla to have them backported if not there.
>
> I at least see this in the exact kernel version I'm using:

Here (netdev/lkml) we deal with current kernels.

Chances are very high the bug(s) you are fighting was(were) already fixed.

Please repro the bug using latest David Miller net tree, otherwise we
will _not_ accept any patch.

Thank you

>
> static inline bool ixgbe_page_is_reserved(struct page *page)
> {
> return (page_to_nid(page) != numa_mem_id()) || page_is_pfmemalloc(page);
> }
>
> ...so ixgbe shouldn't be reusing pfmemalloc pages.
>
> BR, Juha-Matti