2018-10-26 19:27:30

by Oleksandr Natalenko

[permalink] [raw]
Subject: CAKE and r8169 cause panic on upload in v4.19

Hello.

I was excited regarding the fact that v4.19 introduced CAKE, so I've
deployed it on my home router.

I used this script of mine [1]:

# bufferbloat enp3s0.100 20 20

to do its job on the VLAN interface, where 20/20 ISP link is switched
from the home switch. Basically, it just follows [2] with simple
bandwidth restriction and egress mirroring using ifb.

Then I thought it would be nice to run speedtest-cli on one of the
computer in the home LAN, connected to this router. Download stage went
fine, but immediately after upload started I've got a panic on the
router: [3] (sorry, it is a photo, netconsole didn't work because, I
assume, the panic happened in the networking code). I rebooted the
router and tried once more, and got the same result, again during upload
stage. Then I rebooted again, replaced CAKE script with my former HTB
script, and after running speedtest-cli a couple of times there's no
panic.

Before running speedtest-cli I was using CAKE for a couple of days
without generating much traffic just fine. It seems it crashes only if
lots of traffic is generated with tools like this.

My sysctl: [4] and ethtool -k: [5]

So far, I've found something similar only here: [6] [7]. The common
thing is r8169 driver in use, so, maybe, it is a driver issue, and CAKE
is just happy to reveal it.

If it is something known, please point me to a possible fix. If it is
something new, I'm open to provide more info on your request, try
patches etc (as usual).

Thanks.

--
Oleksandr Natalenko (post-factum)

[1] https://gist.github.com/4b27c49a7f9b4d775e2e38ba23d3f13c
[2] https://www.bufferbloat.net/projects/codel/wiki/Cake
[3] https://bit.ly/2SlUl7R
[4] https://gist.github.com/pfactum/bdad2594b151578f460857cacd94c689
[5] https://gist.github.com/pfactum/cad2cc5d1512b31fbc76d821b3e63dbf
[6] https://boards.4chan.org/g/thread/68171835#p68188019
[7] https://i.4cdn.org/g/1540307271879.jpg


2018-10-26 20:22:44

by Heiner Kallweit

[permalink] [raw]
Subject: Re: CAKE and r8169 cause panic on upload in v4.19

On 26.10.2018 21:26, Oleksandr Natalenko wrote:
> Hello.
>
> I was excited regarding the fact that v4.19 introduced CAKE, so I've deployed it on my home router.
>
> I used this script of mine [1]:
>
> # bufferbloat enp3s0.100 20 20
>
> to do its job on the VLAN interface, where 20/20 ISP link is switched from the home switch. Basically, it just follows [2] with simple bandwidth restriction and egress mirroring using ifb.
>
> Then I thought it would be nice to run speedtest-cli on one of the computer in the home LAN, connected to this router. Download stage went fine, but immediately after upload started I've got a panic on the router: [3] (sorry, it is a photo, netconsole didn't work because, I assume, the panic happened in the networking code). I rebooted the router and tried once more, and got the same result, again during upload stage. Then I rebooted again, replaced CAKE script with my former HTB script, and after running speedtest-cli a couple of times there's no panic.
>
> Before running speedtest-cli I was using CAKE for a couple of days without generating much traffic just fine. It seems it crashes only if lots of traffic is generated with tools like this.
>
> My sysctl: [4] and ethtool -k: [5]
>
> So far, I've found something similar only here: [6] [7]. The common thing is r8169 driver in use, so, maybe, it is a driver issue, and CAKE is just happy to reveal it.
>
> If it is something known, please point me to a possible fix. If it is something new, I'm open to provide more info on your request, try patches etc (as usual).
>
It seems to be the same problem as described here: https://bugzilla.kernel.org/show_bug.cgi?id=201063
As I commented in bugzilla, the GPF in dev_hard_start_xmit and the values of R12/R15 make me think
that a poisoned list pointer is accessed. It's so deep in the network stack that I can not really
imagine the network driver is to blame. One screenshot attached to the bug report shows that the
GPF also happened with the igb driver. Most likely we find out only once somebody spends effort
on bisecting the issue.
d4546c2509b1 ("net: Convert GRO SKB handling to list_head.") and some subsequent changes deal with
skb list processing, maybe the issue is related to one of these changes.

> Thanks.
>

2018-10-26 20:26:32

by Dave Taht

[permalink] [raw]
Subject: Re: CAKE and r8169 cause panic on upload in v4.19

On Fri, Oct 26, 2018 at 1:21 PM Heiner Kallweit <[email protected]> wrote:
>
> On 26.10.2018 21:26, Oleksandr Natalenko wrote:
> > Hello.
> >
> > I was excited regarding the fact that v4.19 introduced CAKE, so I've deployed it on my home router.
> >
> > I used this script of mine [1]:
> >
> > # bufferbloat enp3s0.100 20 20
> >
> > to do its job on the VLAN interface, where 20/20 ISP link is switched from the home switch. Basically, it just follows [2] with simple bandwidth restriction and egress mirroring using ifb.
> >
> > Then I thought it would be nice to run speedtest-cli on one of the computer in the home LAN, connected to this router. Download stage went fine, but immediately after upload started I've got a panic on the router: [3] (sorry, it is a photo, netconsole didn't work because, I assume, the panic happened in the networking code). I rebooted the router and tried once more, and got the same result, again during upload stage. Then I rebooted again, replaced CAKE script with my former HTB script, and after running speedtest-cli a couple of times there's no panic.
> >
> > Before running speedtest-cli I was using CAKE for a couple of days without generating much traffic just fine. It seems it crashes only if lots of traffic is generated with tools like this.
> >
> > My sysctl: [4] and ethtool -k: [5]
> >
> > So far, I've found something similar only here: [6] [7]. The common thing is r8169 driver in use, so, maybe, it is a driver issue, and CAKE is just happy to reveal it.
> >
> > If it is something known, please point me to a possible fix. If it is something new, I'm open to provide more info on your request, try patches etc (as usual).
> >
> It seems to be the same problem as described here: https://bugzilla.kernel.org/show_bug.cgi?id=201063
> As I commented in bugzilla, the GPF in dev_hard_start_xmit and the values of R12/R15 make me think
> that a poisoned list pointer is accessed. It's so deep in the network stack that I can not really
> imagine the network driver is to blame. One screenshot attached to the bug report shows that the
> GPF also happened with the igb driver. Most likely we find out only once somebody spends effort
> on bisecting the issue.
> d4546c2509b1 ("net: Convert GRO SKB handling to list_head.") and some subsequent changes deal with
> skb list processing, maybe the issue is related to one of these changes.

Can you repeat your test, disabling gro splitting in cake?

the option is "no-split-gso"

>
> > Thanks.
> >



--

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

2018-10-26 20:54:57

by Oleksandr Natalenko

[permalink] [raw]
Subject: Re: CAKE and r8169 cause panic on upload in v4.19

Hi.

On 26.10.2018 22:25, Dave Taht wrote:
> Can you repeat your test, disabling gro splitting in cake?
>
> the option is "no-split-gso"

Still panics. Takes a couple of rounds, but panics.

Moreover, I've stressed my HTB setup like this too for a longer time,
and it panics as well. So, at least, now I have a proof this is not a
CAKE-specific thing.

Also, I've stressed it even with noqueue, and the panic is still there.
So, this thing is not even sch-specific.

Next, I've seen GRO bits in the call trace and decided to disable GRO on
this NIC. So far, I cannot trigger a panic with GRO disabled even after
20 rounds of speedtest.

So, must be some generic thing indeed.

--
Oleksandr Natalenko (post-factum)

2018-10-26 23:11:54

by Dave Taht

[permalink] [raw]
Subject: Re: CAKE and r8169 cause panic on upload in v4.19

On Fri, Oct 26, 2018 at 1:54 PM Oleksandr Natalenko
<[email protected]> wrote:
>
> Hi.
>
> On 26.10.2018 22:25, Dave Taht wrote:
> > Can you repeat your test, disabling gro splitting in cake?
> >
> > the option is "no-split-gso"
>
> Still panics. Takes a couple of rounds, but panics.
>
> Moreover, I've stressed my HTB setup like this too for a longer time,
> and it panics as well. So, at least, now I have a proof this is not a
> CAKE-specific thing.

Groovy. :whew:

I do look forward to more cake test results, particularly on different
network cards such as these, and at speeds higher than 10Gbit on high
end hardware, and in the 100-1Gbit range on low to mid-range. After
the last round of features added to cake before it went into linux, we
run now out of cpu on inbound shaping at those speeds on low end apu2
(x86) hardware, (atom and a15 chips are not so hot now either) and I
wish I knew what we could do to speed it up. The new "list skb" and
mirred code looked promising but we haven't got around to exploring it
yet.

Thank you for trying and I hope this gets sorted out on your chipset.

>
> Also, I've stressed it even with noqueue, and the panic is still there.
> So, this thing is not even sch-specific.
>
> Next, I've seen GRO bits in the call trace and decided to disable GRO on
> this NIC. So far, I cannot trigger a panic with GRO disabled even after
> 20 rounds of speedtest.

We tend to use flent's rrul test to *really* abuse things. :)

So cake's ok with gro disabled in hw?

>
> So, must be some generic thing indeed.
>
> --
> Oleksandr Natalenko (post-factum)



--

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740

2018-10-27 11:05:23

by Oleksandr Natalenko

[permalink] [raw]
Subject: Re: CAKE and r8169 cause panic on upload in v4.19

Hi.

On 27.10.2018 01:08, Dave Taht wrote:
> Groovy. :whew:
>
> I do look forward to more cake test results, particularly on different
> network cards such as these, and at speeds higher than 10Gbit on high
> end hardware, and in the 100-1Gbit range on low to mid-range. After
> the last round of features added to cake before it went into linux, we
> run now out of cpu on inbound shaping at those speeds on low end apu2
> (x86) hardware, (atom and a15 chips are not so hot now either) and I
> wish I knew what we could do to speed it up. The new "list skb" and
> mirred code looked promising but we haven't got around to exploring it
> yet.
>
> Thank you for trying and I hope this gets sorted out on your chipset.

Yeah, but this is still strange. Both LAN computer and router run 4.19,
but only router panics. The LAN computer employs alx driver, router
employs r8169. Both had GRO enabled at the moment of panic. But [1]
reports that this happens with Intel NIC too, so must not be limited to
Realtek.

> We tend to use flent's rrul test to *really* abuse things. :)
>
> So cake's ok with gro disabled in hw?

Yes, I've gone back to CAKE but with GRO disabled for NIC, and it is
stable now. I've also asked a bug reporter [1] to do the same, so we
will see.

Thanks.

--
Oleksandr Natalenko (post-factum)

[1] https://bugzilla.kernel.org/show_bug.cgi?id=201063

2018-10-27 14:11:12

by Heiner Kallweit

[permalink] [raw]
Subject: Re: CAKE and r8169 cause panic on upload in v4.19

On 26.10.2018 22:54, Oleksandr Natalenko wrote:
> Hi.
>
> On 26.10.2018 22:25, Dave Taht wrote:
>> Can you repeat your test, disabling gro splitting in cake?
>>
>> the option is "no-split-gso"
>
> Still panics. Takes a couple of rounds, but panics.
>
> Moreover, I've stressed my HTB setup like this too for a longer time, and it panics as well. So, at least, now I have a proof this is not a CAKE-specific thing.
>
> Also, I've stressed it even with noqueue, and the panic is still there. So, this thing is not even sch-specific.
>
> Next, I've seen GRO bits in the call trace and decided to disable GRO on this NIC. So far, I cannot trigger a panic with GRO disabled even after 20 rounds of speedtest.
>
> So, must be some generic thing indeed.
>
In net-next there's the following patch which mentions in the
description that it "eliminates spurious list pointer poisoning":
992cba7e276d ("net: Add and use skb_list_del_init().")
And spurious list pointer poisoning is what we see here (IMO).

As an idea this patch and a8305bff6852 ("net: Add and use skb_mark_not_on_list().")
from net-next could be applied on top of 4.19. Would be curious
whether it fixes the issue.

2018-10-27 15:40:51

by Oleksandr Natalenko

[permalink] [raw]
Subject: Re: CAKE and r8169 cause panic on upload in v4.19

Hi.

On 27.10.2018 15:43, Heiner Kallweit wrote:
> In net-next there's the following patch which mentions in the
> description that it "eliminates spurious list pointer poisoning":
> 992cba7e276d ("net: Add and use skb_list_del_init().")
> And spurious list pointer poisoning is what we see here (IMO).
>
> As an idea this patch and a8305bff6852 ("net: Add and use
> skb_mark_not_on_list().")
> from net-next could be applied on top of 4.19. Would be curious
> whether it fixes the issue.

Applied both, still panics. %r12/%r15 are still poisoned too.

--
Oleksandr Natalenko (post-factum)

2018-10-28 04:45:07

by David Miller

[permalink] [raw]
Subject: Re: CAKE and r8169 cause panic on upload in v4.19

From: Oleksandr Natalenko <[email protected]>
Date: Fri, 26 Oct 2018 22:54:12 +0200

> Next, I've seen GRO bits in the call trace and decided to disable GRO
> on this NIC. So far, I cannot trigger a panic with GRO disabled even
> after 20 rounds of speedtest.
>
> So, must be some generic thing indeed.

Yeah something is out-of-whack with GRO.

Does this fix it?

diff --git a/net/core/dev.c b/net/core/dev.c
index 022ad73d6253..77d43ae2a7bb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5457,7 +5457,7 @@ static void gro_flush_oldest(struct list_head *head)
/* Do not adjust napi->gro_hash[].count, caller is adding a new
* SKB to the chain.
*/
- list_del(&oldest->list);
+ skb_list_del_init(oldest);
napi_gro_complete(oldest);
}


2018-10-28 12:37:09

by Oleksandr Natalenko

[permalink] [raw]
Subject: Re: CAKE and r8169 cause panic on upload in v4.19

Hi.

On 28.10.2018 05:44, David Miller wrote:
> Does this fix it?
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 022ad73d6253..77d43ae2a7bb 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5457,7 +5457,7 @@ static void gro_flush_oldest(struct list_head
> *head)
> /* Do not adjust napi->gro_hash[].count, caller is adding a new
> * SKB to the chain.
> */
> - list_del(&oldest->list);
> + skb_list_del_init(oldest);
> napi_gro_complete(oldest);
> }

Yes, but I had to apply both a8305bff6852 and 992cba7e276d too to get it
compiled. With these 3 patches the panic is not triggered any more while
having GRO enabled.

Thanks!

--
Oleksandr Natalenko (post-factum)

2018-10-28 17:46:13

by David Miller

[permalink] [raw]
Subject: Re: CAKE and r8169 cause panic on upload in v4.19

From: Oleksandr Natalenko <[email protected]>
Date: Sun, 28 Oct 2018 13:22:09 +0100

> Hi.
>
> On 28.10.2018 05:44, David Miller wrote:
>> Does this fix it?
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 022ad73d6253..77d43ae2a7bb 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -5457,7 +5457,7 @@ static void gro_flush_oldest(struct list_head
>> *head)
>> /* Do not adjust napi->gro_hash[].count, caller is adding a new
>> * SKB to the chain.
>> */
>> - list_del(&oldest->list);
>> + skb_list_del_init(oldest);
>> napi_gro_complete(oldest);
>> }
>
> Yes, but I had to apply both a8305bff6852 and 992cba7e276d too to get
> it compiled. With these 3 patches the panic is not triggered any more
> while having GRO enabled.
>
> Thanks!

Thanks for testing, I'll queue this up for -stable too:

From ece23711dd956cd5053c9cb03e9fe0668f9c8894 Mon Sep 17 00:00:00 2001
From: "David S. Miller" <[email protected]>
Date: Sun, 28 Oct 2018 10:35:12 -0700
Subject: [PATCH] net: Properly unlink GRO packets on overflow.

Just like with normal GRO processing, we have to initialize
skb->next to NULL when we unlink overflow packets from the
GRO hash lists.

Fixes: d4546c2509b1 ("net: Convert GRO SKB handling to list_head.")
Reported-by: Oleksandr Natalenko <[email protected]>
Tested-by: Oleksandr Natalenko <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
---
net/core/dev.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 022ad73d6253..77d43ae2a7bb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5457,7 +5457,7 @@ static void gro_flush_oldest(struct list_head *head)
/* Do not adjust napi->gro_hash[].count, caller is adding a new
* SKB to the chain.
*/
- list_del(&oldest->list);
+ skb_list_del_init(oldest);
napi_gro_complete(oldest);
}

--
2.17.2