2019-02-07 22:39:46

by Sander Eikelenboom

[permalink] [raw]
Subject: Kernel 5.0-rc5 regression with NAT, bisected to: netfilter: nat: remove l4proto->manip_pkt

L.S.,

While trying out a 5.0-RC5 kernel I seem to have stumbled over a regression with NAT.
(using an nftables firewall with NAT and connection tracking).

Unfortunately it isn't too obvious since no errors are logged, but on clients it
causes symptoms like firefox intermittently not being able to load pages with:
Network Protocol Error
An error occurred during a connection to http://www.example.com
The page you are trying to view cannot be shown because an error in the network protocol was detected.
Please contact the website owners to inform them of this problem.

But it's only intermittently, so i can still visit some webpages with clients,
could be that packet size and or fragments are at play ?

So I tried testing with git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git with
e8c32c32b48c2e889704d8ca0872f92eb027838e as last commit, to be sure to have the latest netdev has to offer,
but to no avail.

After that I tried to git bisect and ended up with:

faec18dbb0405c7d4dda025054511dc3a6696918 is the first bad commit
commit faec18dbb0405c7d4dda025054511dc3a6696918
Author: Florian Westphal <[email protected]>
Date: Thu Dec 13 16:01:33 2018 +0100

netfilter: nat: remove l4proto->manip_pkt

This removes the last l4proto indirection, the two callers, the l3proto
packet mangling helpers for ipv4 and ipv6, now call the
nf_nat_l4proto_manip_pkt() helper.

nf_nat_proto_{dccp,tcp,sctp,gre,icmp,icmpv6} are left behind, even though
they contain no functionality anymore to not clutter this patch.

Next patch will remove the empty files and the nf_nat_l4proto
struct.

nf_nat_proto_udp.c is renamed to nf_nat_proto.c, as it now contains the
other nat manip functionality as well, not just udp and udplite.

Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

:040000 040000 22d8706921e03cbd6d78a6ebcc5f253ccfd2bf0c b6f8ab2779215b4495dfe641f50e798da73859ac M include
:040000 040000 af212a756f1acf00cbe45c3be5b71f38f01f1d34 165c440f9e6f2e05738628a19b51f7603f95752a M net

Any ideas or debugging hints ?

--
Sander


2019-02-08 07:07:44

by Florian Westphal

[permalink] [raw]
Subject: Re: Kernel 5.0-rc5 regression with NAT, bisected to: netfilter: nat: remove l4proto->manip_pkt

Sander Eikelenboom <[email protected]> wrote:
> L.S.,
>
> While trying out a 5.0-RC5 kernel I seem to have stumbled over a regression with NAT.
> (using an nftables firewall with NAT and connection tracking).
>
> Unfortunately it isn't too obvious since no errors are logged, but on clients it
> causes symptoms like firefox intermittently not being able to load pages with:
> Network Protocol Error
> An error occurred during a connection to http://www.example.com
> The page you are trying to view cannot be shown because an error in the network protocol was detected.
> Please contact the website owners to inform them of this problem.
>
> But it's only intermittently, so i can still visit some webpages with clients,
> could be that packet size and or fragments are at play ?
>
> So I tried testing with git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git with
> e8c32c32b48c2e889704d8ca0872f92eb027838e as last commit, to be sure to have the latest netdev has to offer,
> but to no avail.
>
> After that I tried to git bisect and ended up with:
>
> faec18dbb0405c7d4dda025054511dc3a6696918 is the first bad commit
> commit faec18dbb0405c7d4dda025054511dc3a6696918
> Author: Florian Westphal <[email protected]>
> Date: Thu Dec 13 16:01:33 2018 +0100
>
> netfilter: nat: remove l4proto->manip_pkt

Thanks, this is immensely helpful.

I think I see the bug, we can't use target->dst.protonum in
nf_nat_l4proto_manip_pkt(), it will be TCP in case we're dealing
with a related icmp packet.

I will send a patch in a few hours when I get back.

2019-02-08 11:56:36

by Florian Westphal

[permalink] [raw]
Subject: Re: Kernel 5.0-rc5 regression with NAT, bisected to: netfilter: nat: remove l4proto->manip_pkt

Florian Westphal <[email protected]> wrote:
> Sander Eikelenboom <[email protected]> wrote:
> > L.S.,
> >
> > While trying out a 5.0-RC5 kernel I seem to have stumbled over a regression with NAT.
> > (using an nftables firewall with NAT and connection tracking).
> >
> > Unfortunately it isn't too obvious since no errors are logged, but on clients it
> > causes symptoms like firefox intermittently not being able to load pages with:
> > Network Protocol Error
> > An error occurred during a connection to http://www.example.com
> > The page you are trying to view cannot be shown because an error in the network protocol was detected.
> > Please contact the website owners to inform them of this problem.
> >
> > But it's only intermittently, so i can still visit some webpages with clients,
> > could be that packet size and or fragments are at play ?
> >
> > So I tried testing with git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git with
> > e8c32c32b48c2e889704d8ca0872f92eb027838e as last commit, to be sure to have the latest netdev has to offer,
> > but to no avail.
> >
> > After that I tried to git bisect and ended up with:
> >
> > faec18dbb0405c7d4dda025054511dc3a6696918 is the first bad commit
> > commit faec18dbb0405c7d4dda025054511dc3a6696918
> > Author: Florian Westphal <[email protected]>
> > Date: Thu Dec 13 16:01:33 2018 +0100
> >
> > netfilter: nat: remove l4proto->manip_pkt
>
> Thanks, this is immensely helpful.
>
> I think I see the bug, we can't use target->dst.protonum in
> nf_nat_l4proto_manip_pkt(), it will be TCP in case we're dealing
> with a related icmp packet.
>
> I will send a patch in a few hours when I get back.

Sander, does this patch fix things for you?

Thanks!

diff --git a/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c b/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
--- a/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
@@ -215,6 +215,7 @@ int nf_nat_icmp_reply_translation(struct sk_buff *skb,

/* Change outer to look like the reply to an incoming packet */
nf_ct_invert_tuplepr(&target, &ct->tuplehash[!dir].tuple);
+ target.dst.protonum = IPPROTO_ICMP;
if (!nf_nat_ipv4_manip_pkt(skb, 0, &target, manip))
return 0;

diff --git a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
--- a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
+++ b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
@@ -226,6 +226,7 @@ int nf_nat_icmpv6_reply_translation(struct sk_buff *skb,
}

nf_ct_invert_tuplepr(&target, &ct->tuplehash[!dir].tuple);
+ target.dst.protonum = IPPROTO_ICMPV6;
if (!nf_nat_ipv6_manip_pkt(skb, 0, &target, manip))
return 0;


2019-02-08 15:34:08

by Sander Eikelenboom

[permalink] [raw]
Subject: Re: Kernel 5.0-rc5 regression with NAT, bisected to: netfilter: nat: remove l4proto->manip_pkt

On 08/02/2019 12:54, Florian Westphal wrote:
> Florian Westphal <[email protected]> wrote:
>> Sander Eikelenboom <[email protected]> wrote:
>>> L.S.,
>>>
>>> While trying out a 5.0-RC5 kernel I seem to have stumbled over a regression with NAT.
>>> (using an nftables firewall with NAT and connection tracking).
>>>
>>> Unfortunately it isn't too obvious since no errors are logged, but on clients it
>>> causes symptoms like firefox intermittently not being able to load pages with:
>>> Network Protocol Error
>>> An error occurred during a connection to http://www.example.com
>>> The page you are trying to view cannot be shown because an error in the network protocol was detected.
>>> Please contact the website owners to inform them of this problem.
>>>
>>> But it's only intermittently, so i can still visit some webpages with clients,
>>> could be that packet size and or fragments are at play ?
>>>
>>> So I tried testing with git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git with
>>> e8c32c32b48c2e889704d8ca0872f92eb027838e as last commit, to be sure to have the latest netdev has to offer,
>>> but to no avail.
>>>
>>> After that I tried to git bisect and ended up with:
>>>
>>> faec18dbb0405c7d4dda025054511dc3a6696918 is the first bad commit
>>> commit faec18dbb0405c7d4dda025054511dc3a6696918
>>> Author: Florian Westphal <[email protected]>
>>> Date: Thu Dec 13 16:01:33 2018 +0100
>>>
>>> netfilter: nat: remove l4proto->manip_pkt
>>
>> Thanks, this is immensely helpful.
>>
>> I think I see the bug, we can't use target->dst.protonum in
>> nf_nat_l4proto_manip_pkt(), it will be TCP in case we're dealing
>> with a related icmp packet.
>>
>> I will send a patch in a few hours when I get back.
>
> Sander, does this patch fix things for you?

Hi Florian,

You may stick on a reported/tested-by if you like.
Thanks for the swift fix !

--
Sander

>
> Thanks!
>
> diff --git a/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c b/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
> --- a/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
> +++ b/net/ipv4/netfilter/nf_nat_l3proto_ipv4.c
> @@ -215,6 +215,7 @@ int nf_nat_icmp_reply_translation(struct sk_buff *skb,
>
> /* Change outer to look like the reply to an incoming packet */
> nf_ct_invert_tuplepr(&target, &ct->tuplehash[!dir].tuple);
> + target.dst.protonum = IPPROTO_ICMP;
> if (!nf_nat_ipv4_manip_pkt(skb, 0, &target, manip))
> return 0;
>
> diff --git a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
> --- a/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
> +++ b/net/ipv6/netfilter/nf_nat_l3proto_ipv6.c
> @@ -226,6 +226,7 @@ int nf_nat_icmpv6_reply_translation(struct sk_buff *skb,
> }
>
> nf_ct_invert_tuplepr(&target, &ct->tuplehash[!dir].tuple);
> + target.dst.protonum = IPPROTO_ICMPV6;
> if (!nf_nat_ipv6_manip_pkt(skb, 0, &target, manip))
> return 0;
>
>