Hi,
The following commit breaks ipt_REJECT on my machine. Tested with latest
2.6.19rc*, found with git-bisect. i386, gcc-4.1.1, the usual stuff.
All details available on request, of course.
commit 9d02002d2dc2c7423e5891b97727fde4d667adf1
Author: Patrick McHardy <[email protected]>
Date: Mon Oct 2 16:12:20 2006 -0700
[NETFILTER]: ipt_REJECT: remove largely duplicate route_reverse function
Use ip_route_me_harder instead, which now allows to specify how we wish
the packet to be routed.
Based on patch by Simon Horman <[email protected]>.
Signed-off-by: Patrick McHardy <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
The results are:
Last user: [<c015f82b>](alloc_pipe_info+0x1b/0x50)
000: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
010: ad 4e ad de ff ff ff ff ff ff ff ff dc c4 45 c0
Next obj: start=c476fb58, len=512
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [<c02b5998>](kfree_skbmem+0x8/0x80)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Slab corruption: start=c476fd64, len=512
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [<c02b5998>](kfree_skbmem+0x8/0x80)
040: 6b 6b 6b 6b f5 1b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Prev obj: start=c476fb58, len=512
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [<c01f76c7>](acpi_ps_parse_aml+0x1b7/0x1f2)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
Slab corruption: start=c476fd64, len=512
Redzone: 0x5a2cf071/0x5a2cf071.
Last user: [<c02b5998>](kfree_skbmem+0x8/0x80)
and so on.
--
Krzysztof Halasa
Krzysztof Halasa wrote:
> Hi,
>
> The following commit breaks ipt_REJECT on my machine. Tested with latest
> 2.6.19rc*, found with git-bisect. i386, gcc-4.1.1, the usual stuff.
> All details available on request, of course.
>
> commit 9d02002d2dc2c7423e5891b97727fde4d667adf1
How sure are you about this? I can see nothing wrong with that
commit and can't reproduce the slab corruption. Please post
the rule that triggers this.
Patrick McHardy <[email protected]> writes:
>> The following commit breaks ipt_REJECT on my machine. Tested with latest
>> 2.6.19rc*, found with git-bisect. i386, gcc-4.1.1, the usual stuff.
>> All details available on request, of course.
>>
>> commit 9d02002d2dc2c7423e5891b97727fde4d667adf1
>
> How sure are you about this? I can see nothing wrong with that
> commit and can't reproduce the slab corruption. Please post
> the rule that triggers this.
99% sure. Past this commit I get corruptions after 5 minutes at most
(that's ADSL with USB Thomson/Alcatel Speedtouch -> PPP over ATM,
with a GRE tunnel over that PPP).
I'm now running 901eaf6c8f997f18ebc8fcbb85411c79161ab3b2 (i.e. the
last commit before the one in question) for 4 hours and nothing like
that.
Not sure about the exact rule, but the most probable candidates are:
-A INPUT -p tcp --tcp-flags SYN,RST,ACK SYN -j REJECT --reject-with tcp-reset
-A INPUT -p udp -j REJECT --reject-with icmp-port-unreachable
Other "REJECT" rules haven't fired yet.
Could be some obscure problem with GRE/Speedtouch/PPP over ATM,
triggered by this patch, though.
Perhaps I can do some experiments - just say a word.
--
Krzysztof Halasa
diff --git a/net/ipv4/netfilter/ipt_REJECT.c b/net/ipv4/netfilter/ipt_REJECT.c
index ad0312d..264763a 100644
--- a/net/ipv4/netfilter/ipt_REJECT.c
+++ b/net/ipv4/netfilter/ipt_REJECT.c
@@ -114,6 +114,14 @@ static void send_reset(struct sk_buff *o
tcph->window = 0;
tcph->urg_ptr = 0;
+ /* Adjust TCP checksum */
+ tcph->check = 0;
+ tcph->check = tcp_v4_check(tcph, sizeof(struct tcphdr),
+ nskb->nh.iph->saddr,
+ nskb->nh.iph->daddr,
+ csum_partial((char *)tcph,
+ sizeof(struct tcphdr), 0));
+
/* Set DF, id = 0 */
nskb->nh.iph->frag_off = htons(IP_DF);
nskb->nh.iph->id = 0;
@@ -129,14 +137,8 @@ #endif
if (ip_route_me_harder(&nskb, addr_type))
goto free_nskb;
- /* Adjust TCP checksum */
nskb->ip_summed = CHECKSUM_NONE;
- tcph->check = 0;
- tcph->check = tcp_v4_check(tcph, sizeof(struct tcphdr),
- nskb->nh.iph->saddr,
- nskb->nh.iph->daddr,
- csum_partial((char *)tcph,
- sizeof(struct tcphdr), 0));
+
/* Adjust IP TTL */
nskb->nh.iph->ttl = dst_metric(nskb->dst, RTAX_HOPLIMIT);
Patrick McHardy <[email protected]> writes:
> It might be the case that your network device has a
> hard_header_len > LL_MAX_HEADER, which could trigger
> a corruption.
Hmm... GRE tunnels add 24 bytes... I just noticed the following code in
include/linux/netdevice.h:
/*
* Compute the worst case header length according to the protocols
* used.
*/
#if !defined(CONFIG_AX25) && !defined(CONFIG_AX25_MODULE) && !defined(CONFIG_TR)
#define LL_MAX_HEADER 32
#else
#if defined(CONFIG_AX25) || defined(CONFIG_AX25_MODULE)
#define LL_MAX_HEADER 96
#else
#define LL_MAX_HEADER 48
#endif
#endif
#if !defined(CONFIG_NET_IPIP) && \
!defined(CONFIG_IPV6) && !defined(CONFIG_IPV6_MODULE)
#define MAX_HEADER LL_MAX_HEADER
#else
#define MAX_HEADER (LL_MAX_HEADER + 48)
#endif
I don't use AX25, Token Ring, the old IPIP tunnels nor IPv6 here, but
I wonder if GRE tunnel (which is basically another, more compatible
form of IPIP) need the same treatment as IPIP.
I've confirmed that REJECTs over GRE tunnel caused that corruption.
> Please try this patch on top of the REJECT patch (ideally after
> verifying that the REJECT patch is really introducing the
> corruption).
That was certain. The patch fixed the problem, confirmed with current
git tree as well. Thanks for looking at it.
I'm not sure about LL_MAX_HEADER (and/or MAX_HEADER) though. Should it
be changed as well?
There are many devices adding data to header space, perhaps tacking
devices doesn't count as the skb is being linearized in
dev->hard_start_xmit() or equivalent path?
--
Krzysztof Halasa
diff --git a/net/ipv4/netfilter/ipt_REJECT.c b/net/ipv4/netfilter/ipt_REJECT.c
index ad0312d..264763a 100644
--- a/net/ipv4/netfilter/ipt_REJECT.c
+++ b/net/ipv4/netfilter/ipt_REJECT.c
@@ -114,6 +114,14 @@ static void send_reset(struct sk_buff *o
tcph->window = 0;
tcph->urg_ptr = 0;
+ /* Adjust TCP checksum */
+ tcph->check = 0;
+ tcph->check = tcp_v4_check(tcph, sizeof(struct tcphdr),
+ nskb->nh.iph->saddr,
+ nskb->nh.iph->daddr,
+ csum_partial((char *)tcph,
+ sizeof(struct tcphdr), 0));
+
/* Set DF, id = 0 */
nskb->nh.iph->frag_off = htons(IP_DF);
nskb->nh.iph->id = 0;
@@ -129,14 +137,8 @@ #endif
if (ip_route_me_harder(&nskb, addr_type))
goto free_nskb;
- /* Adjust TCP checksum */
nskb->ip_summed = CHECKSUM_NONE;
- tcph->check = 0;
- tcph->check = tcp_v4_check(tcph, sizeof(struct tcphdr),
- nskb->nh.iph->saddr,
- nskb->nh.iph->daddr,
- csum_partial((char *)tcph,
- sizeof(struct tcphdr), 0));
+
/* Adjust IP TTL */
nskb->nh.iph->ttl = dst_metric(nskb->dst, RTAX_HOPLIMIT);
From: Patrick McHardy <[email protected]>
Date: Wed, 29 Nov 2006 03:28:25 +0100
> [NETFILTER]: ipt_REJECT: fix memory corruption
>
> On devices with hard_header_len > LL_MAX_HEADER ip_route_me_harder()
> reallocates the skb, leading to memory corruption when using the stale
> tcph pointer to update the checksum.
>
> Signed-off-by: Patrick McHardy <[email protected]>
Applied, thanks Patrick.
And based upon your discovery wrt. MAX_HEADER I'm also
applying the following.
commit 93e3a20d6c67a09b867431e7d5b3e7bc97154fab
Author: David S. Miller <[email protected]>
Date: Tue Nov 28 20:24:10 2006 -0800
[NET]: Fix MAX_HEADER setting.
MAX_HEADER is either set to LL_MAX_HEADER or LL_MAX_HEADER + 48, and
this is controlled by a set of CONFIG_* ifdef tests.
It is trying to use LL_MAX_HEADER + 48 when any of the tunnels are
enabled which set hard_header_len like this:
dev->hard_header_len = LL_MAX_HEADER + sizeof(struct xxx);
The correct set of tunnel drivers which do this are:
ipip
ip_gre
ip6_tunnel
sit
so make the ifdef test match.
Noticed by Patrick McHardy.
Signed-off-by: David S. Miller <[email protected]>
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9264139..95e86ac 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -94,7 +94,9 @@ #endif
#endif
#if !defined(CONFIG_NET_IPIP) && \
- !defined(CONFIG_IPV6) && !defined(CONFIG_IPV6_MODULE)
+ !defined(CONFIG_NET_IPGRE) && \
+ !defined(CONFIG_IPV6_SIT) && \
+ !defined(CONFIG_IPV6_TUNNEL)
#define MAX_HEADER LL_MAX_HEADER
#else
#define MAX_HEADER (LL_MAX_HEADER + 48)
David Miller <[email protected]> wrote:
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 9264139..95e86ac 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -94,7 +94,9 @@ #endif
> #endif
>
> #if !defined(CONFIG_NET_IPIP) && \
> - !defined(CONFIG_IPV6) && !defined(CONFIG_IPV6_MODULE)
> + !defined(CONFIG_NET_IPGRE) && \
> + !defined(CONFIG_IPV6_SIT) && \
> + !defined(CONFIG_IPV6_TUNNEL)
> #define MAX_HEADER LL_MAX_HEADER
> #else
> #define MAX_HEADER (LL_MAX_HEADER + 48)
What if ipip/gre are modules?
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
From: Herbert Xu <[email protected]>
Date: Wed, 29 Nov 2006 15:38:29 +1100
> David Miller <[email protected]> wrote:
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 9264139..95e86ac 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -94,7 +94,9 @@ #endif
> > #endif
> >
> > #if !defined(CONFIG_NET_IPIP) && \
> > - !defined(CONFIG_IPV6) && !defined(CONFIG_IPV6_MODULE)
> > + !defined(CONFIG_NET_IPGRE) && \
> > + !defined(CONFIG_IPV6_SIT) && \
> > + !defined(CONFIG_IPV6_TUNNEL)
> > #define MAX_HEADER LL_MAX_HEADER
> > #else
> > #define MAX_HEADER (LL_MAX_HEADER + 48)
>
> What if ipip/gre are modules?
Good catch, I'll fix that up by adding the missing CONFIG_*_MODULE
cases.
Longer term this is really messy, we should handle this some
other way.
David Miller <[email protected]> wrote:
>
> Longer term this is really messy, we should handle this some
> other way.
Definitely. I'm not sure whether 48 is enough even for recursive
tunnels. This should really just be a hint. It's OK to spend a
bit of time reallocating skb's if it's too small, but it's not OK
to die.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
From: Herbert Xu <[email protected]>
Date: Wed, 29 Nov 2006 15:56:57 +1100
> David Miller <[email protected]> wrote:
> >
> > Longer term this is really messy, we should handle this some
> > other way.
>
> Definitely. I'm not sure whether 48 is enough even for recursive
> tunnels. This should really just be a hint. It's OK to spend a
> bit of time reallocating skb's if it's too small, but it's not OK
> to die.
The recursive tunnel case is handled by the PMTU reductions
in the route, isn't it?
On Tue, Nov 28, 2006 at 09:04:16PM -0800, David Miller wrote:
>
> > Definitely. I'm not sure whether 48 is enough even for recursive
> > tunnels. This should really just be a hint. It's OK to spend a
> > bit of time reallocating skb's if it's too small, but it's not OK
> > to die.
>
> The recursive tunnel case is handled by the PMTU reductions
> in the route, isn't it?
Oh I wasn't suggesting that the current code is broken.
I'm just emphasising that LL_MAX_HEADER is by no means the *maximum*
header size in a Linux system. Anybody should be able to load a
new NIC module with a hard header size exceeding what LL_MAX_HEADER
is and the system should still function (albeit slower since every
packet sent down that device has to be reallocated).
In particular, nested tunnels is one such device which anybody can
construct without writing a kernel module.
As to getting rid of those ifdefs, here is one idea. We keep a
read-mostly global variable that represents the actual current
maximum LL header size. Everytime a new device appears (or if
its hard header size changes) we update this variable if needed.
Hmm, we don't actually update the hard header size should the
underlying device change for tunnels. Good thing the tunnels
only use that as a hint and reallocate if necessary :)
This is not optimal in that it never decreases, but it's certainly
better than a compile-time constant (e.g., people using distribution
kernels don't necessarily use tunnels).
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On 29-11-2006 05:25, David Miller wrote:
...
> commit 93e3a20d6c67a09b867431e7d5b3e7bc97154fab
> Author: David S. Miller <[email protected]>
> Date: Tue Nov 28 20:24:10 2006 -0800
>
> [NET]: Fix MAX_HEADER setting.
>
> MAX_HEADER is either set to LL_MAX_HEADER or LL_MAX_HEADER + 48, and
> this is controlled by a set of CONFIG_* ifdef tests.
...
> Noticed by Patrick McHardy.
And if we talk about names:
+ Spotted by Krzysztof Halasa.
probably wouldn't be too exaggerated...
> Signed-off-by: David S. Miller <[email protected]>
Jarek P.
Jarek Poplawski <[email protected]> writes:
> And if we talk about names:
>
> + Spotted by Krzysztof Halasa.
I wound't care less btw.
--
Krzysztof Halasa
Krzysztof Halasa <[email protected]> writes:
> I wound't care less btw.
s/wound/couldn/, eh those foreign languages...
--
Krzysztof Halasa
On Wed, Nov 29, 2006 at 04:16:06PM +0100, Krzysztof Halasa wrote:
> Krzysztof Halasa <[email protected]> writes:
>
> > I wound't care less btw.
>
> s/wound/couldn/, eh those foreign languages...
So, you say, you don't care about David Miller's credits?
It isn't nice. He could be very disappointed...
Jarek P.
From: Herbert Xu <[email protected]>
Date: Wed, 29 Nov 2006 17:51:46 +1100
> I'm just emphasising that LL_MAX_HEADER is by no means the *maximum*
> header size in a Linux system.
But it is the maximum "link level" singular header size.
It is MAX_HEADER which is the hack and the main issue.
What MAX_HEADER's setting is trying to do is optimistically allocate
enough for a single level of tunnelling. It does not handle nested
tunneling at all, of course.
> As to getting rid of those ifdefs, here is one idea. We keep a
> read-mostly global variable that represents the actual current
> maximum LL header size. Everytime a new device appears (or if
> its hard header size changes) we update this variable if needed.
>
> Hmm, we don't actually update the hard header size should the
> underlying device change for tunnels. Good thing the tunnels
> only use that as a hint and reallocate if necessary :)
>
> This is not optimal in that it never decreases, but it's certainly
> better than a compile-time constant (e.g., people using distribution
> kernels don't necessarily use tunnels).
I like this idea for the most part. It also deals nicely with, as you
alude to, how the MAX_HEADER scheme uses the space even if you don't
configure any tunnels at all.
Actually, I wonder how antiquated this all is. I bet we could get rid
of MAX_HEADER, then if we have to realloc headroom, we adjust some
per-device header thing which will behave like your global value idea
does. On the next allocation, we'll do the right thing. Although I
cannot come up with a scheme that works without reintroducing another
net_device pointer to sk_buff, which seems necessary to handle arbitrary
nesting. :-/
On Thu, Nov 30, 2006 at 08:22:06PM -0800, David Miller wrote:
>
> What MAX_HEADER's setting is trying to do is optimistically allocate
> enough for a single level of tunnelling. It does not handle nested
> tunneling at all, of course.
Agreed, I should've said MAX_HEADER.
> Actually, I wonder how antiquated this all is. I bet we could get rid
> of MAX_HEADER, then if we have to realloc headroom, we adjust some
> per-device header thing which will behave like your global value idea
> does. On the next allocation, we'll do the right thing. Although I
> cannot come up with a scheme that works without reintroducing another
> net_device pointer to sk_buff, which seems necessary to handle arbitrary
> nesting. :-/
Actually the scarier part is that TCP as well as ip_route_me_harder
doesn't guarantee enough headroom for IPsec. Fortunately TCP reserves
enough room (128 bytes) by default that it's unlikely to break with
non-nested IPsec. But it's still pretty nasty.
So in general when allocating packets we have two scenarios:
1) The dst is known and fixed, i.e., all datagram protocols. This is
the easy case where the headroom is known exactly beforehand.
2) The dst is unknown or may vary, this includes TCP, SCTP and DCCP.
This is where we currently use MAX_HEADER plus some protocol-specific
headroom.
Right now the normal (non-IPsec) dst output path always checks for
sufficient headroom and reallocates if necessary (ip_finish_output2).
I propose that we make IPsec do the same thing.
This change will make the stack safe from underflow crashes in IPsec.
There is also the ip_route_me_harder path where the dst varies. It
also tries to reallocate the packet if there isn't enough headroom for
the new dst. As long both IPsec and the normal path does the headroom
check, this can in fact be removed.
We can then make it more optimal because in the cases of TCP/SCTP/DCCP
we usually have a dst object. The only problem of course is that it
may vary. However, the common case by far is that the dst stays
constant. So we can optimise for it by getting the headroom from the
current dst and rely on the last-ditch reallocation to fix things up
if needed.
For standard MTU-sized packets this discussion is moot since we have
2K of memory in each chunk. However, for ACKs it could save a bit of
memory.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
From: Herbert Xu <[email protected]>
Date: Fri, 1 Dec 2006 15:37:55 +1100
> So in general when allocating packets we have two scenarios:
>
> 1) The dst is known and fixed, i.e., all datagram protocols. This is
> the easy case where the headroom is known exactly beforehand.
>
> 2) The dst is unknown or may vary, this includes TCP, SCTP and DCCP.
> This is where we currently use MAX_HEADER plus some protocol-specific
> headroom.
>
> Right now the normal (non-IPsec) dst output path always checks for
> sufficient headroom and reallocates if necessary (ip_finish_output2).
> I propose that we make IPsec do the same thing.
Agreed.
> For standard MTU-sized packets this discussion is moot since we have
> 2K of memory in each chunk. However, for ACKs it could save a bit of
> memory.
For linear MTU-sized SKBs yes, but TCP data packets are going out %99
of the time with paged data these days and thus suffers from the same
set of issues and potential savings.