RFC3168 section 6.1.1.1 says this:
A host that receives no reply to an ECN-setup SYN within the normal
SYN retransmission timeout interval MAY resend the SYN and any
subsequent SYN retransmissions with CWR and ECE cleared. To overcome
normal packet loss that results in the original SYN being lost, the
originating host may retransmit one or more ECN-setup SYN packets
before giving up and retransmitting the SYN with the CWR and ECE bits
cleared.
Supporting this would make using ECN a lot less painful - currently, if
I want to use ECN by default, I get to turn it off anytime I find an ECN-hostile
site that I'd like to communicate with.
Looking at the 2.5.56 version of net/ipv4/tcp_output.c, it doesn't look like
the tcp_connect() function has a good way to connect a special callback to clear
the ECN bits on a retransmit. Similarly, net/ipv4/netfilter/* doesn't seem
to have a good way to flag a *retransmitted* SYN for packet mangling.
It would be nice, but not required, if the solution included a printk() so
I could grep the logs and find sites to send a nastygram to if they are in
fact ECN-hostile..
Any pointers/suggestions/etc?
/Valdis (who has hit 5 ECN-hostile servers already today... argh)
> RFC3168 section 6.1.1.1 says this:
>
> A host that receives no reply to an ECN-setup SYN within the normal
> SYN retransmission timeout interval MAY resend the SYN and any
> subsequent SYN retransmissions with CWR and ECE cleared. To overcome
> normal packet loss that results in the original SYN being lost, the
> originating host may retransmit one or more ECN-setup SYN packets
> before giving up and retransmitting the SYN with the CWR and ECE bits
> cleared.
>
> Supporting this would make using ECN a lot less painful - currently, if
> I want to use ECN by default, I get to turn it off anytime I find an
> ECN-hostile site that I'd like to communicate with.
Linux shouldn't encourage the use of equipment that violates RFCs, in
this case, RFC 739.
The correct way to deal with it, is to contact the maintainers of the
site, and ask them to fix the non conforming equipment.
If the problem is caused upstream, by equipment out of the
site's maintainers' control, it is their responsibility to contact the
relevant maintainers, or change their upstream provider.
John.
</lurk>
On Fri, Feb 21, 2003 at 08:40:45PM +0000, John Bradford wrote:
> > RFC3168 section 6.1.1.1 says this:
> >
> > A host that receives no reply to an ECN-setup SYN within the normal
> > SYN retransmission timeout interval MAY resend the SYN and any
> > subsequent SYN retransmissions with CWR and ECE cleared. To overcome
> > normal packet loss that results in the original SYN being lost, the
> > originating host may retransmit one or more ECN-setup SYN packets
> > before giving up and retransmitting the SYN with the CWR and ECE bits
> > cleared.
> >
> > Supporting this would make using ECN a lot less painful - currently, if
> > I want to use ECN by default, I get to turn it off anytime I find an
> > ECN-hostile site that I'd like to communicate with.
>
> Linux shouldn't encourage the use of equipment that violates RFCs, in
> this case, RFC 739.
Linux shouldn't encourage the use of equipment that attempts to emulate
<insert thing here> but screws it up.
>
> The correct way to deal with it, is to contact the maintainers of the
> site, and ask them to fix the non conforming equipment.
The correct way to deal with it, is to contact the manufactures of the
equipment.
>
> If the problem is caused upstream, by equipment out of the
> site's maintainers' control, it is their responsibility to contact the
> relevant maintainers, or change their upstream provider.
If the hardware is provided by people upstream, and is out of the
control of the sysadmin's control, it is their responsibility to contact
the relevant people, or change hardware providers.
Oh, look, does that mean that we can now remove all the work arounds in
the various network, ide, etc drivers?
No, I believe Linus has stated many times that Linux is not a research
project, it is meant to actually be USED.
<lurk>
--
1024D/E65A7801 Zephaniah E. Hull <[email protected]>
92ED 94E4 B1E6 3624 226D 5727 4453 008B E65A 7801
CCs of replies from mailing lists are requested.
I am an "expert". Fear me, for I will wreak untold damage upon anything
I can get my grubby hands on.
-- Matt McLeod on ASR.
>
>
> --W/nzBZO5zC0uMSeA
> Content-Type: text/plain; charset=us-ascii
> Content-Disposition: inline
> Content-Transfer-Encoding: quoted-printable
>
> </lurk>
>
> On Fri, Feb 21, 2003 at 08:40:45PM +0000, John Bradford wrote:
> > > RFC3168 section 6.1.1.1 says this:
> > >=20
> > > A host that receives no reply to an ECN-setup SYN within the normal
> > > SYN retransmission timeout interval MAY resend the SYN and any
> > > subsequent SYN retransmissions with CWR and ECE cleared. To overcome
> > > normal packet loss that results in the original SYN being lost, the
> > > originating host may retransmit one or more ECN-setup SYN packets
> > > before giving up and retransmitting the SYN with the CWR and ECE bits
> > > cleared.
> > >=20
> > > Supporting this would make using ECN a lot less painful - currently, if
> > > I want to use ECN by default, I get to turn it off anytime I find an
> > > ECN-hostile site that I'd like to communicate with.
> >=20
> > Linux shouldn't encourage the use of equipment that violates RFCs, in
> > this case, RFC 739.
>
> Linux shouldn't encourage the use of equipment that attempts to emulate
> <insert thing here> but screws it up.
> >=20
> > The correct way to deal with it, is to contact the maintainers of the
> > site, and ask them to fix the non conforming equipment.
>
> The correct way to deal with it, is to contact the manufactures of the
> equipment.
> >=20
> > If the problem is caused upstream, by equipment out of the
> > site's maintainers' control, it is their responsibility to contact the
> > relevant maintainers, or change their upstream provider.
>
> If the hardware is provided by people upstream, and is out of the
> control of the sysadmin's control, it is their responsibility to contact
> the relevant people, or change hardware providers.
>
> Oh, look, does that mean that we can now remove all the work arounds in
> the various network, ide, etc drivers?
No, I'm suggesting that at all.
> No, I believe Linus has stated many times that Linux is not a research
> project, it is meant to actually be USED.
As far as I can see, though, implementing this gains less than we
stand to loose.
What if the first SYN packet, or the response to it is lost, (which is
more possible on congested links, which is when ECN would be most
useful), and we disable ECN - then we loose out on functionality we
could have, and the work around is actually detremental to
performance. Once 99% of internet hosts support ECN, we could be
loosing more than we gain.
If a site is unreachable, ECN can be disabled, and the RFC violating
equipment is easily identified. Automatically disabling ECN just
hides the problem from the user, who might then not be benefiting from
ECN, and will quite possibly accept the degraded performance as
normal.
John.
>
> As far as I can see, though, implementing this gains less than we
> stand to loose.
>
> What if the first SYN packet, or the response to it is lost, (which is
> more possible on congested links, which is when ECN would be most
> useful), and we disable ECN - then we loose out on functionality we
> could have, and the work around is actually detremental to
> performance. Once 99% of internet hosts support ECN, we could be
> loosing more than we gain.
>
> If a site is unreachable, ECN can be disabled, and the RFC violating
> equipment is easily identified. Automatically disabling ECN just
> hides the problem from the user, who might then not be benefiting from
> ECN, and will quite possibly accept the degraded performance as
> normal.
>
> John.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
I think they may have been talking about disabling ECN capabilities for the packets which never got responded to, what is the loss if 1% of your overall traffic has to be re-transmitted to work but the other 99% just works and you never have to turn ECN off with the sysctl at all? I think they might have been going for something like this:
http://marc.theaimsgroup.com/?l=linux-kernel&m=102926352817528&w=2 which was brought on by this:
http://marc.theaimsgroup.com/?l=linux-kernel&m=102919814321938&w=2
Jordan
On Fri, 2003-02-21 at 22:40, John Bradford wrote:
> > Supporting this would make using ECN a lot less painful - currently, if
> > I want to use ECN by default, I get to turn it off anytime I find an
> > ECN-hostile site that I'd like to communicate with.
>
> Linux shouldn't encourage the use of equipment that violates RFCs, in
> this case, RFC 739.
>
> The correct way to deal with it, is to contact the maintainers of the
> site, and ask them to fix the non conforming equipment.
That's right. Unfortunately, the way most people *will* deal with it is
by turning ECN off permanently and forgetting about it. That won't help
ECN become widely adopted.
MikaL
On Fri, 21 Feb 2003 23:43:58 +0200, Mika Liljeberg said:
> That's right. Unfortunately, the way most people *will* deal with it is
> by turning ECN off permanently and forgetting about it. That won't help
> ECN become widely adopted.
That's what I'm trying to avoid doing. ;)
(As an aside, yes, the URL to the previous marc.theaimsgroup thread *is*
what I'm talking about).
It turns out that I *CAN* do it all with iptables *IF* the following
untested code actually works (this assumes that mangle is re-called on
a retransmit)
# If we've already marked this packet, strip/log/send...
iptables -t mangle -A OUTPUT -p tcp --syn -m mark --mark 99 --ecn-tcp-remove
iptables -t mangle -A OUTPUT -p tcp --syn -m mark --mark 99 -j LOG
iptables -t mangle -A OUTPUT -p tcp --syn -m mark --mark 99 -j ACCEPT
# Else tag it - if it makes it on the first try, good. If not, re-enter above
iptables -t mangle -A OUTPUT -p tcp --syn -m mark --set-mark 99
Does the mangle/output chain get called again for a retransmitted
packet, or only once?
/Valdis
> It turns out that I *CAN* do it all with iptables *IF* the following
> untested code actually works (this assumes that mangle is re-called on
> a retransmit)
>
> # If we've already marked this packet, strip/log/send...
> iptables -t mangle -A OUTPUT -p tcp --syn -m mark --mark 99 --ecn-tcp-remove
iptables -t mangle -A OUTPUT -p tcp --syn -m mark --mark 99 -j ECN \
--ecn-tcp-remove
> iptables -t mangle -A OUTPUT -p tcp --syn -m mark --mark 99 -j LOG
> iptables -t mangle -A OUTPUT -p tcp --syn -m mark --mark 99 -j ACCEPT
> # Else tag it - if it makes it on the first try, good. If not, re-enter above
> iptables -t mangle -A OUTPUT -p tcp --syn -m mark --set-mark 99
>
> Does the mangle/output chain get called again for a retransmitted
> packet, or only once?
For every retransmitted packet.
> /Valdis
Maciej
On Fri, 2003-02-21 at 13:25, John Bradford wrote:
> What if the first SYN packet, or the response to it is lost, (which is
> more possible on congested links, which is when ECN would be most
> useful), and we disable ECN - then we loose out on functionality we
> could have, and the work around is actually detremental to
> performance. Once 99% of internet hosts support ECN, we could be
> loosing more than we gain.
How do you know this is the reason for the lost SYN? What if other
things caused the SYN to be dropped by some intermediate site?
All the workarounds for ECN blackholes violate the protocol and
cause more problems than they solve.
That is why we refuse to implement them, and this is why the ECN
RFCs mark the "suggested workarounds" as optional not required to
implement.
On Fri, 21 Feb 2003 16:47:02 PST, "David S. Miller" said:
> How do you know this is the reason for the lost SYN? What if other
> things caused the SYN to be dropped by some intermediate site?
To be honest, we don't know. On the other hand, there's 3 basic
classes of failure modes:
1) Somebody's routing table or dead link says you can't get here. So
it doesn't matter if you retry without ECN, you *still* won't get
there.
2) Temporary queueing congestion causes your *first* SYN to be dropped
on the floor. So if you send a second without ECN, you really can't
tell if it worked because of the second SYN working Just Because, or
because ECN was turned off. On the other hand, you get the same
connection as if you had done ECN-off to begin with (just 1 transmit
later).
3) You can improve things by looking at the value of tcp_syn_retries,
and only turning off ECN for the N'th attempt. This way, you're
looking at these major cases:
3A) The N-1 packets were all dropped because of an ECN problem, and
you have a good chance of the Nth packet actually working. You win
(since you get at least a "standard" TCP connection w/o ECN).
3B) The Nth packet gets munched by congestion even though it WOULD
have worked without ECN. You would have lost anyhow.
3C) You *could* have the case where ECN was actually OK but the first
N-1 got lost by congestion/etc. You probably deserved to lose, but got
lucky instead. You don't get ECN (which would help at THAT high
congestion rate), but hopefully packet loss rates will keep the window
WAY down anyhow so you can't make things much worse.
> All the workarounds for ECN blackholes violate the protocol and
> cause more problems than they solve.
At least the workarounds for this aren't as painful as trying to
do PMTU Discovery through a router that refuses to pass ICMP Frag Needed. ;)
> That is why we refuse to implement them, and this is why the ECN
> RFCs mark the "suggested workarounds" as optional not required to
> implement.
Well.. I really didn't want it to be a mandatory change - I was
looking for an optional way to do it.
I'll cook up some shell ad-crockery that does the iptables thing and
maybe looks at tcp_syn_retries, and will post back with the outcome...
/Valdis
From: [email protected]
Date: Fri, 21 Feb 2003 19:48:45 -0500
2) Temporary queueing congestion causes your *first* SYN to be dropped
on the floor. So if you send a second without ECN, you really can't
tell if it worked because of the second SYN working Just Because, or
because ECN was turned off. On the other hand, you get the same
connection as if you had done ECN-off to begin with (just 1 transmit
later).
This is totally broken behavior. Features don't get turned off
just because of a temporary queue overflow at some intermediate
router.
This is why the workarounds are broken by design. This kind of
behavior is totally anti- the most basic principles of how the
internet works.
> > What if the first SYN packet, or the response to it is lost, (which is
> > more possible on congested links, which is when ECN would be most
> > useful), and we disable ECN - then we loose out on functionality we
> > could have, and the work around is actually detremental to
> > performance. Once 99% of internet hosts support ECN, we could be
> > loosing more than we gain.
>
> How do you know this is the reason for the lost SYN?
We don't.
> What if other things caused the SYN to be dropped by some
> intermediate site?
Then we would be assuming the host didn't support ECN, when in fact,
it may well do.
> All the workarounds for ECN blackholes violate the protocol and
> cause more problems than they solve.
Which is exactly what I what I was providing an example of.
> That is why we refuse to implement them, and this is why the ECN
> RFCs mark the "suggested workarounds" as optional not required to
> implement.
Errr, so we agree then. Cool.
John.
From: John Bradford <[email protected]>
Date: Sat, 22 Feb 2003 10:56:57 +0000 (GMT)
> That is why we refuse to implement them, and this is why the ECN
> RFCs mark the "suggested workarounds" as optional not required to
> implement.
Errr, so we agree then. Cool.
Awesome :)
[email protected] wrote:
> To be honest, we don't know. On the other hand, there's 3 basic
> classes of failure modes:
Another idea:
4) Back off quickly (i.e. disable ECN on first retry), but keep track
of whom you had to do this for. Then use some clever user-mode
strategy module to act on this information. (E.g. send a list of ECN
offenders to root, or raise the threshold value for turning off ECN
for destinations that seem to accept ECN in general, but suffer high
losses.)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
From: Werner Almesberger <[email protected]>
Date: Sat, 22 Feb 2003 15:45:39 -0300
4) Back off quickly (i.e. disable ECN on first retry), but keep track
of whom you had to do this for. Then use some clever user-mode
strategy module to act on this information. (E.g. send a list of ECN
offenders to root, or raise the threshold value for turning off ECN
for destinations that seem to accept ECN in general, but suffer high
losses.)
Time to write an ipt_ECNLOG.c netfilter module :-)
[email protected] writes:
>
> Supporting this would make using ECN a lot less painful - currently, if
> I want to use ECN by default, I get to turn it off anytime I find an ECN-hostile
> site that I'd like to communicate with.
You'll never get anyone to put anything "official" in the kernel, but
this is the patch I've been using against 2.4.{18,19,20} for a while.
It just adds a sysctl value that gives the number of SYNs to try
before clearing the ECN flags. It doesn't memorize which hosts are
screwed up, so every connection to such a host results in a noticeable
delay. Note that it will *not* work for those extremely braindead
firewalls that send back a RST in response to an ECN-enabled SYN
packet.
I combine this patch with an ECN blacklist of known bad hosts (using
straightforward netfilter mangling), and it works well for my
purposes.
The patch is against 2.4.20, and you'll need to echo a nonzero number
to tcp_ecn_retries, like so:
echo 3 >/proc/sys/net/ipv4/tcp_ecn_retries
The 3 indicates that three ECN SYNs will be tried before the ECN flags
are cleared for the fourth SYN. That gives a startup delay of about
30 seconds for every TCP connection to a screwed up host.
If your netfilter mark-and-mangle technique works, though, you may
find that more flexible.
Kevin Buhr <[email protected]>
* * *
diff -ruN --exclude=*~ --exclude=*.orig linux-2.4.20-local/include/linux/sysctl.h linux-2.4.20-localx/include/linux/sysctl.h
--- linux-2.4.20-local/include/linux/sysctl.h Fri Feb 21 16:19:41 2003
+++ linux-2.4.20-localx/include/linux/sysctl.h Fri Feb 21 16:29:01 2003
@@ -292,7 +292,8 @@
NET_IPV4_NONLOCAL_BIND=88,
NET_IPV4_ICMP_RATELIMIT=89,
NET_IPV4_ICMP_RATEMASK=90,
- NET_TCP_TW_REUSE=91
+ NET_TCP_TW_REUSE=91,
+ NET_IPV4_TCP_ECN_RETRIES=92
};
enum {
diff -ruN --exclude=*~ --exclude=*.orig linux-2.4.20-local/include/net/tcp.h linux-2.4.20-localx/include/net/tcp.h
--- linux-2.4.20-local/include/net/tcp.h Fri Feb 21 16:19:41 2003
+++ linux-2.4.20-localx/include/net/tcp.h Fri Feb 21 16:29:01 2003
@@ -453,6 +453,7 @@
extern int sysctl_tcp_fack;
extern int sysctl_tcp_reordering;
extern int sysctl_tcp_ecn;
+extern int sysctl_tcp_ecn_retries;
extern int sysctl_tcp_dsack;
extern int sysctl_tcp_mem[3];
extern int sysctl_tcp_wmem[3];
diff -ruN --exclude=*~ --exclude=*.orig linux-2.4.20-local/include/net/tcp_ecn.h linux-2.4.20-localx/include/net/tcp_ecn.h
--- linux-2.4.20-local/include/net/tcp_ecn.h Fri Nov 2 17:43:26 2001
+++ linux-2.4.20-localx/include/net/tcp_ecn.h Fri Feb 21 16:29:01 2003
@@ -38,6 +38,12 @@
}
static __inline__ void
+TCP_ECN_noecn_syn(struct sk_buff *skb)
+{
+ TCP_SKB_CB(skb)->flags &= ~(TCPCB_FLAG_ECE|TCPCB_FLAG_CWR);
+}
+
+static __inline__ void
TCP_ECN_make_synack(struct open_request *req, struct tcphdr *th)
{
if (req->ecn_ok)
diff -ruN --exclude=*~ --exclude=*.orig linux-2.4.20-local/net/ipv4/sysctl_net_ipv4.c linux-2.4.20-localx/net/ipv4/sysctl_net_ipv4.c
--- linux-2.4.20-local/net/ipv4/sysctl_net_ipv4.c Thu Sep 12 12:19:11 2002
+++ linux-2.4.20-localx/net/ipv4/sysctl_net_ipv4.c Fri Feb 21 16:29:01 2003
@@ -203,6 +203,8 @@
&sysctl_tcp_reordering, sizeof(int), 0644, NULL, &proc_dointvec},
{NET_TCP_ECN, "tcp_ecn",
&sysctl_tcp_ecn, sizeof(int), 0644, NULL, &proc_dointvec},
+ {NET_IPV4_TCP_ECN_RETRIES, "tcp_ecn_retries",
+ &sysctl_tcp_ecn_retries, sizeof(int), 0644, NULL, &proc_dointvec},
{NET_TCP_DSACK, "tcp_dsack",
&sysctl_tcp_dsack, sizeof(int), 0644, NULL, &proc_dointvec},
{NET_TCP_MEM, "tcp_mem",
diff -ruN --exclude=*~ --exclude=*.orig linux-2.4.20-local/net/ipv4/tcp_timer.c linux-2.4.20-localx/net/ipv4/tcp_timer.c
--- linux-2.4.20-local/net/ipv4/tcp_timer.c Mon Oct 1 09:19:57 2001
+++ linux-2.4.20-localx/net/ipv4/tcp_timer.c Fri Feb 21 16:29:01 2003
@@ -30,6 +30,7 @@
int sysctl_tcp_retries1 = TCP_RETR1;
int sysctl_tcp_retries2 = TCP_RETR2;
int sysctl_tcp_orphan_retries;
+int sysctl_tcp_ecn_retries;
static void tcp_write_timer(unsigned long);
static void tcp_delack_timer(unsigned long);
@@ -373,6 +374,11 @@
}
tcp_enter_loss(sk, 0);
+
+ /* If this is a SYN packet, retry with ECN disabled */
+ if (sk->state == TCP_SYN_SENT
+ && sysctl_tcp_ecn_retries && tp->retransmits+1 >= sysctl_tcp_ecn_retries)
+ TCP_ECN_noecn_syn(skb_peek(&sk->write_queue));
if (tcp_retransmit_skb(sk, skb_peek(&sk->write_queue)) > 0) {
/* Retransmission failed because of local congestion,