2006-09-15 17:28:31

by Stuart MacDonald

[permalink] [raw]
Subject: TCP stack behaviour question

I'm having some trouble with a network application I've written. I've
done a lot of research the last few days; man 7 ip, man 7 tcp, kernel
2.4.31 source code, Stevens' Illustrated TCP/IP Vol 1 & 3 (for some
reason we don't have Vol 2), Usenet, websites. I'm hoping someone here
can help me out, or point me in the correct direction.

Distro: Debian 3.0 r2
Kernel: Stock 2.4.24

tcp_retries1 == 3
tcp_retries2 == 15

I have an application that setups up a TCP connection to a server. If
the server has a power failure, TCP starts retransmitting the packet
that wasn't ACKed. I see the exponential backoff.

Question 1: There's the original packet, plus 7 retransmitted packets
for a total of 8, then TCP gives up. How is 7 (or 8) derived from the
tcp_retries[12] settings?

Question 1a: The time between last and second-last retransmit packets
is only about 27 seconds. I've read there's a maximum time, but also
that it's usually 100 or 120 seconds. Where can I find that setting in
/proc?

Question 1b: If RTO is that high, why is retransmit stopping?

Question 2: After the retransmit has given up, the app is still
making an occasional write(), which succeeds! However, tearing down
and attemting a new connection results in an immediate EHOSTUNREACH
error. Why is the write() succeeding?

Question 2a: How can my app find out the EHOSTUNREACH error
immediately? IP_RECVERR is not implemented on TCP, and SO_ERROR always
reports no error (0).

..Stu


2006-09-18 08:29:18

by Andi Kleen

[permalink] [raw]
Subject: Re: TCP stack behaviour question

"Stuart MacDonald" <[email protected]> writes:

> I'm having some trouble with a network application I've written. I've
> done a lot of research the last few days; man 7 ip, man 7 tcp, kernel
> 2.4.31 source code, Stevens' Illustrated TCP/IP Vol 1 & 3 (for some
> reason we don't have Vol 2), Usenet, websites. I'm hoping someone here
> can help me out, or point me in the correct direction.

Stevens is a good reference to BSD networking, but not necessarily to Linux.

> Question 1: There's the original packet, plus 7 retransmitted packets
> for a total of 8, then TCP gives up. How is 7 (or 8) derived from the
> tcp_retries[12] settings?

Documented in tcp(7)

> Question 1a: The time between last and second-last retransmit packets
> is only about 27 seconds. I've read there's a maximum time, but also
> that it's usually 100 or 120 seconds. Where can I find that setting in
> /proc?

It's fixed by the TCP specification.

> Question 2: After the retransmit has given up, the app is still
> making an occasional write(), which succeeds! However, tearing down
> and attemting a new connection results in an immediate EHOSTUNREACH
> error. Why is the write() succeeding?

It goes all in the local buffer, until it starts blocking (or EAGAIN
for non blocking sockets)

>
> Question 2a: How can my app find out the EHOSTUNREACH error
> immediately? IP_RECVERR is not implemented on TCP, and SO_ERROR always
> reports no error (0).

Did you really read the manpages? It is implemented and it's documented.

-Andi (your new temporary manpage proxy)

2006-09-18 13:21:00

by Stuart MacDonald

[permalink] [raw]
Subject: RE: TCP stack behaviour question

From: On Behalf Of Andi Kleen

Thanks for replying, I appreciate it.

> Stevens is a good reference to BSD networking, but not
> necessarily to Linux.

Which is why I posted here. I was wondering about the specific
implementation.

What happened was this: I had a run where I captured output with
tcpdump. My original post was based on that, and the results of the
debug output from my app. For whatever reason, it appears the stack
didn't generate all of the packets it should have. When the log showed
a second-last to last retransmit time of about 27 seconds, and then a
gap of about 400 to the very next packet of any kind, I assumed that
meant the stack had given up on the retransmits when it appears
something else was going on.

I did some digging into the kernel and on the next run found that all
the expected retransmit packets were being generated, and that once
the stack gives up on the retransmits then system calls return errors.

> > Question 2a: How can my app find out the EHOSTUNREACH error
> > immediately? IP_RECVERR is not implemented on TCP, and
> SO_ERROR always
> > reports no error (0).
>
> Did you really read the manpages? It is implemented and it's
> documented.

Yes I did and no it's not, according to the man page. I quoth:
# man 7 ip
..
Note that TCP has no error queue; MSG_ERRQUEUE is
illegal on SOCK_STREAM sockets. Thus all errors are returned by
socket function return or SO_ERROR only.

Maybe the man page is wrong? That's from my FC 3 install.

..Stu

2006-09-18 13:54:22

by Andi Kleen

[permalink] [raw]
Subject: Re: TCP stack behaviour question


> # man 7 ip
> ..
> Note that TCP has no error queue; MSG_ERRQUEUE is
> illegal on SOCK_STREAM sockets. Thus all errors are returned by
> socket function return or SO_ERROR only.
>
> Maybe the man page is wrong? That's from my FC 3 install.

The sentence is correct, but TCP has a IP_RECVERR that works
differently without a queue. Basically it doesn't delay the error
reporting for incoming ICMPs to the last retransmit, but reports
them immediately. This is documented in tcp(7)

-Andi

2006-09-18 14:19:34

by Stuart MacDonald

[permalink] [raw]
Subject: RE: TCP stack behaviour question

From: Andi Kleen [mailto:[email protected]]
> > # man 7 ip
> > ..
> > Note that TCP has no error queue; MSG_ERRQUEUE is
> > illegal on SOCK_STREAM sockets. Thus all errors are returned by
> > socket function return or SO_ERROR only.
> >
> > Maybe the man page is wrong? That's from my FC 3 install.
>
> The sentence is correct, but TCP has a IP_RECVERR that works
> differently without a queue. Basically it doesn't delay the error
> reporting for incoming ICMPs to the last retransmit, but reports
> them immediately. This is documented in tcp(7)

I read that too, but didn't know which one was correct, so I erred on
the side of caution and believed ip(7).

Good to know, I'll look into that. Thanks.

..Stu

2006-09-18 14:31:16

by Andi Kleen

[permalink] [raw]
Subject: Re: TCP stack behaviour question

On Monday 18 September 2006 16:19, Stuart MacDonald wrote:
> From: Andi Kleen [mailto:[email protected]]
> > > # man 7 ip
> > > ..
> > > Note that TCP has no error queue; MSG_ERRQUEUE is
> > > illegal on SOCK_STREAM sockets. Thus all errors are returned by
> > > socket function return or SO_ERROR only.
> > >
> > > Maybe the man page is wrong? That's from my FC 3 install.
> >
> > The sentence is correct, but TCP has a IP_RECVERR that works
> > differently without a queue. Basically it doesn't delay the error
> > reporting for incoming ICMPs to the last retransmit, but reports
> > them immediately. This is documented in tcp(7)
>
> I read that too, but didn't know which one was correct, so I erred on
> the side of caution and believed ip(7).

Ok maybe it's a bit misleading. Michael, you might want to clarify.

-Andi

2006-09-18 15:38:44

by Michael Kerrisk

[permalink] [raw]
Subject: Re: TCP stack behaviour question

Von: Andi Kleen <[email protected]>
> On Monday 18 September 2006 16:19, Stuart MacDonald wrote:
> > From: Andi Kleen [mailto:[email protected]]
> > > > # man 7 ip
> > > > ..
> > > > Note that TCP has no error queue; MSG_ERRQUEUE is
> > > > illegal on SOCK_STREAM sockets. Thus all errors are returned by
> > > > socket function return or SO_ERROR only.
> > > >
> > > > Maybe the man page is wrong? That's from my FC 3 install.
> > >
> > > The sentence is correct, but TCP has a IP_RECVERR that works
> > > differently without a queue. Basically it doesn't delay the error
> > > reporting for incoming ICMPs to the last retransmit, but reports
> > > them immediately. This is documented in tcp(7)
> >
> > I read that too, but didn't know which one was correct, so I erred on
> > the side of caution and believed ip(7).
>
> Ok maybe it's a bit misleading. Michael, you might want to clarify.

Can some one of you propose a better text please?

Cheers,

Michael
--
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7

Want to help with man page maintenance?
Grab the latest tarball at
ftp://ftp.win.tue.nl/pub/linux-local/manpages/,
read the HOWTOHELP file and grep the source
files for 'FIXME'.

2006-09-18 17:01:40

by Stuart MacDonald

[permalink] [raw]
Subject: RE: TCP stack behaviour question

From: Michael Kerrisk [mailto:[email protected]]
> Von: Andi Kleen <[email protected]>
> > On Monday 18 September 2006 16:19, Stuart MacDonald wrote:
> > > From: Andi Kleen [mailto:[email protected]]
> > > > > # man 7 ip
> > > > > ..
> > > > > Note that TCP has no error queue;
> MSG_ERRQUEUE is
> > > > > illegal on SOCK_STREAM sockets. Thus all errors are
> returned by
> > > > > socket function return or SO_ERROR only.
> > > > >
> > > > > Maybe the man page is wrong? That's from my FC 3 install.
> > > >
> > > > The sentence is correct, but TCP has a IP_RECVERR that works
> > > > differently without a queue. Basically it doesn't delay
> the error
> > > > reporting for incoming ICMPs to the last retransmit, but reports
> > > > them immediately. This is documented in tcp(7)
> > >
> > > I read that too, but didn't know which one was correct,
> so I erred on
> > > the side of caution and believed ip(7).
> >
> > Ok maybe it's a bit misleading. Michael, you might want to clarify.
>
> Can some one of you propose a better text please?

Perhaps

Note that TCP has no error queue; MSG_ERRQUEUE is illegal on
SOCK_STREAM sockets. IP_RECVERR is valid for TCP, but all errors are
returned by socket function return or SO_ERROR only.

?

..Stu

2006-09-18 18:29:40

by Stuart MacDonald

[permalink] [raw]
Subject: RE: TCP stack behaviour question

From: Stuart MacDonald [mailto:[email protected]]
> What happened was this: I had a run where I captured output with
> tcpdump. My original post was based on that, and the results of the
> debug output from my app. For whatever reason, it appears the stack
> didn't generate all of the packets it should have. When the log showed
> a second-last to last retransmit time of about 27 seconds, and then a
> gap of about 400 to the very next packet of any kind, I assumed that
> meant the stack had given up on the retransmits when it appears
> something else was going on.

I did another run and confirmed this. The tcpdump capture shows that
seven retransmits are sent, obeying the exponential backoff. Then
something odd happens. Instead of the 8th retransmit at 7th + 26.88
seconds, there is an arp at 7th + 4.159722 seconds. There are three
arps in fact, each one second apart and directed to the MAC of the
powered-off machine. After this there are further arps (in groups of
three one second apart), but they are broadcast and have a backoff
schedule.

The kernel debugging shows that tcp_write_timeout() and
tcp_retransmit_timer() are still being called though, right up to what
would be the 16th retransmit.

I suppose that the TCP retransmits aren't being sent because the
ethernet and/or IP layers don't know what's going on, which is what's
producing the arps. Is that correct? Is that documented anywhere?

I was expecting to see all 15 retransmits, and was confused when I
didn't see them.

..Stu

2006-09-19 06:13:38

by Michael Kerrisk

[permalink] [raw]
Subject: Re: RE: TCP stack behaviour question

Von: "Stuart MacDonald" <[email protected]>

> > > Ok maybe it's a bit misleading. Michael, you might want to clarify.
> >
> > Can some one of you propose a better text please?
>
> Perhaps
>
> Note that TCP has no error queue; MSG_ERRQUEUE is illegal on
> SOCK_STREAM sockets. IP_RECVERR is valid for TCP, but all errors are
> returned by socket function return or SO_ERROR only.
>
> ?

Sound okay to you Andi?

Cheers,

Michael
--
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7

Want to help with man page maintenance?
Grab the latest tarball at
ftp://ftp.win.tue.nl/pub/linux-local/manpages/,
read the HOWTOHELP file and grep the source
files for 'FIXME'.

2006-09-19 06:47:16

by Andi Kleen

[permalink] [raw]
Subject: Re: TCP stack behaviour question

On Tuesday 19 September 2006 08:13, Michael Kerrisk wrote:
> Von: "Stuart MacDonald" <[email protected]>
>
> > > > Ok maybe it's a bit misleading. Michael, you might want to clarify.
> > >
> > > Can some one of you propose a better text please?
> >
> > Perhaps
> >
> > Note that TCP has no error queue; MSG_ERRQUEUE is illegal on
> > SOCK_STREAM sockets. IP_RECVERR is valid for TCP, but all errors are
> > returned by socket function return or SO_ERROR only.
> >
> > ?
>
> Sound okay to you Andi?

Yes.
-Andi

2006-09-19 12:06:07

by Samuel Tardieu

[permalink] [raw]
Subject: Re: TCP stack behaviour question

>>>>> "Stuart" == Stuart MacDonald <[email protected]> writes:

Stuart> I suppose that the TCP retransmits aren't being sent because
Stuart> the ethernet and/or IP layers don't know what's going on,
Stuart> which is what's producing the arps. Is that correct?

It seems correct. You cannot expect TCP packets to be sent if the
target is supposedly on a directly connected network and ARP cannot
get its MAC address. What should the IP layer put as the MAC address
if it is unknown?

You may want to run another test with another unreachable target
located after a router, so that the MAC address of the router is used
on the wire. You should see all the TCP retransmits you expect to see.

Sam
--
Samuel Tardieu -- [email protected] -- http://www.rfc1149.net/

2006-09-19 14:01:08

by Stuart MacDonald

[permalink] [raw]
Subject: RE: TCP stack behaviour question

From: On Behalf Of Samuel Tardieu
> >>>>> "Stuart" == Stuart MacDonald <[email protected]> writes:
> Stuart> I suppose that the TCP retransmits aren't being sent because
> Stuart> the ethernet and/or IP layers don't know what's going on,
> Stuart> which is what's producing the arps. Is that correct?
>
> It seems correct. You cannot expect TCP packets to be sent if the
> target is supposedly on a directly connected network and ARP cannot
> get its MAC address. What should the IP layer put as the MAC address
> if it is unknown?

I'd agree, but the MAC is/was known. Only part way through the
sequence of retransmits (with MAC filled in) do the arps start. Also,
tellingly maybe, the first three arps are sent directly to the
unreachable MAC. Only after those are unanswered are broadcast arps
attempted.

Ah, interesting, see below. ***

> You may want to run another test with another unreachable target
> located after a router, so that the MAC address of the router is used
> on the wire. You should see all the TCP retransmits you expect to see.

I'll try that if I have time, but I agree, I expect to see all the
retransmits in that case.


*** I found arp(7) and read up on it while I was typing. And now I see
something interesting in the tcpdump; my app is actually talking on
two TCP connections at the same time. Both are in retransmit phase,
and the first arp is 5 seconds (delay_first_probe_time) after an
_aggregate total_ of 15 retransmits (being the two original unanswered
packets and 7 and 6 retransmits of each).

My reading of tcp(7)'s documentation of tcp_retries2 is that
tcp_retries2 is a per-TCP packet count. My tcpdump seems to show that
it is in fact a global count. Which is correct?

..Stu

2006-09-19 14:50:26

by Michael Kerrisk

[permalink] [raw]
Subject: Re: TCP stack behaviour question

> Perhaps
>
> Note that TCP has no error queue; MSG_ERRQUEUE is illegal on
> SOCK_STREAM sockets. IP_RECVERR is valid for TCP, but all errors are
> returned by socket function return or SO_ERROR only.

Stuart -- thanks, added for man-pages-2.41.

Interestingly, at this point in the man pages source there
is the following commented out text:

.\" FIXME . Is it a good idea to document that? It is a dubious feature.
.\" On
.\" .B SOCK_STREAM
.\" sockets,
.\" .I IP_RECVERR
.\" has slightly different semantics. Instead of
.\" saving the errors for the next timeout, it passes all incoming
.\" errors immediately to the user.
.\" This might be useful for very short-lived TCP connections which
.\" need fast error handling. Use this option with care:
.\" it makes TCP unreliable
.\" by not allowing it to recover properly from routing
.\" shifts and other normal
.\" conditions and breaks the protocol specification.

Cheers,

Michael
--
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7

Want to help with man page maintenance?
Grab the latest tarball at
ftp://ftp.win.tue.nl/pub/linux-local/manpages/,
read the HOWTOHELP file and grep the source
files for 'FIXME'.

2006-09-20 09:54:51

by Andi Kleen

[permalink] [raw]
Subject: Re: TCP stack behaviour question

"Stuart MacDonald" <[email protected]> writes:
>
>
> *** I found arp(7) and read up on it while I was typing. And now I see
> something interesting in the tcpdump; my app is actually talking on
> two TCP connections at the same time. Both are in retransmit phase,
> and the first arp is 5 seconds (delay_first_probe_time) after an
> _aggregate total_ of 15 retransmits (being the two original unanswered
> packets and 7 and 6 retransmits of each).
>
> My reading of tcp(7)'s documentation of tcp_retries2 is that
> tcp_retries2 is a per-TCP packet count. My tcpdump seems to show that
> it is in fact a global count. Which is correct?

The ARP layer keeps track of what neighbours are reachable and doesn't
transmit packets to unreachable ones before they answer unicast or
broadcast ARP. This is a state machine borrowed from IPv6.

There is nothing global.

-Andi

2006-09-20 09:56:06

by Andi Kleen

[permalink] [raw]
Subject: Re: TCP stack behaviour question


> Interestingly, at this point in the man pages source there
> is the following commented out text:

Yes that was me. On second thought I suppose I was right back then that this
feature is too dubious to be documented. So better keep it
undocumented and drop the change.

-Andi

>
> .\" FIXME . Is it a good idea to document that? It is a dubious feature.
> .\" On
> .\" .B SOCK_STREAM
> .\" sockets,
> .\" .I IP_RECVERR
> .\" has slightly different semantics. Instead of
> .\" saving the errors for the next timeout, it passes all incoming
> .\" errors immediately to the user.
> .\" This might be useful for very short-lived TCP connections which
> .\" need fast error handling. Use this option with care:
> .\" it makes TCP unreliable
> .\" by not allowing it to recover properly from routing
> .\" shifts and other normal
> .\" conditions and breaks the protocol specification.
>
> Cheers,
>
> Michael