2008-06-03 09:42:33

by Ingo Molnar

[permalink] [raw]
Subject: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+


* Ingo Molnar <[email protected]> wrote:

> > ...setsockopt(listenfd, SOL_TCP, TCP_DEFER_ACCEPT, &val,
> > sizeof(val)) seems to be the magic trick that is interestion here.
>
> seems to be used:
>
> 22003 write(3, "distccd[22003] (dcc_listen_by_ad"..., 62) = 62
> 22003 listen(4, 10) = 0
> 22003 setsockopt(4, SOL_TCP, TCP_DEFER_ACCEPT, [1], 4) = 0
>
> i'll queue up your reverts for testing in -tip.

update: your 3 reverts in tip/out-of-tree [commit dad98991c] definitely
fixed the hangs!

Here is the testing i did:

first i ran about 500+ successful iterations on the affected testboxes
with your revert patch applied, on multiple systems. Then today, without
changing anything else on one of the testsystems i reverted your revert
on that single system. After about an hour of testing, in 20 iterations
i got a hang again over localhost:

titan:~> netstat -nt
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 174592 10.0.1.14:34710 10.0.1.14:3632 ESTABLISHED
tcp 72145 0 10.0.1.14:3632 10.0.1.14:34710 ESTABLISHED

so i hereby conclude that your revert works :) I've repeated the commit
below that resolves this nasty regression.

phew ...

Ingo

---------------->
commit dad98991c137c089d64a3057dddc3a12bd6866b0
Author: Ilpo J?rvinen <[email protected]>
Date: Sat May 31 15:18:56 2008 +0300

tcp: revert DEFER_ACCEPT modifications

On Sat, 31 May 2008, Ilpo J?rvinen wrote:

> On Sat, 31 May 2008, Ingo Molnar wrote:
>
> > ok, once i added back the localhost distcc component and the hung kernel
> > build + stuck TCP socket bug happened again overnight:
> >
> > Active Internet connections (w/o servers)
> > Proto Recv-Q Send-Q Local Address Foreign Address State
> > tcp 72187 0 10.0.1.14:3632 10.0.1.14:47910 ESTABLISHED
> > tcp 0 174464 10.0.1.14:47910 10.0.1.14:3632 ESTABLISHED
> >
> > so it seems distcc over localhost was the aspect that made it fail.
> >
> > _Perhaps_ what matters is to have the new post-rc3 TCP code on _both_
> > sides of the connection. But that is just a theory - it could be timing,
> > etc.
>
> Btw, does your distcc perhaps happen enable TCP_DEFER_ACCEPT (there were
> some post 2.6.25 changes into it)?

Hmm, if so, please try this revert below (I hope I got everything that is
related there, gitk is currently broken for me so that viewing many
commits quickly is a pain, so I might have accidently missed something)...

Reverts these commits:
[TCP]: TCP_DEFER_ACCEPT updates - defer timeout conflicts with max_t
[TCP]: TCP_DEFER_ACCEPT updates - dont retxmt synack
[TCP]: TCP_DEFER_ACCEPT updates - process as established
tcp: Fix slab corruption with ipv6 and tcp6fuzz

..., ie.,
539fae89bebd16ebeafd57a87169bc56eb530d76,
e4c78840284f3f51b1896cf3936d60a6033c4d2c,
ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 and
9ae27e0adbf471c7a6b80102e38e1d5a346b3b38.

Signed-off-by: Ilpo J?rvinen <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 18e62e3..b31b6b7 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -239,11 +239,6 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
return (struct tcp_request_sock *)req;
}

-struct tcp_deferred_accept_info {
- struct sock *listen_sk;
- struct request_sock *request;
-};
-
struct tcp_sock {
/* inet_connection_sock has to be the first member of tcp_sock */
struct inet_connection_sock inet_conn;
@@ -379,8 +374,6 @@ struct tcp_sock {
unsigned int keepalive_intvl; /* time interval between keep alive probes */
int linger2;

- struct tcp_deferred_accept_info defer_tcp_accept;
-
unsigned long last_synq_overflow;

u32 tso_deferred;
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index b220b5f..0c96e7b 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -115,8 +115,8 @@ struct request_sock_queue {
struct request_sock *rskq_accept_head;
struct request_sock *rskq_accept_tail;
rwlock_t syn_wait_lock;
- u16 rskq_defer_accept;
- /* 2 bytes hole, try to pack */
+ u8 rskq_defer_accept;
+ /* 3 bytes hole, try to pack */
struct listen_sock *listen_opt;
};

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 633147c..981d5ba 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -139,7 +139,6 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
#define MAX_TCP_KEEPINTVL 32767
#define MAX_TCP_KEEPCNT 127
#define MAX_TCP_SYNCNT 127
-#define MAX_TCP_ACCEPT_DEFERRED 65535

#define TCP_SYNQ_INTERVAL (HZ/5) /* Period of SYNACK timer */

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 828ea21..ec83448 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -419,7 +419,8 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
struct inet_connection_sock *icsk = inet_csk(parent);
struct request_sock_queue *queue = &icsk->icsk_accept_queue;
struct listen_sock *lopt = queue->listen_opt;
- int thresh = icsk->icsk_syn_retries ? : sysctl_tcp_synack_retries;
+ int max_retries = icsk->icsk_syn_retries ? : sysctl_tcp_synack_retries;
+ int thresh = max_retries;
unsigned long now = jiffies;
struct request_sock **reqp, *req;
int i, budget;
@@ -455,6 +456,9 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
}
}

+ if (queue->rskq_defer_accept)
+ max_retries = queue->rskq_defer_accept;
+
budget = 2 * (lopt->nr_table_entries / (timeout / interval));
i = lopt->clock_hand;

@@ -462,8 +466,9 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
reqp=&lopt->syn_table[i];
while ((req = *reqp) != NULL) {
if (time_after_eq(now, req->expires)) {
- if (req->retrans < thresh &&
- !req->rsk_ops->rtx_syn_ack(parent, req)) {
+ if ((req->retrans < thresh ||
+ (inet_rsk(req)->acked && req->retrans < max_retries))
+ && !req->rsk_ops->rtx_syn_ack(parent, req)) {
unsigned long timeo;

if (req->retrans++ == 0)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f886531..a0deead 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2105,12 +2105,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
break;

case TCP_DEFER_ACCEPT:
- if (val < 0) {
- err = -EINVAL;
- } else {
- if (val > MAX_TCP_ACCEPT_DEFERRED)
- val = MAX_TCP_ACCEPT_DEFERRED;
- icsk->icsk_accept_queue.rskq_defer_accept = val;
+ icsk->icsk_accept_queue.rskq_defer_accept = 0;
+ if (val > 0) {
+ /* Translate value in seconds to number of
+ * retransmits */
+ while (icsk->icsk_accept_queue.rskq_defer_accept < 32 &&
+ val > ((TCP_TIMEOUT_INIT / HZ) <<
+ icsk->icsk_accept_queue.rskq_defer_accept))
+ icsk->icsk_accept_queue.rskq_defer_accept++;
+ icsk->icsk_accept_queue.rskq_defer_accept++;
}
break;

@@ -2292,7 +2295,8 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
val = (val ? : sysctl_tcp_fin_timeout) / HZ;
break;
case TCP_DEFER_ACCEPT:
- val = icsk->icsk_accept_queue.rskq_defer_accept;
+ val = !icsk->icsk_accept_queue.rskq_defer_accept ? 0 :
+ ((TCP_TIMEOUT_INIT / HZ) << (icsk->icsk_accept_queue.rskq_defer_accept - 1));
break;
case TCP_WINDOW_CLAMP:
val = tp->window_clamp;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b54d9d3..ddc2e89 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4532,49 +4532,6 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, struct tcphdr *th)
}
}

-static int tcp_defer_accept_check(struct sock *sk)
-{
- struct tcp_sock *tp = tcp_sk(sk);
-
- if (tp->defer_tcp_accept.request) {
- int queued_data = tp->rcv_nxt - tp->copied_seq;
- int hasfin = !skb_queue_empty(&sk->sk_receive_queue) ?
- tcp_hdr((struct sk_buff *)
- sk->sk_receive_queue.prev)->fin : 0;
-
- if (queued_data && hasfin)
- queued_data--;
-
- if (queued_data &&
- tp->defer_tcp_accept.listen_sk->sk_state == TCP_LISTEN) {
- if (sock_flag(sk, SOCK_KEEPOPEN)) {
- inet_csk_reset_keepalive_timer(sk,
- keepalive_time_when(tp));
- } else {
- inet_csk_delete_keepalive_timer(sk);
- }
-
- inet_csk_reqsk_queue_add(
- tp->defer_tcp_accept.listen_sk,
- tp->defer_tcp_accept.request,
- sk);
-
- tp->defer_tcp_accept.listen_sk->sk_data_ready(
- tp->defer_tcp_accept.listen_sk, 0);
-
- sock_put(tp->defer_tcp_accept.listen_sk);
- sock_put(sk);
- tp->defer_tcp_accept.listen_sk = NULL;
- tp->defer_tcp_accept.request = NULL;
- } else if (hasfin ||
- tp->defer_tcp_accept.listen_sk->sk_state != TCP_LISTEN) {
- tcp_reset(sk);
- return -1;
- }
- }
- return 0;
-}
-
static int tcp_copy_to_iovec(struct sock *sk, struct sk_buff *skb, int hlen)
{
struct tcp_sock *tp = tcp_sk(sk);
@@ -4935,8 +4892,6 @@ step5:

tcp_data_snd_check(sk);
tcp_ack_snd_check(sk);
-
- tcp_defer_accept_check(sk);
return 0;

csum_error:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index cd601a8..b698b5b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1918,14 +1918,6 @@ int tcp_v4_destroy_sock(struct sock *sk)
sk->sk_sndmsg_page = NULL;
}

- if (tp->defer_tcp_accept.request) {
- reqsk_free(tp->defer_tcp_accept.request);
- sock_put(tp->defer_tcp_accept.listen_sk);
- sock_put(sk);
- tp->defer_tcp_accept.listen_sk = NULL;
- tp->defer_tcp_accept.request = NULL;
- }
-
atomic_dec(&tcp_sockets_allocated);

return 0;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 019c8c1..8245247 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -571,8 +571,10 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
does sequence test, SYN is truncated, and thus we consider
it a bare ACK.

- Both ends (listening sockets) accept the new incoming
- connection and try to talk to each other. 8-)
+ If icsk->icsk_accept_queue.rskq_defer_accept, we silently drop this
+ bare ACK. Otherwise, we create an established connection. Both
+ ends (listening sockets) accept the new incoming connection and try
+ to talk to each other. 8-)

Note: This case is both harmless, and rare. Possibility is about the
same as us discovering intelligent life on another plant tomorrow.
@@ -640,6 +642,13 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
if (!(flg & TCP_FLAG_ACK))
return NULL;

+ /* If TCP_DEFER_ACCEPT is set, drop bare ACK. */
+ if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
+ TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
+ inet_rsk(req)->acked = 1;
+ return NULL;
+ }
+
/* OK, ACK is valid, create big socket and
* feed this segment to it. It will repeat all
* the tests. THIS SEGMENT MUST MOVE SOCKET TO
@@ -678,24 +687,7 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
inet_csk_reqsk_queue_unlink(sk, req, prev);
inet_csk_reqsk_queue_removed(sk, req);

- if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
- TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
-
- /* the accept queue handling is done is est recv slow
- * path so lets make sure to start there
- */
- tcp_sk(child)->pred_flags = 0;
- sock_hold(sk);
- sock_hold(child);
- tcp_sk(child)->defer_tcp_accept.listen_sk = sk;
- tcp_sk(child)->defer_tcp_accept.request = req;
-
- inet_csk_reset_keepalive_timer(child,
- inet_csk(sk)->icsk_accept_queue.rskq_defer_accept * HZ);
- } else {
- inet_csk_reqsk_queue_add(sk, req, child);
- }
-
+ inet_csk_reqsk_queue_add(sk, req, child);
return child;

listen_overflow:
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 4de68cf..63ed9d6 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -489,11 +489,6 @@ static void tcp_keepalive_timer (unsigned long data)
goto death;
}

- if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
- tcp_send_active_reset(sk, GFP_ATOMIC);
- goto death;
- }
-
if (!sock_flag(sk, SOCK_KEEPOPEN) || sk->sk_state == TCP_CLOSE)
goto out;


2008-06-03 14:41:06

by Patrick McManus

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Tue, 2008-06-03 at 11:40 +0200, Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
> > > ...setsockopt(listenfd, SOL_TCP, TCP_DEFER_ACCEPT, &val,
> > > sizeof(val)) seems to be the magic trick that is interestion here.
> >
> > seems to be used:
> >
> > 22003 write(3, "distccd[22003] (dcc_listen_by_ad"..., 62) = 62
> > 22003 listen(4, 10) = 0
> > 22003 setsockopt(4, SOL_TCP, TCP_DEFER_ACCEPT, [1], 4) = 0
> >
> > i'll queue up your reverts for testing in -tip.
>
> update: your 3 reverts in tip/out-of-tree [commit dad98991c] definitely
> fixed the hangs!
>
> Here is the testing i did:


hm.

Here's a theory - the DA code put the socket in the accept queue after
data has been queued on the socket, instead of the other way around
which is normal.

Could this be confusing the select() mechanism?

-Pat







2008-06-03 21:47:14

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Tue, 3 Jun 2008, Ingo Molnar wrote:

> * Ingo Molnar <[email protected]> wrote:
>
> > > ...setsockopt(listenfd, SOL_TCP, TCP_DEFER_ACCEPT, &val,
> > > sizeof(val)) seems to be the magic trick that is interestion here.
> >
> > seems to be used:
> >
> > 22003 write(3, "distccd[22003] (dcc_listen_by_ad"..., 62) = 62
> > 22003 listen(4, 10) = 0
> > 22003 setsockopt(4, SOL_TCP, TCP_DEFER_ACCEPT, [1], 4) = 0
> >
> > i'll queue up your reverts for testing in -tip.
>
> update: your 3 reverts in tip/out-of-tree [commit dad98991c] definitely
> fixed the hangs!

...It wasn't exactly out-of-tree, Evgeniy fixed a problem that was found
in "TCP_DEFER_ACCEPT updates - process as established", perhaps it just
wasn't in your testing tree yet.

$ git-log -n 1 --pretty=full 9ae27e0adbf471c7a6b80102e38e1d5a346b3b38 |
grep "Commit:"
Commit: David S. Miller <[email protected]>

> Here is the testing i did:
>
> first i ran about 500+ successful iterations on the affected testboxes
> with your revert patch applied, on multiple systems.

Are you sure this is enough to conclude the results? Seems quite small
number to me to rule out luck. Especially considering that it was some
amount of time in the tree already until you noticed it for the first
time.

Anyway, nice that it seems to be helping. It was almost the only
possibility on TCP side, I don't think there were any other state machine
related changes. So it wasn't just "random revert" in that sense like you
were implying :-), I just didn't have any theory how it would cause the
problem... ...I even first disregarded DA that because of timeline
in-exactness and because I wrongly assumed that distcc probably won't use
it anyway, but then I checked later on and found out that it was present
at least in the source I had lying around.

Anyway, it might be that the revert was a bit overkill, I'm not fully sure
if 539fae and e4c7884 need to be reverted to fix it since main changes
are in ec3c098. I just didn't want to take chances at first and put them
all to the revert list.

> Then today, without
> changing anything else on one of the testsystems i reverted your revert
> on that single system. After about an hour of testing, in 20 iterations
> i got a hang again over localhost:
>
> titan:~> netstat -nt
> Active Internet connections (w/o servers)
> Proto Recv-Q Send-Q Local Address Foreign Address State
> tcp 0 174592 10.0.1.14:34710 10.0.1.14:3632 ESTABLISHED
> tcp 72145 0 10.0.1.14:3632 10.0.1.14:34710 ESTABLISHED
>
> so i hereby conclude that your revert works :) I've repeated the commit
> below that resolves this nasty regression.

...I couldn't immediately find anything obviously wrong with those changes
but the patch below might be worth of a try (without the revert of
course). If it ever spits out that WARN_ON for you, we were playing with
fire too much and it's better to return on the safe side there...

--
i.

[PATCH] tcp DEFER_ACCEPT: see if header prediction got turned on

If header prediction is turned on under some circumstances,
DA can deadlock though I have great trouble in figuring out
how it could ever happen while ending up into that else
branch (but I've been wrong before as well :-)).

Signed-off-by: Ilpo J?rvinen <[email protected]>
---
net/ipv4/tcp_input.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c9454f0..0d9a3fe 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4595,6 +4595,9 @@ static int tcp_defer_accept_check(struct sock *sk)
tp->defer_tcp_accept.listen_sk->sk_state != TCP_LISTEN) {
tcp_reset(sk);
return -1;
+ } else {
+ WARN_ON(tp->pred_flags);
+ tp->pred_flags = 0;
}
}
return 0;
--
1.5.2.2

2008-06-03 22:01:41

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Wed, 4 Jun 2008, Ilpo J?rvinen wrote:

> ...I couldn't immediately find anything obviously wrong with those changes
> but the patch below might be worth of a try (without the revert of
> course). If it ever spits out that WARN_ON for you, we were playing with
> fire too much and it's better to return on the safe side there...


> [PATCH] tcp DEFER_ACCEPT: see if header prediction got turned on
>
> If header prediction is turned on under some circumstances,
> DA can deadlock though I have great trouble in figuring out

...Nah, keepalive timer would then eventually kill it then, so no
deadlock seems possible through that one.

--
i.

2008-06-03 22:03:58

by David Miller

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

From: "Ilpo_J?rvinen" <[email protected]>
Date: Wed, 4 Jun 2008 01:01:25 +0300 (EEST)

> On Wed, 4 Jun 2008, Ilpo J?rvinen wrote:
>
> > ...I couldn't immediately find anything obviously wrong with those changes
> > but the patch below might be worth of a try (without the revert of
> > course). If it ever spits out that WARN_ON for you, we were playing with
> > fire too much and it's better to return on the safe side there...
>
>
> > [PATCH] tcp DEFER_ACCEPT: see if header prediction got turned on
> >
> > If header prediction is turned on under some circumstances,
> > DA can deadlock though I have great trouble in figuring out
>
> ...Nah, keepalive timer would then eventually kill it then, so no
> deadlock seems possible through that one.

Keepalive is very long, it might still "seem" like a deadlock for
someone without much patience :-)

2008-06-03 22:10:23

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Tue, 3 Jun 2008, David Miller wrote:

> From: "Ilpo_J?rvinen" <[email protected]>
> Date: Wed, 4 Jun 2008 01:01:25 +0300 (EEST)
>
> > On Wed, 4 Jun 2008, Ilpo J?rvinen wrote:
> >
> > > ...I couldn't immediately find anything obviously wrong with those changes
> > > but the patch below might be worth of a try (without the revert of
> > > course). If it ever spits out that WARN_ON for you, we were playing with
> > > fire too much and it's better to return on the safe side there...
> >
> >
> > > [PATCH] tcp DEFER_ACCEPT: see if header prediction got turned on
> > >
> > > If header prediction is turned on under some circumstances,
> > > DA can deadlock though I have great trouble in figuring out
> >
> > ...Nah, keepalive timer would then eventually kill it then, so no
> > deadlock seems possible through that one.
>
> Keepalive is very long, it might still "seem" like a deadlock for
> someone without much patience :-)

Sure, but I think Ingo kept it once hang for hours, ~8h or so...
...Though I'm not sure if that qualifies as a proof of Ingo's patience
but rather shows his willingness to help solving the matter... ;-)

--
i.

2008-06-03 23:22:39

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Tue, 3 Jun 2008, David Miller wrote:

> From: "Ilpo_J?rvinen" <[email protected]>
> Date: Wed, 4 Jun 2008 01:01:25 +0300 (EEST)
>
> > On Wed, 4 Jun 2008, Ilpo J?rvinen wrote:
> >
> > > ...I couldn't immediately find anything obviously wrong with those changes
> > > but the patch below might be worth of a try (without the revert of
> > > course). If it ever spits out that WARN_ON for you, we were playing with
> > > fire too much and it's better to return on the safe side there...
> >
> >
> > > [PATCH] tcp DEFER_ACCEPT: see if header prediction got turned on
> > >
> > > If header prediction is turned on under some circumstances,
> > > DA can deadlock though I have great trouble in figuring out
> >
> > ...Nah, keepalive timer would then eventually kill it then, so no
> > deadlock seems possible through that one.
>
> Keepalive is very long, it might still "seem" like a deadlock for
> someone without much patience :-)

I think we want that clearing there, it's better to be safe than sorry
there and to not put any trust on the keepalive thingie which tears down
rather than results in a connection.

But here's somewhat more likely explanation... Only compile tested...
It probably needs some commenting from people who understand locking
variants & details (I don't).


--
i.


--
[PATCH] tcp DEFER_ACCEPT: fix racy access to listen_sk

It seems that replacement of DA code also moved parts outside
of appropriate locking. The Ingo's problem seems to come from
the fact that two flows could now race in
(inet_csk_)reqsk_queue_add corrupting the queue. ...This can
leave dangling socks around which won't resolve themselves
without stimuli from outside (e.g., external RST would help
I think).

Then some details I'm not too sure of:
I guess we want to put listen_sk->sk_state checking under the
lock as well. I've not evaluated if ->sk_data_ready too
requires locking but assumed it does.

I'm by no means familiar with all locking variants, requirements,
etc.

Signed-off-by: Ilpo J?rvinen <[email protected]>
---
net/ipv4/tcp_input.c | 23 +++++++++++++----------
1 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c9454f0..d21d2b9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4562,6 +4562,7 @@ static int tcp_defer_accept_check(struct sock *sk)
struct tcp_sock *tp = tcp_sk(sk);

if (tp->defer_tcp_accept.request) {
+ struct sock *listen_sk = tp->defer_tcp_accept.listen_sk;
int queued_data = tp->rcv_nxt - tp->copied_seq;
int hasfin = !skb_queue_empty(&sk->sk_receive_queue) ?
tcp_hdr((struct sk_buff *)
@@ -4570,8 +4571,9 @@ static int tcp_defer_accept_check(struct sock *sk)
if (queued_data && hasfin)
queued_data--;

- if (queued_data &&
- tp->defer_tcp_accept.listen_sk->sk_state == TCP_LISTEN) {
+ bh_lock_sock(listen_sk);
+
+ if (queued_data && listen_sk->sk_state == TCP_LISTEN) {
if (sock_flag(sk, SOCK_KEEPOPEN)) {
inet_csk_reset_keepalive_timer(sk,
keepalive_time_when(tp));
@@ -4579,23 +4581,24 @@ static int tcp_defer_accept_check(struct sock *sk)
inet_csk_delete_keepalive_timer(sk);
}

- inet_csk_reqsk_queue_add(
- tp->defer_tcp_accept.listen_sk,
- tp->defer_tcp_accept.request,
- sk);
+ inet_csk_reqsk_queue_add(listen_sk,
+ tp->defer_tcp_accept.request,
+ sk);

tp->defer_tcp_accept.listen_sk->sk_data_ready(
- tp->defer_tcp_accept.listen_sk, 0);
+ listen_sk, 0);

- sock_put(tp->defer_tcp_accept.listen_sk);
+ sock_put(listen_sk);
sock_put(sk);
tp->defer_tcp_accept.listen_sk = NULL;
tp->defer_tcp_accept.request = NULL;
- } else if (hasfin ||
- tp->defer_tcp_accept.listen_sk->sk_state != TCP_LISTEN) {
+ } else if (hasfin || listen_sk->sk_state != TCP_LISTEN) {
+ bh_unlock_sock(listen_sk);
tcp_reset(sk);
return -1;
}
+
+ bh_unlock_sock(listen_sk);
}
return 0;
}
--
1.5.2.2

2008-06-03 23:57:25

by Joe Perches

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Wed, 2008-06-04 at 02:22 +0300, Ilpo Järvinen wrote:
> But here's somewhat more likely explanation... Only compile tested...
> It probably needs some commenting from people who understand locking
> variants & details (I don't).
> Signed-off-by: Ilpo Järvinen <[email protected]>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index c9454f0..d21d2b9 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4562,6 +4562,7 @@ static int tcp_defer_accept_check(struct sock *sk)
> struct tcp_sock *tp = tcp_sk(sk);
>
> if (tp->defer_tcp_accept.request) {
> + struct sock *listen_sk = tp->defer_tcp_accept.listen_sk;

Not commenting on the locking, but I think the code
would be clearer if tp->defer_tcp_accept was used in
a temporary instead.

Perhaps:

net/ipv4/tcp_input.c | 33 ++++++++++++++++-----------------
1 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b54d9d3..f846e11 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4535,18 +4535,18 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, struct tcphdr *th)
static int tcp_defer_accept_check(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
-
- if (tp->defer_tcp_accept.request) {
- int queued_data = tp->rcv_nxt - tp->copied_seq;
- int hasfin = !skb_queue_empty(&sk->sk_receive_queue) ?
+ struct tcp_deferred_accept_info *tdai = &tp->defer_tcp_accept;
+ if (tdai->request) {
+ int queued_data = tp->rcv_nxt - tp->copied_seq;
+ int hasfin = !skb_queue_empty(&sk->sk_receive_queue) ?
tcp_hdr((struct sk_buff *)
sk->sk_receive_queue.prev)->fin : 0;

if (queued_data && hasfin)
queued_data--;

- if (queued_data &&
- tp->defer_tcp_accept.listen_sk->sk_state == TCP_LISTEN) {
+ bh_lock_sock(tdai->listen_sk);
+ if (queued_data && tdai->listen_sk->sk_state == TCP_LISTEN) {
if (sock_flag(sk, SOCK_KEEPOPEN)) {
inet_csk_reset_keepalive_timer(sk,
keepalive_time_when(tp));
@@ -4554,23 +4554,22 @@ static int tcp_defer_accept_check(struct sock *sk)
inet_csk_delete_keepalive_timer(sk);
}

- inet_csk_reqsk_queue_add(
- tp->defer_tcp_accept.listen_sk,
- tp->defer_tcp_accept.request,
- sk);
+ inet_csk_reqsk_queue_add(tdai->listen_sk,
+ tdai->request,
+ sk);

- tp->defer_tcp_accept.listen_sk->sk_data_ready(
- tp->defer_tcp_accept.listen_sk, 0);
+ tdai->listen_sk->sk_data_ready(tdai->listen_sk, 0);

- sock_put(tp->defer_tcp_accept.listen_sk);
+ sock_put(tdai->listen_sk);
sock_put(sk);
- tp->defer_tcp_accept.listen_sk = NULL;
- tp->defer_tcp_accept.request = NULL;
- } else if (hasfin ||
- tp->defer_tcp_accept.listen_sk->sk_state != TCP_LISTEN) {
+ tdai->listen_sk = NULL;
+ tdai->request = NULL;
+ } else if (hasfin || tdai->listen_sk->sk_state != TCP_LISTEN) {
+ bh_unlock_sock(tdai->listen_sk);
tcp_reset(sk);
return -1;
}
+ bh_unlock_sock(tdai->listen_sk);
}
return 0;
}

2008-06-04 02:54:32

by Patrick McManus

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Wed, 2008-06-04 at 02:22 +0300, Ilpo Järvinen wrote:
==
> --
> [PATCH] tcp DEFER_ACCEPT: fix racy access to listen_sk
>
> It seems that replacement of DA code also moved parts outside
> of appropriate locking. The Ingo's problem seems to come from
> the fact that two flows could now race in
> (inet_csk_)reqsk_queue_add corrupting the queue. ...This can
> leave dangling socks around which won't resolve themselves
> without stimuli from outside (e.g., external RST would help
> I think).


Ilpo, has anyone told you today that you rock? Well allow me - you
rock.
>
do_rcv() clearly has the listening socket locked in the non-DA case, and
in the DA case it is the 'child' ESTABLISHED socket that is locked -
leaving the accept queue unprotected. So simple.

-Pat

2008-06-04 06:25:52

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Tue, 3 Jun 2008, Joe Perches wrote:

> On Wed, 2008-06-04 at 02:22 +0300, Ilpo J?rvinen wrote:
> > But here's somewhat more likely explanation... Only compile tested...
> > It probably needs some commenting from people who understand locking
> > variants & details (I don't).
> > Signed-off-by: Ilpo J?rvinen <[email protected]>
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index c9454f0..d21d2b9 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -4562,6 +4562,7 @@ static int tcp_defer_accept_check(struct sock *sk)
> > struct tcp_sock *tp = tcp_sk(sk);
> >
> > if (tp->defer_tcp_accept.request) {
> > + struct sock *listen_sk = tp->defer_tcp_accept.listen_sk;
>
> Not commenting on the locking, but I think the code
> would be clearer if tp->defer_tcp_accept was used in
> a temporary instead.

Only fixes to net-2.6 now, as agreed. ...So I try to leave syntax cleanup
to net-next.

Besides, there's a trap which makes your patch a lot worse than
mine... ;-) I too once tried a version without temporary until
noticed that trap.

> Perhaps:
>
> net/ipv4/tcp_input.c | 33 ++++++++++++++++-----------------
> 1 files changed, 16 insertions(+), 17 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index b54d9d3..f846e11 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4535,18 +4535,18 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, struct tcphdr *th)
> static int tcp_defer_accept_check(struct sock *sk)
> {
> struct tcp_sock *tp = tcp_sk(sk);
> -
> - if (tp->defer_tcp_accept.request) {
> - int queued_data = tp->rcv_nxt - tp->copied_seq;
> - int hasfin = !skb_queue_empty(&sk->sk_receive_queue) ?
> + struct tcp_deferred_accept_info *tdai = &tp->defer_tcp_accept;
> + if (tdai->request) {
> + int queued_data = tp->rcv_nxt - tp->copied_seq;
> + int hasfin = !skb_queue_empty(&sk->sk_receive_queue) ?
> tcp_hdr((struct sk_buff *)
> sk->sk_receive_queue.prev)->fin : 0;
>
> if (queued_data && hasfin)
> queued_data--;
>
> - if (queued_data &&
> - tp->defer_tcp_accept.listen_sk->sk_state == TCP_LISTEN) {
> + bh_lock_sock(tdai->listen_sk);
> + if (queued_data && tdai->listen_sk->sk_state == TCP_LISTEN) {
> if (sock_flag(sk, SOCK_KEEPOPEN)) {
> inet_csk_reset_keepalive_timer(sk,
> keepalive_time_when(tp));
> @@ -4554,23 +4554,22 @@ static int tcp_defer_accept_check(struct sock *sk)
> inet_csk_delete_keepalive_timer(sk);
> }
>
> - inet_csk_reqsk_queue_add(
> - tp->defer_tcp_accept.listen_sk,
> - tp->defer_tcp_accept.request,
> - sk);
> + inet_csk_reqsk_queue_add(tdai->listen_sk,
> + tdai->request,
> + sk);
>
> - tp->defer_tcp_accept.listen_sk->sk_data_ready(
> - tp->defer_tcp_accept.listen_sk, 0);
> + tdai->listen_sk->sk_data_ready(tdai->listen_sk, 0);
>
> - sock_put(tp->defer_tcp_accept.listen_sk);
> + sock_put(tdai->listen_sk);
> sock_put(sk);
> - tp->defer_tcp_accept.listen_sk = NULL;
> - tp->defer_tcp_accept.request = NULL;
> - } else if (hasfin ||
> - tp->defer_tcp_accept.listen_sk->sk_state != TCP_LISTEN) {
> + tdai->listen_sk = NULL;
> + tdai->request = NULL;

uh-oh...

> + } else if (hasfin || tdai->listen_sk->sk_state != TCP_LISTEN) {
> + bh_unlock_sock(tdai->listen_sk);
> tcp_reset(sk);
> return -1;
> }
> + bh_unlock_sock(tdai->listen_sk);

...which will crash in here.

NAK.

> }
> return 0;
> }
>
>
>

--
i.

2008-06-04 06:43:19

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Tue, 3 Jun 2008, Patrick McManus wrote:

> On Wed, 2008-06-04 at 02:22 +0300, Ilpo J?rvinen wrote:
> ==
> > --
> > [PATCH] tcp DEFER_ACCEPT: fix racy access to listen_sk
> >
> > It seems that replacement of DA code also moved parts outside
> > of appropriate locking. The Ingo's problem seems to come from
> > the fact that two flows could now race in
> > (inet_csk_)reqsk_queue_add corrupting the queue. ...This can
> > leave dangling socks around which won't resolve themselves
> > without stimuli from outside (e.g., external RST would help
> > I think).

> do_rcv() clearly has the listening socket locked in the non-DA case, and
> in the DA case it is the 'child' ESTABLISHED socket that is locked -
> leaving the accept queue unprotected. So simple.

It also well explains why Ingo finally got the KERNEL: assertion
(!sk->sk_ack_backlog) which is adjusted in the very same reqsk_queue_add
making it and the actual queue come out of sync.

...too bad this has no relevance to H?kon's case, so more digging
is necessary... :-)

--
i.

2008-06-04 07:23:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+


* Ilpo J?rvinen <[email protected]> wrote:

> > > i'll queue up your reverts for testing in -tip.
> >
> > update: your 3 reverts in tip/out-of-tree [commit dad98991c] definitely
> > fixed the hangs!
>
> ...It wasn't exactly out-of-tree, Evgeniy fixed a problem that was
> found in "TCP_DEFER_ACCEPT updates - process as established", perhaps
> it just wasn't in your testing tree yet.

out of the -tip tree :-) The -tip tree has 75+ topic branches at the
moment, but TCP topics are not in its scope - so any TCP change is "out
of tree" for the -tip tree.

People got confused in the past when they saw similar test patches show
up in sched.git and x86.git before, so we wanted to make it very clear
in -tip (with is the successor of sched.git, x86.git and a couple of
other git trees) that these are commits we dont want to push anywhere.

Commits in tip/out-of-tree dont get propagated into the tip/auto-*-next
topic branches that linux-next and -mm picks up, they are purely a
courtesy to help the testing/fixing of bugs in subsystems that are
maintained in other git trees.

See attached below the current shortlog of the tip/out-of-tree topic
branch - it contains changes all around the tree for various things that
we triggered in -tip and are not yet upstream or are in flight somewhere
in another git tree.

> > Here is the testing i did:
> >
> > first i ran about 500+ successful iterations on the affected
> > testboxes with your revert patch applied, on multiple systems.
>
> Are you sure this is enough to conclude the results? Seems quite small
> number to me to rule out luck. Especially considering that it was some
> amount of time in the tree already until you noticed it for the first
> time.

a full day of testing on a testsystem with 500 random kernel builds and
bootups (the kernel build done on the testsystem utilizing distcc and
make -j100, so it's rather heavy and parallel TCP traffic per iteration)
with no hang, compared to the same system with your reverts not applied
that hung after an hour with 20-30 iterations.

And that count increased to 1000 successful test iterations since
yesterday.

So i think yes, it seems rather conclusive, given the circumstances ;-)

These random kernel boots found many 'impossible to trigger' bugs and
races in the past. The reason for its race finding capability is the
timing randomness of the resulting random kernel image: the delays
caused by random combination of debugging facilities, build variants,
kernel subsystem variants we have. This -tip qa method - as a
side-effect of its coverage testing - simulates timing variantions that
are otherwise only observable via hardware variations.

I.e. this is not the same kernel booted up a 1000 times - that would be
a very narrow test. This is 1000 _different_ kernels built and booted
up. Each kernel having subtly different timings and ordering. And it's
more than just externally injected random kernel: the test-system itself
builds its "next version" (and uses the network for that as well), so
it's a self-hosting recursive random test in essence.

This method is also amazingly good at finding compiler/linker trouble:
it found 3-4 real gcc bugs so far. (For example i triggered an ancient
bug in gcc 4.0.2 just yesterday. For the record, the testsystem with the
TCP hang utilizes gcc-4.2.2.)

> > so i hereby conclude that your revert works :) I've repeated the
> > commit below that resolves this nasty regression.
>
> ...I couldn't immediately find anything obviously wrong with those
> changes but the patch below might be worth of a try (without the
> revert of course). If it ever spits out that WARN_ON for you, we were
> playing with fire too much and it's better to return on the safe side
> there...

i'll queue it up for testing, but no promises about speedy action here -
the test cycle is really long with this bug.

Ingo

------{ tip/out-of-tree shortlog: }----------->

Alexander van Heukelum (1):
uml: cleanup: use def_bool in Kconfig files

Bjorn Helgaas (1):
PNPACPI: use _CRS IRQ descriptor length for _SRS

Ilpo J?rvinen (1):
tcp: revert DEFER_ACCEPT modifications

Ingo Molnar (7):
video/dvb: fix MEDIA_TUNER && FW_LOADER build error
dvb: input layer dependencies fixes
drivers/media/video build fix for modular builds
drivers/watchdog/geodewdt.c: build fix
USB: fix build bug in USB_ISIGHTFW
acpi-acpi_numa_init-build-fix
acpi: fix drivers/acpi/glue.c build error

Michael Krufky (1):
dib7000p: fix dib7000p_attach when !CONFIG_DVB_DIB7000P

Russ Anderson (1):
acpi: fix boot breakage on Altix

Yinghai Lu (2):
net: use numa_node in net_devcice->dev instead of parent
ide: use dev_to_node instead of pcibus_to_node

2008-06-04 18:24:38

by David Miller

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

From: Ingo Molnar <[email protected]>
Date: Wed, 4 Jun 2008 09:23:11 +0200

> * Ilpo J?rvinen <[email protected]> wrote:
>
> > ...I couldn't immediately find anything obviously wrong with those
> > changes but the patch below might be worth of a try (without the
> > revert of course). If it ever spits out that WARN_ON for you, we were
> > playing with fire too much and it's better to return on the safe side
> > there...
>
> i'll queue it up for testing, but no promises about speedy action here -
> the test cycle is really long with this bug.

Ilpo posted another patch which fixes a locking bug in the
code, please test with that patch. I include it below so
that you know exactly which one I am referring to.

The quicker you test this, the faster I can merge it to
Linus and get the bug fixed for good.

[PATCH] tcp DEFER_ACCEPT: fix racy access to listen_sk

It seems that replacement of DA code also moved parts outside
of appropriate locking. The Ingo's problem seems to come from
the fact that two flows could now race in
(inet_csk_)reqsk_queue_add corrupting the queue. ...This can
leave dangling socks around which won't resolve themselves
without stimuli from outside (e.g., external RST would help
I think).

Then some details I'm not too sure of:
I guess we want to put listen_sk->sk_state checking under the
lock as well. I've not evaluated if ->sk_data_ready too
requires locking but assumed it does.

I'm by no means familiar with all locking variants, requirements,
etc.

Signed-off-by: Ilpo J?rvinen <[email protected]>
---
net/ipv4/tcp_input.c | 23 +++++++++++++----------
1 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c9454f0..d21d2b9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4562,6 +4562,7 @@ static int tcp_defer_accept_check(struct sock *sk)
struct tcp_sock *tp = tcp_sk(sk);

if (tp->defer_tcp_accept.request) {
+ struct sock *listen_sk = tp->defer_tcp_accept.listen_sk;
int queued_data = tp->rcv_nxt - tp->copied_seq;
int hasfin = !skb_queue_empty(&sk->sk_receive_queue) ?
tcp_hdr((struct sk_buff *)
@@ -4570,8 +4571,9 @@ static int tcp_defer_accept_check(struct sock *sk)
if (queued_data && hasfin)
queued_data--;

- if (queued_data &&
- tp->defer_tcp_accept.listen_sk->sk_state == TCP_LISTEN) {
+ bh_lock_sock(listen_sk);
+
+ if (queued_data && listen_sk->sk_state == TCP_LISTEN) {
if (sock_flag(sk, SOCK_KEEPOPEN)) {
inet_csk_reset_keepalive_timer(sk,
keepalive_time_when(tp));
@@ -4579,23 +4581,24 @@ static int tcp_defer_accept_check(struct sock *sk)
inet_csk_delete_keepalive_timer(sk);
}

- inet_csk_reqsk_queue_add(
- tp->defer_tcp_accept.listen_sk,
- tp->defer_tcp_accept.request,
- sk);
+ inet_csk_reqsk_queue_add(listen_sk,
+ tp->defer_tcp_accept.request,
+ sk);

tp->defer_tcp_accept.listen_sk->sk_data_ready(
- tp->defer_tcp_accept.listen_sk, 0);
+ listen_sk, 0);

- sock_put(tp->defer_tcp_accept.listen_sk);
+ sock_put(listen_sk);
sock_put(sk);
tp->defer_tcp_accept.listen_sk = NULL;
tp->defer_tcp_accept.request = NULL;
- } else if (hasfin ||
- tp->defer_tcp_accept.listen_sk->sk_state != TCP_LISTEN) {
+ } else if (hasfin || listen_sk->sk_state != TCP_LISTEN) {
+ bh_unlock_sock(listen_sk);
tcp_reset(sk);
return -1;
}
+
+ bh_unlock_sock(listen_sk);
}
return 0;
}

2008-06-04 20:56:52

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Wed, 4 Jun 2008, David Miller wrote:

> From: Ingo Molnar <[email protected]>
> Date: Wed, 4 Jun 2008 09:23:11 +0200
>
> > * Ilpo J?rvinen <[email protected]> wrote:
> >
> > > ...I couldn't immediately find anything obviously wrong with those
> > > changes but the patch below might be worth of a try (without the
> > > revert of course). If it ever spits out that WARN_ON for you, we were
> > > playing with fire too much and it's better to return on the safe side
> > > there...
> >
> > i'll queue it up for testing, but no promises about speedy action here -
> > the test cycle is really long with this bug.
>
> Ilpo posted another patch which fixes a locking bug in the
> code, please test with that patch. I include it below so
> that you know exactly which one I am referring to.

Do you know if TCP ever needs to lock the child sk again (after creating
it and it has started to live on its own) from a context that has already
locked the listen_sk (it could deadlock after this change)?


--
i.

2008-06-04 21:55:34

by David Miller

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

From: "Ilpo_J?rvinen" <[email protected]>
Date: Wed, 4 Jun 2008 23:56:34 +0300 (EEST)

> Do you know if TCP ever needs to lock the child sk again (after creating
> it and it has started to live on its own) from a context that has already
> locked the listen_sk (it could deadlock after this change)?

I don't know, and since I'm getting back onto an airplane again tomorrow
morning I'm the last person who is likely do be able to do an audit
of these cases for you :-)

2008-06-05 14:23:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+


* Ilpo J?rvinen <[email protected]> wrote:

> [PATCH] tcp DEFER_ACCEPT: fix racy access to listen_sk

i unapplied the reverts then applied your fix, and got this hang after
1.5 hours of testing:

titan:~> netstat -nt
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 65032 10.0.1.14:39454 10.0.1.19:3632 ESTABLISHED

10.0.1.14 is this box, Core2Duo dual-core. 10.0.1.19 is another box with
16 CPUs.

Ingo

2008-06-05 18:01:07

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Thu, 5 Jun 2008, Ingo Molnar wrote:

>
> * Ilpo J?rvinen <[email protected]> wrote:
>
> > [PATCH] tcp DEFER_ACCEPT: fix racy access to listen_sk
>
> i unapplied the reverts then applied your fix, and got this hang after
> 1.5 hours of testing:
>
> titan:~> netstat -nt
> Active Internet connections (w/o servers)
> Proto Recv-Q Send-Q Local Address Foreign Address State
> tcp 0 65032 10.0.1.14:39454 10.0.1.19:3632 ESTABLISHED
>
> 10.0.1.14 is this box, Core2Duo dual-core. 10.0.1.19 is another box with
> 16 CPUs.

Did 10.0.1.19 too have the fix (the receiving end is the important
one to have this fix)?

--
i.

2008-06-05 21:14:21

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Thu, 5 Jun 2008, Ilpo J?rvinen wrote:

> On Thu, 5 Jun 2008, Ingo Molnar wrote:
>
> >
> > * Ilpo J?rvinen <[email protected]> wrote:
> >
> > > [PATCH] tcp DEFER_ACCEPT: fix racy access to listen_sk
> >
> > i unapplied the reverts then applied your fix, and got this hang after
> > 1.5 hours of testing:
> >
> > titan:~> netstat -nt
> > Active Internet connections (w/o servers)
> > Proto Recv-Q Send-Q Local Address Foreign Address State
> > tcp 0 65032 10.0.1.14:39454 10.0.1.19:3632 ESTABLISHED
> >
> > 10.0.1.14 is this box, Core2Duo dual-core. 10.0.1.19 is another box with
> > 16 CPUs.
>
> Did 10.0.1.19 too have the fix (the receiving end is the important
> one to have this fix)?

Since you're probably going to tell that it was, here's a try which just
reverts the most questionable commit out of those three DA related
changes that were included in the first revert you found to be working
(the other two are quite useful fixes).

I'm out of new ideas what could be still wrong (I got confused and lost
track number of times while I tried to verify socket locking today and
probably don't have more time for that now)... Unless somebody else
(Patrick?) comes up with something quickly, I propose you Ingo try if
this smaller revert is enough to solve it and we leave those two fixes
there and postpone the rest. If it isn't enough then things would again
get quite interesting but that's unlikely to happen.

--
i.

--
[PATCH] tcp: revert DEFER_ACCEPT updates - process as established

Reverts "[TCP]: TCP_DEFER_ACCEPT updates - process as established"
(ec3c0982a2dd1e6) and a fix made into it ("tcp: Fix slab corruption
with ipv6 and tcp6fuzz"; 9ae27e0adbf471c7).

Ingo Molnar found out that TCP flows got stuck into ESTABLISHED
state without receiving process. I tried to fix the locking for
listen_sk:
http://marc.info/?l=linux-kernel&m=121253537018216&w=2
...but Ingo's tests still got stuck TCP flow with that fix.

Two other DA related fixes which were among the reverted commits
in the first revert attempt are not reverted here.

Signed-off-by: Ilpo J?rvinen <[email protected]>
---
include/linux/tcp.h | 7 ------
include/net/request_sock.h | 4 +-
include/net/tcp.h | 1 -
net/ipv4/inet_connection_sock.c | 11 +++++++--
net/ipv4/tcp.c | 18 +++++++++------
net/ipv4/tcp_input.c | 45 ---------------------------------------
net/ipv4/tcp_ipv4.c | 8 -------
net/ipv4/tcp_minisocks.c | 32 ++++++++++-----------------
net/ipv4/tcp_timer.c | 5 ----
9 files changed, 33 insertions(+), 98 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 9881295..07e79bd 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -239,11 +239,6 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
return (struct tcp_request_sock *)req;
}

-struct tcp_deferred_accept_info {
- struct sock *listen_sk;
- struct request_sock *request;
-};
-
struct tcp_sock {
/* inet_connection_sock has to be the first member of tcp_sock */
struct inet_connection_sock inet_conn;
@@ -379,8 +374,6 @@ struct tcp_sock {
unsigned int keepalive_time; /* time before keep alive takes place */
unsigned int keepalive_intvl; /* time interval between keep alive probes */

- struct tcp_deferred_accept_info defer_tcp_accept;
-
unsigned long last_synq_overflow;

/* Receiver side RTT estimation */
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index b220b5f..0c96e7b 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -115,8 +115,8 @@ struct request_sock_queue {
struct request_sock *rskq_accept_head;
struct request_sock *rskq_accept_tail;
rwlock_t syn_wait_lock;
- u16 rskq_defer_accept;
- /* 2 bytes hole, try to pack */
+ u8 rskq_defer_accept;
+ /* 3 bytes hole, try to pack */
struct listen_sock *listen_opt;
};

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 633147c..981d5ba 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -139,7 +139,6 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
#define MAX_TCP_KEEPINTVL 32767
#define MAX_TCP_KEEPCNT 127
#define MAX_TCP_SYNCNT 127
-#define MAX_TCP_ACCEPT_DEFERRED 65535

#define TCP_SYNQ_INTERVAL (HZ/5) /* Period of SYNACK timer */

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 828ea21..045e799 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -419,7 +419,8 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
struct inet_connection_sock *icsk = inet_csk(parent);
struct request_sock_queue *queue = &icsk->icsk_accept_queue;
struct listen_sock *lopt = queue->listen_opt;
- int thresh = icsk->icsk_syn_retries ? : sysctl_tcp_synack_retries;
+ int max_retries = icsk->icsk_syn_retries ? : sysctl_tcp_synack_retries;
+ int thresh = max_retries;
unsigned long now = jiffies;
struct request_sock **reqp, *req;
int i, budget;
@@ -455,6 +456,9 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
}
}

+ if (queue->rskq_defer_accept)
+ max_retries = queue->rskq_defer_accept;
+
budget = 2 * (lopt->nr_table_entries / (timeout / interval));
i = lopt->clock_hand;

@@ -462,8 +466,9 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
reqp=&lopt->syn_table[i];
while ((req = *reqp) != NULL) {
if (time_after_eq(now, req->expires)) {
- if (req->retrans < thresh &&
- !req->rsk_ops->rtx_syn_ack(parent, req)) {
+ if ((req->retrans < (inet_rsk(req)->acked ? max_retries : thresh)) &&
+ (inet_rsk(req)->acked ||
+ !req->rsk_ops->rtx_syn_ack(parent, req))) {
unsigned long timeo;

if (req->retrans++ == 0)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ab66683..fc54a48 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2112,12 +2112,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
break;

case TCP_DEFER_ACCEPT:
- if (val < 0) {
- err = -EINVAL;
- } else {
- if (val > MAX_TCP_ACCEPT_DEFERRED)
- val = MAX_TCP_ACCEPT_DEFERRED;
- icsk->icsk_accept_queue.rskq_defer_accept = val;
+ icsk->icsk_accept_queue.rskq_defer_accept = 0;
+ if (val > 0) {
+ /* Translate value in seconds to number of
+ * retransmits */
+ while (icsk->icsk_accept_queue.rskq_defer_accept < 32 &&
+ val > ((TCP_TIMEOUT_INIT / HZ) <<
+ icsk->icsk_accept_queue.rskq_defer_accept))
+ icsk->icsk_accept_queue.rskq_defer_accept++;
+ icsk->icsk_accept_queue.rskq_defer_accept++;
}
break;

@@ -2299,7 +2302,8 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
val = (val ? : sysctl_tcp_fin_timeout) / HZ;
break;
case TCP_DEFER_ACCEPT:
- val = icsk->icsk_accept_queue.rskq_defer_accept;
+ val = !icsk->icsk_accept_queue.rskq_defer_accept ? 0 :
+ ((TCP_TIMEOUT_INIT / HZ) << (icsk->icsk_accept_queue.rskq_defer_accept - 1));
break;
case TCP_WINDOW_CLAMP:
val = tp->window_clamp;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 1322fa1..fb1ab3f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4557,49 +4557,6 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, struct tcphdr *th)
}
}

-static int tcp_defer_accept_check(struct sock *sk)
-{
- struct tcp_sock *tp = tcp_sk(sk);
-
- if (tp->defer_tcp_accept.request) {
- int queued_data = tp->rcv_nxt - tp->copied_seq;
- int hasfin = !skb_queue_empty(&sk->sk_receive_queue) ?
- tcp_hdr((struct sk_buff *)
- sk->sk_receive_queue.prev)->fin : 0;
-
- if (queued_data && hasfin)
- queued_data--;
-
- if (queued_data &&
- tp->defer_tcp_accept.listen_sk->sk_state == TCP_LISTEN) {
- if (sock_flag(sk, SOCK_KEEPOPEN)) {
- inet_csk_reset_keepalive_timer(sk,
- keepalive_time_when(tp));
- } else {
- inet_csk_delete_keepalive_timer(sk);
- }
-
- inet_csk_reqsk_queue_add(
- tp->defer_tcp_accept.listen_sk,
- tp->defer_tcp_accept.request,
- sk);
-
- tp->defer_tcp_accept.listen_sk->sk_data_ready(
- tp->defer_tcp_accept.listen_sk, 0);
-
- sock_put(tp->defer_tcp_accept.listen_sk);
- sock_put(sk);
- tp->defer_tcp_accept.listen_sk = NULL;
- tp->defer_tcp_accept.request = NULL;
- } else if (hasfin ||
- tp->defer_tcp_accept.listen_sk->sk_state != TCP_LISTEN) {
- tcp_reset(sk);
- return -1;
- }
- }
- return 0;
-}
-
static int tcp_copy_to_iovec(struct sock *sk, struct sk_buff *skb, int hlen)
{
struct tcp_sock *tp = tcp_sk(sk);
@@ -4974,8 +4931,6 @@ step5:

tcp_data_snd_check(sk);
tcp_ack_snd_check(sk);
-
- tcp_defer_accept_check(sk);
return 0;

csum_error:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index cd601a8..b698b5b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1918,14 +1918,6 @@ int tcp_v4_destroy_sock(struct sock *sk)
sk->sk_sndmsg_page = NULL;
}

- if (tp->defer_tcp_accept.request) {
- reqsk_free(tp->defer_tcp_accept.request);
- sock_put(tp->defer_tcp_accept.listen_sk);
- sock_put(sk);
- tp->defer_tcp_accept.listen_sk = NULL;
- tp->defer_tcp_accept.request = NULL;
- }
-
atomic_dec(&tcp_sockets_allocated);

return 0;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 019c8c1..8245247 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -571,8 +571,10 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
does sequence test, SYN is truncated, and thus we consider
it a bare ACK.

- Both ends (listening sockets) accept the new incoming
- connection and try to talk to each other. 8-)
+ If icsk->icsk_accept_queue.rskq_defer_accept, we silently drop this
+ bare ACK. Otherwise, we create an established connection. Both
+ ends (listening sockets) accept the new incoming connection and try
+ to talk to each other. 8-)

Note: This case is both harmless, and rare. Possibility is about the
same as us discovering intelligent life on another plant tomorrow.
@@ -640,6 +642,13 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
if (!(flg & TCP_FLAG_ACK))
return NULL;

+ /* If TCP_DEFER_ACCEPT is set, drop bare ACK. */
+ if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
+ TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
+ inet_rsk(req)->acked = 1;
+ return NULL;
+ }
+
/* OK, ACK is valid, create big socket and
* feed this segment to it. It will repeat all
* the tests. THIS SEGMENT MUST MOVE SOCKET TO
@@ -678,24 +687,7 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
inet_csk_reqsk_queue_unlink(sk, req, prev);
inet_csk_reqsk_queue_removed(sk, req);

- if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
- TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
-
- /* the accept queue handling is done is est recv slow
- * path so lets make sure to start there
- */
- tcp_sk(child)->pred_flags = 0;
- sock_hold(sk);
- sock_hold(child);
- tcp_sk(child)->defer_tcp_accept.listen_sk = sk;
- tcp_sk(child)->defer_tcp_accept.request = req;
-
- inet_csk_reset_keepalive_timer(child,
- inet_csk(sk)->icsk_accept_queue.rskq_defer_accept * HZ);
- } else {
- inet_csk_reqsk_queue_add(sk, req, child);
- }
-
+ inet_csk_reqsk_queue_add(sk, req, child);
return child;

listen_overflow:
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 4de68cf..63ed9d6 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -489,11 +489,6 @@ static void tcp_keepalive_timer (unsigned long data)
goto death;
}

- if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
- tcp_send_active_reset(sk, GFP_ATOMIC);
- goto death;
- }
-
if (!sock_flag(sk, SOCK_KEEPOPEN) || sk->sk_state == TCP_CLOSE)
goto out;

--
1.5.2.2

2008-06-05 23:29:43

by Patrick McManus

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Fri, 2008-06-06 at 00:13 +0300, Ilpo J?rvinen wrote:

>
> I'm out of new ideas what could be still wrong (I got confused and
> lost
> track number of times while I tried to verify socket locking today and
> probably don't have more time for that now)... Unless somebody else
> (Patrick?) comes up with something quickly,

Sorry, I don't see anything - it seems to boil down to the same code in
the DA and non-DA case as far as I can tell, but after a while all the
twisty passages seem to look alike.

If Ingo confirms that the recv end was running the locking patch code,
it would be interesting to just confirm the sysreq+t looks the same as
before - it is possible the patch turned the race into a non-obvious
deadlock.

I'm sure your smaller revert will make the problem go away just as the
larger one did, fwiw.

The other odd thing is that Ingo did a lot of experimentation and was
only making this happen on localhost before (though I agree there is
nothing inherent about that lock and localhost) - isn't it odd that the
first trigger of it now is between two hosts? What do you make of that?



2008-06-06 10:03:24

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Thu, 5 Jun 2008, Patrick McManus wrote:

> On Fri, 2008-06-06 at 00:13 +0300, Ilpo J?rvinen wrote:
>
> >
> > I'm out of new ideas what could be still wrong (I got confused and
> > lost
> > track number of times while I tried to verify socket locking today and
> > probably don't have more time for that now)... Unless somebody else
> > (Patrick?) comes up with something quickly,
>
> Sorry, I don't see anything - it seems to boil down to the same code in
> the DA and non-DA case as far as I can tell, but after a while all the
> twisty passages seem to look alike.

:-)

This Ingo's testcase should anyway be quite "simple", I mean that distcc
shouldn't do anything unexpected in a sense it shouldn't abort the flows
by not sending data, close the listening socket or other things like that.

> If Ingo confirms that the recv end was running the locking patch code,
> it would be interesting to just confirm the sysreq+t looks the same as
> before - it is possible the patch turned the race into a non-obvious
> deadlock.

...Yes, but we want that from the receiver's host rather than from the
sender end. Also checking that sender is still doing window probes once
per 2min is probably worth of it though a change in that is quite
unlikely.

> I'm sure your smaller revert will make the problem go away just as the
> larger one did, fwiw.

I'd very much except it to.

> The other odd thing is that Ingo did a lot of experimentation and was
> only making this happen on localhost before (though I agree there is
> nothing inherent about that lock and localhost) - isn't it odd that the
> first trigger of it now is between two hosts? What do you make of that?

...No, it occurred couple of times in the past between host as well,
nothing new in that.

--
i.

2008-06-06 17:11:37

by Patrick McManus

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+


> This Ingo's testcase should anyway be quite "simple", I mean that distcc
> shouldn't do anything unexpected in a sense it shouldn't abort the flows
> by not sending data, close the listening socket or other things like that.
>

maybe - I've noted that I can get the distcc server to crash with just a
little fuzz (telnet to it and close the telnet) - but it is true I
haven't seen anything odd using the distcc client.

Anyhow, my news is that using rc5 I have managed to reproduce it on
localhost - so it isn't just ingo anymore ! ;)

I didn't have a 16 cpu machine at my disposal, so I was concerned I
wouldn't be able to make it happen - but I setup a 64 bit kvm image with
-smp 4, running on my core2-duo and created a makefile with 3 src files
that I just compiled as "while true; do make -j3 all; done" - the
makefile uses distcc to localhost and has intentionally broken
dependencies so it just keeps recompiling stuff. The input files are
approximately 135k, 98k, and 16k after running gcc -E on them (which I
what I assume distcc does before putting them down the socket).

On rc5 I could get the lockup in under 20 minutes.. usually 10. I think
I did it 4 times. My compile test is probably a better trigger than the
kernel compile because the distcc connects are never staggered like they
would be in a large directory of files. (3 files, -j4).

When I apply the locking patch you (Ilpo) wrote, I cannot reproduce the
error at all in the first 90 minutes of testing. I'll let the test run
and update the list.

I'm holding out hope that Ingo's report did not have the locking patch
on the distcc server end - because it certainly makes a difference for
me.

-Patrick

2008-06-06 17:34:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+


* Patrick McManus <[email protected]> wrote:

> When I apply the locking patch you (Ilpo) wrote, I cannot reproduce
> the error at all in the first 90 minutes of testing. I'll let the test
> run and update the list.
>
> I'm holding out hope that Ingo's report did not have the locking patch
> on the distcc server end - because it certainly makes a difference for
> me.

Hm, the distcc server had the full 3-patch-revert from Ilpo, was that
supposed to fix the problem too, indirectly?

The box is running that 3-patch revert right now as well:

phoenix:~> uptime
19:20:28 up 9:58, 2 users, load average: 7.75, 13.88, 30.95
phoenix:~> uname -a
Linux phoenix 2.6.26-rc4 #2352 SMP Fri Jun 6 09:18:07 CEST 2008 x86_64 x86_64 x86_64 GNU/Linux

... and i never saw a single hang today in the 10 hours of uptime this
box has. (and it built a good 500 kernel today) Nor any hang yesterday,
and that was a good 500 kernels too.

You can see it that the box built more than two thousand kernels in the
past few days alone, so it's a rather busy little bee. The other
testboxes built even more kernels - a quad box built and booted 2500
kernels:

#define UTS_VERSION "#2524 SMP PREEMPT Fri Jun 6 19:22:21 CEST 2008"

and i never saw a hang on that box either.

a third box has:

titan:~> uname -a
Linux titan 2.6.26-rc5-00002-g737697d-dirty #2557 SMP PREEMPT Fri Jun 6
19:24:00 CEST 2008 x86_64 x86_64 x86_64 GNU/Linux

(this is the one that showed the hang for the first time)

The total count of kernel bootups i did this week for -tip QA was
somewhere between five and ten thousand random build+bootups - and the
only time i got a hang was when i removed the 3-patch-revert
intentionally, on one of the boxes.

Maybe that 3-patch-revert just makes this locking bug a bit less likely
to trigger, by accident? Out tip test-setup is specialized to find
arch/x86 and scheduler bugs, not primarily to find networking bugs. (but
at this test volume, and given that it makes use of distcc, it will
trigger them too.)

i have a rather accurate timeline of when the hang first occured, do we
know the timeline of the introduction of the locking bug by any chance?
Which commit introduced it? (Ilpo's commit log does not say it)

Your test results are compelling nevertheless so i'll do a retest in any
case, with all boxes either running an older kernel or a kernel with the
locking fix.

Ingo

2008-06-06 18:19:27

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

...I kind of fail to follow in general in this mail which patch have been
tested and where and when... But I understand that it's just due to number
of tests & hosts & kernels & what-else you use and know by heart (and we
don't do all that well :-)). But I'll try to still sort it out below...

On Fri, 6 Jun 2008, Ingo Molnar wrote:

>
> * Patrick McManus <[email protected]> wrote:
>
> > When I apply the locking patch you (Ilpo) wrote, I cannot reproduce
> > the error at all in the first 90 minutes of testing. I'll let the test
> > run and update the list.
> >
> > I'm holding out hope that Ingo's report did not have the locking patch
> > on the distcc server end - because it certainly makes a difference for
> > me.
>
> Hm, the distcc server had the full 3-patch-revert from Ilpo, was that
> supposed to fix the problem too, indirectly?

Yes, the problematic outside of locking portion shouldn't be there
without those DA changes.

> The box is running that 3-patch revert right now as well:
>
> phoenix:~> uptime
> 19:20:28 up 9:58, 2 users, load average: 7.75, 13.88, 30.95
> phoenix:~> uname -a
> Linux phoenix 2.6.26-rc4 #2352 SMP Fri Jun 6 09:18:07 CEST 2008 x86_64 x86_64 x86_64 GNU/Linux
>
> ... and i never saw a single hang today in the 10 hours of uptime this
> box has. (and it built a good 500 kernel today) Nor any hang yesterday,
> and that was a good 500 kernels too.
>
> You can see it that the box built more than two thousand kernels in the
> past few days alone, so it's a rather busy little bee. The other
> testboxes built even more kernels - a quad box built and booted 2500
> kernels:
>
> #define UTS_VERSION "#2524 SMP PREEMPT Fri Jun 6 19:22:21 CEST 2008"
>
> and i never saw a hang on that box either.
>
> a third box has:
>
> titan:~> uname -a
> Linux titan 2.6.26-rc5-00002-g737697d-dirty #2557 SMP PREEMPT Fri Jun 6
> 19:24:00 CEST 2008 x86_64 x86_64 x86_64 GNU/Linux
>
> (this is the one that showed the hang for the first time)
>
> The total count of kernel bootups i did this week for -tip QA was
> somewhere between five and ten thousand random build+bootups - and the
> only time i got a hang was when i removed the 3-patch-revert
> intentionally, on one of the boxes.

...and you added the locking fix there instead? Or was this a removal?

> Maybe that 3-patch-revert just makes this locking bug a bit less likely
> to trigger, by accident?

No, part of the DEFER_ACCEPT stuff was postponed in 2.6.25..2.6.26-rc1
timeframe (ec3c0982a2dd1e671bad8e9d26c28dcba0039d87) so that one portion
of it ended up being added outside of the socket lock of the listening
socket, while touching its datastructures. Without
ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 the deferred accept related
things happen earlier, ie., while we still are under the lock of the
listening socket. So that particular locking bug was _introduced_ by that
ec3c change, not made more likely or so.

...Of course software is known to have bugs, so we might always be
(un?)lucky and hit another one and confuse... :-)

> Out tip test-setup is specialized to find
> arch/x86 and scheduler bugs, not primarily to find networking bugs. (but
> at this test volume, and given that it makes use of distcc, it will
> trigger them too.)

It seems to work quite well actually for this kind of networking related
bugs too which hardly depend on network at all :-).

> i have a rather accurate timeline of when the hang first occured, do we
> know the timeline of the introduction of the locking bug by any chance?
> Which commit introduced it? (Ilpo's commit log does not say it)

Ah, sorry I forgot to add that one there, it was sent quite late in the
night and I just couldn't get sleep until sending the fix... :-) It was
one of the reverted ones that did it:
ec3c0982a2dd1e671bad8e9d26c28dcba0039d87.

> Your test results are compelling nevertheless so i'll do a retest in any
> case, with all boxes either running an older kernel or a kernel with the
> locking fix.

If you want an older kernel, you would have to go basically to 2.6.25 or
so.

To summarize. Both 3changes+1fix revert (you refer to it only as 3-patch
revert) _and_ the locking fix I made should fix the problem (obviously
they exclude each other). ...And end which is significant is the one which
has LISTENing sockets (please keep this in mind if you still get the hang
and provide some info).

--
i.

2008-06-06 18:26:00

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Fri, 6 Jun 2008, Patrick McManus wrote:

> > This Ingo's testcase should anyway be quite "simple", I mean that distcc
> > shouldn't do anything unexpected in a sense it shouldn't abort the flows
> > by not sending data, close the listening socket or other things like that.
>
> maybe - I've noted that I can get the distcc server to crash with just a
> little fuzz (telnet to it and close the telnet) - but it is true I
> haven't seen anything odd using the distcc client.

In addition I think I've also seen some bits floating around that
occassionally distcc does something weird in a correct setup too.

I briefly looked how distcc behaved while doing the stress_accept. Distcc
basically seems to have n processes each accept()ing and some kind of
memleak killer by limiting number of successive accepts then exit, while
the parent who did the listen is only periodically (had some sleep(1)s)
collecting dead ones & respawning them.

> Anyhow, my news is that using rc5 I have managed to reproduce it on
> localhost - so it isn't just ingo anymore ! ;)

Also Peter Z has reported it earlier, it was distcc+localhost for him as
well.

> and has intentionally broken dependencies so it just keeps recompiling
> stuff.

...Trying to invent perpetual motion machine? :-/

> The input files are
> approximately 135k, 98k, and 16k after running gcc -E on them (which I
> what I assume distcc does before putting them down the socket).
>
> On rc5 I could get the lockup in under 20 minutes.. usually 10. I think
> I did it 4 times. My compile test is probably a better trigger than the
> kernel compile because the distcc connects are never staggered like they
> would be in a large directory of files. (3 files, -j4).

It could be even easier if you make next in path gcc to play with
nice, trying a number of different values might reveal some really fast
to reproduce scenario.

> When I apply the locking patch you (Ilpo) wrote, I cannot reproduce the
> error at all in the first 90 minutes of testing. I'll let the test run
> and update the list.

At least it helps some :-), like it should.

> I'm holding out hope that Ingo's report did not have the locking patch
> on the distcc server end - because it certainly makes a difference for
> me.

...He had some issue with different versions being deployed at least in
the past, and I failed to follow his latest answer :-).


--
i.

2008-06-06 18:40:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+


* Ilpo J?rvinen <[email protected]> wrote:

> If you want an older kernel, you would have to go basically to 2.6.25
> or so.

correct, that's what i use as fallback, some distro kernel which is
2.6.25 or older.

but i'm confused a bit, you say v2.6.25-rc6-475-gec3c098 introduced the
locking problem - so 2.6.25 is affected as well? This is a significant
question because the fallback kernel is kernel-2.6.25.3-18.fc9.x86_64 on
the 16-way box. (all other build-boxes have 2.6.24 or older as a
fallback kernel)

> To summarize. Both 3changes+1fix revert (you refer to it only as
> 3-patch revert) _and_ the locking fix I made should fix the problem
> (obviously they exclude each other). ...And end which is significant
> is the one which has LISTENing sockets (please keep this in mind if
> you still get the hang and provide some info).

ok.

For completeness, let me repeat the patch i referred to as the
'3-patch-revert' below. (which indeed is 3+1 as you note)

this is the patch that appears to be working empirically. (Disclaimer:
it might just hide the problem, change timings, have a lucky code
layout, etc.)

Ingo

----------------->
commit 9e5b6ca8fa8e07aceaddbfb0f2a324449e5ef014
Author: Ilpo J?rvinen <[email protected]>
Date: Sat May 31 15:18:56 2008 +0300

tcp: revert DEFER_ACCEPT modifications

On Sat, 31 May 2008, Ilpo J?rvinen wrote:

> On Sat, 31 May 2008, Ingo Molnar wrote:
>
> > ok, once i added back the localhost distcc component and the hung kernel
> > build + stuck TCP socket bug happened again overnight:
> >
> > Active Internet connections (w/o servers)
> > Proto Recv-Q Send-Q Local Address Foreign Address State
> > tcp 72187 0 10.0.1.14:3632 10.0.1.14:47910 ESTABLISHED
> > tcp 0 174464 10.0.1.14:47910 10.0.1.14:3632 ESTABLISHED
> >
> > so it seems distcc over localhost was the aspect that made it fail.
> >
> > _Perhaps_ what matters is to have the new post-rc3 TCP code on _both_
> > sides of the connection. But that is just a theory - it could be timing,
> > etc.
>
> Btw, does your distcc perhaps happen enable TCP_DEFER_ACCEPT (there were
> some post 2.6.25 changes into it)?

Hmm, if so, please try this revert below (I hope I got everything that is
related there, gitk is currently broken for me so that viewing many
commits quickly is a pain, so I might have accidently missed something)...

Reverts these commits:
[TCP]: TCP_DEFER_ACCEPT updates - defer timeout conflicts with max_t
[TCP]: TCP_DEFER_ACCEPT updates - dont retxmt synack
[TCP]: TCP_DEFER_ACCEPT updates - process as established
tcp: Fix slab corruption with ipv6 and tcp6fuzz

..., ie.,
539fae89bebd16ebeafd57a87169bc56eb530d76,
e4c78840284f3f51b1896cf3936d60a6033c4d2c,
ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 and
9ae27e0adbf471c7a6b80102e38e1d5a346b3b38.

Signed-off-by: Ilpo J?rvinen <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 18e62e3..b31b6b7 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -239,11 +239,6 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
return (struct tcp_request_sock *)req;
}

-struct tcp_deferred_accept_info {
- struct sock *listen_sk;
- struct request_sock *request;
-};
-
struct tcp_sock {
/* inet_connection_sock has to be the first member of tcp_sock */
struct inet_connection_sock inet_conn;
@@ -379,8 +374,6 @@ struct tcp_sock {
unsigned int keepalive_intvl; /* time interval between keep alive probes */
int linger2;

- struct tcp_deferred_accept_info defer_tcp_accept;
-
unsigned long last_synq_overflow;

u32 tso_deferred;
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index b220b5f..0c96e7b 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -115,8 +115,8 @@ struct request_sock_queue {
struct request_sock *rskq_accept_head;
struct request_sock *rskq_accept_tail;
rwlock_t syn_wait_lock;
- u16 rskq_defer_accept;
- /* 2 bytes hole, try to pack */
+ u8 rskq_defer_accept;
+ /* 3 bytes hole, try to pack */
struct listen_sock *listen_opt;
};

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 633147c..981d5ba 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -139,7 +139,6 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
#define MAX_TCP_KEEPINTVL 32767
#define MAX_TCP_KEEPCNT 127
#define MAX_TCP_SYNCNT 127
-#define MAX_TCP_ACCEPT_DEFERRED 65535

#define TCP_SYNQ_INTERVAL (HZ/5) /* Period of SYNACK timer */

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 828ea21..ec83448 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -419,7 +419,8 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
struct inet_connection_sock *icsk = inet_csk(parent);
struct request_sock_queue *queue = &icsk->icsk_accept_queue;
struct listen_sock *lopt = queue->listen_opt;
- int thresh = icsk->icsk_syn_retries ? : sysctl_tcp_synack_retries;
+ int max_retries = icsk->icsk_syn_retries ? : sysctl_tcp_synack_retries;
+ int thresh = max_retries;
unsigned long now = jiffies;
struct request_sock **reqp, *req;
int i, budget;
@@ -455,6 +456,9 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
}
}

+ if (queue->rskq_defer_accept)
+ max_retries = queue->rskq_defer_accept;
+
budget = 2 * (lopt->nr_table_entries / (timeout / interval));
i = lopt->clock_hand;

@@ -462,8 +466,9 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
reqp=&lopt->syn_table[i];
while ((req = *reqp) != NULL) {
if (time_after_eq(now, req->expires)) {
- if (req->retrans < thresh &&
- !req->rsk_ops->rtx_syn_ack(parent, req)) {
+ if ((req->retrans < thresh ||
+ (inet_rsk(req)->acked && req->retrans < max_retries))
+ && !req->rsk_ops->rtx_syn_ack(parent, req)) {
unsigned long timeo;

if (req->retrans++ == 0)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ab66683..fc54a48 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2112,12 +2112,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
break;

case TCP_DEFER_ACCEPT:
- if (val < 0) {
- err = -EINVAL;
- } else {
- if (val > MAX_TCP_ACCEPT_DEFERRED)
- val = MAX_TCP_ACCEPT_DEFERRED;
- icsk->icsk_accept_queue.rskq_defer_accept = val;
+ icsk->icsk_accept_queue.rskq_defer_accept = 0;
+ if (val > 0) {
+ /* Translate value in seconds to number of
+ * retransmits */
+ while (icsk->icsk_accept_queue.rskq_defer_accept < 32 &&
+ val > ((TCP_TIMEOUT_INIT / HZ) <<
+ icsk->icsk_accept_queue.rskq_defer_accept))
+ icsk->icsk_accept_queue.rskq_defer_accept++;
+ icsk->icsk_accept_queue.rskq_defer_accept++;
}
break;

@@ -2299,7 +2302,8 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
val = (val ? : sysctl_tcp_fin_timeout) / HZ;
break;
case TCP_DEFER_ACCEPT:
- val = icsk->icsk_accept_queue.rskq_defer_accept;
+ val = !icsk->icsk_accept_queue.rskq_defer_accept ? 0 :
+ ((TCP_TIMEOUT_INIT / HZ) << (icsk->icsk_accept_queue.rskq_defer_accept - 1));
break;
case TCP_WINDOW_CLAMP:
val = tp->window_clamp;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index eba873e..cad73b7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4541,49 +4541,6 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, struct tcphdr *th)
}
}

-static int tcp_defer_accept_check(struct sock *sk)
-{
- struct tcp_sock *tp = tcp_sk(sk);
-
- if (tp->defer_tcp_accept.request) {
- int queued_data = tp->rcv_nxt - tp->copied_seq;
- int hasfin = !skb_queue_empty(&sk->sk_receive_queue) ?
- tcp_hdr((struct sk_buff *)
- sk->sk_receive_queue.prev)->fin : 0;
-
- if (queued_data && hasfin)
- queued_data--;
-
- if (queued_data &&
- tp->defer_tcp_accept.listen_sk->sk_state == TCP_LISTEN) {
- if (sock_flag(sk, SOCK_KEEPOPEN)) {
- inet_csk_reset_keepalive_timer(sk,
- keepalive_time_when(tp));
- } else {
- inet_csk_delete_keepalive_timer(sk);
- }
-
- inet_csk_reqsk_queue_add(
- tp->defer_tcp_accept.listen_sk,
- tp->defer_tcp_accept.request,
- sk);
-
- tp->defer_tcp_accept.listen_sk->sk_data_ready(
- tp->defer_tcp_accept.listen_sk, 0);
-
- sock_put(tp->defer_tcp_accept.listen_sk);
- sock_put(sk);
- tp->defer_tcp_accept.listen_sk = NULL;
- tp->defer_tcp_accept.request = NULL;
- } else if (hasfin ||
- tp->defer_tcp_accept.listen_sk->sk_state != TCP_LISTEN) {
- tcp_reset(sk);
- return -1;
- }
- }
- return 0;
-}
-
static int tcp_copy_to_iovec(struct sock *sk, struct sk_buff *skb, int hlen)
{
struct tcp_sock *tp = tcp_sk(sk);
@@ -4944,8 +4901,6 @@ step5:

tcp_data_snd_check(sk);
tcp_ack_snd_check(sk);
-
- tcp_defer_accept_check(sk);
return 0;

csum_error:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index cd601a8..b698b5b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1918,14 +1918,6 @@ int tcp_v4_destroy_sock(struct sock *sk)
sk->sk_sndmsg_page = NULL;
}

- if (tp->defer_tcp_accept.request) {
- reqsk_free(tp->defer_tcp_accept.request);
- sock_put(tp->defer_tcp_accept.listen_sk);
- sock_put(sk);
- tp->defer_tcp_accept.listen_sk = NULL;
- tp->defer_tcp_accept.request = NULL;
- }
-
atomic_dec(&tcp_sockets_allocated);

return 0;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 019c8c1..8245247 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -571,8 +571,10 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
does sequence test, SYN is truncated, and thus we consider
it a bare ACK.

- Both ends (listening sockets) accept the new incoming
- connection and try to talk to each other. 8-)
+ If icsk->icsk_accept_queue.rskq_defer_accept, we silently drop this
+ bare ACK. Otherwise, we create an established connection. Both
+ ends (listening sockets) accept the new incoming connection and try
+ to talk to each other. 8-)

Note: This case is both harmless, and rare. Possibility is about the
same as us discovering intelligent life on another plant tomorrow.
@@ -640,6 +642,13 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
if (!(flg & TCP_FLAG_ACK))
return NULL;

+ /* If TCP_DEFER_ACCEPT is set, drop bare ACK. */
+ if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
+ TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
+ inet_rsk(req)->acked = 1;
+ return NULL;
+ }
+
/* OK, ACK is valid, create big socket and
* feed this segment to it. It will repeat all
* the tests. THIS SEGMENT MUST MOVE SOCKET TO
@@ -678,24 +687,7 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
inet_csk_reqsk_queue_unlink(sk, req, prev);
inet_csk_reqsk_queue_removed(sk, req);

- if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
- TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
-
- /* the accept queue handling is done is est recv slow
- * path so lets make sure to start there
- */
- tcp_sk(child)->pred_flags = 0;
- sock_hold(sk);
- sock_hold(child);
- tcp_sk(child)->defer_tcp_accept.listen_sk = sk;
- tcp_sk(child)->defer_tcp_accept.request = req;
-
- inet_csk_reset_keepalive_timer(child,
- inet_csk(sk)->icsk_accept_queue.rskq_defer_accept * HZ);
- } else {
- inet_csk_reqsk_queue_add(sk, req, child);
- }
-
+ inet_csk_reqsk_queue_add(sk, req, child);
return child;

listen_overflow:
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 4de68cf..63ed9d6 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -489,11 +489,6 @@ static void tcp_keepalive_timer (unsigned long data)
goto death;
}

- if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
- tcp_send_active_reset(sk, GFP_ATOMIC);
- goto death;
- }
-
if (!sock_flag(sk, SOCK_KEEPOPEN) || sk->sk_state == TCP_CLOSE)
goto out;

2008-06-06 19:50:21

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Fri, 6 Jun 2008, Ingo Molnar wrote:

>
> * Ilpo J?rvinen <[email protected]> wrote:
>
> > If you want an older kernel, you would have to go basically to 2.6.25
> > or so.
>
> correct, that's what i use as fallback, some distro kernel which is
> 2.6.25 or older.
>
> but i'm confused a bit, you say v2.6.25-rc6-475-gec3c098 introduced the
> locking problem - so 2.6.25 is affected as well?

No, you're probably just falling into a git-describe trap I also used
to fall:

ijjarvin@pointhope:~/linux/mainline$ git-log -n 1 --pretty=oneline
ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 ^v2.6.25 | cat -
ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 [TCP]: TCP_DEFER_ACCEPT updates -
process as established
ijjarvin@pointhope:~/linux/mainline$ git-log -n 1 --pretty=oneline
ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 ^v2.6.26-rc1 | cat -
ijjarvin@pointhope:~/linux/mainline$ git-describe
ec3c0982a2dd1e671bad8e9d26c28dcba0039d87
v2.6.25-rc6-475-gec3c098
ijjarvin@pointhope:~/linux/mainline$

The git-describe is not the way one can determine into which mainline
tag a commit was included, it basically just provides the closest tag
among ancestors, which can be a vastly different one and has _no_
relation whatsoever to the tag we'd desire to get. In here, Dave had
net-2.6 based on 2.5.25-rc6ish (or alternatively last merge to net-2.6
from Linus' tree's content came from that point of time), but Linus did
the merge from 2.6.25 but git-describe won't look anything that happens
after the asked commit. This is similar to the
bisect-lands-lower-tag-than-select-good-commit-was "mystery" that was
recently discussed extensively, again the Makefile only tracks ancestors,
not the future.

If somebody knows a trivial command to get that future information (to
where merged info), I'd pretty interested to hear.

> This is a significant
> question because the fallback kernel is kernel-2.6.25.3-18.fc9.x86_64 on
> the 16-way box. (all other build-boxes have 2.6.24 or older as a
> fallback kernel)

Please do get the receiver state if you still see such problem with it,
it is also relevant but it a different problem then (I'm yet to analyze
the data H?kan was collecting, dl it already by didn't even look into
that yet).

...Or also if you see stuck TCPs with other cases I've told should fix it:

1. 2.6.25 (pre-ec3c to be accurate)
2. 3+1 revert
3. ec3c+locking fix (this is the most unsure one because it still would
have the reversed socket lock taking order though nothing bad has been
found by some review neither by me nor Patrick)

Please collect at least /proc/net/tcp and the netstat -np, if there's
process associated to the flow with _Recv-Q_ (in localhost case there
are two of them, the other with Send-Q), also where the process is
waiting is useful. Hopefully clear enough now... :-)

> > To summarize. Both 3changes+1fix revert (you refer to it only as
> > 3-patch revert) _and_ the locking fix I made should fix the problem
> > (obviously they exclude each other). ...And end which is significant
> > is the one which has LISTENing sockets (please keep this in mind if
> > you still get the hang and provide some info).
>
> ok.
>
> For completeness, let me repeat the patch i referred to as the
> '3-patch-revert' below. (which indeed is 3+1 as you note)

...I know because there never have been any 3-patch-revert made... :-)

> this is the patch that appears to be working empirically. (Disclaimer:
> it might just hide the problem, change timings, have a lucky code
> layout, etc.)

Sure, but the revert also removes the obvious locking problem that was
introduced in ec3c.


--
i.

2008-06-06 20:08:58

by Patrick McManus

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

This is all a bit confusing, but here are the conclusions I have drawn.

There definitely is a problem with the locking of the DA commit
ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 . That code was part of 26-rc1
but it never appeared in 25. It exists in pretty much the same form in
rc5 (there was 1 patch to it over that time to fix a different problem).

We're certain this code has a problem with the accept queue both because
of code inspection and the fact that Ingo can back it out (as the
significant part of the 3-patch revert) and the problem goes away in his
testing.

I have run tests that can reproduce the hung socket with distcc over
localhost using 26-rc5. I can also apparently cure it using the locking
fix patch Ilpo sent (c9454f0..d21d2b9) on top of that. (My test of rc5
+lockpatch is at 4.5+ hrs and counting without failures, it fails 6
times an hour with vanilla rc5)

Based on all of that, the right thing to do seems to be to apply the
lockpatch (c9454f0..d21d2b9) to Linus's tree and not revert anything -
just fix the code and I'll send Ilpo and Ingo cookies at Christmas time
for being great guys. Alternatively, Ingo could run the distcc servers
and clients on -tip with the lockpatch (nothing reverted) for more
testing.

The only lingering problem is Ingo's report yesterday
http://marc.info/?l=linux-netdev&m=121267587715976&w=2
of a distcc hang. In this one it was not over localhost and the distcc
server had the ec3c DA changes totally reverted. (The server is really
the only stack that matters in this case - the client is not impacted by
the DA changes). This has to be a different issue, because the ec3c code
we're talking about here wasn't on the server at all. As Ilpo mentions,
Hakon is beleived to have a different problem and maybe you've tripped
over that too?

If we're sure of that conclusion we should just take Ilpo's DA patch as
that will narrow the field for finding Hakon's issue. Its just with all
of these data points I'm not sure if I'm reaching the right conclusion.



2008-06-06 21:13:18

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

...added Arjan.

On Fri, 6 Jun 2008, Patrick McManus wrote:

> This is all a bit confusing, but here are the conclusions I have drawn.

Your observations here match what I've understood :-).

> There definitely is a problem with the locking of the DA commit
> ec3c0982a2dd1e671bad8e9d26c28dcba0039d87 . That code was part of 26-rc1
> but it never appeared in 25. It exists in pretty much the same form in
> rc5 (there was 1 patch to it over that time to fix a different problem).
>
> We're certain this code has a problem with the accept queue both because
> of code inspection and the fact that Ingo can back it out (as the
> significant part of the 3-patch revert) and the problem goes away in his
> testing.

Problems were at least these:
- Accept queue addition was racy and could leave dangling items
- Dangling items caused inconsistent sk_ack_backlog
- Checking for still in LISTEN state was racy, could be changed after
the check was made (shouldn't happen with distcc though)

I didn't read ->sk_data_ready that carefully, it could have some
additional problems that are not listed (but they all should be fixed
by the added locking anyway).

AFAICT, rest of that ec3c change is safe wrt. locking, just holding sk is
enough for the rest and those bits mostly shouldn't anyway be executed
with a distcc setup.

> I have run tests that can reproduce the hung socket with distcc over
> localhost using 26-rc5. I can also apparently cure it using the locking
> fix patch Ilpo sent (c9454f0..d21d2b9) on top of that. (My test of rc5
> +lockpatch is at 4.5+ hrs and counting without failures, it fails 6
> times an hour with vanilla rc5)
>
> Based on all of that, the right thing to do seems to be to apply the
> lockpatch (c9454f0..d21d2b9) to Linus's tree and not revert anything -
> just fix the code and I'll send Ilpo and Ingo cookies at Christmas time
> for being great guys. Alternatively, Ingo could run the distcc servers
> and clients on -tip with the lockpatch (nothing reverted) for more
> testing.

Anyway, we still would have an option to revert both the DA change + the
locking fix later if the problem is still clearly more likely than with
stable-2.6.25.

> The only lingering problem is Ingo's report yesterday
> http://marc.info/?l=linux-netdev&m=121267587715976&w=2
> of a distcc hang. In this one it was not over localhost and the distcc
> server had the ec3c DA changes totally reverted. (The server is really
> the only stack that matters in this case - the client is not impacted by
> the DA changes).

It definately didn't fit to picture that well if we would be talking just
a single bug here.

...I wish Ingo would have provided the receiver state already then. :-)

> This has to be a different issue, because the ec3c code
> we're talking about here wasn't on the server at all. As Ilpo mentions,
> Hakon is beleived to have a different problem and maybe you've tripped
> over that too?

...The H?kon's case is definately different thing, also the symptoms
are quite different because there's no deadlock at all but the TCP flow
eventually dies, I don't yet know with what timescale that dying happens.
Only common denominator actually was this receiver process missing, though
it provably still was there.

Besides, I don't know how long Ingo waited in this case until concluding
that the TCP was stuck again?

> If we're sure of that conclusion we should just take Ilpo's DA patch as
> that will narrow the field for finding Hakon's issue. Its just with all
> of these data points I'm not sure if I'm reaching the right conclusion.

Lets widen the scope to two to three bugs then, one down already...

In case you missed btw, also Arjan reported some problem quite early, but
in his case claws mua+imap was the workload, so I doubt that DEFER_ACCEPT
would be involved but who knows without strace, here:
http://marc.info/?l=linux-kernel&m=121182171000434&w=2

Arjan, can you please check if your workload uses setsockopt
TCP_DEFER_ACCEPT for the LISTENing socket? ...If not, then your case
is different from Ingo's.


--
i.

2008-06-06 21:24:32

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Sat, 7 Jun 2008 00:12:55 +0300 (EEST)
"Ilpo Järvinen" <[email protected]> wrote:
> Arjan, can you please check if your workload uses setsockopt
> TCP_DEFER_ACCEPT for the LISTENing socket? ...If not, then your case
> is different from Ingo's.

at this point the observed hung I had seems to be the app doing
something bad (it's the app that hung) so for now please disregard my
previous comments on this topic.

>
>


--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-06-06 21:28:21

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Fri, 6 Jun 2008, Arjan van de Ven wrote:

> On Sat, 7 Jun 2008 00:12:55 +0300 (EEST)
> "Ilpo J?rvinen" <[email protected]> wrote:
> > Arjan, can you please check if your workload uses setsockopt
> > TCP_DEFER_ACCEPT for the LISTENing socket? ...If not, then your case
> > is different from Ingo's.
>
> at this point the observed hung I had seems to be the app doing
> something bad (it's the app that hung) so for now please disregard my
> previous comments on this topic.

...Ok, thanks for letting us know.

--
i.

2008-06-10 22:33:15

by David Miller

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

From: "Ilpo_J?rvinen" <[email protected]>
Date: Wed, 4 Jun 2008 02:22:16 +0300 (EEST)

> [PATCH] tcp DEFER_ACCEPT: fix racy access to listen_sk
>
> It seems that replacement of DA code also moved parts outside
> of appropriate locking. The Ingo's problem seems to come from
> the fact that two flows could now race in
> (inet_csk_)reqsk_queue_add corrupting the queue. ...This can
> leave dangling socks around which won't resolve themselves
> without stimuli from outside (e.g., external RST would help
> I think).
>
> Then some details I'm not too sure of:
> I guess we want to put listen_sk->sk_state checking under the
> lock as well. I've not evaluated if ->sk_data_ready too
> requires locking but assumed it does.
>
> I'm by no means familiar with all locking variants, requirements,
> etc.
>
> Signed-off-by: Ilpo J?rvinen <[email protected]>

I took a close look at this, it seems this patch adds
an ABBA deadlock. But I might be wrong.

Normally the locking order is:

listen_sk --> some_child_sk

And this can be seen by the code paths that flow down
to tcp_child_process() in net/ipv4/tcp_minisocks.c (via
tcp_v4_do_rcv() for example).

However here in this patch we will lock in the:

child_sk --> listen_sk

order.

Unless... these defer accepted sockets live outside of the
normal socket collection found by tcp_v{4,6}_hnd_req(). If
that is the case, that ought to make this locking order OK
but I fear that lockdep will likely complain because it has
no way to see this distinction.

If we cannot find a simple and easy way to deal with this locking
problem, I am reverting the defer-accept changes entirely. It's not
the end of the world if this feature has to be deferred to 2.6.27
and this bug has been known for several weeks already.

2008-06-10 22:49:45

by David Miller

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

From: Patrick McManus <[email protected]>
Date: Fri, 06 Jun 2008 16:08:57 -0400

> If we're sure of that conclusion we should just take Ilpo's DA patch as
> that will narrow the field for finding Hakon's issue.

While the conclusion might be accurate, Ilpo's patch adds a lock
ordering problem that might introduce potential deadlocks (worst
case) or at least some lockdep warnings. Both of which are
unacceptable.

I personally don't see any clear solution, and because this new
DA code moves an operation into a code block where the locking
order pretty much has to be inverted, I don't see any real short
term solution other than a revert of this particular DA commit.

2008-06-11 13:10:23

by Patrick McManus

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Tue, 2008-06-10 at 15:32 -0700, David Miller wrote:

> If we cannot find a simple and easy way to deal with this locking
> problem, I am reverting the defer-accept changes entirely. It's not
> the end of the world if this feature has to be deferred to 2.6.27

based on this, plus the comments from vitaliy it is clear that moving
the DA socket to established is pretty tricky. I agree I should address
these things in the >= 27 time frame.

imho we don't need to revert the first two patches:
e4c78840284f3f51b1896cf3936d60a6033c4d2c and
539fae89bebd16ebeafd57a87169bc56eb530d76 .. they have nothing to do with
rearranging states or the locking and they do fix the more notable DA
behavior issues.

-Patrick

2008-06-11 15:13:27

by Ilpo Järvinen

[permalink] [raw]
Subject: Re: [fixed] [patch] Re: [bug] stuck localhost TCP connections, v2.6.26-rc3+

On Tue, 10 Jun 2008, David Miller wrote:

> From: "Ilpo_J?rvinen" <[email protected]>
> Date: Wed, 4 Jun 2008 02:22:16 +0300 (EEST)
>
> > [PATCH] tcp DEFER_ACCEPT: fix racy access to listen_sk
> >
> > It seems that replacement of DA code also moved parts outside
> > of appropriate locking. The Ingo's problem seems to come from
> > the fact that two flows could now race in
> > (inet_csk_)reqsk_queue_add corrupting the queue. ...This can
> > leave dangling socks around which won't resolve themselves
> > without stimuli from outside (e.g., external RST would help
> > I think).
> >
> > Then some details I'm not too sure of:
> > I guess we want to put listen_sk->sk_state checking under the
> > lock as well. I've not evaluated if ->sk_data_ready too
> > requires locking but assumed it does.
> >
> > I'm by no means familiar with all locking variants, requirements,
> > etc.
> >
> > Signed-off-by: Ilpo J?rvinen <[email protected]>
>
> I took a close look at this, it seems this patch adds
> an ABBA deadlock. But I might be wrong.
>
> Normally the locking order is:
>
> listen_sk --> some_child_sk
>
> And this can be seen by the code paths that flow down
> to tcp_child_process() in net/ipv4/tcp_minisocks.c (via
> tcp_v4_do_rcv() for example).
>
> However here in this patch we will lock in the:
>
> child_sk --> listen_sk
>
> order.
>
> Unless... these defer accepted sockets live outside of the
> normal socket collection found by tcp_v{4,6}_hnd_req().

All these collection/queues make the analysis a bit complex :-), but
I think I finally got a hold of it anyway and my analysis agrees with
yours. To me it seems that when DAed TCP sk moves to established, there's
no longer a connection from listen_sk to child_sk, we keep them alive by
incrementing those refcounts and the child sk is responsible about the
binding between them (until the DA gets resolution). But the child_sk
locking during the creation is also a suspect, but only after
tcp_rcv_established() part can run for the child sock.

Am I right that after tcp_v4_syn_recv_sock() it is possible to get
tcp_rcv_established() to execute for the created child? It's called
by this path:
tcp_v4_hnd_req ->
tcp_check_req ->
tcp_v4_syn_recv_sock (inet_csk(sk)->icsk_af_ops->syn_recv_sock)
...and the listen_sk part acquires that child's lock only after that in
tcp_v4_hnd_req(). ...Allowing it to deadlock during that short window!

...But from that point onward nothing in listen_sk needs to lock the
child. Then this connection from listen_sk to child_sk comes back once
we've reinserted the DAed child_sk back to the queue (in that racy part I
was trying to fix) but at that point of time we won't ever need to lock in
child_sk -> listen_sk order again because we've already passed that
point.

...But like you, I don't understand all of it that well... Btw, your
sk_callback_lock notes explained some mystery part for me as I came
across with it too while looking around... :-)

> If that is the case, that ought to make this locking order OK
> but I fear that lockdep will likely complain because it has
> no way to see this distinction.

I definately agree that the locks are taken in different order no matter
what (I actually referring to that already in the first mail about the
deadlock) so probably lockdep is not going to be too friendly :-), whether
it could cause a real deadlock, I was not that sure at that point of time.

> If we cannot find a simple and easy way to deal with this locking
> problem, I am reverting the defer-accept changes entirely.

To avoid all problem, I was already thinking of another approach (though
my time constraints hardly allows finishing it any time soon):

Because DA code resides quite late on the TCP path it would be quite easy
to do some preparatory work, drop child sk's lock and re-acquire both
locks in the usual order (listen->child) but that would require handling
correctly reset & other things that could intervene (luckily that pesky
userspace isn't there yet to mess around so not that many things can
happen :-)). But ipv6 seems to do some additional processing after
tcp_rcv_established() which I'd expect to choke if child sk's lock was
dropped there for a moment, while ipv4 part seems quite doable. I don't
even know what exactly are the requirements for that ipv6 part (see the
stuff after ipv6_pktoptions label in tcp_v6_do_rcv). ...I tried to do
this but came across those two things mentioned above.

IMHO, changing locking this late in the release cycle would be quite
risky anyway... And we would also be fiddling with TCP state machine.

> It's not the end of the world if this feature has to be deferred to
> 2.6.27

Agreed. Especially in the light of the another issue that has been
raised.

> and this bug has been known for several weeks already.

...That's partially because Ingo didn't even test my fix on the receiver
which got stuck but used 2.6.25 and got some other bug which looked alike
but couldn't be the same because these DA problems weren't in 2.6.25. What
could I've done for that :-).

--
i.