LinuxLists.cc - Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

2008-06-11 23:53:19

by David Miller

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

From: Alexey Kuznetsov <[email protected]>
Date: Wed, 11 Jun 2008 17:57:18 +0400

> Major issue is that tcp_defer_accept_check() manipulates with not locked
> listening socket. And from all that I know it is impossible to take
> the lock in this context.
>
> Also I see no accounting for those sockets. With this patch any server, which
> set deferred accept, can be flooded with sockets until memory exhausts.
> I did not test and would be glad to be mistaken.
>
>
> Issue with locking can be solved by adding a separate spinlock for
> manipulations with accept_queue. Apparently, accounting and killing
> sockets, which become stale after closing listening socket and
> are going to be alive for up to 65535 seconds, also goes under this lock.
>
> Frankly, cost looks too high for this feature.
>
> Hiding from accept() sockets with only out-of-order data only
> is the only thing which is impossible with old approach. Is this really
> so valuable? My opinion: no, this is nothing but a new loophole
> to consume memory without control.

Yes, we discussed the locking issue over past few days. See
the thread: "stuck localhost TCP connections, v2.6.26-rc3+"

More and more, the arguments are mounting to completely revert the
established code path changes, and frankly that is likely what I am
going to do by the end of today.

2008-06-12 23:32:26

by David Miller

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

From: David Miller <[email protected]>
Date: Wed, 11 Jun 2008 16:52:55 -0700 (PDT)

> More and more, the arguments are mounting to completely revert the
> established code path changes, and frankly that is likely what I am
> going to do by the end of today.

Here is the revert patch I intend to send to Linus:

tcp: Revert 'process defer accept as established' changes.

This reverts two changesets, ec3c0982a2dd1e671bad8e9d26c28dcba0039d87
("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
the follow-on bug fix 9ae27e0adbf471c7a6b80102e38e1d5a346b3b38
("tcp: Fix slab corruption with ipv6 and tcp6fuzz").

This change causes several problems, first reported by Ingo Molnar
as a distcc-over-loopback regression where connections were getting
stuck.

Ilpo J?rvinen first spotted the locking problems. The new function
added by this code, tcp_defer_accept_check(), only has the
child socket locked, yet it is modifying state of the parent
listening socket.

Fixing that is non-trivial at best, because we can't simply just grab
the parent listening socket lock at this point, because it would
create an ABBA deadlock. The normal ordering is parent listening
socket --> child socket, but this code path would require the
reverse lock ordering.

Next is a problem noticed by Vitaliy Gusev, he noted:

----------------------------------------
>--- a/net/ipv4/tcp_timer.c
>+++ b/net/ipv4/tcp_timer.c
>@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
> goto death;
> }
>
>+ if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
>+ tcp_send_active_reset(sk, GFP_ATOMIC);
>+ goto death;

Here socket sk is not attached to listening socket's request queue. tcp_done()
will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
release this sk) as socket is not DEAD. Therefore socket sk will be lost for
freeing.
----------------------------------------

Finally, Alexey Kuznetsov argues that there might not even be any
real value or advantage to these new semantics even if we fix all
of the bugs:

----------------------------------------
Hiding from accept() sockets with only out-of-order data only
is the only thing which is impossible with old approach. Is this really
so valuable? My opinion: no, this is nothing but a new loophole
to consume memory without control.
----------------------------------------

So revert this thing for now.

Signed-off-by: David S. Miller <[email protected]>
---
include/linux/tcp.h | 7 ------
include/net/request_sock.h | 4 +-
include/net/tcp.h | 1 -
net/ipv4/inet_connection_sock.c | 11 +++++++--
net/ipv4/tcp.c | 18 +++++++++------
net/ipv4/tcp_input.c | 45 ---------------------------------------
net/ipv4/tcp_ipv4.c | 8 -------
net/ipv4/tcp_minisocks.c | 32 ++++++++++-----------------
net/ipv4/tcp_timer.c | 5 ----
9 files changed, 33 insertions(+), 98 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 18e62e3..b31b6b7 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -239,11 +239,6 @@ static inline struct tcp_request_sock *tcp_rsk(const struct request_sock *req)
return (struct tcp_request_sock *)req;
}

-struct tcp_deferred_accept_info {
- struct sock *listen_sk;
- struct request_sock *request;
-};
-
struct tcp_sock {
/* inet_connection_sock has to be the first member of tcp_sock */
struct inet_connection_sock inet_conn;
@@ -379,8 +374,6 @@ struct tcp_sock {
unsigned int keepalive_intvl; /* time interval between keep alive probes */
int linger2;

- struct tcp_deferred_accept_info defer_tcp_accept;
-
unsigned long last_synq_overflow;

u32 tso_deferred;
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index b220b5f..0c96e7b 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -115,8 +115,8 @@ struct request_sock_queue {
struct request_sock *rskq_accept_head;
struct request_sock *rskq_accept_tail;
rwlock_t syn_wait_lock;
- u16 rskq_defer_accept;
- /* 2 bytes hole, try to pack */
+ u8 rskq_defer_accept;
+ /* 3 bytes hole, try to pack */
struct listen_sock *listen_opt;
};

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d448310..cf54034 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -139,7 +139,6 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
#define MAX_TCP_KEEPINTVL 32767
#define MAX_TCP_KEEPCNT 127
#define MAX_TCP_SYNCNT 127
-#define MAX_TCP_ACCEPT_DEFERRED 65535

#define TCP_SYNQ_INTERVAL (HZ/5) /* Period of SYNACK timer */

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 828ea21..045e799 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -419,7 +419,8 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
struct inet_connection_sock *icsk = inet_csk(parent);
struct request_sock_queue *queue = &icsk->icsk_accept_queue;
struct listen_sock *lopt = queue->listen_opt;
- int thresh = icsk->icsk_syn_retries ? : sysctl_tcp_synack_retries;
+ int max_retries = icsk->icsk_syn_retries ? : sysctl_tcp_synack_retries;
+ int thresh = max_retries;
unsigned long now = jiffies;
struct request_sock **reqp, *req;
int i, budget;
@@ -455,6 +456,9 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
}
}

+ if (queue->rskq_defer_accept)
+ max_retries = queue->rskq_defer_accept;
+
budget = 2 * (lopt->nr_table_entries / (timeout / interval));
i = lopt->clock_hand;

@@ -462,8 +466,9 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
reqp=&lopt->syn_table[i];
while ((req = *reqp) != NULL) {
if (time_after_eq(now, req->expires)) {
- if (req->retrans < thresh &&
- !req->rsk_ops->rtx_syn_ack(parent, req)) {
+ if ((req->retrans < (inet_rsk(req)->acked ? max_retries : thresh)) &&
+ (inet_rsk(req)->acked ||
+ !req->rsk_ops->rtx_syn_ack(parent, req))) {
unsigned long timeo;

if (req->retrans++ == 0)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ab66683..fc54a48 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2112,12 +2112,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
break;

case TCP_DEFER_ACCEPT:
- if (val < 0) {
- err = -EINVAL;
- } else {
- if (val > MAX_TCP_ACCEPT_DEFERRED)
- val = MAX_TCP_ACCEPT_DEFERRED;
- icsk->icsk_accept_queue.rskq_defer_accept = val;
+ icsk->icsk_accept_queue.rskq_defer_accept = 0;
+ if (val > 0) {
+ /* Translate value in seconds to number of
+ * retransmits */
+ while (icsk->icsk_accept_queue.rskq_defer_accept < 32 &&
+ val > ((TCP_TIMEOUT_INIT / HZ) <<
+ icsk->icsk_accept_queue.rskq_defer_accept))
+ icsk->icsk_accept_queue.rskq_defer_accept++;
+ icsk->icsk_accept_queue.rskq_defer_accept++;
}
break;

@@ -2299,7 +2302,8 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
val = (val ? : sysctl_tcp_fin_timeout) / HZ;
break;
case TCP_DEFER_ACCEPT:
- val = icsk->icsk_accept_queue.rskq_defer_accept;
+ val = !icsk->icsk_accept_queue.rskq_defer_accept ? 0 :
+ ((TCP_TIMEOUT_INIT / HZ) << (icsk->icsk_accept_queue.rskq_defer_accept - 1));
break;
case TCP_WINDOW_CLAMP:
val = tp->window_clamp;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index eba873e..cad73b7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4541,49 +4541,6 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, struct tcphdr *th)
}
}

-static int tcp_defer_accept_check(struct sock *sk)
-{
- struct tcp_sock *tp = tcp_sk(sk);
-
- if (tp->defer_tcp_accept.request) {
- int queued_data = tp->rcv_nxt - tp->copied_seq;
- int hasfin = !skb_queue_empty(&sk->sk_receive_queue) ?
- tcp_hdr((struct sk_buff *)
- sk->sk_receive_queue.prev)->fin : 0;
-
- if (queued_data && hasfin)
- queued_data--;
-
- if (queued_data &&
- tp->defer_tcp_accept.listen_sk->sk_state == TCP_LISTEN) {
- if (sock_flag(sk, SOCK_KEEPOPEN)) {
- inet_csk_reset_keepalive_timer(sk,
- keepalive_time_when(tp));
- } else {
- inet_csk_delete_keepalive_timer(sk);
- }
-
- inet_csk_reqsk_queue_add(
- tp->defer_tcp_accept.listen_sk,
- tp->defer_tcp_accept.request,
- sk);
-
- tp->defer_tcp_accept.listen_sk->sk_data_ready(
- tp->defer_tcp_accept.listen_sk, 0);
-
- sock_put(tp->defer_tcp_accept.listen_sk);
- sock_put(sk);
- tp->defer_tcp_accept.listen_sk = NULL;
- tp->defer_tcp_accept.request = NULL;
- } else if (hasfin ||
- tp->defer_tcp_accept.listen_sk->sk_state != TCP_LISTEN) {
- tcp_reset(sk);
- return -1;
- }
- }
- return 0;
-}
-
static int tcp_copy_to_iovec(struct sock *sk, struct sk_buff *skb, int hlen)
{
struct tcp_sock *tp = tcp_sk(sk);
@@ -4944,8 +4901,6 @@ step5:

tcp_data_snd_check(sk);
tcp_ack_snd_check(sk);
-
- tcp_defer_accept_check(sk);
return 0;

csum_error:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 4f8485c..97a2300 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1918,14 +1918,6 @@ int tcp_v4_destroy_sock(struct sock *sk)
sk->sk_sndmsg_page = NULL;
}

- if (tp->defer_tcp_accept.request) {
- reqsk_free(tp->defer_tcp_accept.request);
- sock_put(tp->defer_tcp_accept.listen_sk);
- sock_put(sk);
- tp->defer_tcp_accept.listen_sk = NULL;
- tp->defer_tcp_accept.request = NULL;
- }
-
atomic_dec(&tcp_sockets_allocated);

return 0;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 019c8c1..8245247 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -571,8 +571,10 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
does sequence test, SYN is truncated, and thus we consider
it a bare ACK.

- Both ends (listening sockets) accept the new incoming
- connection and try to talk to each other. 8-)
+ If icsk->icsk_accept_queue.rskq_defer_accept, we silently drop this
+ bare ACK. Otherwise, we create an established connection. Both
+ ends (listening sockets) accept the new incoming connection and try
+ to talk to each other. 8-)

Note: This case is both harmless, and rare. Possibility is about the
same as us discovering intelligent life on another plant tomorrow.
@@ -640,6 +642,13 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
if (!(flg & TCP_FLAG_ACK))
return NULL;

+ /* If TCP_DEFER_ACCEPT is set, drop bare ACK. */
+ if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
+ TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
+ inet_rsk(req)->acked = 1;
+ return NULL;
+ }
+
/* OK, ACK is valid, create big socket and
* feed this segment to it. It will repeat all
* the tests. THIS SEGMENT MUST MOVE SOCKET TO
@@ -678,24 +687,7 @@ struct sock *tcp_check_req(struct sock *sk,struct sk_buff *skb,
inet_csk_reqsk_queue_unlink(sk, req, prev);
inet_csk_reqsk_queue_removed(sk, req);

- if (inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
- TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
-
- /* the accept queue handling is done is est recv slow
- * path so lets make sure to start there
- */
- tcp_sk(child)->pred_flags = 0;
- sock_hold(sk);
- sock_hold(child);
- tcp_sk(child)->defer_tcp_accept.listen_sk = sk;
- tcp_sk(child)->defer_tcp_accept.request = req;
-
- inet_csk_reset_keepalive_timer(child,
- inet_csk(sk)->icsk_accept_queue.rskq_defer_accept * HZ);
- } else {
- inet_csk_reqsk_queue_add(sk, req, child);
- }
-
+ inet_csk_reqsk_queue_add(sk, req, child);
return child;

listen_overflow:
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 4de68cf..63ed9d6 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -489,11 +489,6 @@ static void tcp_keepalive_timer (unsigned long data)
goto death;
}

- if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
- tcp_send_active_reset(sk, GFP_ATOMIC);
- goto death;
- }
-
if (!sock_flag(sk, SOCK_KEEPOPEN) || sk->sk_state == TCP_CLOSE)
goto out;

--
1.5.5.1.308.g1fbb5

2008-06-13 06:31:20

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

* David Miller <[email protected]> wrote:

> From: David Miller <[email protected]>
> Date: Wed, 11 Jun 2008 16:52:55 -0700 (PDT)
>
> > More and more, the arguments are mounting to completely revert the
> > established code path changes, and frankly that is likely what I am
> > going to do by the end of today.
>
> Here is the revert patch I intend to send to Linus:
>
> tcp: Revert 'process defer accept as established' changes.
>
> This reverts two changesets, ec3c0982a2dd1e671bad8e9d26c28dcba0039d87
> ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
> the follow-on bug fix 9ae27e0adbf471c7a6b80102e38e1d5a346b3b38
> ("tcp: Fix slab corruption with ipv6 and tcp6fuzz").
>
> This change causes several problems, first reported by Ingo Molnar
> as a distcc-over-loopback regression where connections were getting
> stuck.
>
> Ilpo J?rvinen first spotted the locking problems. The new function
> added by this code, tcp_defer_accept_check(), only has the
> child socket locked, yet it is modifying state of the parent
> listening socket.
>
> Fixing that is non-trivial at best, because we can't simply just grab
> the parent listening socket lock at this point, because it would
> create an ABBA deadlock. The normal ordering is parent listening
> socket --> child socket, but this code path would require the
> reverse lock ordering.
>
> Next is a problem noticed by Vitaliy Gusev, he noted:
>
> ----------------------------------------
> >--- a/net/ipv4/tcp_timer.c
> >+++ b/net/ipv4/tcp_timer.c
> >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
> > goto death;
> > }
> >
> >+ if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
> >+ tcp_send_active_reset(sk, GFP_ATOMIC);
> >+ goto death;
>
> Here socket sk is not attached to listening socket's request queue. tcp_done()
> will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
> release this sk) as socket is not DEAD. Therefore socket sk will be lost for
> freeing.
> ----------------------------------------
>
> Finally, Alexey Kuznetsov argues that there might not even be any
> real value or advantage to these new semantics even if we fix all
> of the bugs:
>
> ----------------------------------------
> Hiding from accept() sockets with only out-of-order data only
> is the only thing which is impossible with old approach. Is this really
> so valuable? My opinion: no, this is nothing but a new loophole
> to consume memory without control.
> ----------------------------------------
>
> So revert this thing for now.
>
> Signed-off-by: David S. Miller <[email protected]>

the 3 reverts have been extensively tested in -tip via:

# tip/out-of-tree: 9e5b6ca: tcp: revert DEFER_ACCEPT modifications

and the distcc problems are fixed. (The locking fix alone did not fix it
conclusively in my testing, possibly due to the follow-on observations
outlined in your description.)

Tested-by: Ingo Molnar <[email protected]>

Ingo

2008-06-13 09:32:21

by David Miller

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

From: Ingo Molnar <[email protected]>
Date: Fri, 13 Jun 2008 08:30:37 +0200

> the 3 reverts have been extensively tested in -tip via:
>
> # tip/out-of-tree: 9e5b6ca: tcp: revert DEFER_ACCEPT modifications
>
> and the distcc problems are fixed. (The locking fix alone did not fix it
> conclusively in my testing, possibly due to the follow-on observations
> outlined in your description.)
>
> Tested-by: Ingo Molnar <[email protected]>

I didn't revert all three changes, just the final part of that
3 part series.

Please test the patch I actually applied.

2008-06-13 11:09:48

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

* David Miller <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
> Date: Fri, 13 Jun 2008 08:30:37 +0200
>
> > the 3 reverts have been extensively tested in -tip via:
> >
> > # tip/out-of-tree: 9e5b6ca: tcp: revert DEFER_ACCEPT modifications
> >
> > and the distcc problems are fixed. (The locking fix alone did not fix it
> > conclusively in my testing, possibly due to the follow-on observations
> > outlined in your description.)
> >
> > Tested-by: Ingo Molnar <[email protected]>
>
> I didn't revert all three changes, just the final part of that 3 part
> series.
>
> Please test the patch I actually applied.

i just updated all my testsystems to revert the change i tested so far,
and updated it to yours. The delta between the two is the 3 lines patch
below.

A few testsystems already booted into your patch, so if i dont report a
hung TCP connection in the next 6 hours consider it:

Tested-by: Ingo Molnar <[email protected]>

Ingo

--------------->
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index ec83448..045e799 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -466,9 +466,9 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
reqp=&lopt->syn_table[i];
while ((req = *reqp) != NULL) {
if (time_after_eq(now, req->expires)) {
- if ((req->retrans < thresh ||
- (inet_rsk(req)->acked && req->retrans < max_retries))
- && !req->rsk_ops->rtx_syn_ack(parent, req)) {
+ if ((req->retrans < (inet_rsk(req)->acked ? max_retries : thresh)) &&
+ (inet_rsk(req)->acked ||
+ !req->rsk_ops->rtx_syn_ack(parent, req))) {
unsigned long timeo;

if (req->retrans++ == 0)

2008-06-13 11:48:35

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

* Ingo Molnar <[email protected]> wrote:

> > Please test the patch I actually applied.
>
> i just updated all my testsystems to revert the change i tested so
> far, and updated it to yours. The delta between the two is the 3 lines
> patch below.
>
> A few testsystems already booted into your patch, so if i dont report
> a hung TCP connection in the next 6 hours consider it:
>
> Tested-by: Ingo Molnar <[email protected]>

this threw the warning below - never saw that before in thousands of
bootups and this was the only networking change that happened. config
and bootlog attached. Might be unlucky coincidence.

Ingo

[ 173.354049] NETDEV WATCHDOG: eth0: transmit timed out
[ 173.354148] ------------[ cut here ]------------
[ 173.354221] WARNING: at net/sched/sch_generic.c:222 dev_watchdog+0x9a/0xec()
[ 173.354298] Modules linked in:
[ 173.354421] Pid: 13452, comm: cc1 Tainted: G W 2.6.26-rc6-00273-g81ae43a-dirty #2573
[ 173.354516] [<c01250ca>] warn_on_slowpath+0x46/0x76
[ 173.354641] [<c011d428>] ? try_to_wake_up+0x1d6/0x1e0
[ 173.354815] [<c01411e9>] ? trace_hardirqs_off+0xb/0xd
[ 173.357370] [<c011d43d>] ? default_wake_function+0xb/0xd
[ 173.357370] [<c014112a>] ? trace_hardirqs_off_caller+0x15/0xc9
[ 173.357370] [<c01411e9>] ? trace_hardirqs_off+0xb/0xd
[ 173.357370] [<c0142c83>] ? trace_hardirqs_on+0xb/0xd
[ 173.357370] [<c0142b33>] ? trace_hardirqs_on_caller+0x16/0x15b
[ 173.357370] [<c0142c83>] ? trace_hardirqs_on+0xb/0xd
[ 173.357370] [<c06bb3c9>] ? _spin_unlock_irqrestore+0x5b/0x71
[ 173.357370] [<c0133d46>] ? __queue_work+0x2d/0x32
[ 173.357370] [<c0134023>] ? queue_work+0x50/0x72
[ 173.357483] [<c0134059>] ? schedule_work+0x14/0x16
[ 173.357654] [<c05c59b8>] dev_watchdog+0x9a/0xec
[ 173.357783] [<c012d456>] run_timer_softirq+0x13d/0x19d
[ 173.357905] [<c05c591e>] ? dev_watchdog+0x0/0xec
[ 173.358073] [<c05c591e>] ? dev_watchdog+0x0/0xec
[ 173.360804] [<c0129ad7>] __do_softirq+0xb2/0x15c
[ 173.360804] [<c0129a25>] ? __do_softirq+0x0/0x15c
[ 173.360804] [<c0105526>] do_softirq+0x84/0xe9
[ 173.360804] [<c0129996>] irq_exit+0x4b/0x88
[ 173.360804] [<c010ec7a>] smp_apic_timer_interrupt+0x73/0x81
[ 173.360804] [<c0103ddd>] apic_timer_interrupt+0x2d/0x34
[ 173.360804] =======================
[ 173.360804] ---[ end trace a7919e7f17c0a725 ]---
[ 173.396182] evbug.c: Event. Dev: <NULL>, Type: 0, Code: 0, Value: 0
[ 173.446150] evbug.c: Event. Dev: <NULL>, Type: 0, Code: 0, Value: 0

Attachments:

(No filename) (2.43 kB)
config (44.94 kB)
boot.log.bz2 (60.62 kB)
Download all attachments

2008-06-13 21:11:16

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

* Ingo Molnar <[email protected]> wrote:

> > far, and updated it to yours. The delta between the two is the 3 lines
> > patch below.
> >
> > A few testsystems already booted into your patch, so if i dont report
> > a hung TCP connection in the next 6 hours consider it:
> >
> > Tested-by: Ingo Molnar <[email protected]>
>
> this threw the warning below - never saw that before in thousands of
> bootups and this was the only networking change that happened. config
> and bootlog attached. Might be unlucky coincidence.

hm, threw a second warning after 6 more hours of testing:

[ 362.170209] WARNING: at net/sched/sch_generic.c:222 dev_watchdog+0xde/0xf0()

that appears to be more than just coincidence. I've applied the patch
below - which brings me back to the well-tested revert from Ilpo.

This is the only change i've done for the overnight -tip testruns, so if
the warning from sch_generic.c goes away it's this change that has an
impact on that warning.

Ingo

--------------------->
commit 3019ae9652fe44c099669e5dba116acad583cfcb
Author: Ingo Molnar <[email protected]>
Date: Fri Jun 13 23:09:28 2008 +0200

tcp: revert again

Signed-off-by: Ingo Molnar <[email protected]>

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 045e799..ec83448 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -466,9 +466,9 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
reqp=&lopt->syn_table[i];
while ((req = *reqp) != NULL) {
if (time_after_eq(now, req->expires)) {
- if ((req->retrans < (inet_rsk(req)->acked ? max_retries : thresh)) &&
- (inet_rsk(req)->acked ||
- !req->rsk_ops->rtx_syn_ack(parent, req))) {
+ if ((req->retrans < thresh ||
+ (inet_rsk(req)->acked && req->retrans < max_retries))
+ && !req->rsk_ops->rtx_syn_ack(parent, req)) {
unsigned long timeo;

if (req->retrans++ == 0)

2008-06-16 23:59:14

by David Miller

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

From: Ingo Molnar <[email protected]>
Date: Fri, 13 Jun 2008 13:47:46 +0200

> this threw the warning below - never saw that before in thousands of
> bootups and this was the only networking change that happened. config
> and bootlog attached. Might be unlucky coincidence.

So that we can make forward progress here, please confirm that the
following patch against -tip makes your problems go away for good.

Once you can confirm I will push it to Linus.

Thanks!

tcp: Revert reset of deferred accept changes in 2.6.26

Ingo's system is still seeing strange behavior, and he
reports that is goes away if the rest of the deferred
accept changes are reverted too.

Therefore this reverts e4c78840284f3f51b1896cf3936d60a6033c4d2c
("[TCP]: TCP_DEFER_ACCEPT updates - dont retxmt synack") and
539fae89bebd16ebeafd57a87169bc56eb530d76 ("[TCP]: TCP_DEFER_ACCEPT
updates - defer timeout conflicts with max_thresh").

Just like the other revert, these ideas can be revisited for
2.6.27

Signed-off-by: David S. Miller <[email protected]>
---
net/ipv4/inet_connection_sock.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 045e799..ec83448 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -466,9 +466,9 @@ void inet_csk_reqsk_queue_prune(struct sock *parent,
reqp=&lopt->syn_table[i];
while ((req = *reqp) != NULL) {
if (time_after_eq(now, req->expires)) {
- if ((req->retrans < (inet_rsk(req)->acked ? max_retries : thresh)) &&
- (inet_rsk(req)->acked ||
- !req->rsk_ops->rtx_syn_ack(parent, req))) {
+ if ((req->retrans < thresh ||
+ (inet_rsk(req)->acked && req->retrans < max_retries))
+ && !req->rsk_ops->rtx_syn_ack(parent, req)) {
unsigned long timeo;

if (req->retrans++ == 0)
--
1.5.5.1.308.g1fbb5

2008-06-17 07:27:31

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

* David Miller <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
> Date: Fri, 13 Jun 2008 13:47:46 +0200
>
> > this threw the warning below - never saw that before in thousands of
> > bootups and this was the only networking change that happened.
> > config and bootlog attached. Might be unlucky coincidence.
>
> So that we can make forward progress here, please confirm that the
> following patch against -tip makes your problems go away for good.
>
> Once you can confirm I will push it to Linus.

i triggered the net/sched/sch_generic.c:222 warning once more meanwhile
(yesterday) with the full revert applied (which i think is the same as
the patch below).

So i think it's either some unlucky coincidence or some timing
relationship - perhaps the change impacts packet ordering for certain
workload patterns? [but that same condition can occur without that patch
too]

I also checked kerneloops.org and this warning seems to have been
reported by others as well - although it's not triggering heavily. In
some of those other reports the warning came together with a dead
interface, while in my case it's just a warning with still working
networking.

So since there's no clear bug pattern and no sure reproducability on my
side i'd suggest we track this problem separately and "do nothing" right
now. I've excluded this warning from my 'is the freshly booted kernel
buggy' list of conditions of -tip testing so it's not holding me up.

and i can apply any test-patch if that would be helpful - if it does a
WARN_ON() i'll notice it. (pure extra debug printks with no stack trace
are much harder to notice in automated tests)

btw., it would be nice if there was some .config driven networking debug
option that randomized packet ordering in the tx and rx queue.
(transparently enabled, with zero-config on the userspace side)

I.e. it would have an (expensive, because O(1)) debug mechanism that
randomized things - it would insert new packets into a random place
within the queue where it gets queued. We could hit races and rarer
codepaths much sooner that way - as especially in LAN based testing
there's a strong natural ordering of packets so randomizing it
artificially looks promising to me.

If you make that new option =y enable-able in the .config(dependent on
DEBUG_KERNEL && default off, etc.), and as long as it does not have to
be configured on the userspace side (i'm testing unmodified userspace
images with default distro installs, etc.) the randconfig test will
still be able to reach it in a percentage of the tests and i think we'll
be able to hit a lot of exciting races much sooner than with the normal
in-order/FIFO queueing methods.

it's basically massively parallel coverage testing. It doesnt matter how
unbelievably slow packet ordering randomization might be, the coverage
testing it would do would be worth gold i'm sure. (I'd love to test
something like that in -tip, if it comes in form of some standalone
patch against a mainline-ish tree.)

Ingo

2008-06-17 07:38:45

by David Miller

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

From: Ingo Molnar <[email protected]>
Date: Tue, 17 Jun 2008 09:26:58 +0200

> So since there's no clear bug pattern and no sure reproducability on my
> side i'd suggest we track this problem separately and "do nothing" right
> now. I've excluded this warning from my 'is the freshly booted kernel
> buggy' list of conditions of -tip testing so it's not holding me up.

I'm going to push the revert through just to be safe and I think it's
a good idea to do so because all of those defer accept changes should
be resubmitted as a group for 2.6.27

> and i can apply any test-patch if that would be helpful - if it does a
> WARN_ON() i'll notice it. (pure extra debug printks with no stack trace
> are much harder to notice in automated tests)

I don't have time to work on your bug, sorry. Someone else will
have to step forward and help you with it.

FWIW I don't think your TX timeout problem has anything to do with
packet ordering. The TX element of the network device is totally
stateless, but it's hanging under some set of circumstances to the
point where we timeout and reset the hardware to get it going again.

2008-06-17 08:10:54

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

* David Miller <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
> Date: Tue, 17 Jun 2008 09:26:58 +0200
>
> > So since there's no clear bug pattern and no sure reproducability on
> > my side i'd suggest we track this problem separately and "do
> > nothing" right now. I've excluded this warning from my 'is the
> > freshly booted kernel buggy' list of conditions of -tip testing so
> > it's not holding me up.
>
> I'm going to push the revert through just to be safe and I think it's
> a good idea to do so because all of those defer accept changes should
> be resubmitted as a group for 2.6.27

okay - in that case the full revert is well-tested on my side as well,
fwiw.

Tested-by: Ingo Molnar <[email protected]>

> > and i can apply any test-patch if that would be helpful - if it does
> > a WARN_ON() i'll notice it. (pure extra debug printks with no stack
> > trace are much harder to notice in automated tests)
>
> I don't have time to work on your bug, sorry. Someone else will have
> to step forward and help you with it.

it's not really "my bug" - i just offered help to debug someone else's
bug :-) This is pretty common hw so i guess there will be such reports.

Let me describe what i'm doing exactly: i do a lot of randomized testing
on about a dozen real systems (all across the x86 spectrum) so i tend to
trigger a lot of mainline bugs pretty early on.

My collection of kernel bugs for the last 8 months shows 1285 bugs
(kernel crashes or build failures - about 50%/50%) triggered. One
test-system alone has a serial log of 15 gigabytes - and there's a dozen
of them. That's about 5 kernel bugs a day handled by me, on average.

These systems have about 10 times the hardware variability of your
Niagara system for example, and many of them are rather difficult to
debug (laptops without serial port, etc.). So i physically cannot avoid
and debug all bugs on all my test-systems, like you do on the Niagara. I
will report bugs, i'll bisect anything that is bisectable (on average i
bisect once a day), and i can add patches and report any test-results,
and i'll of course debug any bugs that look like heavy mainline
showstoppers.

> FWIW I don't think your TX timeout problem has anything to do with
> packet ordering. The TX element of the network device is totally
> stateless, but it's hanging under some set of circumstances to the
> point where we timeout and reset the hardware to get it going again.

ok. That's e1000 then. Cc:s added. Stock T60 laptop, 32-bit:

02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
Subsystem: Lenovo ThinkPad T60
Flags: bus master, fast devsel, latency 0, IRQ 16
Memory at ee000000 (32-bit, non-prefetchable) [size=128K]
I/O ports at 2000 [size=32]
Capabilities: <access denied>
Kernel driver in use: e1000

the problem is this non-fatal warning showing up after bootup,
sporadically, in a non-reproducible way:

[ 173.354049] NETDEV WATCHDOG: eth0: transmit timed out
[ 173.354148] ------------[ cut here ]------------
[ 173.354221] WARNING: at net/sched/sch_generic.c:222 dev_watchdog+0x9a/0xec()
[ 173.354298] Modules linked in:
[ 173.354421] Pid: 13452, comm: cc1 Tainted: G W 2.6.26-rc6-00273-g81ae43a-dirty #2573
[ 173.354516] [<c01250ca>] warn_on_slowpath+0x46/0x76
[ 173.354641] [<c011d428>] ? try_to_wake_up+0x1d6/0x1e0
[ 173.354815] [<c01411e9>] ? trace_hardirqs_off+0xb/0xd
[ 173.357370] [<c011d43d>] ? default_wake_function+0xb/0xd
[ 173.357370] [<c014112a>] ? trace_hardirqs_off_caller+0x15/0xc9
[ 173.357370] [<c01411e9>] ? trace_hardirqs_off+0xb/0xd
[ 173.357370] [<c0142c83>] ? trace_hardirqs_on+0xb/0xd
[ 173.357370] [<c0142b33>] ? trace_hardirqs_on_caller+0x16/0x15b
[ 173.357370] [<c0142c83>] ? trace_hardirqs_on+0xb/0xd
[ 173.357370] [<c06bb3c9>] ? _spin_unlock_irqrestore+0x5b/0x71
[ 173.357370] [<c0133d46>] ? __queue_work+0x2d/0x32
[ 173.357370] [<c0134023>] ? queue_work+0x50/0x72
[ 173.357483] [<c0134059>] ? schedule_work+0x14/0x16
[ 173.357654] [<c05c59b8>] dev_watchdog+0x9a/0xec
[ 173.357783] [<c012d456>] run_timer_softirq+0x13d/0x19d
[ 173.357905] [<c05c591e>] ? dev_watchdog+0x0/0xec
[ 173.358073] [<c05c591e>] ? dev_watchdog+0x0/0xec
[ 173.360804] [<c0129ad7>] __do_softirq+0xb2/0x15c
[ 173.360804] [<c0129a25>] ? __do_softirq+0x0/0x15c
[ 173.360804] [<c0105526>] do_softirq+0x84/0xe9
[ 173.360804] [<c0129996>] irq_exit+0x4b/0x88
[ 173.360804] [<c010ec7a>] smp_apic_timer_interrupt+0x73/0x81
[ 173.360804] [<c0103ddd>] apic_timer_interrupt+0x2d/0x34
[ 173.360804] =======================
[ 173.360804] ---[ end trace a7919e7f17c0a725 ]---

full report can be found at:

http://lkml.org/lkml/2008/6/13/224

i have 3 other test-systems with e1000 (with a similar CPU) which are
_not_ showing this symptom, so this could be some model-specific e1000
issue.

Ingo

2008-06-17 08:33:10

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

* Ingo Molnar <[email protected]> wrote:

>
> > FWIW I don't think your TX timeout problem has anything to do with
> > packet ordering. The TX element of the network device is totally
> > stateless, but it's hanging under some set of circumstances to the
> > point where we timeout and reset the hardware to get it going again.
>
> ok. That's e1000 then. Cc:s added. Stock T60 laptop, 32-bit:
>
> 02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
> Subsystem: Lenovo ThinkPad T60
> Flags: bus master, fast devsel, latency 0, IRQ 16
> Memory at ee000000 (32-bit, non-prefetchable) [size=128K]
> I/O ports at 2000 [size=32]
> Capabilities: <access denied>
> Kernel driver in use: e1000
>
> the problem is this non-fatal warning showing up after bootup,
> sporadically, in a non-reproducible way:
>
> [ 173.354049] NETDEV WATCHDOG: eth0: transmit timed out
> [ 173.354148] ------------[ cut here ]------------
> [ 173.354221] WARNING: at net/sched/sch_generic.c:222 dev_watchdog+0x9a/0xec()
> [ 173.354298] Modules linked in:
> [ 173.354421] Pid: 13452, comm: cc1 Tainted: G W 2.6.26-rc6-00273-g81ae43a-dirty #2573
> [ 173.354516] [<c01250ca>] warn_on_slowpath+0x46/0x76
> [ 173.354641] [<c011d428>] ? try_to_wake_up+0x1d6/0x1e0
> [ 173.354815] [<c01411e9>] ? trace_hardirqs_off+0xb/0xd
> [ 173.357370] [<c011d43d>] ? default_wake_function+0xb/0xd
> [ 173.357370] [<c014112a>] ? trace_hardirqs_off_caller+0x15/0xc9
> [ 173.357370] [<c01411e9>] ? trace_hardirqs_off+0xb/0xd
> [ 173.357370] [<c0142c83>] ? trace_hardirqs_on+0xb/0xd
> [ 173.357370] [<c0142b33>] ? trace_hardirqs_on_caller+0x16/0x15b
> [ 173.357370] [<c0142c83>] ? trace_hardirqs_on+0xb/0xd
> [ 173.357370] [<c06bb3c9>] ? _spin_unlock_irqrestore+0x5b/0x71
> [ 173.357370] [<c0133d46>] ? __queue_work+0x2d/0x32
> [ 173.357370] [<c0134023>] ? queue_work+0x50/0x72
> [ 173.357483] [<c0134059>] ? schedule_work+0x14/0x16
> [ 173.357654] [<c05c59b8>] dev_watchdog+0x9a/0xec
> [ 173.357783] [<c012d456>] run_timer_softirq+0x13d/0x19d
> [ 173.357905] [<c05c591e>] ? dev_watchdog+0x0/0xec
> [ 173.358073] [<c05c591e>] ? dev_watchdog+0x0/0xec
> [ 173.360804] [<c0129ad7>] __do_softirq+0xb2/0x15c
> [ 173.360804] [<c0129a25>] ? __do_softirq+0x0/0x15c
> [ 173.360804] [<c0105526>] do_softirq+0x84/0xe9
> [ 173.360804] [<c0129996>] irq_exit+0x4b/0x88
> [ 173.360804] [<c010ec7a>] smp_apic_timer_interrupt+0x73/0x81
> [ 173.360804] [<c0103ddd>] apic_timer_interrupt+0x2d/0x34
> [ 173.360804] =======================
> [ 173.360804] ---[ end trace a7919e7f17c0a725 ]---
>
> full report can be found at:
>
> http://lkml.org/lkml/2008/6/13/224
>
> i have 3 other test-systems with e1000 (with a similar CPU) which are
> _not_ showing this symptom, so this could be some model-specific e1000
> issue.

btw., this reminds me that this is the same system that has a serious
e1000 network latency bug which i have reported more than a year ago,
but which still does not appear to be fixed in latest mainline:

PING europe (10.0.1.15) 56(84) bytes of data.
64 bytes from europe (10.0.1.15): icmp_seq=1 ttl=64 time=1.51 ms
64 bytes from europe (10.0.1.15): icmp_seq=2 ttl=64 time=404 ms
64 bytes from europe (10.0.1.15): icmp_seq=3 ttl=64 time=487 ms
64 bytes from europe (10.0.1.15): icmp_seq=4 ttl=64 time=296 ms
64 bytes from europe (10.0.1.15): icmp_seq=5 ttl=64 time=305 ms
64 bytes from europe (10.0.1.15): icmp_seq=6 ttl=64 time=1011 ms
64 bytes from europe (10.0.1.15): icmp_seq=7 ttl=64 time=0.209 ms
64 bytes from europe (10.0.1.15): icmp_seq=8 ttl=64 time=763 ms
64 bytes from europe (10.0.1.15): icmp_seq=9 ttl=64 time=1000 ms
64 bytes from europe (10.0.1.15): icmp_seq=10 ttl=64 time=0.438 ms
64 bytes from europe (10.0.1.15): icmp_seq=11 ttl=64 time=1000 ms
64 bytes from europe (10.0.1.15): icmp_seq=12 ttl=64 time=0.299 ms
^C
--- europe ping statistics ---
12 packets transmitted, 12 received, 0% packet loss, time 11085ms

those up to 1000 msec delays can be 'felt' via ssh too, if this problem
triggers then the system is almost unusable via the network. Local
latencies are perfect so it's an e1000 problem.

Ingo

2008-06-17 08:39:20

by Vitaliy Gusev

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

On 17 June 2008 12:09:58 Ingo Molnar wrote:
> * David Miller <[email protected]> wrote:
> > From: Ingo Molnar <[email protected]>
> > Date: Tue, 17 Jun 2008 09:26:58 +0200
> >
> > > So since there's no clear bug pattern and no sure reproducability on
> > > my side i'd suggest we track this problem separately and "do
> > > nothing" right now. I've excluded this warning from my 'is the
> > > freshly booted kernel buggy' list of conditions of -tip testing so
> > > it's not holding me up.
> >
> > I'm going to push the revert through just to be safe and I think it's
> > a good idea to do so because all of those defer accept changes should
> > be resubmitted as a group for 2.6.27
>
> okay - in that case the full revert is well-tested on my side as well,
> fwiw.
>
> Tested-by: Ingo Molnar <[email protected]>

Revert patch takes away problem with leak sockets.
Tested-by: Vitaliy Gusev <[email protected]>

>
> > > and i can apply any test-patch if that would be helpful - if it does
> > > a WARN_ON() i'll notice it. (pure extra debug printks with no stack
> > > trace are much harder to notice in automated tests)
> >
> > I don't have time to work on your bug, sorry. Someone else will have
> > to step forward and help you with it.
>
> it's not really "my bug" - i just offered help to debug someone else's
> bug :-) This is pretty common hw so i guess there will be such reports.
>
> Let me describe what i'm doing exactly: i do a lot of randomized testing
> on about a dozen real systems (all across the x86 spectrum) so i tend to
> trigger a lot of mainline bugs pretty early on.
>
> My collection of kernel bugs for the last 8 months shows 1285 bugs
> (kernel crashes or build failures - about 50%/50%) triggered. One
> test-system alone has a serial log of 15 gigabytes - and there's a dozen
> of them. That's about 5 kernel bugs a day handled by me, on average.
>
> These systems have about 10 times the hardware variability of your
> Niagara system for example, and many of them are rather difficult to
> debug (laptops without serial port, etc.). So i physically cannot avoid
> and debug all bugs on all my test-systems, like you do on the Niagara. I
> will report bugs, i'll bisect anything that is bisectable (on average i
> bisect once a day), and i can add patches and report any test-results,
> and i'll of course debug any bugs that look like heavy mainline
> showstoppers.
>
> > FWIW I don't think your TX timeout problem has anything to do with
> > packet ordering. The TX element of the network device is totally
> > stateless, but it's hanging under some set of circumstances to the
> > point where we timeout and reset the hardware to get it going again.
>
> ok. That's e1000 then. Cc:s added. Stock T60 laptop, 32-bit:
>
> 02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet
> Controller Subsystem: Lenovo ThinkPad T60
> Flags: bus master, fast devsel, latency 0, IRQ 16
> Memory at ee000000 (32-bit, non-prefetchable) [size=128K]
> I/O ports at 2000 [size=32]
> Capabilities: <access denied>
> Kernel driver in use: e1000
>
> the problem is this non-fatal warning showing up after bootup,
> sporadically, in a non-reproducible way:
>
> [ 173.354049] NETDEV WATCHDOG: eth0: transmit timed out
> [ 173.354148] ------------[ cut here ]------------
> [ 173.354221] WARNING: at net/sched/sch_generic.c:222
> dev_watchdog+0x9a/0xec() [ 173.354298] Modules linked in:
> [ 173.354421] Pid: 13452, comm: cc1 Tainted: G W
> 2.6.26-rc6-00273-g81ae43a-dirty #2573 [ 173.354516] [<c01250ca>]
> warn_on_slowpath+0x46/0x76
> [ 173.354641] [<c011d428>] ? try_to_wake_up+0x1d6/0x1e0
> [ 173.354815] [<c01411e9>] ? trace_hardirqs_off+0xb/0xd
> [ 173.357370] [<c011d43d>] ? default_wake_function+0xb/0xd
> [ 173.357370] [<c014112a>] ? trace_hardirqs_off_caller+0x15/0xc9
> [ 173.357370] [<c01411e9>] ? trace_hardirqs_off+0xb/0xd
> [ 173.357370] [<c0142c83>] ? trace_hardirqs_on+0xb/0xd
> [ 173.357370] [<c0142b33>] ? trace_hardirqs_on_caller+0x16/0x15b
> [ 173.357370] [<c0142c83>] ? trace_hardirqs_on+0xb/0xd
> [ 173.357370] [<c06bb3c9>] ? _spin_unlock_irqrestore+0x5b/0x71
> [ 173.357370] [<c0133d46>] ? __queue_work+0x2d/0x32
> [ 173.357370] [<c0134023>] ? queue_work+0x50/0x72
> [ 173.357483] [<c0134059>] ? schedule_work+0x14/0x16
> [ 173.357654] [<c05c59b8>] dev_watchdog+0x9a/0xec
> [ 173.357783] [<c012d456>] run_timer_softirq+0x13d/0x19d
> [ 173.357905] [<c05c591e>] ? dev_watchdog+0x0/0xec
> [ 173.358073] [<c05c591e>] ? dev_watchdog+0x0/0xec
> [ 173.360804] [<c0129ad7>] __do_softirq+0xb2/0x15c
> [ 173.360804] [<c0129a25>] ? __do_softirq+0x0/0x15c
> [ 173.360804] [<c0105526>] do_softirq+0x84/0xe9
> [ 173.360804] [<c0129996>] irq_exit+0x4b/0x88
> [ 173.360804] [<c010ec7a>] smp_apic_timer_interrupt+0x73/0x81
> [ 173.360804] [<c0103ddd>] apic_timer_interrupt+0x2d/0x34
> [ 173.360804] =======================
> [ 173.360804] ---[ end trace a7919e7f17c0a725 ]---
>
> full report can be found at:
>
> http://lkml.org/lkml/2008/6/13/224
>
> i have 3 other test-systems with e1000 (with a similar CPU) which are
> _not_ showing this symptom, so this could be some model-specific e1000
> issue.
>
> Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Thank,
Vitaliy Gusev

2008-06-17 09:08:51

by David Miller

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

From: Ingo Molnar <[email protected]>
Date: Tue, 17 Jun 2008 10:32:20 +0200

> those up to 1000 msec delays can be 'felt' via ssh too, if this problem
> triggers then the system is almost unusable via the network. Local
> latencies are perfect so it's an e1000 problem.

Or some kind of weird interrupt problem.

Such an interrupt level bug would also account for the TX timeout's
you're seeing btw.

2008-06-17 09:27:52

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

* David Miller <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
> Date: Tue, 17 Jun 2008 10:32:20 +0200
>
> > those up to 1000 msec delays can be 'felt' via ssh too, if this
> > problem triggers then the system is almost unusable via the network.
> > Local latencies are perfect so it's an e1000 problem.
>
> Or some kind of weird interrupt problem.
>
> Such an interrupt level bug would also account for the TX timeout's
> you're seeing btw.

when i originally reported it i debugged it back to missing e1000 TX
completion IRQs. I tried various versions of the driver to figure out
whether new workarounds for e1000 cover it but it was fruitless. There
is a 1000 msec internal watchdog timer IRQ within e1000 that gets things
going if it's stuck.

But the line sch_generic.c:222 problem is new. It could be an
escallation of this same problem - not even the hw-internal watchdog
timeout fixing up things? So basically two levels of completion failed,
the third fallback level (a hard reset of the interface) helped things
get going. High score from me for networking layer robustness :-)

Ingo

2008-06-17 09:29:21

by David Miller

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

From: Ingo Molnar <[email protected]>
Date: Tue, 17 Jun 2008 11:27:06 +0200

> when i originally reported it i debugged it back to missing e1000 TX
> completion IRQs. I tried various versions of the driver to figure out
> whether new workarounds for e1000 cover it but it was fruitless. There
> is a 1000 msec internal watchdog timer IRQ within e1000 that gets things
> going if it's stuck.

Then that explains your latency, the chip is getting stuck and
TX interrupts stop, right.

> But the line sch_generic.c:222 problem is new. It could be an
> escallation of this same problem - not even the hw-internal watchdog
> timeout fixing up things? So basically two levels of completion failed,
> the third fallback level (a hard reset of the interface) helped things
> get going. High score from me for networking layer robustness :-)

I think it is an escallation of the same problem. My first thought
is that there must have been some change to the reset logic and it
isn't as foolproof as it used to be, especially under load.

2008-06-17 09:40:24

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

* David Miller <[email protected]> wrote:

> From: Ingo Molnar <[email protected]>
> Date: Tue, 17 Jun 2008 11:27:06 +0200
>
> > when i originally reported it i debugged it back to missing e1000 TX
> > completion IRQs. I tried various versions of the driver to figure
> > out whether new workarounds for e1000 cover it but it was fruitless.
> > There is a 1000 msec internal watchdog timer IRQ within e1000 that
> > gets things going if it's stuck.
>
> Then that explains your latency, the chip is getting stuck and TX
> interrupts stop, right.

note that the 1000 msecs timer is AFAIK internal to the e1000
_hardware_, not the driver itself. I.e. probably the firmware detects
and works around a hung transmitter. This is not detectable from the OS
(it's not an OS timer), but it can be observed by a lot of testing on a
totally quiescent system - which i did back then ;-)

i also played a lot with the various knobs of the e1000, none of which
seemed to help.

/me digs in archives

i reported it to the e1000 folks in 2006:

Date: Mon, 4 Dec 2006 11:24:00 +0100

against 2.6.19. The original report is below - with a trace and various
things i tried to debug this.

i eventually got the suggestion from Auke to set RxIntDelay=8 which
seemed to work around the issue - but since i use a built-in driver i
dont have that setting here (RxIntDelay=8 is a module load parameter and
not exposed via Kconfig methods) and the e1000 driver does not seem to
have changed its default setting for RxIntDelay.

2.6.18-1.2849.fc6 was the last kernel that worked fine.

Ingo

-------------------->
Date: Wed, 13 Dec 2006 22:09:22 +0100
From: Ingo Molnar <[email protected]>
To: Auke Kok <[email protected]>
Subject: Re: e1000: 2.6.19 & long packet latencies
Cc: Jesse Brandeburg <[email protected]>,
"Ronciak, John" <[email protected]>

Jesse, et al.,

i'm having a weird packet processing latency problem with the e1000
driver and recent kernels.

The symptom is this: if i connect to a T60 laptop (which has an on-board
e1000) from the outside, i see large delays in network activity, and ssh
sessions are very sluggish.

ping latencies show it best under a dynticks kernel (but vanilla 2.6.19
is affected too):

titan:~/linux/linux> ping e
PING europe (10.0.1.15) 56(84) bytes of data.
64 bytes from europe (10.0.1.15): icmp_seq=1 ttl=64 time=0.340 ms
64 bytes from europe (10.0.1.15): icmp_seq=2 ttl=64 time=757 ms
64 bytes from europe (10.0.1.15): icmp_seq=3 ttl=64 time=1001 ms
64 bytes from europe (10.0.1.15): icmp_seq=4 ttl=64 time=1001 ms
64 bytes from europe (10.0.1.15): icmp_seq=5 ttl=64 time=0.356 ms
64 bytes from europe (10.0.1.15): icmp_seq=6 ttl=64 time=2127 ms
64 bytes from europe (10.0.1.15): icmp_seq=7 ttl=64 time=1002 ms
64 bytes from europe (10.0.1.15): icmp_seq=8 ttl=64 time=0.320 ms
64 bytes from europe (10.0.1.15): icmp_seq=9 ttl=64 time=1002 ms
64 bytes from europe (10.0.1.15): icmp_seq=10 ttl=64 time=2004 ms
64 bytes from europe (10.0.1.15): icmp_seq=11 ttl=64 time=1002 ms
64 bytes from europe (10.0.1.15): icmp_seq=12 ttl=64 time=0.303 ms
64 bytes from europe (10.0.1.15): icmp_seq=13 ttl=64 time=1000 ms
64 bytes from europe (10.0.1.15): icmp_seq=14 ttl=64 time=2010 ms
64 bytes from europe (10.0.1.15): icmp_seq=15 ttl=64 time=1009 ms
64 bytes from europe (10.0.1.15): icmp_seq=16 ttl=64 time=0.283 ms

i have traced this and the 1000/2000 msecs values come from some sort of
e1000-internal 'heartbeat' interrupt. What seems to happen is that RX
packet processing is delayed indefinitely and the IRQ just does not
arrive.

NOTE: the vanilla 2.6.19 kernel shows this too, but the ping delays are
1/HZ.

here's a (filtered) trace of such a delay. IRQ 0x219 is the e1000
interrupt:

<idle>-0 0D.h1 761236us : do_IRQ (c0272a9b 219 0)
IRQ_219-356 0.... 761412us+: e1000_intr (handle_IRQ_event)
IRQ_219-356 0.... 761416us : e1000_clean_rx_irq (e1000_intr)
IRQ_219-356 0.... 761418us+: e1000_clean_tx_irq (e1000_intr)
<idle>-0 0D.h1 2760093us : do_IRQ (c0272a9b 219 0)
IRQ_219-356 0.... 2760268us+: e1000_intr (handle_IRQ_event)
IRQ_219-356 0.... 2760273us : e1000_clean_rx_irq (e1000_intr)
IRQ_219-356 0.... 2760275us : e1000_clean_tx_irq (e1000_intr)
<idle>-0 0D.h1 3804499us : do_IRQ (c0272a9b 219 0)
IRQ_219-356 0.... 3804674us+: e1000_intr (handle_IRQ_event)
IRQ_219-356 0.... 3804679us+: e1000_clean_rx_irq (e1000_intr)
IRQ_219-356 0.... 3804761us : e1000_clean_tx_irq (e1000_intr)
IRQ_219-356 0.... 3804763us : e1000_clean_rx_irq (e1000_intr)
IRQ_219-356 0.... 3804765us : e1000_clean_tx_irq (e1000_intr)
softirq--7 0.... 3804810us : net_rx_action (ksoftirqd)
softirq--5 0D.h. 3805425us : do_IRQ (c01598ac 219 0)
IRQ_219-356 0.... 3805499us+: e1000_intr (handle_IRQ_event)
IRQ_219-356 0.... 3805504us : e1000_clean_rx_irq (e1000_intr)
IRQ_219-356 0.... 3805506us : e1000_clean_tx_irq (e1000_intr)
IRQ_219-356 0.... 3805547us : e1000_clean_rx_irq (e1000_intr)
IRQ_219-356 0.... 3805549us : e1000_clean_tx_irq (e1000_intr)
softirq--6 0.... 3805641us : net_tx_action (ksoftirqd)
<idle>-0 0D.h1 4760910us : do_IRQ (c01451d4 219 0)
IRQ_219-356 0.... 4761347us+: e1000_intr (handle_IRQ_event)
IRQ_219-356 0.... 4761352us : e1000_clean_rx_irq (e1000_intr)
IRQ_219-356 0.... 4761353us : e1000_clean_tx_irq (e1000_intr)
<idle>-0 0D.h1 6761309us : do_IRQ (c0272a9b 219 0)
IRQ_219-356 0.... 6761483us+: e1000_intr (handle_IRQ_event)
IRQ_219-356 0.... 6761488us : e1000_clean_rx_irq (e1000_intr)
IRQ_219-356 0.... 6761490us : e1000_clean_tx_irq (e1000_intr)
softirq--5 0D.h. 8760595us : do_IRQ (c0135dc4 219 0)
IRQ_219-356 0.... 8760676us+: e1000_intr (handle_IRQ_event)
IRQ_219-356 0.... 8760681us+: e1000_clean_rx_irq (e1000_intr)
IRQ_219-356 0.... 8760739us : e1000_clean_tx_irq (e1000_intr)
IRQ_219-356 0.... 8760740us : e1000_clean_rx_irq (e1000_intr)
IRQ_219-356 0.... 8760742us : e1000_clean_tx_irq (e1000_intr)
softirq--7 0.... 8760885us : net_rx_action (ksoftirqd)
softirq--7 0.... 8760914us+: icmp_rcv (ip_local_deliver)
softirq--7 0.... 8760923us+: icmp_reply (icmp_echo)
<idle>-0 0D.h1 8761661us : do_IRQ (c0272a9b 219 0)
IRQ_219-356 0.... 8761833us+: e1000_intr (handle_IRQ_event)
IRQ_219-356 0.... 8761838us : e1000_clean_rx_irq (e1000_intr)
IRQ_219-356 0.... 8761840us : e1000_clean_tx_irq (e1000_intr)
IRQ_219-356 0.... 8761875us : e1000_clean_rx_irq (e1000_intr)
IRQ_219-356 0.... 8761876us : e1000_clean_tx_irq (e1000_intr)
softirq--6 0.... 8761921us : net_tx_action (ksoftirqd)

note that timestamps 2760093us, 4760910us, 6761309us and 8760595us is
some sort of traffic-independent 'periodic' interrupt that e1000
generates. That 'housekeeping' interrupt doesnt seem to be doing much.
The IRQ at 8760595us picks up an icmp packet and replies to it - but the
icmp packet in reality arrived somewhere between timestamps 6761309us
and 8760595us - but no IRQ was generated for it!

Suspecting the interrupt-rate controlling bits of the e1000 hw i have
tried the following tunes too:

-#define DEFAULT_RDTR 0
+#define DEFAULT_RDTR 1

-#define DEFAULT_RADV 128
+#define DEFAULT_RADV 1

-#define DEFAULT_TIDV 64
+#define DEFAULT_TIDV 1

-#define DEFAULT_TADV 64
+#define DEFAULT_TADV 1

-#define DEFAULT_ITR 8000
+#define DEFAULT_ITR 100000

but they made no difference.

a 2.6.18-ish kernel works fine (2.6.18-1.2849.fc6):

titan:~/linux/linux> ping e
PING europe (10.0.1.15) 56(84) bytes of data.
64 bytes from europe (10.0.1.15): icmp_seq=1 ttl=64 time=0.695 ms
64 bytes from europe (10.0.1.15): icmp_seq=2 ttl=64 time=0.171 ms
64 bytes from europe (10.0.1.15): icmp_seq=3 ttl=64 time=0.184 ms
64 bytes from europe (10.0.1.15): icmp_seq=4 ttl=64 time=0.159 ms
64 bytes from europe (10.0.1.15): icmp_seq=5 ttl=64 time=0.148 ms

e1000: 0000:02:00.0: e1000_probe: (PCI Express:2.5Gb/s:Width x1) 00:16:41:17:49:d2
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection

the precise hardware version is:

02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
Subsystem: Lenovo ThinkPad T60
Flags: bus master, fast devsel, latency 0, IRQ 90
Memory at ee000000 (32-bit, non-prefetchable) [size=128K]
I/O ports at 2000 [size=32]
Capabilities: <access denied>

this laptop has a CoreDuo so i have tried maxcpus=1 too, but it didnt
make any difference.

Any ideas about what i should try next?

Ingo

2008-06-18 18:58:20

[permalink] [raw]

Subject: Re: [E1000-devel] [TCP]: TCP_DEFER_ACCEPT causes leak sockets

Ingo Molnar wrote:
> * David Miller <[email protected]> wrote:
>
>> From: Ingo Molnar <[email protected]>
>> Date: Tue, 17 Jun 2008 11:27:06 +0200
>>
>>> when i originally reported it i debugged it back to missing e1000 TX
>>> completion IRQs. I tried various versions of the driver to figure
>>> out whether new workarounds for e1000 cover it but it was fruitless.
>>> There is a 1000 msec internal watchdog timer IRQ within e1000 that
>>> gets things going if it's stuck.
>> Then that explains your latency, the chip is getting stuck and TX
>> interrupts stop, right.
>
> note that the 1000 msecs timer is AFAIK internal to the e1000
> _hardware_, not the driver itself. I.e. probably the firmware detects
> and works around a hung transmitter. This is not detectable from the OS
> (it's not an OS timer), but it can be observed by a lot of testing on a
> totally quiescent system - which i did back then ;-)
>
> i also played a lot with the various knobs of the e1000, none of which
> seemed to help.
>
> /me digs in archives
>
> i reported it to the e1000 folks in 2006:
>
> Date: Mon, 4 Dec 2006 11:24:00 +0100
>
> against 2.6.19. The original report is below - with a trace and various
> things i tried to debug this.
>
> i eventually got the suggestion from Auke to set RxIntDelay=8 which
> seemed to work around the issue - but since i use a built-in driver i
> dont have that setting here (RxIntDelay=8 is a module load parameter and
> not exposed via Kconfig methods) and the e1000 driver does not seem to
> have changed its default setting for RxIntDelay.
>
> 2.6.18-1.2849.fc6 was the last kernel that worked fine.
>
> Ingo
>
> -------------------->
> Date: Wed, 13 Dec 2006 22:09:22 +0100
> From: Ingo Molnar <[email protected]>
> To: Auke Kok <[email protected]>
> Subject: Re: e1000: 2.6.19 & long packet latencies
> Cc: Jesse Brandeburg <[email protected]>,
> "Ronciak, John" <[email protected]>
>
> Jesse, et al.,
>
> i'm having a weird packet processing latency problem with the e1000
> driver and recent kernels.
>
> The symptom is this: if i connect to a T60 laptop (which has an on-board
> e1000) from the outside, i see large delays in network activity, and ssh
> sessions are very sluggish.
>
> ping latencies show it best under a dynticks kernel (but vanilla 2.6.19
> is affected too):
>
> titan:~/linux/linux> ping e
> PING europe (10.0.1.15) 56(84) bytes of data.
> 64 bytes from europe (10.0.1.15): icmp_seq=1 ttl=64 time=0.340 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=2 ttl=64 time=757 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=3 ttl=64 time=1001 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=4 ttl=64 time=1001 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=5 ttl=64 time=0.356 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=6 ttl=64 time=2127 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=7 ttl=64 time=1002 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=8 ttl=64 time=0.320 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=9 ttl=64 time=1002 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=10 ttl=64 time=2004 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=11 ttl=64 time=1002 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=12 ttl=64 time=0.303 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=13 ttl=64 time=1000 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=14 ttl=64 time=2010 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=15 ttl=64 time=1009 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=16 ttl=64 time=0.283 ms
>
> i have traced this and the 1000/2000 msecs values come from some sort of
> e1000-internal 'heartbeat' interrupt. What seems to happen is that RX
> packet processing is delayed indefinitely and the IRQ just does not
> arrive.
>
> NOTE: the vanilla 2.6.19 kernel shows this too, but the ping delays are
> 1/HZ.
>
> here's a (filtered) trace of such a delay. IRQ 0x219 is the e1000
> interrupt:
>
> <idle>-0 0D.h1 761236us : do_IRQ (c0272a9b 219 0)
> IRQ_219-356 0.... 761412us+: e1000_intr (handle_IRQ_event)
> IRQ_219-356 0.... 761416us : e1000_clean_rx_irq (e1000_intr)
> IRQ_219-356 0.... 761418us+: e1000_clean_tx_irq (e1000_intr)
> <idle>-0 0D.h1 2760093us : do_IRQ (c0272a9b 219 0)
> IRQ_219-356 0.... 2760268us+: e1000_intr (handle_IRQ_event)
> IRQ_219-356 0.... 2760273us : e1000_clean_rx_irq (e1000_intr)
> IRQ_219-356 0.... 2760275us : e1000_clean_tx_irq (e1000_intr)
> <idle>-0 0D.h1 3804499us : do_IRQ (c0272a9b 219 0)
> IRQ_219-356 0.... 3804674us+: e1000_intr (handle_IRQ_event)
> IRQ_219-356 0.... 3804679us+: e1000_clean_rx_irq (e1000_intr)
> IRQ_219-356 0.... 3804761us : e1000_clean_tx_irq (e1000_intr)
> IRQ_219-356 0.... 3804763us : e1000_clean_rx_irq (e1000_intr)
> IRQ_219-356 0.... 3804765us : e1000_clean_tx_irq (e1000_intr)
> softirq--7 0.... 3804810us : net_rx_action (ksoftirqd)
> softirq--5 0D.h. 3805425us : do_IRQ (c01598ac 219 0)
> IRQ_219-356 0.... 3805499us+: e1000_intr (handle_IRQ_event)
> IRQ_219-356 0.... 3805504us : e1000_clean_rx_irq (e1000_intr)
> IRQ_219-356 0.... 3805506us : e1000_clean_tx_irq (e1000_intr)
> IRQ_219-356 0.... 3805547us : e1000_clean_rx_irq (e1000_intr)
> IRQ_219-356 0.... 3805549us : e1000_clean_tx_irq (e1000_intr)
> softirq--6 0.... 3805641us : net_tx_action (ksoftirqd)
> <idle>-0 0D.h1 4760910us : do_IRQ (c01451d4 219 0)
> IRQ_219-356 0.... 4761347us+: e1000_intr (handle_IRQ_event)
> IRQ_219-356 0.... 4761352us : e1000_clean_rx_irq (e1000_intr)
> IRQ_219-356 0.... 4761353us : e1000_clean_tx_irq (e1000_intr)
> <idle>-0 0D.h1 6761309us : do_IRQ (c0272a9b 219 0)
> IRQ_219-356 0.... 6761483us+: e1000_intr (handle_IRQ_event)
> IRQ_219-356 0.... 6761488us : e1000_clean_rx_irq (e1000_intr)
> IRQ_219-356 0.... 6761490us : e1000_clean_tx_irq (e1000_intr)
> softirq--5 0D.h. 8760595us : do_IRQ (c0135dc4 219 0)
> IRQ_219-356 0.... 8760676us+: e1000_intr (handle_IRQ_event)
> IRQ_219-356 0.... 8760681us+: e1000_clean_rx_irq (e1000_intr)
> IRQ_219-356 0.... 8760739us : e1000_clean_tx_irq (e1000_intr)
> IRQ_219-356 0.... 8760740us : e1000_clean_rx_irq (e1000_intr)
> IRQ_219-356 0.... 8760742us : e1000_clean_tx_irq (e1000_intr)
> softirq--7 0.... 8760885us : net_rx_action (ksoftirqd)
> softirq--7 0.... 8760914us+: icmp_rcv (ip_local_deliver)
> softirq--7 0.... 8760923us+: icmp_reply (icmp_echo)
> <idle>-0 0D.h1 8761661us : do_IRQ (c0272a9b 219 0)
> IRQ_219-356 0.... 8761833us+: e1000_intr (handle_IRQ_event)
> IRQ_219-356 0.... 8761838us : e1000_clean_rx_irq (e1000_intr)
> IRQ_219-356 0.... 8761840us : e1000_clean_tx_irq (e1000_intr)
> IRQ_219-356 0.... 8761875us : e1000_clean_rx_irq (e1000_intr)
> IRQ_219-356 0.... 8761876us : e1000_clean_tx_irq (e1000_intr)
> softirq--6 0.... 8761921us : net_tx_action (ksoftirqd)
>
> note that timestamps 2760093us, 4760910us, 6761309us and 8760595us is
> some sort of traffic-independent 'periodic' interrupt that e1000
> generates. That 'housekeeping' interrupt doesnt seem to be doing much.
> The IRQ at 8760595us picks up an icmp packet and replies to it - but the
> icmp packet in reality arrived somewhere between timestamps 6761309us
> and 8760595us - but no IRQ was generated for it!
>
> Suspecting the interrupt-rate controlling bits of the e1000 hw i have
> tried the following tunes too:
>
> -#define DEFAULT_RDTR 0
> +#define DEFAULT_RDTR 1
>
> -#define DEFAULT_RADV 128
> +#define DEFAULT_RADV 1
>
> -#define DEFAULT_TIDV 64
> +#define DEFAULT_TIDV 1
>
> -#define DEFAULT_TADV 64
> +#define DEFAULT_TADV 1
>
> -#define DEFAULT_ITR 8000
> +#define DEFAULT_ITR 100000
>
> but they made no difference.
>
> a 2.6.18-ish kernel works fine (2.6.18-1.2849.fc6):
>
> titan:~/linux/linux> ping e
> PING europe (10.0.1.15) 56(84) bytes of data.
> 64 bytes from europe (10.0.1.15): icmp_seq=1 ttl=64 time=0.695 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=2 ttl=64 time=0.171 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=3 ttl=64 time=0.184 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=4 ttl=64 time=0.159 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=5 ttl=64 time=0.148 ms
>
> e1000: 0000:02:00.0: e1000_probe: (PCI Express:2.5Gb/s:Width x1) 00:16:41:17:49:d2
> e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
>
> the precise hardware version is:
>
> 02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
> Subsystem: Lenovo ThinkPad T60
> Flags: bus master, fast devsel, latency 0, IRQ 90
> Memory at ee000000 (32-bit, non-prefetchable) [size=128K]
> I/O ports at 2000 [size=32]
> Capabilities: <access denied>
>
> this laptop has a CoreDuo so i have tried maxcpus=1 too, but it didnt
> make any difference.
>
> Any ideas about what i should try next?
>

have you tried e1000e?

2008-06-18 20:09:09

[permalink] [raw]

Subject: Re: [E1000-devel] [TCP]: TCP_DEFER_ACCEPT causes leak sockets

* Kok, Auke <[email protected]> wrote:

> > Any ideas about what i should try next?
>
> have you tried e1000e?

will try it.

But even it if solves the problem it's a nasty complication: given how
many times i have to bisect back into the times when there was only
e1000 around, how do i handle the transition? I have automated bisection
tools, etc. and i bisect very frequently.

It's a real practical problem for me: if i have E1000E=y in my .config
and go back to an older kernel, i lose that .config setting in 'make
oldconfig'. Then when the bisection run happens to go back into the
E1000E times, 'make oldconfig' picks up E1000E with a default-off
setting - and things break or work differently.

no other Linux driver i'm using forces me to do that and i rely on many
of them and i rely on proper 'make oldconfig' behavior on a daily basis.
Until now i was able to do automatic bisection back for _years_, to the
v2.6.19 times. You broke that.

And that's just one driver out of thousands of Linux drivers. Imagine
what happened to bisectability and migration quality if every driver
version update was this careless about its installed base as
e1000/e1000e.

The e1000 -> e1000e migration it was not only done in an incompetent,
amateurish way, you also ignored real feedback and that combined
together is totally lame and inacceptable behavior in my book. You
should not expect praise and roses from me as long as you do stupid
things like that.

Ingo

2008-06-18 21:32:54

[permalink] [raw]

Subject: Re: [E1000-devel] [TCP]: TCP_DEFER_ACCEPT causes leak sockets

Ingo Molnar wrote:
> * Kok, Auke <[email protected]> wrote:
>
>>> Any ideas about what i should try next?
>> have you tried e1000e?
>
> will try it.
>
> But even it if solves the problem it's a nasty complication: given how
> many times i have to bisect back into the times when there was only
> e1000 around, how do i handle the transition? I have automated bisection
> tools, etc. and i bisect very frequently.
>
> It's a real practical problem for me: if i have E1000E=y in my .config
> and go back to an older kernel, i lose that .config setting in 'make
> oldconfig'. Then when the bisection run happens to go back into the
> E1000E times, 'make oldconfig' picks up E1000E with a default-off
> setting - and things break or work differently.
>
> no other Linux driver i'm using forces me to do that and i rely on many
> of them and i rely on proper 'make oldconfig' behavior on a daily basis.
> Until now i was able to do automatic bisection back for _years_, to the
> v2.6.19 times. You broke that.
>
> And that's just one driver out of thousands of Linux drivers. Imagine
> what happened to bisectability and migration quality if every driver
> version update was this careless about its installed base as
> e1000/e1000e.
>
> The e1000 -> e1000e migration it was not only done in an incompetent,
> amateurish way, you also ignored real feedback and that combined
> together is totally lame and inacceptable behavior in my book. You
> should not expect praise and roses from me as long as you do stupid
> things like that.

where were you when we discussed this? We took over a year and a half to get to a
final plan and many people responded and provided feedback. In the end Jeff Garzik
and many community members suggested a plan and this is what I implemented. In not
a single way did I force anything down anyones throat. I did exactly what the
community wanted me to do, and in the way that it seemed best by everyone.

You only complain and do not provide a single solution to your problem. Your
continued screaming and whining is totally not productive nor constructive at all,
and frankly is insulting since you completely ignore the fact that we worked with
the the community more than two-year to come to some maintainable situation. All
you do is complain. Direct your problems to the network stack and driver
maintainers since they approved and worked with me to implement the changes.

*** NOTE: I NO LONGER MAINTAIN E1000/E1000E, nor do I represent them or speak for
them. ***

I frankly suggested that you try e1000e because this might provide valuable
information for the people who are taking this ingrateful job after me. This was
meant in a productive and constructive way.

your flame is totally inappropriate and unprofessional. Either come up with a
solution or start working on one, like I did when I took the much hated job as
e1000 maintainer.

I am totally open to suggestions and if needed I will work with the current
e1000/e1000e maintainers on working something out if I see a better solution than
the current situation. Until I see such a thing I can't do much else than ignore
your childish whining.

Auke

2008-06-18 21:33:37

[permalink] [raw]

Subject: Re: [E1000-devel] [TCP]: TCP_DEFER_ACCEPT causes leak sockets

* Ingo Molnar <[email protected]> wrote:

> * Kok, Auke <[email protected]> wrote:
>
> > > Any ideas about what i should try next?
> >
> > have you tried e1000e?
>
> will try it.

ok, i tried it now, and there's good news: the latency problem seems
largely fixed by e1000e. (yay!)

with e1000 i got these anomalous latencies:

64 bytes from europe (10.0.1.15): icmp_seq=10 ttl=64 time=1000 ms
64 bytes from europe (10.0.1.15): icmp_seq=11 ttl=64 time=0.882 ms
64 bytes from europe (10.0.1.15): icmp_seq=12 ttl=64 time=1007 ms
64 bytes from europe (10.0.1.15): icmp_seq=13 ttl=64 time=0.522 ms
64 bytes from europe (10.0.1.15): icmp_seq=14 ttl=64 time=1003 ms
64 bytes from europe (10.0.1.15): icmp_seq=15 ttl=64 time=0.381 ms
64 bytes from europe (10.0.1.15): icmp_seq=16 ttl=64 time=1010 ms

with e1000e i get:

64 bytes from europe (10.0.1.15): icmp_seq=1 ttl=64 time=0.212 ms
64 bytes from europe (10.0.1.15): icmp_seq=2 ttl=64 time=0.372 ms
64 bytes from europe (10.0.1.15): icmp_seq=3 ttl=64 time=0.815 ms
64 bytes from europe (10.0.1.15): icmp_seq=4 ttl=64 time=0.961 ms
64 bytes from europe (10.0.1.15): icmp_seq=5 ttl=64 time=0.201 ms
64 bytes from europe (10.0.1.15): icmp_seq=6 ttl=64 time=0.788 ms

TCP latencies are fine too - ssh feels snappy again.

it still does not have nearly as good latencies as say forcedeth though:

64 bytes from mercury (10.0.1.13): icmp_seq=1 ttl=64 time=0.076 ms
64 bytes from mercury (10.0.1.13): icmp_seq=2 ttl=64 time=0.085 ms
64 bytes from mercury (10.0.1.13): icmp_seq=3 ttl=64 time=0.045 ms
64 bytes from mercury (10.0.1.13): icmp_seq=4 ttl=64 time=0.053 ms

that's 10 times better packet latencies.

and even an ancient Realtek RTL-8139 over 10 megabit Ethernet (!) has
better latencies than the e1000e over 1000 megabit:

64 bytes from pluto (10.0.1.10): icmp_seq=2 ttl=64 time=0.309 ms
64 bytes from pluto (10.0.1.10): icmp_seq=3 ttl=64 time=0.333 ms
64 bytes from pluto (10.0.1.10): icmp_seq=4 ttl=64 time=0.329 ms
64 bytes from pluto (10.0.1.10): icmp_seq=5 ttl=64 time=0.311 ms
64 bytes from pluto (10.0.1.10): icmp_seq=6 ttl=64 time=0.302 ms

is it done intentionally perhaps? I dont think it makes much sense to
delay rx/tx processing on a completely idle box for such a long time.

The options i used are:

CONFIG_E1000=y
CONFIG_E1000_NAPI=y
# CONFIG_E1000_DISABLE_PACKET_SPLIT is not set
CONFIG_E1000E=y
CONFIG_E1000E_ENABLED=y

> But even it if solves the problem it's a nasty complication: given how
> many times i have to bisect back into the times when there was only
> e1000 around, how do i handle the transition? I have automated
> bisection tools, etc. and i bisect very frequently.

one possibility would be to change 'make oldconfig' to keep old options
around - as long as they look "unknown" to a particular kernel. It would
list them in some special "unknown options" section near the end of the
.config or so. That way the E1000E=y setting could survive a bisection
run which dives down into older kernel versions. (obviously old kernels
wont grow this capability magically, so if we do such a change we'll
have to wait years for it all to trickle through.)

and eventually E1000E could become the default.

Ingo

2008-06-18 21:41:59

by Denys Fedoryschenko

[permalink] [raw]

Subject: Re: [E1000-devel] [TCP]: TCP_DEFER_ACCEPT causes leak sockets

> * Ingo Molnar <[email protected]> wrote:
> with e1000e i get:
>
> 64 bytes from europe (10.0.1.15): icmp_seq=1 ttl=64 time=0.212 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=2 ttl=64 time=0.372 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=3 ttl=64 time=0.815 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=4 ttl=64 time=0.961 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=5 ttl=64 time=0.201 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=6 ttl=64 time=0.788 ms
>
> TCP latencies are fine too - ssh feels snappy again.
>
> it still does not have nearly as good latencies as say forcedeth though:
>
> 64 bytes from mercury (10.0.1.13): icmp_seq=1 ttl=64 time=0.076 ms
> 64 bytes from mercury (10.0.1.13): icmp_seq=2 ttl=64 time=0.085 ms
> 64 bytes from mercury (10.0.1.13): icmp_seq=3 ttl=64 time=0.045 ms
> 64 bytes from mercury (10.0.1.13): icmp_seq=4 ttl=64 time=0.053 ms
>
> that's 10 times better packet latencies.
>
> and even an ancient Realtek RTL-8139 over 10 megabit Ethernet (!) has
> better latencies than the e1000e over 1000 megabit:
>
> 64 bytes from pluto (10.0.1.10): icmp_seq=2 ttl=64 time=0.309 ms
> 64 bytes from pluto (10.0.1.10): icmp_seq=3 ttl=64 time=0.333 ms
> 64 bytes from pluto (10.0.1.10): icmp_seq=4 ttl=64 time=0.329 ms
> 64 bytes from pluto (10.0.1.10): icmp_seq=5 ttl=64 time=0.311 ms
> 64 bytes from pluto (10.0.1.10): icmp_seq=6 ttl=64 time=0.302 ms
>
> is it done intentionally perhaps? I dont think it makes much sense to
> delay rx/tx processing on a completely idle box for such a long time.
Idle box, ICH8 chipset, e1000e, latest git.

MegaRouterCore-KARAM ~ # ping 192.168.20.26
PING 192.168.20.26 (192.168.20.26) 56(84) bytes of data.
64 bytes from 192.168.20.26: icmp_seq=1 ttl=64 time=0.109 ms
64 bytes from 192.168.20.26: icmp_seq=2 ttl=64 time=0.134 ms
64 bytes from 192.168.20.26: icmp_seq=3 ttl=64 time=0.120 ms
64 bytes from 192.168.20.26: icmp_seq=4 ttl=64 time=0.117 ms
64 bytes from 192.168.20.26: icmp_seq=5 ttl=64 time=0.117 ms
64 bytes from 192.168.20.26: icmp_seq=6 ttl=64 time=0.113 ms

Disabling interrupt moderation
MegaRouterCore-KARAM ~ # ethtool -C eth0 rx-usecs 0
MegaRouterCore-KARAM ~ # ping 192.168.20.26
PING 192.168.20.26 (192.168.20.26) 56(84) bytes of data.
64 bytes from 192.168.20.26: icmp_seq=1 ttl=64 time=0.072 ms
64 bytes from 192.168.20.26: icmp_seq=2 ttl=64 time=0.091 ms
64 bytes from 192.168.20.26: icmp_seq=3 ttl=64 time=0.066 ms
64 bytes from 192.168.20.26: icmp_seq=4 ttl=64 time=0.065 ms
64 bytes from 192.168.20.26: icmp_seq=5 ttl=64 time=0.077 ms
64 bytes from 192.168.20.26: icmp_seq=6 ttl=64 time=0.073 ms

Maybe try the same?
ethtool -C eth0 rx-usecs 0

--
------
Technical Manager
Virtual ISP S.A.L.
Lebanon

2008-06-18 22:06:36

[permalink] [raw]

Subject: Re: [E1000-devel] [TCP]: TCP_DEFER_ACCEPT causes leak sockets

* Denys Fedoryshchenko <[email protected]> wrote:

> > * Ingo Molnar <[email protected]> wrote:
> > with e1000e i get:
> >
> > 64 bytes from europe (10.0.1.15): icmp_seq=1 ttl=64 time=0.212 ms
> > 64 bytes from europe (10.0.1.15): icmp_seq=2 ttl=64 time=0.372 ms
> > 64 bytes from europe (10.0.1.15): icmp_seq=3 ttl=64 time=0.815 ms
> > 64 bytes from europe (10.0.1.15): icmp_seq=4 ttl=64 time=0.961 ms
> > 64 bytes from europe (10.0.1.15): icmp_seq=5 ttl=64 time=0.201 ms
> > 64 bytes from europe (10.0.1.15): icmp_seq=6 ttl=64 time=0.788 ms
> >
> > TCP latencies are fine too - ssh feels snappy again.
> >
> > it still does not have nearly as good latencies as say forcedeth though:
> >
> > 64 bytes from mercury (10.0.1.13): icmp_seq=1 ttl=64 time=0.076 ms
> > 64 bytes from mercury (10.0.1.13): icmp_seq=2 ttl=64 time=0.085 ms
> > 64 bytes from mercury (10.0.1.13): icmp_seq=3 ttl=64 time=0.045 ms
> > 64 bytes from mercury (10.0.1.13): icmp_seq=4 ttl=64 time=0.053 ms
> >
> > that's 10 times better packet latencies.
> >
> > and even an ancient Realtek RTL-8139 over 10 megabit Ethernet (!) has
> > better latencies than the e1000e over 1000 megabit:
> >
> > 64 bytes from pluto (10.0.1.10): icmp_seq=2 ttl=64 time=0.309 ms
> > 64 bytes from pluto (10.0.1.10): icmp_seq=3 ttl=64 time=0.333 ms
> > 64 bytes from pluto (10.0.1.10): icmp_seq=4 ttl=64 time=0.329 ms
> > 64 bytes from pluto (10.0.1.10): icmp_seq=5 ttl=64 time=0.311 ms
> > 64 bytes from pluto (10.0.1.10): icmp_seq=6 ttl=64 time=0.302 ms
> >
> > is it done intentionally perhaps? I dont think it makes much sense to
> > delay rx/tx processing on a completely idle box for such a long time.
> Idle box, ICH8 chipset, e1000e, latest git.
>
> MegaRouterCore-KARAM ~ # ping 192.168.20.26
> PING 192.168.20.26 (192.168.20.26) 56(84) bytes of data.
> 64 bytes from 192.168.20.26: icmp_seq=1 ttl=64 time=0.109 ms
> 64 bytes from 192.168.20.26: icmp_seq=2 ttl=64 time=0.134 ms
> 64 bytes from 192.168.20.26: icmp_seq=3 ttl=64 time=0.120 ms
> 64 bytes from 192.168.20.26: icmp_seq=4 ttl=64 time=0.117 ms
> 64 bytes from 192.168.20.26: icmp_seq=5 ttl=64 time=0.117 ms
> 64 bytes from 192.168.20.26: icmp_seq=6 ttl=64 time=0.113 ms

ok, that looks much better! i have another box with e1000, ich7:

64 bytes from titan (10.0.1.14): icmp_seq=5 ttl=64 time=0.345 ms
64 bytes from titan (10.0.1.14): icmp_seq=6 ttl=64 time=1.03 ms
64 bytes from titan (10.0.1.14): icmp_seq=7 ttl=64 time=0.383 ms
64 bytes from titan (10.0.1.14): icmp_seq=8 ttl=64 time=0.320 ms
64 bytes from titan (10.0.1.14): icmp_seq=9 ttl=64 time=0.996 ms
64 bytes from titan (10.0.1.14): icmp_seq=10 ttl=64 time=0.248 ms

> Disabling interrupt moderation
> MegaRouterCore-KARAM ~ # ethtool -C eth0 rx-usecs 0
> MegaRouterCore-KARAM ~ # ping 192.168.20.26
> PING 192.168.20.26 (192.168.20.26) 56(84) bytes of data.
> 64 bytes from 192.168.20.26: icmp_seq=1 ttl=64 time=0.072 ms
> 64 bytes from 192.168.20.26: icmp_seq=2 ttl=64 time=0.091 ms
> 64 bytes from 192.168.20.26: icmp_seq=3 ttl=64 time=0.066 ms
> 64 bytes from 192.168.20.26: icmp_seq=4 ttl=64 time=0.065 ms
> 64 bytes from 192.168.20.26: icmp_seq=5 ttl=64 time=0.077 ms
> 64 bytes from 192.168.20.26: icmp_seq=6 ttl=64 time=0.073 ms
>
> Maybe try the same?
> ethtool -C eth0 rx-usecs 0

well i tend not to tweak my drivers with such options because i want to
experience and test what 99.9% of our users will experience in the
field. The reality is that if it's not the default behavior, it's almost
as if it didnt exist at all.

but even with that tune on e1000e (on the t60, ich7) i still get rather
large numbers:

earth4:~/s> ping eu
PING europe (10.0.1.15) 56(84) bytes of data.
64 bytes from europe (10.0.1.15): icmp_seq=1 ttl=64 time=0.250 ms
64 bytes from europe (10.0.1.15): icmp_seq=2 ttl=64 time=0.250 ms
64 bytes from europe (10.0.1.15): icmp_seq=3 ttl=64 time=0.225 ms
64 bytes from europe (10.0.1.15): icmp_seq=4 ttl=64 time=0.932 ms
64 bytes from europe (10.0.1.15): icmp_seq=5 ttl=64 time=0.251 ms
64 bytes from europe (10.0.1.15): icmp_seq=6 ttl=64 time=0.915 ms
64 bytes from europe (10.0.1.15): icmp_seq=7 ttl=64 time=0.250 ms
64 bytes from europe (10.0.1.15): icmp_seq=8 ttl=64 time=0.238 ms
64 bytes from europe (10.0.1.15): icmp_seq=9 ttl=64 time=0.390 ms
64 bytes from europe (10.0.1.15): icmp_seq=10 ttl=64 time=0.260 ms

Ingo

2008-06-18 22:12:36

by David Miller

[permalink] [raw]

Subject: Re: [E1000-devel] [TCP]: TCP_DEFER_ACCEPT causes leak sockets

From: "Kok, Auke" <[email protected]>
Date: Wed, 18 Jun 2008 14:25:28 -0700

> You only complain and do not provide a single solution to your
> problem. Your continued screaming and whining is totally not
> productive nor constructive at all, and frankly is insulting since
> you completely ignore the fact that we worked with the the community
> more than two-year to come to some maintainable situation.

I completely and %100 agree with you.

> Until I see such a thing I can't do much else than ignore
> your childish whining.

Join the club.

I'm also ignoring everything he writes until he changes his modus
operandi to one that is more constructive than the pure hurtful
whining he is emitting as of late.

2008-06-18 22:44:44

by Denys Fedoryschenko

[permalink] [raw]

Subject: Re: [E1000-devel] [TCP]: TCP_DEFER_ACCEPT causes leak sockets

On Thursday 19 June 2008 01:05, Ingo Molnar wrote:
>
> ok, that looks much better! i have another box with e1000, ich7:
>
> 64 bytes from titan (10.0.1.14): icmp_seq=5 ttl=64 time=0.345 ms
> 64 bytes from titan (10.0.1.14): icmp_seq=6 ttl=64 time=1.03 ms
> 64 bytes from titan (10.0.1.14): icmp_seq=7 ttl=64 time=0.383 ms
> 64 bytes from titan (10.0.1.14): icmp_seq=8 ttl=64 time=0.320 ms
> 64 bytes from titan (10.0.1.14): icmp_seq=9 ttl=64 time=0.996 ms
> 64 bytes from titan (10.0.1.14): icmp_seq=10 ttl=64 time=0.248 ms
Maybe there is some flow-control involved?
ethtool -S eth0 ?

This is Interrupt throttling i guess in e1000. In e1000 also parameters, but available only on insmod stage
parm: TxIntDelay:Transmit Interrupt Delay (array of int)
parm: TxAbsIntDelay:Transmit Absolute Interrupt Delay (array of int)
parm: RxIntDelay:Receive Interrupt Delay (array of int)
parm: RxAbsIntDelay:Receive Absolute Interrupt Delay (array of int)
parm: InterruptThrottleRate:Interrupt Throttling Rate (array of int)

> well i tend not to tweak my drivers with such options because i want to
> experience and test what 99.9% of our users will experience in the
> field. The reality is that if it's not the default behavior, it's almost
> as if it didnt exist at all.

Each coin have two sides. On one side - low latencies(difference 1ms, it is matter anywhere?)
, on another - performance.

>
> but even with that tune on e1000e (on the t60, ich7) i still get rather
> large numbers:
>
> earth4:~/s> ping eu
> PING europe (10.0.1.15) 56(84) bytes of data.
> 64 bytes from europe (10.0.1.15): icmp_seq=1 ttl=64 time=0.250 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=2 ttl=64 time=0.250 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=3 ttl=64 time=0.225 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=4 ttl=64 time=0.932 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=5 ttl=64 time=0.251 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=6 ttl=64 time=0.915 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=7 ttl=64 time=0.250 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=8 ttl=64 time=0.238 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=9 ttl=64 time=0.390 ms
> 64 bytes from europe (10.0.1.15): icmp_seq=10 ttl=64 time=0.260 ms

Is all this hosts on same switch? Is the switch manageable or not?
For example i am having problems with packetloss on long fiber link between two cheap Linksys switches.
Without flow-control i cannot survive, and as result i have 1-2ms additional delay on load, and +-0.500ms jitter "inside" this switches (probably from switches).

There is many things matter. Maybe even processor sleep latencies involved? bus latency, PCI latency, whatever.

Also on laptops is dynamic frequency running (Speedstep)
with 600 Mhz PentiumM (Speedstep - ondemand)
64 bytes from 127.0.0.1: icmp_seq=17 ttl=64 time=0.017 ms
full speed 1.7 Ghz
64 bytes from 127.0.0.1: icmp_seq=33 ttl=64 time=0.007 ms

on network also i see difference -0.030ms when i am running burnP6 (from CPUburn package).

--
------
Technical Manager
Virtual ISP S.A.L.
Lebanon

2008-06-18 23:15:37

[permalink] [raw]

Subject: Re: [E1000-devel] [TCP]: TCP_DEFER_ACCEPT causes leak sockets

* Kok, Auke <[email protected]> wrote:

> You only complain and do not provide a single solution to your
> problem. [...]

i have reported the problem and even provided a fix.

I have triggered an e1000/e1000e related problem that got introduced in
the v2.6.25 merge window - one of my testboxes came up with no
networking and it took me an hour to figure out why. (i wasnt
particularly focusing on e1000, i just happened to hit that bug in 9
million lines of Linux kernel code)

I have reported it here, two and a half months ago:

http://lkml.org/lkml/2008/4/8/256

I even showed you which commit introduced the problem and gave you a
oneliner fix that i tested (it solved the problem):

http://bugzilla.kernel.org/attachment.cgi?id=15704&action=view

You were Cc:-ed to that. (attached below again for reference) The bug
was added to the regression list of v2.6.25. I never expected to spend
more than 10 minutes on this problem once i found out what's happening -
we fix dozens of bugs like this per stable kernel release.

I just checked latest -git, my fix is still not upstream (or any
equivalent solution - i really dont mind how it's solved and i'm not
maintaining this code).

no alternative patch was sent to me - i offered to test any solution
back then.

FYI, since i first reported it i've been hit by that problem roughly a
dozen times. (it happened sporadically so i forgot about it - until i
again had a system come up with no networking.) It caused me lost time
and lost work that could have been spent on better things.

Ingo

------------------------>
Subject: e1000=y && e1000e=m regression fix
From: Ingo Molnar <[email protected]>
Date: Wed Apr 09 21:09:35 CEST 2008

fix a regression from v2.6.24: do not transfer the e1000e PCI IDs from
e1000 to e1000e if e1000 is built-in and e1000e is a module.

Built-in drivers take precedence over modules in many ways - and in this
case it's clear that the user intended the e1000 driver to be the
primary one. "Silently change behavior and break existing configs" is
never a good migration strategy. Most users will use distro kernels that
are not affected by this problem at all - nor are they affected by this
patch - but this problem can hit users and developers who build their
kernels themselves and migrate from v2.6.24 to v2.6.25.

this fixes: http://bugzilla.kernel.org/show_bug.cgi?id=10427

Signed-off-by: Ingo Molnar <[email protected]>
---
drivers/net/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-x86.q/drivers/net/Kconfig
===================================================================
--- linux-x86.q.orig/drivers/net/Kconfig
+++ linux-x86.q/drivers/net/Kconfig
@@ -2022,7 +2022,7 @@ config E1000E
will be called e1000e.

config E1000E_ENABLED
- def_bool E1000E != n
+ def_bool E1000E = y || ((E1000E != n) && (E1000 = E1000E))

config IP1000
tristate "IP1000 Gigabit Ethernet support"

2008-06-19 07:02:13

by Jarek Poplawski

[permalink] [raw]

Subject: Re: [TCP]: TCP_DEFER_ACCEPT causes leak sockets

On 19-06-2008 00:12, David Miller wrote:
> From: "Kok, Auke" <[email protected]>
> Date: Wed, 18 Jun 2008 14:25:28 -0700
>
>> You only complain and do not provide a single solution to your
>> problem.

Technically, Ingo has asked for a solution (but btw. he gave some
proposal), so this argument is wrong.

>> Your continued screaming and whining is totally not
>> productive nor constructive at all,

Actually, screaming and whining often "helps" to have things done,
so this argument is wrong too.

>> and frankly is insulting

This one is right.

>> since
>> you completely ignore the fact that we worked with the the community
>> more than two-year to come to some maintainable situation.

Two-year work doesn't guarantee the solution is right (but it might be
right),so - wrong argument.

>
> I completely and %100 agree with you.

I give 25...

Regards,
Jarek P.