2001-10-11 18:47:23

by Simon Kirby

[permalink] [raw]
Subject: Really slow netstat and /proc/net/tcp in 2.4

Is there something that changed from 2.2 -> 2.4 with regards to the
speed of netstat and /proc/net/tcp? We have some webservers we just
upgraded from 2.2.19 to 2.4.12, and some in-house monitoring tools that
check /proc/net/tcp have begun to suck up a lot of CPU cycles trying to
read that file.

A simple cat or wc -l on the file feels like about on the order of two
magnitudes slower ("time" reports around a second when the file has 450
entries). Some servers seem to be worse than others, and it does not
appear to be proportional to the number of entries across servers.

netstat -tn just crawls along on these servers. Should I enable
profile=1 or something to see what's happening here?

Examples:

2.2.19:

[sroot@marble:/root]# time wc -l /proc/net/tcp
858 /proc/net/tcp
0.000u 0.010s 0:00.01 100.0% 0+0k 0+0io 112pf+0w

2.4.12:

[sroot@pro:/root]# time wc -l /proc/net/tcp
463 /proc/net/tcp
0.000u 0.640s 0:00.64 100.0% 0+0k 0+0io 69pf+0w

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]


2001-10-11 19:31:38

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Really slow netstat and /proc/net/tcp in 2.4

Hello!

> Is there something that changed from 2.2 -> 2.4 with regards to the
> speed of netstat and /proc/net/tcp?

Incredibly high size of hash table, I think.
At least here size is ~1MB. And all this is read each 1K of data read
via /proc/ :-)

Alexey

2001-10-11 19:55:29

by Simon Kirby

[permalink] [raw]
Subject: Re: Really slow netstat and /proc/net/tcp in 2.4

On Thu, Oct 11, 2001 at 11:30:25PM +0400, [email protected] wrote:

> Hello!
>
> > Is there something that changed from 2.2 -> 2.4 with regards to the
> > speed of netstat and /proc/net/tcp?
>
> Incredibly high size of hash table, I think.
> At least here size is ~1MB. And all this is read each 1K of data read
> via /proc/ :-)

So it's walking the hash table per block read, and the hash table is very
large? Hmm. I notice it's a bit faster if I use dd if=/proc/net/tcp
of=/dev/null bs=1024k, but not much.

Is it possible to fix this? Was the 2.2 hash table just that much
smaller?

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]

2001-10-12 16:45:04

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Really slow netstat and /proc/net/tcp in 2.4

Hello!

> Is it possible to fix this? Was the 2.2 hash table just that much
> smaller?

2.2 did not use hash tables, holding special single list for /proc.

If I understand correctly it was removed because added more data/work
and new point of synchronization for main path being useful only for /proc.
The approach would be justified, if you had 100000 sockets. In this
case both approaches are equally slow. :-) But for 1000 sockets hash
table of 100000 entries is sort of overscaled.


> Is it possible to fix this?

To fix --- no. To make differently --- yes.

Well, actually, if you are interested drop me a not I can pack for you some
my old work on this. It is fully functional, but api is still dirty.
It requires some patching kernel, unfortunately.

Alexey

2001-10-12 19:36:13

by Simon Kirby

[permalink] [raw]
Subject: Re: Really slow netstat and /proc/net/tcp in 2.4

On Fri, Oct 12, 2001 at 08:44:58PM +0400, [email protected] wrote:

> > Is it possible to fix this? Was the 2.2 hash table just that much
> > smaller?
>
> 2.2 did not use hash tables, holding special single list for /proc.
>
> If I understand correctly it was removed because added more data/work
> and new point of synchronization for main path being useful only for /proc.
> The approach would be justified, if you had 100000 sockets. In this
> case both approaches are equally slow. :-) But for 1000 sockets hash
> table of 100000 entries is sort of overscaled.
>
> > Is it possible to fix this?
>
> To fix --- no. To make differently --- yes.
>
> Well, actually, if you are interested drop me a not I can pack for you some
> my old work on this. It is fully functional, but api is still dirty.
> It requires some patching kernel, unfortunately.

If it involves changing the TCP stack locking stuff and putting the old
list back, it's probably not worth the bother. The only thing we're
using the file for (other than the occasional admin-run "netstat") is to
check to see what ports are listening on the machine without actually
attempting to connect to them. We check our services this way more often
than actually connecting and requesting a response to reduce log clutter
and testing load on the server. Is there an easier way to accomplish
this than parsing /proc/net/tcp? We could attempt to bind to the ports
we want to check, but that would race with daemons trying to start up.

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]

2001-10-12 19:43:33

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: Really slow netstat and /proc/net/tcp in 2.4

Hello!

> If it involves changing the TCP stack locking stuff

No. It even does not touch the kernel except for exporting 4 new
not exported symbols.

> and testing load on the server. Is there an easier way to accomplish
> this than parsing /proc/net/tcp? We could attempt to bind to the ports
> we want to check, but that would race with daemons trying to start up.

To syn-flood with single syn using packet socket.

Alexey

2001-10-12 19:56:03

by Andi Kleen

[permalink] [raw]
Subject: Re: Really slow netstat and /proc/net/tcp in 2.4

In article <[email protected]>,
Simon Kirby <[email protected]> writes:
> On Thu, Oct 11, 2001 at 11:30:25PM +0400, [email protected] wrote:
>> Hello!
>>
>> > Is there something that changed from 2.2 -> 2.4 with regards to the
>> > speed of netstat and /proc/net/tcp?
>>
>> Incredibly high size of hash table, I think.
>> At least here size is ~1MB. And all this is read each 1K of data read
>> via /proc/ :-)

> So it's walking the hash table per block read, and the hash table is very
> large? Hmm. I notice it's a bit faster if I use dd if=/proc/net/tcp
> of=/dev/null bs=1024k, but not much.

> Is it possible to fix this? Was the 2.2 hash table just that much
> smaller?

The hash table is likely to big anyways; eating cache and not helping that
much. If you're interested in some testing
I can send you patches to change it by hand and collect statistics for
average hash queue length. Then you can figure out a good size for your
workload with some work. Longer time I think the table sizing heuristics
are far too aggressive and need to be throttled back; but that needs more
data from real servers.

-Andi

2001-10-12 22:10:40

by Simon Kirby

[permalink] [raw]
Subject: Re: Really slow netstat and /proc/net/tcp in 2.4

On Fri, Oct 12, 2001 at 09:56:01PM +0200, Andi Kleen wrote:

> The hash table is likely to big anyways; eating cache and not helping that
> much. If you're interested in some testing
> I can send you patches to change it by hand and collect statistics for
> average hash queue length. Then you can figure out a good size for your
> workload with some work. Longer time I think the table sizing heuristics
> are far too aggressive and need to be throttled back; but that needs more
> data from real servers.

Wouldn't just counting the lines in /proc/net/tcp be sufficient to see
how many buckets should be used in an ideal hash table distribution
scenario? (In which case the size of the hash table depends largely on a
machine's work load...)

Most of our web servers seem to have 500-1000 entries in /proc/net/tcp.

Simon-

[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]

2001-10-12 23:57:20

by Andi Kleen

[permalink] [raw]
Subject: Re: Really slow netstat and /proc/net/tcp in 2.4

On Fri, Oct 12, 2001 at 03:10:33PM -0700, Simon Kirby wrote:
> On Fri, Oct 12, 2001 at 09:56:01PM +0200, Andi Kleen wrote:
>
> > The hash table is likely to big anyways; eating cache and not helping that
> > much. If you're interested in some testing
> > I can send you patches to change it by hand and collect statistics for
> > average hash queue length. Then you can figure out a good size for your
> > workload with some work. Longer time I think the table sizing heuristics
> > are far too aggressive and need to be throttled back; but that needs more
> > data from real servers.
>
> Wouldn't just counting the lines in /proc/net/tcp be sufficient to see
> how many buckets should be used in an ideal hash table distribution
> scenario? (In which case the size of the hash table depends largely on a
> machine's work load...)

That won't tell you the list length of individual hash buckets. Keeping
that number in average slow is the goal of the big hash tables, but I suspect
the 2.4 ones are far too big; losing any possible benefit in cache non locality.

I attached a patch. It allows you to get some simple statistics from
/proc/net/sockstat (unfortunately costly too). It also adds a new kernel
boot argument tcpehashgoal=order. Order is the log2 of how many pages you
want to use for the hash table (so it needs 2^order * 4096 bytes on i386)
You can experiment with various sizes and check which one gives still
reasonable hash distribution under load.

The smallest one you can find is best.

BTW, it seems like the tables are 1/4 too big on SMP systems. the second
half reserved for time-wait have per bucket rwlocks too, but they're not
used. If established and time-wait were split this wastage could be avoided.
This way some memory (but not walk time) could be saved. It would also
lower the requirements on continuous memory by half; e.g. useful if tcp/ip
was ever turned into a module.

-Andi



--- net/ipv4/proc.c-o Wed May 16 19:21:45 2001
+++ net/ipv4/proc.c Sat Oct 13 03:37:55 2001
@@ -68,6 +68,7 @@
{
/* From net/socket.c */
extern int socket_get_info(char *, char **, off_t, int);
+ extern int tcp_v4_hash_statistics(char *) ;

int len = socket_get_info(buffer,start,offset,length);

@@ -82,6 +83,8 @@
fold_prot_inuse(&raw_prot));
len += sprintf(buffer+len, "FRAG: inuse %d memory %d\n",
ip_frag_nqueues, atomic_read(&ip_frag_mem));
+ len += tcp_v4_hash_statistics(buffer+len);
+
if (offset >= len)
{
*start = buffer;
--- net/ipv4/tcp.c-o Thu Oct 11 08:42:47 2001
+++ net/ipv4/tcp.c Sat Oct 13 03:56:58 2001
@@ -2442,6 +2442,15 @@
return 0;
}

+static unsigned tcp_ehash_order;
+static int __init tcp_hash_setup(char *str)
+{
+ tcp_ehash_order = simple_strtol(str,NULL,0);
+ return 0;
+}
+
+__setup("tcpehashorder=", tcp_hash_setup);
+

extern void __skb_cb_too_small_for_tcp(int, int);

@@ -2486,8 +2495,12 @@
else
goal = num_physpages >> (23 - PAGE_SHIFT);

- for(order = 0; (1UL << order) < goal; order++)
- ;
+ if (tcp_ehash_order)
+ order = tcp_ehash_order;
+ else {
+ for(order = 0; (1UL << order) < goal; order++)
+ ;
+ }
do {
tcp_ehash_size = (1UL << order) * PAGE_SIZE /
sizeof(struct tcp_ehash_bucket);
--- net/ipv4/tcp_ipv4.c-o Mon Oct 1 18:19:56 2001
+++ net/ipv4/tcp_ipv4.c Sat Oct 13 03:41:57 2001
@@ -2162,6 +2162,62 @@
return len;
}

+int tcp_v4_hash_statistics(char *buffer)
+{
+ int i;
+ int max_hlen = 0, hrun = 0, hcnt = 0 ;
+ char *bufs = buffer;
+
+ buffer += sprintf(buffer, "hash_buckets %d\n", tcp_ehash_size*2);
+
+ local_bh_disable();
+ for (i = 0; i < tcp_ehash_size; i++) {
+ struct tcp_ehash_bucket *head = &tcp_ehash[i];
+ struct sock *sk;
+ struct tcp_tw_bucket *tw;
+ int len = 0;
+
+ read_lock(&head->lock);
+ for(sk = head->chain; sk; sk = sk->next) {
+ if (!TCP_INET_FAMILY(sk->family))
+ continue;
+ ++len;
+ }
+
+ if (len > 0) {
+ if (len > max_hlen) max_hlen = len;
+ ++hcnt;
+ hrun += len;
+ }
+
+ len = 0;
+
+ for (tw = (struct tcp_tw_bucket *)tcp_ehash[i+tcp_ehash_size].chain;
+ tw != NULL;
+ tw = (struct tcp_tw_bucket *)tw->next) {
+ if (!TCP_INET_FAMILY(tw->family))
+ continue;
+ ++len;
+ }
+ read_unlock(&head->lock);
+
+ if (len > 0) {
+ if (len > max_hlen) max_hlen = len;
+ ++hcnt;
+ hrun += len;
+ }
+ }
+
+ local_bh_enable();
+
+ buffer += sprintf(buffer, "used hash buckets: %d\n", hcnt);
+ if (hcnt > 0)
+ buffer += sprintf(buffer, "average length: %d\n", hrun / hcnt);
+
+ return buffer - bufs;
+}
+
+
struct proto tcp_prot = {
name: "TCP",
close: tcp_close,
@@ -2210,3 +2266,4 @@
*/
tcp_socket->sk->prot->unhash(tcp_socket->sk);
}
+

2001-10-13 15:05:39

by Hugh Dickins

[permalink] [raw]
Subject: Re: Really slow netstat and /proc/net/tcp in 2.4

On Sat, 13 Oct 2001, Andi Kleen wrote:
>
> I attached a patch. It allows you to get some simple statistics from
> /proc/net/sockstat (unfortunately costly too). It also adds a new kernel
> boot argument tcpehashgoal=order. Order is the log2 of how many pages you
> want to use for the hash table (so it needs 2^order * 4096 bytes on i386)
> You can experiment with various sizes and check which one gives still
> reasonable hash distribution under load.

Wouldn't something like "tcpehashbuckets" make a better boot tunable
than "tcpehashorder"? Rounded up to next power of two before used.

I come at this from the PAGE_SIZE angle, rather than the TCP angle:
"order" tunables seem confusing to me (being interested in configurable
PAGE_SIZE). And they're confusing to code too: note that the existing
calculation of goal from num_physpages gives you more hash buckets for
larger PAGE_SIZE (comment says "methodology is similar to that of the
buffer cache", but buffer cache gets it right - though for small memory,
would do better to multiply mempages by sizeof _before_ shifting right).

Hugh

2001-10-13 16:07:37

by Andi Kleen

[permalink] [raw]
Subject: Re: Really slow netstat and /proc/net/tcp in 2.4

In article <[email protected]>,
Hugh Dickins <[email protected]> writes:
> On Sat, 13 Oct 2001, Andi Kleen wrote:
>>
>> I attached a patch. It allows you to get some simple statistics from
>> /proc/net/sockstat (unfortunately costly too). It also adds a new kernel
>> boot argument tcpehashgoal=order. Order is the log2 of how many pages you
>> want to use for the hash table (so it needs 2^order * 4096 bytes on i386)
>> You can experiment with various sizes and check which one gives still
>> reasonable hash distribution under load.

> Wouldn't something like "tcpehashbuckets" make a better boot tunable
> than "tcpehashorder"? Rounded up to next power of two before used.
[...]

I just hacked something quickly together so that people can test what
impact different hash tables sizes have on their workload, and for
that using the order was easiest. The goal is of course to do
automatic hash table tuning. I don't expect it to be an permanent
tunable. My hope is that it'll turn out that smaller hash tables will
be good enough.

-Andi