There is definitely something really broken here. One of our web servers
that was having the problem before has now decided to hit a load average
of 50 because identd is taking so long to parse /proc/net/tcp and give
back ident information.
stracing a process, I see:
open("/proc/net/tcp", O_RDONLY) = 4
fstat(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40009000
read(4, " sl local_address rem_address "..., 4096) = 4096
read(4, "00000000 01:0000023B 00000000 "..., 4096) = 4096
read(4, "0374800 2 c9ae0a60 25 4 0 2 -1 "..., 4096) = 4096
read(4, " \n 81: 0CDFAECC:0050"..., 4096) = 4096
read(4, "06 00000000:00000000 03:0000131F"..., 4096) = 4096
read(4, "0 0 0 2 e3a92800 "..., 4096) = 4096
...Switching to "strace -tt" on the same process which still hadn't
exited by the time I had typed "strace -tt", I see:
09:25:45.481608 read(4, "0 "..., 4096) = 4096
09:25:46.843441 read(4, " \n 354: 17612840:0050 23925BC0:"..., 4096) = 4096
09:25:47.542388 read(4, "0:00000000 03:00000B6C 00000000 "..., 4096) = 4096
09:25:49.649454 read(4, " 10385399 2 c9ae0a60 96 4 7 2 -1"..., 4096) = 4096
09:25:51.743338 read(4, "", 4096) = 0
Each 4k block is taking 2 seconds! It's probably competing with other
processes, but still, this is crazy.
[sroot@pro:/root]# time wc -l /proc/net/tcp
427 /proc/net/tcp
0.000u 0.650s 0:01.38 47.1% 0+0k 0+0io 69pf+0w
Also, other servers running exactly the same kernel on the exact same
hardware have _more_ entries in the table and are much faster:
[sroot@bridge:/root]# time wc -l /proc/net/tcp
805 /proc/net/tcp
0.010u 0.100s 0:00.10 100.0% 0+0k 0+0io 68pf+0w
Also, some inactive servers exactly the same kernel return instantly,
which is why I don't understand why walking too big a hash table can be
the problem:
[sroot@devel:/root]# time wc -l /proc/net/tcp
10 /proc/net/tcp
0.000u 0.000s 0:00.00 0.0% 0+0k 0+0io 113pf+0w
What gives?
Simon-
[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]
Let me guess, the machine exhibiting the problem has the largest
amount of physical memory?
Franks a lot,
David S. Miller
[email protected]
On Thu, Oct 18, 2001 at 09:49:56AM -0700, David S. Miller wrote:
> Let me guess, the machine exhibiting the problem has the largest
> amount of physical memory?
You're right. I was wrong about the identical hardware -- somebody has
recently added another 128 MB to the box, and so it has 640 MB in there
at the moment. Most other boxes are 512 MB, and that other inactive
server was only 128 MB. Ah, the one I was comparing it with was only 384
MB, even. So yeah, it's probably memory-size related.
Simon-
[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]
On Thu, Oct 18, 2001 at 09:42:22AM -0700, Simon Kirby wrote:
> There is definitely something really broken here. One of our web servers
> that was having the problem before has now decided to hit a load average
> of 50 because identd is taking so long to parse /proc/net/tcp and give
> back ident information.
See my last mail. The hash tables are too big. Set a smaller one
during this patch by using the new tcpehashorder= command line option.
It sets the hash table size as 2^order*4096 on i386. You can see the
default order by looking at the dmesg of an machine booted without
this option set.
You can find out how much it costs you by looking at
/proc/net/sockstat. If the tcp_ehash_buckets value is the same as
with the default hash tab size then it didn't cost you anything.
If the value is very similar it's probably still ok; just if you
get e.g. average bucket length >5-10 it's probably too small.
The smaller the hash table the faster should identd work.
-Andi
P.S.: It would be possible to create a shortcut for identd that does
a direct hash lookup and avoid an full walk; but so far I'm not
convinced that is necessary.
--- net/ipv4/proc.c-o Wed May 16 19:21:45 2001
+++ net/ipv4/proc.c Sat Oct 13 03:37:55 2001
@@ -68,6 +68,7 @@
{
/* From net/socket.c */
extern int socket_get_info(char *, char **, off_t, int);
+ extern int tcp_v4_hash_statistics(char *) ;
int len = socket_get_info(buffer,start,offset,length);
@@ -82,6 +83,8 @@
fold_prot_inuse(&raw_prot));
len += sprintf(buffer+len, "FRAG: inuse %d memory %d\n",
ip_frag_nqueues, atomic_read(&ip_frag_mem));
+ len += tcp_v4_hash_statistics(buffer+len);
+
if (offset >= len)
{
*start = buffer;
--- net/ipv4/tcp.c-o Thu Oct 11 08:42:47 2001
+++ net/ipv4/tcp.c Sat Oct 13 03:56:58 2001
@@ -2442,6 +2442,15 @@
return 0;
}
+static unsigned tcp_ehash_order;
+static int __init tcp_hash_setup(char *str)
+{
+ tcp_ehash_order = simple_strtol(str,NULL,0);
+ return 0;
+}
+
+__setup("tcpehashorder=", tcp_hash_setup);
+
extern void __skb_cb_too_small_for_tcp(int, int);
@@ -2486,8 +2495,12 @@
else
goal = num_physpages >> (23 - PAGE_SHIFT);
- for(order = 0; (1UL << order) < goal; order++)
- ;
+ if (tcp_ehash_order)
+ order = tcp_ehash_order;
+ else {
+ for(order = 0; (1UL << order) < goal; order++)
+ ;
+ }
do {
tcp_ehash_size = (1UL << order) * PAGE_SIZE /
sizeof(struct tcp_ehash_bucket);
--- net/ipv4/tcp_ipv4.c-o Mon Oct 1 18:19:56 2001
+++ net/ipv4/tcp_ipv4.c Sat Oct 13 03:41:57 2001
@@ -2162,6 +2162,62 @@
return len;
}
+int tcp_v4_hash_statistics(char *buffer)
+{
+ int i;
+ int max_hlen = 0, hrun = 0, hcnt = 0 ;
+ char *bufs = buffer;
+
+ buffer += sprintf(buffer, "tcp_ehash_buckets %d\n", tcp_ehash_size*2);
+
+ local_bh_disable();
+ for (i = 0; i < tcp_ehash_size; i++) {
+ struct tcp_ehash_bucket *head = &tcp_ehash[i];
+ struct sock *sk;
+ struct tcp_tw_bucket *tw;
+ int len = 0;
+
+ read_lock(&head->lock);
+ for(sk = head->chain; sk; sk = sk->next) {
+ if (!TCP_INET_FAMILY(sk->family))
+ continue;
+ ++len;
+ }
+
+ if (len > 0) {
+ if (len > max_hlen) max_hlen = len;
+ ++hcnt;
+ hrun += len;
+ }
+
+ len = 0;
+
+ for (tw = (struct tcp_tw_bucket *)tcp_ehash[i+tcp_ehash_size].chain;
+ tw != NULL;
+ tw = (struct tcp_tw_bucket *)tw->next) {
+ if (!TCP_INET_FAMILY(tw->family))
+ continue;
+ ++len;
+ }
+ read_unlock(&head->lock);
+
+ if (len > 0) {
+ if (len > max_hlen) max_hlen = len;
+ ++hcnt;
+ hrun += len;
+ }
+ }
+
+ local_bh_enable();
+
+ buffer += sprintf(buffer, "used hash buckets: %d\n", hcnt);
+ if (hcnt > 0)
+ buffer += sprintf(buffer, "average length: %d\n", hrun / hcnt);
+
+ return buffer - bufs;
+}
+
+
struct proto tcp_prot = {
name: "TCP",
close: tcp_close,
@@ -2210,3 +2266,4 @@
*/
tcp_socket->sk->prot->unhash(tcp_socket->sk);
}
+
On Fri, Oct 19, 2001 at 02:57:50PM +0200, Andi Kleen wrote:
> On Thu, Oct 18, 2001 at 09:42:22AM -0700, Simon Kirby wrote:
> > There is definitely something really broken here. One of our web servers
> > that was having the problem before has now decided to hit a load average
> > of 50 because identd is taking so long to parse /proc/net/tcp and give
> > back ident information.
>
> See my last mail. The hash tables are too big. Set a smaller one
> during this patch by using the new tcpehashorder= command line option.
> It sets the hash table size as 2^order*4096 on i386. You can see the
> default order by looking at the dmesg of an machine booted without
> this option set.
> You can find out how much it costs you by looking at
> /proc/net/sockstat. If the tcp_ehash_buckets value is the same as
> with the default hash tab size then it didn't cost you anything.
> If the value is very similar it's probably still ok; just if you
> get e.g. average bucket length >5-10 it's probably too small.
> The smaller the hash table the faster should identd work.
Hmm, yeah, on this box with 640 MB I see:
TCP: Hash tables configured (established 262144 bind 65536)
(I'm guessing this means that the hash table has 32k buckets.)
Yes, that's fairly large, but not that huge for machines which will have
tons of sockets open.
Normally we use nothing like that:
sockets: used 133
TCP: inuse 118 orphan 3 tw 171 alloc 119 mem 132
RAW: inuse 0
FRAG: inuse 0 memory 0
...but shrinking the size slightly won't really fix the problem, it will
just make it less obvious.
Direct lookups for identd, service probes, etc., are definitely the
better way to go, but it would be nice if netstat didn't suck as well.
Simon-
[ Stormix Technologies Inc. ][ NetNation Communications Inc. ]
[ [email protected] ][ [email protected] ]
[ Opinions expressed are not necessarily those of my employers. ]
In article <[email protected]>,
Simon Kirby <[email protected]> writes:
> TCP: Hash tables configured (established 262144 bind 65536)
No it means you have 512+k buckets (*4 on UP; *8 on SMP for bytes = ~4MB)
for the established hash and 64k for the bind hash (the later is only
used internally and searched for netstat).
4MB for a hash table looks ridiculously large to me.
> ...but shrinking the size slightly won't really fix the problem, it will
> just make it less obvious.
the size is the problem. walking an 4MB table will be always slow.
-Andi
I think Alexey's netlink socket querying mechanism is the way
to go for this. Once Alexey sends me an updated version of
these changes, I will put them in and identd can be fixed to
make use of it. BTW, what does identd need netstat information
for?
And, for a 640MB ram machine, a 4MB hash table is perfectly
reasonable.
Franks a lot,
David S. Miller
[email protected]
On Fri, Oct 19, 2001 at 01:59:24PM -0700, David S. Miller wrote:
> And, for a 640MB ram machine, a 4MB hash table is perfectly
> reasonable.
That isn't wholly true. A 4MB hash table can never fit in the cache of
an Athlon, and for one that's being used as a workstation with 1GB of
ram and maybe 60 connections active on average, that's a huge waste of
ram, and a guarantee that there will be lots of cache misses which just
aren't required. Keep the cache footprint as low as possible -- it
results in a system that performs better.
-ben
From: Benjamin LaHaise <[email protected]>
Date: Fri, 19 Oct 2001 17:30:55 -0400
On Fri, Oct 19, 2001 at 01:59:24PM -0700, David S. Miller wrote:
> And, for a 640MB ram machine, a 4MB hash table is perfectly
> reasonable.
That isn't wholly true. A 4MB hash table can never fit in the cache of
an Athlon, and for one that's being used as a workstation with 1GB of
ram and maybe 60 connections active on average, that's a huge waste of
ram, and a guarantee that there will be lots of cache misses which just
aren't required. Keep the cache footprint as low as possible -- it
results in a system that performs better.
It doesn't need to "fit in the cache" to perform optimally, that's
a load of crap Ben.
I actually tested this, and in fact on a cpu that has a meager 512K
cache at the time, and it did turn out to be more important to keep
the hash chains short than to keep it fitting in the cache.
So please don't give me any crap about "fitting in the cache" unless
you can show me hard numbers that show that it does in fact perform
worse.
Let me clue you in. If the hash chains get long, you (instead of
cache missing on the table itself) are missing the cache several
times over walking the long hash chains.
Franks a lot,
David S. Miller
[email protected]
On Fri, Oct 19, 2001 at 02:56:39PM -0700, David S. Miller wrote:
> It doesn't need to "fit in the cache" to perform optimally, that's
> a load of crap Ben.
>
> I actually tested this, and in fact on a cpu that has a meager 512K
> cache at the time, and it did turn out to be more important to keep
> the hash chains short than to keep it fitting in the cache.
>
> So please don't give me any crap about "fitting in the cache" unless
> you can show me hard numbers that show that it does in fact perform
> worse.
Okay, let's take a look at the case where I have 64 connections open: if
I'm using a 64 entry hash table with one 4 byte pointer per entry and
perfect hashing, then it has a cache footprint of 256 bytes. Max. Now,
the same hash table blown up to 4MB is going to have a cache footprint
of 64 bytes (1 cache line) per entry, for a total of a 4KB cache footprint.
Which is better?
> Let me clue you in. If the hash chains get long, you (instead of
> cache missing on the table itself) are missing the cache several
> times over walking the long hash chains.
Don't AssUMe that I don't realise this. What I'm saying is that a 4MB hash
table for a system with a puny number of connections is bloat. Needless
bloat. 4MB is enough memory for a copy of gcc. Or enough to run 4 shells.
If the hash table was grown dynamically, I wouldn't have this complaint.
-ben
From: Benjamin LaHaise <[email protected]>
Date: Fri, 19 Oct 2001 18:04:34 -0400
If the hash table was grown dynamically, I wouldn't have this complaint.
It's too expensive to implement. It adds a new check every
connection, _or_ some timer which periodically scans the chain
lengths.
Franks a lot,
David S. Miller
[email protected]
On Fri, Oct 19, 2001 at 03:11:01PM -0700, David S. Miller wrote:
> From: Benjamin LaHaise <[email protected]>
> Date: Fri, 19 Oct 2001 18:04:34 -0400
>
> If the hash table was grown dynamically, I wouldn't have this complaint.
>
> It's too expensive to implement. It adds a new check every
> connection, _or_ some timer which periodically scans the chain
> lengths.
Then make it a sysctl. Sysadmins exist for a reason. ;-)
-ben