2006-11-06 06:07:07

by Zhao Xiaoming

[permalink] [raw]
Subject: ZONE_NORMAL memory exhausted by 4000 TCP sockets

Dears,
I'm running a linux box with kernel version 2.6.16. The hardware
has 2 Woodcrest Xeon CPUs (2 cores each) and 4G RAM. The NIC cards is
Intel 82571 on PCI-e bus.
The box is acting as ethernet bridge between 2 Gigabit Ethernets.
By configuring ebtables and iptables, an application is running as TCP
proxy which will intercept all TCP connections requests from the
network and setup another TCP connection to the acture server. The
TCP proxy then relays all traffics in both directions.
The problem is the memory. Since the box must support thousands of
concurrent connections, I know the memory size of ZONE_NORMAL would be
a bottleneck as TCP packets would need many buffers. After setting
upper limit of net.ipv4.tcp_rmem and net.ipv4.tcp_wmem to 32K bytes,
our test began.
My test scenario employs 2000 concurrent downloading connections
to a IIS server's port 80. The throughput is about 500~600 Mbps which
is limited by the capability of the client application. Because all
traffics are from server to client and the capability of client
machine is bottleneck, I believe the receiver side of the sockets
connected with server and the sender side of the sockets connected
with client should be filled with packets in correspondent windows.
Thus, roughly there should be about 32K * 2000+ 32K*2000 = 128M bytes
memory occupied by TCP/IP stack for packet buffering. Data from
slabtop confermed it. it's about 140M bytes memory cost after I start
the traffic. That reasonablly matched with my estimation. However,
/proc/meminfo had a different story. The 'LowFree' dropped from about
710M to 80M. In other words, there's addtional 500M memory in
ZONE_NORMAL allocated by someone other than the slab. Why?
I also made another test that the upper limit of tcp_rmem and
tcp_wmem being set to 64K. After 2000 connections transfering a lot of
data for several seconds, the linux box showed some error messages
such as error allocating memory pages, etc. and became unstable.
My questions are:

1. To calculate memory request of TCP sockets, is there any other
large amount of memory requested besides send and receive buffer?
2. Is there any logics that emploied by TCP/IP stack that will
dynamically allocating memory pages directly instead of from slab?

Thanks!

Xiaoming.


2006-11-06 07:34:22

by Eric Dumazet

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

Zhao Xiaoming a ?crit :
> Dears,
> I'm running a linux box with kernel version 2.6.16. The hardware
> has 2 Woodcrest Xeon CPUs (2 cores each) and 4G RAM. The NIC cards is
> Intel 82571 on PCI-e bus.
> The box is acting as ethernet bridge between 2 Gigabit Ethernets.
> By configuring ebtables and iptables, an application is running as TCP
> proxy which will intercept all TCP connections requests from the
> network and setup another TCP connection to the acture server. The
> TCP proxy then relays all traffics in both directions.
> The problem is the memory. Since the box must support thousands of
> concurrent connections, I know the memory size of ZONE_NORMAL would be
> a bottleneck as TCP packets would need many buffers. After setting
> upper limit of net.ipv4.tcp_rmem and net.ipv4.tcp_wmem to 32K bytes,
> our test began.
> My test scenario employs 2000 concurrent downloading connections
> to a IIS server's port 80. The throughput is about 500~600 Mbps which
> is limited by the capability of the client application. Because all
> traffics are from server to client and the capability of client
> machine is bottleneck, I believe the receiver side of the sockets
> connected with server and the sender side of the sockets connected
> with client should be filled with packets in correspondent windows.
> Thus, roughly there should be about 32K * 2000+ 32K*2000 = 128M bytes
> memory occupied by TCP/IP stack for packet buffering. Data from
> slabtop confermed it. it's about 140M bytes memory cost after I start
> the traffic. That reasonablly matched with my estimation. However,
> /proc/meminfo had a different story. The 'LowFree' dropped from about
> 710M to 80M. In other words, there's addtional 500M memory in
> ZONE_NORMAL allocated by someone other than the slab. Why?

We dont know. You might post some data so that we can have some ideas.

Also, these kind of question is better handled by linux netdev mailing list,
so I added a CC to this list.

cat /proc/slabinfo
cat /proc/meminfo
cat /proc/net/sockstat
cat /proc/buddyinfo

> I also made another test that the upper limit of tcp_rmem and
> tcp_wmem being set to 64K. After 2000 connections transfering a lot of
> data for several seconds, the linux box showed some error messages
> such as error allocating memory pages, etc. and became unstable.
> My questions are:
>
> 1. To calculate memory request of TCP sockets, is there any other
> large amount of memory requested besides send and receive buffer?
> 2. Is there any logics that emploied by TCP/IP stack that will
> dynamically allocating memory pages directly instead of from slab?

TCP stack is one thing, but other things may consume ram on your kernel.

Also, kernel memory allocation might use twice the ram you intend to use
because of power of two alignments.

Are you using iptables connection tracking ?

If you plan to use a lot of RAM in kernel, why dont you use a 64 bits kernel,
so that all ram is available for kernel, not only 900 MB ?

Eric

2006-11-06 08:10:07

by Zhao Xiaoming

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006/11/6, Eric Dumazet <[email protected]>:
> Zhao Xiaoming a ?crit :
> > Dears,
> > I'm running a linux box with kernel version 2.6.16. The hardware
> > has 2 Woodcrest Xeon CPUs (2 cores each) and 4G RAM. The NIC cards is
> > Intel 82571 on PCI-e bus.
> > The box is acting as ethernet bridge between 2 Gigabit Ethernets.
> > By configuring ebtables and iptables, an application is running as TCP
> > proxy which will intercept all TCP connections requests from the
> > network and setup another TCP connection to the acture server. The
> > TCP proxy then relays all traffics in both directions.
> > The problem is the memory. Since the box must support thousands of
> > concurrent connections, I know the memory size of ZONE_NORMAL would be
> > a bottleneck as TCP packets would need many buffers. After setting
> > upper limit of net.ipv4.tcp_rmem and net.ipv4.tcp_wmem to 32K bytes,
> > our test began.
> > My test scenario employs 2000 concurrent downloading connections
> > to a IIS server's port 80. The throughput is about 500~600 Mbps which
> > is limited by the capability of the client application. Because all
> > traffics are from server to client and the capability of client
> > machine is bottleneck, I believe the receiver side of the sockets
> > connected with server and the sender side of the sockets connected
> > with client should be filled with packets in correspondent windows.
> > Thus, roughly there should be about 32K * 2000+ 32K*2000 = 128M bytes
> > memory occupied by TCP/IP stack for packet buffering. Data from
> > slabtop confermed it. it's about 140M bytes memory cost after I start
> > the traffic. That reasonablly matched with my estimation. However,
> > /proc/meminfo had a different story. The 'LowFree' dropped from about
> > 710M to 80M. In other words, there's addtional 500M memory in
> > ZONE_NORMAL allocated by someone other than the slab. Why?
>
> We dont know. You might post some data so that we can have some ideas.
>
> Also, these kind of question is better handled by linux netdev mailing list,
> so I added a CC to this list.
>
> cat /proc/slabinfo
> cat /proc/meminfo
> cat /proc/net/sockstat
> cat /proc/buddyinfo
>
> > I also made another test that the upper limit of tcp_rmem and
> > tcp_wmem being set to 64K. After 2000 connections transfering a lot of
> > data for several seconds, the linux box showed some error messages
> > such as error allocating memory pages, etc. and became unstable.
> > My questions are:
> >
> > 1. To calculate memory request of TCP sockets, is there any other
> > large amount of memory requested besides send and receive buffer?
> > 2. Is there any logics that emploied by TCP/IP stack that will
> > dynamically allocating memory pages directly instead of from slab?
>
> TCP stack is one thing, but other things may consume ram on your kernel.
>
> Also, kernel memory allocation might use twice the ram you intend to use
> because of power of two alignments.
>
> Are you using iptables connection tracking ?
>
> If you plan to use a lot of RAM in kernel, why dont you use a 64 bits kernel,
> so that all ram is available for kernel, not only 900 MB ?
>
> Eric
>
>
Thanks for the answer. I know it's more likely relats to netdev.
However, it's always a strange thing to have 400~500M bytes LOWMEM
'gone' while it's not reported to be occupied by slab. Both meminfo
and buddyinfo tell the same.

with traffics of 2000 concurrent sessions:

cat /proc/meminfo
MemTotal: 4136580 kB
MemFree: 3298460 kB
Buffers: 4096 kB
Cached: 21124 kB
SwapCached: 0 kB
Active: 47416 kB
Inactive: 12532 kB
HighTotal: 3276160 kB
HighFree: 3214592 kB
LowTotal: 860420 kB
LowFree: 83868 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 12 kB
Writeback: 0 kB
Mapped: 42104 kB
Slab: 293952 kB
CommitLimit: 2068288 kB
Committed_AS: 58892 kB
PageTables: 1112 kB
VmallocTotal: 116728 kB
VmallocUsed: 2940 kB
VmallocChunk: 110548 kB
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 2048 kB


without traffics:

cat /proc/meminfo
MemTotal: 4136580 kB
MemFree: 3924276 kB
Buffers: 4460 kB
Cached: 21020 kB
SwapCached: 0 kB
Active: 47592 kB
Inactive: 12848 kB
HighTotal: 3276160 kB
HighFree: 3214716 kB
LowTotal: 860420 kB
LowFree: 709560 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 40 kB
Writeback: 0 kB
Mapped: 42368 kB
Slab: 132172 kB
CommitLimit: 2068288 kB
Committed_AS: 59220 kB
PageTables: 1140 kB
VmallocTotal: 116728 kB
VmallocUsed: 2940 kB
VmallocChunk: 110548 kB
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 2048 kB


Xiaoming.

2006-11-06 08:48:47

by Eric Dumazet

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

On Monday 06 November 2006 09:10, Zhao Xiaoming wrote:
>
> Thanks for the answer. I know it's more likely relats to netdev.
> However, it's always a strange thing to have 400~500M bytes LOWMEM
> 'gone' while it's not reported to be occupied by slab. Both meminfo
> and buddyinfo tell the same.
>
> with traffics of 2000 concurrent sessions:

Slab: 293952 kB
So 292 MB used by slab for 2000 sessions.

Expect 600 MB used by slab for 4000 sessions.

So your precious LOWMEM is not gone at all. It *IS* used by SLAB.

You forgot to send
cat /proc/slabinfo

>
> cat /proc/meminfo
> MemTotal: 4136580 kB
> MemFree: 3298460 kB
> Buffers: 4096 kB
> Cached: 21124 kB
> SwapCached: 0 kB
> Active: 47416 kB
> Inactive: 12532 kB
> HighTotal: 3276160 kB
> HighFree: 3214592 kB
> LowTotal: 860420 kB
> LowFree: 83868 kB
> SwapTotal: 0 kB
> SwapFree: 0 kB
> Dirty: 12 kB
> Writeback: 0 kB
> Mapped: 42104 kB
> Slab: 293952 kB
> CommitLimit: 2068288 kB
> Committed_AS: 58892 kB
> PageTables: 1112 kB
> VmallocTotal: 116728 kB
> VmallocUsed: 2940 kB
> VmallocChunk: 110548 kB
> HugePages_Total: 0
> HugePages_Free: 0
> Hugepagesize: 2048 kB
>
>

2006-11-06 09:00:04

by Zhao Xiaoming

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006/11/6, Eric Dumazet <[email protected]>:
> We dont know. You might post some data so that we can have some ideas.
>
> Also, these kind of question is better handled by linux netdev mailing list,
> so I added a CC to this list.
>
> cat /proc/slabinfo
> cat /proc/meminfo
> cat /proc/net/sockstat
> cat /proc/buddyinfo
>
>
> TCP stack is one thing, but other things may consume ram on your kernel.
>
> Also, kernel memory allocation might use twice the ram you intend to use
> because of power of two alignments.
>
> Are you using iptables connection tracking ?
>
> If you plan to use a lot of RAM in kernel, why dont you use a 64 bits kernel,
> so that all ram is available for kernel, not only 900 MB ?
>
> Eric
>
>
Thank you again for your help. To have more detailed statistic data, I
did another round of test and gathered some data. I give the overall
description here and detailed /proc/net/sockstat, /proc/meminfo,
/proc/slabinfo and /proc/buddyinfo follows.
=====================================================
slab mem cost tcp mem pages lowmem free
with traffic: 254668KB 34693
38772KB
without traffic: 104080KB 1
702652KB
=====================================================

detailed info:
>>>>>>>>>>during the test (with traffic):>>>>>>>>>>>>>
[root@nj-research-nas-box ~]# cat /proc/net/sockstat
sockets: used 12058
TCP: inuse 4007 orphan 0 tw 0 alloc 4010 mem 34693
UDP: inuse 4
RAW: inuse 0
FRAG: inuse 0 memory 0
[root@nj-research-nas-box ~]# cat /proc/meminfo
MemTotal: 4136580 kB
MemFree: 3169160 kB
Buffers: 42092 kB
Cached: 20048 kB
SwapCached: 0 kB
Active: 146808 kB
Inactive: 35492 kB
HighTotal: 3276160 kB
HighFree: 3130388 kB
LowTotal: 860420 kB
LowFree: 38772 kB
SwapTotal: 2031608 kB
SwapFree: 2031608 kB
Dirty: 0 kB
Writeback: 0 kB
Mapped: 127720 kB
Slab: 254668 kB
CommitLimit: 4099896 kB
Committed_AS: 367784 kB
PageTables: 1696 kB
VmallocTotal: 116728 kB
VmallocUsed: 3876 kB
VmallocChunk: 110548 kB
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 2048 kB
[root@nj-research-nas-box ~]# cat /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab>
<pagesperslab> : tunables <limit> <batchcount> <sharedfactor> :
slabdata <active_slabs> <num_slabs> <sharedavail>
ip_conntrack_expect 0 0 92 42 1 : tunables 120
60 8 : slabdata 0 0 0
ip_conntrack 4049 4352 228 17 1 : tunables 120 60
8 : slabdata 256 256 0
bridge_fdb_cache 6 59 64 59 1 : tunables 120 60
8 : slabdata 1 1 0
fib6_nodes 7 113 32 113 1 : tunables 120 60
8 : slabdata 1 1 0
ip6_dst_cache 10 30 256 15 1 : tunables 120 60
8 : slabdata 2 2 0
ndisc_cache 1 20 192 20 1 : tunables 120 60
8 : slabdata 1 1 0
RAWv6 7 10 768 5 1 : tunables 54 27
8 : slabdata 2 2 0
UDPv6 0 0 704 11 2 : tunables 54 27
8 : slabdata 0 0 0
tw_sock_TCPv6 0 0 128 30 1 : tunables 120 60
8 : slabdata 0 0 0
request_sock_TCPv6 0 0 128 30 1 : tunables 120 60
8 : slabdata 0 0 0
TCPv6 3 3 1344 3 1 : tunables 24 12
8 : slabdata 1 1 0
cifs_small_rq 30 36 448 9 1 : tunables 54 27
8 : slabdata 4 4 0
cifs_request 4 4 16512 1 8 : tunables 8 4
0 : slabdata 4 4 0
cifs_oplock_structs 0 0 32 113 1 : tunables 120
60 8 : slabdata 0 0 0
cifs_mpx_ids 3 59 64 59 1 : tunables 120 60
8 : slabdata 1 1 0
cifs_inode_cache 0 0 496 8 1 : tunables 54 27
8 : slabdata 0 0 0
rpc_buffers 8 8 2048 2 1 : tunables 24 12
8 : slabdata 4 4 0
rpc_tasks 8 20 192 20 1 : tunables 120 60
8 : slabdata 1 1 0
rpc_inode_cache 6 7 576 7 1 : tunables 54 27
8 : slabdata 1 1 0
ip_fib_alias 9 113 32 113 1 : tunables 120 60
8 : slabdata 1 1 0
ip_fib_hash 9 113 32 113 1 : tunables 120 60
8 : slabdata 1 1 0
uhci_urb_priv 0 0 40 92 1 : tunables 120 60
8 : slabdata 0 0 0
dm-snapshot-in 128 134 56 67 1 : tunables 120 60
8 : slabdata 2 2 0
dm-snapshot-ex 0 0 24 145 1 : tunables 120 60
8 : slabdata 0 0 0
ext3_inode_cache 8275 18378 640 6 1 : tunables 54 27
8 : slabdata 3063 3063 0
ext3_xattr 1 78 48 78 1 : tunables 120 60
8 : slabdata 1 1 0
journal_handle 18 169 20 169 1 : tunables 120 60
8 : slabdata 1 1 0
journal_head 19 72 52 72 1 : tunables 120 60
8 : slabdata 1 1 0
revoke_table 4 254 12 254 1 : tunables 120 60
8 : slabdata 1 1 0
revoke_record 0 0 16 203 1 : tunables 120 60
8 : slabdata 0 0 0
dm_tio 525 609 16 203 1 : tunables 120 60
8 : slabdata 3 3 0
dm_io 525 676 20 169 1 : tunables 120 60
8 : slabdata 4 4 0
scsi_cmd_cache 3 10 384 10 1 : tunables 54 27
8 : slabdata 1 1 0
sgpool-128 32 33 2560 3 2 : tunables 24 12
8 : slabdata 11 11 0
sgpool-64 32 33 1280 3 1 : tunables 24 12
8 : slabdata 11 11 0
sgpool-32 32 36 640 6 1 : tunables 54 27
8 : slabdata 6 6 0
sgpool-16 32 36 320 12 1 : tunables 54 27
8 : slabdata 3 3 0
sgpool-8 34 40 192 20 1 : tunables 120 60
8 : slabdata 2 2 0
scsi_io_context 0 0 104 37 1 : tunables 120 60
8 : slabdata 0 0 0
UNIX 23 77 576 7 1 : tunables 54 27
8 : slabdata 11 11 0
ip_mrt_cache 0 0 128 30 1 : tunables 120 60
8 : slabdata 0 0 0
tcp_bind_bucket 2025 2233 16 203 1 : tunables 120 60
8 : slabdata 11 11 0
inet_peer_cache 1 59 64 59 1 : tunables 120 60
8 : slabdata 1 1 0
secpath_cache 0 0 128 30 1 : tunables 120 60
8 : slabdata 0 0 0
xfrm_dst_cache 0 0 320 12 1 : tunables 54 27
8 : slabdata 0 0 0
ip_dst_cache 11 30 256 15 1 : tunables 120 60
8 : slabdata 2 2 0
arp_cache 3 20 192 20 1 : tunables 120 60
8 : slabdata 1 1 0
RAW 5 6 640 6 1 : tunables 54 27
8 : slabdata 1 1 0
UDP 4 18 640 6 1 : tunables 54 27
8 : slabdata 3 3 0
tw_sock_TCP 0 0 128 30 1 : tunables 120 60
8 : slabdata 0 0 0
request_sock_TCP 2 59 64 59 1 : tunables 120 60
8 : slabdata 1 1 0
TCP 4029 4044 1280 3 1 : tunables 24 12
8 : slabdata 1348 1348 0
flow_cache 0 0 128 30 1 : tunables 120 60
8 : slabdata 0 0 0
msi_cache 8 8 3840 1 1 : tunables 24 12
8 : slabdata 8 8 0
cfq_ioc_pool 0 0 48 78 1 : tunables 120 60
8 : slabdata 0 0 0
cfq_pool 0 0 96 40 1 : tunables 120 60
8 : slabdata 0 0 0
crq_pool 0 0 48 78 1 : tunables 120 60
8 : slabdata 0 0 0
deadline_drq 0 0 52 72 1 : tunables 120 60
8 : slabdata 0 0 0
as_arq 12 118 64 59 1 : tunables 120 60
8 : slabdata 2 2 0
mqueue_inode_cache 1 11 704 11 2 : tunables 54 27
8 : slabdata 1 1 0
isofs_inode_cache 0 0 488 8 1 : tunables 54 27
8 : slabdata 0 0 0
hugetlbfs_inode_cache 1 8 460 8 1 : tunables 54
27 8 : slabdata 1 1 0
ext2_inode_cache 0 0 620 6 1 : tunables 54 27
8 : slabdata 0 0 0
ext2_xattr 0 0 48 78 1 : tunables 120 60
8 : slabdata 0 0 0
dnotify_cache 1 169 20 169 1 : tunables 120 60
8 : slabdata 1 1 0
dquot 0 0 192 20 1 : tunables 120 60
8 : slabdata 0 0 0
eventpoll_pwq 1 101 36 101 1 : tunables 120 60
8 : slabdata 1 1 0
eventpoll_epi 1 30 128 30 1 : tunables 120 60
8 : slabdata 1 1 0
inotify_event_cache 0 0 28 127 1 : tunables 120
60 8 : slabdata 0 0 0
inotify_watch_cache 0 0 36 101 1 : tunables 120
60 8 : slabdata 0 0 0
kioctx 0 0 256 15 1 : tunables 120 60
8 : slabdata 0 0 0
kiocb 0 0 128 30 1 : tunables 120 60
8 : slabdata 0 0 0
fasync_cache 0 0 16 203 1 : tunables 120 60
8 : slabdata 0 0 0
shmem_inode_cache 252 273 568 7 1 : tunables 54 27
8 : slabdata 39 39 0
posix_timers_cache 0 0 112 35 1 : tunables 120 60
8 : slabdata 0 0 0
uid_cache 7 59 64 59 1 : tunables 120 60
8 : slabdata 1 1 0
blkdev_ioc 22 254 28 127 1 : tunables 120 60
8 : slabdata 2 2 0
blkdev_queue 22 28 976 4 1 : tunables 54 27
8 : slabdata 7 7 0
blkdev_requests 12 44 176 22 1 : tunables 120 60
8 : slabdata 2 2 0
biovec-(256) 260 260 3072 2 2 : tunables 24 12
8 : slabdata 130 130 0
biovec-128 264 270 1536 5 2 : tunables 24 12
8 : slabdata 54 54 0
biovec-64 272 275 768 5 1 : tunables 54 27
8 : slabdata 55 55 0
biovec-16 272 280 192 20 1 : tunables 120 60
8 : slabdata 14 14 0
biovec-4 272 295 64 59 1 : tunables 120 60
8 : slabdata 5 5 0
biovec-1 288 406 16 203 1 : tunables 120 60
8 : slabdata 2 2 0
bio 272 300 128 30 1 : tunables 120 60
8 : slabdata 10 10 0
sock_inode_cache 12082 12120 512 8 1 : tunables 54 27
8 : slabdata 1515 1515 0
skbuff_fclone_cache 5211 6210 384 10 1 : tunables 54
27 8 : slabdata 621 621 189
skbuff_head_cache 106780 113080 192 20 1 : tunables 120 60
8 : slabdata 5654 5654 0
file_lock_cache 37 37 104 37 1 : tunables 120 60
8 : slabdata 1 1 0
acpi_operand 579 644 40 92 1 : tunables 120 60
8 : slabdata 7 7 0
acpi_parse_ext 0 0 44 84 1 : tunables 120 60
8 : slabdata 0 0 0
acpi_parse 0 0 28 127 1 : tunables 120 60
8 : slabdata 0 0 0
acpi_state 0 0 48 78 1 : tunables 120 60
8 : slabdata 0 0 0
proc_inode_cache 296 504 476 8 1 : tunables 54 27
8 : slabdata 63 63 0
sigqueue 5 54 144 27 1 : tunables 120 60
8 : slabdata 2 2 0
radix_tree_node 2397 12264 276 14 1 : tunables 54 27
8 : slabdata 876 876 0
bdev_cache 22 42 640 6 1 : tunables 54 27
8 : slabdata 7 7 0
sysfs_dir_cache 3645 3680 40 92 1 : tunables 120 60
8 : slabdata 40 40 0
mnt_cache 28 60 128 30 1 : tunables 120 60
8 : slabdata 2 2 0
inode_cache 817 1048 460 8 1 : tunables 54 27
8 : slabdata 131 131 0
dentry_cache 10991 44361 144 27 1 : tunables 120 60
8 : slabdata 1643 1643 0
filp 4777 4860 192 20 1 : tunables 120 60
8 : slabdata 243 243 60
names_cache 1 1 4096 1 1 : tunables 24 12
8 : slabdata 1 1 0
avc_node 31 72 52 72 1 : tunables 120 60
8 : slabdata 1 1 0
key_jar 14 30 128 30 1 : tunables 120 60
8 : slabdata 1 1 0
idr_layer_cache 117 174 136 29 1 : tunables 120 60
8 : slabdata 6 6 0
buffer_head 10552 22104 52 72 1 : tunables 120 60
8 : slabdata 307 307 0
mm_struct 62 98 512 7 1 : tunables 54 27
8 : slabdata 14 14 0
vm_area_struct 21346 21504 92 42 1 : tunables 120 60
8 : slabdata 512 512 0
fs_cache 63 295 64 59 1 : tunables 120 60
8 : slabdata 5 5 0
files_cache 63 117 448 9 1 : tunables 54 27
8 : slabdata 13 13 0
signal_cache 109 160 384 10 1 : tunables 54 27
8 : slabdata 16 16 0
sighand_cache 104 129 1344 3 1 : tunables 24 12
8 : slabdata 43 43 0
task_struct 10095 10095 1360 3 1 : tunables 24 12
8 : slabdata 3365 3365 0
anon_vma 693 1015 24 145 1 : tunables 120 60
8 : slabdata 7 7 0
pgd 65 339 32 113 1 : tunables 120 60
8 : slabdata 3 3 0
pmd 138 138 4096 1 1 : tunables 24 12
8 : slabdata 138 138 0
size-131072(DMA) 0 0 131072 1 32 : tunables 8 4
0 : slabdata 0 0 0
size-131072 0 0 131072 1 32 : tunables 8 4
0 : slabdata 0 0 0
size-65536(DMA) 0 0 65536 1 16 : tunables 8 4
0 : slabdata 0 0 0
size-65536 2 2 65536 1 16 : tunables 8 4
0 : slabdata 2 2 0
size-32768(DMA) 0 0 32768 1 8 : tunables 8 4
0 : slabdata 0 0 0
size-32768 2 2 32768 1 8 : tunables 8 4
0 : slabdata 2 2 0
size-16384(DMA) 0 0 16384 1 4 : tunables 8 4
0 : slabdata 0 0 0
size-16384 0 0 16384 1 4 : tunables 8 4
0 : slabdata 0 0 0
size-8192(DMA) 0 0 8192 1 2 : tunables 8 4
0 : slabdata 0 0 0
size-8192 4 4 8192 1 2 : tunables 8 4
0 : slabdata 4 4 0
size-4096(DMA) 0 0 4096 1 1 : tunables 24 12
8 : slabdata 0 0 0
size-4096 10148 10148 4096 1 1 : tunables 24 12
8 : slabdata 10148 10148 0
size-2048(DMA) 0 0 2048 2 1 : tunables 24 12
8 : slabdata 0 0 0
size-2048 144 144 2048 2 1 : tunables 24 12
8 : slabdata 72 72 0
size-1024(DMA) 0 0 1024 4 1 : tunables 54 27
8 : slabdata 0 0 0
size-1024 106994 112672 1024 4 1 : tunables 54 27
8 : slabdata 28168 28168 0
size-512(DMA) 0 0 512 8 1 : tunables 54 27
8 : slabdata 0 0 0
size-512 5722 7184 512 8 1 : tunables 54 27
8 : slabdata 898 898 189
size-256(DMA) 0 0 256 15 1 : tunables 120 60
8 : slabdata 0 0 0
size-256 378 390 256 15 1 : tunables 120 60
8 : slabdata 26 26 0
size-128(DMA) 0 0 128 30 1 : tunables 120 60
8 : slabdata 0 0 0
size-64(DMA) 0 0 64 59 1 : tunables 120 60
8 : slabdata 0 0 0
size-32(DMA) 0 0 32 113 1 : tunables 120 60
8 : slabdata 0 0 0
size-32 28024 28024 32 113 1 : tunables 120 60
8 : slabdata 248 248 180
size-128 1621 1800 128 30 1 : tunables 120 60
8 : slabdata 60 60 0
size-64 162644 200128 64 59 1 : tunables 120 60
8 : slabdata 3392 3392 0
kmem_cache 165 165 256 15 1 : tunables 120 60
8 : slabdata 11 11 0
[root@nj-research-nas-box ~]# cat /proc/buddyinfo
Node 0, zone DMA 490 202 23 0 1 1 1
0 1 1 0
Node 0, zone Normal 2199 5197 505 39 2 0 1
0 1 0 0
Node 0, zone HighMem 1 1028 866 485 293 128 47
38 25 25 718

>>>>>>>>>>>>>after the test (without traffic)>>>>>>>>>>>>>>>>>
[root@nj-research-nas-box ~]# cat /proc/net/sockstat
sockets: used 10058
TCP: inuse 7 orphan 0 tw 0 alloc 10 mem 1
UDP: inuse 4
RAW: inuse 0
FRAG: inuse 0 memory 0
[root@nj-research-nas-box ~]# cat /proc/meminfo
MemTotal: 4136580 kB
MemFree: 3806132 kB
Buffers: 39196 kB
Cached: 20084 kB
SwapCached: 0 kB
Active: 172524 kB
Inactive: 34140 kB
HighTotal: 3276160 kB
HighFree: 3103480 kB
LowTotal: 860420 kB
LowFree: 702652 kB
SwapTotal: 2031608 kB
SwapFree: 2031608 kB
Dirty: 0 kB
Writeback: 0 kB
Mapped: 154996 kB
Slab: 104080 kB
CommitLimit: 4099896 kB
Committed_AS: 367676 kB
PageTables: 1696 kB
VmallocTotal: 116728 kB
VmallocUsed: 3876 kB
VmallocChunk: 110548 kB
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 2048 kB
[root@nj-research-nas-box ~]# cat /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab>
<pagesperslab> : tunables <limit> <batchcount> <sharedfactor> :
slabdata <active_slabs> <num_slabs> <sharedavail>
ip_conntrack_expect 0 0 92 42 1 : tunables 120
60 8 : slabdata 0 0 0
ip_conntrack 4 51 228 17 1 : tunables 120 60
8 : slabdata 3 3 0
bridge_fdb_cache 7 59 64 59 1 : tunables 120 60
8 : slabdata 1 1 0
fib6_nodes 7 113 32 113 1 : tunables 120 60
8 : slabdata 1 1 0
ip6_dst_cache 10 30 256 15 1 : tunables 120 60
8 : slabdata 2 2 0
ndisc_cache 1 20 192 20 1 : tunables 120 60
8 : slabdata 1 1 0
RAWv6 7 10 768 5 1 : tunables 54 27
8 : slabdata 2 2 0
UDPv6 0 0 704 11 2 : tunables 54 27
8 : slabdata 0 0 0
tw_sock_TCPv6 0 0 128 30 1 : tunables 120 60
8 : slabdata 0 0 0
request_sock_TCPv6 0 0 128 30 1 : tunables 120 60
8 : slabdata 0 0 0
TCPv6 3 3 1344 3 1 : tunables 24 12
8 : slabdata 1 1 0
cifs_small_rq 30 36 448 9 1 : tunables 54 27
8 : slabdata 4 4 0
cifs_request 4 4 16512 1 8 : tunables 8 4
0 : slabdata 4 4 0
cifs_oplock_structs 0 0 32 113 1 : tunables 120
60 8 : slabdata 0 0 0
cifs_mpx_ids 3 59 64 59 1 : tunables 120 60
8 : slabdata 1 1 0
cifs_inode_cache 0 0 496 8 1 : tunables 54 27
8 : slabdata 0 0 0
rpc_buffers 8 8 2048 2 1 : tunables 24 12
8 : slabdata 4 4 0
rpc_tasks 8 20 192 20 1 : tunables 120 60
8 : slabdata 1 1 0
rpc_inode_cache 6 7 576 7 1 : tunables 54 27
8 : slabdata 1 1 0
ip_fib_alias 9 113 32 113 1 : tunables 120 60
8 : slabdata 1 1 0
ip_fib_hash 9 113 32 113 1 : tunables 120 60
8 : slabdata 1 1 0
uhci_urb_priv 0 0 40 92 1 : tunables 120 60
8 : slabdata 0 0 0
dm-snapshot-in 128 134 56 67 1 : tunables 120 60
8 : slabdata 2 2 0
dm-snapshot-ex 0 0 24 145 1 : tunables 120 60
8 : slabdata 0 0 0
ext3_inode_cache 6611 16482 640 6 1 : tunables 54 27
8 : slabdata 2747 2747 0
ext3_xattr 1 78 48 78 1 : tunables 120 60
8 : slabdata 1 1 0
journal_handle 24 169 20 169 1 : tunables 120 60
8 : slabdata 1 1 0
journal_head 11 72 52 72 1 : tunables 120 60
8 : slabdata 1 1 0
revoke_table 4 254 12 254 1 : tunables 120 60
8 : slabdata 1 1 0
revoke_record 0 0 16 203 1 : tunables 120 60
8 : slabdata 0 0 0
dm_tio 516 609 16 203 1 : tunables 120 60
8 : slabdata 3 3 0
dm_io 516 676 20 169 1 : tunables 120 60
8 : slabdata 4 4 0
scsi_cmd_cache 10 10 384 10 1 : tunables 54 27
8 : slabdata 1 1 0
sgpool-128 32 33 2560 3 2 : tunables 24 12
8 : slabdata 11 11 0
sgpool-64 32 33 1280 3 1 : tunables 24 12
8 : slabdata 11 11 0
sgpool-32 32 36 640 6 1 : tunables 54 27
8 : slabdata 6 6 0
sgpool-16 32 36 320 12 1 : tunables 54 27
8 : slabdata 3 3 0
sgpool-8 40 40 192 20 1 : tunables 120 60
8 : slabdata 2 2 0
scsi_io_context 0 0 104 37 1 : tunables 120 60
8 : slabdata 0 0 0
UNIX 23 77 576 7 1 : tunables 54 27
8 : slabdata 11 11 0
ip_mrt_cache 0 0 128 30 1 : tunables 120 60
8 : slabdata 0 0 0
tcp_bind_bucket 8 203 16 203 1 : tunables 120 60
8 : slabdata 1 1 0
inet_peer_cache 1 59 64 59 1 : tunables 120 60
8 : slabdata 1 1 0
secpath_cache 0 0 128 30 1 : tunables 120 60
8 : slabdata 0 0 0
xfrm_dst_cache 0 0 320 12 1 : tunables 54 27
8 : slabdata 0 0 0
ip_dst_cache 12 30 256 15 1 : tunables 120 60
8 : slabdata 2 2 0
arp_cache 3 20 192 20 1 : tunables 120 60
8 : slabdata 1 1 0
RAW 5 6 640 6 1 : tunables 54 27
8 : slabdata 1 1 0
UDP 4 18 640 6 1 : tunables 54 27
8 : slabdata 3 3 0
tw_sock_TCP 0 0 128 30 1 : tunables 120 60
8 : slabdata 0 0 0
request_sock_TCP 0 0 64 59 1 : tunables 120 60
8 : slabdata 0 0 0
TCP 7 12 1280 3 1 : tunables 24 12
8 : slabdata 4 4 0
flow_cache 0 0 128 30 1 : tunables 120 60
8 : slabdata 0 0 0
msi_cache 8 8 3840 1 1 : tunables 24 12
8 : slabdata 8 8 0
cfq_ioc_pool 0 0 48 78 1 : tunables 120 60
8 : slabdata 0 0 0
cfq_pool 0 0 96 40 1 : tunables 120 60
8 : slabdata 0 0 0
crq_pool 0 0 48 78 1 : tunables 120 60
8 : slabdata 0 0 0
deadline_drq 0 0 52 72 1 : tunables 120 60
8 : slabdata 0 0 0
as_arq 13 118 64 59 1 : tunables 120 60
8 : slabdata 2 2 0
mqueue_inode_cache 1 11 704 11 2 : tunables 54 27
8 : slabdata 1 1 0
isofs_inode_cache 0 0 488 8 1 : tunables 54 27
8 : slabdata 0 0 0
hugetlbfs_inode_cache 1 8 460 8 1 : tunables 54
27 8 : slabdata 1 1 0
ext2_inode_cache 0 0 620 6 1 : tunables 54 27
8 : slabdata 0 0 0
ext2_xattr 0 0 48 78 1 : tunables 120 60
8 : slabdata 0 0 0
dnotify_cache 1 169 20 169 1 : tunables 120 60
8 : slabdata 1 1 0
dquot 0 0 192 20 1 : tunables 120 60
8 : slabdata 0 0 0
eventpoll_pwq 1 101 36 101 1 : tunables 120 60
8 : slabdata 1 1 0
eventpoll_epi 1 30 128 30 1 : tunables 120 60
8 : slabdata 1 1 0
inotify_event_cache 0 0 28 127 1 : tunables 120
60 8 : slabdata 0 0 0
inotify_watch_cache 0 0 36 101 1 : tunables 120
60 8 : slabdata 0 0 0
kioctx 0 0 256 15 1 : tunables 120 60
8 : slabdata 0 0 0
kiocb 0 0 128 30 1 : tunables 120 60
8 : slabdata 0 0 0
fasync_cache 0 0 16 203 1 : tunables 120 60
8 : slabdata 0 0 0
shmem_inode_cache 252 273 568 7 1 : tunables 54 27
8 : slabdata 39 39 0
posix_timers_cache 0 0 112 35 1 : tunables 120 60
8 : slabdata 0 0 0
uid_cache 7 59 64 59 1 : tunables 120 60
8 : slabdata 1 1 0
blkdev_ioc 22 254 28 127 1 : tunables 120 60
8 : slabdata 2 2 0
blkdev_queue 22 28 976 4 1 : tunables 54 27
8 : slabdata 7 7 0
blkdev_requests 13 44 176 22 1 : tunables 120 60
8 : slabdata 2 2 0
biovec-(256) 260 260 3072 2 2 : tunables 24 12
8 : slabdata 130 130 0
biovec-128 264 270 1536 5 2 : tunables 24 12
8 : slabdata 54 54 0
biovec-64 272 275 768 5 1 : tunables 54 27
8 : slabdata 55 55 0
biovec-16 272 280 192 20 1 : tunables 120 60
8 : slabdata 14 14 0
biovec-4 272 295 64 59 1 : tunables 120 60
8 : slabdata 5 5 0
biovec-1 286 406 16 203 1 : tunables 120 60
8 : slabdata 2 2 0
bio 300 300 128 30 1 : tunables 120 60
8 : slabdata 10 10 0
sock_inode_cache 10059 11712 512 8 1 : tunables 54 27
8 : slabdata 1464 1464 0
skbuff_fclone_cache 10 10 384 10 1 : tunables 54
27 8 : slabdata 1 1 0
skbuff_head_cache 825 9840 192 20 1 : tunables 120 60
8 : slabdata 492 492 0
file_lock_cache 13 37 104 37 1 : tunables 120 60
8 : slabdata 1 1 0
acpi_operand 579 644 40 92 1 : tunables 120 60
8 : slabdata 7 7 0
acpi_parse_ext 0 0 44 84 1 : tunables 120 60
8 : slabdata 0 0 0
acpi_parse 0 0 28 127 1 : tunables 120 60
8 : slabdata 0 0 0
acpi_state 0 0 48 78 1 : tunables 120 60
8 : slabdata 0 0 0
proc_inode_cache 326 504 476 8 1 : tunables 54 27
8 : slabdata 63 63 0
sigqueue 4 54 144 27 1 : tunables 120 60
8 : slabdata 2 2 0
radix_tree_node 2283 12264 276 14 1 : tunables 54 27
8 : slabdata 876 876 0
bdev_cache 22 42 640 6 1 : tunables 54 27
8 : slabdata 7 7 0
sysfs_dir_cache 3645 3680 40 92 1 : tunables 120 60
8 : slabdata 40 40 0
mnt_cache 28 60 128 30 1 : tunables 120 60
8 : slabdata 2 2 0
inode_cache 817 1048 460 8 1 : tunables 54 27
8 : slabdata 131 131 0
dentry_cache 5144 29268 144 27 1 : tunables 120 60
8 : slabdata 1084 1084 0
filp 825 2160 192 20 1 : tunables 120 60
8 : slabdata 108 108 104
names_cache 3 4 4096 1 1 : tunables 24 12
8 : slabdata 3 4 0
avc_node 31 72 52 72 1 : tunables 120 60
8 : slabdata 1 1 0
key_jar 14 30 128 30 1 : tunables 120 60
8 : slabdata 1 1 0
idr_layer_cache 117 174 136 29 1 : tunables 120 60
8 : slabdata 6 6 0
buffer_head 9838 21888 52 72 1 : tunables 120 60
8 : slabdata 304 304 0
mm_struct 76 98 512 7 1 : tunables 54 27
8 : slabdata 14 14 0
vm_area_struct 21343 21504 92 42 1 : tunables 120 60
8 : slabdata 512 512 0
fs_cache 63 295 64 59 1 : tunables 120 60
8 : slabdata 5 5 0
files_cache 64 117 448 9 1 : tunables 54 27
8 : slabdata 13 13 0
signal_cache 110 160 384 10 1 : tunables 54 27
8 : slabdata 16 16 0
sighand_cache 104 129 1344 3 1 : tunables 24 12
8 : slabdata 43 43 0
task_struct 10095 10095 1360 3 1 : tunables 24 12
8 : slabdata 3365 3365 0
anon_vma 692 1015 24 145 1 : tunables 120 60
8 : slabdata 7 7 0
pgd 63 339 32 113 1 : tunables 120 60
8 : slabdata 3 3 0
pmd 138 138 4096 1 1 : tunables 24 12
8 : slabdata 138 138 0
size-131072(DMA) 0 0 131072 1 32 : tunables 8 4
0 : slabdata 0 0 0
size-131072 0 0 131072 1 32 : tunables 8 4
0 : slabdata 0 0 0
size-65536(DMA) 0 0 65536 1 16 : tunables 8 4
0 : slabdata 0 0 0
size-65536 2 2 65536 1 16 : tunables 8 4
0 : slabdata 2 2 0
size-32768(DMA) 0 0 32768 1 8 : tunables 8 4
0 : slabdata 0 0 0
size-32768 2 2 32768 1 8 : tunables 8 4
0 : slabdata 2 2 0
size-16384(DMA) 0 0 16384 1 4 : tunables 8 4
0 : slabdata 0 0 0
size-16384 0 0 16384 1 4 : tunables 8 4
0 : slabdata 0 0 0
size-8192(DMA) 0 0 8192 1 2 : tunables 8 4
0 : slabdata 0 0 0
size-8192 4 4 8192 1 2 : tunables 8 4
0 : slabdata 4 4 0
size-4096(DMA) 0 0 4096 1 1 : tunables 24 12
8 : slabdata 0 0 0
size-4096 10149 10149 4096 1 1 : tunables 24 12
8 : slabdata 10149 10149 0
size-2048(DMA) 0 0 2048 2 1 : tunables 24 12
8 : slabdata 0 0 0
size-2048 144 144 2048 2 1 : tunables 24 12
8 : slabdata 72 72 0
size-1024(DMA) 0 0 1024 4 1 : tunables 54 27
8 : slabdata 0 0 0
size-1024 1027 2608 1024 4 1 : tunables 54 27
8 : slabdata 652 652 0
size-512(DMA) 0 0 512 8 1 : tunables 54 27
8 : slabdata 0 0 0
size-512 534 576 512 8 1 : tunables 54 27
8 : slabdata 72 72 0
size-256(DMA) 0 0 256 15 1 : tunables 120 60
8 : slabdata 0 0 0
size-256 376 390 256 15 1 : tunables 120 60
8 : slabdata 26 26 0
size-128(DMA) 0 0 128 30 1 : tunables 120 60
8 : slabdata 0 0 0
size-64(DMA) 0 0 64 59 1 : tunables 120 60
8 : slabdata 0 0 0
size-32(DMA) 0 0 32 113 1 : tunables 120 60
8 : slabdata 0 0 0
size-32 23722 25199 32 113 1 : tunables 120 60
8 : slabdata 223 223 0
size-128 1664 1800 128 30 1 : tunables 120 60
8 : slabdata 60 60 0
size-64 24367 151335 64 59 1 : tunables 120 60
8 : slabdata 2565 2565 0
kmem_cache 165 165 256 15 1 : tunables 120 60
8 : slabdata 11 11 0
[root@nj-research-nas-box ~]# cat /proc/buddyinfo
Node 0, zone DMA 65 57 58 43 28 20 5
1 1 1 0
Node 0, zone Normal 10782 7077 4643 3291 2087 1177 368
58 3 0 0
Node 0, zone HighMem 0 1 225 485 293 128 47
38 25 25 718

2006-11-06 09:03:18

by Zhao Xiaoming

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006/11/6, Eric Dumazet <[email protected]>:
>
> Slab: 293952 kB
> So 292 MB used by slab for 2000 sessions.
>
> Expect 600 MB used by slab for 4000 sessions.
>
> So your precious LOWMEM is not gone at all. It *IS* used by SLAB.
>
> You forgot to send
> cat /proc/slabinfo
>
sorry I didn't make myself clear enough. 2000 sessions means 4000
sockets, 2000 for the server, 2000 for the client.

2006-11-06 09:22:55

by Eric Dumazet

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

On Monday 06 November 2006 09:59, Zhao Xiaoming wrote:

> Thank you again for your help. To have more detailed statistic data, I
> did another round of test and gathered some data. I give the overall
> description here and detailed /proc/net/sockstat, /proc/meminfo,
> /proc/slabinfo and /proc/buddyinfo follows.
> =====================================================
> slab mem cost tcp mem pages lowmem
> free with traffic: 254668KB 34693
> 38772KB
> without traffic: 104080KB 1
> 702652KB
> =====================================================

Thank you for detailed infos.

It appears you have an extensive use of threads (about 10000), since :

> task_struct 10095 10095 1360 3 1 : tunables 24 12
> 8 : slabdata 3365 3365 0

Each thread has a kernel stack, 8KB (ie 2 pages, order-1 allocation), plus a
user vma

> vm_area_struct 21346 21504 92 42 1 : tunables 120 60
> 8 : slabdata 512 512 0

Most likely you dont need that much threads. A program with fewer threads will
perform better and use less ram.

2006-11-06 09:50:31

by Zhao Xiaoming

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006/11/6, Zhao Xiaoming <[email protected]>:
> 2006/11/6, Eric Dumazet <[email protected]>:
> > On Monday 06 November 2006 09:59, Zhao Xiaoming wrote:
> >
> > > Thank you again for your help. To have more detailed statistic data, I
> > > did another round of test and gathered some data. I give the overall
> > > description here and detailed /proc/net/sockstat, /proc/meminfo,
> > > /proc/slabinfo and /proc/buddyinfo follows.
> > > =====================================================
> > > slab mem cost tcp mem pages lowmem
> > > free with traffic: 254668KB 34693
> > > 38772KB
> > > without traffic: 104080KB 1
> > > 702652KB
> > > =====================================================
> >
> > Thank you for detailed infos.
> >
> > It appears you have an extensive use of threads (about 10000), since :
> >
> > > task_struct 10095 10095 1360 3 1 : tunables 24 12
> > > 8 : slabdata 3365 3365 0
> >
> > Each thread has a kernel stack, 8KB (ie 2 pages, order-1 allocation), plus a
> > user vma
> >
> > > vm_area_struct 21346 21504 92 42 1 : tunables 120 60
> > > 8 : slabdata 512 512 0
> >
> > Most likely you dont need that much threads. A program with fewer threads will
> > perform better and use less ram.
> >
> >
> Thanks for the comments. I known the threads may cost many memory.
> However, I already excluded them from the statistics. The 'after test'
> info was gotten while the 10000 threads running but no traffics
> relayed. You may look at the meminfo of 'after test', there is still
> 104080 kB slab memory which should already included the thread kernel
> memory cost (8K*10000=80MB). I know 10000 threads are not necessary
> and just use the simple logic to do some test.
>
and I just tried 2500 threads. the results are the same.

2006-11-06 10:13:09

by Arjan van de Ven

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

On Mon, 2006-11-06 at 14:07 +0800, Zhao Xiaoming wrote:
> Dears,
> I'm running a linux box with kernel version 2.6.16. The hardware
> has 2 Woodcrest Xeon CPUs (2 cores each) and 4G RAM. The NIC cards is
> Intel 82571 on PCI-e bus.

are you using a 32 bit or a 64 bit OS?


2006-11-06 10:15:59

by Zhao Xiaoming

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006/11/6, Eric Dumazet <[email protected]>:
> On Monday 06 November 2006 09:59, Zhao Xiaoming wrote:
>
> > Thank you again for your help. To have more detailed statistic data, I
> > did another round of test and gathered some data. I give the overall
> > description here and detailed /proc/net/sockstat, /proc/meminfo,
> > /proc/slabinfo and /proc/buddyinfo follows.
> > =====================================================
> > slab mem cost tcp mem pages lowmem
> > free with traffic: 254668KB 34693
> > 38772KB
> > without traffic: 104080KB 1
> > 702652KB
> > =====================================================
>
> Thank you for detailed infos.
>
> It appears you have an extensive use of threads (about 10000), since :
>
> > task_struct 10095 10095 1360 3 1 : tunables 24 12
> > 8 : slabdata 3365 3365 0
>
> Each thread has a kernel stack, 8KB (ie 2 pages, order-1 allocation), plus a
> user vma
>
> > vm_area_struct 21346 21504 92 42 1 : tunables 120 60
> > 8 : slabdata 512 512 0
>
> Most likely you dont need that much threads. A program with fewer threads will
> perform better and use less ram.
>
>
Thanks for the comments. I known the threads may cost many memory.
However, I already excluded them from the statistics. The 'after test'
info was gotten while the 10000 threads running but no traffics
relayed. You may look at the meminfo of 'after test', there is still
104080 kB slab memory which should already included the thread kernel
memory cost (8K*10000=80MB). I know 10000 threads are not necessary
and just use the simple logic to do some test.

2006-11-06 10:21:15

by Zhao Xiaoming

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

2006/11/6, Arjan van de Ven <[email protected]>:
> On Mon, 2006-11-06 at 14:07 +0800, Zhao Xiaoming wrote:
> > Dears,
> > I'm running a linux box with kernel version 2.6.16. The hardware
> > has 2 Woodcrest Xeon CPUs (2 cores each) and 4G RAM. The NIC cards is
> > Intel 82571 on PCI-e bus.
>
> are you using a 32 bit or a 64 bit OS?
>
>
>

2006-11-06 12:11:38

by Zhao Xiaoming

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

32 bit. Of course 64 bit kernel can help me overcome the 900M
barrier. However, if I can't find the reason why so much memory
getting 'lost', it will be difficult to support more heavy loadded
concurrent TCP connections.

> 2006/11/6, Arjan van de Ven <[email protected]>:
> > On Mon, 2006-11-06 at 14:07 +0800, Zhao Xiaoming wrote:
> > > Dears,
> > > I'm running a linux box with kernel version 2.6.16. The hardware
> > > has 2 Woodcrest Xeon CPUs (2 cores each) and 4G RAM. The NIC cards is
> > > Intel 82571 on PCI-e bus.
> >
> > are you using a 32 bit or a 64 bit OS?
> >
> >
> >
>

2006-11-06 13:33:49

by Eric Dumazet

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

On Monday 06 November 2006 10:46, Zhao Xiaoming wrote:
> 2006/11/6, Eric Dumazet <[email protected]>:
> > On Monday 06 November 2006 09:59, Zhao Xiaoming wrote:
> > > Thank you again for your help. To have more detailed statistic data, I
> > > did another round of test and gathered some data. I give the overall
> > > description here and detailed /proc/net/sockstat, /proc/meminfo,
> > > /proc/slabinfo and /proc/buddyinfo follows.
> > > =====================================================
> > > slab mem cost tcp mem pages
> > > lowmem free with traffic: 254668KB 34693
> > > 38772KB
> > > without traffic: 104080KB 1
> > > 702652KB
> > > =====================================================
> >
> > Thank you for detailed infos.
> >
> > It appears you have an extensive use of threads (about 10000), since :
> > > task_struct 10095 10095 1360 3 1 : tunables 24 12
> > > 8 : slabdata 3365 3365 0
> >
> > Each thread has a kernel stack, 8KB (ie 2 pages, order-1 allocation),
> > plus a user vma
> >
> > > vm_area_struct 21346 21504 92 42 1 : tunables 120 60
> > > 8 : slabdata 512 512 0
> >
> > Most likely you dont need that much threads. A program with fewer threads
> > will perform better and use less ram.
>
> Thanks for the comments. I known the threads may cost many memory.
> However, I already excluded them from the statistics. The 'after test'
> info was gotten while the 10000 threads running but no traffics
> relayed. You may look at the meminfo of 'after test', there is still
> 104080 kB slab memory which should already included the thread kernel
> memory cost (8K*10000=80MB). I know 10000 threads are not necessary
> and just use the simple logic to do some test.

In fact, your kernel has CONFIG_4KSTACKS, kernel thread stacks use 4K instead
of 8K.

If you want to increase LOWMEM, (and keep 32bits kernel), you can chose a
2G/2G user/kernel split, instead of the 3G/1G default split.
(see config : CONFIG_VMSPLIT_2G)

Eric

2006-11-06 16:36:29

by Stephen Hemminger

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

Eric Dumazet wrote:
> Zhao Xiaoming a ?crit :
>> Dears,
>> I'm running a linux box with kernel version 2.6.16. The hardware
>> has 2 Woodcrest Xeon CPUs (2 cores each) and 4G RAM. The NIC cards is
>> Intel 82571 on PCI-e bus.
>> The box is acting as ethernet bridge between 2 Gigabit Ethernets.
>> By configuring ebtables and iptables, an application is running as TCP
>> proxy which will intercept all TCP connections requests from the
>> network and setup another TCP connection to the acture server. The
>> TCP proxy then relays all traffics in both directions.
>> The problem is the memory. Since the box must support thousands of
>> concurrent connections, I know the memory size of ZONE_NORMAL would be
>> a bottleneck as TCP packets would need many buffers. After setting
>> upper limit of net.ipv4.tcp_rmem and net.ipv4.tcp_wmem to 32K bytes,
>> our test began.
>> My test scenario employs 2000 concurrent downloading connections
>> to a IIS server's port 80. The throughput is about 500~600 Mbps which
>> is limited by the capability of the client application. Because all
>> traffics are from server to client and the capability of client
>> machine is bottleneck, I believe the receiver side of the sockets
>> connected with server and the sender side of the sockets connected
>> with client should be filled with packets in correspondent windows.
>> Thus, roughly there should be about 32K * 2000+ 32K*2000 = 128M bytes
>> memory occupied by TCP/IP stack for packet buffering. Data from
>> slabtop confermed it. it's about 140M bytes memory cost after I start
>> the traffic. That reasonablly matched with my estimation. However,
>> /proc/meminfo had a different story. The 'LowFree' dropped from about
>> 710M to 80M. In other words, there's addtional 500M memory in
>> ZONE_NORMAL allocated by someone other than the slab. Why?
The amount of memory per socket is controlled by the socket buffering.
Your application
could be setting the value by calling setsockopt(). Otherwise, the tcp
memory is limited
by the sysctl settings tcp_rmem (receiver) and tcp_wmem (sender).

For example on this server:
$ cat /proc/sys/net/ipv4/tcp_wmem
4096 16384 131072

Each sending socket would start with 16K of buffering, but could grow up
to 128K based
on TCP send autotuning.


2006-11-07 02:48:46

by Zhao Xiaoming

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

On 11/6/06, Eric Dumazet <[email protected]> wrote:
> On Monday 06 November 2006 10:46, Zhao Xiaoming wrote:
> > 2006/11/6, Eric Dumazet <[email protected]>:
> > > On Monday 06 November 2006 09:59, Zhao Xiaoming wrote:
> > > > Thank you again for your help. To have more detailed statistic data, I
> > > > did another round of test and gathered some data. I give the overall
> > > > description here and detailed /proc/net/sockstat, /proc/meminfo,
> > > > /proc/slabinfo and /proc/buddyinfo follows.
> > > > =====================================================
> > > > slab mem cost tcp mem pages
> > > > lowmem free with traffic: 254668KB 34693
> > > > 38772KB
> > > > without traffic: 104080KB 1
> > > > 702652KB
> > > > =====================================================
> > >
> > > Thank you for detailed infos.
> > >
> > > It appears you have an extensive use of threads (about 10000), since :
> > > > task_struct 10095 10095 1360 3 1 : tunables 24 12
> > > > 8 : slabdata 3365 3365 0
> > >
> > > Each thread has a kernel stack, 8KB (ie 2 pages, order-1 allocation),
> > > plus a user vma
> > >
> > > > vm_area_struct 21346 21504 92 42 1 : tunables 120 60
> > > > 8 : slabdata 512 512 0
> > >
> > > Most likely you dont need that much threads. A program with fewer threads
> > > will perform better and use less ram.
> >
> > Thanks for the comments. I known the threads may cost many memory.
> > However, I already excluded them from the statistics. The 'after test'
> > info was gotten while the 10000 threads running but no traffics
> > relayed. You may look at the meminfo of 'after test', there is still
> > 104080 kB slab memory which should already included the thread kernel
> > memory cost (8K*10000=80MB). I know 10000 threads are not necessary
> > and just use the simple logic to do some test.
>
> In fact, your kernel has CONFIG_4KSTACKS, kernel thread stacks use 4K instead
> of 8K.
>
> If you want to increase LOWMEM, (and keep 32bits kernel), you can chose a
> 2G/2G user/kernel split, instead of the 3G/1G default split.
> (see config : CONFIG_VMSPLIT_2G)
>
> Eric
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Thank you for your advice. I know increase LOMEM could be help, but
now my concern is why I lose my 500M bytes memory after excluding all
known memory cost.

2006-11-07 02:50:59

by Zhao Xiaoming

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

On 11/7/06, Stephen Hemminger <[email protected]> wrote:
> Eric Dumazet wrote:
> > Zhao Xiaoming a ?crit :
> >> Dears,
> >> I'm running a linux box with kernel version 2.6.16. The hardware
> >> has 2 Woodcrest Xeon CPUs (2 cores each) and 4G RAM. The NIC cards is
> >> Intel 82571 on PCI-e bus.
> >> The box is acting as ethernet bridge between 2 Gigabit Ethernets.
> >> By configuring ebtables and iptables, an application is running as TCP
> >> proxy which will intercept all TCP connections requests from the
> >> network and setup another TCP connection to the acture server. The
> >> TCP proxy then relays all traffics in both directions.
> >> The problem is the memory. Since the box must support thousands of
> >> concurrent connections, I know the memory size of ZONE_NORMAL would be
> >> a bottleneck as TCP packets would need many buffers. After setting
> >> upper limit of net.ipv4.tcp_rmem and net.ipv4.tcp_wmem to 32K bytes,
> >> our test began.
> >> My test scenario employs 2000 concurrent downloading connections
> >> to a IIS server's port 80. The throughput is about 500~600 Mbps which
> >> is limited by the capability of the client application. Because all
> >> traffics are from server to client and the capability of client
> >> machine is bottleneck, I believe the receiver side of the sockets
> >> connected with server and the sender side of the sockets connected
> >> with client should be filled with packets in correspondent windows.
> >> Thus, roughly there should be about 32K * 2000+ 32K*2000 = 128M bytes
> >> memory occupied by TCP/IP stack for packet buffering. Data from
> >> slabtop confermed it. it's about 140M bytes memory cost after I start
> >> the traffic. That reasonablly matched with my estimation. However,
> >> /proc/meminfo had a different story. The 'LowFree' dropped from about
> >> 710M to 80M. In other words, there's addtional 500M memory in
> >> ZONE_NORMAL allocated by someone other than the slab. Why?
> The amount of memory per socket is controlled by the socket buffering.
> Your application
> could be setting the value by calling setsockopt(). Otherwise, the tcp
> memory is limited
> by the sysctl settings tcp_rmem (receiver) and tcp_wmem (sender).
>
> For example on this server:
> $ cat /proc/sys/net/ipv4/tcp_wmem
> 4096 16384 131072
>
> Each sending socket would start with 16K of buffering, but could grow up
> to 128K based
> on TCP send autotuning.
>
>
>
Of course I can change the TCP buffers and I already discribed I set
both uppper limit of tcp_rmem and tcp_wmem to 32K. And if you go
through my former posts, you should notic that TCP stack on my machine
only occupied 34K memory pages for buffering which is close to my
theoretical estimation: 128M. But at the same time, my free LOMEM size
decreased from over 700M to less than 100M. The question is where the
additional 500M bytes gone?

2006-11-07 05:53:33

by Eric Dumazet

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

Zhao Xiaoming a ?crit :
> On 11/6/06, Eric Dumazet <[email protected]> wrote:
>> In fact, your kernel has CONFIG_4KSTACKS, kernel thread stacks use 4K
>> instead
>> of 8K.
>>
>> If you want to increase LOWMEM, (and keep 32bits kernel), you can chose a
>> 2G/2G user/kernel split, instead of the 3G/1G default split.
>> (see config : CONFIG_VMSPLIT_2G)
>>
>> Eric

> Thank you for your advice. I know increase LOMEM could be help, but
> now my concern is why I lose my 500M bytes memory after excluding all
> known memory cost.

Unfortunatly you dont provide very much details.
AFAIK you didnt even gave whcih version of linux you run, which programs you
run...
You keep answering where you 'lost' your mem, it's quite buging.
Maybe some Oracles on this list will see the light for you, before exchanging
100 mails with you ?

2006-11-07 06:00:04

by Zhao Xiaoming

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

The latest update:
It seems that Linux kernel memory management mechanisms including
buddy and slab algorisms are not very efficient under my test
conditions that tcp stack requires a lot of (hundreds of MB) packet
buffers and release them very frequently.
Here is the proof. After change my kernel configuration to support
2/2 VM splition, LOMEM consumption reduced to 270M bytes compared with
640M bytes of the 1/3 kernel. All test conditions are the same and
memory pages allocated by TCP stack are also the same, 34K ~ 38K
pages. In other words, 'lost' memory changed from ~500M to ~130M.
Thus, I have nothing to do but guessing the much more free pages make
the slab/buddy algorisms more efficient and waste less memory.
Finally I got what I want. Thank you all for your help and advices.

Xiaoming.

2006-11-07 06:08:34

by Zhao Xiaoming

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

On 11/7/06, Eric Dumazet <[email protected]> wrote:
> Zhao Xiaoming a ?crit :
> > On 11/6/06, Eric Dumazet <[email protected]> wrote:
> >> In fact, your kernel has CONFIG_4KSTACKS, kernel thread stacks use 4K
> >> instead
> >> of 8K.
> >>
> >> If you want to increase LOWMEM, (and keep 32bits kernel), you can chose a
> >> 2G/2G user/kernel split, instead of the 3G/1G default split.
> >> (see config : CONFIG_VMSPLIT_2G)
> >>
> >> Eric
>
> > Thank you for your advice. I know increase LOMEM could be help, but
> > now my concern is why I lose my 500M bytes memory after excluding all
> > known memory cost.
>
> Unfortunatly you dont provide very much details.
> AFAIK you didnt even gave whcih version of linux you run, which programs you
> run...
> You keep answering where you 'lost' your mem, it's quite buging.
> Maybe some Oracles on this list will see the light for you, before exchanging
> 100 mails with you ?
>
>
I think I aready gave the kernel version and introduced my application
in the first post. What are the further details you want? The reason I
keep asking for the 'lost mem' is that I want to focus on the problem,
not the workarrounds that may lead to further problems if I keep
increasing the concurrent scale.
Anyway, since the problem is already solved (see my last post), I'd
like to thank you for the help.

Xiaoming.

2006-11-07 08:04:10

by Al Boldi

[permalink] [raw]
Subject: Re: ZONE_NORMAL memory exhausted by 4000 TCP sockets

Zhao Xiaoming wrote:
> The latest update:
> It seems that Linux kernel memory management mechanisms including
> buddy and slab algorisms are not very efficient under my test
> conditions that tcp stack requires a lot of (hundreds of MB) packet
> buffers and release them very frequently.
> Here is the proof. After change my kernel configuration to support
> 2/2 VM splition, LOMEM consumption reduced to 270M bytes compared with
> 640M bytes of the 1/3 kernel. All test conditions are the same and
> memory pages allocated by TCP stack are also the same, 34K ~ 38K
> pages. In other words, 'lost' memory changed from ~500M to ~130M.
> Thus, I have nothing to do but guessing the much more free pages make
> the slab/buddy algorisms more efficient and waste less memory.

I kind of agree, and always compile for a 2G/2G VM split, as this also seems
to affect certain OOM conditions positively.

What isn't quite clear though, why is the 2G/2G VM split not the default?


Thanks!

--
Al