More bottlenecks located during SCI/Gigabit Ethernet testing
and profiling. Configuration is 2.4.19-pre2(3) running SCI
and Intel e1000 gigbit ethernet adapters. In this scenario,
the Gnet adapter is DMA'ing frames from a gigabit segment
directly into reflective memory mapped into an SCI adapter
address space, then immediately triggering an outbound
DMA of the data over an SCI clustering fabric into
the memory of a remote node. In essense a GNET to SCI
routing fabric.
Performance thoughput numbers are stable for the most part since
we are at the maximum throughput with Intel's GNET adapters at
124 MB/S, however, processor utilization, locking, etc. is far more
excessive than necessary. We are also spending too much time calling
kmalloc/kfree during skb contruct/destruct operations. Also,
Intel's adapter by default has the ring buffer size in the driver
set to 256 packets, and our skb hot_list count we before discarding
free skb header frames is too low for these GNET adapters (128),
resulting in packet overruns intermittently.
Increasing these numbers and using a fixed frame size consistent with GNET
(less 9K jumbo frames) instead of kmalloc'ing/kfree'ing the
skb->data portion of these frames all the time yields a decrease
in remote receipt latency and lower utilization and bus
utilization.
Measured latency of packets coming off the SCI interface on the remote
side of the clustering fabric is 3-4% higher between the two
test scenarios.
The modifications made to skbuff.c are extensive and driver changes
were also required to get around these performance problems. Data
provided for review. Recommend a minumum change of increasing
the sysctl_hot_list_len from 128 to 1024 by default. I have reviewed
(and modified) the skbuff and all the copy stuff related to mapping
fragment lists, etc. and this code is quite a mess.
NetWare always created ECB's (Event Control Blocks) at the max size
of the network adpapter rather than trying to allocate fragment
elements on the fly the way is being done in Linux with skb's.
Bottom line is this stuff is impacting performance and IO bandwidth,
and needs to be corrected. Default hot_list size should be
increased by default.
/usr/src/linux/net/core/skbuff.c
//int sysctl_hot_list_len = 128;
int sysctl_hot_list_len = 1024; // bump this value up
alloc_skb with calls to kmalloc/kfree 2.4.19-pre2 with code
"as is". Notice high call rate to kmalloc/kfree and corresponding
higher utilization (@ 7%)
36324 total 0.0210
28044 default_idle 584.2500
1117 __rdtsc_delay 34.9062
927 eth_type_trans 4.4567
733 skb_release_data 5.0903
645 kmalloc 2.5195
638 kfree 3.9875
463 __make_request 0.3180
415 __scsi_end_request 1.3651
382 alloc_skb 0.8843
372 tw_interrupt 0.3633
241 kfree_skbmem 1.8828
233 scsi_dispatch_cmd 0.4161
233 __generic_copy_to_user 3.6406
194 __kfree_skb 0.9327
184 scsi_request_fn 0.2396
103 ip_rcv 0.1238
88 __wake_up 0.5000
84 do_anonymous_page 0.4773
72 do_softirq 0.4091
52 processes: 51 sleeping, 1 running, 0 zombie, 0 stopped
CPU states: 0.0% user, 32.8% system, 0.0% nice, 67.1% idle
Mem: 897904K av, 869248K used, 28656K free, 0K shrd, 3724K buff
Swap: 1052216K av, 0K used, 1052216K free 46596K cached
alloc_skb_frame with fixed 1514 + fragment list allocations,
sysctl_hot_list_len = 1024.
34880 total 0.0202
28581 default_idle 595.4375
1125 __rdtsc_delay 35.1562
1094 eth_type_trans 5.2596
657 skb_release_data 4.5625
378 __make_request 0.2596
335 alloc_skb_frame 1.1020
334 tw_interrupt 0.3262
293 __scsi_end_request 0.9638
208 scsi_dispatch_cmd 0.3714
193 __kfree_skb 0.9279
184 scsi_request_fn 0.2396
160 kfree_skbmem 1.2500
90 __generic_copy_to_user 1.4062
81 ip_rcv 0.0974
68 __wake_up 0.3864
59 do_anonymous_page 0.3352
48 do_softirq 0.2727
43 generic_make_request 0.1493
43 alloc_skb 0.0995
50 processes: 49 sleeping, 1 running, 0 zombie, 0 stopped
CPU states: 0.0% user, 27.5% system, 0.0% nice, 72.4% idle
Mem: 897904K av, 841280K used, 56624K free, 0K shrd, 2220K buff
Swap: 1052216K av, 0K used, 1052216K free 22292K cached
Jeff
"Jeff V. Merkey" wrote:
>
> ...
> 34880 total 0.0202
> 28581 default_idle 595.4375
> 1125 __rdtsc_delay 35.1562
> 1094 eth_type_trans 5.2596
> 657 skb_release_data 4.5625
> 378 __make_request 0.2596
> 335 alloc_skb_frame 1.1020
Note how eth_type_trans is now the most expensive function. This
is because it's the first place where the CPU touches the
just-arrived ethernet header.
It would be interesting to add a prefetch() to the driver at the
earliest possible time to get the header read underway. Maybe
the IP header too?
-
"Jeff V. Merkey" <[email protected]> writes:
> /usr/src/linux/net/core/skbuff.c
>
> //int sysctl_hot_list_len = 128;
> int sysctl_hot_list_len = 1024; // bump this value up
>
The plan was to actually to get rid of the skb hot list. It was just
a stop gap solution to get CPU local smp allocation before Linux had
cpu local slab caches. The slab cache has been fixed and runs
cpu local now too, so there should be no need for it anymore as
the slab cache does essentially the same thing as the private hot list
cache (maintaining linked lists of objects and unlinking them quickly
on allocation and linking them again on free, all in O(1))
> alloc_skb_frame with fixed 1514 + fragment list allocations,
> sysctl_hot_list_len = 1024.
Something is bogus with your profile data. Increasing sysctl_hot_list_len
never changes the frequency with which kmalloc/kfree are called. All
it does is to produce less calls to kmem_cache_alloc() for the skb head,
but the skb data portion is always allocated using kmalloc(). Your
new profile doesn't show kmalloc so you changed something else.
-andi
> provided for review. Recommend a minumum change of increasing
> the sysctl_hot_list_len from 128 to 1024 by default. I have reviewed
Good way to kill low end boxes. It probably wants sizing based on system
size and load monitoring.
> NetWare always created ECB's (Event Control Blocks) at the max size
> of the network adpapter rather than trying to allocate fragment
> elements on the fly the way is being done in Linux with skb's.
Thats up to the network adapter. In fact the Linux drivers mostly do
keep preloaded with full sized buffers and only copy if the packet size
is small (and copying 1 or 2 cache lines isnt going to hurt anyone)
> 28044 default_idle 584.2500
You spent most of your time asleep 8)
> 1117 __rdtsc_delay 34.9062
Or doing delays
> 927 eth_type_trans 4.4567
And pulling a line into L1 cache
On Mon, Mar 04, 2002 at 09:28:21AM +0100, Andi Kleen wrote:
> "Jeff V. Merkey" <[email protected]> writes:
>
> > /usr/src/linux/net/core/skbuff.c
> >
> > //int sysctl_hot_list_len = 128;
> > int sysctl_hot_list_len = 1024; // bump this value up
> >
>
> The plan was to actually to get rid of the skb hot list. It was just
> a stop gap solution to get CPU local smp allocation before Linux had
> cpu local slab caches. The slab cache has been fixed and runs
> cpu local now too, so there should be no need for it anymore as
> the slab cache does essentially the same thing as the private hot list
> cache (maintaining linked lists of objects and unlinking them quickly
> on allocation and linking them again on free, all in O(1))
>
>
> > alloc_skb_frame with fixed 1514 + fragment list allocations,
> > sysctl_hot_list_len = 1024.
>
> Something is bogus with your profile data. Increasing sysctl_hot_list_len
> never changes the frequency with which kmalloc/kfree are called. All
> it does is to produce less calls to kmem_cache_alloc() for the skb head,
> but the skb data portion is always allocated using kmalloc(). Your
> new profile doesn't show kmalloc so you changed something else.
>
>
> -andi
Agree. What's making the numbers get better is the fact I have
removed calls to kmalloc/kfree in alloc_akb. This extra code path
increases latency in these high speed adapters. Reread the post. I
said changing the hot list eliminated a lot of packet overruns, not
calls to kmalloc/kfree. The data is correct.
Jeff
On Mon, Mar 04, 2002 at 03:04:00PM +0000, Alan Cox wrote:
> > provided for review. Recommend a minumum change of increasing
> > the sysctl_hot_list_len from 128 to 1024 by default. I have reviewed
>
> Good way to kill low end boxes. It probably wants sizing based on system
> size and load monitoring.
>
> > NetWare always created ECB's (Event Control Blocks) at the max size
> > of the network adpapter rather than trying to allocate fragment
> > elements on the fly the way is being done in Linux with skb's.
>
> Thats up to the network adapter. In fact the Linux drivers mostly do
> keep preloaded with full sized buffers and only copy if the packet size
> is small (and copying 1 or 2 cache lines isnt going to hurt anyone)
There's an increase in latency. For my application, I have no
problem keeping around a local patch that corrects this behavior
if folks don't feel it needs fixing. From everything I've ever
done in this space, having needless alloc/free calls in a
performance intensive path that requires low latency like a Lan
driver is not a good thing.
I am idle most of the time since I have eliminated all of the
copy activity by using SCI in the system. This is why it's
idle most of the time. Were I using the IP stack code in Linux
proper, the utilization would be through the roof.
:-)
Jeff
>
> > 28044 default_idle 584.2500
>
> You spent most of your time asleep 8)
>
> > 1117 __rdtsc_delay 34.9062
>
> Or doing delays
>
> > 927 eth_type_trans 4.4567
>
> And pulling a line into L1 cache
Hi Jeff,
Have you tried the NAPI patch and the NAPI'fied e1000 driver?
I'm not sure how far the development has come but I know it improves
performance quite a bit versus the regular e1000 driver.
You'll find it here:
ftp://robur.slu.se/pub/Linux/net-development/NAPI/
kernel/napi-patch-ank is the NAPI patch, you need to change
the get_fast_time() call to do_gettimeofday() for it to compile.
e1000/ is the NAPI'fied e1000 driver, the latest release is from Jan 29
but there is a document that describes how you checkout the latest version
via cvs.
I've never tried the e1000 NAPI driver since I don't have one of these
boards but I use the tulip NAPI driver a lot here and it works great,
impressive performance.
I hope you get better performance.
/Martin
Never argue with an idiot. They drag you down to their level, then beat you with experience.
Thanks! I'll check it out. I;ve already done very heavy modifications
to the e1000 for my testing.
Jeff
On Mon, Mar 04, 2002 at 06:39:31PM +0100, Martin Josefsson wrote:
> Hi Jeff,
>
> Have you tried the NAPI patch and the NAPI'fied e1000 driver?
> I'm not sure how far the development has come but I know it improves
> performance quite a bit versus the regular e1000 driver.
>
> You'll find it here:
> ftp://robur.slu.se/pub/Linux/net-development/NAPI/
>
> kernel/napi-patch-ank is the NAPI patch, you need to change
> the get_fast_time() call to do_gettimeofday() for it to compile.
>
> e1000/ is the NAPI'fied e1000 driver, the latest release is from Jan 29
> but there is a document that describes how you checkout the latest version
> via cvs.
>
> I've never tried the e1000 NAPI driver since I don't have one of these
> boards but I use the tulip NAPI driver a lot here and it works great,
> impressive performance.
>
> I hope you get better performance.
>
> /Martin
>
> Never argue with an idiot. They drag you down to their level, then beat you with experience.
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
> > Thats up to the network adapter. In fact the Linux drivers mostly do
> > keep preloaded with full sized buffers and only copy if the packet size
> > is small (and copying 1 or 2 cache lines isnt going to hurt anyone)
>
> There's an increase in latency. For my application, I have no
A very tiny one (especially if you keep a small buffer pool around too).
To copy a packet is 2 cache line loads, which will dominate, some
writes that you wont be able to measure, and a writeback you won't be
able to instrument without a bus analyser.
For receive paths its up to the driver. The copy to smaller buffer is
something the driver can choose to do. It and it alone decide what skbuff
to throw at the kernel core
The bigger ring helping is interesting but itself begs a question. Do you
ever dirty rather than merely reference skbuff data. In that case a bigger
ring may simply be hiding the fact that the recycled skbuff has dirty
cached data that has to be written back. Does the combination of hardware
you have do the right thing when it comes to the invalidating - and do
you ever DMA into a partial cache line ?
Alan
On Mon, Mar 04, 2002 at 06:34:33PM +0000, Alan Cox wrote:
> > > Thats up to the network adapter. In fact the Linux drivers mostly do
> > > keep preloaded with full sized buffers and only copy if the packet size
> > > is small (and copying 1 or 2 cache lines isnt going to hurt anyone)
> >
> > There's an increase in latency. For my application, I have no
>
> A very tiny one (especially if you keep a small buffer pool around too).
> To copy a packet is 2 cache line loads, which will dominate, some
> writes that you wont be able to measure, and a writeback you won't be
> able to instrument without a bus analyser.
>
> For receive paths its up to the driver. The copy to smaller buffer is
> something the driver can choose to do. It and it alone decide what skbuff
> to throw at the kernel core
>
> The bigger ring helping is interesting but itself begs a question. Do you
> ever dirty rather than merely reference skbuff data. In that case a bigger
Actually, I am plugging in my own data blocks into the skbuff->data
pointer from the drivers themselves. When the driver allocates an
skbuff, I use an alternate cache allocator to load the buffer
into the skbuff, and I leave the two attached. skb_release_data does
not release one of my data blocks. These blocks are 4K aligned so
I am not in the middle of a cache line (I think) in the data portion
when DMA is initiated.
> ring may simply be hiding the fact that the recycled skbuff has dirty
> cached data that has to be written back. Does the combination of hardware
This is probably happening. I have an Arium here now, and am hooking
it up. I can get memory bus traces and provide what's actually
happening on the bus. I do not ever DMA into a partial cache line.
Jeff
> you have do the right thing when it comes to the invalidating - and do
> you ever DMA into a partial cache line ?
>
> Alan