2003-01-03 16:05:29

by Avery Fay

[permalink] [raw]
Subject: Gigabit/SMP performance problem

Hello,

I'm working with a dual xeon platform with 4 dual e1000 cards on different
pci-x buses. I'm having trouble getting better performance with the second
cpu enabled (ht disabled). With a UP kernel (redhat's 2.4.18), I can route
about 2.9 gigabits/s at around 90% cpu utilization. With a SMP kernel
(redhat's 2.4.18), I can route about 2.8 gigabits/s with both cpus at
around 90% utilization. This suggests to me that the network code is
serialized. I would expect one of two things from my understanding of the
2.4.x networking improvements (softirqs allowing execution on more than
one cpu):

1.) with smp I would get ~2.9 gb/s but the combined cpu utilization would
be that of one cpu at 90%.
2.) or with smp I would get more than ~2.9 gb/s.

Has anyone been able to utilize more than one cpu with pure forwarding?

Note: I realize that I am not using a stock kernel. I was in the past, but
I ran into the same problem (smp not improving performance), just at lower
speeds (redhat's kernel was faster). Therefore, this problem is neither
introduced nor solved by redhat's kernel. If anyone has suggestions for
improvements, I can move back to a stock kernel.

Note #2: I've tried tweaking a lot of different things including binding
irq's to specific cpus, playing around with e1000 modules settings, etc.

Thanks in advance and please CC me with any suggestions as I'm not
subscribed to the list.

Avery Fay

P.S. Only got one response on the linux-net list so I'm posting here. One
thing I did learn from that response is that redhat's kernel is faster
because they use a napi version of the e1000 driver.


2003-01-03 18:04:59

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem

> I'm working with a dual xeon platform with 4 dual e1000 cards on different
> pci-x buses. I'm having trouble getting better performance with the second
> cpu enabled (ht disabled). With a UP kernel (redhat's 2.4.18), I can route
> about 2.9 gigabits/s at around 90% cpu utilization. With a SMP kernel
> (redhat's 2.4.18), I can route about 2.8 gigabits/s with both cpus at
> around 90% utilization. This suggests to me that the network code is
> serialized. I would expect one of two things from my understanding of the
> 2.4.x networking improvements (softirqs allowing execution on more than
> one cpu):
>
> 1.) with smp I would get ~2.9 gb/s but the combined cpu utilization would
> be that of one cpu at 90%.
> 2.) or with smp I would get more than ~2.9 gb/s.
>
> Has anyone been able to utilize more than one cpu with pure forwarding?
>
> Note: I realize that I am not using a stock kernel. I was in the past, but
> I ran into the same problem (smp not improving performance), just at lower
> speeds (redhat's kernel was faster). Therefore, this problem is neither
> introduced nor solved by redhat's kernel. If anyone has suggestions for
> improvements, I can move back to a stock kernel.
>
> Note #2: I've tried tweaking a lot of different things including binding
> irq's to specific cpus, playing around with e1000 modules settings, etc.
>
> Thanks in advance and please CC me with any suggestions as I'm not
> subscribed to the list.

Dual what Xeon? I presume a P4 thing. Can you cat /proc/interrupts?
Are you using the irq_balance code? If so, I think you'll only use
1 cpu to process all the interrupts from each gigabit card. Not that
you have much choice, since Intel broke the P4's interrupt routing.

Which of the e1000 modules settings did you play with? tx_delay
and rx_delay? What rev of the e1000 chipset?

M.

2003-01-03 20:19:06

by Avery Fay

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem

Dual Pentium 4 Xeon at 2.4 Ghz. I believe I am using irq load balancing as
shown below (seems to be applied to Red Hat's kernel). Here's
/proc/interrupts:

CPU0 CPU1
0: 179670 182501 IO-APIC-edge timer
1: 386 388 IO-APIC-edge keyboard
2: 0 0 XT-PIC cascade
8: 1 0 IO-APIC-edge rtc
12: 9 9 IO-APIC-edge PS/2 Mouse
14: 1698 1511 IO-APIC-edge ide0
24: 1300174 1298071 IO-APIC-level eth2
25: 1935085 1935625 IO-APIC-level eth3
28: 1162013 1162734 IO-APIC-level eth4
29: 1971246 1967758 IO-APIC-level eth5
48: 2753990 2753821 IO-APIC-level eth0
49: 2047386 2043894 IO-APIC-level eth1
72: 838987 841143 IO-APIC-level eth6
73: 2767885 2768307 IO-APIC-level eth7
NMI: 0 0
LOC: 362009 362008
ERR: 0
MIS: 0

I started traffic at different times on the various interfaces so the
number of interrupts per interface aren't uniform.

I modified RxIntDelay, TxIntDelay, RxAbsIntDelay, TxAbsIntDelay,
FlowControl, RxDescriptors, TxDescriptors. Increasing the various
IntDelays seemed to improve performance slightly.

I'm using 3 Intel PRO/1000 MT Dual Port Server adapters as well as 2
onboard Intel PRO/1000 ports. The adapters use the 82546EB chips. I
believe that the onboard ports use the same although I'm not sure.

Should I get rid of IRQ load balancing? And what do you mean "Intel broke the P4's interrupt routing"?

Thanks,
Avery Fay





"Martin J. Bligh" <[email protected]>
01/03/2003 01:05 PM


To: Avery Fay <[email protected]>, [email protected]
cc:
Subject: Re: Gigabit/SMP performance problem


> I'm working with a dual xeon platform with 4 dual e1000 cards on
different
> pci-x buses. I'm having trouble getting better performance with the
second
> cpu enabled (ht disabled). With a UP kernel (redhat's 2.4.18), I can
route
> about 2.9 gigabits/s at around 90% cpu utilization. With a SMP kernel
> (redhat's 2.4.18), I can route about 2.8 gigabits/s with both cpus at
> around 90% utilization. This suggests to me that the network code is
> serialized. I would expect one of two things from my understanding of
the
> 2.4.x networking improvements (softirqs allowing execution on more than
> one cpu):
>
> 1.) with smp I would get ~2.9 gb/s but the combined cpu utilization
would
> be that of one cpu at 90%.
> 2.) or with smp I would get more than ~2.9 gb/s.
>
> Has anyone been able to utilize more than one cpu with pure forwarding?
>
> Note: I realize that I am not using a stock kernel. I was in the past,
but
> I ran into the same problem (smp not improving performance), just at
lower
> speeds (redhat's kernel was faster). Therefore, this problem is neither
> introduced nor solved by redhat's kernel. If anyone has suggestions for
> improvements, I can move back to a stock kernel.
>
> Note #2: I've tried tweaking a lot of different things including binding

> irq's to specific cpus, playing around with e1000 modules settings, etc.
>
> Thanks in advance and please CC me with any suggestions as I'm not
> subscribed to the list.

Dual what Xeon? I presume a P4 thing. Can you cat /proc/interrupts?
Are you using the irq_balance code? If so, I think you'll only use
1 cpu to process all the interrupts from each gigabit card. Not that
you have much choice, since Intel broke the P4's interrupt routing.

Which of the e1000 modules settings did you play with? tx_delay
and rx_delay? What rev of the e1000 chipset?

M.




2003-01-03 21:03:13

by Robert Olsson

[permalink] [raw]
Subject: Gigabit/SMP performance problem


Avery Fay writes:
>
> I'm working with a dual xeon platform with 4 dual e1000 cards on different
> pci-x buses. I'm having trouble getting better performance with the second
> cpu enabled (ht disabled). With a UP kernel (redhat's 2.4.18), I can route
> about 2.9 gigabits/s at around 90% cpu utilization. With a SMP kernel
> (redhat's 2.4.18), I can route about 2.8 gigabits/s with both cpus at
> around 90% utilization. This suggests to me that the network code is
> serialized. I would expect one of two things from my understanding of the
> 2.4.x networking improvements (softirqs allowing execution on more than
> one cpu):

Well you have a gigabit router :-)

How is your routing setup? Packet size?

Also you'll never get increased performance of a single flow with SMP.
Aggregated performance possible at best. I've been fighting with for some
time too.

You have some important data in /proc/net/softnet_stat which are per cpu
packets received and "cpu collisions" should interest you.

As far as I understand there no serialization in forwarding path except where
it has to be -- when we add softirq's from different cpu into a single device.
This seen in "cpu collisions"

Also here we get into inherent SMP cache bouncing problem with TX interrupts
When TX has skb's which are processed/created in different CPU's. Which CPU
gonna take the interrupt? No matter how we do we run kfree we gona see a lot
of cache bouncing. For systems that have same in/out interface smp_affinity
can be used. In practice this impossible for forwarding.

And this bouncing hurts especially for small pakets....

A litte TX test illustrates. Sender on cpu0.

UP 186 kpps
SMP Aff to cpu0 160 kpps
SMP Aff to cpu0, cpu1 124 kpps
SMP Aff to cpu1 106 kpps

We are playing some code that might decrease this problem.


Cheers.
--ro

2003-01-03 21:11:00

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem

On Fri, 2003-01-03 at 21:25, Avery Fay wrote:

> Should I get rid of IRQ load balancing? And what do you mean "Intel broke the P4's interrupt routing"?

well you can bind IRQ's to specific cpu's in /proc....


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2003-01-03 21:34:51

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem

> Dual Pentium 4 Xeon at 2.4 Ghz. I believe I am using irq load balancing as
> shown below (seems to be applied to Red Hat's kernel). Here's
> /proc/interrupts:

Is in 2.4.20-ac2 at least. See if arch/i386/kernel/io_apic.c
has a function called balance_irq.

> CPU0 CPU1
> 0: 179670 182501 IO-APIC-edge timer
> 1: 386 388 IO-APIC-edge keyboard
> 2: 0 0 XT-PIC cascade
> 8: 1 0 IO-APIC-edge rtc
> 12: 9 9 IO-APIC-edge PS/2 Mouse
> 14: 1698 1511 IO-APIC-edge ide0
> 24: 1300174 1298071 IO-APIC-level eth2
> 25: 1935085 1935625 IO-APIC-level eth3
> 28: 1162013 1162734 IO-APIC-level eth4
> 29: 1971246 1967758 IO-APIC-level eth5
> 48: 2753990 2753821 IO-APIC-level eth0
> 49: 2047386 2043894 IO-APIC-level eth1
> 72: 838987 841143 IO-APIC-level eth6
> 73: 2767885 2768307 IO-APIC-level eth7
> NMI: 0 0
> LOC: 362009 362008
> ERR: 0
> MIS: 0
>
> I started traffic at different times on the various interfaces so the
> number of interrupts per interface aren't uniform.
>
> I modified RxIntDelay, TxIntDelay, RxAbsIntDelay, TxAbsIntDelay,
> FlowControl, RxDescriptors, TxDescriptors. Increasing the various
> IntDelays seemed to improve performance slightly.

Makes sense, increasing the delays should reduce the interrupt load.

> I'm using 3 Intel PRO/1000 MT Dual Port Server adapters as well as 2
> onboard Intel PRO/1000 ports. The adapters use the 82546EB chips. I
> believe that the onboard ports use the same although I'm not sure.
>
> Should I get rid of IRQ load balancing? And what do you mean
> "Intel broke the P4's interrupt routing"?

P3's distributed interrupts round-robin amongst cpus. P4's send
everything to CPU 0. If you put irq_balance on, it'll spread
them around, but any given interrupt is still only handled by
one CPU (as far as I understand the code). If you hammer one
adaptor, does that generate more interrupts than 1 cpu can handle?
(turn irq balance off by sticking a return at the top of balance_irq,
and hammer one link, see how much CPU power that burns).

M.

2003-01-03 21:41:05

by Ron cooper

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem

On Friday 03 January 2003 12:05 pm, Martin J. Bligh wrote:

> Dual what Xeon? I presume a P4 thing. Can you cat /proc/interrupts?
> Are you using the irq_balance code? If so, I think you'll only use
> 1 cpu to process all the interrupts from each gigabit card. Not that
> you have much choice, since Intel broke the P4's interrupt routing.
>

You got my attention with this statement. I've have Dual Xeon Prestonias on
an I860 chipset (IWill dp400). cat /proc/interrupts indeed shows CPU0 as
processing all IRQ's instead of sharing them with CPU1 on a 2.4.x kernel.

Is there a work around for this, or is this *really* a problem? Some say it
might be a problem depending on how many interrupts need to be processed per
second. Others imply that cpu0 catching the irq's might be a good thing.

I happen to have PIII's using VIA chipsets that dont have this issue with
proc/interrupts. This is very annonying, but I wonder if it is worth
worrying about.


Ron.




2003-01-03 21:47:48

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem

>> Dual what Xeon? I presume a P4 thing. Can you cat /proc/interrupts?
>> Are you using the irq_balance code? If so, I think you'll only use
>> 1 cpu to process all the interrupts from each gigabit card. Not that
>> you have much choice, since Intel broke the P4's interrupt routing.
>
> You got my attention with this statement. I've have Dual Xeon Prestonias on
> an I860 chipset (IWill dp400). cat /proc/interrupts indeed shows CPU0 as
> processing all IRQ's instead of sharing them with CPU1 on a 2.4.x kernel.
>
> Is there a work around for this, or is this *really* a problem? Some say it
> might be a problem depending on how many interrupts need to be processed per
> second. Others imply that cpu0 catching the irq's might be a good thing.

Right - depends what you're doing. You can look at irq balance (in 2.5
or 2.4-ac), but I don't like it as a solution much. Or you could try
programming the TPR (were some patches floating around). Would be interesting
to get some perf measurments against people using the TPR patches (is more
expensive to set on a P4). Or someone from Intel posted some code recently
that seemed to do more intelligent things, but I haven't had the time to
look closely. If you want to experiment with that, I'm sure people would
be interested in the results.

> I happen to have PIII's using VIA chipsets that dont have this issue with
> proc/interrupts. This is very annonying, but I wonder if it is worth
> worrying about.

P3's aren't as brain damaged.

M.

2003-01-03 22:24:05

by Andrew Theurer

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem

On Friday 03 January 2003 15:36, Martin J. Bligh wrote:
> > Dual Pentium 4 Xeon at 2.4 Ghz. I believe I am using irq load balancing
> > as shown below (seems to be applied to Red Hat's kernel). Here's
> > /proc/interrupts:
>
> Is in 2.4.20-ac2 at least. See if arch/i386/kernel/io_apic.c
> has a function called balance_irq.
>
> > CPU0 CPU1
> > 0: 179670 182501 IO-APIC-edge timer
> > 1: 386 388 IO-APIC-edge keyboard
> > 2: 0 0 XT-PIC cascade
> > 8: 1 0 IO-APIC-edge rtc
> > 12: 9 9 IO-APIC-edge PS/2 Mouse
> > 14: 1698 1511 IO-APIC-edge ide0
> > 24: 1300174 1298071 IO-APIC-level eth2
> > 25: 1935085 1935625 IO-APIC-level eth3
> > 28: 1162013 1162734 IO-APIC-level eth4
> > 29: 1971246 1967758 IO-APIC-level eth5
> > 48: 2753990 2753821 IO-APIC-level eth0
> > 49: 2047386 2043894 IO-APIC-level eth1
> > 72: 838987 841143 IO-APIC-level eth6
> > 73: 2767885 2768307 IO-APIC-level eth7
> > NMI: 0 0
> > LOC: 362009 362008
> > ERR: 0
> > MIS: 0
> >
> > I started traffic at different times on the various interfaces so the
> > number of interrupts per interface aren't uniform.
> >
> > I modified RxIntDelay, TxIntDelay, RxAbsIntDelay, TxAbsIntDelay,
> > FlowControl, RxDescriptors, TxDescriptors. Increasing the various
> > IntDelays seemed to improve performance slightly.

Monitor for dropped packets when increasing int delay. At least on the older
e1000 adapters, you would get dropped packets, etc, making the problem worse
in other areas.
>
> Makes sense, increasing the delays should reduce the interrupt load.
>
> > I'm using 3 Intel PRO/1000 MT Dual Port Server adapters as well as 2
> > onboard Intel PRO/1000 ports. The adapters use the 82546EB chips. I
> > believe that the onboard ports use the same although I'm not sure.
> >
> > Should I get rid of IRQ load balancing? And what do you mean
> > "Intel broke the P4's interrupt routing"?
>
> P3's distributed interrupts round-robin amongst cpus. P4's send
> everything to CPU 0. If you put irq_balance on, it'll spread
> them around, but any given interrupt is still only handled by
> one CPU (as far as I understand the code). If you hammer one
> adaptor, does that generate more interrupts than 1 cpu can handle?
> (turn irq balance off by sticking a return at the top of balance_irq,
> and hammer one link, see how much CPU power that burns).

Another problem you may have is that irq_balance is random, and sometimes more
than one interrupt is serviced by the same cpu at the same. Actually, let me
clarify. In your case if your netowrk load was "even" across the adapters,
ideally you would want cpu0 handling the first 4 adapters and cpu1 handling
the last 4 adapters. With irq_balance, this is usually not the case. There
will be times where one cpu is doing more work than the other, possibly
becomming a bottleneck.

Now, there was some code in SuSE's kernel (SuSE 8.0, 2.4.18) which did a round
robin static assingment of interrupt to cpu. In your case, all even
interrupt numbers would go to cpu0 and all odd interrupt numbers would go to
cpu1. Since you have exactly 4 adapters in even interrupts and 4 on odd
interrupts, that would work perfectly. Now, that doesn't mean there is some
other problem, like PCI bandwidth, but it's a start. Also, you might be able
to emulate this with irq affinity (/proc/irq/<num>/smp_affnity) but last time
I tried it on P4, it didn't work at all -No interrupts!

-Andrew

2003-01-04 03:29:48

by Anton Blanchard

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem


> I'm working with a dual xeon platform with 4 dual e1000 cards on different
> pci-x buses. I'm having trouble getting better performance with the second
> cpu enabled (ht disabled). With a UP kernel (redhat's 2.4.18), I can route
> about 2.9 gigabits/s at around 90% cpu utilization. With a SMP kernel
> (redhat's 2.4.18), I can route about 2.8 gigabits/s with both cpus at
> around 90% utilization. This suggests to me that the network code is
> serialized. I would expect one of two things from my understanding of the
> 2.4.x networking improvements (softirqs allowing execution on more than
> one cpu):

The Fujitsu guys have a nice summary of this:

http://www.labs.fujitsu.com/en/techinfo/linux/lse-0211/index.html

Skip forward to page 8.

Dont blame the networking code just yet :) Notice how worse UP vs SMP
performance is on the P4 compared to the P3?

This brings up another point, is a single CPU with hyperthreading worth
it? As Rusty will tell you, you need to compare it with a UP kernel
since it avoids all the locking overhead. I suspect for a lot of cases
HT will be a loss (imagine your case, comparing UP and one CPU HT)

Anton

2003-01-06 18:21:20

by Bill Davidsen

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem

On 4 Jan 2003, Daniel Blueman wrote:

> It's interesting you have IRQs balanced over the two logical
> processors. I can't get this on HT Xeons with stock RedHat 7.3 kernel.

I think he's using two physical processors, if by "logical processors" you
are thinking HT... I also recall he has HT off, but the original post
isn't handy.

>
> Can you post the exact kernel version string, please?
>
> TIA,
> Dan
>
> "Avery Fay" <[email protected]> wrote in message news:<OF256CD297.9F92C038-ON85256CA3.006A4034-85256CA3.00705DEA@symantec.com>...
> > Dual Pentium 4 Xeon at 2.4 Ghz. I believe I am using irq load balancing as
> > shown below (seems to be applied to Red Hat's kernel). Here's
> > /proc/interrupts:

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-01-06 19:00:51

by Daniel Blueman

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem

Even with HT turned off on this dual-Xeon box, all IRQs are routed to CPU 0.

Kernel here is the latest RedHat 2.4.18 one.

Just curious what kernel Avery is running...

Dan

> On 4 Jan 2003, Daniel Blueman wrote:
>
> > It's interesting you have IRQs balanced over the two logical
> > processors. I can't get this on HT Xeons with stock RedHat 7.3 kernel.
>
> I think he's using two physical processors, if by "logical processors" you
> are thinking HT... I also recall he has HT off, but the original post
> isn't handy.
>
> >
> > Can you post the exact kernel version string, please?
> >
> > TIA,
> > Dan
> >
> > "Avery Fay" <[email protected]> wrote in message
>
news:<OF256CD297.9F92C038-ON85256CA3.006A4034-85256CA3.00705DEA@symantec.com>...
> > > Dual Pentium 4 Xeon at 2.4 Ghz. I believe I am using irq load
> balancing as
> > > shown below (seems to be applied to Red Hat's kernel). Here's
> > > /proc/interrupts:
>
> --
> bill davidsen <[email protected]>
> CTO, TMR Associates, Inc
> Doing interesting things with little computers since 1979.
>

--
Daniel J Blueman

+++ GMX - Mail, Messaging & more http://www.gmx.net +++
NEU: Mit GMX ins Internet. Rund um die Uhr f?r 1 ct/ Min. surfen!

2003-01-06 19:17:57

by Brian Tinsley

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem

I've been able to distribute IRQ servicing to other processors on P4
Xeon HT systems as described in the IRQ-affinity.txt file in the
kernel-source Documentation directory. Well, it shows up as doing so in
/proc/interrupts anyway! Looks like CPUs 0, 2, 4, etc.. are the real
processors and 1,3,5, etc.. are the logical processors (which do not
handle interrupts).


Daniel Blueman wrote:

>Even with HT turned off on this dual-Xeon box, all IRQs are routed to CPU 0.
>
>Kernel here is the latest RedHat 2.4.18 one.
>
>Just curious what kernel Avery is running...
>
>Dan
>
>
>
>>On 4 Jan 2003, Daniel Blueman wrote:
>>
>>
>>
>>>It's interesting you have IRQs balanced over the two logical
>>>processors. I can't get this on HT Xeons with stock RedHat 7.3 kernel.
>>>
>>>
>>I think he's using two physical processors, if by "logical processors" you
>>are thinking HT... I also recall he has HT off, but the original post
>>isn't handy.
>>
>>
>>
>>>Can you post the exact kernel version string, please?
>>>
>>>TIA,
>>> Dan
>>>
>>>"Avery Fay" <[email protected]> wrote in message
>>>
>>>
>news:<OF256CD297.9F92C038-ON85256CA3.006A4034-85256CA3.00705DEA@symantec.com>...
>
>
>>>>Dual Pentium 4 Xeon at 2.4 Ghz. I believe I am using irq load
>>>>
>>>>
>>balancing as
>>
>>
>>>>shown below (seems to be applied to Red Hat's kernel). Here's
>>>>/proc/interrupts:
>>>>
>>>>
>>--
>>bill davidsen <[email protected]>
>> CTO, TMR Associates, Inc
>>Doing interesting things with little computers since 1979.
>>
>>
>>
>
>
>

--

-[========================]-
-[ Brian Tinsley ]-
-[ Chief Systems Engineer ]-
-[ Emageon ]-
-[========================]-




2003-01-06 19:34:16

by Jon Fraser

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem



What is your packet size? How many packets/second are you forwarding?

I did a lot of testing on 2.4.18 and 2.4.20 kernels with a couple of
different
hardware platforms, using 82543 and 82544 chipsets. cache
contention/invalidates
due to locks, counters, and ring buffer access becomes the bottleneck. I
actually verified the
the stats using the cpu performance counters. As traffic goes up, cache
invalidate increase
and usefull cpu cycles decrease.

I found I was best off to bind the interrupts for each gig-e chip to a
different processor.
That way, only one cpu is accessing the data structures for that interface.
You also not
suffer from packet reordering if you bind the interrupts.

Also, be sure that you have the latest e1000 driver. If the driver is
refilling the ring buffer
from a tasklet, find a later driver.

Play with the rx interrupt delay until you minimize the interrupts, if
you're not using NAPI.
Be aware that earlier intel chipsets have some problems. I believe 82543
and earlier
have unreliable rx interrupt delay and can't use more that 256 ring buffers.

I don't have my numbers handy, but I believe I was able to achieve around
400 kpps, 64 byte size,
with a dual cpu dell box with I believe, 1ghz cpus.

By the way, your performance won't scale linearly with cpu speed. We had a
2.4 ghz dual HT cpu
box from intel for a bit, and it didn't run that much faster.

You may want to search the archives for [email protected] for some work
being done
on skbuff recycling. I did some work along those lines, avoiding
constantly allocing
and freeing memory, and it made quite a difference. It's been a month since
I last looked,
so there may be more progress.

If you happen to turn on vlans, I be curious about your results. Our
chipsets produced
cisco ISL frames instead of 802.1q frames. Intel admitted the chipset would
do it,
but 'shouldn't be doing that...'

Jon

----- Original Message -----
From: "Avery Fay" <[email protected]>
To: <[email protected]>
Sent: Friday, January 03, 2003 11:12 AM
Subject: Gigabit/SMP performance problem


> Hello,
>
> I'm working with a dual xeon platform with 4 dual e1000 cards on different
> pci-x buses. I'm having trouble getting better performance with the second
> cpu enabled (ht disabled). With a UP kernel (redhat's 2.4.18), I can route
> about 2.9 gigabits/s at around 90% cpu utilization. With a SMP kernel
> (redhat's 2.4.18), I can route about 2.8 gigabits/s with both cpus at
> around 90% utilization. This suggests to me that the network code is
> serialized. I would expect one of two things from my understanding of the
> 2.4.x networking improvements (softirqs allowing execution on more than
> one cpu):
>
> 1.) with smp I would get ~2.9 gb/s but the combined cpu utilization would
> be that of one cpu at 90%.
> 2.) or with smp I would get more than ~2.9 gb/s.
>
> Has anyone been able to utilize more than one cpu with pure forwarding?
>
> Note: I realize that I am not using a stock kernel. I was in the past, but
> I ran into the same problem (smp not improving performance), just at lower
> speeds (redhat's kernel was faster). Therefore, this problem is neither
> introduced nor solved by redhat's kernel. If anyone has suggestions for
> improvements, I can move back to a stock kernel.
>
> Note #2: I've tried tweaking a lot of different things including binding
> irq's to specific cpus, playing around with e1000 modules settings, etc.
>
> Thanks in advance and please CC me with any suggestions as I'm not
> subscribed to the list.
>
> Avery Fay
>
> P.S. Only got one response on the linux-net list so I'm posting here. One
> thing I did learn from that response is that redhat's kernel is faster
> because they use a napi version of the e1000 driver.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-01-06 21:17:53

by Avery Fay

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem

Well, judging by the fact that a UP kernel can route more traffic (and
consequently more interrupts p/s) than an SMP kernel, I think that one cpu
can probably handle all of the interrupts. Really the issue I'm trying to
solve is not routing performance, but rather the fact that SMP routing
performance is worse while using twice the cpu time (2 cpu's at around 95%
vs. 1 at around 95%).

Avery Fay





"Martin J. Bligh" <[email protected]>
01/03/2003 04:36 PM


To: Avery Fay <[email protected]>
cc: [email protected]
Subject: Re: Gigabit/SMP performance problem


P3's distributed interrupts round-robin amongst cpus. P4's send
everything to CPU 0. If you put irq_balance on, it'll spread
them around, but any given interrupt is still only handled by
one CPU (as far as I understand the code). If you hammer one
adaptor, does that generate more interrupts than 1 cpu can handle?
(turn irq balance off by sticking a return at the top of balance_irq,
and hammer one link, see how much CPU power that burns).

M.



2003-01-06 21:13:37

by Avery Fay

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem

Right now, I have 4 interfaces in and 4 interfaces out (ideal routing
setup). I'm using just shy of 1500 byte udp packets for testing.

I tried binding the irqs for each pair of interfaces to a cpu... so for
example, if eth0 to sending to eth2 they would be bound to the same cpu.
This seemed to improve performance a little, but I didn't get definite
numbers and it certainly wasn't much.

I'm currently playing around with UP kernels, but when I go back I'll
check out softnet_stat

Avery Fay





Robert Olsson <[email protected]>
01/03/2003 04:20 PM


To: "Avery Fay" <[email protected]>
cc: [email protected]
Subject: Gigabit/SMP performance problem



Avery Fay writes:
>
> I'm working with a dual xeon platform with 4 dual e1000 cards on
different
> pci-x buses. I'm having trouble getting better performance with the
second
> cpu enabled (ht disabled). With a UP kernel (redhat's 2.4.18), I can
route
> about 2.9 gigabits/s at around 90% cpu utilization. With a SMP kernel
> (redhat's 2.4.18), I can route about 2.8 gigabits/s with both cpus at
> around 90% utilization. This suggests to me that the network code is
> serialized. I would expect one of two things from my understanding of
the
> 2.4.x networking improvements (softirqs allowing execution on more than

> one cpu):

Well you have a gigabit router :-)

How is your routing setup? Packet size?

Also you'll never get increased performance of a single flow with SMP.
Aggregated performance possible at best. I've been fighting with for some

time too.

You have some important data in /proc/net/softnet_stat which are per cpu
packets received and "cpu collisions" should interest you.

As far as I understand there no serialization in forwarding path except
where
it has to be -- when we add softirq's from different cpu into a single
device.
This seen in "cpu collisions"

Also here we get into inherent SMP cache bouncing problem with TX
interrupts
When TX has skb's which are processed/created in different CPU's. Which
CPU
gonna take the interrupt? No matter how we do we run kfree we gona see a
lot
of cache bouncing. For systems that have same in/out interface
smp_affinity
can be used. In practice this impossible for forwarding.

And this bouncing hurts especially for small pakets....

A litte TX test illustrates. Sender on cpu0.

UP 186 kpps
SMP Aff to cpu0 160 kpps
SMP Aff to cpu0, cpu1 124 kpps
SMP Aff to cpu1 106 kpps

We are playing some code that might decrease this problem.


Cheers.
--ro



2003-01-06 21:23:07

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem

> Well, judging by the fact that a UP kernel can route more traffic (and
> consequently more interrupts p/s) than an SMP kernel, I think that one cpu

Umm ... what are you comparing here? How many CPUs on your SMP kernel?
If I have an 8 CPU machine, you think it can handle less traffic than
a 1-cpu machine running a UP kernel?

> can probably handle all of the interrupts. Really the issue I'm trying to
> solve is not routing performance, but rather the fact that SMP routing
> performance is worse while using twice the cpu time (2 cpu's at around 95%
> vs. 1 at around 95%).

Can you mail out kernel profiles? What's burning all the time here?

Thanks,

M.

2003-01-06 21:21:50

by Avery Fay

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem

The numbers I got are taking into account packet drops. I think that the
point where performance starts to go down is when an interface is dropping
more than a couple hundred packets per second (at least in my testing). In
my testing scenario, traffic is perfectly distributed across interfaces
and I have bound the irqs using smp_affinity. Unfortunately, the
performance gain is small if any.

Avery Fay





Andrew Theurer <[email protected]>
01/03/2003 05:31 PM


To: "Martin J. Bligh" <[email protected]>, Avery Fay <[email protected]>
cc: [email protected]
Subject: Re: Gigabit/SMP performance problem


On Friday 03 January 2003 15:36, Martin J. Bligh wrote:

...

Monitor for dropped packets when increasing int delay. At least on the
older
e1000 adapters, you would get dropped packets, etc, making the problem
worse
in other areas.
>
> Makes sense, increasing the delays should reduce the interrupt load.
>
> > I'm using 3 Intel PRO/1000 MT Dual Port Server adapters as well as 2
> > onboard Intel PRO/1000 ports. The adapters use the 82546EB chips. I
> > believe that the onboard ports use the same although I'm not sure.
> >
> > Should I get rid of IRQ load balancing? And what do you mean
> > "Intel broke the P4's interrupt routing"?
>
> P3's distributed interrupts round-robin amongst cpus. P4's send
> everything to CPU 0. If you put irq_balance on, it'll spread
> them around, but any given interrupt is still only handled by
> one CPU (as far as I understand the code). If you hammer one
> adaptor, does that generate more interrupts than 1 cpu can handle?
> (turn irq balance off by sticking a return at the top of balance_irq,
> and hammer one link, see how much CPU power that burns).

Another problem you may have is that irq_balance is random, and sometimes
more
than one interrupt is serviced by the same cpu at the same. Actually, let
me
clarify. In your case if your netowrk load was "even" across the
adapters,
ideally you would want cpu0 handling the first 4 adapters and cpu1
handling
the last 4 adapters. With irq_balance, this is usually not the case.
There
will be times where one cpu is doing more work than the other, possibly
becomming a bottleneck.

Now, there was some code in SuSE's kernel (SuSE 8.0, 2.4.18) which did a
round
robin static assingment of interrupt to cpu. In your case, all even
interrupt numbers would go to cpu0 and all odd interrupt numbers would go
to
cpu1. Since you have exactly 4 adapters in even interrupts and 4 on odd
interrupts, that would work perfectly. Now, that doesn't mean there is
some
other problem, like PCI bandwidth, but it's a start. Also, you might be
able
to emulate this with irq affinity (/proc/irq/<num>/smp_affnity) but last
time
I tried it on P4, it didn't work at all -No interrupts!

-Andrew



2003-01-06 21:25:46

by Avery Fay

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem

Hmm. That paper is actually very interesting. I'm thinking maybe with the
P4 I'm better off with only 1 cpu. WRT hyperthreading, I actually disabled
it because it make performance worse (wasn't clear in the original email).

Avery Fay





Anton Blanchard <[email protected]>
01/03/2003 10:33 PM


To: Avery Fay <[email protected]>
cc: [email protected]
Subject: Re: Gigabit/SMP performance problem



> I'm working with a dual xeon platform with 4 dual e1000 cards on
different
> pci-x buses. I'm having trouble getting better performance with the
second
> cpu enabled (ht disabled). With a UP kernel (redhat's 2.4.18), I can
route
> about 2.9 gigabits/s at around 90% cpu utilization. With a SMP kernel
> (redhat's 2.4.18), I can route about 2.8 gigabits/s with both cpus at
> around 90% utilization. This suggests to me that the network code is
> serialized. I would expect one of two things from my understanding of
the
> 2.4.x networking improvements (softirqs allowing execution on more than
> one cpu):

The Fujitsu guys have a nice summary of this:

http://www.labs.fujitsu.com/en/techinfo/linux/lse-0211/index.html

Skip forward to page 8.

Dont blame the networking code just yet :) Notice how worse UP vs SMP
performance is on the P4 compared to the P3?

This brings up another point, is a single CPU with hyperthreading worth
it? As Rusty will tell you, you need to compare it with a UP kernel
since it avoids all the locking overhead. I suspect for a lot of cases
HT will be a loss (imagine your case, comparing UP and one CPU HT)

Anton



2003-01-07 17:10:49

by Mike Black

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem

I just saw an article that might be of some help to all you gigabit hackers....
http://www.nwfusion.com/news/tech/2003/0106techupdate.html

2003-01-07 17:59:09

by Robert Olsson

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem


Avery Fay writes:
> Hmm. That paper is actually very interesting. I'm thinking maybe with the
> P4 I'm better off with only 1 cpu. WRT hyperthreading, I actually disabled > it because it make performance worse (wasn't clear in the original email).


With 1CPU-SMP-HT I'm on UP level of performance this with forwarding two
single flows evenly distributes between CPU's. So HT payed the SMP cost so
to say.

Also I tested the MB bandwidth with new threaded version of pktgen just
TX'ing a packets on 6 GIGE I'm seeing almost 6 Gbit/s TX'ed w 1500 bytes
packets.

I have problem populating all slots w. GIGE NIC's. WoL (Wake on Lan) this
is a real pain... Seems like my adapters needs a standby current 0.8A and
most Power Supplies gives 2.0A for this. (Number come from SuperMicro).
So booting fails radomlingy. You have 8 NIC's -- Didn't you have problem?

Anyway I'll guess profiling is needed?

Cheers.
--ro

2003-01-08 12:09:58

by Jon Burgess

[permalink] [raw]
Subject: Re: Gigabit/SMP performance problem



Avery Fay wrote:
> can probably handle all of the interrupts. Really the issue I'm
> trying to solve is not routing performance, but rather the fact
> that SMP routing performance is worse while using twice
> the cpu time (2 cpu's at around 95% vs. 1 at around 95%).

Please forgive me if this is a silly suggestion, but are you sure this is a real
95% utilisation in the 2 CPU case. I think some versions of top show 0..200% for
the 2 CPU case, and therefore 95% utilisation is represents a real CPU
utilisation of 47.5%

Jon


2003-01-08 21:04:22

by Feldman, Scott

[permalink] [raw]
Subject: RE: Gigabit/SMP performance problem

> If you happen to turn on vlans, I be curious about your
> results. Our chipsets produced cisco ISL frames instead of
> 802.1q frames. Intel admitted the chipset would do it, but
> 'shouldn't be doing that...'

This problem has been fixed in tot 2.4 and 2.5. The VLANs were not
being restored after ifup.

-scott

2003-01-08 22:53:00

by Ronciak, John

[permalink] [raw]
Subject: RE: Gigabit/SMP performance problem

All,

We (Intel - LAN Access Division, e1000 driver) are taking a look at what is
going on here. We don't have any data yet but we'll keep you posted on what
we find.

Thanks for your patients.

Cheers,
John



> -----Original Message-----
> From: Robert Olsson [mailto:[email protected]]
> Sent: Tuesday, January 07, 2003 10:16 AM
> To: Avery Fay
> Cc: Anton Blanchard; [email protected]
> Subject: Re: Gigabit/SMP performance problem
>
>
>
> Avery Fay writes:
> > Hmm. That paper is actually very interesting. I'm thinking
> maybe with the
> > P4 I'm better off with only 1 cpu. WRT hyperthreading, I
> actually disabled > it because it make performance worse
> (wasn't clear in the original email).
>
>
> With 1CPU-SMP-HT I'm on UP level of performance this with
> forwarding two
> single flows evenly distributes between CPU's. So HT payed
> the SMP cost so
> to say.
>
> Also I tested the MB bandwidth with new threaded version of
> pktgen just
> TX'ing a packets on 6 GIGE I'm seeing almost 6 Gbit/s TX'ed
> w 1500 bytes
> packets.
>
> I have problem populating all slots w. GIGE NIC's. WoL (Wake
> on Lan) this
> is a real pain... Seems like my adapters needs a standby
> current 0.8A and
> most Power Supplies gives 2.0A for this. (Number come from
> SuperMicro).
> So booting fails radomlingy. You have 8 NIC's -- Didn't you
> have problem?
>
> Anyway I'll guess profiling is needed?
>
> Cheers.
> --ro
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-01-09 12:32:42

by Robert Olsson

[permalink] [raw]
Subject: RE: Gigabit/SMP performance problem


Ronciak, John writes:
> All,
>
> We (Intel - LAN Access Division, e1000 driver) are taking a look at what is
> going on here. We don't have any data yet but we'll keep you posted on what
> we find.

Thanks.
FYI. SuperMicro reported they added a new MB jumper to disable Standby-Power
in order to get systems to boot. I don't think "driver" operation was verfied.

Cheers.
--ro