2003-01-08 02:41:43

by Kamble, Nitin A

[permalink] [raw]
Subject: [2.5] IRQ distribution in the 2.5.52 kernel

Hello All,

We were looking at the performance impact of the IRQ routing from
the 2.5.52 Linux kernel. This email includes some of our findings
about the way the interrupts are getting moved in the 2.5.52 kernel.
Also there is discussion and a patch for a new implementation. Let
me know what you think at [email protected]

Current implementation:
======================
We have found that the existing implementation works well on IA32
SMP systems with light load of interrupts. Also we noticed that it
is not working that well under heavy interrupt load conditions on
these SMP systems. The observations are:

* Interrupt load of each IRQ is getting balanced on CPUs independent
of load of other IRQs. Also the current implementation moves the
IRQs randomly. This works well when the interrupt load is light. But
we start seeing imbalance of interrupt load with existence of
multiple heavy interrupt sources. Frequently multiple heavily loaded
IRQs gets moved to a single CPU while other CPUs stay very lightly
loaded. To achieve a good interrupts load balance, it is important to
consider the load of all the interrupts together.
This further can be explained with an example of 4 CPUs and 4
heavy interrupt sources. With the existing random movement approach,
the chance of each of these heavy interrupt sources moving to separate
CPUs is: (4/4)*(3/4)*(2/4)*(1/4) = 3/16. It means 13/16 = 81.25% of
the time the situation is, some CPUs are very lightly loaded and some
are loaded with multiple heavy interrupts. This causes the interrupt
load imbalance and results in less performance. In a case of 2 CPUs
and 2 heavily loaded interrupt sources, this imbalance happens
1/2 = 50% of the times. This issue becomes more and more severe with
increasing number of heavy interrupt sources.

* Another interesting observation is: We cannot see the imbalance
of the interrupt load from /proc/interrupts. (/proc/interrupts shows
the cumulative load of interrupts on all CPUs.) If the interrupt load
is imbalanced and this imbalance is getting rotated among CPUs
continuously, then /proc/interrupts will still show that the interrupt
load is going to processors very evenly. Currently at the frequency
(HZ/50) at which IRQs are moved across CPUs, it is not possible to
see any interrupt load imbalance happening.

* We have also found that, in certain cases the static IRQ binding
performs better than the existing kernel distribution of interrupt
load. The reason is, in a well-balanced interrupt load situations,
these interrupts are unnecessarily getting frequently moved across
CPUs. This adds an extra overhead; also it takes off the CPU cache
warmth benefits.
This came out from the performance measurements done on a 4-way HT
(8 logical processors) Pentium 4 Xeon system running 8 copies of
netperf. The 4 NICs in the system taking different IRQs generated
sizable interrupt load with the help of connected clients.

Here the netperf transactions/sec throughput numbers observed are:

IRQs nicely manually bound to CPUs: 56.20K
The current kernel implementation of IRQ movement: 50.05K
-----------------------
The static binding of IRQs has performed 12.28% better than the
current IRQ movement implemented in the kernel.

* The current implementation does not distinguish siblings from the
HT (Hyper-Threading(tm)) enabled CPUs. It will be beneficial to
balance the interrupt load with respect to processor packages first,
and then among logical CPUs inside processor packages.
For example if we have 2 heavy interrupt sources and 2 processor
packages (4 logical CPUs); Assigning both the heavy interrupt sources
in different processor packages is better, it will use different
execution resources from the different processor packages.



New revised implementation:
==========================
We also have been working on a new implementation. The following
points are in main focus.

* At any moment heavily loaded IRQs are distributed to different
CPUs to achieve as much balance as possible.

* Lightly loaded interrupt sources are ignored from the load
balancing, as they do not cause considerable imbalance.

* When the heavy interrupt sources are balanced, they are not moved
around. This also helps in keeping the CPU caches warm.

* It has been made HT aware. While distributing the load, the load
on a processor package to which the logical CPUs belong to is also
considered.

* In the situations of few (lesser than num_cpus) heavy interrupt
sources, it is not possible to balance them evenly. In such case
the existing code has been reused to move the interrupts. The
randomness from the original code has been removed.

* The time interval for redistribution has been made flexible. It
varies as the system interrupt load changes.

* A new kernel_thread is introduced to do the load balancing
calculations for all the interrupt sources. It keeps the balanace_maps
ready for interrupt handlers, keeping the overhead in the interrupt
handling to minimum.

* It allows the disabling of the IRQ distribution from the boot loader
command line, if anybody wants to do it for any reason.

* The algorithm also takes into account the static binding of
interrupts to CPUs that user imposes from the
/proc/irq/{n}/smp_affinity interface.


Throughput numbers with the netperf setup for the new implementation:

Current kernel IRQ balance implementation: 50.02K transactions/sec
The new IRQ balance implementation: 56.01K transactions/sec
---------------------
The performance improvement on P4 Xeon of 11.9% is observed.

The new IRQ balance implementation also shows little performance
improvement on P6 (Pentium II, III) systems.

On a P6 system the netperf throughput numbers are:
Current kernel IRQ balance implementation: 36.96K transactions/sec
The new IRQ balance implementation: 37.65K transactions/sec
---------------------
Here the performance improvement on P6 system of about 2% is observed.


Thanks & Regards,
Nitin


2003-01-09 16:10:12

by Andrew Theurer

[permalink] [raw]
Subject: Re: [2.5] IRQ distribution in the 2.5.52 kernel

On Tuesday 07 January 2003 20:50, Kamble, Nitin A wrote:
> Hello All,
>
> We were looking at the performance impact of the IRQ routing from
> the 2.5.52 Linux kernel. This email includes some of our findings
> about the way the interrupts are getting moved in the 2.5.52 kernel.
> Also there is discussion and a patch for a new implementation. Let
> me know what you think at [email protected]

Nitin,

I got a chance to run the NetBench benchmark with your patch on 2.5.54-mjb2
kernel. NetBench measures SMB/CIFS performance by using several SMB clients
(in this case 44 Windows 2000 systems), sending SMB requests to a Linux
server running Samba 2.2.3a+sendfile. Result is in throughput, Mbps.
Generally the network traffic on the server is 60% recv, 40% tx.

I believe we have very similar systems. Mine is a 4 x 1.6 GHz, 1 MB L3 P4
Xeon with 4 GB DDR memory (3.2 GB/sec I believe). The chipset is "Summit".
I also have more than one Intel e1000 adapters.

I decided to run a few configurations, first with just one adapter, with and
without HT support in the kernel (acpi=off), then add another adapter and
test again with/without HT.

Here are the results:

4P, no HT, 1 x e1000, no kirq: 1214 Mbps, 4% idle
4P, no HT, 1 x e1000, kirq: 1223 Mbps, 4% idle, +0.74%

I suppose we didn't see much of an improvement here because we never run into
the situation where more than one interrupt with a high rate is routed to a
single CPU on irq_balance.

4P, HT, 1 x e1000, no kirq: 1214 Mbps, 25% idle
4P, HT, 1 x e1000, kirq: 1220 Mbps, 30% idle, +0.49%

Again, not much of a difference just yet, but lots of idle time. We may have
reached the limit at which one logical CPU can process interrupts for an
e1000 adapter. There are other things I can probably do to help this, like
int delay, and NAPI, which I will get to eventually.

4P, HT, 2 x e1000, no kirq: 1269 Mbps, 23% idle
4P, HT, 2 x e1000, kirq: 1329 Mbps, 18% idle +4.7%

OK, almost 5% better! Probably has to do with a couple of things; the fact
that your code does not route two different interrupts to the same
core/different logical cpus (quite obvious by looking at /proc/interrupts),
and that more than one interrupt does not go to the same cpu if possible. I
suspect irq_balance did some of those [bad] things some of the time, and we
observed a bottleneck in int processing that was lower than with kirq.

I don't think all of the idle time is because of a int processing bottleneck.
I'm just not sure what it is yet :) Hopefully something will become obvious
to me...

Overall I like the way it works, and I believe it can be tweaked to work with
NUMA when necessary. I hope to have access to a specweb system on a NUMA box
soon, so we can verify that.

-Andrew Theurer










2003-01-09 21:49:15

by Kamble, Nitin A

[permalink] [raw]
Subject: RE: [2.5] IRQ distribution in the 2.5.52 kernel

Hi Andrew,
Your benchmark results are very impressive. Thanks for trying it out.
I have some thoughts after seeing the results.

> Nitin,
>
> I got a chance to run the NetBench benchmark with your patch on
2.5.54-
> mjb2
> kernel. NetBench measures SMB/CIFS performance by using several SMB
> clients
> (in this case 44 Windows 2000 systems), sending SMB requests to a
Linux
> server running Samba 2.2.3a+sendfile. Result is in throughput, Mbps.
> Generally the network traffic on the server is 60% recv, 40% tx.
>
> I believe we have very similar systems. Mine is a 4 x 1.6 GHz, 1 MB
L3 P4
> Xeon with 4 GB DDR memory (3.2 GB/sec I believe). The chipset is
> "Summit".
> I also have more than one Intel e1000 adapters.
>
> I decided to run a few configurations, first with just one adapter,
with
> and
> without HT support in the kernel (acpi=off), then add another adapter
and
> test again with/without HT.
>
> Here are the results:
>
> 4P, no HT, 1 x e1000, no kirq: 1214 Mbps, 4% idle
> 4P, no HT, 1 x e1000, kirq: 1223 Mbps, 4% idle,
+0.74%
[NK] It is surprising to see single e1000 is giving bandwidth more than
1Gbps. What can be the reason for this extra bandwidth? ... Maybe
compression is happening somewhere.

>
> I suppose we didn't see much of an improvement here because we never
run
> into
> the situation where more than one interrupt with a high rate is routed
to
> a
> single CPU on irq_balance.
>
> 4P, HT, 1 x e1000, no kirq: 1214 Mbps, 25% idle
> 4P, HT, 1 x e1000, kirq: 1220 Mbps, 30% idle,
+0.49%
>
> Again, not much of a difference just yet, but lots of idle time. We
may
> have
> reached the limit at which one logical CPU can process interrupts for
an
> e1000 adapter. There are other things I can probably do to help this,
> like
> int delay, and NAPI, which I will get to eventually.
>
> 4P, HT, 2 x e1000, no kirq: 1269 Mbps, 23% idle
> 4P, HT, 2 x e1000, kirq: 1329 Mbps, 18% idle
+4.7%
[NK] It can be a case that throughput is getting limited by the network
infrastructure or total load of clients. If we know the theoretical
desired maximum throughput then we will get a better idea about the
bottleneck. It would be interesting to see the results, after adding one
more e1000 card to the server.

>
> OK, almost 5% better!
[NK] It's a pretty good number!

Probably has to do with a couple of things; the
> fact
> that your code does not route two different interrupts to the same
> core/different logical cpus (quite obvious by looking at
> /proc/interrupts),
> and that more than one interrupt does not go to the same cpu if
possible.
> I
> suspect irq_balance did some of those [bad] things some of the time,
and
> we
> observed a bottleneck in int processing that was lower than with kirq.
>
> I don't think all of the idle time is because of a int processing
> bottleneck.
> I'm just not sure what it is yet :) Hopefully something will become
> obvious
> to me...
>
> Overall I like the way it works, and I believe it can be tweaked to
work
> with
> NUMA when necessary.
[NK] I also believe so.

I hope to have access to a specweb system on a NUMA
> box
> soon, so we can verify that.
>
> -Andrew Theurer
[NK]
Thanks & regards,
Nitin

2003-01-09 22:02:19

by Andrew Theurer

[permalink] [raw]
Subject: Re: [2.5] IRQ distribution in the 2.5.52 kernel

<snip>
> > test again with/without HT.
> >
> > Here are the results:
> >
> > 4P, no HT, 1 x e1000, no kirq: 1214 Mbps, 4% idle
> > 4P, no HT, 1 x e1000, kirq: 1223 Mbps, 4% idle,
>
> +0.74%
> [NK] It is surprising to see single e1000 is giving bandwidth more than
> 1Gbps. What can be the reason for this extra bandwidth? ... Maybe
> compression is happening somewhere.

Full duplex. I suppose theoretical full throughput is 2Gbps. Sar reported
about 1174 Mb/sec with one adapter on one of these results above, and it was
454 Recv/720 Tx (I had the percentages incorrectly swapped in previous
email). This is still with an MTU of 1500!

> > I suppose we didn't see much of an improvement here because we never
>
> >
> > 4P, HT, 1 x e1000, no kirq: 1214 Mbps, 25% idle
> > 4P, HT, 1 x e1000, kirq: 1220 Mbps, 30% idle,
>
>
> >
> > 4P, HT, 2 x e1000, no kirq: 1269 Mbps, 23% idle
> > 4P, HT, 2 x e1000, kirq: 1329 Mbps, 18% idle
>
> +4.7%
> [NK] It can be a case that throughput is getting limited by the network
> infrastructure or total load of clients. If we know the theoretical
> desired maximum throughput then we will get a better idea about the
> bottleneck. It would be interesting to see the results, after adding one
> more e1000 card to the server.

It occurred to me later, the answer was obvious, the one you mentioned:
clients. I originally had enough clients to accomplish 1000 Mbps, but I'm
pretty sure 44 client will not cut it for NetBench at around 1500 Mbps (where
this hopefully will end up). NetBench throttles the clients, so I really
can't drive them much harder. There is an option to simulate more than one
client per computer, but I have had trouble in the past with that, but I am
going to give it one more try.
>
> > OK, almost 5% better!
>
> [NK] It's a pretty good number!