Hi,
This patch does the following things :
* Bouncing of IRQs between CPUs reduction
If IRQ already moved - check if further move makes sense
(blocks bounce forth and back)
* Bring back interrupts to the IRQ_PRIMARY_CPU (default = 0)
If interupts / time drops down the IRQ is routed back to the default CPU
* Introduces a desc_interrupt[irq].processor value
This is needed to decide which IRQ has to be routed back to the default CPU.
* Add visualization for desc_interrupt[irq].processor value in
'/proc/interrupt'
* If less than 2 CPUs online on boot-time kirqd is not started
At least the rest of the logic wouldn't be able to recognize added CPUs at a
later time.
* FIX timer_irq_works() - used a 'unsigned int' to store jiffies value
Would it be possible for you to test Arjan's irqbalance daemon?
We believe it is a superior solution to in-kernel irq balancing, but
also, can be safely used in addition to in-kernel irq balancing.
(we just have not run benchmarks to prove this yet :))
http://people.redhat.com/arjanv/irqbalance/
This userspace solution is shipping with current Red Hat, and is
portable to non-ia32 architectures.
Jeff
Hi Andrew, Kai,
The bouncing is seen because of the round robin IRQ distribution in
some
particular cases. In some cases, (such as single heavy interrupt source
in
a 2way SMP system) binding heavy interrupt sources to different cpus is
not
going to remove the complete imbalance. In that case we fall back to
Ingo's
round robin approach. We have studied the previous round robin interrupt
distribution implemented in the kernel, and we found that, at very high
interrupt rate, the performance of the system increased with the
increasing
period of the round robin distribution. Please see the original LKML
posting
for more details.
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0212.2/1122.html
So when if there is significant imbalance left after binding the IRQs to
cpus,
there are two options now,
1. Do not move around. Let the significant imbalance stick on a
particular
cpu.
2. Or move the heavy imbalance around all the cpus in the round robin
fashion at high rate.
Also we can have either of the option configurable in the kernel.
Both the solutions will eliminate the bouncing behavior. The current
implementation is based on the option 2, with the only difference of
lower rate of distribution (5 sec). The optimal option is workload
dependant. With static and heavy interrupt load, the option 2 looks
better, while with random interrupt load the option 1 is good enough.
Thanks & Regards,
Nitin
"Kamble, Nitin A" <[email protected]> wrote:
>
> Both the solutions will eliminate the bouncing behavior. The current
> implementation is based on the option 2, with the only difference of
> lower rate of distribution (5 sec). The optimal option is workload
> dependant. With static and heavy interrupt load, the option 2 looks
> better, while with random interrupt load the option 1 is good enough.
OK, thanks.
Now there has been some discssion as to whether these algorithmic decisions
can be moved out of the kernel altogether. And with periods of one and five
seconds that does appear to be feasible.
I believe that you have looked at this before and encountered some problem
with it. Could you please describe what happened there?
There are few issues we found with the user level daemon approach.
Static binding compatibility: With the user level daemon, users can
not
use the /proc/irq/i/smp_affinity interface for the static binding of
interrupts.
There is some information which is only available in the kernel today,
Also the future implementation might need more kernel data. This is
important for interfaces such as NAPI, where interrupts handling changes
on the fly.
Thanks,
Nitin
> Now there has been some discssion as to whether these algorithmic
> decisions
> can be moved out of the kernel altogether. And with periods of one
and
> five
> seconds that does appear to be feasible.
>
> I believe that you have looked at this before and encountered some
problem
> with it. Could you please describe what happened there?
Kamble, Nitin A wrote:
> There are few issues we found with the user level daemon approach.
Thanks much for the response!
> Static binding compatibility: With the user level daemon, users can
> not
> use the /proc/irq/i/smp_affinity interface for the static binding of
> interrupts.
Not terribly accurate: in "one-shot" mode, where the daemon balances
irqs once at startup, users can change smp_affinity all they want.
In the normal continuous-balance mode, it is quite easy to have the
daemon either (a) notice changes users make or (b) configure the daemon.
The daemon does not do (a) or (b) currently, but it is a simple change.
> There is some information which is only available in the kernel today,
> Also the future implementation might need more kernel data. This is
> important for interfaces such as NAPI, where interrupts handling changes
> on the fly.
This depends on the information :) Some information that is useful for
balancing is only [easily] available from userspace. In-kernel
information may be easily exported through "sysfs", which is designed to
export in-kernel information.
Further, for NAPI and networking in general, it is recommended to bind
each NIC to a single interrupt, and never change that binding.
Delivering a single NIC's interrupts to multiple CPUs leads to a
noticeable performance loss. This is why some people complain that
their specific network setups are faster on a uniprocessor kernel than
an SMP kernel.
I have not examined interrupt delivery for other peripherals, such at
ATA or SCSI hosts, but for networking you definitely want to statically
bind each NIC's irqs to a separate CPU, and then not touch that binding.
Best regards, and thanks again for your valuable feedback,
Jeff
>
>
> 2. Or move the heavy imbalance around all the cpus in the round robin
> fashion at high rate.
>
>
>Both the solutions will eliminate the bouncing behavior. The current
>implementation is based on the option 2, with the only difference of
>lower rate of distribution (5 sec). The optimal option is workload
>dependant. With static and heavy interrupt load, the option 2 looks
>better, while with random interrupt load the option 1 is good enough.
>
>
>
Hi Nitin,
Thanks much for your response !
Are you really sure that option 2 looks better on a static and heavy
interrupt load ?
If the load is generated by few heavy sources (sources_count <
count(cpus)) why not distributed them (mostly) statically across the
available cpus ? What gain do you have by rotating them round robin in
this case ?
I think round robin only starts making sense if the number of heavy
sources is > number of physical cpus.
Kai
[email protected] said:
> Further, for NAPI and networking in general, it is recommended to bind
> each NIC to a single interrupt, and never change that binding.
I assume you mean "bind each NIC interrupt to a single CPU" here. I've
done quite a lot of benchmarking on dual SMP that shows that for
high-load networking, you basically have two cases:
- the irq load is less than what can be handled by one CPU. This is the
case, for example, using a NAPI e1000 driver under any load on a
> 1 GHz SMP machine. even with two e1000 cards under extreme load,
one CPU can run the interrupt handlers with cycles to spare (thanks
to NAPI). This config (all NIC interrupts on CPU0) is optimal as
long as CPU doesn't become saturated. Trying to distribute the
interrupt load across multiple CPUs incurs measurable performance
loss, probably due to cache effects.
- the irq load is enough to livelock one CPU. It's easy for this to
happen with gigE NICs on a non-NAPI kernel, for example. In this
case, you're better off binding each heavy interrupt source to a
different CPU.
2.4's default behavior isn't optimal in either case.
> Delivering a single NIC's interrupts to multiple CPUs leads to a
> noticeable performance loss. This is why some people complain that
> their specific network setups are faster on a uniprocessor kernel than
> an SMP kernel.
This is what I've seen as well. The good news is that you can pretty
much recapture the uniprocessor performance by binding all heavy
interrupt sources to one CPU, as long as that CPU can handle it. And any
modern machine with a NAPI kernel _can_ handle any realistic gigE load.
I should mention that these results are all measurements of gigabit
bridge performance, where every frame needs to be received on one NIC
and sent on the other. So there are obvious cache benefits to doing it
all on one CPU.
--
Jason Lunz Reflex Security
[email protected] http://www.reflexsecurity.com/
On Wed, 2003-03-05 at 05:21, Kamble, Nitin A wrote:
> There are few issues we found with the user level daemon approach.
>
> Static binding compatibility: With the user level daemon, users can
> not
> use the /proc/irq/i/smp_affinity interface for the static binding of
> interrupts.
no they can just write/change the config file, with a gui if needed
>
> There is some information which is only available in the kernel today,
there's also some information only available to userspace today that the
userspace daemon can and does use.
> Also the future implementation might need more kernel data. This is
> important for interfaces such as NAPI, where interrupts handling changes
> on the fly.
ehm. almost. but napi isn't it ....
and the userspace side can easily have a system vendor provided file
that represents all kinds of very specific system info about the numa
structure..... working with every kernel out there.
> -----Original Message-----
> From: Kai Bankett [mailto:[email protected]]
> Are you really sure that option 2 looks better on a static and heavy
> interrupt load ?
> If the load is generated by few heavy sources (sources_count <
> count(cpus)) why not distributed them (mostly) statically across the
> available cpus ? What gain do you have by rotating them round robin in
> this case ?
> I think round robin only starts making sense if the number of heavy
> sources is > number of physical cpus.
[NK] If there is no rotating around at all, then it is same as
statically binding the IRQs to cpus. And with the netstat benchmark the
kirq has performed about 10% better than nicely statically bound IRQs.
It is happening like that because, after processing the interrupt the
benchmark
also has to do some processing, and if all the threads are doing the
processing at almost equal speed it gives better performance. If one
thread is faster and another is slower, the slower guy slows down the
whole system.
Thanks,
Nitin
>
> Kai
I think tuning for NUMA issues are different. The intention/scope of this patch was to provide an efficient interrupt routing in software that would work for dual/SMP P4P based systems. Although we found this improved older systems as well, there was no need to do this earlier since it was done by the chipsets in the platform and software did not have any thing to do.
Jun
> -----Original Message-----
> From: Arjan van de Ven [mailto:[email protected]]
> Sent: Wednesday, March 05, 2003 10:27 AM
> To: Kamble, Nitin A
> Cc: Andrew Morton; [email protected]; [email protected];
> [email protected]; Nakajima, Jun; Mallick, Asit K; Saxena, Sunil
> Subject: RE: [PATCH][IO_APIC] 2.5.63bk7 irq_balance improvments / bug-
> fixes
>
> On Wed, 2003-03-05 at 05:21, Kamble, Nitin A wrote:
> > There are few issues we found with the user level daemon approach.
> >
> > Static binding compatibility: With the user level daemon, users can
> > not
> > use the /proc/irq/i/smp_affinity interface for the static binding of
> > interrupts.
>
> no they can just write/change the config file, with a gui if needed
>
> >
> > There is some information which is only available in the kernel today,
>
> there's also some information only available to userspace today that the
> userspace daemon can and does use.
>
> > Also the future implementation might need more kernel data. This is
> > important for interfaces such as NAPI, where interrupts handling changes
> > on the fly.
>
> ehm. almost. but napi isn't it ....
>
> and the userspace side can easily have a system vendor provided file
> that represents all kinds of very specific system info about the numa
> structure..... working with every kernel out there.