Martin, Couple of days back I have posted a kernel IRQ distribution patch with some discussion. There we tried doing same things as you have interests here. We have made the interval flexible and longer. Also the randomness of the algorithm is removed.
Also about the fairness. Scheduler will not be able to solve the fairness issues coming because of the interrupts at all the times. For example, at very interrupts load, some of the CPUs may get 100% busy just servicing the interrupts. Here the scheduler cannot do anything. To get the fairness, we need the interrupts distribution mechanism to move interrupts as required.
May be we can add some generic NUMA awareness in it. But I am not fully aware of the way interrupt routing happens in various NUMA systems. If I can get this information I can look into, how can we have the generic NUMA support in the new IRQ distribution code.
Thanks,
Nitin
-----Original Message-----
From: Martin J. Bligh [mailto:[email protected]]
Sent: Sunday, December 22, 2002 9:21 AM
To: Pallipadi, Venkatesh; William Lee Irwin III; Protasevich, Natalie
Cc: Christoph Hellwig; James Cleverdon; Linux Kernel; John Stultz;
Nakajima, Jun; Mallick, Asit K; Saxena, Sunil; Van Maren, Kevin; Andi
Kleen; Hubert Mantel; Kamble, Nitin A
Subject: RE: [PATCH][2.4] generic cluster APIC support for systems with
m ore than 8 CPUs
> I actually meant interrupt distribution (rather than interrupt routing).
> AFAIK, interrupt distribution right now assumes flat logical setup and
> tries to distribute the interrupt. And is disabled in case of clustered
> APIC mode. I was just thinking loud, about the changes interrupt
> distribution code should have for systems using clustered APIC/physical
> mode (NUMAQ and non-NUMAQ).
Oh, you mean irq_balance? I'm happy to leave that turned off on NUMA-Q
until it does something less random than it does now. Getting some sort
of affinity for interrupts over a longer period is much more interesting
than providing pretty numbers under /proc/interrupts. Giving each of
the frequently used interrupts their own local CPU to process it would
be cool, but I see no purpose in continually moving them around. If you're
concerned about fairness, that's a scheduler problem to account for and
deal with, IMHO.
The provided topology functions should be able to do node_to_cpumask
and cpu_to_node mappings once that's sorted out. Treat each node as a
seperate "system" and balance within that.
M.
> Martin, Couple of days back I have posted a kernel IRQ distribution patch with some discussion. There we tried doing same things as you have interests here. We have made the interval flexible and longer. Also the randomness of the algorithm is removed.
Yup, saw it, but haven't given it the inspection it really deserves yet.
That code does need some work, and it sounds like you're doing the
right things to it.
> Also about the fairness. Scheduler will not be able to solve the fairness issues coming because of the interrupts at all the times. For example, at very interrupts load, some of the CPUs may get 100% busy just servicing the interrupts. Here the scheduler cannot do anything. To get the fairness, we need the interrupts distribution mechanism to move interrupts as required.
Well, if the scheduler didn't ding the process for time spent in interrupts,
I think it'd work out - it could always run processes on another CPU ;-)
But that may not be practical to do in reality.
> May be we can add some generic NUMA awareness in it. But I am not fully aware of the way interrupt routing happens in various NUMA systems. If I can get this information I can look into, how can we have the generic NUMA support in the new IRQ distribution code.
Mmm... well it's late and I'm tired, but off the top of my head ... you
need to map from each PCI bus to the closest set of cpus - for me that's
a simple bus_to_node mapping (not sure that bit is added to the topology
infrastructure yet, but it's a trivial patch that's floating around ...
I'll try to dig out out and add it to the 2.5-mjb tree). Then just limit
the distrubtion for an interrupt to the closest set of CPUs (for UMA SMP
would just be cpu_online_map), and have another abstracted function that
sets IO-APIC distribution up to a certain CPU (if doing balancing explicity)
or group thereof. But it's late, so if that makes no sense, I'll take it
all back in the morning ;-)
If you're interested in working on it, I'm very happy to test it ...
(should probably be kept seperate from your other stuff though).
I'll see if I can find someone in our performance team to evaluate
how your existing patch runs on SMP for us ...
M.
On Sun, 22 Dec 2002, Martin J. Bligh wrote:
> > May be we can add some generic NUMA awareness in it. But I am not fully aware of the way interrupt routing happens in various NUMA systems. If I can get this information I can look into, how can we have the generic NUMA support in the new IRQ distribution code.
>
> Mmm... well it's late and I'm tired, but off the top of my head ... you
> need to map from each PCI bus to the closest set of cpus - for me that's
> a simple bus_to_node mapping (not sure that bit is added to the topology
> infrastructure yet, but it's a trivial patch that's floating around ...
> I'll try to dig out out and add it to the 2.5-mjb tree). Then just limit
> the distrubtion for an interrupt to the closest set of CPUs (for UMA SMP
> would just be cpu_online_map), and have another abstracted function that
> sets IO-APIC distribution up to a certain CPU (if doing balancing explicity)
> or group thereof. But it's late, so if that makes no sense, I'll take it
> all back in the morning ;-)
How about using logical destination mode when programming the IOAPIC?
Currently we do physical in io_apic.c (the reason why it breaks on NUMAQ)
This way we can get node affinity just by setting the Destination Field
for an IOREDTBL to APIC_BROADCAST_ID and also targetting single cpus on a
node becomes node generic.
Cheers,
Zwane Mwaikambo
PS This suggestion also comes with a possible nonsense disclaimer as i'm
also about to go to bed ;)
--
function.linuxpower.ca
> How about using logical destination mode when programming the IOAPIC?
> Currently we do physical in io_apic.c (the reason why it breaks on NUMAQ)
> This way we can get node affinity just by setting the Destination Field
> for an IOREDTBL to APIC_BROADCAST_ID and also targetting single cpus on a
> node becomes node generic.
Yup, that'll work fine once we have balance_IRQ set up with node affinity.
Using phys is just a cheapo lazy hacker's way to steal node affinity for
free from the mouths of babes.
M.