2002-01-18 22:46:58

by James Cleverdon

[permalink] [raw]
Subject: [RFC] Summit interrupt routing patches

For a forthcoming NUMA box that uses Foster processors and IBM's Summit
chipset, I've had to extend Martin Bligh's multiquad code beyond the usual
APIC flat mode and his clustered logical mode. The problem with the latter
is that Intel's new APIC numbering scheme only puts two physical CPUs per
cluster. Linux leaves all the TPRs and XTPRs at zero, so that means that all
the interrupts will hit just one CPU per cluster, and generally everything
goes to one cluster.

To combat that, I used the XAPICs new physical cluster capability and a
simple (probably too simple) static round robin binding to roughly level out
the interrupt load. This works, though it seems somehow unsatisfying. The
alternative is much more invasive: adding interrupt code to adjust XTPRs,
hacks to the scheduler to somehow encapsulate task priority in 4 bits, etc.

What do you think?

--
James Cleverdon, IBM xSeries Platform (NUMA), Beaverton
[email protected] | [email protected]


Attachments:
summit_patch.2002-01-17_2.4.17 (24.80 kB)

2002-01-18 23:07:08

by Alan

[permalink] [raw]
Subject: Re: [RFC] Summit interrupt routing patches

> alternative is much more invasive: adding interrupt code to adjust XTPRs,
> hacks to the scheduler to somehow encapsulate task priority in 4 bits, etc.

Is it necessary to try something complex. We already keep per cpu/per irq
data and if you have a lot of interrupts it feels like you can handwave them
to be roughly the same amount of cpu load.

Given that is it enough to once a second shuffle the irqs around to try and
get a rough balance based on a simple decaying history. Then all it needs
is a regular timer event to do what the hardware hasnt

2002-01-19 00:19:12

by James Cleverdon

[permalink] [raw]
Subject: Re: [RFC] Summit interrupt routing patches

On Friday 18 January 2002 03:15 pm, Alan Cox wrote:
> > alternative is much more invasive: adding interrupt code to adjust
> > XTPRs, hacks to the scheduler to somehow encapsulate task priority in 4
> > bits, etc.
>
> Is it necessary to try something complex. We already keep per cpu/per irq
> data and if you have a lot of interrupts it feels like you can handwave
> them to be roughly the same amount of cpu load.
>
> Given that is it enough to once a second shuffle the irqs around to try and
> get a rough balance based on a simple decaying history. Then all it needs
> is a regular timer event to do what the hardware hasnt

Thanks for the reply.

What I'd like to see is a scheme where we route interrupts preferentially to
idle CPUs. If none are idle, then aim them at the CPUs running the "least
important" tasks. Also, since each CPU's local APIC only has two interrupt
latches per `level' (the upper nibble of the IRQ's vector), it would be a
good idea to avoid sending IRQs to those CPUs that are already processing one.

Since we only have 4 bits in each XPTR, suppose we used the high bit to
indicate IRQ-in-progress and the bottom three for some kind of compressed
task goodness measure. Idle equals zero. The XTPR would be updated on every
interrupt entry/exit and on every task switch.

What's the catch? The cost of doing the above. Plus, we still only have two
CPUs in each cluster, four if hyperthreading is turned on. (Of course, the
hyperthreaded "sibling processors" don't have the power of a full CPU --
maybe 20% to 40% of one.) All interrupts have to be targeted at a particular
cluster. So, lowest priority interrupt delivery may not buy us very much.

To make matters even more interesting, the economy version (non-NUMA) of this
hardware started shipping last month and the full version will be shipping
soon. I wonder if Marcelo is going to allow this kind of futzing around with
interrupt and scheduler code in 2.4....

--
James Cleverdon, IBM xSeries Platform (NUMA), Beaverton
[email protected] | [email protected]

2002-01-19 01:09:39

by Alan

[permalink] [raw]
Subject: Re: [RFC] Summit interrupt routing patches

> idle CPUs. If none are idle, then aim them at the CPUs running the "least
> important" tasks. Also, since each CPU's local APIC only has two interrupt
> latches per `level' (the upper nibble of the IRQ's vector), it would be a
> good idea to avoid sending IRQs to those CPUs that are already processing one.

Im not sure aiming at least important is worth anything. Aiming at idle
processors on a box not doing power management seems easy providing you'll
accept 99.99% accuracy. Switch the priority up in the idle code, switch it
back down again before the idle task schedule()'s. If you hit during the
schedule well tough.

> soon. I wonder if Marcelo is going to allow this kind of futzing around with
> interrupt and scheduler code in 2.4....

Thats another reason to keep it small and clean.

2002-01-19 01:13:39

by David Miller

[permalink] [raw]
Subject: Re: [RFC] Summit interrupt routing patches

From: Alan Cox <[email protected]>
Date: Sat, 19 Jan 2002 01:18:09 +0000 (GMT)

Im not sure aiming at least important is worth anything. Aiming at idle
processors on a box not doing power management seems easy providing you'll
accept 99.99% accuracy. Switch the priority up in the idle code, switch it
back down again before the idle task schedule()'s. If you hit during the
schedule well tough.

$ egrep idle_me_harder arch/sparc64/kernel/process.c
$ egrep "idle_volume|redirect_intr" arch/sparc64/kernel/irq.c

Been there, done that :-)

Franks a lot,
David S. Miller
[email protected]

2002-01-22 01:00:00

by James Cleverdon

[permalink] [raw]
Subject: Re: [RFC] Summit interrupt routing patches

On Friday 18 January 2002 05:11 pm, David S. Miller wrote:
> From: Alan Cox <[email protected]>
> Date: Sat, 19 Jan 2002 01:18:09 +0000 (GMT)
>
> Im not sure aiming at least important is worth anything. Aiming at idle
> processors on a box not doing power management seems easy providing
> you'll accept 99.99% accuracy. Switch the priority up in the idle code,
> switch it back down again before the idle task schedule()'s. If you hit
> during the schedule well tough.
>
> $ egrep idle_me_harder arch/sparc64/kernel/process.c
> $ egrep "idle_volume|redirect_intr" arch/sparc64/kernel/irq.c
>
> Been there, done that :-)
>
> Franks a lot,
> David S. Miller
> [email protected]

Yeah, and a collegue has some patches for ia64, which has a similar problem
with SAPICs. But, my attempts at making a i386 version haven't panned out
yet (drops interrupts on Foster boxes). The code doesn't draw a strong
distinction between a CPU's number, its logical APIC ID, and it's physical
APIC ID. Martin Bligh did some work in this department for his multiquad
patches, but it is by no means complete.

Until I can untangle this snarl, how about taking the static patches as a
first step? They a bit less invasive than the lowest priority ones (i.e. no
apic_set_tpr calls inserted into do_IRQ, cpu_idle, etc.). Plus, they work on
all the hardware I can lay my hands on.

--
James Cleverdon, IBM xSeries Platform (NUMA), Beaverton
[email protected] | [email protected]



Attachments:
summit_patch.2002-01-17_2.4.17 (24.80 kB)