2002-03-13 19:09:28

by Martin Wilck

[permalink] [raw]
Subject: Severe IRQ problems on Foster (P4 Xeon) system


Hi Ingo and Maciej, everybody,

I am contacting you as the Linux experts for (IO)-Apics.

We are currrently testing our new Foster (Pentium 4 Xeon with HT
technology) machines and facing really weird problems particularly
with the timer interrupt. The machine I am talking about has 2 physical,
i.e. 4 logical CPUs. It has 4 dual-port Intel Pro/100 network cards
built in.

** Before you read further: I cannot 100% exclude a hardware/BIOS
error yet, but something Linux does must have to do with it **

First of all, we see that virtually 100% of all IRQs are handled by CPU 0.
I have seen this reported a number of times before. I guess it can become
a severe performance problem in IRQ-intensive situations.

But much worse, we encounter the following problem:

- sporadically the timer interrupt becomes disabled sometime after
booting (typically around X startup).

- This is caused by a "junk" value that appears in the mask register
of the master 8259a PIC (the system is working in ExtINT mode for timer
IRQs.

- The system can be "healed" by rewriting the correct value into register
0x21.

- We see with a bus logic analyzer that the junk value is read after the
completely legal mask 0xfa was last written to the register.
However according to my understanding that register should not be
written after initialization, because IRQ are disabled through the
APIC, not the 8259a?

- The junk value always appears after CPU 1 gets a timer interrupt.
As stated above, this hardly happens at all, but CPU 1 seems to get
2-6 timer interrupts during each boot. These are all consecutive,
and starting with the second one the "junk" value is found in io-port
0x21.

Here is an excerpt from my syslog demonstrating this (I put some printk's
in do_IRQ, printk is triggered if either a CPU other than 0 serves the
IRQ or if the value in port 0x21 != 0xfa, lines ending):

jiffies | 0x21 desc->status|
V V V
Mar 12 16:51:40 pdb0374c kernel: IRQ1: 8931: fa, CPU 1 IRQ 0 status 0
Mar 12 16:51:40 pdb0374c kernel: IRQ1: 8931: fa, CPU 1 IRQ 0 stat 5
Mar 12 16:51:40 pdb0374c kernel: IRQ1: 8932: fa, CPU 1 IRQ 0 status 1 <==
Mar 12 16:51:40 pdb0374c kernel: IRQ1: 8932: 07, CPU 1 IRQ 0 stat 5 <==
Mar 12 16:51:40 pdb0374c kernel: IRQ1: 8934: 07, CPU 1 IRQ 0 status 0
Mar 12 16:51:40 pdb0374c kernel: IRQ1: 8934: 07, CPU 0 IRQ 0 stat 0 <==
Mar 12 16:51:40 pdb0374c kernel: IRQ1: 8935: 07, CPU 1 IRQ 0 stat 0
Mar 12 16:51:40 pdb0374c kernel: IRQ1: 8935: 07, CPU 1 IRQ 0 status 0
Mar 12 16:51:40 pdb0374c kernel: IRQ1: 8936: 07, CPU 1 IRQ 0 stat 0
Mar 12 16:51:40 pdb0374c kernel: IRQ1: 8936: 07, CPU 1 IRQ 0 status 0
Mar 12 16:51:40 pdb0374c kernel: IRQ1: 8937: 07, CPU 1 IRQ 0 stat 0
Mar 12 16:51:40 pdb0374c kernel: IRQ1: 8937: 07, CPU 0 IRQ 28 stat 0
Mar 12 16:51:40 pdb0374c last message repeated 42 times
[ no more timer interrupts below here ].

Lines ending in "status" represent entry in do_IRQ(), "stat" represents
exit from do_IRQ() (label "out:").

Where the <== arrows are, it seems that CPU 0 and 1 are executing the
timer IRQ in parallel (!). The entry to do_IRQ for CPU 0 is not listed
because my conditions for printk were not met.

Very strange, too: After the PIC is masked, we see 4(!) more timer
interrupts - where do they come from ?

My guess: We are in a situation where CPU 0 is overwhelmed by interrupts
from the 5 network interfaces in this machine. This leads to timer
interrupts "piling up", and eventually an interrupt is even delivered
to CPU 1. If this happens to occur simultaneously with CPU 0, we get
some sort of race condition.

Our PCI bus protocol analysis gives another hint:

It seems that the problem occurs in this code fragment from time.c:

#ifdef CONFIG_X86_IO_APIC
if (timer_ack) {
/*
* Subtle, when I/O APICs are used we have to ack timer IRQ
* manually to reset the IRR bit for do_slow_gettimeoffset().
* This will also deassert NMI lines for the watchdog if run
* on an 82489DX-based system.
*/
spin_lock(&i8259A_lock);
outb(0x0c, 0x20);
/* Ack the IRQ; AEOI will end it automatically. */ <==
inb(0x20);
spin_unlock(&i8259A_lock);
}
#endif

The above fragment is executed ~50us after the last write of the (correct)
value 0xfa to port 0x21.

Where the <== arrow is, the analyzer clearly shows the outb (0x0c, 0x20)
operation. After that, we see strange cycles for ~70us (probably special
cycles for arbitration, definitely no valid data transfers). The very next bus
operation is a read on port 0x21 that reveals the junk value 0x07! The
inb(0x20) call is not captured in our protocol, it must occur long after
the error. (We saw normal execution of the above code fragment where
there is ~1us between the outb and inb, where it is >120us here).

That's our state of insight so far. Any hints are very welcome.

Martin

PS We were testing with RedHat's 2.4.9-21 and SuSE's 2.4.16 kernel.
AFAICS, there are no significant changes between these and newer
kernels wrt APIC handling. The error occurs only once in a few
hours, so our testing possibilities are limited by time.
And yes, we were using Intel's e100.o driver for the ethernet
boards. But the driver is not to blame, the problem occurs even
if no driver is loaded for the cards at all.

--
Martin Wilck Phone: +49 5251 8 15113
Fujitsu Siemens Computers Fax: +49 5251 8 20409
Heinz-Nixdorf-Ring 1 mailto:[email protected]
D-33106 Paderborn http://www.fujitsu-siemens.com/primergy






2002-03-13 19:14:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system


On Wed, 13 Mar 2002, Martin Wilck wrote:

> First of all, we see that virtually 100% of all IRQs are handled by
> CPU 0. I have seen this reported a number of times before. I guess it
> can become a severe performance problem in IRQ-intensive situations.

i've written a patch for this, it's enclosed in this email. It implements
a brownean motion of IRQs, based on load patterns. The concept works
really well on Foster CPUs - eg. it will redirect IRQs to idle CPUs - but
if all CPUs are idle then the IRQs are randomly and evenly distributed
between CPUs.

(the patch can be made cheaper, but i've kept the overhead per-IRQ for the
time being to have more flexibility.)

let me know whether this fixes your problem,

Ingo

--- linux/kernel/sched.c.orig Tue Feb 5 13:11:35 2002
+++ linux/kernel/sched.c Tue Feb 5 13:12:48 2002
@@ -118,6 +118,11 @@
#define can_schedule(p,cpu) \
((p)->cpus_runnable & (p)->cpus_allowed & (1 << cpu))

+int idle_cpu(int cpu)
+{
+ return cpu_curr(cpu) == idle_task(cpu);
+}
+
#else

#define idle_task(cpu) (&init_task)
--- linux/include/linux/sched.h.orig Tue Feb 5 13:13:09 2002
+++ linux/include/linux/sched.h Tue Feb 5 13:14:00 2002
@@ -144,6 +144,7 @@

extern void sched_init(void);
extern void init_idle(void);
+extern int idle_cpu(int cpu);
extern void show_state(void);
extern void cpu_init (void);
extern void trap_init(void);
--- linux/include/asm-i386/hardirq.h.orig Tue Feb 5 13:10:39 2002
+++ linux/include/asm-i386/hardirq.h Tue Feb 5 13:14:00 2002
@@ -12,6 +12,7 @@
unsigned int __local_bh_count;
unsigned int __syscall_count;
struct task_struct * __ksoftirqd_task; /* waitqueue is too large */
+ unsigned long idle_timestamp;
unsigned int __nmi_count; /* arch dependent */
} ____cacheline_aligned irq_cpustat_t;

--- linux/arch/i386/kernel/io_apic.c.orig Tue Feb 5 13:10:37 2002
+++ linux/arch/i386/kernel/io_apic.c Tue Feb 5 13:15:23 2002
@@ -28,6 +28,7 @@
#include <linux/config.h>
#include <linux/smp_lock.h>
#include <linux/mc146818rtc.h>
+#include <linux/compiler.h>

#include <asm/io.h>
#include <asm/smp.h>
@@ -163,6 +164,86 @@
clear_IO_APIC_pin(apic, pin);
}

+static void set_ioapic_affinity (unsigned int irq, unsigned long mask)
+{
+ unsigned long flags;
+
+ /*
+ * Only the first 8 bits are valid.
+ */
+ mask = mask << 24;
+ spin_lock_irqsave(&ioapic_lock, flags);
+ __DO_ACTION(1, = mask, )
+ spin_unlock_irqrestore(&ioapic_lock, flags);
+}
+
+#if CONFIG_SMP
+
+typedef struct {
+ unsigned int cpu;
+ unsigned long timestamp;
+} ____cacheline_aligned irq_balance_t;
+
+static irq_balance_t irq_balance[NR_IRQS] __cacheline_aligned
+ = { [ 0 ... NR_IRQS-1 ] = { 1, 0 } };
+
+extern unsigned long irq_affinity [NR_IRQS];
+
+#endif
+
+#define IDLE_ENOUGH(cpu,now) \
+ (idle_cpu(cpu) && ((now) - irq_stat[(cpu)].idle_timestamp > 1))
+
+#define IRQ_ALLOWED(cpu,allowed_mask) \
+ ((1 << cpu) & (allowed_mask))
+
+static unsigned long move(int curr_cpu, unsigned long allowed_mask, unsigned long now, int direction)
+{
+ int search_idle = 1;
+ int cpu = curr_cpu;
+
+ goto inside;
+
+ do {
+ if (unlikely(cpu == curr_cpu))
+ search_idle = 0;
+inside:
+ if (direction == 1) {
+ cpu++;
+ if (cpu >= smp_num_cpus)
+ cpu = 0;
+ } else {
+ cpu--;
+ if (cpu == -1)
+ cpu = smp_num_cpus-1;
+ }
+ } while (!IRQ_ALLOWED(cpu,allowed_mask) ||
+ (search_idle && !IDLE_ENOUGH(cpu,now)));
+
+ return cpu;
+}
+
+static inline void balance_irq(int irq)
+{
+#if CONFIG_SMP
+ irq_balance_t *entry = irq_balance + irq;
+ unsigned long now = jiffies;
+
+ if (unlikely(entry->timestamp != now)) {
+ unsigned long allowed_mask;
+ int random_number;
+
+ rdtscl(random_number);
+ random_number &= 1;
+
+ allowed_mask = cpu_online_map & irq_affinity[irq];
+ entry->timestamp = now;
+ entry->cpu = move(entry->cpu, allowed_mask, now, random_number);
+ set_ioapic_affinity(irq, 1 << entry->cpu);
+ }
+#endif
+}
+
/*
* support for broken MP BIOSs, enables hand-redirection of PIRQ0-7 to
* specific CPU-side IRQs.
@@ -653,8 +734,7 @@
}

/*
- * Set up the 8259A-master output pin as broadcast to all
- * CPUs.
+ * Set up the 8259A-master output pin:
*/
void __init setup_ExtINT_IRQ0_pin(unsigned int pin, int vector)
{
@@ -1174,6 +1254,7 @@
*/
static void ack_edge_ioapic_irq(unsigned int irq)
{
+ balance_irq(irq);
if ((irq_desc[irq].status & (IRQ_PENDING | IRQ_DISABLED))
== (IRQ_PENDING | IRQ_DISABLED))
mask_IO_APIC_irq(irq);
@@ -1213,6 +1294,7 @@
unsigned long v;
int i;

+ balance_irq(irq);
/*
* It appears there is an erratum which affects at least version 0x11
* of I/O APIC (that's the 82093AA and cores integrated into various
@@ -1268,19 +1350,6 @@
}

static void mask_and_ack_level_ioapic_irq (unsigned int irq) { /* nothing */ }
-
-static void set_ioapic_affinity (unsigned int irq, unsigned long mask)
-{
- unsigned long flags;
- /*
- * Only the first 8 bits are valid.
- */
- mask = mask << 24;
-
- spin_lock_irqsave(&ioapic_lock, flags);
- __DO_ACTION(1, = mask, )
- spin_unlock_irqrestore(&ioapic_lock, flags);
-}

/*
* Level and edge triggered IO-APIC interrupts need different handling,
--- linux/arch/i386/kernel/irq.c.orig Tue Feb 5 13:10:34 2002
+++ linux/arch/i386/kernel/irq.c Tue Feb 5 13:11:15 2002
@@ -1076,7 +1076,7 @@

static struct proc_dir_entry * smp_affinity_entry [NR_IRQS];

-static unsigned long irq_affinity [NR_IRQS] = { [0 ... NR_IRQS-1] = ~0UL };
+unsigned long irq_affinity [NR_IRQS] = { [0 ... NR_IRQS-1] = ~0UL };
static int irq_affinity_read_proc (char *page, char **start, off_t off,
int count, int *eof, void *data)
{

2002-03-13 22:24:55

by Bill Davidsen

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

On Wed, 13 Mar 2002, Ingo Molnar wrote:

>
> On Wed, 13 Mar 2002, Martin Wilck wrote:
>
> > First of all, we see that virtually 100% of all IRQs are handled by
> > CPU 0. I have seen this reported a number of times before. I guess it
> > can become a severe performance problem in IRQ-intensive situations.
>
> i've written a patch for this, it's enclosed in this email. It implements
> a brownean motion of IRQs, based on load patterns. The concept works
> really well on Foster CPUs - eg. it will redirect IRQs to idle CPUs - but
> if all CPUs are idle then the IRQs are randomly and evenly distributed
> between CPUs.

If several processors are idle, say CPU0 busy and CPU[123] idle, does it
preferentially use a "CPU" on another chip? And does that make any
difference? It's not clear to me if the HT CPUs share cache or not, they
obviously share bandwidth from L2 to RAM.

I'm looking at P4 chips and boards, my 2Q02 budget has some $$ for a
system. I also will be getting some laptops 3Q02, does the new P4-M mobile
chip by any chance have HT? If so a good reason to go Intel, assuming that
either the BIOS or Linux can get it to use the feature ;-)

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-03-13 22:36:26

by Alan

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

> If several processors are idle, say CPU0 busy and CPU[123] idle, does it
> preferentially use a "CPU" on another chip? And does that make any
> difference? It's not clear to me if the HT CPUs share cache or not, they
> obviously share bandwidth from L2 to RAM.

The scheduler changes try to schedule onto a new true CPU rather than a
sibling first. Typically you only gain 10-30% via the HT feature so you
want to load the "real" CPU's properly.

> I'm looking at P4 chips and boards, my 2Q02 budget has some $$ for a
> system. I also will be getting some laptops 3Q02, does the new P4-M mobile
> chip by any chance have HT? If so a good reason to go Intel, assuming that
> either the BIOS or Linux can get it to use the feature ;-)

At the moment HT is Xeon only. Linux can do the right thing with it as of
2.4.18 + acpismp=force. Autodetect should be in soon. I don't know about
Intel's future product plans for HT.

Alan

2002-03-14 05:30:37

by Randy.Dunlap

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

On Wed, 13 Mar 2002, Alan Cox wrote:

| > I'm looking at P4 chips and boards, my 2Q02 budget has some $$ for a
| > system. I also will be getting some laptops 3Q02, does the new P4-M mobile
| > chip by any chance have HT? If so a good reason to go Intel, assuming that
| > either the BIOS or Linux can get it to use the feature ;-)

They announced at IDF last week (or 2 weeks back) that "UP"
P4 next year sometime will include HT.
I think this is "Prescott."

I haven't heard details about the P4-M, but I doubt that
it has HT (yet).

| At the moment HT is Xeon only. Linux can do the right thing with it as of
| 2.4.18 + acpismp=force. Autodetect should be in soon. I don't know about
| Intel's future product plans for HT.

--
~Randy

2002-03-14 13:27:01

by Martin Wilck

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

On Wed, 13 Mar 2002, Ingo Molnar wrote:

> let me know whether this fixes your problem,

The patch distributes the IRQs nicely between CPUs, but unfortunately does
not fix our timer IRQ problem.

Btw is it correct that one could also use the APIC Task Priority Registers
to implement "fair" IRQ routing? (If linux adjusted them, which it
currently doesn't).

Thanks & regards,
Martin

--
Martin Wilck Phone: +49 5251 8 15113
Fujitsu Siemens Computers Fax: +49 5251 8 20409
Heinz-Nixdorf-Ring 1 mailto:[email protected]
D-33106 Paderborn http://www.fujitsu-siemens.com/primergy





2002-03-14 14:14:14

by Alan

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

> They announced at IDF last week (or 2 weeks back) that "UP"
> P4 next year sometime will include HT.
> I think this is "Prescott."

They obviously don't expect to sell them in the UK then (Prescott is a not
wonderous UK political figure...). Good to know it will pop up elsewhere

2002-03-14 14:34:55

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

>> let me know whether this fixes your problem,
>
> The patch distributes the IRQs nicely between CPUs, but unfortunately does
> not fix our timer IRQ problem.
>
> Btw is it correct that one could also use the APIC Task Priority Registers
> to implement "fair" IRQ routing? (If linux adjusted them, which it
> currently doesn't).

Yes, and Dave Olien has already done this. It's a good idea for P3,
and seems to me to be essential for P4.

Dave, can you republish your patch?

M.

2002-03-14 18:23:07

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

>> Btw is it correct that one could also use the APIC Task Priority Registers
>> to implement "fair" IRQ routing? (If linux adjusted them, which it
>> currently doesn't).
>
> Yes, and Dave Olien has already done this. It's a good idea for P3,
> and seems to me to be essential for P4.
>
> Dave, can you republish your patch?

Apparently he's out for a few days. I poked around, and here's the latest
version of his stuff I can find:

http://sourceforge.net/project/showfiles.php?group_id=8875

Look under "APIC routing". Read the notes carefully - you have to
activate it from the command line.


M.

2002-03-15 08:48:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system


On Thu, 14 Mar 2002, Martin Wilck wrote:

> > let me know whether this fixes your problem,
>
> The patch distributes the IRQs nicely between CPUs, but unfortunately
> does not fix our timer IRQ problem.

that is a different issue, a fix for it was sent to lkml.

> Btw is it correct that one could also use the APIC Task Priority
> Registers to implement "fair" IRQ routing? (If linux adjusted them,
> which it currently doesn't).

no. The TPR has a number of limitations, and it suffers from the same
problem that lowest priority routing is suffering.

1) the TRP is not really finegrained and does not match Linux's irq
architecture, it's a rather spl-alike metric to allow irqs in below/above
a certain level. Since Linux distributes IRQ sources in essence randomly,
there is no point in TPR-limiting a certain half of the IRQ vector
spectrum.

2) i initially played with the TPR and it does not really solve the P4
problem. It can be used to force irqs away from a busy CPU, but in the
common (idle, or mostly idle) case the TPR will be equivalent across CPUs,
resulting in the same 'ugly' IRQ inbalance that you see.

3) the irqbalance patch also takes CPU affinity into account, ie. it will
try to keep the same IRQ source on a single CPU, for some time. So the
micro-distribution of IRQ sources is 'CPU affine', while the
macro-distribution is statistically random, in a load-weighted way. The
TPR approach results in the same 'one IRQ goes to CPU1, next IRQ goes to
CPU2' type of cache-affinity problems.

4) irqbalance is a software-based distribution method. It was time for the
Linux/x86 IRQ routing code to 'grow up' and actually be clever in a number
of ways. The hardware did bad decisions - even in lowestprio mode.

eg. apply the irqbalance patch, start a single CPU-using script like:

while [ 1 ]; do N=1; done

and watch IRQ load wander to the idle CPU(s) instantly.

Ingo

2002-03-15 08:55:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system


On Thu, 14 Mar 2002, Martin J. Bligh wrote:

> >> Btw is it correct that one could also use the APIC Task Priority Registers
> >> to implement "fair" IRQ routing? (If linux adjusted them, which it
> >> currently doesn't).
> >
> > Yes, and Dave Olien has already done this. It's a good idea for P3,
> > and seems to me to be essential for P4.

another problem with TPR-based IRQ routing (in addition to the ones i
mentioned in the previous mail) is that if you 'deny' certain IRQs via the
TPR, then if all CPUs run kernel-intensive jobs, then IRQs will never be
served by any of the CPUs (or will be served only after a long latency).
Sure, this can be hacked around, but if gets ugly very fast and doesnt get
us very far. All in one, i found the TPR to be not flexible enough for
what we really want: good IRQ distribution and good IRQ affinity at once.

Ingo

2002-03-15 08:59:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system


On Wed, 13 Mar 2002, Bill Davidsen wrote:

> > i've written a patch for this, it's enclosed in this email. It implements
> > a brownean motion of IRQs, based on load patterns. The concept works
> > really well on Foster CPUs - eg. it will redirect IRQs to idle CPUs - but
> > if all CPUs are idle then the IRQs are randomly and evenly distributed
> > between CPUs.
>
> If several processors are idle, say CPU0 busy and CPU[123] idle, does it
> preferentially use a "CPU" on another chip? And does that make any
> difference? It's not clear to me if the HT CPUs share cache or not, they
> obviously share bandwidth from L2 to RAM.

it has no HT affinity knowledge yet, but adding it should be
straightforward. The IRQ 'move' function is in the slow path and can be
made HT-aware without any performance-worries.

Ingo

2002-03-15 18:09:22

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

--On Friday, March 15, 2002 08:43:38 +0100 Ingo Molnar <[email protected]> wrote:
>
> On Thu, 14 Mar 2002, Martin Wilck wrote:
>> Btw is it correct that one could also use the APIC Task Priority
>> Registers to implement "fair" IRQ routing? (If linux adjusted them,
>> which it currently doesn't).
>
> no. The TPR has a number of limitations, and it suffers from the same
> problem that lowest priority routing is suffering.
>
> 1) the TRP is not really finegrained and does not match Linux's irq
> architecture, it's a rather spl-alike metric to allow irqs in below/above
> a certain level. Since Linux distributes IRQ sources in essence randomly,
> there is no point in TPR-limiting a certain half of the IRQ vector
> spectrum.

OK, there's actually two levels here, there's the stuff you're talking about
that blocks interrupts, then there's a way underneath that to affect which
CPU gets priority. You're really going to make me trawl through the Intel
APIC docs again? Not sure what I did to deserve that level of pain ;-)
I think I can remember this, but please forgive me if I don't get it 100% right
first time.

Dave could explain this to you better than I could, but if I remember all this
correctly what he was doing was really trying to was program the APR,
not the TPR, but the APR is a read only register ... it's affected by the way
you program the TPR. The docs are particularly opaque about how this really
works. I've had pretty much exactly this argument with Dave before, and we
thrashed out how it all worked in the process.

What I normally look at is Section 7.5 (APIC) of Vol 3 of the PIII Intel docs.
These are very confusing in this area, and Dave had some better docs
that I'll see if I can dig out. But if you look carefully at them (and you know
how it's meant to work before you start) it makes sense in a twisted sort of
way.

Look at the section marked "Interrupt distribution mechanisms":

Dynamic distribution assigns incoming interrupts to the lowest priority processor,
which is generally the least busy processor ... <snip> ... from all processors listed
in the destination, the processor selected is the one whose current arbitration
priority is the lowest. The latter is specified in the arbitration priority register (APR)
... <snip> ... If more than one processor shares the lowest priority, the processor
with the highest arbitration priority (the unique value in the Arb ID register) is selected.

The last sentence is how round robin happens on an SMP P3 system. I presume
this is what fell off for the P4.

In the section "valid interrupts", they define "priority = vector / 16".

Now look at the two paragraphs defining the TPR. The first para describes pretty
much what you describe. Note that this operation would only require 4 bits.
Now look at the second para, where they define the 4 msbs as corresponding
to the interrupt priorities, and mumble something about the 4 lsbs, giving very
little real information.

Now look at the section defining the APR, and look at the wierd algorithm,
which does somewhat opaque things to derive the value of the APR from the TPR
(and some other registers). It's easy to figure out that they're coupled, it's harder
to figure out exactly how, and I can't remember exactly how this works right now.

Now read Dave's code, and see what he does ;-) Look at his notes (they're at
the same URL I gave), and amongst other things, you'll see he says:

Linux does not
assign APIC interrupt vectors below the value 0x20. Reading specifications on
the APIC indicates that vector values 0 through 0xf are "reserved". So,
tpr values ranging from 0x10 through 0x1f where assigned to the "idle" through
"kernel-mode executing an interrupt handler" states.

> 2) i initially played with the TPR and it does not really solve the P4
> problem. It can be used to force irqs away from a busy CPU, but in the
> common (idle, or mostly idle) case the TPR will be equivalent across CPUs,
> resulting in the same 'ugly' IRQ inbalance that you see.

As above, you can do a little more with it. I'm not too worried about the
'ugliness' of the distribution pattern, but we should throw enough randomness
in there to help. No, it's not deterministic, and you're correct, it probably doesn't
give you good enough guarantees to solve completely what you're discussing.

It does seem to me like a fine idea to get the idle cpus taking the interrupts,
and those doing user work to take them in preference to those cpus doing
interrupt processing already.

I'll go look at what you were doing to acheive what seems to be similar goals.

M.

2002-03-19 13:12:57

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

On Wed, 13 Mar 2002, Ingo Molnar wrote:

> i've written a patch for this, it's enclosed in this email. It implements
> a brownean motion of IRQs, based on load patterns. The concept works
> really well on Foster CPUs - eg. it will redirect IRQs to idle CPUs - but
> if all CPUs are idle then the IRQs are randomly and evenly distributed
> between CPUs.

A nice idea. One note though -- the code depends on the TSC to be
present. It would be better to use:

if (cpu_has_tsc)
rdtscll(random_number);

and either preset random_number to a fixed value or even leave it
uninitialized to have it somewhat randomly set by what was found at the
stack for the TSC-less case.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2002-03-19 13:39:58

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

On Fri, 15 Mar 2002, Martin J. Bligh wrote:

> Dave could explain this to you better than I could, but if I remember
> all this correctly what he was doing was really trying to was program
> the APR, not the TPR, but the APR is a read only register ... it's
> affected by the way you program the TPR. The docs are particularly
> opaque about how this really works. I've had pretty much exactly this
> argument with Dave before, and we thrashed out how it all worked in the
> process.
>
> What I normally look at is Section 7.5 (APIC) of Vol 3 of the PIII Intel
> docs. These are very confusing in this area, and Dave had some better
> docs that I'll see if I can dig out. But if you look carefully at them
> (and you know how it's meant to work before you start) it makes sense in
> a twisted sort of way.

I've found i82489DX docs to be most comprehensive in this as well as
other areas. The integrated APIC doesn't seem to differ much from the
i82489DX wrt the arbitration -- the only difference is the focus processor
checking feature that can't be disabled for the latter.

> Dynamic distribution assigns incoming interrupts to the lowest priority
> processor, which is generally the least busy processor ... <snip> ...
> from all processors listed in the destination, the processor selected is
> the one whose current arbitration priority is the lowest. The latter is
> specified in the arbitration priority register (APR) ... <snip> ... If
> more than one processor shares the lowest priority, the processor with
> the highest arbitration priority (the unique value in the Arb ID
> register) is selected.
>
> The last sentence is how round robin happens on an SMP P3 system. I presume
> this is what fell off for the P4.

Basically there is no way to arbitrate with the FSB delivery.

> Now look at the two paragraphs defining the TPR. The first para
> describes pretty much what you describe. Note that this operation would
> only require 4 bits. Now look at the second para, where they define the
> 4 msbs as corresponding to the interrupt priorities, and mumble
> something about the 4 lsbs, giving very little real information.
>
> Now look at the section defining the APR, and look at the wierd
> algorithm, which does somewhat opaque things to derive the value of the
> APR from the TPR (and some other registers). It's easy to figure out
> that they're coupled, it's harder to figure out exactly how, and I can't
> remember exactly how this works right now.

Hmm, i82489DX docs explain it best, IIRC.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2002-03-19 14:33:25

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

On Wed, 13 Mar 2002, Martin Wilck wrote:

> Where the <== arrow is, the analyzer clearly shows the outb (0x0c, 0x20)
> operation. After that, we see strange cycles for ~70us (probably special
> cycles for arbitration, definitely no valid data transfers). The very next bus
> operation is a read on port 0x21 that reveals the junk value 0x07! The

The value is correct as after issuing the poll i8259 command the next
read cycle to the PIC returns an IRQ level (0x07 = no IRQ active; it
shouldn't happen here -- 0x80 is expected for active IRQ 0).

> inb(0x20) call is not captured in our protocol, it must occur long after
> the error. (We saw normal execution of the above code fragment where
> there is ~1us between the outb and inb, where it is >120us here).

Any chance your system uses the ExtINTA mode for the timer IRQ? This
would be very weird -- it's the fourth, last attempt to set up the timer
IRQ, meant not to happen really anytime (and it also means the i8259A
implementation is buggy, incomplete or is wired incorrectly). What does
your bootstrap log contain?

If that's really the case then the following patch should help -- there
is a subtle bug in the code indeed, sigh.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

patch-2.4.18-timer_ack-0
diff -up --recursive --new-file linux-2.4.18.macro/arch/i386/kernel/io_apic.c linux-2.4.18/arch/i386/kernel/io_apic.c
--- linux-2.4.18.macro/arch/i386/kernel/io_apic.c Fri Nov 23 15:32:04 2001
+++ linux-2.4.18/arch/i386/kernel/io_apic.c Tue Mar 19 14:21:35 2002
@@ -1567,6 +1567,7 @@ static inline void check_timer(void)

printk(KERN_INFO "...trying to set up timer as ExtINT IRQ...");

+ timer_ack = 0;
init_8259A(0);
make_8259A_irq(0);
apic_write_around(APIC_LVT0, APIC_DM_EXTINT);

2002-03-19 16:37:27

by Martin Wilck

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

Hi Maciej,

thanks a lot for caring about this problem!

> > operation is a read on port 0x21 that reveals the junk value 0x07! The
>
> The value is correct as after issuing the poll i8259 command the next
> read cycle to the PIC returns an IRQ level (0x07 = no IRQ active; it
> shouldn't happen here -- 0x80 is expected for active IRQ 0).

We found this out in the meantime as well. Actually sometimes the read
obtains 0x80, though very seldom. If that happens, we may still get 0x07
later.

Our analysis so far turned out that:

- A system management interrupt (SMI) is generated *right after* the
outb (0x0c, 0x20); statement, i.e. before the inb (0x20).
The reason for that SMI is the aic7xxx driver probing for some
non-existing EISA device and causing a PCI bus abort. It could be
anything else, though.

- The SMI handler in the BIOS masks the PIC interrupts. It reads the
current mask (this is where the wrong value is obtained).
At exit from SMI, it restores this wrong value to register 0x21.
The SMI handler simply does not account for the fact that
the PIC may be in polling mode when it reads register 0x21.
We are right now trying a patched BIOS that does a dummy read
on the PIC, before trying to obtain the mask from 0x21.
This is an (almost) standard Phoenix BIOS, the IRQ masking code
was definitely not changed by anyone at our company.

Thus yes, the timer_ack code is actually (partly) responsible for
our hang. I propose a patch for this below.

> Any chance your system uses the ExtINTA mode for the timer IRQ?

No, different problem - see above.

We use normal ExtInt operation:

Mar 12 16:51:30 pdb0374c kernel: ENABLING IO-APIC IRQs
Mar 12 16:51:30 pdb0374c kernel: ...changing IO-APIC physical APIC ID to 2 ... ok.
Mar 12 16:51:30 pdb0374c kernel: ...changing IO-APIC physical APIC ID to 3 ... ok.
Mar 12 16:51:30 pdb0374c kernel: ..TIMER: vector=0x31 pin1=-1 pin2=0
Mar 12 16:51:30 pdb0374c kernel: ...trying to set up timer (IRQ0) through the 8259A ...
Mar 12 16:51:30 pdb0374c kernel: ..... (found pin 0) ...works.

For the timer_ack stuff, I observe that it is only needed for
do_slow_gettimeoffset, i.e. only for CPUs which do not have a TSC.
Thus I propose the folloing patch:

--- ./arch/i386/kernel/time.c.ORIG Fri Mar 15 14:57:00 2002
+++ ./arch/i386/kernel/time.c Tue Mar 19 17:27:12 2002
@@ -384,7 +384,9 @@
/* last time the cmos clock got updated */
static long last_rtc_update;

+#ifndef CONFIG_X86_TSC
int timer_ack;
+#endif

/*
* timer_interrupt() needs to keep up the real-time clock,
@@ -392,7 +394,7 @@
*/
static inline void do_timer_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
-#ifdef CONFIG_X86_IO_APIC
+#if defined (CONFIG_X86_IO_APIC) && ! defined (CONFIG_X86_TSC)
if (timer_ack) {
/*
* Subtle, when I/O APICs are used we have to ack timer IRQ
--- ./arch/i386/kernel/io_apic.c.ORIG Tue Mar 19 17:31:25 2002
+++ ./arch/i386/kernel/io_apic.c Tue Mar 19 17:30:02 2002
@@ -1478,7 +1478,9 @@
*/
static inline void check_timer(void)
{
+#ifndef CONFIG_X86_TSC
extern int timer_ack;
+#endif
int pin1, pin2;
int vector;

@@ -1498,7 +1500,7 @@
*/
apic_write_around(APIC_LVT0, APIC_LVT_MASKED | APIC_DM_EXTINT);
init_8259A(1);
- timer_ack = 1;
+ timer_ack = !(cpu_has_tsc);
enable_8259A_irq(0);

pin1 = find_isa_irq_pin(0, mp_INT);


This would get us rid of our problem (although the BIOS hack may
suffice). However, more than that, it also spares ~2 us on each timer
interrupt for CPUs which do not need do_slow_gettimeoffset.

What do you think?

Martin

--
Martin Wilck Phone: +49 5251 8 15113
Fujitsu Siemens Computers Fax: +49 5251 8 20409
Heinz-Nixdorf-Ring 1 mailto:[email protected]
D-33106 Paderborn http://www.fujitsu-siemens.com/primergy






2002-03-19 23:31:05

by Pavel Machek

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

Hi!

> --- ./arch/i386/kernel/time.c.ORIG Fri Mar 15 14:57:00 2002
> +++ ./arch/i386/kernel/time.c Tue Mar 19 17:27:12 2002
> @@ -384,7 +384,9 @@
> /* last time the cmos clock got updated */
> static long last_rtc_update;
>
> +#ifndef CONFIG_X86_TSC
> int timer_ack;
> +#endif
>
> /*
> * timer_interrupt() needs to keep up the real-time clock,
> @@ -392,7 +394,7 @@
> */
> static inline void do_timer_interrupt(int irq, void *dev_id, struct pt_regs *regs)
> {
> -#ifdef CONFIG_X86_IO_APIC
> +#if defined (CONFIG_X86_IO_APIC) && ! defined (CONFIG_X86_TSC)
> if (timer_ack) {
> /*
> * Subtle, when I/O APICs are used we have to ack timer IRQ
> --- ./arch/i386/kernel/io_apic.c.ORIG Tue Mar 19 17:31:25 2002
> +++ ./arch/i386/kernel/io_apic.c Tue Mar 19 17:30:02 2002
> @@ -1478,7 +1478,9 @@
> */
> static inline void check_timer(void)
> {
> +#ifndef CONFIG_X86_TSC
> extern int timer_ack;
> +#endif
> int pin1, pin2;
> int vector;
>
> @@ -1498,7 +1500,7 @@
> */
> apic_write_around(APIC_LVT0, APIC_LVT_MASKED | APIC_DM_EXTINT);
> init_8259A(1);
> - timer_ack = 1;
> + timer_ack = !(cpu_has_tsc);
> enable_8259A_irq(0);
>
> pin1 = find_isa_irq_pin(0, mp_INT);
>
>
> This would get us rid of our problem (although the BIOS hack may
> suffice). However, more than that, it also spares ~2 us on each timer
> interrupt for CPUs which do not need do_slow_gettimeoffset.
>
> What do you think?

Well, you should get your bios fixed.

Then... Those ifdefs are not neccessary, right? You only need ...

> @@ -1498,7 +1500,7 @@
> */
> apic_write_around(APIC_LVT0, APIC_LVT_MASKED | APIC_DM_EXTINT);
> init_8259A(1);
> - timer_ack = 1;
> + timer_ack = !(cpu_has_tsc);
> enable_8259A_irq(0);
>
> pin1 = find_isa_irq_pin(0, mp_INT);

... these lines.
Pavel

--
(about SSSCA) "I don't say this lightly. However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa

2002-03-20 07:50:03

by Martin Wilck

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

On Wed, 20 Mar 2002, Pavel Machek wrote:

> > This would get us rid of our problem (although the BIOS hack may
> > suffice). However, more than that, it also spares ~2 us on each timer
> > interrupt for CPUs which do not need do_slow_gettimeoffset.
> >
> > What do you think?
>
> Well, you should get your bios fixed.

Whatever's wrong with our BIOS is not the focus of the patch.
There are two sides to this, and the Linux side is that the
IO-operations to port 20 are completely superfluous on modern CPUs.

> Then... Those ifdefs are not neccessary, right? You only need ...
[ snip ]
> ... these lines.

There is no point in keeping the timer_ack variable in the kernel's
symbol table if it's compiled for CPUS with TSC only, and no need
to test it in every timer interrupt if we know it's going to be 0
anyway.

Please note that do_slow_timeoffset is completely #ifdef'ed out in
time.c for CPUs with TSC.

Martin

--
Martin Wilck Phone: +49 5251 8 15113
Fujitsu Siemens Computers Fax: +49 5251 8 20409
Heinz-Nixdorf-Ring 1 mailto:[email protected]
D-33106 Paderborn http://www.fujitsu-siemens.com/primergy





2002-03-20 13:28:32

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

On Wed, 20 Mar 2002, Martin Wilck wrote:

> Whatever's wrong with our BIOS is not the focus of the patch.

Sure, but working around hopeless cases is not the Linux's job either.
We cannot guarantee operation on an out-of-spec system that carelessly
fiddles with registers beneath the kernel. The system needs to be fixed.

> There are two sides to this, and the Linux side is that the
> IO-operations to port 20 are completely superfluous on modern CPUs.

Depending on the configuration -- a user may specify "notsc" for whatever
reason (although admittedly, that's mostly a debugging option).

> There is no point in keeping the timer_ack variable in the kernel's
> symbol table if it's compiled for CPUS with TSC only, and no need

There is a point, actually. There are i82489DX-based Pentium systems.
They work fine with a CONFIG_X86_TSC kernel but still they require the
code in do_timer_interrupt() for NMI watchdog deassertion -- watch the
comment.

> to test it in every timer interrupt if we know it's going to be 0
> anyway.

The current code actually supports discrete i82489DX local APICs for any
CPU, so you don't really know unless you define an additional
configuration option to force integrated APIC support only. Do you really
think it's worth the hassle to save an order of two-three CPU cycles per
10ms?

> Please note that do_slow_timeoffset is completely #ifdef'ed out in
> time.c for CPUs with TSC.

In short some code like:

timer_ack = !(cpu_has_tsc &&
APIC_INTEGRATED(GET_APIC_VERSION(apic_read(APIC_LVR))));

should suffice as the condition to disable the code in
do_timer_interrupt() for systems using the through-8259A mode. There is
no need to keep it enabled unconditionally and I/O cycles are quite
expensive. The following patch implements it. Please test it. It should
cure your problems as a side effect, but that does not mean the BIOS isn't
to be fixed.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

patch-2.4.18-timer_ack-1
diff -up --recursive --new-file linux-2.4.18.macro/arch/i386/kernel/io_apic.c linux-2.4.18/arch/i386/kernel/io_apic.c
--- linux-2.4.18.macro/arch/i386/kernel/io_apic.c Fri Nov 23 15:32:04 2001
+++ linux-2.4.18/arch/i386/kernel/io_apic.c Wed Mar 20 12:36:11 2002
@@ -1481,6 +1481,10 @@ static inline void check_timer(void)
extern int timer_ack;
int pin1, pin2;
int vector;
+ unsigned int ver;
+
+ ver = apic_read(APIC_LVR);
+ ver = GET_APIC_VERSION(ver);

/*
* get/set the timer IRQ vector:
@@ -1494,11 +1498,15 @@ static inline void check_timer(void)
* mode for the 8259A whenever interrupts are routed
* through I/O APICs. Also IRQ0 has to be enabled in
* the 8259A which implies the virtual wire has to be
- * disabled in the local APIC.
+ * disabled in the local APIC. Finally timer interrupts
+ * need to be acknowledged manually in the 8259A for
+ * do_slow_timeoffset() and for the i82489DX when using
+ * the NMI watchdog.
*/
apic_write_around(APIC_LVT0, APIC_LVT_MASKED | APIC_DM_EXTINT);
init_8259A(1);
- timer_ack = 1;
+ timer_ack = !cpu_has_tsc;
+ timer_ack |= nmi_watchdog == NMI_IO_APIC && !APIC_INTEGRATED(ver);
enable_8259A_irq(0);

pin1 = find_isa_irq_pin(0, mp_INT);
@@ -1516,7 +1524,8 @@ static inline void check_timer(void)
disable_8259A_irq(0);
setup_nmi();
enable_8259A_irq(0);
- check_nmi_watchdog();
+ if (check_nmi_watchdog() < 0);
+ timer_ack = !cpu_has_tsc;
}
return;
}
@@ -1535,7 +1544,8 @@ static inline void check_timer(void)
printk("works.\n");
if (nmi_watchdog == NMI_IO_APIC) {
setup_nmi();
- check_nmi_watchdog();
+ if (check_nmi_watchdog() < 0);
+ timer_ack = !cpu_has_tsc;
}
return;
}
@@ -1567,6 +1577,7 @@ static inline void check_timer(void)

printk(KERN_INFO "...trying to set up timer as ExtINT IRQ...");

+ timer_ack = 0;
init_8259A(0);
make_8259A_irq(0);
apic_write_around(APIC_LVT0, APIC_DM_EXTINT);

2002-03-20 13:49:56

by Martin Wilck

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system


Dear Maciej,

thanks for your comments and the patch. I overlooked that NMI watchdog
thing. I'll test it, although I am certain that it'll fix our problem.

Btw, we _have_ already fixed our BIOS (at least on my test machine).
I just submitted the patch because I thought that Linux putting the
8259A in polling mode is also a dangerous thing that should be avoided
if possible. You have shown me that there are some more situations where
it is impossible than I had seen.

Many people seem to think our BIOS is particularly nasty. I'd like to
repeat that this is a pretty common Phoenix BIOS. Of course I can't tell
what other manufacturers do, but I'd consider it at least possible that
their BIOS's act similarly.

Martin

--
Martin Wilck Phone: +49 5251 8 15113
Fujitsu Siemens Computers Fax: +49 5251 8 20409
Heinz-Nixdorf-Ring 1 mailto:[email protected]
D-33106 Paderborn http://www.fujitsu-siemens.com/primergy





2002-03-20 13:53:27

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

On Tue, 19 Mar 2002, Martin Wilck wrote:

> - The SMI handler in the BIOS masks the PIC interrupts. It reads the
> current mask (this is where the wrong value is obtained).
> At exit from SMI, it restores this wrong value to register 0x21.
> The SMI handler simply does not account for the fact that
> the PIC may be in polling mode when it reads register 0x21.
> We are right now trying a patched BIOS that does a dummy read
> on the PIC, before trying to obtain the mask from 0x21.
> This is an (almost) standard Phoenix BIOS, the IRQ masking code
> was definitely not changed by anyone at our company.

Why do you need to fiddle with the 8259A's IMR at all? What's wrong with
cli?

> Thus yes, the timer_ack code is actually (partly) responsible for
> our hang. I propose a patch for this below.

The current code is correct, written to 8259A's specs. I used original
ones ("8259A Programmable Interrupt Controller (8259A/8259A-2)", Intel's
order number 231468) as a reference to assure whatever happens is
well-specified and not an implementation-specific quirk. The SMI code is
broken for not being transparent (or for existing at all, but that's
another matter).

> We use normal ExtInt operation:
>
> Mar 12 16:51:30 pdb0374c kernel: ENABLING IO-APIC IRQs
> Mar 12 16:51:30 pdb0374c kernel: ...changing IO-APIC physical APIC ID to 2 ... ok.
> Mar 12 16:51:30 pdb0374c kernel: ...changing IO-APIC physical APIC ID to 3 ... ok.
> Mar 12 16:51:30 pdb0374c kernel: ..TIMER: vector=0x31 pin1=-1 pin2=0
> Mar 12 16:51:30 pdb0374c kernel: ...trying to set up timer (IRQ0) through the 8259A ...
> Mar 12 16:51:30 pdb0374c kernel: ..... (found pin 0) ...works.

That's not the normal mode -- that's the through-8259A mode workaround
designed originally for i82357 (ISP) based EISA systems. The chip
predates the APIC and does not make IRQ 0 from its internal 8254 core
available externally. I am actually very surprised to see new systems not
connecting IRQ 0 to the I/O APIC.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2002-03-20 14:13:45

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

On Wed, 20 Mar 2002, Martin Wilck wrote:

> I just submitted the patch because I thought that Linux putting the
> 8259A in polling mode is also a dangerous thing that should be avoided
> if possible. You have shown me that there are some more situations where
> it is impossible than I had seen.

I don't think it's dangerous on sane systems -- the 8259A is such an old
and simple chip there is no excuse for not getting appropriate knowledge
on it before working on it both from the hardware and the software's
points of view. There is no need to waste cycles, of course.

> Many people seem to think our BIOS is particularly nasty. I'd like to
> repeat that this is a pretty common Phoenix BIOS. Of course I can't tell
> what other manufacturers do, but I'd consider it at least possible that
> their BIOS's act similarly.

Well, experience shows BIOSes tend to be nasty -- I don't think yours is
particular here, although bugs may differ.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2002-03-20 14:14:15

by Alan

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

> Depending on the configuration -- a user may specify "notsc" for whatever
> reason (although admittedly, that's mostly a debugging option).

Mixed multiplier x86 for one - I've got some basic code there to handle
this automatically but its not yet finished. Plenty of BP6 folks have
mismatched celewrongs

> no need to keep it enabled unconditionally and I/O cycles are quite
> expensive. The following patch implements it. Please test it. It should
> cure your problems as a side effect, but that does not mean the BIOS isn't
> to be fixed.

The DMI strings for that bios version would be useful to so that we can
panic with a "BIOS upgrade required" message

2002-03-20 16:17:34

by Martin Wilck

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

On Wed, 20 Mar 2002, Maciej W. Rozycki wrote:

> Why do you need to fiddle with the 8259A's IMR at all? What's wrong with
> cli?

Only Phoenix developers can answer this.

In principle not even cli is necessary because, all interrupts are
automatically disabled upon entry into SMM mode. The bios does mask
the PIC, though, probably because it is possible for SMI handlers to
reenable interrupts while in SMM mode. AFAIK, our BIOS does _not_
include such handlers, but Phoenix seems to have put in that code
so that it is possible to write them if desired. It appears that the
Phoenix programmers overlooked the fact that the PIC might be in poll
mode when they try to read the IMR. Our Bios people are trying to
contact Phoenix about this.

> The current code is correct, written to 8259A's specs. I used original
> ones ("8259A Programmable Interrupt Controller (8259A/8259A-2)", Intel's
> order number 231468) as a reference to assure whatever happens is
> well-specified and not an implementation-specific quirk. The SMI code is
> broken for not being transparent (or for existing at all, but that's
> another matter).

Can you tell me how the SMI code could correctly transparently reestablish
the 8259A state in the situation we are facing?
I can't think of an algorithm that would put the PIC in exactly the same
state that it was in before the SMI, not to talk about the time elapsed
during SMI, which easily comes up to a few 100 us on our system.

> That's not the normal mode -- that's the through-8259A mode workaround
> designed originally for i82357 (ISP) based EISA systems. The chip
> predates the APIC and does not make IRQ 0 from its internal 8254 core
> available externally. I am actually very surprised to see new systems not
> connecting IRQ 0 to the I/O APIC.

Oops - I will contact our hardware guys about this.

Thanks for your comments,
Martin

--
Martin Wilck Phone: +49 5251 8 15113
Fujitsu Siemens Computers Fax: +49 5251 8 20409
Heinz-Nixdorf-Ring 1 mailto:[email protected]
D-33106 Paderborn http://www.fujitsu-siemens.com/primergy





2002-03-20 20:24:47

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

On Tue, Mar 19, 2002 at 03:32:22PM +0100, Maciej W. Rozycki wrote:
> The value is correct as after issuing the poll i8259 command the next
> read cycle to the PIC returns an IRQ level (0x07 = no IRQ active; it
> shouldn't happen here -- 0x80 is expected for active IRQ 0).

On Wed, 13 Mar 2002, Martin Wilck wrote:
>> inb(0x20) call is not captured in our protocol, it must occur long after
>> the error. (We saw normal execution of the above code fragment where
>> there is ~1us between the outb and inb, where it is >120us here).


There is/was at least one simulator that shoots the kernel (and
everything else) dead in response to frobbing the PIC while the local
APIC timer etc. are going, and I wouldn't be surprised if there were
some real hardware that did so as well.

(this is the code under if (timer_ack) in do_timer_interrupt())


Cheers,
Bill

2002-03-21 07:04:54

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

On Wed, 20 Mar 2002, William Lee Irwin III wrote:

> There is/was at least one simulator that shoots the kernel (and
> everything else) dead in response to frobbing the PIC while the local
> APIC timer etc. are going, and I wouldn't be surprised if there were
> some real hardware that did so as well.

The problem was that the the simulator didn't support polling mode, which
linux 2.4 seems to use for handling the timer interrupt. The simulator PIC
implementation wasn't complete if it didn't support polling mode. You
shouldn't see this on real hardware.

Zwane


2002-03-21 13:03:45

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

On Wed, 20 Mar 2002, Martin Wilck wrote:

> In principle not even cli is necessary because, all interrupts are
> automatically disabled upon entry into SMM mode. The bios does mask
> the PIC, though, probably because it is possible for SMI handlers to
> reenable interrupts while in SMM mode. AFAIK, our BIOS does _not_
> include such handlers, but Phoenix seems to have put in that code
> so that it is possible to write them if desired. It appears that the
> Phoenix programmers overlooked the fact that the PIC might be in poll
> mode when they try to read the IMR. Our Bios people are trying to
> contact Phoenix about this.

This does not make sense -- fiddling with 8259A chips does not assure
maskable interrupts won't arrive. They still may come from the APIC bus
and from local APIC sources. And you don't really want to handle
interrupts in the SMI mode unless you know details of the operating system
running -- if you execute an iret instruction NMIs get enabled and even
more mess may happen (this may be the reason of some systems failing with
the NMI watchdog enabled).

> Can you tell me how the SMI code could correctly transparently reestablish
> the 8259A state in the situation we are facing?
> I can't think of an algorithm that would put the PIC in exactly the same
> state that it was in before the SMI, not to talk about the time elapsed
> during SMI, which easily comes up to a few 100 us on our system.

There are two ways I can think of at the moment:

1. Don't touch it at all. It's the preferred way if possible, as it
doesn't change the hardware's state.

2. Use alternate registers dedicated to save/restore the hardware's state
(typically used for power management), that provide details beyond the
standard ones. Of course if a particular implementation doesn't have ones
(are there any these days?), you don't really have a choice apart from #1
above.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2002-03-21 18:51:30

by Martin Wilck

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

Dear Maciej,

> In short some code like:
>
> timer_ack = !(cpu_has_tsc &&
> APIC_INTEGRATED(GET_APIC_VERSION(apic_read(APIC_LVR))));
>
> should suffice as the condition to disable the code in
> do_timer_interrupt() for systems using the through-8259A mode. There is
> no need to keep it enabled unconditionally and I/O cycles are quite
> expensive. The following patch implements it. Please test it. It should
> cure your problems as a side effect, but that does not mean the BIOS isn't
> to be fixed.

please consider submitting this patch to Linus and Marcelo.
Perhaps it needs testing by some more people,
but I think it's the right thing to do.

Martin

--
Martin Wilck Phone: +49 5251 8 15113
Fujitsu Siemens Computers Fax: +49 5251 8 20409
Heinz-Nixdorf-Ring 1 mailto:[email protected]
D-33106 Paderborn http://www.fujitsu-siemens.com/primergy





2002-03-21 23:36:46

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

On Wed, 20 Mar 2002, William Lee Irwin III wrote:
>> There is/was at least one simulator that shoots the kernel (and
>> everything else) dead in response to frobbing the PIC while the local
>> APIC timer etc. are going, and I wouldn't be surprised if there were
>> some real hardware that did so as well.

On Thu, Mar 21, 2002 at 08:54:53AM +0200, Zwane Mwaikambo wrote:
> The problem was that the the simulator didn't support polling mode, which
> linux 2.4 seems to use for handling the timer interrupt. The simulator PIC
> implementation wasn't complete if it didn't support polling mode. You
> shouldn't see this on real hardware.

Polling mode on the PIC wasn't there, true, so it did trigger a
simulator bug -- but the PIC should never have been touched. It was
using the local APIC timer etc. Or at least it certainly seems
counterintuitive to me that the PIC is set into polling mode and EOI'd
when the local APIC is whose timer is used and is where the interrupt
came from. We're using the APIC... why are we touching the PIC?


Cheers,
Bill

2002-03-22 00:20:23

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

On Thu, 21 Mar 2002, William Lee Irwin III wrote:

> Polling mode on the PIC wasn't there, true, so it did trigger a
> simulator bug -- but the PIC should never have been touched. It was
> using the local APIC timer etc. Or at least it certainly seems
> counterintuitive to me that the PIC is set into polling mode and EOI'd
> when the local APIC is whose timer is used and is where the interrupt
> came from. We're using the APIC... why are we touching the PIC?

That's the trick to let the NMI watchdog work. And also to let timer
interrupts be delivered in the APIC's native mode for broken setups. 8254
interrupts are split and delivered via two ways:

1a. A normal LoPri delivery for the timer interrupt. Typically pin #2 of
I/O APIC #0 is used, which is directly connected to the output of timer #0
of the 8254.

1b. Alternatively for broken setups that do not connect the timer to an
I/O APIC the so called through-8259A mode is used. The 8259A is
programmed to pass its IRQ 0 transparently, typically to pin #0 of I/O
APIC #0. Again the LoPri delivery mode is used. But code in
do_slow_gettimeoffset() depends in the 8259A's IRR register to be set
appropriately, so we need to ack timer interrupts to the 8259A manually
(the LoPri mode doesn't permit processor's INTA cycles to appear beyond
the local APIC).

2. The through-8259A mode is also used for broadcasting interrupts at the
same time to LINT0 local interrupt inputs of all processors (they are
typically tied together and connected to the INT output of the master
8259A). Local APICs are programmed to deliver LINT0 interrupts as NMIs.
But NMIs are level-triggered in the i82489DX APIC, so they need to be
deasserted ASAP to let system error NMIs (passed via LINT1) to be
recognized. Again they get deasserted by sending an ack the 8259A
manually.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2002-03-27 17:35:56

by Steffen Persvold

[permalink] [raw]
Subject: Re: Severe IRQ problems on Foster (P4 Xeon) system

On Wed, 13 Mar 2002, Ingo Molnar wrote:

>
> On Wed, 13 Mar 2002, Martin Wilck wrote:
>
> > First of all, we see that virtually 100% of all IRQs are handled by
> > CPU 0. I have seen this reported a number of times before. I guess it
> > can become a severe performance problem in IRQ-intensive situations.
>
> i've written a patch for this, it's enclosed in this email. It implements
> a brownean motion of IRQs, based on load patterns. The concept works
> really well on Foster CPUs - eg. it will redirect IRQs to idle CPUs - but
> if all CPUs are idle then the IRQs are randomly and evenly distributed
> between CPUs.
>
> (the patch can be made cheaper, but i've kept the overhead per-IRQ for the
> time being to have more flexibility.)
>
> let me know whether this fixes your problem,
>

Hi Ingo,

I've tested your patch with a 2.4.18 kernel on a few SMP systems : i860,
Plumas (E7500), 760MP(X), ServerWorks HE-SL and ServerWorks LE. It works
fine in all cases. I had to modify the patch a little bit in order to make
it compile on uniprocessor. I've attached the modified patch.

Will this patch be included in 2.4.19 ?

Regards,
--
Steffen Persvold | Scalable Linux Systems | Try out the world's best
mailto:[email protected] | http://www.scali.com | performing MPI implementation:
Tel: (+47) 2262 8950 | Olaf Helsets vei 6 | - ScaMPI 1.13.8 -
Fax: (+47) 2262 8951 | N0621 Oslo, NORWAY | >320MBytes/s and <4uS latency


Attachments:
linux-2.4.18-irqbalancing.patch (4.96 kB)