2005-11-10 19:53:22

by Dinakar Guniguntala

[permalink] [raw]
Subject: IO-APIC problem with 2.6.14-rt9

Hi,

I get this on boot with 2.6.14-rt9

Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#3.
CPU3: Intel P4/Xeon Extended MCE MSRs (12) available
CPU3: Intel(R) Xeon(TM) MP CPU 2.50GHz stepping 05
Total of 4 processors activated (11165.69 BogoMIPS).
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 pin1=2 pin2=-1
..MP-BIOS bug: 8254 timer not connected to IO-APIC
...trying to set up timer (IRQ0) through the 8259A ... failed.
...trying to set up timer as Virtual Wire IRQ... failed.
...trying to set up timer as ExtINT IRQ... failed :(.
Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. Then try booting with the


It does boot up with noapic, but a nonRT kernel does not have the above problem

I have attached the .config below

Appreciate any help


-Dinakar




Attachments:
(No filename) (853.00 B)
config-llm11 (24.32 kB)
Download all attachments

2005-11-10 20:02:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: IO-APIC problem with 2.6.14-rt9


* Dinakar Guniguntala <[email protected]> wrote:

> Hi,
>
> I get this on boot with 2.6.14-rt9
>
> Intel machine check architecture supported.
> Intel machine check reporting enabled on CPU#3.
> CPU3: Intel P4/Xeon Extended MCE MSRs (12) available
> CPU3: Intel(R) Xeon(TM) MP CPU 2.50GHz stepping 05
> Total of 4 processors activated (11165.69 BogoMIPS).
> ENABLING IO-APIC IRQs
> ..TIMER: vector=0x31 pin1=2 pin2=-1
> ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> ...trying to set up timer (IRQ0) through the 8259A ... failed.
> ...trying to set up timer as Virtual Wire IRQ... failed.
> ...trying to set up timer as ExtINT IRQ... failed :(.
> Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. Then try booting with the

does it help if you edit include/asm-i386/timex.h and change this line:

//#define ARCH_HAS_READ_CURRENT_TIMER 1

to:

#define ARCH_HAS_READ_CURRENT_TIMER 1

?

Ingo

2005-11-10 20:20:45

by Dinakar Guniguntala

[permalink] [raw]
Subject: Re: IO-APIC problem with 2.6.14-rt9

On Thu, Nov 10, 2005 at 09:02:05PM +0100, Ingo Molnar wrote:
>
> * Dinakar Guniguntala <[email protected]> wrote:
>
> > Hi,
> >
> > I get this on boot with 2.6.14-rt9
> >
> > Intel machine check architecture supported.
> > Intel machine check reporting enabled on CPU#3.
> > CPU3: Intel P4/Xeon Extended MCE MSRs (12) available
> > CPU3: Intel(R) Xeon(TM) MP CPU 2.50GHz stepping 05
> > Total of 4 processors activated (11165.69 BogoMIPS).
> > ENABLING IO-APIC IRQs
> > ..TIMER: vector=0x31 pin1=2 pin2=-1
> > ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> > ...trying to set up timer (IRQ0) through the 8259A ... failed.
> > ...trying to set up timer as Virtual Wire IRQ... failed.
> > ...trying to set up timer as ExtINT IRQ... failed :(.
> > Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. Then try booting with the
>
> does it help if you edit include/asm-i386/timex.h and change this line:
>
> //#define ARCH_HAS_READ_CURRENT_TIMER 1
>
> to:
>
> #define ARCH_HAS_READ_CURRENT_TIMER 1
>
> ?

It works !! Thanks Ingo for the immediate response

Just a clarification. The comment in the file include/asm-i386/timex.h
says

/*
* On an Athlon64 the cycles-based estimator is off by a
* factor of 2: udelay(100) takes 200 usecs. With the non-TSC
* based estimator the timings are precise. So turn it off.
*/
#define ARCH_HAS_READ_CURRENT_TIMER 1

Does this mean that this is not Athlon specific and needs to be
changed? I have a IBM x255 with Xeon processors

-Dinakar





2005-11-10 20:29:39

by john stultz

[permalink] [raw]
Subject: Re: IO-APIC problem with 2.6.14-rt9

On Fri, 2005-11-11 at 02:00 +0530, Dinakar Guniguntala wrote:
> On Thu, Nov 10, 2005 at 09:02:05PM +0100, Ingo Molnar wrote:
> >
> > * Dinakar Guniguntala <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > I get this on boot with 2.6.14-rt9
> > >
> > > Intel machine check architecture supported.
> > > Intel machine check reporting enabled on CPU#3.
> > > CPU3: Intel P4/Xeon Extended MCE MSRs (12) available
> > > CPU3: Intel(R) Xeon(TM) MP CPU 2.50GHz stepping 05
> > > Total of 4 processors activated (11165.69 BogoMIPS).
> > > ENABLING IO-APIC IRQs
> > > ..TIMER: vector=0x31 pin1=2 pin2=-1
> > > ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> > > ...trying to set up timer (IRQ0) through the 8259A ... failed.
> > > ...trying to set up timer as Virtual Wire IRQ... failed.
> > > ...trying to set up timer as ExtINT IRQ... failed :(.
> > > Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug and send a report. Then try booting with the
> >
> > does it help if you edit include/asm-i386/timex.h and change this line:
> >
> > //#define ARCH_HAS_READ_CURRENT_TIMER 1
> >
> > to:
> >
> > #define ARCH_HAS_READ_CURRENT_TIMER 1
> >
> > ?
>
> It works !! Thanks Ingo for the immediate response

Hrm. Could you post the value for BogoMIPS that you're getting now?

My patches touch the __delay() code, since using the TSC based delay has
just as many, if not more, problems as the loop based delay. So I want
to be careful that my changes are not further causing problems.

Ingo, did you commented out ARCH_HAS_READ_CURRENT_TIMER because of
problems with the new calibration code?

thanks
-john

2005-11-10 20:46:13

by Dinakar Guniguntala

[permalink] [raw]
Subject: Re: IO-APIC problem with 2.6.14-rt9

On Thu, Nov 10, 2005 at 12:29:34PM -0800, john stultz wrote:
>
> Hrm. Could you post the value for BogoMIPS that you're getting now?

Here it is

vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Xeon(TM) MP CPU 2.50GHz
stepping : 5
cpu MHz : 2488.063
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips : 4975.15


I checked that this is the same even on a vanilla 2.6.14 as well

-Dinakar

2005-11-10 21:05:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: IO-APIC problem with 2.6.14-rt9


* john stultz <[email protected]> wrote:

> > > //#define ARCH_HAS_READ_CURRENT_TIMER 1
> > >
> > > to:
> > >
> > > #define ARCH_HAS_READ_CURRENT_TIMER 1
> > >
> > > ?
> >
> > It works !! Thanks Ingo for the immediate response
>
> Hrm. Could you post the value for BogoMIPS that you're getting now?
>
> My patches touch the __delay() code, since using the TSC based delay
> has just as many, if not more, problems as the loop based delay. So I
> want to be careful that my changes are not further causing problems.
>
> Ingo, did you commented out ARCH_HAS_READ_CURRENT_TIMER because of
> problems with the new calibration code?

yes. traces show that the new calibration code results in a bogomips
value on Athlon64 CPUs that halve the timeout. I.e. udelay(100) now
takes 50 usecs (!). The calibration code seems to assume the number of
cycles == number of loops in __delay() - that is not valid. The
calibration needs to happen based on some real clock, such as the PIT,
or PIT-driven jiffies.

Ingo

2005-11-10 21:42:59

by john stultz

[permalink] [raw]
Subject: Re: IO-APIC problem with 2.6.14-rt9

On Thu, 2005-11-10 at 22:04 +0100, Ingo Molnar wrote:
> * john stultz <[email protected]> wrote:
>
> > > > //#define ARCH_HAS_READ_CURRENT_TIMER 1
> > > >
> > > > to:
> > > >
> > > > #define ARCH_HAS_READ_CURRENT_TIMER 1
> > > >
> > > > ?
> > >
> > > It works !! Thanks Ingo for the immediate response
> >
> > Hrm. Could you post the value for BogoMIPS that you're getting now?
> >
> > My patches touch the __delay() code, since using the TSC based delay
> > has just as many, if not more, problems as the loop based delay. So I
> > want to be careful that my changes are not further causing problems.
> >
> > Ingo, did you commented out ARCH_HAS_READ_CURRENT_TIMER because of
> > problems with the new calibration code?
>
> yes. traces show that the new calibration code results in a bogomips
> value on Athlon64 CPUs that halve the timeout. I.e. udelay(100) now
> takes 50 usecs (!). The calibration code seems to assume the number of
> cycles == number of loops in __delay() - that is not valid.

Yea, that makes sense, because the READ_CURRENT_TIMER calibration is all
TSC based and with my code we use the loop based delay (since the TSC
based one can have a number of problems). So that doesn't mesh well when
the loop/cycle values are not equivalent.

That still leaves open the question why Dinakar is seeing issues w/ the
loop based calibration, but I've got some similar hardware in my lab, so
I can probably work that out.

I'll see if I can't avoid touching the delay code. Its such a sketchy
calibration sensitive code path that I'd really like to see it killed,
but maybe there's something simple that can be done.

Grumble. :( I was hoping to submit my tod code to Andrew tomorrow, but
this might block that.

thanks
-john




2005-11-11 07:38:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: IO-APIC problem with 2.6.14-rt9


* john stultz <[email protected]> wrote:

> > yes. traces show that the new calibration code results in a bogomips
> > value on Athlon64 CPUs that halve the timeout. I.e. udelay(100) now
> > takes 50 usecs (!). The calibration code seems to assume the number of
> > cycles == number of loops in __delay() - that is not valid.
>
> Yea, that makes sense, because the READ_CURRENT_TIMER calibration is
> all TSC based and with my code we use the loop based delay (since the
> TSC based one can have a number of problems). So that doesn't mesh
> well when the loop/cycle values are not equivalent.
>
> That still leaves open the question why Dinakar is seeing issues w/
> the loop based calibration, but I've got some similar hardware in my
> lab, so I can probably work that out.
>
> I'll see if I can't avoid touching the delay code. Its such a sketchy
> calibration sensitive code path that I'd really like to see it killed,
> but maybe there's something simple that can be done.
>
> Grumble. :( I was hoping to submit my tod code to Andrew tomorrow, but
> this might block that.

hm, ARCH_HAS_READ_CURRENT_TIMER is upstream already. I have not measured
the udelay thing upstream, but i thought it would have the same issue.
Does the GTOD code impact this code?

Ingo

2005-11-11 08:20:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: IO-APIC problem with 2.6.14-rt9


* Ingo Molnar <[email protected]> wrote:

> > Grumble. :( I was hoping to submit my tod code to Andrew tomorrow, but
> > this might block that.
>
> hm, ARCH_HAS_READ_CURRENT_TIMER is upstream already. I have not
> measured the udelay thing upstream, but i thought it would have the
> same issue. Does the GTOD code impact this code?

ah, i see - you changed __delay to always be loop-based. That is quite
incorrect.

but i think there is a generic way to solve this: just busy-poll
->read_cycles(). I.e. move __delay into the generic code too. This means
even more architecture-specific code would be consolidated, which is
always good.

Ingo

2005-11-12 02:26:03

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: RE: IO-APIC problem with 2.6.14-rt9



>-----Original Message-----
>From: [email protected]
>[mailto:[email protected]] On Behalf Of john stultz
>Sent: Thursday, November 10, 2005 1:43 PM
>To: Ingo Molnar
>Cc: [email protected]; [email protected]; Thomas Gleixner
>Subject: Re: IO-APIC problem with 2.6.14-rt9
>
>On Thu, 2005-11-10 at 22:04 +0100, Ingo Molnar wrote:
>> * john stultz <[email protected]> wrote:
>>
>> > > > //#define ARCH_HAS_READ_CURRENT_TIMER 1
>> > > >
>> > > > to:
>> > > >
>> > > > #define ARCH_HAS_READ_CURRENT_TIMER 1
>> > > >
>> > > > ?
>> > >
>> > > It works !! Thanks Ingo for the immediate response
>> >
>> > Hrm. Could you post the value for BogoMIPS that you're getting now?
>> >
>> > My patches touch the __delay() code, since using the TSC
>based delay
>> > has just as many, if not more, problems as the loop based
>delay. So I
>> > want to be careful that my changes are not further causing
>problems.
>> >
>> > Ingo, did you commented out ARCH_HAS_READ_CURRENT_TIMER because of
>> > problems with the new calibration code?
>>
>> yes. traces show that the new calibration code results in a bogomips
>> value on Athlon64 CPUs that halve the timeout. I.e. udelay(100) now
>> takes 50 usecs (!). The calibration code seems to assume the
>number of
>> cycles == number of loops in __delay() - that is not valid.
>
>Yea, that makes sense, because the READ_CURRENT_TIMER
>calibration is all
>TSC based and with my code we use the loop based delay (since the TSC
>based one can have a number of problems). So that doesn't mesh
>well when
>the loop/cycle values are not equivalent.
>
>That still leaves open the question why Dinakar is seeing issues w/ the
>loop based calibration, but I've got some similar hardware in
>my lab, so
>I can probably work that out.

The reason ARCH_HAS_READ_CURRENT_TIMER and related code was added is:
Due to SMIs happening during the calibration we were seeing calibration
returning
very low values with previous calibrate_delay() algorithm. With this low
value,
we don't wait long enough during the timer initialization and we expect
some
number of timer interrupts and we panic() when we don't see that many
interrupts.

All the above was with TSC based delay.
I think we can keep calibrate_delay() do the calibration using
READ_CURRENT_TIMER
(giving TSC per jiffy) and then convert it to some number of TSCs per
loop and
use it in loop based delay.

Thanks,
Venki

2005-11-12 02:34:59

by john stultz

[permalink] [raw]
Subject: RE: IO-APIC problem with 2.6.14-rt9

On Fri, 2005-11-11 at 18:25 -0800, Pallipadi, Venkatesh wrote:
> I think we can keep calibrate_delay() do the calibration using
> READ_CURRENT_TIMER
> (giving TSC per jiffy) and then convert it to some number of TSCs per
> loop and
> use it in loop based delay.

That sounds reasonable. However, for now, I'll just keep the old code.
In my patch I pulled the loop based and TSC based delay functions out of
the timer_opts and pick one at boot depending on if the TSC is around or
not.

I'm testing it now and will be sending out a new patchset before I head
home tonight.

thanks
-john