LinuxLists.cc - Failing back to INSANE timesource :) Time stopped today.

2004-04-08 09:21:56

Subject: Failing back to INSANE timesource :) Time stopped today.

Hi,

I'm running Linux 2.6.5 on a IBM xSeries 305 with a Intel P4 2.8Ghz.

And something is very very wrong, I'm getting the following last
messages in dmesg:

------
set_rtc_mmss: can't update from 52 to 0
set_rtc_mmss: can't update from 53 to 1
set_rtc_mmss: can't update from 54 to 2
set_rtc_mmss: can't update from 55 to 3
set_rtc_mmss: can't update from 56 to 4
set_rtc_mmss: can't update from 57 to 5
set_rtc_mmss: can't update from 58 to 6
Losing too many ticks!
TSC cannot be used as a timesource. <4>Possible reasons for this are:
You're running with Speedstep,
You don't have DMA enabled for your hard disk (see hdparm),
Incorrect TSC synchronization on an SMP system (see dmesg).
------

The problem seesm to be related to heavy loads.
I experienced a similar problem yesterday. The machine completly hung
after that and i had to cut the power to reboot it. Now however it is
responsive and I can log on to it through ssh.

Problem is that the clock stopped completly! - I've never seen anything
like this before.

Local time is about 11 am here and a time gives me:

[root@s151 root]# date
Thu Apr 8 03:51:21 CEST 2004

...10 s later, using my wristwatch, not sleep 10 ;)

[root@s151 root]# date
Thu Apr 8 03:51:21 CEST 2004

Any ideas anyone, I'd really like to know why it is behaving this way.

Other usful(?) info from dmesg:
---
Detected 2800.731 MHz processor.
Using tsc for high-res timesource
...
init IO_APIC IRQs
IO-APIC (apicid-pin) 1-0, 2-0, 2-1, 2-2, 2-3, 2-4, 2-5, 2-6, 2-7, 2-8,
2-9, 2-10, 2-11, 2-12, 2-13, 2-14, 2-15, 3-0, 3-1, 3-2, 3-3, 3-4, 3-5,
3-6, 3-7, 3-8, 3-9, 3-10, 3-11, 3-12, 3-13, 3-14, 3-15 not connected.
..TIMER: vector=0x31 pin1=2 pin2=-1
..MP-BIOS bug: 8254 timer not connected to IO-APIC
...trying to set up timer (IRQ0) through the 8259A ... failed.
...trying to set up timer as Virtual Wire IRQ... works.
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 2798.0750 MHz.
...
----

Oh, yeah, I'm running NTP on the machine, however the client seems to be
sleeping for the next time-poll.
I'm also using this machine as a ntp server and one client has the
following to say:

[root@s131 tmp]# ntpdc
ntpdc> peers
remote local st poll reach delay offset disp
=======================================================================
*time2 192.168.4.131 2 1024 377 0.00067 0.004991 0.01482
=time1 192.168.4.131 2 1024 377 0.00085 0.000558 0.01485
=s151 192.168.4.131 3 1024 377 0.00021 -25494.16 0.01482

Where s151 is the machine that is on freeze.

Cheers,

Niclas

2004-04-08 12:21:08

by Niclas Gustafsson

[permalink] [raw]

Subject: Re: Failing back to INSANE timesource :) Time stopped today.

Running multiple date commands ( ~ 1s apart ), shows that something is
going on...

[root@s151 root]# date "+%X %N"
03:51:21 AM 262889000
[root@s151 root]# date "+%X %N"
03:51:21 AM 263185000
[root@s151 root]# date "+%X %N"
03:51:21 AM 262383000
[root@s151 root]# date "+%X %N"
03:51:21 AM 263328000
[root@s151 root]# date "+%X %N"
03:51:21 AM 263237000
[root@s151 root]# date "+%X %N"
03:51:21 AM 263049000
[root@s151 root]# date "+%X %N"
03:51:21 AM 262941000

... And a couple of hours later:

[root@s151 root]# date "+%X %N"
03:51:21 AM 348003000

Regards,

Niclas

tor 2004-04-08 klockan 11.21 skrev Niclas Gustafsson:
> Hi,
>
> I'm running Linux 2.6.5 on a IBM xSeries 305 with a Intel P4 2.8Ghz.
>
> And something is very very wrong, I'm getting the following last
> messages in dmesg:
>
> ------
> set_rtc_mmss: can't update from 52 to 0
> set_rtc_mmss: can't update from 53 to 1
> set_rtc_mmss: can't update from 54 to 2
> set_rtc_mmss: can't update from 55 to 3
> set_rtc_mmss: can't update from 56 to 4
> set_rtc_mmss: can't update from 57 to 5
> set_rtc_mmss: can't update from 58 to 6
> Losing too many ticks!
> TSC cannot be used as a timesource. <4>Possible reasons for this are:
> You're running with Speedstep,
> You don't have DMA enabled for your hard disk (see hdparm),
> Incorrect TSC synchronization on an SMP system (see dmesg).
> ------
>
> The problem seesm to be related to heavy loads.
> I experienced a similar problem yesterday. The machine completly hung
> after that and i had to cut the power to reboot it. Now however it is
> responsive and I can log on to it through ssh.
>
> Problem is that the clock stopped completly! - I've never seen anything
> like this before.
>
> Local time is about 11 am here and a time gives me:
>
> [root@s151 root]# date
> Thu Apr 8 03:51:21 CEST 2004
>
> ...10 s later, using my wristwatch, not sleep 10 ;)
>
> [root@s151 root]# date
> Thu Apr 8 03:51:21 CEST 2004
>
>
> Any ideas anyone, I'd really like to know why it is behaving this way.
>
>
> Other usful(?) info from dmesg:
> ---
> Detected 2800.731 MHz processor.
> Using tsc for high-res timesource
> ...
> init IO_APIC IRQs
> IO-APIC (apicid-pin) 1-0, 2-0, 2-1, 2-2, 2-3, 2-4, 2-5, 2-6, 2-7, 2-8,
> 2-9, 2-10, 2-11, 2-12, 2-13, 2-14, 2-15, 3-0, 3-1, 3-2, 3-3, 3-4, 3-5,
> 3-6, 3-7, 3-8, 3-9, 3-10, 3-11, 3-12, 3-13, 3-14, 3-15 not connected.
> ..TIMER: vector=0x31 pin1=2 pin2=-1
> ..MP-BIOS bug: 8254 timer not connected to IO-APIC
> ...trying to set up timer (IRQ0) through the 8259A ... failed.
> ...trying to set up timer as Virtual Wire IRQ... works.
> Using local APIC timer interrupts.
> calibrating APIC timer ...
> ..... CPU clock speed is 2798.0750 MHz.
> ...
> ----
>
> Oh, yeah, I'm running NTP on the machine, however the client seems to be
> sleeping for the next time-poll.
> I'm also using this machine as a ntp server and one client has the
> following to say:
>
> [root@s131 tmp]# ntpdc
> ntpdc> peers
> remote local st poll reach delay offset disp
> =======================================================================
> *time2 192.168.4.131 2 1024 377 0.00067 0.004991 0.01482
> =time1 192.168.4.131 2 1024 377 0.00085 0.000558 0.01485
> =s151 192.168.4.131 3 1024 377 0.00021 -25494.16 0.01482
>
> Where s151 is the machine that is on freeze.
>
>
>
> Cheers,
>
> Niclas
>
>
>
>
>

2004-04-08 22:58:48

by john stultz

[permalink] [raw]

Subject: Re: Failing back to INSANE timesource :) Time stopped today.

On Thu, 2004-04-08 at 02:21, Niclas Gustafsson wrote:
> Hi,
>
> I'm running Linux 2.6.5 on a IBM xSeries 305 with a Intel P4 2.8Ghz.
>
> And something is very very wrong, I'm getting the following last
> messages in dmesg:
>
> ------
> set_rtc_mmss: can't update from 52 to 0
> set_rtc_mmss: can't update from 53 to 1
> set_rtc_mmss: can't update from 54 to 2
> set_rtc_mmss: can't update from 55 to 3
> set_rtc_mmss: can't update from 56 to 4
> set_rtc_mmss: can't update from 57 to 5
> set_rtc_mmss: can't update from 58 to 6
> Losing too many ticks!
> TSC cannot be used as a timesource. <4>Possible reasons for this are:
> You're running with Speedstep,
> You don't have DMA enabled for your hard disk (see hdparm),
> Incorrect TSC synchronization on an SMP system (see dmesg).
> ------
>
> The problem seesm to be related to heavy loads.
> I experienced a similar problem yesterday. The machine completly hung
> after that and i had to cut the power to reboot it. Now however it is
> responsive and I can log on to it through ssh.
>
> Problem is that the clock stopped completly! - I've never seen anything
> like this before.
>
> Local time is about 11 am here and a time gives me:
>
> [root@s151 root]# date
> Thu Apr 8 03:51:21 CEST 2004
>
> ...10 s later, using my wristwatch, not sleep 10 ;)
>
> [root@s151 root]# date
> Thu Apr 8 03:51:21 CEST 2004
>
>
> Any ideas anyone, I'd really like to know why it is behaving this way.

Huh. Very very odd.

Does /proc/interrupts show timer ticks increasing?
Does setting the date change anything?

Would you mind sending me your complete dmesg?

I'll look into reproducing the error here if you can give me a better
description of what triggers it and how frequently you see the problem.

thanks
-john

2004-04-13 08:26:21

by Niclas Gustafsson

[permalink] [raw]

Subject: Re: Failing back to INSANE timesource :) Time stopped today.

Hi,

Sorry for the late reply I was away on Easter holidays.

I'm attaching both the dmesg output (dmesg.265-IBM305_2) and the
unpacked /proc/config.gz (config.gz.IBM_265_2)

However the last lines from dmesg:
---
> set_rtc_mmss: can't update from 52 to 0
> > set_rtc_mmss: can't update from 53 to 1
> > set_rtc_mmss: can't update from 54 to 2
> > set_rtc_mmss: can't update from 55 to 3
> > set_rtc_mmss: can't update from 56 to 4
> > set_rtc_mmss: can't update from 57 to 5
> > set_rtc_mmss: can't update from 58 to 6
> > Losing too many ticks!
> > TSC cannot be used as a timesource. <4>Possible reasons for this are:
> > You're running with Speedstep,
> > You don't have DMA enabled for your hard disk (see hdparm),
> > Incorrect TSC synchronization on an SMP system (see dmesg).
---

Were not synched to disk before I rebooted the system.

I've compiled a new kernel that seems to be working better although I
need to run some more tests to be sure.

Regards,

Niclas Gustafsson

fre 2004-04-09 klockan 00.58 skrev john stultz:
> On Thu, 2004-04-08 at 02:21, Niclas Gustafsson wrote:
> > Hi,
> >
> > I'm running Linux 2.6.5 on a IBM xSeries 305 with a Intel P4 2.8Ghz.
> >
> > And something is very very wrong, I'm getting the following last
> > messages in dmesg:
> >
> > ------
> > set_rtc_mmss: can't update from 52 to 0
> > set_rtc_mmss: can't update from 53 to 1
> > set_rtc_mmss: can't update from 54 to 2
> > set_rtc_mmss: can't update from 55 to 3
> > set_rtc_mmss: can't update from 56 to 4
> > set_rtc_mmss: can't update from 57 to 5
> > set_rtc_mmss: can't update from 58 to 6
> > Losing too many ticks!
> > TSC cannot be used as a timesource. <4>Possible reasons for this are:
> > You're running with Speedstep,
> > You don't have DMA enabled for your hard disk (see hdparm),
> > Incorrect TSC synchronization on an SMP system (see dmesg).
> > ------
> >
> > The problem seesm to be related to heavy loads.
> > I experienced a similar problem yesterday. The machine completly hung
> > after that and i had to cut the power to reboot it. Now however it is
> > responsive and I can log on to it through ssh.
> >
> > Problem is that the clock stopped completly! - I've never seen anything
> > like this before.
> >
> > Local time is about 11 am here and a time gives me:
> >
> > [root@s151 root]# date
> > Thu Apr 8 03:51:21 CEST 2004
> >
> > ...10 s later, using my wristwatch, not sleep 10 ;)
> >
> > [root@s151 root]# date
> > Thu Apr 8 03:51:21 CEST 2004
> >
> >
> > Any ideas anyone, I'd really like to know why it is behaving this way.
>
> Huh. Very very odd.
>
> Does /proc/interrupts show timer ticks increasing?
> Does setting the date change anything?
>
> Would you mind sending me your complete dmesg?
>
> I'll look into reproducing the error here if you can give me a better
> description of what triggers it and how frequently you see the problem.
>
> thanks
> -john
>
>
>

Attachments:

dmesg.265-IBM305_2 (15.05 kB)
config.gz.IBM-265_2 (5.10 kB)
Download all attachments

2004-04-14 08:56:51

by Niclas Gustafsson

[permalink] [raw]

Subject: Re: Failing back to INSANE timesource :) Time stopped today.

Hi again,

Now it happened again, with my newly compiled kernel, I've attached the
config file and dmesg for this kernel.

Watching the /proc/interrupts with 10s apart after the "stop".

[root@s151 root]# more /proc/interrupts
CPU0
0: 66413955 local-APIC-edge timer
2: 0 XT-PIC cascade
9: 1 IO-APIC-level acpi
10: 0 IO-APIC-level ohci_hcd
14: 24 IO-APIC-edge ide0
20: 31244 IO-APIC-level aic7xxx
22: 19641795 IO-APIC-level eth0
NMI: 0
LOC: 67355837
ERR: 0
MIS: 0
[root@s151 root]# more /proc/interrupts
CPU0
0: 66413955 local-APIC-edge timer
2: 0 XT-PIC cascade
9: 1 IO-APIC-level acpi
10: 0 IO-APIC-level ohci_hcd
14: 24 IO-APIC-edge ide0
20: 31244 IO-APIC-level aic7xxx
22: 19652139 IO-APIC-level eth0
NMI: 0
LOC: 67379568
ERR: 0
MIS: 0

And some 10-15 min later:

[root@s151 root]# cat /proc/interrupts
CPU0
0: 66413964 local-APIC-edge timer
2: 0 XT-PIC cascade
9: 1 IO-APIC-level acpi
10: 0 IO-APIC-level ohci_hcd
14: 24 IO-APIC-edge ide0
20: 31245 IO-APIC-level aic7xxx
22: 19754446 IO-APIC-level eth0
NMI: 0
LOC: 68366976
ERR: 0
MIS: 0

Last vmstat (and /proc/loadavg intermixed):

(Rows like 3.48 1.70 2.12 2/114 13232 are loadavg, the other are
linewrapped vmstats.)

procs memory swap io
system cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
1 0 0 0 686752 48412 59104 0 0 0 0 2112 898
19 1 80
3.48 1.70 2.12 2/114 13232
0 0 0 0 686832 48412 59104 0 0 0 0 2093 688
12 1 87
0 0 0 0 686768 48416 59104 0 0 0 140 1623 998
15 1 84
3.20 1.67 2.11 2/114 13235
0 0 0 0 686848 48416 59104 0 0 0 0 2128 960
16 1 83
3.20 1.67 2.11 2/114 13238
0 0 0 0 686912 48416 59104 0 0 0 0 1597 809
11 1 88
0 0 0 0 686880 48420 59104 0 0 0 22 2164 959
13 1 86
2.94 1.64 2.09 2/114 13241
0 0 0 0 686816 48420 59104 0 0 0 0 2089 748
14 1 85
2.94 1.64 2.09 2/114 13244
0 0 0 0 686640 48420 59104 0 0 0 0 2465 1170 20
21 59
procs memory swap io
system cpu
r b w swpd free buff cache si so bi bo in cs us
sy id
1 0 0 0 649904 48420 59104 0 0 0 24 87091 90759
29 66 5
2.87 1.65 2.09 3/114 13247

Worth noticing is the extreme increase in interrupts and context
switches on the last output line from vmstat, however if this is a true
picture of what happened or not I cannot say. Maybe this is just a
result of timing problems?

However I see an increase in network activity just before the stop, the
system goes from 2 Mbps output to about 55 Mbps on the last read. (Also
read from this machine so I don't know about it's validity)

cpuinfo, if useful is:

[root@s151 root]# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Pentium(R) 4 CPU 2.80GHz
stepping : 9
cpu MHz : 2800.731
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid
bogomips : 5505.02

Where can I see what the system is currently using as a timing source
(TSC/HPET/PIT etc.)?

Cheers,

Niclas

fre 2004-04-09 klockan 00.58 skrev john stultz:
> On Thu, 2004-04-08 at 02:21, Niclas Gustafsson wrote:
> > Hi,
> >
> > I'm running Linux 2.6.5 on a IBM xSeries 305 with a Intel P4 2.8Ghz.
> >
> > And something is very very wrong, I'm getting the following last
> > messages in dmesg:
> >
> > ------
> > set_rtc_mmss: can't update from 52 to 0
> > set_rtc_mmss: can't update from 53 to 1
> > set_rtc_mmss: can't update from 54 to 2
> > set_rtc_mmss: can't update from 55 to 3
> > set_rtc_mmss: can't update from 56 to 4
> > set_rtc_mmss: can't update from 57 to 5
> > set_rtc_mmss: can't update from 58 to 6
> > Losing too many ticks!
> > TSC cannot be used as a timesource. <4>Possible reasons for this are:
> > You're running with Speedstep,
> > You don't have DMA enabled for your hard disk (see hdparm),
> > Incorrect TSC synchronization on an SMP system (see dmesg).
> > ------
> >
> > The problem seesm to be related to heavy loads.
> > I experienced a similar problem yesterday. The machine completly hung
> > after that and i had to cut the power to reboot it. Now however it is
> > responsive and I can log on to it through ssh.
> >
> > Problem is that the clock stopped completly! - I've never seen anything
> > like this before.
> >
> > Local time is about 11 am here and a time gives me:
> >
> > [root@s151 root]# date
> > Thu Apr 8 03:51:21 CEST 2004
> >
> > ...10 s later, using my wristwatch, not sleep 10 ;)
> >
> > [root@s151 root]# date
> > Thu Apr 8 03:51:21 CEST 2004
> >
> >
> > Any ideas anyone, I'd really like to know why it is behaving this way.
>
> Huh. Very very odd.
>
> Does /proc/interrupts show timer ticks increasing?
> Does setting the date change anything?
>
> Would you mind sending me your complete dmesg?
>
> I'll look into reproducing the error here if you can give me a better
> description of what triggers it and how frequently you see the problem.
>
> thanks
> -john
>
>
>

Attachments:

dmesg.265-IBM305_3 (15.04 kB)
config.IBM-265_3 (18.21 kB)
Download all attachments

2004-04-14 18:36:59

by john stultz

[permalink] [raw]

Subject: Re: Failing back to INSANE timesource :) Time stopped today.

On Wed, 2004-04-14 at 01:54, Niclas Gustafsson wrote:
> Now it happened again, with my newly compiled kernel, I've attached the
> config file and dmesg for this kernel.
>
> Watching the /proc/interrupts with 10s apart after the "stop".
>
> [root@s151 root]# more /proc/interrupts
> CPU0
> 0: 66413955 local-APIC-edge timer
> 2: 0 XT-PIC cascade
> 9: 1 IO-APIC-level acpi
> 10: 0 IO-APIC-level ohci_hcd
> 14: 24 IO-APIC-edge ide0
> 20: 31244 IO-APIC-level aic7xxx
> 22: 19652139 IO-APIC-level eth0
> NMI: 0
> LOC: 67379568
> ERR: 0
> MIS: 0
>
> And some 10-15 min later:
>
> [root@s151 root]# cat /proc/interrupts
> CPU0
> 0: 66413964 local-APIC-edge timer
> 2: 0 XT-PIC cascade
> 9: 1 IO-APIC-level acpi
> 10: 0 IO-APIC-level ohci_hcd
> 14: 24 IO-APIC-edge ide0
> 20: 31245 IO-APIC-level aic7xxx
> 22: 19754446 IO-APIC-level eth0
> NMI: 0
> LOC: 68366976
> ERR: 0
> MIS: 0

Wow, clearly something is going wrong with interrupt delivery. You only
received 9 timer interrupts in 10 seconds!

> Worth noticing is the extreme increase in interrupts and context
> switches on the last output line from vmstat, however if this is a true
> picture of what happened or not I cannot say. Maybe this is just a
> result of timing problems?
>
> However I see an increase in network activity just before the stop, the
> system goes from 2 Mbps output to about 55 Mbps on the last read. (Also
> read from this machine so I don't know about it's validity)

Yea, those values are both junk as the X/sec ratio is totally skewed.

> Where can I see what the system is currently using as a timing source
> (TSC/HPET/PIT etc.)?

Note the "Using tsc for high-res timesource" in your dmesg.

I'm working now to reproduce this w/ a 2G system here in our lab, and
just for completeness, could you also send me your BIOS revision number?

thanks
-john

2004-04-15 08:46:32

by Niclas Gustafsson

[permalink] [raw]

Subject: Re: Failing back to INSANE timesource :) Time stopped today.

ons 2004-04-14 klockan 20.36 skrev john stultz:

> > Where can I see what the system is currently using as a timing source
> > (TSC/HPET/PIT etc.)?
>
> Note the "Using tsc for high-res timesource" in your dmesg.
>
Yes, I noticed that, however, when it stops using tsc, is there a way to
see what the current strategy is? I.e. to what time source it is
falling back to? Or perhaps this is always the same? And because of this
not implemented into the proc-fs? I have just briefly looked at the
kernel source for this, I'll have a closer look today if I can find the
time.

> I'm working now to reproduce this w/ a 2G system here in our lab, and
> just for completeness, could you also send me your BIOS revision number?
>
Sure, Here is some info from bios:

Machine Type: 867373X
Flash EEPROM Revision Level: PLE161AUS
System Board Identifier: NA60B7Y0S3Q
System Serial Number: KDZZ6FC
Bios Date: 09/10/03

Some more info from Advanced/Cpu Frequency:
Bus: 133 MHz
Cpu Multiplier: 18 X
Processor Speed: 2.8 GHz
Single processor MP Table: Enabled
MP Table Version: 1.4
Hyper-Threading technology: ----------

Just a thought, how much if any, increase in performance can one gain
when disabling the Single processor MP Table option? It says, in the
help on that options, that one can benefit from disabling it on a
UP-system if I don't remember wrong now.

Cheers,

Niclas

2004-04-15 14:47:22

by Maciej W. Rozycki

[permalink] [raw]

Subject: Re: Failing back to INSANE timesource :) Time stopped today.

On Wed, 14 Apr 2004, Niclas Gustafsson wrote:

> Watching the /proc/interrupts with 10s apart after the "stop".
>
> [root@s151 root]# more /proc/interrupts
> CPU0
> 0: 66413955 local-APIC-edge timer
[...]
> LOC: 67355837
> ERR: 0
> MIS: 0
> [root@s151 root]# more /proc/interrupts
> CPU0
> 0: 66413955 local-APIC-edge timer
[...]
> LOC: 67379568
> ERR: 0
> MIS: 0

This may be because buggy SMM firmware messes with the 8259A (configured
for a transparent mode -- yes that rare "local-APIC-edge" mode is tricky
;-) ) insanely. You've written this is an IBM box previously -- this
would be no surprise. The following patch should help -- I think it's
already included in the -mm series.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

patch-2.6.5-timer_ack-2
--- linux.macro/arch/i386/kernel/io_apic.c Wed Apr 14 03:57:24 2004
+++ linux/arch/i386/kernel/io_apic.c Thu Apr 15 14:41:10 2004
@@ -2152,6 +2152,10 @@ static inline void check_timer(void)
{
int pin1, pin2;
int vector;
+ unsigned int ver;
+
+ ver = apic_read(APIC_LVR);
+ ver = GET_APIC_VERSION(ver);

/*
* get/set the timer IRQ vector:
@@ -2165,11 +2169,15 @@ static inline void check_timer(void)
* mode for the 8259A whenever interrupts are routed
* through I/O APICs. Also IRQ0 has to be enabled in
* the 8259A which implies the virtual wire has to be
- * disabled in the local APIC.
+ * disabled in the local APIC. Finally timer interrupts
+ * need to be acknowledged manually in the 8259A for
+ * do_slow_timeoffset() and for the i82489DX when using
+ * the NMI watchdog.
*/
apic_write_around(APIC_LVT0, APIC_LVT_MASKED | APIC_DM_EXTINT);
init_8259A(1);
- timer_ack = 1;
+ timer_ack = !cpu_has_tsc;
+ timer_ack |= nmi_watchdog == NMI_IO_APIC && !APIC_INTEGRATED(ver);
enable_8259A_irq(0);

pin1 = find_isa_irq_pin(0, mp_INT);
@@ -2187,7 +2195,8 @@ static inline void check_timer(void)
disable_8259A_irq(0);
setup_nmi();
enable_8259A_irq(0);
- check_nmi_watchdog();
+ if (check_nmi_watchdog() < 0)
+ timer_ack = !cpu_has_tsc;
}
return;
}
@@ -2210,7 +2219,8 @@ static inline void check_timer(void)
add_pin_to_irq(0, 0, pin2);
if (nmi_watchdog == NMI_IO_APIC) {
setup_nmi();
- check_nmi_watchdog();
+ if (check_nmi_watchdog() < 0)
+ timer_ack = !cpu_has_tsc;
}
return;
}

2004-04-15 16:58:08

by Niclas Gustafsson

[permalink] [raw]

Subject: Re: Failing back to INSANE timesource :) Time stopped today.

Hello and thanks,

I've compiled and deployed a kernel with the patch below.
I'm about to start some more tests on the machine - it's going to be
interesting to see how it works out, I'll let you know.

Cheers,

Niclas

tor 2004-04-15 klockan 16.47 skrev Maciej W. Rozycki:
> On Wed, 14 Apr 2004, Niclas Gustafsson wrote:
>
> > Watching the /proc/interrupts with 10s apart after the "stop".
> >
> > [root@s151 root]# more /proc/interrupts
> > CPU0
> > 0: 66413955 local-APIC-edge timer
> [...]
> > LOC: 67355837
> > ERR: 0
> > MIS: 0
> > [root@s151 root]# more /proc/interrupts
> > CPU0
> > 0: 66413955 local-APIC-edge timer
> [...]
> > LOC: 67379568
> > ERR: 0
> > MIS: 0
>
> This may be because buggy SMM firmware messes with the 8259A (configured
> for a transparent mode -- yes that rare "local-APIC-edge" mode is tricky
> ;-) ) insanely. You've written this is an IBM box previously -- this
> would be no surprise. The following patch should help -- I think it's
> already included in the -mm series.

2004-04-19 16:44:58

by john stultz

[permalink] [raw]

Subject: Re: Failing back to INSANE timesource :) Time stopped today.

On Thu, 2004-04-15 at 07:47, Maciej W. Rozycki wrote:
> This may be because buggy SMM firmware messes with the 8259A (configured
> for a transparent mode -- yes that rare "local-APIC-edge" mode is tricky
> ;-) ) insanely. You've written this is an IBM box previously -- this
> would be no surprise. The following patch should help -- I think it's
> already included in the -mm series.

Just a FYI: I opened bugme bug #2544 to track this issue.
http://bugme.osdl.org/show_bug.cgi?id=2544

thanks
-john

2004-04-20 09:24:24

by Niclas Gustafsson

[permalink] [raw]

Subject: Re: Failing back to INSANE timesource :) Time stopped today.

Hi again,

I've now been running the system since last week, about 6 days now with
sometimes quite high load, both in regard to CPU usage and network
traffic.
And it seems to be running just fine with the patch from Maciej.

I've got a couple of questions,

When was this bug introduced? Was it 2.6.1 ( or rather somewhere in
2.5)? Or was it already present in 2.4?

When will this patch be merged into the 2.6-tree? I don't have to
stress the impact of this problem on IBM servers as they are rendered
quite useless.

Which other IBM models are affected? Can I run 2.6.5 on my 345:s or
335:s? Do they use the same buggy SMM firmware?

Cheers,

Niclas

tor 2004-04-15 klockan 18.57 skrev Niclas Gustafsson:
> Hello and thanks,
>
> I've compiled and deployed a kernel with the patch below.
> I'm about to start some more tests on the machine - it's going to be
> interesting to see how it works out, I'll let you know.
>
>
> Cheers,
>
> Niclas
>
>
> tor 2004-04-15 klockan 16.47 skrev Maciej W. Rozycki:
> > On Wed, 14 Apr 2004, Niclas Gustafsson wrote:
> >
> > > Watching the /proc/interrupts with 10s apart after the "stop".
> > >
> > > [root@s151 root]# more /proc/interrupts
> > > CPU0
> > > 0: 66413955 local-APIC-edge timer
> > [...]
> > > LOC: 67355837
> > > ERR: 0
> > > MIS: 0
> > > [root@s151 root]# more /proc/interrupts
> > > CPU0
> > > 0: 66413955 local-APIC-edge timer
> > [...]
> > > LOC: 67379568
> > > ERR: 0
> > > MIS: 0
> >
> > This may be because buggy SMM firmware messes with the 8259A (configured
> > for a transparent mode -- yes that rare "local-APIC-edge" mode is tricky
> > ;-) ) insanely. You've written this is an IBM box previously -- this
> > would be no surprise. The following patch should help -- I think it's
> > already included in the -mm series.

2004-04-20 12:41:08

by Maciej W. Rozycki

[permalink] [raw]

Subject: Re: Failing back to INSANE timesource :) Time stopped today.

On Tue, 20 Apr 2004, Niclas Gustafsson wrote:

> I've now been running the system since last week, about 6 days now with
> sometimes quite high load, both in regard to CPU usage and network
> traffic.
> And it seems to be running just fine with the patch from Maciej.

I'm glad to read this.

> I've got a couple of questions,
>
> When was this bug introduced? Was it 2.6.1 ( or rather somewhere in
> 2.5)? Or was it already present in 2.4?

Well, the bug has been introduced by IBM in their firmware (SMM code).
;-) The patch only works it around. Functionally the changed code is the
same for your configuration.

If you are asking about the problematic code, then it's there since
2.3.x, so it's in 2.4, too. It's a part of the NMI watchdog support,
though it's used for ordinary timer interrupts for certain systems as
well.

> When will this patch be merged into the 2.6-tree? I don't have to
> stress the impact of this problem on IBM servers as they are rendered
> quite useless.

Apparently there are problems with the workaround on certain AMD
Athlon-based systems. I suppose they need to be resolved somehow first.

> Which other IBM models are affected? Can I run 2.6.5 on my 345:s or
> 335:s? Do they use the same buggy SMM firmware?

Ask IBM. The reason is an incorrect handling of PIC (8259A) state
saving/restoration.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +

2004-04-20 22:10:01

by john stultz

[permalink] [raw]

Subject: Re: Failing back to INSANE timesource :) Time stopped today.

On Tue, 2004-04-20 at 05:40, Maciej W. Rozycki wrote:
> On Tue, 20 Apr 2004, Niclas Gustafsson wrote:
>
> > I've now been running the system since last week, about 6 days now with
> > sometimes quite high load, both in regard to CPU usage and network
> > traffic.
> > And it seems to be running just fine with the patch from Maciej.
>
> I'm glad to read this.

It appears to be working here in our labs as well.

> > I've got a couple of questions,
> >
> > When was this bug introduced? Was it 2.6.1 ( or rather somewhere in
> > 2.5)? Or was it already present in 2.4?
>
> Well, the bug has been introduced by IBM in their firmware (SMM code).
> ;-) The patch only works it around. Functionally the changed code is the
> same for your configuration.
>
> If you are asking about the problematic code, then it's there since
> 2.3.x, so it's in 2.4, too. It's a part of the NMI watchdog support,
> though it's used for ordinary timer interrupts for certain systems as
> well.

Are you saying that 2.4 will exhibit this problem as well, or that 2.4
already has an equivalent workaround?

> > When will this patch be merged into the 2.6-tree? I don't have to
> > stress the impact of this problem on IBM servers as they are rendered
> > quite useless.
>
> Apparently there are problems with the workaround on certain AMD
> Athlon-based systems. I suppose they need to be resolved somehow first.

Can you point me to any threads on this issue. I'd like to do what I can
to help get this workaround in.

> > Which other IBM models are affected? Can I run 2.6.5 on my 345:s or
> > 335:s? Do they use the same buggy SMM firmware?
>
> Ask IBM. The reason is an incorrect handling of PIC (8259A) state
> saving/restoration.

I'm following up w/ our hardware group about this issue.

Thanks so much for the help!
-john

2004-04-21 14:20:36

by Maciej W. Rozycki

[permalink] [raw]

Subject: Re: Failing back to INSANE timesource :) Time stopped today.

On Tue, 20 Apr 2004, john stultz wrote:

> > If you are asking about the problematic code, then it's there since
> > 2.3.x, so it's in 2.4, too. It's a part of the NMI watchdog support,
> > though it's used for ordinary timer interrupts for certain systems as
> > well.
>
> Are you saying that 2.4 will exhibit this problem as well, or that 2.4
> already has an equivalent workaround?

The former.

> > Apparently there are problems with the workaround on certain AMD
> > Athlon-based systems. I suppose they need to be resolved somehow first.
>
> Can you point me to any threads on this issue. I'd like to do what I can
> to help get this workaround in.

Here's my reply to a report which seems not to have reached LKML archives
for some reason:
"http://www.uwsg.indiana.edu/hypermail/linux/kernel/0403.2/0384". The
important part of the original mail is quoted.

--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: [email protected], PGP key available +