2003-01-30 14:25:44

by Praveen Ray

[permalink] [raw]
Subject: timer interrupts on HP machines

We have few HP (LPR NetServers and LT6000) which run 2.4.18 (from RedHat 8.0)
. The problem is that sometimes the time interrupts stop coming - i.e. the
(time) counts in /proc/interrupts stop getting incremented! When this
happens, the date on the system falls behind, 'sleep' calls stop working and
basically machine becomes unusable.Has anyone else encountered this problem?
Is it an HP issue?
Thanks.


2003-01-30 15:41:11

by Alan

[permalink] [raw]
Subject: Re: timer interrupts on HP machines

On Thu, 2003-01-30 at 14:34, Praveen Ray wrote:
> We have few HP (LPR NetServers and LT6000) which run 2.4.18 (from RedHat 8.0)
> . The problem is that sometimes the time interrupts stop coming - i.e. the
> (time) counts in /proc/interrupts stop getting incremented! When this
> happens, the date on the system falls behind, 'sleep' calls stop working and
> basically machine becomes unusable.Has anyone else encountered this problem?
> Is it an HP issue?

That I don't know ut my first question other than the usual "Have you applied
the errata kernels" is probably whether its hitting some of the APIC funnies
older hw occasionally has. Are they stable running "noapic" ?

2003-01-30 16:52:32

by Matt C

[permalink] [raw]
Subject: Re: timer interrupts on HP machines

Hi Praveen-

We have a few LT6000r servers as well, and have the same problem on all
2.4 kernels -- this happens when your MP spec is set to 1.1 in the BIOS.
Change it to 1.4 and you should be okay.

The other common problem on these guys is the CPU speed misdetect, which
causes the kernel to think your CPU is roughly 2x as fast as it really is.
The solution to that one is to unplug and replug the power cords (even a
power-off doesn't fix it, go figure).

Hope that helps.

-Matt

On Thu, 30 Jan 2003, Praveen Ray wrote:

> We have few HP (LPR NetServers and LT6000) which run 2.4.18 (from RedHat 8.0)
> . The problem is that sometimes the time interrupts stop coming - i.e. the
> (time) counts in /proc/interrupts stop getting incremented! When this
> happens, the date on the system falls behind, 'sleep' calls stop working and
> basically machine becomes unusable.Has anyone else encountered this problem?
> Is it an HP issue?
> Thanks.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-02-03 13:14:10

by Nohez

[permalink] [raw]
Subject: Re: timer interrupts on HP machines


We have a similar problem with our HP servers. We are facing this problem
for more than a year. We have reported this problem to HP support.

We have five HP Netserver LH6000 running k_smp-2.4.18-47 (SuSE7.1).
We are sure that MP spec is v1.4 in the BIOS. But we have not
checked /proc/interrupts. Will check the next time this problem occurs.

Problem:
--------

System Time behaved erratically but servers do not hang. We noticed that
all time related apps (sendmail, ping, top, cron etc) stopped. We
noticed that time goes forward & backward in seconds only.

server: # date
Mon Feb 3 17:38:26 IST 2003
server: # date
Mon Feb 3 17:38:30 IST 2003
server: # date
Mon Feb 3 17:38:20 IST 2003
server: # date
Mon Feb 3 17:38:25 IST 2003
server: # date
Mon Feb 3 17:38:28 IST 2003
server: # date
Mon Feb 3 17:38:21 IST 2003

The above is just an example. We could not find any pattern.

We could not access the server remotely. But we could login from console.
All programs using system time failed - like sendmail, top, cron etc.

We could umount filesystems. But the server had to be forcibly shut (power
reset). After system reboot everything was ok.

We have xntpd daemon running on all our servers.

Four servers are file/print servers (samba/nfs/cups) and one is database
server. The above problem has NEVER occured on the database server.
The only difference between the file-server and database server is:
1. DB server has a external HP Ultrium & HP DDS4 tape drive
connected to Adaptec 29160N Ultra160 SCSI adapter.
2. DB server has a Intel PRO/1000 Network (gigabit ethernet card)

Hardware details :
----------------
HP Netserver LH6000
6 * 550Mhz Xeon CPUs
1GB RAM
Integrated Megaraid Ultra-2 SCSI Raid Controller
BIOS MP spec is v1.4

Software:
---------
SuSE Linux 7.1
kernel 2.4.18 (k_smp-2.4.18-47)
glibc-2.2-7
samba-2.0.10-0
xntp-4.0.99f-6
"Unsynced TSC support" is enabled in default SuSE kernel k_smp-2.4.18-47
Kernel debugging is set


Nohez.

------------------------------------------------------------------------
List: linux-kernel
Subject: Re: timer interrupts on HP machines
From: Matt C <wago () phlinux ! com>
Date: 2003-01-30 17:01:50
[Download message RAW]

Hi Praveen-

We have a few LT6000r servers as well, and have the same problem on all
2.4 kernels -- this happens when your MP spec is set to 1.1 in the BIOS.
Change it to 1.4 and you should be okay.

The other common problem on these guys is the CPU speed misdetect, which
causes the kernel to think your CPU is roughly 2x as fast as it really is.
The solution to that one is to unplug and replug the power cords (even a
power-off doesn't fix it, go figure).

Hope that helps.

-Matt

On Thu, 30 Jan 2003, Praveen Ray wrote:

> We have few HP (LPR NetServers and LT6000) which run 2.4.18 (from RedHat 8.0)
> . The problem is that sometimes the time interrupts stop coming - i.e. the
> (time) counts in /proc/interrupts stop getting incremented! When this
> happens, the date on the system falls behind, 'sleep' calls stop working and
> basically machine becomes unusable.Has anyone else encountered this problem?

> Is it an HP issue?

> Thanks.


2003-02-03 14:54:29

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: timer interrupts on HP machines

On Mon, 03 Feb 2003 18:52:14 +0530, Nohez said:

> server: # date
> Mon Feb 3 17:38:30 IST 2003
> server: # date
> Mon Feb 3 17:38:20 IST 2003

> We have xntpd daemon running on all our servers.

Any xntpd messages in the syslog that correlate with these events? I've
seen similar behavior on my laptop (although the clock ran very slow and
was getting slammed 10-15 seconds forward by xntpd - was a missing interrupt
problem). I've seen oddness with corrupted /etc/ntp/drift files as well...
--
Valdis Kletnieks
Computer Systems Senior Engineer
Virginia Tech


Attachments:
(No filename) (226.00 B)

2003-02-04 02:44:38

by Matt C

[permalink] [raw]
Subject: Re: timer interrupts on HP machines

Hi Nohez:

That's interesting. We've traced almost all of the times when this happens
back to an incorrect MP spec. I know it sounds goofy, but have you tried
unplugging AC power from the machine for ~5 minutes or so? We've seen that
make a difference in the Netservers. Also make sure you're up-to-date with
the firmware (latest is 4.06.43 or so?). Outside of that, I don't have any
other suggestions besides calling HP and having them replace the system
board.

-Matt

On Mon, 3 Feb 2003, Nohez wrote:

>
> We have a similar problem with our HP servers. We are facing this problem
> for more than a year. We have reported this problem to HP support.
>
> We have five HP Netserver LH6000 running k_smp-2.4.18-47 (SuSE7.1).
> We are sure that MP spec is v1.4 in the BIOS. But we have not
> checked /proc/interrupts. Will check the next time this problem occurs.
>
> Problem:
> --------
>
> System Time behaved erratically but servers do not hang. We noticed that
> all time related apps (sendmail, ping, top, cron etc) stopped. We
> noticed that time goes forward & backward in seconds only.
>
> server: # date
> Mon Feb 3 17:38:26 IST 2003
> server: # date
> Mon Feb 3 17:38:30 IST 2003
> server: # date
> Mon Feb 3 17:38:20 IST 2003
> server: # date
> Mon Feb 3 17:38:25 IST 2003
> server: # date
> Mon Feb 3 17:38:28 IST 2003
> server: # date
> Mon Feb 3 17:38:21 IST 2003
>
> The above is just an example. We could not find any pattern.
>
> We could not access the server remotely. But we could login from console.
> All programs using system time failed - like sendmail, top, cron etc.
>
> We could umount filesystems. But the server had to be forcibly shut (power
> reset). After system reboot everything was ok.
>
> We have xntpd daemon running on all our servers.
>
> Four servers are file/print servers (samba/nfs/cups) and one is database
> server. The above problem has NEVER occured on the database server.
> The only difference between the file-server and database server is:
> 1. DB server has a external HP Ultrium & HP DDS4 tape drive
> connected to Adaptec 29160N Ultra160 SCSI adapter.
> 2. DB server has a Intel PRO/1000 Network (gigabit ethernet card)
>
> Hardware details :
> ----------------
> HP Netserver LH6000
> 6 * 550Mhz Xeon CPUs
> 1GB RAM
> Integrated Megaraid Ultra-2 SCSI Raid Controller
> BIOS MP spec is v1.4
>
> Software:
> ---------
> SuSE Linux 7.1
> kernel 2.4.18 (k_smp-2.4.18-47)
> glibc-2.2-7
> samba-2.0.10-0
> xntp-4.0.99f-6
> "Unsynced TSC support" is enabled in default SuSE kernel k_smp-2.4.18-47
> Kernel debugging is set
>
>
> Nohez.
>
> ------------------------------------------------------------------------
> List: linux-kernel
> Subject: Re: timer interrupts on HP machines
> From: Matt C <wago () phlinux ! com>
> Date: 2003-01-30 17:01:50
> [Download message RAW]
>
> Hi Praveen-
>
> We have a few LT6000r servers as well, and have the same problem on all
> 2.4 kernels -- this happens when your MP spec is set to 1.1 in the BIOS.
> Change it to 1.4 and you should be okay.
>
> The other common problem on these guys is the CPU speed misdetect, which
> causes the kernel to think your CPU is roughly 2x as fast as it really is.
> The solution to that one is to unplug and replug the power cords (even a
> power-off doesn't fix it, go figure).
>
> Hope that helps.
>
> -Matt
>
> On Thu, 30 Jan 2003, Praveen Ray wrote:
>
> > We have few HP (LPR NetServers and LT6000) which run 2.4.18 (from RedHat 8.0)
> > . The problem is that sometimes the time interrupts stop coming - i.e. the
> > (time) counts in /proc/interrupts stop getting incremented! When this
> > happens, the date on the system falls behind, 'sleep' calls stop working and
> > basically machine becomes unusable.Has anyone else encountered this problem?
>
> > Is it an HP issue?
>
> > Thanks.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-02-04 04:25:01

by Nohez

[permalink] [raw]
Subject: Re: timer interrupts on HP machines


On Mon, 3 Feb 2003 [email protected] wrote:

> On Mon, 03 Feb 2003 18:52:14 +0530, Nohez said:
>
> > server: # date
> > Mon Feb 3 17:38:30 IST 2003
> > server: # date
> > Mon Feb 3 17:38:20 IST 2003
>
> > We have xntpd daemon running on all our servers.
>
> Any xntpd messages in the syslog that correlate with these events? I've
> seen similar behavior on my laptop (although the clock ran very slow and
> was getting slammed 10-15 seconds forward by xntpd - was a missing interrupt
> problem). I've seen oddness with corrupted /etc/ntp/drift files as well...

I have attached ntp log entries for the relevant time period.
Server was rebooted at approx 10:15 and the server time stopped
at 4:57. Before xntpd we used to sync time using "netdate" once
every hour. Problem occured even while using netdate.

/var/log/ntp:
-------------

7 Jan 00:46:11 xntpd[477]: offset -0.000146 sec freq 22.645 ppm error 0.000059 poll 9
7 Jan 01:46:47 xntpd[477]: offset -0.000174 sec freq 22.636 ppm error 0.000059 poll 10
7 Jan 02:47:23 xntpd[477]: offset 0.001350 sec freq 22.634 ppm error 0.000566 poll 10
7 Jan 03:47:59 xntpd[477]: offset -0.000288 sec freq 22.631 ppm error 0.000368 poll 10
7 Jan 04:48:35 xntpd[477]: offset -0.000312 sec freq 22.627 ppm error 0.000208 poll 10
7 Jan 10:18:52 xntpd[476]: system event 'event_restart' (0x01) status \
'sync_alarm, sync_unspec, 1 event, event_unspec'
7 Jan 10:19:08 xntpd[476]: peer LOCAL(0) event 'event_reach' (0x84) \
status 'unreach, conf, 1 event, event_reach' \
(0x801
7 Jan 10:19:09 xntpd[476]: peer xxx.x.x.xx event 'event_reach' (0x84) \
status 'unreach, conf, 1 event, event_reach' (0x8
7 Jan 10:22:21 xntpd[476]: system event 'event_peer/strat_chg' (0x04) \
status 'sync_alarm, sync_ntp, 2 events, event_res
7 Jan 10:22:21 xntpd[476]: system event 'event_sync_chg' (0x03) \
status 'leap_none, sync_ntp, 3 events, \
event_peer/strat
7 Jan 10:22:21 xntpd[476]: system event 'event_peer/strat_chg' (0x04) \
status 'leap_none, sync_ntp, 4 events, event_sync
7 Jan 11:19:28 xntpd[476]: offset 0.000093 sec freq 22.940 ppm error 0.000051 poll 7
7 Jan 12:20:04 xntpd[476]: offset 0.000134 sec freq 23.146 ppm error 0.000123 poll 6
7 Jan 13:20:40 xntpd[476]: offset -0.000233 sec freq 23.147 ppm error 0.000111 poll 10


2003-02-04 04:50:25

by Nohez

[permalink] [raw]
Subject: Re: timer interrupts on HP machines


Hi Matt,

We have the MP spec set to v1.4 for more than a year and the systems have
been unplugged for more than 1 hr for system maintenance many times. The
BIOS firmware is 4.06.43. We suspect the kernel triggering a hardware bug
as we see this only on HP Netservers. We have other unbranded Intel
SMP machines running the same kernel, distro & same services without this
problem.

Nohez.

On Mon, 3 Feb 2003, Matt C wrote:

> Hi Nohez:
>
> That's interesting. We've traced almost all of the times when this happens
> back to an incorrect MP spec. I know it sounds goofy, but have you tried
> unplugging AC power from the machine for ~5 minutes or so? We've seen that
> make a difference in the Netservers. Also make sure you're up-to-date with
> the firmware (latest is 4.06.43 or so?). Outside of that, I don't have any
> other suggestions besides calling HP and having them replace the system
> board.
>


2003-02-04 19:39:01

by Matt C

[permalink] [raw]
Subject: Re: timer interrupts on HP machines

Yup, it's definitely the HP hardware, since we also only see this problem
on the NetServers. I haven't worked with the LH series, though, just the
LT series. We've brought up issues like this with their support
organization, with the inevitable response "unable to reproduce problem"
and a closed ticket. We've given up since they ditched the NetServer line
in favor of the Proliants anyways.

Good Luck.

-Matt

On Tue, 4 Feb 2003, Nohez wrote:

>
> Hi Matt,
>
> We have the MP spec set to v1.4 for more than a year and the systems have
> been unplugged for more than 1 hr for system maintenance many times. The
> BIOS firmware is 4.06.43. We suspect the kernel triggering a hardware bug
> as we see this only on HP Netservers. We have other unbranded Intel
> SMP machines running the same kernel, distro & same services without this
> problem.
>
> Nohez.
>
> On Mon, 3 Feb 2003, Matt C wrote:
>
> > Hi Nohez:
> >
> > That's interesting. We've traced almost all of the times when this happens
> > back to an incorrect MP spec. I know it sounds goofy, but have you tried
> > unplugging AC power from the machine for ~5 minutes or so? We've seen that
> > make a difference in the Netservers. Also make sure you're up-to-date with
> > the firmware (latest is 4.06.43 or so?). Outside of that, I don't have any
> > other suggestions besides calling HP and having them replace the system
> > board.
> >
>
>