2005-11-16 14:04:25

by Eyal Lebedinsky

[permalink] [raw]
Subject: hware clock left bad after a system failure

I recently had two cases where my machine locked up and needed
a hard reset. The last time magic SysRq did not respond at all.

In these cases I found that the hware clock was set incorrectly
and the machine comes up with a bad date. It seems that the clock
is ahead by as much as my TZ (+10 in my case). I may be able
to understand if it was set 10h behind (kernel set it to UTC)
but this is the other way. The machine comes up with UTC+20.

Now this is just trouble. The machine comes up and spends 15m
fscking. I then reset the clock and reboot and it does the whole
fsck again because it thinks the fs was not checked for eons. It
does not understand time in the future.

So the points are

- why is the clock mangled in this way?
- should e2fsck not allow future check time (maybe within some
limits)?

--
Eyal Lebedinsky ([email protected]) <http://samba.org/eyal/>
attach .zip as .dat


2005-11-16 15:52:29

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: hware clock left bad after a system failure


On Wed, 16 Nov 2005, Eyal Lebedinsky wrote:

> I recently had two cases where my machine locked up and needed
> a hard reset. The last time magic SysRq did not respond at all.
>
> In these cases I found that the hware clock was set incorrectly
> and the machine comes up with a bad date. It seems that the clock
> is ahead by as much as my TZ (+10 in my case). I may be able
> to understand if it was set 10h behind (kernel set it to UTC)
> but this is the other way. The machine comes up with UTC+20.
>
> Now this is just trouble. The machine comes up and spends 15m
> fscking. I then reset the clock and reboot and it does the whole
> fsck again because it thinks the fs was not checked for eons. It
> does not understand time in the future.
>
> So the points are
>
> - why is the clock mangled in this way?

I am assuming that you have an ix86 kind of machine.

It's probably mangled because you had a hardware-crash.

If you have a driver that accesses the RTC, it needs to leave
the index register at offset 0 so that a hardware crash can
only upset the seconds. Otherwise, even the RTC checksum
can get screwed up, forcing manual reconfiguration of the
BIOS.

During a hardware-crash, the chip enables may go TRUE. This
means that an RTC write can occur with junk that's on the
data-bus.

Now, you need to find out why you had a hardware-crash which
is quite unlike a software-crash. A hardware crash occurs when
you turn OFF the power or the power-good line from the
power-supply goes FALSE. You do not get a hardware-crash from
hitting the reset button. You may have induced the RTC failure
if you hit the power switch instead of the reset button.

> - should e2fsck not allow future check time (maybe within some
> limits)?

Doesn't the `init` script ask you if you want to fsck the
drive? Most distributions do. Anyways, a time in the future
is one of the ways e2fsck may discover that your file-system
is dorked. You certainly don't want it to ignore it by default.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.51 BogoMips).
Warning : 98.36% of all statistics are fiction.
.

****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

2005-11-16 21:52:22

by Eyal Lebedinsky

[permalink] [raw]
Subject: Re: hware clock left bad after a system failure

linux-os (Dick Johnson) wrote:
> On Wed, 16 Nov 2005, Eyal Lebedinsky wrote:
>
>
>>I recently had two cases where my machine locked up and needed
>>a hard reset. The last time magic SysRq did not respond at all.
>>
>>In these cases I found that the hware clock was set incorrectly
>>and the machine comes up with a bad date. It seems that the clock
>>is ahead by as much as my TZ (+10 in my case). I may be able
>>to understand if it was set 10h behind (kernel set it to UTC)
>>but this is the other way. The machine comes up with UTC+20.
>>
>>Now this is just trouble. The machine comes up and spends 15m
>>fscking. I then reset the clock and reboot and it does the whole
>>fsck again because it thinks the fs was not checked for eons. It
>>does not understand time in the future.
>>
>>So the points are
>>
>>- why is the clock mangled in this way?
>
>
> I am assuming that you have an ix86 kind of machine.
>
> It's probably mangled because you had a hardware-crash.
>
> If you have a driver that accesses the RTC, it needs to leave
> the index register at offset 0 so that a hardware crash can
> only upset the seconds. Otherwise, even the RTC checksum
> can get screwed up, forcing manual reconfiguration of the
> BIOS.
>
> During a hardware-crash, the chip enables may go TRUE. This
> means that an RTC write can occur with junk that's on the
> data-bus.
>
> Now, you need to find out why you had a hardware-crash which
> is quite unlike a software-crash. A hardware crash occurs when
> you turn OFF the power or the power-good line from the
> power-supply goes FALSE. You do not get a hardware-crash from
> hitting the reset button. You may have induced the RTC failure
> if you hit the power switch instead of the reset button.

I hit the reset. In one case I managed to reboot using magic SysRq.

The crashes are related to disk problems. In one case the hda/b
controller went down (last message said DMA disabled on both)
and in another case the system was doing a proper shutdown
when it failed to complete and a reset was necessary. I suspect
a problem with the SATA card which I know has some driver issues
(promise SATA II 150 TX4).

The point of the post was the fact that the clock was not randomly
set but clearly at +20h after the reboot. This was the case in the
last two crashes. This is too coincidental and I suspect that some
logic does play with the RTC and if a proper shutdown does not
complete it may not be restored correctly.

--
Eyal Lebedinsky ([email protected]) <http://samba.org/eyal/>
attach .zip as .dat

2005-11-17 13:37:05

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: hware clock left bad after a system failure


On Wed, 16 Nov 2005, Eyal Lebedinsky wrote:

> linux-os (Dick Johnson) wrote:
>> On Wed, 16 Nov 2005, Eyal Lebedinsky wrote:
>>
>>
>>> I recently had two cases where my machine locked up and needed
>>> a hard reset. The last time magic SysRq did not respond at all.
>>>
>>> In these cases I found that the hware clock was set incorrectly
>>> and the machine comes up with a bad date. It seems that the clock
>>> is ahead by as much as my TZ (+10 in my case). I may be able
>>> to understand if it was set 10h behind (kernel set it to UTC)
>>> but this is the other way. The machine comes up with UTC+20.
>>>
>>> Now this is just trouble. The machine comes up and spends 15m
>>> fscking. I then reset the clock and reboot and it does the whole
>>> fsck again because it thinks the fs was not checked for eons. It
>>> does not understand time in the future.
>>>
>>> So the points are
>>>
>>> - why is the clock mangled in this way?
>>
>>
>> I am assuming that you have an ix86 kind of machine.
>>
>> It's probably mangled because you had a hardware-crash.
>>
>> If you have a driver that accesses the RTC, it needs to leave
>> the index register at offset 0 so that a hardware crash can
>> only upset the seconds. Otherwise, even the RTC checksum
>> can get screwed up, forcing manual reconfiguration of the
>> BIOS.
>>
>> During a hardware-crash, the chip enables may go TRUE. This
>> means that an RTC write can occur with junk that's on the
>> data-bus.
>>
>> Now, you need to find out why you had a hardware-crash which
>> is quite unlike a software-crash. A hardware crash occurs when
>> you turn OFF the power or the power-good line from the
>> power-supply goes FALSE. You do not get a hardware-crash from
>> hitting the reset button. You may have induced the RTC failure
>> if you hit the power switch instead of the reset button.
>
> I hit the reset. In one case I managed to reboot using magic SysRq.
>
> The crashes are related to disk problems. In one case the hda/b
> controller went down (last message said DMA disabled on both)
> and in another case the system was doing a proper shutdown
> when it failed to complete and a reset was necessary. I suspect
> a problem with the SATA card which I know has some driver issues
> (promise SATA II 150 TX4).
>
> The point of the post was the fact that the clock was not randomly
> set but clearly at +20h after the reboot. This was the case in the
> last two crashes. This is too coincidental and I suspect that some
> logic does play with the RTC and if a proper shutdown does not
> complete it may not be restored correctly.

Red Hat distributions set the RTC during the shutdown sequence.
That sequence is in /etc/rc.d/init.d/halt line 232. If you just
comment out that line, your problem may go away while you search
for the underlying cause.

The only way the time could have been set to something sane
would be by something like this setting it. It's likely
that you have some memory corruption that screwed up your
environment and therefore the time-zone. When the shutdown
sequence occurred, the RTC was set to the wrong time because
of this time-zone corruption.

If your machine was being heavily swapped when the disk problems
occurred, this __might__ explain the corruption. However, I would
first check RAM, do not overclock, etc. It might be that bad
RAM, in fact, is the reason for all your problems and you don't
really have disk or driver problems at all.


Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.44 BogoMips).
Warning : 98.36% of all statistics are fiction.
.

****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

2005-11-17 14:02:20

by Eyal Lebedinsky

[permalink] [raw]
Subject: Re: hware clock left bad after a system failure

linux-os (Dick Johnson) wrote:
> On Wed, 16 Nov 2005, Eyal Lebedinsky wrote:
[report of hwclock breakage trimmed]
>
> If your machine was being heavily swapped when the disk problems
> occurred, this __might__ explain the corruption. However, I would
> first check RAM, do not overclock, etc. It might be that bad
> RAM, in fact, is the reason for all your problems and you don't
> really have disk or driver problems at all.

I will now keep watching for q while quietly. Earlier today I had
another such hard lockup, identical errors claiming a disk failure.

This time I went into the BIOS on bootup and the clock was set
correctly. Good. Booted and it did the usual fscks but then dropped
into a shell when errors were found on /. I did the necessary fscks.

On a hunch I did 'date' and the clock was 11h ahead (we actually
are +11 now). So the problem is during the boot, not during the
crash. I consider that the boot thinks that I am running a UTC
hwclock and adjusts for this, when in fact I run a local time
hwclock. There must be something in the scripts that goes funny
if / does an fsck and then drops into the recovery shell.

I will start looking in this direction.

This is Debian Sarge in x86.

Thanks

--
Eyal Lebedinsky ([email protected]) <http://samba.org/eyal/>
attach .zip as .dat

2005-11-17 14:37:47

by Gene Heskett

[permalink] [raw]
Subject: Re: hware clock left bad after a system failure

On Thursday 17 November 2005 09:02, Eyal Lebedinsky wrote:
>linux-os (Dick Johnson) wrote:
>> On Wed, 16 Nov 2005, Eyal Lebedinsky wrote:
>
>[report of hwclock breakage trimmed]
>
>> If your machine was being heavily swapped when the disk problems
>> occurred, this __might__ explain the corruption. However, I would
>> first check RAM, do not overclock, etc. It might be that bad
>> RAM, in fact, is the reason for all your problems and you don't
>> really have disk or driver problems at all.
>
>I will now keep watching for q while quietly. Earlier today I had
>another such hard lockup, identical errors claiming a disk failure.
>
>This time I went into the BIOS on bootup and the clock was set
>correctly. Good. Booted and it did the usual fscks but then dropped
>into a shell when errors were found on /. I did the necessary fscks.
>
>On a hunch I did 'date' and the clock was 11h ahead (we actually
>are +11 now). So the problem is during the boot, not during the
>crash. I consider that the boot thinks that I am running a UTC
>hwclock and adjusts for this, when in fact I run a local time
>hwclock. There must be something in the scripts that goes funny
>if / does an fsck and then drops into the recovery shell.
>
>I will start looking in this direction.
>
>This is Debian Sarge in x86.
>
>Thanks

Thanks for looking into this. It has been exactly the PITA the
original poster described for me, for years on all distro's.

It seems the scenario is that if everything is on local time, its not a
problem, but if the system is set to run its hardware clock on grenwich
time, a crash leaves the hw clock afu because its on local time at the
reboot init.

I did something to my init script to quasi-fix this years ago, fix now
lost in the sands of time as this FC2 install is truely long in the
tooth now. But it works, so I don't fix it. :)

I *think* the problem is that the assumption of grenwich time is only
at boot, and shutdown times. At boot, I believe the hw clock is reset
to local time after getting the time reference from it, and at shutdown
its reset to grenwich. So a crash recovery finds it on local time but
assumes its grenwich & the result is a predictable mess.

If thats the case, then IMNSHO, the hw clock should ALWAYS be on
grenwich time. This needless twiddling is rather counter-productive.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.36% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2005 by Maurice Eugene Heskett, all rights reserved.


2005-11-17 15:37:43

by Petri Kaukasoina

[permalink] [raw]
Subject: Re: hware clock left bad after a system failure

On Fri, Nov 18, 2005 at 01:02:15AM +1100, Eyal Lebedinsky wrote:
> On a hunch I did 'date' and the clock was 11h ahead (we actually
> are +11 now). So the problem is during the boot, not during the
> crash. I consider that the boot thinks that I am running a UTC
> hwclock and adjusts for this, when in fact I run a local time
> hwclock.

A wild guess follows. If /usr is on a separate partition from root,
/etc/localtime should be a file copied from /usr/share/zoneinfo/something
and not just a symlink...