2006-10-20 05:30:48

by Gene Heskett

[permalink] [raw]
Subject: 2.6.19-rc1, timebomb?

Greetings;

I just arrived home a few hours ago, and my wife said the outside lights
hadn't worked for the last 2 days.

I come in to check, the this machine, which runs some heyu scripts to do
this, was powered down. So I powered it back up and it had to e2fsk
everything. I have a ups with a fresh battery which passes the tests just
fine.

The only thing in the logs is a single line about eth0 being down:
Oct 17 05:31:11 coyote kernel: eth0: link down.
Oct 19 20:37:49 coyote syslogd 1.4.1: restart.

Uptime when this occurred was about 9 days. Was this a known problem?

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.


2006-10-21 04:21:40

by Chris Largret

[permalink] [raw]
Subject: Re: 2.6.19-rc1, timebomb?

On Fri, 20 Oct 2006 01:30:44 -0400
Gene Heskett <[email protected]> wrote:

> Greetings;
>
> I just arrived home a few hours ago, and my wife said the outside lights
> hadn't worked for the last 2 days.
>
> I come in to check, the this machine, which runs some heyu scripts to do
> this, was powered down. So I powered it back up and it had to e2fsk
> everything. I have a ups with a fresh battery which passes the tests just
> fine.
>
> The only thing in the logs is a single line about eth0 being down:
> Oct 17 05:31:11 coyote kernel: eth0: link down.
> Oct 19 20:37:49 coyote syslogd 1.4.1: restart.
>
> Uptime when this occurred was about 9 days. Was this a known problem?

Out of curiosity, did you check the UPS logs? The low- (and mid- ?)
range ones I've played with have logs as well as the ability to tell
the computer when there is a power problem. I'd check those logs and
also look in the system BIOS for a way to power the computer back on
when power returns. If it was powered off, I don't believe it would be
kernel-related.

I could always be wrong, but from my own experiences kernel problems
result in a system that is on but not operational.

--
Chris Largret <http://www.largret.com>

2006-10-21 04:37:59

by Gene Heskett

[permalink] [raw]
Subject: Re: 2.6.19-rc1, timebomb?

On Saturday 21 October 2006 00:22, Chris Largret wrote:
>On Fri, 20 Oct 2006 01:30:44 -0400
>
>Gene Heskett <[email protected]> wrote:
>> Greetings;
>>
>> I just arrived home a few hours ago, and my wife said the outside
>> lights hadn't worked for the last 2 days.
>>
>> I come in to check, the this machine, which runs some heyu scripts to
>> do this, was powered down. So I powered it back up and it had to e2fsk
>> everything. I have a ups with a fresh battery which passes the tests
>> just fine.
>>
>> The only thing in the logs is a single line about eth0 being down:
>> Oct 17 05:31:11 coyote kernel: eth0: link down.
>> Oct 19 20:37:49 coyote syslogd 1.4.1: restart.
>>
>> Uptime when this occurred was about 9 days. Was this a known problem?
>
>Out of curiosity, did you check the UPS logs? The low- (and mid- ?)
>range ones I've played with have logs as well as the ability to tell
>the computer when there is a power problem. I'd check those logs and
>also look in the system BIOS for a way to power the computer back on
>when power returns. If it was powered off, I don't believe it would be
>kernel-related.
>
yes, they were clean. Its a 1500kva Belkin, not exactly a small ups.

>I could always be wrong, but from my own experiences kernel problems
>result in a system that is on but not operational.

ISTR that was the second time an un-logged powerdown has been done since
that kernel became the default. For all practical purposes, it the equ of
tapping the hard reset button and before it can start to reboot, the 4
second powerdown expires and things get real quiet.

I guess I'm 'waiting for the other shoe to drop' Until that time,
everything seems normal. But I did just note that 'fam' is using up to
99.3% of the cpu, which is unusual considering that amanda is also
running, and its usually gtar thats the hog. This is according to htop.

That doesn't seem to be what I'd expect to see, thats for sure. Even
wierder, I just used htop to send it a SIGHUP and its now gone. WTF?? Me
wanders off for some sleep while the real brains ponder that one.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

2006-10-21 06:08:37

by Gene Heskett

[permalink] [raw]
Subject: Re: 2.6.19-rc1, timebomb?

On Saturday 21 October 2006 01:03, Chris Wedgwood wrote:
>On Sat, Oct 21, 2006 at 12:37:56AM -0400, Gene Heskett wrote:
>> I guess I'm 'waiting for the other shoe to drop' Until that time,
>> everything seems normal. But I did just note that 'fam' is using up
>> to 99.3% of the cpu, which is unusual considering that amanda is
>> also running, and its usually gtar thats the hog. This is according
>> to htop.
>
>I've had a few spontaneous restarts (which actually might have been
>shutdowns, any key press will make the machine up so a power down when
>working would probably look like a restart).
>
>I've assumed these were heat related, mostly because they also
>occurred when the CPU was working hard and the weather has been pretty
>warm lately.

These may be related. But I'm not convinced weather has anything to do
with it. The cpu is running about 120F, and is busier by quite a few
processes than it was when the last failure occured.

The 'fam' that was using 99.3% of the cpu, and which disappeared when I
sent it a SIGHUP, has not returned, and amanda has completed her nightly
chores without any hiccups. It was not started as a service and is unk to
getting a status report from it. So I'm wondering just where it fits in
the grand scheme of things?

>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

2006-10-21 15:10:55

by Gene Heskett

[permalink] [raw]
Subject: Re: 2.6.19-rc1, timebomb?

On Saturday 21 October 2006 02:08, Gene Heskett wrote:
>On Saturday 21 October 2006 01:03, Chris Wedgwood wrote:
>>On Sat, Oct 21, 2006 at 12:37:56AM -0400, Gene Heskett wrote:
>>> I guess I'm 'waiting for the other shoe to drop' Until that time,
>>> everything seems normal. But I did just note that 'fam' is using up
>>> to 99.3% of the cpu, which is unusual considering that amanda is
>>> also running, and its usually gtar thats the hog. This is according
>>> to htop.
>>
>>I've had a few spontaneous restarts (which actually might have been
>>shutdowns, any key press will make the machine up so a power down when
>>working would probably look like a restart).
>>
>>I've assumed these were heat related, mostly because they also
>>occurred when the CPU was working hard and the weather has been pretty
>>warm lately.
>
>These may be related. But I'm not convinced weather has anything to do
>with it. The cpu is running about 120F, and is busier by quite a few
>processes than it was when the last failure occured.
>
>The 'fam' that was using 99.3% of the cpu, and which disappeared when I
>sent it a SIGHUP, has not returned, and amanda has completed her nightly
>chores without any hiccups. It was not started as a service and is unk
> to getting a status report from it. So I'm wondering just where it fits
> in the grand scheme of things?
>
Further addendum: Another shutdown this morning, and the only line in the
log is the 3rd one here:

Oct 21 07:42:18 coyote kernel: usb 3-2.1: reset low speed USB device using
ohci_hcd and address 3
Oct 21 07:51:39 coyote kernel: usb 3-2.1: reset low speed USB device using
ohci_hcd and address 3
Oct 21 08:01:01 coyote kernel: eth0: link down. <<------<<<
Oct 21 10:53:12 coyote syslogd 1.4.1: restart.

Thats a microsoft wireless mouse it keeps resetting, and I hadn't been in
the room in 2+ hours. My logs are littered with that message, but the
mouse itself works fine. I'd estimate that the mouse reset is over 90% of
the content of my messages logs, and has been since back around 2.6.16
days. The batteries are fine.

I'm back to 2.6.18, but will build 2.6.19-rc2 shortly.

>>-
>>To unsubscribe from this list: send the line "unsubscribe linux-kernel"
>> in the body of a message to [email protected]
>>More majordomo info at http://vger.kernel.org/majordomo-info.html
>>Please read the FAQ at http://www.tux.org/lkml/

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

2006-10-21 17:25:44

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.19-rc1, timebomb?

Gene Heskett <[email protected]> writes:
>
> ISTR that was the second time an un-logged powerdown has been done since
> that kernel became the default.

It might be overheating. During a critical overheat condition the
ACPI code will just power off. It should still get console messages
out (but nothing on disk), so if you configure serial or net console
you would see a message.

And check your fans are ok.

-Andi

2006-10-22 00:11:40

by Gene Heskett

[permalink] [raw]
Subject: Re: 2.6.19-rc1, timebomb?

On Saturday 21 October 2006 13:25, Andi Kleen wrote:
>Gene Heskett <[email protected]> writes:
>> ISTR that was the second time an un-logged powerdown has been done
>> since that kernel became the default.
>
>It might be overheating. During a critical overheat condition the
>ACPI code will just power off. It should still get console messages
>out (but nothing on disk), so if you configure serial or net console
>you would see a message.
>
>And check your fans are ok.
>
>-Andi
>-

Thanks Andi, but heating isn't a problem that I'm aware of, I'm no longer
running a seti client since they moved it all to BOINC & refused to set
priorities to reasonable values. Cpu temps are pretty steady at 120F.

I tried to build and boot to 2.6.19-rc2 twice today, but each time it fails
at the initrd read phase, saying no (mutter) or cpio magic. And this is
with exactly the same command line as always generating the initrd and
then copying it to the /boot partition. This works well for 2.6.18, which
I just rebuilt after having discovered I'd lost the himem magic somehow.

In fact, thats the 2.6.18 I'm running on right now. If I get a decent
uptime here, then I'll be pretty well convinced its something in
2.6.19-rc1 thats doing it.

I haven't tried to setup a seriel console because both serial ports on this
box are already busy with other things. I could free a serial port if
someone could tell me howto make the bulldog ups monitoring software from
belkin use a usb port instead. Anyone have a clue to share on that
subject?

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

2006-10-22 11:21:31

by Gene Heskett

[permalink] [raw]
Subject: WAS Re: 2.6.19-rc1, timebomb?, now -rc2 progress

On Saturday 21 October 2006 20:11, Gene Heskett wrote:
>On Saturday 21 October 2006 13:25, Andi Kleen wrote:
>>Gene Heskett <[email protected]> writes:
>>> ISTR that was the second time an un-logged powerdown has been done
>>> since that kernel became the default.
>>
>>It might be overheating. During a critical overheat condition the
>>ACPI code will just power off. It should still get console messages
>>out (but nothing on disk), so if you configure serial or net console
>>you would see a message.
>>
>>And check your fans are ok.
>>
>>-Andi
>>-
>
>Thanks Andi, but heating isn't a problem that I'm aware of, I'm no longer
>running a seti client since they moved it all to BOINC & refused to set
>priorities to reasonable values. Cpu temps are pretty steady at 120F.
>
>I tried to build and boot to 2.6.19-rc2 twice today, but each time it
> fails at the initrd read phase, saying no (mutter) or cpio magic. And
> this is with exactly the same command line as always generating the
> initrd and then copying it to the /boot partition. This works well for
> 2.6.18, which I just rebuilt after having discovered I'd lost the himem
> magic somehow.

Someplace along the line, either a make oldconfig screwed up, or my .config
chain of succession my scripts use got totally fubared when I was trying
to build 19-rc2.

After 3 more rebuilds to add stuff like emu10k1 & the RAMFS bits, -rc2 has
now booted. So now we wait for the other shoe to drop & see if the auto
powerdowns persist.
[...]
>I haven't tried to setup a seriel console because both serial ports on
> this box are already busy with other things. I could free a serial port
> if someone could tell me howto make the bulldog ups monitoring software
> from belkin use a usb port instead. Anyone have a clue to share on that
> subject?

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

2006-10-21 05:03:44

by Chris Wedgwood

[permalink] [raw]
Subject: Re: 2.6.19-rc1, timebomb?

On Sat, Oct 21, 2006 at 12:37:56AM -0400, Gene Heskett wrote:

> I guess I'm 'waiting for the other shoe to drop' Until that time,
> everything seems normal. But I did just note that 'fam' is using up
> to 99.3% of the cpu, which is unusual considering that amanda is
> also running, and its usually gtar thats the hog. This is according
> to htop.

I've had a few spontaneous restarts (which actually might have been
shutdowns, any key press will make the machine up so a power down when
working would probably look like a restart).

I've assumed these were heat related, mostly because they also
occurred when the CPU was working hard and the weather has been pretty
warm lately.