2012-02-17 22:54:27

by Keith Chew

[permalink] [raw]
Subject: Hang on "echo b > /proc/sysrq-trigger"

Hi

To test the reliability of a hardware, I have a script which reboots a
machine every 15 minutes after boot up. This machine has a dual video
output, VGA and DVI-D, both driven via an intel GM45 chipset (I am
using kernel 2.6.39.24 kernel intel drivers).

Some interesting results (which can be reproduced consistently):
"echo b > /proc/sysrq-trigger" via DVI-D - after 2-3 days, it hangs
(freezes) before reboot (dmesg only shows "Resetting...", nothing
after that, no panic, stack trace, etc)
"echo b > /proc/sysrq-trigger" via VGA - runs > 1 week
"reboot -fn" via VGA or DVI-D - runs > 1 week
"reboot" via VGA or DVI-D - runs > 1 week

I suspect that the intel graphics driver is not happy with the "echo b
> /proc/sysrq-trigger" when it is still running.

I would like to make the "echo b" successfully reboot the machine, but
this would appear to be a hardware bug? Is there anything that can be
done in the kernel to make the "echo b" successfully work 100%?

Regards
Keith


2012-02-29 18:03:51

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Hang on "echo b > /proc/sysrq-trigger"

Keith Chew <[email protected]> writes:

> Hi
>
> To test the reliability of a hardware, I have a script which reboots a
> machine every 15 minutes after boot up. This machine has a dual video
> output, VGA and DVI-D, both driven via an intel GM45 chipset (I am
> using kernel 2.6.39.24 kernel intel drivers).
>
> Some interesting results (which can be reproduced consistently):
> "echo b > /proc/sysrq-trigger" via DVI-D - after 2-3 days, it hangs
> (freezes) before reboot (dmesg only shows "Resetting...", nothing

My blind guess would be that it is the BIOS on the machine that is hung.

> after that, no panic, stack trace, etc)
> "echo b > /proc/sysrq-trigger" via VGA - runs > 1 week
> "reboot -fn" via VGA or DVI-D - runs > 1 week
> "reboot" via VGA or DVI-D - runs > 1 week
>
> I suspect that the intel graphics driver is not happy with the "echo b
>> /proc/sysrq-trigger" when it is still running.
>
> I would like to make the "echo b" successfully reboot the machine, but
> this would appear to be a hardware bug? Is there anything that can be
> done in the kernel to make the "echo b" successfully work 100%?

echo b > /proc/sysrq-trigger triggers the emergency_restart path which
tries but skips some steps so that it has a reasonable chance of working
when the kernel is wedged, it looks like some of those steps it skips
are needed on your hardware.

Eric

2012-02-29 18:28:10

by Keith Chew

[permalink] [raw]
Subject: Re: Hang on "echo b > /proc/sysrq-trigger"

Hi Eric

<snip>

>> Some interesting results (which can be reproduced consistently):
>> "echo b > /proc/sysrq-trigger" via DVI-D - after 2-3 days, it hangs
>> (freezes) before reboot (dmesg only shows "Resetting...", nothing
>
> My blind guess would be that it is the BIOS on the machine that is hung.
>

We have contacted the manufacturer, and they do not believe this is
the case as the BIOS does not really do much during the reboot.
Unfortunately, we do not have enough knowledge on the inner workings
of the BIOS to help or diagnose further. Any pointers here will be
helpful.

<snip>
>>
>> I would like to make the "echo b" successfully reboot the machine, but
>> this would appear to be a hardware bug? Is there anything that can be
>> done in the kernel to make the "echo b" successfully work 100%?
>
> echo b > /proc/sysrq-trigger triggers the emergency_restart path which
> tries but skips some steps so that it has a reasonable chance of working
> when the kernel is wedged, it looks like some of those steps it skips
> are needed on your hardware.
>

Yes, I have looked into the kernel code and it does not do much,
except to tell the hardware to reboot (either via BIOS, keyboard,
ACPI, etc). I have also tried the reboot=b, reboot=k and reboot=a
options, and all of them can cause a hang, with reboot=b lasting the
longest.

We have extended our testing time, and have some more worrying
results. The command "reboot -fn" which runs > 1 week, got a hang
after 2 weeks of running. We are now testing with just "reboot" to see
how long that last.

Regards
Keith

2012-02-29 20:49:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Hang on "echo b > /proc/sysrq-trigger"

Keith Chew <[email protected]> writes:

> Hi Eric
>
> <snip>
>
>>> Some interesting results (which can be reproduced consistently):
>>> "echo b > /proc/sysrq-trigger" via DVI-D - after 2-3 days, it hangs
>>> (freezes) before reboot (dmesg only shows "Resetting...", nothing
>>
>> My blind guess would be that it is the BIOS on the machine that is hung.
>>
>
> We have contacted the manufacturer, and they do not believe this is
> the case as the BIOS does not really do much during the reboot.
> Unfortunately, we do not have enough knowledge on the inner workings
> of the BIOS to help or diagnose further. Any pointers here will be
> helpful.

Historically a lot of issues have had to do with which cpu you are
entering the bios from. So you might try pinning your process
to differen cpus and see if you can make the failure more deterministic.

>>> I would like to make the "echo b" successfully reboot the machine, but
>>> this would appear to be a hardware bug? Is there anything that can be
>>> done in the kernel to make the "echo b" successfully work 100%?
>>
>> echo b > /proc/sysrq-trigger triggers the emergency_restart path which
>> tries but skips some steps so that it has a reasonable chance of working
>> when the kernel is wedged, it looks like some of those steps it skips
>> are needed on your hardware.
>>
>
> Yes, I have looked into the kernel code and it does not do much,
> except to tell the hardware to reboot (either via BIOS, keyboard,
> ACPI, etc). I have also tried the reboot=b, reboot=k and reboot=a
> options, and all of them can cause a hang, with reboot=b lasting the
> longest.
>
> We have extended our testing time, and have some more worrying
> results. The command "reboot -fn" which runs > 1 week, got a hang
> after 2 weeks of running. We are now testing with just "reboot" to see
> how long that last.

Ugh. The other possibility is that there is an intermittent failure in
the hardware, that prevents the boot/reboot. Wrong values on pull-up
resistors have been known to cause that kind of thing.

Eric

2012-02-29 22:06:55

by Keith Chew

[permalink] [raw]
Subject: Re: Hang on "echo b > /proc/sysrq-trigger"

Hi Eric

>
> Historically a lot of issues have had to do with which cpu you are
> entering the bios from. ?So you might try pinning your process
> to differen cpus and see if you can make the failure more deterministic.
>

We are using a Celeron 575 uniprocessor, so we do not have the option
to pin on another cpu. I have tried compiling the kernel in both UP
and SMP configuration, but sadly both causes the hang.

>
> Ugh. ?The other possibility is that there is an intermittent failure in
> the hardware, that prevents the boot/reboot. ?Wrong values on pull-up
> resistors have been known to cause that kind of thing.
>

Thank you very much for this pointer, will feed that back to the
manufacturer and see if it will give them some clues. The original
purpose for this reboot exercise was to ensure the software will
handle a power failure without any OS/data corruptions. With this new
discovery of unreliable reboot, the next worry is "If reboot is not
reliable, is the boot process also susceptible to the same issue?". I
have not rigged up any hardware to simulate a periodic full shutdown
and boot up process, but will be planning to set this up next.

Thanks again, if you have any other suggestions for us to try, I am all ears!

Regards
Keith

2012-02-29 23:34:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Hang on "echo b > /proc/sysrq-trigger"

Keith Chew <[email protected]> writes:

> Hi Eric
>
>>
>> Historically a lot of issues have had to do with which cpu you are
>> entering the bios from.  So you might try pinning your process
>> to differen cpus and see if you can make the failure more deterministic.
>>
>
> We are using a Celeron 575 uniprocessor, so we do not have the option
> to pin on another cpu. I have tried compiling the kernel in both UP
> and SMP configuration, but sadly both causes the hang.

Ok. That rules out a bunch of things, and emerengy_restart may not
be much different in practice.

>> Ugh.  The other possibility is that there is an intermittent failure in
>> the hardware, that prevents the boot/reboot.  Wrong values on pull-up
>> resistors have been known to cause that kind of thing.
>>
>
> Thank you very much for this pointer, will feed that back to the
> manufacturer and see if it will give them some clues. The original
> purpose for this reboot exercise was to ensure the software will
> handle a power failure without any OS/data corruptions. With this new
> discovery of unreliable reboot, the next worry is "If reboot is not
> reliable, is the boot process also susceptible to the same issue?". I
> have not rigged up any hardware to simulate a periodic full shutdown
> and boot up process, but will be planning to set this up next.
>
> Thanks again, if you have any other suggestions for us to try, I am
> all ears!

I would check with your BIOS folks and perhaps play with the kernel
option. The most reliable way to peform a reset is to trigger a board
reset by writing to 0xcf9 or a similar register. I expect your BIOS
does that and you can probably get the kernel to do that. I would
definitely test to see if you can write to the mostly standard
0xcf9 register directly from the kernel and trigger a reset directly.

Once past a reset and with a single cpu all of the failures will be
happening in the boot path. So the only possible points of failure
are in devices that are different between a soft reset and a power on
reset.

I would check to see if your board perhaps supports post codes or any
other debugging that will let you see where you are hanging.

It sounds like there is some very rare failure, that is going to be
a challenge to track down. I would definitely test more than one
motherboard to ensure that you can reproduce the problem on more
than one piece of hardware. Sometimes hardware is just broken.

Eric

2012-03-01 00:12:40

by Keith Chew

[permalink] [raw]
Subject: Re: Hang on "echo b > /proc/sysrq-trigger"

Hi Eric

> I would check with your BIOS folks and perhaps play with the kernel
> option. ?The most reliable way to peform a reset is to trigger a board
> reset by writing to 0xcf9 or a similar register. ?I expect your BIOS
> does that and you can probably get the kernel to do that. ?I would
> definitely test to see if you can write to the mostly standard
> 0xcf9 register directly from the kernel and trigger a reset directly.
>
> Once past a reset and with a single cpu all of the failures will be
> happening in the boot path. ?So the only possible points of failure
> are in devices that are different between a soft reset and a power on
> reset.
>
> I would check to see if your board perhaps supports post codes or any
> other debugging that will let you see where you are hanging.
>
> It sounds like there is some very rare failure, that is going to be
> a challenge to track down. ?I would definitely test more than one
> motherboard to ensure that you can reproduce the problem on more
> than one piece of hardware. ?Sometimes hardware is just broken.
>

These are really helpful suggestions, I will try to get to the bottom
on it. Yes, have tried 3 different boards with different RAM, HDD and
CPU. The hang can be reproduced consistently (just not
deterministically at this stage).

Thank you very much again, will update the progress in due course.

Regards
Keith