2000-11-01 08:53:23

by Simon Byrnand

[permalink] [raw]
Subject: SMP freeze in <= 2.2.17 when toggling consoles

Hi Everyone,

Although I follow lkml on and off through newsgroups etc, I'm not
subscribed, so please CC me any important replies via email.

Recently while trying to pin down SCSI errors with a Dual PII machine with
onboard SCSI I discovered a nasty and easily repeatable way to cause a
total system freeze.

First the machine details, at least as much as I can remember off the top
of my head, as it is at work:

Tyan Thunder 100 M/B
256MB ram
2x PII-233 (klamath core)
Onboard dual channel AIC-7895 (aic7xxx driver built into the kernel)
Onboard eepro100 compatible ethernet.
2x Seagate ST34501W, 4GB, 10,000rpm drives

The install is Redhat 6.2 but with a custom compiled 2.2.17 kernel,
although I've been testing various different kernel's.

The symptom is basically this - under heavy disk activity, rapidly
switching virtual consoles will trigger a system freeze. Nothing will
revive the machine, the Magic Sysreq key is also inoperable. Generally I
can get it to freeze within a couple of seconds. If I just switch consoles
slowly it doesnt seem to freeze, but holding down ALT and tapping
F1-F2-F1-F2 as quick as possible will freeze it in a few seconds. Quite
often it will even freeze right in the middle of redrawing the new console
- The top half of the screen will show part of the console you were
switching to, while the bottom is still showing the previous console
because it froze before it finished drawing the screen.

Now heres the kicker - if I compile the exact same kernel, but with SMP
support disabled the freeze does *NOT* happen! Additionally, all kernel
versions (2.2 series only) I've tried exhibit the same problem. As well as
2.2.17 SMP/UP, I tried 2.2.16 SMP and UP, (UP is ok, SMP freezes) and also
the default kernels shipped with redhat 6.2 (2.2.14-5 from memory) and the
result is the same, the UP kernel is ok, the SMP kernel will freeze.

Additionally, the freeze *only* happens when there is disk activity. How
much disk activity is required to trigger it is hard to estimate, but it
certainly does not freeze when the system is idle, only when there is
significant disk activity. To generate disk activity I was using something
like

dd if=/dev/zero of=testfile bs=1M count=256

Of course creating a 256 meg empty file constitutes significant disk
activity, and reliably reproduces the problem.

So what am I looking at here ? Faulty SMP hardware which somehow works ok
in UP mode ? A buggy SMP motherboard implementation ? Bugs in the Linux SMP
implementation triggered by this motherboard ? SMP races in the SCSI
driver, the console driver, the filesystem, or something else ?

For what its worth, the reason I've been testing the machine in the first
place is because of SCSI related problems with intermitant timeouts/parity
errors etc, but as far as _that_ problem goes, I think this discovery of
being able to freeze the machine by toggling the console is a red herring -
a geninue problem certainly, but (hopefully) unrelated to the problem I was
originally trying to solve. (At present a different hard drive is on test
to try and eliminate the cause of that problem..)

I can provide any other information required to help trace the problem,
(BIOS versions, log dumps, kernel .config etc) but please CC any questions
via email...

Regards,
Simon



2000-11-01 09:29:42

by Simen Thoresen

[permalink] [raw]
Subject: Re: SMP freeze in <= 2.2.17 when toggling consoles

> Recently while trying to pin down SCSI errors with a Dual PII machine with
> onboard SCSI I discovered a nasty and easily repeatable way to cause a
> total system freeze.
>
> Tyan Thunder 100 M/B
> 256MB ram
> 2x PII-233 (klamath core)
> Onboard dual channel AIC-7895 (aic7xxx driver built into the kernel)
> Onboard eepro100 compatible ethernet.
> 2x Seagate ST34501W, 4GB, 10,000rpm drives

Ok, our system is
Epox KP6-BS, (Also a dual 440BX)
256MB ram
2x PIII-750MHz Coppermine
Adaptec AHA-294X Ultra SCSI host adapter (Ultra Wide)
2x Kingston K110TX (Lite on tulip) nics
1x 20MB/s raid unit.

> The symptom is basically this - under heavy disk activity, rapidly
> switching virtual consoles will trigger a system freeze. Nothing will
> revive the machine, the Magic Sysreq key is also inoperable. Generally I
> can get it to freeze within a couple of seconds. If I just switch consoles
> slowly it doesnt seem to freeze, but holding down ALT and tapping
> F1-F2-F1-F2 as quick as possible will freeze it in a few seconds. Quite
> often it will even freeze right in the middle of redrawing the new console
> - The top half of the screen will show part of the console you were
> switching to, while the bottom is still showing the previous console
> because it froze before it finished drawing the screen.

I've tried this a few times, but was unable to get the system to crash.
I tried varying block-sizes and against both the ata-controller and the raid-box.

You might want to try to reproduce your problems with another scsi-
controller, cable and disk comination, and also against an ide disk.

Good luck.

-Simen
--
Simen Thoresen, Beowulf-cleaner and random artist.

Er det ikke rart?
The gnu RART-project on http://valinor.dolphinics.no:1080/~simentt/rart