2002-02-11 18:02:56

by kelley eicher

[permalink] [raw]
Subject: scsi abort 0x2002 and eth0: too much work on a dual amd 760mpx system

i'm having some problems with the 2.4.17 linux kernel on a dual athlon
system that i was hoping someone could shed some light on. there seem
to be multiple problems so i'm not quite sure which path to follow at this
point.

the scenario is that i have a system with the following hardware:

# awk '/\(/' /proc/pci
Host bridge: PCI device 1022:700c (Advanced Micro Devices [AMD]) (rev 17).
PCI bridge: PCI device 1022:700d (Advanced Micro Devices [AMD]) (rev 0).
ISA bridge: Advanced Micro Devices [AMD] AMD-768 [??] ISA (rev 4).
IDE interface: Advanced Micro Devices [AMD] AMD-768 [??] IDE (rev 4).
Bridge: Advanced Micro Devices [AMD] AMD-768 [??] ACPI (rev 3).
SCSI storage controller: Adaptec 7892A (rev 2).
PCI bridge: Advanced Micro Devices [AMD] AMD-768 [??] PCI (rev 4).
VGA compatible controller: Matrox Graphics, Inc. MGA G400 AGP (rev 133).
Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 116).

while this machine is at work i see aic7xxx_abort errors in dmesg from time
to time during heavy i/o. interestingly, this does not crash the machine or
the devices in question but recovers with an aic7xxx_dev_reset instruction
after 1-2 minutes of abort attempts.

during the time between the apparent scsi failures i see a few error
messages in the form of 'eth0: Too much work in interrupt, status e401.'

looking at /proc/interrupts i see that indeed, the eth0 device is hard at
work.

# cat /proc/interrupts
CPU0 CPU1
0: 33669019 33490567 IO-APIC-edge timer
1: 19117 19797 IO-APIC-edge keyboard
2: 0 0 XT-PIC cascade
10: 32635348 32632811 IO-APIC-level eth0
11: 381721 381874 IO-APIC-level aic7xxx
14: 236054 249997 IO-APIC-edge ide0
15: 0 6 IO-APIC-edge ide1
NMI: 0 0
LOC: 67153036 67153815
ERR: 0
MIS: 32

i have researched the 'eth0: Too much work in interrupt, status e401.' a bit
and found that it is possible to increase the threshold for which these
errors will be printed. i did not attempt this because it does not seem that
it should be a solution to this problem but more of a crutch. i.e. bad things
still happen, you just don't see them.

another reason i refrained from making any adjustments to settings for the
driver is that i have an almost identical system in a very similar load
and role that exhibits *none* of the problems mentioned.

# awk '/\)/' /proc/pci
Host bridge: PCI device 1022:700c (Advanced Micro Devices [AMD]) (rev 17).
PCI bridge: PCI device 1022:700d (Advanced Micro Devices [AMD]) (rev 0).
ISA bridge: Advanced Micro Devices [AMD] AMD-765 [Viper] ISA (rev 2).
IDE interface: Advanced Micro Devices [AMD] AMD-765 [Viper] IDE (rev 1).
Bridge: Advanced Micro Devices [AMD] AMD-765 [Viper] ACPI (rev 1).
USB Controller: Advanced Micro Devices [AMD] AMD-765 [Viper] USB (rev 7).
SCSI storage controller: Adaptec 7892A (rev 2).
SCSI storage controller: Adaptec 7892A (#2) (rev 2).
Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 116).
VGA compatible controller: nVidia Corporation Riva TnT2 [NV5] (rev 17).

this second machine runs an identical 2.4.17 kernel to that of the first.

the most significant difference i see here is that the chipset is the amd
760mp rather than 760mpx which is purely a supposed improvement to the
south bridge 765->768.

so before i go tearing machines apart in hopes of debugging which piece of
hardware is the cause of this less than optimal behavior, would anyone care
to wager what the cause is?

-kelley


Attachments:
(No filename) (3.60 kB)
(No filename) (232.00 B)
Download all attachments

2002-02-13 16:00:11

by kelley eicher

[permalink] [raw]
Subject: Re: scsi abort 0x2002 and eth0: too much work on a dual amd 760mpx system

low and behold, i found that this was all caused by improper scsi termination
which i'm sort of suprised at because the controller didn't bitch about it
at boot while initializing. <shrug>

-kelley


On Mon, Feb 11, 2002 at 12:02:27PM -0600, kelley eicher wrote:
> i'm having some problems with the 2.4.17 linux kernel on a dual athlon
> system that i was hoping someone could shed some light on. there seem
> to be multiple problems so i'm not quite sure which path to follow at this
> point.
>
> the scenario is that i have a system with the following hardware:
>
> # awk '/\(/' /proc/pci
> Host bridge: PCI device 1022:700c (Advanced Micro Devices [AMD]) (rev 17).
> PCI bridge: PCI device 1022:700d (Advanced Micro Devices [AMD]) (rev 0).
> ISA bridge: Advanced Micro Devices [AMD] AMD-768 [??] ISA (rev 4).
> IDE interface: Advanced Micro Devices [AMD] AMD-768 [??] IDE (rev 4).
> Bridge: Advanced Micro Devices [AMD] AMD-768 [??] ACPI (rev 3).
> SCSI storage controller: Adaptec 7892A (rev 2).
> PCI bridge: Advanced Micro Devices [AMD] AMD-768 [??] PCI (rev 4).
> VGA compatible controller: Matrox Graphics, Inc. MGA G400 AGP (rev 133).
> Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 116).
>
> while this machine is at work i see aic7xxx_abort errors in dmesg from time
> to time during heavy i/o. interestingly, this does not crash the machine or
> the devices in question but recovers with an aic7xxx_dev_reset instruction
> after 1-2 minutes of abort attempts.
>
> during the time between the apparent scsi failures i see a few error
> messages in the form of 'eth0: Too much work in interrupt, status e401.'
>
> looking at /proc/interrupts i see that indeed, the eth0 device is hard at
> work.
>
> # cat /proc/interrupts
> CPU0 CPU1
> 0: 33669019 33490567 IO-APIC-edge timer
> 1: 19117 19797 IO-APIC-edge keyboard
> 2: 0 0 XT-PIC cascade
> 10: 32635348 32632811 IO-APIC-level eth0
> 11: 381721 381874 IO-APIC-level aic7xxx
> 14: 236054 249997 IO-APIC-edge ide0
> 15: 0 6 IO-APIC-edge ide1
> NMI: 0 0
> LOC: 67153036 67153815
> ERR: 0
> MIS: 32
>
> i have researched the 'eth0: Too much work in interrupt, status e401.' a bit
> and found that it is possible to increase the threshold for which these
> errors will be printed. i did not attempt this because it does not seem that
> it should be a solution to this problem but more of a crutch. i.e. bad things
> still happen, you just don't see them.
>
> another reason i refrained from making any adjustments to settings for the
> driver is that i have an almost identical system in a very similar load
> and role that exhibits *none* of the problems mentioned.
>
> # awk '/\)/' /proc/pci
> Host bridge: PCI device 1022:700c (Advanced Micro Devices [AMD]) (rev 17).
> PCI bridge: PCI device 1022:700d (Advanced Micro Devices [AMD]) (rev 0).
> ISA bridge: Advanced Micro Devices [AMD] AMD-765 [Viper] ISA (rev 2).
> IDE interface: Advanced Micro Devices [AMD] AMD-765 [Viper] IDE (rev 1).
> Bridge: Advanced Micro Devices [AMD] AMD-765 [Viper] ACPI (rev 1).
> USB Controller: Advanced Micro Devices [AMD] AMD-765 [Viper] USB (rev 7).
> SCSI storage controller: Adaptec 7892A (rev 2).
> SCSI storage controller: Adaptec 7892A (#2) (rev 2).
> Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 116).
> VGA compatible controller: nVidia Corporation Riva TnT2 [NV5] (rev 17).
>
> this second machine runs an identical 2.4.17 kernel to that of the first.
>
> the most significant difference i see here is that the chipset is the amd
> 760mp rather than 760mpx which is purely a supposed improvement to the
> south bridge 765->768.
>
> so before i go tearing machines apart in hopes of debugging which piece of
> hardware is the cause of this less than optimal behavior, would anyone care
> to wager what the cause is?
>
> -kelley
>


Attachments:
(No filename) (4.00 kB)
(No filename) (232.00 B)
Download all attachments

2002-02-13 20:27:08

by kelley eicher

[permalink] [raw]
Subject: Re: scsi abort 0x2002 and eth0: too much work on a dual amd 760mpx system

rik-

i have done extensive cpu load + i/o testing on the 760mp machine. it handles
perfectly under very high cpu activity. one thing i should mention though is
that neither of these chipsets, amd 760mp and amd760mpx, work with multi-
processor specification 1.4 under linux. i had several problems using m.p.s.
1.4 on the 760mp in dual processor mode and the 760mpx wouldn't even boot
with m.p.s. 1.4 enabled.

as an fyi to anyone listening, the 760mpx crashed while loading any smp linux
kernel during apic timer calibration.

so my suggestion rik, if you haven't done this already, is to change the
multi- processor specification in your bios from 1.4 to 1.1.

-kelley


On Wed, Feb 13, 2002 at 01:31:35PM -0500, Rik Faith wrote:
> I'm seeing possibly related problems with a 760MP board (Tyan Tiger MP)
> and 3ware adapters. I can very reproduce the problem if I run two
> instances of burnK7 (from cpuburn: http://users.ev1.net/~redelm/) and
> then do an I/O operation. Due to what I think is a bug in the 3ware
> code, however, my machine locks hard during the SCSI reset. Could you
> try burnK7 (the idea is to load the CPUs -- not to heat them up) and see
> if your problem gets worse?
>
> I was hoping to "fix" my problem by getting a AMD-760MPX MB from Asus,
> but I'm no longer sure that's a good idea. I'm currently running Linux
> with maxcpus=1 on the boot prompt -- the system seems very stable when
> only one CPU is being used -- have you tried that?
>
> The general problem may be that an interrupt gets lost, but I'd like to
> get more evidence that leads to that conclusion. Thanks, Rik.
>
>
> BTW, I use:
> while :; do date; strace fdisk -l /dev/sda; uptime; echo; sleep 5; done
>
> and then look for a stall at "read(4," when I start the second burnK7.
> If I kill both burnK7's quick enough, the 3ware card will recover --
> otherwise, the machine locks.

--

>> kelley j eicher
<< UNIX architect
>> Univ. of MN Astronomy Dept.
<< ph: (612) 626-2067 or (612) 624-3589
>> fx: (612) 626-2029
<< office: 385 physics
>> carde at astro dot umn dot edu


Attachments:
(No filename) (2.04 kB)
(No filename) (232.00 B)
Download all attachments

2002-02-18 17:15:39

by Michael Kwasigroch

[permalink] [raw]
Subject: Re: scsi abort 0x2002 and eth0: too much work on a dual amd 760mpx system

At 2002-02-13 20:26:46 "kelley eicher <[email protected]>" wrote:

> rik-
>
> i have done extensive cpu load + i/o testing on the 760mp machine. it
handles
> perfectly under very high cpu activity. one thing i should mention though
is
> that neither of these chipsets, amd 760mp and amd760mpx, work with multi-
> processor specification 1.4 under linux. i had several problems using
m.p.s.
> 1.4 on the 760mp in dual processor mode and the 760mpx wouldn't even boot
> with m.p.s. 1.4 enabled.
>
> as an fyi to anyone listening, the 760mpx crashed while loading any smp
linux
> kernel during apic timer calibration.
>
> so my suggestion rik, if you haven't done this already, is to change the
> multi- processor specification in your bios from 1.4 to 1.1.
>
> -kelley

I've got the Tyan Tiger MPX (S2466N) running SMP flawlessly with both
2.2.19 (io_apic.c patched) and 2.4.17 (w/ide-patch) ... and Windows 2000
Pro SP2 (;-).

In the BIOS I've left the setting to ACPI (which is the default).

- Why should one want to change that to MPS 1.1/1.4?
- What 760mpx board do you use?

I'm using an old Adaptec 2940 PCI SCSI adaptor for my DAT streamer and it
also works flawlessly (although I feel there is a little bit of improvement
possible by tweaking the PCI latency).

You might want to check http://www.2cpu.com for tips & tricks getting a
dual athlon system running. I'm not connected to this site but it gave me
all the help I needed while choosing the components for my nice new system.
It is #1 for duallies!!!


P.S.: Please cc me directly on any replies since I'm not subscribed to
linux-kernel. TIA.


Mit freundlichen Gruessen / best regards


"The sooner you fall behind, the more time you'll have to catch up."

Michael Kwasigroch
FaxPlus/Open Development
________________________________________

e-mail: [email protected]

INTERCOPE
International Communication Products Engineering GmbH

http://www.intercope.com