2004-01-16 22:04:54

by Stephen Smoogen

[permalink] [raw]
Subject: AIC7xxx kernel problem with 2.4.2[234] kernels

I hope this is in the correct mode and I am sending it to the correct
list :). Please let me know what other information I need to supply.


[1.] One line summary of the problem:

Booting problems with aic7xxx with stock kernel 2.4.24.

[2.] Full description of the problem/report:

I think I am seeing the same problem that was reported to the list from
Moal Tanguy on 2003-09-08. Our oem systems use an older SuperMicro
motherboard with a built on aic7xxx Adaptec aic7892 Ultra160 SCSI
adapter. Most of the systems have a forward mounted 'removable' 18
gigabyte scsi disk drive (different manufacturers). When running any Red
Hat kernel we have no problems booting the system. When using 2.4.22,
2.4.23, or 2.4.24 kernels, the system loads in the aic7xxx module from
the initrd and probes the interfaces. In systems with multiple disks, it
reaches the first drive (the removable disk) and errors out with a slow
progression of


Unexpected busfree while idle
SEQ 0x01

After 10 or so of these, it will then go merrily onto the next disk
without reporting any errors, and then 'crashes' because it was unable
to find the root directory to continue.

Interestinglyu, after this has happened using the hardware reset will
cause the system to not be able to find the SCSI ID0 disk to boot
from. A complete power cycle is needed for the SCSI controller to find
the disk again.

Patched 2.4.24 kernel with latest items from Justin Gibbs website, and
problem occurs in same form. From what I could google, I am expecting it
is hardware related in some issue, but would love to know what.

[3.] Keywords (i.e., modules, networking, kernel):

kernel | drivers | scsi | aic7xxx

[4.] Kernel version (from /proc/version):

Linux version 2.4.24 ([email protected]) (gcc version 2.96
20000731 (Red Hat Linux 7.3 2.96-113)) #1 Mon Jan 12 14:44:21 MST 2004

CONFIG_SCSI_AIC7XXX=m
CONFIG_AIC7XXX_CMDS_PER_DEVICE=32
CONFIG_AIC7XXX_RESET_DELAY_MS=15000
# CONFIG_AIC7XXX_PROBE_EISA_VL is not set
# CONFIG_AIC7XXX_BUILD_FIRMWARE is not set
# CONFIG_AIC7XXX_DEBUG_ENABLE is not set
CONFIG_AIC7XXX_DEBUG_MASK=0
# CONFIG_AIC7XXX_REG_PRETTY_PRINT is not set


[5.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/oops-tracing.txt)

None

[6.] A small shell script or example program which triggers the
problem (if possible)

Not Applicaple

[7.] Environment
[7.1.] Software (add the output of the ver_linux script here)

This is from the machine that was used to build the kernel.

# sh ./scripts/ver_linux
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.

Linux rh73dev.ds.lanl.gov 2.4.20-19.7 #1 Tue Jul 15 13:44:14 EDT 2003
i686 unknown

Gnu C 2.96
Gnu make 3.79.1
util-linux 2.11n
mount 2.11n
modutils 2.4.18
e2fsprogs 1.27
quota-tools 3.06.
Linux C Library 2.2.5
Dynamic linker (ldd) 2.2.5
Procps 2.0.7
Net-tools 1.60
Console-tools 0.3.3
Sh-utils 2.0.11
Modules Loaded autofs nfs lockd sunrpc eepro100 mii ipv6
usb-ohci usbcore ext3 jbd aic7xxx sd_mod scsi_mod


[7.2.] Processor information (from /proc/cpuinfo):


# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 8
model name : Pentium III (Coppermine)
stepping : 10
cpu MHz : 866.277
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 mmx fxsr sse
bogomips : 1730.15

[7.3.] Module information (from /proc/modules):

[All info below is from running Red Hat kernel 2.4.20-28.7 as it works.]

# cat /proc/modules
ide-cd 32256 0 (autoclean)
cdrom 32128 0 (autoclean) [ide-cd]
loop 10736 0 (unused)
eepro100 21068 1 (autoclean)
mii 3976 0 (autoclean) [eepro100]
usb-ohci 20544 0 (unused)
usbcore 73792 1 [usb-ohci]
aic7xxx 133248 6
sd_mod 12828 12
scsi_mod 107548 2 [aic7xxx sd_mod]


[7.4.] Loaded driver and hardware information (/proc/ioports,
/proc/iomem)

[All info below is from running Red Hat kernel 2.4.20-28.7 as it works.]

# cat /proc/ioports
0000-001f : dma1
0020-003f : pic1
0040-005f : timer
0060-006f : keyboard
0070-007f : rtc
0080-008f : dma page reg
00a0-00bf : pic2
00c0-00df : dma2
00f0-00ff : fpu
01f0-01f7 : ide0
02f8-02ff : serial(auto)
03c0-03df : vga+
03f6-03f6 : ide0
03f8-03ff : serial(auto)
0cf8-0cff : PCI conf1
d800-d83f : Intel Corp. 82557/8/9 [Ethernet Pro 100]
d800-d83f : eepro100
e400-e4ff : Adaptec AIC-7892P U160/m
e800-e8ff : ATI Technologies Inc Rage XL
ffa0-ffaf : ServerWorks OSB4 IDE Controller
ffa0-ffa7 : ide0
ffa8-ffaf : ide1

# cat /proc/iomem
00000000-0009fbff : System RAM
0009fc00-0009ffff : reserved
000a0000-000bffff : Video RAM area
000c0000-000c7fff : Video ROM
000c8000-000cc7ff : Extension ROM
000cc800-000cd7ff : Extension ROM
000f0000-000fffff : System ROM
00100000-3ffeffff : System RAM
00100000-00225f42 : Kernel code
00225f43-0031caff : Kernel data
3fff0000-3fffefff : ACPI Tables
3ffff000-3fffffff : ACPI Non-volatile Storage
fc900000-fc9fffff : Intel Corp. 82557/8/9 [Ethernet Pro 100]
fcafe000-fcafefff : Intel Corp. 82557/8/9 [Ethernet Pro 100]
fcafe000-fcafefff : eepro100
fcaff000-fcafffff : ServerWorks OSB4/CSB5 OHCI USB Controller
fcaff000-fcafffff : usb-ohci
fd000000-fdffffff : ATI Technologies Inc Rage XL
febfe000-febfefff : Adaptec AIC-7892P U160/m
febfe000-febfefff : aic7xxx
febff000-febfffff : ATI Technologies Inc Rage XL
fec00000-fec01fff : reserved
fee00000-fee00fff : reserved
fff80000-ffffffff : reserved

[7.5.] PCI information ('lspci -vvv' as root)

# lspci -vvv
00:00.0 Host bridge: ServerWorks CNB20LE (rev 06)
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort+ >SERR- <PERR-
Latency: 32, cache line size 08

00:00.1 Host bridge: ServerWorks CNB20LE (rev 06)
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 16, cache line size 08

00:06.0 Ethernet controller: Intel Corporation 82557 [Ethernet Pro 100]
(rev 08)
Subsystem: Intel Corporation 82559 Fast Ethernet LAN on
Motherboard
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (2000ns min, 14000ns max), cache line size 08
Interrupt: pin A routed to IRQ 9
Region 0: Memory at fcafe000 (32-bit, non-prefetchable)
[size=4K]
Region 1: I/O ports at d800 [size=64]
Region 2: Memory at fc900000 (32-bit, non-prefetchable)
[size=1M]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA
PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=2 PME-

00:0f.0 ISA bridge: ServerWorks OSB4 (rev 50)
Subsystem: ServerWorks OSB4
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 0

00:0f.1 IDE interface: ServerWorks: Unknown device 0211 (prog-if 8a
[Master SecP PriP])
Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 64
Region 4: I/O ports at ffa0 [size=16]

00:0f.2 USB Controller: ServerWorks: Unknown device 0220 (rev 04)
(prog-if 10 [OHCI])
Subsystem: ServerWorks: Unknown device 0220
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap- 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (20000ns max), cache line size 08
Interrupt: pin A routed to IRQ 10
Region 0: Memory at fcaff000 (32-bit, non-prefetchable)
[size=4K]

01:01.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
(prog-if 00 [VGA])
Subsystem: ATI Technologies Inc: Unknown device 0008
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping+ SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (2000ns min), cache line size 08
Interrupt: pin A routed to IRQ 11
Region 0: Memory at fd000000 (32-bit, non-prefetchable)
[size=16M]
Region 1: I/O ports at e800 [size=256]
Region 2: Memory at febff000 (32-bit, non-prefetchable)
[size=4K]
Expansion ROM at febc0000 [disabled] [size=128K]
Capabilities: [5c] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-

01:03.0 SCSI storage controller: Adaptec 7892P (rev 02)
Subsystem: Super Micro Computer Inc: Unknown device 9005
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (10000ns min, 6250ns max), cache line size 08
Interrupt: pin A routed to IRQ 10
BIST result: 00
Region 0: I/O ports at e400 [disabled] [size=256]
Region 1: Memory at febfe000 (64-bit, non-prefetchable)
[size=4K]
Expansion ROM at feba0000 [disabled] [size=128K]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-


[7.6.] SCSI information (from /proc/scsi/scsi)

# cat /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: IBM Model: DDYS-T18350N Rev: S80D
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 04 Lun: 00
Vendor: IBM Model: DDYS-T18350N Rev: S9YB
Type: Direct-Access ANSI SCSI revision: 03


[7.7.] Other information that might be relevant to the problem
(please look in /proc and include all information that you
think to be relevant):
[X.] Other notes, patches, fixes, workarounds:


--
Stephen John Smoogen [email protected]
Los Alamos National Lab CCN-5 Sched 5/40 PH: 4-0645
Ta-03 SM-1498 MailStop B255 DP 10S Los Alamos, NM 87545
-- So shines a good deed in a weary world. = Willy Wonka --


2004-01-16 22:33:43

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: AIC7xxx kernel problem with 2.4.2[234] kernels

> Booting problems with aic7xxx with stock kernel 2.4.24.

...

> Unexpected busfree while idle
> SEQ 0x01

A problem with similar symptoms was corrected in driver version 6.2.37
back in August of last year. Can you try using the latest driver source
from here:

http://people.FreeBSD.org/~gibbs/linux/SRC/

and see if your problem persists? The aic79xx driver archive at the
above location includes both the aic7xxx and aic79xx drivers. If this
does not resolve your problem there are other debugging options we can
enable that may aid in tracking down the problem.

--
Justin

2004-01-16 22:59:16

by Stephen Smoogen

[permalink] [raw]
Subject: Re: AIC7xxx kernel problem with 2.4.2[234] kernels

On Fri, 2004-01-16 at 15:39, Justin T. Gibbs wrote:
> > Booting problems with aic7xxx with stock kernel 2.4.24.
>
> ...
>
> > Unexpected busfree while idle
> > SEQ 0x01
>
> A problem with similar symptoms was corrected in driver version 6.2.37
> back in August of last year. Can you try using the latest driver source
> from here:
>
> http://people.FreeBSD.org/~gibbs/linux/SRC/
>
> and see if your problem persists? The aic79xx driver archive at the
> above location includes both the aic7xxx and aic79xx drivers. If this
> does not resolve your problem there are other debugging options we can
> enable that may aid in tracking down the problem.

Hi I did that already; sorry for not being clearer about it in the bug
report. For some of my systems I had patched my kernel to have the
latest source code from your site for our aic79xx machines. I ran that
kernel on these other systems and it locked up in a similar state.

I am ready for the additional debugging options :). Thanks for your
quick response.



--
Stephen John Smoogen [email protected]
Los Alamos National Lab CCN-5 Sched 5/40 PH: 4-0645
Ta-03 SM-1498 MailStop B255 DP 10S Los Alamos, NM 87545
-- So shines a good deed in a weary world. = Willy Wonka --

2004-01-16 23:23:52

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: AIC7xxx kernel problem with 2.4.2[234] kernels



On Fri, 16 Jan 2004, Justin T. Gibbs wrote:

> > Booting problems with aic7xxx with stock kernel 2.4.24.
>
> ...
>
> > Unexpected busfree while idle
> > SEQ 0x01
>
> A problem with similar symptoms was corrected in driver version 6.2.37
> back in August of last year. Can you try using the latest driver source
> from here:
>
> http://people.FreeBSD.org/~gibbs/linux/SRC/
>
> and see if your problem persists? The aic79xx driver archive at the
> above location includes both the aic7xxx and aic79xx drivers. If this
> does not resolve your problem there are other debugging options we can
> enable that may aid in tracking down the problem.

Hi Justin,

It might be interesting to merge these fixes in mainline?

2004-01-18 01:14:49

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: AIC7xxx kernel problem with 2.4.2[234] kernels



On Fri, 16 Jan 2004, Justin T. Gibbs wrote:

> > Booting problems with aic7xxx with stock kernel 2.4.24.
>
> ...
>
> > Unexpected busfree while idle
> > SEQ 0x01
>
> A problem with similar symptoms was corrected in driver version 6.2.37
> back in August of last year. Can you try using the latest driver source
> from here:
>
> http://people.FreeBSD.org/~gibbs/linux/SRC/
>
> and see if your problem persists? The aic79xx driver archive at the
> above location includes both the aic7xxx and aic79xx drivers. If this
> does not resolve your problem there are other debugging options we can
> enable that may aid in tracking down the problem.

Hi Justin,

Stephen informed me privately that aic7xxx_old works for him.

About the aic7xxx update, well, I believe aic7xxx 6.2.36 is pretty stable
(I dont remember seeing any reliable bug report and I also cant find one
in lkml archives) except this one (and a pair of "lockup on initialization
with SMP").

What bugs are you aware of in 2.4's aic7xxx ?


2004-01-19 13:35:51

by Xose Vazquez Perez

[permalink] [raw]
Subject: Re: AIC7xxx kernel problem with 2.4.2[234] kernels

Marcelo Tosatti wrote:

> About the aic7xxx update, well, I believe aic7xxx 6.2.36 is pretty stable
> (I dont remember seeing any reliable bug report and I also cant find one
> in lkml archives) except this one (and a pair of "lockup on initialization
> with SMP").

Justin already put updates in BK, but James did not like the "new error recovery"
code. So, kernel driver is *SIX months* behind ADAPTEC driver release.

There is more info in this linux-scsi thread, why the patch was not applied:
http://marc.theaimsgroup.com/?l=linux-scsi&m=107228516327580&w=2

It looks like the _kernel_ driver is going to be without a maintainer
unless somebody works on it, porting ADAPTEC fixes/features to the kernel driver.


> What bugs are you aware of in 2.4's aic7xxx ?

aic7xxx/aic79xx CHANGELOG has info about all bugs fixed:

o Adaptec Aic7xxx

Version History:

6.3.4 (December 22nd, 2003)
- Provide a better description string for the 2915/30LP.
- Sniff sense information returned by targets for unit
attention errors that may indicate that the device has
been changed. If we see such status for non Domain
Validation related commands, start a DV scan for the
target. In the past, DV would only occur for hot-plugged
devices if no target had been previously probed for a
particular ID. This change guarantees that the DV process
will occur even if the user swaps devices without any
interveining I/O to tell us that a device has gone missing.
The old behavior, among other things, would fail to spin up
drives that were hot-plugged since the Linux mid-layer
will only spin-up drives on initial attach.

6.3.3 (November 6th, 2003)
- Support the 2.6.0-test9 kernel
- Fix rare deadlock caused by using del_timer_sync from within
a timer handler.

6.3.2 (October 28th, 2003)
- Enforce a bus settle delay for bus resets that the
driver initiates.
- Fall back to basic DV for U160 devices that lack an
echo buffer.
- Correctly detect that left over BIOS data has not
been initialized when the CHPRST status bit is set
during driver initialization.

6.3.1 (October 21st, 2003)
- Fix a compiler error when building with only EISA or PCI
support compiled into the kernel.
- Add chained dependencies to both the driver and aicasm Makefiles
to avoid problems with parallel builds.
- Move additional common routines to the aiclib OSM library
to reduce code duplication.
- Fix a bug in the testing of the AHC_TMODE_WIDEODD_BUG that
could cause target mode operations to hang.
- Leave removal of softcs from the global list of softcs to
the OSM. This allows us to avoid holding the list_lock during
device destruction.

6.3.0 (September 8th, 2003)
- Move additional common routines to the aiclib OSM library
to reduce code duplication.
- Bump minor number to reflect change in error recovery strategy.

6.2.38 (August 31st, 2003)
- Avoid an inadvertant reset of the controller during the
memory mapped I/O test should the controller be left in
the reset state prior to driver initialization. On some
systems, this extra reset resulted in a system hang due
to a chip access that occurred too soon after reset.
- Move additional common routines to the aiclib OSM library
to reduce code duplication.
- Add magic sysrq handler that causes a card dump to be output
to the console for each controller.

6.2.37 (August 12th, 2003)
- Perform timeout recovery within the driver instead of relying
on the Linux SCSI mid-layer to perform this function. The
mid-layer does not know the full state of the SCSI bus and
is therefore prone to looping for several minutes to effect
recovery. The new scheme recovers within 15 seconds of the
failure.
- Support writing 93c56/66 SEEPROM on newer cards.
- Avoid clearing ENBUSFREE during single stepping to avoid
spurious "unexpected busfree while idle" messages.
- Enable the use of the "Auto-Access-Pause" feature on the
aic7880 and aic7870 chips. It was disabled due to an
oversight. Using this feature drastically reduces command
delivery latency.

6.2.36 **KERNEL DRIVER**


o Adaptec Aic79xx

Version History:

2.0.5 (December 22nd, 2003)
- Correct a bug preventing the driver from renegotiating
during auto-request operations when a check condition
occurred for a zero length command.
- Sniff sense information returned by targets for unit
attention errors that may indicate that the device has
been changed. If we see such status for non Domain
Validation related commands, start a DV scan for the
target. In the past, DV would only occur for hot-plugged
devices if no target had been previously probed for a
particular ID. This change guarantees that the DV process
will occur even if the user swaps devices without any
interveining I/O to tell us that a device has gone missing.
The old behavior, among other things, would fail to spin up
drives that were hot-plugged since the Linux mid-layer
will only spin-up drives on initial attach.
- Correct several issues in the rundown of the good status
FIFO during error recovery. The typical failure scenario
evidenced by this defect was the loss of several commands
under high load when several queue full conditions occured
back to back.

2.0.4 (November 6th, 2003)
- Support the 2.6.0-test9 kernel
- Fix rare deadlock caused by using del_timer_sync from within
a timer handler.

2.0.3 (October 21st, 2003)
- On 7902A4 hardware, use the slow slew rate for transfer
rates slower than U320. This behavior matches the Windows
driver.
- Fix some issues with the ahd_flush_qoutfifo() routine.
- Add a delay in the loop waiting for selection activity
to cease. Otherwise we may exhaust the loop counter too
quickly on fast machines.
- Return to processing bad status completions through the
qoutfifo. This reduces the amount of time the controller
is paused for these kinds of errors.
- Move additional common routines to the aiclib OSM library
to reduce code duplication.
- Leave removal of softcs from the global list of softcs to
the OSM. This allows us to avoid holding the list_lock during
device destruction.
- Enforce a bus settle delay for bus resets that the
driver initiates.
- Fall back to basic DV for U160 devices that lack an
echo buffer.

2.0.2 (September 4th, 2003)
- Move additional common routines to the aiclib OSM library
to reduce code duplication.
- Avoid an inadvertant reset of the controller during the
memory mapped I/O test should the controller be left in
the reset state prior to driver initialization. On some
systems, this extra reset resulted in a system hang due
to a chip access that occurred too soon after reset.
- Correct an endian bug in ahd_swap_with_next_hscb. This
corrects strong-arm support.
- Reset the bus for transactions that timeout waiting for
the bus to go free after a disconnect or command complete
message.

2.0.1 (August 26th, 2003)
- Add magic sysrq handler that causes a card dump to be output
to the console for each controller.
- Avoid waking the mid-layer's error recovery handler during
timeout recovery by returning DID_ERROR instead of DID_TIMEOUT
for timed-out commands that have been aborted.
- Move additional common routines to the aiclib OSM library
to reduce code duplication.

2.0.0 (August 20th, 2003)
- Remove MMAPIO definition and allow memory mapped
I/O for any platform that supports PCI.
- Avoid clearing ENBUSFREE during single stepping to avoid
spurious "unexpected busfree while idle" messages.
- Correct deadlock in ahd_run_qoutfifo() processing.
- Optimize support for the 7901B.
- Correct a few cases where an explicit flush of pending
register writes was required to ensure acuracy in delays.
- Correct problems in manually flushing completed commands
on the controller. The FIFOs are now flushed to ensure
that completed commands that are still draining to the
host are completed correctly.
- Correct incomplete CDB delivery detection on the 790XB.
- Ignore the cmd->underflow field since userland applications
using the legacy command pass-thru interface do not set
it correctly. Honoring this field led to spurious errors
when users used the "scsi_unique_id" program.
- Perform timeout recovery within the driver instead of relying
on the Linux SCSI mid-layer to perform this function. The
mid-layer does not know the full state of the SCSI bus and
is therefore prone to looping for several minutes to effect
recovery. The new scheme recovers within 15 seconds of the
failure.
- Correct support for manual termination settings.
- Increase maximum wait time for serial eeprom writes allowing
writes to function correctly.

1.3.12 (August 11, 2003)
- Implement new error recovery thread that supercedes the existing
Linux SCSI error recovery code.
- Fix termination logic for 29320ALP.
- Fix SEEPROM delay to compensate for write ops taking longer.

1.3.11 (July 11, 2003)
- Fix several deadlock issues.
- Add 29320ALP and 39320B Id's.

1.3.10 **KERNEL DRIVER**



2004-01-19 17:22:11

by James Bottomley

[permalink] [raw]
Subject: Re: AIC7xxx kernel problem with 2.4.2[234] kernels

On Mon, 2004-01-19 at 08:32, Xose Vazquez Perez wrote:
> It looks like the _kernel_ driver is going to be without a maintainer
> unless somebody works on it, porting ADAPTEC fixes/features to the kernel driver.

As I told you in private email, this is *not* the way I see it. At the
moment, Ataptec is the maintainer of that driver unless they choose
formally to relinquish it.

There is a glimmering of a resolution of the problem in an early
notification API for command timeouts.

Although throwing away successful completions when error recovery is in
progress isn't a bug (scsi commands are either idempotent or non
retryable), it's certainly not ideal. I'm thinking about a better
framework where we would quiesce the device but pull back from
activating the eh thread if all commands return. This would also fix
the tag starvation issue that many drivers tackle independently too.

James


2004-01-19 18:32:41

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: AIC7xxx kernel problem with 2.4.2[234] kernels

> On Mon, 2004-01-19 at 08:32, Xose Vazquez Perez wrote:
>> It looks like the _kernel_ driver is going to be without a maintainer
>> unless somebody works on it, porting ADAPTEC fixes/features to the kernel driver.
>
> As I told you in private email, this is *not* the way I see it. At the
> moment, Ataptec is the maintainer of that driver unless they choose
> formally to relinquish it.

Can you provide your definition of "maintainer"? I know that I am maintainer
of the drivers distributed from my website, but I don't feel I have ever
been maintainer of the drivers in the 2.4.X, 2.5.X, or 2.6.X trees.

> There is a glimmering of a resolution of the problem in an early
> notification API for command timeouts.

I'm open to ideas, but from this one line summary, this sounds like a
workaround and not a real solution. Can you say more about your proposal?

In my mind, an easy resolution would be to:

1) Let me fix the SCSI layer so that the error recovery handler override
already there will actually work - cleanly.

2) Let my drivers use that mechanism.

While working on 1, I would appreciate being able to "maintain" these
drivers with their current error recovery workaround in place.

> Although throwing away successful completions when error recovery is in
> progress isn't a bug (scsi commands are either idempotent or non
> retryable), it's certainly not ideal.

Most SCSI commands are only idempotent if replayed in the same order
as originally issued (consider FSes that rely on write ordering to
keep their meta-data coherent). Some commands are retriable but only if
they have actually failed. The mid-layer has no concept currently of these
issues, yet it acts on behalf of the peripheral drivers that can better
understand how the device they control behaves and act accordingly.

Bugs are defects that render non-ideal behavior. The only question is
what types of non-ideal behaviors you are willing to tolerate.

> I'm thinking about a better
> framework where we would quiesce the device but pull back from
> activating the eh thread if all commands return. This would also fix
> the tag starvation issue that many drivers tackle independently too.

That wouldn't help things. For example, lets say that there is one command
active on the bus holding up the completion of 32 others. "Waiting for a bit"
will never release the other 32 commands. You must abort the bus hog. Once
you abort the problem command, you get flooded with the completions of the
32 others. The bus is recovered. You can now safely go about your business.
An HBA watchdog handler can properly deal with this situation since it has
state that the mid-layer does not.

As for tag starvation, just inserting a periodic ordered tag on devices
that show signs of starvation is a much better approach than shutting
down the flow of commands to the whole controller at the first sign of
trouble. Luckily, most vendors stopped making drives with tag starvation
issues in the mid-90's. For this reason, the tag starvation code in
my drivers is off by default, but can be enabled via a module or kernel
command line option.

--
Justin

2004-01-20 00:54:15

by James Bottomley

[permalink] [raw]
Subject: Re: AIC7xxx kernel problem with 2.4.2[234] kernels

On Mon, 2004-01-19 at 13:38, Justin T. Gibbs wrote:
> Can you provide your definition of "maintainer"? I know that I am maintainer
> of the drivers distributed from my website, but I don't feel I have ever
> been maintainer of the drivers in the 2.4.X, 2.5.X, or 2.6.X trees.

A maintainer is a person who works with the kernel community to keep the
driver (or subsystem, filesystem or whatever) up to date. Such a person
may or possibly may not have an entry in the MAINTAINERS file.

If you want to maintain a reference driver and have someone else do the
legwork with the community, that's fine by me...do you have someone in
mind, or should I find this person?

> I'm open to ideas, but from this one line summary, this sounds like a
> workaround and not a real solution. Can you say more about your proposal?

It actually wasn't mine, it was Alan Cox's. There was a thread about it
(on which you were cc'd), but I'm currently on a 'plane to NY for
LinuxWorld and don't have it handy.

> In my mind, an easy resolution would be to:
>
> 1) Let me fix the SCSI layer so that the error recovery handler override
> already there will actually work - cleanly.
>
> 2) Let my drivers use that mechanism.
>
> While working on 1, I would appreciate being able to "maintain" these
> drivers with their current error recovery workaround in place.

Well, I'm thinking of having a class based error recovery scheme, built
upon an extension of the transport class patch that has been floating
around this list. However, my problem is that the aic7xxx/79xx chips
are basically SPI, and therefore, even under the new scheme, should be
using the SPI recovery class. Therefore, just providing an override
mechanism for all drivers to use isn't what I want. What I want is a
robust SPI recovery mechanism usable by all.

This is what "working with the kernel community" means. If there's a
bug, I don't want it fixed by driver work arounds, I want it fixed in
the core code. Having driver writers ignore the APIs and roll their own
will simply create problems.

> Most SCSI commands are only idempotent if replayed in the same order
> as originally issued (consider FSes that rely on write ordering to
> keep their meta-data coherent). Some commands are retriable but only if
> they have actually failed. The mid-layer has no concept currently of these
> issues, yet it acts on behalf of the peripheral drivers that can better
> understand how the device they control behaves and act accordingly.
>
> Bugs are defects that render non-ideal behavior. The only question is
> what types of non-ideal behaviors you are willing to tolerate.

This is the old barrier debate. The scsi subsytem does not advertise an
ordering property to the block layer and thus is not required to
preserve order over error recovery. This problem, therefore, does not
exist in linux.

We had this debate years ago...the upshot being that the performance
benefits of order preservation were uncertain at best so it was never
implemented. Linux works just fine without it.

> That wouldn't help things. For example, lets say that there is one command
> active on the bus holding up the completion of 32 others. "Waiting for a bit"
> will never release the other 32 commands. You must abort the bus hog. Once
> you abort the problem command, you get flooded with the completions of the
> 32 others. The bus is recovered. You can now safely go about your business.
> An HBA watchdog handler can properly deal with this situation since it has
> state that the mid-layer does not.

I don't understand this. If by "active on the bus" you mean is holding
the bus in a busy state, then you cannot get access to the bus to to
send an abort or a device reset...the only recourse is a bus
reset...which the mid layer will do.

If the drive has actually freed the bus but lost the tag, then it's a
drive queueing bug, and the solution is usually to lower the TCQ depth
(we should probably have a blacklist for this). This is where the mid
layer quiesce is good...if all the other commands complete, the bus is
free and we still don't get a response from the missing command, then
you know the drive firmware lost it, and the driver should adjust the
queue depth downwards.

If the drive is just off servicing other tags, then it's tag starvation.
> As for tag starvation, just inserting a periodic ordered tag on devices
> that show signs of starvation is a much better approach than shutting
> down the flow of commands to the whole controller at the first sign of
> trouble. Luckily, most vendors stopped making drives with tag starvation
> issues in the mid-90's. For this reason, the tag starvation code in
> my drivers is off by default, but can be enabled via a module or kernel
> command line option.

Well, I have to deal with old hardware...

James


2004-01-20 02:03:51

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: AIC7xxx kernel problem with 2.4.2[234] kernels

> On Mon, 2004-01-19 at 13:38, Justin T. Gibbs wrote:
>> Can you provide your definition of "maintainer"? I know that I am maintainer
>> of the drivers distributed from my website, but I don't feel I have ever
>> been maintainer of the drivers in the 2.4.X, 2.5.X, or 2.6.X trees.
>
> A maintainer is a person who works with the kernel community to keep the
> driver (or subsystem, filesystem or whatever) up to date. Such a person
> may or possibly may not have an entry in the MAINTAINERS file.

Does the maintainer have the ability to veto changes that harm the
code they maintain? In otherwords, you claim that I am the maintainer
of the drivers in the kernel.org tree. This has not prevented changes
from being made to these drivers without adequate review. Even your last
update to the driver threw away all of the changelog state and left at
least the aic79xx driver in a worse state than it was in before (see
changelog entries for the driver versions after the one that you imported
for details - this was exactly why I didn't submit that particular revision).
You didn't even bother to ask me if importing 1.3.11 was appropriate. This
is why I say I don't feel like a maintainer. I'm not given adequate control
over the end product yet I'm supposed to take the blame when it doesn't work.

>> I'm open to ideas, but from this one line summary, this sounds like a
>> workaround and not a real solution. Can you say more about your proposal?
>
> It actually wasn't mine, it was Alan Cox's. There was a thread about it
> (on which you were cc'd), but I'm currently on a 'plane to NY for
> LinuxWorld and don't have it handy.

That proposal was to allow the timeout handler to be redirected. This
is different than an early notification. Allowing the timeout handler
to be redirected is a required step toward making the recovery code
work.

> Well, I'm thinking of having a class based error recovery scheme, built
> upon an extension of the transport class patch that has been floating
> around this list.

That's all fine for "status based" recovery. All I'm trying to resolve
are issues with watchdog recovery. Can we limit the discussion to that
area?

> However, my problem is that the aic7xxx/79xx chips
> are basically SPI, and therefore, even under the new scheme, should be
> using the SPI recovery class. Therefore, just providing an override
> mechanism for all drivers to use isn't what I want. What I want is a
> robust SPI recovery mechanism usable by all.

I understand that, but it just isn't possible to do well for watchdog
recovery. For status based recovery, sure.

> This is what "working with the kernel community" means. If there's a
> bug, I don't want it fixed by driver work arounds, I want it fixed in
> the core code. Having driver writers ignore the APIs and roll their own
> will simply create problems.

In this case, the bug is that the mid-layer tries to handle watchdog
recovery on its own. It will never, in my opinion, having reviewed
lots of systems that have tried to do it in a centralized way, work well.
The mid-layer just doesn't have the necessary state to make intelligent
decisions and exporting that state will always be cumbersome and incomplete.

>> Most SCSI commands are only idempotent if replayed in the same order
>> as originally issued (consider FSes that rely on write ordering to
>> keep their meta-data coherent). Some commands are retriable but only if
>> they have actually failed. The mid-layer has no concept currently of these
>> issues, yet it acts on behalf of the peripheral drivers that can better
>> understand how the device they control behaves and act accordingly.
>>
>> Bugs are defects that render non-ideal behavior. The only question is
>> what types of non-ideal behaviors you are willing to tolerate.
>
> This is the old barrier debate.

Not entirely. Tapes are allowed to accept multiple commands and some
FCTape drives do. But even if you throw away that argument completely,
you still haven't resolved how to deal with retriable commands that
are only retriable if they have actually failed.

My feeling is that any situation where the mid-layer or HBA drivers fail
to provide complete and acurate state for the commands that are completed
is a bug. The peripheral drivers cannot do their job if they aren't
given good information.

>> That wouldn't help things. For example, lets say that there is one
>> command active on the bus holding up the completion of 32 others.
>> "Waiting for a bit" will never release the other 32 commands. You must
>> abort the bus hog. Once you abort the problem command, you get flooded
>> with the completions of the 32 others. The bus is recovered. You can now
>> safely go about your business. An HBA watchdog handler can properly deal
>> with this situation since it has state that the mid-layer does not.
>
> I don't understand this. If by "active on the bus" you mean is holding
> the bus in a busy state, then you cannot get access to the bus to to
> send an abort or a device reset...the only recourse is a bus
> reset...which the mid layer will do.

If we are talking SPI, then aborts, device resets, and lun resets
are all handled with either message bytes transmitted via a message phase,
or via command packets with the task management function set appropriately.
If you cannot send an abort message, you cannot send any message, so claiming
that a BDR request will resolve the problem doesn't make any sense if you
believe that a device active on the bus prevents aborts for working. In any
event, just because a device is active on the bus doesn't mean that the
bus is hung and that you cannot abort a command. By raising the ATN line,
the target may decide to change phase to accept your message byte. It
may not, but if it doesn't, then your only recourse is to reset the
bus. Looping through all the other commands that happen to be stalled
and asking the driver to abort them will only delay the inevitable.

> If the drive has actually freed the bus but lost the tag, then it's a
> drive queueing bug, and the solution is usually to lower the TCQ depth
> (we should probably have a blacklist for this). This is where the mid
> layer quiesce is good...if all the other commands complete, the bus is
> free and we still don't get a response from the missing command, then
> you know the drive firmware lost it, and the driver should adjust the
> queue depth downwards.

How does the mid-layer know that the "bus is free". What transports even
have this concept? If one drive has lost a command, and the transport
is functioning normally, why are you penalizing the other devices attached
to the HBA while you "sort this out"? There is no need to do that.

As for reducing the queue depth in response to repeated timeouts by a
device, this is easy enough to do with your "multi-layered", status based
recovery code. All that is required is for the HBA to tell you that a
particular command was aborted due to timeout as well as indicate what
side-effects occurred because of the abort process (bus reset, device reset,
lun reset, LIP, etc). Some of the latter is already provided for by the
reset and bus reset entry points, but a better solution would be to have
a single "async event" callback that can encompass any transport notifications
needed by SPI, FC, SAS, and any future transports without adding more
entry points.

> If the drive is just off servicing other tags, then it's tag starvation.
>> As for tag starvation, just inserting a periodic ordered tag on devices
>> that show signs of starvation is a much better approach than shutting
>> down the flow of commands to the whole controller at the first sign of
>> trouble. Luckily, most vendors stopped making drives with tag starvation
>> issues in the mid-90's. For this reason, the tag starvation code in
>> my drivers is off by default, but can be enabled via a module or kernel
>> command line option.
>
> Well, I have to deal with old hardware...

Sure, just don't penalize the other disks on the transport because you
have one disk out there that is affected by this issue.

--
Justin

2004-01-20 04:45:27

by James Bottomley

[permalink] [raw]
Subject: Re: AIC7xxx kernel problem with 2.4.2[234] kernels

On Mon, 2004-01-19 at 21:02, Justin T. Gibbs wrote:
> Does the maintainer have the ability to veto changes that harm the
> code they maintain? In otherwords, you claim that I am the maintainer
> of the drivers in the kernel.org tree. This has not prevented changes
> from being made to these drivers without adequate review. Even your last
> update to the driver threw away all of the changelog state and left at
> least the aic79xx driver in a worse state than it was in before (see
> changelog entries for the driver versions after the one that you imported
> for details - this was exactly why I didn't submit that particular revision).

I said "works with the kernel community". It's not about control, it's
about co-operation. The control you seek simply does not exist in the
kernel development process.

> You didn't even bother to ask me if importing 1.3.11 was appropriate. This
> is why I say I don't feel like a maintainer. I'm not given adequate control
> over the end product yet I'm supposed to take the blame when it doesn't work.

In the previous thread about the driver you said "You can integrate the
driver at whatever revision suits you.", so I took you at your word; if
that wasn't what you meant, it's a little late to whine about it now.
Small bug fixes, would, as ever, be welcome...

As for blame, apart from the occasional flamewar, the community seems
generally welcoming of anyone who provides fixes. We tend to be more
interested in fixing things than assigning blame.

> That proposal was to allow the timeout handler to be redirected. This
> is different than an early notification. Allowing the timeout handler
> to be redirected is a required step toward making the recovery code
> work.

The recovery code does work. You may want it to work differently, and
that may make it work better, but that's an enhancement not a bug fix.

> In this case, the bug is that the mid-layer tries to handle watchdog
> recovery on its own. It will never, in my opinion, having reviewed
> lots of systems that have tried to do it in a centralized way, work well.
> The mid-layer just doesn't have the necessary state to make intelligent
> decisions and exporting that state will always be cumbersome and incomplete.

But it does do it successfully. Something that currently works but
could work better is an enhancement not a bug.

> How does the mid-layer know that the "bus is free". What transports even
> have this concept? If one drive has lost a command, and the transport
> is functioning normally, why are you penalizing the other devices attached
> to the HBA while you "sort this out"? There is no need to do that.

Again, this is could do better not required bug fix.

I'm not against enhancements, even at this late stage in the
stabilisation process. However, they have to be small, self contained
and obviously correct. If you have them, send them to the list and
they'll get reviewed.

James


2004-01-20 05:38:05

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: AIC7xxx kernel problem with 2.4.2[234] kernels

> On Mon, 2004-01-19 at 21:02, Justin T. Gibbs wrote:
>> Does the maintainer have the ability to veto changes that harm the
>> code they maintain? In otherwords, you claim that I am the maintainer
>> of the drivers in the kernel.org tree. This has not prevented changes
>> from being made to these drivers without adequate review. Even your last
>> update to the driver threw away all of the changelog state and left at
>> least the aic79xx driver in a worse state than it was in before (see
>> changelog entries for the driver versions after the one that you imported
>> for details - this was exactly why I didn't submit that particular revision).
>
> I said "works with the kernel community". It's not about control, it's
> about co-operation. The control you seek simply does not exist in the
> kernel development process.

Then I ask again, what does it mean to be a maintainer? It sounds like
I'm on equal footing with anyone who decides to post some patch to the
lists. I've lost count of the number of occasions that some random
patch from some random individual was accepted without any consultation
with "the maintainer" of these drivers. The end result was more email
in my mailbox complaining about "the broken driver that I maintain."

As for control, the type of control "I seek" does exist. You have it.
You can also delegate some of that control if it suits you.

A maintainer takes on responsibility to ensure that something is maintained
and works. Without some level of control, how can the maintainer fulfill
that responsibility?

>> You didn't even bother to ask me if importing 1.3.11 was appropriate. This
>> is why I say I don't feel like a maintainer. I'm not given adequate control
>> over the end product yet I'm supposed to take the blame when it doesn't work.
>
> In the previous thread about the driver you said "You can integrate the
> driver at whatever revision suits you.", so I took you at your word; if
> that wasn't what you meant, it's a little late to whine about it now.
> Small bug fixes, would, as ever, be welcome...

I provided all of the information required for you to make a reasoned
decision of which change sets to integrate. I had no idea that you
would completely disregard the wealth of information in the change sets
and change set comments when coming up with an integration point. Your
actions show that you didn't review or understand the changes well enough
to submit them into the tree. You probably didn't even test the resulting
driver on real hardware before you submitted the changes.

> The recovery code does work. You may want it to work differently, and
> that may make it work better, but that's an enhancement not a bug fix.

No. The recovery code doesn't work. Many of the people that know this
don't bother complaining to you about it. They complain to the HBA driver
authors and the tech support departments of the companies that make the HBAs.
The HBA driver authors then do what they have to ensure that the system
remains viable after recovery.

I mean honestly. Do you think I would have gone to all of the trouble
I did in doing my own watchdog recovery if the recovery code worked
correctly? Or that I would stand so firm in my position if these issues
didn't have real customer impact?

--
Justin

2004-01-20 07:15:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: AIC7xxx kernel problem with 2.4.2[234] kernels



On Mon, 19 Jan 2004, Justin T. Gibbs wrote:
>
> Does the maintainer have the ability to veto changes that harm the
> code they maintain?

Nope. Nobody has that right.

Even _I_ don't veto changes that the right people push (my motto:
"everybody is wrong sometimes: when enough people complain, even I am
wrong").

In particular, maintainers of "conceptually higher" generally always have
priority. If Al Viro says a filesystem is doing something wrong from a VFS
standpoint, then that filesystem is broken - regardless of whether the
filesystem maintainer agrees or not. Because the VFS layer requirements
trump any low-level filesystem issues.

But perhaps more importantly (and it's the reason even _I_ don't have the
right, regardless of how high up in the maintainership chain I am), nobody
has veto-power over anything. That's to keep people honest: nobody should
_ever_ think that they are "in control", and that nobody else can replace
them.

In other words: maintainership is not ownership. It's a stewardship.

End result: maintainership is a nasty and mostly unthankful job. It
doesn't really give many privileges, and most of what it does is just have
people complain to you about bugs. The satisfaction is there, of course,
but

And finally: maintainership is largely about working with people.
There's some code in there too, but people tend to be more important.

Linus

2004-01-20 08:32:29

by Andre Hedrick

[permalink] [raw]
Subject: Re: AIC7xxx kernel problem with 2.4.2[234] kernels


Linus,

Would have been nice to list such rules at the top if the MAINTAINERS
file, this may have saved me some grief, well maybe not ...

Andre Hedrick
LAD Storage Consulting Group

On Mon, 19 Jan 2004, Linus Torvalds wrote:

>
>
> On Mon, 19 Jan 2004, Justin T. Gibbs wrote:
> >
> > Does the maintainer have the ability to veto changes that harm the
> > code they maintain?
>
> Nope. Nobody has that right.
>
> Even _I_ don't veto changes that the right people push (my motto:
> "everybody is wrong sometimes: when enough people complain, even I am
> wrong").
>
> In particular, maintainers of "conceptually higher" generally always have
> priority. If Al Viro says a filesystem is doing something wrong from a VFS
> standpoint, then that filesystem is broken - regardless of whether the
> filesystem maintainer agrees or not. Because the VFS layer requirements
> trump any low-level filesystem issues.
>
> But perhaps more importantly (and it's the reason even _I_ don't have the
> right, regardless of how high up in the maintainership chain I am), nobody
> has veto-power over anything. That's to keep people honest: nobody should
> _ever_ think that they are "in control", and that nobody else can replace
> them.
>
> In other words: maintainership is not ownership. It's a stewardship.
>
> End result: maintainership is a nasty and mostly unthankful job. It
> doesn't really give many privileges, and most of what it does is just have
> people complain to you about bugs. The satisfaction is there, of course,
> but
>
> And finally: maintainership is largely about working with people.
> There's some code in there too, but people tend to be more important.
>
> Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2004-01-20 11:25:22

by Chiaki

[permalink] [raw]
Subject: Re: AIC7xxx kernel problem with 2.4.2[234] kernels

I have a feeling that Linus summed up
what the maintainer is like in the context
of linux kernel development.

But I have a comment.

> The recovery code does work.

Maybe. I have not tried a few problematic devices under
my PC lately. These devices
usually caused troube under 2.2.xx series, and even under late
2.3.y series for a while.
The symptom was essentially a reset storm that made the system
unusable.
Given the various patches accumalating, maybe the symptom
is tolerable now, but again I see some mention of
unusable system response even today.
So I suspect the problem is still there for certain type
of hardware problems.

>You may want it to work differently, and
> that may make it work better, but that's an enhancement not a bug fix.

To people who have been bitten with such unusable system symptoms
the above statement simply won't pass.

It is essentially a "performance *bug*" and
should be corrected IMHO.

> But it does do it successfully. Something that currently works but
> could work better is an enhancement not a bug.

Again, to those people this is a correctable (and should be corrected)
performance "bug".

> I'm not against enhancements, even at this late stage in the
> stabilisation process.

I am a little confused here. Are we talking about 2.4 series?
OK the subject line states 2.4.2[234].

I believe that there are a lot of user base, especially, people
who use server type machines with SCSI interface (and AIC chips
seem to be popular among these machines)
who would appreciate the enhanced (== perforamance
bug corrected) version of the SCSI subsystem.

I, for one, don't use AIC chip on my home PCs, but
do have some machines at the office which use them and
will appreciate "enhanced" SCSI subsystem after all these years.

As for 2.6.zz, aren't there any chance of introducing hooks into EH
framework? The previous discussion suggested that it needs
to wait for 2.7 series. Too bad :-(

I feel that these error handling issues of SCSI subsystem
will have to be solved once for all sooner or later in the mainline
or otherwise as we see the vendors of commercial distribution probably
need to keep a separate tree (which they may have to, anyway, deal
with other quirks of the mainline kernel, etc.) for a long time to
come and this is rather waste of man-power resources IMHO.

In any case, with all due respect
I don't think that the discussion goes anywhere unless we
recognize that someone's "mere enhancement" is actually
other people's "serious performance bug correction".
I, for one, tend to see the topic discussed as
the performance "BUG" and so
am a little frustrated at the pace the
error handling scheme is being improved.

This is just a comment from a third party observer who,
unfortunately doesn't have the time to dig into
the code and offer a patch. (Yes I actually tried
once during 2.2.xx time-frame but was then repulsed at
the spaghetti code of the time and gave up.).

PS: One other thing is that the type of the bug
is hard to trigger unless you have a controlled
facility or some seeming working and yet
faulty devices which
generate bad condition in a short time, say a few minutes
into the operation . So I agree that
not all people see such problems.

Intermittently faulty SCSI devices are rather rare, aren't they?
Either a SCSI device such as disk is complete dead or
or healthy. Finding a faulty device that triggers error condition
from time to time is probably the key to observe the
problematic symptom being discussed. I wonder
if some disk manufacturers or someone could produce
a special firmware to generate error condition every minute or so
and send such disks to SCSI subsystem developers :-)

PPS: Some would argue that if such devices are so rare
then we can ignore them. Heck, no!
I have seen Solaris log files where such faulty
behavior occur from time to time and was dealt with
very gracefully without the system being rendered unusable.
So the ratio of the such devices are small, but the sheer number
of computer installation today make such incidents visible indeed.

--
int main(void){int j=2003;/*(c)2003 cishikawa. */
char t[] ="<CI> @abcdefghijklmnopqrstuvwxyz.,\n\"";
char *i ="g>qtCIuqivb,gCwe\[email protected]\"tqkvv is>dnamz";
while(*i)((j+=strchr(t,*i++)-(int)t),(j%=sizeof t-1),
(putchar(t[j])));return 0;}/* under GPL */

2004-01-21 20:00:11

by Stephen Smoogen

[permalink] [raw]
Subject: Re: AIC7xxx kernel problem with 2.4.2[234] kernels

Hopefully this wont spark more contention, but here is the additional
info I have dug up:

The cabling goes from the onboard scsi card to a removable carrier tray
to another disk to a terminator. Cabling was pulled out and looked at
for cuts, breaks, and other possible problems.. none seen.

07/26/2001 Supermicro 370DL3/370DLE/P3TDL3/P3TDLE BIOS R1.31A
PentiumIII(tm), 866MHz
133MHz Host Bus, PC133 SDRAM
Checking NVRAM..
1024 MB OK
WAIT...

(C) American Megatrends Inc.,
63-0726-009999-00101111-072601-AMIBIOS-3DL0726-Y2KC-5

*** Press <Ctrl><A> for SCSISelect(TM) Utility! ***

Adaptec SCSI BIOS v3.10
(c) 2001 Adaptec, Inc. All Rights Reserved.
*** Press <Ctrl><A> for SCSISelect(TM) Utility! ***

Slot Ch ID LUN Vendor Product Size Sync Bus
*******************************************************************
00 A 0 0 IBM DDYS-T18350N 17GB 160 16
00 A 4 0 IBM DDYS-T18350N 17GB 160 16
00 A 6 0

Putting in the debug options Justin sent in a seperate email.. I dont
get much more data.

Red Hat nash verSCSI subsystem driver Revision: 1.00
sion 3.3.10 starting
Loading scPCI: Found IRQ 10 for device 01:03.0
si_mod module
Lahc_pci:1:3:0: Reading SEEPROM...oading sd_mod modone.
dule
Loading aiahc_pci:1:3:0: Manual LVD Termination
c7xxx module
ahc_pci:1:3:0: BIOS eeprom is present
ahc_pci:1:3:0: Secondary High byte termination Enabled
ahc_pci:1:3:0: Secondary Low byte termination Enabled
ahc_pci:1:3:0: Primary Low Byte termination Enabled
ahc_pci:1:3:0: Primary High Byte termination Enabled
ahc_pci:1:3:0: Downloading Sequencer Program... 423 instructions
downloaded
ahc_pci:1:3:0: Features 0x1def6, Bugs 0x40, Flags 0x20485560
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.3.4
<Adaptec aic7892 Ultra160 SCSI adapter>
aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs

blk: queue f7ecb374, I/O limit 4095Mb (mask 0xffffffff)
(scsi0:A:0:0): Sending PPR bus_width 1, period 9, offset 7f, ppr_options
2
(scsi0:A:0:0): Received PPR width 1, period 9, offset 3f,options 2
Filtered to width 1, period 9, offset 3f, options 2
(scsi0:A:0): 6.600MB/s transfers (16bit)
scsi0: target 0 using 16bit transfers
(scsi0:A:0): 160.000MB/s transfers (80.000MHz DT, offset 63, 16bit)
scsi0: target 0 synchronous at 80.0MHz DT, offset = 0x3f
(scsi0:A:0): 80.000MB/s transfers (80.000MHz DT, offset 63)
scsi0: target 0 using 8bit transfers
(scsi0:A:0): 3.300MB/s transfers
scsi0: target 0 using asynchronous transfers
scsi0: Unexpected busfree while idle
SEQADDR == 0x1
scsi0: Unexpected busfree while idle
SEQADDR == 0x1
scsi0: Unexpected busfree while idle
SEQADDR == 0x1
scsi0: Unexpected busfree while idle
SEQADDR == 0x1

wait 30-40 minutes

scsi0: Unexpected busfree while idle
SEQADDR == 0x1
(scsi0:A:4:0): Sending PPR bus_width 1, period 9, offset 7f, ppr_options
2
(scsi0:A:4:0): Received PPR width 1, period 9, offset 3f,options 2
Filtered to width 1, period 9, offset 3f, options 2
(scsi0:A:4): 6.600MB/s transfers (16bit)
scsi0: target 4 using 16bit transfers
(scsi0:A:4): 160.000MB/s transfers (80.000MHz DT, offset 63, 16bit)
scsi0: target 4 synchronous at 80.0MHz DT, offset = 0x3f
Vendor: IBM Model: DDYS-T18350N Rev: S9YB
Type: Direct-Access ANSI SCSI revision: 03
blk: queue f7ecb474, I/O limit 4095Mb (mask 0xffffffff)
(scsi0:A:4): 160.000MB/s transfers (80.000MHz DT, offset 63, 16bit)
scsi0:A:4:0: Tagged Queuing enabled. Depth 32
Attached scsi disk sda at scsi0, channel 0, id 4, lun 0
SCSI device sda: 35843670 512-byte hdwr sectors (18352 MB)
Partition check:
sda:
Mounting /proc filesystem
Creating root device
Mounting root filesystem
mount: error 6 mounting ext2
pivotroot: pivot_root(/sysroot,/sysroot/initrd) failed: 2
Freeing unused kernel memory: 116k freed
Kernel panic: No init found. Try passing init= option to kernel.

A good boot looks like the following:

Loading aic7xxx module
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.8
<Adaptec aic7892 Ultra160 SCSI adapter>
aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs

blk: queue c364fe14, I/O limit 4095Mb (mask 0xffffffff)
Vendor: IBM Model: DDYS-T18350N Rev: S80D
Type: Direct-Access ANSI SCSI revision: 03
blk: queue f7c9a014, I/O limit 4095Mb (mask 0xffffffff)
Vendor: IBM Model: DDYS-T18350N Rev: S9YB
Type: Direct-Access ANSI SCSI revision: 03
blk: queue f7c9a414, I/O limit 4095Mb (mask 0xffffffff)
scsi0:A:0:0: Tagged Queuing enabled. Depth 32
scsi0:A:4:0: Tagged Queuing enabled. Depth 32
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
Attached scsi disk sdb at scsi0, channel 0, id 4, lun 0
(scsi0:A:0): 160.000MB/s transfers (80.000MHz DT, offset 63, 16bit)
SCSI device sda: 35843670 512-byte hdwr sectors (18352 MB)
Partition check:
sda: sda1 sda2 < sda5 sda6 sda7 sda8 sda9 >
(scsi0:A:4): 160.000MB/s transfers (80.000MHz DT, offset 63, 16bit)
SCSI device sdb: 35843670 512-byte hdwr sectors (18352 MB)
sdb:
Mounting /proc filesystem



On Fri, 2004-01-16 at 15:59, Stephen Smoogen wrote:
> On Fri, 2004-01-16 at 15:39, Justin T. Gibbs wrote:
> > > Booting problems with aic7xxx with stock kernel 2.4.24.
> >
> > ...
> >
> > > Unexpected busfree while idle
> > > SEQ 0x01
> >
> > A problem with similar symptoms was corrected in driver version 6.2.37
> > back in August of last year. Can you try using the latest driver source
> > from here:
> >
> > http://people.FreeBSD.org/~gibbs/linux/SRC/
> >
> > and see if your problem persists? The aic79xx driver archive at the
> > above location includes both the aic7xxx and aic79xx drivers. If this
> > does not resolve your problem there are other debugging options we can
> > enable that may aid in tracking down the problem.
>
> Hi I did that already; sorry for not being clearer about it in the bug
> report. For some of my systems I had patched my kernel to have the
> latest source code from your site for our aic79xx machines. I ran that
> kernel on these other systems and it locked up in a similar state.
>
> I am ready for the additional debugging options :). Thanks for your
> quick response.
--
Stephen John Smoogen [email protected]
Los Alamos National Lab CCN-5 Sched 5/40 PH: 4-0645
Ta-03 SM-1498 MailStop B255 DP 10S Los Alamos, NM 87545
-- So shines a good deed in a weary world. = Willy Wonka --

2004-01-21 20:41:05

by Guennadi Liakhovetski

[permalink] [raw]
Subject: Re: AIC7xxx kernel problem with 2.4.2[234] kernels

On Mon, 19 Jan 2004, Linus Torvalds wrote:

> On Mon, 19 Jan 2004, Justin T. Gibbs wrote:
> >
> > Does the maintainer have the ability to veto changes that harm the
> > code they maintain?
>
> Nope. Nobody has that right.
>
> Even _I_ don't veto changes that the right people push (my motto:
> "everybody is wrong sometimes: when enough people complain, even I am
> wrong").
>
> In particular, maintainers of "conceptually higher" generally always have
> priority. If Al Viro says a filesystem is doing something wrong from a VFS
> standpoint, then that filesystem is broken - regardless of whether the
> filesystem maintainer agrees or not. Because the VFS layer requirements
> trump any low-level filesystem issues.

Linus

May I try to sweeten the pill a bit? I think, I am not contradicting what
you said, but just complementing it, thinking, that the direct code
maintainer has a right and priority in modifying the code, even over the
"conceptionally higher" level. Say, if some code is found to be broken,
the problem and possible fixes should first be reported to the direct
maintainer. And only if the maintainer is not co-operating suitably (e.g.,
in the opinion of those "higher" ones), only then necessary modifications
can be made directly. In other words, a situation, when, say, a subsystem
maintainer silently modifies some driver-code, without even letting the
direct maintainer know, is undesirable. A better solution would be to
inform the driver maintainer of the problem / send a patch. And only if no
suitable action follows, force the necessary modifications.

That was just a mere speculation, not pertaining to any specific case.

Thanks
Guennadi
---
Guennadi Liakhovetski




2004-01-22 05:16:23

by James Bottomley

[permalink] [raw]
Subject: Re: AIC7xxx kernel problem with 2.4.2[234] kernels

On Tue, 2004-01-20 at 00:43, Justin T. Gibbs wrote:
> As for control, the type of control "I seek" does exist. You have it.
> You can also delegate some of that control if it suits you.

Well, as you have heard from the horse's mouth: I don't.

> I provided all of the information required for you to make a reasoned
> decision of which change sets to integrate. I had no idea that you
> would completely disregard the wealth of information in the change sets
> and change set comments when coming up with an integration point. Your
> actions show that you didn't review or understand the changes well enough
> to submit them into the tree. You probably didn't even test the resulting
> driver on real hardware before you submitted the changes.

Actually, I would have done nothing but for some 2.6 migration reports
of total lockups with the then in tree aic79xx driver. The patch that
went into the tree was tested by the people reporting the lockups.

> > The recovery code does work. You may want it to work differently, and
> > that may make it work better, but that's an enhancement not a bug fix.
>
> No. The recovery code doesn't work. Many of the people that know this
> don't bother complaining to you about it. They complain to the HBA driver
> authors and the tech support departments of the companies that make the HBAs.
> The HBA driver authors then do what they have to ensure that the system
> remains viable after recovery.

You haven't outlined any incorrect cases in your emails, just could do
better cases. If you have all these bug reports that you haven't been
passing on, could you at least distil them to the mid layer failure
scenario that we can discuss fixing?

> I mean honestly. Do you think I would have gone to all of the trouble
> I did in doing my own watchdog recovery if the recovery code worked
> correctly? Or that I would stand so firm in my position if these issues
> didn't have real customer impact?

Well, in coming up with the mid layer changes from 2.4 to 2.6 I did look
at what issues the main drivers had work arounds for. I used these work
arounds and an email you sent in September 2002 as the basis for a lot
of the mid-layer changes in 2.6. None of the other drivers does this
timer interception and the issue wasn't mentioned in your email, so I am
dubious about the seriousness of the impact.

The way fixes get into linux is either lots of people complain, or one
person sends a fix, neither has happened in this case, which again leads
me to suspect that it's not a huge problem.

The still outstanding question is, now that you have a clearer idea what
being a Maintainer entails, do you wish to be the maintainer for
aic7xxx?

James