2006-02-12 15:53:04

by Matthias Andree

[permalink] [raw]
Subject: 2.6.15.4 ide-cd causes 'local unit communication failure' on NEC ND-4550A?

Greetings,

catchy subject, here's the story:

SUSE Linux kernels 2.6.13-15.7,
2.6.13-15.8 (both SUSE Linux 10.0 i586) and
Vanilla kernel 2.6.15.4

cause "local unit communication failure" when running "readcd -c2scan"
on my NEC ND-4550A CD/DVD drive (VIA KT8237, Athlon XP 2500+, 1 GB RAM)
after at most two seconds of reading.

I am aware of three ways to trigger the problem currently:

1. as little as switching to a different tab in GNOME-Terminal.

2. hald-addon-storage's poll with CDROM_MEDIA_CHANGED and
CDROM_DRIVE_STATUS ioctls cause the same problem (hald-addon-storage
doesn't do anything else except opening and closing the device and
sleeping for two seconds).

3. running two readcd -c2scan dev=/dev/hdc commands on the same drive
triggers the problem, too. One might argue this isn't very sensible,
but it shouldn't confuse the drive either.

The problem is invariant to hdparm -d0 (or -d1) /dev/hdc, and hdparm -u0
or -u1 make no difference either.

A Plextor PX-W4824TA (/dev/hdd) on the same cable as the NEC works like
a charm, and with FreeBSD 6-STABLE, both work like a charm with (3.).
(This is a dualboot machine, so it's the very same hardware.)

readcd uses commands like these:

Executing 'read_cd' command on Bus 1 Target 0, Lun 0 timeout 40s
CDB: BE 00 00 00 02 7D 00 00 31 FA 00 00
ioctl(3, SG_IO, 0xbfa5b2ac) = 0

In a second run -- I needed to capture the output ;-) -- I get this:

CDB: BE 00 00 00 00 31 00 00 31 FA 00 00
status: 0x2 (CHECK CONDITION)
Sense Bytes: 70 00 04 00 00 00 00 0A 00 00 00 00 08 00 00 00
Sense Key: 0x4 Hardware Error, Segment 0
Sense Code: 0x08 Qual 0x00 (logical unit communication failure) Fru 0x0
Sense flags: Blk 0 (not valid)
resid: 123748
cmd finished after 6.408s timeout 40s
readcd: Success. read_cd: scsi sendcmd: no error

I'd been using "sudo" to run this with ruid == euid == 0 to rule out
command filtering issues. I hope that was the right approach.

What else do you need to debug this? Is there some low-level debugging
output I could enable to see what the kernel sends and gets to/from the
drive?

J?rg Schilling suspected (on cdwrite(#)other,debian!org - typos are
mine) this might be a DMA problem (but what does he know of Linux...
little compared to his Solaris-fu), but I think we can rule that out,
the problem persists after hdparm -d0. J?rg also insisted I tell him if
I had a 40 or 80 wire cable, _after_ I'd posted that FreeBSD worked OK
on the same hardware.

--
Matthias Andree


2006-02-13 20:10:42

by Matthias Andree

[permalink] [raw]
Subject: Re: 2.6.15.4 ide-cd causes 'local unit communication failure' on NEC ND-4550A?

Matthias Andree schrieb am 2006-02-12:

> Greetings,
>
> catchy subject, here's the story:
>
> SUSE Linux kernels 2.6.13-15.7,
> 2.6.13-15.8 (both SUSE Linux 10.0 i586) and
> Vanilla kernel 2.6.15.4
>
> cause "local unit communication failure" when running "readcd -c2scan"
> on my NEC ND-4550A CD/DVD drive (VIA KT8237, Athlon XP 2500+, 1 GB RAM)
> after at most two seconds of reading.

Detail info: the failed command always takes a touch more than 6.4 s to
timeout. Only the time until it happens for the first time differs.
It appears to me that the drive has an internal watchdog to wait for the
host to do something.

And directions how to dig up useful debug information for this issue?
Have there been related fixes in 2.6.16-rc* that make testing
worthwhile?

--
Matthias Andree