LinuxLists.cc - new aic7xxx bug, 2.4.13/6.2.4

2001-11-02 03:25:27

[permalink] [raw]

Subject: new aic7xxx bug, 2.4.13/6.2.4

I'm having troubles with the last few revisions of the new Gibbs aic7xxx
driver that until now has served me well. This is kernel
2.4.13-preempt-lvmrc4 with the 6.2.4 scsi driver, but I saw it happen on
2.4.12 with 6.2.1. I haven't tried older versions with this CD.

When trying to rip a particular audio CD with cdrdao 1.1.5, I get this
error:

?: Input/output error. : scsi sendcmd: retryable error
CDB: 42 00 40 03 00 00 03 00 30 00
status: 0x0 (GOOD STATUS)
cmd finished after 45.934s timeout 20s
WARNING: Cannot read ISRC code.
Track 4...
?: No such device or address. Cannot send SCSI cmd via ioctl

And this appears in the kernel output:

VFS: Disk change detected on device sr(11,1)
scsi0:0:3:0: Attempting to queue an ABORT message
scsi0: Dumping Card State while idle, at SEQADDR 0x7
ACCUM = 0x16, SINDEX = 0x37, DINDEX = 0x24, ARG_2 = 0x0
HCNT = 0x0
SCSISEQ = 0x12, SBLKCTL = 0x0
DFCNTRL = 0x0, DFSTATUS = 0x2d
LASTPHASE = 0x1, SCSISIGI = 0x0, SXFRCTL0 = 0x80
SSTAT0 = 0x5, SSTAT1 = 0xa
STACK == 0x3, 0x186, 0x156, 0x0
SCB count = 4
Kernel NEXTQSCB = 2
Card NEXTQSCB = 2
QINFIFO entries:
Waiting Queue entries:
Disconnected Queue entries: 0:3
QOUTFIFO entries:
Sequencer Free SCB List: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Pending list: 3
Kernel Free SCB list: 1 0
Untagged Q(3): 3
DevQ(0:2:0): 0 waiting
DevQ(0:3:0): 0 waiting
(scsi0:A:3:0): Queuing a recovery SCB
scsi0:0:3:0: Device is disconnected, re-queuing SCB
Recovery code sleeping
(scsi0:A:3:0): Abort Message Sent
(scsi0:A:3:0): SCB 3 - Abort Completed.
Recovery SCB completes
Recovery code awake
aic7xxx_abort returns 0x2002
(scsi0:A:3:0): Unexpected busfree in Command phase
SEQADDR == 0x15c
scsi0:0:3:0: Attempting to queue a TARGET RESET message
scsi0:0:3:0: Command not found
aic7xxx_dev_reset returns 0x2002
scsi0:0:3:0: Attempting to queue an ABORT message
scsi0: Dumping Card State while idle, at SEQADDR 0x7
ACCUM = 0xf7, SINDEX = 0x37, DINDEX = 0x24, ARG_2 = 0x0
HCNT = 0x0
SCSISEQ = 0x12, SBLKCTL = 0x0
DFCNTRL = 0x0, DFSTATUS = 0x2d
LASTPHASE = 0x1, SCSISIGI = 0x0, SXFRCTL0 = 0x80
SSTAT0 = 0x5, SSTAT1 = 0xa
STACK == 0x3, 0x186, 0x156, 0x35
SCB count = 4
Kernel NEXTQSCB = 2
Card NEXTQSCB = 2
QINFIFO entries:
Waiting Queue entries:
Disconnected Queue entries: 0:3
QOUTFIFO entries:
Sequencer Free SCB List: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Pending list: 3
Kernel Free SCB list: 1 0
Untagged Q(3): 3
DevQ(0:2:0): 0 waiting
DevQ(0:3:0): 0 waiting
(scsi0:A:3:0): Queuing a recovery SCB
scsi0:0:3:0: Device is disconnected, re-queuing SCB
Recovery code sleeping
(scsi0:A:3:0): Abort Message Sent
(scsi0:A:3:0): SCB 3 - Abort Completed.
Recovery SCB completes
Recovery code awake
aic7xxx_abort returns 0x2002
scsi: device set offline - not ready or command retry failed after bus reset: host 0 channel 0 id 3 lun 0

After this, subsequent attempts to use that drive fail. Trying to mount
a data CD, for example, now gives:

[orr](0) % mount /cdrom
mount: No medium found

with this in dmesg:

cdrom: open failed.
VFS: Disk change detected on device sr(11,1)

Some other information follows.

/proc/scsi/scsi:

Attached devices:
Host: scsi0 Channel: 00 Id: 02 Lun: 00
Vendor: YAMAHA Model: CRW6416S Rev: 1.0d
Type: CD-ROM ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 03 Lun: 00
Vendor: PIONEER Model: DVD-ROM DVD-305 Rev: 1.00
Type: CD-ROM ANSI SCSI revision: 02

/proc/scsi/aic7xxx/0:

Adaptec AIC7xxx driver version: 6.2.4
aic7880: Single Channel A, SCSI Id=7, 16/253 SCBs
Channel A Target 0 Negotiation Settings
User: 10.000MB/s transfers (10.000MHz, offset 255)
Channel A Target 1 Negotiation Settings
User: 8.064MB/s transfers (8.064MHz, offset 255)
Channel A Target 2 Negotiation Settings
User: 10.000MB/s transfers (10.000MHz, offset 255)
Goal: 10.000MB/s transfers (10.000MHz, offset 15)
Curr: 10.000MB/s transfers (10.000MHz, offset 15)
Channel A Target 2 Lun 0 Settings
Commands Queued 94
Commands Active 0
Command Openings 1
Max Tagged Openings 0
Device Queue Frozen Count 0
Channel A Target 3 Negotiation Settings
User: 10.000MB/s transfers (10.000MHz, offset 255)
Goal: 10.000MB/s transfers (10.000MHz, offset 15)
Curr: 10.000MB/s transfers (10.000MHz, offset 15)
Channel A Target 3 Lun 0 Settings
Commands Queued 977404
Commands Active 0
Command Openings 1
Max Tagged Openings 0
Device Queue Frozen Count 0
Channel A Target 4 Negotiation Settings
User: 10.000MB/s transfers (10.000MHz, offset 255)
Channel A Target 5 Negotiation Settings
User: 10.000MB/s transfers (10.000MHz, offset 255)
Channel A Target 6 Negotiation Settings
User: 10.000MB/s transfers (10.000MHz, offset 255)
Channel A Target 7 Negotiation Settings
User: 10.000MB/s transfers (10.000MHz, offset 255)

I'm happy to help investigate further if there's anything you want to
try. Let me know if you need any other output or have a patch I can try
out.

Jason

2001-11-02 14:03:37

by Krzysztof Halasa

[permalink] [raw]

Subject: Re: new aic7xxx bug, 2.4.13/6.2.4

Jason Lunz <[email protected]> writes:

> I'm having troubles with the last few revisions of the new Gibbs aic7xxx
> driver that until now has served me well. This is kernel
> 2.4.13-preempt-lvmrc4 with the 6.2.4 scsi driver, but I saw it happen on
> 2.4.12 with 6.2.1. I haven't tried older versions with this CD.

Something similar here, but I have disks on this SCSI :-(

Plain 2.4.13-ac5 kernel, dual PIII-1000, VIA686B chipset, AHA2940UW
(more details available on request):

Adaptec AIC7xxx driver version: 6.2.4
aic7880: Ultra Wide Channel A, SCSI Id=7, 16/253 SCBs
Channel A Target 0 Negotiation Settings
User: 40.000MB/s transfers (20.000MHz, offset 255, 16bit)
Goal: 40.000MB/s transfers (20.000MHz, offset 8, 16bit)
Curr: 40.000MB/s transfers (20.000MHz, offset 8, 16bit)
Channel A Target 0 Lun 0 Settings
Commands Queued 7357
Commands Active 0
Command Openings 253
Max Tagged Openings 253
Device Queue Frozen Count 0
Channel A Target 1 Negotiation Settings
User: 40.000MB/s transfers (20.000MHz, offset 255, 16bit)
Goal: 40.000MB/s transfers (20.000MHz, offset 8, 16bit)
Curr: 40.000MB/s transfers (20.000MHz, offset 8, 16bit)
Channel A Target 1 Lun 0 Settings
Commands Queued 365
Commands Active 0
Command Openings 253
Max Tagged Openings 253
Device Queue Frozen Count 0
Channel A Target 2 Negotiation Settings
User: 40.000MB/s transfers (20.000MHz, offset 255, 16bit)
Channel A Target 3 Negotiation Settings
User: 40.000MB/s transfers (20.000MHz, offset 255, 16bit)
Channel A Target 4 Negotiation Settings
User: 40.000MB/s transfers (20.000MHz, offset 255, 16bit)
Channel A Target 5 Negotiation Settings
User: 40.000MB/s transfers (20.000MHz, offset 255, 16bit)
Channel A Target 6 Negotiation Settings
User: 40.000MB/s transfers (20.000MHz, offset 255, 16bit)
Channel A Target 7 Negotiation Settings
User: 40.000MB/s transfers (20.000MHz, offset 255, 16bit)
Channel A Target 8 Negotiation Settings
User: 40.000MB/s transfers (20.000MHz, offset 255, 16bit)
Channel A Target 9 Negotiation Settings
User: 40.000MB/s transfers (20.000MHz, offset 255, 16bit)
Channel A Target 10 Negotiation Settings
User: 40.000MB/s transfers (20.000MHz, offset 255, 16bit)
Channel A Target 11 Negotiation Settings
User: 40.000MB/s transfers (20.000MHz, offset 255, 16bit)
Channel A Target 12 Negotiation Settings
User: 40.000MB/s transfers (20.000MHz, offset 255, 16bit)
Channel A Target 13 Negotiation Settings
User: 40.000MB/s transfers (20.000MHz, offset 255, 16bit)
Channel A Target 14 Negotiation Settings
User: 40.000MB/s transfers (20.000MHz, offset 255, 16bit)
Channel A Target 15 Negotiation Settings
User: 40.000MB/s transfers (20.000MHz, offset 255, 16bit)

Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: IBM Model: DNES-309170W Rev: SA30
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 01 Lun: 00
Vendor: IBM Model: DNES-318350W Rev: SAH0
Type: Direct-Access ANSI SCSI revision: 03

scsi0:A:0: parity error detected in Message-in phase. SEQADDR(0x1b6) SCSIRATE(0x0)
scsi0: Unexpected busfree in Message-out phase
SEQADDR == 0x166
scsi0:0:0:0: Attempting to queue an ABORT message
scsi0: Dumping Card State while idle, at SEQADDR 0x7
ACCUM = 0x33, SINDEX = 0x7, DINDEX = 0x8c, ARG_2 = 0x0
HCNT = 0x0
SCSISEQ = 0x12, SBLKCTL = 0x2
DFCNTRL = 0x0, DFSTATUS = 0x29
LASTPHASE = 0x1, SCSISIGI = 0x0, SXFRCTL0 = 0x80
SSTAT0 = 0x5, SSTAT1 = 0xa
STACK == 0x3, 0x105, 0x160, 0xe4
SCB count = 20
Kernel NEXTQSCB = 19
Card NEXTQSCB = 19
QINFIFO entries:
Waiting Queue entries:
Disconnected Queue entries: 7:5
QOUTFIFO entries:
Sequencer Free SCB List: 8 11 4 5 10 15 2 3 1 0 9 13 14 6 12
Pending list: 5
Kernel Free SCB list: 7 12 0 4 10 6 9 14 3 1 8 11 15 13 2 18 17 16
DevQ(0:0:0): 0 waiting
DevQ(0:1:0): 0 waiting
(scsi0:A:0:0): Queuing a recovery SCB
scsi0:0:0:0: Device is disconnected, re-queuing SCB
Recovery code sleeping
(scsi0:A:0:0): Abort Tag Message Sent
(scsi0:A:0:0): SCB 5 - Abort Tag Completed.
Recovery SCB completes
Recovery code awake
aic7xxx_abort returns 0x2002
scsi0:A:0: parity error detected in Message-in phase. SEQADDR(0x1b6) SCSIRATE(0x0)
Kernel panic: HOST_MSG_LOOP with invalid SCB ff

In interrupt handler - not syncing

--
Krzysztof Halasa
Network Administrator

2001-11-02 14:43:46

by Justin T. Gibbs

[permalink] [raw]

Subject: Re: new aic7xxx bug, 2.4.13/6.2.4

>
>I'm having troubles with the last few revisions of the new Gibbs aic7xxx
>driver that until now has served me well. This is kernel
>2.4.13-preempt-lvmrc4 with the 6.2.4 scsi driver, but I saw it happen on
>2.4.12 with 6.2.1. I haven't tried older versions with this CD.

Its not clear to me that this is the driver's fault. Let's look at
the abort log.

>And this appears in the kernel output:
>
>VFS: Disk change detected on device sr(11,1)
>scsi0:0:3:0: Attempting to queue an ABORT message
>scsi0: Dumping Card State while idle, at SEQADDR 0x7

Upper layer has timed out a command while the SCSI bus
is idle.

>ACCUM = 0x16, SINDEX = 0x37, DINDEX = 0x24, ARG_2 = 0x0
>HCNT = 0x0
>SCSISEQ = 0x12, SBLKCTL = 0x0

We have reselection on

>Kernel NEXTQSCB = 2
>Card NEXTQSCB = 2
>QINFIFO entries:

No new commands to run.

>Disconnected Queue entries: 0:3

At least one command is disconnected on a target.

>QOUTFIFO entries:
>Sequencer Free SCB List: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
>Pending list: 3
>Kernel Free SCB list: 1 0
>Untagged Q(3): 3

The disconnected command is for target 3.

>DevQ(0:2:0): 0 waiting
>DevQ(0:3:0): 0 waiting

All of the rest of the state is normal.

>(scsi0:A:3:0): Queuing a recovery SCB
>scsi0:0:3:0: Device is disconnected, re-queuing SCB
>Recovery code sleeping
>(scsi0:A:3:0): Abort Message Sent
>(scsi0:A:3:0): SCB 3 - Abort Completed.
>Recovery SCB completes
>Recovery code awake
>aic7xxx_abort returns 0x2002

We successfully selected the target and aborted the command.

>(scsi0:A:3:0): Unexpected busfree in Command phase
>SEQADDR == 0x15c

This, I can't really explain unless the target is somewhat
unstable just after an abort occurs. I'd need to see a
bus trace.

>scsi0:0:3:0: Attempting to queue a TARGET RESET message
>scsi0:0:3:0: Command not found

The upper layer tells us to perform a target reset for
a command that doesn't exist. It was likely aborted
by the unexpected bus free above, but the mid-layer ignores
completions during error recovery.

>aic7xxx_dev_reset returns 0x2002
>scsi0:0:3:0: Attempting to queue an ABORT message
>scsi0: Dumping Card State while idle, at SEQADDR 0x7
>ACCUM = 0xf7, SINDEX = 0x37, DINDEX = 0x24, ARG_2 = 0x0

Target decideds not to return our command again, so we
are told to perform recovery.

>(scsi0:A:3:0): Abort Message Sent
>(scsi0:A:3:0): SCB 3 - Abort Completed.
>Recovery SCB completes
>Recovery code awake
>aic7xxx_abort returns 0x2002

And we were successful.

>scsi: device set offline - not ready or command retry failed after bus reset:
>host 0 channel 0 id 3 lun 0

But the mid-layer has already decided that it can't recover this device,
so it calls it dead and refuses to allow I/O to it anymore.

>I'm happy to help investigate further if there's anything you want to
>try. Let me know if you need any other output or have a patch I can try
>out.

Have you recently changed your version of cdrdao? Perhaps that program
is issuing a command that this particular drive simply will not accept?

--
Justin

2001-11-02 14:56:17

by Justin T. Gibbs

[permalink] [raw]

Subject: Re: new aic7xxx bug, 2.4.13/6.2.4

>Jason Lunz <[email protected]> writes:
>
>> I'm having troubles with the last few revisions of the new Gibbs aic7xxx
>> driver that until now has served me well. This is kernel
>> 2.4.13-preempt-lvmrc4 with the 6.2.4 scsi driver, but I saw it happen on
>> 2.4.12 with 6.2.1. I haven't tried older versions with this CD.
>
>Something similar here, but I have disks on this SCSI :-(

Actually not that similar.

>scsi0:A:0: parity error detected in Message-in phase. SEQADDR(0x1b6) SCSIRATE(
>0x0)

Here's your first clue. Parity errors occur on "broken" SCSI busses.
Your cabling or terminator is bad, you have a bent pin, or just too
much noise in and around your cabling.

>scsi0: Unexpected busfree in Message-out phase
>SEQADDR == 0x166

When we tried to tell the target about the parity error during message-in
phase, the target was unable to see our message without parity errors.
So, it had to go bus free to terminate the transaction. It is a bug
here that we didn't properly identify the active transaction and abort
it here due to some overly protective code, but that can be readily
fixed.

>scsi0:0:0:0: Attempting to queue an ABORT message

Because we failed to abort the command above, the mid-layer
timesout the command and we start recovery.

>scsi0: Dumping Card State while idle, at SEQADDR 0x7
>ACCUM = 0x33, SINDEX = 0x7, DINDEX = 0x8c, ARG_2 = 0x0
>HCNT = 0x0
>SCSISEQ = 0x12, SBLKCTL = 0x2
> DFCNTRL = 0x0, DFSTATUS = 0x29
>LASTPHASE = 0x1, SCSISIGI = 0x0, SXFRCTL0 = 0x80
>SSTAT0 = 0x5, SSTAT1 = 0xa
>STACK == 0x3, 0x105, 0x160, 0xe4
>SCB count = 20
>Kernel NEXTQSCB = 19
>Card NEXTQSCB = 19
>QINFIFO entries:
>Waiting Queue entries:
>Disconnected Queue entries: 7:5
>QOUTFIFO entries:
>Sequencer Free SCB List: 8 11 4 5 10 15 2 3 1 0 9 13 14 6 12
>Pending list: 5
>Kernel Free SCB list: 7 12 0 4 10 6 9 14 3 1 8 11 15 13 2 18 17 16
>DevQ(0:0:0): 0 waiting
>DevQ(0:1:0): 0 waiting
>(scsi0:A:0:0): Queuing a recovery SCB
>scsi0:0:0:0: Device is disconnected, re-queuing SCB
>Recovery code sleeping
>(scsi0:A:0:0): Abort Tag Message Sent
>(scsi0:A:0:0): SCB 5 - Abort Tag Completed.
>Recovery SCB completes
>Recovery code awake
>aic7xxx_abort returns 0x2002

And we abort the transaction successfully.

>scsi0:A:0: parity error detected in Message-in phase. SEQADDR(0x1b6) SCSIRATE(
>0x0)
>Kernel panic: HOST_MSG_LOOP with invalid SCB ff

We get another parity error in a state without a proper connection (probably
on the identify message after a reselection) and some more defensive
programming prevents us from handling it correctly. 8-(

The end result is that you need to fix your SCSI bus to be more reliable.
I will fix the issues you've discovered in parity error and unexpected
busfree handling in the next driver release. If your bus is working
properly, you shouldn't see these errors.

--
Justin

2001-11-02 19:36:09

[permalink] [raw]

Subject: Re: new aic7xxx bug, 2.4.13/6.2.4

In mlist.linux-kernel, you wrote:
>>VFS: Disk change detected on device sr(11,1)
>>scsi0:0:3:0: Attempting to queue an ABORT message
>>scsi0: Dumping Card State while idle, at SEQADDR 0x7
>
> Upper layer has timed out a command while the SCSI bus
> is idle.

I don't understand this. The error is in the middle of a CD rip
(actually the pre-rip of cdrdao where it looks at sub-channel info for
pre-gaps and such). The only way to get a timeout while the scsi bus
is idle would be if the drive just stopped cold, with the mid-layer
expecting it to go on, right?

>>ACCUM = 0x16, SINDEX = 0x37, DINDEX = 0x24, ARG_2 = 0x0
>>HCNT = 0x0
>>SCSISEQ = 0x12, SBLKCTL = 0x0
>
> We have reselection on
>
>>Kernel NEXTQSCB = 2
>>Card NEXTQSCB = 2
>>QINFIFO entries:
>
> No new commands to run.
>
>>Disconnected Queue entries: 0:3
>
> At least one command is disconnected on a target.

I don't know what it means for a command to be disconnected. Can you
clarify that?

>>QOUTFIFO entries:
>>Sequencer Free SCB List: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
>>Pending list: 3
>>Kernel Free SCB list: 1 0
>>Untagged Q(3): 3
>
> The disconnected command is for target 3.

Right, the DVD-ROM we're ripping from.

>>(scsi0:A:3:0): Queuing a recovery SCB
>>scsi0:0:3:0: Device is disconnected, re-queuing SCB
>>Recovery code sleeping
>>(scsi0:A:3:0): Abort Message Sent
>>(scsi0:A:3:0): SCB 3 - Abort Completed.
>>Recovery SCB completes
>>Recovery code awake
>>aic7xxx_abort returns 0x2002
>
> We successfully selected the target and aborted the command.

The command that timed out, I'm assuming...

>>(scsi0:A:3:0): Unexpected busfree in Command phase
>>SEQADDR == 0x15c
>
> This, I can't really explain unless the target is somewhat
> unstable just after an abort occurs. I'd need to see a
> bus trace.

Possibly, but I hope not. This is otherwise the best optical drive I've
ever worked with, a new "Vendor: PIONEER Model: DVD-ROM DVD-305".

>>scsi0:0:3:0: Attempting to queue a TARGET RESET message
>>scsi0:0:3:0: Command not found
>
> The upper layer tells us to perform a target reset for
> a command that doesn't exist. It was likely aborted
> by the unexpected bus free above, but the mid-layer ignores
> completions during error recovery.

ok, I think this is where the kernel starts to go wrong. The mid-layer
wants to reset the command that resulted in an "unexpected busfree",
which the driver and drive are already done with.

>>aic7xxx_dev_reset returns 0x2002
>>scsi0:0:3:0: Attempting to queue an ABORT message
>>scsi0: Dumping Card State while idle, at SEQADDR 0x7
>>ACCUM = 0xf7, SINDEX = 0x37, DINDEX = 0x24, ARG_2 = 0x0
>
> Target decideds not to return our command again, so we
> are told to perform recovery.

It makes sense that the target wouldn't return the command if that
command already died. In that case, shouldn't we avoid trying to reset
it? Should the mid-layer not be requesting a reset at this point? Is
this part of the rumored 2.5 scsi rewrite?

>>(scsi0:A:3:0): Abort Message Sent
>>(scsi0:A:3:0): SCB 3 - Abort Completed.
>>Recovery SCB completes
>>Recovery code awake
>>aic7xxx_abort returns 0x2002
>
> And we were successful.

we successfully aborted, but did we have to?

>>scsi: device set offline - not ready or command retry failed after bus reset:
>>host 0 channel 0 id 3 lun 0
>
> But the mid-layer has already decided that it can't recover this device,
> so it calls it dead and refuses to allow I/O to it anymore.

This is definitely wrong. The drive won't do anything now without a
reboot (or maybe removing and reinserting all scsi modules; I could do
that but I haven't tried it).

> Have you recently changed your version of cdrdao? Perhaps that program
> is issuing a command that this particular drive simply will not accept?

This is the same drive and version of cdrdao that have ripped more than
100 CDs. It's just this particular CD that breaks in this way at the
same spot every time.

If the DVD-ROM can't handle that CD then that's fine, but it would be
nice if such a broken CD didn't result in not being able to use that
drive at all anymore.

thanks for your help,

Jason

2001-11-02 20:04:04

by Justin T. Gibbs

[permalink] [raw]

Subject: Re: new aic7xxx bug, 2.4.13/6.2.4

>In mlist.linux-kernel, you wrote:
>>>VFS: Disk change detected on device sr(11,1)
>>>scsi0:0:3:0: Attempting to queue an ABORT message
>>>scsi0: Dumping Card State while idle, at SEQADDR 0x7
>>
>> Upper layer has timed out a command while the SCSI bus
>> is idle.
>
>I don't understand this. The error is in the middle of a CD rip
>(actually the pre-rip of cdrdao where it looks at sub-channel info for
>pre-gaps and such). The only way to get a timeout while the scsi bus
>is idle would be if the drive just stopped cold, with the mid-layer
>expecting it to go on, right?

A quick discussion about "disconnection" on a SCSI bus may be helpful
here. Since the SCSI bus is shared, devices often tell the initiator
(of the command, in this case your Adaptec controller), "I'm dropping
off the bus for a while to do more work on this command, but I'll
get back to you". This "disconnection" allows other devices that are
ready to accept or send additional data to use the bus. In the message
above, the aic7xxx driver says that the adapter is currently idle. This
means that it is not currently talking to any device on the bus. Another
initiator may be talking to someone but this will only occur if you
happen to be in a multi-initiator environment (e.g. clustering).

So why might the bus be idle? Well, your Pioneer could be trying
"really hard" to read a sector of data that is hard to read. If
the application setting the timeout for this particular command
has set it too short, we may timeout the command before the drive
has given up on its attempts to recover this sector.

>>>ACCUM = 0x16, SINDEX = 0x37, DINDEX = 0x24, ARG_2 = 0x0
>>>HCNT = 0x0
>>>SCSISEQ = 0x12, SBLKCTL = 0x0
>>
>> We have reselection on

Meaning that a device can engage us to pick up where we
left off on a disconnected transaction.

>>>(scsi0:A:3:0): SCB 3 - Abort Completed.
>>>Recovery SCB completes
>>>Recovery code awake
>>>aic7xxx_abort returns 0x2002
>>
>> We successfully selected the target and aborted the command.
>
>The command that timed out, I'm assuming...

Yup.

>>>(scsi0:A:3:0): Unexpected busfree in Command phase
>>>SEQADDR == 0x15c
>>
>> This, I can't really explain unless the target is somewhat
>> unstable just after an abort occurs. I'd need to see a
>> bus trace.
>
>Possibly, but I hope not. This is otherwise the best optical drive I've
>ever worked with, a new "Vendor: PIONEER Model: DVD-ROM DVD-305".

Well, I can't say for sure without a bus analyzer. I don't happen
to have access to one of these drives to test it out here. I'd
probably need your particular disc to replicate this too.

>>>scsi0:0:3:0: Attempting to queue a TARGET RESET message
>>>scsi0:0:3:0: Command not found
>>
>> The upper layer tells us to perform a target reset for
>> a command that doesn't exist. It was likely aborted
>> by the unexpected bus free above, but the mid-layer ignores
>> completions during error recovery.
>
>ok, I think this is where the kernel starts to go wrong. The mid-layer
>wants to reset the command that resulted in an "unexpected busfree",
>which the driver and drive are already done with.

Yeah, but we told the upper layer that we successfully killed it,
so it should go on. The real killer is below.

>>>aic7xxx_dev_reset returns 0x2002
>>>scsi0:0:3:0: Attempting to queue an ABORT message
>>>scsi0: Dumping Card State while idle, at SEQADDR 0x7
>>>ACCUM = 0xf7, SINDEX = 0x37, DINDEX = 0x24, ARG_2 = 0x0
>>
>> Target decideds not to return our command again, so we
>> are told to perform recovery.
>
>It makes sense that the target wouldn't return the command if that
>command already died. In that case, shouldn't we avoid trying to reset
>it? Should the mid-layer not be requesting a reset at this point? Is
>this part of the rumored 2.5 scsi rewrite?

I should have been more clear. The upper layer performed its recovery
and then reissued the command. Again that command timed out. After
a certain number of these cycles, the mid-layer will call the device
dead.

As to the 2.5 scsi rewrite... well there are lots of things in
the SCSI layer to fix for 2.5. 8-)

>>>(scsi0:A:3:0): Abort Message Sent
>>>(scsi0:A:3:0): SCB 3 - Abort Completed.
>>>Recovery SCB completes
>>>Recovery code awake
>>>aic7xxx_abort returns 0x2002
>>
>> And we were successful.
>
>we successfully aborted, but did we have to?

The command was still outstanding and it timed-out. So, yes and no.
See below.

>>>scsi: device set offline - not ready or command retry failed after bus reset
>:
>>>host 0 channel 0 id 3 lun 0
>>
>> But the mid-layer has already decided that it can't recover this device,
>> so it calls it dead and refuses to allow I/O to it anymore.
>
>This is definitely wrong. The drive won't do anything now without a
>reboot (or maybe removing and reinserting all scsi modules; I could do
>that but I haven't tried it).

There may be some way to recover the device via /proc/scsi, but I haven't
looked into it.

What this boils down to is that the mid-layer's error recovery makes one,
big, invalid, assumption. Any command that is issued to a device should
complete successfully (i.e. return a status prior to timing out). If
you send a command that violates that assumption (timeout too short,
device wigs out on a non-vital command, etc.) the device *will* be set
offline.

In my opinion, the mid-layer simply doesn't have enough information to
perform so bold of an action. The client (disk, tape, cdrom, or userland
driver using SG) issuing the command may well have the information to call
a device dead, but generic code can't make generic assumptions that will
work in all cases.

>If the DVD-ROM can't handle that CD then that's fine, but it would be
>nice if such a broken CD didn't result in not being able to use that
>drive at all anymore.

If you bump the timeout on all of the commands in cdrdao, you might
get better results. The drive should return status for a failed read
*eventually*. *Eventually* is not part of the SCSI spec, so it could
be seconds or hours before the drive gives up.

--
Justin

2001-11-02 20:17:56

by Richard B. Johnson

[permalink] [raw]

Subject: Re: new aic7xxx bug, 2.4.13/6.2.4

On Fri, 2 Nov 2001, Jason Lunz wrote:

>
> In mlist.linux-kernel, you wrote:
> >>VFS: Disk change detected on device sr(11,1)
> >>scsi0:0:3:0: Attempting to queue an ABORT message
> >>scsi0: Dumping Card State while idle, at SEQADDR 0x7
> >
> > Upper layer has timed out a command while the SCSI bus
> > is idle.
>
> I don't understand this. The error is in the middle of a CD rip
> (actually the pre-rip of cdrdao where it looks at sub-channel info for
> pre-gaps and such). The only way to get a timeout while the scsi bus
> is idle would be if the drive just stopped cold, with the mid-layer
> expecting it to go on, right?
>
> >>ACCUM = 0x16, SINDEX = 0x37, DINDEX = 0x24, ARG_2 = 0x0
> >>HCNT = 0x0
> >>SCSISEQ = 0x12, SBLKCTL = 0x0
> >
> > We have reselection on
> >
> >>Kernel NEXTQSCB = 2
> >>Card NEXTQSCB = 2
> >>QINFIFO entries:
> >
> > No new commands to run.
> >
> >>Disconnected Queue entries: 0:3
> >
> > At least one command is disconnected on a target.
>
> I don't know what it means for a command to be disconnected. Can you
> clarify that?
>
> >>QOUTFIFO entries:
> >>Sequencer Free SCB List: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> >>Pending list: 3
> >>Kernel Free SCB list: 1 0
> >>Untagged Q(3): 3
> >
> > The disconnected command is for target 3.
>
> Right, the DVD-ROM we're ripping from.
>
> >>(scsi0:A:3:0): Queuing a recovery SCB
> >>scsi0:0:3:0: Device is disconnected, re-queuing SCB
> >>Recovery code sleeping
> >>(scsi0:A:3:0): Abort Message Sent
> >>(scsi0:A:3:0): SCB 3 - Abort Completed.
> >>Recovery SCB completes
> >>Recovery code awake
> >>aic7xxx_abort returns 0x2002
> >
> > We successfully selected the target and aborted the command.
>
> The command that timed out, I'm assuming...
>
> >>(scsi0:A:3:0): Unexpected busfree in Command phase
> >>SEQADDR == 0x15c
> >
> > This, I can't really explain unless the target is somewhat
> > unstable just after an abort occurs. I'd need to see a
> > bus trace.
>
> Possibly, but I hope not. This is otherwise the best optical drive I've
> ever worked with, a new "Vendor: PIONEER Model: DVD-ROM DVD-305".
>
> >>scsi0:0:3:0: Attempting to queue a TARGET RESET message
> >>scsi0:0:3:0: Command not found
> >
> > The upper layer tells us to perform a target reset for
> > a command that doesn't exist. It was likely aborted
> > by the unexpected bus free above, but the mid-layer ignores
> > completions during error recovery.
>
> ok, I think this is where the kernel starts to go wrong. The mid-layer
> wants to reset the command that resulted in an "unexpected busfree",
> which the driver and drive are already done with.
>
> >>aic7xxx_dev_reset returns 0x2002
> >>scsi0:0:3:0: Attempting to queue an ABORT message
> >>scsi0: Dumping Card State while idle, at SEQADDR 0x7
> >>ACCUM = 0xf7, SINDEX = 0x37, DINDEX = 0x24, ARG_2 = 0x0
> >
> > Target decideds not to return our command again, so we
> > are told to perform recovery.
>
> It makes sense that the target wouldn't return the command if that
> command already died. In that case, shouldn't we avoid trying to reset
> it? Should the mid-layer not be requesting a reset at this point? Is
> this part of the rumored 2.5 scsi rewrite?
>
> >>(scsi0:A:3:0): Abort Message Sent
> >>(scsi0:A:3:0): SCB 3 - Abort Completed.
> >>Recovery SCB completes
> >>Recovery code awake
> >>aic7xxx_abort returns 0x2002
> >
> > And we were successful.
>
> we successfully aborted, but did we have to?
>
> >>scsi: device set offline - not ready or command retry failed after bus reset:
> >>host 0 channel 0 id 3 lun 0
> >
> > But the mid-layer has already decided that it can't recover this device,
> > so it calls it dead and refuses to allow I/O to it anymore.
>
> This is definitely wrong. The drive won't do anything now without a
> reboot (or maybe removing and reinserting all scsi modules; I could do
> that but I haven't tried it).
>
> > Have you recently changed your version of cdrdao? Perhaps that program
> > is issuing a command that this particular drive simply will not accept?
>
> This is the same drive and version of cdrdao that have ripped more than
> 100 CDs. It's just this particular CD that breaks in this way at the
> same spot every time.
>
> If the DVD-ROM can't handle that CD then that's fine, but it would be
> nice if such a broken CD didn't result in not being able to use that
> drive at all anymore.
>
> thanks for your help,
>
> Jason
> -

I don't think it's a NEW bug. The BusLogic also shows the same
problem on 2.4.1

Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: SEAGATE Model: ST32171W Rev: 0484
Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 01 Lun: 00
Vendor: SEAGATE Model: ST318233LWV Rev: 0002
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 02 Lun: 00
Vendor: SEAGATE Model: ST39102LW Rev: 0005
Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 04 Lun: 00
Vendor: YAMAHA Model: CRW6416S Rev: 1.0b
Type: CD-ROM ANSI SCSI revision: 02

***** BusLogic SCSI Driver Version 2.1.15 of 17 August 1998 *****
Copyright 1995-1998 by Leonard N. Zubkoff <[email protected]>
Configuring BusLogic Model BT-958 PCI Wide Ultra SCSI Host Adapter
Firmware Version: 5.06J, I/O Address: 0xB400, IRQ Channel: 11/Level
PCI Bus: 0, Device: 12, Address: 0xDE800000, Host Adapter SCSI ID: 7
Parity Checking: Enabled, Extended Translation: Enabled
Synchronous Negotiation: UUUUUUF#UUUUUUUU, Wide Negotiation: Enabled
Disconnect/Reconnect: Enabled, Tagged Queuing: Enabled
Scatter/Gather Limit: 128 of 8192 segments, Mailboxes: 211
Driver Queue Depth: 211, Host Adapter Queue Depth: 192
Tagged Queue Depth: Automatic, Untagged Queue Depth: 3
Error Recovery Strategy: Default, SCSI Bus Reset: Enabled
SCSI Bus Termination: High Enabled, SCAM: Disabled
*** BusLogic BT-958 Initialized Successfully ***

Target 0: Queue Depth 28, Wide Synchronous at 40.0 MB/sec, offset 15
Target 1: Queue Depth 28, Wide Synchronous at 40.0 MB/sec, offset 15
Target 2: Queue Depth 28, Wide Synchronous at 40.0 MB/sec, offset 15
Target 4: Queue Depth 3, Synchronous at 10.0 MB/sec, offset 15

Current Driver Queue Depth: 211
Currently Allocated CCBs: 91

DATA TRANSFER STATISTICS

Target Tagged Queuing Queue Depth Active Attempted Completed
====== ============== =========== ====== ========= =========
0 Active 28 0 71 71
1 Active 28 0 2279 2279
2 Permitted 28 0 23 23
4 Not Supported 3 0 1 1

Target Read Commands Write Commands Total Bytes Read Total Bytes Written
====== ============= ============== =================== ===================
0 68 0 36864 0
1 1123 1153 19695616 5369856
2 18 2 58368 8192
4 0 0 0 0

Target Command 0-1KB 1-2KB 2-4KB 4-8KB 8-16KB
====== ======= ========= ========= ========= ========= =========
0 Read 64 4 0 0 0
0 Write 0 0 0 0 0
1 Read 0 2 0 581 35
1 Write 0 0 0 1018 129
2 Read 0 5 0 13 0
2 Write 0 0 0 2 0
4 Read 0 0 0 0 0
4 Write 0 0 0 0 0

Target Command 16-32KB 32-64KB 64-128KB 128-256KB 256KB+
====== ======= ========= ========= ========= ========= =========
0 Read 0 0 0 0 0
0 Write 0 0 0 0 0
1 Read 277 124 104 0 0
1 Write 6 0 0 0 0
2 Read 0 0 0 0 0
2 Write 0 0 0 0 0
4 Read 0 0 0 0 0
4 Write 0 0 0 0 0

ERROR RECOVERY STATISTICS

Command Aborts Bus Device Resets Host Adapter Resets
Target Requested Completed Requested Completed Requested Completed
ID \\\\ Attempted //// \\\\ Attempted //// \\\\ Attempted ////
====== ===== ===== ===== ===== ===== ===== ===== ===== =====
0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0

External Host Adapter Resets: 0
Host Adapter Internal Errors: 0

If I put a deliberate scratch in a Write-once CD, and attempt to
burn it with this:

Vendor: YAMAHA Model: CRW6416S Rev: 1.0b

cdrecord -scanbus

Cdrecord release 1.8a35 Copyright (C) 1995-1999 J?rg Schilling
scsibus0:
0,0,0 0) 'SEAGATE ' 'ST32171W ' '0484' Disk
0,1,0 1) 'SEAGATE ' 'ST318233LWV ' '0002' Disk
0,2,0 2) 'SEAGATE ' 'ST39102LW ' '0005' Disk
0,3,0 3) *
0,4,0 4) 'YAMAHA ' 'CRW6416S ' '1.0b' Removable CD-ROM
0,5,0 5) *
0,6,0 6) *
0,7,0 7) *

The result is a permanent SCSI bus error that requires the reset
or power switch to clear.

This, even though the BusLogic driver attempts to reset the bus
forever. There seems to be something that is detected as a SCSI
bus error, but isn't one. This makes the driver(s) think that
it has to reset the bus. However, whatever it finds isn't cleared
by the bus reset so it does it forever.

I think that the driver has to issue a 'TEST UNIT READY' command
and not reset the SCSI bus if the device responds. Since all the
devices on the SCSI bus will 'reboot' to their power-on state
when the SCSI bus is reset, the driver(s) have to reload all the
SCBs and queued commands after such a reset. The last time I looked,
they aborted queued commands, even though the bus reset had aborted
them anyway (if the device impliments a "hard" reset), or completed
any commands that were fully identified ("soft" reset). Note that
the driver "knows" what commands have completed. It can always
re-write or re-read anything to completely recover after a bus
reset. For hot-swap, the SCSI driver has got to be fairly robust
and be able to re-do anything without any fuss anyway.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

I was going to compile a list of innovations that could be
attributed to Microsoft. Once I realized that Ctrl-Alt-Del
was handled in the BIOS, I found that there aren't any.

2001-11-12 18:46:03

by Hans-Peter Jansen

[permalink] [raw]

Subject: Re: new aic7xxx bug, 2.4.13/6.2.4

On Friday, 2. November 2001 20:35, Jason Lunz wrote:
> In mlist.linux-kernel, you wrote:

[...]

> > But the mid-layer has already decided that it can't recover this device,
> > so it calls it dead and refuses to allow I/O to it anymore.
>
> This is definitely wrong. The drive won't do anything now without a
> reboot (or maybe removing and reinserting all scsi modules; I could do
> that but I haven't tried it).
>
> > Have you recently changed your version of cdrdao? Perhaps that program
> > is issuing a command that this particular drive simply will not accept?
>
> This is the same drive and version of cdrdao that have ripped more than
> 100 CDs. It's just this particular CD that breaks in this way at the
> same spot every time.
>
> If the DVD-ROM can't handle that CD then that's fine, but it would be
> nice if such a broken CD didn't result in not being able to use that
> drive at all anymore.

FYI: I've found a similar result under totally different conditions:
2.4.13-ac7
Vendor: TOSHIBA Model: DVD-ROM SD-M1502 Rev: 1012
Type: CD-ROM ANSI SCSI revision: 02
via ide-scsi

cdrdao read-cd --device /dev/sr0 --driver generic-mmc --buffers 80 -n
--eject --paranoia-mode 0 toc
[...]
?: Input/output error. : scsi sendcmd: retryable error
CDB: BE 00 00 04 2C 67 00 00 1A F8 01 00
status: 0x0 (GOOD STATUS)
cmd finished after 20.101s timeout 20s
?: Input/output error. : scsi sendcmd: retryable error
CDB: BE 00 00 04 2E 43 00 00 1A F8 01 00
status: 0x0 (GOOD STATUS)
cmd finished after 20.101s timeout 20s
[...]
killed with ^c

locked the drive completely. Need to reboot to eject the cd...
I suspect some bad interference between DVD firmware, kernel
SCSI error handling and cdrdao. A plextor reader finally
succeeded on this job (wink :)

Maybe you're barking up the wrong tree.

Hans-Peter

> thanks for your help,
>
> Jason

2001-11-12 22:01:33

[permalink] [raw]

Subject: Re: new aic7xxx bug, 2.4.13/6.2.4

On Mon, Nov 12, 2001 at 7:45PM +0100, Hans-Peter Jansen wrote:
> cdrdao read-cd --device /dev/sr0 --driver generic-mmc --buffers 80 -n
> --eject --paranoia-mode 0 toc
> [...]
> ?: Input/output error. : scsi sendcmd: retryable error
> CDB: BE 00 00 04 2C 67 00 00 1A F8 01 00
> status: 0x0 (GOOD STATUS)
> cmd finished after 20.101s timeout 20s
> ?: Input/output error. : scsi sendcmd: retryable error
> CDB: BE 00 00 04 2E 43 00 00 1A F8 01 00
> status: 0x0 (GOOD STATUS)
> cmd finished after 20.101s timeout 20s
> [...]
> killed with ^c
>
> locked the drive completely. Need to reboot to eject the cd...
> I suspect some bad interference between DVD firmware, kernel
> SCSI error handling and cdrdao. A plextor reader finally
> succeeded on this job (wink :)

I agree it's the same effect, but I disagree about the cause. It's the
fault of the scsi mid-layer; it marks the device as dead when a command
times out and won't allow further accesses to it. The fact that it
happens to both of us with different drives and different scsi drivers
(you don't even have scsi, but ide-scsi emulation) shows that the scsi
midlayer is broken in both cases.

And as Justin Gibbs suggested, I'd bet that raising the timeout on the
failing scsi command in cdrdao would probably fix this for both of us,
but I haven't had time to try it. If so, it's not a userspace bug
because it allows any user with cdrom access to disable a cdrom until
reboot.

Jason