2002-06-08 20:57:28

by Borsenkow Andrej

[permalink] [raw]
Subject: 2.4.18-pre9, Iomega Jaz, PPA - endless loop in SCSI recovery thread


It looks like a media went corrupted so one expects some reasonable
error message. What happens is I get endless loop in SCSI error recovery
thread. The only way to clear up situation is to reboot and even then it
hangs on umount so I have to do SysRq-B. Please, what additional
information can be provided; I hope I can keep this media intact for
some time for testing :-)

the kernel is 2.4.18-18mdk based on 2.4.1-pre9.

I am not on lkml so please Cc me. Thank you.

-andrej


{pts/0}% cat /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 04 Lun: 00
Vendor: iomega Model: jaz 2GB Rev: E.17
Type: Direct-Access ANSI SCSI revision: 02
Host: scsi1 Channel: 00 Id: 03 Lun: 00
Vendor: PHILIPS Model: CDD3610 CD-R/RW Rev: 3.09
Type: CD-ROM ANSI SCSI revision: 02
{pts/0}% cat /proc/scsi/ppa/0
Version : 2.07 (for Linux 2.4.x)
Parport : parport0
Mode : EPP 32 bit
{pts/0}% uname -a
Linux localhost.localdomain 2.4.18-18mdkcustom #5 Sat Jun 8 23:25:23 MSD
2002 i686 unknown

SCSI log:

Jun 8 20:49:59 localhost kernel: Activating command for device 4 (1)
Jun 8 20:49:59 localhost kernel: Doing sd request, dev = 0x804, block = 944344
Jun 8 20:49:59 localhost kernel: sda : real dev = /dev/0, block = 944376
Jun 8 20:49:59 localhost kernel: sda : reading 248/248 512 byte blocks.
Jun 8 20:49:59 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12892)
Jun 8 20:49:59 localhost kernel: scsi_dispatch_cmnd (host = 0, channel = 0, target = 4, command = cd8b1e58, buffer = c009e000,
Jun 8 20:49:59 localhost kernel: bufflen = 126976, done = d2813b7a)
Jun 8 20:49:59 localhost kernel: queuecommand : routine at d2a2f1e0
Jun 8 20:49:59 localhost kernel: leaving scsi_dispatch_cmnd()
Jun 8 20:50:00 localhost kernel: Clearing timer for command cd8b1e00 1
Jun 8 20:50:00 localhost kernel: Command failed cd8b1e00 2 active=1 busy=1 failed=0
Jun 8 20:50:00 localhost kernel: bh08:04: old sense key None
Jun 8 20:50:00 localhost kernel: Non-extended sense class 0 code 0x0
Jun 8 20:50:00 localhost kernel: Waking error handler thread (-1)
Jun 8 20:50:00 localhost kernel: Error handler waking up
Jun 8 20:50:00 localhost kernel: scsi_unjam_host: Checking to see if we need to request sense
Jun 8 20:50:00 localhost kernel: scsi_unjam_host: Requesting sense for 4
Jun 8 20:50:00 localhost kernel: Adding timer for command cd8b1e00 at 1000 (d2a12a28)
Jun 8 20:50:00 localhost kernel: In eh_done cd8b1e00 result:0
Jun 8 20:50:00 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
Jun 8 20:50:00 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
Jun 8 20:50:00 localhost kernel: Sense requested for cd8b1e00 - result 2
Jun 8 20:50:00 localhost kernel: Info fld=0xe6960, Current bh08:04: sense key Medium Error
Jun 8 20:50:00 localhost kernel: Additional sense indicates Address mark not found for data field
Jun 8 20:50:00 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
Jun 8 20:50:03 localhost kernel: In eh_done cd8b1e00 result:2
Jun 8 20:50:03 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
Jun 8 20:50:03 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jun 8 20:50:03 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
Jun 8 20:50:06 localhost kernel: In eh_done cd8b1e00 result:2
Jun 8 20:50:06 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
Jun 8 20:50:06 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jun 8 20:50:06 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
Jun 8 20:50:09 localhost kernel: In eh_done cd8b1e00 result:2
Jun 8 20:50:09 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
Jun 8 20:50:09 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jun 8 20:50:09 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
Jun 8 20:50:12 localhost kernel: In eh_done cd8b1e00 result:2
Jun 8 20:50:12 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
Jun 8 20:50:12 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jun 8 20:50:12 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
Jun 8 20:50:15 localhost kernel: In eh_done cd8b1e00 result:2
Jun 8 20:50:15 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
Jun 8 20:50:15 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jun 8 20:50:15 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
Jun 8 20:50:17 localhost kernel: In eh_done cd8b1e00 result:2
Jun 8 20:50:17 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
Jun 8 20:50:17 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jun 8 20:50:17 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
Jun 8 20:50:20 localhost kernel: In eh_done cd8b1e00 result:2
Jun 8 20:50:20 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
Jun 8 20:50:20 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jun 8 20:50:20 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
Jun 8 20:50:23 localhost kernel: In eh_done cd8b1e00 result:2
Jun 8 20:50:23 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
Jun 8 20:50:23 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jun 8 20:50:23 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
Jun 8 20:50:26 localhost kernel: In eh_done cd8b1e00 result:2
Jun 8 20:50:26 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
Jun 8 20:50:26 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001


2002-06-08 21:00:05

by Borsenkow Andrej

[permalink] [raw]
Subject: Re: 2.4.18-pre9, Iomega Jaz, PPA - endless loop in SCSI recovery thread

Oh, and when I tried to simply switch off Jaz during this loop I got
panic:

kernel panic: scsi_unjam_host
Miscount of number of failed commands

Not sure if it helps very much ...

-andrej


? ???, 09.06.2002, ? 00:55, Andrej Borsenkow ???????:
>
> It looks like a media went corrupted so one expects some reasonable
> error message. What happens is I get endless loop in SCSI error recovery
> thread. The only way to clear up situation is to reboot and even then it
> hangs on umount so I have to do SysRq-B. Please, what additional
> information can be provided; I hope I can keep this media intact for
> some time for testing :-)
>
> the kernel is 2.4.18-18mdk based on 2.4.1-pre9.
>
> I am not on lkml so please Cc me. Thank you.
>
> -andrej
>
>
> {pts/0}% cat /proc/scsi/scsi
> Attached devices:
> Host: scsi0 Channel: 00 Id: 04 Lun: 00
> Vendor: iomega Model: jaz 2GB Rev: E.17
> Type: Direct-Access ANSI SCSI revision: 02
> Host: scsi1 Channel: 00 Id: 03 Lun: 00
> Vendor: PHILIPS Model: CDD3610 CD-R/RW Rev: 3.09
> Type: CD-ROM ANSI SCSI revision: 02
> {pts/0}% cat /proc/scsi/ppa/0
> Version : 2.07 (for Linux 2.4.x)
> Parport : parport0
> Mode : EPP 32 bit
> {pts/0}% uname -a
> Linux localhost.localdomain 2.4.18-18mdkcustom #5 Sat Jun 8 23:25:23 MSD
> 2002 i686 unknown
>
> SCSI log:
>
> Jun 8 20:49:59 localhost kernel: Activating command for device 4 (1)
> Jun 8 20:49:59 localhost kernel: Doing sd request, dev = 0x804, block = 944344
> Jun 8 20:49:59 localhost kernel: sda : real dev = /dev/0, block = 944376
> Jun 8 20:49:59 localhost kernel: sda : reading 248/248 512 byte blocks.
> Jun 8 20:49:59 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12892)
> Jun 8 20:49:59 localhost kernel: scsi_dispatch_cmnd (host = 0, channel = 0, target = 4, command = cd8b1e58, buffer = c009e000,
> Jun 8 20:49:59 localhost kernel: bufflen = 126976, done = d2813b7a)
> Jun 8 20:49:59 localhost kernel: queuecommand : routine at d2a2f1e0
> Jun 8 20:49:59 localhost kernel: leaving scsi_dispatch_cmnd()
> Jun 8 20:50:00 localhost kernel: Clearing timer for command cd8b1e00 1
> Jun 8 20:50:00 localhost kernel: Command failed cd8b1e00 2 active=1 busy=1 failed=0
> Jun 8 20:50:00 localhost kernel: bh08:04: old sense key None
> Jun 8 20:50:00 localhost kernel: Non-extended sense class 0 code 0x0
> Jun 8 20:50:00 localhost kernel: Waking error handler thread (-1)
> Jun 8 20:50:00 localhost kernel: Error handler waking up
> Jun 8 20:50:00 localhost kernel: scsi_unjam_host: Checking to see if we need to request sense
> Jun 8 20:50:00 localhost kernel: scsi_unjam_host: Requesting sense for 4
> Jun 8 20:50:00 localhost kernel: Adding timer for command cd8b1e00 at 1000 (d2a12a28)
> Jun 8 20:50:00 localhost kernel: In eh_done cd8b1e00 result:0
> Jun 8 20:50:00 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
> Jun 8 20:50:00 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
> Jun 8 20:50:00 localhost kernel: Sense requested for cd8b1e00 - result 2
> Jun 8 20:50:00 localhost kernel: Info fld=0xe6960, Current bh08:04: sense key Medium Error
> Jun 8 20:50:00 localhost kernel: Additional sense indicates Address mark not found for data field
> Jun 8 20:50:00 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
> Jun 8 20:50:03 localhost kernel: In eh_done cd8b1e00 result:2
> Jun 8 20:50:03 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
> Jun 8 20:50:03 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
> Jun 8 20:50:03 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
> Jun 8 20:50:06 localhost kernel: In eh_done cd8b1e00 result:2
> Jun 8 20:50:06 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
> Jun 8 20:50:06 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
> Jun 8 20:50:06 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
> Jun 8 20:50:09 localhost kernel: In eh_done cd8b1e00 result:2
> Jun 8 20:50:09 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
> Jun 8 20:50:09 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
> Jun 8 20:50:09 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
> Jun 8 20:50:12 localhost kernel: In eh_done cd8b1e00 result:2
> Jun 8 20:50:12 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
> Jun 8 20:50:12 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
> Jun 8 20:50:12 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
> Jun 8 20:50:15 localhost kernel: In eh_done cd8b1e00 result:2
> Jun 8 20:50:15 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
> Jun 8 20:50:15 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
> Jun 8 20:50:15 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
> Jun 8 20:50:17 localhost kernel: In eh_done cd8b1e00 result:2
> Jun 8 20:50:17 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
> Jun 8 20:50:17 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
> Jun 8 20:50:17 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
> Jun 8 20:50:20 localhost kernel: In eh_done cd8b1e00 result:2
> Jun 8 20:50:20 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
> Jun 8 20:50:20 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
> Jun 8 20:50:20 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
> Jun 8 20:50:23 localhost kernel: In eh_done cd8b1e00 result:2
> Jun 8 20:50:23 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
> Jun 8 20:50:23 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
> Jun 8 20:50:23 localhost kernel: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
> Jun 8 20:50:26 localhost kernel: In eh_done cd8b1e00 result:2
> Jun 8 20:50:26 localhost kernel: send_eh_cmnd: cd8b1e00 eh_state:2002
> Jun 8 20:50:26 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001

2002-06-08 22:50:40

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: 2.4.18-pre9, Iomega Jaz, PPA - endless loop in SCSI recovery thread

From: Borsenkow Andrej <[email protected]>

It looks like a media went corrupted so one expects some reasonable
error message. What happens is I get endless loop in SCSI error recovery
thread. The only way to clear up situation is to reboot

That is the normal situation. I have never seen the SCSI error handler
do anything else but bring down the system. Commenting it out
really increases the stability of Linux (on my hardware).

the kernel is 2.4.18-18mdk based on 2.4.1-pre9.

Jun 8 20:50:03 localhost: In eh_done cd8b1e00 result:2
Jun 8 20:50:03 localhost: send_eh_cmnd: cd8b1e00 eh_state:2002
Jun 8 20:50:03 localhost: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jun 8 20:50:03 localhost: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
Jun 8 20:50:06 localhost: In eh_done cd8b1e00 result:2
Jun 8 20:50:06 localhost: send_eh_cmnd: cd8b1e00 eh_state:2002
Jun 8 20:50:06 localhost: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jun 8 20:50:06 localhost: Adding timer for command cd8b1e00 at 6000 (d2a12a28)
Jun 8 20:50:09 localhost: In eh_done cd8b1e00 result:2
Jun 8 20:50:09 localhost: send_eh_cmnd: cd8b1e00 eh_state:2002
Jun 8 20:50:09 localhost: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jun 8 20:50:09 localhost: Adding timer for command cd8b1e00 at 6000 (d2a12a28)

Hmm. It is clear what happens.
You are in scsi_error.c:scsi_send_eh_cmnd() and send a command,
the return status is NEEDS_RETRY, and the same command is sent again. Etc.

Now retrying is evil, and trying more than three times almost always
stupid, so if you add in this function

int retry_count = 0;

and

retry:
if (++retry_count > 3) {
SCpnt->eh_state = FAILED;
return;
}

that would surely be an improvement locally.
Globally one wonders: there is a retry count already - why isn't it
honoured by everybody involved? (But it isnt - it is only used in
scsi_decide_disposition().)

Where did we get this NEEDS_RETRY in the first place?
Well, scsi_check_sense() will return NEEDS_RETRY for MEDIUM_ERROR.
(Bad: the hardware also did retry many times already; it is usually very
counterproductive to keep on hitting the same bad spot on the media.)

There are also other code paths that will loop, but this one
looks most likely here.

Andries

2002-06-09 16:06:18

by Borsenkow Andrej

[permalink] [raw]
Subject: Re: 2.4.18-pre9, Iomega Jaz, PPA - endless loop in SCSI recovery thread

? ???, 09.06.2002, ? 02:50, [email protected] ???????:
> From: Borsenkow Andrej <[email protected]>
>
> It looks like a media went corrupted so one expects some reasonable
> error message. What happens is I get endless loop in SCSI error recovery
> thread. The only way to clear up situation is to reboot
>
> That is the normal situation. I have never seen the SCSI error handler
> do anything else but bring down the system. Commenting it out
> really increases the stability of Linux (on my hardware).
>

MAINTAINERS:

SCSI SUBSYSTEM
L: [email protected]
S: Unmaintained

Oh, well ...


> Hmm. It is clear what happens.
> You are in scsi_error.c:scsi_send_eh_cmnd() and send a command,
> the return status is NEEDS_RETRY, and the same command is sent again. Etc.
>
> Now retrying is evil, and trying more than three times almost always
> stupid, so if you add in this function
>
> int retry_count = 0;
>
> and
>
> retry:
> if (++retry_count > 3) {
> SCpnt->eh_state = FAILED;
> return;
> }
>
> that would surely be an improvement locally.
> Globally one wonders: there is a retry count already - why isn't it
> honoured by everybody involved? (But it isnt - it is only used in
> scsi_decide_disposition().)
>

Which actually looks like an oversight ... and comments in unjam_host
loop even indicate this must be used.

Anyway, returning FAILED does not look right here. As long as I
understand what happens, FAILED means hard error (like failing to
communicate with device) while here we have more or less "soft" error.
So I decided to copy relevant bits from scsi_decide_disposition into
scsi_eh_completed_normally which resulted in pretty nice overall error
handling:

Jun 9 19:30:45 localhost kernel: Activating command for device 4 (1)
Jun 9 19:30:45 localhost kernel: Doing sd request, dev = 0x804, block = 944448
Jun 9 19:30:45 localhost kernel: sda : real dev = /dev/0, block = 944480
Jun 9 19:30:45 localhost kernel: sda : reading 8/8 512 byte blocks.
Jun 9 19:30:45 localhost kernel: Adding timer for command cfe5a000 at 6000 (d2a13964)
Jun 9 19:30:45 localhost kernel: scsi_dispatch_cmnd (host = 0, channel = 0, target = 4, command = cfe5a058, buffer = ca4d4000,
Jun 9 19:30:45 localhost kernel: bufflen = 4096, done = d2813b84)
Jun 9 19:30:45 localhost kernel: queuecommand : routine at d2a38292
Jun 9 19:30:45 localhost kernel: leaving scsi_dispatch_cmnd()
Jun 9 19:30:47 localhost kernel: Clearing timer for command cfe5a000 1
Jun 9 19:30:47 localhost kernel: Command failed cfe5a000 2 active=1 busy=1 failed=0
Jun 9 19:30:47 localhost kernel: bh08:04: old sense key None
Jun 9 19:30:47 localhost kernel: Non-extended sense class 0 code 0x0
Jun 9 19:30:47 localhost kernel: Waking error handler thread (-1)
Jun 9 19:30:47 localhost kernel: Error handler waking up
Jun 9 19:30:47 localhost kernel: scsi_unjam_host: Checking to see if we need to request sense
Jun 9 19:30:47 localhost kernel: scsi_unjam_host: Requesting sense for 4
Jun 9 19:30:47 localhost kernel: Adding timer for command cfe5a000 at 1000 (d2a13af8)
Jun 9 19:30:47 localhost kernel: In eh_done cfe5a000 result:0
Jun 9 19:30:47 localhost kernel: send_eh_cmnd: cfe5a000 eh_state:2002
Jun 9 19:30:47 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
Jun 9 19:30:47 localhost kernel: Sense requested for cfe5a000 - result 2
Jun 9 19:30:47 localhost kernel: Info fld=0xe6960, Current bh08:04: sense key Medium Error
Jun 9 19:30:47 localhost kernel: Additional sense indicates Address mark not found for data field
Jun 9 19:30:47 localhost kernel: Adding timer for command cfe5a000 at 6000 (d2a13af8)
Jun 9 19:30:49 localhost kernel: In eh_done cfe5a000 result:2
Jun 9 19:30:49 localhost kernel: send_eh_cmnd: cfe5a000 eh_state:2002
Jun 9 19:30:49 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jun 9 19:30:49 localhost kernel: Adding timer for command cfe5a000 at 6000 (d2a13af8)
Jun 9 19:30:51 localhost kernel: In eh_done cfe5a000 result:2
Jun 9 19:30:51 localhost kernel: send_eh_cmnd: cfe5a000 eh_state:2002
Jun 9 19:30:51 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jun 9 19:30:51 localhost kernel: Adding timer for command cfe5a000 at 6000 (d2a13af8)
Jun 9 19:30:53 localhost kernel: In eh_done cfe5a000 result:2
Jun 9 19:30:53 localhost kernel: send_eh_cmnd: cfe5a000 eh_state:2002
Jun 9 19:30:53 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2001
Jun 9 19:30:53 localhost kernel: Adding timer for command cfe5a000 at 6000 (d2a13af8)
Jun 9 19:30:55 localhost kernel: In eh_done cfe5a000 result:2
Jun 9 19:30:55 localhost kernel: send_eh_cmnd: cfe5a000 eh_state:2002
Jun 9 19:30:55 localhost kernel: scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
Jun 9 19:30:55 localhost kernel: Total of 0+0 commands on 0 devices require eh work
Jun 9 19:30:55 localhost kernel: Adding timer for command cfe5a000 at 100 (d2a13af8)
Jun 9 19:30:55 localhost kernel: Clearing timer for command cfe5a000 1
Jun 9 19:30:55 localhost kernel: scsi_error.c: Waking up host to restart
Jun 9 19:30:55 localhost kernel: Error handler sleeping
Jun 9 19:30:55 localhost kernel: Command finished 1 0 0x2
Jun 9 19:30:55 localhost kernel: Notifying upper driver of completion for device 4 8000002
Jun 9 19:30:55 localhost kernel: sda : rw_intr(0, 8000002 [f0 3])
Jun 9 19:30:55 localhost kernel: scsi0: ERROR on channel 0, id 4, lun 0, CDB: Read (10) 00 00 0e 69 60 00 00 08 00
Jun 9 19:30:55 localhost kernel: Info fld=0xe6960, Current sd08:04: sense key Medium Error
Jun 9 19:30:55 localhost kernel: Additional sense indicates Address mark not found for data field
Jun 9 19:30:55 localhost kernel: I/O error: dev 08:04, sector 944448
Jun 9 19:30:55 localhost kernel: Deactivating command for device 4 (active=0, failed=0)
Jun 9 19:30:55 localhost kernel: Open returning 1
Jun 9 19:30:55 localhost kernel: Trying ioctl with scsi command 30
Jun 9 19:30:55 localhost kernel: scsi_do_req (host = 0, channel = 0 target = 4, buffer =00000000, bufflen = 0, done = d2a0f088, timeout = 1000, retries = 5)
Jun 9 19:30:55 localhost kernel: command : 1e 00 00 00 00 00
Jun 9 19:30:55 localhost kernel: Activating command for device 4 (1)
Jun 9 19:30:55 localhost kernel: Leaving scsi_init_cmd_from_req()
Jun 9 19:30:55 localhost kernel: Adding timer for command cfe5a000 at 1000 (d2a13964)
Jun 9 19:30:55 localhost kernel: scsi_dispatch_cmnd (host = 0, channel = 0, target = 4, command = cfe5a058, buffer = 00000000,
Jun 9 19:30:55 localhost kernel: bufflen = 0, done = d2a0f088)
Jun 9 19:30:55 localhost kernel: queuecommand : routine at d2a38292
Jun 9 19:30:55 localhost kernel: leaving scsi_dispatch_cmnd()
Jun 9 19:30:55 localhost kernel: Leaving scsi_do_req()
Jun 9 19:30:55 localhost kernel: Clearing timer for command cfe5a000 1
Jun 9 19:30:55 localhost kernel: Command finished 1 0 0x0
Jun 9 19:30:55 localhost kernel: Notifying upper driver of completion for device 4 0
Jun 9 19:30:55 localhost kernel: Deactivating command for device 4 (active=0, failed=0)
Jun 9 19:30:55 localhost kernel: Ioctl returned 0x0
Jun 9 19:30:55 localhost kernel: IOCTL Releasing command


> Where did we get this NEEDS_RETRY in the first place?
> Well, scsi_check_sense() will return NEEDS_RETRY for MEDIUM_ERROR.
> (Bad: the hardware also did retry many times already; it is usually very
> counterproductive to keep on hitting the same bad spot on the media.)
>

I do not dare to touch it. Whoever wrote the code must have some reasons
behind. Still, the obvious patch the shuts at least this code path
follows.

-andrej

> There are also other code paths that will loop, but this one
> looks most likely here.
>
> Andries
>

--- linux-2.4.18-18mdk/drivers/scsi/scsi_error.c.scsi-eh-timeout Thu May 30 16:22:37 2002
+++ linux-2.4.18-18mdk/drivers/scsi/scsi_error.c Sun Jun 9 19:18:11 2002
@@ -1100,6 +1100,8 @@
*/
STATIC int scsi_eh_completed_normally(Scsi_Cmnd * SCpnt)
{
+ int rtn;
+
/*
* First check the host byte, to see if there is anything in there
* that would indicate what we need to do.
@@ -1113,14 +1115,18 @@
* otherwise we just flag it as success.
*/
SCpnt->flags &= ~IS_RESETTING;
- return NEEDS_RETRY;
+ goto maybe_retry;
}
/*
* Rats. We are already in the error handler, so we now get to try
* and figure out what to do next. If the sense is valid, we have
* a pretty good idea of what to do. If not, we mark it as failed.
*/
- return scsi_check_sense(SCpnt);
+ rtn = scsi_check_sense(SCpnt);
+ if (rtn == NEEDS_RETRY) {
+ goto maybe_retry;
+ }
+ return rtn;
}
if (host_byte(SCpnt->result) != DID_OK) {
return FAILED;
@@ -1139,7 +1145,11 @@
case COMMAND_TERMINATED:
return SUCCESS;
case CHECK_CONDITION:
- return scsi_check_sense(SCpnt);
+ rtn = scsi_check_sense(SCpnt);
+ if (rtn == NEEDS_RETRY) {
+ goto maybe_retry;
+ }
+ return rtn;
case CONDITION_GOOD:
case INTERMEDIATE_GOOD:
case INTERMEDIATE_C_GOOD:
@@ -1154,6 +1164,17 @@
return FAILED;
}
return FAILED;
+
+ maybe_retry:
+
+ if ((++SCpnt->retries) < SCpnt->allowed) {
+ return NEEDS_RETRY;
+ } else {
+ /*
+ * No more retries - report this one back to upper level.
+ */
+ return SUCCESS;
+ }
}

/*