2003-03-06 00:51:12

by Andries E. Brouwer

[permalink] [raw]
Subject: 2.5.63/64 do not boot: loop in scsi_error

See that 2.5.64 came out - good. Time to send the next dev_t patch.
Unfortunately 2.5.63 and 2.5.64 do not boot.

A moment ago I looked at what goes wrong, and it turns out that
scsi_error is activated
[always a bad sign - I have never see it do any good, and
often see it crash the machine]
and an infinite loop occurs, leaving the machine rather dead.

(Total of 1 commands require eh work; scsi_unjam_host; requesting sense;
scsi_eh_done: result 0) - infinite repeat.

Have no time tonight to make a patch, but I suppose the author of
the 2.5.63 scsi_error.c changes knows what she did wrong.

Andries


[I can make 2.5.64 boot if I make sure no errors ever occur.
That means that I must disable get_evpd_page, get_serialnumber,
get_cachetype that my old stuff doesnt know about.
If I do that all is well.]


2003-03-06 01:15:01

by Patrick Mansfield

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

Andries -

On Thu, Mar 06, 2003 at 02:01:38AM +0100, [email protected] wrote:
> See that 2.5.64 came out - good. Time to send the next dev_t patch.
> Unfortunately 2.5.63 and 2.5.64 do not boot.

Did you try the patch to scsi_error.c Mike A. recently posted?

> [I can make 2.5.64 boot if I make sure no errors ever occur.
> That means that I must disable get_evpd_page, get_serialnumber,
> get_cachetype that my old stuff doesnt know about.
> If I do that all is well.]

That sucks - even if error handling recovers from them.

-- Patrick Mansfield

2003-03-06 01:14:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error


On Thu, 6 Mar 2003 [email protected] wrote:
>
> See that 2.5.64 came out - good. Time to send the next dev_t patch.
> Unfortunately 2.5.63 and 2.5.64 do not boot.
>
> A moment ago I looked at what goes wrong, and it turns out that
> scsi_error is activated

See if this fixes it..

Linus

---
# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.1088 -> 1.1089
# drivers/scsi/scsi_error.c 1.38 -> 1.39
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/03/05 [email protected] 1.1089
# [PATCH] Fix SCSI error handler abort case
#
# I had my list empty checks reversed if aborting and bus device reset
# failed. The condition that causes the error handler to run is still
# unknown.
# --------------------------------------------
#
diff -Nru a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
--- a/drivers/scsi/scsi_error.c Wed Mar 5 17:21:56 2003
+++ b/drivers/scsi/scsi_error.c Wed Mar 5 17:21:56 2003
@@ -1490,9 +1490,9 @@
struct list_head *work_q,
struct list_head *done_q)
{
- if (scsi_eh_bus_device_reset(shost, work_q, done_q))
- if (scsi_eh_bus_reset(shost, work_q, done_q))
- if (scsi_eh_host_reset(work_q, done_q))
+ if (!scsi_eh_bus_device_reset(shost, work_q, done_q))
+ if (!scsi_eh_bus_reset(shost, work_q, done_q))
+ if (!scsi_eh_host_reset(work_q, done_q))
scsi_eh_offline_sdevs(work_q, done_q);
}


2003-03-06 04:18:44

by Rob Radez

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

On Thu, Mar 06, 2003 at 02:01:38AM +0100, [email protected] wrote:
> See that 2.5.64 came out - good. Time to send the next dev_t patch.
> Unfortunately 2.5.63 and 2.5.64 do not boot.
>
> A moment ago I looked at what goes wrong, and it turns out that
> scsi_error is activated
> [always a bad sign - I have never see it do any good, and
> often see it crash the machine]
> and an infinite loop occurs, leaving the machine rather dead.
>
> (Total of 1 commands require eh work; scsi_unjam_host; requesting sense;
> scsi_eh_done: result 0) - infinite repeat.
>
> Have no time tonight to make a patch, but I suppose the author of
> the 2.5.63 scsi_error.c changes knows what she did wrong.

Even with the patch to scsi_error.c floating around, I still get the
same hang/infinite loop after the information for my scsi cd-rom is
printed on both 2.5.63 and .64.

Regards,
Rob Radez

2003-03-06 06:28:50

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

> See if this fixes it..

No, I am afraid not. My infinite loop does not pass through
scsi_eh_ready_devs().

Andries

2003-03-06 06:37:05

by Mike Anderson

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

[email protected] [[email protected]] wrote:
> > See if this fixes it..
>
> No, I am afraid not. My infinite loop does not pass through
> scsi_eh_ready_devs().
>

Can you send me your console log. If you have scsi_logging=1 that would
be greate also.

-andmike
--
Michael Anderson
[email protected]

2003-03-06 07:51:35

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

On Wed, 5 Mar 2003, Mike Anderson wrote:

> [email protected] [[email protected]] wrote:
> > > See if this fixes it..
> >
> > No, I am afraid not. My infinite loop does not pass through
> > scsi_eh_ready_devs().
> >
>
> Can you send me your console log. If you have scsi_logging=1 that would
> be greate also.

If you can figure out which paths this goes through because it completely
locks up right before printing 'scsi: device offlined' on 2.5.63. I
can't provide much more information at present.

scsi1 : QLogic ISP1020 SCSI on PCI bus 04 device 70 irq 89 MEM base 0xf8a18000
scsi: Device offlined - not ready or command retry failed after error recovery: host 1 channel 0 id 0 lun 0
scsi: Device offlined - not ready or command retry failed after error recovery: host 1 channel 0 id 1 lun 0

Zwane
--
function.linuxpower.ca

2003-03-06 08:20:04

by Mike Anderson

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

Zwane Mwaikambo [[email protected]] wrote:
> scsi1 : QLogic ISP1020 SCSI on PCI bus 04 device 70 irq 89 MEM base 0xf8a18000
> scsi: Device offlined - not ready or command retry failed after error recovery: host 1 channel 0 id 0 lun 0
> scsi: Device offlined - not ready or command retry failed after error recovery: host 1 channel 0 id 1 lun 0
>

Did this work in 2.5.62? The qlogicisp driver does have any error
handlers. Any error will cause a device offline state. You
should see a message at boot like:
ERROR: This is not a safe way to run your SCSI host
ERROR: The error handling must be added to this driver

This does not explain what is causing the error handler to start up or
do anything to help your problem.

We have been switching to the feral driver to handle the qlogic isp
card. This driver contains error handling routines. I believe the 2.5
versions of the driver is in the -mm tree. I also believe Andrew has it
as a separate patch.

I did try running the qlogicisp driver and it appears to be loading for
me, but I do not have any non-disk devices on the system at the moment.

-andmike
--
Michael Anderson
[email protected]

2003-03-06 08:25:27

by Mike Anderson

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

Mike Anderson [[email protected]] wrote:
> Zwane Mwaikambo [[email protected]] wrote:
> > scsi1 : QLogic ISP1020 SCSI on PCI bus 04 device 70 irq 89 MEM base 0xf8a18000
> > scsi: Device offlined - not ready or command retry failed after error recovery: host 1 channel 0 id 0 lun 0
> > scsi: Device offlined - not ready or command retry failed after error recovery: host 1 channel 0 id 1 lun 0
> >
>
> Did this work in 2.5.62? The qlogicisp driver does have any error
The above line should read "does not have any error"
> handlers. Any error will cause a device offline state. You
> should see a message at boot like:
> ERROR: This is not a safe way to run your SCSI host
> ERROR: The error handling must be added to this driver

-andmike
--
Michael Anderson
[email protected]

2003-03-06 08:27:37

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

On Thu, 6 Mar 2003, Mike Anderson wrote:

> Zwane Mwaikambo [[email protected]] wrote:
> > scsi1 : QLogic ISP1020 SCSI on PCI bus 04 device 70 irq 89 MEM base 0xf8a18000
> > scsi: Device offlined - not ready or command retry failed after error recovery: host 1 channel 0 id 0 lun 0
> > scsi: Device offlined - not ready or command retry failed after error recovery: host 1 channel 0 id 1 lun 0
> >
>
> Did this work in 2.5.62? The qlogicisp driver does have any error
> handlers. Any error will cause a device offline state. You
> should see a message at boot like:
> ERROR: This is not a safe way to run your SCSI host
> ERROR: The error handling must be added to this driver

That error was from a booting 2.5.62 and i do get the warnings about
missing error handling.

> This does not explain what is causing the error handler to start up or
> do anything to help your problem.

I'm not concerned about that, that was peripheral damage from another
patch (affected irq handling), the difference being is that with 2.5.62 it boots
after printing those errors a couple of times, but with 2.5.63 it doesn't.

> We have been switching to the feral driver to handle the qlogic isp
> card. This driver contains error handling routines. I believe the 2.5
> versions of the driver is in the -mm tree. I also believe Andrew has it
> as a separate patch.
>
> I did try running the qlogicisp driver and it appears to be loading for
> me, but I do not have any non-disk devices on the system at the moment.

I'm currently using it with the following devices and survives general
usage.

scsi0 : QLogic ISP1020 SCSI on PCI bus 01 device 70 irq 41 MEM base 0xf8a16000
Vendor: IBM Model: DRHS36V Rev: 0270
Type: Direct-Access ANSI SCSI revision: 03
Vendor: IBM Model: DRHS36V Rev: 0270
Type: Direct-Access ANSI SCSI revision: 03
Vendor: PLEXTOR Model: CD-ROM PX-32CS Rev: 1.02
Type: CD-ROM ANSI SCSI revision: 02
SCSI device sda: 72170879 512-byte hdwr sectors (36951 MB)
SCSI device sda: drive cache: write through
sda: sda1 sda2 sda3
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
SCSI device sdb: 72170879 512-byte hdwr sectors (36951 MB)
SCSI device sdb: drive cache: write through
sdb: unknown partition table
Attached scsi disk sdb at scsi0, channel 0, id 1, lun 0
sr0: scsi-1 drive

Zwane
--
function.linuxpower.ca

2003-03-06 08:42:51

by Mike Anderson

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

Zwane Mwaikambo [[email protected]] wrote:
> I'm not concerned about that, that was peripheral damage from another
> patch (affected irq handling), the difference being is that with 2.5.62 it boots
> after printing those errors a couple of times, but with 2.5.63 it doesn't.

Ok I will keep looking at this , I believe I have a PLEXTOR CD in the
lab I will add this to my qlogic isp bus and see if I can get the error
to show up. I am running cd drives on the other adapters and I am not
seeing a problem.

-andmike
--
Michael Anderson
[email protected]

2003-03-06 08:52:02

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

On Thu, 6 Mar 2003, Mike Anderson wrote:

> Zwane Mwaikambo [[email protected]] wrote:
> > I'm not concerned about that, that was peripheral damage from another
> > patch (affected irq handling), the difference being is that with 2.5.62 it boots
> > after printing those errors a couple of times, but with 2.5.63 it doesn't.
>
> Ok I will keep looking at this , I believe I have a PLEXTOR CD in the
> lab I will add this to my qlogic isp bus and see if I can get the error
> to show up. I am running cd drives on the other adapters and I am not
> seeing a problem.

My apologies, i think i wasn't being too clear. You won't be able to
replicate that exact error by default, i got it because i killed
interrupt routing/handling on the interrupt controllers servicing the bus
on which the scsi controller is on. The errors generated by the SCSI layer
in turn kill the box in 2.5.63 whilst only spewing those errors and
continuing boot with 2.5.62

Zwane
--
function.linuxpower.ca

2003-03-06 09:06:09

by Mike Anderson

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

Zwane Mwaikambo [[email protected]] wrote:
> On Thu, 6 Mar 2003, Mike Anderson wrote:
>
> > Zwane Mwaikambo [[email protected]] wrote:
> > > I'm not concerned about that, that was peripheral damage from another
> > > patch (affected irq handling), the difference being is that with 2.5.62 it boots
> > > after printing those errors a couple of times, but with 2.5.63 it doesn't.
> >
> > Ok I will keep looking at this , I believe I have a PLEXTOR CD in the
> > lab I will add this to my qlogic isp bus and see if I can get the error
> > to show up. I am running cd drives on the other adapters and I am not
> > seeing a problem.
>
> My apologies, i think i wasn't being too clear. You won't be able to
> replicate that exact error by default, i got it because i killed
> interrupt routing/handling on the interrupt controllers servicing the bus
> on which the scsi controller is on. The errors generated by the SCSI layer
> in turn kill the box in 2.5.63 whilst only spewing those errors and
> continuing boot with 2.5.62

Would it be possible for you to send me a console output with
scsi_logging=1 so that I can narrow down the failure case.

-andmike
--
Michael Anderson
[email protected]

2003-03-06 09:12:21

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

> Can you send me your console log.

Patience. Fourteen hours from now I'll look at this some more.

Andries

2003-03-06 09:50:37

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

On Thu, 6 Mar 2003, Mike Anderson wrote:

> Would it be possible for you to send me a console output with
> scsi_logging=1 so that I can narrow down the failure case.

The following is from 2.5.63-mjb2

http://function.linuxpower.ca/patches/numaq/dmesg-scsi_logging

The [disconnect] point is where it locks up

Zwane
--
function.linuxpower.ca

2003-03-06 16:21:27

by James Bottomley

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

On Thu, 2003-03-06 at 03:58, Zwane Mwaikambo wrote:
> On Thu, 6 Mar 2003, Mike Anderson wrote:
>
> > Would it be possible for you to send me a console output with
> > scsi_logging=1 so that I can narrow down the failure case.
>
> The following is from 2.5.63-mjb2
>
> http://function.linuxpower.ca/patches/numaq/dmesg-scsi_logging

This log implies the error handling finished after the BDR. That looks
like the system doesn't have Mike's latest patch for the logic reversal
problem in scsi_eh_ready_devs, could you check this?

Thanks,

James


2003-03-06 17:07:41

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

On Thu, 6 Mar 2003, James Bottomley wrote:

> This log implies the error handling finished after the BDR. That looks
> like the system doesn't have Mike's latest patch for the logic reversal
> problem in scsi_eh_ready_devs, could you check this?

static void scsi_eh_ready_devs(struct Scsi_Host *shost,
struct list_head *work_q,
struct list_head *done_q)
{
if (scsi_eh_bus_device_reset(shost, work_q, done_q))
if (scsi_eh_bus_reset(shost, work_q, done_q))
if (scsi_eh_host_reset(work_q, done_q))
scsi_eh_offline_sdevs(work_q, done_q);
}

That is what i currently have, i'll try a boot with;

- if (scsi_eh_bus_reset(shost, work_q, done_q))
+ if (!scsi_eh_bus_reset(shost, work_q, done_q))

Thanks,
Zwane
--
function.linuxpower.ca

2003-03-06 17:12:07

by James Bottomley

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

On Thu, 2003-03-06 at 11:15, Zwane Mwaikambo wrote:
> On Thu, 6 Mar 2003, James Bottomley wrote:
>
> > This log implies the error handling finished after the BDR. That looks
> > like the system doesn't have Mike's latest patch for the logic reversal
> > problem in scsi_eh_ready_devs, could you check this?
>
> static void scsi_eh_ready_devs(struct Scsi_Host *shost,
> struct list_head *work_q,
> struct list_head *done_q)
> {
> if (scsi_eh_bus_device_reset(shost, work_q, done_q))
> if (scsi_eh_bus_reset(shost, work_q, done_q))
> if (scsi_eh_host_reset(work_q, done_q))
> scsi_eh_offline_sdevs(work_q, done_q);
> }
>
> That is what i currently have, i'll try a boot with;
>
> - if (scsi_eh_bus_reset(shost, work_q, done_q))
> + if (!scsi_eh_bus_reset(shost, work_q, done_q))
>
> Thanks,
> Zwane


Actually, all three if's need nots in front:

diff -Nru a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
--- a/drivers/scsi/scsi_error.c Thu Mar 6 11:21:22 2003
+++ b/drivers/scsi/scsi_error.c Thu Mar 6 11:21:22 2003
@@ -1490,9 +1490,9 @@
struct list_head *work_q,
struct list_head *done_q)
{
- if (scsi_eh_bus_device_reset(shost, work_q, done_q))
- if (scsi_eh_bus_reset(shost, work_q, done_q))
- if (scsi_eh_host_reset(work_q, done_q))
+ if (!scsi_eh_bus_device_reset(shost, work_q, done_q))
+ if (!scsi_eh_bus_reset(shost, work_q, done_q))
+ if (!scsi_eh_host_reset(work_q, done_q))
scsi_eh_offline_sdevs(work_q, done_q);
}


2003-03-06 17:12:41

by Mike Anderson

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

Zwane Mwaikambo [[email protected]] wrote:
> On Thu, 6 Mar 2003, James Bottomley wrote:
>
> > This log implies the error handling finished after the BDR. That looks
> > like the system doesn't have Mike's latest patch for the logic reversal
> > problem in scsi_eh_ready_devs, could you check this?
>
> static void scsi_eh_ready_devs(struct Scsi_Host *shost,
> struct list_head *work_q,
> struct list_head *done_q)
> {
> if (scsi_eh_bus_device_reset(shost, work_q, done_q))
> if (scsi_eh_bus_reset(shost, work_q, done_q))
> if (scsi_eh_host_reset(work_q, done_q))
> scsi_eh_offline_sdevs(work_q, done_q);
> }
>
> That is what i currently have, i'll try a boot with;
>
> - if (scsi_eh_bus_reset(shost, work_q, done_q))
> + if (!scsi_eh_bus_reset(shost, work_q, done_q))
>

This should not fix your problem you should apply the whole patch as the
reversed check on scsi_eh_bus_device_reset is what you should be
hitting.

The patch below should apply to your kernel version.

-andmike
--
Michael Anderson
[email protected]


=====
name: 00_scsi_error_ready_devs-1.diff
version: 2003-03-05.10:39:28-0800
against: 2.5.63

scsi_error.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

=====
===== drivers/scsi/scsi_error.c 1.38 vs edited =====
--- 1.38/drivers/scsi/scsi_error.c Sat Feb 22 08:17:01 2003
+++ edited/drivers/scsi/scsi_error.c Wed Mar 5 10:14:22 2003
@@ -1490,9 +1490,9 @@
struct list_head *work_q,
struct list_head *done_q)
{
- if (scsi_eh_bus_device_reset(shost, work_q, done_q))
- if (scsi_eh_bus_reset(shost, work_q, done_q))
- if (scsi_eh_host_reset(work_q, done_q))
+ if (!scsi_eh_bus_device_reset(shost, work_q, done_q))
+ if (!scsi_eh_bus_reset(shost, work_q, done_q))
+ if (!scsi_eh_host_reset(work_q, done_q))
scsi_eh_offline_sdevs(work_q, done_q);
}

2003-03-06 17:31:27

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

On Thu, 6 Mar 2003, James Bottomley wrote:

> Actually, all three if's need nots in front:
>
> diff -Nru a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> --- a/drivers/scsi/scsi_error.c Thu Mar 6 11:21:22 2003
> +++ b/drivers/scsi/scsi_error.c Thu Mar 6 11:21:22 2003
> @@ -1490,9 +1490,9 @@
> struct list_head *work_q,
> struct list_head *done_q)
> {
> - if (scsi_eh_bus_device_reset(shost, work_q, done_q))
> - if (scsi_eh_bus_reset(shost, work_q, done_q))
> - if (scsi_eh_host_reset(work_q, done_q))
> + if (!scsi_eh_bus_device_reset(shost, work_q, done_q))
> + if (!scsi_eh_bus_reset(shost, work_q, done_q))
> + if (!scsi_eh_host_reset(work_q, done_q))
> scsi_eh_offline_sdevs(work_q, done_q);
> }

Ok patched 2.5.63 is back to booting as 2.5.62, would you like any more
information?

Thanks,
Zwane
--
function.linuxpower.ca

2003-03-06 18:02:01

by Mike Anderson

[permalink] [raw]
Subject: Re: 2.5.63/64 do not boot: loop in scsi_error

Zwane Mwaikambo [[email protected]] wrote:
> On Thu, 6 Mar 2003, James Bottomley wrote:
>
> > Actually, all three if's need nots in front:
> >
> > diff -Nru a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> > --- a/drivers/scsi/scsi_error.c Thu Mar 6 11:21:22 2003
> > +++ b/drivers/scsi/scsi_error.c Thu Mar 6 11:21:22 2003
> > @@ -1490,9 +1490,9 @@
> > struct list_head *work_q,
> > struct list_head *done_q)
> > {
> > - if (scsi_eh_bus_device_reset(shost, work_q, done_q))
> > - if (scsi_eh_bus_reset(shost, work_q, done_q))
> > - if (scsi_eh_host_reset(work_q, done_q))
> > + if (!scsi_eh_bus_device_reset(shost, work_q, done_q))
> > + if (!scsi_eh_bus_reset(shost, work_q, done_q))
> > + if (!scsi_eh_host_reset(work_q, done_q))
> > scsi_eh_offline_sdevs(work_q, done_q);
> > }
>
> Ok patched 2.5.63 is back to booting as 2.5.62, would you like any more
> information?
>

I believe we have all the information we need.

Thanks for sending the previous data and trying the patch.

I still need to understand the error signature for Andries as it sounds
different then what you are seeing.

-andmike
--
Michael Anderson
[email protected]