2005-09-13 12:55:06

by Dipankar Sarma

[permalink] [raw]
Subject: [2.6.14-rc1] sym scsi boot hang

My ppc64 box refuses to boot with 2.6.14-rc1. The console log
is included below. I see a lot of repeat of the last message
in the log and then the box hangs.

Any idea what might have caused this ?

Thanks
Dipankar

sym0: <1010-66> rev 0x1 at pci 0001:01:01.0 irq 115
sym0: No NVRAM, ID 7, Fast-80, LVD, parity checking
sym0: SCSI BUS has been reset.
scsi0 : sym-2.2.1
target0:0:8: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 31)
Vendor: IBM Model: IC35L036UCDY10-0 Rev: S25M
Type: Direct-Access ANSI SCSI revision: 03
target0:0:8: tagged command queuing enabled, command queue depth 16.
target0:0:8: Beginning Domain Validation
target0:0:8: asynchronous.
target0:0:8: wide asynchronous.
target0:0:8: FAST-80 WIDE SCSI 160.0 MB/s DT IU QAS (12.5 ns, offset 31)
sym0: unexpected disconnect
target0:0:8: Write Buffer failure 700ff
target0:0:8: Domain Validation Disabing Information Units
target0:0:8: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 31)
sym0: unexpected disconnect
target0:0:8: Write Buffer failure 700ff
target0:0:8: Domain Validation detected failure, dropping back
target0:0:8: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
target0:0:8: Ending Domain Validation
target0:0:9: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 31)
Vendor: IBM Model: IC35L036UCDY10-0 Rev: S25M
Type: Direct-Access ANSI SCSI revision: 03
target0:0:9: tagged command queuing enabled, command queue depth 16.
target0:0:9: Beginning Domain Validation
target0:0:9: asynchronous.
target0:0:9: wide asynchronous.
target0:0:9: FAST-80 WIDE SCSI 160.0 MB/s DT IU QAS (12.5 ns, offset 31)
sym0: unexpected disconnect
target0:0:9: Write Buffer failure 700ff
target0:0:9: Domain Validation Disabing Information Units
target0:0:9: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 31)
sym0: unexpected disconnect
target0:0:9: Write Buffer failure 700ff
target0:0:9: Domain Validation detected failure, dropping back
target0:0:9: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
target0:0:9: Ending Domain Validation
target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 31)
Vendor: IBM Model: IC35L036UCDY10-0 Rev: S25M
Type: Direct-Access ANSI SCSI revision: 03
target0:0:10: tagged command queuing enabled, command queue depth 16.
target0:0:10: Beginning Domain Validation
target0:0:10: asynchronous.
target0:0:10: wide asynchronous.
target0:0:10: Domain Validation skipping write tests
target0:0:10: FAST-80 WIDE SCSI 160.0 MB/s DT IU QAS (12.5 ns, offset 31)
sym0: unexpected disconnect
target0:0:10: Domain Validation Disabing Information Units
target0:0:10: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 31)
sym0: unexpected disconnect
target0:0:10: Domain Validation detected failure, dropping back
target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
target0:0:10: Ending Domain Validation
Vendor: IBM Model: HSBPM2 PU2SCSI Rev: 0016
Type: Enclosure ANSI SCSI revision: 02
target0:0:14: Beginning Domain Validation
0:0:14:0: phase change 6-7 9@100503a8 resid=7.
0:0:14:0: phase change 6-7 9@100503a8 resid=7.
0:0:14:0: phase change 6-7 9@100503a8 resid=7.
0:0:14:0: phase change 6-7 9@100503a8 resid=7.
target0:0:14: Ending Domain Validation
Vendor: IBM Model: HSBPD4M PU3SCSI Rev: 0016
Type: Enclosure ANSI SCSI revision: 02
target0:0:15: Beginning Domain Validation
0:0:15:0: phase change 6-7 9@100503a8 resid=7.
0:0:15:0: phase change 6-7 9@100503a8 resid=7.
0:0:15:0: phase change 6-7 9@100503a8 resid=7.
0:0:15:0: phase change 6-7 9@100503a8 resid=7.
target0:0:15: Ending Domain Validation
sym1: <1010-66> rev 0x1 at pci 0001:01:01.1 irq 116
sym1: No NVRAM, ID 7, Fast-80, LVD, parity checking
sym1: SCSI BUS has been reset.
scsi1 : sym-2.2.1
sym2: <1010-66> rev 0x1 at pci 0001:41:01.0 irq 119
sym2: No NVRAM, ID 7, Fast-80, LVD, parity checking
sym2: SCSI BUS has been reset.
scsi2 : sym-2.2.1
sym3: <1010-66> rev 0x1 at pci 0001:41:01.1 irq 120
sym3: No NVRAM, ID 7, Fast-80, LVD, parity checking
sym3: SCSI BUS has been reset.
scsi3 : sym-2.2.1
st: Version 20050830, fixed bufsize 32768, s/g segs 256
SCSI device sda: 71096640 512-byte hdwr sectors (36401 MB)
SCSI device sda: drive cache: write through
SCSI device sda: 71096640 512-byte hdwr sectors (36401 MB)
SCSI device sda: drive cache: write through
sda: sda1 sda2 sda3 sda4 < sda5 sda6 >
Attached scsi disk sda at scsi0, channel 0, id 8, lun 0
SCSI device sdb: 71096640 512-byte hdwr sectors (36401 MB)
SCSI device sdb: drive cache: write through
SCSI device sdb: 71096640 512-byte hdwr sectors (36401 MB)
SCSI device sdb: drive cache: write through
sdb: sdb1 sdb2
Attached scsi disk sdb at scsi0, channel 0, id 9, lun 0
target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
sdc: Spinning up disk....<6> target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)


2005-09-13 13:46:51

by Anton Blanchard

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang


Hi,

> My ppc64 box refuses to boot with 2.6.14-rc1. The console log
> is included below. I see a lot of repeat of the last message
> in the log and then the box hangs.
>
> Any idea what might have caused this ?

...

> Attached scsi disk sdb at scsi0, channel 0, id 9, lun 0
> target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
> target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
> target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
> sdc: Spinning up disk....<6> target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
> target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
> target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
> target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
> target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)
> target0:0:10: FAST-40 WIDE SCSI 80.0 MB/s DT (25 ns, offset 31)

Looks like a change between 2.6.13-git11 and 2.6.14-rc1 caused this - so
something in the last 24 hours.

Anton

2005-09-13 15:02:22

by Anton Blanchard

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang


Hi,

> Looks like a change between 2.6.13-git11 and 2.6.14-rc1 caused this - so
> something in the last 24 hours.

I just noticed a similar hang on the ibmvscsi driver. The following
backout patch seems to fix it (part of the scsi merge yesterday), I'll
look closer after I get some sleep.

Anton

diff -urN build/drivers/scsi/scsi_lib.c build2/drivers/scsi/scsi_lib.c
--- build/drivers/scsi/scsi_lib.c 2005-09-13 15:13:32.000000000 +1000
+++ build2/drivers/scsi/scsi_lib.c 2005-09-14 00:44:57.000000000 +1000
@@ -97,30 +97,6 @@
}

static void scsi_run_queue(struct request_queue *q);
-static void scsi_release_buffers(struct scsi_cmnd *cmd);
-
-/*
- * Function: scsi_unprep_request()
- *
- * Purpose: Remove all preparation done for a request, including its
- * associated scsi_cmnd, so that it can be requeued.
- *
- * Arguments: req - request to unprepare
- *
- * Lock status: Assumed that no locks are held upon entry.
- *
- * Returns: Nothing.
- */
-static void scsi_unprep_request(struct request *req)
-{
- struct scsi_cmnd *cmd = req->special;
-
- req->flags &= ~REQ_DONTPREP;
- req->special = (req->flags & REQ_SPECIAL) ? cmd->sc_request : NULL;
-
- scsi_release_buffers(cmd);
- scsi_put_command(cmd);
-}

/*
* Function: scsi_queue_insert()
@@ -140,14 +116,12 @@
* commands.
* Notes: This could be called either from an interrupt context or a
* normal process context.
- * Notes: Upon return, cmd is a stale pointer.
*/
int scsi_queue_insert(struct scsi_cmnd *cmd, int reason)
{
struct Scsi_Host *host = cmd->device->host;
struct scsi_device *device = cmd->device;
struct request_queue *q = device->request_queue;
- struct request *req = cmd->request;
unsigned long flags;

SCSI_LOG_MLQUEUE(1,
@@ -188,9 +162,8 @@
* function. The SCSI request function detects the blocked condition
* and plugs the queue appropriately.
*/
- scsi_unprep_request(req);
spin_lock_irqsave(q->queue_lock, flags);
- blk_requeue_request(q, req);
+ blk_requeue_request(q, cmd->request);
spin_unlock_irqrestore(q->queue_lock, flags);

scsi_run_queue(q);
@@ -366,7 +339,7 @@
int result;

if (sshdr) {
- sense = kmalloc(SCSI_SENSE_BUFFERSIZE, GFP_NOIO);
+ sense = kmalloc(SCSI_SENSE_BUFFERSIZE, GFP_KERNEL);
if (!sense)
return DRIVER_ERROR << 24;
memset(sense, 0, SCSI_SENSE_BUFFERSIZE);
@@ -579,16 +552,15 @@
* I/O errors in the middle of the request, in which case
* we need to request the blocks that come after the bad
* sector.
- * Notes: Upon return, cmd is a stale pointer.
*/
static void scsi_requeue_command(struct request_queue *q, struct scsi_cmnd *cmd)
{
- struct request *req = cmd->request;
unsigned long flags;

- scsi_unprep_request(req);
+ cmd->request->flags &= ~REQ_DONTPREP;
+
spin_lock_irqsave(q->queue_lock, flags);
- blk_requeue_request(q, req);
+ blk_requeue_request(q, cmd->request);
spin_unlock_irqrestore(q->queue_lock, flags);

scsi_run_queue(q);
@@ -623,14 +595,13 @@
*
* Lock status: Assumed that lock is not held upon entry.
*
- * Returns: cmd if requeue required, NULL otherwise.
+ * Returns: cmd if requeue done or required, NULL otherwise
*
* Notes: This is called for block device requests in order to
* mark some number of sectors as complete.
*
* We are guaranteeing that the request queue will be goosed
* at some point during this call.
- * Notes: If cmd was requeued, upon return it will be a stale pointer.
*/
static struct scsi_cmnd *scsi_end_request(struct scsi_cmnd *cmd, int uptodate,
int bytes, int requeue)
@@ -653,15 +624,14 @@
if (!uptodate && blk_noretry_request(req))
end_that_request_chunk(req, 0, leftover);
else {
- if (requeue) {
+ if (requeue)
/*
* Bleah. Leftovers again. Stick the
* leftovers in the front of the
* queue, and goose the queue again.
*/
scsi_requeue_command(q, cmd);
- cmd = NULL;
- }
+
return cmd;
}
}
@@ -887,13 +857,15 @@
* requeueing right here - we will requeue down below
* when we handle the bad sectors.
*/
+ cmd = scsi_end_request(cmd, 1, good_bytes, result == 0);

/*
- * If the command completed without error, then either
- * finish off the rest of the command, or start a new one.
+ * If the command completed without error, then either finish off the
+ * rest of the command, or start a new one.
*/
- if (scsi_end_request(cmd, 1, good_bytes, result == 0) == NULL)
+ if (result == 0 || cmd == NULL ) {
return;
+ }
}
/*
* Now, if we were good little boys and girls, Santa left us a request
@@ -908,7 +880,7 @@
* and quietly refuse further access.
*/
cmd->device->changed = 1;
- scsi_end_request(cmd, 0,
+ cmd = scsi_end_request(cmd, 0,
this_count, 1);
return;
} else {
@@ -942,7 +914,7 @@
scsi_requeue_command(q, cmd);
result = 0;
} else {
- scsi_end_request(cmd, 0, this_count, 1);
+ cmd = scsi_end_request(cmd, 0, this_count, 1);
return;
}
break;
@@ -959,7 +931,7 @@
dev_printk(KERN_INFO,
&cmd->device->sdev_gendev,
"Device not ready.\n");
- scsi_end_request(cmd, 0, this_count, 1);
+ cmd = scsi_end_request(cmd, 0, this_count, 1);
return;
case VOLUME_OVERFLOW:
if (!(req->flags & REQ_QUIET)) {
@@ -969,7 +941,7 @@
__scsi_print_command(cmd->data_cmnd);
scsi_print_sense("", cmd);
}
- scsi_end_request(cmd, 0, block_bytes, 1);
+ cmd = scsi_end_request(cmd, 0, block_bytes, 1);
return;
default:
break;
@@ -1000,7 +972,7 @@
block_bytes = req->hard_cur_sectors << 9;
if (!block_bytes)
block_bytes = req->data_len;
- scsi_end_request(cmd, 0, block_bytes, 1);
+ cmd = scsi_end_request(cmd, 0, block_bytes, 1);
}
}
EXPORT_SYMBOL(scsi_io_completion);
@@ -1146,7 +1118,7 @@
if (unlikely(!scsi_device_online(sdev))) {
printk(KERN_ERR "scsi%d (%d:%d): rejecting I/O to offline device\n",
sdev->host->host_no, sdev->id, sdev->lun);
- goto kill;
+ return BLKPREP_KILL;
}
if (unlikely(sdev->sdev_state != SDEV_RUNNING)) {
/* OK, we're not in a running state don't prep
@@ -1156,7 +1128,7 @@
* at all allowed down */
printk(KERN_ERR "scsi%d (%d:%d): rejecting I/O to dead device\n",
sdev->host->host_no, sdev->id, sdev->lun);
- goto kill;
+ return BLKPREP_KILL;
}
/* OK, we only allow special commands (i.e. not
* user initiated ones */
@@ -1188,11 +1160,11 @@
if(unlikely(specials_only) && !(req->flags & REQ_SPECIAL)) {
if(specials_only == SDEV_QUIESCE ||
specials_only == SDEV_BLOCK)
- goto defer;
+ return BLKPREP_DEFER;

printk(KERN_ERR "scsi%d (%d:%d): rejecting I/O to device being removed\n",
sdev->host->host_no, sdev->id, sdev->lun);
- goto kill;
+ return BLKPREP_KILL;
}


@@ -1210,7 +1182,7 @@
cmd->tag = req->tag;
} else {
blk_dump_rq_flags(req, "SCSI bad req");
- goto kill;
+ return BLKPREP_KILL;
}

/* note the overloading of req->special. When the tag
@@ -1248,13 +1220,8 @@
* required).
*/
ret = scsi_init_io(cmd);
- switch(ret) {
- case BLKPREP_KILL:
- /* BLKPREP_KILL return also releases the command */
- goto kill;
- case BLKPREP_DEFER:
- goto defer;
- }
+ if (ret) /* BLKPREP_KILL return also releases the command */
+ return ret;

/*
* Initialize the actual SCSI command for this request.
@@ -1264,7 +1231,7 @@
if (unlikely(!drv->init_command(cmd))) {
scsi_release_buffers(cmd);
scsi_put_command(cmd);
- goto kill;
+ return BLKPREP_KILL;
}
} else {
memcpy(cmd->cmnd, req->cmd, sizeof(cmd->cmnd));
@@ -1295,9 +1262,6 @@
if (sdev->device_busy == 0)
blk_plug_device(q);
return BLKPREP_DEFER;
- kill:
- req->errors = DID_NO_CONNECT << 16;
- return BLKPREP_KILL;
}

/*
@@ -1372,24 +1336,19 @@
}

/*
- * Kill a request for a dead device
+ * Kill requests for a dead device
*/
-static void scsi_kill_request(struct request *req, request_queue_t *q)
+static void scsi_kill_requests(request_queue_t *q)
{
- struct scsi_cmnd *cmd = req->special;
-
- blkdev_dequeue_request(req);
+ struct request *req;

- if (unlikely(cmd == NULL)) {
- printk(KERN_CRIT "impossible request in %s.\n",
- __FUNCTION__);
- BUG();
+ while ((req = elv_next_request(q)) != NULL) {
+ blkdev_dequeue_request(req);
+ req->flags |= REQ_QUIET;
+ while (end_that_request_first(req, 0, req->nr_sectors))
+ ;
+ end_that_request_last(req);
}
-
- scsi_init_cmd_errh(cmd);
- cmd->result = DID_NO_CONNECT << 16;
- atomic_inc(&cmd->device->iorequest_cnt);
- __scsi_done(cmd);
}

/*
@@ -1412,8 +1371,7 @@

if (!sdev) {
printk("scsi: killing requests for dead queue\n");
- while ((req = elv_next_request(q)) != NULL)
- scsi_kill_request(req, q);
+ scsi_kill_requests(q);
return;
}

@@ -1440,7 +1398,11 @@
if (unlikely(!scsi_device_online(sdev))) {
printk(KERN_ERR "scsi%d (%d:%d): rejecting I/O to offline device\n",
sdev->host->host_no, sdev->id, sdev->lun);
- scsi_kill_request(req, q);
+ blkdev_dequeue_request(req);
+ req->flags |= REQ_QUIET;
+ while (end_that_request_first(req, 0, req->nr_sectors))
+ ;
+ end_that_request_last(req);
continue;
}

@@ -1453,14 +1415,6 @@
sdev->device_busy++;

spin_unlock(q->queue_lock);
- cmd = req->special;
- if (unlikely(cmd == NULL)) {
- printk(KERN_CRIT "impossible request in %s.\n"
- "please mail a stack trace to "
- "[email protected]",
- __FUNCTION__);
- BUG();
- }
spin_lock(shost->host_lock);

if (!scsi_host_queue_ready(q, shost, sdev))
@@ -1479,6 +1433,15 @@
*/
spin_unlock_irq(shost->host_lock);

+ cmd = req->special;
+ if (unlikely(cmd == NULL)) {
+ printk(KERN_CRIT "impossible request in %s.\n"
+ "please mail a stack trace to "
+ "[email protected]",
+ __FUNCTION__);
+ BUG();
+ }
+
/*
* Finally, initialize any error handling parameters, and set up
* the timers for timeouts.
@@ -1514,7 +1477,6 @@
* cases (host limits or settings) should run the queue at some
* later time.
*/
- scsi_unprep_request(req);
spin_lock_irq(q->queue_lock);
blk_requeue_request(q, req);
sdev->device_busy--;
diff -urN build/drivers/scsi/scsi_priv.h build2/drivers/scsi/scsi_priv.h
--- build/drivers/scsi/scsi_priv.h 2005-09-13 15:13:32.000000000 +1000
+++ build2/drivers/scsi/scsi_priv.h 2005-09-14 00:44:57.000000000 +1000
@@ -124,7 +124,6 @@
extern void scsi_sysfs_device_initialize(struct scsi_device *);
extern int scsi_sysfs_target_initialize(struct scsi_device *);
extern struct scsi_transport_template blank_transport_template;
-extern void __scsi_remove_device(struct scsi_device *);

extern struct bus_type scsi_bus_type;

diff -urN build/drivers/scsi/scsi_scan.c build2/drivers/scsi/scsi_scan.c
--- build/drivers/scsi/scsi_scan.c 2005-09-13 15:13:32.000000000 +1000
+++ build2/drivers/scsi/scsi_scan.c 2005-09-14 00:44:57.000000000 +1000
@@ -870,12 +870,8 @@
out_free_sdev:
if (res == SCSI_SCAN_LUN_PRESENT) {
if (sdevp) {
- if (scsi_device_get(sdev) == 0) {
- *sdevp = sdev;
- } else {
- __scsi_remove_device(sdev);
- res = SCSI_SCAN_NO_RESPONSE;
- }
+ scsi_device_get(sdev);
+ *sdevp = sdev;
}
} else {
if (sdev->host->hostt->slave_destroy)
@@ -1264,19 +1260,6 @@
}
EXPORT_SYMBOL(__scsi_add_device);

-int scsi_add_device(struct Scsi_Host *host, uint channel,
- uint target, uint lun)
-{
- struct scsi_device *sdev =
- __scsi_add_device(host, channel, target, lun, NULL);
- if (IS_ERR(sdev))
- return PTR_ERR(sdev);
-
- scsi_device_put(sdev);
- return 0;
-}
-EXPORT_SYMBOL(scsi_add_device);
-
void scsi_rescan_device(struct device *dev)
{
struct scsi_driver *drv;
@@ -1293,8 +1276,27 @@
}
EXPORT_SYMBOL(scsi_rescan_device);

-static void __scsi_scan_target(struct device *parent, unsigned int channel,
- unsigned int id, unsigned int lun, int rescan)
+/**
+ * scsi_scan_target - scan a target id, possibly including all LUNs on the
+ * target.
+ * @sdevsca: Scsi_Device handle for scanning
+ * @shost: host to scan
+ * @channel: channel to scan
+ * @id: target id to scan
+ *
+ * Description:
+ * Scan the target id on @shost, @channel, and @id. Scan at least LUN
+ * 0, and possibly all LUNs on the target id.
+ *
+ * Use the pre-allocated @sdevscan as a handle for the scanning. This
+ * function sets sdevscan->host, sdevscan->id and sdevscan->lun; the
+ * scanning functions modify sdevscan->lun.
+ *
+ * First try a REPORT LUN scan, if that does not scan the target, do a
+ * sequential scan of LUNs on the target id.
+ **/
+void scsi_scan_target(struct device *parent, unsigned int channel,
+ unsigned int id, unsigned int lun, int rescan)
{
struct Scsi_Host *shost = dev_to_shost(parent);
int bflags = 0;
@@ -1308,7 +1310,9 @@
*/
return;

+
starget = scsi_alloc_target(parent, channel, id);
+
if (!starget)
return;

@@ -1354,33 +1358,6 @@

put_device(&starget->dev);
}
-
-/**
- * scsi_scan_target - scan a target id, possibly including all LUNs on the
- * target.
- * @parent: host to scan
- * @channel: channel to scan
- * @id: target id to scan
- * @lun: Specific LUN to scan or SCAN_WILD_CARD
- * @rescan: passed to LUN scanning routines
- *
- * Description:
- * Scan the target id on @parent, @channel, and @id. Scan at least LUN 0,
- * and possibly all LUNs on the target id.
- *
- * First try a REPORT LUN scan, if that does not scan the target, do a
- * sequential scan of LUNs on the target id.
- **/
-void scsi_scan_target(struct device *parent, unsigned int channel,
- unsigned int id, unsigned int lun, int rescan)
-{
- struct Scsi_Host *shost = dev_to_shost(parent);
-
- down(&shost->scan_mutex);
- if (scsi_host_scan_allowed(shost))
- __scsi_scan_target(parent, channel, id, lun, rescan);
- up(&shost->scan_mutex);
-}
EXPORT_SYMBOL(scsi_scan_target);

static void scsi_scan_channel(struct Scsi_Host *shost, unsigned int channel,
@@ -1406,12 +1383,10 @@
order_id = shost->max_id - id - 1;
else
order_id = id;
- __scsi_scan_target(&shost->shost_gendev, channel,
- order_id, lun, rescan);
+ scsi_scan_target(&shost->shost_gendev, channel, order_id, lun, rescan);
}
else
- __scsi_scan_target(&shost->shost_gendev, channel,
- id, lun, rescan);
+ scsi_scan_target(&shost->shost_gendev, channel, id, lun, rescan);
}

int scsi_scan_host_selected(struct Scsi_Host *shost, unsigned int channel,
@@ -1509,15 +1484,12 @@
*/
struct scsi_device *scsi_get_host_dev(struct Scsi_Host *shost)
{
- struct scsi_device *sdev = NULL;
+ struct scsi_device *sdev;
struct scsi_target *starget;

- down(&shost->scan_mutex);
- if (!scsi_host_scan_allowed(shost))
- goto out;
starget = scsi_alloc_target(&shost->shost_gendev, 0, shost->this_id);
if (!starget)
- goto out;
+ return NULL;

sdev = scsi_alloc_sdev(starget, 0, NULL);
if (sdev) {
@@ -1525,8 +1497,6 @@
sdev->borken = 0;
}
put_device(&starget->dev);
- out:
- up(&shost->scan_mutex);
return sdev;
}
EXPORT_SYMBOL(scsi_get_host_dev);
diff -urN build/drivers/scsi/scsi_sysfs.c build2/drivers/scsi/scsi_sysfs.c
--- build/drivers/scsi/scsi_sysfs.c 2005-09-13 15:13:32.000000000 +1000
+++ build2/drivers/scsi/scsi_sysfs.c 2005-09-14 00:44:57.000000000 +1000
@@ -653,7 +653,7 @@
error = attr_add(&sdev->sdev_gendev,
sdev->host->hostt->sdev_attrs[i]);
if (error) {
- __scsi_remove_device(sdev);
+ scsi_remove_device(sdev);
goto out;
}
}
@@ -667,7 +667,7 @@
scsi_sysfs_sdev_attrs[i]);
error = device_create_file(&sdev->sdev_gendev, attr);
if (error) {
- __scsi_remove_device(sdev);
+ scsi_remove_device(sdev);
goto out;
}
}
@@ -687,10 +687,17 @@
return error;
}

-void __scsi_remove_device(struct scsi_device *sdev)
+/**
+ * scsi_remove_device - unregister a device from the scsi bus
+ * @sdev: scsi_device to unregister
+ **/
+void scsi_remove_device(struct scsi_device *sdev)
{
+ struct Scsi_Host *shost = sdev->host;
+
+ down(&shost->scan_mutex);
if (scsi_device_set_state(sdev, SDEV_CANCEL) != 0)
- return;
+ goto out;

class_device_unregister(&sdev->sdev_classdev);
device_del(&sdev->sdev_gendev);
@@ -699,17 +706,8 @@
sdev->host->hostt->slave_destroy(sdev);
transport_unregister_device(&sdev->sdev_gendev);
put_device(&sdev->sdev_gendev);
-}
-
-/**
- * scsi_remove_device - unregister a device from the scsi bus
- * @sdev: scsi_device to unregister
- **/
-void scsi_remove_device(struct scsi_device *sdev)
-{
- down(&sdev->host->scan_mutex);
- __scsi_remove_device(sdev);
- up(&sdev->host->scan_mutex);
+out:
+ up(&shost->scan_mutex);
}
EXPORT_SYMBOL(scsi_remove_device);

diff -urN build/include/scsi/scsi_device.h build2/include/scsi/scsi_device.h
--- build/include/scsi/scsi_device.h 2005-09-13 15:13:32.000000000 +1000
+++ build2/include/scsi/scsi_device.h 2005-09-14 00:44:57.000000000 +1000
@@ -178,8 +178,8 @@

extern struct scsi_device *__scsi_add_device(struct Scsi_Host *,
uint, uint, uint, void *hostdata);
-extern int scsi_add_device(struct Scsi_Host *host, uint channel,
- uint target, uint lun);
+#define scsi_add_device(host, channel, target, lun) \
+ __scsi_add_device(host, channel, target, lun, NULL)
extern void scsi_remove_device(struct scsi_device *);
extern int scsi_device_cancel(struct scsi_device *, int);

2005-09-13 16:36:07

by James Bottomley

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang

On Wed, 2005-09-14 at 00:29 +1000, Anton Blanchard wrote:
> I just noticed a similar hang on the ibmvscsi driver. The following
> backout patch seems to fix it (part of the scsi merge yesterday), I'll
> look closer after I get some sleep.

If that's the cause, it's probably a double down of the host scan
semaphore somewhere in the code. alt-sysrq-t should work in this case,
can you get a stack trace of the blocked process?

Thanks,

James


2005-09-13 17:18:10

by Anton Blanchard

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang


Hi,

> If that's the cause, it's probably a double down of the host scan
> semaphore somewhere in the code. alt-sysrq-t should work in this case,
> can you get a stack trace of the blocked process?

Good idea, I wonder why the IO isnt completing.

Anton

[c0000000004945fc] schedule+0x63c/0xf70
[c0000000004954c8] wait_for_completion+0xb8/0x140
[c0000000002a3800] blk_execute_rq+0xb0/0x120
[c000000000338a98] scsi_execute+0xf8/0x150
[c000000000338bc0] scsi_execute_req+0xd0/0x140
[c00000000033bce4] scsi_probe_and_add_lun+0x204/0x9f0
[c00000000033cfc4] __scsi_scan_target+0x164/0x4f0
[c00000000033d430] scsi_scan_channel+0xe0/0x120
[c00000000033d598] scsi_scan_host_selected+0x128/0x1d0
[c000000000361c20] ibmvscsi_probe+0x270/0x400
[c0000000000367fc] vio_bus_probe+0x7c/0x90
[c000000000299a78] driver_probe_device+0x98/0x160
[c000000000299ce8] __driver_attach+0xa8/0xd0
[c000000000298938] bus_for_each_dev+0x88/0xe0
[c000000000299738] driver_attach+0x28/0x40
[c000000000299074] bus_add_driver+0xc4/0x200
[c00000000029a1cc] driver_register+0x5c/0x80
[c0000000000365b0] vio_register_driver+0x50/0x70
[c000000000565bac] ibmvscsi_module_init+0x1c/0x40

2005-09-13 17:32:46

by James Bottomley

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang

On Wed, 2005-09-14 at 02:47 +1000, Anton Blanchard wrote:
> Good idea, I wonder why the IO isnt completing.
>
> Anton
>
> [c0000000004945fc] schedule+0x63c/0xf70
> [c0000000004954c8] wait_for_completion+0xb8/0x140
> [c0000000002a3800] blk_execute_rq+0xb0/0x120
> [c000000000338a98] scsi_execute+0xf8/0x150
> [c000000000338bc0] scsi_execute_req+0xd0/0x140
> [c00000000033bce4] scsi_probe_and_add_lun+0x204/0x9f0
> [c00000000033cfc4] __scsi_scan_target+0x164/0x4f0
> [c00000000033d430] scsi_scan_channel+0xe0/0x120
> [c00000000033d598] scsi_scan_host_selected+0x128/0x1d0
> [c000000000361c20] ibmvscsi_probe+0x270/0x400
> [c0000000000367fc] vio_bus_probe+0x7c/0x90
> [c000000000299a78] driver_probe_device+0x98/0x160
> [c000000000299ce8] __driver_attach+0xa8/0xd0
> [c000000000298938] bus_for_each_dev+0x88/0xe0
> [c000000000299738] driver_attach+0x28/0x40
> [c000000000299074] bus_add_driver+0xc4/0x200
> [c00000000029a1cc] driver_register+0x5c/0x80
> [c0000000000365b0] vio_register_driver+0x50/0x70
> [c000000000565bac] ibmvscsi_module_init+0x1c/0x40

That trace says the ibmvscsi driver (not sym2) has lost an I/O

James


2005-09-13 17:39:38

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang

On Tue, Sep 13, 2005 at 12:32:40PM -0500, James Bottomley wrote:
> On Wed, 2005-09-14 at 02:47 +1000, Anton Blanchard wrote:
> > Good idea, I wonder why the IO isnt completing.
> >
> > Anton
> >
> > [c0000000004945fc] schedule+0x63c/0xf70
> > [c0000000004954c8] wait_for_completion+0xb8/0x140
> > [c0000000002a3800] blk_execute_rq+0xb0/0x120
> > [c000000000338a98] scsi_execute+0xf8/0x150
> > [c000000000338bc0] scsi_execute_req+0xd0/0x140
> > [c00000000033bce4] scsi_probe_and_add_lun+0x204/0x9f0
> > [c00000000033cfc4] __scsi_scan_target+0x164/0x4f0
> > [c00000000033d430] scsi_scan_channel+0xe0/0x120
> > [c00000000033d598] scsi_scan_host_selected+0x128/0x1d0
> > [c000000000361c20] ibmvscsi_probe+0x270/0x400
> > [c0000000000367fc] vio_bus_probe+0x7c/0x90
> > [c000000000299a78] driver_probe_device+0x98/0x160
> > [c000000000299ce8] __driver_attach+0xa8/0xd0
> > [c000000000298938] bus_for_each_dev+0x88/0xe0
> > [c000000000299738] driver_attach+0x28/0x40
> > [c000000000299074] bus_add_driver+0xc4/0x200
> > [c00000000029a1cc] driver_register+0x5c/0x80
> > [c0000000000365b0] vio_register_driver+0x50/0x70
> > [c000000000565bac] ibmvscsi_module_init+0x1c/0x40
>
> That trace says the ibmvscsi driver (not sym2) has lost an I/O

Anton's trace is from a different machine. Mine has symbios.
Backing out the scsi merge (Anton's backout patch) of a day ago
fixes my problem.

Thanks
Dipankar

2005-09-13 17:43:21

by Anton Blanchard

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang


> That trace says the ibmvscsi driver (not sym2) has lost an I/O

Yep, this is another machine. The sym2 box is hanging at boot also.
Ill get a backtrace on it next.

Anton

2005-09-14 08:36:39

by Anton Blanchard

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang


Hi,

> If that's the cause, it's probably a double down of the host scan
> semaphore somewhere in the code. alt-sysrq-t should work in this case,
> can you get a stack trace of the blocked process?

It appears to be this patch:

[SCSI] SCSI core: fix leakage of scsi_cmnd's

From: Alan Stern <[email protected]>

This patch (as559b) adds a new routine, scsi_unprep_request, which
gets called every place a request is requeued. (That includes
scsi_queue_insert as well as scsi_requeue_command.) It also changes
scsi_kill_requests to make it call __scsi_done with result equal to
DID_NO_CONNECT << 16. (I'm not sure if it's necessary to call
scsi_init_cmd_errh here; maybe you can check on that.) Finally, the
patch changes the return value from scsi_end_request, to avoid
returning a stale pointer in the case where the request was requeued.
Fortunately the return value is used in only place, and the change
actually simplified it.


And in particular it looks like the scsi_unprep_request in
scsi_queue_insert is causing it. The following patch fixes the boot
problems on the vscsi machine:


Index: build/drivers/scsi/scsi_lib.c
===================================================================
--- build.orig/drivers/scsi/scsi_lib.c 2005-09-14 18:23:34.000000000 +1000
+++ build/drivers/scsi/scsi_lib.c 2005-09-14 18:27:33.000000000 +1000
@@ -188,7 +188,7 @@
* function. The SCSI request function detects the blocked condition
* and plugs the queue appropriately.
*/
- scsi_unprep_request(req);
+ //scsi_unprep_request(req);
spin_lock_irqsave(q->queue_lock, flags);
blk_requeue_request(q, req);
spin_unlock_irqrestore(q->queue_lock, flags);

2005-09-14 15:49:49

by Alan Stern

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang

On Wed, 14 Sep 2005, Anton Blanchard wrote:

> Hi,
>
> > If that's the cause, it's probably a double down of the host scan
> > semaphore somewhere in the code. alt-sysrq-t should work in this case,
> > can you get a stack trace of the blocked process?
>
> It appears to be this patch:
>
> [SCSI] SCSI core: fix leakage of scsi_cmnd's
>
> From: Alan Stern <[email protected]>

> And in particular it looks like the scsi_unprep_request in
> scsi_queue_insert is causing it. The following patch fixes the boot
> problems on the vscsi machine:

In general the scsi_unprep_request routine is correct and needs to be
there. The one part that might be questionable is the assignment to
req->special. It may turn out that the real solution is to have
scsi_execute set req->special to NULL; I assumed it would be NULL already
but perhaps I was wrong.

(James, I see a possible problem with scsi_insert_special_req. It adds to
the queue a request with REQ_DONTPREP set. How can such a request, with
no associated scsi_cmnd, ever work? Also, won't scsi_end_request and
__scsi_release_request end up putting the same scsi_command twice?)

Here is a patch that addresses the first problem and fixes up a few other
loose ends. Please see if it helps.

Alan Stern



Index: usb-2.6/drivers/scsi/scsi_lib.c
===================================================================
--- usb-2.6.orig/drivers/scsi/scsi_lib.c
+++ usb-2.6/drivers/scsi/scsi_lib.c
@@ -116,7 +116,13 @@ static void scsi_unprep_request(struct r
struct scsi_cmnd *cmd = req->special;

req->flags &= ~REQ_DONTPREP;
- req->special = (req->flags & REQ_SPECIAL) ? cmd->sc_request : NULL;
+ req->special = NULL;
+ if (req->flags & REQ_SPECIAL) {
+ struct scsi_request *sreq = cmd->sc_request;
+
+ if (sreq->sr_magic == SCSI_REQ_MAGIC)
+ req->special = sreq;
+ }

scsi_release_buffers(cmd);
scsi_put_command(cmd);
@@ -343,6 +349,7 @@ int scsi_execute(struct scsi_device *sde
req->sense_len = 0;
req->timeout = timeout;
req->flags |= flags | REQ_BLOCK_PC | REQ_SPECIAL | REQ_QUIET;
+ req->special = NULL;

/*
* head injection *required* here otherwise quiesce won't work
@@ -1072,9 +1079,6 @@ static int scsi_init_io(struct scsi_cmnd
printk(KERN_ERR "req nr_sec %lu, cur_nr_sec %u\n", req->nr_sectors,
req->current_nr_sectors);

- /* release the command and kill it */
- scsi_release_buffers(cmd);
- scsi_put_command(cmd);
return BLKPREP_KILL;
}

@@ -1176,13 +1180,13 @@ static int scsi_prep_fn(struct request_q
if (req->flags & REQ_SPECIAL && req->special) {
struct scsi_request *sreq = req->special;

- if (sreq->sr_magic == SCSI_REQ_MAGIC) {
- cmd = scsi_get_command(sreq->sr_device, GFP_ATOMIC);
- if (unlikely(!cmd))
- goto defer;
- scsi_init_cmd_from_req(cmd, sreq);
- } else
- cmd = req->special;
+ if (sreq->sr_magic != SCSI_REQ_MAGIC)
+ printk(KERN_ERR "invalid sr_magic in %s\n",
+ __FUNCTION__);
+ cmd = scsi_get_command(sreq->sr_device, GFP_ATOMIC);
+ if (unlikely(!cmd))
+ goto defer;
+ scsi_init_cmd_from_req(cmd, sreq);
} else if (req->flags & (REQ_CMD | REQ_BLOCK_PC)) {

if(unlikely(specials_only) && !(req->flags & REQ_SPECIAL)) {
@@ -1194,17 +1198,14 @@ static int scsi_prep_fn(struct request_q
sdev->host->host_no, sdev->id, sdev->lun);
goto kill;
}
-

/*
* Now try and find a command block that we can use.
*/
- if (!req->special) {
- cmd = scsi_get_command(sdev, GFP_ATOMIC);
- if (unlikely(!cmd))
- goto defer;
- } else
- cmd = req->special;
+ cmd = scsi_get_command(sdev, GFP_ATOMIC);
+ if (unlikely(!cmd))
+ goto defer;
+ cmd->sc_request = NULL;

/* pull a tag out of the request if we have one */
cmd->tag = req->tag;
@@ -1250,7 +1251,7 @@ static int scsi_prep_fn(struct request_q
ret = scsi_init_io(cmd);
switch(ret) {
case BLKPREP_KILL:
- /* BLKPREP_KILL return also releases the command */
+ scsi_unprep_request(req);
goto kill;
case BLKPREP_DEFER:
goto defer;

2005-09-14 16:52:27

by Mike Christie

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang

Alan Stern wrote:
> On Wed, 14 Sep 2005, Anton Blanchard wrote:
>
>
>>Hi,
>>
>>
>>>If that's the cause, it's probably a double down of the host scan
>>>semaphore somewhere in the code. alt-sysrq-t should work in this case,
>>>can you get a stack trace of the blocked process?
>>
>>It appears to be this patch:
>>
>> [SCSI] SCSI core: fix leakage of scsi_cmnd's
>>
>> From: Alan Stern <[email protected]>
>
>
>>And in particular it looks like the scsi_unprep_request in
>>scsi_queue_insert is causing it. The following patch fixes the boot
>>problems on the vscsi machine:
>
>
> In general the scsi_unprep_request routine is correct and needs to be
> there. The one part that might be questionable is the assignment to
> req->special. It may turn out that the real solution is to have
> scsi_execute set req->special to NULL; I assumed it would be NULL already
> but perhaps I was wrong.

I think we have scsi_execute and friends setting REQ_SPECIAL. This is
could cause a problem becuase it does not have a scsi_request.

2005-09-14 16:53:58

by Mike Christie

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang

Mike Christie wrote:
> Alan Stern wrote:
>
>> On Wed, 14 Sep 2005, Anton Blanchard wrote:
>>
>>
>>> Hi,
>>>
>>>
>>>> If that's the cause, it's probably a double down of the host scan
>>>> semaphore somewhere in the code. alt-sysrq-t should work in this case,
>>>> can you get a stack trace of the blocked process?
>>>
>>>
>>> It appears to be this patch:
>>>
>>> [SCSI] SCSI core: fix leakage of scsi_cmnd's
>>>
>>> From: Alan Stern <[email protected]>
>>
>>
>>
>>> And in particular it looks like the scsi_unprep_request in
>>> scsi_queue_insert is causing it. The following patch fixes the boot
>>> problems on the vscsi machine:
>>
>>
>>
>> In general the scsi_unprep_request routine is correct and needs to be
>> there. The one part that might be questionable is the assignment to
>> req->special. It may turn out that the real solution is to have
>> scsi_execute set req->special to NULL; I assumed it would be NULL already
>> but perhaps I was wrong.
>
>
> I think we have scsi_execute and friends setting REQ_SPECIAL. This is
> could cause a problem becuase it does not have a scsi_request.
>

well now actually it won't becuase sc_request should be null for those
scsi_execute block pc commands I think.

2005-09-14 19:33:54

by James Bottomley

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang

On Wed, 2005-09-14 at 18:06 +1000, Anton Blanchard wrote:
> And in particular it looks like the scsi_unprep_request in
> scsi_queue_insert is causing it. The following patch fixes the boot
> problems on the vscsi machine:

OK, my fault. Your fix is almost correct .. I was going to do this
eventually, honest, because there's no need to unprep and reprep a
command that comes in through scsi_queue_insert().

However, I decided to leave it in to exercise the scsi_unprep_request()
path just to make sure it was working. What's happening, I think, is
that we also use this path for retries. Since we kill and reget the
command each time, the retries decrement is never seen, so we're
retrying forever.

This should be the correct reversal.

James
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -140,14 +140,12 @@ static void scsi_unprep_request(struct r
* commands.
* Notes: This could be called either from an interrupt context or a
* normal process context.
- * Notes: Upon return, cmd is a stale pointer.
*/
int scsi_queue_insert(struct scsi_cmnd *cmd, int reason)
{
struct Scsi_Host *host = cmd->device->host;
struct scsi_device *device = cmd->device;
struct request_queue *q = device->request_queue;
- struct request *req = cmd->request;
unsigned long flags;

SCSI_LOG_MLQUEUE(1,
@@ -188,9 +186,8 @@ int scsi_queue_insert(struct scsi_cmnd *
* function. The SCSI request function detects the blocked condition
* and plugs the queue appropriately.
*/
- scsi_unprep_request(req);
spin_lock_irqsave(q->queue_lock, flags);
- blk_requeue_request(q, req);
+ blk_requeue_request(q, cmd->request);
spin_unlock_irqrestore(q->queue_lock, flags);

scsi_run_queue(q);


2005-09-14 20:19:51

by Alan Stern

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang

On Wed, 14 Sep 2005, James Bottomley wrote:

> OK, my fault. Your fix is almost correct .. I was going to do this
> eventually, honest, because there's no need to unprep and reprep a
> command that comes in through scsi_queue_insert().
>
> However, I decided to leave it in to exercise the scsi_unprep_request()
> path just to make sure it was working. What's happening, I think, is
> that we also use this path for retries. Since we kill and reget the
> command each time, the retries decrement is never seen, so we're
> retrying forever.
>
> This should be the correct reversal.

Then shouldn't you also avoid unprepping and reprepping a command that is
deferred because the host isn't ready?

And isn't it necessary to make sure that req->special is NULL when
submitting a special request with no scsi_request, and that
cmd->sc_request is NULL when associating a command block to a special
request with no scsi_request?

In short, is this patch needed?

Alan Stern



Index: usb-2.6/drivers/scsi/scsi_lib.c
===================================================================
--- usb-2.6.orig/drivers/scsi/scsi_lib.c
+++ usb-2.6/drivers/scsi/scsi_lib.c
@@ -343,6 +343,7 @@ int scsi_execute(struct scsi_device *sde
req->sense_len = 0;
req->timeout = timeout;
req->flags |= flags | REQ_BLOCK_PC | REQ_SPECIAL | REQ_QUIET;
+ req->special = NULL;

/*
* head injection *required* here otherwise quiesce won't work
@@ -1072,9 +1073,6 @@ static int scsi_init_io(struct scsi_cmnd
printk(KERN_ERR "req nr_sec %lu, cur_nr_sec %u\n", req->nr_sectors,
req->current_nr_sectors);

- /* release the command and kill it */
- scsi_release_buffers(cmd);
- scsi_put_command(cmd);
return BLKPREP_KILL;
}

@@ -1205,6 +1203,7 @@ static int scsi_prep_fn(struct request_q
goto defer;
} else
cmd = req->special;
+ cmd->sc_request = NULL;

/* pull a tag out of the request if we have one */
cmd->tag = req->tag;
@@ -1250,7 +1249,7 @@ static int scsi_prep_fn(struct request_q
ret = scsi_init_io(cmd);
switch(ret) {
case BLKPREP_KILL:
- /* BLKPREP_KILL return also releases the command */
+ scsi_unprep_request(req);
goto kill;
case BLKPREP_DEFER:
goto defer;
@@ -1514,7 +1513,6 @@ static void scsi_request_fn(struct reque
* cases (host limits or settings) should run the queue at some
* later time.
*/
- scsi_unprep_request(req);
spin_lock_irq(q->queue_lock);
blk_requeue_request(q, req);
sdev->device_busy--;

2005-09-14 20:35:22

by James Bottomley

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang

On Wed, 2005-09-14 at 11:49 -0400, Alan Stern wrote:
> (James, I see a possible problem with scsi_insert_special_req. It adds to
> the queue a request with REQ_DONTPREP set. How can such a request, with
> no associated scsi_cmnd, ever work? Also, won't scsi_end_request and
> __scsi_release_request end up putting the same scsi_command twice?)

It's a historical anomaly which will hopefully die when we finally
manage to get sg and st converted to the generic request infrastructure.
Then scsi_request can be killed and this along with it.

What used to happen (as the comment implies) is that drivers would
allocate a single request and then reuse it for multiple independent
commands. Since they weren't too picky about cleaning it up after each
use, we had to reset the DONTPREP flag to ensure each new invocation was
actually correctly prepared.

James


2005-09-14 20:44:26

by James Bottomley

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang

On Wed, 2005-09-14 at 16:19 -0400, Alan Stern wrote:
> On Wed, 14 Sep 2005, James Bottomley wrote:
> Then shouldn't you also avoid unprepping and reprepping a command that is
> deferred because the host isn't ready?

Yes ... really the only case for unprep is when we've partially released
the command (like in scsi_io_completion) where we need to tear the rest
of it down.

The rule should be that if it needs preparing, then it's in the same
state as the block layer would send it to us in (with no appendages).

For most requeues, we have all the prepared resources attached, so they
don't need tearing down and repreparing.

> And isn't it necessary to make sure that req->special is NULL when
> submitting a special request with no scsi_request, and that

Yes, but only if the command will be prepared again.

> cmd->sc_request is NULL when associating a command block to a special
> request with no scsi_request?

No, that's zeroed out when the command is allocated. It's only set in
the path that sends down a scsi_request.

James



> In short, is this patch needed?
>
> Alan Stern
>
>
>
> Index: usb-2.6/drivers/scsi/scsi_lib.c
> ===================================================================
> --- usb-2.6.orig/drivers/scsi/scsi_lib.c
> +++ usb-2.6/drivers/scsi/scsi_lib.c
> @@ -343,6 +343,7 @@ int scsi_execute(struct scsi_device *sde
> req->sense_len = 0;
> req->timeout = timeout;
> req->flags |= flags | REQ_BLOCK_PC | REQ_SPECIAL | REQ_QUIET;
> + req->special = NULL;
>
> /*
> * head injection *required* here otherwise quiesce won't work
> @@ -1072,9 +1073,6 @@ static int scsi_init_io(struct scsi_cmnd
> printk(KERN_ERR "req nr_sec %lu, cur_nr_sec %u\n", req->nr_sectors,
> req->current_nr_sectors);
>
> - /* release the command and kill it */
> - scsi_release_buffers(cmd);
> - scsi_put_command(cmd);
> return BLKPREP_KILL;
> }
>
> @@ -1205,6 +1203,7 @@ static int scsi_prep_fn(struct request_q
> goto defer;
> } else
> cmd = req->special;
> + cmd->sc_request = NULL;
>
> /* pull a tag out of the request if we have one */
> cmd->tag = req->tag;
> @@ -1250,7 +1249,7 @@ static int scsi_prep_fn(struct request_q
> ret = scsi_init_io(cmd);
> switch(ret) {
> case BLKPREP_KILL:
> - /* BLKPREP_KILL return also releases the command */
> + scsi_unprep_request(req);
> goto kill;
> case BLKPREP_DEFER:
> goto defer;
> @@ -1514,7 +1513,6 @@ static void scsi_request_fn(struct reque
> * cases (host limits or settings) should run the queue at some
> * later time.
> */
> - scsi_unprep_request(req);
> spin_lock_irq(q->queue_lock);
> blk_requeue_request(q, req);
> sdev->device_busy--;
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2005-09-14 21:33:41

by Alan Stern

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang

On Wed, 14 Sep 2005, James Bottomley wrote:

> On Wed, 2005-09-14 at 16:19 -0400, Alan Stern wrote:
> > On Wed, 14 Sep 2005, James Bottomley wrote:
> > Then shouldn't you also avoid unprepping and reprepping a command that is
> > deferred because the host isn't ready?
>
> Yes ... really the only case for unprep is when we've partially released
> the command (like in scsi_io_completion) where we need to tear the rest
> of it down.

In other words, in scsi_requeue_command and nowhere else.

> The rule should be that if it needs preparing, then it's in the same
> state as the block layer would send it to us in (with no appendages).

That's what the unprep routine was supposed to accomplish.

> For most requeues, we have all the prepared resources attached, so they
> don't need tearing down and repreparing.
>
> > And isn't it necessary to make sure that req->special is NULL when
> > submitting a special request with no scsi_request, and that
>
> Yes, but only if the command will be prepared again.

Or will be prepared for the first time, as in scsi_execute. As far as I
can tell, a new struct request is not set to all 0's. So if you queue a
request with REQ_SPECIAL set and you fail to clear req->special, you're in
trouble. Do you have any idea why this hasn't been causing errors all
along?

> > cmd->sc_request is NULL when associating a command block to a special
> > request with no scsi_request?
>
> No, that's zeroed out when the command is allocated. It's only set in
> the path that sends down a scsi_request.

Oops, yes. I must have been reading __scsi_get_command instead of
scsi_get_command.

Okay, then how does this patch look (moved the routine over to where it
gets used, plus two real changes)?

Alan Stern



Index: usb-2.6/drivers/scsi/scsi_lib.c
===================================================================
--- usb-2.6.orig/drivers/scsi/scsi_lib.c
+++ usb-2.6/drivers/scsi/scsi_lib.c
@@ -100,29 +100,6 @@ static void scsi_run_queue(struct reques
static void scsi_release_buffers(struct scsi_cmnd *cmd);

/*
- * Function: scsi_unprep_request()
- *
- * Purpose: Remove all preparation done for a request, including its
- * associated scsi_cmnd, so that it can be requeued.
- *
- * Arguments: req - request to unprepare
- *
- * Lock status: Assumed that no locks are held upon entry.
- *
- * Returns: Nothing.
- */
-static void scsi_unprep_request(struct request *req)
-{
- struct scsi_cmnd *cmd = req->special;
-
- req->flags &= ~REQ_DONTPREP;
- req->special = (req->flags & REQ_SPECIAL) ? cmd->sc_request : NULL;
-
- scsi_release_buffers(cmd);
- scsi_put_command(cmd);
-}
-
-/*
* Function: scsi_queue_insert()
*
* Purpose: Insert a command in the midlevel queue.
@@ -343,6 +320,7 @@ int scsi_execute(struct scsi_device *sde
req->sense_len = 0;
req->timeout = timeout;
req->flags |= flags | REQ_BLOCK_PC | REQ_SPECIAL | REQ_QUIET;
+ req->special = NULL;

/*
* head injection *required* here otherwise quiesce won't work
@@ -564,6 +542,29 @@ static void scsi_run_queue(struct reques
}

/*
+ * Function: scsi_unprep_request()
+ *
+ * Purpose: Remove all preparation done for a request, including its
+ * associated scsi_cmnd, so that it can be requeued.
+ *
+ * Arguments: req - request to unprepare
+ *
+ * Lock status: Assumed that no locks are held upon entry.
+ *
+ * Returns: Nothing.
+ */
+static void scsi_unprep_request(struct request *req)
+{
+ struct scsi_cmnd *cmd = req->special;
+
+ req->flags &= ~REQ_DONTPREP;
+ req->special = (req->flags & REQ_SPECIAL) ? cmd->sc_request : NULL;
+
+ scsi_release_buffers(cmd);
+ scsi_put_command(cmd);
+}
+
+/*
* Function: scsi_requeue_command()
*
* Purpose: Handle post-processing of completed commands.
@@ -1514,7 +1515,6 @@ static void scsi_request_fn(struct reque
* cases (host limits or settings) should run the queue at some
* later time.
*/
- scsi_unprep_request(req);
spin_lock_irq(q->queue_lock);
blk_requeue_request(q, req);
sdev->device_busy--;


2005-09-15 13:56:44

by James Bottomley

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang

On Wed, 2005-09-14 at 17:33 -0400, Alan Stern wrote:
> On Wed, 14 Sep 2005, James Bottomley wrote:
> > Yes ... really the only case for unprep is when we've partially released
> > the command (like in scsi_io_completion) where we need to tear the rest
> > of it down.
>
> In other words, in scsi_requeue_command and nowhere else.

Pretty much, yes.

> Or will be prepared for the first time, as in scsi_execute. As far as I
> can tell, a new struct request is not set to all 0's. So if you queue a
> request with REQ_SPECIAL set and you fail to clear req->special, you're in
> trouble. Do you have any idea why this hasn't been causing errors all
> along?

That's true, it's not. However ll_rq_blk.c:rq_init() clears req-
>special (and initialises all other important fields).

> Okay, then how does this patch look (moved the routine over to where it
> gets used, plus two real changes)?

Well ... under pressure to fix this in -mm, I already commited a version
to rc-fixes. What I did was fully reverse the changes to the
scsi_insert_queue() [the patch I sent Anton]. We can move the unprep
function if you feel strongly about it, but I'm also happy to keep it
where it is.

James


2005-09-15 14:13:18

by Alan Stern

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang

On Thu, 15 Sep 2005, James Bottomley wrote:

> On Wed, 2005-09-14 at 17:33 -0400, Alan Stern wrote:
> > On Wed, 14 Sep 2005, James Bottomley wrote:
> > > Yes ... really the only case for unprep is when we've partially released
> > > the command (like in scsi_io_completion) where we need to tear the rest
> > > of it down.
> >
> > In other words, in scsi_requeue_command and nowhere else.
>
> Pretty much, yes.
>
> > Or will be prepared for the first time, as in scsi_execute. As far as I
> > can tell, a new struct request is not set to all 0's. So if you queue a
> > request with REQ_SPECIAL set and you fail to clear req->special, you're in
> > trouble. Do you have any idea why this hasn't been causing errors all
> > along?
>
> That's true, it's not. However ll_rq_blk.c:rq_init() clears req-
> >special (and initialises all other important fields).

(*Sigh*... I'm trying to do this too fast, not following up properly on
all the code paths.) Okay, good, glad to hear it.

> > Okay, then how does this patch look (moved the routine over to where it
> > gets used, plus two real changes)?
>
> Well ... under pressure to fix this in -mm, I already commited a version
> to rc-fixes. What I did was fully reverse the changes to the
> scsi_insert_queue() [the patch I sent Anton]. We can move the unprep
> function if you feel strongly about it, but I'm also happy to keep it
> where it is.

I don't care where the function goes, so just leave it.

That leaves only the question of the call to scsi_unprep_request near the
end of scsi_request_fn, in the not_ready: section. Looks like that call
isn't needed and can be taken out also, do you agree?

Alan Stern

2005-09-15 17:52:55

by Alan Stern

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang

On Thu, 15 Sep 2005, James Bottomley wrote:

> On Wed, 2005-09-14 at 17:33 -0400, Alan Stern wrote:
> > On Wed, 14 Sep 2005, James Bottomley wrote:
> > > Yes ... really the only case for unprep is when we've partially released
> > > the command (like in scsi_io_completion) where we need to tear the rest
> > > of it down.
> >
> > In other words, in scsi_requeue_command and nowhere else.
>
> Pretty much, yes.

I found one other thing that needs to be fixed. The call to
scsi_release_buffers in scsi_unprep_request causes an oops, because the
sgtable has already been freed in scsi_io_completion. The following patch
is needed.

Alan Stern



Signed-off-by: Alan Stern <[email protected]>

Index: usb-2.6/drivers/scsi/scsi_lib.c
===================================================================
--- usb-2.6.orig/drivers/scsi/scsi_lib.c
+++ usb-2.6/drivers/scsi/scsi_lib.c
@@ -118,7 +118,6 @@ static void scsi_unprep_request(struct r
req->flags &= ~REQ_DONTPREP;
req->special = (req->flags & REQ_SPECIAL) ? cmd->sc_request : NULL;

- scsi_release_buffers(cmd);
scsi_put_command(cmd);
}

@@ -1514,7 +1513,6 @@ static void scsi_request_fn(struct reque
* cases (host limits or settings) should run the queue at some
* later time.
*/
- scsi_unprep_request(req);
spin_lock_irq(q->queue_lock);
blk_requeue_request(q, req);
sdev->device_busy--;

2005-09-16 10:31:58

by Anton Blanchard

[permalink] [raw]
Subject: Re: [2.6.14-rc1] sym scsi boot hang


Hi,

> OK, my fault. Your fix is almost correct .. I was going to do this
> eventually, honest, because there's no need to unprep and reprep a
> command that comes in through scsi_queue_insert().
>
> However, I decided to leave it in to exercise the scsi_unprep_request()
> path just to make sure it was working. What's happening, I think, is
> that we also use this path for retries. Since we kill and reget the
> command each time, the retries decrement is never seen, so we're
> retrying forever.
>
> This should be the correct reversal.

Thanks James, that did the trick.

Anton