2003-11-25 00:23:54

by Peter Chubb

[permalink] [raw]
Subject: test10 hangs on startup: NMI watchdog hits Adaptec driver


Hi folks,
I've been seeing random hangs on a dual 500MHz celeron here; so I
rebooted this morning with the NMI watchdog turned on.

With the watchdog, the machine shows the attached. Looks to me as if
the lock taken at aic7xx_osm.c:1709 which is released *after*
ahc_linux_initialize_scsi_bus() should perhaps be released earlier.
Otherwise the host lock is held for the duration.


NMI Watchdog detected LOCKUP on CPU0, eip c02fb737, registers:
CPU: 0
EIP: 0060:[<c02fb737>] Not tainted
EFLAGS: 00000086
EIP is at .text.lock.scsi+0x81/0xaa
eax: c1b01a50 ebx: c1b01800 ecx: c1b01800 edx: c1b01a00
esi: c1b01800 edi: 00000000 ebp: 00000046 esp: f7fa5da4
ds: 007b es: 007b ss: 0068
Process swapper (pid: 1, threadinfo=f7fa4000 task=c1a47900)
Stack: 00000000 c1b01800 00000000 c1b01c00 00000000 c02ff2bb c1b01800 00000000
ffffff41 ffffffff c031f932 c1b01800 00000000 00000000 c1b01c98 00000000
c1b01c00 00000000 c030492f c1b01c00 c1b01c90 00000000 00000000 00000000
Call Trace:
[<c02ff2bb>] scsi_report_bus_reset+0x1b/0x50
[<c031f932>] ahc_send_async+0xa2/0x2e0
[<c030492f>] ahc_run_untagged_queues+0x2f/0x40
[<c030fb23>] ahc_abort_scbs+0x403/0x4c0
[<c0310016>] ahc_reset_channel+0x2b6/0x5d0
[<c018ba09>] proc_create+0x89/0xe0
[<c031c0dc>] ahc_linux_initialize_scsi_bus+0x1fc/0x210
[<c031bca8>] ahc_linux_register_host+0x178/0x360
[<c01905f4>] sysfs_add_file+0xa4/0xb0
[<c018fee0>] init_file+0x0/0x20
[<c028a888>] pci_create_newid_file+0x28/0x30
[<c028ad5c>] pci_register_driver+0x7c/0xa0
[<c031aa3c>] ahc_linux_detect+0x4c/0x80
[<c0527b8f>] ahc_linux_init+0xf/0x30
[<c051095c>] do_initcalls+0x2c/0xa0
[<c013271f>] init_workqueues+0xf/0x24
[<c01050f6>] init+0x56/0x180
[<c01050a0>] init+0x0/0x180
[<c0107259>] kernel_thread_helper+0x5/0xc

Code: f3 90 80 38 00 7e f9 e9 fc fc ff ff f3 90 80 38 00 7e f9 e9
console shuts up ...


2003-11-25 00:39:57

by James Bottomley

[permalink] [raw]
Subject: Re: test10 hangs on startup: NMI watchdog hits Adaptec driver

On Mon, 2003-11-24 at 18:23, Peter Chubb wrote:
> I've been seeing random hangs on a dual 500MHz celeron here; so I
> rebooted this morning with the NMI watchdog turned on.
>
> With the watchdog, the machine shows the attached. Looks to me as if
> the lock taken at aic7xx_osm.c:1709 which is released *after*
> ahc_linux_initialize_scsi_bus() should perhaps be released earlier.
> Otherwise the host lock is held for the duration.

There have been several threads on this.

The fix is attached.

James

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.1483 -> 1.1484
# drivers/scsi/scsi_error.c 1.65 -> 1.66
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/11/24 [email protected] 1.1484
# Fix locking problems in scsi_report_bus_reset() causing aic7xxx to hang
#
# All the users of this function in the SCSI tree call it with the host
# lock held. With the new list traversal code, it was trying to take
# the lock again to traverse the list.
#
# Fix it to use the unlocked version of list traversal and modify the
# header comments to make it clear that the lock is expected to be held
# on calling it.
# --------------------------------------------
#
diff -Nru a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
--- a/drivers/scsi/scsi_error.c Mon Nov 24 17:27:38 2003
+++ b/drivers/scsi/scsi_error.c Mon Nov 24 17:27:38 2003
@@ -911,7 +911,9 @@

if (rtn == SUCCESS) {
scsi_sleep(BUS_RESET_SETTLE_TIME);
+ spin_lock_irqsave(scmd->device->host->host_lock, flags);
scsi_report_bus_reset(scmd->device->host, scmd->device->channel);
+ spin_unlock_irqrestore(scmd->device->host->host_lock, flags);
}

return rtn;
@@ -940,7 +942,9 @@

if (rtn == SUCCESS) {
scsi_sleep(HOST_RESET_SETTLE_TIME);
+ spin_lock_irqsave(scmd->device->host->host_lock, flags);
scsi_report_bus_reset(scmd->device->host, scmd->device->channel);
+ spin_unlock_irqrestore(scmd->device->host->host_lock, flags);
}

return rtn;
@@ -1608,7 +1612,7 @@
*
* Returns: Nothing
*
- * Lock status: No locks are assumed held.
+ * Lock status: Host lock must be held.
*
* Notes: This only needs to be called if the reset is one which
* originates from an unknown location. Resets originated
@@ -1622,7 +1626,7 @@
{
struct scsi_device *sdev;

- shost_for_each_device(sdev, shost) {
+ __shost_for_each_device(sdev, shost) {
if (channel == sdev->channel) {
sdev->was_reset = 1;
sdev->expecting_cc_ua = 1;
@@ -1642,7 +1646,7 @@
*
* Returns: Nothing
*
- * Lock status: No locks are assumed held.
+ * Lock status: Host lock must be held
*
* Notes: This only needs to be called if the reset is one which
* originates from an unknown location. Resets originated
@@ -1656,7 +1660,7 @@
{
struct scsi_device *sdev;

- shost_for_each_device(sdev, shost) {
+ __shost_for_each_device(sdev, shost) {
if (channel == sdev->channel &&
target == sdev->id) {
sdev->was_reset = 1;