2004-03-25 20:06:32

by Florian Lohoff

[permalink] [raw]
Subject: 2.6.3 gdth driver NMI Watchdog detected LOCKUP


Hi,
i got this NMI on one of our Machines - I have seen these kinds of deadlocks
on 2.4 too - Iirc its a locking problem on long overdue requests and the scsi_eh
kicking in. I think i spotted this already in 2.4.18 back in May 2002 (Mail attached).

Dual PIII 1Ghz, Serverworks Chipset, ICP Vortex
Vanilla 2.6.3

00:05.0 SCSI storage controller: ICP Vortex Computersysteme GmbH GDT 6123RS/6523RS
Subsystem: ICP Vortex Computersysteme GmbH GDT 6123RS/6523RS
Flags: bus master, medium devsel, latency 32, IRQ 18
Memory at fe000000 (32-bit, prefetchable) [size=16K]
Expansion ROM at <unassigned> [disabled] [size=32K]
Capabilities: [80] Power Management version 2

Adapter 0: Host Drive 0: resetted locally
Adapter 0: Host Drive 0: resetted locally
NMI Watchdog detected LOCKUP on CPU0, eip c0116e24, registers:
CPU: 0
EIP: 0060:[<c0116e24>] Not tainted
EFLAGS: 00000083
EIP is at delay_tsc+0x14/0x20
eax: 52bff37b ebx: 000f0b90 ecx: 52b42ab1 edx: 00006b5a
esi: c00981c0 edi: 00000000 ebp: 00000002 esp: f7fabe8c
ds: 007b es: 007b ss: 0068
Process scsi_eh_2 (pid: 18, threadinfo=f7faa000 task=f7dc6c00)
Stack: ffffffff c01eade2 000f0b90 c0261ffd 000f0b90 0001734e c026258a 00000001
c00981c0 00000000 00000000 c009ae80 c00981c0 00000002 00000000 c02626e0
00000000 00000002 000186a0 000186a0 00000018 09000000 00000000 c00981c0
Call Trace:
[<c01eade2>] __delay+0x12/0x20
[<c0261ffd>] gdth_delay+0x2d/0x60
[<c026258a>] gdth_wait+0x6a/0xc0
[<c02626e0>] gdth_internal_cmd+0x100/0x1e0
[<c026582c>] gdth_eh_bus_reset+0x24c/0x2c0
[<c024e456>] scsi_try_bus_reset+0x56/0xf0
[<c024b7e4>] __scsi_iterate_devices+0x84/0xa0
[<c024e639>] scsi_eh_bus_reset+0x59/0xf0
[<c024eb80>] scsi_eh_ready_devs+0x50/0x80
[<c024ed20>] scsi_unjam_host+0xe0/0xf0
[<c024ee28>] scsi_error_handler+0xf8/0x150
[<c024ed30>] scsi_error_handler+0x0/0x150
[<c0108c19>] kernel_thread_helper+0x5/0xc

Code: 29 c8 39 d8 72 f6 5b c3 8d 74 26 00 55 b8 00 e0 ff ff 57 56
console shuts up ...
EIP is at __preempt_spin_lock+ 0x50/0x70




I guess modifying "gdth_polling" needs to be move infront of taking the
lock and needs to be guranteed written out to mem (memory barrier ?
atomic_set ?)

drivers/scsi/gdth.c:gdth_eh_bus_reset

4779
4780 if (b == ha->virt_bus) {
4781 /* host drives */
4782 for (i = 0; i < MAX_HDRIVES; ++i) {
4783 if (ha->hdr[i].present) {
4784 GDTH_LOCK_HA(ha, flags);
4785 gdth_polling = TRUE;
4786 while (gdth_test_busy(hanum))
4787 gdth_delay(0);
4788 if (gdth_internal_cmd(hanum, CACHESERVICE,
4789 GDT_CLUST_RESET, i, 0, 0))
4790 ha->hdr[i].cluster_type &= ~CLUSTER_RESERVED;
4791 gdth_polling = FALSE;
4792 GDTH_UNLOCK_HA(ha, flags);
4793 }
4794 }
4795 } else {
4796 /* raw devices */
4797 GDTH_LOCK_HA(ha, flags);
4798 for (i = 0; i < MAXID; ++i)
4799 ha->raw[BUS_L2P(ha,b)].io_cnt[i] = 0;
4800 gdth_polling = TRUE;
4801 while (gdth_test_busy(hanum))
4802 gdth_delay(0);
4803 gdth_internal_cmd(hanum, SCSIRAWSERVICE, GDT_RESET_BUS,
4804 BUS_L2P(ha,b), 0, 0);
4805 gdth_polling = FALSE;
4806 GDTH_UNLOCK_HA(ha, flags);
4807 }
4808 return SUCCESS;
4809 }
4810

vs.

drivers/scsi/gdth.c:gdth_interrupt

3390 if (!gdth_polling)
3391 GDTH_LOCK_HA((gdth_ha_str *)dev_id,flags);
3392 wait_index = 0;

So we have a small race.

I would propose something like this:


--- linux-2.6.3/drivers/scsi/gdth.c.orig 2004-03-25 20:58:18.000000000 +0100
+++ linux-2.6.3/drivers/scsi/gdth.c 2004-03-25 20:59:28.000000000 +0100
@@ -4781,29 +4781,29 @@
/* host drives */
for (i = 0; i < MAX_HDRIVES; ++i) {
if (ha->hdr[i].present) {
- GDTH_LOCK_HA(ha, flags);
gdth_polling = TRUE;
+ GDTH_LOCK_HA(ha, flags);
while (gdth_test_busy(hanum))
gdth_delay(0);
if (gdth_internal_cmd(hanum, CACHESERVICE,
GDT_CLUST_RESET, i, 0, 0))
ha->hdr[i].cluster_type &= ~CLUSTER_RESERVED;
- gdth_polling = FALSE;
GDTH_UNLOCK_HA(ha, flags);
+ gdth_polling = FALSE;
}
}
} else {
/* raw devices */
+ gdth_polling = TRUE;
GDTH_LOCK_HA(ha, flags);
for (i = 0; i < MAXID; ++i)
ha->raw[BUS_L2P(ha,b)].io_cnt[i] = 0;
- gdth_polling = TRUE;
while (gdth_test_busy(hanum))
gdth_delay(0);
gdth_internal_cmd(hanum, SCSIRAWSERVICE, GDT_RESET_BUS,
BUS_L2P(ha,b), 0, 0);
- gdth_polling = FALSE;
GDTH_UNLOCK_HA(ha, flags);
+ gdth_polling = FALSE;
}
return SUCCESS;
}


--
Florian Lohoff [email protected] +49-171-2280134
Heisenberg may have been here.


Attachments:
(No filename) (0.00 B)
(No filename) (189.00 B)
Download all attachments

2004-03-26 12:49:39

by Leubner, Achim

[permalink] [raw]
Subject: RE: 2.6.3 gdth driver NMI Watchdog detected LOCKUP

Hi,

thanks for reporting this problem. I will include the change into our new driver version and will send a patch for 2.6.4 on Monday next week to Linus.

Achim Leubner
Research & Development
ICP vortex Computersysteme GmbH
Gostritzer Str. 61-63
D-01217 Dresden, Germany
Phone: +49-351-871-8291
Fax: +49-351-871-8448
Email: [email protected]
Web: http://www.icp-vortex.com?


> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Florian
> Lohoff
> Sent: Donnerstag, 25. M?rz 2004 21:05
> To: [email protected]
> Cc: [email protected]
> Subject: 2.6.3 gdth driver NMI Watchdog detected LOCKUP
>
>
> Hi,
> i got this NMI on one of our Machines - I have seen these kinds of deadlocks
> on 2.4 too - Iirc its a locking problem on long overdue requests and the scsi_eh
> kicking in. I think i spotted this already in 2.4.18 back in May 2002 (Mail attached).
>
> Dual PIII 1Ghz, Serverworks Chipset, ICP Vortex
> Vanilla 2.6.3
>
> 00:05.0 SCSI storage controller: ICP Vortex Computersysteme GmbH GDT 6123RS/6523RS
> Subsystem: ICP Vortex Computersysteme GmbH GDT 6123RS/6523RS
> Flags: bus master, medium devsel, latency 32, IRQ 18
> Memory at fe000000 (32-bit, prefetchable) [size=16K]
> Expansion ROM at <unassigned> [disabled] [size=32K]
> Capabilities: [80] Power Management version 2
>
> Adapter 0: Host Drive 0: resetted locally
> Adapter 0: Host Drive 0: resetted locally
> NMI Watchdog detected LOCKUP on CPU0, eip c0116e24, registers:
> CPU: 0
> EIP: 0060:[<c0116e24>] Not tainted
> EFLAGS: 00000083
> EIP is at delay_tsc+0x14/0x20
> eax: 52bff37b ebx: 000f0b90 ecx: 52b42ab1 edx: 00006b5a
> esi: c00981c0 edi: 00000000 ebp: 00000002 esp: f7fabe8c
> ds: 007b es: 007b ss: 0068
> Process scsi_eh_2 (pid: 18, threadinfo=f7faa000 task=f7dc6c00)
> Stack: ffffffff c01eade2 000f0b90 c0261ffd 000f0b90 0001734e c026258a 00000001
> c00981c0 00000000 00000000 c009ae80 c00981c0 00000002 00000000 c02626e0
> 00000000 00000002 000186a0 000186a0 00000018 09000000 00000000 c00981c0
> Call Trace:
> [<c01eade2>] __delay+0x12/0x20
> [<c0261ffd>] gdth_delay+0x2d/0x60
> [<c026258a>] gdth_wait+0x6a/0xc0
> [<c02626e0>] gdth_internal_cmd+0x100/0x1e0
> [<c026582c>] gdth_eh_bus_reset+0x24c/0x2c0
> [<c024e456>] scsi_try_bus_reset+0x56/0xf0
> [<c024b7e4>] __scsi_iterate_devices+0x84/0xa0
> [<c024e639>] scsi_eh_bus_reset+0x59/0xf0
> [<c024eb80>] scsi_eh_ready_devs+0x50/0x80
> [<c024ed20>] scsi_unjam_host+0xe0/0xf0
> [<c024ee28>] scsi_error_handler+0xf8/0x150
> [<c024ed30>] scsi_error_handler+0x0/0x150
> [<c0108c19>] kernel_thread_helper+0x5/0xc
>
> Code: 29 c8 39 d8 72 f6 5b c3 8d 74 26 00 55 b8 00 e0 ff ff 57 56
> console shuts up ...
> EIP is at __preempt_spin_lock+ 0x50/0x70
>
>
>
>
> I guess modifying "gdth_polling" needs to be move infront of taking the
> lock and needs to be guranteed written out to mem (memory barrier ?
> atomic_set ?)
>
> drivers/scsi/gdth.c:gdth_eh_bus_reset
>
> 4779
> 4780 if (b == ha->virt_bus) {
> 4781 /* host drives */
> 4782 for (i = 0; i < MAX_HDRIVES; ++i) {
> 4783 if (ha->hdr[i].present) {
> 4784 GDTH_LOCK_HA(ha, flags);
> 4785 gdth_polling = TRUE;
> 4786 while (gdth_test_busy(hanum))
> 4787 gdth_delay(0);
> 4788 if (gdth_internal_cmd(hanum, CACHESERVICE,
> 4789 GDT_CLUST_RESET, i, 0, 0))
> 4790 ha->hdr[i].cluster_type &= ~CLUSTER_RESERVED;
> 4791 gdth_polling = FALSE;
> 4792 GDTH_UNLOCK_HA(ha, flags);
> 4793 }
> 4794 }
> 4795 } else {
> 4796 /* raw devices */
> 4797 GDTH_LOCK_HA(ha, flags);
> 4798 for (i = 0; i < MAXID; ++i)
> 4799 ha->raw[BUS_L2P(ha,b)].io_cnt[i] = 0;
> 4800 gdth_polling = TRUE;
> 4801 while (gdth_test_busy(hanum))
> 4802 gdth_delay(0);
> 4803 gdth_internal_cmd(hanum, SCSIRAWSERVICE, GDT_RESET_BUS,
> 4804 BUS_L2P(ha,b), 0, 0);
> 4805 gdth_polling = FALSE;
> 4806 GDTH_UNLOCK_HA(ha, flags);
> 4807 }
> 4808 return SUCCESS;
> 4809 }
> 4810
>
> vs.
>
> drivers/scsi/gdth.c:gdth_interrupt
>
> 3390 if (!gdth_polling)
> 3391 GDTH_LOCK_HA((gdth_ha_str *)dev_id,flags);
> 3392 wait_index = 0;
>
> So we have a small race.
>
> I would propose something like this:
>
>
> --- linux-2.6.3/drivers/scsi/gdth.c.orig 2004-03-25 20:58:18.000000000 +0100
> +++ linux-2.6.3/drivers/scsi/gdth.c 2004-03-25 20:59:28.000000000 +0100
> @@ -4781,29 +4781,29 @@
> /* host drives */
> for (i = 0; i < MAX_HDRIVES; ++i) {
> if (ha->hdr[i].present) {
> - GDTH_LOCK_HA(ha, flags);
> gdth_polling = TRUE;
> + GDTH_LOCK_HA(ha, flags);
> while (gdth_test_busy(hanum))
> gdth_delay(0);
> if (gdth_internal_cmd(hanum, CACHESERVICE,
> GDT_CLUST_RESET, i, 0, 0))
> ha->hdr[i].cluster_type &= ~CLUSTER_RESERVED;
> - gdth_polling = FALSE;
> GDTH_UNLOCK_HA(ha, flags);
> + gdth_polling = FALSE;
> }
> }
> } else {
> /* raw devices */
> + gdth_polling = TRUE;
> GDTH_LOCK_HA(ha, flags);
> for (i = 0; i < MAXID; ++i)
> ha->raw[BUS_L2P(ha,b)].io_cnt[i] = 0;
> - gdth_polling = TRUE;
> while (gdth_test_busy(hanum))
> gdth_delay(0);
> gdth_internal_cmd(hanum, SCSIRAWSERVICE, GDT_RESET_BUS,
> BUS_L2P(ha,b), 0, 0);
> - gdth_polling = FALSE;
> GDTH_UNLOCK_HA(ha, flags);
> + gdth_polling = FALSE;
> }
> return SUCCESS;
> }
>
>
> --
> Florian Lohoff [email protected] +49-171-2280134
> Heisenberg may have been here.

2004-03-28 15:02:42

by Florian Lohoff

[permalink] [raw]
Subject: Re: 2.6.3 gdth driver NMI Watchdog detected LOCKUP

On Fri, Mar 26, 2004 at 01:49:24PM +0100, Leubner, Achim wrote:
> Hi,
>
> thanks for reporting this problem. I will include the change into our new driver version and will send a patch for 2.6.4 on Monday next week to Linus.
>

I dont think this patch will be sufficient - There is another problem
with the driver - What for example happens if we are in the
gdth_interrupt - we do have locked the HA because we are not in polling
mode. Then the error handler kicks in and sets polling_mode to TRUE -
Now the gdth_interrupt will not unlock the HA which will get one cpu to
keep spinning in the error handler.

I think calling the gdth_interrupt in polling and non-polling mode is
bogus by design.

Flo
--
Florian Lohoff [email protected] +49-171-2280134
Heisenberg may have been here.


Attachments:
(No filename) (837.00 B)
(No filename) (189.00 B)
Download all attachments