2008-11-18 20:15:26

by Randy Dunlap

[permalink] [raw]
Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr

Miller, Mike (OS Dev) wrote:
>
>> -----Original Message-----
>> From: Randy Dunlap [mailto:[email protected]]
>> Sent: Thursday, September 25, 2008 3:40 PM
>> To: scsi
>> Cc: Jens Axboe; Miller, Mike (OS Dev); James Bottomley; lkml; akpm
>> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr
>>
>> On Thu, 25 Sep 2008 13:33:07 -0700 Randy Dunlap wrote:
>>
>>> Jens Axboe wrote:
>>>> On Thu, Sep 04 2008, Miller, Mike (OS Dev) wrote:
>>>>>>>>> 0x3bb2 <do_cciss_intr+1649>: mov 0x2(%r8),%dx
>>>>>>>>> 0x3bb7 <do_cciss_intr+1654>: test %dx,%dx
>>>>>>>>> 0x3bba <do_cciss_intr+1657>: je 0x3f0e
>> <do_cciss_intr+2509>
>>>>>>>>>
>>>>>>>>> $ addr2line -e cciss.o -f do_cciss_intr+0x627 SA5_fifo_full
>>>>>>>>>
>> /home/rdunlap/linsrc/linux-2.6.27-rc3-git7/drivers/block/cciss.h:2
>>>>>> 06
>>>>>>>> OK ...that's confusing. It seems to be saying that
>> ctrlr_info_t
>>>>>>>> * was NULL. However, I can't see a way of getting into the
>>>>>> fifo_full
>>>>>>>> callback from do_cciss_intr ..
>>>>>>>> especially not with an NULL host.
>>>>>>>>
>>>>>>>> James
>>>>>>> That is weird. Even if we could get there fifo_full doesn't
>>>>>> do anything but wait for a bit.
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> This just happened again. This time it's on 2.6.27-rc5-git3.
>>>>>>
>>>>>> ~Randy
>>>>> Thanks Randy. I think. :)
>>>>>
>>>>> I'll try to recreate in my lab.
>>>> This looks somewhat strange, mostly like 'c' is NULL and it's
>>>> oopsing in in removeQ (I don't think Randy's analysis is
>> correct in
>>>> assuming it's 'h' and it's in fifo_full). Given that 'c'
>> cannot be
>>>> NULL, it's c->prev or c->next that are NULL.

This BUG: has happened (now) 5 times today. Higher frequency than usual for
some reason.

I enabled CCISS_DEBUG and added one printk in removeQ(). On the first call
to removeQ(), both c->next and c->prev are NULL.

Here's the kernel log output from cciss:

cciss 0000:42:08.0: PCI INT A -> Link[LNKA] -> GSI 54 (level, high) -> IRQ 54
command = 147
irq = 36
board_id = 3211103c
cciss 0000:42:08.0: irq 87 for MSI/MSI-X
address 0 = fdf80000
cfg base address = 10
cfg base address index = 0
cfg offset = 400
Controller Configuration information
------------------------------------
Signature = CISS
Spec Number = 1
Transport methods supported = 0x6
Transport methods active = 0x3
Requested transport Method = 0x0
Coalesce Interrupt Delay = 0x0
Coalesce Interrupt Count = 0x1
Max outstanding commands = 0x256
Bus Types = 0x200000
Server Name =
Heartbeat Counter = 0xffc


Trying to put board into Simple mode
I counter got to 1 0
Controller Configuration information
------------------------------------
Signature = CISS
Spec Number = 1
Transport methods supported = 0x6
Transport methods active = 0x3
Requested transport Method = 0x0
Coalesce Interrupt Delay = 0x0
Coalesce Interrupt Count = 0x1
Max outstanding commands = 0x256
Bus Types = 0x200000
Server Name =
Heartbeat Counter = 0xffc

cciss0: <0x3238> at PCI 0000:42:08.0 IRQ 87 using DAC
cciss: intr_pending 8
cciss: removeQ: Qptr=ffff88027e7500b8, c=ffff88007f83e000, next=ffff88007f83e000, prev=ffff88007f83e000
Sending 7f83e000 - down to controller
cciss: intr_pending 8
cciss: Read 4 back from board
cciss: removeQ: Qptr=ffff88027e7500c0, c=ffff88007f840000, next=0000000000000000, prev=0000000000000000
BUG: unable to handle kernel NULL pointer dereference at 0000000000000248
IP: [<ffffffffa002502b>] do_cciss_intr+0x6c8/0xb10 [cciss]
PGD 0
Oops: 0002 [#1] SMP
last sysfs file: /sys/block/ram15/dev
CPU 2
Modules linked in: cciss(+) ehci_hcd ohci_hcd uhci_hcd
Pid: 0, comm: swapper Not tainted 2.6.28-rc5 #1
RIP: 0010:[<ffffffffa002502b>] [<ffffffffa002502b>] do_cciss_intr+0x6c8/0xb10 [cciss]
RSP: 0018:ffff88017fa9fee8 EFLAGS: 00010087
RAX: 0000000000000000 RBX: ffff88007f840000 RCX: 000000000000a3d9
RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffffffff8080e634
RBP: ffff88017fa9ff18 R08: 0000000000000000 R09: ffff88017e918800
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88027e740000
R13: 0000000000000000 R14: 0000000000000057 R15: 0000000000000086
FS: 00000000008558f0(0000) GS:ffff88017fc01c80(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000248 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff88027f660000, task ffff88027f635400)
Stack:
0000000000000030 ffff88017ea73640 0000000000000000 0000000000000000
0000000000000057 0000000000000000 ffff88017fa9ff48 ffffffff8026a8b9
ffffffff8074ab00 0000000000000057 ffff88017ea73640 ffffffff8074ab58
Call Trace:
<IRQ> <0> [<ffffffff8026a8b9>] handle_IRQ_event+0x27/0x57
[<ffffffff8026c424>] handle_edge_irq+0xde/0x11f
[<ffffffff8020e29b>] do_IRQ+0xfc/0x175
[<ffffffff8020c3e6>] ret_from_intr+0x0/0xa
<EOI> <0> [<ffffffff8023c7d2>] ? ksoftirqd+0x0/0xa6
[<ffffffff80212575>] ? default_idle+0x2b/0x40
[<ffffffff80212799>] ? c1e_idle+0xe5/0xec
[<ffffffff8056a7f6>] ? atomic_notifier_call_chain+0xf/0x11
[<ffffffff8020acd1>] ? cpu_idle+0x40/0x5e
[<ffffffff8056284e>] ? start_secondary+0x174/0x179
Code: 8b 83 48 02 00 00 48 39 d8 74 37 49 39 9c 24 c0 00 01 00 75 08 49 89 84 24 c0 00 01 00 48 8b 83 40 02 00 00 48 8b 93 48 02 00 00 <48> 89 90 48 02 00 00 48 8b 93 48 02 00 00 48 89 82 40 02 00 00
RIP [<ffffffffa002502b>] do_cciss_intr+0x6c8/0xb10 [cciss]
RSP <ffff88017fa9fee8>
CR2: 0000000000000248
Kernel panic - not syncing: Fatal exception in interrupt


Any ideas/suggestions?

Thanks,
~Randy


2008-11-18 20:20:34

by Randy Dunlap

[permalink] [raw]
Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr

Randy Dunlap wrote:
> Miller, Mike (OS Dev) wrote:
>>> -----Original Message-----
>>> From: Randy Dunlap [mailto:[email protected]]
>>> Sent: Thursday, September 25, 2008 3:40 PM
>>> To: scsi
>>> Cc: Jens Axboe; Miller, Mike (OS Dev); James Bottomley; lkml; akpm
>>> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr
>>>
>>> On Thu, 25 Sep 2008 13:33:07 -0700 Randy Dunlap wrote:
>>>
>>>> Jens Axboe wrote:
>>>>> On Thu, Sep 04 2008, Miller, Mike (OS Dev) wrote:
>>>>>>>>>> 0x3bb2 <do_cciss_intr+1649>: mov 0x2(%r8),%dx
>>>>>>>>>> 0x3bb7 <do_cciss_intr+1654>: test %dx,%dx
>>>>>>>>>> 0x3bba <do_cciss_intr+1657>: je 0x3f0e
>>> <do_cciss_intr+2509>
>>>>>>>>>> $ addr2line -e cciss.o -f do_cciss_intr+0x627 SA5_fifo_full
>>>>>>>>>>
>>> /home/rdunlap/linsrc/linux-2.6.27-rc3-git7/drivers/block/cciss.h:2
>>>>>>> 06
>>>>>>>>> OK ...that's confusing. It seems to be saying that
>>> ctrlr_info_t
>>>>>>>>> * was NULL. However, I can't see a way of getting into the
>>>>>>> fifo_full
>>>>>>>>> callback from do_cciss_intr ..
>>>>>>>>> especially not with an NULL host.
>>>>>>>>>
>>>>>>>>> James
>>>>>>>> That is weird. Even if we could get there fifo_full doesn't
>>>>>>> do anything but wait for a bit.
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> This just happened again. This time it's on 2.6.27-rc5-git3.
>>>>>>>
>>>>>>> ~Randy
>>>>>> Thanks Randy. I think. :)
>>>>>>
>>>>>> I'll try to recreate in my lab.
>>>>> This looks somewhat strange, mostly like 'c' is NULL and it's
>>>>> oopsing in in removeQ (I don't think Randy's analysis is
>>> correct in
>>>>> assuming it's 'h' and it's in fifo_full). Given that 'c'
>>> cannot be
>>>>> NULL, it's c->prev or c->next that are NULL.
>
> This BUG: has happened (now) 5 times today. Higher frequency than usual for
> some reason.
>
> I enabled CCISS_DEBUG and added one printk in removeQ(). On the first call

s/first/second/


> to removeQ(), both c->next and c->prev are NULL.
>
> Here's the kernel log output from cciss:
>
> cciss 0000:42:08.0: PCI INT A -> Link[LNKA] -> GSI 54 (level, high) -> IRQ 54
> command = 147
> irq = 36
> board_id = 3211103c
> cciss 0000:42:08.0: irq 87 for MSI/MSI-X
> address 0 = fdf80000
> cfg base address = 10
> cfg base address index = 0
> cfg offset = 400
> Controller Configuration information
> ------------------------------------
> Signature = CISS
> Spec Number = 1
> Transport methods supported = 0x6
> Transport methods active = 0x3
> Requested transport Method = 0x0
> Coalesce Interrupt Delay = 0x0
> Coalesce Interrupt Count = 0x1
> Max outstanding commands = 0x256
> Bus Types = 0x200000
> Server Name =
> Heartbeat Counter = 0xffc
>
>
> Trying to put board into Simple mode
> I counter got to 1 0
> Controller Configuration information
> ------------------------------------
> Signature = CISS
> Spec Number = 1
> Transport methods supported = 0x6
> Transport methods active = 0x3
> Requested transport Method = 0x0
> Coalesce Interrupt Delay = 0x0
> Coalesce Interrupt Count = 0x1
> Max outstanding commands = 0x256
> Bus Types = 0x200000
> Server Name =
> Heartbeat Counter = 0xffc
>
> cciss0: <0x3238> at PCI 0000:42:08.0 IRQ 87 using DAC
> cciss: intr_pending 8
> cciss: removeQ: Qptr=ffff88027e7500b8, c=ffff88007f83e000, next=ffff88007f83e000, prev=ffff88007f83e000
> Sending 7f83e000 - down to controller
> cciss: intr_pending 8
> cciss: Read 4 back from board
> cciss: removeQ: Qptr=ffff88027e7500c0, c=ffff88007f840000, next=0000000000000000, prev=0000000000000000
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000248
> IP: [<ffffffffa002502b>] do_cciss_intr+0x6c8/0xb10 [cciss]
> PGD 0
> Oops: 0002 [#1] SMP
> last sysfs file: /sys/block/ram15/dev
> CPU 2
> Modules linked in: cciss(+) ehci_hcd ohci_hcd uhci_hcd
> Pid: 0, comm: swapper Not tainted 2.6.28-rc5 #1
> RIP: 0010:[<ffffffffa002502b>] [<ffffffffa002502b>] do_cciss_intr+0x6c8/0xb10 [cciss]
> RSP: 0018:ffff88017fa9fee8 EFLAGS: 00010087
> RAX: 0000000000000000 RBX: ffff88007f840000 RCX: 000000000000a3d9
> RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffffffff8080e634
> RBP: ffff88017fa9ff18 R08: 0000000000000000 R09: ffff88017e918800
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff88027e740000
> R13: 0000000000000000 R14: 0000000000000057 R15: 0000000000000086
> FS: 00000000008558f0(0000) GS:ffff88017fc01c80(0000) knlGS:0000000000000000
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000248 CR3: 0000000000201000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper (pid: 0, threadinfo ffff88027f660000, task ffff88027f635400)
> Stack:
> 0000000000000030 ffff88017ea73640 0000000000000000 0000000000000000
> 0000000000000057 0000000000000000 ffff88017fa9ff48 ffffffff8026a8b9
> ffffffff8074ab00 0000000000000057 ffff88017ea73640 ffffffff8074ab58
> Call Trace:
> <IRQ> <0> [<ffffffff8026a8b9>] handle_IRQ_event+0x27/0x57
> [<ffffffff8026c424>] handle_edge_irq+0xde/0x11f
> [<ffffffff8020e29b>] do_IRQ+0xfc/0x175
> [<ffffffff8020c3e6>] ret_from_intr+0x0/0xa
> <EOI> <0> [<ffffffff8023c7d2>] ? ksoftirqd+0x0/0xa6
> [<ffffffff80212575>] ? default_idle+0x2b/0x40
> [<ffffffff80212799>] ? c1e_idle+0xe5/0xec
> [<ffffffff8056a7f6>] ? atomic_notifier_call_chain+0xf/0x11
> [<ffffffff8020acd1>] ? cpu_idle+0x40/0x5e
> [<ffffffff8056284e>] ? start_secondary+0x174/0x179
> Code: 8b 83 48 02 00 00 48 39 d8 74 37 49 39 9c 24 c0 00 01 00 75 08 49 89 84 24 c0 00 01 00 48 8b 83 40 02 00 00 48 8b 93 48 02 00 00 <48> 89 90 48 02 00 00 48 8b 93 48 02 00 00 48 89 82 40 02 00 00
> RIP [<ffffffffa002502b>] do_cciss_intr+0x6c8/0xb10 [cciss]
> RSP <ffff88017fa9fee8>
> CR2: 0000000000000248
> Kernel panic - not syncing: Fatal exception in interrupt
>
>
> Any ideas/suggestions?

2008-11-18 21:33:19

by Mike Miller

[permalink] [raw]
Subject: RE: in 2.6.23-rc3-git7 in do_cciss_intr


> FS: 00000000008558f0(0000) GS:ffff88017fc01c80(0000)
> knlGS:0000000000000000
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 0000000000000248 CR3: 0000000000201000 CR4: 00000000000006e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400 Process swapper (pid: 0, threadinfo
> ffff88027f660000, task ffff88027f635400)
> Stack:
> 0000000000000030 ffff88017ea73640 0000000000000000 0000000000000000
> 0000000000000057 0000000000000000 ffff88017fa9ff48
> ffffffff8026a8b9 ffffffff8074ab00 0000000000000057
> ffff88017ea73640 ffffffff8074ab58 Call Trace:
> <IRQ> <0> [<ffffffff8026a8b9>] handle_IRQ_event+0x27/0x57
> [<ffffffff8026c424>] handle_edge_irq+0xde/0x11f
> [<ffffffff8020e29b>] do_IRQ+0xfc/0x175 [<ffffffff8020c3e6>]
> ret_from_intr+0x0/0xa <EOI> <0> [<ffffffff8023c7d2>] ?
> ksoftirqd+0x0/0xa6 [<ffffffff80212575>] ?
> default_idle+0x2b/0x40 [<ffffffff80212799>] ?
> c1e_idle+0xe5/0xec [<ffffffff8056a7f6>] ?
> atomic_notifier_call_chain+0xf/0x11
> [<ffffffff8020acd1>] ? cpu_idle+0x40/0x5e
> [<ffffffff8056284e>] ? start_secondary+0x174/0x179
> Code: 8b 83 48 02 00 00 48 39 d8 74 37 49 39 9c 24 c0 00 01
> 00 75 08 49 89 84 24 c0 00 01 00 48 8b 83 40 02 00 00 48 8b
> 93 48 02 00 00 <48> 89 90 48 02 00 00 48 8b 93 48 02 00 00 48
> 89 82 40 02 00 00 RIP [<ffffffffa002502b>]
> do_cciss_intr+0x6c8/0xb10 [cciss] RSP <ffff88017fa9fee8>
> CR2: 0000000000000248
> Kernel panic - not syncing: Fatal exception in interrupt
>
>
> Any ideas/suggestions?
>
> Thanks,
> ~Randy
>

Randy,
I been in and out of the office, mostly out. I finally tracked down a blade. I can't reproduce this on a 385 and the e200 isn't detected in my 585.

So I'm back on this, for what it's worth.

-- mikem

2008-11-18 21:33:35

by Randy Dunlap

[permalink] [raw]
Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr

Randy Dunlap wrote:
> Randy Dunlap wrote:
>> Miller, Mike (OS Dev) wrote:
>>>> -----Original Message-----
>>>> From: Randy Dunlap [mailto:[email protected]]
>>>> Sent: Thursday, September 25, 2008 3:40 PM
>>>> To: scsi
>>>> Cc: Jens Axboe; Miller, Mike (OS Dev); James Bottomley; lkml; akpm
>>>> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr
>>>>
>>>> On Thu, 25 Sep 2008 13:33:07 -0700 Randy Dunlap wrote:
>>>>
>>>>> Jens Axboe wrote:
>>>>>> On Thu, Sep 04 2008, Miller, Mike (OS Dev) wrote:
>>>>>>>>>>> 0x3bb2 <do_cciss_intr+1649>: mov 0x2(%r8),%dx
>>>>>>>>>>> 0x3bb7 <do_cciss_intr+1654>: test %dx,%dx
>>>>>>>>>>> 0x3bba <do_cciss_intr+1657>: je 0x3f0e
>>>> <do_cciss_intr+2509>
>>>>>>>>>>> $ addr2line -e cciss.o -f do_cciss_intr+0x627 SA5_fifo_full
>>>>>>>>>>>
>>>> /home/rdunlap/linsrc/linux-2.6.27-rc3-git7/drivers/block/cciss.h:2
>>>>>>>> 06
>>>>>>>>>> OK ...that's confusing. It seems to be saying that
>>>> ctrlr_info_t
>>>>>>>>>> * was NULL. However, I can't see a way of getting into the
>>>>>>>> fifo_full
>>>>>>>>>> callback from do_cciss_intr ..
>>>>>>>>>> especially not with an NULL host.
>>>>>>>>>>
>>>>>>>>>> James
>>>>>>>>> That is weird. Even if we could get there fifo_full doesn't
>>>>>>>> do anything but wait for a bit.
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> This just happened again. This time it's on 2.6.27-rc5-git3.
>>>>>>>>
>>>>>>>> ~Randy
>>>>>>> Thanks Randy. I think. :)
>>>>>>>
>>>>>>> I'll try to recreate in my lab.
>>>>>> This looks somewhat strange, mostly like 'c' is NULL and it's
>>>>>> oopsing in in removeQ (I don't think Randy's analysis is
>>>> correct in
>>>>>> assuming it's 'h' and it's in fifo_full). Given that 'c'
>>>> cannot be
>>>>>> NULL, it's c->prev or c->next that are NULL.
>> This BUG: has happened (now) 5 times today. Higher frequency than usual for
>> some reason.
>>
>> I enabled CCISS_DEBUG and added one printk in removeQ(). On the first call
>
> s/first/second/
>
>
>> to removeQ(), both c->next and c->prev are NULL.
>>
>> Here's the kernel log output from cciss:

I added a printk() in addQ() as well. Here's the new output:

HP CISS Driver (v 3.6.20)
ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 54
cciss 0000:42:08.0: PCI INT A -> Link[LNKA] -> GSI 54 (level, high) -> IRQ 54
command = 147
irq = 36
board_id = 3211103c
cciss 0000:42:08.0: irq 87 for MSI/MSI-X
address 0 = fdf80000
cfg base address = 10
cfg base address index = 0
cfg offset = 400
Controller Configuration information
------------------------------------
Signature = CISS
Spec Number = 1
Transport methods supported = 0x6
Transport methods active = 0x3
Requested transport Method = 0x0
Coalesce Interrupt Delay = 0x0
Coalesce Interrupt Count = 0x1
Max outstanding commands = 0x256
Bus Types = 0x200000
Server Name =
Heartbeat Counter = 0x1672


Trying to put board into Simple mode
I counter got to 1 0
Controller Configuration information
------------------------------------
Signature = CISS
Spec Number = 1
Transport methods supported = 0x6
Transport methods active = 0x3
Requested transport Method = 0x0
Coalesce Interrupt Delay = 0x0
Coalesce Interrupt Count = 0x1
Max outstanding commands = 0x256
Bus Types = 0x200000
Server Name =
Heartbeat Counter = 0x1672


cciss0: <0x3238> at PCI 0000:42:08.0 IRQ 87 using DAC
cciss: intr_pending 8
cciss: addQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000
cciss: removeQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000, next=ffff88007f83e000, prev=ffff88007f83e000
Sending 7f83e000 - down to controller
cciss: addQ: Qptr=ffff88027e0100c0, c=ffff88007f83e000
cciss: intr_pending 8
cciss: Read 4 back from board
cciss: removeQ: Qptr=ffff88027e0100c0, c=ffff88007f840000, next=0000000000000000, prev=0000000000000000
BUG: unable to handle kernel NULL pointer dereference at 0000000000000248
IP: [<ffffffffa0025106>] do_cciss_intr+0x706/0xb6c [cciss]
PGD 0
Oops: 0002 [#1] SMP
last sysfs file: /sys/block/ram15/dev
CPU 2
Modules linked in: cciss(+) ehci_hcd ohci_hcd uhci_hcd
Pid: 0, comm: swapper Not tainted 2.6.28-rc5 #1
RIP: 0010:[<ffffffffa0025106>] [<ffffffffa0025106>] do_cciss_intr+0x706/0xb6c [cciss]
RSP: 0018:ffff88027f643ee8 EFLAGS: 00010087
RAX: 0000000000000000 RBX: ffff88007f840000 RCX: 000000000000a44f
RDX: 0000000000000000 RSI: 0000000000000046 RDI: ffffffff8080e634
RBP: ffff88027f643f18 R08: 0000000000000000 R09: ffff88017e964800
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88027e000000
R13: 0000000000000000 R14: 0000000000000057 R15: 0000000000000086
FS: 00000000008558f0(0000) GS:ffff88017fc01c80(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000248 CR3: 0000000000201000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffff88017fa9e000, task ffff88017fa5d400)
Stack:
0000000000000030 ffff88027f627500 0000000000000000 0000000000000000
0000000000000057 0000000000000000 ffff88027f643f48 ffffffff8026a8b9
ffffffff8074ab00 0000000000000057 ffff88027f627500 ffffffff8074ab58
Call Trace:
<IRQ> <0> [<ffffffff8026a8b9>] handle_IRQ_event+0x27/0x57
[<ffffffff8026c424>] handle_edge_irq+0xde/0x11f
[<ffffffff8020e29b>] do_IRQ+0xfc/0x175
[<ffffffff8020c3e6>] ret_from_intr+0x0/0xa
<EOI> <0> [<ffffffff8023c7d2>] ? ksoftirqd+0x0/0xa6
[<ffffffff80212575>] ? default_idle+0x2b/0x40
[<ffffffff80212799>] ? c1e_idle+0xe5/0xec
[<ffffffff8056a7f6>] ? atomic_notifier_call_chain+0xf/0x11
[<ffffffff8020acd1>] ? cpu_idle+0x40/0x5e
[<ffffffff8056284e>] ? start_secondary+0x174/0x179
Code: 8b 83 48 02 00 00 48 39 d8 74 37 49 39 9c 24 c0 00 01 00 75 08 49 89 84 24 c0 00 01 00 48 8b 83 40 02 00 00 48 8b 93 48 02 00 00 <48> 89 90 48 02 00 00 48 8b 93 48 02 00 00 48 89 82 40 02 00 00
RIP [<ffffffffa0025106>] do_cciss_intr+0x706/0xb6c [cciss]
RSP <ffff88027f643ee8>
CR2: 0000000000000248
Kernel panic - not syncing: Fatal exception in interrupt

~Randy

2008-11-19 08:54:29

by Jens Axboe

[permalink] [raw]
Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr

On Tue, Nov 18 2008, Randy Dunlap wrote:
> Randy Dunlap wrote:
> > Randy Dunlap wrote:
> >> Miller, Mike (OS Dev) wrote:
> >>>> -----Original Message-----
> >>>> From: Randy Dunlap [mailto:[email protected]]
> >>>> Sent: Thursday, September 25, 2008 3:40 PM
> >>>> To: scsi
> >>>> Cc: Jens Axboe; Miller, Mike (OS Dev); James Bottomley; lkml; akpm
> >>>> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr
> >>>>
> >>>> On Thu, 25 Sep 2008 13:33:07 -0700 Randy Dunlap wrote:
> >>>>
> >>>>> Jens Axboe wrote:
> >>>>>> On Thu, Sep 04 2008, Miller, Mike (OS Dev) wrote:
> >>>>>>>>>>> 0x3bb2 <do_cciss_intr+1649>: mov 0x2(%r8),%dx
> >>>>>>>>>>> 0x3bb7 <do_cciss_intr+1654>: test %dx,%dx
> >>>>>>>>>>> 0x3bba <do_cciss_intr+1657>: je 0x3f0e
> >>>> <do_cciss_intr+2509>
> >>>>>>>>>>> $ addr2line -e cciss.o -f do_cciss_intr+0x627 SA5_fifo_full
> >>>>>>>>>>>
> >>>> /home/rdunlap/linsrc/linux-2.6.27-rc3-git7/drivers/block/cciss.h:2
> >>>>>>>> 06
> >>>>>>>>>> OK ...that's confusing. It seems to be saying that
> >>>> ctrlr_info_t
> >>>>>>>>>> * was NULL. However, I can't see a way of getting into the
> >>>>>>>> fifo_full
> >>>>>>>>>> callback from do_cciss_intr ..
> >>>>>>>>>> especially not with an NULL host.
> >>>>>>>>>>
> >>>>>>>>>> James
> >>>>>>>>> That is weird. Even if we could get there fifo_full doesn't
> >>>>>>>> do anything but wait for a bit.
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> This just happened again. This time it's on 2.6.27-rc5-git3.
> >>>>>>>>
> >>>>>>>> ~Randy
> >>>>>>> Thanks Randy. I think. :)
> >>>>>>>
> >>>>>>> I'll try to recreate in my lab.
> >>>>>> This looks somewhat strange, mostly like 'c' is NULL and it's
> >>>>>> oopsing in in removeQ (I don't think Randy's analysis is
> >>>> correct in
> >>>>>> assuming it's 'h' and it's in fifo_full). Given that 'c'
> >>>> cannot be
> >>>>>> NULL, it's c->prev or c->next that are NULL.
> >> This BUG: has happened (now) 5 times today. Higher frequency than usual for
> >> some reason.
> >>
> >> I enabled CCISS_DEBUG and added one printk in removeQ(). On the first call
> >
> > s/first/second/
> >
> >
> >> to removeQ(), both c->next and c->prev are NULL.
> >>
> >> Here's the kernel log output from cciss:
>
> I added a printk() in addQ() as well. Here's the new output:
>
> HP CISS Driver (v 3.6.20)
> ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 54
> cciss 0000:42:08.0: PCI INT A -> Link[LNKA] -> GSI 54 (level, high) -> IRQ 54
> command = 147
> irq = 36
> board_id = 3211103c
> cciss 0000:42:08.0: irq 87 for MSI/MSI-X
> address 0 = fdf80000
> cfg base address = 10
> cfg base address index = 0
> cfg offset = 400
> Controller Configuration information
> ------------------------------------
> Signature = CISS
> Spec Number = 1
> Transport methods supported = 0x6
> Transport methods active = 0x3
> Requested transport Method = 0x0
> Coalesce Interrupt Delay = 0x0
> Coalesce Interrupt Count = 0x1
> Max outstanding commands = 0x256
> Bus Types = 0x200000
> Server Name =
> Heartbeat Counter = 0x1672
>
>
> Trying to put board into Simple mode
> I counter got to 1 0
> Controller Configuration information
> ------------------------------------
> Signature = CISS
> Spec Number = 1
> Transport methods supported = 0x6
> Transport methods active = 0x3
> Requested transport Method = 0x0
> Coalesce Interrupt Delay = 0x0
> Coalesce Interrupt Count = 0x1
> Max outstanding commands = 0x256
> Bus Types = 0x200000
> Server Name =
> Heartbeat Counter = 0x1672
>
>
> cciss0: <0x3238> at PCI 0000:42:08.0 IRQ 87 using DAC
> cciss: intr_pending 8
> cciss: addQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000
> cciss: removeQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000, next=ffff88007f83e000, prev=ffff88007f83e000
> Sending 7f83e000 - down to controller
> cciss: addQ: Qptr=ffff88027e0100c0, c=ffff88007f83e000
> cciss: intr_pending 8
> cciss: Read 4 back from board
> cciss: removeQ: Qptr=ffff88027e0100c0, c=ffff88007f840000, next=0000000000000000, prev=0000000000000000
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000248

Randy, can you post the debug patch you used? The above goes boom when
it attempts to remove a command that isn't on the list, the Qptr in the
last example should be empty, hence the oops. So I'd be interested in
seeing what removeQ() calls this is, I'm assuming it's this bit in
do_cciss_intr():

...
while (c->busaddr != a) {
c = c->next;
if (c == h->cmpQ)
break;
}
}
/*
* If we've found the command, take it off the
* completion Q and free it
*/
if (c->busaddr == a) {
removeQ(&h->cmpQ, c);
if (c->cmd_type == CMD_RWREQ) {
complete_command(h, c, 0);
...

If so, what part of the c lookup are you hitting - the on that does:

c = h->cmd_pool + a2;

or the c->busaddr check that his shown above?

--
Jens Axboe

2008-11-19 17:01:59

by Mike Miller

[permalink] [raw]
Subject: RE: in 2.6.23-rc3-git7 in do_cciss_intr



> -----Original Message-----
> From: Jens Axboe [mailto:[email protected]]
> Sent: Wednesday, November 19, 2008 2:52 AM
> To: Randy Dunlap
> Cc: scsi; Miller, Mike (OS Dev); James Bottomley; lkml; akpm
> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr
>
> On Tue, Nov 18 2008, Randy Dunlap wrote:
> > Randy Dunlap wrote:
> > > Randy Dunlap wrote:
> > >> Miller, Mike (OS Dev) wrote:
> > >>>> -----Original Message-----
> > >>>> From: Randy Dunlap [mailto:[email protected]]
> > >>>> Sent: Thursday, September 25, 2008 3:40 PM
> > >>>> To: scsi
> > >>>> Cc: Jens Axboe; Miller, Mike (OS Dev); James Bottomley; lkml;
> > >>>> akpm
> > >>>> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr
> > >>>>
> > >>>> On Thu, 25 Sep 2008 13:33:07 -0700 Randy Dunlap wrote:
> > >>>>
> > >>>>> Jens Axboe wrote:
> > >>>>>> On Thu, Sep 04 2008, Miller, Mike (OS Dev) wrote:
> > >>>>>>>>>>> 0x3bb2 <do_cciss_intr+1649>: mov 0x2(%r8),%dx
> > >>>>>>>>>>> 0x3bb7 <do_cciss_intr+1654>: test %dx,%dx
> > >>>>>>>>>>> 0x3bba <do_cciss_intr+1657>: je 0x3f0e
> > >>>> <do_cciss_intr+2509>
> > >>>>>>>>>>> $ addr2line -e cciss.o -f do_cciss_intr+0x627
> > >>>>>>>>>>> SA5_fifo_full
> > >>>>>>>>>>>
> > >>>>
> /home/rdunlap/linsrc/linux-2.6.27-rc3-git7/drivers/block/cciss.h:
> > >>>> 2
> > >>>>>>>> 06
> > >>>>>>>>>> OK ...that's confusing. It seems to be saying that
> > >>>> ctrlr_info_t
> > >>>>>>>>>> * was NULL. However, I can't see a way of
> getting into the
> > >>>>>>>> fifo_full
> > >>>>>>>>>> callback from do_cciss_intr ..
> > >>>>>>>>>> especially not with an NULL host.
> > >>>>>>>>>>
> > >>>>>>>>>> James
> > >>>>>>>>> That is weird. Even if we could get there
> fifo_full doesn't
> > >>>>>>>> do anything but wait for a bit.
> > >>>>>>>>
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> This just happened again. This time it's on
> 2.6.27-rc5-git3.
> > >>>>>>>>
> > >>>>>>>> ~Randy
> > >>>>>>> Thanks Randy. I think. :)
> > >>>>>>>
> > >>>>>>> I'll try to recreate in my lab.
> > >>>>>> This looks somewhat strange, mostly like 'c' is NULL
> and it's
> > >>>>>> oopsing in in removeQ (I don't think Randy's analysis is
> > >>>> correct in
> > >>>>>> assuming it's 'h' and it's in fifo_full). Given that 'c'
> > >>>> cannot be
> > >>>>>> NULL, it's c->prev or c->next that are NULL.
> > >> This BUG: has happened (now) 5 times today. Higher
> frequency than
> > >> usual for some reason.
> > >>
> > >> I enabled CCISS_DEBUG and added one printk in removeQ(). On the
> > >> first call
> > >
> > > s/first/second/
> > >
> > >
> > >> to removeQ(), both c->next and c->prev are NULL.
> > >>
> > >> Here's the kernel log output from cciss:
> >
> > I added a printk() in addQ() as well. Here's the new output:
> >
> > HP CISS Driver (v 3.6.20)
> > ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 54 cciss
> 0000:42:08.0:
> > PCI INT A -> Link[LNKA] -> GSI 54 (level, high) -> IRQ 54 command =
> > 147 irq = 36 board_id = 3211103c cciss 0000:42:08.0: irq 87 for
> > MSI/MSI-X address 0 = fdf80000 cfg base address = 10 cfg
> base address
> > index = 0 cfg offset = 400 Controller Configuration information
> > ------------------------------------
> > Signature = CISS
> > Spec Number = 1
> > Transport methods supported = 0x6
> > Transport methods active = 0x3
> > Requested transport Method = 0x0
> > Coalesce Interrupt Delay = 0x0
> > Coalesce Interrupt Count = 0x1
> > Max outstanding commands = 0x256
> > Bus Types = 0x200000
> > Server Name =
> > Heartbeat Counter = 0x1672
> >
> >
> > Trying to put board into Simple mode
> > I counter got to 1 0
> > Controller Configuration information
> > ------------------------------------
> > Signature = CISS
> > Spec Number = 1
> > Transport methods supported = 0x6
> > Transport methods active = 0x3
> > Requested transport Method = 0x0
> > Coalesce Interrupt Delay = 0x0
> > Coalesce Interrupt Count = 0x1
> > Max outstanding commands = 0x256
> > Bus Types = 0x200000
> > Server Name =
> > Heartbeat Counter = 0x1672
> >
> >
> > cciss0: <0x3238> at PCI 0000:42:08.0 IRQ 87 using DAC
> > cciss: intr_pending 8
> > cciss: addQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000
> > cciss: removeQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000,
> > next=ffff88007f83e000, prev=ffff88007f83e000 Sending
> 7f83e000 - down
> > to controller
> > cciss: addQ: Qptr=ffff88027e0100c0, c=ffff88007f83e000
> > cciss: intr_pending 8
> > cciss: Read 4 back from board
> > cciss: removeQ: Qptr=ffff88027e0100c0, c=ffff88007f840000,
> > next=0000000000000000, prev=0000000000000000
> > BUG: unable to handle kernel NULL pointer dereference at
> > 0000000000000248
>
> Randy, can you post the debug patch you used? The above goes
> boom when it attempts to remove a command that isn't on the
> list, the Qptr in the last example should be empty, hence the
> oops. So I'd be interested in seeing what removeQ() calls
> this is, I'm assuming it's this bit in
> do_cciss_intr():
>
> ...
> while (c->busaddr != a) {
> c = c->next;
> if (c == h->cmpQ)
> break;
> }
> }
> /*
> * If we've found the command, take it off the
> * completion Q and free it
> */
> if (c->busaddr == a) {
> removeQ(&h->cmpQ, c);
> if (c->cmd_type == CMD_RWREQ) {
> complete_command(h, c, 0);
> ...
>
> If so, what part of the c lookup are you hitting - the on that does:
>
> c = h->cmd_pool + a2;
>
> or the c->busaddr check that his shown above?
>
> --
Randy,
I still can't reproduce this bug. I have your config file on a BL465c w/e200i. Just to confirm, you only see this at init time, correct?
Please post your debug patch as Jens requested.

-- mikem

2008-11-19 17:19:35

by Randy Dunlap

[permalink] [raw]
Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr

Jens Axboe wrote:
> On Tue, Nov 18 2008, Randy Dunlap wrote:
>> Randy Dunlap wrote:
>>> Randy Dunlap wrote:
>>>> Miller, Mike (OS Dev) wrote:
>>>>>> -----Original Message-----
>>>>>> From: Randy Dunlap [mailto:[email protected]]
>>>>>> Sent: Thursday, September 25, 2008 3:40 PM
>>>>>> To: scsi
>>>>>> Cc: Jens Axboe; Miller, Mike (OS Dev); James Bottomley; lkml; akpm
>>>>>> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr
>>>>>>
>>>>>> On Thu, 25 Sep 2008 13:33:07 -0700 Randy Dunlap wrote:
>>>>>>
>>>>>>> Jens Axboe wrote:
>>>>>>>> On Thu, Sep 04 2008, Miller, Mike (OS Dev) wrote:
>>>>>>>>>>>>> 0x3bb2 <do_cciss_intr+1649>: mov 0x2(%r8),%dx
>>>>>>>>>>>>> 0x3bb7 <do_cciss_intr+1654>: test %dx,%dx
>>>>>>>>>>>>> 0x3bba <do_cciss_intr+1657>: je 0x3f0e
>>>>>> <do_cciss_intr+2509>
>>>>>>>>>>>>> $ addr2line -e cciss.o -f do_cciss_intr+0x627 SA5_fifo_full
>>>>>>>>>>>>>
>>>>>> /home/rdunlap/linsrc/linux-2.6.27-rc3-git7/drivers/block/cciss.h:2
>>>>>>>>>> 06
>>>>>>>>>>>> OK ...that's confusing. It seems to be saying that
>>>>>> ctrlr_info_t
>>>>>>>>>>>> * was NULL. However, I can't see a way of getting into the
>>>>>>>>>> fifo_full
>>>>>>>>>>>> callback from do_cciss_intr ..
>>>>>>>>>>>> especially not with an NULL host.
>>>>>>>>>>>>
>>>>>>>>>>>> James
>>>>>>>>>>> That is weird. Even if we could get there fifo_full doesn't
>>>>>>>>>> do anything but wait for a bit.
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> This just happened again. This time it's on 2.6.27-rc5-git3.
>>>>>>>>>>
>>>>>>>>>> ~Randy
>>>>>>>>> Thanks Randy. I think. :)
>>>>>>>>>
>>>>>>>>> I'll try to recreate in my lab.
>>>>>>>> This looks somewhat strange, mostly like 'c' is NULL and it's
>>>>>>>> oopsing in in removeQ (I don't think Randy's analysis is
>>>>>> correct in
>>>>>>>> assuming it's 'h' and it's in fifo_full). Given that 'c'
>>>>>> cannot be
>>>>>>>> NULL, it's c->prev or c->next that are NULL.
>>>> This BUG: has happened (now) 5 times today. Higher frequency than usual for
>>>> some reason.
>>>>
>>>> I enabled CCISS_DEBUG and added one printk in removeQ(). On the first call
>>> s/first/second/
>>>
>>>
>>>> to removeQ(), both c->next and c->prev are NULL.
>>>>
>>>> Here's the kernel log output from cciss:
>> I added a printk() in addQ() as well. Here's the new output:
>>
>> HP CISS Driver (v 3.6.20)
>> ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 54
>> cciss 0000:42:08.0: PCI INT A -> Link[LNKA] -> GSI 54 (level, high) -> IRQ 54
>> command = 147
>> irq = 36
>> board_id = 3211103c
>> cciss 0000:42:08.0: irq 87 for MSI/MSI-X
>> address 0 = fdf80000
>> cfg base address = 10
>> cfg base address index = 0
>> cfg offset = 400
>> Controller Configuration information
>> ------------------------------------
>> Signature = CISS
>> Spec Number = 1
>> Transport methods supported = 0x6
>> Transport methods active = 0x3
>> Requested transport Method = 0x0
>> Coalesce Interrupt Delay = 0x0
>> Coalesce Interrupt Count = 0x1
>> Max outstanding commands = 0x256
>> Bus Types = 0x200000
>> Server Name =
>> Heartbeat Counter = 0x1672
>>
>>
>> Trying to put board into Simple mode
>> I counter got to 1 0
>> Controller Configuration information
>> ------------------------------------
>> Signature = CISS
>> Spec Number = 1
>> Transport methods supported = 0x6
>> Transport methods active = 0x3
>> Requested transport Method = 0x0
>> Coalesce Interrupt Delay = 0x0
>> Coalesce Interrupt Count = 0x1
>> Max outstanding commands = 0x256
>> Bus Types = 0x200000
>> Server Name =
>> Heartbeat Counter = 0x1672
>>
>>
>> cciss0: <0x3238> at PCI 0000:42:08.0 IRQ 87 using DAC
>> cciss: intr_pending 8
>> cciss: addQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000
>> cciss: removeQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000, next=ffff88007f83e000, prev=ffff88007f83e000
>> Sending 7f83e000 - down to controller
>> cciss: addQ: Qptr=ffff88027e0100c0, c=ffff88007f83e000
>> cciss: intr_pending 8
>> cciss: Read 4 back from board
>> cciss: removeQ: Qptr=ffff88027e0100c0, c=ffff88007f840000, next=0000000000000000, prev=0000000000000000
>> BUG: unable to handle kernel NULL pointer dereference at 0000000000000248
>
> Randy, can you post the debug patch you used? The above goes boom when

Sure. I have 2 patches. One is a fix for CCISS_DEBUG printk formats that I
posted to linux-scsi yesterday. The other just adds more debug code.

> it attempts to remove a command that isn't on the list, the Qptr in the
> last example should be empty, hence the oops. So I'd be interested in
> seeing what removeQ() calls this is, I'm assuming it's this bit in
> do_cciss_intr():
>
> ...
> while (c->busaddr != a) {
> c = c->next;
> if (c == h->cmpQ)
> break;
> }
> }
> /*
> * If we've found the command, take it off the
> * completion Q and free it
> */
> if (c->busaddr == a) {
> removeQ(&h->cmpQ, c);
> if (c->cmd_type == CMD_RWREQ) {
> complete_command(h, c, 0);
> ...
>
> If so, what part of the c lookup are you hitting - the on that does:
>
> c = h->cmd_pool + a2;
>
> or the c->busaddr check that his shown above?

I don't know that the patch will tell us which call it is.
The added code is inside removeQ() and addQ(), not near the calls
to them.

--
~Randy


Attachments:
cciss-check-more.patch (1.79 kB)
cciss-debug-printk.patch (2.21 kB)
Download all attachments

2008-11-19 17:23:50

by Randy Dunlap

[permalink] [raw]
Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr

Miller, Mike (OS Dev) wrote:
>
>> -----Original Message-----
>> From: Jens Axboe [mailto:[email protected]]
>> Sent: Wednesday, November 19, 2008 2:52 AM
>> To: Randy Dunlap
>> Cc: scsi; Miller, Mike (OS Dev); James Bottomley; lkml; akpm
>> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr
>>
>> On Tue, Nov 18 2008, Randy Dunlap wrote:
>>> Randy Dunlap wrote:
>>>> Randy Dunlap wrote:
>>>>> Miller, Mike (OS Dev) wrote:
>>>>>>> -----Original Message-----
>>>>>>> From: Randy Dunlap [mailto:[email protected]]
>>>>>>> Sent: Thursday, September 25, 2008 3:40 PM
>>>>>>> To: scsi
>>>>>>> Cc: Jens Axboe; Miller, Mike (OS Dev); James Bottomley; lkml;
>>>>>>> akpm
>>>>>>> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr
>>>>>>>
>>>>>>> On Thu, 25 Sep 2008 13:33:07 -0700 Randy Dunlap wrote:
>>>>>>>
>>>>>>>> Jens Axboe wrote:
>>>>>>>>> On Thu, Sep 04 2008, Miller, Mike (OS Dev) wrote:
>>>>>>>>>>>>>> 0x3bb2 <do_cciss_intr+1649>: mov 0x2(%r8),%dx
>>>>>>>>>>>>>> 0x3bb7 <do_cciss_intr+1654>: test %dx,%dx
>>>>>>>>>>>>>> 0x3bba <do_cciss_intr+1657>: je 0x3f0e
>>>>>>> <do_cciss_intr+2509>
>>>>>>>>>>>>>> $ addr2line -e cciss.o -f do_cciss_intr+0x627
>>>>>>>>>>>>>> SA5_fifo_full
>>>>>>>>>>>>>>
>> /home/rdunlap/linsrc/linux-2.6.27-rc3-git7/drivers/block/cciss.h:
>>>>>>> 2
>>>>>>>>>>> 06
>>>>>>>>>>>>> OK ...that's confusing. It seems to be saying that
>>>>>>> ctrlr_info_t
>>>>>>>>>>>>> * was NULL. However, I can't see a way of
>> getting into the
>>>>>>>>>>> fifo_full
>>>>>>>>>>>>> callback from do_cciss_intr ..
>>>>>>>>>>>>> especially not with an NULL host.
>>>>>>>>>>>>>
>>>>>>>>>>>>> James
>>>>>>>>>>>> That is weird. Even if we could get there
>> fifo_full doesn't
>>>>>>>>>>> do anything but wait for a bit.
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> This just happened again. This time it's on
>> 2.6.27-rc5-git3.
>>>>>>>>>>> ~Randy
>>>>>>>>>> Thanks Randy. I think. :)
>>>>>>>>>>
>>>>>>>>>> I'll try to recreate in my lab.
>>>>>>>>> This looks somewhat strange, mostly like 'c' is NULL
>> and it's
>>>>>>>>> oopsing in in removeQ (I don't think Randy's analysis is
>>>>>>> correct in
>>>>>>>>> assuming it's 'h' and it's in fifo_full). Given that 'c'
>>>>>>> cannot be
>>>>>>>>> NULL, it's c->prev or c->next that are NULL.
>>>>> This BUG: has happened (now) 5 times today. Higher
>> frequency than
>>>>> usual for some reason.
>>>>>
>>>>> I enabled CCISS_DEBUG and added one printk in removeQ(). On the
>>>>> first call
>>>> s/first/second/
>>>>
>>>>
>>>>> to removeQ(), both c->next and c->prev are NULL.
>>>>>
>>>>> Here's the kernel log output from cciss:
>>> I added a printk() in addQ() as well. Here's the new output:
>>>
>>> HP CISS Driver (v 3.6.20)
>>> ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 54 cciss
>> 0000:42:08.0:
>>> PCI INT A -> Link[LNKA] -> GSI 54 (level, high) -> IRQ 54 command =
>>> 147 irq = 36 board_id = 3211103c cciss 0000:42:08.0: irq 87 for
>>> MSI/MSI-X address 0 = fdf80000 cfg base address = 10 cfg
>> base address
>>> index = 0 cfg offset = 400 Controller Configuration information
>>> ------------------------------------
>>> Signature = CISS
>>> Spec Number = 1
>>> Transport methods supported = 0x6
>>> Transport methods active = 0x3
>>> Requested transport Method = 0x0
>>> Coalesce Interrupt Delay = 0x0
>>> Coalesce Interrupt Count = 0x1
>>> Max outstanding commands = 0x256
>>> Bus Types = 0x200000
>>> Server Name =
>>> Heartbeat Counter = 0x1672
>>>
>>>
>>> Trying to put board into Simple mode
>>> I counter got to 1 0
>>> Controller Configuration information
>>> ------------------------------------
>>> Signature = CISS
>>> Spec Number = 1
>>> Transport methods supported = 0x6
>>> Transport methods active = 0x3
>>> Requested transport Method = 0x0
>>> Coalesce Interrupt Delay = 0x0
>>> Coalesce Interrupt Count = 0x1
>>> Max outstanding commands = 0x256
>>> Bus Types = 0x200000
>>> Server Name =
>>> Heartbeat Counter = 0x1672
>>>
>>>
>>> cciss0: <0x3238> at PCI 0000:42:08.0 IRQ 87 using DAC
>>> cciss: intr_pending 8
>>> cciss: addQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000
>>> cciss: removeQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000,
>>> next=ffff88007f83e000, prev=ffff88007f83e000 Sending
>> 7f83e000 - down
>>> to controller
>>> cciss: addQ: Qptr=ffff88027e0100c0, c=ffff88007f83e000
>>> cciss: intr_pending 8
>>> cciss: Read 4 back from board
>>> cciss: removeQ: Qptr=ffff88027e0100c0, c=ffff88007f840000,
>>> next=0000000000000000, prev=0000000000000000
>>> BUG: unable to handle kernel NULL pointer dereference at
>>> 0000000000000248
>> Randy, can you post the debug patch you used? The above goes
>> boom when it attempts to remove a command that isn't on the
>> list, the Qptr in the last example should be empty, hence the
>> oops. So I'd be interested in seeing what removeQ() calls
>> this is, I'm assuming it's this bit in
>> do_cciss_intr():
>>
>> ...
>> while (c->busaddr != a) {
>> c = c->next;
>> if (c == h->cmpQ)
>> break;
>> }
>> }
>> /*
>> * If we've found the command, take it off the
>> * completion Q and free it
>> */
>> if (c->busaddr == a) {
>> removeQ(&h->cmpQ, c);
>> if (c->cmd_type == CMD_RWREQ) {
>> complete_command(h, c, 0);
>> ...
>>
>> If so, what part of the c lookup are you hitting - the on that does:
>>
>> c = h->cmd_pool + a2;
>>
>> or the c->busaddr check that his shown above?
>>
>> --
> Randy,
> I still can't reproduce this bug. I have your config file on a BL465c w/e200i. Just to confirm, you only see this at init time, correct?

Yes, only at init time.

> Please post your debug patch as Jens requested.

Done (separately).

I need to back up a bit. Yesterday these BUGs happened consistenly,
so I wondered why. Then I recalled that for debugging another bug/problem,
I had changed the test system's normal boot kernel from 2.6.25 to
2.6.18-8. The test system is used to build and then boot the new kernel
*via kexec*, so it's quite possible (or certain) that something in the kexec
world has been fixed since 2.6.18. I don't recall seeing this problem
lately when using 2.6.25 to kexec/boot the new test kernel, so I'm quite
willing to drop the bug for now and then re-open it if I see the problem
again. OK??

--
~Randy

2008-11-19 17:28:21

by Mike Miller

[permalink] [raw]
Subject: RE: in 2.6.23-rc3-git7 in do_cciss_intr



> -----Original Message-----
> From: Randy Dunlap [mailto:[email protected]]
> Sent: Wednesday, November 19, 2008 11:23 AM
> To: Miller, Mike (OS Dev)
> Cc: Jens Axboe; scsi; James Bottomley; lkml; akpm
> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr
>
> Miller, Mike (OS Dev) wrote:
> >
> >> -----Original Message-----
> >> From: Jens Axboe [mailto:[email protected]]
> >> Sent: Wednesday, November 19, 2008 2:52 AM
> >> To: Randy Dunlap
> >> Cc: scsi; Miller, Mike (OS Dev); James Bottomley; lkml; akpm
> >> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr
> >>
> >> On Tue, Nov 18 2008, Randy Dunlap wrote:
> >>> Randy Dunlap wrote:
> >>>> Randy Dunlap wrote:
> >>>>> Miller, Mike (OS Dev) wrote:
> >>>>>>> -----Original Message-----
> >>>>>>> From: Randy Dunlap [mailto:[email protected]]
> >>>>>>> Sent: Thursday, September 25, 2008 3:40 PM
> >>>>>>> To: scsi
> >>>>>>> Cc: Jens Axboe; Miller, Mike (OS Dev); James Bottomley; lkml;
> >>>>>>> akpm
> >>>>>>> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr
> >>>>>>>
> >>>>>>> On Thu, 25 Sep 2008 13:33:07 -0700 Randy Dunlap wrote:
> >>>>>>>
> >>>>>>>> Jens Axboe wrote:
> >>>>>>>>> On Thu, Sep 04 2008, Miller, Mike (OS Dev) wrote:
> >>>>>>>>>>>>>> 0x3bb2 <do_cciss_intr+1649>: mov 0x2(%r8),%dx
> >>>>>>>>>>>>>> 0x3bb7 <do_cciss_intr+1654>: test %dx,%dx
> >>>>>>>>>>>>>> 0x3bba <do_cciss_intr+1657>: je 0x3f0e
> >>>>>>> <do_cciss_intr+2509>
> >>>>>>>>>>>>>> $ addr2line -e cciss.o -f do_cciss_intr+0x627
> >>>>>>>>>>>>>> SA5_fifo_full
> >>>>>>>>>>>>>>
> >> /home/rdunlap/linsrc/linux-2.6.27-rc3-git7/drivers/block/cciss.h:
> >>>>>>> 2
> >>>>>>>>>>> 06
> >>>>>>>>>>>>> OK ...that's confusing. It seems to be saying that
> >>>>>>> ctrlr_info_t
> >>>>>>>>>>>>> * was NULL. However, I can't see a way of
> >> getting into the
> >>>>>>>>>>> fifo_full
> >>>>>>>>>>>>> callback from do_cciss_intr ..
> >>>>>>>>>>>>> especially not with an NULL host.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> James
> >>>>>>>>>>>> That is weird. Even if we could get there
> >> fifo_full doesn't
> >>>>>>>>>>> do anything but wait for a bit.
> >>>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> This just happened again. This time it's on
> >> 2.6.27-rc5-git3.
> >>>>>>>>>>> ~Randy
> >>>>>>>>>> Thanks Randy. I think. :)
> >>>>>>>>>>
> >>>>>>>>>> I'll try to recreate in my lab.
> >>>>>>>>> This looks somewhat strange, mostly like 'c' is NULL
> >> and it's
> >>>>>>>>> oopsing in in removeQ (I don't think Randy's analysis is
> >>>>>>> correct in
> >>>>>>>>> assuming it's 'h' and it's in fifo_full). Given that 'c'
> >>>>>>> cannot be
> >>>>>>>>> NULL, it's c->prev or c->next that are NULL.
> >>>>> This BUG: has happened (now) 5 times today. Higher
> >> frequency than
> >>>>> usual for some reason.
> >>>>>
> >>>>> I enabled CCISS_DEBUG and added one printk in
> removeQ(). On the
> >>>>> first call
> >>>> s/first/second/
> >>>>
> >>>>
> >>>>> to removeQ(), both c->next and c->prev are NULL.
> >>>>>
> >>>>> Here's the kernel log output from cciss:
> >>> I added a printk() in addQ() as well. Here's the new output:
> >>>
> >>> HP CISS Driver (v 3.6.20)
> >>> ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 54 cciss
> >> 0000:42:08.0:
> >>> PCI INT A -> Link[LNKA] -> GSI 54 (level, high) -> IRQ 54
> command =
> >>> 147 irq = 36 board_id = 3211103c cciss 0000:42:08.0: irq 87 for
> >>> MSI/MSI-X address 0 = fdf80000 cfg base address = 10 cfg
> >> base address
> >>> index = 0 cfg offset = 400 Controller Configuration information
> >>> ------------------------------------
> >>> Signature = CISS
> >>> Spec Number = 1
> >>> Transport methods supported = 0x6
> >>> Transport methods active = 0x3
> >>> Requested transport Method = 0x0
> >>> Coalesce Interrupt Delay = 0x0
> >>> Coalesce Interrupt Count = 0x1
> >>> Max outstanding commands = 0x256
> >>> Bus Types = 0x200000
> >>> Server Name =
> >>> Heartbeat Counter = 0x1672
> >>>
> >>>
> >>> Trying to put board into Simple mode I counter got to 1 0
> Controller
> >>> Configuration information
> >>> ------------------------------------
> >>> Signature = CISS
> >>> Spec Number = 1
> >>> Transport methods supported = 0x6
> >>> Transport methods active = 0x3
> >>> Requested transport Method = 0x0
> >>> Coalesce Interrupt Delay = 0x0
> >>> Coalesce Interrupt Count = 0x1
> >>> Max outstanding commands = 0x256
> >>> Bus Types = 0x200000
> >>> Server Name =
> >>> Heartbeat Counter = 0x1672
> >>>
> >>>
> >>> cciss0: <0x3238> at PCI 0000:42:08.0 IRQ 87 using DAC
> >>> cciss: intr_pending 8
> >>> cciss: addQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000
> >>> cciss: removeQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000,
> >>> next=ffff88007f83e000, prev=ffff88007f83e000 Sending
> >> 7f83e000 - down
> >>> to controller
> >>> cciss: addQ: Qptr=ffff88027e0100c0, c=ffff88007f83e000
> >>> cciss: intr_pending 8
> >>> cciss: Read 4 back from board
> >>> cciss: removeQ: Qptr=ffff88027e0100c0, c=ffff88007f840000,
> >>> next=0000000000000000, prev=0000000000000000
> >>> BUG: unable to handle kernel NULL pointer dereference at
> >>> 0000000000000248
> >> Randy, can you post the debug patch you used? The above goes boom
> >> when it attempts to remove a command that isn't on the
> list, the Qptr
> >> in the last example should be empty, hence the oops. So I'd be
> >> interested in seeing what removeQ() calls this is, I'm
> assuming it's
> >> this bit in
> >> do_cciss_intr():
> >>
> >> ...
> >> while (c->busaddr != a) {
> >> c = c->next;
> >> if (c == h->cmpQ)
> >> break;
> >> }
> >> }
> >> /*
> >> * If we've found the command, take it off the
> >> * completion Q and free it
> >> */
> >> if (c->busaddr == a) {
> >> removeQ(&h->cmpQ, c);
> >> if (c->cmd_type == CMD_RWREQ) {
> >> complete_command(h, c, 0);
> >> ...
> >>
> >> If so, what part of the c lookup are you hitting - the on
> that does:
> >>
> >> c = h->cmd_pool + a2;
> >>
> >> or the c->busaddr check that his shown above?
> >>
> >> --
> > Randy,
> > I still can't reproduce this bug. I have your config file
> on a BL465c w/e200i. Just to confirm, you only see this at
> init time, correct?
>
> Yes, only at init time.
>
> > Please post your debug patch as Jens requested.
>
> Done (separately).
>
> I need to back up a bit. Yesterday these BUGs happened
> consistenly, so I wondered why. Then I recalled that for
> debugging another bug/problem, I had changed the test
> system's normal boot kernel from 2.6.25 to 2.6.18-8. The
> test system is used to build and then boot the new kernel
> *via kexec*, so it's quite possible (or certain) that
> something in the kexec world has been fixed since 2.6.18. I
> don't recall seeing this problem lately when using 2.6.25 to
> kexec/boot the new test kernel, so I'm quite willing to drop
> the bug for now and then re-open it if I see the problem again. OK??

Ahhhh, the kexec piece was missing. Now I don't feel quite so clueless. I'm OK with dropping the bug for now. Jens, James?

-- mikem


>
> --
> ~Randy
>

2008-11-19 17:31:33

by Jens Axboe

[permalink] [raw]
Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr

On Wed, Nov 19 2008, Miller, Mike (OS Dev) wrote:
>
>
> > -----Original Message-----
> > From: Randy Dunlap [mailto:[email protected]]
> > Sent: Wednesday, November 19, 2008 11:23 AM
> > To: Miller, Mike (OS Dev)
> > Cc: Jens Axboe; scsi; James Bottomley; lkml; akpm
> > Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr
> >
> > Miller, Mike (OS Dev) wrote:
> > >
> > >> -----Original Message-----
> > >> From: Jens Axboe [mailto:[email protected]]
> > >> Sent: Wednesday, November 19, 2008 2:52 AM
> > >> To: Randy Dunlap
> > >> Cc: scsi; Miller, Mike (OS Dev); James Bottomley; lkml; akpm
> > >> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr
> > >>
> > >> On Tue, Nov 18 2008, Randy Dunlap wrote:
> > >>> Randy Dunlap wrote:
> > >>>> Randy Dunlap wrote:
> > >>>>> Miller, Mike (OS Dev) wrote:
> > >>>>>>> -----Original Message-----
> > >>>>>>> From: Randy Dunlap [mailto:[email protected]]
> > >>>>>>> Sent: Thursday, September 25, 2008 3:40 PM
> > >>>>>>> To: scsi
> > >>>>>>> Cc: Jens Axboe; Miller, Mike (OS Dev); James Bottomley; lkml;
> > >>>>>>> akpm
> > >>>>>>> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr
> > >>>>>>>
> > >>>>>>> On Thu, 25 Sep 2008 13:33:07 -0700 Randy Dunlap wrote:
> > >>>>>>>
> > >>>>>>>> Jens Axboe wrote:
> > >>>>>>>>> On Thu, Sep 04 2008, Miller, Mike (OS Dev) wrote:
> > >>>>>>>>>>>>>> 0x3bb2 <do_cciss_intr+1649>: mov 0x2(%r8),%dx
> > >>>>>>>>>>>>>> 0x3bb7 <do_cciss_intr+1654>: test %dx,%dx
> > >>>>>>>>>>>>>> 0x3bba <do_cciss_intr+1657>: je 0x3f0e
> > >>>>>>> <do_cciss_intr+2509>
> > >>>>>>>>>>>>>> $ addr2line -e cciss.o -f do_cciss_intr+0x627
> > >>>>>>>>>>>>>> SA5_fifo_full
> > >>>>>>>>>>>>>>
> > >> /home/rdunlap/linsrc/linux-2.6.27-rc3-git7/drivers/block/cciss.h:
> > >>>>>>> 2
> > >>>>>>>>>>> 06
> > >>>>>>>>>>>>> OK ...that's confusing. It seems to be saying that
> > >>>>>>> ctrlr_info_t
> > >>>>>>>>>>>>> * was NULL. However, I can't see a way of
> > >> getting into the
> > >>>>>>>>>>> fifo_full
> > >>>>>>>>>>>>> callback from do_cciss_intr ..
> > >>>>>>>>>>>>> especially not with an NULL host.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> James
> > >>>>>>>>>>>> That is weird. Even if we could get there
> > >> fifo_full doesn't
> > >>>>>>>>>>> do anything but wait for a bit.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi,
> > >>>>>>>>>>>
> > >>>>>>>>>>> This just happened again. This time it's on
> > >> 2.6.27-rc5-git3.
> > >>>>>>>>>>> ~Randy
> > >>>>>>>>>> Thanks Randy. I think. :)
> > >>>>>>>>>>
> > >>>>>>>>>> I'll try to recreate in my lab.
> > >>>>>>>>> This looks somewhat strange, mostly like 'c' is NULL
> > >> and it's
> > >>>>>>>>> oopsing in in removeQ (I don't think Randy's analysis is
> > >>>>>>> correct in
> > >>>>>>>>> assuming it's 'h' and it's in fifo_full). Given that 'c'
> > >>>>>>> cannot be
> > >>>>>>>>> NULL, it's c->prev or c->next that are NULL.
> > >>>>> This BUG: has happened (now) 5 times today. Higher
> > >> frequency than
> > >>>>> usual for some reason.
> > >>>>>
> > >>>>> I enabled CCISS_DEBUG and added one printk in
> > removeQ(). On the
> > >>>>> first call
> > >>>> s/first/second/
> > >>>>
> > >>>>
> > >>>>> to removeQ(), both c->next and c->prev are NULL.
> > >>>>>
> > >>>>> Here's the kernel log output from cciss:
> > >>> I added a printk() in addQ() as well. Here's the new output:
> > >>>
> > >>> HP CISS Driver (v 3.6.20)
> > >>> ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 54 cciss
> > >> 0000:42:08.0:
> > >>> PCI INT A -> Link[LNKA] -> GSI 54 (level, high) -> IRQ 54
> > command =
> > >>> 147 irq = 36 board_id = 3211103c cciss 0000:42:08.0: irq 87 for
> > >>> MSI/MSI-X address 0 = fdf80000 cfg base address = 10 cfg
> > >> base address
> > >>> index = 0 cfg offset = 400 Controller Configuration information
> > >>> ------------------------------------
> > >>> Signature = CISS
> > >>> Spec Number = 1
> > >>> Transport methods supported = 0x6
> > >>> Transport methods active = 0x3
> > >>> Requested transport Method = 0x0
> > >>> Coalesce Interrupt Delay = 0x0
> > >>> Coalesce Interrupt Count = 0x1
> > >>> Max outstanding commands = 0x256
> > >>> Bus Types = 0x200000
> > >>> Server Name =
> > >>> Heartbeat Counter = 0x1672
> > >>>
> > >>>
> > >>> Trying to put board into Simple mode I counter got to 1 0
> > Controller
> > >>> Configuration information
> > >>> ------------------------------------
> > >>> Signature = CISS
> > >>> Spec Number = 1
> > >>> Transport methods supported = 0x6
> > >>> Transport methods active = 0x3
> > >>> Requested transport Method = 0x0
> > >>> Coalesce Interrupt Delay = 0x0
> > >>> Coalesce Interrupt Count = 0x1
> > >>> Max outstanding commands = 0x256
> > >>> Bus Types = 0x200000
> > >>> Server Name =
> > >>> Heartbeat Counter = 0x1672
> > >>>
> > >>>
> > >>> cciss0: <0x3238> at PCI 0000:42:08.0 IRQ 87 using DAC
> > >>> cciss: intr_pending 8
> > >>> cciss: addQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000
> > >>> cciss: removeQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000,
> > >>> next=ffff88007f83e000, prev=ffff88007f83e000 Sending
> > >> 7f83e000 - down
> > >>> to controller
> > >>> cciss: addQ: Qptr=ffff88027e0100c0, c=ffff88007f83e000
> > >>> cciss: intr_pending 8
> > >>> cciss: Read 4 back from board
> > >>> cciss: removeQ: Qptr=ffff88027e0100c0, c=ffff88007f840000,
> > >>> next=0000000000000000, prev=0000000000000000
> > >>> BUG: unable to handle kernel NULL pointer dereference at
> > >>> 0000000000000248
> > >> Randy, can you post the debug patch you used? The above goes boom
> > >> when it attempts to remove a command that isn't on the
> > list, the Qptr
> > >> in the last example should be empty, hence the oops. So I'd be
> > >> interested in seeing what removeQ() calls this is, I'm
> > assuming it's
> > >> this bit in
> > >> do_cciss_intr():
> > >>
> > >> ...
> > >> while (c->busaddr != a) {
> > >> c = c->next;
> > >> if (c == h->cmpQ)
> > >> break;
> > >> }
> > >> }
> > >> /*
> > >> * If we've found the command, take it off the
> > >> * completion Q and free it
> > >> */
> > >> if (c->busaddr == a) {
> > >> removeQ(&h->cmpQ, c);
> > >> if (c->cmd_type == CMD_RWREQ) {
> > >> complete_command(h, c, 0);
> > >> ...
> > >>
> > >> If so, what part of the c lookup are you hitting - the on
> > that does:
> > >>
> > >> c = h->cmd_pool + a2;
> > >>
> > >> or the c->busaddr check that his shown above?
> > >>
> > >> --
> > > Randy,
> > > I still can't reproduce this bug. I have your config file
> > on a BL465c w/e200i. Just to confirm, you only see this at
> > init time, correct?
> >
> > Yes, only at init time.
> >
> > > Please post your debug patch as Jens requested.
> >
> > Done (separately).
> >
> > I need to back up a bit. Yesterday these BUGs happened
> > consistenly, so I wondered why. Then I recalled that for
> > debugging another bug/problem, I had changed the test
> > system's normal boot kernel from 2.6.25 to 2.6.18-8. The
> > test system is used to build and then boot the new kernel
> > *via kexec*, so it's quite possible (or certain) that
> > something in the kexec world has been fixed since 2.6.18. I
> > don't recall seeing this problem lately when using 2.6.25 to
> > kexec/boot the new test kernel, so I'm quite willing to drop
> > the bug for now and then re-open it if I see the problem again. OK??
>
> Ahhhh, the kexec piece was missing. Now I don't feel quite so
> clueless. I'm OK with dropping the bug for now. Jens, James?

Yeah, kexec is definitely a clue. My guess is that we got some sort of
left over completion. Regardless of the status of this particular bug or
not, I think it would be a good idea to add some checks for when a
command is attempted removed from a queue it isn't currently on.

--
Jens Axboe

2008-11-19 19:16:40

by Mike Miller

[permalink] [raw]
Subject: RE: in 2.6.23-rc3-git7 in do_cciss_intr

Jens wrote:

>
> Yeah, kexec is definitely a clue. My guess is that we got
> some sort of left over completion. Regardless of the status
> of this particular bug or not, I think it would be a good
> idea to add some checks for when a command is attempted
> removed from a queue it isn't currently on.
>

I agree, I'll fix.

-- mikem

2008-11-19 20:48:28

by Jens Axboe

[permalink] [raw]
Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr

On Wed, Nov 19 2008, Miller, Mike (OS Dev) wrote:
> Jens wrote:
>
> >
> > Yeah, kexec is definitely a clue. My guess is that we got
> > some sort of left over completion. Regardless of the status
> > of this particular bug or not, I think it would be a good
> > idea to add some checks for when a command is attempted
> > removed from a queue it isn't currently on.
> >
>
> I agree, I'll fix.

I'd propose just converting it to list_head instead of doing it
manually. Heck, that should be a 5 minute job, let me just do it...

OK, here it is, totally untested (it compiles, must be golden...)

3 files changed, 24 insertions(+), 40 deletions(-)

diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c
index 12de1fd..d2923de 100644
--- a/drivers/block/cciss.c
+++ b/drivers/block/cciss.c
@@ -215,30 +215,17 @@ static struct block_device_operations cciss_fops = {
/*
* Enqueuing and dequeuing functions for cmdlists.
*/
-static inline void addQ(CommandList_struct **Qptr, CommandList_struct *c)
+static inline void addQ(struct list_head *list, CommandList_struct *c)
{
- if (*Qptr == NULL) {
- *Qptr = c;
- c->next = c->prev = c;
- } else {
- c->prev = (*Qptr)->prev;
- c->next = (*Qptr);
- (*Qptr)->prev->next = c;
- (*Qptr)->prev = c;
- }
+ list_add(&c->list, list);
}

-static inline CommandList_struct *removeQ(CommandList_struct **Qptr,
- CommandList_struct *c)
+static inline CommandList_struct *removeQ(CommandList_struct *c)
{
- if (c && c->next != c) {
- if (*Qptr == c)
- *Qptr = c->next;
- c->prev->next = c->next;
- c->next->prev = c->prev;
- } else {
- *Qptr = NULL;
- }
+ if (WARN_ON(list_empty(&c->list)))
+ return NULL;
+
+ list_del_init(&c->list);
return c;
}

@@ -506,6 +493,7 @@ static CommandList_struct *cmd_alloc(ctlr_info_t *h, int get_from_pool)
c->cmdindex = i;
}

+ INIT_LIST_HEAD(&c->list);
c->busaddr = (__u32) cmd_dma_handle;
temp64.val = (__u64) err_dma_handle;
c->ErrDesc.Addr.lower = temp64.val32.lower;
@@ -2543,7 +2531,8 @@ static void start_io(ctlr_info_t *h)
{
CommandList_struct *c;

- while ((c = h->reqQ) != NULL) {
+ while (!list_empty(&h->reqQ)) {
+ c = list_entry(h->reqQ.next, CommandList_struct, list);
/* can't do anything if fifo is full */
if ((h->access.fifo_full(h))) {
printk(KERN_WARNING "cciss: fifo full\n");
@@ -2551,7 +2540,7 @@ static void start_io(ctlr_info_t *h)
}

/* Get the first entry from the Request Q */
- removeQ(&(h->reqQ), c);
+ removeQ(c);
h->Qdepth--;

/* Tell the controller execute command */
@@ -2981,15 +2970,8 @@ static irqreturn_t do_cciss_intr(int irq, void *dev_id)

} else {
a &= ~3;
- if ((c = h->cmpQ) == NULL) {
- printk(KERN_WARNING
- "cciss: Completion of %08x ignored\n",
- a1);
- continue;
- }
- while (c->busaddr != a) {
- c = c->next;
- if (c == h->cmpQ)
+ list_for_each_entry(c, &h->cmpQ, list) {
+ if (c->busaddr == a)
break;
}
}
@@ -2998,7 +2980,7 @@ static irqreturn_t do_cciss_intr(int irq, void *dev_id)
* completion Q and free it
*/
if (c->busaddr == a) {
- removeQ(&h->cmpQ, c);
+ removeQ(c);
if (c->cmd_type == CMD_RWREQ) {
complete_command(h, c, 0);
} else if (c->cmd_type == CMD_IOCTL_PEND) {
@@ -3417,6 +3399,8 @@ static int __devinit cciss_init_one(struct pci_dev *pdev,
return -1;

hba[i]->busy_initializing = 1;
+ INIT_LIST_HEAD(&hba[i]->cmpQ);
+ INIT_LIST_HEAD(&hba[i]->reqQ);

if (cciss_pci_init(hba[i], pdev) != 0)
goto clean1;
@@ -3724,15 +3708,16 @@ static void fail_all_cmds(unsigned long ctlr)
pci_disable_device(h->pdev); /* Make sure it is really dead. */

/* move everything off the request queue onto the completed queue */
- while ((c = h->reqQ) != NULL) {
- removeQ(&(h->reqQ), c);
+ while (!list_empty(&h->reqQ)) {
+ c = list_entry(h->reqQ.next, CommandList_struct, list);
+ removeQ(c);
h->Qdepth--;
addQ(&(h->cmpQ), c);
}

/* Now, fail everything on the completed queue with a HW error */
- while ((c = h->cmpQ) != NULL) {
- removeQ(&h->cmpQ, c);
+ while (!list_empty(&h->cmpQ)) {
+ removeQ(c);
c->err_info->CommandStatus = CMD_HARDWARE_ERR;
if (c->cmd_type == CMD_RWREQ) {
complete_command(h, c, 0);
diff --git a/drivers/block/cciss.h b/drivers/block/cciss.h
index 24a7efa..5a9806a 100644
--- a/drivers/block/cciss.h
+++ b/drivers/block/cciss.h
@@ -89,8 +89,8 @@ struct ctlr_info
struct access_method access;

/* queue and queue Info */
- CommandList_struct *reqQ;
- CommandList_struct *cmpQ;
+ struct list_head reqQ;
+ struct list_head cmpQ;
unsigned int Qdepth;
unsigned int maxQsinceinit;
unsigned int maxSG;
diff --git a/drivers/block/cciss_cmd.h b/drivers/block/cciss_cmd.h
index 43bf559..899cc0e 100644
--- a/drivers/block/cciss_cmd.h
+++ b/drivers/block/cciss_cmd.h
@@ -265,8 +265,7 @@ typedef struct _CommandList_struct {
int ctlr;
int cmd_type;
long cmdindex;
- struct _CommandList_struct *prev;
- struct _CommandList_struct *next;
+ struct list_head list;
struct request * rq;
struct completion *waiting;
int retry_count;


--
Jens Axboe

2008-11-20 09:15:51

by Jens Axboe

[permalink] [raw]
Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr

On Wed, Nov 19 2008, Jens Axboe wrote:
> On Wed, Nov 19 2008, Miller, Mike (OS Dev) wrote:
> > Jens wrote:
> >
> > >
> > > Yeah, kexec is definitely a clue. My guess is that we got
> > > some sort of left over completion. Regardless of the status
> > > of this particular bug or not, I think it would be a good
> > > idea to add some checks for when a command is attempted
> > > removed from a queue it isn't currently on.
> > >
> >
> > I agree, I'll fix.
>
> I'd propose just converting it to list_head instead of doing it
> manually. Heck, that should be a 5 minute job, let me just do it...
>
> OK, here it is, totally untested (it compiles, must be golden...)

It was missing a list_entry() in fail_all_cmds(), apart from that it was
fine. I changed it to use hlist instead, as that is more appropriate and
similar to how it worked before. It also means there's no extra space
usage in the controller structure. I've tested it and it works fine for
me.

Mike, can you give this a look-over and give me a Reviewed-by or
similar? As an extra bonus, it also gets rid of some code.

3 files changed, 33 insertions(+), 46 deletions(-)

http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=e303054e55acd1b6478b8859a5f8648bfaf69a44

--
Jens Axboe

2008-11-20 16:42:23

by Mike Miller

[permalink] [raw]
Subject: RE: in 2.6.23-rc3-git7 in do_cciss_intr

Jens wrote:

> >
> > I'd propose just converting it to list_head instead of doing it
> > manually. Heck, that should be a 5 minute job, let me just do it...
> >
> > OK, here it is, totally untested (it compiles, must be golden...)
>
> It was missing a list_entry() in fail_all_cmds(), apart from
> that it was fine. I changed it to use hlist instead, as that
> is more appropriate and similar to how it worked before. It
> also means there's no extra space usage in the controller
> structure. I've tested it and it works fine for me.
>
> Mike, can you give this a look-over and give me a Reviewed-by
> or similar? As an extra bonus, it also gets rid of some code.
>
> 3 files changed, 33 insertions(+), 46 deletions(-)
>
> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=e303054
> e55acd1b6478b8859a5f8648bfaf69a44

That works for me. :)

Acked-by: Mike Miller <[email protected]>

2008-11-20 17:52:32

by Jens Axboe

[permalink] [raw]
Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr

On Thu, Nov 20 2008, Miller, Mike (OS Dev) wrote:
> Jens wrote:
>
> > >
> > > I'd propose just converting it to list_head instead of doing it
> > > manually. Heck, that should be a 5 minute job, let me just do it...
> > >
> > > OK, here it is, totally untested (it compiles, must be golden...)
> >
> > It was missing a list_entry() in fail_all_cmds(), apart from
> > that it was fine. I changed it to use hlist instead, as that
> > is more appropriate and similar to how it worked before. It
> > also means there's no extra space usage in the controller
> > structure. I've tested it and it works fine for me.
> >
> > Mike, can you give this a look-over and give me a Reviewed-by
> > or similar? As an extra bonus, it also gets rid of some code.
> >
> > 3 files changed, 33 insertions(+), 46 deletions(-)
> >
> > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=e303054
> > e55acd1b6478b8859a5f8648bfaf69a44
>
> That works for me. :)
>
> Acked-by: Mike Miller <[email protected]>

Excellent, thanks Mike. If you could run it through a cycle or so of
your regular testing, I'd feel 100% confident in it.

--
Jens Axboe

2008-11-20 19:14:23

by Mike Miller

[permalink] [raw]
Subject: RE: in 2.6.23-rc3-git7 in do_cciss_intr



> -----Original Message-----
> From: Jens Axboe [mailto:[email protected]]
> Sent: Thursday, November 20, 2008 11:51 AM
> To: Miller, Mike (OS Dev)
> Cc: Randy Dunlap; scsi; James Bottomley; lkml; akpm
> Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr
>
> On Thu, Nov 20 2008, Miller, Mike (OS Dev) wrote:
> > Jens wrote:
> >
> > > >
> > > > I'd propose just converting it to list_head instead of doing it
> > > > manually. Heck, that should be a 5 minute job, let me
> just do it...
> > > >
> > > > OK, here it is, totally untested (it compiles, must be
> golden...)
> > >
> > > It was missing a list_entry() in fail_all_cmds(), apart
> from that it
> > > was fine. I changed it to use hlist instead, as that is more
> > > appropriate and similar to how it worked before. It also means
> > > there's no extra space usage in the controller structure. I've
> > > tested it and it works fine for me.
> > >
> > > Mike, can you give this a look-over and give me a Reviewed-by or
> > > similar? As an extra bonus, it also gets rid of some code.
> > >
> > > 3 files changed, 33 insertions(+), 46 deletions(-)
> > >
> > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=e303054
> > > e55acd1b6478b8859a5f8648bfaf69a44
> >
> > That works for me. :)
> >
> > Acked-by: Mike Miller <[email protected]>
>
> Excellent, thanks Mike. If you could run it through a cycle
> or so of your regular testing, I'd feel 100% confident in it.

Jens,
I'm porting the changes into our build environment so they will go thru the full QA cycle. I'm also testing in my lab before giving the changes to QA.

Thanks for the quick fix.

-- mikem