2002-10-06 06:53:06

by Murray J. Root

[permalink] [raw]
Subject: 2.5.40-ac4 kernel BUG at slab.c:1477!

ASUS P4S533 (SiS645DX chipset)
P4 2Ghz
1G PC2700 RAM
NVidia GeForce2 GTS (XFree86 driver, not nvidia binary)
hda: Maxtor 4G100J5, ATA DISK drive
hdc: LG CD-RW CED-8120B, ATAPI CD/DVD-ROM drive
hdd: SAMSUNG DVD-ROM SD-616T, ATAPI CD/DVD-ROM drive

This happens at random during boot when loading modules.
About half of the time ide-scsi works fine.
The system continues to boot after the BUG with /dev/hdc unaccessible.

>From /var/log/messages:

SCSI subsystem driver Revision: 1.00
scsi0 : SCSI host adapter emulation for IDE ATAPI devices
scsi_eh_offline_sdevs: Device offlined - not ready or command retry failed after error recovery: host 0 channel 0 id 0 lun 0
Vendor: Model: Rev:
Type: Direct-Access ANSI SCSI revision: 00
hdc: lost interrupt
hdc: status timeout: status=0xd0 { Busy }
hdc: DMA disabled
hdc: drive not ready for command
hdc: ATAPI reset complete
------------[ cut here ]------------
kernel BUG at slab.c:1477!
invalid operand: 0000
ide-scsi scsi_mod rtc
CPU: 0
EIP: 0060:[<c01365ee>] Not tainted
EFLAGS: 00010016
EIP is at kmem_cache_free_one+0x7e/0x240
eax: 5a2cf071 ebx: c1009c10 ecx: f7561e90 edx: c1b0d5fc
esi: c038c080 edi: c038c000 ebp: f7561e28 esp: f7561e04
ds: 0068 es: 0068 ss: 0068
Process insmod (pid: 257, threadinfo=f7560000 task=f7599b00)
Stack: 742f3073 65677261 6c2f3074 00306e75 c038c000 c1b0d5fc c1009c10 c038c084
00009c10 f7561e48 c0135bcf c1b0d5fc c038c084 00000286 f7cd3600 c1b616b4
c038c084 f7561e7c fa8eee32 c038c084 f7561e68 c1b616b4 c038c084 f7561e6c
Call Trace:
[<c0135bcf>]kfree+0x5f/0xb0
[<fa8eee32>]scsi_probe_and_add_lun+0x92/0x190 [scsi_mod]
[<fa8ef08e>]scsi_scan_target+0x4e/0x90 [scsi_mod]
[<fa8ef301>]scan_scsis+0x91/0x17c [scsi_mod]
[<fa8f8d92>].rodata.str1.32+0x392/0x3e6 [ide-scsi]
[<fa8e79d2>]scsi_register_host_Rb0dc194c+0x222/0x350 [scsi_mod]
[<fa8f84fe>]init_module+0x1e/0x30 [ide-scsi]
[<fa8f96c0>]idescsi_template+0x0/0x80 [ide-scsi]
[<c011ab1c>]sys_init_module+0x53c/0x690
[<fa8f7060>]idescsi_discard_data+0x0/0x40 [ide-scsi]
[<fa8f89cf>]__ksymtab+0x0/0x31 [ide-scsi]
[<fa8f8fac>].kmodtab+0x0/0xc [ide-scsi]
[<fa8f7060>]idescsi_discard_data+0x0/0x40 [ide-scsi]
[<c010781b>]syscall_call+0x7/0xb

Code: 0f 0b c5 05 07 ba 2e c0 8b 4d f0 89 f2 b8 71 f0 2c 5a 8b 59

--
Murray J. Root
------------------------------------------------
DISCLAIMER: http://www.goldmark.org/jeff/stupid-disclaimers/
------------------------------------------------
Mandrake on irc.freenode.net:
#mandrake & #mandrake-linux = help for newbies
#mdk-cooker = Mandrake Cooker


2002-10-06 13:25:46

by Manfred Spraul

[permalink] [raw]
Subject: Re: 2.5.40-ac4 kernel BUG at slab.c:1477!

--- 2.5/drivers/scsi/scsi_scan.c Sun Sep 22 06:25:17 2002
+++ build-2.5/drivers/scsi/scsi_scan.c Sun Oct 6 14:21:58 2002
@@ -1526,8 +1526,9 @@
GFP_DMA : 0);
if (scsi_result == NULL)
goto alloc_failed;
-
+printk(KERN_INFO"scsi_result: %p, start %lxh.\n",scsi_result, ((unsigned long*)scsi_result)[-1]);
scsi_probe_lun(sreq, scsi_result, &bflags);
+printk(KERN_INFO"scsi_result: %p, start %lxh.\n",scsi_result, ((unsigned long*)scsi_result)[-1]);
if (sreq->sr_result)
res = SCSI_SCAN_NO_RESPONSE;
else {
@@ -1550,8 +1551,10 @@
" no device added\n"));
res = SCSI_SCAN_TARGET_PRESENT;
} else {
+printk(KERN_INFO"scsi_result: %p, start %lxh.\n",scsi_result, ((unsigned long*)scsi_result)[-1]);
res = scsi_add_lun(sdevscan, &sdev, sreq, scsi_result,
&bflags);
+printk(KERN_INFO"scsi_result: %p, start %lxh.\n",scsi_result, ((unsigned long*)scsi_result)[-1]);
if (res == SCSI_SCAN_LUN_PRESENT) {
BUG_ON(sdev == NULL);
if ((bflags & BLIST_KEY) != 0) {
@@ -1574,9 +1577,13 @@
}
}
}
+printk(KERN_INFO"scsi_result: %p, start %lxh.\n",scsi_result, ((unsigned long*)scsi_result)[-1]);
kfree(scsi_result);
+printk(KERN_INFO"after kfree\n");
scsi_release_request(sreq);
+printk(KERN_INFO"after release_request\n");
scsi_release_commandblocks(sdevscan);
+printk(KERN_INFO"after release_commandblocks\n");
return res;

alloc_failed:


Attachments:
patch-scsi-debug (1.37 kB)

2002-10-06 22:37:39

by Murray J. Root

[permalink] [raw]
Subject: Re: 2.5.40-ac4 kernel BUG at slab.c:1477!

On Sun, Oct 06, 2002 at 03:31:03PM +0200, Manfred Spraul wrote:
> > This happens at random during boot when loading modules.
> > About half of the time ide-scsi works fine.
> > The system continues to boot after the BUG with /dev/hdc unaccessible.
>
> from mm/slab.c:
>
> 1475 if (xchg((unsigned long *)objp, RED_MAGIC1) != RED_MAGIC2)
> 1476 /* Either write before start, or a double free. */
> 1477 BUG();
>
> You run an uniprocessor kernel, with slab debugging enabled, and the
> red-zoning test notices a write before the beginning of the buffer
> during scsi_probe_and_add_lun, with ide-scsi.
>
> Andre: Do you know if ide-scsi makes any assumptions about memory
> alignment of the input buffers? With slab debugging disabled, the
> alignment is 32 or 64 bytes, with debugging enabled, it's just 4 byte
> [actually sizeof(void*)] aligned.
>
> Murray, could you apply the attached patch? It dumps the redzone value
> during scsi_probe_and_add_lun. Hopefully this will help to find who
> corrupts the buffers.
>
After patching:
Soft reboot:
kernel oops - rebooted (I'll try to get the oops data from a pic I
took of the screen later).

Hard reboot:
SCSI subsystem driver Revision: 1.00
scsi0 : SCSI host adapter emulation for IDE ATAPI devices
scsi_result: c0334084, start 170fc2a5h.
scsi_eh_offline_sdevs: Device offlined - not ready or command retry failed after error recovery: host 0 channel 0 id 0 lun 0
scsi_result: c0334084, start 170fc2a5h.
scsi_result: c0334084, start 170fc2a5h.
Vendor: Model: Rev:
Type: Direct-Access ANSI SCSI revision: 00
hdc: lost interrupt
hdc: status timeout: status=0xd0 { Busy }
hdc: DMA disabled
hdc: drive not ready for command
hdc: ATAPI reset complete
scsi_eh_offline_sdevs: Device offlined - not ready or command retry failed after error recovery: host 0 channel 0 id 0 lun 0
scsi scan: host 0 channel 0 id 0 lun 0 identifier too long, length 116, max 80. Device might be improperly identified.
scsi_result: c0334084, start 5a2cf071h.
scsi_result: c0334084, start 5a2cf071h.
------------[ cut here ]------------
kernel BUG at slab.c:1477!
invalid operand: 0000
ide-scsi scsi_mod rtc
CPU: 0
EIP: 0060:[<c01365ee>] Not tainted
EFLAGS: 00010016
EIP is at kmem_cache_free_one+0x7e/0x240
eax: 5a2cf071 ebx: c1008cf0 ecx: c02db214 edx: c1b0d5fc
esi: c0334080 edi: c0334000 ebp: f784fe28 esp: f784fe04
ds: 0068 es: 0068 ss: 0068
Process insmod (pid: 280, threadinfo=f784e000 task=f7956800)
Stack: ffffdaf6 c034198b 00000246 0000002b c0334000 c1b0d5fc c1008cf0 c0334084
00008cf0 f784fe48 c0135bcf c1b0d5fc c0334084 00000286 c0334084 f7cd5a00
c1b645ac f784fe7c fa8eee71 c0334084 c0334084 5a2cf071 c0334084 f784fe6c
Call Trace:
[<c0135bcf>]kfree+0x5f/0xb0
[<fa8eee71>]scsi_probe_and_add_lun+0xd1/0x220 [scsi_mod]
[<fa8ef11e>]scsi_scan_target+0x4e/0x90 [scsi_mod]
[<fa8ef391>]scan_scsis+0x91/0x17c [scsi_mod]
[<fa8f8d92>].rodata.str1.32+0x392/0x3e6 [ide-scsi]
[<fa8e79d2>]scsi_register_host_Rb0dc194c+0x222/0x350 [scsi_mod]
[<fa8f84fe>]init_module+0x1e/0x30 [ide-scsi]
[<fa8f96c0>]idescsi_template+0x0/0x80 [ide-scsi]
[<c011ab1c>]sys_init_module+0x53c/0x690
[<fa8f7060>]idescsi_discard_data+0x0/0x40 [ide-scsi]
[<fa8f89cf>]__ksymtab+0x0/0x31 [ide-scsi]
[<fa8f8fac>].kmodtab+0x0/0xc [ide-scsi]
[<fa8f7060>]idescsi_discard_data+0x0/0x40 [ide-scsi]
[<c010781b>]syscall_call+0x7/0xb

Code: 0f 0b c5 05 f2 2c 2a c0 8b 4d f0 89 f2 b8 71 f0 2c 5a 8b 59

--
Murray J. Root
------------------------------------------------
DISCLAIMER: http://www.goldmark.org/jeff/stupid-disclaimers/
------------------------------------------------
Mandrake on irc.freenode.net:
#mandrake & #mandrake-linux = help for newbies
#mdk-cooker = Mandrake Cooker

2002-10-06 23:09:31

by Murray J. Root

[permalink] [raw]
Subject: Re: 2.5.40-ac4 kernel BUG at slab.c:1477!

On Sun, Oct 06, 2002 at 03:31:03PM +0200, Manfred Spraul wrote:
> > This happens at random during boot when loading modules.
> > About half of the time ide-scsi works fine.
> > The system continues to boot after the BUG with /dev/hdc unaccessible.
>
> from mm/slab.c:
>
> 1475 if (xchg((unsigned long *)objp, RED_MAGIC1) != RED_MAGIC2)
> 1476 /* Either write before start, or a double free. */
> 1477 BUG();
>
> You run an uniprocessor kernel, with slab debugging enabled, and the
> red-zoning test notices a write before the beginning of the buffer
> during scsi_probe_and_add_lun, with ide-scsi.
>
> Andre: Do you know if ide-scsi makes any assumptions about memory
> alignment of the input buffers? With slab debugging disabled, the
> alignment is 32 or 64 bytes, with debugging enabled, it's just 4 byte
> [actually sizeof(void*)] aligned.
>
> Murray, could you apply the attached patch? It dumps the redzone value
> during scsi_probe_and_add_lun. Hopefully this will help to find who
> corrupts the buffers.
>
After patch did a soft reboot and got a kernel oops.
I took a pic of it, but did a lousy job, so deciphering it is tricky - here's
what I can be sure is correct:

Mounting local filesystems: Unable to handle kernel paging request at virtual address 00001124
*pde = 00000000
Oops: 0000
ide-scsi scsi_mod rtc
CPU: 0
EIP: 0060:[<fa8ebe82>] Not tainted
EFLAGS: 00010296
EIP is at scsi_decide_disposition+0x12/0x160 [scsi_mod]

Call Trace:
scsi_softirq+0x4b/0xd0 [scsi_mod]
tasklet_hi_action+0x46/0x70
do_softirq+0xb5/0xc0
do_IRQ+0x109/0x130
default_idle+0x0/0x40
default_idle+0x0/0x40
common_interrupt+0x10/0x20
default_idle+0x0/0x40
default_idle+0x0/0x40
default_idle+0x24/0x40
cpu_idle+0x3a/0x50
stext+0x0/0x30

Code: f6 81 24 01 00 00 01 74 35 0f b6 83 3e 01 00 00 8b 93 3c 01
<0>Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing

I'll save the pic in case you need the register or stack values

--
Murray J. Root
------------------------------------------------
DISCLAIMER: http://www.goldmark.org/jeff/stupid-disclaimers/
------------------------------------------------
Mandrake on irc.freenode.net:
#mandrake & #mandrake-linux = help for newbies
#mdk-cooker = Mandrake Cooker

2002-10-07 01:46:00

by Andre Hedrick

[permalink] [raw]
Subject: Re: 2.5.40-ac4 kernel BUG at slab.c:1477!


Only that if it is SG it must conform the ATA/ATAPI DMA on 64kb on 4b
boundaries. Some what basic stuff iirc, but remember this is piped via
scsi and the eh-stratagy/joke leaves something to be desired.

One concern could be crossing SG lists with the device in PIO, that could
get hairy.

On Sun, 6 Oct 2002, Manfred Spraul wrote:

> > This happens at random during boot when loading modules.
> > About half of the time ide-scsi works fine.
> > The system continues to boot after the BUG with /dev/hdc unaccessible.
>
> from mm/slab.c:
>
> 1475 if (xchg((unsigned long *)objp, RED_MAGIC1) != RED_MAGIC2)
> 1476 /* Either write before start, or a double free. */
> 1477 BUG();
>
> You run an uniprocessor kernel, with slab debugging enabled, and the
> red-zoning test notices a write before the beginning of the buffer
> during scsi_probe_and_add_lun, with ide-scsi.
>
> Andre: Do you know if ide-scsi makes any assumptions about memory
> alignment of the input buffers? With slab debugging disabled, the
> alignment is 32 or 64 bytes, with debugging enabled, it's just 4 byte
> [actually sizeof(void*)] aligned.
>
> Murray, could you apply the attached patch? It dumps the redzone value
> during scsi_probe_and_add_lun. Hopefully this will help to find who
> corrupts the buffers.
>
> --
> Manfred
>

Andre Hedrick
LAD Storage Consulting Group

2002-10-07 15:38:10

by Manfred Spraul

[permalink] [raw]
Subject: Re: 2.5.40-ac4 kernel BUG at slab.c:1477!

Andre Hedrick wrote:
>
> Only that if it is SG it must conform the ATA/ATAPI DMA on 64kb on 4b
> boundaries.
>
What does that mean?

Is a 4 byte aligned buffer that doesn't cross a page boundary (4096
byte) an acceptable buffer for the inquiry buffer of ide-scsi?

--
Manfred

2002-10-08 02:53:04

by Murray J. Root

[permalink] [raw]
Subject: Re: 2.5.40-ac4 kernel BUG at slab.c:1477!

On Sun, Oct 06, 2002 at 06:43:26PM -0400, Murray J. Root wrote:
> On Sun, Oct 06, 2002 at 03:31:03PM +0200, Manfred Spraul wrote:
> > > This happens at random during boot when loading modules.
> > > About half of the time ide-scsi works fine.
> > > The system continues to boot after the BUG with /dev/hdc unaccessible.
> >
> > from mm/slab.c:
> >
> > 1475 if (xchg((unsigned long *)objp, RED_MAGIC1) != RED_MAGIC2)
> > 1476 /* Either write before start, or a double free. */
> > 1477 BUG();
> >
> > You run an uniprocessor kernel, with slab debugging enabled, and the
> > red-zoning test notices a write before the beginning of the buffer
> > during scsi_probe_and_add_lun, with ide-scsi.
> >
> > Andre: Do you know if ide-scsi makes any assumptions about memory
> > alignment of the input buffers? With slab debugging disabled, the
> > alignment is 32 or 64 bytes, with debugging enabled, it's just 4 byte
> > [actually sizeof(void*)] aligned.
> >
> > Murray, could you apply the attached patch? It dumps the redzone value
> > during scsi_probe_and_add_lun. Hopefully this will help to find who
> > corrupts the buffers.
> >
> After patching:
> Soft reboot:
> kernel oops - rebooted (I'll try to get the oops data from a pic I
> took of the screen later).
>
> Hard reboot:
> SCSI subsystem driver Revision: 1.00
> scsi0 : SCSI host adapter emulation for IDE ATAPI devices
> scsi_result: c0334084, start 170fc2a5h.
> scsi_eh_offline_sdevs: Device offlined - not ready or command retry failed after error recovery: host 0 channel 0 id 0 lun 0
> scsi_result: c0334084, start 170fc2a5h.
> scsi_result: c0334084, start 170fc2a5h.
> Vendor: Model: Rev:
> Type: Direct-Access ANSI SCSI revision: 00
> hdc: lost interrupt
> hdc: status timeout: status=0xd0 { Busy }
> hdc: DMA disabled
> hdc: drive not ready for command
> hdc: ATAPI reset complete
> scsi_eh_offline_sdevs: Device offlined - not ready or command retry failed after error recovery: host 0 channel 0 id 0 lun 0
> scsi scan: host 0 channel 0 id 0 lun 0 identifier too long, length 116, max 80. Device might be improperly identified.
> scsi_result: c0334084, start 5a2cf071h.
> scsi_result: c0334084, start 5a2cf071h.
> ------------[ cut here ]------------
> kernel BUG at slab.c:1477!
> invalid operand: 0000
> ide-scsi scsi_mod rtc
> CPU: 0
> EIP: 0060:[<c01365ee>] Not tainted
> EFLAGS: 00010016
> EIP is at kmem_cache_free_one+0x7e/0x240
> eax: 5a2cf071 ebx: c1008cf0 ecx: c02db214 edx: c1b0d5fc
> esi: c0334080 edi: c0334000 ebp: f784fe28 esp: f784fe04
> ds: 0068 es: 0068 ss: 0068
> Process insmod (pid: 280, threadinfo=f784e000 task=f7956800)
> Stack: ffffdaf6 c034198b 00000246 0000002b c0334000 c1b0d5fc c1008cf0 c0334084
> 00008cf0 f784fe48 c0135bcf c1b0d5fc c0334084 00000286 c0334084 f7cd5a00
> c1b645ac f784fe7c fa8eee71 c0334084 c0334084 5a2cf071 c0334084 f784fe6c
> Call Trace:
> [<c0135bcf>]kfree+0x5f/0xb0
> [<fa8eee71>]scsi_probe_and_add_lun+0xd1/0x220 [scsi_mod]
> [<fa8ef11e>]scsi_scan_target+0x4e/0x90 [scsi_mod]
> [<fa8ef391>]scan_scsis+0x91/0x17c [scsi_mod]
> [<fa8f8d92>].rodata.str1.32+0x392/0x3e6 [ide-scsi]
> [<fa8e79d2>]scsi_register_host_Rb0dc194c+0x222/0x350 [scsi_mod]
> [<fa8f84fe>]init_module+0x1e/0x30 [ide-scsi]
> [<fa8f96c0>]idescsi_template+0x0/0x80 [ide-scsi]
> [<c011ab1c>]sys_init_module+0x53c/0x690
> [<fa8f7060>]idescsi_discard_data+0x0/0x40 [ide-scsi]
> [<fa8f89cf>]__ksymtab+0x0/0x31 [ide-scsi]
> [<fa8f8fac>].kmodtab+0x0/0xc [ide-scsi]
> [<fa8f7060>]idescsi_discard_data+0x0/0x40 [ide-scsi]
> [<c010781b>]syscall_call+0x7/0xb
>
> Code: 0f 0b c5 05 f2 2c 2a c0 8b 4d f0 89 f2 b8 71 f0 2c 5a 8b 59
>

Same thing in 2.5.41-ac1

--
Murray J. Root
------------------------------------------------
DISCLAIMER: http://www.goldmark.org/jeff/stupid-disclaimers/
------------------------------------------------
Mandrake on irc.freenode.net:
#mandrake & #mandrake-linux = help for newbies
#mdk-cooker = Mandrake Cooker