2006-10-22 03:11:55

by Dave Jones

[permalink] [raw]
Subject: sn9c10x list corruption in 2.6.18.1

Here's a bizarre oops that one of our FC5 users filed since I pushed
out a 2.6.18.1 based kernel update.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=198869

When the user unplugs the camera, the following oops occurs..

Linux video capture interface: v2.00
sn9c102: V4L2 driver for SN9C10x PC Camera Controllers v1:1.27
usb 2-2: SN9C10[12] PC Camera Controller detected (vid/pid 0x0C45/0x6005)
usb 2-2: TAS5110C1B image sensor detected
usb 2-2: Initialization succeeded
usb 2-2: V4L2 device registered as /dev/video0
usbcore: registered new driver sn9c102
usb 2-2: USB disconnect, address 2
usb 2-2: Disconnecting SN9C10x PC Camera...
usb 2-2: V4L2 device /dev/video0 deregistered
list_add corruption. next->prev should be df7cb7c8, but was 00200200
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:26!
invalid opcode: 0000 [#1]
last sysfs file: /class/net/eth0/carrier
Modules linked in: sn9c102 videodev v4l1_compat v4l2_common i915 drm autofs4 p4_clockmod dm_mirror dm_mod video sbs i2c_ec container button battery asus_acpi ac ipv6 lp parport_pc parport ohci1394 ieee1394 ehci_hcd uhci_hcd joydev snd_intel8x0m natsemi serio_raw snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd i2c_i801 soundcore i2c_i810 i2c_algo_bit i2c_core snd_page_alloc ide_cd pcspkr cdrom ext3 jbd
CPU: 0
EIP: 0060:[<c04d8cfb>] Not tainted VLI
EFLAGS: 00010096 (2.6.18-1.2200.fc5 #1)
EIP is at __list_add+0x27/0x62
eax: 00000048 ebx: de4daba0 ecx: c0665450 edx: c9048000
esi: df7cb7c8 edi: de4daba0 ebp: df7cb840 esp: c9048ea4
ds: 007b es: 007b ss: 0068
Process udevd (pid: 2225, ti=c9048000 task=c90ded70 task.ti=c9048000)
Stack: c0628352 df7cb7c8 00200200 00000000 de4daba0 0000000b c045d8a5 00000001
000000d0 df7cfec0 df7cb7c0 00000000 df7cb7c0 00000010 000000d0 00000001
de4dabbc 00000246 000000d0 df7cfec0 caa4c750 c045d766 c906b0c0 c906bcb4
Call Trace:
[<c045d8a5>] cache_alloc_refill+0x135/0x409
[<c045d766>] kmem_cache_alloc+0x45/0x4f
[<c0419e12>] mm_init+0x94/0xb9
[<c041a93a>] copy_process+0x96c/0x10f1
[<c041b110>] do_fork+0x51/0x125
[<c0401220>] sys_clone+0x36/0x3b
[<c0402d9b>] syscall_call+0x7/0xb
DWARF2 unwinder stuck at syscall_call+0x7/0xb
Leftover inexact backtrace:
=======================
Code: c4 0c 5b c3 57 89 c7 56 89 d6 53 89 cb 83 ec 0c 8b 41 04 39 d0 74 1c 89 54 24 04 89 44 24 08 c7 04 24 52 83 62 c0 e8 ae 3a f4 ff <0f> 0b 1a 00 04 83 62 c0 8b 06 39 d8 74 1c 89 5c 24 04 89 44 24
Oct 21 20:11:32 localhost kernel: EIP: [<c04d8cfb>] __list_add+0x27/0x62 SS:ESP 0068:c9048ea4


What's odd here is that we have a list entry still on a list, with its ->next set to
LIST_POISON2, which should only ever happen after an entry has been removed from
a list. The list manipulation in cache_alloc_refill is all done under l3->list_lock,
so I'm puzzled how this is possible.

I found one area in the driver where we do list manipulation without any locking,
but I'm not entirely convinced that this is the source of the bug yet.

Signed-off-by: Dave Jones <[email protected]>

--- linux-2.6.18.noarch/drivers/media/video/sn9c102/sn9c102_core.c~ 2006-10-21 22:57:32.000000000 -0400
+++ linux-2.6.18.noarch/drivers/media/video/sn9c102/sn9c102_core.c 2006-10-21 22:58:43.000000000 -0400
@@ -197,13 +197,16 @@ static void sn9c102_empty_framequeues(st
static void sn9c102_requeue_outqueue(struct sn9c102_device* cam)
{
struct sn9c102_frame_t *i;
+ unsigned long lock_flags;

+ spin_lock_irqsave(&cam->queue_lock, lock_flags);
list_for_each_entry(i, &cam->outqueue, frame) {
i->state = F_QUEUED;
list_add(&i->frame, &cam->inqueue);
}

INIT_LIST_HEAD(&cam->outqueue);
+ spin_unlock_irqrestore(&cam->queue_lock, lock_flags);
}


--
http://www.codemonkey.org.uk


2006-10-22 11:46:55

by Luca Risolia

[permalink] [raw]
Subject: Re: sn9c10x list corruption in 2.6.18.1

Hello Dave.

> I found one area in the driver where we do list manipulation without any locking,
> but I'm not entirely convinced that this is the source of the bug yet.

Spinlocks are not necessary there, becouse that piece of code is already
isolated when running, to prevent concurrent accesses to the list from within
the irq handler. The bug should be elsewhere (probably not in the driver).

Best regards
Luca


> > --- linux-2.6.18.noarch/drivers/media/video/sn9c102/sn9c102_core.c~ 2006-10-21 22:57:32.000000000 -0400
> +++ linux-2.6.18.noarch/drivers/media/video/sn9c102/sn9c102_core.c 2006-10-21 22:58:43.000000000 -0400
> @@ -197,13 +197,16 @@ static void sn9c102_empty_framequeues(st
> static void sn9c102_requeue_outqueue(struct sn9c102_device* cam)
> {
> struct sn9c102_frame_t *i;
> + unsigned long lock_flags;
>
> + spin_lock_irqsave(&cam->queue_lock, lock_flags);
> list_for_each_entry(i, &cam->outqueue, frame) {
> i->state = F_QUEUED;
> list_add(&i->frame, &cam->inqueue);
> }
>
> INIT_LIST_HEAD(&cam->outqueue);
> + spin_unlock_irqrestore(&cam->queue_lock, lock_flags);
> }

2006-10-22 18:15:48

by Dave Jones

[permalink] [raw]
Subject: Re: sn9c10x list corruption in 2.6.18.1

On Sun, Oct 22, 2006 at 01:46:52PM +0200, Luca Risolia wrote:

> > I found one area in the driver where we do list manipulation without any locking,
> > but I'm not entirely convinced that this is the source of the bug yet.
> Spinlocks are not necessary there, becouse that piece of code is already
> isolated when running, to prevent concurrent accesses to the list from within
> the irq handler.

ok

> The bug should be elsewhere (probably not in the driver).

But it only happens when the user unplugs the camera, and no other
webcam driver seems to be affected by this problem.

That's fairly conclusive to me that the driver is misbehaving.

Dave

--
http://www.codemonkey.org.uk

2006-10-22 20:37:10

by Pekka Enberg

[permalink] [raw]
Subject: Re: sn9c10x list corruption in 2.6.18.1

On 10/22/06, Dave Jones <[email protected]> wrote:
> What's odd here is that we have a list entry still on a list, with its ->next set to
> LIST_POISON2, which should only ever happen after an entry has been removed from
> a list. The list manipulation in cache_alloc_refill is all done under l3->list_lock,
> so I'm puzzled how this is possible.
>
> I found one area in the driver where we do list manipulation without any locking,
> but I'm not entirely convinced that this is the source of the bug yet.

But I don't see how that could cause a slab list to go bad. An
old-fashioned slab corruption sounds more like it. Does the the kernel
have CONFIG_SLAB_DEBUG enabled?

Pekka

2006-10-22 20:54:22

by Dave Jones

[permalink] [raw]
Subject: Re: sn9c10x list corruption in 2.6.18.1

On Sun, Oct 22, 2006 at 11:37:06PM +0300, Pekka Enberg wrote:
> On 10/22/06, Dave Jones <[email protected]> wrote:
> > What's odd here is that we have a list entry still on a list, with its ->next set to
> > LIST_POISON2, which should only ever happen after an entry has been removed from
> > a list. The list manipulation in cache_alloc_refill is all done under l3->list_lock,
> > so I'm puzzled how this is possible.
> >
> > I found one area in the driver where we do list manipulation without any locking,
> > but I'm not entirely convinced that this is the source of the bug yet.
>
> But I don't see how that could cause a slab list to go bad. An
> old-fashioned slab corruption sounds more like it. Does the the kernel
> have CONFIG_SLAB_DEBUG enabled?

No, but I'll do a test build for the next update with it enabled to see if
that's any more enlightening.

Dave

--
http://www.codemonkey.org.uk

2006-10-22 21:22:55

by Luca Risolia

[permalink] [raw]
Subject: Re: sn9c10x list corruption in 2.6.18.1

Alle 20:15, domenica 22 ottobre 2006, Dave Jones ha scritto:
> But it only happens when the user unplugs the camera, and no other
> webcam driver seems to be affected by this problem.

Simply unplugging the camera does not reproduce any problem here. This is
the first time I see this bug.

> That's fairly conclusive to me that the driver is misbehaving.

I do not think this implication is correct, as not all the drivers are
implemented the same way and run under the same kernel configurations.

The code in the driver seems to be okay to me.

Best regards,
Luca

2006-10-23 02:35:42

by Teunis Peters

[permalink] [raw]
Subject: Re: sn9c10x list corruption in 2.6.18.1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Luca Risolia wrote:
> Alle 20:15, domenica 22 ottobre 2006, Dave Jones ha scritto:
>> But it only happens when the user unplugs the camera, and no other
>> webcam driver seems to be affected by this problem.
>
> Simply unplugging the camera does not reproduce any problem here. This is
> the first time I see this bug.
>
>> That's fairly conclusive to me that the driver is misbehaving.
>
> I do not think this implication is correct, as not all the drivers are
> implemented the same way and run under the same kernel configurations.
>
> The code in the driver seems to be okay to me.
>

I've seen crashes (rock solid - system halted, following a core dump)
occasionally when unplugging this device:
sn9c102: V4L2 driver for SN9C10x PC Camera Controllers v1:1.27
usb 2-1: SN9C10[12] PC Camera Controller detected (vid/pid 0x0C45/0x6005)
usb 2-1: TAS5110C1B image sensor detected
usb 2-1: Initialization succeeded
usb 2-1: V4L2 device registered as /dev/video0
usbcore: registered new interface driver sn9c102

It crashed 100% with 2.6.19-rc1-git6. 2.6.19-rc2-mm2: it only crashes
occasionally (1/10 times) making me suspect a lock issue. Haven't the
foggiest where to look and as I only tested this hardware on a whim,
it's not high enough priority for me to debug. I'm also so rusty on
drivers that my eyes would probably be of no help.

I can duplicate this on several different systems - but they're all
EHCI-based USB2 hosts. (Intel Corporation 82801DB/DBM on this laptop)
Maybe that helps.

- - Teunis
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFPCpPbFT/SAfwLKMRAu/SAJ9p2pEAZST1b0aI1Pbphp5UF0MgGwCdH3sh
3o7bNGOUXos6A/HjI0izn9U=
=9Fn7
-----END PGP SIGNATURE-----

2006-10-23 02:40:00

by Teunis Peters

[permalink] [raw]
Subject: Re: sn9c10x list corruption in 2.6.18.1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Luca Risolia wrote:
> Alle 20:15, domenica 22 ottobre 2006, Dave Jones ha scritto:
>> But it only happens when the user unplugs the camera, and no other
>> webcam driver seems to be affected by this problem.
>
> Simply unplugging the camera does not reproduce any problem here. This is
> the first time I see this bug.
>
>> That's fairly conclusive to me that the driver is misbehaving.
>
> I do not think this implication is correct, as not all the drivers are
> implemented the same way and run under the same kernel configurations.
>
> The code in the driver seems to be okay to me.

Aha!
No idea if this helps, but here's the fault report:

usb 2-1: USB disconnect, address 7
PM: Removing info for No Bus:usbdev2.7_ep81
PM: Removing info for No Bus:usbdev2.7_ep82
PM: Removing info for No Bus:usbdev2.7_ep83
usb 2-1: Disconnecting SN9C10x PC Camera...
usb 2-1: V4L2 device /dev/video0 deregistered
PM: Removing info for usb:2-1:1.0
PM: Removing info for No Bus:usbdev2.7_ep00
PM: Removing info for usb:2-1
get_unused_fd: slot 30 not NULL!
BUG: unable to handle kernel paging request at virtual address 01401e76
printing eip:
c016af2f
*pde = 00000000
Oops: 0000 [#1]
SMP
last sysfs file: /class/net/eth0/carrier
Modules linked in: sn9c102 videodev v4l1_compat v4l2_common ipaq
usbserial iptable_filter ip_tables x_tables fglrx(P) vmnet(P) vmmon(P)
cpufreq_ondemand cpufreq_performance speedstep_centrino freq_table
rfcomm l2cap bluetooth ppdev lp button ac battery ipv6 sbp2 loop
snd_intel8x0 snd_intel8x0m snd_ac97_codec snd_ac97_bus snd_pcm_oss
snd_mixer_oss snd_pcm snd_timer snd ipw2200 soundcore eth1394 ieee80211
i2c_i801 irda snd_page_alloc shpchp parport_pc parport ieee80211_crypt
i2c_core iTCO_wdt crc_ccitt serio_raw joydev tsdev pcspkr ext3 jbd
dm_mirror dm_snapshot dm_mod ide_generic ide_cd cdrom ide_disk ata_piix
libata scsi_mod generic uhci_hcd ehci_hcd piix ide_core tg3 ohci1394
ieee1394
CPU: 0
EIP: 0060:[<c016af2f>] Tainted: P VLI
EFLAGS: 00210202 (2.6.19-rc2-mm2 #1)
EIP is at fget+0x22/0x45
eax: d62c0000 ebx: fffffff7 ecx: 01401e62 edx: 00000001
esi: 00000000 edi: 00000001 ebp: 00000000 esp: c77fbf98
ds: 007b es: 007b ss: 0068
Process laptop_mode (pid: 1474, ti=c77fa000 task=d323e570 task.ti=c77fa000)
Stack: fffffff7 c0173eb9 00008441 00000001 00000000 b7f75ff4 c77fa000
c0103d32
00000001 00000001 00000000 00000000 b7f75ff4 bff7b708 000000dd
0000007b
0000007b c0100033 000000dd b7fe7410 00000073 00200246 bff7b6e4
0000007b
Call Trace:
[<c0173eb9>] sys_fcntl64+0x1d/0x85
[<c0103d32>] sysenter_past_esp+0x5f/0x85
=======================
Code: 00 ef 3a c0 01 31 c0 5b c3 53 89 c2 65 a1 08 00 00 00 8b 80 54 04
00 00 31 c9 8b 40 04 3b 10 73 29 8b 40 04 8b 0c 90 85 c9 74 1f <8b> 51
14 85 d2 74 16 8d 5a 01 89 d0 90 0f b1 59 14 39 d0 74 04
EIP: [<c016af2f>] fget+0x22/0x45 SS:ESP 0068:c77fbf98
<1>BUG: unable to handle kernel paging request at virtual address 4f3e3450
printing eip:
c0168835
*pde = 00000000
Oops: 0000 [#2]
SMP
last sysfs file: /class/net/eth0/carrier
Modules linked in: sn9c102 videodev v4l1_compat v4l2_common ipaq
usbserial iptable_filter ip_tables x_tables fglrx(P) vmnet(P) vmmon(P)
cpufreq_ondemand cpufreq_performance speedstep_centrino freq_table
rfcomm l2cap bluetooth ppdev lp button ac battery ipv6 sbp2 loop
snd_intel8x0 snd_intel8x0m snd_ac97_codec snd_ac97_bus snd_pcm_oss
snd_mixer_oss snd_pcm snd_timer snd ipw2200 soundcore eth1394 ieee80211
i2c_i801 irda snd_page_alloc shpchp parport_pc parport ieee80211_crypt
i2c_core iTCO_wdt crc_ccitt serio_raw joydev tsdev pcspkr ext3 jbd
dm_mirror dm_snapshot dm_mod ide_generic ide_cd cdrom ide_disk ata_piix
libata scsi_mod generic uhci_hcd ehci_hcd piix ide_core tg3 ohci1394
ieee1394
CPU: 0
EIP: 0060:[<c0168835>] Tainted: P VLI
EFLAGS: 00210282 (2.6.19-rc2-mm2 #1)
EIP is at filp_close+0xa/0x59
eax: 4f3e343c ebx: 4f3e343c ecx: 4f3e343c edx: dbee9e00
esi: dbee9e00 edi: 00000000 ebp: cc3e8840 esp: c77fbe80
ds: 007b es: 007b ss: 0068
Process laptop_mode (pid: 1474, ti=c77fa000 task=d323e570 task.ti=c77fa000)
Stack: 00000000 dbee9e00 7fffffff 00000000 c012324e c988fe40 00000000
00000000
dbee9e00 d323e570 00000001 0000000b c01242b6 c77fa000 0000000f
c77fbf98
00000068 00200202 c01220f0 c0318f4b 00000000 c77fbf60 c77fbf98
00000068
Call Trace:
[<c012324e>] put_files_struct+0x79/0xbf
[<c01242b6>] do_exit+0x22f/0x769
[<c01220f0>] printk+0x1b/0x1f
[<c01051b4>] die+0x207/0x22c
[<c02e6fe6>] do_page_fault+0x412/0x4ec
[<c02e6bd4>] do_page_fault+0x0/0x4ec
[<c02e590c>] error_code+0x7c/0x84
[<c016af2f>] fget+0x22/0x45
[<c0173eb9>] sys_fcntl64+0x1d/0x85
[<c0103d32>] sysenter_past_esp+0x5f/0x85
=======================
Code: 00 00 00 8b 41 04 01 d0 83 38 00 74 04 0f 0b eb fe 03 51 04 89 3a
c6 86 80 00 00 00 01 5b 5e 5f c3 57 56 89 d6 53 89 c3 83 ec 04 <8b> 40
14 85 c0 75 10 c7 04 24 c2 8f 30 c0 31 ff e8 8b 98 fb ff
EIP: [<c0168835>] filp_close+0xa/0x59 SS:ESP 0068:c77fbe80
<1>Fixing recursive fault but reboot is needed!

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFPCtbbFT/SAfwLKMRAkgfAJ9DiU0jWpzVkaVqvblAUsx04koDqACgl/Ju
ApA93IN5EwNJTgzrAc25ObI=
=vXgZ
-----END PGP SIGNATURE-----

2006-10-23 11:37:49

by Luca Risolia

[permalink] [raw]
Subject: Re: sn9c10x list corruption in 2.6.18.1

Alle 04:35, luned? 23 ottobre 2006, teunis ha scritto:
> Luca Risolia wrote:
> > Alle 20:15, domenica 22 ottobre 2006, Dave Jones ha scritto:
> >> But it only happens when the user unplugs the camera, and no other
> >> webcam driver seems to be affected by this problem.
> >
> > Simply unplugging the camera does not reproduce any problem here. This is
> > the first time I see this bug.
> >
> >> That's fairly conclusive to me that the driver is misbehaving.
> >
> > I do not think this implication is correct, as not all the drivers are
> > implemented the same way and run under the same kernel configurations.
> >
> > The code in the driver seems to be okay to me.
> >
>
> I've seen crashes (rock solid - system halted, following a core dump)
> occasionally when unplugging this device:

Could you test the version I maintain out of the tree?
http://www.linux-projects.org/modules/mydownloads/visit.php?cid=2&lid=44

I see the driver has been patched in the latest kernels and I didn't have
the time to review and test the new code.

Best regards
Luca


> sn9c102: V4L2 driver for SN9C10x PC Camera Controllers v1:1.27
> usb 2-1: SN9C10[12] PC Camera Controller detected (vid/pid 0x0C45/0x6005)
> usb 2-1: TAS5110C1B image sensor detected
> usb 2-1: Initialization succeeded
> usb 2-1: V4L2 device registered as /dev/video0
> usbcore: registered new interface driver sn9c102
>
> It crashed 100% with 2.6.19-rc1-git6. 2.6.19-rc2-mm2: it only crashes
> occasionally (1/10 times) making me suspect a lock issue. Haven't the
> foggiest where to look and as I only tested this hardware on a whim,
> it's not high enough priority for me to debug. I'm also so rusty on
> drivers that my eyes would probably be of no help.
>
> I can duplicate this on several different systems - but they're all
> EHCI-based USB2 hosts. (Intel Corporation 82801DB/DBM on this laptop)
> Maybe that helps.
>
> - Teunis
>
>