LinuxLists.cc - BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded

2011-03-04 02:06:28

Subject: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded

Hi,

I got a BUG when loading the cx18.ko module (which in turn requests the
cx18-alsa.ko module) on a kernel built from this repository

http://git.linuxtv.org/media_tree.git staging/for_v2.6.39

which I beleive is based on 2.6.38-rc2.

The BUG is mmap related and I'm almost certain it has to do with
userspace accessing cx18-alsa.ko ALSA device nodes, since cx18.ko
doesn't provide any mmap() related file ops.

So here is my transcription of a fuzzy digital photo of the screen:

kernel BUG at /home/andy/cx18dev/git/media_tree/mm/mmap.c:2309!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/module/snd_pcm/initstate
Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple ...
...
Pid: 2580, comm: udevd Not tainted 2.6.38-rc2-cx18-vb2-proto+
RIP: 0010:[<ffffffff810eb50b>] [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0020000000000000
RDX: 0000000000160011 RSI: ffffea____c42___ RDI: 0000000000000202
RBP: ffff____18c1f_58 R08: ffff____________ R09: 0000000000000004
R10: ffff_______bb_38 R11: 0000000000000000 R12: ffff____344a6680
R13: 00007fff22______ R14: ffff____________ R15: 0000000000000001
...
CR2: 0000000000000000 ...
....
Process udevd (pid: 25__, threadinfo ffff________, ...
Stack:
000000000000015e ffff00003bc0e1d0 0000000000000246 ....
.....
Call Trace:
... mmput+0x63/0xcf
... exit_mm+0x132/0x13f
... do_exit+0x238/0x749
... ? __dequeue_signal+0xfa/0x12f
... do_group_exit+0x7d/0xa5
... get_signal_to_deliver+0x371/0x395
... do_signal+0x72/0x692
... ? do_page_fault+0x24a/0x391
... ? printk+0x41/0x47
... ? sigprocmask+0xa3/0xcd
... do_notify_resume+0x2c/0x64
... retint_signal+0x48/0x8c

Code: ff ff 48 8b 7d d8 4c 89 ea 31 f6 e8 3e fe ff ff 48 89 df e8 78 fe
ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 e0 00 00 00 00 74 04 <0f> 0b
eb fe 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 57
RIP [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
RSP <ffff880018c1fc28>
general protection fault: 0000 [#2] SMP
last sysfs file: /sys/devices/virtual/sound/card2/uevent
CPU 1
Modules linked in: cx18-alsa tda9887 tda8290 mxl5005s s5h1409
tuner_simple tuner_types cs5345 tuner cx18 dvb_core cx2341x v4l2_common
videodev v4l2_compat_ioctl32

I'm not very familiar with mmap() nor ALSA and I did not author the
cx18-alsa part of the cx18 driver, so any hints at where to look for the
problem are appreciated.

Regards,
Andy

2011-03-04 15:50:26

by Devin Heitmueller

[permalink] [raw]

Subject: Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded

On Thu, Mar 3, 2011 at 9:06 PM, Andy Walls <[email protected]> wrote:
> Hi,
>
> I got a BUG when loading the cx18.ko module (which in turn requests the
> cx18-alsa.ko module) on a kernel built from this repository
>
> ? ? ? ?http://git.linuxtv.org/media_tree.git staging/for_v2.6.39
>
> which I beleive is based on 2.6.38-rc2.
>
> The BUG is mmap related and I'm almost certain it has to do with
> userspace accessing cx18-alsa.ko ALSA device nodes, since cx18.ko
> doesn't provide any mmap() related file ops.
>
> So here is my transcription of a fuzzy digital photo of the screen:
<snip>
> I'm not very familiar with mmap() nor ALSA and I did not author the
> cx18-alsa part of the cx18 driver, so any hints at where to look for the
> problem are appreciated.

Hi Andy,

I'm traveling on business for about two weeks, so I won't be able to
look into this right now.

Any idea whether this is some new regression? I'm just trying to
understand whether this is something that has always been there since
I originally added the ALSA support to cx18 or whether it's something
that is new, in which case it might make sense to drag the ALSA people
into the conversation since there haven't been any changes in the cx18
driver lately.

Devin

--
Devin J. Heitmueller - Kernel Labs
http://www.kernellabs.com

2011-03-04 17:12:36

by Andy Walls

[permalink] [raw]

Subject: Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded

On Fri, 2011-03-04 at 10:50 -0500, Devin Heitmueller wrote:
> On Thu, Mar 3, 2011 at 9:06 PM, Andy Walls <[email protected]> wrote:
> > Hi,
> >
> > I got a BUG when loading the cx18.ko module (which in turn requests the
> > cx18-alsa.ko module) on a kernel built from this repository
> >
> > http://git.linuxtv.org/media_tree.git staging/for_v2.6.39
> >
> > which I beleive is based on 2.6.38-rc2.
> >
> > The BUG is mmap related and I'm almost certain it has to do with
> > userspace accessing cx18-alsa.ko ALSA device nodes, since cx18.ko
> > doesn't provide any mmap() related file ops.
> >
> > So here is my transcription of a fuzzy digital photo of the screen:
> <snip>
> > I'm not very familiar with mmap() nor ALSA and I did not author the
> > cx18-alsa part of the cx18 driver, so any hints at where to look for the
> > problem are appreciated.
>
> Hi Andy,
>
> I'm traveling on business for about two weeks, so I won't be able to
> look into this right now.
>
> Any idea whether this is some new regression?

I do not know. I normally don't let cx18-alsa.ko load, due to
PulseAudio's persistence at keeping the device nodes open (which makes
unloading the cx18.ko module for development a hassle.)

> I'm just trying to
> understand whether this is something that has always been there since
> I originally added the ALSA support to cx18 or whether it's something
> that is new, in which case it might make sense to drag the ALSA people
> into the conversation since there haven't been any changes in the cx18
> driver lately.

I can add some information about what is going on in userspace. This
was on a Fedora 10 machine. When devices nodes show up, the HAL daemon
and PulseAudio start using the device nodes right away.

That activity triggers cx18.ko to do a firmware load which gets udevd
running to satisfy firmware requests, and then the cx18 driver issues
some simple commands to the CX23418 firmware, which will have
acknowledgment interrupts coming back from the CX23418. I resolved the
firmware race in cx18*.ko a while ago, so I'm confident its not an
issue.

The BUG looks like some sort of mmap() race or memory management problem
outside of the cx18*.ko modules, given that mmput(), which appears to be
an mm specific reference counting function, is involved.

It could also be in ALSA I guess.

I'm not sure how in the cx18-alsa.ko things can be screwed up so badly
that it messes up the kernel's reference counting of mm structures.

I'll take a harder look at it myself this weekend, but the kernel mm
system is a little out of my current realm of experience. Looks like I
get to learn, because I'm not going to bisect a BUG() that halts the
machine and risks disk corruption every time.

Regards,
Andy

2011-03-05 21:59:18

by Andy Walls

[permalink] [raw]

Subject: Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded

On Thu, 2011-03-03 at 21:06 -0500, Andy Walls wrote:
> Hi,
>
> I got a BUG when loading the cx18.ko module (which in turn requests the
> cx18-alsa.ko module) on a kernel built from this repository
>
> http://git.linuxtv.org/media_tree.git staging/for_v2.6.39
>
> which I beleive is based on 2.6.38-rc2.

[snip]

> So here is my transcription of a fuzzy digital photo of the screen:
>
> kernel BUG at /home/andy/cx18dev/git/media_tree/mm/mmap.c:2309!
> invalid opcode: 0000 [#1] SMP
> last sysfs file: /sys/module/snd_pcm/initstate
> Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple ...
> ...
> Pid: 2580, comm: udevd Not tainted 2.6.38-rc2-cx18-vb2-proto+
> RIP: 0010:[<ffffffff810eb50b>] [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0020000000000000
> RDX: 0000000000160011 RSI: ffffea____c42___ RDI: 0000000000000202
> RBP: ffff____18c1f_58 R08: ffff____________ R09: 0000000000000004
> R10: ffff_______bb_38 R11: 0000000000000000 R12: ffff____344a6680
> R13: 00007fff22______ R14: ffff____________ R15: 0000000000000001
> ...
> CR2: 0000000000000000 ...
> ....
> Process udevd (pid: 25__, threadinfo ffff________, ...
> Stack:
> 000000000000015e ffff00003bc0e1d0 0000000000000246 ....
> .....
> Call Trace:
> ... mmput+0x63/0xcf
> ... exit_mm+0x132/0x13f
> ... do_exit+0x238/0x749
> ... ? __dequeue_signal+0xfa/0x12f
> ... do_group_exit+0x7d/0xa5
> ... get_signal_to_deliver+0x371/0x395
> ... do_signal+0x72/0x692
> ... ? do_page_fault+0x24a/0x391
> ... ? printk+0x41/0x47
> ... ? sigprocmask+0xa3/0xcd
> ... do_notify_resume+0x2c/0x64
> ... retint_signal+0x48/0x8c
>
> Code: ff ff 48 8b 7d d8 4c 89 ea 31 f6 e8 3e fe ff ff 48 89 df e8 78 fe
> ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 e0 00 00 00 00 74 04 <0f> 0b
> eb fe 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 57
> RIP [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
> RSP <ffff880018c1fc28>
> general protection fault: 0000 [#2] SMP
> last sysfs file: /sys/devices/virtual/sound/card2/uevent
> CPU 1
> Modules linked in: cx18-alsa tda9887 tda8290 mxl5005s s5h1409
> tuner_simple tuner_types cs5345 tuner cx18 dvb_core cx2341x v4l2_common
> videodev v4l2_compat_ioctl32

I'm dumping all my previous assumtpions about this BUG. After a bit of
reading, all I can say is that it's a page table deallocation problem at
process exit. After all the page table deallocations on exit,
mm->nr_ptes is still > 0, and that's a bad thing.

It apparently happened in a child udevd exiting shortly after cx18.ko
loaded. The cx18 driver allocating large amounts kernel memory for DMA
buffers upon load may be related to triggering the problem, but I doubt
it is a root cause of the BUG.

This monsterous thread from 5 years ago is somewhat enlightening:

http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1680.html
http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1787.html

so it gives me a place to start looking for the problem.

Any advice on what data to collect is appreciated.

Regards,
Andy

2011-03-06 02:03:58

by Andy Walls

[permalink] [raw]

Subject: Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded

On Sat, 2011-03-05 at 16:59 -0500, Andy Walls wrote:
> On Thu, 2011-03-03 at 21:06 -0500, Andy Walls wrote:
> > Hi,
> >
> > I got a BUG when loading the cx18.ko module (which in turn requests the
> > cx18-alsa.ko module) on a kernel built from this repository
> >
> > http://git.linuxtv.org/media_tree.git staging/for_v2.6.39
> >
> > which I beleive is based on 2.6.38-rc2.
>
> [snip]
>
> > So here is my transcription of a fuzzy digital photo of the screen:
> >
> > kernel BUG at /home/andy/cx18dev/git/media_tree/mm/mmap.c:2309!
> > invalid opcode: 0000 [#1] SMP
> > last sysfs file: /sys/module/snd_pcm/initstate
> > Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple ...
> > ...
> > Pid: 2580, comm: udevd Not tainted 2.6.38-rc2-cx18-vb2-proto+
> > RIP: 0010:[<ffffffff810eb50b>] [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0020000000000000
> > RDX: 0000000000160011 RSI: ffffea____c42___ RDI: 0000000000000202
> > RBP: ffff____18c1f_58 R08: ffff____________ R09: 0000000000000004
> > R10: ffff_______bb_38 R11: 0000000000000000 R12: ffff____344a6680
> > R13: 00007fff22______ R14: ffff____________ R15: 0000000000000001
> > ...
> > CR2: 0000000000000000 ...
> > ....
> > Process udevd (pid: 25__, threadinfo ffff________, ...
> > Stack:
> > 000000000000015e ffff00003bc0e1d0 0000000000000246 ....
> > .....
> > Call Trace:
> > ... mmput+0x63/0xcf
> > ... exit_mm+0x132/0x13f
> > ... do_exit+0x238/0x749
> > ... ? __dequeue_signal+0xfa/0x12f
> > ... do_group_exit+0x7d/0xa5
> > ... get_signal_to_deliver+0x371/0x395
> > ... do_signal+0x72/0x692
> > ... ? do_page_fault+0x24a/0x391
> > ... ? printk+0x41/0x47
> > ... ? sigprocmask+0xa3/0xcd
> > ... do_notify_resume+0x2c/0x64
> > ... retint_signal+0x48/0x8c
> >
> > Code: ff ff 48 8b 7d d8 4c 89 ea 31 f6 e8 3e fe ff ff 48 89 df e8 78 fe
> > ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 e0 00 00 00 00 74 04 <0f> 0b
> > eb fe 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 57
> > RIP [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
> > RSP <ffff880018c1fc28>
> > general protection fault: 0000 [#2] SMP
> > last sysfs file: /sys/devices/virtual/sound/card2/uevent
> > CPU 1
> > Modules linked in: cx18-alsa tda9887 tda8290 mxl5005s s5h1409
> > tuner_simple tuner_types cs5345 tuner cx18 dvb_core cx2341x v4l2_common
> > videodev v4l2_compat_ioctl32
>
>
> I'm dumping all my previous assumtpions about this BUG. After a bit of
> reading, all I can say is that it's a page table deallocation problem at
> process exit. After all the page table deallocations on exit,
> mm->nr_ptes is still > 0, and that's a bad thing.
>
> It apparently happened in a child udevd exiting shortly after cx18.ko
> loaded. The cx18 driver allocating large amounts kernel memory for DMA
> buffers upon load may be related to triggering the problem, but I doubt
> it is a root cause of the BUG.
>
>
> This monsterous thread from 5 years ago is somewhat enlightening:
>
> http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1680.html
> http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1787.html
>
> so it gives me a place to start looking for the problem.
>
> Any advice on what data to collect is appreciated.

When attemtping to reproduce this BUG, I got another bug related to
memory management:

(Details handtyped from a photo):
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff010f22fa>] remove_vm_area+0x42/0x77
PGD 37cdd067 PUD 336c__67 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:14.4/0000:03:00.0/firmware/0000:03:00.0/loading
CPU 0
Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple tuner_types cx5345 tuner cx18(+) dvb_core cx2341x ...
Pid: 2470, comm: work_for_cpu Tainted: G W 2.6.28-rc2-cx18-vb2-proto+
RIP: 0010:[<ffffffff010f22fa>] [<ffffffff010f22fa>] remove_vm_area+0x42/0x77
...
RAX: 0000000000000000 RBX: ffff____35e7c540 RCX: 0000000000001000
RDX: 0000000000000000 ....
...
CR2: 0000000000000000 ....
Stack:
ffff__0011485968 000000000000001 ffff____1147dc9_ ffffffff_1_f23__
....
Call Trace:
... __vunmap+0x3e/0xbd
... vfree+0x2e/0x30
... dvb_dmx_init+0x7e/0x253 [dvb_core]
... cx18_dvb_register+0xd2/0x75c [cx18]
... cx18_streams_resgister+0x6a/0x26a [cx18]
... cx18_streams_setup+0x3cc/0x486 [cx18]
... cx18_probe+0x11cc/0x12fb [cx18]
......

The code appears to be failing here:

/home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1352
161d: eb 06 jmp 1625 <remove_vm_area+0x45>
161f: 48 89 c2 mov %rax,%rdx
1622: 48 8b 00 mov (%rax),%rax <--- Oops p = &tmp->next) (tmp = *p)
1625: 48 39 d8 cmp %rbx,%rax (tmp = *p) != vm;
1628: 75 f5 jne 161f <remove_vm_area+0x3f>
/home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1354

Corresponding to this code in mm/vmalloc.c:

struct vm_struct *remove_vm_area(const void *addr)
{
struct vmap_area *va;

va = find_vmap_area((unsigned long)addr);
if (va && va->flags & VM_VM_AREA) {
struct vm_struct *vm = va->private;
struct vm_struct *tmp, **p;
/*
* remove from list and disallow access to this vm_struct
* before unmap. (address range confliction is maintained by
* vmap.)
*/
write_lock(&vmlist_lock);
for (p = &vmlist; (tmp = *p) != vm; p = &tmp->next) <--- Ooops
;
[...]

That for() loop appears to assume the vm_struct will be on the vmlist
somewhere. If it isn't, then I suppose the for() loop could end up
doing a NULL dereference.

This BUG happened in the final stages of the cx18 driver setting up a
CX23418 card instance. I have 2 cards in this machine, so a number of
buffers had certainly been allocated using kmalloc(). The code in the
dvb_core that is failing got BUG'ed in this case was this:

int dvb_dmx_init(struct dvb_demux *dvbdemux)
{
int i;
struct dmx_demux *dmx = &dvbdemux->dmx;

dvbdemux->cnt_storage = NULL;
dvbdemux->users = 0;
dvbdemux->filter = vmalloc(dvbdemux->filternum * sizeof(struct dvb_demux_filter));

if (!dvbdemux->filter)
return -ENOMEM;

dvbdemux->feed = vmalloc(dvbdemux->feednum * sizeof(struct dvb_demux_feed));
if (!dvbdemux->feed) {
vfree(dvbdemux->filter); <------- BUG/Oops happened in this call
dvbdemux->filter = NULL;
return -ENOMEM;
}
...

Which is kind of interesting:
1. The first vmalloc() succeeded.
2. The second vmalloc() failed.
3. The vfree() of the pointer from the first vmalloc() caused an
Oops/BUG.

I'm not sure where to go from here.

Regards,
Andy

> Regards,
> Andy
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-media" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-03-06 18:38:08

by Hugh Dickins

[permalink] [raw]

Subject: Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded

On Sat, Mar 5, 2011 at 6:03 PM, Andy Walls <[email protected]> wrote:
> On Sat, 2011-03-05 at 16:59 -0500, Andy Walls wrote:
>> On Thu, 2011-03-03 at 21:06 -0500, Andy Walls wrote:
>> > Hi,
>> >
>> > I got a BUG when loading the cx18.ko module (which in turn requests the
>> > cx18-alsa.ko module) on a kernel built from this repository
>> >
>> > http://git.linuxtv.org/media_tree.git staging/for_v2.6.39
>> >
>> > which I beleive is based on 2.6.38-rc2.
>>
>> [snip]
>>
>> > So here is my transcription of a fuzzy digital photo of the screen:
>> >
>> > kernel BUG at /home/andy/cx18dev/git/media_tree/mm/mmap.c:2309!
>> > invalid opcode: 0000 [#1] SMP
>> > last sysfs file: /sys/module/snd_pcm/initstate
>> > Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple ...
>> > ...
>> > Pid: 2580, comm: udevd Not tainted 2.6.38-rc2-cx18-vb2-proto+
>> > RIP: 0010:[<ffffffff810eb50b>] [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
>> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0020000000000000
>> > RDX: 0000000000160011 RSI: ffffea____c42___ RDI: 0000000000000202
>> > RBP: ffff____18c1f_58 R08: ffff____________ R09: 0000000000000004
>> > R10: ffff_______bb_38 R11: 0000000000000000 R12: ffff____344a6680
>> > R13: 00007fff22______ R14: ffff____________ R15: 0000000000000001
>> > ...
>> > CR2: 0000000000000000 ...
>> > ....
>> > Process udevd (pid: 25__, threadinfo ffff________, ...
>> > Stack:
>> > 000000000000015e ffff00003bc0e1d0 0000000000000246 ....
>> > .....
>> > Call Trace:
>> > ... mmput+0x63/0xcf
>> > ... exit_mm+0x132/0x13f
>> > ... do_exit+0x238/0x749
>> > ... ? __dequeue_signal+0xfa/0x12f
>> > ... do_group_exit+0x7d/0xa5
>> > ... get_signal_to_deliver+0x371/0x395
>> > ... do_signal+0x72/0x692
>> > ... ? do_page_fault+0x24a/0x391
>> > ... ? printk+0x41/0x47
>> > ... ? sigprocmask+0xa3/0xcd
>> > ... do_notify_resume+0x2c/0x64
>> > ... retint_signal+0x48/0x8c
>> >
>> > Code: ff ff 48 8b 7d d8 4c 89 ea 31 f6 e8 3e fe ff ff 48 89 df e8 78 fe
>> > ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 e0 00 00 00 00 74 04 <0f> 0b
>> > eb fe 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 57
>> > RIP [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
>> > RSP <ffff880018c1fc28>
>> > general protection fault: 0000 [#2] SMP
>> > last sysfs file: /sys/devices/virtual/sound/card2/uevent
>> > CPU 1
>> > Modules linked in: cx18-alsa tda9887 tda8290 mxl5005s s5h1409
>> > tuner_simple tuner_types cs5345 tuner cx18 dvb_core cx2341x v4l2_common
>> > videodev v4l2_compat_ioctl32
>>
>>
>> I'm dumping all my previous assumtpions about this BUG. After a bit of
>> reading, all I can say is that it's a page table deallocation problem at
>> process exit. After all the page table deallocations on exit,
>> mm->nr_ptes is still > 0, and that's a bad thing.
>>
>> It apparently happened in a child udevd exiting shortly after cx18.ko
>> loaded. The cx18 driver allocating large amounts kernel memory for DMA
>> buffers upon load may be related to triggering the problem, but I doubt
>> it is a root cause of the BUG.
>>
>>
>> This monsterous thread from 5 years ago is somewhat enlightening:
>>
>> http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1680.html
>> http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1787.html
>>
>> so it gives me a place to start looking for the problem.
>>
>> Any advice on what data to collect is appreciated.
>
> When attemtping to reproduce this BUG, I got another bug related to
> memory management:
>
> (Details handtyped from a photo):
> BUG: unable to handle kernel NULL pointer dereference at (null)
> IP: [<ffffffff010f22fa>] remove_vm_area+0x42/0x77
> PGD 37cdd067 PUD 336c__67 PMD 0
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/devices/pci0000:00/0000:00:14.4/0000:03:00.0/firmware/0000:03:00.0/loading
> CPU 0
> Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple tuner_types cx5345 tuner cx18(+) dvb_core cx2341x ...
> Pid: 2470, comm: work_for_cpu Tainted: G W 2.6.28-rc2-cx18-vb2-proto+
> RIP: 0010:[<ffffffff010f22fa>] [<ffffffff010f22fa>] remove_vm_area+0x42/0x77
> ...
> RAX: 0000000000000000 RBX: ffff____35e7c540 RCX: 0000000000001000
> RDX: 0000000000000000 ....
> ...
> CR2: 0000000000000000 ....
> Stack:
> ffff__0011485968 000000000000001 ffff____1147dc9_ ffffffff_1_f23__
> ....
> Call Trace:
> ... __vunmap+0x3e/0xbd
> ... vfree+0x2e/0x30
> ... dvb_dmx_init+0x7e/0x253 [dvb_core]
> ... cx18_dvb_register+0xd2/0x75c [cx18]
> ... cx18_streams_resgister+0x6a/0x26a [cx18]
> ... cx18_streams_setup+0x3cc/0x486 [cx18]
> ... cx18_probe+0x11cc/0x12fb [cx18]
> ......
>
> The code appears to be failing here:
>
> /home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1352
> 161d: eb 06 jmp 1625 <remove_vm_area+0x45>
> 161f: 48 89 c2 mov %rax,%rdx
> 1622: 48 8b 00 mov (%rax),%rax <--- Oops p = &tmp->next) (tmp = *p)
> 1625: 48 39 d8 cmp %rbx,%rax (tmp = *p) != vm;
> 1628: 75 f5 jne 161f <remove_vm_area+0x3f>
> /home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1354
>
> Corresponding to this code in mm/vmalloc.c:
>
> struct vm_struct *remove_vm_area(const void *addr)
> {
> struct vmap_area *va;
>
> va = find_vmap_area((unsigned long)addr);
> if (va && va->flags & VM_VM_AREA) {
> struct vm_struct *vm = va->private;
> struct vm_struct *tmp, **p;
> /*
> * remove from list and disallow access to this vm_struct
> * before unmap. (address range confliction is maintained by
> * vmap.)
> */
> write_lock(&vmlist_lock);
> for (p = &vmlist; (tmp = *p) != vm; p = &tmp->next) <--- Ooops
> ;
> [...]
>
> That for() loop appears to assume the vm_struct will be on the vmlist
> somewhere. If it isn't, then I suppose the for() loop could end up
> doing a NULL dereference.
>
> This BUG happened in the final stages of the cx18 driver setting up a
> CX23418 card instance. I have 2 cards in this machine, so a number of
> buffers had certainly been allocated using kmalloc(). The code in the
> dvb_core that is failing got BUG'ed in this case was this:
>
> int dvb_dmx_init(struct dvb_demux *dvbdemux)
> {
> int i;
> struct dmx_demux *dmx = &dvbdemux->dmx;
>
> dvbdemux->cnt_storage = NULL;
> dvbdemux->users = 0;
> dvbdemux->filter = vmalloc(dvbdemux->filternum * sizeof(struct dvb_demux_filter));
>
> if (!dvbdemux->filter)
> return -ENOMEM;
>
> dvbdemux->feed = vmalloc(dvbdemux->feednum * sizeof(struct dvb_demux_feed));
> if (!dvbdemux->feed) {
> vfree(dvbdemux->filter); <------- BUG/Oops happened in this call
> dvbdemux->filter = NULL;
> return -ENOMEM;
> }
> ...
>
> Which is kind of interesting:
> 1. The first vmalloc() succeeded.
> 2. The second vmalloc() failed.
> 3. The vfree() of the pointer from the first vmalloc() caused an
> Oops/BUG.
>
> I'm not sure where to go from here.

Thanks for all the effort you are putting into investigating this: you
deserve a better response than I can give you.

mm/vmalloc.c's vmap_area handling is entirely separate from
mm/mmap.c's vm_area_struct handling, yet both misbehaviors would be
explained if a next pointer has been corrupted to NULL.

Probably just coincidence that they both manifest that way, though the
underlying problem may turn out to be one.

If you have not already, it would be well worth turning on
CONFIG_DEBUG_LIST and CONFIG_DEBUG_SLAB or CONFIG_SLUB_DEBUG with
CONFIG_SLUB_DEBUG_ON.

If that BUG_ON(mm->nr_ptes ...) in exit_mmap() is preventing you from
getting on with your work, or slowing down reproduction of the
testcase, you should be able to replace it by a WARN_ON. You will
probably leak at least one page (the page table) and perhaps many
pages (those that that page table points to) each time it hits, but it
shouldn't actually be unsafe to continue - it's really a development
BUG_ON, to check that new architectures added are freeing all the
page tables they have allocated.

I do expect the underlying problem to be somewhere down the driver
end, given that nobody else has been reporting these issues. I'm
hoping that once the cx18 guys have time to try to reproduce it,
they'll be better able to track it down. But you are having trouble
reproducing it yourself? hitting this vmalloc one before you could
reproduce the exit_mmap one? No chance to bisect it to a particular
commit if you cannot reliably reproduce it.

There was a horrid list corruption bug in early 2.6.38-rc, fixed in
-rc6; but although I guess it could cause all kinds of havoc, its
particular signature was not like this, so I don't really believe that
one was to blame here.

Hugh

2011-03-07 03:14:57

by Hugh Dickins

[permalink] [raw]

Subject: Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded

On Sun, 6 Mar 2011, Andy Walls wrote:
> On Sun, 2011-03-06 at 10:37 -0800, Hugh Dickins wrote:
>
> > There was a horrid list corruption bug in early 2.6.38-rc, fixed in
> > -rc6; but although I guess it could cause all kinds of havoc, its
> > particular signature was not like this, so I don't really believe that
> > one was to blame here.
>
> Sounds like it may be worth me reviewing the commits that introduced the
> failure and the commit that fixed it. Do you happen to know what they
> are?

Here are the several fixes, which reference LKML threads and culprits:
it seems to have been a danger since 2.6.33, made much worse recently.

commit ceaaec98ad99859ac90ac6863ad0a6cd075d8e0e
Author: Eric Dumazet <[email protected]>
Date: Thu Feb 17 22:59:19 2011 +0000

net: deinit automatic LIST_HEAD

commit 9b5e383c11b08784 (net: Introduce
unregister_netdevice_many()) left an active LIST_HEAD() in
rollback_registered(), with possible memory corruption.

Even if device is freed without touching its unreg_list (and therefore
touching the previous memory location holding LISTE_HEAD(single), better
close the bug for good, since its really subtle.

(Same fix for default_device_exit_batch() for completeness)

Reported-by: Michal Hocko <[email protected]>
Tested-by: Michal Hocko <[email protected]>
Reported-by: Eric W. Biderman <[email protected]>
Tested-by: Eric W. Biderman <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Octavian Purdila <[email protected]>
CC: stable <[email protected]> [.33+]
Signed-off-by: David S. Miller <[email protected]>

commit f87e6f47933e3ebeced9bb12615e830a72cedce4
Author: Linus Torvalds <[email protected]>
Date: Thu Feb 17 22:54:38 2011 +0000

net: dont leave active on stack LIST_HEAD

Eric W. Biderman and Michal Hocko reported various memory corruptions
that we suspected to be related to a LIST head located on stack, that
was manipulated after thread left function frame (and eventually exited,
so its stack was freed and reused).

Eric Dumazet suggested the problem was probably coming from commit
443457242beb (net: factorize
sync-rcu call in unregister_netdevice_many)

This patch fixes __dev_close() and dev_close() to properly deinit their
respective LIST_HEAD(single) before exiting.

References: https://lkml.org/lkml/2011/2/16/304
References: https://lkml.org/lkml/2011/2/14/223

Reported-by: Michal Hocko <[email protected]>
Tested-by: Michal Hocko <[email protected]>
Reported-by: Eric W. Biderman <[email protected]>
Tested-by: Eric W. Biderman <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Octavian Purdila <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

commit 3c18d4de86e4a7f93815c081e50e0543fa27200f
Author: Linus Torvalds <[email protected]>
Date: Fri Feb 18 11:32:28 2011 -0800

Expand CONFIG_DEBUG_LIST to several other list operations

When list debugging is enabled, we aim to readably show list corruption
errors, and the basic list_add/list_del operations end up having extra
debugging code in them to do some basic validation of the list entries.

However, "list_del_init()" and "list_move[_tail]()" ended up avoiding
the debug code due to how they were written. This fixes that.

So the _next_ time we have list_move() problems with stale list entries,
we'll hopefully have an easier time finding them..

Signed-off-by: Linus Torvalds <[email protected]>

2011-03-06 21:04:01

by Andy Walls

[permalink] [raw]

Subject: Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded

On Sun, 2011-03-06 at 10:37 -0800, Hugh Dickins wrote:
> On Sat, Mar 5, 2011 at 6:03 PM, Andy Walls <[email protected]> wrote:
> > On Sat, 2011-03-05 at 16:59 -0500, Andy Walls wrote:
> >> On Thu, 2011-03-03 at 21:06 -0500, Andy Walls wrote:
> >> > Hi,
> >> >
> >> > I got a BUG when loading the cx18.ko module (which in turn requests the
> >> > cx18-alsa.ko module) on a kernel built from this repository
> >> >
> >> > http://git.linuxtv.org/media_tree.git staging/for_v2.6.39
> >> >
> >> > which I beleive is based on 2.6.38-rc2.
> >>
> >> [snip]
> >>
> >> > So here is my transcription of a fuzzy digital photo of the screen:
> >> >
> >> > kernel BUG at /home/andy/cx18dev/git/media_tree/mm/mmap.c:2309!
> >> > invalid opcode: 0000 [#1] SMP
> >> > last sysfs file: /sys/module/snd_pcm/initstate
> >> > Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple ...
> >> > ...
> >> > Pid: 2580, comm: udevd Not tainted 2.6.38-rc2-cx18-vb2-proto+
> >> > RIP: 0010:[<ffffffff810eb50b>] [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
> >> > RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0020000000000000
> >> > RDX: 0000000000160011 RSI: ffffea____c42___ RDI: 0000000000000202
> >> > RBP: ffff____18c1f_58 R08: ffff____________ R09: 0000000000000004
> >> > R10: ffff_______bb_38 R11: 0000000000000000 R12: ffff____344a6680
> >> > R13: 00007fff22______ R14: ffff____________ R15: 0000000000000001
> >> > ...
> >> > CR2: 0000000000000000 ...
> >> > ....
> >> > Process udevd (pid: 25__, threadinfo ffff________, ...
> >> > Stack:
> >> > 000000000000015e ffff00003bc0e1d0 0000000000000246 ....
> >> > .....
> >> > Call Trace:
> >> > ... mmput+0x63/0xcf
> >> > ... exit_mm+0x132/0x13f
> >> > ... do_exit+0x238/0x749
> >> > ... ? __dequeue_signal+0xfa/0x12f
> >> > ... do_group_exit+0x7d/0xa5
> >> > ... get_signal_to_deliver+0x371/0x395
> >> > ... do_signal+0x72/0x692
> >> > ... ? do_page_fault+0x24a/0x391
> >> > ... ? printk+0x41/0x47
> >> > ... ? sigprocmask+0xa3/0xcd
> >> > ... do_notify_resume+0x2c/0x64
> >> > ... retint_signal+0x48/0x8c
> >> >
> >> > Code: ff ff 48 8b 7d d8 4c 89 ea 31 f6 e8 3e fe ff ff 48 89 df e8 78 fe
> >> > ff ff 48 85 c0 48 89 c3 75 f0 49 83 bc 24 e0 00 00 00 00 74 04 <0f> 0b
> >> > eb fe 48 83 c4 18 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 57
> >> > RIP [<ffffffff810eb50b>] exit_mmap+0x10f/0x11e
> >> > RSP <ffff880018c1fc28>
> >> > general protection fault: 0000 [#2] SMP
> >> > last sysfs file: /sys/devices/virtual/sound/card2/uevent
> >> > CPU 1
> >> > Modules linked in: cx18-alsa tda9887 tda8290 mxl5005s s5h1409
> >> > tuner_simple tuner_types cs5345 tuner cx18 dvb_core cx2341x v4l2_common
> >> > videodev v4l2_compat_ioctl32
> >>
> >>
> >> I'm dumping all my previous assumtpions about this BUG. After a bit of
> >> reading, all I can say is that it's a page table deallocation problem at
> >> process exit. After all the page table deallocations on exit,
> >> mm->nr_ptes is still > 0, and that's a bad thing.
> >>
> >> It apparently happened in a child udevd exiting shortly after cx18.ko
> >> loaded. The cx18 driver allocating large amounts kernel memory for DMA
> >> buffers upon load may be related to triggering the problem, but I doubt
> >> it is a root cause of the BUG.
> >>
> >>
> >> This monsterous thread from 5 years ago is somewhat enlightening:
> >>
> >> http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1680.html
> >> http://lkml.indiana.edu/hypermail/linux/kernel/0503.2/1787.html
> >>
> >> so it gives me a place to start looking for the problem.
> >>
> >> Any advice on what data to collect is appreciated.
> >
> > When attemtping to reproduce this BUG, I got another bug related to
> > memory management:
> >
> > (Details handtyped from a photo):
> > BUG: unable to handle kernel NULL pointer dereference at (null)
> > IP: [<ffffffff010f22fa>] remove_vm_area+0x42/0x77
> > PGD 37cdd067 PUD 336c__67 PMD 0
> > Oops: 0000 [#1] SMP
> > last sysfs file: /sys/devices/pci0000:00/0000:00:14.4/0000:03:00.0/firmware/0000:03:00.0/loading
> > CPU 0
> > Modules linked in: tda9887 tda8290 mxl5005s s5h1409 tuner_simple tuner_types cx5345 tuner cx18(+) dvb_core cx2341x ...
> > Pid: 2470, comm: work_for_cpu Tainted: G W 2.6.28-rc2-cx18-vb2-proto+
> > RIP: 0010:[<ffffffff010f22fa>] [<ffffffff010f22fa>] remove_vm_area+0x42/0x77
> > ...
> > RAX: 0000000000000000 RBX: ffff____35e7c540 RCX: 0000000000001000
> > RDX: 0000000000000000 ....
> > ...
> > CR2: 0000000000000000 ....
> > Stack:
> > ffff__0011485968 000000000000001 ffff____1147dc9_ ffffffff_1_f23__
> > ....
> > Call Trace:
> > ... __vunmap+0x3e/0xbd
> > ... vfree+0x2e/0x30
> > ... dvb_dmx_init+0x7e/0x253 [dvb_core]
> > ... cx18_dvb_register+0xd2/0x75c [cx18]
> > ... cx18_streams_resgister+0x6a/0x26a [cx18]
> > ... cx18_streams_setup+0x3cc/0x486 [cx18]
> > ... cx18_probe+0x11cc/0x12fb [cx18]
> > ......
> >
> > The code appears to be failing here:
> >
> > /home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1352
> > 161d: eb 06 jmp 1625 <remove_vm_area+0x45>
> > 161f: 48 89 c2 mov %rax,%rdx
> > 1622: 48 8b 00 mov (%rax),%rax <--- Oops p = &tmp->next) (tmp = *p)
> > 1625: 48 39 d8 cmp %rbx,%rax (tmp = *p) != vm;
> > 1628: 75 f5 jne 161f <remove_vm_area+0x3f>
> > /home/andy/cx18dev/git/media_tree/mm/vmalloc.c:1354
> >
> > Corresponding to this code in mm/vmalloc.c:
> >
> > struct vm_struct *remove_vm_area(const void *addr)
> > {
> > struct vmap_area *va;
> >
> > va = find_vmap_area((unsigned long)addr);
> > if (va && va->flags & VM_VM_AREA) {
> > struct vm_struct *vm = va->private;
> > struct vm_struct *tmp, **p;
> > /*
> > * remove from list and disallow access to this vm_struct
> > * before unmap. (address range confliction is maintained by
> > * vmap.)
> > */
> > write_lock(&vmlist_lock);
> > for (p = &vmlist; (tmp = *p) != vm; p = &tmp->next) <--- Ooops
> > ;
> > [...]
> >
> > That for() loop appears to assume the vm_struct will be on the vmlist
> > somewhere. If it isn't, then I suppose the for() loop could end up
> > doing a NULL dereference.
> >
> > This BUG happened in the final stages of the cx18 driver setting up a
> > CX23418 card instance. I have 2 cards in this machine, so a number of
> > buffers had certainly been allocated using kmalloc(). The code in the
> > dvb_core that is failing got BUG'ed in this case was this:
> >
> > int dvb_dmx_init(struct dvb_demux *dvbdemux)
> > {
> > int i;
> > struct dmx_demux *dmx = &dvbdemux->dmx;
> >
> > dvbdemux->cnt_storage = NULL;
> > dvbdemux->users = 0;
> > dvbdemux->filter = vmalloc(dvbdemux->filternum * sizeof(struct dvb_demux_filter));
> >
> > if (!dvbdemux->filter)
> > return -ENOMEM;
> >
> > dvbdemux->feed = vmalloc(dvbdemux->feednum * sizeof(struct dvb_demux_feed));
> > if (!dvbdemux->feed) {
> > vfree(dvbdemux->filter); <------- BUG/Oops happened in this call
> > dvbdemux->filter = NULL;
> > return -ENOMEM;
> > }
> > ...
> >
> > Which is kind of interesting:
> > 1. The first vmalloc() succeeded.
> > 2. The second vmalloc() failed.
> > 3. The vfree() of the pointer from the first vmalloc() caused an
> > Oops/BUG.
> >
> > I'm not sure where to go from here.
>
> Thanks for all the effort you are putting into investigating this: you
> deserve a better response than I can give you.
>
> mm/vmalloc.c's vmap_area handling is entirely separate from
> mm/mmap.c's vm_area_struct handling, yet both misbehaviors would be
> explained if a next pointer has been corrupted to NULL.
>
> Probably just coincidence that they both manifest that way, though the
> underlying problem may turn out to be one.

Hi Hugh,

I suspect the underlying problem is the same. Because VM area and
Vmalloc handling is very different, I agree, it is likely not in either
one, but somewhere else. (e.g. list.[ch] maybe?)

> If you have not already, it would be well worth turning on
> CONFIG_DEBUG_LIST and CONFIG_DEBUG_SLAB or CONFIG_SLUB_DEBUG with
> CONFIG_SLUB_DEBUG_ON.

Will do.

> If that BUG_ON(mm->nr_ptes ...) in exit_mmap() is preventing you from
> getting on with your work, or slowing down reproduction of the
> testcase, you should be able to replace it by a WARN_ON. You will
> probably leak at least one page (the page table) and perhaps many
> pages (those that that page table points to) each time it hits, but it
> shouldn't actually be unsafe to continue - it's really a development
> BUG_ON, to check that new architectures added are freeing all the
> page tables they have allocated.

I added debug to dump the VM Areas' start & end values and get past the
BUG. That's when I got the seconds BUG with involing vmlist. :(

> I do expect the underlying problem to be somewhere down the driver
> end, given that nobody else has been reporting these issues. I'm
> hoping that once the cx18 guys have time to try to reproduce it,
> they'll be better able to track it down.

Well, I am the cx18 guy (except for Devin who wrote the guts of
cx18-alsa*c). Given that the cx18 driver has been pretty static for the
past few months, I'm doubting it is there.

I have been doing a recent change to add dynamic SCB MDL entry
management, but that is in a small 64kB region in the ioremap()'ed
memory of the CX23418 device.

(one patch I have locally hasn't been pushed yet, but here is the one
patch I pushed before I began testing:
http://git.linuxtv.org/awalls/media_tree.git?a=shortlog;h=refs/heads/cx18-vb2-proto )

My overall objective is to start convert cx18 to use the videobuf2
infrastruutre instead of allocating so many buffers at module load.

The cx18 driver kmalloc()'s *a lot* of buffers per card at module load,
and I have two cards in my machine. I think these consecutive
allocations by the cx18 driver are creating conditions that induce
corruption in the memory management system. I'm guessing it is probably
due to some locking or list handling bug.

> But you are having trouble
> reproducing it yourself?

I can't say yet. I'm currently two for two.

I hate BUG-ing our household "production" machine that has all the kids'
homework, financal tracking, saved e-mails, etc. I've got to make a
backup first or just slap in a new disk for development.

> hitting this vmalloc one before you could
> reproduce the exit_mmap one?

I hit the exit_mmap() one; added debug to get past the BUG; and then got
the new Oops/BUG.

Loading the cx18 driver was clearly the action in both cases that caused
the BUGs.

> No chance to bisect it to a particular
> commit if you cannot reliably reproduce it.

I have to back up home directories before I can begin a bisect on this
one.

> There was a horrid list corruption bug in early 2.6.38-rc, fixed in
> -rc6; but although I guess it could cause all kinds of havoc, its
> particular signature was not like this, so I don't really believe that
> one was to blame here.

Sounds like it may be worth me reviewing the commits that introduced the
failure and the commit that fixed it. Do you happen to know what they
are?

Regards,
Andy

> Hugh

2011-03-07 10:33:00

by Takashi Iwai

[permalink] [raw]

Subject: Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded

At Fri, 04 Mar 2011 12:13:04 -0500,
Andy Walls wrote:
>
> On Fri, 2011-03-04 at 10:50 -0500, Devin Heitmueller wrote:
> > On Thu, Mar 3, 2011 at 9:06 PM, Andy Walls <[email protected]> wrote:
> > > Hi,
> > >
> > > I got a BUG when loading the cx18.ko module (which in turn requests the
> > > cx18-alsa.ko module) on a kernel built from this repository
> > >
> > > http://git.linuxtv.org/media_tree.git staging/for_v2.6.39
> > >
> > > which I beleive is based on 2.6.38-rc2.
> > >
> > > The BUG is mmap related and I'm almost certain it has to do with
> > > userspace accessing cx18-alsa.ko ALSA device nodes, since cx18.ko
> > > doesn't provide any mmap() related file ops.
> > >
> > > So here is my transcription of a fuzzy digital photo of the screen:
> > <snip>
> > > I'm not very familiar with mmap() nor ALSA and I did not author the
> > > cx18-alsa part of the cx18 driver, so any hints at where to look for the
> > > problem are appreciated.
> >
> > Hi Andy,
> >
> > I'm traveling on business for about two weeks, so I won't be able to
> > look into this right now.
> >
> > Any idea whether this is some new regression?
>
> I do not know. I normally don't let cx18-alsa.ko load, due to
> PulseAudio's persistence at keeping the device nodes open (which makes
> unloading the cx18.ko module for development a hassle.)
>
>
> > I'm just trying to
> > understand whether this is something that has always been there since
> > I originally added the ALSA support to cx18 or whether it's something
> > that is new, in which case it might make sense to drag the ALSA people
> > into the conversation since there haven't been any changes in the cx18
> > driver lately.
>
> I can add some information about what is going on in userspace. This
> was on a Fedora 10 machine. When devices nodes show up, the HAL daemon
> and PulseAudio start using the device nodes right away.
>
> That activity triggers cx18.ko to do a firmware load which gets udevd
> running to satisfy firmware requests, and then the cx18 driver issues
> some simple commands to the CX23418 firmware, which will have
> acknowledgment interrupts coming back from the CX23418. I resolved the
> firmware race in cx18*.ko a while ago, so I'm confident its not an
> issue.
>
> The BUG looks like some sort of mmap() race or memory management problem
> outside of the cx18*.ko modules, given that mmput(), which appears to be
> an mm specific reference counting function, is involved.
>
> It could also be in ALSA I guess.

There is no change in ALSA core regarding mmap for really long time.
If it's a regression, it must be triggered by some other changes.

Takashi

2011-03-09 00:36:50

by Andy Walls

[permalink] [raw]

Subject: Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded

On Sun, 2011-03-06 at 16:04 -0500, Andy Walls wrote:
> On Sun, 2011-03-06 at 10:37 -0800, Hugh Dickins wrote:

> >
> > Thanks for all the effort you are putting into investigating this: you
> > deserve a better response than I can give you.
> >
> > mm/vmalloc.c's vmap_area handling is entirely separate from
> > mm/mmap.c's vm_area_struct handling, yet both misbehaviors would be
> > explained if a next pointer has been corrupted to NULL.
> >
> > Probably just coincidence that they both manifest that way, though the
> > underlying problem may turn out to be one.

> > If you have not already, it would be well worth turning on
> > CONFIG_DEBUG_LIST and CONFIG_DEBUG_SLAB or CONFIG_SLUB_DEBUG with
> > CONFIG_SLUB_DEBUG_ON.

>
> > But you are having trouble
> > reproducing it yourself?
>
> I can't say yet. I'm currently two for two.

After backing up the machine and testing again, I'm now 3 for 3.

This time it happened in the memset() in kernel/module.c:move_module()
when modprobe was trying to load the cx18-alsa.ko module.

static int move_module(struct module *mod, struct load_info
*info)
{
int i;
void *ptr;

/* Do the allocs. */
ptr = module_alloc_update_bounds(mod->core_size);
/*
* The pointer to this block is stored in the module structure
* which is inside the block. Just mark it as not being
a
* leak.
*/
kmemleak_not_leak(ptr);
if (!ptr)
return -ENOMEM;

memset(ptr, 0, mod->core_size); <----- Ooops/BUG

/home/andy/cx18dev/git/media_tree/kernel/module.c:2529
385c: 41 8b 8c 24 64 01 00 mov 0x164(%r12),%ecx
3863: 00
3864: 31 c0 xor %eax,%eax
3866: 48 89 d7 mov %rdx,%rdi
3869: f3 aa rep stos %al,%es:(%rdi) <----- Oops/BUG

ptr had a value of 0x0000000000001000

I'm starting a git bisect now.

Regards,
Andy

2011-03-11 00:34:26

by Andy Walls

[permalink] [raw]

Subject: Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded

On Sun, 2011-03-06 at 16:04 -0500, Andy Walls wrote:
> On Sun, 2011-03-06 at 10:37 -0800, Hugh Dickins wrote:

> > I do expect the underlying problem to be somewhere down the driver
> > end, given that nobody else has been reporting these issues. I'm
> > hoping that once the cx18 guys have time to try to reproduce it,
> > they'll be better able to track it down.

Hi Hugh,

You were correct. The mistake was in the cx18 driver, in the last thing
that I touched, of course. The code causing the bug isn't anywhere
aside from my private repo.

All,

Sorry for all the noise.

The bug was so idiotic, I fell compelled to show the fix:

diff --git a/drivers/media/video/cx18/cx18-scb.c b/drivers/media/video/cx18/cx18
index fd89ad0..d17ffc8 100644
--- a/drivers/media/video/cx18/cx18-scb.c
+++ b/drivers/media/video/cx18/cx18-scb.c
@@ -28,8 +28,8 @@

int cx18_scb_init_mdl_ent_mgmt(struct cx18 *cx)
{
- cx->scb_mdl_ent_map = kzalloc(BITS_TO_LONGS(SCB_MDL_ENTRIES),
- GFP_KERNEL);
+ cx->scb_mdl_ent_map = kzalloc(BITS_TO_LONGS(SCB_MDL_ENTRIES)
+ * sizeof(long), GFP_KERNEL);
if (cx->scb_mdl_ent_map == NULL) {
CX18_ERR("Fatal: unable to allocate bitmap for managing SCB MDL"
"entries\n");

So now the subsequent call to

bitmap_zero(cx->scb_mdl_ent_map, SCB_MDL_ENTRIES);

doesn't walk off the end of what was allocated.

Apparently BITS_TO_LONGS() is not BITS_TO_LONGS_TO_BYTES().

Regards,
Andy

2011-03-11 00:47:17

by Hugh Dickins

[permalink] [raw]

Subject: Re: BUG at mm/mmap.c:2309 when cx18.ko and cx18-alsa.ko loaded

On Thu, Mar 10, 2011 at 4:34 PM, Andy Walls <[email protected]> wrote:
> On Sun, 2011-03-06 at 16:04 -0500, Andy Walls wrote:
>> On Sun, 2011-03-06 at 10:37 -0800, Hugh Dickins wrote:
>
>> > I do expect the underlying problem to be somewhere down the driver
>> > end, given that nobody else has been reporting these issues. I'm
>> > hoping that once the cx18 guys have time to try to reproduce it,
>> > they'll be better able to track it down.
>
> Hi Hugh,
>
> You were correct. The mistake was in the cx18 driver, in the last thing
> that I touched, of course. The code causing the bug isn't anywhere
> aside from my private repo.

Thanks a lot for reporting back, Andy: relief all round.

Hugh