2012-01-26 22:36:30

by Larry Finger

[permalink] [raw]
Subject: Kernel BUG due to kernel page fault

I am trying to convert the rtlwifi family of drivers to use asynchronous
firmware loading. With recent changes in udev, many problems are being reported.
See Bug 42632] at b.k.o.

Most of the code is now working, but I still have one problem. When the firmware
is available, everything is fine. I delay the starting of the driver and the MAC
layer until the firmware is available. The problem occurs when I start the
driver with the firmware file not available. Under those conditions, when I
unload the driver, I get the following BUG output:

BUG: unable to handle kernel paging request at ffffffffa0056641
IP: [<ffffffff811e3ecb>] strnlen+0xb/0x30
PGD 1607067 PUD 160b063 PMD b2aed067 PTE 0
Oops: 0000 [#1] SMP
CPU 1
Modules linked in: fuse autofs4 snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device
edd nfs nfsd lockd auth_rpcgss nfs_acl sunrpc exportfs ipv6 vboxpci(O)
vboxnetadp(O) vboxnetflt(O) vboxdrv(O) cpufreq_conservative cpufreq_userspace
cpufreq_powersave powernow_k8 mperf ext3 jbd ide_cd_mod cdrom
snd_hda_codec_conexant ide_pci_generic snd_hda_intel snd_hda_codec snd_pcm
k8temp amd74xx sg snd_timer ide_core joydev button hwmon i2c_nforce2 serio_raw
ac battery forcedeth snd i2c_core video soundcore snd_page_alloc ext4 mbcache
jbd2 crc16 sd_mod ohci_hcd ahci libahci libata ehci_hcd scsi_mod usbcore
usb_common fan processor thermal [last unloaded: rfkill]

Pid: 3605, comm: irqbalance Tainted: G O 3.3.0-rc1-wl+ #175
Hewlett-Packard HP Pavilion dv2700 Notebook PC/30D6
RIP: 0010:[<ffffffff811e3ecb>] [<ffffffff811e3ecb>] strnlen+0xb/0x30
RSP: 0018:ffff8800b3179c98 EFLAGS: 00010086
RAX: 0000000000000000 RBX: ffff880084cdd28f RCX: 0000000000000000
RDX: ffffffffa0056641 RSI: ffffffffffffffff RDI: ffffffffa0056641
RBP: ffff8800b3179c98 R08: 000000000000ffff R09: 000000000000ffff
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa0056641
R13: ffff880084cde000 R14: 0000000000000000 R15: 000000000000ffff
FS: 00007fa070fdd720(0000) GS:ffff8800bbd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffa0056641 CR3: 00000000b2bba000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process irqbalance (pid: 3605, threadinfo ffff8800b3178000, task ffff8800b5875880)
Stack:
ffff8800b3179cd8 ffffffff811e58ae ffff8800b3179cf8 ffff880084cdd28f
ffffffff8151f664 ffff8800b3179d70 ffffffff8151f664 ffff880084cde000
ffff8800b3179d58 ffffffff811e6889 000000000000ffff 0000000000000002
Call Trace:
[<ffffffff811e58ae>] string.isra.4+0x3e/0xd0
[<ffffffff811e6889>] vsnprintf+0x219/0x600
[<ffffffff81153093>] seq_printf+0x53/0x80
[<ffffffff811e21db>] ? radix_tree_lookup+0xb/0x10
[<ffffffff810ba5ab>] show_interrupts+0x24b/0x300
[<ffffffff81153474>] seq_read+0x2b4/0x3f0
[<ffffffff811531c0>] ? seq_lseek+0x100/0x100
[<ffffffff8118b651>] proc_reg_read+0x81/0xc0
[<ffffffff8112f967>] vfs_read+0xa7/0x160
[<ffffffff8112fa65>] sys_read+0x45/0x90
[<ffffffff813a33a2>] system_call_fastpath+0x16/0x1b
Code: 31 c0 80 3f 00 55 48 89 e5 74 11 48 89 f8 66 90 48 83 c0 01 80 38 00 75 f7
48 29 f8 5d c3 66 90 55 31 c0 48 85 f6 48 89 e5 74 23 <80> 3f 00 74 1e 48 89 f8
eb 0c 0f 1f 00 48 83 ee 01 80 38 00 74
RIP [<ffffffff811e3ecb>] strnlen+0xb/0x30
RSP <ffff8800b3179c98>
CR2: ffffffffa0056641
---[ end trace ad5f2d26ea4a9371 ]---

Nothing that I have tried has made any difference. Does this kind of fault
suggest a fix? I found nothing useful with Google.

Thanks,

Larry


2012-01-28 04:31:05

by Johannes Berg

[permalink] [raw]
Subject: Re: Kernel BUG due to kernel page fault

Larry,

On 1/26/2012 2:36 PM, Larry Finger wrote:
> I am trying to convert the rtlwifi family of drivers to use asynchronous
> firmware loading. With recent changes in udev, many problems are being
> reported. See Bug 42632] at b.k.o.

I'm not sure what caused your problem, but your email reminded me of an
issue we had with asynchronous firmware loading that you may want to
consider.

When request_firmware_async() is called, it eventually calls the
callback you give it. This may, however, be a long time later (I think
the current default timeout is 1 minute). In the meantime, the module
could be unloaded by the user, and then the system crashes as soon as
the firmwaer code attempts to invoke the callback.

To fix this, we have a completion in iwlwifi that some exit code path
waits for to make sure this scenario doesn't happen.

I think you should implement that.

A better fix would probably be to make request_firmware_async() take a
struct module * argument and pass THIS_MODULE to it, and make the
firmware code handle this, but currently it doesn't.

johannes

2012-01-27 16:42:28

by Stanislaw Gruszka

[permalink] [raw]
Subject: Re: Kernel BUG due to kernel page fault

On Thu, Jan 26, 2012 at 04:36:47PM -0600, Larry Finger wrote:
> Modules linked in: fuse autofs4 snd_pcm_oss snd_mixer_oss snd_seq
> snd_seq_device edd nfs nfsd lockd auth_rpcgss nfs_acl sunrpc
> exportfs ipv6 vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> RIP: 0010:[<ffffffff811e3ecb>] [<ffffffff811e3ecb>] strnlen+0xb/0x30
[snip]
> Call Trace:
> [<ffffffff811e58ae>] string.isra.4+0x3e/0xd0
> [<ffffffff811e6889>] vsnprintf+0x219/0x600
> [<ffffffff81153093>] seq_printf+0x53/0x80
> [<ffffffff811e21db>] ? radix_tree_lookup+0xb/0x10
> [<ffffffff810ba5ab>] show_interrupts+0x24b/0x300
> [<ffffffff81153474>] seq_read+0x2b4/0x3f0
> [<ffffffff811531c0>] ? seq_lseek+0x100/0x100
> [<ffffffff8118b651>] proc_reg_read+0x81/0xc0
> [<ffffffff8112f967>] vfs_read+0xa7/0x160
> [<ffffffff8112fa65>] sys_read+0x45/0x90
> [<ffffffff813a33a2>] system_call_fastpath+0x16/0x1b

This is call trace from cat /proc/interrupts, seems something pass
malformed string to request_irq(). I do not think this is rtlwifi,
rather that vbox thing.

Stanislaw

2012-01-27 18:07:56

by Larry Finger

[permalink] [raw]
Subject: Re: Kernel BUG due to kernel page fault

On 01/27/2012 10:42 AM, Stanislaw Gruszka wrote:
> On Thu, Jan 26, 2012 at 04:36:47PM -0600, Larry Finger wrote:
>> Modules linked in: fuse autofs4 snd_pcm_oss snd_mixer_oss snd_seq
>> snd_seq_device edd nfs nfsd lockd auth_rpcgss nfs_acl sunrpc
>> exportfs ipv6 vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> RIP: 0010:[<ffffffff811e3ecb>] [<ffffffff811e3ecb>] strnlen+0xb/0x30
> [snip]
>> Call Trace:
>> [<ffffffff811e58ae>] string.isra.4+0x3e/0xd0
>> [<ffffffff811e6889>] vsnprintf+0x219/0x600
>> [<ffffffff81153093>] seq_printf+0x53/0x80
>> [<ffffffff811e21db>] ? radix_tree_lookup+0xb/0x10
>> [<ffffffff810ba5ab>] show_interrupts+0x24b/0x300
>> [<ffffffff81153474>] seq_read+0x2b4/0x3f0
>> [<ffffffff811531c0>] ? seq_lseek+0x100/0x100
>> [<ffffffff8118b651>] proc_reg_read+0x81/0xc0
>> [<ffffffff8112f967>] vfs_read+0xa7/0x160
>> [<ffffffff8112fa65>] sys_read+0x45/0x90
>> [<ffffffff813a33a2>] system_call_fastpath+0x16/0x1b
>
> This is call trace from cat /proc/interrupts, seems something pass
> malformed string to request_irq(). I do not think this is rtlwifi,
> rather that vbox thing.

Thanks for responding. Vbox may be guilty of a lot of things, but this is not
one of them.

Just before I received your reply, I found the problem. The PCI driver in
rtlwifi sets a flag to indicate that interrupts have been enabled, and uses that
flag to tell if the cleanup needs to disable them. When I moved the code that
starts mac80211, I also moved the statement that sets the flag. As a result, the
driver was exiting with interrupts still enabled, and the BUG resulted from
that. I can now unload the driver when firmware is not available.

Larry





2012-01-28 23:02:55

by Larry Finger

[permalink] [raw]
Subject: Re: Kernel BUG due to kernel page fault

On 01/27/2012 10:30 PM, Johannes Berg wrote:
> Larry,
>
> On 1/26/2012 2:36 PM, Larry Finger wrote:
>> I am trying to convert the rtlwifi family of drivers to use asynchronous
>> firmware loading. With recent changes in udev, many problems are being
>> reported. See Bug 42632] at b.k.o.
>
> I'm not sure what caused your problem, but your email reminded me of an issue we
> had with asynchronous firmware loading that you may want to consider.
>
> When request_firmware_async() is called, it eventually calls the callback you
> give it. This may, however, be a long time later (I think the current default
> timeout is 1 minute). In the meantime, the module could be unloaded by the user,
> and then the system crashes as soon as the firmwaer code attempts to invoke the
> callback.
>
> To fix this, we have a completion in iwlwifi that some exit code path waits for
> to make sure this scenario doesn't happen.
>
> I think you should implement that.
>
> A better fix would probably be to make request_firmware_async() take a struct
> module * argument and pass THIS_MODULE to it, and make the firmware code handle
> this, but currently it doesn't.

Thanks for explaining why the completion queue is in iwlwifi. That was easy to
implement.

Larry