2008-11-15 18:20:39

by Bob Copeland

[permalink] [raw]
Subject: Re: Kernel oops when loading ath5k from compat-wireless in 2.6.27

On Sat, Nov 15, 2008 at 12:29:34AM -0600, Dan McGee wrote:
> On Fri, Nov 14, 2008 at 8:57 PM, Dan McGee <[email protected]> wrote:
> >
> > BUG: unable to handle kernel NULL pointer dereference at 00000082
> > IP: [<7818ca71>] sysfs_find_dirent+0x9/0x23
> > Oops: 0000 [#1] PREEMPT
> > Modules linked in: ath5k(+) mac80211

So, just to recap, this is with Luis' patch; now you get a null pointer
dereference in sysfs instead of in ieee80211_register_hw? It does look
like we're deep in register_netdevice now. If you revert his patch, you
can still get the error in register_hw every time?

> > Pid: 818 comm: modprobe Not tainted (2.6.27.6eee #1)
> > EIP: 0060:[<7818ca71>] EFLAGS: 00010206 CPU: 0
> > EIP is at sysfs_find_dirent+0x9/0x23
> > EAX: 00000001 EBX: 00000072 ECX: 00000001 EDX: b730b4f0
> > ESI: b730b4f0 EDI: fffffff4 EBP: b7311490 ESP: b73ffd34

EBX is 00000072, definitely not a pointer.

> And I had the code completely wrong, oops. Looks like we are bailing
> on the strcmp call in this function or something along those lines? I
> wish I could be a bigger help with debugging this stuff.

Yep, or at least in the setup code for that. Don't worry, you're being
a big help; I think we just don't have a good enough theory yet to
propose decent debugging patches.

> struct sysfs_dirent *sysfs_find_dirent(struct sysfs_dirent *parent_sd,
> const unsigned char *name)
> {
> 1bc: 56 push %esi
> 1bd: 89 d6 mov %edx,%esi
> 1bf: 53 push %ebx
> struct sysfs_dirent *sd;
>
> for (sd = parent_sd->s_dir.children; sd; sd = sd->s_sibling)
> 1c0: 8b 58 18 mov 0x18(%eax),%ebx
> 1c3: eb 11 jmp 1d6 <sysfs_find_dirent+0x1a>
> if (!strcmp(sd->s_name, name))
> 1c5: 8b 43 10 mov 0x10(%ebx),%eax

EBX appears to be sd (it's initialized at line 1c0 to parent_sd + 0x18,
which is &parent_sd->s_dir.children, then it jumps to the loop test).
Thus EAX must be sd->s_sibling, which we hope to use for strcmp.

So, while traversing the sibling pointers, one of them happens to be
00000072 (instead of what should probably have been NULL). 0x72 is not
a poison value I'm aware of. At this point, things have gone south, but
the real problem happened earlier.

Can you post your .config?

--
Bob Copeland %% http://www.bobcopeland.com



2008-11-16 00:12:09

by Dan McGee

[permalink] [raw]
Subject: Re: Kernel oops when loading ath5k from compat-wireless in 2.6.27

On Sat, Nov 15, 2008 at 12:19 PM, Bob Copeland <[email protected]> wrote:
> On Sat, Nov 15, 2008 at 12:29:34AM -0600, Dan McGee wrote:
>> On Fri, Nov 14, 2008 at 8:57 PM, Dan McGee <[email protected]> wrote:
>> >
>> > BUG: unable to handle kernel NULL pointer dereference at 00000082
>> > IP: [<7818ca71>] sysfs_find_dirent+0x9/0x23
>> > Oops: 0000 [#1] PREEMPT
>> > Modules linked in: ath5k(+) mac80211
>
> So, just to recap, this is with Luis' patch; now you get a null pointer
> dereference in sysfs instead of in ieee80211_register_hw? It does look
> like we're deep in register_netdevice now. If you revert his patch, you
> can still get the error in register_hw every time?

Yeah, this is with Luis' patch. Without that patch it always bugs out
at the earlier step in register_hw(). And like I said, I can't
reproduce this one with debug symbols built into the kernel
unfortunately.

>> > Pid: 818 comm: modprobe Not tainted (2.6.27.6eee #1)
>> > EIP: 0060:[<7818ca71>] EFLAGS: 00010206 CPU: 0
>> > EIP is at sysfs_find_dirent+0x9/0x23
>> > EAX: 00000001 EBX: 00000072 ECX: 00000001 EDX: b730b4f0
>> > ESI: b730b4f0 EDI: fffffff4 EBP: b7311490 ESP: b73ffd34
>
> EBX is 00000072, definitely not a pointer.
>
>> And I had the code completely wrong, oops. Looks like we are bailing
>> on the strcmp call in this function or something along those lines? I
>> wish I could be a bigger help with debugging this stuff.
>
> Yep, or at least in the setup code for that. Don't worry, you're being
> a big help; I think we just don't have a good enough theory yet to
> propose decent debugging patches.
>
>> struct sysfs_dirent *sysfs_find_dirent(struct sysfs_dirent *parent_sd,
>> const unsigned char *name)
>> {
>> 1bc: 56 push %esi
>> 1bd: 89 d6 mov %edx,%esi
>> 1bf: 53 push %ebx
>> struct sysfs_dirent *sd;
>>
>> for (sd = parent_sd->s_dir.children; sd; sd = sd->s_sibling)
>> 1c0: 8b 58 18 mov 0x18(%eax),%ebx
>> 1c3: eb 11 jmp 1d6 <sysfs_find_dirent+0x1a>
>> if (!strcmp(sd->s_name, name))
>> 1c5: 8b 43 10 mov 0x10(%ebx),%eax
>
> EBX appears to be sd (it's initialized at line 1c0 to parent_sd + 0x18,
> which is &parent_sd->s_dir.children, then it jumps to the loop test).
> Thus EAX must be sd->s_sibling, which we hope to use for strcmp.
>
> So, while traversing the sibling pointers, one of them happens to be
> 00000072 (instead of what should probably have been NULL). 0x72 is not
> a poison value I'm aware of. At this point, things have gone south, but
> the real problem happened earlier.

Yeah, I figured it was something earlier that didn't quite work out,
but I really had no idea where to start poking.

> Can you post your .config?

Sure- here it is: http://www.toofishes.net/uploads/kernelconfig

-Dan