2003-09-23 10:42:25

by Sebastian Piecha

[permalink] [raw]
Subject: How to understand an oops?

Hello,

using samba 2.2.8a with kernel 2.4.20 or 2.4.23pre1 ends in an oops.
In several mailings to the lkml I tried to get help but unfortunately
nobody did answer me.

I'm trying to interpret all the oopses myself. I reproduced the
kernel panic in different configurations - with or without LVM or
RAID. What all oopses do have common is the "Code" section.
skb_drop_fraglist is listed herein. skb_drop_fraglist is defined in
net/core/skbuff.c.

Does the code section mean that the kernel panic occurred during
execution of this code?
How likely is it that a bug in net/core/skbuff.c is causing the
kernel panic?
How can I find other code/modules from which skb_drop_fraglist is
called and used?
What is the best way interpreting such an oops?

################## oops #####################
Oops: 0000
CPU: 0
EIP: 0010:[<c0219cd7>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010206
eax: c40866a0 ebx: 00200000 ecx: c40866a0 edx: 00200000
esi: cec57360 edi: fffffff9 ebp: 00000046 esp: c0303f2c
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=c0303000)
Stack: cec57360 c0219d6e cec57360 cec57360 cec57360 c0219dab cec57360
cec57360
c0219efc cec57360 cf49cb20 c021e173 cec57360 00000003 c032c568
c0120629
c032c568 00000006 0000000e c0303f98 d3e02e40 c010a091 c0106f40
c0302000
Call Trace: [<c0219d6e>] [<c0219dab>] [<c0219efc>] [<c021e173>]
[<c0120629>]
[<c010a091>] [<c0106f40>] [<c010c4e8>] [<c0106f40>] [<c0106f64>]
[<c0106fd2>]
[<c0105000>]
Code: 8b 1b 8b 42 74 48 74 0a ff 4a 74 0f 94 c0 84 c0 74 07 52 e8


>>EIP; c0219cd7 <skb_drop_fraglist+17/40> <=====

>>eax; c40866a0 <_end+3cf81fc/14e64bbc>
>>ecx; c40866a0 <_end+3cf81fc/14e64bbc>
>>esi; cec57360 <_end+e8c8ebc/14e64bbc>
>>esp; c0303f2c <init_task_union+1f2c/2000>

Trace; c0219d6e <skb_release_data+4e/80>
Trace; c0219dab <kfree_skbmem+b/70>
Trace; c0219efc <__kfree_skb+ec/150>
Trace; c021e173 <net_tx_action+33/a0>
Trace; c0120629 <do_softirq+99/a0>
Trace; c010a091 <do_IRQ+a1/b0>
Trace; c0106f40 <default_idle+0/30>
Trace; c010c4e8 <call_do_IRQ+5/d>
Trace; c0106f40 <default_idle+0/30>
Trace; c0106f64 <default_idle+24/30>
Trace; c0106fd2 <cpu_idle+42/60>
Trace; c0105000 <_stext+0/0>

Code; c0219cd7 <skb_drop_fraglist+17/40>
00000000 <_EIP>:
Code; c0219cd7 <skb_drop_fraglist+17/40> <=====
0: 8b 1b mov (%ebx),%ebx <=====
Code; c0219cd9 <skb_drop_fraglist+19/40>
2: 8b 42 74 mov 0x74(%edx),%eax
Code; c0219cdc <skb_drop_fraglist+1c/40>
5: 48 dec %eax
Code; c0219cdd <skb_drop_fraglist+1d/40>
6: 74 0a je 12 <_EIP+0x12>
Code; c0219cdf <skb_drop_fraglist+1f/40>
8: ff 4a 74 decl 0x74(%edx)
Code; c0219ce2 <skb_drop_fraglist+22/40>
b: 0f 94 c0 sete %al
Code; c0219ce5 <skb_drop_fraglist+25/40>
e: 84 c0 test %al,%al
Code; c0219ce7 <skb_drop_fraglist+27/40>
10: 74 07 je 19 <_EIP+0x19>
Code; c0219ce9 <skb_drop_fraglist+29/40>
12: 52 push %edx
Code; c0219cea <skb_drop_fraglist+2a/40>
13: e8 00 00 00 00 call 18 <_EIP+0x18>

<0>Kernel panic: Aiee, killing interrupt handler!
#################################

Mit freundlichen Gruessen/Best regards,
Sebastian Piecha

EMail: [email protected]


2003-09-23 16:43:20

by Randy.Dunlap

[permalink] [raw]
Subject: Re: How to understand an oops?

On Tue, 23 Sep 2003 12:42:22 +0200 "Sebastian Piecha" <[email protected]> wrote:

| Hello,
|
| using samba 2.2.8a with kernel 2.4.20 or 2.4.23pre1 ends in an oops.
| In several mailings to the lkml I tried to get help but unfortunately
| nobody did answer me.

You would probably get more linux-networking help on
[email protected] or
[email protected]

| I'm trying to interpret all the oopses myself. I reproduced the
| kernel panic in different configurations - with or without LVM or
| RAID. What all oopses do have common is the "Code" section.
| skb_drop_fraglist is listed herein. skb_drop_fraglist is defined in
| net/core/skbuff.c.

Yes, and that's the only source file where it is used.

| Does the code section mean that the kernel panic occurred during
| execution of this code?

Yes.
Are there a few informative lines missing before the Oops: line below?

| How likely is it that a bug in net/core/skbuff.c is causing the
| kernel panic?

Dunno. It's trying to access memory at (%ebx) == 00200000.
That could be a single-bit error in memory which was supposed to be
0, which would have terminated the while loop in skb_drop_fraglist().

The common suggestion based on this would be to run memtest86
(or memtst86 ?) overnight to check for memory errors.

| How can I find other code/modules from which skb_drop_fraglist is
| called and used?

Use grep (or cscope, but that would be overkill in this case).
I found it only in net/core/skbuff.c.

| What is the best way interpreting such an oops?
|
| ################## oops #####################
| Oops: 0000
| CPU: 0
| EIP: 0010:[<c0219cd7>] Not tainted
| Using defaults from ksymoops -t elf32-i386 -a i386
| EFLAGS: 00010206
| eax: c40866a0 ebx: 00200000 ecx: c40866a0 edx: 00200000
| esi: cec57360 edi: fffffff9 ebp: 00000046 esp: c0303f2c
| ds: 0018 es: 0018 ss: 0018
| Process swapper (pid: 0, stackpage=c0303000)
| Stack: cec57360 c0219d6e cec57360 cec57360 cec57360 c0219dab cec57360
| cec57360
| c0219efc cec57360 cf49cb20 c021e173 cec57360 00000003 c032c568
| c0120629
| c032c568 00000006 0000000e c0303f98 d3e02e40 c010a091 c0106f40
| c0302000
| Call Trace: [<c0219d6e>] [<c0219dab>] [<c0219efc>] [<c021e173>]
| [<c0120629>]
| [<c010a091>] [<c0106f40>] [<c010c4e8>] [<c0106f40>] [<c0106f64>]
| [<c0106fd2>]
| [<c0105000>]
| Code: 8b 1b 8b 42 74 48 74 0a ff 4a 74 0f 94 c0 84 c0 74 07 52 e8
|
|
| >>EIP; c0219cd7 <skb_drop_fraglist+17/40> <=====
^^^^^^^^^^^^^^^^^^^^^^^
This is the faulting code location.

| >>eax; c40866a0 <_end+3cf81fc/14e64bbc>
| >>ecx; c40866a0 <_end+3cf81fc/14e64bbc>
| >>esi; cec57360 <_end+e8c8ebc/14e64bbc>
| >>esp; c0303f2c <init_task_union+1f2c/2000>

This is the call stack. skb_drop_fraglist() was called from here:
vvvvvvvvvvvvvvvvvvvvvv
| Trace; c0219d6e <skb_release_data+4e/80>
| Trace; c0219dab <kfree_skbmem+b/70>
| Trace; c0219efc <__kfree_skb+ec/150>
| Trace; c021e173 <net_tx_action+33/a0>
| Trace; c0120629 <do_softirq+99/a0>
| Trace; c010a091 <do_IRQ+a1/b0>
| Trace; c0106f40 <default_idle+0/30>
| Trace; c010c4e8 <call_do_IRQ+5/d>
| Trace; c0106f40 <default_idle+0/30>
| Trace; c0106f64 <default_idle+24/30>
| Trace; c0106fd2 <cpu_idle+42/60>
| Trace; c0105000 <_stext+0/0>
|
| Code; c0219cd7 <skb_drop_fraglist+17/40>
| 00000000 <_EIP>:
| Code; c0219cd7 <skb_drop_fraglist+17/40> <=====
| 0: 8b 1b mov (%ebx),%ebx <=====
^^^^^^^^^^^^^^^^^^
This is the faulting instruction. Accessing memory at (%ebx) == 00200000.

| Code; c0219cd9 <skb_drop_fraglist+19/40>
| 2: 8b 42 74 mov 0x74(%edx),%eax
| Code; c0219cdc <skb_drop_fraglist+1c/40>
| 5: 48 dec %eax
| Code; c0219cdd <skb_drop_fraglist+1d/40>
| 6: 74 0a je 12 <_EIP+0x12>
| Code; c0219cdf <skb_drop_fraglist+1f/40>
| 8: ff 4a 74 decl 0x74(%edx)
| Code; c0219ce2 <skb_drop_fraglist+22/40>
| b: 0f 94 c0 sete %al
| Code; c0219ce5 <skb_drop_fraglist+25/40>
| e: 84 c0 test %al,%al
| Code; c0219ce7 <skb_drop_fraglist+27/40>
| 10: 74 07 je 19 <_EIP+0x19>
| Code; c0219ce9 <skb_drop_fraglist+29/40>
| 12: 52 push %edx
| Code; c0219cea <skb_drop_fraglist+2a/40>
| 13: e8 00 00 00 00 call 18 <_EIP+0x18>
|
| <0>Kernel panic: Aiee, killing interrupt handler!
| #################################

HTH.
--
~Randy

2003-09-23 17:31:13

by Sebastian Piecha

[permalink] [raw]
Subject: Re: How to understand an oops?

>...
> | Does the code section mean that the kernel panic occurred during
> | execution of this code?
>
> Yes.
> Are there a few informative lines missing before the Oops: line below?
>
Only the following lines:
ksymoops 2.4.8 on i686 2.4.23pre1-usbtest. Options used
-V (specified)
-k ksyms (specified)
-l modules (specified)
-o /lib/modules/2.4.23pre1-usbtest/ (default)
-m System.map (specified)

> | How likely is it that a bug in net/core/skbuff.c is causing the
> | kernel panic?
>
> Dunno. It's trying to access memory at (%ebx) == 00200000.
> That could be a single-bit error in memory which was supposed to be
> 0, which would have terminated the while loop in skb_drop_fraglist().
>
> The common suggestion based on this would be to run memtest86
> (or memtst86 ?) overnight to check for memory errors.
>

I already run memtest for about 25 hours without any error.
If it's not a memory error how can I find out what caused the error?
Debugger?

> | How can I find other code/modules from which skb_drop_fraglist is
> | called and used?
>
> Use grep (or cscope, but that would be overkill in this case).
> I found it only in net/core/skbuff.c.
>

What I found out in the meantime:
skb_drop_fraglist is part of sk_buff and sk_buff is used by network
drivers and it seems also IDE drivers. There's a documentation how to
use sk_buff (Network Buffers and Memory Management, Alan Cox,
http://www.linuxjournal.com/article.php?sid=1312).

> | What is the best way interpreting such an oops?
> |
> | ################## oops #####################
> | Oops: 0000
> | CPU: 0
> | EIP: 0010:[<c0219cd7>] Not tainted
> | Using defaults from ksymoops -t elf32-i386 -a i386
> | EFLAGS: 00010206
> | eax: c40866a0 ebx: 00200000 ecx: c40866a0 edx: 00200000
> | esi: cec57360 edi: fffffff9 ebp: 00000046 esp: c0303f2c
> | ds: 0018 es: 0018 ss: 0018
> | Process swapper (pid: 0, stackpage=c0303000)
> | Stack: cec57360 c0219d6e cec57360 cec57360 cec57360 c0219dab cec57360
> | cec57360
> | c0219efc cec57360 cf49cb20 c021e173 cec57360 00000003 c032c568
> | c0120629
> | c032c568 00000006 0000000e c0303f98 d3e02e40 c010a091 c0106f40
> | c0302000
> | Call Trace: [<c0219d6e>] [<c0219dab>] [<c0219efc>] [<c021e173>]
> | [<c0120629>]
> | [<c010a091>] [<c0106f40>] [<c010c4e8>] [<c0106f40>] [<c0106f64>]
> | [<c0106fd2>]
> | [<c0105000>]
> | Code: 8b 1b 8b 42 74 48 74 0a ff 4a 74 0f 94 c0 84 c0 74 07 52 e8
> |
> |
> | >>EIP; c0219cd7 <skb_drop_fraglist+17/40> <=====
> ^^^^^^^^^^^^^^^^^^^^^^^
> This is the faulting code location.
>
> | >>eax; c40866a0 <_end+3cf81fc/14e64bbc>
> | >>ecx; c40866a0 <_end+3cf81fc/14e64bbc>
> | >>esi; cec57360 <_end+e8c8ebc/14e64bbc>
> | >>esp; c0303f2c <init_task_union+1f2c/2000>
>
> This is the call stack. skb_drop_fraglist() was called from here:
> vvvvvvvvvvvvvvvvvvvvvv
> | Trace; c0219d6e <skb_release_data+4e/80>

What does +4e/80 mean in the line above?

>...

Mit freundlichen Gruessen/Best regards,
Sebastian Piecha

EMail: [email protected]

2003-09-23 18:23:38

by Randy.Dunlap

[permalink] [raw]
Subject: Re: How to understand an oops?

On Tue, 23 Sep 2003 19:31:10 +0200 "Sebastian Piecha" <[email protected]> wrote:

| >...
| > | Does the code section mean that the kernel panic occurred during
| > | execution of this code?
| >
| > Yes.
| > Are there a few informative lines missing before the Oops: line below?

I was expecting more like (cut from source code):

("Unable to handle kernel NULL pointer dereference");
or
("Unable to handle kernel paging request");
and
(" at virtual address %08lx\n",address);
(" printing eip:\n");
("%08lx\n", regs->eip);
("*pde = %08lx\n", page);
("*pte = %08lx\n", page);

| Only the following lines:
| ksymoops 2.4.8 on i686 2.4.23pre1-usbtest. Options used
| -V (specified)
| -k ksyms (specified)
| -l modules (specified)
| -o /lib/modules/2.4.23pre1-usbtest/ (default)
| -m System.map (specified)
|
| > | How likely is it that a bug in net/core/skbuff.c is causing the
| > | kernel panic?
| >
| > Dunno. It's trying to access memory at (%ebx) == 00200000.
| > That could be a single-bit error in memory which was supposed to be
| > 0, which would have terminated the while loop in skb_drop_fraglist().
| >
| > The common suggestion based on this would be to run memtest86
| > (or memtst86 ?) overnight to check for memory errors.
| >
|
| I already run memtest for about 25 hours without any error.
| If it's not a memory error how can I find out what caused the error?
| Debugger?
|
| > | How can I find other code/modules from which skb_drop_fraglist is
| > | called and used?
| >
| > Use grep (or cscope, but that would be overkill in this case).
| > I found it only in net/core/skbuff.c.
| >

In the meantime, you haven't tried the other mailing lists that
I suggested....

| What I found out in the meantime:
| skb_drop_fraglist is part of sk_buff and sk_buff is used by network
| drivers and it seems also IDE drivers. There's a documentation how to
| use sk_buff (Network Buffers and Memory Management, Alan Cox,
| http://www.linuxjournal.com/article.php?sid=1312).
|
| > | What is the best way interpreting such an oops?
| > |
| > | ################## oops #####################
| > | Oops: 0000
| > | CPU: 0
| > | EIP: 0010:[<c0219cd7>] Not tainted
| > | Using defaults from ksymoops -t elf32-i386 -a i386
| > | EFLAGS: 00010206
| > | eax: c40866a0 ebx: 00200000 ecx: c40866a0 edx: 00200000
| > | esi: cec57360 edi: fffffff9 ebp: 00000046 esp: c0303f2c
| > | ds: 0018 es: 0018 ss: 0018
| > | Process swapper (pid: 0, stackpage=c0303000)
| > | Stack: cec57360 c0219d6e cec57360 cec57360 cec57360 c0219dab cec57360
| > | cec57360
| > | c0219efc cec57360 cf49cb20 c021e173 cec57360 00000003 c032c568
| > | c0120629
| > | c032c568 00000006 0000000e c0303f98 d3e02e40 c010a091 c0106f40
| > | c0302000
| > | Call Trace: [<c0219d6e>] [<c0219dab>] [<c0219efc>] [<c021e173>]
| > | [<c0120629>]
| > | [<c010a091>] [<c0106f40>] [<c010c4e8>] [<c0106f40>] [<c0106f64>]
| > | [<c0106fd2>]
| > | [<c0105000>]
| > | Code: 8b 1b 8b 42 74 48 74 0a ff 4a 74 0f 94 c0 84 c0 74 07 52 e8
| > |
| > |
| > | >>EIP; c0219cd7 <skb_drop_fraglist+17/40> <=====
| > ^^^^^^^^^^^^^^^^^^^^^^^
| > This is the faulting code location.
| >
| > | >>eax; c40866a0 <_end+3cf81fc/14e64bbc>
| > | >>ecx; c40866a0 <_end+3cf81fc/14e64bbc>
| > | >>esi; cec57360 <_end+e8c8ebc/14e64bbc>
| > | >>esp; c0303f2c <init_task_union+1f2c/2000>
| >
| > This is the call stack. skb_drop_fraglist() was called from here:
| > vvvvvvvvvvvvvvvvvvvvvv
| > | Trace; c0219d6e <skb_release_data+4e/80>
|
| What does +4e/80 mean in the line above?

That's the return address from skb_drop_fraglist() back to
skb_release_data(): 0x4e bytes into the function, which is
a total of 0x80 bytes long.

--
~Randy

2003-09-23 23:10:15

by Sebastian Piecha

[permalink] [raw]
Subject: Re: How to understand an oops?

...
> | > Are there a few informative lines missing before the Oops: line below?
>
> I was expecting more like (cut from source code):
>
> ("Unable to handle kernel NULL pointer dereference");
> or
> ("Unable to handle kernel paging request");
> and
> (" at virtual address %08lx\n",address);
> (" printing eip:\n");
> ("%08lx\n", regs->eip);
> ("*pde = %08lx\n", page);
> ("*pte = %08lx\n", page);
>

No output of that kind. Only the Oops posted.

> ...
>
> In the meantime, you haven't tried the other mailing lists that
> I suggested....
>

In the meantime I did.

>...

Mit freundlichen Gruessen/Best regards,
Sebastian Piecha

EMail: [email protected]

2003-09-27 10:23:18

by Sebastian Piecha

[permalink] [raw]
Subject: Re: How to understand an oops?

>...
> | > | How likely is it that a bug in net/core/skbuff.c is causing the
> | > | kernel panic?
> | >
> | > Dunno. It's trying to access memory at (%ebx) == 00200000.
> | > That could be a single-bit error in memory which was supposed to be
> | > 0, which would have terminated the while loop in skb_drop_fraglist().
> | >
> | > The common suggestion based on this would be to run memtest86
> | > (or memtst86 ?) overnight to check for memory errors.
> | >
> |
> | I already run memtest for about 25 hours without any error.
> | If it's not a memory error how can I find out what caused the error?
> | Debugger?
> |
> | > | How can I find other code/modules from which skb_drop_fraglist is
> | > | called and used?
> | >
> | > Use grep (or cscope, but that would be overkill in this case).
> | > I found it only in net/core/skbuff.c.
> | >
>

sk_buff is used by network drivers. How can I narrow down the error
in skb_drop_fraglist? Maybe it's a network driver problem and not
Promise-IDE related.

> In the meantime, you haven't tried the other mailing lists that
> I suggested....
>

I did but didn't get any reaction.

I don't know how to go on. I can't verify my configuration under
kernel 2.6.x or 2.4.22-ac4 as 2.6.x doesn't boot (blank screen and
yes, I tried different CONFIG_CONSOLE settings). 2.4.22-ac4 hangs
with an oops when booting. Under 2.4.23pre1 I'm getting the same
errors.

Can I do some debugging to narrow down the problem? I still don't
know if the error is network or Promise driver related.

Mit freundlichen Gruessen/Best regards,
Sebastian Piecha

EMail: [email protected]