2003-11-02 23:09:33

by Udo A. Steinberg

[permalink] [raw]
Subject: [OOPS] Linux-2.6.0-test9


Hi,

Following oops just happened with Linux-2.6.0-test9. All filesystems are ext3,
except for /dev/hda6 which is ext3 over aes-cryptoloop.

-Udo.


Unable to handle kernel paging request at virtual address 50d35f7c
printing eip:
c016c912
*pde = 00000000
Oops: 0000 [#2]
CPU: 0
EIP: 0060:[<c016c912>] Not tainted
EFLAGS: 00010246
EIP is at __d_lookup+0xf2/0x150
eax: 50d35f7c ebx: d1d35f00 ecx: 00000004 edx: d1d35f64
esi: 50d35f7c edi: d7329000 ebp: d1d35f70 esp: d4ed7e68
ds: 007b es: 007b ss: 0068
Process sylpheed (pid: 19589, threadinfo=d4ed6000 task=d3dfa0a0)
Stack: d7fdd6c0 000796b7 00001000 00000000 d7f46aa0 d4ed6000 c0190473 00000000
d7329000 00d03008 00000004 d7329000 d4ed7f38 d7ff4660 d4ed7ee4 c0161f30
d21db820 d4ed7eec d21db820 d7329000 d4ed7ee4 d4ed7f38 d4ed7eec c016245f
Call Trace:
[<c0190473>] ext3_getblk+0x93/0x280
[<c0161f30>] do_lookup+0x30/0xb0
[<c016245f>] link_path_walk+0x4af/0x940
[<c0162df9>] __user_walk+0x49/0x60
[<c015dd2f>] vfs_stat+0x1f/0x60
[<c015e4cb>] sys_stat64+0x1b/0x40
[<c010942b>] syscall_call+0x7/0xb

Code: f3 a6 75 9d 8b 44 24 14 ff 40 14 8b 54 24 0c 3b 53 58 75 0c
<6>note: sylpheed[19589] exited with preempt_count 1


Attachments:
(No filename) (1.19 kB)
(No filename) (189.00 B)
Download all attachments

2003-11-02 23:19:17

by Udo A. Steinberg

[permalink] [raw]
Subject: Re: [OOPS] Linux-2.6.0-test9

On Mon, 3 Nov 2003 00:09:24 +0100 Udo A. Steinberg (UAS) wrote:

UAS> Unable to handle kernel paging request at virtual address 50d35f7c
UAS> printing eip:
UAS> c016c912
UAS> *pde = 00000000
UAS> Oops: 0000 [#2]

I just noticed that my last report was actually the second OOPS in a whole
series. Here is the first one.

-Udo.

Unable to handle kernel paging request at virtual address 50d35f7c
printing eip:
c016c912
*pde = 00000000
Oops: 0000 [#1]
CPU: 0
EIP: 0060:[<c016c912>] Not tainted
EFLAGS: 00010246
EIP is at __d_lookup+0xf2/0x150
eax: 50d35f7c ebx: d1d35f00 ecx: 00000004 edx: d1d35f64
esi: 50d35f7c edi: d650e000 ebp: d1d35f70 esp: d3c85e68
ds: 007b es: 007b ss: 0068
Process sylpheed (pid: 265, threadinfo=d3c84000 task=d3dfa6c0)
Stack: d7ff4660 d3c85ee4 c0109598 00000000 d7f46aa0 d3c84000 d3c85f38 00000000
d650e000 00d03008 00000004 d650e000 d3c85f38 d7ff4660 d3c85ee4 c0161f30
d21db820 d3c85eec d21db820 d650e000 d3c85ee4 d3c85f38 d3c85eec c016245f
Call Trace:
[<c0109598>] common_interrupt+0x18/0x20
[<c0161f30>] do_lookup+0x30/0xb0
[<c016245f>] link_path_walk+0x4af/0x940
[<c0162df9>] __user_walk+0x49/0x60
[<c015dd2f>] vfs_stat+0x1f/0x60
[<c015e4cb>] sys_stat64+0x1b/0x40
[<c010942b>] syscall_call+0x7/0xb

Code: f3 a6 75 9d 8b 44 24 14 ff 40 14 8b 54 24 0c 3b 53 58 75 0c
<6>note: sylpheed[265] exited with preempt_count 1


Attachments:
(No filename) (1.40 kB)
(No filename) (189.00 B)
Download all attachments

2003-11-03 03:02:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [OOPS] Linux-2.6.0-test9

Udo A. Steinberg wrote:
>
> I just noticed that my last report was actually the second OOPS in a whole
> series. Here is the first one.
>
> Unable to handle kernel paging request at virtual address 50d35f7c
> EIP is at __d_lookup+0xf2/0x150

Ok, looks like your dentry lists got seriously corrupted.

The good news (?) is that you seem to have preempt enabled, and there is one
known (but fairly hard-to-hit) race in UP+preempt locking due to bad
barrier ordering in test9 (and all previous kernels too, for that matter).
That should be fixed in the current BK snapshots, but you can also avoid
the problem by just not enabling preempt.

Of course, if you see the bug without preempt, or if this was on a SMP
kernel (which _should_ have hidden the preempt problem even with preempt
enabled), holler.

Linus

2003-11-03 03:48:49

by Udo A. Steinberg

[permalink] [raw]
Subject: Re: [OOPS] Linux-2.6.0-test9

On Mon, 03 Nov 2003 11:10:06 +1100 Peter Lieverdink (PL) wrote:

PL> What command do you execute right before this oops occurs? Is this an
PL> Athlon system?

The system was unattended and rather idle at that time. It's not an Athlon,
but a Pentium-III (Coppermine).

-Udo.


Attachments:
(No filename) (275.00 B)
(No filename) (189.00 B)
Download all attachments

2003-11-03 03:56:44

by Udo A. Steinberg

[permalink] [raw]
Subject: Re: [OOPS] Linux-2.6.0-test9

On Sun, 02 Nov 2003 19:01:51 -0800 Linus Torvalds (LT) wrote:

LT> The good news (?) is that you seem to have preempt enabled, and there is one
LT> known (but fairly hard-to-hit) race in UP+preempt locking due to bad
LT> barrier ordering in test9 (and all previous kernels too, for that matter).
LT> That should be fixed in the current BK snapshots, but you can also avoid
LT> the problem by just not enabling preempt.

Yes, the kernel is UP + preempt. I'll try the current BK snapshot and will
let you know should the problem occur again.

-Udo.


Attachments:
(No filename) (547.00 B)
(No filename) (189.00 B)
Download all attachments

2003-11-03 04:49:50

by Udo A. Steinberg

[permalink] [raw]
Subject: Re: [OOPS] Linux-2.6.0-test9

On Sun, 2 Nov 2003 20:45:55 -0800 (PST) Linus Torvalds (LT) wrote:

LT> Btw, what compiler version do you have? The UP+preempt bug is a real bug,
LT> but as far as I've been able to find out it's almost impossible to get gcc
LT> to actually generate code that might trigger it. So while it's entirely
LT> possible that the bug you're seeing is due to the (now fixed) UP+preempt
LT> bug, it's also quite possible that there's something else going on too.

I'm using gcc 3.3.2.

Reading specs from /usr/lib/gcc-lib/i486-slackware-linux/3.3.2/specs
Configured with: ../gcc-3.3.2/configure --prefix=/usr --enable-shared
--enable-threads=posix --enable-__cxa_atexit --disable-checking --with-gnu-ld
--verbose --target=i486-slackware-linux --host=i486-slackware-linux
Thread model: posix
gcc version 3.3.2

-Udo.


Attachments:
(No filename) (808.00 B)
(No filename) (189.00 B)
Download all attachments

2003-11-03 04:46:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [OOPS] Linux-2.6.0-test9


On Mon, 3 Nov 2003, Udo A. Steinberg wrote:
>
> Yes, the kernel is UP + preempt. I'll try the current BK snapshot and will
> let you know should the problem occur again.

Btw, what compiler version do you have? The UP+preempt bug is a real bug,
but as far as I've been able to find out it's almost impossible to get gcc
to actually generate code that might trigger it. So while it's entirely
possible that the bug you're seeing is due to the (now fixed) UP+preempt
bug, it's also quite possible that there's something else going on too.

Linus