2003-01-16 16:19:59

by Phil Oester

[permalink] [raw]
Subject: 2.4.21-pre3-ac4 oops in free_pages_ok

Had a qmail server crash this morning with the below oops. Also had 2 other squid servers running same kernel die with no indication of why in syslog (couldn't see console). Think I'll stick with 2.4.20 for now...

Phil Oester


Jan 16 08:34:34 mail34 kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000004
Jan 16 08:34:34 mail34 kernel: c0131566
Jan 16 08:34:34 mail34 kernel: *pde = 00000000
Jan 16 08:34:34 mail34 kernel: Oops: 0002
Jan 16 08:34:34 mail34 kernel: CPU: 0
Jan 16 08:34:34 mail34 kernel: EIP: 0010:[<c0131566>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
Jan 16 08:34:34 mail34 kernel: EFLAGS: 00010246
Jan 16 08:34:34 mail34 kernel: eax: 00000000 ebx: c16eaf50 ecx: c9a00000 edx: c9a0005c
Jan 16 08:34:34 mail34 kernel: esi: 00000000 edi: 00000000 ebp: 00000000 esp: c9a01d84
Jan 16 08:34:34 mail34 kernel: ds: 0018 es: 0018 ss: 0018
Jan 16 08:34:34 mail34 kernel: Process smtp_message (pid: 12110, stackpage=c9a01000)
Jan 16 08:34:34 mail34 kernel: Stack: d63b3840 40014000 00000001 d56f2b00 c0126a82 d63b3840 d56f2b00 c1c0fdb8
Jan 16 08:34:34 mail34 kernel: e4e51000 c1000020 cd552cc0 c012e78a c1c0fdb8 00000025 00000000 c1c22660
Jan 16 08:34:34 mail34 kernel: c012fb39 c1c0fdb8 cd552cc0 00000000 c1c0fdc0 c1c0fdc8 c5523b50 c9a00000
Jan 16 08:34:34 mail34 kernel: Call Trace: [<c0126a82>] [<c012e78a>] [<c012fb39>] [<c0130cc9>] [<c0130d6c>]
Warning (Oops_read): Code line not seen, dumping what data is available

>>EIP; c0131566 <__free_pages_ok+286/2a0> <=====
Trace; c0126a82 <handle_mm_fault+62/d0>
Trace; c012e78a <kmem_slab_destroy+aa/d0>
Trace; c012fb39 <kmem_cache_reap+2b9/330>
Trace; c0130cc9 <shrink_caches+19/80>
Trace; c0130d6c <try_to_free_pages_zone+3c/60>


2003-01-16 19:04:24

by Tupshin Harper

[permalink] [raw]
Subject: Re: 2.4.21-pre3-ac4 oops in free_pages_ok

There are multiple other threads about this problem recently. One
started by me, as well as a few others.
The consensus is that it's a problem in the ac tree, and is not present
in 2.4.21-pre3.
Some people seem to avoid the problem by disabling highmem, but this
doesn't work for me. Quota has been mentioned as a possible culprit, but
disabling that also doesn't help me. The ac changes to mm/shmem.c are
still a possibility, though one reporter seems to have tried that
without any success. The remaining candidate that has been mentioned to
me is the buffer cache changes in the ac tree, this seems moderately
likely. I don't see any obvious way to break out those changes from
Alan's large ac4 patch, so I emailed him hoping to get a patch free of
those changes, but I haven't heard back yet(it's been 9 hours
already...how dare he ignore me ;-).

-Tupshin

Phil Oester wrote:

>Had a qmail server crash this morning with the below oops. Also had 2 other squid servers running same kernel die with no indication of why in syslog (couldn't see console). Think I'll stick with 2.4.20 for now...
>
>Phil Oester
>
>
>Jan 16 08:34:34 mail34 kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000004
>Jan 16 08:34:34 mail34 kernel: c0131566
>Jan 16 08:34:34 mail34 kernel: *pde = 00000000
>Jan 16 08:34:34 mail34 kernel: Oops: 0002
>Jan 16 08:34:34 mail34 kernel: CPU: 0
>Jan 16 08:34:34 mail34 kernel: EIP: 0010:[<c0131566>] Not tainted
>Using defaults from ksymoops -t elf32-i386 -a i386
>Jan 16 08:34:34 mail34 kernel: EFLAGS: 00010246
>Jan 16 08:34:34 mail34 kernel: eax: 00000000 ebx: c16eaf50 ecx: c9a00000 edx: c9a0005c
>Jan 16 08:34:34 mail34 kernel: esi: 00000000 edi: 00000000 ebp: 00000000 esp: c9a01d84
>Jan 16 08:34:34 mail34 kernel: ds: 0018 es: 0018 ss: 0018
>Jan 16 08:34:34 mail34 kernel: Process smtp_message (pid: 12110, stackpage=c9a01000)
>Jan 16 08:34:34 mail34 kernel: Stack: d63b3840 40014000 00000001 d56f2b00 c0126a82 d63b3840 d56f2b00 c1c0fdb8
>Jan 16 08:34:34 mail34 kernel: e4e51000 c1000020 cd552cc0 c012e78a c1c0fdb8 00000025 00000000 c1c22660
>Jan 16 08:34:34 mail34 kernel: c012fb39 c1c0fdb8 cd552cc0 00000000 c1c0fdc0 c1c0fdc8 c5523b50 c9a00000
>Jan 16 08:34:34 mail34 kernel: Call Trace: [<c0126a82>] [<c012e78a>] [<c012fb39>] [<c0130cc9>] [<c0130d6c>]
>Warning (Oops_read): Code line not seen, dumping what data is available
>
>
>
>>>EIP; c0131566 <__free_pages_ok+286/2a0> <=====
>>>
>>>
>Trace; c0126a82 <handle_mm_fault+62/d0>
>Trace; c012e78a <kmem_slab_destroy+aa/d0>
>Trace; c012fb39 <kmem_cache_reap+2b9/330>
>Trace; c0130cc9 <shrink_caches+19/80>
>Trace; c0130d6c <try_to_free_pages_zone+3c/60>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>


2003-01-17 07:38:18

by Ralf Hildebrandt

[permalink] [raw]
Subject: Re: 2.4.21-pre3-ac4 oops in free_pages_ok

* Tupshin Harper <[email protected]>:
> There are multiple other threads about this problem recently. One
> started by me, as well as a few others.
> The consensus is that it's a problem in the ac tree, and is not present
> in 2.4.21-pre3.

Yep.

> Some people seem to avoid the problem by disabling highmem, but this
> doesn't work for me. Quota has been mentioned as a possible culprit, but
> disabling that also doesn't help me.

Correct. I use neither himem nor quotas, still it crashes.

> The ac changes to mm/shmem.c are
> still a possibility, though one reporter seems to have tried that
> without any success.

Yup, I tried that. No go.

> The remaining candidate that has been mentioned to
> me is the buffer cache changes in the ac tree, this seems moderately
> likely. I don't see any obvious way to break out those changes from
> Alan's large ac4 patch, so I emailed him hoping to get a patch free of
> those changes, but I haven't heard back yet(it's been 9 hours
> already...how dare he ignore me ;-).

--
Ralf Hildebrandt (Im Auftrag des Referat V a) [email protected]
Charite Campus Mitte Tel. +49 (0)30-450 570-155
Referat V a - Kommunikationsnetze - Fax. +49 (0)30-450 570-916
"The report of my death was an exaggeration."
-Mark Twain, After reading his own obituary, June 2, 1897

2003-01-17 23:56:45

by Bryan Andersen

[permalink] [raw]
Subject: Re: 2.4.21-pre3-ac4 oops in free_pages_ok

I too have been seeing this oops crash problem. I can
consistantly reproduce mine by running:

$ mke2fs -c -j -i 16768 /dev/hdc6

The interesting thing is running:

$ mke2fs -j -i 16768 /dev/hdc6

does not cause an oops crash. The only difference being
the bad block scan.

These are the outputs of ksymoops for the stack trace part
of the oops output from. Kernel version is
linux-2.4.21-pre3-ac4.

Adhoc c013f4b7 <try_to_free_buffers+c7/140>
Adhoc c013d8c9 <try_to_release_page+49/50>
Adhoc c01348ac <__free_pages+1c/20>
Adhoc c0133863 <shrink_cache+383/3b0>
Adhoc c01339f6 <shrink_caches+56/80>
Adhoc c0133a5c <try_to_free_pages_zone+3c/60>
Adhoc c013452e <balance_classzone+5e/1d0>
Adhoc c01347b2 <__alloc_pages+112/160>
Adhoc c012ebc1 <generic_file_write+3f1/710>
Adhoc c01344c6 <_alloc_pages+16/20>
Adhoc c012ebdd <generic_file_write+40d/710>
Adhoc c013b316 <sys_write+96/110>
Adhoc c0106f8b <system_call+33/38>

Adhoc c0134239 <__free_pages_ok+279/2a0>

I'm going to do further tests with the generic IDE
driver instead of the NVIDIA one. Then I plan on
teasing out the NVIDIA2 specific stuff from the ac4
patch and only applying them to a pre3 patched kernel.

- Bryan

2003-01-19 10:52:58

by Bryan Andersen

[permalink] [raw]
Subject: Re: 2.4.21-pre3-ac4 oops in free_pages_ok

As I thought and others felt too. It looks like the -ac4
patch for 2.4.21-pre3 has some error in it that causes an
oops. I've tested a 2.4.21-pre3 with only the nvidia2
related patches from -ac4 added to it and it appears to be
stable. A few kernel comples and 53 mke2fs with bad block
scan runs and it stayed up. I'm now running memtest86 over
night to be pedantic that it isn't a memory error. Sofar
it passed the first pass of the tests, we'll see what
happens over night.

What are some good system abuse test suites? In the past
I've used kernel compiles in an endless loop. I'd do one
run then compare the outputs from each run to the first
run. Any difference constitutes a failure.

Bryan Andersen wrote:
> I too have been seeing this oops crash problem. I can
> consistantly reproduce mine by running:
>
> $ mke2fs -c -j -i 16768 /dev/hdc6
>
> The interesting thing is running:
>
> $ mke2fs -j -i 16768 /dev/hdc6
>
> does not cause an oops crash. The only difference being
> the bad block scan.
>
> These are the outputs of ksymoops for the stack trace part
> of the oops output from. Kernel version is
> linux-2.4.21-pre3-ac4.
>
> Adhoc c013f4b7 <try_to_free_buffers+c7/140>
> Adhoc c013d8c9 <try_to_release_page+49/50>
> Adhoc c01348ac <__free_pages+1c/20>
> Adhoc c0133863 <shrink_cache+383/3b0>
> Adhoc c01339f6 <shrink_caches+56/80>
> Adhoc c0133a5c <try_to_free_pages_zone+3c/60>
> Adhoc c013452e <balance_classzone+5e/1d0>
> Adhoc c01347b2 <__alloc_pages+112/160>
> Adhoc c012ebc1 <generic_file_write+3f1/710>
> Adhoc c01344c6 <_alloc_pages+16/20>
> Adhoc c012ebdd <generic_file_write+40d/710>
> Adhoc c013b316 <sys_write+96/110>
> Adhoc c0106f8b <system_call+33/38>
>
> Adhoc c0134239 <__free_pages_ok+279/2a0>
>
> I'm going to do further tests with the generic IDE
> driver instead of the NVIDIA one. Then I plan on
> teasing out the NVIDIA2 specific stuff from the ac4
> patch and only applying them to a pre3 patched kernel.
>
> - Bryan
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>