2004-03-22 14:49:37

by Matthias Andree

[permalink] [raw]
Subject: 2.4.25 SMP - BUG at page_alloc.c:105

Hi,

I found this in the logs of a Dual Athlon MP machine (Tyan board)
running 2.4.25-SMP:

kernel BUG at page_alloc.c:105!
invalid operand: 0000
CPU: 0
EIP: 0010:[__free_pages_ok+80/704] Not tainted
EFLAGS: 00010286
eax: c0333674 ebx: c1b2d720 ecx: 00000000 edx: f22f7a84
esi: 00000001 edi: 00000000 ebp: 00000001 esp: f6901e3c
ds: 0018 es: 0018 ss: 0018
Process svscan (pid: 1348, stackpage=f6901000)
Stack: c033364c f741cbc0 f22f7a84 00000001 0804c000 c0133ea6 f22f79c0 00000004
00000001 00000001 0804c000 00000001 c01308fa c1b2d720 f68e3080 0804b000
00001000 0844b000 c03ac4e0 00000001 0804c000 f68e3084 f42baa40 f7212440
Call Trace: [set_page_dirty+166/176] [zap_page_range+330/400] [exit_mmap+221/352] [mmput+88/176] [do_exit+259/800]
[sig_exit+195/208] [dequeue_signal+95/192] [do_signal+448/694] [schedule_timeout+94/176] [process_timeout+0/96] [sys_nanosleep+232/448]
[do_page_fault+0/1347] [signal_return+20/24]

Other than this BUG (that took down the machine hard, I was lucky to log
across the network), there appear to be no relevant logs shortly before
this crash.

What's causing this?

--
Matthias Andree

Encrypt your mail: my GnuPG key ID is 0x052E7D95


2004-03-24 19:57:48

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.25 SMP - BUG at page_alloc.c:105


The backtrace is odd to me.

set_page_dirty() does not call __free_pages_ok() directly or indirectly.

How can it be?

---

Hi,

I found this in the logs of a Dual Athlon MP machine (Tyan board)
running 2.4.25-SMP:

kernel BUG at page_alloc.c:105!
invalid operand: 0000
CPU: 0
EIP: 0010:[__free_pages_ok+80/704] Not tainted
EFLAGS: 00010286
eax: c0333674 ebx: c1b2d720 ecx: 00000000 edx: f22f7a84
esi: 00000001 edi: 00000000 ebp: 00000001 esp: f6901e3c
ds: 0018 es: 0018 ss: 0018
Process svscan (pid: 1348, stackpage=f6901000)
Stack: c033364c f741cbc0 f22f7a84 00000001 0804c000 c0133ea6 f22f79c0 00000004
00000001 00000001 0804c000 00000001 c01308fa c1b2d720 f68e3080 0804b000
00001000 0844b000 c03ac4e0 00000001 0804c000 f68e3084 f42baa40 f7212440
Call Trace: [set_page_dirty+166/176] [zap_page_range+330/400] [exit_mmap+221/352] \
[mmput+88/176] [do_exit+259/800] [sig_exit+195/208] [dequeue_signal+95/192] \
[do_signal+448/694] [schedule_timeout+94/176] [process_timeout+0/96] \
[sys_nanosleep+232/448] [do_page_fault+0/1347] [signal_return+20/24]

Other than this BUG (that took down the machine hard, I was lucky to log
across the network), there appear to be no relevant logs shortly before
this crash.


2004-03-24 20:26:12

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.25 SMP - BUG at page_alloc.c:105

Marcelo Tosatti <[email protected]> wrote:
>
>
> The backtrace is odd to me.
>
> set_page_dirty() does not call __free_pages_ok() directly or indirectly.
>

I'd suspect that's just gunk on the stack and that zap_pte_range() freed an
anonymous page which had a non-null ->mapping. It could be a hardware bug.
Without seeing the actual value of page->mapping it's hard to know.

It would be good to backport the bad_page() debug code so we get a bit more
info when this sort of thing happens.



> ---
>
> Hi,
>
> I found this in the logs of a Dual Athlon MP machine (Tyan board)
> running 2.4.25-SMP:
>
> kernel BUG at page_alloc.c:105!
> invalid operand: 0000
> CPU: 0
> EIP: 0010:[__free_pages_ok+80/704] Not tainted
> EFLAGS: 00010286
> eax: c0333674 ebx: c1b2d720 ecx: 00000000 edx: f22f7a84
> esi: 00000001 edi: 00000000 ebp: 00000001 esp: f6901e3c
> ds: 0018 es: 0018 ss: 0018
> Process svscan (pid: 1348, stackpage=f6901000)
> Stack: c033364c f741cbc0 f22f7a84 00000001 0804c000 c0133ea6 f22f79c0 00000004
> 00000001 00000001 0804c000 00000001 c01308fa c1b2d720 f68e3080 0804b000
> 00001000 0844b000 c03ac4e0 00000001 0804c000 f68e3084 f42baa40 f7212440
> Call Trace: [set_page_dirty+166/176] [zap_page_range+330/400] [exit_mmap+221/352] \
> [mmput+88/176] [do_exit+259/800] [sig_exit+195/208] [dequeue_signal+95/192] \
> [do_signal+448/694] [schedule_timeout+94/176] [process_timeout+0/96] \
> [sys_nanosleep+232/448] [do_page_fault+0/1347] [signal_return+20/24]
>
> Other than this BUG (that took down the machine hard, I was lucky to log
> across the network), there appear to be no relevant logs shortly before
> this crash.

2004-03-24 20:50:45

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.25 SMP - BUG at page_alloc.c:105


On Wed, Mar 24, 2004 at 12:28:06PM -0800, Andrew Morton wrote:
> Marcelo Tosatti <[email protected]> wrote:
> >
> >
> > The backtrace is odd to me.
> >
> > set_page_dirty() does not call __free_pages_ok() directly or indirectly.
> >
>
> I'd suspect that's just gunk on the stack and that zap_pte_range() freed an
> anonymous page which had a non-null ->mapping. It could be a hardware bug.
> Without seeing the actual value of page->mapping it's hard to know.
>
> It would be good to backport the bad_page() debug code so we get a bit more
> info when this sort of thing happens.

This should work. Matthias, please apply and try to reproduce.

--- mm/page_alloc.c.orig 2004-03-24 18:42:53.693251224 -0300
+++ mm/page_alloc.c 2004-03-24 18:47:52.484828000 -0300
@@ -81,6 +81,20 @@
* -- wli
*/

+static void bad_page(const char *function, struct page *page)
+{
+ printk("Bad page state at %s\n", function);
+ printk("flags:0x%08lx mapping:%p buffers:%p count:%d\n",
+ page->flags, page->mapping,
+ page->buffers, page_count(page));
+ printk("Backtrace:\n");
+ dump_stack();
+ printk("bad_page: Trying to fix it up.\n");
+ set_page_count(page, 0);
+ page->mapping = NULL;
+}
+
+
static void FASTCALL(__free_pages_ok (struct page *page, unsigned int order));
static void __free_pages_ok (struct page *page, unsigned int order)
{
@@ -101,8 +115,8 @@

if (page->buffers)
BUG();
- if (page->mapping)
- BUG();
+ if (page->mapping)
+ bad_page(page);
if (!VALID_PAGE(page))
BUG();
if (PageLocked(page))

2004-03-24 21:12:57

by Matthias Andree

[permalink] [raw]
Subject: Re: 2.4.25 SMP - BUG at page_alloc.c:105

On Wed, 24 Mar 2004, Andrew Morton wrote:

> I'd suspect that's just gunk on the stack and that zap_pte_range() freed an
> anonymous page which had a non-null ->mapping. It could be a hardware bug.
> Without seeing the actual value of page->mapping it's hard to know.

Any chance to retrieve that when the machine has been rebooted since? I
fear there is none.

I have these log entries from boot-up (after the crash), seems the BIOS
isn't perfect (Tyan S2460 "Tiger MP" w/ BIOS 1.05):

...
128MB HIGHMEM available.
896MB LOWMEM available.
ACPI: have wakeup address 0xc0002000
found SMP MP-table at 000f7510
hm, page 000f7000 reserved twice.
hm, page 000f8000 reserved twice.
hm, page 0009f000 reserved twice.
hm, page 000a0000 reserved twice.
On node 0 totalpages: 262144
zone(0): 4096 pages.
zone(1): 225280 pages.
zone(2): 32768 pages.
ACPI: Unable to locate RSDP
Intel MultiProcessor Specification v1.4
Virtual Wire compatibility mode.
OEM ID: TYAN Product ID: GUINNESS APIC at: 0xFEE00000
Processor #1 Pentium(tm) Pro APIC version 16
Processor #0 Pentium(tm) Pro APIC version 16
I/O APIC #2 Version 17 at 0xFEC00000.
Enabling APIC mode: Flat. Using 1 I/O APICs
Processors: 2
Kernel command line: root=/dev/hda5 vga=791 splash=silent showopts noapic
Initializing CPU#0
Detected 1533.378 MHz processor.
Console: colour dummy device 80x25
Calibrating delay loop... 3060.53 BogoMIPS
Memory: 1032772k/1048576k available (1902k kernel code, 15416k reserved, 636k data, 152k init, 131072k highmem)
...
Intel machine check reporting enabled on CPU#0.
CPU: After generic, caps: 0383fbff c1cbfbff 00000000 00000000
CPU: Common caps: 0383fbff c1cbfbff 00000000 00000000
CPU0: AMD Athlon(tm) MP 1800+ stepping 02
Intel machine check reporting enabled on CPU#1.
CPU: After generic, caps: 0383fbff c1cbfbff 00000000 00000000
CPU: Common caps: 0383fbff c1cbfbff 00000000 00000000
CPU1: AMD Athlon(tm) Processor stepping 02
Total of 2 processors activated (6121.06 BogoMIPS).
Using local APIC timer interrupts.
calibrating APIC timer ...
..... CPU clock speed is 1533.3658 MHz.
..... host bus clock speed is 266.6723 MHz.
cpu: 0, clocks: 2666723, slice: 888907
CPU0<T0:2666720,T1:1777808,D:5,S:888907,C:2666723>
cpu: 1, clocks: 2666723, slice: 888907
CPU1<T0:2666720,T1:888896,D:10,S:888907,C:2666723>
checking TSC synchronization across CPUs: passed.
Waiting on wait_init_idle (map = 0x2)
All processors have done init_idle
mtrr: your CPUs had inconsistent fixed MTRR settings
mtrr: probably your BIOS does not setup all CPUs
ACPI: Subsystem revision 20040116
ACPI: Interpreter disabled.
PCI: PCI BIOS revision 2.10 entry at 0xfd7e0, last bus=1
PCI: Using configuration type 1
PCI: Probing PCI hardware
PCI: ACPI tables contain no PCI IRQ routing entries
PCI: Probing PCI hardware (bus 00)
BIOS failed to enable PCI standards compliance, fixing this error.
I/O APIC: AMD Errata #22 may be present. In the event of instability try
: booting with the "noapic" option.
...


Don't waste countless efforts debugging this

--
Matthias Andree

Encrypt your mail: my GnuPG key ID is 0x052E7D95

2004-03-24 21:37:11

by Matthias Andree

[permalink] [raw]
Subject: Re: 2.4.25 SMP - BUG at page_alloc.c:105

On Wed, 24 Mar 2004, Marcelo Tosatti wrote:

> This should work. Matthias, please apply and try to reproduce.

Didn't compile. I have changed that line 119 to bad_page(__FUNCTION__,
page); instead. If the first argument must be something else, let me
know. It doesn't immedately make sense with just one caller, but I know
nothing better right now.

As I don't know a specific scenario to reproduce the crash, it may take
longer (possibly weeks) until I can come up with results.

Here's the error:

gcc -D__KERNEL__ -I/usr/src/linux-2.4.25/include -Wall -Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer -pipe -mpreferred-stack-boundary=2 -march=athlon -nostdinc -iwithprefix include -DKBUILD_BASENAME=page_alloc -DEXPORT_SYMTAB -c page_alloc.c
page_alloc.c: In function `__free_pages_ok':
page_alloc.c:119: warning: passing arg 1 of `bad_page' from incompatible pointer type
page_alloc.c:119: error: too few arguments to function `bad_page'
make[2]: *** [page_alloc.o] Error 1
make[2]: Leaving directory `/usr/src/linux-2.4.25/mm'

The relevant parts of the patch were:

> --- mm/page_alloc.c.orig 2004-03-24 18:42:53.693251224 -0300
> +++ mm/page_alloc.c 2004-03-24 18:47:52.484828000 -0300
> @@ -81,6 +81,20 @@
> * -- wli
> */
>
> +static void bad_page(const char *function, struct page *page)
> +{
> + printk("Bad page state at %s\n", function);
...
> @@ -101,8 +115,8 @@
>
> if (page->buffers)
> BUG();
> - if (page->mapping)
> - BUG();
> + if (page->mapping)
> + bad_page(page);
> if (!VALID_PAGE(page))
> BUG();
> if (PageLocked(page))

--
Matthias Andree

Encrypt your mail: my GnuPG key ID is 0x052E7D95

2004-03-24 23:22:20

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.25 SMP - BUG at page_alloc.c:105

On Wed, Mar 24, 2004 at 10:36:48PM +0100, Matthias Andree wrote:
> On Wed, 24 Mar 2004, Marcelo Tosatti wrote:
>
> > This should work. Matthias, please apply and try to reproduce.
>
> Didn't compile. I have changed that line 119 to bad_page(__FUNCTION__,
> page); instead. If the first argument must be something else, let me
> know. It doesn't immedately make sense with just one caller, but I know
> nothing better right now.

Right. My mistake.

> As I don't know a specific scenario to reproduce the crash, it may take
> longer (possibly weeks) until I can come up with results.

Lets wait and see.

Did you try older 2.4's or 2.6 ?