2006-10-04 10:40:35

by Andre Noll

[permalink] [raw]
Subject: 2.6.18: Kernel BUG at mm/rmap.c:522

Hi

MATLAB triggers the following bug on both of our new 16-way opteron
machines (64G Ram): The same kernel is running with no problems on a
bunch of smaller (8-way, 4-way, max 32G Ram) cluster nodes.

Any hints?
Andre


----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ...aid0/home/maan/scm/stable/linux-2.6.18.y/mm/rmap.c:522
invalid opcode: 0000 [1] SMP
CPU 14
Pid: 12948, comm: MATLAB Not tainted 2.6.18-tt64-6 #1
RIP: 0010:[<ffffffff8015ee54>] [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP: 0018:ffff810207a19d70 EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff81101aa1dd90 RCX: 000000000000001c
RDX: 00002aaad63b2000 RSI: 00002aaad63b2000 RDI: ffff81102d129ce8
RBP: 0000000f59e3b067 R08: 0000000000000023 R09: ffff810e30000680
R10: ffff8106079df408 R11: ffff8105970284a8 R12: ffff81102d129ce8
R13: ffff810e3005e160 R14: 00002aaad63b2000 R15: 0000000000000000
FS: 00002b3f6aa704a0(0000) GS:ffff810e301b9440(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002aaad8592000 CR3: 00000005bdd17000 CR4: 00000000000006a0
Process MATLAB (pid: 12948, threadinfo ffff810207a18000, task ffff8101ea581080)
Stack: ffffffff80157a37 ffff810e30000680 00002aaad63b3000 ffff81101aa1dd98
ffff81102fb53668 ffffffffffffffb8 ffff8106079df400 ffff810207a19e98
00002aaad6400000 ffff810feecdc088 00002aaad6400000 ffff810b24d11588
Call Trace:
[<ffffffff80157a37>] zap_pte_range+0x1c4/0x2c0
[<ffffffff80157d0e>] unmap_page_range+0x1db/0x23a
[<ffffffff80157e5b>] unmap_vmas+0xee/0x1e3
[<ffffffff8015c6fe>] unmap_region+0xb4/0x127
[<ffffffff8015caa7>] do_munmap+0x183/0x19a
[<ffffffff8015caf7>] sys_munmap+0x39/0x52
[<ffffffff80109726>] system_call+0x7e/0x83


Code: 0f 0b 68 c0 4e 46 80 c2 0a 02 31 f6 f6 47 18 01 40 0f 94 c6
RIP [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP <ffff810207a19d70>
<0>Bad page state in process 'MATLAB'
page:ffff81102d129ce8 flags:0x1e00000000000014 mapping:0000000000000000 mapcount:-1 count:0
Trying to fix it up, but a reboot is needed
Backtrace:

Call Trace:
[<ffffffff8014f325>] bad_page+0x51/0x7b
[<ffffffff8014f788>] prep_new_page+0x57/0x15f
[<ffffffff8014feb1>] buffered_rmqueue+0x128/0x14a
[<ffffffff8015002c>] get_page_from_freelist+0xbd/0xe2
[<ffffffff801500a3>] __alloc_pages+0x52/0x29f
[<ffffffff80159a26>] do_anonymous_page+0x46/0x1b8
[<ffffffff8015a023>] __handle_mm_fault+0x18f/0x29d
[<ffffffff8011b359>] do_page_fault+0x1bd/0x4e7
[<ffffffff8015bf42>] do_mmap_pgoff+0x5fd/0x6de
[<ffffffff8022094b>] __up_write+0x14/0x108
[<ffffffff8010a3f9>] error_exit+0x0/0x84

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ...aid0/home/maan/scm/stable/linux-2.6.18.y/mm/rmap.c:522
invalid opcode: 0000 [2] SMP
CPU 14
Pid: 12079, comm: MATLAB Tainted: G B 2.6.18-tt64-6 #1
RIP: 0010:[<ffffffff8015ee54>] [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP: 0018:ffff810a0916fd70 EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff810e7c68d300 RCX: 000000000000001c
RDX: ffff810e30000000 RSI: 00002aab5ea60000 RDI: ffff81102bb25c10
RBP: 0000000ef53ee067 R08: 0000000000000023 R09: ffff810e30000680
R10: ffff810c01b69c88 R11: ffff810bce5484a8 R12: ffff81102bb25c10
R13: ffff810e3005e160 R14: 00002aab5ea60000 R15: 0000000000000000
FS: 00002b1f8100d4a0(0000) GS:ffff810e301b9440(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aab31345000 CR3: 0000000c1bebd000 CR4: 00000000000006a0
Process MATLAB (pid: 12079, threadinfo ffff810a0916e000, task ffff810a092d2180)
Stack: ffffffff80157a37 ffff810e30000680 00002aab5ea61000 ffff810e7c68d308
ffff81102a0b6ee8 00000000ffffff9f ffff810c01b69c80 ffff810a0916fe98
00002aab5ec00000 ffff810fd4924298 00002aab5ec00000 ffff8100812777a8
Call Trace:
[<ffffffff80157a37>] zap_pte_range+0x1c4/0x2c0
[<ffffffff80157d0e>] unmap_page_range+0x1db/0x23a
[<ffffffff80157e5b>] unmap_vmas+0xee/0x1e3
[<ffffffff8015c6fe>] unmap_region+0xb4/0x127
[<ffffffff8015caa7>] do_munmap+0x183/0x19a
[<ffffffff8015caf7>] sys_munmap+0x39/0x52
[<ffffffff80109726>] system_call+0x7e/0x83


Code: 0f 0b 68 c0 4e 46 80 c2 0a 02 31 f6 f6 47 18 01 40 0f 94 c6
RIP [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP <ffff810a0916fd70>
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ...aid0/home/maan/scm/stable/linux-2.6.18.y/mm/rmap.c:522
invalid opcode: 0000 [3] SMP
CPU 15
Pid: 20344, comm: MATLAB Tainted: G B 2.6.18-tt64-6 #1
RIP: 0010:[<ffffffff8015ee54>] [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP: 0018:ffff8101113b7d70 EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff810e87e8d660 RCX: 000000000000001c
RDX: ffff810e30000000 RSI: 00002aaaf6acc000 RDI: ffff81102ef01410
RBP: 0000000fe24ee067 R08: 0000000000000023 R09: ffff810e30000680
R10: ffff81010a2d5248 R11: ffff810162b6a608 R12: ffff81102ef01410
R13: ffff810e300639e0 R14: 00002aaaf6acc000 R15: 0000000000000000
FS: 00002b88026364a0(0000) GS:ffff810e301f4440(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002aab0284c0d0 CR3: 00000001038e3000 CR4: 00000000000006a0
Process MATLAB (pid: 20344, threadinfo ffff8101113b6000, task ffff81018fbd18a0)
Stack: ffffffff80157a37 ffff810e30000680 00002aaaf6acd000 ffff810e87e8d668
ffff81102a33aee8 00000000ffffff3f ffff81010a2d5240 ffff8101113b7e98
00002aaaf6c00000 ffff810ff0aa7818 00002aaaf6c00000 ffff810eaffd3da8
Call Trace:
[<ffffffff80157a37>] zap_pte_range+0x1c4/0x2c0
[<ffffffff80157d0e>] unmap_page_range+0x1db/0x23a
[<ffffffff80157e5b>] unmap_vmas+0xee/0x1e3
[<ffffffff8015c6fe>] unmap_region+0xb4/0x127
[<ffffffff8015caa7>] do_munmap+0x183/0x19a
[<ffffffff8015caf7>] sys_munmap+0x39/0x52
[<ffffffff80109726>] system_call+0x7e/0x83


Code: 0f 0b 68 c0 4e 46 80 c2 0a 02 31 f6 f6 47 18 01 40 0f 94 c6
RIP [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP <ffff8101113b7d70>
<3>swap_free: Unused swap offset entry 00000060
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ...aid0/home/maan/scm/stable/linux-2.6.18.y/mm/rmap.c:522
invalid opcode: 0000 [4] SMP
CPU 14
Pid: 5985, comm: MATLAB Tainted: G B 2.6.18-tt64-6 #1
RIP: 0010:[<ffffffff8015ee54>] [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP: 0018:ffff810204875d70 EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff810e87e852b0 RCX: ffff8101ef34d970
RDX: 00002aaae1856000 RSI: ffff8102280524d0 RDI: ffff81102d127358
RBP: 0000000f59d7d067 R08: 000000000000001e R09: 0000000000000000
R10: 0000000000000206 R11: 00000000fffffffa R12: ffff81102d127358
R13: ffff810e3005e160 R14: 00002aaae1856000 R15: 0000000000000000
FS: 00002ada130564a0(0000) GS:ffff810e301b9440(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002aaae418a000 CR3: 0000000c0bdad000 CR4: 00000000000006a0
Process MATLAB (pid: 5985, threadinfo ffff810204874000, task ffff8101110f3140)
Stack: ffffffff80157a37 ffff810e30000680 00002aaae1857000 ffff810e87e852b8
ffff81102a33ad28 ffffffffffffffaa ffff810a30128a40 ffff810204875e98
00002aaae1a00000 ffff810fffed14a8 00002aaae1a00000 ffff810be0f79860
Call Trace:
[<ffffffff80157a37>] zap_pte_range+0x1c4/0x2c0
[<ffffffff80157d0e>] unmap_page_range+0x1db/0x23a
[<ffffffff80157e5b>] unmap_vmas+0xee/0x1e3
[<ffffffff8015c6fe>] unmap_region+0xb4/0x127
[<ffffffff8015caa7>] do_munmap+0x183/0x19a
[<ffffffff8015caf7>] sys_munmap+0x39/0x52
[<ffffffff80109726>] system_call+0x7e/0x83


Code: 0f 0b 68 c0 4e 46 80 c2 0a 02 31 f6 f6 47 18 01 40 0f 94 c6
RIP [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP <ffff810204875d70>
<3>swap_free: Unused swap offset entry 00000060
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ...aid0/home/maan/scm/stable/linux-2.6.18.y/mm/rmap.c:522
invalid opcode: 0000 [5] SMP
CPU 14
Pid: 19074, comm: MATLAB Tainted: G B 2.6.18-tt64-6 #1
RIP: 0010:[<ffffffff8015ee54>] [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP: 0018:ffff810021587d70 EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff810e8ca8f6a0 RCX: 000000000000001c
RDX: ffff810e30000000 RSI: 00002aaaf7cd4000 RDI: ffff81102a447b08
RBP: 0000000e8cb57067 R08: 00000000fffffffe R09: ffff810e30000680
R10: ffff810230080708 R11: ffff810389b47b88 R12: ffff81102a447b08
R13: ffff810e3005e160 R14: 00002aaaf7cd4000 R15: 0000000000000000
FS: 00002b1e7eb3b4a0(0000) GS:ffff810e301b9440(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002aab08206870 CR3: 00000002b40e4000 CR4: 00000000000006a0
Process MATLAB (pid: 19074, threadinfo ffff810021586000, task ffff81013b640040)
Stack: ffffffff80157a37 ffff810e30000680 00002aaaf7cd5000 ffff810e8ca8f6a8
ffff81102a444f58 00000000ffffff38 ffff810230080700 ffff810021587e98
00002aaaf7e00000 ffff810fecefdb88 00002aaaf7e00000 ffff810092d8fdf0
Call Trace:
[<ffffffff80157a37>] zap_pte_range+0x1c4/0x2c0
[<ffffffff80157d0e>] unmap_page_range+0x1db/0x23a
[<ffffffff80157e5b>] unmap_vmas+0xee/0x1e3
[<ffffffff8015c6fe>] unmap_region+0xb4/0x127
[<ffffffff8015caa7>] do_munmap+0x183/0x19a
[<ffffffff8015caf7>] sys_munmap+0x39/0x52
[<ffffffff80109726>] system_call+0x7e/0x83


Code: 0f 0b 68 c0 4e 46 80 c2 0a 02 31 f6 f6 47 18 01 40 0f 94 c6
RIP [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP <ffff810021587d70>
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ...aid0/home/maan/scm/stable/linux-2.6.18.y/mm/rmap.c:522
invalid opcode: 0000 [6] SMP
CPU 14
Pid: 20028, comm: MATLAB Tainted: G B 2.6.18-tt64-6 #1
RIP: 0010:[<ffffffff8015ee54>] [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP: 0018:ffff810dad0e7d70 EFLAGS: 00010286
RAX: 00000000ffffffff RBX: ffff810e7ea8d0e0 RCX: 000000000000001c
RDX: ffff810e30000000 RSI: 00002aaaad21c000 RDI: ffff81102cf5e918
RBP: 0000000f51b05067 R08: 00002aaaad400000 R09: ffff810e30000680
R10: ffff810428d0f888 R11: ffff810fcb122ad8 R12: ffff81102cf5e918
R13: ffff810e3005e160 R14: 00002aaaad21c000 R15: 0000000000000000
FS: 00002adbe435e4a0(0000) GS:ffff810e301b9440(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002aaab12b8000 CR3: 0000000394dab000 CR4: 00000000000006a0
Process MATLAB (pid: 20028, threadinfo ffff810dad0e6000, task ffff810dbc128140)
Stack: ffffffff80157a37 ffff810e301b9340 00002aaaad21d000 ffff810e7ea8d0e8
ffff81102a134ee8 00000000fffffff2 ffff810428d0f880 ffff810dad0e7e98
00002aaaad400000 ffff810fe56a0d98 00002aaaad400000 ffff810175c0eb48
Call Trace:
[<ffffffff80157a37>] zap_pte_range+0x1c4/0x2c0
[<ffffffff80157d0e>] unmap_page_range+0x1db/0x23a
[<ffffffff80157e5b>] unmap_vmas+0xee/0x1e3
[<ffffffff8015c6fe>] unmap_region+0xb4/0x127
[<ffffffff8015caa7>] do_munmap+0x183/0x19a
[<ffffffff8015caf7>] sys_munmap+0x39/0x52
[<ffffffff80109726>] system_call+0x7e/0x83


Code: 0f 0b 68 c0 4e 46 80 c2 0a 02 31 f6 f6 47 18 01 40 0f 94 c6
RIP [<ffffffff8015ee54>] page_remove_rmap+0x13/0x2d
RSP <ffff810dad0e7d70>



--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (10.78 kB)
signature.asc (189.00 B)
Digital signature
Download all attachments

2006-10-04 13:59:30

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.18: Kernel BUG at mm/rmap.c:522

Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h 2006-10-04 23:51:55.000000000 +1000
+++ linux-2.6/include/linux/rmap.h 2006-10-04 23:52:28.000000000 +1000
@@ -72,7 +72,7 @@ void __anon_vma_link(struct vm_area_stru
void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
void page_add_file_rmap(struct page *);
-void page_remove_rmap(struct page *);
+void page_remove_rmap(struct page *, struct vm_area_struct *);

/**
* page_dup_rmap - duplicate pte mapping to a page
Index: linux-2.6/mm/filemap_xip.c
===================================================================
--- linux-2.6.orig/mm/filemap_xip.c 2006-10-04 23:56:59.000000000 +1000
+++ linux-2.6/mm/filemap_xip.c 2006-10-04 23:57:08.000000000 +1000
@@ -189,7 +189,7 @@ __xip_unmap (struct address_space * mapp
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush(vma, address, pte);
- page_remove_rmap(page);
+ page_remove_rmap(page, vma);
dec_mm_counter(mm, file_rss);
BUG_ON(pte_dirty(pteval));
pte_unmap_unlock(pte, ptl);
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c 2006-10-04 23:57:00.000000000 +1000
+++ linux-2.6/mm/fremap.c 2006-10-04 23:57:13.000000000 +1000
@@ -33,7 +33,7 @@ static int zap_pte(struct mm_struct *mm,
if (page) {
if (pte_dirty(pte))
set_page_dirty(page);
- page_remove_rmap(page);
+ page_remove_rmap(page, vma);
page_cache_release(page);
}
} else {
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c 2006-10-04 23:57:00.000000000 +1000
+++ linux-2.6/mm/memory.c 2006-10-04 23:57:26.000000000 +1000
@@ -677,7 +677,7 @@ static unsigned long zap_pte_range(struc
mark_page_accessed(page);
file_rss--;
}
- page_remove_rmap(page);
+ page_remove_rmap(page, vma);
tlb_remove_page(tlb, page);
continue;
}
@@ -1540,7 +1540,7 @@ gotten:
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (likely(pte_same(*page_table, orig_pte))) {
if (old_page) {
- page_remove_rmap(old_page);
+ page_remove_rmap(old_page, vma);
if (!PageAnon(old_page)) {
dec_mm_counter(mm, file_rss);
inc_mm_counter(mm, anon_rss);
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c 2006-10-04 23:49:15.000000000 +1000
+++ linux-2.6/mm/rmap.c 2006-10-04 23:58:00.000000000 +1000
@@ -508,7 +508,7 @@ void page_add_file_rmap(struct page *pag
*
* The caller needs to hold the pte lock.
*/
-void page_remove_rmap(struct page *page)
+void page_remove_rmap(struct page *page, struct vm_area_struct *vma)
{
if (atomic_add_negative(-1, &page->_mapcount)) {
#ifdef CONFIG_DEBUG_VM
@@ -517,6 +517,7 @@ void page_remove_rmap(struct page *page)
printk (KERN_EMERG " page->flags = %lx\n", page->flags);
printk (KERN_EMERG " page->count = %x\n", page_count(page));
printk (KERN_EMERG " page->mapping = %p\n", page->mapping);
+ print_symbol (KERN_EMERG " vma->vm_ops = %s\n", vma->vm_ops);
}
#endif
BUG_ON(page_mapcount(page) < 0);
@@ -621,7 +622,7 @@ static int try_to_unmap_one(struct page
dec_mm_counter(mm, file_rss);


- page_remove_rmap(page);
+ page_remove_rmap(page, vma);
page_cache_release(page);

out_unmap:
@@ -711,7 +712,7 @@ static void try_to_unmap_cluster(unsigne
if (pte_dirty(pteval))
set_page_dirty(page);

- page_remove_rmap(page);
+ page_remove_rmap(page, vma);
page_cache_release(page);
dec_mm_counter(mm, file_rss);
(*mapcount)--;


Attachments:
mm-rmap-debug-more.patch (3.77 kB)

2006-10-04 15:48:05

by Andre Noll

[permalink] [raw]
Subject: Re: 2.6.18: Kernel BUG at mm/rmap.c:522

On 23:59, Nick Piggin wrote:

> Ah, this old thing. I hope it is repeatable?

Well, it happened on both of the new machines we got last week. One
of these is still up BTW and I'm able to ssh into it.

> What we really want is the bit before this, the "Eeek! page_mapcount went
> negative" part.

There's no such message in the log. The preceeding lines are just normal
startup messages:

Adding 16779852k swap on /dev/sda1. Priority:42 extents:1 across:16779852k
Adding 16779852k swap on /dev/sdb1. Priority:42 extents:1 across:16779852k
process `syslogd' is using obsolete setsockopt SO_BSDCOMPAT

> It is also nice if we can work out where the page actually came from. The
> following attached patch should help out a bit with that, if you could
> run with it?

Okay. I'll reboot with your patch and let you know if it crashes again.

Thanks for the quick response.
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (958.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2006-10-04 15:57:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: 2.6.18: Kernel BUG at mm/rmap.c:522

On Wed, 2006-10-04 at 17:42 +0200, Andre Noll wrote:
> On 23:59, Nick Piggin wrote:
>
> > Ah, this old thing. I hope it is repeatable?
>
> Well, it happened on both of the new machines we got last week. One
> of these is still up BTW and I'm able to ssh into it.
>
> > What we really want is the bit before this, the "Eeek! page_mapcount went
> > negative" part.
>
> There's no such message in the log. The preceeding lines are just normal
> startup messages:
>
> Adding 16779852k swap on /dev/sda1. Priority:42 extents:1 across:16779852k
> Adding 16779852k swap on /dev/sdb1. Priority:42 extents:1 across:16779852k
> process `syslogd' is using obsolete setsockopt SO_BSDCOMPAT
>
> > It is also nice if we can work out where the page actually came from. The
> > following attached patch should help out a bit with that, if you could
> > run with it?
>
> Okay. I'll reboot with your patch and let you know if it crashes again.

enable CONFIG_DEBUG_VM to get that.

2006-10-04 16:13:10

by Andre Noll

[permalink] [raw]
Subject: Re: 2.6.18: Kernel BUG at mm/rmap.c:522

On 17:49, Peter Zijlstra wrote:

> enable CONFIG_DEBUG_VM to get that.

Yup, that was disabled. It's enabled now.

Thanks
Andre

--
The only person who always got his work done by Friday was Robinson Crusoe


Attachments:
(No filename) (208.00 B)
signature.asc (189.00 B)
Digital signature
Download all attachments

2006-10-04 20:40:06

by Dave Jones

[permalink] [raw]
Subject: Re: 2.6.18: Kernel BUG at mm/rmap.c:522

On Wed, Oct 04, 2006 at 05:49:00PM +0200, Peter Zijlstra wrote:

> > > It is also nice if we can work out where the page actually came from. The
> > > following attached patch should help out a bit with that, if you could
> > > run with it?
> > Okay. I'll reboot with your patch and let you know if it crashes again.
> enable CONFIG_DEBUG_VM to get that.

Given this warnings still pops up from time to time, I question whether
putting that check under DEBUG_VM was such a good idea. It's not as
if it's a major performance impact. This has potential for us to lose
valuable debugging info for a few nanoseconds performance increase in
an already costly path.

This patch brings it back unconditionally, and moves the BUG()
into the if arm.

Signed-off-by: Dave Jones <[email protected]>

--- local-git/mm/rmap.c~ 2006-10-04 16:38:06.000000000 -0400
+++ local-git/mm/rmap.c 2006-10-04 16:38:24.000000000 -0400
@@ -576,15 +576,14 @@ void page_add_file_rmap(struct page *pag
void page_remove_rmap(struct page *page)
{
if (atomic_add_negative(-1, &page->_mapcount)) {
-#ifdef CONFIG_DEBUG_VM
if (unlikely(page_mapcount(page) < 0)) {
printk (KERN_EMERG "Eeek! page_mapcount(page) went negative! (%d)\n", page_mapcount(page));
printk (KERN_EMERG " page->flags = %lx\n", page->flags);
printk (KERN_EMERG " page->count = %x\n", page_count(page));
printk (KERN_EMERG " page->mapping = %p\n", page->mapping);
+ BUG();
}
-#endif
- BUG_ON(page_mapcount(page) < 0);
+
/*
* It would be tidy to reset the PageAnon mapping here,
* but that might overwrite a racing page_add_anon_rmap
--
http://www.codemonkey.org.uk

2006-10-05 06:28:39

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.18: Kernel BUG at mm/rmap.c:522

Dave Jones wrote:
> On Wed, Oct 04, 2006 at 05:49:00PM +0200, Peter Zijlstra wrote:
>
> > > > It is also nice if we can work out where the page actually came from. The
> > > > following attached patch should help out a bit with that, if you could
> > > > run with it?
> > > Okay. I'll reboot with your patch and let you know if it crashes again.
> > enable CONFIG_DEBUG_VM to get that.
>
> Given this warnings still pops up from time to time, I question whether
> putting that check under DEBUG_VM was such a good idea. It's not as
> if it's a major performance impact. This has potential for us to lose
> valuable debugging info for a few nanoseconds performance increase in
> an already costly path.

Frustratingly, it usually doesn't tell us much (without my previous
patch), because it is normally some driver that has stuffed up their
refcounting and the page-> fields don't tell us where the page has
come from.

But..

> This patch brings it back unconditionally, and moves the BUG()
> into the if arm.

... this shouldn't hurt if gcc moves the unlikely code out of the
linear instruction stream, which I think it usually does. It shouldn't
cost _anything_ because we're already doing the branch for the BUG_ON
which you remove.

> Signed-off-by: Dave Jones <[email protected]>

Thanks,
Acked-by: Nick Piggin <[email protected]>

>
> --- local-git/mm/rmap.c~ 2006-10-04 16:38:06.000000000 -0400
> +++ local-git/mm/rmap.c 2006-10-04 16:38:24.000000000 -0400
> @@ -576,15 +576,14 @@ void page_add_file_rmap(struct page *pag
> void page_remove_rmap(struct page *page)
> {
> if (atomic_add_negative(-1, &page->_mapcount)) {
> -#ifdef CONFIG_DEBUG_VM
> if (unlikely(page_mapcount(page) < 0)) {
> printk (KERN_EMERG "Eeek! page_mapcount(page) went negative! (%d)\n", page_mapcount(page));
> printk (KERN_EMERG " page->flags = %lx\n", page->flags);
> printk (KERN_EMERG " page->count = %x\n", page_count(page));
> printk (KERN_EMERG " page->mapping = %p\n", page->mapping);
> + BUG();
> }
> -#endif
> - BUG_ON(page_mapcount(page) < 0);
> +
> /*
> * It would be tidy to reset the PageAnon mapping here,
> * but that might overwrite a racing page_add_anon_rmap


--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-11 16:07:45

by Michael Harris

[permalink] [raw]
Subject: Re: 2.6.18: Kernel BUG at mm/rmap.c:522


Hi, I can readily reproduce this with 2.6.18 doing 4 simultanous kernel compiles on two disks to load test a P4 3.2 HT with 2GB. I have SMP and SMT scheduling enabled, and the 4GB memory option. Here is output with CONFIG_DEBUG_VM enabled followed by another crash before CONFIG_DEBUG_VM was enabled.

[*ROOT* hen /usr/src/linux-2.6.18 14 ] uname -a
Linux hen.igconcepts.com 2.6.18 #2 SMP Tue Oct 10 11:46:01 CDT 2006 i686 i686 i386 GNU/Linux
[*ROOT* hen /usr/src/linux-2.6.18 15 ] gcc -v
Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/3.2.2/specs
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --disable-checking --with-system-zlib --enable-__cxa_atexit --host=i386-redhat-linux
Thread model: posix
gcc version 3.2.2 20030222 (Red Hat Linux 3.2.2-5)


----------------
Oct 11 04:53:35 hen kernel: VM: killing process cc1
Oct 11 04:53:35 hen kernel: swap_free: Unused swap offset entry 00004000
Oct 11 04:53:35 hen kernel: Eeek! page_mapcount(page) went negative! (-1)
Oct 11 04:53:35 hen kernel: page->flags = c0080014
Oct 11 04:53:35 hen kernel: page->count = 0
Oct 11 04:53:35 hen kernel: page->mapping = 00000000
Oct 11 04:53:35 hen kernel: ------------[ cut here ]------------
Oct 11 04:53:35 hen kernel: kernel BUG at mm/rmap.c:522!
Oct 11 04:53:35 hen kernel: invalid opcode: 0000 [#1]
Oct 11 04:53:35 hen kernel: SMP DEBUG_PAGEALLOC
Oct 11 04:53:35 hen kernel: CPU: 1
Oct 11 04:53:35 hen kernel: EIP: 0060:[<c01494c2>] Not tainted VLI
Oct 11 04:53:35 hen kernel: EFLAGS: 00210286 (2.6.18 #2)
Oct 11 04:53:35 hen kernel: EIP is at page_remove_rmap+0x44/0xb9
Oct 11 04:53:35 hen kernel: eax: ffffffff ebx: c1b5cd80 ecx: ffffffff edx: 00200046
Oct 11 04:53:35 hen kernel: esi: 403fc000 edi: c1b5cd80 ebp: c20108a0 esp: d85e1ea8
Oct 11 04:53:35 hen kernel: ds: 007b es: 007b ss: 0068
Oct 11 04:53:35 hen kernel: Process cc1 (pid: 23624, ti=d85e0000 task=e817c030 task.ti=d85e0000)
Oct 11 04:53:35 hen kernel: Stack: c0390116 00000000 d80adff0 c0143249 5ae6c067 c13015ac ffffff14 ffffffff
Oct 11 04:53:35 hen kernel: eceade40 f3c50b00 40400000 d8929400 40400000 40310000 c014347e 40310000
Oct 11 04:53:35 hen kernel: 40400000 d85e1f3c 00000000 4046ffff d8929400 d8929400 f3c50b00 c20108a0
Oct 11 04:53:35 hen kernel: Call Trace:
Oct 11 04:53:35 hen kernel: [<c0143249>] zap_pte_range+0x15b/0x273
Oct 11 04:53:35 hen kernel: [<c014347e>] unmap_page_range+0x11d/0x13f
Oct 11 04:53:35 hen kernel: [<c0143569>] unmap_vmas+0xc9/0x1bf
Oct 11 04:53:35 hen kernel: [<c014797c>] exit_mmap+0x79/0xf3
Oct 11 04:53:35 hen kernel: [<c0117d41>] mmput+0x33/0x91
Oct 11 04:53:35 hen kernel: [<c011c9e8>] do_exit+0x110/0x3c8
Oct 11 04:53:35 hen kernel: [<c0111a65>] do_page_fault+0x13b/0x54c
Oct 11 04:53:35 hen kernel: [<c0147626>] do_munmap+0xea/0x113
Oct 11 04:53:35 hen kernel: [<c011cd47>] do_group_exit+0x87/0xa7
Oct 11 04:53:35 hen kernel: [<c0102ca7>] syscall_call+0x7/0xb
Oct 11 04:53:35 hen kernel: Code: 94 00 00 00 8b 43 08 83 c0 01 78 29 8b 43 08 83 c0 01 78 17 31 d2 89 d8 f6 43 10 01 0f 94 c2 8b 5c 24 08 83 c4 0c e9 92 7b ff ff <0f> 0b 0a 02 d4 00 39 c0 eb df 8b 43 08 c7 04 24 e0 0c 39 c0 83
Oct 11 04:53:35 hen kernel: EIP: [<c01494c2>] page_remove_rmap+0x44/0xb9 SS:ESP 0068:d85e1ea8
Oct 11 04:53:35 hen kernel: <1>Fixing recursive fault but reboot is needed!
Oct 11 04:54:31 hen kernel: Bad page state in process 'tripwire'
Oct 11 04:54:31 hen kernel: page:c1b5cd80 flags:0xc0000014 mapping:00000000 mapcount:-1 count:0
Oct 11 04:54:31 hen kernel: Trying to fix it up, but a reboot is needed
Oct 11 04:54:31 hen kernel: Backtrace:
Oct 11 04:54:31 hen kernel: [<c013afe4>] bad_page+0x60/0x8f
Oct 11 04:54:31 hen kernel: [<c013b5bd>] prep_new_page+0x197/0x1a4
Oct 11 04:54:31 hen kernel: [<c013bb7a>] buffered_rmqueue+0xda/0x152
Oct 11 04:54:31 hen kernel: [<c013bcfc>] get_page_from_freelist+0x6d/0x98
Oct 11 04:54:31 hen kernel: [<c013bd71>] __alloc_pages+0x4a/0x2b0
Oct 11 04:54:31 hen kernel: [<c03664bb>] __wait_on_bit_lock+0x41/0x5e
Oct 11 04:54:31 hen kernel: [<c0136e2c>] sync_page+0x0/0x4c
Oct 11 04:54:31 hen kernel: [<c013db15>] __do_page_cache_readahead+0x110/0x14f
Oct 11 04:54:31 hen kernel: [<c0138861>] filemap_nopage+0x312/0x368
Oct 11 04:54:31 hen kernel: [<c0144cb4>] do_no_page+0x6f/0x295
Oct 11 04:54:31 hen kernel: [<c0145049>] __handle_mm_fault+0xec/0x1f9
Oct 11 04:54:31 hen kernel: [<c0111a65>] do_page_fault+0x13b/0x54c
Oct 11 04:54:32 hen kernel: [<c011192a>] do_page_fault+0x0/0x54c
Oct 11 04:54:32 hen kernel: [<c010374d>] error_code+0x39/0x40

----------------
Another crash from a day earlier before enabling DEBUG_VM

Oct 10 05:19:43 hen kernel: VM: killing process cc1
Oct 10 05:19:43 hen kernel: swap_free: Unused swap offset entry 00002000
Oct 10 05:19:56 hen kernel: swap_free: Unused swap offset entry 00000400
Oct 10 05:20:13 hen kernel: ------------[ cut here ]------------
Oct 10 05:20:13 hen kernel: kernel BUG at mm/rmap.c:522!
Oct 10 05:20:13 hen kernel: invalid opcode: 0000 [#1]
Oct 10 05:20:13 hen kernel: SMP
Oct 10 05:20:13 hen kernel: CPU: 0
Oct 10 05:20:13 hen kernel: EIP: 0060:[<c0148ffc>] Not tainted VLI
Oct 10 05:20:13 hen kernel: EFLAGS: 00210286 (2.6.18 #1)
Oct 10 05:20:13 hen kernel: EIP is at page_remove_rmap+0x2a/0x35
Oct 10 05:20:13 hen kernel: eax: ffffffff ebx: d80df18c ecx: c1ffc800 edx: 0007fe40
Oct 10 05:20:13 hen kernel: esi: 08063000 edi: c1ffc800 ebp: c200be20 esp: d8e35dd8
Oct 10 05:20:13 hen kernel: ds: 007b es: 007b ss: 0068
Oct 10 05:20:13 hen kernel: Process sh (pid: 23162, ti=d8e34000 task=db85a030 task.ti=d8e34000)
Oct 10 05:20:13 hen kernel: Stack: c0142f1d 7fe40025 c1301bec 00000000 ffffffe4 f264d200 f7a975c0 080dc000
Oct 10 05:20:13 hen kernel: d8b33080 080dc000 08048000 c0143152 08048000 080dc000 d8e35e60 00000000
Oct 10 05:20:13 hen kernel: 080dbfff d8b33080 d8b33080 f7a975c0 c200be20 d8e35e84 f7a975c0 08048000
Oct 10 05:20:13 hen kernel: Call Trace:
Oct 10 05:20:13 hen kernel: [<c0142f1d>] zap_pte_range+0x15b/0x273
Oct 10 05:20:13 hen kernel: [<c0143152>] unmap_page_range+0x11d/0x13f
Oct 10 05:20:13 hen kernel: [<c014323d>] unmap_vmas+0xc9/0x1bf
Oct 10 05:20:13 hen kernel: [<c0147605>] exit_mmap+0x79/0xf3
Oct 10 05:20:13 hen kernel: [<c0117c29>] mmput+0x33/0x91
Oct 10 05:20:13 hen kernel: [<c011c8c8>] do_exit+0x110/0x3c8
Oct 10 05:20:13 hen kernel: [<c011cc27>] do_group_exit+0x87/0xa7
Oct 10 05:20:13 hen kernel: [<c0125e6f>] get_signal_to_deliver+0x200/0x2e5
Oct 10 05:20:13 hen kernel: [<c0102a2f>] do_signal+0x6a/0x180
Oct 10 05:20:13 hen kernel: [<c0124490>] signal_wake_up+0x28/0x36
Oct 10 05:20:13 hen kernel: [<c012498b>] specific_send_sig_info+0xaa/0xc3
Oct 10 05:20:13 hen kernel: [<c0124a15>] force_sig_info+0x71/0x8b
Oct 10 05:20:13 hen kernel: [<c010471b>] do_general_protection+0xa7/0x188
Oct 10 05:20:13 hen kernel: [<c0104674>] do_general_protection+0x0/0x188
Oct 10 05:20:13 hen kernel: [<c0102b8a>] do_notify_resume+0x45/0x47
Oct 10 05:20:13 hen kernel: [<c0102d32>] work_notifysig+0x13/0x19
Oct 10 05:20:13 hen kernel: Code: ff 89 c1 b8 ff ff ff ff f0 01 41 08 0f 98 c0 84 c0 74 22 8b 41 08 83 c0 01 78 10 31 d2 89 c8 f6 41 10 01 0f 94 c2 e9 34 7d ff ff <0f> 0b 0a 02 94 ff 38 c0 eb e6 c3 55 57 56 89 d6 53 89 cb 31 c9
Oct 10 05:20:13 hen kernel: EIP: [<c0148ffc>] page_remove_rmap+0x2a/0x35 SS:ESP 0068:d8e35dd8
Oct 10 05:20:13 hen kernel: <1>Fixing recursive fault but reboot is needed!



2006-10-12 07:00:36

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.18: Kernel BUG at mm/rmap.c:522

On Wed, 11 Oct 2006 11:07:40 -0500
Michael Harris <[email protected]> wrote:


> Hi, I can readily reproduce this with 2.6.18 doing 4 simultanous kernel
> compiles on two disks to load test a P4 3.2 HT with 2GB. I have SMP and
> SMT scheduling enabled, and the 4GB memory option. Here is output with
> CONFIG_DEBUG_VM enabled followed by another crash before CONFIG_DEBUG_VM
> was enabled.

Repeatable. That's good news.

>
> [*ROOT* hen /usr/src/linux-2.6.18 14 ] uname -a
> Linux hen.igconcepts.com 2.6.18 #2 SMP Tue Oct 10 11:46:01 CDT 2006 i686 i686 i386 GNU/Linux
> [*ROOT* hen /usr/src/linux-2.6.18 15 ] gcc -v
> Reading specs from /usr/lib/gcc-lib/i386-redhat-linux/3.2.2/specs
> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --disable-checking --with-system-zlib --enable-__cxa_atexit --host=i386-redhat-linux
> Thread model: posix
> gcc version 3.2.2 20030222 (Red Hat Linux 3.2.2-5)
>
>
> ----------------
> Oct 11 04:53:35 hen kernel: VM: killing process cc1
> Oct 11 04:53:35 hen kernel: swap_free: Unused swap offset entry 00004000
> Oct 11 04:53:35 hen kernel: Eeek! page_mapcount(page) went negative! (-1)
> Oct 11 04:53:35 hen kernel: page->flags = c0080014
> Oct 11 04:53:35 hen kernel: page->count = 0
> Oct 11 04:53:35 hen kernel: page->mapping = 00000000
> Oct 11 04:53:35 hen kernel: ------------[ cut here ]------------
> Oct 11 04:53:35 hen kernel: kernel BUG at mm/rmap.c:522!

Does that machine run any earlier kernels OK? If so, which?

Thanks.

2006-10-12 07:28:46

by Dave Jones

[permalink] [raw]
Subject: Re: 2.6.18: Kernel BUG at mm/rmap.c:522

On Thu, Oct 12, 2006 at 12:00:26AM -0700, Andrew Morton wrote:

> > ----------------
> > Oct 11 04:53:35 hen kernel: VM: killing process cc1
> > Oct 11 04:53:35 hen kernel: swap_free: Unused swap offset entry 00004000
> > Oct 11 04:53:35 hen kernel: Eeek! page_mapcount(page) went negative! (-1)
> > Oct 11 04:53:35 hen kernel: page->flags = c0080014
> > Oct 11 04:53:35 hen kernel: page->count = 0
> > Oct 11 04:53:35 hen kernel: page->mapping = 00000000
> > Oct 11 04:53:35 hen kernel: ------------[ cut here ]------------
> > Oct 11 04:53:35 hen kernel: kernel BUG at mm/rmap.c:522!
>
> Does that machine run any earlier kernels OK? If so, which?

FWIW, I've seen a bunch of reports of this being triggered in Fedora bugzilla
which have had the nvidia module loaded. Is that the case here ?

Dave

--
http://www.codemonkey.org.uk

2006-10-12 08:16:23

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.18: Kernel BUG at mm/rmap.c:522

Michael Harris wrote:
> Hi, I can readily reproduce this with 2.6.18 doing 4 simultanous kernel compiles on two disks to load test a P4 3.2 HT with 2GB. I have SMP and SMT scheduling enabled, and the 4GB memory option. Here is output with CONFIG_DEBUG_VM enabled followed by another crash before CONFIG_DEBUG_VM was enabled.

> Oct 11 04:53:35 hen kernel: swap_free: Unused swap offset entry 00004000
> Oct 11 04:53:35 hen kernel: Eeek! page_mapcount(page) went negative! (-1)
> Oct 11 04:53:35 hen kernel: page->flags = c0080014
> Oct 11 04:53:35 hen kernel: page->count = 0
> Oct 11 04:53:35 hen kernel: page->mapping = 00000000

Hmm, this is a new one. The page is free and not reserved, wheras we are
used to seeing them reserved here.

> Oct 11 04:54:31 hen kernel: Bad page state in process 'tripwire'
> Oct 11 04:54:31 hen kernel: page:c1b5cd80 flags:0xc0000014 mapping:00000000 mapcount:-1 count:0

> Another crash from a day earlier before enabling DEBUG_VM
> Oct 10 05:19:43 hen kernel: VM: killing process cc1
> Oct 10 05:19:43 hen kernel: swap_free: Unused swap offset entry 00002000
> Oct 10 05:19:56 hen kernel: swap_free: Unused swap offset entry 00000400

These unused swap offset entry messages seem to indicate extensive memory
corruption in your page tables. Probably bad RAM, or system overheating
when you load it up :(

Can you run a good memory tester like memtest86+ overnight?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-10-12 14:19:09

by Michael Harris

[permalink] [raw]
Subject: Re: 2.6.18: Kernel BUG at mm/rmap.c:522

> Michael Harris wrote:
> >Hi, I can readily reproduce this with 2.6.18 doing 4 simultanous kernel
> >compiles on two disks to load test a P4 3.2 HT with 2GB. I have SMP and
> >SMT scheduling enabled, and the 4GB memory option. Here is output with
> >CONFIG_DEBUG_VM enabled followed by another crash before CONFIG_DEBUG_VM
> >was enabled.
>
> >Oct 11 04:53:35 hen kernel: swap_free: Unused swap offset entry 00004000
> >Oct 11 04:53:35 hen kernel: Eeek! page_mapcount(page) went negative! (-1)
> >Oct 11 04:53:35 hen kernel: page->flags = c0080014
> >Oct 11 04:53:35 hen kernel: page->count = 0
> >Oct 11 04:53:35 hen kernel: page->mapping = 00000000
>
> Hmm, this is a new one. The page is free and not reserved, wheras we are
> used to seeing them reserved here.
>
> >Oct 11 04:54:31 hen kernel: Bad page state in process 'tripwire'
> >Oct 11 04:54:31 hen kernel: page:c1b5cd80 flags:0xc0000014
> >mapping:00000000 mapcount:-1 count:0
>
> >Another crash from a day earlier before enabling DEBUG_VM
> >Oct 10 05:19:43 hen kernel: VM: killing process cc1
> >Oct 10 05:19:43 hen kernel: swap_free: Unused swap offset entry 00002000
> >Oct 10 05:19:56 hen kernel: swap_free: Unused swap offset entry 00000400
>
> These unused swap offset entry messages seem to indicate extensive memory
> corruption in your page tables. Probably bad RAM, or system overheating
> when you load it up :(
>
> Can you run a good memory tester like memtest86+ overnight?


Hi, I think this is the case, a stick of ram gone bad. It had worked
testing under 2.4 but coincidentally failed about the time I upgraded
to 2.6. memtest86 uncovered it at once. Sorry for the trouble and thanks
for the help.
Mike

2006-10-12 17:42:20

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.18: Kernel BUG at mm/rmap.c:522

Michael Harris wrote:
>>Michael Harris wrote:

>>These unused swap offset entry messages seem to indicate extensive memory
>>corruption in your page tables. Probably bad RAM, or system overheating
>>when you load it up :(
>>
>>Can you run a good memory tester like memtest86+ overnight?
>
>
>
> Hi, I think this is the case, a stick of ram gone bad. It had worked
> testing under 2.4 but coincidentally failed about the time I upgraded
> to 2.6. memtest86 uncovered it at once. Sorry for the trouble and thanks
> for the help.

No problem, thanks for reporting.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com