LinuxLists.cc - mm/rmap.c negative page map count BUG.

2006-01-03 08:26:12

Subject: mm/rmap.c negative page map count BUG.

This has cropped up from time to time in the last few Fedora
kernels, by several users. I just got another report that it's
still a problem on 2.6.15rc7 based kernels (so likely .15 final too).

kernel: kernel BUG at mm/rmap.c:486!
kernel: invalid operand: 0000 [#1]
kernel: Modules linked in: parport_pc lp parport nfs lockd nfs_acl autofs4 sunrpc dm_mod ipv6 uhci_hcd shpchp i2c_piix4 i2c_core snd_es18xx snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_opl3_lib snd_timer snd_hwdep snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore tlan floppy ext3 jbd aic7xxx scsi_transport_spi sd_mod scsi_mod
kernel: CPU: 0
kernel: EIP: 0060:[<c01502b2>] Not tainted VLI
kernel: EFLAGS: 00010286 (2.6.14-1.1769_FC4)
kernel: EIP is at page_remove_rmap+0x25/0x2f
kernel: eax: ffffffff ebx: c8331e30 ecx: c1152360 edx: c1152360
kernel: esi: 08f8c000 edi: c1152360 ebp: 00000000 esp: c2114d78
kernel: ds: 007b es: 007b ss: 0068
kernel: Process udevd (pid: 11892, threadinfo=c2114000 task=c54af030)
kernel: Stack: c0149a90 c3d70e34 c0419b20 c349d440 ffffffff ffffff3f c349d490 c2e6b08c
kernel: 09000000 c2114dfc c2e6b08c c0149ccb 08ecb000 09000000 c2114dfc 00000000
kernel: c3d70e34 c0419b20 c2e6b08c 0913dfff 08ecb000 c3d70e34 0913e000 c2114e24
kernel: Call Trace:
kernel: [<c0149a90>] zap_pte_range+0x105/0x25a [<c0149ccb>] unmap_page_range+0xe6/0x110
kernel: [<c0149dc7>] unmap_vmas+0xd2/0x1f1 [<c014e5f2>] exit_mmap+0x5f/0xda
kernel: [<c0119669>] mmput+0x1f/0x95 [<c0162f1f>] exec_mmap+0xc7/0x149
kernel: [<c0163084>] flush_old_exec+0x7b/0x8b7 [<c01595ff>] vfs_read+0xf6/0x158
kernel: [<c0162e4e>] kernel_read+0x37/0x41 [<c0182e30>] load_elf_binary+0x2b9/0xd8e
kernel: [<c01408b0>] __alloc_pages+0x57/0x2ed [<c01df430>] copy_from_user+0x42/0x82
kernel: [<c0182b77>] load_elf_binary+0x0/0xd8e [<c0163b32>] search_binary_handler+0x7a/0x243
kernel: [<c0163ee3>] do_execve+0x1e8/0x210 [<c0101b3f>] sys_execve+0x30/0x72
kernel: [<c0102ec5>] syscall_call+0x7/0xb
kernel: Code: 2e 0d 33 c0 eb bf 89 c2 83 40 08 ff 0f 98 c0 84 c0 75 01 c3 8b 42 08 83 c0 01 78 0f ba ff ff ff ff b8 10 00 00 00 e9 32 0b ff ff <0f> 0b e6 01 2e 0d 33 c0 eb e7 55 57 56 53 83 ec 0c 89 c7 89 d3

The BUG it's hitting is the BUG_ON(page_mapcount(page) < 0); in page_remove_rmap()

anyone with any ideas wtf happened here ?

shortly after hitting this, the users usually report thing likes like ...

kernel: Bad page state at free_hot_cold_page (in process 'kswapd0', page c1152360)
kernel: flags:0x80000010 mapping:00000000 mapcount:-1 count:0

In no examples seen have there been binary modules loaded, and no obvious
signs of hardware failure (some of them have run memtest86 with no problems found)

Dave

2006-01-03 11:42:17

by Nick Piggin

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

Dave Jones wrote:
> This has cropped up from time to time in the last few Fedora
> kernels, by several users. I just got another report that it's
> still a problem on 2.6.15rc7 based kernels (so likely .15 final too).
>
> kernel: kernel BUG at mm/rmap.c:486!
> kernel: invalid operand: 0000 [#1]
> kernel: Modules linked in: parport_pc lp parport nfs lockd nfs_acl autofs4 sunrpc dm_mod ipv6 uhci_hcd shpchp i2c_piix4 i2c_core snd_es18xx snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss snd_pcm snd_page_alloc snd_opl3_lib snd_timer snd_hwdep snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore tlan floppy ext3 jbd aic7xxx scsi_transport_spi sd_mod scsi_mod
> kernel: CPU: 0
> kernel: EIP: 0060:[<c01502b2>] Not tainted VLI
> kernel: EFLAGS: 00010286 (2.6.14-1.1769_FC4)
> kernel: EIP is at page_remove_rmap+0x25/0x2f
> kernel: eax: ffffffff ebx: c8331e30 ecx: c1152360 edx: c1152360
> kernel: esi: 08f8c000 edi: c1152360 ebp: 00000000 esp: c2114d78
> kernel: ds: 007b es: 007b ss: 0068
> kernel: Process udevd (pid: 11892, threadinfo=c2114000 task=c54af030)
> kernel: Stack: c0149a90 c3d70e34 c0419b20 c349d440 ffffffff ffffff3f c349d490 c2e6b08c
> kernel: 09000000 c2114dfc c2e6b08c c0149ccb 08ecb000 09000000 c2114dfc 00000000
> kernel: c3d70e34 c0419b20 c2e6b08c 0913dfff 08ecb000 c3d70e34 0913e000 c2114e24
> kernel: Call Trace:
> kernel: [<c0149a90>] zap_pte_range+0x105/0x25a [<c0149ccb>] unmap_page_range+0xe6/0x110
> kernel: [<c0149dc7>] unmap_vmas+0xd2/0x1f1 [<c014e5f2>] exit_mmap+0x5f/0xda
> kernel: [<c0119669>] mmput+0x1f/0x95 [<c0162f1f>] exec_mmap+0xc7/0x149
> kernel: [<c0163084>] flush_old_exec+0x7b/0x8b7 [<c01595ff>] vfs_read+0xf6/0x158
> kernel: [<c0162e4e>] kernel_read+0x37/0x41 [<c0182e30>] load_elf_binary+0x2b9/0xd8e
> kernel: [<c01408b0>] __alloc_pages+0x57/0x2ed [<c01df430>] copy_from_user+0x42/0x82
> kernel: [<c0182b77>] load_elf_binary+0x0/0xd8e [<c0163b32>] search_binary_handler+0x7a/0x243
> kernel: [<c0163ee3>] do_execve+0x1e8/0x210 [<c0101b3f>] sys_execve+0x30/0x72
> kernel: [<c0102ec5>] syscall_call+0x7/0xb
> kernel: Code: 2e 0d 33 c0 eb bf 89 c2 83 40 08 ff 0f 98 c0 84 c0 75 01 c3 8b 42 08 83 c0 01 78 0f ba ff ff ff ff b8 10 00 00 00 e9 32 0b ff ff <0f> 0b e6 01 2e 0d 33 c0 eb e7 55 57 56 53 83 ec 0c 89 c7 89 d3
>
> The BUG it's hitting is the BUG_ON(page_mapcount(page) < 0); in page_remove_rmap()
>
> anyone with any ideas wtf happened here ?
>
> shortly after hitting this, the users usually report thing likes like ...
>
> kernel: Bad page state at free_hot_cold_page (in process 'kswapd0', page c1152360)
> kernel: flags:0x80000010 mapping:00000000 mapcount:-1 count:0
>

Well it isn't PG_reserved, so it is unlikely to be something like ZERO_PAGE.
That kswapd eventually frees it indicates it is a regular pagecache page on
the LRU... so it is unusual that nobody has reported it here.

Can you reproduce it? On a kernel.org kernel? Can you print ->flags, ->count,
->mapping, etc instead of going BUG?

Thanks,
Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-01-03 13:53:21

by Dave Jones

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

On Tue, Jan 03, 2006 at 10:42:07PM +1100, Nick Piggin wrote:

> Well it isn't PG_reserved, so it is unlikely to be something like ZERO_PAGE.
> That kswapd eventually frees it indicates it is a regular pagecache page on
> the LRU... so it is unusual that nobody has reported it here.
>
> Can you reproduce it?

I can't :(

> On a kernel.org kernel?

Only some of our users hit it, which makes it tricky to reproduce.

> Can you print ->flags, ->count, ->mapping, etc instead of going BUG?

I can add some instrumentation like this though, and see what turns up.

thanks,

Dave

2006-01-04 23:51:45

by Andrew Morton

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

Dave Jones <[email protected]> wrote:
>
> > Can you print ->flags, ->count, ->mapping, etc instead of going BUG?
>
> I can add some instrumentation like this though, and see what turns up.

Can we get that instrumentation into the upstream kernel please? We do
seem to be hitting rmap assertions too often for it to be dud
hardware/bodgy drivers/etc.

2006-01-04 23:56:49

by Dave Jones

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

2006-01-05 00:14:59

by Andrew Morton

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

Dave Jones <[email protected]> wrote:
>
> + printk (KERN_EMERG "Eeek! page_mapcount(page) went negative! (%d)\n", page->_mapcount);

page_mapcount(page);

> + printk (KERN_EMERG " page->flags = %x\n", page->flags);

%lx

> + printk (KERN_EMERG " page->count = %x\n", page->_count);

page_count(page);

2006-01-05 00:31:13

by Dave Jones

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

On Wed, Jan 04, 2006 at 04:16:40PM -0800, Andrew Morton wrote:
> Dave Jones <[email protected]> wrote:
> >
> > + printk (KERN_EMERG "Eeek! page_mapcount(page) went negative! (%d)\n", page->_mapcount);
>
> page_mapcount(page);
>
> > + printk (KERN_EMERG " page->flags = %x\n", page->flags);
>
> %lx
>
> > + printk (KERN_EMERG " page->count = %x\n", page->_count);
>
> page_count(page);

Ugh, almost an error per line. I suck.

Dave

--- linux-2.6.14/mm/rmap.c~ 2006-01-03 08:53:32.000000000 -0500
+++ linux-2.6.14/mm/rmap.c 2006-01-03 08:58:19.000000000 -0500
@@ -484,6 +484,13 @@ void page_remove_rmap(struct page *page)
BUG_ON(PageReserved(page));

if (atomic_add_negative(-1, &page->_mapcount)) {
+ if (page_mapcount(page) < 0) {
+ printk (KERN_EMERG "Eeek! page_mapcount(page) went negative! (%d)\n", page_mapcount(page));
+ printk (KERN_EMERG " page->flags = %lx\n", page->flags);
+ printk (KERN_EMERG " page->count = %x\n", page_count(page));
+ printk (KERN_EMERG " page->mapping = %p\n", page->mapping);
+ }
+
BUG_ON(page_mapcount(page) < 0);
/*
* It would be tidy to reset the PageAnon mapping here,

2006-01-05 07:47:43

by Dave Jones

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

On Wed, Jan 04, 2006 at 03:53:26PM -0800, Andrew Morton wrote:
> Dave Jones <[email protected]> wrote:
> >
> > > Can you print ->flags, ->count, ->mapping, etc instead of going BUG?
> >
> > I can add some instrumentation like this though, and see what turns up.
>
> Can we get that instrumentation into the upstream kernel please? We do
> seem to be hitting rmap assertions too often for it to be dud
> hardware/bodgy drivers/etc.

I had a quick skim through bugme.osdl.org & Red Hat bugzilla.

Seems to be a few variants of this problem reported.
Quite a few Fedora users have hit it over the last year,
but what I find fascinating is that there's not a single
occurance of "BUG at mm/rmap.c" in our 2.6.9 based RHEL4 bug reports.

Dave

2005-08-07
http://bugme.osdl.org/show_bug.cgi?id=3636

Oct 25 04:41:47 www kernel: kernel BUG at mm/rmap.c:474!
Oct 25 04:41:47 www kernel: invalid operand: 0000 [#4]
Oct 25 04:41:47 www kernel: PREEMPT
Oct 25 04:41:47 www kernel: Modules linked in:
Oct 25 04:41:47 www kernel: CPU: 0
Oct 25 04:41:47 www kernel: EIP: 0060:[<c0147319>] Not tainted VLI
Oct 25 04:41:47 www kernel: EFLAGS: 00010286 (2.6.9)
Oct 25 04:41:47 www kernel: EIP is at page_remove_rmap+0x29/0x40
Oct 25 04:41:47 www kernel: eax: ffffffff ebx: 000dd000 ecx: c1160bc0 edx: c1160bc0
Oct 25 04:41:47 www kernel: esi: c5e6f894 edi: c1160bc0 ebp: 00100000 esp: c9e93e90
Oct 25 04:41:47 www kernel: ds: 007b es: 007b ss: 0068
Oct 25 04:41:47 www kernel: Process show_bug.cgi (pid: 16375, threadinfo=c9e92000 task=cdac9020)
Oct 25 04:41:47 www kernel: Stack: c0140ce6 c1160bc0 c02e6790 c9dec7a0 00000000 0b05e067 08948000 c4325088
Oct 25 04:41:47 www kernel: 08648000 00000000 c0140e47 c045a008 c4325084 08548000 00100000 00000000
Oct 25 04:41:47 www kernel: c045a008 08548000 c4325088 08648000 00000000 c0140ebb c045a008 c4325084
Oct 25 04:41:47 www kernel: Call Trace:
Oct 25 04:41:47 www kernel: [<c0140ce6>] zap_pte_range+0x126/0x230
Oct 25 04:41:47 www kernel: [<c02e6790>] ip_rcv_finish+0x0/0x270
Oct 25 04:41:47 www kernel: [<c0140e47>] zap_pmd_range+0x57/0x80
Oct 25 04:41:47 www kernel: [<c0140ebb>] unmap_page_range+0x4b/0x80
Oct 25 04:41:47 www kernel: [<c0140fed>] unmap_vmas+0xfd/0x1c0
Oct 25 04:41:47 www kernel: [<c0145593>] exit_mmap+0x83/0x160
Oct 25 04:41:47 www kernel: [<c01161d4>] mmput+0x64/0xb0
Oct 25 04:41:47 www kernel: [<c011aa72>] do_exit+0x152/0x420
Oct 25 04:41:47 www kernel: [<c010654d>] do_IRQ+0xfd/0x130
Oct 25 04:41:47 www kernel: [<c011adca>] do_group_exit+0x3a/0xb0
Oct 25 04:41:47 www kernel: [<c010421b>] syscall_call+0x7/0xb

2005-03-22
http://bugme.osdl.org/show_bug.cgi?id=4388

Nov 4 13:55:03 localhost kernel: kernel BUG at mm/rmap.c:487!
Nov 4 13:55:03 localhost kernel: invalid operand: 0000 [#1]
Nov 4 13:55:03 localhost kernel: PREEMPT
Nov 4 13:55:03 localhost kernel: Modules linked in: radeon drm
Nov 4 13:55:03 localhost kernel: CPU: 0
Nov 4 13:55:03 localhost kernel: EIP: 0060:[page_remove_rmap+71/96] Not tainted VLI
Nov 4 13:55:03 localhost kernel: EFLAGS: 00010286 (2.6.14)
Nov 4 13:55:03 localhost kernel: EIP is at page_remove_rmap+0x47/0x60
Nov 4 13:55:03 localhost kernel: eax: ffffffff ebx: ccdbd244 ecx: 00000002 edx: c11cb8c0
Nov 4 13:55:03 localhost kernel: esi: c11cb8c0 edi: 41891000 ebp: ce246d88 esp: ce246d80
Nov 4 13:55:03 localhost kernel: ds: 007b es: 007b ss: 0068
Nov 4 13:55:03 localhost kernel: Process postmaster (pid: 1914, threadinfo=ce246000 task=ce179560)
Nov 4 13:55:04 localhost kernel: Stack: c014943d ccdbd244 ce246dac c014dd6c c11cb8c0 00000000 00000001 0e5c6025
Nov 4 13:55:04 localhost kernel: cebab41c 41897000 41897000 ce246dd8 c014df24 c04e94ac cebab418 4188f000
Nov 4 13:55:04 localhost kernel: 41897000 00000000 41896fff 00008000 41897000 cd7a8634 ce246e18 c014e039
Nov 4 13:55:04 localhost kernel: Call Trace:
Nov 4 13:55:04 localhost kernel: [show_stack+171/240] show_stack+0xab/0xf0
Nov 4 13:55:04 localhost kernel: [show_registers+399/560] show_registers+0x18f/0x230
Nov 4 13:55:04 localhost kernel: [die+237/400] die+0xed/0x190
Nov 4 13:55:04 localhost kernel: [do_trap+137/208] do_trap+0x89/0xd0
Nov 4 13:55:04 localhost kernel: [do_invalid_op+170/192] do_invalid_op+0xaa/0xc0
Nov 4 13:55:04 localhost kernel: [error_code+79/84] error_code+0x4f/0x54
Nov 4 13:55:04 localhost kernel: [zap_pte_range+220/512] zap_pte_range+0xdc/0x200
Nov 4 13:55:04 localhost kernel: [unmap_page_range+148/208] unmap_page_range+0x94/0xd0
Nov 4 13:55:04 localhost kernel: [unmap_vmas+217/544] unmap_vmas+0xd9/0x220
Nov 4 13:55:04 localhost kernel: [exit_mmap+130/352] exit_mmap+0x82/0x160
Nov 4 13:55:04 localhost kernel: [mmput+53/176] mmput+0x35/0xb0
Nov 4 13:55:04 localhost kernel: [exit_mm+170/352] exit_mm+0xaa/0x160
Nov 4 13:55:04 localhost kernel: [do_exit+206/1184] do_exit+0xce/0x4a0
Nov 4 13:55:04 localhost kernel: [do_group_exit+59/208] do_group_exit+0x3b/0xd0
Nov 4 13:55:04 localhost kernel: [get_signal_to_deliver+515/848] get_signal_to_deliver+0x203/0x350
Nov 4 13:55:04 localhost kernel: [do_signal+87/288] do_signal+0x57/0x120
Nov 4 13:55:04 localhost kernel: [do_notify_resume+42/60] do_notify_resume+0x2a/0x3c
Nov 4 13:55:04 localhost kernel: [work_notifysig+19/25] work_notifysig+0x13/0x19

2005-08-23
http://bugme.osdl.org/show_bug.cgi?id=4873

Jul 11 17:55:09 us401 kernel: kernel BUG at mm/rmap.c:493!
Jul 11 17:55:09 us401 kernel: invalid operand: 0000 [#1]
Jul 11 17:55:09 us401 kernel: SMP
Jul 11 17:55:09 us401 kernel: Modules linked in: netconsole iptable_nat ipv6 ipt_TOS iptable_mangle ip_conntrack_ftp ip_conntrack_irc ipt_LOG ipt_limit ipt_multiport autofs ipt_REJECT ipt_state ip_conntrack iptable_filter ip_tables sg scsi_mod parport_pc parport microcode loop video thermal processor fan button battery ac raid1
Jul 11 17:55:09 us401 kernel: CPU: 2
Jul 11 17:55:09 us401 kernel: EIP: 0060:[<c0151e99>] Not tainted VLI
Jul 11 17:55:09 us401 kernel: EFLAGS: 00010286 (2.6.12.1)
Jul 11 17:55:09 us401 kernel: EIP is at page_remove_rmap+0x39/0x50
Jul 11 17:55:09 us401 kernel: eax: ffffffff ebx: 00013508 ecx: 00000038 edx: c126a100
Jul 11 17:55:09 us401 kernel: esi: ef60d720 edi: c126a100 ebp: 08ae4000 esp: ee869e84
Jul 11 17:55:09 us401 kernel: ds: 007b es: 007b ss: 0068
Jul 11 17:55:09 us401 kernel: Process httpd (pid: 28353, threadinfo=ee868000 task=d2d0c530)
Jul 11 17:55:09 us401 kernel: Stack: c0145cd4 00013508 c014a9a7 c126a100 d2065be8 13508067 00000000 00000000
Jul 11 17:55:09 us401 kernel: f5e52228 08ad0000 08b27000 c014ac16 c201a900 f5e52228 08ad0000 08b27000
Jul 11 17:55:09 us401 kernel: 00000000 08b26fff 08b26fff 08b27000 f77ba380 00057000 08b27000 08b27000
Jul 11 17:55:09 us401 kernel: Call Trace:
Jul 11 17:55:09 us401 kernel: [<c0145cd4>] mark_page_accessed+0x34/0x40
Jul 11 17:55:09 us401 kernel: [<c014a9a7>] zap_pte_range+0x107/0x270
Jul 11 17:55:09 us401 kernel: [<c014ac16>] unmap_page_range+0x106/0x150
Jul 11 17:55:09 us401 kernel: [<c014ad56>] unmap_vmas+0xf6/0x250
Jul 11 17:55:09 us401 kernel: [<c014f6b3>] unmap_region+0xb3/0x160
Jul 11 17:55:09 us401 kernel: [<c014f9df>] do_munmap+0x10f/0x150
Jul 11 17:55:09 us401 kernel: [<c014de22>] sys_brk+0x112/0x120
Jul 11 17:55:09 us401 kernel: [<c0102daf>] sysenter_past_esp+0x54/0x75
Jul 11 17:55:09 us401 kernel: Code: f0 83 42 08 ff 0f 98 c0 84 c0 74 1b 8b 42 08 40 78 19 c7 04 24 10 00 00 00 b8 ff ff ff ff 89 44 24 04 e8 bb f3 fe ff 83 c4 08 c3

2005-11-27
http://bugme.osdl.org/show_bug.cgi?id=5666

kernel BUG at mm/rmap.c:487!
invalid operand: 0000 [#1]
Modules linked in: af_packet ipt_limit ipt_state iptable_mangle iptable_nat
ip_nat iptable_filter ipt_ULOG ip_tables ipv6 ip_conntrack_ftp ip_conntrack
via_rhine sis900 mii unix
CPU: 0
EIP: 0060:[<c014b5a7>] Tainted: G M VLI
EFLAGS: 00010286 (2.6.14)
EIP is at page_remove_rmap+0x37/0x50
eax: ffffffff ebx: d5097c20 ecx: c03e9dcc edx: c11fa560
esi: b7f08000 edi: c11fa560 ebp: 00000020 esp: cf9ddebc
ds: 007b es: 007b ss: 0068
Process apache2 (pid: 22104, threadinfo=cf9dc000 task=dd0850b0)
Stack: c11f3fe0 d5097c20 c0145298 c11fa560 b76bc000 d7daab7c b7f2d000 b7f2d000
b7f2cfff c014541a c03e9dcc d7daab7c b7f06000 b7f2d000 00000000 00027000
b7f2d000 b7f2d000 d15e7284 c0145529 c03e9dcc d15e7284 b7f06000 b7f2d000
Call Trace:
[<c0145298>] zap_pte_range+0xd8/0x1d0
[<c014541a>] unmap_page_range+0x8a/0xb0
[<c0145529>] unmap_vmas+0xe9/0x1e0
[<c0149a59>] exit_mmap+0x79/0x150
[<c01181dc>] mmput+0x2c/0x80
[<c011c3a8>] do_exit+0xd8/0x390
[<c011c6d4>] do_group_exit+0x34/0x70
[<c0103075>] syscall_call+0x7/0xb
Code: 75 33 83 42 08 ff 0f 98 c0 84 c0 74 1a 8b 42 08 40 78 18 c7 44 24 04 ff
ff ff ff c7 04 24 10 00 00 00 e8 8d 10 ff ff 83 c4 08 c3 <0f> 0b e7 01 c0 2a 33
c0 eb de 0f 0b e4 01 c0 2a 33 c0 eb c3 90

2005-12-16
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=175925

Dec 15 02:57:13 garvin kernel: kernel BUG at mm/rmap.c:487!
Dec 15 02:57:13 garvin kernel: invalid operand: 0000 [#1]
Dec 15 02:57:13 garvin kernel: Modules linked in: loop parport_pc lp parport nfs
lockd nfs_acl autofs4 sunrpc dm_mod ipv6 uhci_hcd i2c_piix4 i2c_core snd_es18xx
snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss
snd_pcm snd_page_alloc snd_opl3_lib snd_timer snd_hwdep snd_mpu401_uart snd_rawm
idi snd_seq_device snd soundcore tlan floppy ext3 jbd aic7xxx scsi_transport_spi
sd_mod scsi_mod
Dec 15 02:57:13 garvin kernel: CPU: 0
Dec 15 02:57:13 garvin kernel: EIP: 0060:[<c014f97b>] Not tainted VLI
Dec 15 02:57:13 garvin kernel: EFLAGS: 00010286 (2.6.14-1.1637_FC4)
Dec 15 02:57:13 garvin kernel: EIP is at page_remove_rmap+0x37/0x41
Dec 15 02:57:13 garvin kernel: eax: ffffffff ebx: c85d5e30 ecx: 00000006 edx: c115c580
Dec 15 02:57:13 garvin kernel: esi: c115c580 edi: 0038c000 ebp: c03f7a7c esp: cd7ddec8
Dec 15 02:57:13 garvin kernel: ds: 007b es: 007b ss: 0068
Dec 15 02:57:13 garvin kernel: Process udev (pid: 4008, threadinfo=cd7dd000 task=c7059ab0)
Dec 15 02:57:13 garvin kernel: Stack: c0149137 00000000 00391000 c03f7a7c c0a7d000 00391000 00391000 00390fff
Dec 15 02:57:13 garvin kernel: c01492ca 00391000 00000000 c03f7a7c 00009000 00391000 c4ce3ddc 00391000
Dec 15 02:57:13 garvin kernel: c0149401 00391000 00000000 cd7dd000 cdb671c0 cd7ddf58 002d7000 00000000
Dec 15 02:57:13 garvin kernel: Call Trace:
Dec 15 02:57:13 garvin kernel: [<c0149137>] zap_pte_range+0xe5/0x1f5
Dec 15 02:57:13 garvin kernel: [<c01492ca>] unmap_page_range+0x83/0xb7
Dec 15 02:57:13 garvin kernel: [<c0149401>] unmap_vmas+0x103/0x222
Dec 15 02:57:13 garvin kernel: [<c014dc05>] exit_mmap+0x7c/0x14c
Dec 15 02:57:13 garvin kernel: [<c01189a0>] mmput+0x1f/0x95
Dec 15 02:57:13 garvin kernel: [<c011d33d>] do_exit+0xe0/0x3b8
Dec 15 02:57:13 garvin kernel: [<c011d66a>] do_group_exit+0x29/0x90
Dec 15 02:57:13 garvin kernel: [<c0102edd>] syscall_call+0x7/0xb
Dec 15 02:57:13 garvin kernel: Code: ff 0f 98 c0 84 c0 75 01 c3 8b 42 08 83 c0 0
1 90 78 19 ba ff ff ff ff b8 10 00 00 00 e9 43 0c ff ff 0f 0b e4 01 ad 4a 32 c0
eb d2 <0f> 0b e7 01 ad 4a 32 c0 eb dd 55 57 56 53 83 ec 04 89 c7 89 d3

2004-09-11
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=121902
(mention of the BUG in comment #46 on 2.6.8, albeit nvidia tainted).

2004-06-21
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=126454

Two instances, at least one 'went away' with a hardware upgrade.
Could be a coincidence.

2004-07-15
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=127903
Wow, the oldest so far. All the way back to 2.6.6.
But again 'went away' with memory module replacements.

2004-11-28
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=141035
Several flavours. Nothing conclusive. Was mistakenly
believed to be possibly related to the amd errata at the time
and closed.

2005-06-02
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=157557
More of the same. Memory corruption after the first oops perhaps?

2005-07-09
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=159364
Another AMD user. Reports the problem 'went away' with an
update to 2.6.12.3

2006-01-05 08:12:07

by Arjan van de Ven

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

> Quite a few Fedora users have hit it over the last year,
> but what I find fascinating is that there's not a single
> occurance of "BUG at mm/rmap.c" in our 2.6.9 based RHEL4 bug reports.

could mean it's caused by consumer hardware code...

2006-01-05 11:15:45

by Dave Jones

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

On Thu, Jan 05, 2006 at 09:11:51AM +0100, Arjan van de Ven wrote:
>
> > Quite a few Fedora users have hit it over the last year,
> > but what I find fascinating is that there's not a single
> > occurance of "BUG at mm/rmap.c" in our 2.6.9 based RHEL4 bug reports.
>
> could mean it's caused by consumer hardware code...

Yeah. People buying enterprise distros do tend to buy branded RAM
with goodies like ECC from big name suppliers instead of a cheap $20
noname DIMM from "Joe's computers".

So it *could* be a lot of these are crappy hardware, especially
as some of the reports do indicate that the problem went away
when they upgraded their RAM. Some of the others though, I'm
not so sure.

Dave

2006-01-05 11:19:00

by Arjan van de Ven

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

On Thu, 2006-01-05 at 06:15 -0500, Dave Jones wrote:
> On Thu, Jan 05, 2006 at 09:11:51AM +0100, Arjan van de Ven wrote:
> >
> > > Quite a few Fedora users have hit it over the last year,
> > > but what I find fascinating is that there's not a single
> > > occurance of "BUG at mm/rmap.c" in our 2.6.9 based RHEL4 bug reports.
> >
> > could mean it's caused by consumer hardware code...
>
> Yeah. People buying enterprise distros do tend to buy branded RAM
> with goodies like ECC from big name suppliers instead of a cheap $20
> noname DIMM from "Joe's computers".
>
> So it *could* be a lot of these are crappy hardware, especially
> as some of the reports do indicate that the problem went away
> when they upgraded their RAM. Some of the others though, I'm
> not so sure.

it could also be some consumer-mostly device, or driver thereof. say
video capture or weird usb gizmo

2006-01-05 11:26:29

by Dave Jones

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

On Thu, Jan 05, 2006 at 12:18:43PM +0100, Arjan van de Ven wrote:
> On Thu, 2006-01-05 at 06:15 -0500, Dave Jones wrote:
> > On Thu, Jan 05, 2006 at 09:11:51AM +0100, Arjan van de Ven wrote:
> > >
> > > > Quite a few Fedora users have hit it over the last year,
> > > > but what I find fascinating is that there's not a single
> > > > occurance of "BUG at mm/rmap.c" in our 2.6.9 based RHEL4 bug reports.
> > >
> > > could mean it's caused by consumer hardware code...
> >
> > Yeah. People buying enterprise distros do tend to buy branded RAM
> > with goodies like ECC from big name suppliers instead of a cheap $20
> > noname DIMM from "Joe's computers".
> >
> > So it *could* be a lot of these are crappy hardware, especially
> > as some of the reports do indicate that the problem went away
> > when they upgraded their RAM. Some of the others though, I'm
> > not so sure.
>
> it could also be some consumer-mostly device, or driver thereof. say
> video capture or weird usb gizmo

except looking at the oopses, there's no obvious pattern amongst
the modules loaded. Though they could all have a commonality as
a built-in driver, it's a long-shot.

even looking at the Fedora ones alone, which have no built-in
drivers, there's nothing that immediately jumps out like
"ooh, radeon again". I'll look through them again tomorrow,
but first, sleep.

Dave

2006-01-05 19:04:56

by Octavio Alvarez Piza

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

On Thu, 05 Jan 2006 03:15:20 -0800, Dave Jones <[email protected]> wrote:

> On Thu, Jan 05, 2006 at 09:11:51AM +0100, Arjan van de Ven wrote:
> >
> > > Quite a few Fedora users have hit it over the last year,
> > > but what I find fascinating is that there's not a single
> > > occurance of "BUG at mm/rmap.c" in our 2.6.9 based RHEL4 bug
> reports.
> >
> > could mean it's caused by consumer hardware code...
>
> Yeah. People buying enterprise distros do tend to buy branded RAM
> with goodies like ECC from big name suppliers instead of a cheap $20
> noname DIMM from "Joe's computers".
>
> So it *could* be a lot of these are crappy hardware, especially
> as some of the reports do indicate that the problem went away
> when they upgraded their RAM. Some of the others though, I'm
> not so sure.

Nevertheless, there are more instances of the bug in recent versions.
For me, version 2.6.10 or 2.6.11 seems to be the big difference, from
1 bug monthly to --suddenly-- 4 weekly.

I'm experiecing that problem too. I have notice that sometimes
"bad_page_state" trigger before the BUG is reported.

http://lkml.org/lkml/2005/12/14/449

I have already installed the instrumentation Dave provided. I'll see how
it goes.

2006-01-11 08:43:59

by Octavio Alvarez Piza

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

On Thu, 05 Jan 2006 11:00:41 -0800
"Octavio Alvarez" <[email protected]> wrote:

> On Thu, 05 Jan 2006 03:15:20 -0800, Dave Jones <[email protected]>
wrote:
>
> > On Thu, Jan 05, 2006 at 09:11:51AM +0100, Arjan van de Ven wrote:
> > >
> > > > Quite a few Fedora users have hit it over the last year,
> > > > but what I find fascinating is that there's not a single
> > > > occurance of "BUG at mm/rmap.c" in our 2.6.9 based RHEL4 bug
> > reports.
> > >
> > > could mean it's caused by consumer hardware code...
> >
> > Yeah. People buying enterprise distros do tend to buy branded RAM
> > with goodies like ECC from big name suppliers instead of a cheap $20
> > noname DIMM from "Joe's computers".
> >
> > So it *could* be a lot of these are crappy hardware, especially
> > as some of the reports do indicate that the problem went away
> > when they upgraded their RAM. Some of the others though, I'm
> > not so sure.
>
> Nevertheless, there are more instances of the bug in recent versions.
> For me, version 2.6.10 or 2.6.11 seems to be the big difference, from
> 1 bug monthly to --suddenly-- 4 weekly.
>
> I'm experiecing that problem too. I have notice that sometimes
> "bad_page_state" trigger before the BUG is reported.
>
> http://lkml.org/lkml/2005/12/14/449
>
> I have already installed the instrumentation Dave provided. I'll see
how
> it goes.

I have found another instance of "bad_page_state" with mapcount:-1
before
hitting BUG_ON().

sh-3.00$ cat /var/log/kernel | tail -n 19
Bad page state at free_hot_cold_page (in process 'X', page c1140c60)
flags:0x80010008 mapping:00000000
mapcount:-65536 count:0
Backtrace:
[<c012eee2>] bad_page+0x5c/0x92
[<c012f56c>] free_hot_cold_page+0x58/0xc2
[<c012fbb6>] __pagevec_free+0x17/0x1d
[<c0133e28>] __pagevec_release_nonlru+0x72/0x7f
[<c0134c04>] shrink_list+0x2ef/0x386
[<c0134e23>] shrink_cache+0xe7/0x210
[<c0135323>] shrink_zone+0xac/0xc4
[<c0135389>] shrink_caches+0x4e/0x5b
[<c0135466>] try_to_free_pages+0xd0/0x190
[<c012fa0a>] __alloc_pages+0x170/0x271
[<c01382ca>] do_anonymous_page+0x37/0x107
[<c013865c>] __handle_mm_fault+0xa6/0x15e
[<c02a4cf4>] do_page_fault+0x188/0x545
[<c02a4b6c>] do_page_fault+0x0/0x545
[<c0102c9f>] error_code+0x4f/0x54
Trying to fix it up, but a reboot is needed

sh-3.00$ uptime
23:56:29 up 1 day, 2:45, 4 users, load average: 1.22, 1.16, 1.12

sh-3.00$ uname -a
Linux octavio 2.6.15 #13 Sat Jan 7 17:37:22 PST 2006 i686 unknown
unknown GNU/Linux

I ran memtest86+ for 24 hours prior to installing the latest kernel boot
with no errors reported.

2006-01-11 16:12:23

by Hugh Dickins

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

On Wed, 11 Jan 2006, Octavio Alvarez Piza wrote:
> On Thu, 05 Jan 2006 11:00:41 -0800
> "Octavio Alvarez" <[email protected]> wrote:
>
> I have found another instance of "bad_page_state" with mapcount:-1
> before hitting BUG_ON().
>
> Bad page state at free_hot_cold_page (in process 'X', page c1140c60)
> flags:0x80010008 mapping:00000000 mapcount:-65536 count:0

No, that's mapcount -65536 not -1.

That means page->_mapcount contained 0xfffeffff when it should have
contained 0xffffffff. A single bit got cleared. Probably bad memory,
overheating, something of that kind.

> I ran memtest86+ for 24 hours prior to installing the latest kernel boot
> with no errors reported.

Well, you've done your best to rule out that possibility, yes.

We can't rule out that something somewhere in the kernel has
scribbled on that location, but I've no guesses what.

Hugh

2006-01-11 16:21:43

by Arjan van de Ven

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

>
> That means page->_mapcount contained 0xfffeffff when it should have

> We can't rule out that something somewhere in the kernel has
> scribbled on that location, but I've no guesses what.

could be an rwsem/rwlock

btw.. which video driver is in use? (X tends to do rather evil things at
times via /dev/mem, but that is very much driver specific)

2006-01-11 17:03:53

by Octavio Alvarez Piza

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

On Wed, 11 Jan 2006 16:12:54 +0000 (GMT)
Hugh Dickins <[email protected]> wrote:

> On Wed, 11 Jan 2006, Octavio Alvarez Piza wrote:
> > On Thu, 05 Jan 2006 11:00:41 -0800
> > "Octavio Alvarez" <[email protected]> wrote:
> >
> > I have found another instance of "bad_page_state" with mapcount:-1
> > before hitting BUG_ON().
> >
> > Bad page state at free_hot_cold_page (in process 'X', page c1140c60)
> > flags:0x80010008 mapping:00000000 mapcount:-65536 count:0
>
> No, that's mapcount -65536 not -1.
>

That's right, this might be a different issue. Now that it was X and not
"kswap0d" and that Arjan has asked me, I've realized that I'm using the
binary nVidia driver. I had gotten pretty much the same issue with the
open driver, though. Still, since I changed kernels to 2.6.15, I'll try
again to catch the bad page state with the nv free driver.

> That means page->_mapcount contained 0xfffeffff when it should have
> contained 0xffffffff. A single bit got cleared. Probably bad memory,
> overheating, something of that kind.

BTW, what's the first 8 in flags:0x80010008? I can't find 1<<31 in
include/linux/page-flags.h

Octavio.

2006-01-11 17:17:58

by Hugh Dickins

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

On Wed, 11 Jan 2006, Octavio Alvarez Piza wrote:
>
> BTW, what's the first 8 in flags:0x80010008? I can't find 1<<31 in
> include/linux/page-flags.h

It's the zone that page belongs to (you won't, I think, get involved
in nodes and sections): see helpful comment on "page->flags layout"
in include/linux/mm.h, and definitions in include/linux/mmzone.h.

I'd have to make a fool of myself by doing arithmetic in public,
probably getting it wrong, to tell you precisely which zone the
8 meant in 2.6.15 in your config; but it's not interesting anyway.

Hugh

2006-01-11 17:25:05

by Andrew Morton

[permalink] [raw]

Subject: Re: mm/rmap.c negative page map count BUG.

Octavio Alvarez Piza <[email protected]> wrote:
>
> > That means page->_mapcount contained 0xfffeffff when it should have
> > contained 0xffffffff. A single bit got cleared. Probably bad memory,
> > overheating, something of that kind.
>
> BTW, what's the first 8 in flags:0x80010008? I can't find 1<<31 in
> include/linux/page-flags.h

That's the page's zone identifier. We stuff that into the high bits of
page->flags for page_zone().