2004-09-19 01:38:01

by Mike Kirk

[permalink] [raw]
Subject: 2.6.9-rc2: "kernel BUG at mm/rmap.c:473!"

Not sure what this means: but the system kept running and I only lost a
bzip2 process: 2.6.9-rc2 w/preempt AMD 2700+ on A7N8X motherboard, 1GB
DDR400:

==============================
kernel BUG at mm/rmap.c:473!
invalid operand: 0000 [#1]
PREEMPT
Modules linked in:
CPU: 0
EIP: 0060:[<c014a8f9>] Not tainted VLI
EFLAGS: 00010286 (2.6.8.1N)
EIP is at page_remove_rmap+0x29/0x40
eax: ffffffff ebx: 00005000 ecx: c1017a18 edx: c1017a00
esi: fffedffc edi: 00006000 ebp: c1017a00 esp: f6c99e84
ds: 007b es: 007b ss: 0068
Process bzip2 (pid: 11701, threadinfo=f6c98000 task=c0ec8560)
Stack: c0143f21 c1017a00 00000007 c0ecab7c c0ecab7c 00bd0067 00000000
bfffa000
c04f9c94 c03fa000 c0ecac00 c0000000 00000000 c0144087 c04f9c94 c0ecabfc
bfffa000 00006000 00000000 c04f9c94 bfffa000 c0ecac00 c0000000 00000000
Call Trace:
[<c0143f21>] zap_pte_range+0x161/0x270
[<c0144087>] zap_pmd_range+0x57/0x80
[<c01440fb>] unmap_page_range+0x4b/0x80
[<c014422d>] unmap_vmas+0xfd/0x1c0
[<c0148a53>] exit_mmap+0x83/0x160
[<c011abb4>] mmput+0x64/0xa0
[<c011ef42>] do_exit+0x152/0x420
[<c011f29a>] do_group_exit+0x3a/0xb0
[<c0105c07>] syscall_call+0x7/0xb
Code: 26 00 8b 54 24 04 8b 02 f6 c4 08 75 28 83 42 08 ff 0f 98 c0 84 c0 74
11 8b 42 08 40 78 0d 9c 58 fa ff 0d 90 7e 50 c0 50 9d 90 c3 <0f> 0b d9 01 6d
eb 3c c0 eb e9 0f 0b d6 01 6d eb 3c c0 eb ce 8d
<6>note: bzip2[11701] exited with preempt_count 2
bad: scheduling while atomic!
[<c03aac06>] schedule+0x4c6/0x4d0
[<c0155773>] wake_up_buffer+0x13/0x40
[<c01558e3>] unlock_buffer+0x13/0x20
[<c015903f>] ll_rw_block+0x4f/0x80
[<c01a8212>] search_by_key+0xfc2/0x1030
[<c0268550>] pty_write+0x130/0x140
[<c02b8897>] ppp_async_push+0x97/0x170
[<c02b87f2>] ppp_async_send+0x42/0x50
[<c02b3e68>] ppp_push+0x98/0x100
[<c0124983>] __mod_timer+0x123/0x170
[<c0328e09>] htb_delay_by+0x49/0xb0
[<c032903f>] htb_dequeue+0x1cf/0x260
[<c01a8331>] search_for_position_by_key+0xb1/0x3e0
[<c031fa77>] qdisc_restart+0x17/0x1d0
[<c0314c61>] net_tx_action+0xa1/0xd0
[<c0191629>] make_cpu_key+0x59/0x70
[<c01aa2f6>] reiserfs_do_truncate+0xc6/0x5a0
[<c019529d>] reiserfs_truncate_file+0xed/0x240
[<c0196de6>] reiserfs_file_release+0x266/0x470
[<c016bba4>] dput+0x24/0x210
[<c01554e0>] __fput+0x110/0x130
[<c0153c09>] filp_close+0x59/0x90
[<c011e23a>] put_files_struct+0x5a/0xc0
[<c011ef7d>] do_exit+0x18d/0x420
[<c0107220>] do_invalid_op+0x0/0x120
[<c0106e68>] die+0x188/0x190
[<c014a8f9>] page_remove_rmap+0x29/0x40
[<c0115676>] fixup_exception+0x16/0x60
[<c0107332>] do_invalid_op+0x112/0x120
[<c014a8f9>] page_remove_rmap+0x29/0x40
[<c0124eeb>] update_wall_time+0xb/0x40
[<c012535f>] do_timer+0xdf/0xf0
[<c01212bd>] __do_softirq+0x7d/0x90
[<c0106611>] error_code+0x2d/0x38
[<c014a8f9>] page_remove_rmap+0x29/0x40
[<c0143f21>] zap_pte_range+0x161/0x270
[<c0144087>] zap_pmd_range+0x57/0x80
[<c01440fb>] unmap_page_range+0x4b/0x80
[<c014422d>] unmap_vmas+0xfd/0x1c0
[<c0148a53>] exit_mmap+0x83/0x160
[<c011abb4>] mmput+0x64/0xa0
[<c011ef42>] do_exit+0x152/0x420
[<c011f29a>] do_group_exit+0x3a/0xb0
[<c0105c07>] syscall_call+0x7/0xb
==============================

Regards,

Mike


2004-09-19 12:07:29

by Hugh Dickins

[permalink] [raw]
Subject: Re: 2.6.9-rc2: "kernel BUG at mm/rmap.c:473!"

On Sat, 18 Sep 2004, Mike Kirk wrote:
> Not sure what this means: but the system kept running and I only lost a
> bzip2 process: 2.6.9-rc2 w/preempt AMD 2700+ on A7N8X motherboard, 1GB
> DDR400:
> ==============================
> kernel BUG at mm/rmap.c:473!
> EIP is at page_remove_rmap+0x29/0x40

Was there a "Bad page state" message and backtrace shortly before this?
I say "shortly" because I don't suppose bzip2 had been running for hours,
I'd expect the underlying error to have occurred while it was running.

BUG_ON(page_mapcount(page) < 0);

Previous incidents of this BUG (or its antecedents: the mapcount has
recently changed to atomic from guarded by spinlock) have been after
something elsewhere has freed a page it no longer owned, which has
meanwhile got mapped into process address space.

If that's the case this time around, then I hope the bad_page backtrace
will shed light on what's freeing that shouldn't. But if there was no
"Bad page state" message before, then I'll have to start worrying about
rmap mapcount consistency.

(The page_remove_rmap BUG follows as a consequence of bad_page resetting
the mapcount on such a page. This is unsatisfactory: that code remote
from the cause should force a BUG, whereas code nearer the cause try to
continue without forcing a BUG. Just removing the BUG from mm/rmap.c
would let me off the hook, but seems irresponsible. Adding a BUG into
bad_page would revert an intentional relaxation. I don't know.)

Hugh

2004-11-30 15:06:48

by Benoit Boissinot

[permalink] [raw]
Subject: 2.6.10-rc2-mm3 [was: Re: 2.6.9-rc2: "kernel BUG at mm/rmap.c:473!"]

On Sun, 19 Sep 2004, Hugh Dickins wrote:
> On Sat, 18 Sep 2004, Mike Kirk wrote:
> > Not sure what this means: but the system kept running and I only lost a
> > bzip2 process: 2.6.9-rc2 w/preempt AMD 2700+ on A7N8X motherboard, 1GB
> > DDR400:
> > ==============================
> > kernel BUG at mm/rmap.c:473!
> > EIP is at page_remove_rmap+0x29/0x40
>
> Was there a "Bad page state" message and backtrace shortly before this?
> I say "shortly" because I don't suppose bzip2 had been running for hours,
> I'd expect the underlying error to have occurred while it was running.
>
> BUG_ON(page_mapcount(page) < 0);
>

I had the same BUG_ON with 2.6.10-rc2-mm3 while transcoding a video.

------------[ cut here ]------------
kernel BUG at mm/rmap.c:479!
invalid operand: 0000 [#1]
Modules linked in: radeon drm snd_seq snd_via82xx snd_ac97_codec snd_pcm snd_timer snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore ide_cd cdrom sk98lin
CPU: 0
EIP: 0060:[<c01454a4>] Not tainted VLI
EFLAGS: 00010286 (2.6.10-rc2-mm3-arakou)
EIP is at page_remove_rmap+0x34/0x40
eax: ffffffff ebx: 00007000 ecx: c03be100 edx: c17712c0
esi: cbfa60bc edi: 003ce000 ebp: c6648d78 esp: c6648d78
ds: 007b es: 007b ss: 0068
Process transcode (pid: 14505, threadinfo=c6648000 task=dac3aaa0)
Stack: c6648dac c013f2af c6648e34 c02d74dd 00000003 0000000e 3b896067 c17712c0
b2028000 c03be100 b2428000 c8027b24 b23f6000 c6648dd4 c013f439 003ce000
00000000 00000000 c0310524 c03be100 b2028000 c8027b24 b23f6000 c6648df4
Call Trace:
[<c0103b2a>] show_stack+0x7a/0x90
[<c0103cad>] show_registers+0x14d/0x1c0
[<c0103e82>] die+0xc2/0x140
[<c01042a3>] do_invalid_op+0xa3/0xb0
[<c01037e3>] error_code+0x2b/0x30
[<c013f2af>] zap_pte_range+0x13f/0x280
[<c013f439>] zap_pmd_range+0x49/0x70
[<c013f49d>] zap_pgd_range+0x3d/0x70
[<c013f4fb>] unmap_page_range+0x2b/0x40
[<c013f600>] unmap_vmas+0xf0/0x1f0
[<c014399b>] exit_mmap+0x6b/0x120
[<c0116431>] mmput+0x41/0xd0
[<c011a2db>] do_exit+0x15b/0x420
[<c011a616>] do_group_exit+0x36/0x70
[<c0122b0b>] get_signal_to_deliver+0x1bb/0x2b0
[<c0102b76>] do_signal+0x66/0x120
[<c0102c69>] do_notify_resume+0x39/0x3c
[<c0102daa>] work_notifysig+0x13/0x15
Code: 08 75 1e 83 42 08 ff 0f 98 c0 84 c0 74 11 8b 42 08 40 78 17 9c 58 fa ff 0d d0 c4 3c c0 50 9d c9 c3 0f 0b dc 01 db 95 2d c0 eb d8 <0f> 0b df 01 db 95 2d c0 eb df 89 f6 55 89 e5 83 ec 18 89 5d f4
BUG: atomic counter underflow at:
[<c0103b57>] dump_stack+0x17/0x20
[<c011a51e>] do_exit+0x39e/0x420
[<c0103efc>] die+0x13c/0x140
[<c01042a3>] do_invalid_op+0xa3/0xb0
[<c01037e3>] error_code+0x2b/0x30
[<c013f2af>] zap_pte_range+0x13f/0x280
[<c013f439>] zap_pmd_range+0x49/0x70
[<c013f49d>] zap_pgd_range+0x3d/0x70
[<c013f4fb>] unmap_page_range+0x2b/0x40
[<c013f600>] unmap_vmas+0xf0/0x1f0
[<c014399b>] exit_mmap+0x6b/0x120
[<c0116431>] mmput+0x41/0xd0
[<c011a2db>] do_exit+0x15b/0x420
[<c011a616>] do_group_exit+0x36/0x70
[<c0122b0b>] get_signal_to_deliver+0x1bb/0x2b0
[<c0102b76>] do_signal+0x66/0x120
[<c0102c69>] do_notify_resume+0x39/0x3c
[<c0102daa>] work_notifysig+0x13/0x15
Bad page state at prep_new_page (in process 'mpd', page c17712c0)
flags:0x40020114 mapping:00000000 mapcount:-1 count:0
Backtrace:
[<c0103b57>] dump_stack+0x17/0x20
[<c0136352>] bad_page+0x72/0xb0
[<c013669b>] prep_new_page+0x2b/0x80
[<c0136d8b>] buffered_rmqueue+0xcb/0x170
[<c0136ffd>] __alloc_pages+0x11d/0x340
[<c013925c>] do_page_cache_readahead+0xcc/0x120
[<c0139476>] page_cache_readahead+0x1c6/0x1f0
[<c0132ecb>] do_generic_mapping_read+0x11b/0x630
[<c013361d>] __generic_file_aio_read+0x16d/0x1e0
[<c0133784>] generic_file_read+0x94/0xc0
[<c014e6f1>] vfs_read+0x101/0x140
[<c014e99d>] sys_read+0x3d/0x70
[<c0102d0d>] sysenter_past_esp+0x52/0x71
Trying to fix it up, but a reboot is needed

Regards,

Benoit


Attachments:
(No filename) (3.78 kB)
config.gz (7.51 kB)
lspci.log (10.23 kB)
output of lspci -vv
Download all attachments

2004-12-01 01:00:41

by Hugh Dickins

[permalink] [raw]
Subject: Re: 2.6.10-rc2-mm3 [was: Re: 2.6.9-rc2: "kernel BUG at mm/rmap.c:473!"]

On Tue, 30 Nov 2004, Benoit Boissinot wrote:
>
> I had the same BUG_ON with 2.6.10-rc2-mm3 while transcoding a video.
>
> kernel BUG at mm/rmap.c:479!
> CPU: 0
> EIP: 0060:[<c01454a4>] Not tainted VLI
> EFLAGS: 00010286 (2.6.10-rc2-mm3-arakou)
> EIP is at page_remove_rmap+0x34/0x40
...
> BUG: atomic counter underflow at:
> [<c0103b57>] dump_stack+0x17/0x20
> [<c011a51e>] do_exit+0x39e/0x420
> [<c0103efc>] die+0x13c/0x140
...
> Bad page state at prep_new_page (in process 'mpd', page c17712c0)
> flags:0x40020114 mapping:00000000 mapcount:-1 count:0
> Backtrace:
> [<c0103b57>] dump_stack+0x17/0x20
> [<c0136352>] bad_page+0x72/0xb0
> [<c013669b>] prep_new_page+0x2b/0x80

Thanks for the report. I'm still searching for something useful
to say. I've recently spent several days trying to deduce what's
behind such page_remove_rmap BUGs, but not yet come up with any
convincing hypothesis. Yours is the first I've seen without
CONFIG_PREEMPT, so that's another potential culprit exonerated.

Some things which _might_ help me to shed more light on your case:
outputs of "objdump -rd" on your mm/memory.o, mm/rmap.o, kernel/exit.o;
and "cat /proc/$(pidof transcode)/maps" while transcode is running.

The atomic counter underflow in do_exit does suggest corruption
elsewhere than in transcode's page table (though I'm not at all
sure that is corrupt) - as always, it is worth giving memtest86
a thorough run to check your memory.

Thanks,
Hugh

2004-12-01 01:13:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.6.10-rc2-mm3 [was: Re: 2.6.9-rc2: "kernel BUG at mm/rmap.c:473!"]

On Wed, Dec 01, 2004 at 12:49:39AM +0000, Hugh Dickins wrote:
> The atomic counter underflow in do_exit does suggest corruption
> elsewhere than in transcode's page table (though I'm not at all
> sure that is corrupt) - as always, it is worth giving memtest86
> a thorough run to check your memory.

Transcode should be 99% cpu bound in userspace and it shouldn't be
kernel intensive at all. It's one of the few desktop apps 99% cpu bound,
in turn the reasoning that the cpu is overheating sounds reasonable to
me. It might also be using sse2 to compress faster etc...