2005-11-29 18:34:20

by Ryan Richter

[permalink] [raw]
Subject: Re: Fw: crash on x86_64 - mm related?

On Tue, Nov 29, 2005 at 09:24:32AM -0800, Ryan Richter wrote:

Not sure if this matters, but this apparently happened in two stages.
This first part happened during the backups, as I said earlier:

> Bad page state at free_hot_cold_page (in process 'taper', page ffff81000260b6f8)
> flags:0x010000000000000c mapping:ffff8100355f1dd8 mapcount:2 count:0
> Backtrace:
>
> Call Trace:<ffffffff80159f93>{bad_page+99} <ffffffff8015a965>{free_hot_cold_page+101}
> <ffffffff80162007>{__page_cache_release+151} <ffffffff802b8fe8>{sgl_unmap_user_pages+120}
> <ffffffff802b48fb>{release_buffering+27} <ffffffff802b4fb1>{st_write+1697}
> <ffffffff8017af46>{vfs_write+198} <ffffffff8017b0a3>{sys_write+83}
> <ffffffff8010db7a>{system_call+126}
> Trying to fix it up, but a reboot is needed
> Bad page state at free_hot_cold_page (in process 'taper', page ffff81000260b6f8)
> flags:0x010000000000081c mapping:ffff81005c0fc310 mapcount:0 count:0
> Backtrace:
>
> Call Trace:<ffffffff80159f93>{bad_page+99} <ffffffff8015a965>{free_hot_cold_page+101}
> <ffffffff80162007>{__page_cache_release+151} <ffffffff802b8fe8>{sgl_unmap
> _user_pages+120}
> <ffffffff802b48fb>{release_buffering+27} <ffffffff802b4fb1>{st_write+1697}
> <ffffffff8017af46>{vfs_write+198} <ffffffff8017b0a3>{sys_write+83}
> <ffffffff8010db7a>{system_call+126}
> Trying to fix it up, but a reboot is needed
> ----------- [cut here ] --------- [please bite here ] ---------
> Kernel BUG at include/linux/mm.h:341
> invalid operand: 0000 [1] SMP
> CPU 1
> Modules linked in: bonding
> Pid: 2418, comm: taper Tainted: G B 2.6.14.2 #1
> RIP: 0010:[<ffffffff802b8fcd>] <ffffffff802b8fcd>{sgl_unmap_user_pages+93}
> RSP: 0018:ffff810035725e18 EFLAGS: 00010256
> RAX: 0000000000000000 RBX: 0000000000000007 RCX: 000000000000000f
> RDX: 00000000000000e0 RSI: 0000000000000001 RDI: ffff81000260b6f8
> RBP: ffff810004852068 R08: 00000000ffffffff R09: 0000000000000000
> R10: 0000000000008000 R11: 0000000000000200 R12: 0000000000000008
> R13: 0000000000000000 R14: 0000000000008000 R15: ffff810004949d10
> FS: 00002aaaab53d880(0000) GS:ffffffff804db880(0000) knlGS:00000000556b6920
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00002aaaaaac0000 CR3: 0000000035691000 CR4: 00000000000006e0
> Process taper (pid: 2418, threadinfo ffff810035724000, task ffff81017d680300)
> Stack: ffff8101423f3600 ffff810004852000 0000000000000040 0000000000008000
> ffff810004949c00 ffffffff802b48fb ffff810004852000 ffffffff802b4fb1
> ffff810000000000 ffffffff00000001
> Call Trace:<ffffffff802b48fb>{release_buffering+27} <ffffffff802b4fb1>{st_write+1697}
> <ffffffff8017af46>{vfs_write+198} <ffffffff8017b0a3>{sys_write+83}
> <ffffffff8010db7a>{system_call+126}
>
> Code: 0f 0b 68 ba 12 3a 80 c2 55 01 f0 83 47 08 ff 0f 98 c0 84 c0
> RIP <ffffffff802b8fcd>{sgl_unmap_user_pages+93} RSP <ffff810035725e18>
> ----------- [cut here ] --------- [please bite here ] ---------
> Kernel BUG at mm/rmap.c:487
> invalid operand: 0000 [2] SMP
> CPU 1
> Modules linked in: bonding
> Pid: 2418, comm: taper Tainted: G B 2.6.14.2 #1
> RIP: 0010:[<ffffffff8016f3f7>] <ffffffff8016f3f7>{page_remove_rmap+39}
> RSP: 0018:ffff810035725ab0 EFLAGS: 00010286
> RAX: 00000000ffffffff RBX: ffff8100356976f8 RCX: ffff81000000f000
> RDX: 0000000000000000 RSI: 8000000064c69067 RDI: ffff81000260b6f8
> RBP: 00002aaaaaadf000 R08: 0000000000000000 R09: ffff81000260b688
> R10: 00000000fffffffa R11: 0000000000000000 R12: ffff810101c22380
> R13: 8000000064c69067 R14: ffff81000260b6f8 R15: 0000000000000000
> FS: 00002aaaab53d880(0000) GS:ffffffff804db880(0000) knlGS:00000000556b6920
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00002aaaaaac0000 CR3: 0000000035691000 CR4: 00000000000006e0
> Process taper (pid: 2418, threadinfo ffff810035724000, task ffff81017d680300)
> Stack: ffffffff80166ecd 00002aaaaab62000 ffff810035696aa8 00002aaaaab62000
> 00002aaaaab62000 00002aaaaab61fff ffff810035695550 00002aaaaab62000
> ffffffff80167180 ffff810035725d68
> Call Trace:<ffffffff80166ecd>{zap_pte_range+477} <ffffffff80167180>{unmap_page_range+496}
> <ffffffff801672e5>{unmap_vmas+293} <ffffffff8016cfa2>{exit_mmap+162}
> <ffffffff80131ce1>{mmput+49} <ffffffff801371c6>{do_exit+438}
> <ffffffff8010f6f1>{die+81} <ffffffff8010f9df>{do_invalid_op+159}
> <ffffffff802b8fcd>{sgl_unmap_user_pages+93} <ffffffff80381f76>{thread_return+86}
> <ffffffff802a8662>{sym_setup_data_and_start+402} <ffffffff8010e84d>{error_exit+0}
> <ffffffff802b8fcd>{sgl_unmap_user_pages+93} <ffffffff802b8fe8>{sgl_unmap_user_pages+120}
> <ffffffff802b48fb>{release_buffering+27} <ffffffff802b4fb1>{st_write+1697}
> <ffffffff8017af46>{vfs_write+198} <ffffffff8017b0a3>{sys_write+83}
> <ffffffff8010db7a>{system_call+126}
>
> Code: 0f 0b 68 9b 35 3a 80 c2 e7 01 48 c7 c6 ff ff ff ff bf 20 00
> RIP <ffffffff8016f3f7>{page_remove_rmap+39} RSP <ffff810035725ab0>
> <1>Fixing recursive fault but reboot is needed!
> Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
> <ffffffff801b9c7b>{ext3_prepare_write+27}
> PGD 355bc067 PUD 355c9067 PMD 0
> Oops: 0000 [3] SMP
> CPU 0
> Modules linked in: bonding
> Pid: 2416, comm: driver Tainted: G B 2.6.14.2 #1
> RIP: 0010:[<ffffffff801b9c7b>] <ffffffff801b9c7b>{ext3_prepare_write+27}
> RSP: 0018:ffff8100355e7b48 EFLAGS: 00010296
> RAX: 0000000000000000 RBX: ffffffff8040f660 RCX: 000000000000017d
> RDX: 0000000000000094 RSI: ffff81000260b6f8 RDI: ffff810035b09cc0
> RBP: 000000000000000e R08: 00000000fffffffa R09: 00000000000000e9
> R10: ffff81001190c818 R11: 0000000000000000 R12: ffff81000260b6f8
> R13: ffff81000260b6f8 R14: 000000000000017d R15: 0000000000000094
> FS: 00002aaaab53d8e0(0000) GS:ffffffff804db800(0000) knlGS:00000000555bc920
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 0000000000000000 CR3: 0000000035555000 CR4: 00000000000006e0
> Process driver (pid: 2416, threadinfo ffff8100355e6000, task ffff8100f43e8a80)
> Stack: ffff81014e643310 ffffffff8040f660 000000000000000e ffff81000260b6f8
> ffff81005c0fc310 0000000000000094 00000000000000e9 ffffffff80158247
> 0000000000000292 00002aaaaaac0000
> Call Trace:<ffffffff80158247>{generic_file_buffered_write+551}
> <ffffffff801c06bd>{__ext3_journal_stop+45} <ffffffff8019ec74>{__mark_inode_dirty+52}
> <ffffffff801963ec>{inode_update_time+188} <ffffffff80158a08>{__generic_file_aio_write_nolock+936}
> <ffffffff80381f76>{thread_return+86} <ffffffff8013d349>{lock_timer_base+41}
> <ffffffff80158cfe>{generic_file_aio_write+110} <ffffffff801b7783>{ext3_file_write+35}
> <ffffffff8017ae43>{do_sync_write+211} <ffffffff8018ecc0>{__pollwait+0}
> <ffffffff8014a2b0>{autoremove_wake_function+0} <ffffffff8018f681>{sys_select+1153}
> <ffffffff8017af46>{vfs_write+198} <ffffffff8017b0a3>{sys_write+83}
> <ffffffff8010db7a>{system_call+126}
>
> Code: 48 8b 28 48 89 ef e8 aa 26 00 00 c7 44 24 04 00 00 00 00 89
> RIP <ffffffff801b9c7b>{ext3_prepare_write+27} RSP <ffff8100355e7b48>
> CR2: 0000000000000000
> <0>Bad page state at prep_new_page (in process 'dumper', page ffff81000260b6f8)
> flags:0x010000000000001d mapping:0000000000000000 mapcount:-1 count:1
> Backtrace:
>
> Call Trace:<ffffffff80159f93>{bad_page+99} <ffffffff8015a371>{prep_new_page+65}
> <ffffffff8015ab2e>{buffered_rmqueue+302} <ffffffff8015ad85>{__alloc_pages+261}
> <ffffffff801581bd>{generic_file_buffered_write+413}
> <ffffffff80139509>{current_fs_time+105} <ffffffff8019636e>{inode_update_time+62}
> <ffffffff80158a08>{__generic_file_aio_write_nolock+936}
> <ffffffff8031f4a4>{sock_common_recvmsg+52} <ffffffff8031bb30>{sock_aio_read+272}
> <ffffffff80158cfe>{generic_file_aio_write+110} <ffffffff801b7783>{ext3_file_write+35}
> <ffffffff8017ae43>{do_sync_write+211} <ffffffff8018ecc0>{__pollwait+0}
> <ffffffff8014a2b0>{autoremove_wake_function+0} <ffffffff8018f681>{sys_select+1153}
> <ffffffff8017af46>{vfs_write+198} <ffffffff8017b0a3>{sys_write+83}
> <ffffffff8010db7a>{system_call+126}
> Trying to fix it up, but a reboot is needed

Everything from here on happened several hours later while updatedb was
running.

> Bad page state at prep_new_page (in process 'find', page ffff81000260b6f8)
> flags:0x0100000000000064 mapping:ffff8100f3be9be9 mapcount:1 count:1
> Backtrace:
>
> Call Trace:<ffffffff80159f93>{bad_page+99} <ffffffff8015a371>{prep_new_page+65}
> <ffffffff8015ab2e>{buffered_rmqueue+302} <ffffffff8015ad85>{__alloc_pages+261}
> <ffffffff8015e7a3>{kmem_getpages+99} <ffffffff8015fbb0>{cache_grow+192}
> <ffffffff8015fe3b>{cache_alloc_refill+459} <ffffffff80160226>{kmem_cache_alloc+54}
> <ffffffff80193831>{d_alloc+33} <ffffffff80188fe9>{real_lookup+105}
> <ffffffff801893c0>{do_lookup+112} <ffffffff80189e07>{__link_path_walk+2551}
> <ffffffff8018a382>{link_path_walk+178} <ffffffff8018a8ce>{path_lookup+446}
> <ffffffff8018aa9e>{__user_walk+62} <ffffffff801849b6>{vfs_lstat+38}
> <ffffffff80184dff>{sys_newlstat+31} <ffffffff8010db7a>{system_call+126}
>
> Trying to fix it up, but a reboot is needed
> Unable to handle kernel paging request at 00002aaaab9c5b61 RIP:
> <ffffffff8015fdba>{cache_alloc_refill+330}
> PGD c2512067 PUD c2513067 PMD 0
> Oops: 0002 [4] SMP
> CPU 0
> Modules linked in: bonding
> Pid: 3011, comm: find Tainted: G B 2.6.14.2 #1
> RIP: 0010:[<ffffffff8015fdba>] <ffffffff8015fdba>{cache_alloc_refill+330}
> RSP: 0018:ffff810112f05c28 EFLAGS: 00010082
> RAX: 00002aaaab9c5b59 RBX: 0000000000000010 RCX: 0000000000029ba6
> RDX: 00002aaaab9c5bb3 RSI: ffff810064c69040 RDI: ffff81000c01a288
> RBP: ffff8100f6fc4800 R08: ffff81000c01a250 R09: ffff81000c01a260
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff81000c01a240
> R13: ffff8100f6fc3640 R14: ffff81000c01a288 R15: 00000000000000d0
> FS: 00002aaaaae00640(0000) GS:ffffffff804db800(0000) knlGS:00000000555bc920
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00002aaaab9c5b61 CR3: 00000000c2bb3000 CR4: 00000000000006e0
> Process find (pid: 3011, threadinfo ffff810112f04000, task ffff810102b46040)
> Stack: ffff810112f05e68 ffff810179923cb8 fffffffffffffff4 ffff810112f05d28
> ffff810179923cb8 ffff810112f05d28 ffff810112f05e68 ffffffff80160226
> 0000000000000292 ffffffff80193831
> Call Trace:<ffffffff80160226>{kmem_cache_alloc+54} <ffffffff80193831>{d_alloc+33}
> <ffffffff80188fe9>{real_lookup+105} <ffffffff801893c0>{do_lookup+112}
> <ffffffff80189e07>{__link_path_walk+2551} <ffffffff8018a382>{link_path_walk+178}
> <ffffffff8018a8ce>{path_lookup+446} <ffffffff8018aa9e>{__user_walk+62}
> <ffffffff801849b6>{vfs_lstat+38} <ffffffff80184dff>{sys_newlstat+31}
> <ffffffff8010db7a>{system_call+126}
>
> Code: 48 89 50 08 48 89 02 48 c7 46 08 00 02 20 00 83 7e 24 ff 48
> RIP <ffffffff8015fdba>{cache_alloc_refill+330} RSP <ffff810112f05c28>
> CR2: 00002aaaab9c5b61
> NMI Watchdog detected LOCKUP on CPU 1
> CPU 1
> Modules linked in: bonding
> Pid: 7, comm: events/1 Tainted: G B 2.6.14.2 #1
> RIP: 0010:[<ffffffff803837dd>] <ffffffff803837dd>{.text.lock.spinlock+118}
> RSP: 0018:ffff810004869dd0 EFLAGS: 00000086
> RAX: ffff81000c01a240 RBX: ffff81000c01a288 RCX: ffff8100f6fc3640
> RDX: 0000000000000003 RSI: 0000000000000003 RDI: ffff81000c01a288
> RBP: ffff810100009dc0 R08: 0000000000000000 R09: 0000000000000000
> R10: 00000000ffffffff R11: 0000000000000066 R12: 0000000000000000
> R13: ffff810100009dd0 R14: 0000000000000292 R15: ffff810100009e40
> FS: 00002aaaaae00640(0000) GS:ffffffff804db880(0000) knlGS:00000000556b6920
> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> CR2: 00002aaaaaf1df40 CR3: 000000017f448000 CR4: 00000000000006e0
> Process events/1 (pid: 7, threadinfo ffff810004868000, task ffff8100f6fb6080)
> Stack: ffffffff8015e35b ffff8100f6fc3640 ffff810100009f60 0000000000000001
> ffff810100009e40 ffff8100f6fc3640 ffff8100f6fc38e0 ffff810100009f88
> ffffffff80161414 ffff810004869e58
> Call Trace:<ffffffff8015e35b>{drain_alien_cache+123} <ffffffff80161414>{cache_reap+164}
> <ffffffff80161370>{cache_reap+0} <ffffffff8014553c>{worker_thread+476}
> <ffffffff8012ed70>{default_wake_function+0} <ffffffff8012ed70>{default_wake_function+0}
> <ffffffff80145360>{worker_thread+0} <ffffffff80149c82>{kthread+146}
> <ffffffff8010ea02>{child_rip+8} <ffffffff80145360>{worker_thread+0}
> <ffffffff80149bf0>{kthread+0} <ffffffff8010e9fa>{child_rip+0}
>
>
> Code: 80 3f 00 7e f9 e9 59 fe ff ff e8 58 41 e9 ff e9 6f fe ff ff
> console shuts up ...
> <0>Kernel panic - not syncing: Aiee, killing interrupt handler!