2004-09-24 07:54:28

by Roel van der Made

[permalink] [raw]
Subject: kernel 2.6.9-rc2-mm1 system hangup

Hi there,

I'm having problems with servers hanging spontanely without any logging
or console output. They're running a 2.6.9-rc2-mm1 kernel and are Dell
PowerEdge 1750 dual Xeon servers with 4G ECC Reg. and 3 disks in
sw-raid 5.

The systems still responds to ping and listens to ie. the mysql port, but does
not give a MySQL prompt, seems the disks are in deadlock state or so ?

Using the sysrq showTasks I see the following traces (I will only show
some since the total log is much too long to show here, the full log
including the .config can be found on http://roel.net/backtrace/):

pdflush D 00000008 0 56 15 57 23 (L-TLB)
f7ca7c10 00000046 f7ca7c00 00000008 00000003 00000000 00000000 6c4e4dc0
00000011 c31c6000 00000000 00000000 f64ff800 00000000 00000000 00000000
00000000 00000008 00000000 006d3803 00000013 c2fca020 00000003 c31c6154
Call Trace:
[<c0361ce2>] __down+0x8e/0x10d
[<c011818f>] default_wake_function+0x0/0x12
[<c0361eb0>] __down_failed+0x8/0xc
[<c03030fc>] .text.lock.md+0x9b/0xf3
[<c02f9c92>] make_request+0x1df/0x226
[<c026508f>] generic_make_request+0x113/0x194
[<c01395d2>] mempool_alloc+0x8b/0x15d
[<c012f9ab>] autoremove_wake_function+0x0/0x57
[<c0265180>] submit_bio+0x70/0x121
[<c0159707>] bio_alloc+0xd9/0x1ac
[<c0158fba>] submit_bh+0xe0/0x133
[<c019604a>] reiserfs_write_full_page+0x2eb/0x4d0
[<c019626a>] reiserfs_writepage+0x26/0x3e
[<c017659d>] mpage_writepages+0x207/0x3b2
[<c0196244>] reiserfs_writepage+0x0/0x3e
[<c016cabc>] d_rehash+0x55/0x79
[<c017388b>] simple_lookup+0x40/0x44
[<c013c17c>] do_writepages+0x3d/0x43
[<c0174a37>] __sync_single_inode+0x66/0x1f1
[<c0174c32>] __writeback_single_inode+0x70/0x157
[<c0174ef6>] generic_sync_sb_inodes+0x1dd/0x2e7
[<c013caa9>] pdflush+0x0/0x2c
[<c01751ce>] sync_inodes_sb+0xba/0xc9
[<c0175264>] get_super_to_sync+0x54/0x75
[<c01752b0>] sync_inodes+0x2b/0x90
[<c0155df5>] do_sync+0x23/0x74
[<c013c9d7>] __pdflush+0xbf/0x191
[<c013cad1>] pdflush+0x28/0x2c
[<c0155dd2>] do_sync+0x0/0x74
[<c013caa9>] pdflush+0x0/0x2c
[<c012f4e5>] kthread+0xb7/0xbd
[<c012f42e>] kthread+0x0/0xbd
[<c0103f59>] kernel_thread_helper+0x5/0xb

sshd D 00000008 0 24085 1220 24084 (NOTLB)
e8e59cac 00000082 e8e59c98 00000008 00000003 00000001 e9e60b7c bd0483c0
00000004 e8e19000 e8e29900 e8e28440 e8e1b080 00000082 c04830c0 f3ff6550
00000000 e9e60b7c e8e1b080 dcda1904 00000012 c2fca020 00000003 e8e19154
Call Trace:
[<c0361ce2>] __down+0x8e/0x10d
[<c011818f>] default_wake_function+0x0/0x12
[<c0361eb0>] __down_failed+0x8/0xc
[<c01b3474>] .text.lock.journal+0x4b/0xcb
[<c01b1816>] journal_begin+0x6b/0xc3
[<c019f63a>] reiserfs_dirty_inode+0x58/0xd3
[<c0174982>] __mark_inode_dirty+0x1d2/0x1d7
[<c016e996>] update_atime+0xd9/0xde
[<c0136ac2>] do_generic_mapping_read+0x3cb/0x53d
[<c0136f0f>] __generic_file_aio_read+0x1ef/0x22b
[<c0136c34>] file_read_actor+0x0/0xec
[<c01626ad>] link_path_walk+0x9ed/0xd38
[<c0137079>] generic_file_read+0xba/0xd2
[<c01473ff>] __vma_link+0x44/0x73
[<c01481da>] do_mmap_pgoff+0x509/0x779
[<c012f9ab>] autoremove_wake_function+0x0/0x57
[<c015e3b2>] sys_fstat64+0x37/0x39
[<c01546f3>] vfs_read+0xbc/0x127
[<c01549c1>] sys_read+0x51/0x80
[<c0105dab>] syscall_call+0x7/0xb

etc. not sure which are relevant here, so please look at the mentioned location above.

Thanks,

--
Roel van der Made .--.
GNU/Linux Systems/Network Engineer |o_o |
Telegraaf Media ICT BV - Internet Services |:_/ |
Tel.: +31 (0)20 585 2229 // \ \
GnuPG Key available at: http://roel.net/gpgkey.asc


2004-09-27 20:57:01

by Chris Mason

[permalink] [raw]
Subject: Re: kernel 2.6.9-rc2-mm1 system hangup

On Fri, 2004-09-24 at 09:54 +0200, Roel van der Made wrote:
> Hi there,
>
> I'm having problems with servers hanging spontanely without any logging
> or console output. They're running a 2.6.9-rc2-mm1 kernel and are Dell
> PowerEdge 1750 dual Xeon servers with 4G ECC Reg. and 3 disks in
> sw-raid 5.
>
> The systems still responds to ping and listens to ie. the mysql port, but does
> not give a MySQL prompt, seems the disks are in deadlock state or so ?
>
> Using the sysrq showTasks I see the following traces (I will only show
> some since the total log is much too long to show here, the full log
> including the .config can be found on http://roel.net/backtrace/):

For reiserfs deadlocks, it's usually the task stuck in do_journal_end
that everyone else is waiting on. Those two procs are below, anyone
have ideas where the md code is stuck?

pdflush D 00000008 0 57 15 59 56 (L-TLB)
f7cbbba4 00000046 f7cbbb94 00000008 00000001 00000008 00000008 94891fbc
0000003e f7c77000 f7cbbc28 00081600 f650fa80 f7e7db50 f7cbbc28
00000000
f7c77000 00000008 00000000 b31919b8 0000000d c2fba020 00000001
f7c77154
Call Trace:
[<c026942f>] as_update_arq+0x2e/0x75
[<c0362a8c>] wait_for_completion+0x90/0xef
[<c011818f>] default_wake_function+0x0/0x12
[<c011818f>] default_wake_function+0x0/0x12
[<c0261b6f>] elv_merged_request+0x1f/0x21
[<c02fd432>] sync_page_io+0xa3/0xb1
[<c02fd373>] bi_complete+0x0/0x1c
[<c02fec51>] write_disk_sb+0x78/0xb0
[<c02fecb4>] sync_sbs+0x2b/0x43
[<c02fed75>] md_update_sb+0xa9/0xdb
[<c0117cb3>] load_balance_newidle+0x35/0x98
[<c03021a8>] md_write_start+0x95/0xa0
[<c02f9c92>] make_request+0x1df/0x226
[<c026508f>] generic_make_request+0x113/0x194
[<c01395d2>] mempool_alloc+0x8b/0x15d
[<c012f9ab>] autoremove_wake_function+0x0/0x57
[<c0265180>] submit_bio+0x70/0x121
[<c0159707>] bio_alloc+0xd9/0x1ac
[<c0158fba>] submit_bh+0xe0/0x133
[<c0159075>] ll_rw_block+0x68/0x88
[<c01ae555>] flush_commit_list+0x454/0x48f
[<c01b30b1>] do_journal_end+0x898/0xb63
[<c013caa9>] pdflush+0x0/0x2c
[<c01b1e39>] journal_end_sync+0x4d/0x89
[<c019e9e5>] reiserfs_sync_fs+0x65/0xa8
[<c015b041>] sync_supers+0x9b/0x9d
[<c013befb>] wb_kupdate+0x60/0x13b
[<c013c9d7>] __pdflush+0xbf/0x191
[<c013cad1>] pdflush+0x28/0x2c
[<c013be9b>] wb_kupdate+0x0/0x13b
[<c013caa9>] pdflush+0x0/0x2c
[<c012f4e5>] kthread+0xb7/0xbd
[<c012f42e>] kthread+0x0/0xbd
[<c0103f59>] kernel_thread_helper+0x5/0xb

and

munin-node D 00000008 0 23920 978 23921 23533 (NOTLB)
e9cd59a0 00000086 e9cd598c 00000008 00000003 00000008 00000008 f753ac80
00000007 ea202550 e9cd5a24 00201c00 f6466300 00000082 c04830c0 f6449550
00000000 f6888e1c 00000008 c45a45e6 0000000d c2fca020 00000003 ea2026a4
Call Trace:
[<c0362a8c>] wait_for_completion+0x90/0xef
[<c011818f>] default_wake_function+0x0/0x12
[<c015701f>] __find_get_block+0x5e/0xc2
[<c011818f>] default_wake_function+0x0/0x12
[<c01a76d6>] is_tree_node+0x6f/0x71
[<c02fd432>] sync_page_io+0xa3/0xb1
[<c02fd373>] bi_complete+0x0/0x1c
[<c02fec51>] write_disk_sb+0x78/0xb0
[<c02fecb4>] sync_sbs+0x2b/0x43
[<c02fed75>] md_update_sb+0xa9/0xdb
[<c0193ca4>] inode2sd+0xcc/0x116
[<c03021a8>] md_write_start+0x95/0xa0
[<c02f9c92>] make_request+0x1df/0x226
[<c026508f>] generic_make_request+0x113/0x194
[<c01395d2>] mempool_alloc+0x8b/0x15d
[<c012f9ab>] autoremove_wake_function+0x0/0x57
[<c0265180>] submit_bio+0x70/0x121
[<c015701f>] __find_get_block+0x5e/0xc2
[<c0159707>] bio_alloc+0xd9/0x1ac
[<c01a76d6>] is_tree_node+0x6f/0x71
[<c0158fba>] submit_bh+0xe0/0x133
[<c01ad9c8>] submit_logged_buffer+0x5e/0x62
[<c01adad0>] write_chunk+0x3d/0x47
[<c01af127>] kupdate_transactions+0x129/0x14c
[<c0157000>] __find_get_block+0x3f/0xc2
[<c015e4d9>] inode_get_bytes+0x3d/0x54
[<c012484c>] run_timer_softirq+0x109/0x19d
[<c01180b8>] scheduler_tick+0x192/0x269
[<c0156f8f>] bh_lru_install+0xb0/0xe2
[<c01af209>] flush_used_journal_lists+0xbf/0xe1
[<c01b27fa>] flush_old_journal_lists+0x3f/0x5e
[<c01b2fcf>] do_journal_end+0x7b6/0xb63
[<c01b1bd5>] journal_end+0xa2/0xc0
[<c019f66e>] reiserfs_dirty_inode+0x8c/0xd3
[<c013a30b>] __rmqueue+0xe8/0x139
[<c0174982>] __mark_inode_dirty+0x1d2/0x1d7
[<c013a38b>] rmqueue_bulk+0x2f/0x6f
[<c016ea47>] inode_update_time+0xac/0xd6
[<c019a218>] reiserfs_file_write+0x2f4/0x7a3
[<c0145716>] do_wp_page+0x20a/0x380
[<c01442ed>] pte_alloc_map+0xaa/0xd1
[<c0146658>] handle_mm_fault+0x15c/0x172
[<c01154fd>] do_page_fault+0x19f/0x5c9
[<c01280e9>] do_sigaction+0x1e7/0x203
[<c012484c>] run_timer_softirq+0x109/0x19d
[<c0154905>] vfs_write+0xbc/0x127
[<c030a069>] sys_socketcall+0xf7/0x256
[<c0154a41>] sys_write+0x51/0x80
[<c0105dab>] syscall_call+0x7/0xb