From: Graham Murray Subject: Re: All processes accessing etx4 partition stuck in 'D' state Date: Tue, 10 Feb 2009 19:57:02 +0000 Message-ID: <87skmmoyk1.fsf@newton.gmurray.org.uk> References: <87ljt7zt9q.fsf@newton.gmurray.org.uk> <20090119124305.GB7598@mit.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii To: linux-ext4@vger.kernel.org Return-path: Received: from newton.gmurray.org.uk ([81.2.114.237]:57493 "EHLO newton.gmurray.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754191AbZBJT5G (ORCPT ); Tue, 10 Feb 2009 14:57:06 -0500 Received: from newton.gmurray.org.uk (localhost [127.0.0.1]) by newton.gmurray.org.uk (8.14.3/8.14.3) with ESMTP id n1AJv2hn017211 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Tue, 10 Feb 2009 19:57:02 GMT In-Reply-To: <20090119124305.GB7598@mit.edu> (Theodore Tso's message of "Mon, 19 Jan 2009 07:43:05 -0500") Sender: linux-ext4-owner@vger.kernel.org List-ID: Theodore Tso writes: > On Mon, Jan 19, 2009 at 06:47:45AM +0000, Graham Murray wrote: >> I do not know if this is ext4 related or not, so my apologies if it is >> not. >> >> I am running the latest linus 2.6.29-rc2 git kernel together with the >> master branch of the ext4 git tree announced here a couple of days ago. >> >> Yesterday evening, all processes accessing /home which is formatted as >> ext4 (created as ext4 under 2.6.26 with the ext4 patches applied, not >> converted from ext3) were stuck in 'D' state and top showed both cores >> of the core2 CPU in 100% Waiting state. Everything not accessing /home >> was responsive. To eliminate a physical problem I ran a SMART self-test >> on the drive containing /home and it passed with no errors. / is also >> etx4, but was converted from ext3. There were no kernel messages, just >> the hung processes. The system would not reboot normally (because of the >> processes in 'D' state), but following sysrq sync, mount r/o and reboot, >> the filesystem showed as clean when restarted. To be sure, the first >> thing I did was to unmount it and manually ran fsck -f and that did not >> report any problems. > > If this happens again, could you use sysrq-l and capture resulting the > stack backtraces? That would help determine what parts of the kernel > are responsible. It has happened again for the third time, and this time I was able to use sysrq. In all three cases, the first process to hang was emacs when exiting a group in gnus and top showed it in WCHAN page_sync. Now using:- Linux newton 2.6.29-rc4-00026-gf06da26 #1 SMP Mon Feb 9 21:17:53 GMT 2009 i686 Intel(R) Core(TM)2 CPU 6700 @ 2.66GHz GenuineIntel GNU/Linux Sysrq-l does not show much Feb 10 17:13:01 newton SysRq : Show backtrace of all active CPUs Feb 10 17:13:01 newton CPU0: Feb 10 17:13:01 newton Feb 10 17:13:01 newton Pid: 0, comm: swapper Not tainted (2.6.29-rc4-00026-gf06da26 #1) P5W DH Deluxe Feb 10 17:13:01 newton EIP: 0060:[] EFLAGS: 00000246 CPU: 0 Feb 10 17:13:01 newton EIP is at mwait_idle+0x32/0x40 Feb 10 17:13:01 newton EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000 Feb 10 17:13:01 newton ESI: c0741200 EDI: c06df814 EBP: c06e7fb0 ESP: c06e7fb0 Feb 10 17:13:01 newton DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Feb 10 17:13:01 newton CR0: 8005003b CR2: 0a1d2000 CR3: 357b6000 CR4: 000006f0 Feb 10 17:13:01 newton DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 Feb 10 17:13:01 newton DR6: ffff0ff0 DR7: 00000400 Feb 10 17:13:01 newton Call Trace: Feb 10 17:13:01 newton [] cpu_idle+0x68/0xc0 Feb 10 17:13:01 newton [] rest_init+0x4e/0x60 But Sysrq-w shows Feb 10 17:13:28 newton [] ? security_file_permission+0xf/0x20 Feb 10 17:13:28 newton [] ? rw_verify_area+0x48/0xc0 Feb 10 17:13:28 newton [] vfs_write+0x9c/0x160 Feb 10 17:13:28 newton [] ? do_sync_write+0x0/0x110 Feb 10 17:13:28 newton [] sys_write+0x3d/0x70 Feb 10 17:13:28 newton [] sysenter_do_call+0x12/0x21 Feb 10 17:13:28 newton bash D 00000000 0 6734 1 Feb 10 17:13:28 newton ebfb3dbc 00000082 001200d2 00000000 80000000 00000001 bed82067 ced0c7f3 Feb 10 17:13:28 newton 00003399 80000000 09f2a060 ea8f8f78 ebfde000 c0741300 e71c5630 e71c58a4 Feb 10 17:13:28 newton c2816300 00000000 80000000 ced1303c 00003399 80000000 e71c58a4 00100073 Feb 10 17:13:28 newton Call Trace: Feb 10 17:13:28 newton [] __mutex_lock_slowpath+0xb8/0x250 Feb 10 17:13:28 newton [] mutex_lock+0x12/0x20 Feb 10 17:13:28 newton [] do_lookup+0xd7/0x1d0 Feb 10 17:13:28 newton [] __link_path_walk+0x266/0xd80 Feb 10 17:13:28 newton [] ? do_page_fault+0x235/0x950 Feb 10 17:13:28 newton [] path_walk+0x54/0xc0 Feb 10 17:13:28 newton [] do_path_lookup+0xb9/0x1a0 Feb 10 17:13:28 newton [] user_path_at+0x4a/0x80 Feb 10 17:13:28 newton [] ? restore_i387_xstate+0xfc/0x1f0 Feb 10 17:13:28 newton [] ? vfs_ioctl+0x28/0x90 Feb 10 17:13:28 newton [] vfs_stat_fd+0x21/0x50 Feb 10 17:13:28 newton [] vfs_stat+0x11/0x20 Feb 10 17:13:28 newton [] sys_stat64+0x14/0x30 Feb 10 17:13:28 newton [] ? sys_sigreturn+0xc5/0xe0 Feb 10 17:13:28 newton [] sysenter_do_call+0x12/0x21 Feb 10 17:13:28 newton emacs D 00000000 0 6766 6751 Feb 10 17:13:28 newton e57abd44 00000082 c281633c 00000000 c281201c 00000000 00000001 7740f708 Feb 10 17:13:28 newton 00003372 f6fc0034 e57abd08 c0123657 e57abd20 c0741300 e71c24f0 e71c2764 Feb 10 17:13:28 newton c2816300 00000000 00d6a406 0000001c e57abdf8 e4c25d28 e71c2764 00000000 Feb 10 17:13:28 newton Call Trace: Feb 10 17:13:28 newton [] ? set_next_entity+0x97/0x180 Feb 10 17:13:28 newton [] ? ktime_get_ts+0x48/0x50 Feb 10 17:13:28 newton [] io_schedule+0x55/0x90 Feb 10 17:13:28 newton [] sync_page+0x2d/0x60 Feb 10 17:13:28 newton [] __wait_on_bit_lock+0x4a/0xa0 Feb 10 17:13:28 newton [] ? sync_page+0x0/0x60 Feb 10 17:13:28 newton [] __lock_page+0x87/0xa0 Feb 10 17:13:28 newton [] ? wake_bit_function+0x0/0x60 Feb 10 17:13:28 newton [] write_cache_pages+0x196/0x370 Feb 10 17:13:28 newton [] ? __mpage_da_writepage+0x0/0x150 Feb 10 17:13:28 newton [] ext4_da_writepages+0x1fe/0x470 Feb 10 17:13:28 newton [] ? mutex_unlock+0x8/0x10 Feb 10 17:13:28 newton [] ? generic_file_aio_write+0x64/0xd0 Feb 10 17:13:28 newton [] ? do_sync_write+0xd1/0x110 Feb 10 17:13:28 newton [] ? ext4_da_get_block_write+0x0/0x210 Feb 10 17:13:28 newton [] ? ext4_da_writepages+0x0/0x470 Feb 10 17:13:28 newton [] do_writepages+0x2e/0x50 Feb 10 17:13:28 newton [] __filemap_fdatawrite_range+0x61/0x70 Feb 10 17:13:28 newton [] filemap_fdatawrite+0x26/0x30 Feb 10 17:13:28 newton [] vfs_fsync+0x4f/0xd0 Feb 10 17:13:28 newton [] do_fsync+0x2d/0x50 Feb 10 17:13:28 newton [] sys_fsync+0xd/0x10 Feb 10 17:13:28 newton [] sysenter_do_call+0x12/0x21 Feb 10 17:13:28 newton procmail D bf93d025 0 11337 11336 Feb 10 17:13:28 newton f5387df4 00200082 f1da6910 bf93d025 c149978c b7f59238 eeebb790 3f0bfa17 Feb 10 17:13:28 newton 00003388 c1499780 c27f27a0 00000000 f1d68070 c0741300 eee40000 eee40274 Feb 10 17:13:28 newton c2816300 00000000 00d6ff4f c01b239a 00000000 e6411d48 eee40274 c01b378b Feb 10 17:13:28 newton Call Trace: Feb 10 17:13:28 newton [] ? d_rehash+0x2a/0x40 Feb 10 17:13:28 newton [] ? d_splice_alias+0x3b/0xd0 Feb 10 17:13:28 newton [] ? prepare_to_wait+0x3a/0x70 Feb 10 17:13:28 newton [] start_this_handle+0x1fd/0x4f0 Feb 10 17:13:28 newton [] ? mntput_no_expire+0x1d/0x130 Feb 10 17:13:28 newton [] ? _atomic_dec_and_lock+0x4f/0x70 Feb 10 17:13:28 newton [] ? autoremove_wake_function+0x0/0x40 Feb 10 17:13:28 newton [] jbd2_journal_start+0x93/0xd0 Feb 10 17:13:28 newton [] ext4_journal_start_sb+0x55/0x80 Feb 10 17:13:28 newton [] ext4_create+0x76/0x110 Feb 10 17:13:28 newton [] vfs_create+0x94/0xb0 Feb 10 17:13:28 newton [] do_filp_open+0x731/0x7e0 Feb 10 17:13:28 newton [] do_sys_open+0x59/0xf0 Feb 10 17:13:28 newton [] ? sys_wait4+0x87/0xd0 Feb 10 17:13:28 newton [] sys_open+0x29/0x40 Feb 10 17:13:28 newton [] sysenter_do_call+0x12/0x21 Feb 10 17:13:28 newton [] ? schedule+0x2c0/0x9e0 Feb 10 17:13:28 newton procmail D e70d7f44 0 11339 11338 Feb 10 17:13:28 newton e4c4ddbc 00200082 f7009eec e70d7f44 00000000 00000000 00000000 7793278a Feb 10 17:13:28 newton 00003388 c0179ac1 c16aaa8c e4c4ddc4 c018e717 c0741300 eee46ed0 eee47144 Feb 10 17:13:28 newton c2816300 00000000 00d70055 b7eed238 e4c21370 e5767880 eee47144 c16aaa80 Feb 10 17:13:28 newton Call Trace: Feb 10 17:13:28 newton [] ? unlock_page+0x41/0x50 Feb 10 17:13:28 newton [] ? __do_fault+0x187/0x3f0 Feb 10 17:13:28 newton [] __mutex_lock_slowpath+0xb8/0x250 Feb 10 17:13:28 newton [] ? handle_mm_fault+0x13b/0x780 Feb 10 17:13:28 newton [] mutex_lock+0x12/0x20 Feb 10 17:13:28 newton [] do_lookup+0xd7/0x1d0 Feb 10 17:13:28 newton [] __link_path_walk+0x266/0xd80 Feb 10 17:13:28 newton [] ? do_page_fault+0x235/0x950 Feb 10 17:13:28 newton [] ? in_group_p+0x25/0x30 Feb 10 17:13:28 newton [] ? file_move+0x1f/0x50 Feb 10 17:13:28 newton [] path_walk+0x54/0xc0 Feb 10 17:13:28 newton [] do_path_lookup+0xb9/0x1a0 Feb 10 17:13:28 newton [] user_path_at+0x4a/0x80 Feb 10 17:13:28 newton [] ? remove_wait_queue+0x38/0x50 Feb 10 17:13:28 newton [] vfs_lstat_fd+0x1e/0x50 Feb 10 17:13:28 newton [] vfs_lstat+0x11/0x20 Feb 10 17:13:28 newton [] sys_lstat64+0x14/0x30 Feb 10 17:13:28 newton [] ? sys_wait4+0x87/0xd0 Feb 10 17:13:28 newton [] ? sys_time+0x19/0x50 Feb 10 17:13:28 newton [] sysenter_do_call+0x12/0x21 Feb 10 17:13:28 newton [] ? schedule+0x2c0/0x9e0 Feb 10 17:13:28 newton sshd D c0179edd 0 11344 6175 Feb 10 17:13:28 newton e70e8d98 00000082 e70e8d64 c0179edd 00000003 00000044 c01b8a8d bd47338f Feb 10 17:13:28 newton 0000338e ebfc43c8 eabd8300 f7009eec eabd8344 c0741300 eee449e0 eee44c54 Feb 10 17:13:28 newton c2816300 00000000 00d71ab9 c155246c e70e8dc4 c018e717 eee44c54 ea923268 Feb 10 17:13:28 newton Call Trace: Feb 10 17:13:28 newton [] ? filemap_fault+0x12d/0x3d0 Feb 10 17:13:28 newton [] ? mntput_no_expire+0x1d/0x130 Feb 10 17:13:28 newton [] ? __do_fault+0x187/0x3f0 Feb 10 17:13:28 newton [] __mutex_lock_slowpath+0xb8/0x250 Feb 10 17:13:28 newton [] mutex_lock+0x12/0x20 Feb 10 17:13:28 newton [] do_lookup+0xd7/0x1d0 Feb 10 17:13:28 newton [] __link_path_walk+0x154/0xd80 Feb 10 17:13:28 newton [] path_walk+0x54/0xc0 Feb 10 17:13:28 newton [] do_path_lookup+0xb9/0x1a0 Feb 10 17:13:28 newton [] path_lookup_open+0x4b/0xb0 Feb 10 17:13:28 newton [] do_filp_open+0xb1/0x7e0 Feb 10 17:13:28 newton [] ? sched_slice+0x58/0x90 Feb 10 17:13:28 newton [] do_sys_open+0x59/0xf0 Feb 10 17:13:28 newton [] ? commit_creds+0xce/0x150 Feb 10 17:13:28 newton [] sys_open+0x29/0x40 Feb 10 17:13:28 newton [] sysenter_do_call+0x12/0x21 Feb 10 17:13:28 newton [] ? schedule+0x2c0/0x9e0 Feb 10 17:13:28 newton sshd D c06c6b40 0 11355 6175 Feb 10 17:13:28 newton e5795d98 00000086 c073e220 c06c6b40 00000002 00000000 00000000 da8d6e19 Feb 10 17:13:28 newton 0000339a 00000246 00000001 c06c6c80 00000000 c0741300 e71c6ed0 e71c7144 Feb 10 17:13:28 newton c2816300 00000000 00d74d56 e71c6ed0 00000010 e5795dc4 e71c7144 c06c73c0 Feb 10 17:13:28 newton Call Trace: Feb 10 17:13:28 newton [] __mutex_lock_slowpath+0xb8/0x250 Feb 10 17:13:28 newton [] mutex_lock+0x12/0x20 Feb 10 17:13:28 newton [] do_lookup+0xd7/0x1d0 Feb 10 17:13:28 newton [] __link_path_walk+0x154/0xd80 Feb 10 17:13:28 newton [] path_walk+0x54/0xc0 Feb 10 17:13:28 newton [] do_path_lookup+0xb9/0x1a0 Feb 10 17:13:28 newton [] path_lookup_open+0x4b/0xb0 Feb 10 17:13:28 newton [] do_filp_open+0xb1/0x7e0 Feb 10 17:13:28 newton [] ? sched_slice+0x58/0x90 Feb 10 17:13:28 newton [] do_sys_open+0x59/0xf0 Feb 10 17:13:28 newton [] ? commit_creds+0xce/0x150 Feb 10 17:13:28 newton [] sys_open+0x29/0x40 Feb 10 17:13:28 newton [] sysenter_do_call+0x12/0x21 Feb 10 17:13:28 newton sshd D c06c6b40 0 11370 6175 Feb 10 17:13:28 newton ea944d98 00000086 c073e220 c06c6b40 00000002 00000000 00000000 0673d3f6 Feb 10 17:13:28 newton 000033c0 00000246 00000001 c06c6c80 e4c63000 c0741300 e71c18a0 e71c1b14 Feb 10 17:13:28 newton c281f300 00000001 c06c6b40 067677ce 000033c0 ea944dc4 e71c1b14 c06c73c0 Feb 10 17:13:28 newton Call Trace: Feb 10 17:13:28 newton [] __mutex_lock_slowpath+0xb8/0x250 Feb 10 17:13:28 newton [] mutex_lock+0x12/0x20 Feb 10 17:13:28 newton [] do_lookup+0xd7/0x1d0 Feb 10 17:13:28 newton [] __link_path_walk+0x154/0xd80 Feb 10 17:13:28 newton [] path_walk+0x54/0xc0 Feb 10 17:13:28 newton [] do_path_lookup+0xb9/0x1a0 Feb 10 17:13:28 newton [] path_lookup_open+0x4b/0xb0 Feb 10 17:13:28 newton [] do_filp_open+0xb1/0x7e0 Feb 10 17:13:28 newton [] ? sched_slice+0x58/0x90 Feb 10 17:13:28 newton [] do_sys_open+0x59/0xf0 Feb 10 17:13:28 newton [] ? commit_creds+0xce/0x150 Feb 10 17:13:28 newton [] sys_open+0x29/0x40 Feb 10 17:13:28 newton [] sysenter_do_call+0x12/0x21 Feb 10 17:13:28 newton procmail D eaba8bc4 0 11408 11407 Feb 10 17:13:28 newton f3160dbc 00200082 f7009eec eaba8bc4 00000000 00000000 00000000 1359ca9e Feb 10 17:13:28 newton 000033f4 c0179ac1 c149822c f3160dc4 c018e717 c0741300 ea820c50 ea820ec4 Feb 10 17:13:28 newton c281f300 00000001 00d8c333 b7fb1238 eaa28738 e5766540 ea820ec4 c1498220 Feb 10 17:13:28 newton Call Trace: Feb 10 17:13:28 newton [] ? unlock_page+0x41/0x50 Feb 10 17:13:28 newton [] ? __do_fault+0x187/0x3f0 Feb 10 17:13:28 newton [] __mutex_lock_slowpath+0xb8/0x250 Feb 10 17:13:28 newton [] ? handle_mm_fault+0x13b/0x780 Feb 10 17:13:28 newton [] mutex_lock+0x12/0x20 Feb 10 17:13:28 newton [] do_lookup+0xd7/0x1d0 Feb 10 17:13:28 newton [] __link_path_walk+0x266/0xd80 Feb 10 17:13:28 newton [] ? do_page_fault+0x235/0x950 Feb 10 17:13:28 newton [] ? in_group_p+0x25/0x30 Feb 10 17:13:28 newton [] ? file_move+0x1f/0x50 Feb 10 17:13:28 newton [] path_walk+0x54/0xc0 Feb 10 17:13:28 newton [] do_path_lookup+0xb9/0x1a0 Feb 10 17:13:28 newton [] user_path_at+0x4a/0x80 Feb 10 17:13:28 newton [] ? remove_wait_queue+0x38/0x50 Feb 10 17:13:28 newton [] vfs_lstat_fd+0x1e/0x50 Feb 10 17:13:28 newton [] vfs_lstat+0x11/0x20 Feb 10 17:13:28 newton [] sys_lstat64+0x14/0x30 Feb 10 17:13:28 newton [] ? sys_wait4+0x87/0xd0 Feb 10 17:13:28 newton [] ? sys_time+0x19/0x50 Feb 10 17:13:28 newton [] sysenter_do_call+0x12/0x21 Feb 10 17:13:28 newton [] ? schedule+0x2c0/0x9e0 Feb 10 17:13:28 newton procmail D e714a6c4 0 11420 11419 Feb 10 17:13:28 newton eaaf9dbc 00200082 f7009eec e714a6c4 00000000 00000000 00000000 ced77a41 Feb 10 17:13:28 newton 000033fd c0179ac1 c149b70c eaaf9dc4 c018e717 c0741300 ea823d90 ea824004 Feb 10 17:13:28 newton c281f300 00000001 00d8ec3a b7eef238 f5734898 e5766380 ea824004 c149b700 Feb 10 17:13:28 newton Call Trace: Feb 10 17:13:28 newton [] ? unlock_page+0x41/0x50 Feb 10 17:13:28 newton [] ? __do_fault+0x187/0x3f0 Feb 10 17:13:28 newton [] __mutex_lock_slowpath+0xb8/0x250 Feb 10 17:13:28 newton [] ? handle_mm_fault+0x13b/0x780 Feb 10 17:13:28 newton [] mutex_lock+0x12/0x20 Feb 10 17:13:28 newton [] do_lookup+0xd7/0x1d0 Feb 10 17:13:28 newton [] __link_path_walk+0x266/0xd80 Feb 10 17:13:28 newton [] ? do_page_fault+0x235/0x950 Feb 10 17:13:28 newton [] ? in_group_p+0x25/0x30 Feb 10 17:13:28 newton [] ? file_move+0x1f/0x50 Feb 10 17:13:28 newton [] path_walk+0x54/0xc0 Feb 10 17:13:28 newton [] do_path_lookup+0xb9/0x1a0 Feb 10 17:13:28 newton [] user_path_at+0x4a/0x80 Feb 10 17:13:28 newton [] ? remove_wait_queue+0x38/0x50 Feb 10 17:13:28 newton [] vfs_lstat_fd+0x1e/0x50 Feb 10 17:13:28 newton [] vfs_lstat+0x11/0x20 Feb 10 17:13:28 newton [] sys_lstat64+0x14/0x30 Feb 10 17:13:28 newton [] ? sys_wait4+0x87/0xd0 Feb 10 17:13:28 newton [] ? sys_time+0x19/0x50 Feb 10 17:13:28 newton [] sysenter_do_call+0x12/0x21 Feb 10 17:13:28 newton [] ? schedule+0x2c0/0x9e0 and quite a few more procmails which had been launched after it started hanging.