Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1761147AbYBLBCY (ORCPT ); Mon, 11 Feb 2008 20:02:24 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752955AbYBLBCL (ORCPT ); Mon, 11 Feb 2008 20:02:11 -0500 Received: from ishtar.tlinx.org ([64.81.245.74]:38935 "EHLO ishtar.tlinx.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752111AbYBLBCI (ORCPT ); Mon, 11 Feb 2008 20:02:08 -0500 Message-ID: <47B0F00D.3060802@tlinx.org> Date: Mon, 11 Feb 2008 17:02:05 -0800 From: Linda Walsh User-Agent: Thunderbird 2.0.0.9 (Windows/20071031) MIME-Version: 1.0 To: Linux-Xfs CC: Linux-Kernel Subject: xfs [_fsr] probs in 2.6.24.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 13519 Lines: 292 I'm getting similar errors on an x86-32 & x86-64 kernel. The x86-64 system (2nd log below w/date+times) was unusable this morning: one or more of the xfs file systems had "gone off line" due to some unknown error (upon reboot, no errors were indicated; all partitions on the same physical disk). I keep ending up with random failures on two systems. The 32-bit sys more often than not just "locks-up" -- no messages, no keyboard response...etc. Nothing to do except reset -- and of course, nothing in the logs.... I'm turning on all the stat and diagnostics that don't seem to have a noted performance penalty (or not much of one) to see if that helps. Perhaps one issue is the "bug" (looks like multiple instances of the same bug) in xfs. The 32-bit machine does lots of disk activity in early morning hours...and coincidentally, it seemed to be crashing (not consistently) in the morning. With 2.6.24, the 32-bit machine "seems" a bit more stable -- up for over 3 days now (average before was <48 hours). But the 64-bit machine went bonkers (not sure of exact time) -- It *had* been stable (non crashing, anyway) before 2.6.24. I vaguely remember there being a similar xfs lock bug a few versions back and was told just "not to worry about and turn off the lock checking to "avoid the problem"....( :^| ). So I did, but still trying to track down randomness. So I turn back on what checks I could and ... ding -- still a problem in the xfs area -- which "coincidentally" (has happened on more than one occasion), an xfs partition developed a run-time error and shut itself down. This also happened on the 32-bit machine with the SATA disk (thought it might be sata specific), so removed its controller and disk and threw in a same-size PATA. No more file-system errors on 32bit. But first time I've had a filesystem 'forced offline' on the 64bit (but just switched to 2.6.24 recently for obvious reasons). Sadly -- it also seemed to be the case that the 32bit machine when it had the SATA controller+disk installed, was more likely to crash if xfs_fsr was running at the same-time remote backups were being written to the SATA-drive. Couldn't repeat it reliably, though so not sure what's going on there. Both machines are SMP, so that's a potential factor along with the fact that both machines also use SCSI (64bit, SAS form). Neither are running a graphical console but are being remotely accessed. Lemme know if I can provide any info in addressing these problems. I don't know if they are related to the other system crashes, but usage patterns indicate it could be disk-activity related. So eliminating known bugs & problems in that area might help (?)... Suspect log messages follow (1st, w/o Date/time were from dmesg on 32bit, latter was a previously unnoticed one on 64bit until I started looking through logs for clues on last night's failures. The first set of errors the I have some diagnostics for, I could only find in dmesg: (ish-32bit machine) ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.24-ish #1 ------------------------------------------------------- xfs_fsr/2119 is trying to acquire lock: (&mm->mmap_sem){----}, at: [] dio_get_page+0x62/0x160 but task is already holding lock: (&(&ip->i_iolock)->mr_lock){----}, at: [] xfs_ilock+0x5b/0xb0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&(&ip->i_iolock)->mr_lock){----}: [] __lock_acquire+0xce6/0x1150 [] xfs_ilock+0x97/0xb0 [] trace_hardirqs_on+0xaf/0x160 [] lock_acquire+0x6a/0x80 [] xfs_ilock+0x97/0xb0 [] down_write_nested+0x43/0x80 [] xfs_ilock+0x97/0xb0 [] xfs_ilock+0x97/0xb0 [] xfs_free_eofblocks+0x231/0x2e0 [] xfs_release+0x1a7/0x230 [] xfs_file_release+0xb/0x10 [] __fput+0xac/0x180 [] remove_vma+0x35/0x50 [] do_munmap+0x190/0x200 [] sys_munmap+0x2f/0x50 [] sysenter_past_esp+0x5f/0xa5 [] 0xffffffff -> #0 (&mm->mmap_sem){----}: [] print_circular_bug_entry+0x41/0x50 [] __lock_acquire+0xad9/0x1150 [] mark_held_locks+0x43/0x80 [] lock_acquire+0x6a/0x80 [] dio_get_page+0x62/0x160 [] down_read+0x38/0x80 [] dio_get_page+0x62/0x160 [] dio_get_page+0x62/0x160 [] __spin_lock_init+0x32/0x60 [] __blockdev_direct_IO+0x483/0xc70 [] kmem_cache_alloc+0x9b/0xc0 [] _spin_unlock+0x14/0x20 [] lockdep_init_map+0x3d/0x4c0 [] xfs_alloc_ioend+0x120/0x170 [] xfs_vm_direct_IO+0x15f/0x170 [] xfs_get_blocks_direct+0x0/0x30 [] xfs_end_io_direct+0x0/0x90 [] generic_file_direct_IO+0xd2/0x160 [] generic_file_direct_write+0x5a/0x170 [] up_write+0x14/0x30 [] xfs_write+0x3e8/0x920 [] file_read_actor+0x0/0x120 [] xfs_file_aio_write+0x5e/0x70 [] do_sync_write+0xc7/0x110 [] autoremove_wake_function+0x0/0x40 [] lock_release_holdtime+0x47/0x70 [] vfs_read+0xd0/0x160 [] do_sync_write+0x0/0x110 [] vfs_write+0xa0/0x160 [] sys_write+0x41/0x70 [] sysenter_past_esp+0x5f/0xa5 [] 0xffffffff other info that might help us debug this: 1 lock held by xfs_fsr/2119: #0: (&(&ip->i_iolock)->mr_lock){----}, at: [] xfs_ilock+0x5b/0xb0 stack backtrace: Pid: 2119, comm: xfs_fsr Not tainted 2.6.24-ish #1 [] print_circular_bug_tail+0x7a/0x90 [] __lock_acquire+0xad9/0x1150 [] mark_held_locks+0x43/0x80 [] lock_acquire+0x6a/0x80 [] dio_get_page+0x62/0x160 [] down_read+0x38/0x80 [] dio_get_page+0x62/0x160 [] dio_get_page+0x62/0x160 [] __spin_lock_init+0x32/0x60 [] __blockdev_direct_IO+0x483/0xc70 [] kmem_cache_alloc+0x9b/0xc0 [] _spin_unlock+0x14/0x20 [] lockdep_init_map+0x3d/0x4c0 [] xfs_alloc_ioend+0x120/0x170 [] xfs_vm_direct_IO+0x15f/0x170 [] xfs_get_blocks_direct+0x0/0x30 [] xfs_end_io_direct+0x0/0x90 [] generic_file_direct_IO+0xd2/0x160 [] generic_file_direct_write+0x5a/0x170 [] up_write+0x14/0x30 [] xfs_write+0x3e8/0x920 [] file_read_actor+0x0/0x120 [] xfs_file_aio_write+0x5e/0x70 [] do_sync_write+0xc7/0x110 [] autoremove_wake_function+0x0/0x40 [] lock_release_holdtime+0x47/0x70 [] vfs_read+0xd0/0x160 [] do_sync_write+0x0/0x110 [] vfs_write+0xa0/0x160 [] sys_write+0x41/0x70 [] sysenter_past_esp+0x5f/0xa5 ======================= -------------- The other system shows a similar message: in its logs: Feb 7 02:01:50 kern: ======================================================= Feb 7 02:01:50 kern: [ INFO: possible circular locking dependency detected ] Feb 7 02:01:50 kern: 2.6.24-asa64def #3 Feb 7 02:01:50 kern: ------------------------------------------------------- Feb 7 02:01:50 kern: xfs_fsr/6313 is trying to acquire lock: Feb 7 02:01:50 kern: (&(&ip->i_lock)->mr_lock/2){----}, at: [] xfs_ilock+0x82/0xc0 Feb 7 02:01:50 kern: Feb 7 02:01:50 kern: but task is already holding lock: Feb 7 02:01:50 kern: (&(&ip->i_iolock)->mr_lock/3){--..}, at: [] xfs_ilock+0xa5/0xc0 Feb 7 02:01:50 kern: Feb 7 02:01:50 kern: which lock already depends on the new lock. Feb 7 02:01:50 kern: Feb 7 02:01:50 kern: Feb 7 02:01:50 kern: the existing dependency chain (in reverse order) is: Feb 7 02:01:50 kern: Feb 7 02:01:50 kern: -> #1 (&(&ip->i_iolock)->mr_lock/3){--..}: Feb 7 02:01:50 kern: [] __lock_acquire+0xc39/0x1090 Feb 7 02:01:50 kern: [] lock_acquire+0x61/0x80 Feb 7 02:01:50 kern: [] xfs_ilock+0xa5/0xc0 Feb 7 02:01:50 kern: [] down_write_nested+0x3a/0x80 Feb 7 02:01:50 kern: [] xfs_ilock+0xa5/0xc0 Feb 7 02:01:50 kern: [] xfs_lock_inodes+0x143/0x190 Feb 7 02:01:50 kern: [] xfs_swap_extents+0xa1/0x630 Feb 7 02:01:50 kern: [] lock_release_holdtime+0x45/0x70 Feb 7 02:01:50 kern: [] xfs_swapext+0x131/0x160 Feb 7 02:01:50 kern: [] xfs_ioctl+0x5ce/0x740 Feb 7 02:01:50 kern: [] lock_release_holdtime+0x45/0x70 Feb 7 02:01:50 kern: [] __mutex_unlock_slowpath+0xca/0x1a0 Feb 7 02:01:50 kern: [] xfs_file_ioctl_invis+0x36/0x80 Feb 7 02:01:50 kern: [] do_ioctl+0x2f/0xa0 Feb 7 02:01:50 kern: [] vfs_ioctl+0x230/0x2d0 Feb 7 02:01:50 kern: [] trace_hardirqs_on+0xc1/0x160 Feb 7 02:01:50 kern: [] sys_ioctl+0x49/0x80 Feb 7 02:01:51 kern: [] system_call+0x7e/0x83 Feb 7 02:01:51 kern: [] 0xffffffffffffffff Feb 7 02:01:51 kern: Feb 7 02:01:51 kern: -> #0 (&(&ip->i_lock)->mr_lock/2){----}: Feb 7 02:01:51 kern: [] print_circular_bug_header+0xe8/0xf0 Feb 7 02:01:51 kern: [] __lock_acquire+0xaa8/0x1090 Feb 7 02:01:51 kern: [] lock_acquire+0x61/0x80 Feb 7 02:01:51 kern: [] xfs_ilock+0x82/0xc0 Feb 7 02:01:51 kern: [] _spin_unlock+0x17/0x20 Feb 7 02:01:51 kern: [] down_write_nested+0x3a/0x80 Feb 7 02:01:51 kern: [] xfs_ilock+0x82/0xc0 Feb 7 02:01:51 kern: [] xfs_lock_inodes+0x143/0x190 Feb 7 02:01:51 kern: [] xfs_swap_extents+0x31e/0x630 Feb 7 02:01:51 kern: [] xfs_swapext+0x131/0x160 Feb 7 02:01:51 kern: [] xfs_ioctl+0x5ce/0x740 Feb 7 02:01:51 kern: [] lock_release_holdtime+0x45/0x70 Feb 7 02:01:51 kern: [] __mutex_unlock_slowpath+0xca/0x1a0 Feb 7 02:01:51 kern: [] xfs_file_ioctl_invis+0x36/0x80 Feb 7 02:01:51 kern: [] do_ioctl+0x2f/0xa0 Feb 7 02:01:51 kern: [] vfs_ioctl+0x230/0x2d0 Feb 7 02:01:51 kern: [] trace_hardirqs_on+0xc1/0x160 Feb 7 02:01:51 kern: [] sys_ioctl+0x49/0x80 Feb 7 02:01:51 kern: [] system_call+0x7e/0x83 Feb 7 02:01:51 kern: [] 0xffffffffffffffff Feb 7 02:01:51 kern: Feb 7 02:01:51 kern: other info that might help us debug this: Feb 7 02:01:51 kern: Feb 7 02:01:51 kern: 2 locks held by xfs_fsr/6313: Feb 7 02:01:51 kern: #0: (&(&ip->i_iolock)->mr_lock/2){--..}, at: [] xfs_ilock+0xa5/0xc0 Feb 7 02:01:51 kern: #1: (&(&ip->i_iolock)->mr_lock/3){--..}, at: [] xfs_ilock+0xa5/0xc0 Feb 7 02:01:51 kern: Feb 7 02:01:51 kern: stack backtrace: Feb 7 02:01:51 kern: Pid: 6313, comm: xfs_fsr Not tainted 2.6.24-asa64def #3 Feb 7 02:01:51 kern: Feb 7 02:01:51 kern: Call Trace: Feb 7 02:01:51 kern: [] print_circular_bug_tail+0x83/0x90 Feb 7 02:01:51 kern: [] print_circular_bug_header+0xe8/0xf0 Feb 7 02:01:51 kern: [] __lock_acquire+0xaa8/0x1090 Feb 7 02:01:51 kern: [] lock_acquire+0x61/0x80 Feb 7 02:01:51 kern: [] xfs_ilock+0x82/0xc0 Feb 7 02:01:51 kern: [] _spin_unlock+0x17/0x20 Feb 7 02:01:51 kern: [] down_write_nested+0x3a/0x80 Feb 7 02:01:51 kern: [] xfs_ilock+0x82/0xc0 Feb 7 02:01:51 kern: [] xfs_lock_inodes+0x143/0x190 Feb 7 02:01:51 kern: [] xfs_swap_extents+0x31e/0x630 Feb 7 02:01:51 kern: [] xfs_swapext+0x131/0x160 Feb 7 02:01:51 kern: [] xfs_ioctl+0x5ce/0x740 Feb 7 02:01:51 kern: [] lock_release_holdtime+0x45/0x70 Feb 7 02:01:51 kern: [] __mutex_unlock_slowpath+0xca/0x1a0 [] xfs_file_ioctl_invis+0x36/0x80 Feb 7 02:01:51 kern: [] do_ioctl+0x2f/0xa0 Feb 7 02:01:51 kern: [] vfs_ioctl+0x230/0x2d0 Feb 7 02:01:51 kern: [] trace_hardirqs_on+0xc1/0x160 Feb 7 02:01:51 kern: [] sys_ioctl+0x49/0x80 Feb 7 02:01:51 kern: [] system_call+0x7e/0x83 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/