Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932092AbZJSDFH (ORCPT ); Sun, 18 Oct 2009 23:05:07 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755819AbZJSDFG (ORCPT ); Sun, 18 Oct 2009 23:05:06 -0400 Received: from bld-mail14.adl6.internode.on.net ([150.101.137.99]:34707 "EHLO mail.internode.on.net" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1755793AbZJSDFF (ORCPT ); Sun, 18 Oct 2009 23:05:05 -0400 Date: Mon, 19 Oct 2009 14:04:56 +1100 From: Dave Chinner To: Justin Piszcz Cc: linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org, xfs@oss.sgi.com, Alan Piszcz Subject: Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48 hours (sysrq-t+w available) Message-ID: <20091019030456.GS9464@discord.disaster> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3036 Lines: 64 On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote: > It has happened again, all sysrq-X output was saved this time. > > wget http://home.comcast.net/~jpiszcz/20091018/crash.txt > wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt > wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt > wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt ..... > Again, some more D-state processes: > > [76325.608073] pdflush D 0000000000000001 0 362 2 0x00000000 > [76325.608087] Call Trace: > [76325.608095] [] ? xfs_trans_brelse+0x30/0x130 > [76325.608099] [] ? xlog_state_sync+0x26c/0x2a0 > [76325.608103] [] ? default_wake_function+0x0/0x10 > [76325.608106] [] ? _xfs_log_force+0x51/0x80 > [76325.608108] [] ? xfs_log_force+0xb/0x40 > > [76325.608202] xfssyncd D 0000000000000000 0 831 2 0x00000000 > [76325.608214] Call Trace: > [76325.608216] [] ? xlog_state_sync+0x49/0x2a0 > [76325.608220] [] ? __xfs_iunpin_wait+0x95/0xe0 > [76325.608222] [] ? autoremove_wake_function+0x0/0x30 > [76325.608225] [] ? xfs_iflush+0xdd/0x2f0 > [76325.608228] [] ? xfs_reclaim_inode+0x148/0x190 > [76325.608231] [] ? xfs_reclaim_inode_now+0x0/0xa0 > [76325.608233] [] ? xfs_inode_ag_walk+0x6c/0xc0 > [76325.608236] [] ? xfs_reclaim_inode_now+0x0/0xa0 > > All of the D-state processes: All pointing to log IO not completing. That is, all of the D state processes are backed up on locks or waiting for IO completion processing. A lot of the processes are waiting for _xfs_log_force to complete, others are waiting for inodes to be unpinned or are backed up behind locked inodes that are waiting on log IO to complete before they can complete the transaction and unlock the inode, and so on. Unfortunately, the xfslogd and xfsdatad kernel threads are not present in any of the output given, so I can't tell if these have deadlocked themselves and caused the problem. However, my experience with such pile-ups is that an I/O completion has not been run for some reason and that is the cause of the problem. I don't know if you can provide enough information to tell us if this happened or not. Instead, do you have a test case that you can share? Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/