Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754949Ab1C2Bvn (ORCPT ); Mon, 28 Mar 2011 21:51:43 -0400 Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:43307 "EHLO ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753040Ab1C2Bvm (ORCPT ); Mon, 28 Mar 2011 21:51:42 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AjMEAJY5kU15LK5JgWdsb2JhbAClSBUBARYmJYh2vC0NgnKCawQ Date: Tue, 29 Mar 2011 12:51:37 +1100 From: Dave Chinner To: Sean Noonan Cc: "'Michel Lespinasse'" , Christoph Hellwig , "linux-kernel@vger.kernel.org" , Martin Bligh , Trammell Hudson , Christos Zoulas , "linux-xfs@oss.sgi.com" , Stephen Degler , "linux-mm@kvack.org" Subject: Re: XFS memory allocation deadlock in 2.6.38 Message-ID: <20110329015137.GD3008@dastard> References: <081DDE43F61F3D43929A181B477DCA95639B52FD@MSXAOA6.twosigma.com> <081DDE43F61F3D43929A181B477DCA95639B5327@MSXAOA6.twosigma.com> <20110324174311.GA31576@infradead.org> <081DDE43F61F3D43929A181B477DCA95639B5349@MSXAOA6.twosigma.com> <081DDE43F61F3D43929A181B477DCA95639B534E@MSXAOA6.twosigma.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <081DDE43F61F3D43929A181B477DCA95639B534E@MSXAOA6.twosigma.com> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3063 Lines: 73 On Mon, Mar 28, 2011 at 05:34:09PM -0400, Sean Noonan wrote: > > Could you test if you see the deadlock before > > 5ecfda041e4b4bd858d25bbf5a16c2a6c06d7272 without MAP_POPULATE ? > > Built and tested 72ddc8f72270758951ccefb7d190f364d20215ab. > Confirmed that the original bug does not present in this version. > Confirmed that removing MAP_POPULATE does cause the deadlock to occur. > > Here is the stack of the test: > # cat /proc/3846/stack > [] call_rwsem_down_read_failed+0x14/0x30 > [] xfs_ilock+0x9d/0x110 > [] xfs_ilock_map_shared+0x1e/0x50 > [] __xfs_get_blocks+0xc5/0x4e0 > [] xfs_get_blocks+0xc/0x10 > [] do_mpage_readpage+0x462/0x660 > [] mpage_readpage+0x4a/0x60 > [] xfs_vm_readpage+0x13/0x20 > [] filemap_fault+0x2d0/0x4e0 > [] __do_fault+0x50/0x510 > [] handle_mm_fault+0x1a2/0xe60 > [] do_page_fault+0x146/0x440 > [] page_fault+0x1f/0x30 > [] 0xffffffffffffffff Something else is holding the inode locked here. > xfssyncd is stuck in D state. > # cat /proc/2484/stack > [] down+0x3c/0x50 > [] xfs_buf_lock+0x72/0x170 > [] xfs_getsb+0x1d/0x50 > [] xfs_trans_getsb+0x5f/0x150 > [] xfs_mod_sb+0x4e/0xe0 > [] xfs_fs_log_dummy+0x5a/0xb0 > [] xfs_sync_worker+0x83/0x90 > [] xfssyncd+0x172/0x220 > [] kthread+0x96/0xa0 > [] kernel_thread_helper+0x4/0x10 > [] 0xffffffffffffffff And this is indicating that something else is holding the superblock locked here. IOWs, whatever thread is having trouble with memory allocation is causing these threads to block and so they can be ignored. What's the stack trace of the thread that is throwing the "I can't allocating a page" errors? As it is, the question I'd really like answered is how a machine with 48GB RAM can possibly be short of memory when running mmap() on a 16GB file. The error that XFS is throwing indicates that the machine cannot allocate a single page of memory, so where has all your memory gone, and why hasn't the OOM killer been let off the leash? What is consuming the other 32GB of RAM or preventing it from being allocated? Also, I was unable to reproduce this at all on a machine with only 2GB of RAM, regardless of the kernel version and/or MAP_POPULATE, so I'm left to wonder what is special about your test system... Perhaps the output of xfs_bmap -vvp after a successful vs deadlocked run would be instructive.... Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/