Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752445AbXBRXWg (ORCPT ); Sun, 18 Feb 2007 18:22:36 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752446AbXBRXWg (ORCPT ); Sun, 18 Feb 2007 18:22:36 -0500 Received: from mail-gw1.sa.eol.hu ([212.108.200.67]:39140 "EHLO mail-gw1.sa.eol.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752445AbXBRXWf (ORCPT ); Sun, 18 Feb 2007 18:22:35 -0500 To: akpm@linux-foundation.org CC: miklos@szeredi.hu, linux-kernel@vger.kernel.org, linux-mm@kvack.org In-reply-to: <20070218145929.547c21c7.akpm@linux-foundation.org> (message from Andrew Morton on Sun, 18 Feb 2007 14:59:29 -0800) Subject: Re: dirty balancing deadlock References: <20070218125307.4103c04a.akpm@linux-foundation.org> <20070218145929.547c21c7.akpm@linux-foundation.org> Message-Id: From: Miklos Szeredi Date: Mon, 19 Feb 2007 00:22:11 +0100 Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6409 Lines: 168 > > > > I was testing the new fuse shared writable mmap support, and finding > > > > that bash-shared-mapping deadlocks (which isn't so strange ;). What > > > > is more strange is that this is not an OOM situation at all, with > > > > plenty of free and cached pages. > > > > > > > > A little more investigation shows that a similar deadlock happens > > > > reliably with bash-shared-mapping on a loopback mount, even if only > > > > half the total memory is used. > > > > > > > > The cause is slightly different in the two cases: > > > > > > > > - loopback mount: allocation by the underlying filesystem is stalled > > > > on throttle_vm_writeout() > > > > > > > > - fuse-loop: page dirtying on the underlying filesystem is stalled on > > > > balance_dirty_pages() > > > > > > > > In both cases the underlying fs is totally innocent, with no > > > > dirty/writback pages, yet it's waiting for the global dirty+writeback > > > > to go below the threshold, which obviously won't, until the > > > > allocation/dirtying succeeds. > > > > > > > > I'm not quite sure what the solution is, and asking for thoughts. > > > > > > But.... these things don't just throttle. They also perform large amounts > > > of writeback, which causes the dirty levels to subside. > > > > > > >From your description it appears that this writeback isn't happening, or > > > isn't working. How come? > > > > - filesystems A and B > > - write to A will end up as write to B > > - dirty pages in A manage to go over dirty_threshold > > - page writeback is started from A > > - this triggers writeback for a couple of pages in B > > - writeback finishes normally, but dirty+writeback pages are still > > over threshold > > - balance_dirty_pages in B gets stuck, nothing ever moves after this > > > > At least this is my theory for what happens. > > > > Is B a real filesystem? Yes. > If so, writes to B will decrease the dirty memory threshold. Yes, but not by enough. Say A dirties a 1100 pages, limit is 1000. Some pages queued for writeback (doesn't matter how much). B writes back 1, 1099 dirty remain in A, zero in B. balance_dirty_pages() for B doesn't know that there's nothing more to write back for B, it's just waiting there for those 1099, which'll never get written. > The writeout code _should_ just sit there transferring dirtyiness from A to > B and cleaning pages via B, looping around, alternating between both. > > What does sysrq-t say? This is the fuse daemon thread that got stuck. There are lots of others that are stuck on some ext3 mutex as a result of this. fusexmp_fh_no D 40045401 0 527 493 533 495 (NOTLB) 088d55f8 00000001 00000000 08dcfb14 0805d8cb 08a09b78 088d55f8 08dc8000 08dc8000 08dcfb3c 0805a38a 08a09680 088d5100 08dcfb2c 08dc8000 08dc8000 0847c300 088d5100 08a09680 08dcfb94 08182fe6 08a09680 088d5100 08a09680 Call Trace: 08dcfb00: [<0805d8cb>] switch_to_skas+0x3b/0x83 08dcfb18: [<0805a38a>] _switch_to+0x49/0x99 08dcfb40: [<08182fe6>] schedule+0x246/0x547 08dcfb98: [<08183a03>] schedule_timeout+0x4e/0xb6 08dcfbcc: [<08183991>] io_schedule_timeout+0x11/0x20 08dcfbd4: [<080a0cf2>] congestion_wait+0x72/0x87 08dcfc04: [<0809c693>] balance_dirty_pages+0xa8/0x153 08dcfc5c: [<0809c7bf>] balance_dirty_pages_ratelimited_nr+0x43/0x45 08dcfc68: [<080992b5>] generic_file_buffered_write+0x3e3/0x6f5 08dcfd20: [<0809988e>] __generic_file_aio_write_nolock+0x2c7/0x5dd 08dcfda8: [<08099cb6>] generic_file_aio_write+0x55/0xc7 08dcfddc: [<080ea1e6>] ext3_file_write+0x39/0xaf 08dcfe04: [<080b060b>] do_sync_write+0xd8/0x10e 08dcfebc: [<080b06e3>] vfs_write+0xa2/0x1cb 08dcfeec: [<080b09b8>] sys_pwrite64+0x65/0x69 08dcff10: [<0805dd54>] handle_syscall+0x90/0xbc 08dcff64: [<0806d56c>] handle_trap+0x27/0x121 08dcff8c: [<0806dc65>] userspace+0x1de/0x226 08dcffe4: [<0805da19>] fork_handler+0x76/0x88 08dcfffc: [] 0xd4cf0007 /proc/vmstat: nr_anon_pages 668 nr_mapped 3168 nr_file_pages 5191 nr_slab_reclaimable 173 nr_slab_unreclaimable 494 nr_page_table_pages 65 nr_dirty 2174 nr_writeback 10 nr_unstable 0 nr_bounce 0 nr_vmscan_write 0 pgpgin 10955 pgpgout 421091 pswpin 0 pswpout 0 pgalloc_dma 0 pgalloc_normal 268761 pgfree 269709 pgactivate 128287 pgdeactivate 31253 pgfault 237350 pgmajfault 4340 pgrefill_dma 0 pgrefill_normal 127899 pgsteal_dma 0 pgsteal_normal 46892 pgscan_kswapd_dma 0 pgscan_kswapd_normal 47104 pgscan_direct_dma 0 pgscan_direct_normal 36544 pginodesteal 0 slabs_scanned 2048 kswapd_steal 25083 kswapd_inodesteal 335 pageoutrun 656 allocstall 423 pgrotated 0 Breakpoint 3, balance_dirty_pages (mapping=0xa01feb0) at mm/page-writeback.c:202 202 dirty_exceeded = 1; (gdb) p dirty_thresh $1 = 2113 (gdb) For completeness' sake, here's the backtrace for the stuck loopback as well: loop0 D BFFFE101 0 499 5 500 59 (L-TLB) 088cc578 00000001 00000000 09197c4c 0805d8cb 084fe6f8 088cc578 09190000 09190000 09197c74 0805a38a 084fe200 088cc080 09197c64 09190000 09190000 086d9c80 088cc080 084fe200 09197ccc 08182ab6 084fe200 088cc080 084fe200 Call Trace: 09197c38: [<0805d8cb>] switch_to_skas+0x3b/0x83 09197c50: [<0805a38a>] _switch_to+0x49/0x99 09197c78: [<08182ab6>] schedule+0x246/0x547 09197cd0: [<081834d3>] schedule_timeout+0x4e/0xb6 09197d04: [<08183461>] io_schedule_timeout+0x11/0x20 09197d0c: [<080a0c62>] congestion_wait+0x72/0x87 09197d3c: [<0809c7e8>] throttle_vm_writeout+0x27/0x6a 09197d60: [<0809faec>] shrink_zone+0xaf/0x103 09197d8c: [<0809fbb2>] shrink_zones+0x72/0x8a 09197db0: [<0809fc87>] try_to_free_pages+0xbd/0x185 09197dfc: [<0809ba76>] __alloc_pages+0x155/0x335 09197e50: [<080975eb>] find_or_create_page+0x85/0x99 09197e78: [<0812785e>] do_lo_send_aops+0x8d/0x233 09197ee4: [<08127c56>] lo_send+0x92/0x10d 09197f20: [<08127ee6>] do_bio_filebacked+0x6d/0x74 09197f44: [<081280e0>] loop_thread+0x89/0x188 09197f84: [<0808a03a>] kthread+0xa7/0xab 09197fb4: [<0806a0f1>] run_kernel_thread+0x41/0x50 09197fe0: [<0805d975>] new_thread_handler+0x62/0x8b 09197ffc: [<00000000>] nosmp+0xf7fb7000/0x14 Miklos - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/