Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757349Ab1BPAVr (ORCPT ); Tue, 15 Feb 2011 19:21:47 -0500 Received: from kroah.org ([198.145.64.141]:39367 "EHLO coco.kroah.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757336Ab1BPAVn (ORCPT ); Tue, 15 Feb 2011 19:21:43 -0500 X-Mailbox-Line: From gregkh@clark.kroah.org Tue Feb 15 16:14:37 2011 Message-Id: <20110216001437.822305986@clark.kroah.org> User-Agent: quilt/0.48-11.2 Date: Tue, 15 Feb 2011 16:13:05 -0800 From: Greg KH To: linux-kernel@vger.kernel.org, stable@kernel.org Cc: stable-review@kernel.org, torvalds@linux-foundation.org, akpm@linux-foundation.org, alan@lxorguk.ukuu.org.uk, Jan Kara , Wu Fengguang , Johannes Weiner , Dave Chinner , Christoph Hellwig , Jan Engelhardt , Jens Axboe Subject: [128/272] writeback: avoid livelocking WB_SYNC_ALL writeback In-Reply-To: <20110216001559.GA31413@kroah.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4418 Lines: 123 2.6.37-stable review patch. If anyone has any objections, please let us know. ------------------ From: Jan Kara commit b9543dac5bbc4aef0a598965b6b34f6259ab9a9b upstream. When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is usually set to LONG_MAX. The logic in wb_writeback() then calls __writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and we easily end up with non-positive nr_to_write after the function returns, if the inode has more than MAX_WRITEBACK_PAGES dirty pages at the moment. When nr_to_write is <= 0 wb_writeback() decides we need another round of writeback but this is wrong in some cases! For example when a single large file is continuously dirtied, we would never finish syncing it because each pass would be able to write MAX_WRITEBACK_PAGES and inode dirty timestamp never gets updated (as inode is never completely clean). Thus __writeback_inodes_sb() would write the redirtied inode again and again. Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode. We do not need nr_to_write in WB_SYNC_ALL mode anyway since write_cache_pages() does livelock avoidance using page tagging in WB_SYNC_ALL mode. This makes wb_writeback() call __writeback_inodes_sb() only once on WB_SYNC_ALL. The latter function won't livelock because it works on - a finite set of files by doing queue_io() once at the beginning - a finite set of pages by PAGECACHE_TAG_TOWRITE page tagging After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no longer able to stall sync forever. [fengguang.wu@intel.com: fix locking comment] Signed-off-by: Jan Kara Signed-off-by: Wu Fengguang Cc: Johannes Weiner Cc: Dave Chinner Cc: Christoph Hellwig Cc: Jan Engelhardt Cc: Jens Axboe Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- fs/fs-writeback.c | 27 +++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-) --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -629,6 +629,7 @@ static long wb_writeback(struct bdi_writ }; unsigned long oldest_jif; long wrote = 0; + long write_chunk; struct inode *inode; if (wbc.for_kupdate) { @@ -641,6 +642,24 @@ static long wb_writeback(struct bdi_writ wbc.range_end = LLONG_MAX; } + /* + * WB_SYNC_ALL mode does livelock avoidance by syncing dirty + * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX + * here avoids calling into writeback_inodes_wb() more than once. + * + * The intended call sequence for WB_SYNC_ALL writeback is: + * + * wb_writeback() + * __writeback_inodes_sb() <== called only once + * write_cache_pages() <== called once for each inode + * (quickly) tag currently dirty pages + * (maybe slowly) sync all tagged pages + */ + if (wbc.sync_mode == WB_SYNC_NONE) + write_chunk = MAX_WRITEBACK_PAGES; + else + write_chunk = LONG_MAX; + wbc.wb_start = jiffies; /* livelock avoidance */ for (;;) { /* @@ -667,7 +686,7 @@ static long wb_writeback(struct bdi_writ break; wbc.more_io = 0; - wbc.nr_to_write = MAX_WRITEBACK_PAGES; + wbc.nr_to_write = write_chunk; wbc.pages_skipped = 0; trace_wbc_writeback_start(&wbc, wb->bdi); @@ -677,8 +696,8 @@ static long wb_writeback(struct bdi_writ writeback_inodes_wb(wb, &wbc); trace_wbc_writeback_written(&wbc, wb->bdi); - work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; - wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write; + work->nr_pages -= write_chunk - wbc.nr_to_write; + wrote += write_chunk - wbc.nr_to_write; /* * If we consumed everything, see if we have more @@ -693,7 +712,7 @@ static long wb_writeback(struct bdi_writ /* * Did we write something? Try for more */ - if (wbc.nr_to_write < MAX_WRITEBACK_PAGES) + if (wbc.nr_to_write < write_chunk) continue; /* * Nothing written. Wait for some inode to -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/