Date: Sun, 11 Nov 2007 04:56:11 +0100 (CET)
From: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
To: Andrew Morton <akpm@linux-foundation.org>
cc: linux-kernel@vger.kernel.org, Peter Zijlstra <a.p.zijlstra@chello.nl>,
       WU Fengguang <wfg@mail.ustc.edu.cn>
Subject: Re: Temporary lockup on loopback block device
In-Reply-To: <Pine.LNX.4.64.0711110114300.18189@artax.karlin.mff.cuni.cz>
Message-ID: <Pine.LNX.4.64.0711110454050.26435@artax.karlin.mff.cuni.cz>
References: <Pine.LNX.4.64.0711102025440.17691@artax.karlin.mff.cuni.cz>
 <20071110145444.a6993df1.akpm@linux-foundation.org>
 <Pine.LNX.4.64.0711110114300.18189@artax.karlin.mff.cuni.cz>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4544
Lines: 106

On Sun, 11 Nov 2007, Mikulas Patocka wrote:

> On Sat, 10 Nov 2007, Andrew Morton wrote:
> 
> > On Sat, 10 Nov 2007 20:51:31 +0100 (CET) Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> wrote:
> > 
> > > Hi
> > > 
> > > I am experiencing a transient lockup in 'D' state with loopback device. It 
> > > happens when process writes to a filesystem in loopback with command like
> > > dd if=/dev/zero of=/s/fill bs=4k 
> > > 
> > > CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in 
> > > congestion_wait called from balance_dirty_pages.
> > > 
> > > After about 30 seconds, the lockup is gone and dd resumes, but it locks up 
> > > soon again.
> > > 
> > > I added a printk to the balance_dirty_pages
> > > printk("wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, 
> > > pages_written %d, write_chunk %d\n", nr_reclaimable, 
> > > global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, 
> > > write_chunk);
> > > 
> > > and it shows this during the lockup:
> > > 
> > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> > > pages_written 1021, write_chunk 1522
> > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> > > pages_written 1021, write_chunk 1522
> > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> > > pages_written 1021, write_chunk 1522
> > > 
> > > What apparently happens:
> > > 
> > > writeback_inodes syncs inodes only on the given wbc->bdi, however 
> > > balance_dirty_pages checks against global counts of dirty pages. So if 
> > > there's nothing to sync on a given device, but there are other dirty pages 
> > > so that the counts are over the limit, it will loop without doing any 
> > > work.
> > > 
> > > To reproduce it, you need totally idle machine (no GUI, etc.) -- if 
> > > something writes to the backing device, it flushes the dirty pages 
> > > generated by the loopback and the lockup is gone. If you add printk, don't 
> > > forget to stop klogd, otherwise logging would end the lockup.
> > 
> > erk.
> > 
> > > The hotfix (that I verified to work) is to not set wbc->bdi, so that all 
> > > devices are flushed ... but the code probably needs some redesign (i.e. 
> > > either account per-device and flush per-device, or account-global and 
> > > flush-global).
> > > 
> > > Mikulas
> > > 
> > > 
> > > diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c
> > > --- ../x/linux-2.6.23.1/mm/page-writeback.c     2007-10-12 18:43:44.000000000 +0200
> > > +++ mm/page-writeback.c 2007-11-10 20:32:43.000000000 +0100
> > > @@ -214,7 +214,6 @@
> > > 
> > > 	for (;;) {
> > > 		struct writeback_control wbc = {
> > > -			.bdi            = bdi,
> > > 			.sync_mode      = WB_SYNC_NONE,
> > > 			.older_than_this = NULL,
> > > 			.nr_to_write    = write_chunk,
> > 
> > Arguably we just have the wrong backing-device here, and what we should do
> > is to propagate the real backing device's pointer through up into the
> > filesystem.  There's machinery for this which things like DM stacks use.
> 
> If you change loopback backing-device, you just turn this nicely 
> reproducible example into a subtle race condition that can happen whenever 
> you use loopback or not. Think, what happens when different process 
> dirties memory:
> 
> You have process "A" that dirtied a lot of pages on device "1" but has not 
> started writing them.
> You have process "B" that is trying to write to device "2", sees dirty 
> page count over limit, but can't do anything about it, because it is only 
> allowed to flush pages on device "2". --- so it endlessly loops.
> 
> If you want to use the current flushing semantics, you just have to audit 
> the whole kernel to make sure that if some process sees over-limit dirty 
> page count, there is another process that is flushing the pages. Currently 
> it is not true, the "dd" process sees over-limit count, but there is 
> no-one writing.
> 
> > I wonder if the post-2.6.23 changes happened to make this problem go away.
> 
> I will try 2.6.24-rc2, but I don't think the root cause of this went away. 
> Maybe you just reduced probability.
> 
> Mikulas

So I compiled it and I don't see any more lock-ups. The writeback loop 
doesn't depend on any global page count, so the above scenario can't 
happen here. Good.

Mikulas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/