Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754284AbXKKD4U (ORCPT ); Sat, 10 Nov 2007 22:56:20 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751080AbXKKD4N (ORCPT ); Sat, 10 Nov 2007 22:56:13 -0500 Received: from artax.karlin.mff.cuni.cz ([195.113.31.125]:60087 "EHLO artax.karlin.mff.cuni.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750932AbXKKD4M (ORCPT ); Sat, 10 Nov 2007 22:56:12 -0500 Date: Sun, 11 Nov 2007 04:56:11 +0100 (CET) From: Mikulas Patocka To: Andrew Morton cc: linux-kernel@vger.kernel.org, Peter Zijlstra , WU Fengguang Subject: Re: Temporary lockup on loopback block device In-Reply-To: Message-ID: References: <20071110145444.a6993df1.akpm@linux-foundation.org> X-Personality-Disorder: Schizoid MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4544 Lines: 106 On Sun, 11 Nov 2007, Mikulas Patocka wrote: > On Sat, 10 Nov 2007, Andrew Morton wrote: > > > On Sat, 10 Nov 2007 20:51:31 +0100 (CET) Mikulas Patocka wrote: > > > > > Hi > > > > > > I am experiencing a transient lockup in 'D' state with loopback device. It > > > happens when process writes to a filesystem in loopback with command like > > > dd if=/dev/zero of=/s/fill bs=4k > > > > > > CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in > > > congestion_wait called from balance_dirty_pages. > > > > > > After about 30 seconds, the lockup is gone and dd resumes, but it locks up > > > soon again. > > > > > > I added a printk to the balance_dirty_pages > > > printk("wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, > > > pages_written %d, write_chunk %d\n", nr_reclaimable, > > > global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, > > > write_chunk); > > > > > > and it shows this during the lockup: > > > > > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, > > > pages_written 1021, write_chunk 1522 > > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, > > > pages_written 1021, write_chunk 1522 > > > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, > > > pages_written 1021, write_chunk 1522 > > > > > > What apparently happens: > > > > > > writeback_inodes syncs inodes only on the given wbc->bdi, however > > > balance_dirty_pages checks against global counts of dirty pages. So if > > > there's nothing to sync on a given device, but there are other dirty pages > > > so that the counts are over the limit, it will loop without doing any > > > work. > > > > > > To reproduce it, you need totally idle machine (no GUI, etc.) -- if > > > something writes to the backing device, it flushes the dirty pages > > > generated by the loopback and the lockup is gone. If you add printk, don't > > > forget to stop klogd, otherwise logging would end the lockup. > > > > erk. > > > > > The hotfix (that I verified to work) is to not set wbc->bdi, so that all > > > devices are flushed ... but the code probably needs some redesign (i.e. > > > either account per-device and flush per-device, or account-global and > > > flush-global). > > > > > > Mikulas > > > > > > > > > diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c > > > --- ../x/linux-2.6.23.1/mm/page-writeback.c 2007-10-12 18:43:44.000000000 +0200 > > > +++ mm/page-writeback.c 2007-11-10 20:32:43.000000000 +0100 > > > @@ -214,7 +214,6 @@ > > > > > > for (;;) { > > > struct writeback_control wbc = { > > > - .bdi = bdi, > > > .sync_mode = WB_SYNC_NONE, > > > .older_than_this = NULL, > > > .nr_to_write = write_chunk, > > > > Arguably we just have the wrong backing-device here, and what we should do > > is to propagate the real backing device's pointer through up into the > > filesystem. There's machinery for this which things like DM stacks use. > > If you change loopback backing-device, you just turn this nicely > reproducible example into a subtle race condition that can happen whenever > you use loopback or not. Think, what happens when different process > dirties memory: > > You have process "A" that dirtied a lot of pages on device "1" but has not > started writing them. > You have process "B" that is trying to write to device "2", sees dirty > page count over limit, but can't do anything about it, because it is only > allowed to flush pages on device "2". --- so it endlessly loops. > > If you want to use the current flushing semantics, you just have to audit > the whole kernel to make sure that if some process sees over-limit dirty > page count, there is another process that is flushing the pages. Currently > it is not true, the "dd" process sees over-limit count, but there is > no-one writing. > > > I wonder if the post-2.6.23 changes happened to make this problem go away. > > I will try 2.6.24-rc2, but I don't think the root cause of this went away. > Maybe you just reduced probability. > > Mikulas So I compiled it and I don't see any more lock-ups. The writeback loop doesn't depend on any global page count, so the above scenario can't happen here. Good. Mikulas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/