Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753856AbZIWCAQ (ORCPT ); Tue, 22 Sep 2009 22:00:16 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753480AbZIWCAP (ORCPT ); Tue, 22 Sep 2009 22:00:15 -0400 Received: from smtp1.linux-foundation.org ([140.211.169.13]:50308 "EHLO smtp1.linux-foundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753428AbZIWCAO (ORCPT ); Tue, 22 Sep 2009 22:00:14 -0400 Date: Tue, 22 Sep 2009 18:59:41 -0700 From: Andrew Morton To: Wu Fengguang Cc: Chris Mason , Peter Zijlstra , "Li, Shaohua" , "linux-kernel@vger.kernel.org" , "richard@rsk.demon.co.uk" , "jens.axboe@oracle.com" Subject: Re: regression in page writeback Message-Id: <20090922185941.1118e011.akpm@linux-foundation.org> In-Reply-To: <20090923014500.GA11076@localhost> References: <1253601612.8439.274.camel@twins> <20090922080505.GB9192@localhost> <1253606965.8439.281.camel@twins> <20090922082427.GA24888@localhost> <1253608335.8439.283.camel@twins> <20090922155259.GL10825@think> <20090923002220.GA6382@localhost> <20090922175452.d66400dd.akpm@linux-foundation.org> <20090923011758.GC6382@localhost> <20090922182832.28e7f73a.akpm@linux-foundation.org> <20090923014500.GA11076@localhost> X-Mailer: Sylpheed 2.4.8 (GTK+ 2.12.5; x86_64-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3110 Lines: 72 On Wed, 23 Sep 2009 09:45:00 +0800 Wu Fengguang wrote: > On Wed, Sep 23, 2009 at 09:28:32AM +0800, Andrew Morton wrote: > > On Wed, 23 Sep 2009 09:17:58 +0800 Wu Fengguang wrote: > > > > > On Wed, Sep 23, 2009 at 08:54:52AM +0800, Andrew Morton wrote: > > > > On Wed, 23 Sep 2009 08:22:20 +0800 Wu Fengguang wrote: > > > > > > > > > Jens' per-bdi writeback has another improvement. In 2.6.31, when > > > > > superblocks A and B both have 100000 dirty pages, it will first > > > > > exhaust A's 100000 dirty pages before going on to sync B's. > > > > > > > > That would only be true if someone broke 2.6.31. Did they? > > > > > > > > SYSCALL_DEFINE0(sync) > > > > { > > > > wakeup_pdflush(0); > > > > sync_filesystems(0); > > > > sync_filesystems(1); > > > > if (unlikely(laptop_mode)) > > > > laptop_sync_completion(); > > > > return 0; > > > > } > > > > > > > > the sync_filesystems(0) is supposed to non-blockingly start IO against > > > > all devices. It used to do that correctly. But people mucked with it > > > > so perhaps it no longer does. > > > > > > I'm referring to writeback_inodes(). Each invocation of which (to sync > > > 4MB) will do the same iteration over superblocks A => B => C ... So if > > > A has dirty pages, it will always be served first. > > > > > > So if wbc->bdi == NULL (which is true for kupdate/background sync), it > > > will have to first exhaust A before going on to B and C. > > > > But that works OK. We fill the first device's queue, then it gets > > congested and sync_sb_inodes() does nothing and we advance to the next > > queue. > > So in common cases "exhaust" is a bit exaggerated, but A does receive > much more opportunity than B. Computation resources for IO submission > are unbalanced for A, and there are pointless overheads in rechecking A. That's unquantified handwaving. One CPU can do a *lot* of IO. > > If a device has more than a queue's worth of dirty data then we'll > > probably leave some of that dirty memory un-queued, so there's some > > lack of concurrency in that situation. > > Good insight. It was wrong. See the other email. > That possibly explains one major factor of the > performance gains of Jens' per-bdi writeback. I've yet to see any believable and complete explanation for these gains. I've asked about these things multiple times and nothing happened. I suspect that what happened over time was that previously-working code got broken, then later people noticed the breakage but failed to analyse and fix it in favour of simply ripping everything out and starting again. So for the want of analysing and fixing several possible regressions, we've tossed away some very sensitive core kernel code which had tens of millions of machine-years testing. I find this incredibly rash. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/