Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759283AbXF0WgU (ORCPT ); Wed, 27 Jun 2007 18:36:20 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1760254AbXF0WgJ (ORCPT ); Wed, 27 Jun 2007 18:36:09 -0400 Received: from netops-testserver-3-out.sgi.com ([192.48.171.28]:33415 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1759341AbXF0WgH (ORCPT ); Wed, 27 Jun 2007 18:36:07 -0400 Date: Thu, 28 Jun 2007 08:35:48 +1000 From: David Chinner To: Chris Mason Cc: Nick Piggin , David Chinner , Nick Piggin , Linux Kernel Mailing List , Linux Memory Management List , linux-fsdevel@vger.kernel.org Subject: Re: [RFC] fsblock Message-ID: <20070627223548.GS989688@sgi.com> References: <20070624014528.GA17609@wotan.suse.de> <20070626030640.GM989688@sgi.com> <46808E1F.1000509@yahoo.com.au> <20070626092309.GF31489@sgi.com> <20070626123449.GM14224@think.oraclecorp.com> <20070627053245.GA6033@wotan.suse.de> <20070627115056.GW14224@think.oraclecorp.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070627115056.GW14224@think.oraclecorp.com> User-Agent: Mutt/1.4.2.1i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3437 Lines: 83 On Wed, Jun 27, 2007 at 07:50:56AM -0400, Chris Mason wrote: > On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote: > > On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote: > > > On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote: > > > > On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote: > > > > > > [ ... fsblocks vs extent range mapping ] > > > > > > > iomaps can double as range locks simply because iomaps are > > > > expressions of ranges within the file. Seeing as you can only > > > > access a given range exclusively to modify it, inserting an empty > > > > mapping into the tree as a range lock gives an effective method of > > > > allowing safe parallel reads, writes and allocation into the file. > > > > > > > > The fsblocks and the vm page cache interface cannot be used to > > > > facilitate this because a radix tree is the wrong type of tree to > > > > store this information in. A sparse, range based tree (e.g. btree) > > > > is the right way to do this and it matches very well with > > > > a range based API. > > > > > > I'm really not against the extent based page cache idea, but I kind of > > > assumed it would be too big a change for this kind of generic setup. At > > > any rate, if we'd like to do it, it may be best to ditch the idea of > > > "attach mapping information to a page", and switch to "lookup mapping > > > information and range locking for a page". > > > > Well the get_block equivalent API is extent based one now, and I'll > > look at what is required in making map_fsblock a more generic call > > that could be used for an extent-based scheme. > > > > An extent based thing IMO really isn't appropriate as the main generic > > layer here though. If it is really useful and popular, then it could > > be turned into generic code and sit along side fsblock or underneath > > fsblock... > > Lets look at a typical example of how IO actually gets done today, > starting with sys_write(): > > sys_write(file, buffer, 1MB) > for each page: > prepare_write() > allocate contiguous chunks of disk > attach buffers > copy_from_user() > commit_write() > dirty buffers > > pdflush: > writepages() > find pages with contiguous chunks of disk > build and submit large bios > > So, we replace prepare_write and commit_write with an extent based api, > but we keep the dirty each buffer part. writepages has to turn that > back into extents (bio sized), and the result is completely full of dark > dark corner cases. Yup - I've been on the painful end of those dark corner cases several times in the last few months. It's also worth pointing out that mpage_readpages() already works on an extent basis - it overloads bufferheads to provide a "map_bh" that can point to a range of blocks in the same state. The code then iterates the map_bh range a page at a time building bios (i.e. not even using buffer heads) from that map...... > I do think fsblocks is a nice cleanup on its own, but Dave has a good > point that it makes sense to look for ways generalize things even more. *nod* Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/