From: David Chinner Subject: Re: [RFC] Ext3 online defrag Date: Fri, 27 Oct 2006 11:32:52 +1000 Message-ID: <20061027013252.GP8394166@melbourne.sgi.com> References: <200610250225.MAA23029@larry.melbourne.sgi.com> <20061025024257.GA23769@havoc.gtf.org> <20061025042753.GV8394166@melbourne.sgi.com> <20061025044844.GB32486@havoc.gtf.org> <20061025053823.GX8394166@melbourne.sgi.com> <20061025060142.GD32486@havoc.gtf.org> <20061025081137.GB8394166@melbourne.sgi.com> <20061025170052.GA19513@havoc.gtf.org> <20061026014020.GC8394166@melbourne.sgi.com> <20061026113722.GA23610@atrey.karlin.mff.cuni.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: David Chinner , Jeff Garzik , Barry Naujok , "'Dave Kleikamp'" , "'Alex Tomas'" , "'Theodore Tso'" , linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org Return-path: To: Jan Kara Content-Disposition: inline In-Reply-To: <20061026113722.GA23610@atrey.karlin.mff.cuni.cz> Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Thu, Oct 26, 2006 at 01:37:22PM +0200, Jan Kara wrote: > > On Wed, Oct 25, 2006 at 01:00:52PM -0400, Jeff Garzik wrote: > > We don't need to expose anything filesystem specific to userspace to > > implement this. Online data movement (i.e. the defrag mechanism) > > becomes something like: > > > > do { > > get_free_list(dst_fd, location, len, list) > > /* select extent to use */ > Upto this point I can imagine we can be perfectly generic. > > > alloc_from_list(dst_fd, list[X], off, len) > > } while (ENOALLOC) > > move_data(src_fd, dst_fd, off, len); > With these two it's not clear how well can we do with just a generic > interface. Every filesystem needs to have some additional metadata to > keep list of data blocks. In case of ext2/ext3/reiserfs this is not > a negligible amount of space and placement of these metadata is important > for performance. Yes, the same can be said for XFS. However, XFS's extent btree implementation uses readahead to hide a lot of the latency involved with reading extent map, and it only needs to read it once per inode lifecycle > So either we focus only on data blocks and let > implementation of alloc_from_list() allocate metadata wherever it wants > (but then we get suboptimal performace because there need not be space > for indirect blocks close before our provided extent) I think the first step would be to focus on data blocks using something like the above. There are many steps to full filesystem defragmentation, but data fragmetnation is typically the most common symptom of fragmentation that we see. > or we allocate > metadata from the provided list, but then we need some knowledge of fs > to know how much should we expect to spend on metadata and where these > metadata should be placed. That's the second step, I think. For example, we could count the metadata blocks used in metadata structure (say an block list), allocate a new chunk like above, and then execute a "move_metadata()" type of operation, which the filesystem does internally in a transactionally safe manner. Once again, generic interface, filesystem specific implementations. > For example if you know that indirect block > for your interval is at block B, then you'd like to allocate somewhere > close after this point or to relocate that indirect block (and all the > data it references to). But for that you need to know you have something > like indirect blocks => filesystem knowledge. *nod* This is far less of a problem with extent based filesystems - coalescing all the fragments into a single extent removes the need for indirect blocks and you get the extent list for free when you read the inode. When we do have a fragmented file, XFS uses readahead to speed btree searching and reading, so it hides a lot of the latency overhead that fragmented metadata can cause. Either way, these lists can still be optimised by allocating a set of contiguous blocks and copying the metadata into them and updating the pointers to the new blocks. It can be done separately to the data moving and really should be done after the data has been defragmented.... > So I think that to get this working, we also need some way to tell > the program that if it wants to allocate some data, it also needs to > count with this amount of metadata and some of it is already allocated > in given blocks... If you want to do it all in one step. However, it's not quite that simple for something like XFS. An allocation may require a btree split (or three, actually) and the number of blocks required is dependent on the height of the btrees. So we don't know how many blocks we'll need ahead of time, and we'd have to reach deep into the allocator and abuse it badly to do anything like this. It's not something I want to even contemplate doing. :/ Also, we don't want to be mingling global metadata with inode specific metadata so we don't want to put most of the new metadata blocks near the extent we are putting the data into. That means I'd prefer to be able to optimise metadata objects separately. e.g. rewrite a btree into a single contiguous extent with the btree blocks laid out so the readahead patterns result in sequential I/O. The kernel would need to do this in XFS because we'd have to lock the entire btree a block at a time, copy it and then issue a "swap btree" transaction. most other journalling filesystems will have similar requirements, I think, for doing this online.... That's a very similar concept to the move_data() interface... > > I see substantial benefit moving forward from having filesystem > > independent interfaces. Many features that filesystems implement > > are common, and as time goes on the common feature set of the > > different filesystems gets larger. So why shouldn't we be > > trying to make common operations generic so that every filesystem > > can benefit from the latest and greatest tool? > So you prefer to handle only "data blocks" part of the problem and let > filesystem sort out metadata? The filesystem should already be attempting to do the right thing with the metadata blocks. If it can't then I don't think we should complicate the interface to do this as a spearate metadata optmisation pass can do it for us. Also, it would make the interface harder to use for general applications that really only need to guarantee that their datafiles are unfragmented (e.g bittorrent). FWIW, the only user of metadata optmisation that I can see is defrag applications. Hence I think it is best to keep it separate and let the filesystem do best effort metadata placement on all allocations whether they are directed by userspace or not. the block/extent lists can be further optimised as a separate phase of a defrag application if the fs isn't able to do it right the first time.... No, wait, I just thought of another - online shrinking of a filesystem requires you to move lots of metadata in a similarly safe manner.... :) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group