Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756011AbZDTOm5 (ORCPT ); Mon, 20 Apr 2009 10:42:57 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755595AbZDTOmq (ORCPT ); Mon, 20 Apr 2009 10:42:46 -0400 Received: from cantor2.suse.de ([195.135.220.15]:36110 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755563AbZDTOmo (ORCPT ); Mon, 20 Apr 2009 10:42:44 -0400 Date: Mon, 20 Apr 2009 16:42:42 +0200 From: Jan Kara To: Chris Mason Cc: Linus Torvalds , "Theodore Ts'o" , Linux Kernel Developers List , Ext4 Developers List Subject: Re: [PATCH] Add ext3 data=guarded mode Message-ID: <20090420144241.GF14699@duck.suse.cz> References: <1239816159-6868-1-git-send-email-chris.mason@oracle.com> <1239910921.21233.98.camel@think.oraclecorp.com> <20090420134423.GE14699@duck.suse.cz> <1240237105.16213.41.camel@think.oraclecorp.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1240237105.16213.41.camel@think.oraclecorp.com> User-Agent: Mutt/1.5.17 (2007-11-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9379 Lines: 181 On Mon 20-04-09 10:18:25, Chris Mason wrote: > On Mon, 2009-04-20 at 15:44 +0200, Jan Kara wrote: > > Hi Chris, > > > > On Thu 16-04-09 15:42:01, Chris Mason wrote: > > > > > > ext3 data=ordered mode makes sure that data blocks are on disk before > > > the metadata that references them, which avoids files full of garbage > > > or previously deleted data after a crash. It does this by adding every dirty > > > buffer onto a list of things that must be written before a commit. > > > > > > This makes every fsync write out all the dirty data on the entire FS, which > > > has high latencies and is generally much more expensive than it needs to be. > > > > > > Another way to avoid exposing stale data after a crash is to wait until > > > after the data buffers are written before updating the on-disk record > > > of the file's size. If we crash before the data IO is done, i_size > > > doesn't yet include the new blocks and no stale data is exposed. > > > > > > This patch adds the delayed i_size update to ext3, along with a new > > > mount option (data=guarded) to enable it. The basic mechanism works like > > > this: > > > > > > * Change block_write_full_page to take an end_io handler as a parameter. > > > This allows us to make an end_io handler that queues buffer heads for > > > a workqueue where the real work of updating the on disk i_size is done. > > > > > > * Add an rbtree to the in-memory ext3 inode for tracking data=guarded > > > buffer heads that are waiting to be sent to disk. > > > > > > * Add an ext3 guarded write_end call to add buffer heads for newly > > > allocated blocks into the rbtree. If we have a newly allocated block that is > > > filling a hole inside i_size, this is done as an old style data=ordered write > > > instead. > > > > > > * Add an ext3 guarded writepage call that uses a special buffer head > > > end_io handler for buffers that are marked as guarded. Again, if we find > > > newly allocated blocks filling holes, they are sent through data=ordered > > > instead of data=guarded. > > > > > > * When a guarded IO finishes, kick a per-FS workqueue to do the > > > on disk i_size updates. The workqueue function must be very careful. We > > > only update the on disk i_size if all of the IO between the old on > > > disk i_size and the new on disk i_size is complete. This is why an > > > rbtree is used to track the pending buffers, that way we can verify all > > > of the IO is actually done. The on disk i_size is incrementally updated to > > > the largest safe value every time an IO completes. > > > > > > * When we start tracking guarded buffers on a given inode, we put the > > > inode into ext3's orphan list. This way if we do crash, the file will > > > be truncated back down to the on disk i_size and we'll free any blocks that > > > were not completely written. The inode is removed from the orphan list > > > only after all the guarded buffers are done. > > > > > > Signed-off-by: Chris Mason > > I've read the patch. I don't think I've got all the subtleties but before > > diving into it more I'd like to ask why do we do the things in so > > complicated way? > > Thanks for reviewing things! > > > Maybe I'm missing some issues so let's see: > > 1) If I got it right, hole filling goes through standard ordered mode so > > we can ignore such writes. So why do we have special writepage? I should > > look just like writepage for ordered mode and we could just tweak > > ext3_ordered_writepage() (probably renamed) to do: > > if (ext3_should_order_data(inode)) > > err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE, > > NULL, journal_dirty_data_fn); > > else > > err = walk_page_buffers(handle, page_bufs, 0, PAGE_CACHE_SIZE, > > NULL, journal_dirty_guarded_data_fn); > > That would work. My first writepage was more complex, it shrunk as the > patch evolved. Another question is if we want to use exactly the same > writepage for guarded and ordered. I've always though data=ordered > should only order new blocks... Hmm, true. After my last change we already don't file data buffers in ordered_writepage() if the page is fully mapped to disk so doing this in all the cases is fine. Actually, I'll soon write ext3_page_mkwrite() to do the block allocation on page fault time so after that we can get rid of most of the code in ext3_ordered_writepage(). > > 2) Why is there any RB tree after all? Since guarded are just extending > > writes, we can have a linked list of guarded buffers. We always append > > at the end, we update i_size if the current buffer has no predecestor in the > > list. > > A guarded write is anything from write() that is past the disk i_size. > lseek and friends mean it could happen in any order. Are you sure? Looking and ext3_guarded_write_end(): ... + if (test_clear_buffer_datanew(bh)) { + /* + * if we're filling a hole inside i_size, we need to + * fall back to the old style data=ordered + */ + if (offset < inode->i_size) { + ret = ext3_journal_dirty_data(handle, bh); + goto out; + } ... So it seems we always to ordered write unless we are appending / have blocks allocated. You could have i_disksize in the check but is it really worth it? IMO getting rid of the RB tree might be better ;) > > 3) Currently truncate() does filemap_write_and_wait() - is it really > > needed? Each guarded bh could carry with itself i_disksize it should update > > to when IO is finished. Extending truncate will just update this i_disksize > > at the last member of the list (or update i_disksize when the list is > > empty). > > > > Shortening truncate will walk the list of guarded bh's, removing from > > the list those beyond new i_size, then it will behave like the extending > > truncate (it works even if current i_disksize is larger than new i_size). > > Note, that before we get to ext3_truncate() mm walks all the pages beyond > > i_size and waits for page writeback so by the time ext3_truncate() is > > called, all the IO is finished and dirty pages are canceled. > > The problem here was the disk i_size being updated by ext3_setattr > before the vmtruncate calls calls ext3_truncate(). So the guarded IO > might wander in and change the i_disksize update done by setattr. > > It all made me a bit dizzy and I just tossed the write_and_wait in > instead. > > At the end of the day, we're waiting for guarded writes only, and we > probably would have ended up waiting on those exact same pages in > vmtruncate anyway. So, I do agree we could avoid the write with more > code, but is this really a performance critical section? Well, not really critical but also not negligible - mainly because with your approach we end up *submitting* new writes we could just be canceled otherwise. Without fdatawrite(), data of short-lived files need not ever reach the disk similarly as in writeback mode (OK, this is connected with the fact that you actually don't have fdatawrite() before ext3_truncate() in ext3_delete_inode() and that's what initially puzzled me). > > IO finished callback will update i_disksize to carried value if the > > buffer is the first in the list, otherwise it will copy it's value to the > > previous member of the list. > > 4) Do we have to call end_page_writeback() from the work queue? That > > could make IO completion times significantly longer on a good disk array, > > couldn't it? > > My understanding is that XFS is doing something similar with the > workqueue already, without big performance problems. OK. > > There is a way how to solve this I believe although it might > > be too hacky / complicated. We have to update i_disksize before calling > > end_page_writeback() because of truncate races and generally for > > filemap_fdatawrite() to work. So what we could do is: > > guarded_end_io(): > > update i_disksize > > call something like __mark_inode_dirty(inode, I_DIRTY_DATASYNC) but > > avoid calling ext3_dirty_inode() or somehow make that we immediately > > return from it. > > call end_buffer_async_write() > > queue addition of inode to the transaction / removal from orphan list > > It could, but we end up with a list of inodes that must be logged before > they can be freed. This was a problem in the past (before the > dirty_inode operation was added) because logging an inode is relatively > expensive, and we have no mechanism to throttle them. > > In the past it lead to deadlocks because kswapd would try and log all > the dirty inodes, and someone else had the transaction pinned while > waiting on kswapd to find free ram. We might be able to do better > today, but I didn't want to cram that into this patch series as well. Ah, OK. I didn't know this. Anyway, if we find getting rid of the work queue is useful, we can do it later. It would be rather local change. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/