Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754246AbZCYT6B (ORCPT ); Wed, 25 Mar 2009 15:58:01 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753344AbZCYT5v (ORCPT ); Wed, 25 Mar 2009 15:57:51 -0400 Received: from brick.kernel.dk ([93.163.65.50]:39833 "EHLO kernel.dk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751867AbZCYT5v (ORCPT ); Wed, 25 Mar 2009 15:57:51 -0400 Date: Wed, 25 Mar 2009 20:57:47 +0100 From: Jens Axboe To: Ric Wheeler Cc: Jeff Garzik , Linus Torvalds , Theodore Tso , Ingo Molnar , Alan Cox , Arjan van de Ven , Andrew Morton , Peter Zijlstra , Nick Piggin , David Rees , Jesper Krogh , Linux Kernel Mailing List Subject: Re: Linux 2.6.29 Message-ID: <20090325195747.GC27476@kernel.dk> References: <20090324103111.GA26691@elte.hu> <20090324132032.GK5814@mit.edu> <20090324184549.GE32307@mit.edu> <49C93AB0.6070300@garzik.org> <20090325093913.GJ27476@kernel.dk> <49CA86BD.6060205@garzik.org> <20090325194341.GB27476@kernel.dk> <49CA8ADA.3040709@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <49CA8ADA.3040709@redhat.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4242 Lines: 106 On Wed, Mar 25 2009, Ric Wheeler wrote: > Jens Axboe wrote: >> On Wed, Mar 25 2009, Jeff Garzik wrote: >> >>> Jens Axboe wrote: >>> >>>> On Tue, Mar 24 2009, Jeff Garzik wrote: >>>> >>>>> Linus Torvalds wrote: >>>>> >>>>>> But I really don't understand filesystem people who think that >>>>>> "fsck" is the important part, regardless of whether the data is >>>>>> valid or not. That's just stupid and _obviously_ bogus. >>>>>> >>>>> I think I can understand that point of view, at least: >>>>> >>>>> More customers complain about hours-long fsck times than they do >>>>> about silent data corruption of non-fsync'd files. >>>>> >>>>> >>>>> >>>>>> The point is, if you write your metadata earlier (say, every 5 >>>>>> sec) and the real data later (say, every 30 sec), you're >>>>>> actually MORE LIKELY to see corrupt files than if you try to >>>>>> write them together. >>>>>> >>>>>> And if you write your data _first_, you're never going to see >>>>>> corruption at all. >>>>>> >>>>> Amen. >>>>> >>>>> And, personal filesystem pet peeve: please encourage proper >>>>> FLUSH CACHE use to give users the data guarantees they deserve. >>>>> Linux's sync(2) and fsync(2) (and fdatasync, etc.) should poke >>>>> the block layer to guarantee a media write. >>>>> >>>> fsync already does that, at least if you have barriers enabled on your >>>> drive. >>>> >>> Erm, no, you don't enable barriers on your drive, they are not a >>> hardware feature. You enable barriers via your filesystem. >>> >> >> Thanks for the lesson Jeff, I'm obviously not aware how that stuff >> works... >> >> >>> Stating "fsync already does that" borders on false, because that assumes >>> (a) the user has a fs that supports barriers >>> (b) the user is actually aware of a 'barriers' mount option and what >>> it means >>> (c) the user has turned on an option normally defaulted to off. >>> >>> Or in other words, it pretty much never happens. >>> >> >> That is true, except if you use xfs/ext4. And this discussion is fine, >> as was the one a few months back that got ext4 to enable barriers by >> default. If I had submitted patches to do that back in 2001/2 when the >> barrier stuff was written, I would have been shot for introducing such a >> slow down. After people found out that it just wasn't something silly, >> then you have a way to enable it. >> >> I'd still wager that most people would rather have a 'good enough >> fsync' on their desktops than incur the penalty of barriers or write >> through caching. I know I do. >> >> >>> Furthermore, a blatantly obvious place to flush data to media -- >>> fsync(2), fdatasync(2) and sync_file_range(2) -- should cause the >>> block layer to issue a FLUSH CACHE for __any__ filesystem. But that >>> doesn't happen either. >>> >>> So, no, for 95% of Linux users, fsync does _not_ already do that. If >>> you are lucky enough to use XFS or ext4, you're covered. That's it. >>> >> >> The point is that you need to expose this choice somewhere, and that >> 'somewhere' isn't manually editing fstab and enabling barriers or >> fsync-for-real. And it should be easier. >> >> Another problem is that FLUSH_CACHE sucks. Really. And not just on >> ext3/ordered, generally. Write a 50 byte file, fsync, flush cache and >> wit for the world to finish. Pretty hard to teach people to use a nicer >> fdatasync(), when the majority of the cost now becomes flushing the >> cache of that 1TB drive you happen to have 8 partitions on. Good luck >> with that. >> >> > And, as I am sure that you do know, to add insult to injury, FLUSH_CACHE > is per device (not file system). > > When you issue an fsync() on a disk with multiple partitions, you will > flush the data for all of its partitions from the write cache.... Exactly, that's what my (vague) 8 partition reference was for :-) A range flush would be so much more palatable. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/