Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1762657AbZCaQ2x (ORCPT ); Tue, 31 Mar 2009 12:28:53 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754791AbZCaQ2o (ORCPT ); Tue, 31 Mar 2009 12:28:44 -0400 Received: from smtp1.linux-foundation.org ([140.211.169.13]:42604 "EHLO smtp1.linux-foundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754182AbZCaQ2n (ORCPT ); Tue, 31 Mar 2009 12:28:43 -0400 Date: Tue, 31 Mar 2009 09:15:06 -0700 (PDT) From: Linus Torvalds X-X-Sender: torvalds@localhost.localdomain To: Ric Wheeler cc: Jens Axboe , =?ISO-8859-15?Q?Fernando_Luis_V=E1zquez_Cao?= , Jeff Garzik , Christoph Hellwig , Theodore Tso , Ingo Molnar , Alan Cox , Arjan van de Ven , Andrew Morton , Peter Zijlstra , Nick Piggin , David Rees , Jesper Krogh , Linux Kernel Mailing List , chris.mason@oracle.com, david@fromorbit.com, tj@kernel.org Subject: Re: [PATCH 1/7] block: Add block_flush_device() In-Reply-To: <49D239A0.5080405@redhat.com> Message-ID: References: <49D02328.7060108@oss.ntt.co.jp> <49D0258A.9020306@garzik.org> <49D03377.1040909@oss.ntt.co.jp> <49D0B535.2010106@oss.ntt.co.jp> <49D0B687.1030407@oss.ntt.co.jp> <20090330175544.GX5178@kernel.dk> <20090330185414.GZ5178@kernel.dk> <20090330201732.GB5178@kernel.dk> <49D17CA2.5060105@redhat.com> <49D1FB64.8000505@redhat.com> <49D239A0.5080405@redhat.com> User-Agent: Alpine 2.00 (LFD 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4152 Lines: 89 On Tue, 31 Mar 2009, Ric Wheeler wrote: > > The question is really what we do when you have a storage device in your box > with a volatile write cache that does support flush or fua or similar. Ok. Then you are talking about a different case - not EOPNOTSUPP. [ Although it may be related in that maybe the admin can _force_ a EOPNOTSUPP thing for when he wants to disable any "write barrier implies flush" thing. IOW, we may end up with an _implementation_ detail where we overload a potential QUEUE_FLUSH_EOPNOTSUPP flag with two meanings - either "the driver told me a barrier isn't supported" or "the admin set that same flag by hand to disable barrier-related flush commands". But that's just an implementation detail, of course. We could use two different flags, we could do the flags at different levels, whatever. ] > Using barriers & ordered transactions for these types of devices will > give you a more reliable file system - less fsck time needed and better > data integrity support for the (few?) applications that use fsync > properly. Sure. And it still shouldn't be the filesystem that _requires_ use of it. The user (or low-level driver) may simply know better. The user may know that he trusts the disk more than anything else, and prefers to not actually emit the "FLUSH" command. Again, that's not something that the filesystem should know about, or care about. If the user trusts the disk subsystem and wants the performance, it's the users choice. Even the _driver_ may know better. Knowing the kinds of firmware bugs those drives have, it could even be a driver that simply black-lists certain disks as having known-broken FLUSH commands. We have _CPU's_ that corrupt memory on cache writeback ("wbinvl"), and those things are a lot more tested than most driver firmware is. Do you realize just how buggy some of those flash drives are? Some of them will literally (a) report the wrong size and (b) lock up if you try to read from the last sector. Oops. Do you really expect such crap to even bother to honor some flush command? Good luck with that. They're designed as a floppy replacement. Now, you can tell me that I shouldn't put a reliable filesystem on an el-cheapo flash drive and expect it to work, but I'm sorry, you're wrong. People _are_ supposed to be able to move their data around, and the filesystem shouldn't make judgement calls. If you want judgement calls, call your mom. Not your filesystem. For another example, the driver might be a driver for a high-end battery-backup SCSI RAID controller. It knows that the controller _will_ write things out in the right order even in the case of a crash, but it may also know that the controller _also_ has a way to force a flush to actual hardware. When do you want to force a flush? For hotplug events, for example. Maybe the disks won't be _connected_ any more afterwards - then the battery backup on the controller won't be helping, will it? So there may well be a flush event thing, but it's really up to the admin to decide whether it should be connected to a write barrier thing, or be a separate admin activity. Maybe the admin is extra careful and anal, and decides that he wants to flush to disk platters _despite_ the battery backup. Maybe he doesn't trust the card. Maybe he does. Whatever. The point is that the admin might want to set a driver flag that does the flush or not, adn it's totally not a filesystem issue. See? The filesystem has absolutely _no_place_ deciding these kinds of things. The only thing it can ask for is "please serialize", but what _level_ of serialization is simply not a filesystem decision to make. And that very much includes the level of serialization that says "no serialization what-so-ever, and please go absolutely crazy with your cache". Not your choice. So no, you can't have a pony. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/