Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755761AbZC3Qnk (ORCPT ); Mon, 30 Mar 2009 12:43:40 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752448AbZC3Qna (ORCPT ); Mon, 30 Mar 2009 12:43:30 -0400 Received: from smtp1.linux-foundation.org ([140.211.169.13]:57606 "EHLO smtp1.linux-foundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752188AbZC3Qn3 (ORCPT ); Mon, 30 Mar 2009 12:43:29 -0400 Date: Mon, 30 Mar 2009 09:34:30 -0700 (PDT) From: Linus Torvalds X-X-Sender: torvalds@localhost.localdomain To: Ric Wheeler cc: "Andreas T.Auer" , Alan Cox , Theodore Tso , Mark Lord , Stefan Richter , Jeff Garzik , Matthew Garrett , Andrew Morton , David Rees , Jesper Krogh , Linux Kernel Mailing List Subject: Re: Linux 2.6.29 In-Reply-To: <49D0EF1E.9040806@redhat.com> Message-ID: References: <49CD7B10.7010601@garzik.org> <49CD891A.7030103@rtr.ca> <49CD9047.4060500@garzik.org> <49CE2633.2000903@s5r6.in-berlin.de> <49CE3186.8090903@garzik.org> <49CE35AE.1080702@s5r6.in-berlin.de> <49CE3F74.6090103@rtr.ca> <20090329231451.GR26138@disturbed> <20090330003948.GA13356@mit.edu> <49D0710A.1030805@ursus.ath.cx> <20090330100546.51907bd2@the-village.bc.nu> <49D0A3D6.4000300@ursus.ath.cx> <49D0AA4A.6020308@redhat.com> <49D0EF1E.9040806@redhat.com> User-Agent: Alpine 2.00 (LFD 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3667 Lines: 77 On Mon, 30 Mar 2009, Ric Wheeler wrote: > > I still disagree strongly with the don't force flush idea - we have an > absolute and critical need to have ordered writes that will survive a power > failure for any file system that is built on transactions (or data base). Read that sentence of yours again. In particular, read the "we" part, and ponder. YOU have that absolute and critical need. Others? Likely not so much. The reason people run "data=ordered" on their laptops is not just because it's the default - rather, it's the default _because_ it's the one that avoids most obvious problems. And for 99% of all people, that's what they want. And as mentioned, if you have to have absolute requirements, you absolutely MUST be using real RAID with real protection (not just RAID0). Not "should". MUST. If you don't do redundancy, your disk _will_ eventually eat your data. Not because the OS wrote in the wrong order, or the disk cached writes, but simply because bad things do happen. But turn that around, and say: if you don't have redundant disks, then pretty much by definition those drive flushes won't be guaranteeing your data _anyway_, so why pay the price? > The big issues are that for s-ata drives, our flush mechanism is really, > really primitive and brutal. We could/should try to validate a better and less > onerous mechanism (with ordering tags? experimental flush ranges? etc). That's one of the issues. The cost of those flushes can be really quite high, and as mentioned, in the absense of redundancy you don't actually get the guarantees that you seem to think that you get. > I spent a very long time looking at huge numbers of installed systems > (millions of file systems deployed in the field), including taking part in > weekly analysis of why things failed, whether the rates of failure went up or > down with a given configuration, etc. so I can fully appreciate all of the > ways drives (or SSD's!) can magically eat your data. Well, I can go mainly by my own anecdotal evidence, and so far I've actually had more catastrophic data failure from failed drives than anything else. OS crashes in the middle of a "yum update"? Yup, been there, done that, it was really painful. But it was painful in a "damn, I need to force a re-install of a couple of rpms". Actual failed drives that got read errors? I seem to average almost one a year. It's been overheating laptops, and it's been power outages that apparently happened at really bad times. I have a UPS now. > What you have to keep in mind is the order of magnitude of various buckets of > failures - software crashes/code bugs tend to dominate, followed by drive > failures, followed by power supplies, etc. Sure. And those "write flushes" really only cover a rather small percentage. For many setups, the other corruption issues (drive failure) are not just more common, but generally more disastrous anyway. So why would a person like that worry about the (rare) power failure? > I have personally seen a huge reduction in the "software" rate of failures > when you get the write barriers (forced write cache flushing) working properly > with a very large installed base, tested over many years :-) The software rate of failures should only care about the software write barriers (ie the ones that order the OS elevator - NOT the ones that actually tell the disk to flush itself). Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/