From: Jamie Lokier Subject: Re: [PATCH 0/4] (RESEND) ext3[34] barrier changes Date: Tue, 20 May 2008 23:26:09 +0100 Message-ID: <20080520222609.GC27853@shareable.org> References: <482DDA56.6000301@redhat.com> <200805201202.54420.chris.mason@oracle.com> <20080520162710.GM16676@shareable.org> <200805201308.03956.chris.mason@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Andi Kleen , Andrew Morton , Eric Sandeen , linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org To: Chris Mason Return-path: Content-Disposition: inline In-Reply-To: <200805201308.03956.chris.mason@oracle.com> Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org Chris Mason wrote: > Jens and I talked about tossing the barriers completely and just > doing FUA for all metadata writes. For drives with NCQ, we'll get > something close to optimal because the higher layer elevators are > already doing most of the hard work. Will need benchmarking, but might just work. Still need barriers for reasonable performance + integrity on systems without NCQ, like the kit on my desk. Without barriers I have to turn write-cache off (I do see those power-off errors). But that's too slow. > Either way, you do want the flush to cover all the data=ordered writes, at > least all the ordered writes from the transaction you're about to commit. > Telling the difference between data=ordered from an old transaction or from > the running transaction gets into pushing ordering down to the lower levels > (see below) I grant you, it is all about pushing ordering down as far as reasonable. > > > But, it complicates the decision about when you're allowed to dirty > > > a metadata block for writeback. It used to be dirty-after-commit > > > and it would change to dirty-after-barrier. I suspect that is some > > > significant surgery into jbd. > > > > Rather than tracking when it's "allowed" to dirty a metadata block, it > > will be simpler to keep a flag saying "barrier needed", and just issue > > the barrier prior to writing a metadata block, if the flag is set. > > Adding explicit ordering into the IO path is really interesting. We > toss a bunch of IO down to the lower layers with information about > dependencies and let the lower layers figure it out. That's right. You can keep it quite simple: just distinguish barriers from flushes. Or make it more detailed, distinguishing different block sets and partial barriers between them. > James had a bunch of ideas here, but I'm afraid the only people that > understood it were James and the whiteboard he was scribbling on. I think simply distinguishing barrier from flush is relatively understandable, isn't it? > The trick is to code the ordering in such a way that an IO failure breaks the > chain, and that the filesystem has some sensible chance to deal with all > these requests that have failed because an earlier write failed. Yeah, if you're going to do it _properly_ that's the way :-) But the I/O failure strategy is really a different problem, and shouldn't necessarily distract from barrier I/O elevator scheduling. The I/O failure strategy applies just the same even with no barriers and no disk cache at all. Even there, you want a failed write to block subsequent dependent writes. > Also, once we go down the ordering road, it is really tempting to forget that > ordering does ensure consistency but doesn't tell us the write actually > happened. fsync and friends need to hook into the dependency chain to wait > for the barrier instead of waiting for the commit. True, but fortunately it's quite a simple thing. fsync et al need to ask for flush, nothing else does. You simply don't need consistency from journalling writes without fsync. Journal writes without fsync happen at times determined by arcane kernel daemons. Applications and even users don't have any say, and therefore no particular expectation. > My test prog is definitely a worst case, but I'm pretty confident that most > mail server workloads end up doing similar IO. Mail servers seem like the most vulnerable systems to me too. They are often designed to relay a successful fsync() result across the network. RAID doesn't protect against power-fails with flushless fsync, and SMTP does not redundantly hold the same email at different locations for recovery. (The most solid email services will use a distributed database which is not vulnerable to these problems. But the most common mail systems are). > A 16MB or 32MB disk cache is common these days, and that is a very sizable > percentage of the jbd log size. I think the potential for corruptions on > power failure is only growing over time. That depends on the disk implementation. Another use of the 32MB is to hold write-through NCQ commands in flight while they are sorted. If the trend is towards that, perhaps it's quite resistant to power failure. Still, I'm on the side of safety - having witnessed these power-fail corruptions myself. The type I've seen will get more common, as the trend in home "appliances" is towards pulling the plug to turn them off, and to have these larger storage devices in them. -- Jamie