From: Theodore Tso Subject: Re: [PATCH 0/4] (RESEND) ext3[34] barrier changes Date: Fri, 16 May 2008 20:20:30 -0400 Message-ID: <20080517002030.GA7374@mit.edu> References: <482DDA56.6000301@redhat.com> <20080516130545.845a3be9.akpm@linux-foundation.org> <482DF44B.50204@redhat.com> <20080516220315.GB15334@shareable.org> <482E08E6.4030507@redhat.com> <20080516225304.GG15334@shareable.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii To: Eric Sandeen , Andrew Morton , linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org Return-path: Content-Disposition: inline In-Reply-To: <20080516225304.GG15334@shareable.org> Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Fri, May 16, 2008 at 11:53:04PM +0100, Jamie Lokier wrote: > > > If you just want to test the block I/O layer and drive itself, don't > > > use the filesystem, but write a program which just access the block > > > device, continuously writing with/without barriers every so often, and > > > after power cycle read back to see what was and wasn't written. > > > > Well, I think it is worth testing through the filesystem, different > > journaling mechanisms will probably react^wcorrupt in different ways. > > I agree, but intentional tests on the block device will show the > drives characteristcs on power failure much sooner and more > consistently. Then you can concentrate on the worst drivers :-) I suspect the real reason why we get away with it so much with ext3 is that the journal is usually contiguous on disk, hence, when you write to the journal, it's highly unlikely that commit block will be written and the blocks before the commit block have not. In addition, because we are doing physical block journalling, it repairs a huge amount of damage during the journal replay. So as we are writing the journal, the disk drive sees a large contiguous write stream, followed by singleton writes where the disk blocks end up on disk. The most important reason, though, is that the blocks which are dirty don't get flushed out to disk right away! They don't have to, since they are in the journal, and the journal replay will write the correct data to disk. Before the journal commit, the buffer heads are pinned, so they can't be written out. After the journal commit, the buffer heads may be written out, but they don't get written out right away; the kernel will only write them out when the periodic buffer cache flush takes place, *or* if the journal need to wrap, at which point if there are pending writes to an old commit that haven't been superceded by another journal commit, the jbd layer has to force them out. But the point is this is done in an extremely lazy fashion. As a result, it's very tough to create a situation where a hard drive will reorder write requests aggressively enough that we would see a potential problem. I suspect if we want to demonstrate the problem, we would need to do a number of things: * create a highly fragmented journal inode, forcing the jbd layer to seek all over the disk while writing out the journal blocks during a commit * make the journal small, forcing the journal to wrap very often * run with journal=data mode, to put maximal stress on the journal * make the workload one which creates and deletes large number of files scattered all over the directory hierarchy, so that we limit the number of blocks which are rewritten, * forcibly crash the system while subjecting the ext3 filesystem to this torture test Given that most ext3 filesystems have their journal created at mkfs time, so it is contiguous, and the journal is generally nice and large, in practice I suspect it's relatively difficult (I didn't say impossible) for us to trigger corruption given how far away we are from the worst case scenario described above. - Ted