From: Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 2/2] ext2: Add blk_issue_flush() to syncing paths
Date: Wed, 14 Jan 2009 10:18:34 -0800
Message-ID: <20090114101834.fbb9ea12.akpm@linux-foundation.org>
References: <1231945948-23676-1-git-send-email-jack@suse.cz>
	<1231945948-23676-2-git-send-email-jack@suse.cz>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org,
	pavel@suse.cz
To: Jan Kara <jack@suse.cz>
In-Reply-To: <1231945948-23676-2-git-send-email-jack@suse.cz>
Sender: linux-ext4-owner@vger.kernel.org

On Wed, 14 Jan 2009 16:12:28 +0100 Jan Kara <jack@suse.cz> wrote:

> To be really safe that the data hit the platter, we should also flush drive's
> writeback caches on fsync and for O_SYNC files or O_DIRSYNC inodes.
> 

It's not good to randomly sprinkle blkdev_issue_flush() calls all over
the filesystem like this.  How do we know that you didn't miss a site? 
How do we ensure that people who modify the fs in the future don't
forget to add the blkdev_issue_flush() call, if needed?

IOW, it is fragile.  Is there anything we can do to make this more
robust?  Do the flush calls from some higher-level callsite?  Perhaps
even the VFS?

> @@ -97,8 +98,10 @@ static int ext2_commit_chunk(struct page *page, loff_t pos, unsigned len)
>  
>  	if (IS_DIRSYNC(dir)) {
>  		err = write_one_page(page, 1);
> -		if (!err)
> +		if (!err) {
>  			err = ext2_sync_inode(dir);
> +			blkdev_issue_flush(dir->i_sb->s_bdev, NULL);

The patch itself would have been a bit neater if it had added

int ext3_blkdev_issue_flush(struct inode *inode)

and called that, IMO.


Also, the changelog needs some work, methinks.

/**
 * blkdev_issue_flush - queue a flush
 * @bdev:	blockdev to issue flush for
 * @error_sector:	error sector
 *
 * Description:
 *    Issue a flush for the block device in question. Caller can supply
 *    room for storing the error offset in case of a flush error, if they
 *    wish to.  Caller must run wait_for_completion() on its own.
 */

So afaict the change you've made is incomplete.  We'll queue a
writeback command to the disk but we won't wait for it to be sent down
the wire.  Nor do we wait for the command to complete at the device
end.  So it can still be a looong time (seconds!) before the data which
the user thinks is on disk really is safe.

Yes?

If so, this design decision should be described in the changelog, and
justified.  Actually, doing this in the comment over
ext3_blkdev_issue_flush() would be good.