Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758143Ab0GEDxN (ORCPT ); Sun, 4 Jul 2010 23:53:13 -0400 Received: from rcsinet10.oracle.com ([148.87.113.121]:54202 "EHLO rcsinet10.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755889Ab0GEDxM (ORCPT ); Sun, 4 Jul 2010 23:53:12 -0400 Message-ID: <4C3156D0.5090001@oracle.com> Date: Mon, 05 Jul 2010 11:51:44 +0800 From: Tao Ma User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4 MIME-Version: 1.0 To: Dave Chinner , Linus Torvalds , Linux Kernel , ocfs2-devel@oss.oracle.com, Dave Chinner , Christoph Hellwig , Mark Fasheh , Joel Becker Subject: Re: [PATCH 1/2] ocfs2: Zero the tail cluster when extending past i_size v2 References: <20100628173529.GA10573@mail.oracle.com> <20100629002421.GY6590@dastard> <20100629005403.GC24343@mail.oracle.com> <20100629015615.GZ6590@dastard> <20100629020420.GE24343@mail.oracle.com> <20100629022757.GA6590@dastard> <20100629071817.GA4150@mail.oracle.com> <20100702224912.GC5800@mail.oracle.com> <20100703213219.GB21262@mail.oracle.com> In-Reply-To: <20100703213219.GB21262@mail.oracle.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: acsmt354.oracle.com [141.146.40.154] X-Auth-Type: Internal IP X-CT-RefId: str=0001.0A090204.4C315706.0011:SCFMA4539814,ss=1,fgs=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6516 Lines: 187 Hi Joel, On 07/04/2010 05:32 AM, Joel Becker wrote: > Here's the second version of my corruption fix. It fixes two > bugs: > > 1) i_size can obviously be at a place that is a hole, so don't BUG on > that. > 2) Fix an off-by-one when checking whether the write position is within > the tail allocation. > > This version passes my tail corruption test as well as the kernel > compile that exposed the two bugs above. > > Joel > > --------------------------------------------------------------- > > ocfs2's allocation unit is the cluster. This can be larger than a block > or even a memory page. This means that a file may have many blocks in > its last extent that are beyond the block containing i_size. > > When ocfs2 grows a file, it zeros the entire cluster in order to ensure > future i_size growth will see cleared blocks. Unfortunately, > block_write_full_page() drops the pages past i_size. This means that > ocfs2 is actually leaking garbage data into the tail end of that last > cluster. > > We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect > when a write or truncate is past i_size. If there is any existing > allocation between the block containing the current i_size and the > location of the write or truncate, zeros will be written to that > allocation. > > This is only for sparse filesystems. Non-sparse filesystems already get > this via ocfs2_extend_no_holes(). > > Signed-off-by: Joel Becker > --- > fs/ocfs2/aops.c | 22 ++++---- > fs/ocfs2/file.c | 154 +++++++++++++++++++++++++++++++++++++++++++++++++------ > fs/ocfs2/file.h | 2 + > 3 files changed, 150 insertions(+), 28 deletions(-) > > diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c > index 6a13ea6..7fca78d 100644 > --- a/fs/ocfs2/file.c > +++ b/fs/ocfs2/file.c > @@ -848,6 +848,137 @@ out: > return ret; > } > > +/* > + * This function is a helper for ocfs2_zero_tail(). It calculates > + * what blocks need zeroing and does any CoW necessary. > + */ > +static int ocfs2_zero_tail_prepare(struct inode *inode, > + struct buffer_head *di_bh, > + loff_t pos, u64 *start_blkno, > + u64 *blocks) > +{ > + int rc = 0; > + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); > + u32 tail_cpos, pos_cpos, p_cpos; > + u64 tail_blkno, pos_blkno, blocks_to_zero; > + unsigned int num_clusters = 0; > + unsigned int ext_flags = 0; > + > + /* > + * The block containing i_size has already been zeroed, so our tail > + * block is the first block after i_size. The block containing > + * pos will be zeroed. So we only need to do anything if > + * tail_blkno is before pos_blkno. > + */ > + tail_blkno = (i_size_read(inode)>> inode->i_sb->s_blocksize_bits) + 1; > + pos_blkno = pos>> inode->i_sb->s_blocksize_bits; > + mlog(0, "tail_blkno = %llu, pos_blkno = %llu\n", > + (unsigned long long)tail_blkno, (unsigned long long)pos_blkno); > + if (pos_blkno<= tail_blkno) > + goto out; > + blocks_to_zero = pos_blkno - tail_blkno; > + > + /* > + * If tail_blkno is in the cluster past i_size, we don't need > + * to touch the cluster containing i_size at all. > + */ > + tail_cpos = i_size_read(inode)>> osb->s_clustersize_bits; > + if (ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)> tail_cpos) > + tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, > + tail_blkno); Can we always set tail_cpos in one line? tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)? tail_cpos is either the same cluster as i_size or the next cluster and both works for tail_blkno I guess? > + > + rc = ocfs2_get_clusters(inode, tail_cpos,&p_cpos,&num_clusters, > + &ext_flags); > + if (rc) { > + mlog_errno(rc); > + goto out; > + } > + > + /* Is there a cluster to zero? */ > + if (!p_cpos) > + goto out; For unwritten extent, we also need to clear the pages? If yes, the solution doesn't complete if we have 2 unwritten extent, one contains i_size while one passes i_size. Here we only clear the pages for the 1st unwritten extent and leave the 2nd one untouched. > + > + pos_cpos = pos>> osb->s_clustersize_bits; > + mlog(0, "tail_cpos = %u, num_clusters = %u, pos_cpos = %u, tail_blkno = %llu, pos_blkno = %llu\n", > + (unsigned int)tail_cpos, (unsigned int)num_clusters, > + (unsigned int)pos_cpos, (unsigned long long)tail_blkno, > + (unsigned long long)pos_blkno); From here to the call of CoW is a bit hard to understand. In 'if', num_clusters is set for CoW and in 'else', blocks_to_zero is set. So it isn't easy for the reader to tell why these 2 clauses are setting different values. So how about my code below? It looks more straightforward I think. > + if ((tail_cpos + num_clusters)> pos_cpos) { > + num_clusters = pos_cpos - tail_cpos; > + if (pos_blkno> > + ocfs2_clusters_to_blocks(inode->i_sb, pos_cpos)) > + num_clusters += 1; > + } else { > + blocks_to_zero = > + ocfs2_clusters_to_blocks(inode->i_sb, > + tail_cpos + num_clusters); > + blocks_to_zero -= tail_blkno; > + } > + > + /* Now CoW the clusters we're about to zero */ > + if (ext_flags& OCFS2_EXT_REFCOUNTED) { > + rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos, > + num_clusters, UINT_MAX); > + if (rc) { > + mlog_errno(rc); > + goto out; > + } > + } /* Decrease blocks_to_zero if there is some hole after extent */ if (tail_cpos + num_clusters <= pos_cpos) { blocks_to_zero = ocfs2_clusters_to_blocks(inode->i_sb, tail_cpos + num_clusters); blocks_to_zero -= tail_blkno; } /* Now CoW if we have some refcounted clusters. */ if (ext_flags & OCFS2_EXT_REFCOUNTED) { /* * We add one more cluster here since it will be * written shortly and if the pos_blkno isn't aligned * to the cluster size, we have to zero the blocks * before it. */ if (tail_cpos + num_clusters > pos_cpos) num_clusters = pos_cpos - tail_cpos + 1; rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos, num_clusters, UINT_MAX); if (rc) { mlog_errno(rc); goto out; } } > + > + *start_blkno = tail_blkno; > + *blocks = blocks_to_zero; > + mlog(0, "start_blkno = %llu, blocks = %llu\n", > + (unsigned long long)(*start_blkno), > + (unsigned long long)(*blocks)); > + > +out: > + return rc; > +} Regards, Tao -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/