From: Brian Foster <bfoster@redhat.com>
Subject: Re: [PATCH v6 2/10] xfs: Add support FALLOC_FL_INSERT_RANGE for
	fallocate
Date: Tue, 2 Dec 2014 15:48:02 -0500
Message-ID: <20141202204801.GA44400@bfoster.bfoster>
References: <002b01d007ae$319ed0a0$94dc71e0$@samsung.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Cc: Theodore Ts'o <tytso@mit.edu>, linux-kernel@vger.kernel.org,
	xfs@oss.sgi.com, Ashish Sangwan <a.sangwan@samsung.com>,
	linux-fsdevel@vger.kernel.org, linux-ext4 <linux-ext4@vger.kernel.org>
To: Namjae Jeon <namjae.jeon@samsung.com>
Content-Disposition: inline
In-Reply-To: <002b01d007ae$319ed0a0$94dc71e0$@samsung.com>
Errors-To: xfs-bounces@oss.sgi.com
Sender: xfs-bounces@oss.sgi.com

On Mon, Nov 24, 2014 at 03:16:33PM +0900, Namjae Jeon wrote:
> This patch implements fallocate's FALLOC_FL_INSERT_RANGE for XFS.
> 
> 1) Make sure that both offset and len are block size aligned.
> 2) Update the i_size of inode by len bytes.
> 3) Compute the file's logical block number against offset. If the computed
>    block number is not the starting block of the extent, split the extent
>    such that the block number is the starting block of the extent.
> 4) Shift all the extents which are lying bewteen [offset, last allocated extent]
>    towards right by len bytes. This step will make a hole of len bytes
>    at offset.
> 
> Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
> Signed-off-by: Ashish Sangwan <a.sangwan@samsung.com>
> ---

Hi Namjae,

Here are some review notes. I haven't got to any of the test code or
played around with it just yet...

> Changelog
> 
> v6:
>  - This version is based upon Brian's changes to collapse paths.
>  - Instead of having seperate functions for shifting extents left/right, the
>    current extent shift function is made generic to shift in both directions.
> 
> v5:
>  - remove allocation part.
> 
> v4:
>  - set cur->bc_private.b.allocated to zero before calling xfs_btree_del_cursor.
> 
> v3:
>  - remove XFS_TRANS_RESERVE and assert.
>  - update the comment of blockcount calculation.
>  - use 'if(blockcount)' instead of 'if (got.br_blockcount < blockcount)'.
>  - move insert_file_space() calling under xfs_setattr_size to avoid code duplicate.
> 
> v2:
>  - remove reserved enable.
>  - add xfs_qm_dqattach.
>  - reset blockcount in xfs_bmap_shift_extents_right.
>  - update i_size to avoid data loss before insert_file_space() is called.
>  - use in-memory extent array size that delayed allocation extents
> 
>  fs/xfs/libxfs/xfs_bmap.c | 368 +++++++++++++++++++++++++++++++++++++++++------
>  fs/xfs/libxfs/xfs_bmap.h |  14 +-
>  fs/xfs/xfs_bmap_util.c   | 118 +++++++++++----
>  fs/xfs/xfs_bmap_util.h   |   2 +
>  fs/xfs/xfs_file.c        |  38 ++++-
>  fs/xfs/xfs_trace.h       |   1 +
>  6 files changed, 463 insertions(+), 78 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 79c9819..da01890 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -5528,56 +5528,87 @@ xfs_bmse_shift_one(

FYI, you're probably going to need to rebase the xfs_bmse_shift_one()
bits on top of Dave's recent cleanup here:

http://oss.sgi.com/archives/xfs/2014-11/msg00458.html

>  	int				*current_ext,
>  	struct xfs_bmbt_rec_host	*gotp,
>  	struct xfs_btree_cur		*cur,
> -	int				*logflags)
> +	int				*logflags,
> +	enum SHIFT_DIRECTION		SHIFT)
>  {
>  	struct xfs_ifork		*ifp;
>  	xfs_fileoff_t			startoff;
> -	struct xfs_bmbt_rec_host	*leftp;
> +	struct xfs_bmbt_rec_host	*contp;
>  	struct xfs_bmbt_irec		got;
> -	struct xfs_bmbt_irec		left;
> +	struct xfs_bmbt_irec		cont;
>  	int				error;
>  	int				i;
> +	int				total_extents;
>  
>  	ifp = XFS_IFORK_PTR(ip, whichfork);
> +	total_extents = ifp->if_bytes / sizeof(xfs_bmbt_rec_t);
>  
>  	xfs_bmbt_get_all(gotp, &got);
> -	startoff = got.br_startoff - offset_shift_fsb;
>  
>  	/* delalloc extents should be prevented by caller */
>  	XFS_WANT_CORRUPTED_GOTO(!isnullstartblock(got.br_startblock),
>  				out_error);
>  
> -	/*
> -	 * If this is the first extent in the file, make sure there's enough
> -	 * room at the start of the file and jump right to the shift as there's
> -	 * no left extent to merge.
> -	 */
> -	if (*current_ext == 0) {
> -		if (got.br_startoff < offset_shift_fsb)
> +	if (SHIFT == SHIFT_LEFT) {
> +		startoff = got.br_startoff - offset_shift_fsb;
> +		/*
> +		 * If this is the first extent in the file, make sure there's
> +		 * enough room at the start of the file and jump right to the
> +		 * shift as there's no left extent to merge.
> +		 */
> +		if (*current_ext == 0) {
> +			if (got.br_startoff < offset_shift_fsb)
> +				return -EINVAL;
> +			goto shift_extent;
> +		}
> +
> +		/* grab the left extent and check for a large enough hole */
> +		contp = xfs_iext_get_ext(ifp, *current_ext - 1);
> +		xfs_bmbt_get_all(contp, &cont);
> +
> +		if (startoff < cont.br_startoff + cont.br_blockcount)
>  			return -EINVAL;
> -		goto shift_extent;
> -	}
>  
> -	/* grab the left extent and check for a large enough hole */
> -	leftp = xfs_iext_get_ext(ifp, *current_ext - 1);
> -	xfs_bmbt_get_all(leftp, &left);
> +		/* check whether to merge the extent or shift it down */
> +		if (!xfs_bmse_can_merge(&cont, &got, offset_shift_fsb))
> +			goto shift_extent;
>  
> -	if (startoff < left.br_startoff + left.br_blockcount)
> -		return -EINVAL;
> +		return xfs_bmse_merge(ip, whichfork, offset_shift_fsb,
> +				      *current_ext, gotp, contp, cur, logflags);
> +	} else {
> +		startoff = got.br_startoff + offset_shift_fsb;
> +		/*
> +		 * If this is the last extent in the file, make sure there's
> +		 * enough room at the end of the file and jump right to the
> +		 * shift as there's no right extent to merge.
> +		 */
> +		if (*current_ext == (total_extents - 1))
> +			goto shift_extent;
>  
> -	/* check whether to merge the extent or shift it down */
> -	if (!xfs_bmse_can_merge(&left, &got, offset_shift_fsb))
> -		goto shift_extent;
> +		/* grab the right extent and check for a large enough hole */
> +		contp = xfs_iext_get_ext(ifp, *current_ext + 1);
> +		xfs_bmbt_get_all(contp, &cont);
>  
> -	return xfs_bmse_merge(ip, whichfork, offset_shift_fsb, *current_ext,
> -			      gotp, leftp, cur, logflags);
> +		if (startoff > cont.br_startoff)
> +			return -EINVAL;

Shouldn't this be 'if (startoff + got.br_blockount > cont.br_startoff)'?

> +
> +		if (!xfs_bmse_can_merge(&got, &cont,  offset_shift_fsb))
> +			goto shift_extent;
> +
> +		return xfs_bmse_merge(ip, whichfork, offset_shift_fsb,
> +				      *current_ext + 1, contp, gotp, cur,
> +				      logflags);

It doesn't look like xfs_bmse_merge() is bidirectional in this sense.
The can_merge() helper might work Ok since we're just checking whether
the extents line up. The merge code, however, will always extend the
block count of the left extent and delete the right. The left extent is
gotp in this case, which is the extent we want to shift right. In other
words, it seems like we should adjust the start offset of the right
extent to the right-shifted start offset of the left and delete the
left.

That said, I wonder whether we even care about a merge in the right
shift case since we haven't punched a hole in the file and thus have not
changed the "neighbors" of any of the extents we're shuffling around. I
would think any extents that are already contiguous as such are already
a single extent.

> +	}
>  
>  shift_extent:
>  	/*
>  	 * Increment the extent index for the next iteration, update the start
>  	 * offset of the in-core extent and update the btree if applicable.
>  	 */
> -	(*current_ext)++;
> +	if (SHIFT == SHIFT_LEFT)
> +		(*current_ext)++;
> +	else
> +		(*current_ext)--;
>  	xfs_bmbt_set_startoff(gotp, startoff);
>  	*logflags |= XFS_ILOG_CORE;
>  	if (!cur) {
> @@ -5604,10 +5635,10 @@ out_error:
>  }
>  
>  /*
> - * Shift extent records to the left to cover a hole.
> + * Shift extent records to the left/right to cover/create a hole.
>   *
>   * The maximum number of extents to be shifted in a single operation is
> - * @num_exts. @start_fsb specifies the file offset to start the shift and the
> + * @num_exts. @stop_fsb specifies the file offset at which to stop shift and the
>   * file offset where we've left off is returned in @next_fsb. @offset_shift_fsb
>   * is the length by which each extent is shifted. If there is no hole to shift
>   * the extents into, this will be considered invalid operation and we abort
> @@ -5617,12 +5648,13 @@ int
>  xfs_bmap_shift_extents(
>  	struct xfs_trans	*tp,
>  	struct xfs_inode	*ip,
> -	xfs_fileoff_t		start_fsb,
> +	xfs_fileoff_t		stop_fsb,
>  	xfs_fileoff_t		offset_shift_fsb,
>  	int			*done,
>  	xfs_fileoff_t		*next_fsb,
>  	xfs_fsblock_t		*firstblock,
>  	struct xfs_bmap_free	*flist,
> +	enum SHIFT_DIRECTION	SHIFT,
>  	int			num_exts)
>  {
>  	struct xfs_btree_cur		*cur = NULL;
> @@ -5636,6 +5668,7 @@ xfs_bmap_shift_extents(
>  	int				whichfork = XFS_DATA_FORK;
>  	int				logflags = 0;
>  	int				total_extents;
> +	int				stop_extent;
>  
>  	if (unlikely(XFS_TEST_ERROR(
>  	    (XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
> @@ -5651,6 +5684,7 @@ xfs_bmap_shift_extents(
>  
>  	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
> +	ASSERT(SHIFT == SHIFT_LEFT || SHIFT == SHIFT_RIGHT);
>  
>  	ifp = XFS_IFORK_PTR(ip, whichfork);
>  	if (!(ifp->if_flags & XFS_IFEXTENTS)) {
> @@ -5668,43 +5702,87 @@ xfs_bmap_shift_extents(
>  	}
>  
>  	/*
> +	 * There may be delalloc extents in the data fork before the range we
> +	 * are collapsing out, so we cannot use the count of real extents here.
> +	 * Instead we have to calculate it from the incore fork.
> +	 */
> +	total_extents = ifp->if_bytes / sizeof(xfs_bmbt_rec_t);
> +	if (total_extents == 0) {
> +		*done = 1;
> +		goto del_cursor;
> +	}
> +
> +	/*
> +	 * In case of first right shift, we need to initialize next_fsb
> +	 */
> +	if (*next_fsb == NULLFSBLOCK) {
> +		ASSERT(SHIFT == SHIFT_RIGHT);
> +		gotp = xfs_iext_get_ext(ifp, total_extents - 1);
> +		xfs_bmbt_get_all(gotp, &got);
> +		*next_fsb = got.br_startoff;
> +		if (stop_fsb > *next_fsb) {
> +			*done = 1;
> +			goto del_cursor;
> +		}
> +	}
> +
> +	/* Lookup the extent index at which we have to stop */
> +	if (SHIFT == SHIFT_RIGHT)
> +		gotp = xfs_iext_bno_to_ext(ifp, stop_fsb, &stop_extent);
> +	else
> +		stop_extent = total_extents;
> +
> +	/*
>  	 * Look up the extent index for the fsb where we start shifting. We can
>  	 * henceforth iterate with current_ext as extent list changes are locked
>  	 * out via ilock.
>  	 *
>  	 * gotp can be null in 2 cases: 1) if there are no extents or 2)
> -	 * start_fsb lies in a hole beyond which there are no extents. Either
> +	 * *next_fsb lies in a hole beyond which there are no extents. Either
>  	 * way, we are done.
>  	 */
> -	gotp = xfs_iext_bno_to_ext(ifp, start_fsb, &current_ext);
> +	gotp = xfs_iext_bno_to_ext(ifp, *next_fsb, &current_ext);
>  	if (!gotp) {
>  		*done = 1;
>  		goto del_cursor;
>  	}
>  
> -	/*
> -	 * There may be delalloc extents in the data fork before the range we
> -	 * are collapsing out, so we cannot use the count of real extents here.
> -	 * Instead we have to calculate it from the incore fork.
> -	 */
> -	total_extents = ifp->if_bytes / sizeof(xfs_bmbt_rec_t);
> -	while (nexts++ < num_exts && current_ext < total_extents) {
> +	/* some sanity checking before we finally start shifting extents */
> +	if ((SHIFT == SHIFT_LEFT && current_ext >= stop_extent) ||
> +	     (SHIFT == SHIFT_RIGHT && current_ext < stop_extent)) {
> +		error = EIO;
> +		goto del_cursor;
> +	}

It looks like stop_extent is exclusive for left shifts (== total_extents
rather than the last extent number) and inclusive for right shifts.
Could we be consistent between the two?

> +
> +	while (nexts++ < num_exts) {
>  		error = xfs_bmse_shift_one(ip, whichfork, offset_shift_fsb,
> -					&current_ext, gotp, cur, &logflags);
> +					   &current_ext, gotp, cur, &logflags,
> +					   SHIFT);
>  		if (error)
>  			goto del_cursor;
> -
> -		/* update total extent count and grab the next record */
> +		/*
> +		 * In case there was an extent merge after shifting extent,
> +		 * extent numbers would change.
> +		 * Update total extent count and grab the next record.
> +		 */
>  		total_extents = ifp->if_bytes / sizeof(xfs_bmbt_rec_t);
> -		if (current_ext >= total_extents)
> -			break;
> +		if (SHIFT == SHIFT_RIGHT) {
> +			gotp = xfs_iext_bno_to_ext(ifp, stop_fsb, &stop_extent);
> +			if (current_ext < stop_extent) {
> +				*done = 1;
> +				break;
> +			}
> +		} else {
> +			stop_extent = total_extents;
> +			if (current_ext == stop_extent) {
> +				*done = 1;
> +				break;
> +			}

... and if we can make stop_extent consistently exclusive, it looks like
we could use 'if (current_ext == stop_extent)' as a stop condition for
both cases, yes?

> +		}
>  		gotp = xfs_iext_get_ext(ifp, current_ext);
>  	}
>  
> -	/* Check if we are done */
> -	if (current_ext == total_extents) {
> -		*done = 1;
> -	} else if (next_fsb) {
> +	if (!*done) {
>  		xfs_bmbt_get_all(gotp, &got);
>  		*next_fsb = got.br_startoff;
>  	}

Might be good to set next_fsb to NULLFSBLOCK or some such value if we
are done.

> @@ -5719,3 +5797,203 @@ del_cursor:
>  
>  	return error;
>  }
> +
> +/*
> + * Splits an extent into two extents at split_fsb block that it is
> + * the first block of the current_ext. @current_ext is a target extent
> + * to be split. @split_fsb is a block where the extents is split.
> + * If split_fsb lies in a hole or the first block of extents, just return 0.
> + */
> +STATIC int
> +xfs_bmap_split_extent_at(
> +	struct xfs_trans	*tp,
> +	struct xfs_inode	*ip,
> +	xfs_fileoff_t		split_fsb,
> +	xfs_fsblock_t		*firstfsb,
> +	struct xfs_bmap_free	*free_list)
> +{
> +	int				whichfork = XFS_DATA_FORK;
> +	struct xfs_btree_cur		*cur;
> +	struct xfs_bmbt_rec_host	*gotp;
> +	struct xfs_bmbt_irec		got;
> +	struct xfs_bmbt_irec		new; /* split extent */
> +	struct xfs_mount		*mp = ip->i_mount;
> +	struct xfs_ifork		*ifp;
> +	xfs_fsblock_t			gotblkcnt; /* new block count for got */
> +	xfs_extnum_t			current_ext;
> +	int				error = 0;
> +	int				logflags;
> +	int				i = 0;
> +
> +	if (unlikely(XFS_TEST_ERROR(
> +	    (XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
> +	     XFS_IFORK_FORMAT(ip, whichfork) != XFS_DINODE_FMT_BTREE),
> +	     mp, XFS_ERRTAG_BMAPIFORMAT, XFS_RANDOM_BMAPIFORMAT))) {
> +		XFS_ERROR_REPORT("xfs_bmap_split_extent_at",
> +				 XFS_ERRLEVEL_LOW, mp);
> +		return -EFSCORRUPTED;
> +	}
> +
> +	if (XFS_FORCED_SHUTDOWN(mp))
> +		return -EIO;
> +
> +	ifp = XFS_IFORK_PTR(ip, whichfork);
> +	if (!(ifp->if_flags & XFS_IFEXTENTS)) {
> +		/* Read in all the extents */
> +		error = xfs_iread_extents(tp, ip, whichfork);
> +		if (error)
> +			return error;
> +	}
> +
> +	gotp = xfs_iext_bno_to_ext(ifp, split_fsb, &current_ext);
> +	/*
> +	 * gotp can be null in 2 cases: 1) if there are no extents
> +	 * or 2) split_fsb lies in a hole beyond which there are
> +	 * no extents. Either way, we are done.
> +	 */
> +	if (!gotp)
> +		return 0;
> +
> +	xfs_bmbt_get_all(gotp, &got);
> +
> +	/*
> +	 * Check split_fsb lies in a hole or the start boundary offset
> +	 * of the extent.
> +	 */
> +	if (got.br_startoff >= split_fsb)
> +		return 0;
> +
> +	gotblkcnt = split_fsb - got.br_startoff;
> +	new.br_startoff = split_fsb;
> +	new.br_startblock = got.br_startblock + gotblkcnt;
> +	new.br_blockcount = got.br_blockcount - gotblkcnt;
> +	new.br_state = got.br_state;
> +
> +	/* We are going to change core inode */
> +	logflags = XFS_ILOG_CORE;
> +
> +	if (ifp->if_flags & XFS_IFBROOT) {
> +		cur = xfs_bmbt_init_cursor(mp, tp, ip, whichfork);
> +		cur->bc_private.b.firstblock = *firstfsb;
> +		cur->bc_private.b.flist = free_list;
> +		cur->bc_private.b.flags = 0;
> +	} else {
> +		cur = NULL;
> +		logflags |= XFS_ILOG_DEXT;
> +	}

This looks like it suffers from a similar problem as the bmap shift code
with regard to logflags and error handling. Check out the subsequent fix
for reference:

ca446d88 xfs: don't log inode unless extent shift makes extent modifications

We basically init. logflags to 0 and delay setting the actual flags as
long as possible, until we actually make a change to the extent tree or
bmap btree.

Otherwise, if the following lookup were to fail, for example, we'd still
log the inode even though we haven't changed anything and ultimately the
fs will shutdown on transaction cancel.

> +
> +	if (cur) {
> +		error = xfs_bmbt_lookup_eq(cur, got.br_startoff,
> +				got.br_startblock,
> +				got.br_blockcount,
> +				&i);
> +		if (error)
> +			goto del_cursor;
> +		XFS_WANT_CORRUPTED_GOTO(i == 1, del_cursor);
> +	}
> +
> +	xfs_bmbt_set_blockcount(gotp, gotblkcnt);
> +	got.br_blockcount = gotblkcnt;
> +	if (cur) {
> +		error = xfs_bmbt_update(cur, got.br_startoff,
> +				got.br_startblock,
> +				got.br_blockcount,
> +				got.br_state);
> +		if (error)
> +			goto del_cursor;
> +	}
> +
> +	/* Add new extent */
> +	current_ext++;
> +	xfs_iext_insert(ip, current_ext, 1, &new, 0);
> +	XFS_IFORK_NEXT_SET(ip, whichfork,
> +			XFS_IFORK_NEXTENTS(ip, whichfork) + 1);
> +
> +	if (cur) {
> +		error = xfs_bmbt_lookup_eq(cur, new.br_startoff,
> +				new.br_startblock, new.br_blockcount,
> +				&i);
> +		if (error)
> +			goto del_cursor;
> +		XFS_WANT_CORRUPTED_GOTO(i == 0, del_cursor);
> +		cur->bc_rec.b.br_state = new.br_state;
> +
> +		error = xfs_btree_insert(cur, &i);
> +		if (error)
> +			goto del_cursor;
> +		XFS_WANT_CORRUPTED_GOTO(i == 1, del_cursor);
> +	}
> +
> +	/*
> +	 * Convert to a btree if necessary.
> +	 */
> +	if (xfs_bmap_needs_btree(ip, whichfork)) {
> +		int tmp_logflags; /* partial log flag return val */
> +
> +		ASSERT(cur == NULL);
> +		error = xfs_bmap_extents_to_btree(tp, ip, firstfsb, free_list,
> +				&cur, 0, &tmp_logflags, whichfork);
> +		logflags |= tmp_logflags;
> +	}

Hmm, looks Ok, but it would be nice if we had a test case for this
convert to btree scenario. I suspect something that falloc's just the
right number extents for a known fs format and does an insert range
right in the middle of one would suffice (and probably only require a
few seconds to run).

> +
> +del_cursor:
> +	if (cur) {
> +		cur->bc_private.b.allocated = 0;
> +		xfs_btree_del_cursor(cur,
> +				error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
> +	}
> +	xfs_trans_log_inode(tp, ip, logflags);
> +	return error;
> +}
> +
> +int
> +xfs_bmap_split_extent(
> +		struct xfs_inode        *ip,
> +		xfs_fileoff_t           split_fsb)

You can line up the above params with the local vars below.

> +{
> +	struct xfs_mount        *mp = ip->i_mount;
> +	struct xfs_trans        *tp;
> +	struct xfs_bmap_free    free_list;
> +	xfs_fsblock_t           firstfsb;
> +	int                     committed;
> +	int                     error;
> +
> +	tp = xfs_trans_alloc(mp, XFS_TRANS_DIOSTRAT);
> +	error = xfs_trans_reserve(tp, &M_RES(mp)->tr_write,
> +			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0);
> +	if (error) {
> +		xfs_trans_cancel(tp, 0);
> +		return error;
> +	}
> +
> +	xfs_ilock(ip, XFS_ILOCK_EXCL);
> +	error = xfs_trans_reserve_quota(tp, mp, ip->i_udquot,
> +			ip->i_gdquot, ip->i_pdquot,
> +			XFS_DIOSTRAT_SPACE_RES(mp, 0), 0,
> +			XFS_QMOPT_RES_REGBLKS);
> +	if (error)
> +		goto out;
> +
> +	xfs_trans_ijoin(tp, ip, 0);

Might as well transfer the lock to the tp here? That avoids the need for
the unlocks below. We just need to make sure we order things correctly
such that the inode is unlocked on error conditions.

> +
> +	xfs_bmap_init(&free_list, &firstfsb);
> +
> +	error = xfs_bmap_split_extent_at(tp, ip, split_fsb,
> +			&firstfsb, &free_list);
> +	if (error)
> +		goto out;
> +
> +	error = xfs_bmap_finish(&tp, &free_list, &committed);
> +	if (error)
> +		goto out;
> +
> +	error = xfs_trans_commit(tp, XFS_TRANS_RELEASE_LOG_RES);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +
> +	return error;
> +
> +out:
> +	xfs_trans_cancel(tp, XFS_TRANS_RELEASE_LOG_RES | XFS_TRANS_ABORT);
> +	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +	return error;
> +}
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 44db6db..94c80aa 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -135,6 +135,11 @@ static inline void xfs_bmap_init(xfs_bmap_free_t *flp, xfs_fsblock_t *fbp)
>   */
>  #define XFS_BMAP_MAX_SHIFT_EXTENTS	1
>  
> +enum SHIFT_DIRECTION {
> +	SHIFT_LEFT = 0,
> +	SHIFT_RIGHT,
> +};
> +
>  #ifdef DEBUG
>  void	xfs_bmap_trace_exlist(struct xfs_inode *ip, xfs_extnum_t cnt,
>  		int whichfork, unsigned long caller_ip);
> @@ -180,6 +185,13 @@ uint	xfs_default_attroffset(struct xfs_inode *ip);
>  int	xfs_bmap_shift_extents(struct xfs_trans *tp, struct xfs_inode *ip,
>  		xfs_fileoff_t start_fsb, xfs_fileoff_t offset_shift_fsb,
>  		int *done, xfs_fileoff_t *next_fsb, xfs_fsblock_t *firstblock,
> -		struct xfs_bmap_free *flist, int num_exts);
> +		struct xfs_bmap_free *flist, enum SHIFT_DIRECTION SHIFT,
> +		int num_exts);
> +int	xfs_bmap_split_extent(struct xfs_inode *ip, xfs_fileoff_t split_offset);
> +int	xfs_bmap_shift_extents_right(struct xfs_trans *tp, struct xfs_inode *ip,
> +		xfs_fileoff_t *start_fsb, xfs_fileoff_t *next_fsb,
> +		xfs_fileoff_t offset_shift_fsb, int *done,
> +		xfs_fsblock_t *firstblock, struct xfs_bmap_free *flist,
> +		int num_exts);
>  
>  #endif	/* __XFS_BMAP_H__ */
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 2810026..4ce7f92 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1379,22 +1379,21 @@ out:
>  }
>  
>  /*
> - * xfs_collapse_file_space()
> - *	This routine frees disk space and shift extent for the given file.
> - *	The first thing we do is to free data blocks in the specified range
> - *	by calling xfs_free_file_space(). It would also sync dirty data
> - *	and invalidate page cache over the region on which collapse range
> - *	is working. And Shift extent records to the left to cover a hole.
> - * RETURNS:
> - *	0 on success
> - *	errno on error
> - *
> + * @next_fsb will keep track of the extent currently undergoing shift.
> + * @stop_fsb will keep track of the extent at which we have to stop.
> + * If we are shifting left, we will start with block (offset + len) and
> + * shift each extent till last extent.
> + * If we are shifting right, we will start with last extent inside file space
> + * and continue until we reach the block corresponding to offset.
> + * If right shift, delegate the work of
> + * initialization of next_fsb to xfs_bmap_shift_extent as it has ilock held.

Could you move the bit of the comment about the next_fsb right-shift
init down where we set it to NULLFSBLOCK? That way it is a bit more
clear.

>   */
>  int
> -xfs_collapse_file_space(
> -	struct xfs_inode	*ip,
> -	xfs_off_t		offset,
> -	xfs_off_t		len)
> +xfs_shift_file_space(
> +	struct xfs_inode        *ip,
> +	xfs_off_t               offset,
> +	xfs_off_t               len,
> +	enum SHIFT_DIRECTION	SHIFT)
>  {
>  	int			done = 0;
>  	struct xfs_mount	*mp = ip->i_mount;
> @@ -1403,21 +1402,22 @@ xfs_collapse_file_space(
>  	struct xfs_bmap_free	free_list;
>  	xfs_fsblock_t		first_block;
>  	int			committed;
> -	xfs_fileoff_t		start_fsb;
> +	xfs_fileoff_t		stop_fsb;
>  	xfs_fileoff_t		next_fsb;
>  	xfs_fileoff_t		shift_fsb;
>  
> -	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> +	ASSERT(SHIFT == SHIFT_LEFT || SHIFT == SHIFT_RIGHT);
>  
> -	trace_xfs_collapse_file_space(ip);
> +	if (SHIFT == SHIFT_LEFT) {
> +		next_fsb = XFS_B_TO_FSB(mp, offset + len);
> +		stop_fsb = XFS_B_TO_FSB(mp, VFS_I(ip)->i_size);
> +	} else {
> +		next_fsb = NULLFSBLOCK;
> +		stop_fsb = XFS_B_TO_FSB(mp, offset);
> +	}
>  
> -	next_fsb = XFS_B_TO_FSB(mp, offset + len);
>  	shift_fsb = XFS_B_TO_FSB(mp, len);
>  
> -	error = xfs_free_file_space(ip, offset, len);
> -	if (error)
> -		return error;
> -
>  	/*
>  	 * Trim eofblocks to avoid shifting uninitialized post-eof preallocation
>  	 * into the accessible region of the file.
> @@ -1430,20 +1430,23 @@ xfs_collapse_file_space(
>  
>  	/*
>  	 * Writeback and invalidate cache for the remainder of the file as we're
> -	 * about to shift down every extent from the collapse range to EOF. The
> -	 * free of the collapse range above might have already done some of
> -	 * this, but we shouldn't rely on it to do anything outside of the range
> -	 * that was freed.
> +	 * about to shift down every extent from offset to EOF.
>  	 */
>  	error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> -					     offset + len, -1);
> +					     offset, -1);
>  	if (error)
>  		return error;
>  	error = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
> -					(offset + len) >> PAGE_CACHE_SHIFT, -1);
> +					offset >> PAGE_CACHE_SHIFT, -1);
>  	if (error)
>  		return error;
>  
> +	if (SHIFT == SHIFT_RIGHT) {
> +		error = xfs_bmap_split_extent(ip, stop_fsb);
> +		if (error)
> +			return error;
> +	}
> +
>  	while (!error && !done) {
>  		tp = xfs_trans_alloc(mp, XFS_TRANS_DIOSTRAT);
>  		/*
> @@ -1475,10 +1478,9 @@ xfs_collapse_file_space(
>  		 * We are using the write transaction in which max 2 bmbt
>  		 * updates are allowed
>  		 */
> -		start_fsb = next_fsb;
> -		error = xfs_bmap_shift_extents(tp, ip, start_fsb, shift_fsb,
> +		error = xfs_bmap_shift_extents(tp, ip, stop_fsb, shift_fsb,

Nice clean up, but could we reorder next_fsb prior to stop_fsb to be a
bit more clean?

>  				&done, &next_fsb, &first_block, &free_list,
> -				XFS_BMAP_MAX_SHIFT_EXTENTS);
> +				SHIFT, XFS_BMAP_MAX_SHIFT_EXTENTS);
>  		if (error)
>  			goto out;
>  
> @@ -1499,6 +1501,60 @@ out:
>  }
>  
>  /*
> + * xfs_collapse_file_space()
> + *	This routine frees disk space and shift extent for the given file.
> + *	The first thing we do is to free data blocks in the specified range
> + *	by calling xfs_free_file_space(). It would also sync dirty data
> + *	and invalidate page cache over the region on which collapse range
> + *	is working. And Shift extent records to the left to cover a hole.
> + * RETURNS:
> + *	0 on success
> + *	errno on error
> + *
> + */
> +int
> +xfs_collapse_file_space(
> +	struct xfs_inode	*ip,
> +	xfs_off_t		offset,
> +	xfs_off_t		len)
> +{
> +	int error;
> +
> +	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> +	trace_xfs_collapse_file_space(ip);
> +
> +	error = xfs_free_file_space(ip, offset, len);
> +	if (error)
> +		return error;
> +
> +	return xfs_shift_file_space(ip, offset, len, SHIFT_LEFT);
> +}
> +
> +/*
> + * xfs_insert_file_space()
> + *	This routine create hole space by shifting extents for the given file.
> + *	The first thing we do is to sync dirty data and invalidate page cache
> + *	over the region on which insert range is working. And split an extent
> + *	to two extents at given offset by calling xfs_bmap_split_extent.
> + *	And shift all extent records which are laying between [offset,
> + *	last allocated extent] to the right to reserve hole range.
> + * RETURNS:
> + *	0 on success
> + *	errno on error
> + */
> +int
> +xfs_insert_file_space(
> +	struct xfs_inode	*ip,
> +	loff_t			offset,
> +	loff_t			len)
> +{
> +	ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL));
> +	trace_xfs_insert_file_space(ip);
> +
> +	return xfs_shift_file_space(ip, offset, len, SHIFT_RIGHT);
> +}
> +
> +/*
>   * We need to check that the format of the data fork in the temporary inode is
>   * valid for the target inode before doing the swap. This is not a problem with
>   * attr1 because of the fixed fork offset, but attr2 has a dynamically sized
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index 2fdb72d..6cb116c 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -98,6 +98,8 @@ int	xfs_zero_file_space(struct xfs_inode *ip, xfs_off_t offset,
>  			    xfs_off_t len);
>  int	xfs_collapse_file_space(struct xfs_inode *, xfs_off_t offset,
>  				xfs_off_t len);
> +int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
> +				xfs_off_t len);
>  
>  /* EOF block manipulation functions */
>  bool	xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index eb596b4..93b568d 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -790,11 +790,13 @@ xfs_file_fallocate(
>  	struct xfs_trans	*tp;
>  	long			error;
>  	loff_t			new_size = 0;
> +	int			do_file_insert = 0;
>  
>  	if (!S_ISREG(inode->i_mode))
>  		return -EINVAL;
>  	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
> -		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE))
> +		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |
> +		     FALLOC_FL_INSERT_RANGE))
>  		return -EOPNOTSUPP;
>  
>  	xfs_ilock(ip, XFS_IOLOCK_EXCL);
> @@ -824,6 +826,28 @@ xfs_file_fallocate(
>  		error = xfs_collapse_file_space(ip, offset, len);
>  		if (error)
>  			goto out_unlock;
> +	} else if (mode & FALLOC_FL_INSERT_RANGE) {
> +		unsigned blksize_mask = (1 << inode->i_blkbits) - 1;
> +
> +		if (offset & blksize_mask || len & blksize_mask) {
> +			error = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		/* Check for wrap through zero */
> +		if (inode->i_size + len > inode->i_sb->s_maxbytes) {
> +			error = -EFBIG;
> +			goto out_unlock;
> +		}
> +
> +		/* Offset should be less than i_size */
> +		if (offset >= i_size_read(inode)) {
> +			error = -EINVAL;
> +			goto out_unlock;
> +		}
> +
> +		new_size = i_size_read(inode) + len;
> +		do_file_insert = 1;
>  	} else {
>  		if (!(mode & FALLOC_FL_KEEP_SIZE) &&
>  		    offset + len > i_size_read(inode)) {

There's a check that sets XFS_DIFLAG_PREALLOC down after this hunk but
before the next that we probably want to update to exclude insert range
(it already handles collapse).

Brian

> @@ -874,8 +898,20 @@ xfs_file_fallocate(
>  		iattr.ia_valid = ATTR_SIZE;
>  		iattr.ia_size = new_size;
>  		error = xfs_setattr_size(ip, &iattr);
> +		if (error)
> +			goto out_unlock;
>  	}
>  
> +	/*
> +	 * Some operations are performed after the inode size is updated. For
> +	 * example, insert range expands the address space of the file, shifts
> +	 * all subsequent extents to create a hole inside the file. Updating
> +	 * the size first ensures that shifted extents aren't left hanging
> +	 * past EOF in the event of a crash or failure.
> +	 */
> +	if (do_file_insert)
> +		error = xfs_insert_file_space(ip, offset, len);
> +
>  out_unlock:
>  	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
>  	return error;
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 51372e3..7e45fa1 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -664,6 +664,7 @@ DEFINE_INODE_EVENT(xfs_alloc_file_space);
>  DEFINE_INODE_EVENT(xfs_free_file_space);
>  DEFINE_INODE_EVENT(xfs_zero_file_space);
>  DEFINE_INODE_EVENT(xfs_collapse_file_space);
> +DEFINE_INODE_EVENT(xfs_insert_file_space);
>  DEFINE_INODE_EVENT(xfs_readdir);
>  #ifdef CONFIG_XFS_POSIX_ACL
>  DEFINE_INODE_EVENT(xfs_get_acl);
> -- 
> 1.7.11-rc0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs