2008-03-07 10:53:10

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [RFC PATCH] ext4: Fix the locking with respect to ext3 to ext4 migrate.

Ext3 to Ext4 migrate takes inode->i_mutex lock to make sure write
and truncate of the original inode doesn't happen during the migrate.
But a write to mmap area mapping holes of the file can still happen.
Such writes results in new block allocation and changes to the i_data
field of inode. Add i_migrate_sem read write semaphore to protect against
this. The semaphore is taken in read mode when trying to write to the
mmap area. This happens only once via page_mkwrite call back. Taking
the semaphore in read mode ensures that multiple mmap write can still
happen. In the migrate code we take the semaphore in write mode.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/inode.c | 17 +++++++++++++----
fs/ext4/migrate.c | 7 +++++++
fs/ext4/super.c | 1 +
include/linux/ext4_fs_i.h | 12 ++++++++++++
4 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 42bc666..7a4c72c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3496,15 +3496,24 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)

int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page)
{
+ int retval;
+ struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
/*
* if ext4_get_block resulted in a split of an uninitialized extent,
* in file system full case, we will have to take the journal write
* access and zero out the page. The journal handle get initialized
* in ext4_get_block.
*/
- /* FIXME!! should we take inode->i_mutex ? Currently we can't because
- * it has a circular locking dependency with DIO. But migrate expect
- * i_mutex to ensure no i_data changes
+ /* we need to make sure we don't allow migrate to happen. This is
+ * needed because a write to area mapping holes can result in block
+ * allocation and update of i_data field. We can't take inode->i_mutex
+ * here because it changes the locking order with rspect to directIO.
+ * directIO expect i_mutex -> mmap_sem. We are here already holding
+ * mmap_sem
*/
- return block_page_mkwrite(vma, page, ext4_get_block);
+ down_read(&(EXT4_I(inode)->i_migrate_sem));
+ retval = block_page_mkwrite(vma, page, ext4_get_block);
+ up_read(&(EXT4_I(inode)->i_migrate_sem));
+
+ return retval;
}
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 5c1e27d..a3fbd9f 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -508,6 +508,12 @@ int ext4_ext_migrate(struct inode *inode, struct file *filp,
* switch the inode format to prevent read.
*/
mutex_lock(&(inode->i_mutex));
+ /*
+ * prevent a mmap write to holes
+ * i_mutex is ranked above i_migrate_sem.
+ * The order is i_mutex -> mmap_sem -> i_migrate_sem
+ */
+ down_write(&(EXT4_I(inode)->i_migrate_sem));
handle = ext4_journal_start(inode, 1);

ei = EXT4_I(inode);
@@ -593,6 +599,7 @@ err_out:
tmp_inode->i_nlink = 0;

ext4_journal_stop(handle);
+ up_write(&(EXT4_I(inode)->i_migrate_sem));
mutex_unlock(&(inode->i_mutex));

if (tmp_inode)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 2ac34ac..32c2c0e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -599,6 +599,7 @@ static void init_once(struct kmem_cache *cachep, void *foo)
#endif
init_rwsem(&ei->i_data_sem);
inode_init_once(&ei->vfs_inode);
+ init_rwsem(&ei->i_migrate_sem);
}

static int init_inodecache(void)
diff --git a/include/linux/ext4_fs_i.h b/include/linux/ext4_fs_i.h
index d5508d3..96c0b4f 100644
--- a/include/linux/ext4_fs_i.h
+++ b/include/linux/ext4_fs_i.h
@@ -162,6 +162,18 @@ struct ext4_inode_info {
/* mballoc */
struct list_head i_prealloc_list;
spinlock_t i_prealloc_lock;
+
+ /* When doing migrate we need to ensure that the i_data field
+ * doesn't change. With respect to write and truncate we can ensure
+ * the same by taking inode->i_mutex. But a write to mmap area
+ * mapping holes doesn't take i_mutex since it doesn't change the
+ * i_size. We also can't take i_data_sem because we would like to
+ * extend/restart the journal and locking order prevents us from
+ * restarting journal within i_data_sem. This will be taken in
+ * page_mkwrite in the read mode and migrate will take it in the
+ * write mode.
+ */
+ struct rw_semaphore i_migrate_sem;
};

#endif /* _LINUX_EXT4_FS_I */
--
1.5.4.3.422.g34cd6.dirty



2008-03-07 11:18:09

by Mingming Cao

[permalink] [raw]
Subject: Re: [RFC PATCH] ext4: Fix the locking with respect to ext3 to ext4 migrate.

On Fri, 2008-03-07 at 16:23 +0530, Aneesh Kumar K.V wrote:
Hi Aneesh,

> static int init_inodecache(void)
> diff --git a/include/linux/ext4_fs_i.h b/include/linux/ext4_fs_i.h
> index d5508d3..96c0b4f 100644
> --- a/include/linux/ext4_fs_i.h
> +++ b/include/linux/ext4_fs_i.h
> @@ -162,6 +162,18 @@ struct ext4_inode_info {
> /* mballoc */
> struct list_head i_prealloc_list;
> spinlock_t i_prealloc_lock;
> +
> + /* When doing migrate we need to ensure that the i_data field
> + * doesn't change. With respect to write and truncate we can ensure
> + * the same by taking inode->i_mutex. But a write to mmap area
> + * mapping holes doesn't take i_mutex since it doesn't change the
> + * i_size. We also can't take i_data_sem because we would like to
> + * extend/restart the journal and locking order prevents us from
> + * restarting journal within i_data_sem.

How about we start a journal with estimated worse case transaction
credits and then take the i_data_sem down? So that we could ensure that
whenever the i_data_sem is hold, the i_data is protected. That is what
currently DIO does, I think. It would be nice to avoid introducing
another semaphore to protect i_data for migration if we could.

> This will be taken in
> + * page_mkwrite in the read mode and migrate will take it in the
> + * write mode.
> + */
> + struct rw_semaphore i_migrate_sem;
> };
>
> #endif /* _LINUX_EXT4_FS_I */

Mingming


2008-03-07 11:31:12

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH] ext4: Fix the locking with respect to ext3 to ext4 migrate.

On Fri, Mar 07, 2008 at 03:17:33AM -0800, Mingming Cao wrote:
> On Fri, 2008-03-07 at 16:23 +0530, Aneesh Kumar K.V wrote:
> Hi Aneesh,
>
> > static int init_inodecache(void)
> > diff --git a/include/linux/ext4_fs_i.h b/include/linux/ext4_fs_i.h
> > index d5508d3..96c0b4f 100644
> > --- a/include/linux/ext4_fs_i.h
> > +++ b/include/linux/ext4_fs_i.h
> > @@ -162,6 +162,18 @@ struct ext4_inode_info {
> > /* mballoc */
> > struct list_head i_prealloc_list;
> > spinlock_t i_prealloc_lock;
> > +
> > + /* When doing migrate we need to ensure that the i_data field
> > + * doesn't change. With respect to write and truncate we can ensure
> > + * the same by taking inode->i_mutex. But a write to mmap area
> > + * mapping holes doesn't take i_mutex since it doesn't change the
> > + * i_size. We also can't take i_data_sem because we would like to
> > + * extend/restart the journal and locking order prevents us from
> > + * restarting journal within i_data_sem.
>
> How about we start a journal with estimated worse case transaction
> credits and then take the i_data_sem down? So that we could ensure that
> whenever the i_data_sem is hold, the i_data is protected. That is what
> currently DIO does, I think. It would be nice to avoid introducing
> another semaphore to protect i_data for migration if we could.
>


Estimating transaction for a single page directIO write may be easy. But
in case of migrate it involves new blocks allocated to carry the extents
and also we free the indirect blocks of ext3 and that would involve
update of bitmap from different groups. I am not sure we will be able to
come up with a value. But if yes and if we can get that many credits
from journal i agree that would be better than introducing a new
semaphore.


> > This will be taken in
> > + * page_mkwrite in the read mode and migrate will take it in the
> > + * write mode.
> > + */
> > + struct rw_semaphore i_migrate_sem;
> > };
> >
> > #endif /* _LINUX_EXT4_FS_I */
>
> Mingming
>

2008-03-07 23:48:03

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC PATCH] ext4: Fix the locking with respect to ext3 to ext4 migrate.

On Mar 07, 2008 17:01 +0530, Aneesh Kumar K.V wrote:
> On Fri, Mar 07, 2008 at 03:17:33AM -0800, Mingming Cao wrote:
> > How about we start a journal with estimated worse case transaction
> > credits and then take the i_data_sem down? So that we could ensure that
> > whenever the i_data_sem is hold, the i_data is protected. That is what
> > currently DIO does, I think. It would be nice to avoid introducing
> > another semaphore to protect i_data for migration if we could.
>
> Estimating transaction for a single page directIO write may be easy. But
> in case of migrate it involves new blocks allocated to carry the extents
> and also we free the indirect blocks of ext3 and that would involve
> update of bitmap from different groups. I am not sure we will be able to
> come up with a value. But if yes and if we can get that many credits
> from journal i agree that would be better than introducing a new
> semaphore.

Agreed - and if we have a generic routine to calculate the journal
credits needed for a full-file (or better a range) indirect block
operation (including bitmaps, group descriptors, and [dt]indirect
blocks).

I don't think there would be a serious failure case if it wasn't possible
to convert a block-mapped file to extent-mapped while it was mmapped.
At worst the administrator would need to do that some time later, or
after a system reboot, so long as the conversion actually failed if the
file had any mmaps. If this same requirement is introduced when we
get defrag for ext4 (because the block mapping is changing on the file)
then we may have to reconsider the benefits of the more complex code.


Note we can also use the "journal credits needed" for fixing truncate in
a similar manner to do it all in a single transaction to avoid zeroing
all of the indirect blocks. All that would be needed for trunate is to
call the above function, update the on-disk i_size, possibly zero out the
partially-truncated block, and update the group descriptors and bitmaps.
That would also allow "undelete" to work on ext3 again because the
inode i_blocks and indirect blocks wouldn't be zeroed out anymore,
like it was in ext2.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-03-11 15:25:38

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH] ext4: Fix the locking with respect to ext3 to ext4 migrate.

> On Mar 07, 2008 17:01 +0530, Aneesh Kumar K.V wrote:
> > On Fri, Mar 07, 2008 at 03:17:33AM -0800, Mingming Cao wrote:
> > > How about we start a journal with estimated worse case transaction
> > > credits and then take the i_data_sem down? So that we could ensure that
> > > whenever the i_data_sem is hold, the i_data is protected. That is what
> > > currently DIO does, I think. It would be nice to avoid introducing
> > > another semaphore to protect i_data for migration if we could.
> >
> > Estimating transaction for a single page directIO write may be easy. But
> > in case of migrate it involves new blocks allocated to carry the extents
> > and also we free the indirect blocks of ext3 and that would involve
> > update of bitmap from different groups. I am not sure we will be able to
> > come up with a value. But if yes and if we can get that many credits
> > from journal i agree that would be better than introducing a new
> > semaphore.
>
> Agreed - and if we have a generic routine to calculate the journal
> credits needed for a full-file (or better a range) indirect block
> operation (including bitmaps, group descriptors, and [dt]indirect
> blocks).
>
> I don't think there would be a serious failure case if it wasn't possible
> to convert a block-mapped file to extent-mapped while it was mmapped.
> At worst the administrator would need to do that some time later, or
> after a system reboot, so long as the conversion actually failed if the
> file had any mmaps. If this same requirement is introduced when we
> get defrag for ext4 (because the block mapping is changing on the file)
> then we may have to reconsider the benefits of the more complex code.
I agree here. IMHO the better option would be to just build the
extent-tree for converted inode on best-effort basis. If we find in
the end that someone has allocated new block to the file (via mmap
filling a hole) while we are converting, we can just cancel the
conversion. Because I think the cost of extra rwsem (both in terms of
additional memory needed for each inode structure and in time needed for
rwsem acquisitions) is more than I as a user would like to bear given
how rare the conversion is.

> Note we can also use the "journal credits needed" for fixing truncate in
> a similar manner to do it all in a single transaction to avoid zeroing
> all of the indirect blocks. All that would be needed for trunate is to
> call the above function, update the on-disk i_size, possibly zero out the
> partially-truncated block, and update the group descriptors and bitmaps.
> That would also allow "undelete" to work on ext3 again because the
> inode i_blocks and indirect blocks wouldn't be zeroed out anymore,
> like it was in ext2.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2008-03-11 16:59:12

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH] ext4: Fix the locking with respect to ext3 to ext4 migrate.

On Tue, Mar 11, 2008 at 04:25:37PM +0100, Jan Kara wrote:
> > On Mar 07, 2008 17:01 +0530, Aneesh Kumar K.V wrote:
> > > On Fri, Mar 07, 2008 at 03:17:33AM -0800, Mingming Cao wrote:
> > > > How about we start a journal with estimated worse case transaction
> > > > credits and then take the i_data_sem down? So that we could ensure that
> > > > whenever the i_data_sem is hold, the i_data is protected. That is what
> > > > currently DIO does, I think. It would be nice to avoid introducing
> > > > another semaphore to protect i_data for migration if we could.
> > >
> > > Estimating transaction for a single page directIO write may be easy. But
> > > in case of migrate it involves new blocks allocated to carry the extents
> > > and also we free the indirect blocks of ext3 and that would involve
> > > update of bitmap from different groups. I am not sure we will be able to
> > > come up with a value. But if yes and if we can get that many credits
> > > from journal i agree that would be better than introducing a new
> > > semaphore.
> >
> > Agreed - and if we have a generic routine to calculate the journal
> > credits needed for a full-file (or better a range) indirect block
> > operation (including bitmaps, group descriptors, and [dt]indirect
> > blocks).
> >
> > I don't think there would be a serious failure case if it wasn't possible
> > to convert a block-mapped file to extent-mapped while it was mmapped.
> > At worst the administrator would need to do that some time later, or
> > after a system reboot, so long as the conversion actually failed if the
> > file had any mmaps. If this same requirement is introduced when we
> > get defrag for ext4 (because the block mapping is changing on the file)
> > then we may have to reconsider the benefits of the more complex code.
> I agree here. IMHO the better option would be to just build the
> extent-tree for converted inode on best-effort basis. If we find in
> the end that someone has allocated new block to the file (via mmap
> filling a hole) while we are converting, we can just cancel the
> conversion. Because I think the cost of extra rwsem (both in terms of
> additional memory needed for each inode structure and in time needed for
> rwsem acquisitions) is more than I as a user would like to bear given
> how rare the conversion is.
>
Something like the below ??

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 059f2fc..a52904b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3502,9 +3502,5 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page)
* access and zero out the page. The journal handle get initialized
* in ext4_get_block.
*/
- /* FIXME!! should we take inode->i_mutex ? Currently we can't because
- * it has a circular locking dependency with DIO. But migrate expect
- * i_mutex to ensure no i_data changes
- */
return block_page_mkwrite(vma, page, ext4_get_block);
}
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 5c1e27d..c6391e9 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -327,7 +327,7 @@ static int free_ind_block(handle_t *handle, struct inode *inode, __le32 *i_data)
}

static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
- struct inode *tmp_inode)
+ struct inode *tmp_inode, blkcnt_t total_blocks)
{
int retval;
__le32 i_data[3];
@@ -350,6 +350,13 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
i_data[2] = ei->i_data[EXT4_TIND_BLOCK];

down_write(&EXT4_I(inode)->i_data_sem);
+ /* check for number of blocks */
+ if (total_blocks != inode->i_blocks) {
+ retval = -EAGAIN;
+ up_write(&EXT4_I(inode)->i_data_sem);
+ goto err_out;
+
+ }
/*
* We have the extent map build with the tmp inode.
* Now copy the i_data across
@@ -445,6 +452,7 @@ int ext4_ext_migrate(struct inode *inode, struct file *filp,
struct inode *tmp_inode = NULL;
struct list_blocks_struct lb;
unsigned long max_entries;
+ blkcnt_t total_blocks;

if (!test_opt(inode->i_sb, EXTENTS))
/*
@@ -508,6 +516,12 @@ int ext4_ext_migrate(struct inode *inode, struct file *filp,
* switch the inode format to prevent read.
*/
mutex_lock(&(inode->i_mutex));
+ /*
+ * Even though we take i_mutex we can still cause block allocation
+ * via mmap write to holes. If we have allocated new blocks we fail
+ * migrate.
+ */
+ total_blocks = inode->i_blocks;
handle = ext4_journal_start(inode, 1);

ei = EXT4_I(inode);
@@ -561,7 +575,7 @@ err_out:
free_ext_block(handle, tmp_inode);
else
retval = ext4_ext_swap_inode_data(handle, inode,
- tmp_inode);
+ tmp_inode, total_blocks);

/* We mark the tmp_inode dirty via ext4_ext_tree_init. */
if (ext4_journal_extend(handle, 1) != 0)

2008-03-12 08:57:37

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC PATCH] ext4: Fix the locking with respect to ext3 to ext4 migrate.

On Mar 11, 2008 22:28 +0530, Aneesh Kumar K.V wrote:
> On Tue, Mar 11, 2008 at 04:25:37PM +0100, Jan Kara wrote:
> > I agree here. IMHO the better option would be to just build the
> > extent-tree for converted inode on best-effort basis. If we find in
> > the end that someone has allocated new block to the file (via mmap
> > filling a hole) while we are converting, we can just cancel the
> > conversion. Because I think the cost of extra rwsem (both in terms of
> > additional memory needed for each inode structure and in time needed for
> > rwsem acquisitions) is more than I as a user would like to bear given
> > how rare the conversion is.
>
> Something like the below ??
>
> down_write(&EXT4_I(inode)->i_data_sem);
> + /* check for number of blocks */
> + if (total_blocks != inode->i_blocks) {
> + retval = -EAGAIN;
> + up_write(&EXT4_I(inode)->i_data_sem);
> + goto err_out;

Is this enough, or should we use the inode version instead?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-03-12 09:08:21

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC PATCH] ext4: Fix the locking with respect to ext3 to ext4 migrate.

On Wed, Mar 12, 2008 at 02:56:29AM -0600, Andreas Dilger wrote:
> On Mar 11, 2008 22:28 +0530, Aneesh Kumar K.V wrote:
> > On Tue, Mar 11, 2008 at 04:25:37PM +0100, Jan Kara wrote:
> > > I agree here. IMHO the better option would be to just build the
> > > extent-tree for converted inode on best-effort basis. If we find in
> > > the end that someone has allocated new block to the file (via mmap
> > > filling a hole) while we are converting, we can just cancel the
> > > conversion. Because I think the cost of extra rwsem (both in terms of
> > > additional memory needed for each inode structure and in time needed for
> > > rwsem acquisitions) is more than I as a user would like to bear given
> > > how rare the conversion is.
> >
> > Something like the below ??
> >
> > down_write(&EXT4_I(inode)->i_data_sem);
> > + /* check for number of blocks */
> > + if (total_blocks != inode->i_blocks) {
> > + retval = -EAGAIN;
> > + up_write(&EXT4_I(inode)->i_data_sem);
> > + goto err_out;
>
> Is this enough, or should we use the inode version instead

We are already holding inode->i_mutex. So the only possible operation is
adding new blocks via mmap write to holes. Also inode version is tricky
because it is available only with i_version mount option. We are also
interested only in new blocks added event.


-aneesh

2008-03-12 11:19:48

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH] ext4: Fix the locking with respect to ext3 to ext4 migrate.

On Tue 11-03-08 22:28:59, Aneesh Kumar K.V wrote:
> On Tue, Mar 11, 2008 at 04:25:37PM +0100, Jan Kara wrote:
> > > On Mar 07, 2008 17:01 +0530, Aneesh Kumar K.V wrote:
> > > > On Fri, Mar 07, 2008 at 03:17:33AM -0800, Mingming Cao wrote:
> > > > > How about we start a journal with estimated worse case transaction
> > > > > credits and then take the i_data_sem down? So that we could ensure that
> > > > > whenever the i_data_sem is hold, the i_data is protected. That is what
> > > > > currently DIO does, I think. It would be nice to avoid introducing
> > > > > another semaphore to protect i_data for migration if we could.
> > > >
> > > > Estimating transaction for a single page directIO write may be easy. But
> > > > in case of migrate it involves new blocks allocated to carry the extents
> > > > and also we free the indirect blocks of ext3 and that would involve
> > > > update of bitmap from different groups. I am not sure we will be able to
> > > > come up with a value. But if yes and if we can get that many credits
> > > > from journal i agree that would be better than introducing a new
> > > > semaphore.
> > >
> > > Agreed - and if we have a generic routine to calculate the journal
> > > credits needed for a full-file (or better a range) indirect block
> > > operation (including bitmaps, group descriptors, and [dt]indirect
> > > blocks).
> > >
> > > I don't think there would be a serious failure case if it wasn't possible
> > > to convert a block-mapped file to extent-mapped while it was mmapped.
> > > At worst the administrator would need to do that some time later, or
> > > after a system reboot, so long as the conversion actually failed if the
> > > file had any mmaps. If this same requirement is introduced when we
> > > get defrag for ext4 (because the block mapping is changing on the file)
> > > then we may have to reconsider the benefits of the more complex code.
> > I agree here. IMHO the better option would be to just build the
> > extent-tree for converted inode on best-effort basis. If we find in
> > the end that someone has allocated new block to the file (via mmap
> > filling a hole) while we are converting, we can just cancel the
> > conversion. Because I think the cost of extra rwsem (both in terms of
> > additional memory needed for each inode structure and in time needed for
> > rwsem acquisitions) is more than I as a user would like to bear given
> > how rare the conversion is.
> >
> Something like the below ??
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 059f2fc..a52904b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3502,9 +3502,5 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct page *page)
> * access and zero out the page. The journal handle get initialized
> * in ext4_get_block.
> */
> - /* FIXME!! should we take inode->i_mutex ? Currently we can't because
> - * it has a circular locking dependency with DIO. But migrate expect
> - * i_mutex to ensure no i_data changes
> - */
> return block_page_mkwrite(vma, page, ext4_get_block);
> }
> diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
> index 5c1e27d..c6391e9 100644
> --- a/fs/ext4/migrate.c
> +++ b/fs/ext4/migrate.c
> @@ -327,7 +327,7 @@ static int free_ind_block(handle_t *handle, struct inode *inode, __le32 *i_data)
> }
>
> static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
> - struct inode *tmp_inode)
> + struct inode *tmp_inode, blkcnt_t total_blocks)
> {
> int retval;
> __le32 i_data[3];
> @@ -350,6 +350,13 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
> i_data[2] = ei->i_data[EXT4_TIND_BLOCK];
>
> down_write(&EXT4_I(inode)->i_data_sem);
> + /* check for number of blocks */
> + if (total_blocks != inode->i_blocks) {
> + retval = -EAGAIN;
> + up_write(&EXT4_I(inode)->i_data_sem);
> + goto err_out;
> +
> + }
Hmm, I'm just wondering whether this check wouldn't be unreliable because
we can e.g. free xattr block (AFAICS this is protected only by xattr_sem)
and allocate data block. I agree with you that checking i_version isn't
good as well because it gets updated unnecessarily often and we don't
always maintain it. We could take xattr_sem to prevent the race but that
seems a bit awkward. I'm wondering whether we don't have another way
of checking whether allocation to the file changed or not.
If there's no better way of checking, we could also do something as
follows: when we start migration, we set "MIGRATING" flag in memory for the
inode (this setting would be protected by i_data_sem). In allocation routines,
we just unconditionally clear this flag - we are again under i_data_sem so
this should be safe. In the end of migration, we check that MIGRATING flag
is still set. This should be quite lightweight and safe...

> /*
> * We have the extent map build with the tmp inode.
> * Now copy the i_data across
> @@ -445,6 +452,7 @@ int ext4_ext_migrate(struct inode *inode, struct file *filp,
> struct inode *tmp_inode = NULL;
> struct list_blocks_struct lb;
> unsigned long max_entries;
> + blkcnt_t total_blocks;
>
> if (!test_opt(inode->i_sb, EXTENTS))
> /*
> @@ -508,6 +516,12 @@ int ext4_ext_migrate(struct inode *inode, struct file *filp,
> * switch the inode format to prevent read.
> */
> mutex_lock(&(inode->i_mutex));
> + /*
> + * Even though we take i_mutex we can still cause block allocation
> + * via mmap write to holes. If we have allocated new blocks we fail
> + * migrate.
> + */
> + total_blocks = inode->i_blocks;
> handle = ext4_journal_start(inode, 1);
>
> ei = EXT4_I(inode);
> @@ -561,7 +575,7 @@ err_out:
> free_ext_block(handle, tmp_inode);
> else
> retval = ext4_ext_swap_inode_data(handle, inode,
> - tmp_inode);
> + tmp_inode, total_blocks);
>
> /* We mark the tmp_inode dirty via ext4_ext_tree_init. */
> if (ext4_journal_extend(handle, 1) != 0)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR