2008-08-28 17:28:00

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH] ext4: truncate block allocated on a failed ext4_write_begin

For blocksize < pagesize we need to remove blocks that
got allocte in block_wirte_begin if we fail with ENOSPC
for later blocks. block_write_begin internally does
this if it allocated page locally. This make sure
we don't have blocks outisde inode.i_size during
ENOSPC

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/inode.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 473d888..a8c6bc8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1394,6 +1394,13 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
unlock_page(page);
ext4_journal_stop(handle);
page_cache_release(page);
+ /*
+ * block_write_begin may have instantiated a few blocks
+ * outside i_size. Trim these off again. Don't need
+ * i_size_read because we hold i_mutex.
+ */
+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
}

if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
--
1.6.0.1.90.g27a6e



2008-08-28 17:28:03

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH] ext4: Properly update i_disksize.

With delayed allocation we use i_data_sem to update
i_disksize. We need to update i_disksize only
if the new size specified is greater than the
current value and we need to make sure we don't
race with other i_disksize update. With delayed
allocation we will switch to nondelalloc write_begin
if we are low on free blocks. That means nondelalloc
write_begin also need to use the same locking.

We also need to check and update i_disksize even
if the new size is less that inode.i_size because
of delayed allocation.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/ext4.h | 11 +++++++++
fs/ext4/extents.c | 9 ++++---
fs/ext4/inode.c | 61 ++++++++++++++++++++++++++++++++--------------------
3 files changed, 53 insertions(+), 28 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 13c69ed..bc856e3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1307,6 +1307,17 @@ do { \
#define EXT4_FREEBLOCKS_WATERMARK 0
#endif

+static inline void ext4_update_i_disksize(struct inode *inode, loff_t newsize)
+{
+ /*
+ * XXX: replace with spinlock if seen contended -bzzz
+ */
+ down_write(&EXT4_I(inode)->i_data_sem);
+ if (newsize > EXT4_I(inode)->i_disksize)
+ EXT4_I(inode)->i_disksize = newsize;
+ up_write(&EXT4_I(inode)->i_data_sem);
+ return ;
+}

/*
* Inodes and files operations
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 268e96d..7def792 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3018,10 +3018,11 @@ static void ext4_falloc_update_inode(struct inode *inode,
* Update only when preallocation was requested beyond
* the file size.
*/
- if (!(mode & FALLOC_FL_KEEP_SIZE) &&
- new_size > i_size_read(inode)) {
- i_size_write(inode, new_size);
- EXT4_I(inode)->i_disksize = new_size;
+ if (!(mode & FALLOC_FL_KEEP_SIZE)) {
+ if (new_size > i_size_read(inode))
+ i_size_write(inode, new_size);
+ if (new_size > EXT4_I(inode)->i_disksize)
+ ext4_update_i_disksize(inode, new_size);
}

}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a8c6bc8..605ae24 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1437,16 +1437,18 @@ static int ext4_ordered_write_end(struct file *file,
ret = ext4_jbd2_file_inode(handle, inode);

if (ret == 0) {
- /*
- * generic_write_end() will run mark_inode_dirty() if i_size
- * changes. So let's piggyback the i_disksize mark_inode_dirty
- * into that.
- */
loff_t new_i_size;

new_i_size = pos + copied;
- if (new_i_size > EXT4_I(inode)->i_disksize)
- EXT4_I(inode)->i_disksize = new_i_size;
+ if (new_i_size > EXT4_I(inode)->i_disksize) {
+ ext4_update_i_disksize(inode, new_i_size);
+ /* We need to mark inode dirty even if
+ * new_i_size is less that inode->i_size
+ * bu greater than i_disksize.(hint delalloc)
+ */
+ ext4_mark_inode_dirty(handle, inode);
+ }
+
ret2 = generic_write_end(file, mapping, pos, len, copied,
page, fsdata);
copied = ret2;
@@ -1471,8 +1473,14 @@ static int ext4_writeback_write_end(struct file *file,
loff_t new_i_size;

new_i_size = pos + copied;
- if (new_i_size > EXT4_I(inode)->i_disksize)
- EXT4_I(inode)->i_disksize = new_i_size;
+ if (new_i_size > EXT4_I(inode)->i_disksize) {
+ ext4_update_i_disksize(inode, new_i_size);
+ /* We need to mark inode dirty even if
+ * new_i_size is less that inode->i_size
+ * bu greater than i_disksize.(hint delalloc)
+ */
+ ext4_mark_inode_dirty(handle, inode);
+ }

ret2 = generic_write_end(file, mapping, pos, len, copied,
page, fsdata);
@@ -1497,6 +1505,7 @@ static int ext4_journalled_write_end(struct file *file,
int ret = 0, ret2;
int partial = 0;
unsigned from, to;
+ loff_t new_i_size;

from = pos & (PAGE_CACHE_SIZE - 1);
to = from + len;
@@ -1511,11 +1520,12 @@ static int ext4_journalled_write_end(struct file *file,
to, &partial, write_end_fn);
if (!partial)
SetPageUptodate(page);
- if (pos+copied > inode->i_size)
+ new_i_size = pos + copied;
+ if (new_i_size > inode->i_size)
i_size_write(inode, pos+copied);
EXT4_I(inode)->i_state |= EXT4_STATE_JDATA;
- if (inode->i_size > EXT4_I(inode)->i_disksize) {
- EXT4_I(inode)->i_disksize = inode->i_size;
+ if (new_i_size > EXT4_I(inode)->i_disksize) {
+ ext4_update_i_disksize(inode, new_i_size);
ret2 = ext4_mark_inode_dirty(handle, inode);
if (!ret)
ret = ret2;
@@ -2227,18 +2237,9 @@ static int ext4_da_get_block_write(struct inode *inode, sector_t iblock,
if (disksize > i_size_read(inode))
disksize = i_size_read(inode);
if (disksize > EXT4_I(inode)->i_disksize) {
- /*
- * XXX: replace with spinlock if seen contended -bzzz
- */
- down_write(&EXT4_I(inode)->i_data_sem);
- if (disksize > EXT4_I(inode)->i_disksize)
- EXT4_I(inode)->i_disksize = disksize;
- up_write(&EXT4_I(inode)->i_data_sem);
-
- if (EXT4_I(inode)->i_disksize == disksize) {
- ret = ext4_mark_inode_dirty(handle, inode);
- return ret;
- }
+ ext4_update_i_disksize(inode, disksize);
+ ret = ext4_mark_inode_dirty(handle, inode);
+ return ret;
}
ret = 0;
}
@@ -2567,6 +2568,13 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
unlock_page(page);
ext4_journal_stop(handle);
page_cache_release(page);
+ /*
+ * block_write_begin may have instantiated a few blocks
+ * outside i_size. Trim these off again. Don't need
+ * i_size_read because we hold i_mutex.
+ */
+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
}

if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
@@ -2647,6 +2655,11 @@ static int ext4_da_write_end(struct file *file,
EXT4_I(inode)->i_disksize = new_i_size;
}
up_write(&EXT4_I(inode)->i_data_sem);
+ /* We need to mark inode dirty even if
+ * new_i_size is less that inode->i_size
+ * bu greater than i_disksize.(hint delalloc)
+ */
+ ext4_mark_inode_dirty(handle, inode);
}
}
ret2 = generic_write_end(file, mapping, pos, len, copied,
--
1.6.0.1.90.g27a6e


2008-08-28 17:28:01

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH] ext3: truncate block allocated on a failed ext3_write_begin

For blocksize < pagesize we need to remove blocks that
got allocte in block_wirte_begin if we fail with ENOSPC
for later blocks. block_write_begin internally does
this if it allocated page locally. This make sure
we don't have blocks outisde inode.i_size during
ENOSPC

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext3/inode.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index ebfec4d..f8424ad 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1186,6 +1186,13 @@ static int ext3_write_begin(struct file *file, struct address_space *mapping,
ext3_journal_stop(handle);
unlock_page(page);
page_cache_release(page);
+ /*
+ * block_write_begin may have instantiated a few blocks
+ * outside i_size. Trim these off again. Don't need
+ * i_size_read because we hold i_mutex.
+ */
+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
}
if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
goto retry;
--
1.6.0.1.90.g27a6e


2008-08-28 18:14:00

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] ext3: truncate block allocated on a failed ext3_write_begin

Aneesh Kumar K.V wrote:
> For blocksize < pagesize we need to remove blocks that
> got allocte in block_wirte_begin if we fail with ENOSPC
> for later blocks. block_write_begin internally does
> this if it allocated page locally. This make sure
> we don't have blocks outisde inode.i_size during
> ENOSPC

I think this is good; here's an easy testcase:

# dd if=/dev/zero of=fsfile bs=1k count=2048
# mkfs.ext3 -F fsfile
# mkdir mnt
# mount -o loop fsfile mnt/
# dd if=/dev/zero of=mnt/1kfile bs=1k count=1
# cd mnt/
# dd if=/dev/zero of=bigfile bs=4k
# cd ..
# umount mnt
# e2fsck -f fsfile
e2fsck 1.39 (29-May-2006)
Pass 1: Checking inodes, blocks, and sizes
Inode 13, i_size is 974848, should be 976896. Fix<y>?
...

can you test with that, unless you already have a testcase you've used?

Assuming it fixes it (it should) you can add:

Acked-by: Eric Sandeen <[email protected]>

-Eric

> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> fs/ext3/inode.c | 7 +++++++
> 1 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index ebfec4d..f8424ad 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -1186,6 +1186,13 @@ static int ext3_write_begin(struct file *file, struct address_space *mapping,
> ext3_journal_stop(handle);
> unlock_page(page);
> page_cache_release(page);
> + /*
> + * block_write_begin may have instantiated a few blocks
> + * outside i_size. Trim these off again. Don't need
> + * i_size_read because we hold i_mutex.
> + */
> + if (pos + len > inode->i_size)
> + vmtruncate(inode, inode->i_size);
> }
> if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
> goto retry;


2008-08-29 08:43:45

by Dmitry Monakhov

[permalink] [raw]
Subject: Re: [PATCH] ext3: truncate block allocated on a failed ext3_write_begin

"Aneesh Kumar K.V" <[email protected]> writes:

> For blocksize < pagesize we need to remove blocks that
> got allocte in block_wirte_begin if we fail with ENOSPC
> for later blocks. block_write_begin internally does
> this if it allocated page locally. This make sure
> we don't have blocks outisde inode.i_size during
> ENOSPC
BTW why this check was moved from generic_XXX_write to fs speciffic code?
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> fs/ext3/inode.c | 7 +++++++
> 1 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index ebfec4d..f8424ad 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -1186,6 +1186,13 @@ static int ext3_write_begin(struct file *file, struct address_space *mapping,
> ext3_journal_stop(handle);
> unlock_page(page);
> page_cache_release(page);
> + /*
> + * block_write_begin may have instantiated a few blocks
> + * outside i_size. Trim these off again. Don't need
> + * i_size_read because we hold i_mutex.
> + */
> + if (pos + len > inode->i_size)
> + vmtruncate(inode, inode->i_size);
> }
> if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
> goto retry;
> --
> 1.6.0.1.90.g27a6e
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-08-29 09:51:24

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] ext3: truncate block allocated on a failed ext3_write_begin

On Fri, Aug 29, 2008 at 12:33:24PM +0400, Dmitri Monakhov wrote:
> "Aneesh Kumar K.V" <[email protected]> writes:
>
> > For blocksize < pagesize we need to remove blocks that
> > got allocte in block_wirte_begin if we fail with ENOSPC
> > for later blocks. block_write_begin internally does
> > this if it allocated page locally. This make sure
> > we don't have blocks outisde inode.i_size during
> > ENOSPC
> BTW why this check was moved from generic_XXX_write to fs speciffic code?


Not quite sure what you mean by that ? block_write_begin generic
code already does the vmtruncate if it had allocated page locally.
ext3/4 allocate/grab_cache_page in write_begin and pass the page
pointer to block_write_begin. That implies block_write_begin
won't do the truncate. So we have to do it in filesystem write_begin.
also we have to do it after unlocking the page.


> >
> > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> > ---
> > fs/ext3/inode.c | 7 +++++++
> > 1 files changed, 7 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > index ebfec4d..f8424ad 100644
> > --- a/fs/ext3/inode.c
> > +++ b/fs/ext3/inode.c
> > @@ -1186,6 +1186,13 @@ static int ext3_write_begin(struct file *file, struct address_space *mapping,
> > ext3_journal_stop(handle);
> > unlock_page(page);
> > page_cache_release(page);
> > + /*
> > + * block_write_begin may have instantiated a few blocks
> > + * outside i_size. Trim these off again. Don't need
> > + * i_size_read because we hold i_mutex.
> > + */
> > + if (pos + len > inode->i_size)
> > + vmtruncate(inode, inode->i_size);
> > }
> > if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
> > goto retry;
> > --
> > 1.6.0.1.90.g27a6e
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-08-29 14:47:34

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] ext3: truncate block allocated on a failed ext3_write_begin

On Thu, Aug 28, 2008 at 12:56:28PM -0500, Eric Sandeen wrote:
> Aneesh Kumar K.V wrote:
> > For blocksize < pagesize we need to remove blocks that
> > got allocte in block_wirte_begin if we fail with ENOSPC
> > for later blocks. block_write_begin internally does
> > this if it allocated page locally. This make sure
> > we don't have blocks outisde inode.i_size during
> > ENOSPC
>
> I think this is good; here's an easy testcase:
>
> # dd if=/dev/zero of=fsfile bs=1k count=2048
> # mkfs.ext3 -F fsfile
> # mkdir mnt
> # mount -o loop fsfile mnt/
> # dd if=/dev/zero of=mnt/1kfile bs=1k count=1
> # cd mnt/
> # dd if=/dev/zero of=bigfile bs=4k
> # cd ..
> # umount mnt
> # e2fsck -f fsfile
> e2fsck 1.39 (29-May-2006)
> Pass 1: Checking inodes, blocks, and sizes
> Inode 13, i_size is 974848, should be 976896. Fix<y>?
> ...
>
> can you test with that, unless you already have a testcase you've used?

I tested the above and it works fine. The problem appeared during
fsstress run and I used fsstress to debug and fix.


>
> Assuming it fixes it (it should) you can add:
>
> Acked-by: Eric Sandeen <[email protected]>
>

-aneesh