2009-03-26 18:21:54

by Jan Kara

[permalink] [raw]
Subject: [PATCH] ext3: Avoid false EIO errors

Sometimes block_write_begin() can map buffers in a page but later we fail to
copy data into those buffers (because the source page has been paged out in the
mean time). We then end up with !uptodate mapped buffers. To add a bit more to
the confusion, block_write_end() does not commit any data (and thus does not
any mark buffers as uptodate) if we didn't succeed with copying all the data.

Commit f4fc66a894546bdc88a775d0e83ad20a65210bcb (ext3: convert to new aops)
missed these cases and thus we were inserting non-uptodate buffers to
transaction's list which confuses JBD code and it reports IO errors, aborts
a transaction and generally makes users afraid about their data ;-P.

This patch fixes the problem by reorganizing ext3_..._write_end() code to
first call block_write_end() to mark buffers with valid data uptodate and
after that we file only uptodate buffers to transaction's lists. Also
fix a problem where we could leave blocks allocated beyond i_size (i_disksize
in fact).

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext3/inode.c | 99 +++++++++++++++++++++++-------------------------------
1 files changed, 42 insertions(+), 57 deletions(-)

As a side note, ext4 / JBD2 only needs the "proper i_disksize update"
part of the fix (we got rid of special handling of ordered mode buffers
there). But I have to first figure out how to properly do it...
Andrew, would you please merge the patch? Thanks.

Honza

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 5fa453b..e230f7a 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1211,6 +1211,18 @@ int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
return err;
}

+/* For ordered writepage and write_end functions */
+static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
+{
+ /*
+ * Write could have mapped the buffer but it didn't copy the data in
+ * yet. So avoid filing such buffer into a transaction.
+ */
+ if (buffer_mapped(bh) && buffer_uptodate(bh))
+ return ext3_journal_dirty_data(handle, bh);
+ return 0;
+}
+
/* For write_end() in data=journal mode */
static int write_end_fn(handle_t *handle, struct buffer_head *bh)
{
@@ -1221,26 +1233,29 @@ static int write_end_fn(handle_t *handle, struct buffer_head *bh)
}

/*
- * Generic write_end handler for ordered and writeback ext3 journal modes.
- * We can't use generic_write_end, because that unlocks the page and we need to
- * unlock the page after ext3_journal_stop, but ext3_journal_stop must run
- * after block_write_end.
+ * This is nasty and subtle: ext3_write_begin() could have allocated blocks
+ * for the whole page but later we failed to copy the data in. So the disk
+ * size we really have allocated is pos + len (block_write_end() has zeroed
+ * the freshly allocated buffers so we aren't going to write garbage). But we
+ * want to keep i_size at the place where data copying finished so that we
+ * don't confuse readers. The worst what can happen is that we expose a page
+ * of zeros at the end of file after a crash...
*/
-static int ext3_generic_write_end(struct file *file,
- struct address_space *mapping,
- loff_t pos, unsigned len, unsigned copied,
- struct page *page, void *fsdata)
+static void update_file_sizes(struct inode *inode, loff_t pos, unsigned len,
+ unsigned copied)
{
- struct inode *inode = file->f_mapping->host;
-
- copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+ int mark_dirty = 0;

- if (pos+copied > inode->i_size) {
- i_size_write(inode, pos+copied);
- mark_inode_dirty(inode);
+ if (pos + len > EXT3_I(inode)->i_disksize) {
+ mark_dirty = 1;
+ EXT3_I(inode)->i_disksize = pos + len;
}
-
- return copied;
+ if (pos + copied > inode->i_size) {
+ i_size_write(inode, pos + copied);
+ mark_dirty = 1;
+ }
+ if (mark_dirty)
+ mark_inode_dirty(inode);
}

/*
@@ -1260,29 +1275,17 @@ static int ext3_ordered_write_end(struct file *file,
unsigned from, to;
int ret = 0, ret2;

+ copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+
+ /* See comment at update_file_sizes() for why we check buffers upto
+ * from + len */
from = pos & (PAGE_CACHE_SIZE - 1);
to = from + len;
-
ret = walk_page_buffers(handle, page_buffers(page),
- from, to, NULL, ext3_journal_dirty_data);
+ from, to, NULL, journal_dirty_data_fn);

- if (ret == 0) {
- /*
- * generic_write_end() will run mark_inode_dirty() if i_size
- * changes. So let's piggyback the i_disksize mark_inode_dirty
- * into that.
- */
- loff_t new_i_size;
-
- new_i_size = pos + copied;
- if (new_i_size > EXT3_I(inode)->i_disksize)
- EXT3_I(inode)->i_disksize = new_i_size;
- ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
- page, fsdata);
- copied = ret2;
- if (ret2 < 0)
- ret = ret2;
- }
+ if (ret == 0)
+ update_file_sizes(inode, pos, len, copied);
ret2 = ext3_journal_stop(handle);
if (!ret)
ret = ret2;
@@ -1299,22 +1302,11 @@ static int ext3_writeback_write_end(struct file *file,
{
handle_t *handle = ext3_journal_current_handle();
struct inode *inode = file->f_mapping->host;
- int ret = 0, ret2;
- loff_t new_i_size;
-
- new_i_size = pos + copied;
- if (new_i_size > EXT3_I(inode)->i_disksize)
- EXT3_I(inode)->i_disksize = new_i_size;
-
- ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
- page, fsdata);
- copied = ret2;
- if (ret2 < 0)
- ret = ret2;
+ int ret;

- ret2 = ext3_journal_stop(handle);
- if (!ret)
- ret = ret2;
+ copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+ update_file_sizes(inode, pos, len, copied);
+ ret = ext3_journal_stop(handle);
unlock_page(page);
page_cache_release(page);

@@ -1428,13 +1420,6 @@ static int bput_one(handle_t *handle, struct buffer_head *bh)
return 0;
}

-static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
-{
- if (buffer_mapped(bh))
- return ext3_journal_dirty_data(handle, bh);
- return 0;
-}


2009-03-27 18:08:34

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] ext3: Avoid false EIO errors

On Thu, Mar 26, 2009 at 07:21:51PM +0100, Jan Kara wrote:
> Sometimes block_write_begin() can map buffers in a page but later we fail to
> copy data into those buffers (because the source page has been paged out in the
> mean time). We then end up with !uptodate mapped buffers. To add a bit more to
> the confusion, block_write_end() does not commit any data (and thus does not
> any mark buffers as uptodate) if we didn't succeed with copying all the data.
>
> Commit f4fc66a894546bdc88a775d0e83ad20a65210bcb (ext3: convert to new aops)
> missed these cases and thus we were inserting non-uptodate buffers to
> transaction's list which confuses JBD code and it reports IO errors, aborts
> a transaction and generally makes users afraid about their data ;-P.
>
> This patch fixes the problem by reorganizing ext3_..._write_end() code to
> first call block_write_end() to mark buffers with valid data uptodate and
> after that we file only uptodate buffers to transaction's lists. Also
> fix a problem where we could leave blocks allocated beyond i_size (i_disksize
> in fact).
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext3/inode.c | 99 +++++++++++++++++++++++-------------------------------
> 1 files changed, 42 insertions(+), 57 deletions(-)
>
> As a side note, ext4 / JBD2 only needs the "proper i_disksize update"
> part of the fix (we got rid of special handling of ordered mode buffers
> there). But I have to first figure out how to properly do it...
> Andrew, would you please merge the patch? Thanks.
>
> Honza
>
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 5fa453b..e230f7a 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -1211,6 +1211,18 @@ int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
> return err;
> }
>
> +/* For ordered writepage and write_end functions */
> +static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> +{
> + /*
> + * Write could have mapped the buffer but it didn't copy the data in
> + * yet. So avoid filing such buffer into a transaction.
> + */
> + if (buffer_mapped(bh) && buffer_uptodate(bh))
> + return ext3_journal_dirty_data(handle, bh);
> + return 0;
> +}
> +
> /* For write_end() in data=journal mode */
> static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> {
> @@ -1221,26 +1233,29 @@ static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> }
>
> /*
> - * Generic write_end handler for ordered and writeback ext3 journal modes.
> - * We can't use generic_write_end, because that unlocks the page and we need to
> - * unlock the page after ext3_journal_stop, but ext3_journal_stop must run
> - * after block_write_end.
> + * This is nasty and subtle: ext3_write_begin() could have allocated blocks
> + * for the whole page but later we failed to copy the data in. So the disk
> + * size we really have allocated is pos + len (block_write_end() has zeroed
> + * the freshly allocated buffers so we aren't going to write garbage). But we
> + * want to keep i_size at the place where data copying finished so that we
> + * don't confuse readers. The worst what can happen is that we expose a page
> + * of zeros at the end of file after a crash...
> */
> -static int ext3_generic_write_end(struct file *file,
> - struct address_space *mapping,
> - loff_t pos, unsigned len, unsigned copied,
> - struct page *page, void *fsdata)
> +static void update_file_sizes(struct inode *inode, loff_t pos, unsigned len,
> + unsigned copied)
> {
> - struct inode *inode = file->f_mapping->host;
> -
> - copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> + int mark_dirty = 0;
>
> - if (pos+copied > inode->i_size) {
> - i_size_write(inode, pos+copied);
> - mark_inode_dirty(inode);
> + if (pos + len > EXT3_I(inode)->i_disksize) {
> + mark_dirty = 1;
> + EXT3_I(inode)->i_disksize = pos + len;
> }

Won't this result in file having wrong contents if we failed to copy
the contents from user space? Now if we successfully allocated
blocks and we failed to copy the contents from user space, the above
would result in update of i_disksize and a mark_inode_dirty. Doesn't
that imply we have wrong contents in those block for which we failed to
copy the contents from user space ?

> -
> - return copied;
> + if (pos + copied > inode->i_size) {
> + i_size_write(inode, pos + copied);
> + mark_dirty = 1;
> + }
> + if (mark_dirty)
> + mark_inode_dirty(inode);
> }
>

-aneesh

2009-03-27 20:24:28

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext3: Avoid false EIO errors

On Fri 27-03-09 23:38:06, Aneesh Kumar K.V wrote:
> On Thu, Mar 26, 2009 at 07:21:51PM +0100, Jan Kara wrote:
> > Sometimes block_write_begin() can map buffers in a page but later we fail to
> > copy data into those buffers (because the source page has been paged out in the
> > mean time). We then end up with !uptodate mapped buffers. To add a bit more to
> > the confusion, block_write_end() does not commit any data (and thus does not
> > any mark buffers as uptodate) if we didn't succeed with copying all the data.
> >
> > Commit f4fc66a894546bdc88a775d0e83ad20a65210bcb (ext3: convert to new aops)
> > missed these cases and thus we were inserting non-uptodate buffers to
> > transaction's list which confuses JBD code and it reports IO errors, aborts
> > a transaction and generally makes users afraid about their data ;-P.
> >
> > This patch fixes the problem by reorganizing ext3_..._write_end() code to
> > first call block_write_end() to mark buffers with valid data uptodate and
> > after that we file only uptodate buffers to transaction's lists. Also
> > fix a problem where we could leave blocks allocated beyond i_size (i_disksize
> > in fact).
> >
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/ext3/inode.c | 99 +++++++++++++++++++++++-------------------------------
> > 1 files changed, 42 insertions(+), 57 deletions(-)
> >
> > As a side note, ext4 / JBD2 only needs the "proper i_disksize update"
> > part of the fix (we got rid of special handling of ordered mode buffers
> > there). But I have to first figure out how to properly do it...
> > Andrew, would you please merge the patch? Thanks.
> >
> > Honza
> >
> > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > index 5fa453b..e230f7a 100644
> > --- a/fs/ext3/inode.c
> > +++ b/fs/ext3/inode.c
> > @@ -1211,6 +1211,18 @@ int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
> > return err;
> > }
> >
> > +/* For ordered writepage and write_end functions */
> > +static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> > +{
> > + /*
> > + * Write could have mapped the buffer but it didn't copy the data in
> > + * yet. So avoid filing such buffer into a transaction.
> > + */
> > + if (buffer_mapped(bh) && buffer_uptodate(bh))
> > + return ext3_journal_dirty_data(handle, bh);
> > + return 0;
> > +}
> > +
> > /* For write_end() in data=journal mode */
> > static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> > {
> > @@ -1221,26 +1233,29 @@ static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> > }
> >
> > /*
> > - * Generic write_end handler for ordered and writeback ext3 journal modes.
> > - * We can't use generic_write_end, because that unlocks the page and we need to
> > - * unlock the page after ext3_journal_stop, but ext3_journal_stop must run
> > - * after block_write_end.
> > + * This is nasty and subtle: ext3_write_begin() could have allocated blocks
> > + * for the whole page but later we failed to copy the data in. So the disk
> > + * size we really have allocated is pos + len (block_write_end() has zeroed
> > + * the freshly allocated buffers so we aren't going to write garbage). But we
> > + * want to keep i_size at the place where data copying finished so that we
> > + * don't confuse readers. The worst what can happen is that we expose a page
> > + * of zeros at the end of file after a crash...
> > */
> > -static int ext3_generic_write_end(struct file *file,
> > - struct address_space *mapping,
> > - loff_t pos, unsigned len, unsigned copied,
> > - struct page *page, void *fsdata)
> > +static void update_file_sizes(struct inode *inode, loff_t pos, unsigned len,
> > + unsigned copied)
> > {
> > - struct inode *inode = file->f_mapping->host;
> > -
> > - copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> > + int mark_dirty = 0;
> >
> > - if (pos+copied > inode->i_size) {
> > - i_size_write(inode, pos+copied);
> > - mark_inode_dirty(inode);
> > + if (pos + len > EXT3_I(inode)->i_disksize) {
> > + mark_dirty = 1;
> > + EXT3_I(inode)->i_disksize = pos + len;
> > }
>
> Won't this result in file having wrong contents if we failed to copy
> the contents from user space? Now if we successfully allocated
> blocks and we failed to copy the contents from user space, the above
> would result in update of i_disksize and a mark_inode_dirty. Doesn't
> that imply we have wrong contents in those block for which we failed to
> copy the contents from user space ?
block_write_end() zeros all new buffers. So yes, if we crash after
this transaction commits but before we manage to redo the write, then user
could see zeros at the end of file (previously inode could have blocks
allocated beyond EOF).
I was also thinking about truncating the newly created buffers but it's a
bit tricky. We need to do it in the same transaction (otherwise the race
would be still there) but standard truncate path would like to add inode
to the orphan list, lock pages etc and we have no credits for that and also
lock ordering might be troublesome. So I've chosen the simple path.

> > -
> > - return copied;
> > + if (pos + copied > inode->i_size) {
> > + i_size_write(inode, pos + copied);
> > + mark_dirty = 1;
> > + }
> > + if (mark_dirty)
> > + mark_inode_dirty(inode);
> > }
Thanks for reading the patch.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-03-30 08:25:50

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] ext3: Avoid false EIO errors

On Fri, Mar 27, 2009 at 09:24:21PM +0100, Jan Kara wrote:
> On Fri 27-03-09 23:38:06, Aneesh Kumar K.V wrote:
> > On Thu, Mar 26, 2009 at 07:21:51PM +0100, Jan Kara wrote:
> > > Sometimes block_write_begin() can map buffers in a page but later we fail to
> > > copy data into those buffers (because the source page has been paged out in the
> > > mean time). We then end up with !uptodate mapped buffers. To add a bit more to
> > > the confusion, block_write_end() does not commit any data (and thus does not
> > > any mark buffers as uptodate) if we didn't succeed with copying all the data.
> > >
> > > Commit f4fc66a894546bdc88a775d0e83ad20a65210bcb (ext3: convert to new aops)
> > > missed these cases and thus we were inserting non-uptodate buffers to
> > > transaction's list which confuses JBD code and it reports IO errors, aborts
> > > a transaction and generally makes users afraid about their data ;-P.
> > >
> > > This patch fixes the problem by reorganizing ext3_..._write_end() code to
> > > first call block_write_end() to mark buffers with valid data uptodate and
> > > after that we file only uptodate buffers to transaction's lists. Also
> > > fix a problem where we could leave blocks allocated beyond i_size (i_disksize
> > > in fact).
> > >
> > > Signed-off-by: Jan Kara <[email protected]>
> > > ---
> > > fs/ext3/inode.c | 99 +++++++++++++++++++++++-------------------------------
> > > 1 files changed, 42 insertions(+), 57 deletions(-)
> > >
> > > As a side note, ext4 / JBD2 only needs the "proper i_disksize update"
> > > part of the fix (we got rid of special handling of ordered mode buffers
> > > there). But I have to first figure out how to properly do it...
> > > Andrew, would you please merge the patch? Thanks.
> > >
> > > Honza
> > >
> > > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > > index 5fa453b..e230f7a 100644
> > > --- a/fs/ext3/inode.c
> > > +++ b/fs/ext3/inode.c
> > > @@ -1211,6 +1211,18 @@ int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
> > > return err;
> > > }
> > >
> > > +/* For ordered writepage and write_end functions */
> > > +static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> > > +{
> > > + /*
> > > + * Write could have mapped the buffer but it didn't copy the data in
> > > + * yet. So avoid filing such buffer into a transaction.
> > > + */
> > > + if (buffer_mapped(bh) && buffer_uptodate(bh))
> > > + return ext3_journal_dirty_data(handle, bh);
> > > + return 0;
> > > +}
> > > +
> > > /* For write_end() in data=journal mode */
> > > static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> > > {
> > > @@ -1221,26 +1233,29 @@ static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> > > }
> > >
> > > /*
> > > - * Generic write_end handler for ordered and writeback ext3 journal modes.
> > > - * We can't use generic_write_end, because that unlocks the page and we need to
> > > - * unlock the page after ext3_journal_stop, but ext3_journal_stop must run
> > > - * after block_write_end.
> > > + * This is nasty and subtle: ext3_write_begin() could have allocated blocks
> > > + * for the whole page but later we failed to copy the data in. So the disk
> > > + * size we really have allocated is pos + len (block_write_end() has zeroed
> > > + * the freshly allocated buffers so we aren't going to write garbage). But we
> > > + * want to keep i_size at the place where data copying finished so that we
> > > + * don't confuse readers. The worst what can happen is that we expose a page
> > > + * of zeros at the end of file after a crash...
> > > */
> > > -static int ext3_generic_write_end(struct file *file,
> > > - struct address_space *mapping,
> > > - loff_t pos, unsigned len, unsigned copied,
> > > - struct page *page, void *fsdata)
> > > +static void update_file_sizes(struct inode *inode, loff_t pos, unsigned len,
> > > + unsigned copied)
> > > {
> > > - struct inode *inode = file->f_mapping->host;
> > > -
> > > - copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> > > + int mark_dirty = 0;
> > >
> > > - if (pos+copied > inode->i_size) {
> > > - i_size_write(inode, pos+copied);
> > > - mark_inode_dirty(inode);
> > > + if (pos + len > EXT3_I(inode)->i_disksize) {
> > > + mark_dirty = 1;
> > > + EXT3_I(inode)->i_disksize = pos + len;
> > > }
> >
> > Won't this result in file having wrong contents if we failed to copy
> > the contents from user space? Now if we successfully allocated
> > blocks and we failed to copy the contents from user space, the above
> > would result in update of i_disksize and a mark_inode_dirty. Doesn't
> > that imply we have wrong contents in those block for which we failed to
> > copy the contents from user space ?
> block_write_end() zeros all new buffers. So yes, if we crash after
> this transaction commits but before we manage to redo the write, then user
> could see zeros at the end of file (previously inode could have blocks
> allocated beyond EOF).
> I was also thinking about truncating the newly created buffers but it's a
> bit tricky. We need to do it in the same transaction (otherwise the race
> would be still there) but standard truncate path would like to add inode
> to the orphan list, lock pages etc and we have no credits for that and also
> lock ordering might be troublesome. So I've chosen the simple path.
>

We do a vmtruncate if we failed to allocate blocks in
ext3_write_begin. That is done after the closing the current
transaction. If we crash in between (ie, after committing the
transaction allocating blocks and before committing the transaction that
is doing truncate) we would only have some data blocks leaking. But
that would be better than user seeing zero's in the file ?. Also if we
happen to add the inode to the orphan list and crash, the recovery would
truncate it properly. So by doing a vmtruncate I guess the window would be
small and we are already doing that in ext3_write_begin.

-aneesh

2009-03-30 10:32:57

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext3: Avoid false EIO errors

On Mon 30-03-09 13:55:32, Aneesh Kumar K.V wrote:
> On Fri, Mar 27, 2009 at 09:24:21PM +0100, Jan Kara wrote:
> > On Fri 27-03-09 23:38:06, Aneesh Kumar K.V wrote:
> > > On Thu, Mar 26, 2009 at 07:21:51PM +0100, Jan Kara wrote:
> > > > Sometimes block_write_begin() can map buffers in a page but later we fail to
> > > > copy data into those buffers (because the source page has been paged out in the
> > > > mean time). We then end up with !uptodate mapped buffers. To add a bit more to
> > > > the confusion, block_write_end() does not commit any data (and thus does not
> > > > any mark buffers as uptodate) if we didn't succeed with copying all the data.
> > > >
> > > > Commit f4fc66a894546bdc88a775d0e83ad20a65210bcb (ext3: convert to new aops)
> > > > missed these cases and thus we were inserting non-uptodate buffers to
> > > > transaction's list which confuses JBD code and it reports IO errors, aborts
> > > > a transaction and generally makes users afraid about their data ;-P.
> > > >
> > > > This patch fixes the problem by reorganizing ext3_..._write_end() code to
> > > > first call block_write_end() to mark buffers with valid data uptodate and
> > > > after that we file only uptodate buffers to transaction's lists. Also
> > > > fix a problem where we could leave blocks allocated beyond i_size (i_disksize
> > > > in fact).
> > > >
> > > > Signed-off-by: Jan Kara <[email protected]>
> > > > ---
> > > > fs/ext3/inode.c | 99 +++++++++++++++++++++++-------------------------------
> > > > 1 files changed, 42 insertions(+), 57 deletions(-)
> > > >
> > > > As a side note, ext4 / JBD2 only needs the "proper i_disksize update"
> > > > part of the fix (we got rid of special handling of ordered mode buffers
> > > > there). But I have to first figure out how to properly do it...
> > > > Andrew, would you please merge the patch? Thanks.
> > > >
> > > > Honza
> > > >
> > > > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > > > index 5fa453b..e230f7a 100644
> > > > --- a/fs/ext3/inode.c
> > > > +++ b/fs/ext3/inode.c
> > > > @@ -1211,6 +1211,18 @@ int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
> > > > return err;
> > > > }
> > > >
> > > > +/* For ordered writepage and write_end functions */
> > > > +static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> > > > +{
> > > > + /*
> > > > + * Write could have mapped the buffer but it didn't copy the data in
> > > > + * yet. So avoid filing such buffer into a transaction.
> > > > + */
> > > > + if (buffer_mapped(bh) && buffer_uptodate(bh))
> > > > + return ext3_journal_dirty_data(handle, bh);
> > > > + return 0;
> > > > +}
> > > > +
> > > > /* For write_end() in data=journal mode */
> > > > static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> > > > {
> > > > @@ -1221,26 +1233,29 @@ static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> > > > }
> > > >
> > > > /*
> > > > - * Generic write_end handler for ordered and writeback ext3 journal modes.
> > > > - * We can't use generic_write_end, because that unlocks the page and we need to
> > > > - * unlock the page after ext3_journal_stop, but ext3_journal_stop must run
> > > > - * after block_write_end.
> > > > + * This is nasty and subtle: ext3_write_begin() could have allocated blocks
> > > > + * for the whole page but later we failed to copy the data in. So the disk
> > > > + * size we really have allocated is pos + len (block_write_end() has zeroed
> > > > + * the freshly allocated buffers so we aren't going to write garbage). But we
> > > > + * want to keep i_size at the place where data copying finished so that we
> > > > + * don't confuse readers. The worst what can happen is that we expose a page
> > > > + * of zeros at the end of file after a crash...
> > > > */
> > > > -static int ext3_generic_write_end(struct file *file,
> > > > - struct address_space *mapping,
> > > > - loff_t pos, unsigned len, unsigned copied,
> > > > - struct page *page, void *fsdata)
> > > > +static void update_file_sizes(struct inode *inode, loff_t pos, unsigned len,
> > > > + unsigned copied)
> > > > {
> > > > - struct inode *inode = file->f_mapping->host;
> > > > -
> > > > - copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> > > > + int mark_dirty = 0;
> > > >
> > > > - if (pos+copied > inode->i_size) {
> > > > - i_size_write(inode, pos+copied);
> > > > - mark_inode_dirty(inode);
> > > > + if (pos + len > EXT3_I(inode)->i_disksize) {
> > > > + mark_dirty = 1;
> > > > + EXT3_I(inode)->i_disksize = pos + len;
> > > > }
> > >
> > > Won't this result in file having wrong contents if we failed to copy
> > > the contents from user space? Now if we successfully allocated
> > > blocks and we failed to copy the contents from user space, the above
> > > would result in update of i_disksize and a mark_inode_dirty. Doesn't
> > > that imply we have wrong contents in those block for which we failed to
> > > copy the contents from user space ?
> > block_write_end() zeros all new buffers. So yes, if we crash after
> > this transaction commits but before we manage to redo the write, then user
> > could see zeros at the end of file (previously inode could have blocks
> > allocated beyond EOF).
> > I was also thinking about truncating the newly created buffers but it's a
> > bit tricky. We need to do it in the same transaction (otherwise the race
> > would be still there) but standard truncate path would like to add inode
> > to the orphan list, lock pages etc and we have no credits for that and also
> > lock ordering might be troublesome. So I've chosen the simple path.
> >
> We do a vmtruncate if we failed to allocate blocks in
> ext3_write_begin. That is done after the closing the current
> transaction. If we crash in between (ie, after committing the
> transaction allocating blocks and before committing the transaction that
> is doing truncate) we would only have some data blocks leaking. But
> that would be better than user seeing zero's in the file ?. Also if we
> happen to add the inode to the orphan list and crash, the recovery would
> truncate it properly. So by doing a vmtruncate I guess the window would be
> small and we are already doing that in ext3_write_begin.
Hmm, are you sure some assertion would not fire if we find allocated
blocks beyond i_size (which could happen with the old code)? Frankly, I
prefer user seeing zeros at the end of file (so that he can come and yell
at me ;) rather than silently leaking blocks, getting inode into an
unexpected state and then debug some mysterious problem. But hopefully this
problem has a solution which can make both of us happy ;): We can reserve
enough credits (actually just one block more) and when we see we need to
do truncate because of failed write, we first add inode to the orphan list
before stopping the current handle (so that if we crash it gets properly
truncated) and then truncate the blocks in a separate transaction. Does it
sound good to you?

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-03-30 10:58:30

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] ext3: Avoid false EIO errors

On Mon, Mar 30, 2009 at 12:32:48PM +0200, Jan Kara wrote:
> > > > > - struct address_space *mapping,
> > > > > - loff_t pos, unsigned len, unsigned copied,
> > > > > - struct page *page, void *fsdata)
> > > > > +static void update_file_sizes(struct inode *inode, loff_t pos, unsigned len,
> > > > > + unsigned copied)
> > > > > {
> > > > > - struct inode *inode = file->f_mapping->host;
> > > > > -
> > > > > - copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> > > > > + int mark_dirty = 0;
> > > > >
> > > > > - if (pos+copied > inode->i_size) {
> > > > > - i_size_write(inode, pos+copied);
> > > > > - mark_inode_dirty(inode);
> > > > > + if (pos + len > EXT3_I(inode)->i_disksize) {
> > > > > + mark_dirty = 1;
> > > > > + EXT3_I(inode)->i_disksize = pos + len;
> > > > > }
> > > >
> > > > Won't this result in file having wrong contents if we failed to copy
> > > > the contents from user space? Now if we successfully allocated
> > > > blocks and we failed to copy the contents from user space, the above
> > > > would result in update of i_disksize and a mark_inode_dirty. Doesn't
> > > > that imply we have wrong contents in those block for which we failed to
> > > > copy the contents from user space ?
> > > block_write_end() zeros all new buffers. So yes, if we crash after
> > > this transaction commits but before we manage to redo the write, then user
> > > could see zeros at the end of file (previously inode could have blocks
> > > allocated beyond EOF).
> > > I was also thinking about truncating the newly created buffers but it's a
> > > bit tricky. We need to do it in the same transaction (otherwise the race
> > > would be still there) but standard truncate path would like to add inode
> > > to the orphan list, lock pages etc and we have no credits for that and also
> > > lock ordering might be troublesome. So I've chosen the simple path.
> > >
> > We do a vmtruncate if we failed to allocate blocks in
> > ext3_write_begin. That is done after the closing the current
> > transaction. If we crash in between (ie, after committing the
> > transaction allocating blocks and before committing the transaction that
> > is doing truncate) we would only have some data blocks leaking. But
> > that would be better than user seeing zero's in the file ?. Also if we
> > happen to add the inode to the orphan list and crash, the recovery would
> > truncate it properly. So by doing a vmtruncate I guess the window would be
> > small and we are already doing that in ext3_write_begin.
> Hmm, are you sure some assertion would not fire if we find allocated
> blocks beyond i_size (which could happen with the old code)? Frankly, I
> prefer user seeing zeros at the end of file (so that he can come and yell
> at me ;) rather than silently leaking blocks, getting inode into an
> unexpected state and then debug some mysterious problem. But hopefully this
> problem has a solution which can make both of us happy ;): We can reserve
> enough credits (actually just one block more) and when we see we need to
> do truncate because of failed write, we first add inode to the orphan list
> before stopping the current handle (so that if we crash it gets properly
> truncated) and then truncate the blocks in a separate transaction. Does it
> sound good to you?

Yes. We also should unlock the page before the truncate

-aneesh

2009-03-30 13:54:30

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] ext3: Avoid false EIO errors

Aneesh Kumar K.V wrote:

> We do a vmtruncate if we failed to allocate blocks in
> ext3_write_begin. That is done after the closing the current
> transaction. If we crash in between (ie, after committing the
> transaction allocating blocks and before committing the transaction that
> is doing truncate) we would only have some data blocks leaking. But
> that would be better than user seeing zero's in the file ?. Also if we
> happen to add the inode to the orphan list and crash, the recovery would
> truncate it properly. So by doing a vmtruncate I guess the window would be
> small and we are already doing that in ext3_write_begin.

I don't agree that leaking data blocks is better than exposing zeros...
the former is a security flaw, the latter a (significant) annoyance.

-Eric

2009-03-30 14:45:46

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] ext3: Avoid false EIO errors

On Mon, Mar 30, 2009 at 08:53:42AM -0500, Eric Sandeen wrote:
> Aneesh Kumar K.V wrote:
>
> > We do a vmtruncate if we failed to allocate blocks in
> > ext3_write_begin. That is done after the closing the current
> > transaction. If we crash in between (ie, after committing the
> > transaction allocating blocks and before committing the transaction that
> > is doing truncate) we would only have some data blocks leaking. But
> > that would be better than user seeing zero's in the file ?. Also if we
> > happen to add the inode to the orphan list and crash, the recovery would
> > truncate it properly. So by doing a vmtruncate I guess the window would be
> > small and we are already doing that in ext3_write_begin.
>
> I don't agree that leaking data blocks is better than exposing zeros...
> the former is a security flaw, the latter a (significant) annoyance.
>

Even when we fail to track few data blocks we do zero them using
page_zero_new_buffers. So it should not imply a security flaw. I guess
if we crash failing to commit the truncate fsck will look at the bitmap
and find the blocks which are not tracked by any inode and will mark them
free.

-aneesh

2009-03-30 15:14:00

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] ext3: Avoid false EIO errors

Aneesh Kumar K.V wrote:
> On Mon, Mar 30, 2009 at 08:53:42AM -0500, Eric Sandeen wrote:
>> Aneesh Kumar K.V wrote:
>>
>>> We do a vmtruncate if we failed to allocate blocks in
>>> ext3_write_begin. That is done after the closing the current
>>> transaction. If we crash in between (ie, after committing the
>>> transaction allocating blocks and before committing the transaction that
>>> is doing truncate) we would only have some data blocks leaking. But
>>> that would be better than user seeing zero's in the file ?. Also if we
>>> happen to add the inode to the orphan list and crash, the recovery would
>>> truncate it properly. So by doing a vmtruncate I guess the window would be
>>> small and we are already doing that in ext3_write_begin.
>> I don't agree that leaking data blocks is better than exposing zeros...
>> the former is a security flaw, the latter a (significant) annoyance.
>>
>
> Even when we fail to track few data blocks we do zero them using
> page_zero_new_buffers. So it should not imply a security flaw. I guess
> if we crash failing to commit the truncate fsck will look at the bitmap
> and find the blocks which are not tracked by any inode and will mark them
> free.

Oh, perhaps I misunderstood. I thought you meant leaking data in
uninitialized blocks, but you meant losing track of those blocks'
allocation, I guess. Sorry for the confusion...

-Eric

2009-03-30 16:05:23

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext3: Avoid false EIO errors

On Mon 30-03-09 16:28:21, Aneesh Kumar K.V wrote:
> On Mon, Mar 30, 2009 at 12:32:48PM +0200, Jan Kara wrote:
> > > > > > - struct address_space *mapping,
> > > > > > - loff_t pos, unsigned len, unsigned copied,
> > > > > > - struct page *page, void *fsdata)
> > > > > > +static void update_file_sizes(struct inode *inode, loff_t pos, unsigned len,
> > > > > > + unsigned copied)
> > > > > > {
> > > > > > - struct inode *inode = file->f_mapping->host;
> > > > > > -
> > > > > > - copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> > > > > > + int mark_dirty = 0;
> > > > > >
> > > > > > - if (pos+copied > inode->i_size) {
> > > > > > - i_size_write(inode, pos+copied);
> > > > > > - mark_inode_dirty(inode);
> > > > > > + if (pos + len > EXT3_I(inode)->i_disksize) {
> > > > > > + mark_dirty = 1;
> > > > > > + EXT3_I(inode)->i_disksize = pos + len;
> > > > > > }
> > > > >
> > > > > Won't this result in file having wrong contents if we failed to copy
> > > > > the contents from user space? Now if we successfully allocated
> > > > > blocks and we failed to copy the contents from user space, the above
> > > > > would result in update of i_disksize and a mark_inode_dirty. Doesn't
> > > > > that imply we have wrong contents in those block for which we failed to
> > > > > copy the contents from user space ?
> > > > block_write_end() zeros all new buffers. So yes, if we crash after
> > > > this transaction commits but before we manage to redo the write, then user
> > > > could see zeros at the end of file (previously inode could have blocks
> > > > allocated beyond EOF).
> > > > I was also thinking about truncating the newly created buffers but it's a
> > > > bit tricky. We need to do it in the same transaction (otherwise the race
> > > > would be still there) but standard truncate path would like to add inode
> > > > to the orphan list, lock pages etc and we have no credits for that and also
> > > > lock ordering might be troublesome. So I've chosen the simple path.
> > > >
> > > We do a vmtruncate if we failed to allocate blocks in
> > > ext3_write_begin. That is done after the closing the current
> > > transaction. If we crash in between (ie, after committing the
> > > transaction allocating blocks and before committing the transaction that
> > > is doing truncate) we would only have some data blocks leaking. But
> > > that would be better than user seeing zero's in the file ?. Also if we
> > > happen to add the inode to the orphan list and crash, the recovery would
> > > truncate it properly. So by doing a vmtruncate I guess the window would be
> > > small and we are already doing that in ext3_write_begin.
> > Hmm, are you sure some assertion would not fire if we find allocated
> > blocks beyond i_size (which could happen with the old code)? Frankly, I
> > prefer user seeing zeros at the end of file (so that he can come and yell
> > at me ;) rather than silently leaking blocks, getting inode into an
> > unexpected state and then debug some mysterious problem. But hopefully this
> > problem has a solution which can make both of us happy ;): We can reserve
> > enough credits (actually just one block more) and when we see we need to
> > do truncate because of failed write, we first add inode to the orphan list
> > before stopping the current handle (so that if we crash it gets properly
> > truncated) and then truncate the blocks in a separate transaction. Does it
> > sound good to you?
>
> Yes. We also should unlock the page before the truncate
OK, below is improved patch that adds inode to orphan list before
stopping the current handle.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--

>From 8f02ffb17a23c52ec980800fdccf0fa11d96f2a7 Mon Sep 17 00:00:00 2001
From: Jan Kara <[email protected]>
Date: Wed, 25 Mar 2009 18:51:52 +0100
Subject: [PATCH] ext3: Avoid false EIO errors

Sometimes block_write_begin() can map buffers in a page but later we fail to
copy data into those buffers (because the source page has been paged out in the
mean time). We then end up with !uptodate mapped buffers. To add a bit more to
the confusion, block_write_end() does not commit any data (and thus does not
any mark buffers as uptodate) if we didn't succeed with copying all the data.

Commit f4fc66a894546bdc88a775d0e83ad20a65210bcb (ext3: convert to new aops)
missed these cases and thus we were inserting non-uptodate buffers to
transaction's list which confuses JBD code and it reports IO errors, aborts
a transaction and generally makes users afraid about their data ;-P.

This patch fixes the problem by reorganizing ext3_..._write_end() code to
first call block_write_end() to mark buffers with valid data uptodate and
after that we file only uptodate buffers to transaction's lists.

We also fix a problem where we could leave blocks allocated beyond i_size
(i_disksize in fact) because of failed write. We now add inode to orphan
list when write fails (to be safe in case we crash) and then truncate blocks
beyond i_size in a separate transaction.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext3/inode.c | 123 +++++++++++++++++++++++++++----------------------------
1 files changed, 61 insertions(+), 62 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 4a09ff1..40eb569 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1149,12 +1149,15 @@ static int ext3_write_begin(struct file *file, struct address_space *mapping,
struct page **pagep, void **fsdata)
{
struct inode *inode = mapping->host;
- int ret, needed_blocks = ext3_writepage_trans_blocks(inode);
+ int ret;
handle_t *handle;
int retries = 0;
struct page *page;
pgoff_t index;
unsigned from, to;
+ /* Reserve one block more for addition to orphan list in case
+ * we allocate blocks but write fails for some reason */
+ int needed_blocks = ext3_writepage_trans_blocks(inode) + 1;

index = pos >> PAGE_CACHE_SHIFT;
from = pos & (PAGE_CACHE_SIZE - 1);
@@ -1184,15 +1187,20 @@ retry:
}
write_begin_failed:
if (ret) {
- ext3_journal_stop(handle);
- unlock_page(page);
- page_cache_release(page);
/*
* block_write_begin may have instantiated a few blocks
* outside i_size. Trim these off again. Don't need
* i_size_read because we hold i_mutex.
+ *
+ * Add inode to orphan list in case we crash before truncate
+ * finishes.
*/
if (pos + len > inode->i_size)
+ ext3_orphan_add(handle, inode);
+ ext3_journal_stop(handle);
+ unlock_page(page);
+ page_cache_release(page);
+ if (pos + len > inode->i_size)
vmtruncate(inode, inode->i_size);
}
if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
@@ -1211,6 +1219,18 @@ int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
return err;
}

+/* For ordered writepage and write_end functions */
+static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
+{
+ /*
+ * Write could have mapped the buffer but it didn't copy the data in
+ * yet. So avoid filing such buffer into a transaction.
+ */
+ if (buffer_mapped(bh) && buffer_uptodate(bh))
+ return ext3_journal_dirty_data(handle, bh);
+ return 0;
+}
+
/* For write_end() in data=journal mode */
static int write_end_fn(handle_t *handle, struct buffer_head *bh)
{
@@ -1221,26 +1241,20 @@ static int write_end_fn(handle_t *handle, struct buffer_head *bh)
}

/*
- * Generic write_end handler for ordered and writeback ext3 journal modes.
- * We can't use generic_write_end, because that unlocks the page and we need to
- * unlock the page after ext3_journal_stop, but ext3_journal_stop must run
- * after block_write_end.
+ * This is nasty and subtle: ext3_write_begin() could have allocated blocks
+ * for the whole page but later we failed to copy the data in. Update inode
+ * size according to what we managed to copy. The rest is going to be
+ * truncated in write_end function.
*/
-static int ext3_generic_write_end(struct file *file,
- struct address_space *mapping,
- loff_t pos, unsigned len, unsigned copied,
- struct page *page, void *fsdata)
+static void update_file_sizes(struct inode *inode, loff_t pos, unsigned copied)
{
- struct inode *inode = file->f_mapping->host;
-
- copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
-
- if (pos+copied > inode->i_size) {
- i_size_write(inode, pos+copied);
+ /* What matters to us is i_disksize. We don't write i_size anywhere */
+ if (pos + copied > inode->i_size)
+ i_size_write(inode, pos + copied);
+ if (pos + copied > EXT3_I(inode)->i_disksize) {
+ EXT3_I(inode)->i_disksize = pos + copied;
mark_inode_dirty(inode);
}
-
- return copied;
}

/*
@@ -1260,35 +1274,29 @@ static int ext3_ordered_write_end(struct file *file,
unsigned from, to;
int ret = 0, ret2;

- from = pos & (PAGE_CACHE_SIZE - 1);
- to = from + len;
+ copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);

+ from = pos & (PAGE_CACHE_SIZE - 1);
+ to = from + copied;
ret = walk_page_buffers(handle, page_buffers(page),
- from, to, NULL, ext3_journal_dirty_data);
+ from, to, NULL, journal_dirty_data_fn);

- if (ret == 0) {
- /*
- * generic_write_end() will run mark_inode_dirty() if i_size
- * changes. So let's piggyback the i_disksize mark_inode_dirty
- * into that.
- */
- loff_t new_i_size;
-
- new_i_size = pos + copied;
- if (new_i_size > EXT3_I(inode)->i_disksize)
- EXT3_I(inode)->i_disksize = new_i_size;
- ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
- page, fsdata);
- copied = ret2;
- if (ret2 < 0)
- ret = ret2;
- }
+ if (ret == 0)
+ update_file_sizes(inode, pos, copied);
+ /*
+ * There may be allocated blocks outside of i_size because
+ * we failed to copy some data. Prepare for truncate.
+ */
+ if (pos + len > inode->i_size)
+ ext3_orphan_add(handle, inode);
ret2 = ext3_journal_stop(handle);
if (!ret)
ret = ret2;
unlock_page(page);
page_cache_release(page);

+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
return ret ? ret : copied;
}

@@ -1299,25 +1307,22 @@ static int ext3_writeback_write_end(struct file *file,
{
handle_t *handle = ext3_journal_current_handle();
struct inode *inode = file->f_mapping->host;
- int ret = 0, ret2;
- loff_t new_i_size;
-
- new_i_size = pos + copied;
- if (new_i_size > EXT3_I(inode)->i_disksize)
- EXT3_I(inode)->i_disksize = new_i_size;
-
- ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
- page, fsdata);
- copied = ret2;
- if (ret2 < 0)
- ret = ret2;
+ int ret;

- ret2 = ext3_journal_stop(handle);
- if (!ret)
- ret = ret2;
+ copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+ update_file_sizes(inode, pos, copied);
+ /*
+ * There may be allocated blocks outside of i_size because
+ * we failed to copy some data. Prepare for truncate.
+ */
+ if (pos + len > inode->i_size)
+ ext3_orphan_add(handle, inode);
+ ret = ext3_journal_stop(handle);
unlock_page(page);
page_cache_release(page);

+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
return ret ? ret : copied;
}

@@ -1428,17 +1433,11 @@ static int bput_one(handle_t *handle, struct buffer_head *bh)
return 0;
}

-static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
-{
- if (buffer_mapped(bh))
- return ext3_journal_dirty_data(handle, bh);
- return 0;
-}

2009-03-31 04:46:05

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] ext3: Avoid false EIO errors

On Mon, Mar 30, 2009 at 06:05:17PM +0200, Jan Kara wrote:
> On Mon 30-03-09 16:28:21, Aneesh Kumar K.V wrote:
> > On Mon, Mar 30, 2009 at 12:32:48PM +0200, Jan Kara wrote:
> > > > > > > - struct address_space *mapping,
> > > > > > > - loff_t pos, unsigned len, unsigned copied,
> > > > > > > - struct page *page, void *fsdata)
> > > > > > > +static void update_file_sizes(struct inode *inode, loff_t pos, unsigned len,
> > > > > > > + unsigned copied)
> > > > > > > {
> > > > > > > - struct inode *inode = file->f_mapping->host;
> > > > > > > -
> > > > > > > - copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> > > > > > > + int mark_dirty = 0;
> > > > > > >
> > > > > > > - if (pos+copied > inode->i_size) {
> > > > > > > - i_size_write(inode, pos+copied);
> > > > > > > - mark_inode_dirty(inode);
> > > > > > > + if (pos + len > EXT3_I(inode)->i_disksize) {
> > > > > > > + mark_dirty = 1;
> > > > > > > + EXT3_I(inode)->i_disksize = pos + len;
> > > > > > > }
> > > > > >
> > > > > > Won't this result in file having wrong contents if we failed to copy
> > > > > > the contents from user space? Now if we successfully allocated
> > > > > > blocks and we failed to copy the contents from user space, the above
> > > > > > would result in update of i_disksize and a mark_inode_dirty. Doesn't
> > > > > > that imply we have wrong contents in those block for which we failed to
> > > > > > copy the contents from user space ?
> > > > > block_write_end() zeros all new buffers. So yes, if we crash after
> > > > > this transaction commits but before we manage to redo the write, then user
> > > > > could see zeros at the end of file (previously inode could have blocks
> > > > > allocated beyond EOF).
> > > > > I was also thinking about truncating the newly created buffers but it's a
> > > > > bit tricky. We need to do it in the same transaction (otherwise the race
> > > > > would be still there) but standard truncate path would like to add inode
> > > > > to the orphan list, lock pages etc and we have no credits for that and also
> > > > > lock ordering might be troublesome. So I've chosen the simple path.
> > > > >
> > > > We do a vmtruncate if we failed to allocate blocks in
> > > > ext3_write_begin. That is done after the closing the current
> > > > transaction. If we crash in between (ie, after committing the
> > > > transaction allocating blocks and before committing the transaction that
> > > > is doing truncate) we would only have some data blocks leaking. But
> > > > that would be better than user seeing zero's in the file ?. Also if we
> > > > happen to add the inode to the orphan list and crash, the recovery would
> > > > truncate it properly. So by doing a vmtruncate I guess the window would be
> > > > small and we are already doing that in ext3_write_begin.
> > > Hmm, are you sure some assertion would not fire if we find allocated
> > > blocks beyond i_size (which could happen with the old code)? Frankly, I
> > > prefer user seeing zeros at the end of file (so that he can come and yell
> > > at me ;) rather than silently leaking blocks, getting inode into an
> > > unexpected state and then debug some mysterious problem. But hopefully this
> > > problem has a solution which can make both of us happy ;): We can reserve
> > > enough credits (actually just one block more) and when we see we need to
> > > do truncate because of failed write, we first add inode to the orphan list
> > > before stopping the current handle (so that if we crash it gets properly
> > > truncated) and then truncate the blocks in a separate transaction. Does it
> > > sound good to you?
> >
> > Yes. We also should unlock the page before the truncate
> OK, below is improved patch that adds inode to orphan list before
> stopping the current handle.
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
> --
>
> From 8f02ffb17a23c52ec980800fdccf0fa11d96f2a7 Mon Sep 17 00:00:00 2001
> From: Jan Kara <[email protected]>
> Date: Wed, 25 Mar 2009 18:51:52 +0100
> Subject: [PATCH] ext3: Avoid false EIO errors
>
> Sometimes block_write_begin() can map buffers in a page but later we fail to
> copy data into those buffers (because the source page has been paged out in the
> mean time). We then end up with !uptodate mapped buffers. To add a bit more to
> the confusion, block_write_end() does not commit any data (and thus does not
> any mark buffers as uptodate) if we didn't succeed with copying all the data.
>
> Commit f4fc66a894546bdc88a775d0e83ad20a65210bcb (ext3: convert to new aops)
> missed these cases and thus we were inserting non-uptodate buffers to
> transaction's list which confuses JBD code and it reports IO errors, aborts
> a transaction and generally makes users afraid about their data ;-P.
>
> This patch fixes the problem by reorganizing ext3_..._write_end() code to
> first call block_write_end() to mark buffers with valid data uptodate and
> after that we file only uptodate buffers to transaction's lists.
>
> We also fix a problem where we could leave blocks allocated beyond i_size
> (i_disksize in fact) because of failed write. We now add inode to orphan
> list when write fails (to be safe in case we crash) and then truncate blocks
> beyond i_size in a separate transaction.
>
> Signed-off-by: Jan Kara <[email protected]>

ext4 would need the orphan_add and truncate changes.

Reviewed-by: Aneesh Kumar K.V <[email protected]>

> ---
> fs/ext3/inode.c | 123 +++++++++++++++++++++++++++----------------------------
> 1 files changed, 61 insertions(+), 62 deletions(-)
>
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 4a09ff1..40eb569 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -1149,12 +1149,15 @@ static int ext3_write_begin(struct file *file, struct address_space *mapping,
> struct page **pagep, void **fsdata)
> {
> struct inode *inode = mapping->host;
> - int ret, needed_blocks = ext3_writepage_trans_blocks(inode);
> + int ret;
> handle_t *handle;
> int retries = 0;
> struct page *page;
> pgoff_t index;
> unsigned from, to;
> + /* Reserve one block more for addition to orphan list in case
> + * we allocate blocks but write fails for some reason */
> + int needed_blocks = ext3_writepage_trans_blocks(inode) + 1;
>
> index = pos >> PAGE_CACHE_SHIFT;
> from = pos & (PAGE_CACHE_SIZE - 1);
> @@ -1184,15 +1187,20 @@ retry:
> }
> write_begin_failed:
> if (ret) {
> - ext3_journal_stop(handle);
> - unlock_page(page);
> - page_cache_release(page);
> /*
> * block_write_begin may have instantiated a few blocks
> * outside i_size. Trim these off again. Don't need
> * i_size_read because we hold i_mutex.
> + *
> + * Add inode to orphan list in case we crash before truncate
> + * finishes.
> */
> if (pos + len > inode->i_size)
> + ext3_orphan_add(handle, inode);
> + ext3_journal_stop(handle);
> + unlock_page(page);
> + page_cache_release(page);
> + if (pos + len > inode->i_size)
> vmtruncate(inode, inode->i_size);
> }
> if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
> @@ -1211,6 +1219,18 @@ int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
> return err;
> }
>
> +/* For ordered writepage and write_end functions */
> +static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> +{
> + /*
> + * Write could have mapped the buffer but it didn't copy the data in
> + * yet. So avoid filing such buffer into a transaction.
> + */
> + if (buffer_mapped(bh) && buffer_uptodate(bh))
> + return ext3_journal_dirty_data(handle, bh);
> + return 0;
> +}
> +
> /* For write_end() in data=journal mode */
> static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> {
> @@ -1221,26 +1241,20 @@ static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> }
>
> /*
> - * Generic write_end handler for ordered and writeback ext3 journal modes.
> - * We can't use generic_write_end, because that unlocks the page and we need to
> - * unlock the page after ext3_journal_stop, but ext3_journal_stop must run
> - * after block_write_end.
> + * This is nasty and subtle: ext3_write_begin() could have allocated blocks
> + * for the whole page but later we failed to copy the data in. Update inode
> + * size according to what we managed to copy. The rest is going to be
> + * truncated in write_end function.
> */
> -static int ext3_generic_write_end(struct file *file,
> - struct address_space *mapping,
> - loff_t pos, unsigned len, unsigned copied,
> - struct page *page, void *fsdata)
> +static void update_file_sizes(struct inode *inode, loff_t pos, unsigned copied)
> {
> - struct inode *inode = file->f_mapping->host;
> -
> - copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> -
> - if (pos+copied > inode->i_size) {
> - i_size_write(inode, pos+copied);
> + /* What matters to us is i_disksize. We don't write i_size anywhere */
> + if (pos + copied > inode->i_size)
> + i_size_write(inode, pos + copied);
> + if (pos + copied > EXT3_I(inode)->i_disksize) {
> + EXT3_I(inode)->i_disksize = pos + copied;
> mark_inode_dirty(inode);
> }
> -
> - return copied;
> }
>
> /*
> @@ -1260,35 +1274,29 @@ static int ext3_ordered_write_end(struct file *file,
> unsigned from, to;
> int ret = 0, ret2;
>
> - from = pos & (PAGE_CACHE_SIZE - 1);
> - to = from + len;
> + copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
>
> + from = pos & (PAGE_CACHE_SIZE - 1);
> + to = from + copied;
> ret = walk_page_buffers(handle, page_buffers(page),
> - from, to, NULL, ext3_journal_dirty_data);
> + from, to, NULL, journal_dirty_data_fn);
>
> - if (ret == 0) {
> - /*
> - * generic_write_end() will run mark_inode_dirty() if i_size
> - * changes. So let's piggyback the i_disksize mark_inode_dirty
> - * into that.
> - */
> - loff_t new_i_size;
> -
> - new_i_size = pos + copied;
> - if (new_i_size > EXT3_I(inode)->i_disksize)
> - EXT3_I(inode)->i_disksize = new_i_size;
> - ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
> - page, fsdata);
> - copied = ret2;
> - if (ret2 < 0)
> - ret = ret2;
> - }
> + if (ret == 0)
> + update_file_sizes(inode, pos, copied);
> + /*
> + * There may be allocated blocks outside of i_size because
> + * we failed to copy some data. Prepare for truncate.
> + */
> + if (pos + len > inode->i_size)
> + ext3_orphan_add(handle, inode);
> ret2 = ext3_journal_stop(handle);
> if (!ret)
> ret = ret2;
> unlock_page(page);
> page_cache_release(page);
>
> + if (pos + len > inode->i_size)
> + vmtruncate(inode, inode->i_size);
> return ret ? ret : copied;
> }
>
> @@ -1299,25 +1307,22 @@ static int ext3_writeback_write_end(struct file *file,
> {
> handle_t *handle = ext3_journal_current_handle();
> struct inode *inode = file->f_mapping->host;
> - int ret = 0, ret2;
> - loff_t new_i_size;
> -
> - new_i_size = pos + copied;
> - if (new_i_size > EXT3_I(inode)->i_disksize)
> - EXT3_I(inode)->i_disksize = new_i_size;
> -
> - ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
> - page, fsdata);
> - copied = ret2;
> - if (ret2 < 0)
> - ret = ret2;
> + int ret;
>
> - ret2 = ext3_journal_stop(handle);
> - if (!ret)
> - ret = ret2;
> + copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> + update_file_sizes(inode, pos, copied);
> + /*
> + * There may be allocated blocks outside of i_size because
> + * we failed to copy some data. Prepare for truncate.
> + */
> + if (pos + len > inode->i_size)
> + ext3_orphan_add(handle, inode);
> + ret = ext3_journal_stop(handle);
> unlock_page(page);
> page_cache_release(page);
>
> + if (pos + len > inode->i_size)
> + vmtruncate(inode, inode->i_size);
> return ret ? ret : copied;
> }
>
> @@ -1428,17 +1433,11 @@ static int bput_one(handle_t *handle, struct buffer_head *bh)
> return 0;
> }
>
> -static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> -{
> - if (buffer_mapped(bh))
> - return ext3_journal_dirty_data(handle, bh);
> - return 0;
> -}
> -
> static int buffer_unmapped(handle_t *handle, struct buffer_head *bh)
> {
> return !buffer_mapped(bh);
> }
> +
> /*
> * Note that we always start a transaction even if we're not journalling
> * data. This is to preserve ordering: any hole instantiation within
> --
> 1.6.0.2
>

2009-03-31 09:29:46

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH 1/2] [PATCH] ext4: Add inode to the orphan list during block allocation failure

We should add inode to the orphan list in the same transaction
as block allocation. This ensures that if we crash after a failed
block allocation and before we do a vmtruncate we don't leak block
(ie block marked as used in bitmap but not claimed by the inode).

Signed-off-by: Aneesh Kumar K.V <[email protected]>
CC: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 15 +++++++++++++--
1 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2231a65..074185f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1424,7 +1424,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
struct page **pagep, void **fsdata)
{
struct inode *inode = mapping->host;
- int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
+ int ret, needed_blocks;
handle_t *handle;
int retries = 0;
struct page *page;
@@ -1435,6 +1435,11 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
"dev %s ino %lu pos %llu len %u flags %u",
inode->i_sb->s_id, inode->i_ino,
(unsigned long long) pos, len, flags);
+ /*
+ * Reserve one block more for addition to orphan list in case
+ * we allocate blocks but write fails for some reason
+ */
+ needed_blocks = ext4_writepage_trans_blocks(inode) + 1;
index = pos >> PAGE_CACHE_SHIFT;
from = pos & (PAGE_CACHE_SIZE - 1);
to = from + len;
@@ -1468,14 +1473,20 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,

if (ret) {
unlock_page(page);
- ext4_journal_stop(handle);
page_cache_release(page);
/*
* block_write_begin may have instantiated a few blocks
* outside i_size. Trim these off again. Don't need
* i_size_read because we hold i_mutex.
+ *
+ * Add inode to orphan list in case we crash before
+ * truncate finishes
*/
if (pos + len > inode->i_size)
+ ext4_orphan_add(handle, inode);
+
+ ext4_journal_stop(handle);
+ if (pos + len > inode->i_size)
vmtruncate(inode, inode->i_size);
}

--
1.6.2.1.404.gb0085.dirty


2009-03-31 09:29:52

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH 2/2] [PATCH] ext4: truncate the file properly if we fail to copy data from userspace.

In generic_perform_write if we fail to copy the user data we don't
update the inode->i_size. We should truncate the file in the above case
so that we don't have blocks allocated outside inode->i_size. Add
the inode to orphan list in the same transaction as block allocation
This ensures that if we crash in between the recovery would do the truncate.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
CC: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 103 +++++++++++++++++++++++++++++++++++++++++--------------
1 files changed, 77 insertions(+), 26 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 074185f..3997999 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1505,6 +1505,52 @@ static int write_end_fn(handle_t *handle, struct buffer_head *bh)
return ext4_handle_dirty_metadata(handle, NULL, bh);
}

+static int ext4_generic_write_end(struct file *file,
+ struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata)
+{
+ int i_size_changed = 0;
+ struct inode *inode = mapping->host;
+ handle_t *handle = ext4_journal_current_handle();
+
+ copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+
+ /*
+ * No need to use i_size_read() here, the i_size
+ * cannot change under us because we hold i_mutex.
+ *
+ * But it's important to update i_size while still holding page lock:
+ * page writeout could otherwise come in and zero beyond i_size.
+ */
+ if (pos + copied > inode->i_size) {
+ i_size_write(inode, pos + copied);
+ i_size_changed = 1;
+ }
+
+ if (pos + copied > EXT4_I(inode)->i_disksize) {
+ /* We need to mark inode dirty even if
+ * new_i_size is less that inode->i_size
+ * bu greater than i_disksize.(hint delalloc)
+ */
+ ext4_update_i_disksize(inode, (pos + copied));
+ i_size_changed = 1;
+ }
+ unlock_page(page);
+ page_cache_release(page);
+
+ /*
+ * Don't mark the inode dirty under page lock. First, it unnecessarily
+ * makes the holding time of page lock longer. Second, it forces lock
+ * ordering of page lock and transaction start for journaling
+ * filesystems.
+ */
+ if (i_size_changed)
+ ext4_mark_inode_dirty(handle, inode);
+
+ return copied;
+}
+
/*
* We need to pick up the new inode size which generic_commit_write gave us
* `file' can be NULL - eg, when called from page_symlink().
@@ -1528,21 +1574,15 @@ static int ext4_ordered_write_end(struct file *file,
ret = ext4_jbd2_file_inode(handle, inode);

if (ret == 0) {
- loff_t new_i_size;
-
- new_i_size = pos + copied;
- if (new_i_size > EXT4_I(inode)->i_disksize) {
- ext4_update_i_disksize(inode, new_i_size);
- /* We need to mark inode dirty even if
- * new_i_size is less that inode->i_size
- * bu greater than i_disksize.(hint delalloc)
- */
- ext4_mark_inode_dirty(handle, inode);
- }
-
- ret2 = generic_write_end(file, mapping, pos, len, copied,
+ ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
page, fsdata);
copied = ret2;
+ if (pos + len > inode->i_size)
+ /* if we have allocated more blocks and copied
+ * less. We will have blocks allocated outside
+ * inode->i_size. So truncate them
+ */
+ ext4_orphan_add(handle, inode);
if (ret2 < 0)
ret = ret2;
}
@@ -1550,6 +1590,9 @@ static int ext4_ordered_write_end(struct file *file,
if (!ret)
ret = ret2;

+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
+
return ret ? ret : copied;
}

@@ -1561,25 +1604,21 @@ static int ext4_writeback_write_end(struct file *file,
handle_t *handle = ext4_journal_current_handle();
struct inode *inode = mapping->host;
int ret = 0, ret2;
- loff_t new_i_size;

trace_mark(ext4_writeback_write_end,
"dev %s ino %lu pos %llu len %u copied %u",
inode->i_sb->s_id, inode->i_ino,
(unsigned long long) pos, len, copied);
- new_i_size = pos + copied;
- if (new_i_size > EXT4_I(inode)->i_disksize) {
- ext4_update_i_disksize(inode, new_i_size);
- /* We need to mark inode dirty even if
- * new_i_size is less that inode->i_size
- * bu greater than i_disksize.(hint delalloc)
- */
- ext4_mark_inode_dirty(handle, inode);
- }
-
- ret2 = generic_write_end(file, mapping, pos, len, copied,
+ ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
page, fsdata);
copied = ret2;
+ if (pos + len > inode->i_size)
+ /* if we have allocated more blocks and copied
+ * less. We will have blocks allocated outside
+ * inode->i_size. So truncate them
+ */
+ ext4_orphan_add(handle, inode);
+
if (ret2 < 0)
ret = ret2;

@@ -1587,6 +1626,9 @@ static int ext4_writeback_write_end(struct file *file,
if (!ret)
ret = ret2;

+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
+
return ret ? ret : copied;
}

@@ -1631,10 +1673,19 @@ static int ext4_journalled_write_end(struct file *file,
}

unlock_page(page);
+ page_cache_release(page);
+ if (pos + len > inode->i_size)
+ /* if we have allocated more blocks and copied
+ * less. We will have blocks allocated outside
+ * inode->i_size. So truncate them
+ */
+ ext4_orphan_add(handle, inode);
+
ret2 = ext4_journal_stop(handle);
if (!ret)
ret = ret2;
- page_cache_release(page);
+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);

return ret ? ret : copied;
}
--
1.6.2.1.404.gb0085.dirty


2009-03-31 09:34:53

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/2] [PATCH] ext4: Add inode to the orphan list during block allocation failure

On Tue 31-03-09 14:59:25, Aneesh Kumar K.V wrote:
> We should add inode to the orphan list in the same transaction
> as block allocation. This ensures that if we crash after a failed
> block allocation and before we do a vmtruncate we don't leak block
> (ie block marked as used in bitmap but not claimed by the inode).
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> CC: Jan Kara <[email protected]>
Looks fine.

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/inode.c | 15 +++++++++++++--
> 1 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 2231a65..074185f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1424,7 +1424,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
> struct page **pagep, void **fsdata)
> {
> struct inode *inode = mapping->host;
> - int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
> + int ret, needed_blocks;
> handle_t *handle;
> int retries = 0;
> struct page *page;
> @@ -1435,6 +1435,11 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
> "dev %s ino %lu pos %llu len %u flags %u",
> inode->i_sb->s_id, inode->i_ino,
> (unsigned long long) pos, len, flags);
> + /*
> + * Reserve one block more for addition to orphan list in case
> + * we allocate blocks but write fails for some reason
> + */
> + needed_blocks = ext4_writepage_trans_blocks(inode) + 1;
> index = pos >> PAGE_CACHE_SHIFT;
> from = pos & (PAGE_CACHE_SIZE - 1);
> to = from + len;
> @@ -1468,14 +1473,20 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>
> if (ret) {
> unlock_page(page);
> - ext4_journal_stop(handle);
> page_cache_release(page);
> /*
> * block_write_begin may have instantiated a few blocks
> * outside i_size. Trim these off again. Don't need
> * i_size_read because we hold i_mutex.
> + *
> + * Add inode to orphan list in case we crash before
> + * truncate finishes
> */
> if (pos + len > inode->i_size)
> + ext4_orphan_add(handle, inode);
> +
> + ext4_journal_stop(handle);
> + if (pos + len > inode->i_size)
> vmtruncate(inode, inode->i_size);
> }
>
> --
> 1.6.2.1.404.gb0085.dirty
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-03-31 09:38:22

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] ext4: truncate the file properly if we fail to copy data from userspace.

On Tue 31-03-09 14:59:26, Aneesh Kumar K.V wrote:
> In generic_perform_write if we fail to copy the user data we don't
> update the inode->i_size. We should truncate the file in the above case
> so that we don't have blocks allocated outside inode->i_size. Add
> the inode to orphan list in the same transaction as block allocation
> This ensures that if we crash in between the recovery would do the truncate.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> CC: Jan Kara <[email protected]>
This patch looks fine as well.
Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/inode.c | 103 +++++++++++++++++++++++++++++++++++++++++--------------
> 1 files changed, 77 insertions(+), 26 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 074185f..3997999 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1505,6 +1505,52 @@ static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> return ext4_handle_dirty_metadata(handle, NULL, bh);
> }
>
> +static int ext4_generic_write_end(struct file *file,
> + struct address_space *mapping,
> + loff_t pos, unsigned len, unsigned copied,
> + struct page *page, void *fsdata)
> +{
> + int i_size_changed = 0;
> + struct inode *inode = mapping->host;
> + handle_t *handle = ext4_journal_current_handle();
> +
> + copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> +
> + /*
> + * No need to use i_size_read() here, the i_size
> + * cannot change under us because we hold i_mutex.
> + *
> + * But it's important to update i_size while still holding page lock:
> + * page writeout could otherwise come in and zero beyond i_size.
> + */
> + if (pos + copied > inode->i_size) {
> + i_size_write(inode, pos + copied);
> + i_size_changed = 1;
> + }
> +
> + if (pos + copied > EXT4_I(inode)->i_disksize) {
> + /* We need to mark inode dirty even if
> + * new_i_size is less that inode->i_size
> + * bu greater than i_disksize.(hint delalloc)
> + */
> + ext4_update_i_disksize(inode, (pos + copied));
> + i_size_changed = 1;
> + }
> + unlock_page(page);
> + page_cache_release(page);
> +
> + /*
> + * Don't mark the inode dirty under page lock. First, it unnecessarily
> + * makes the holding time of page lock longer. Second, it forces lock
> + * ordering of page lock and transaction start for journaling
> + * filesystems.
> + */
> + if (i_size_changed)
> + ext4_mark_inode_dirty(handle, inode);
> +
> + return copied;
> +}
> +
> /*
> * We need to pick up the new inode size which generic_commit_write gave us
> * `file' can be NULL - eg, when called from page_symlink().
> @@ -1528,21 +1574,15 @@ static int ext4_ordered_write_end(struct file *file,
> ret = ext4_jbd2_file_inode(handle, inode);
>
> if (ret == 0) {
> - loff_t new_i_size;
> -
> - new_i_size = pos + copied;
> - if (new_i_size > EXT4_I(inode)->i_disksize) {
> - ext4_update_i_disksize(inode, new_i_size);
> - /* We need to mark inode dirty even if
> - * new_i_size is less that inode->i_size
> - * bu greater than i_disksize.(hint delalloc)
> - */
> - ext4_mark_inode_dirty(handle, inode);
> - }
> -
> - ret2 = generic_write_end(file, mapping, pos, len, copied,
> + ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
> page, fsdata);
> copied = ret2;
> + if (pos + len > inode->i_size)
> + /* if we have allocated more blocks and copied
> + * less. We will have blocks allocated outside
> + * inode->i_size. So truncate them
> + */
> + ext4_orphan_add(handle, inode);
> if (ret2 < 0)
> ret = ret2;
> }
> @@ -1550,6 +1590,9 @@ static int ext4_ordered_write_end(struct file *file,
> if (!ret)
> ret = ret2;
>
> + if (pos + len > inode->i_size)
> + vmtruncate(inode, inode->i_size);
> +
> return ret ? ret : copied;
> }
>
> @@ -1561,25 +1604,21 @@ static int ext4_writeback_write_end(struct file *file,
> handle_t *handle = ext4_journal_current_handle();
> struct inode *inode = mapping->host;
> int ret = 0, ret2;
> - loff_t new_i_size;
>
> trace_mark(ext4_writeback_write_end,
> "dev %s ino %lu pos %llu len %u copied %u",
> inode->i_sb->s_id, inode->i_ino,
> (unsigned long long) pos, len, copied);
> - new_i_size = pos + copied;
> - if (new_i_size > EXT4_I(inode)->i_disksize) {
> - ext4_update_i_disksize(inode, new_i_size);
> - /* We need to mark inode dirty even if
> - * new_i_size is less that inode->i_size
> - * bu greater than i_disksize.(hint delalloc)
> - */
> - ext4_mark_inode_dirty(handle, inode);
> - }
> -
> - ret2 = generic_write_end(file, mapping, pos, len, copied,
> + ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
> page, fsdata);
> copied = ret2;
> + if (pos + len > inode->i_size)
> + /* if we have allocated more blocks and copied
> + * less. We will have blocks allocated outside
> + * inode->i_size. So truncate them
> + */
> + ext4_orphan_add(handle, inode);
> +
> if (ret2 < 0)
> ret = ret2;
>
> @@ -1587,6 +1626,9 @@ static int ext4_writeback_write_end(struct file *file,
> if (!ret)
> ret = ret2;
>
> + if (pos + len > inode->i_size)
> + vmtruncate(inode, inode->i_size);
> +
> return ret ? ret : copied;
> }
>
> @@ -1631,10 +1673,19 @@ static int ext4_journalled_write_end(struct file *file,
> }
>
> unlock_page(page);
> + page_cache_release(page);
> + if (pos + len > inode->i_size)
> + /* if we have allocated more blocks and copied
> + * less. We will have blocks allocated outside
> + * inode->i_size. So truncate them
> + */
> + ext4_orphan_add(handle, inode);
> +
> ret2 = ext4_journal_stop(handle);
> if (!ret)
> ret = ret2;
> - page_cache_release(page);
> + if (pos + len > inode->i_size)
> + vmtruncate(inode, inode->i_size);
>
> return ret ? ret : copied;
> }
> --
> 1.6.2.1.404.gb0085.dirty
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-03-31 09:46:22

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext3: Avoid false EIO errors (version 4)

Hello,

as Aneesh pointed to me, I forgot to add truncation to the data=journaled
mode. This patch fixes it. Hopefully final version of the fix ;).

Honza
---
Jan Kara <[email protected]>
SUSE Labs, CR
---

>From ec5c2977f87328fb93fee1aa35043bafeb53cea1 Mon Sep 17 00:00:00 2001
From: Jan Kara <[email protected]>
Date: Wed, 25 Mar 2009 18:51:52 +0100
Subject: [PATCH] ext3: Avoid false EIO errors (version 4)

Sometimes block_write_begin() can map buffers in a page but later we fail to
copy data into those buffers (because the source page has been paged out in the
mean time). We then end up with !uptodate mapped buffers. To add a bit more to
the confusion, block_write_end() does not commit any data (and thus does not
any mark buffers as uptodate) if we didn't succeed with copying all the data.

Commit f4fc66a894546bdc88a775d0e83ad20a65210bcb (ext3: convert to new aops)
missed these cases and thus we were inserting non-uptodate buffers to
transaction's list which confuses JBD code and it reports IO errors, aborts
a transaction and generally makes users afraid about their data ;-P.

This patch fixes the problem by reorganizing ext3_..._write_end() code to
first call block_write_end() to mark buffers with valid data uptodate and
after that we file only uptodate buffers to transaction's lists.

We also fix a problem where we could leave blocks allocated beyond i_size
(i_disksize in fact) because of failed write. We now add inode to orphan
list when write fails (to be safe in case we crash) and then truncate blocks
beyond i_size in a separate transaction.

Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext3/inode.c | 139 +++++++++++++++++++++++++++++--------------------------
1 files changed, 74 insertions(+), 65 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 4a09ff1..d3ef656 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1149,12 +1149,15 @@ static int ext3_write_begin(struct file *file, struct address_space *mapping,
struct page **pagep, void **fsdata)
{
struct inode *inode = mapping->host;
- int ret, needed_blocks = ext3_writepage_trans_blocks(inode);
+ int ret;
handle_t *handle;
int retries = 0;
struct page *page;
pgoff_t index;
unsigned from, to;
+ /* Reserve one block more for addition to orphan list in case
+ * we allocate blocks but write fails for some reason */
+ int needed_blocks = ext3_writepage_trans_blocks(inode) + 1;

index = pos >> PAGE_CACHE_SHIFT;
from = pos & (PAGE_CACHE_SIZE - 1);
@@ -1184,15 +1187,20 @@ retry:
}
write_begin_failed:
if (ret) {
- ext3_journal_stop(handle);
- unlock_page(page);
- page_cache_release(page);
/*
* block_write_begin may have instantiated a few blocks
* outside i_size. Trim these off again. Don't need
* i_size_read because we hold i_mutex.
+ *
+ * Add inode to orphan list in case we crash before truncate
+ * finishes.
*/
if (pos + len > inode->i_size)
+ ext3_orphan_add(handle, inode);
+ ext3_journal_stop(handle);
+ unlock_page(page);
+ page_cache_release(page);
+ if (pos + len > inode->i_size)
vmtruncate(inode, inode->i_size);
}
if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
@@ -1211,6 +1219,18 @@ int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
return err;
}

+/* For ordered writepage and write_end functions */
+static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
+{
+ /*
+ * Write could have mapped the buffer but it didn't copy the data in
+ * yet. So avoid filing such buffer into a transaction.
+ */
+ if (buffer_mapped(bh) && buffer_uptodate(bh))
+ return ext3_journal_dirty_data(handle, bh);
+ return 0;
+}
+
/* For write_end() in data=journal mode */
static int write_end_fn(handle_t *handle, struct buffer_head *bh)
{
@@ -1221,26 +1241,20 @@ static int write_end_fn(handle_t *handle, struct buffer_head *bh)
}

/*
- * Generic write_end handler for ordered and writeback ext3 journal modes.
- * We can't use generic_write_end, because that unlocks the page and we need to
- * unlock the page after ext3_journal_stop, but ext3_journal_stop must run
- * after block_write_end.
+ * This is nasty and subtle: ext3_write_begin() could have allocated blocks
+ * for the whole page but later we failed to copy the data in. Update inode
+ * size according to what we managed to copy. The rest is going to be
+ * truncated in write_end function.
*/
-static int ext3_generic_write_end(struct file *file,
- struct address_space *mapping,
- loff_t pos, unsigned len, unsigned copied,
- struct page *page, void *fsdata)
+static void update_file_sizes(struct inode *inode, loff_t pos, unsigned copied)
{
- struct inode *inode = file->f_mapping->host;
-
- copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
-
- if (pos+copied > inode->i_size) {
- i_size_write(inode, pos+copied);
+ /* What matters to us is i_disksize. We don't write i_size anywhere */
+ if (pos + copied > inode->i_size)
+ i_size_write(inode, pos + copied);
+ if (pos + copied > EXT3_I(inode)->i_disksize) {
+ EXT3_I(inode)->i_disksize = pos + copied;
mark_inode_dirty(inode);
}
-
- return copied;
}

/*
@@ -1260,35 +1274,29 @@ static int ext3_ordered_write_end(struct file *file,
unsigned from, to;
int ret = 0, ret2;

- from = pos & (PAGE_CACHE_SIZE - 1);
- to = from + len;
+ copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);

+ from = pos & (PAGE_CACHE_SIZE - 1);
+ to = from + copied;
ret = walk_page_buffers(handle, page_buffers(page),
- from, to, NULL, ext3_journal_dirty_data);
+ from, to, NULL, journal_dirty_data_fn);

- if (ret == 0) {
- /*
- * generic_write_end() will run mark_inode_dirty() if i_size
- * changes. So let's piggyback the i_disksize mark_inode_dirty
- * into that.
- */
- loff_t new_i_size;
-
- new_i_size = pos + copied;
- if (new_i_size > EXT3_I(inode)->i_disksize)
- EXT3_I(inode)->i_disksize = new_i_size;
- ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
- page, fsdata);
- copied = ret2;
- if (ret2 < 0)
- ret = ret2;
- }
+ if (ret == 0)
+ update_file_sizes(inode, pos, copied);
+ /*
+ * There may be allocated blocks outside of i_size because
+ * we failed to copy some data. Prepare for truncate.
+ */
+ if (pos + len > inode->i_size)
+ ext3_orphan_add(handle, inode);
ret2 = ext3_journal_stop(handle);
if (!ret)
ret = ret2;
unlock_page(page);
page_cache_release(page);

+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
return ret ? ret : copied;
}

@@ -1299,25 +1307,22 @@ static int ext3_writeback_write_end(struct file *file,
{
handle_t *handle = ext3_journal_current_handle();
struct inode *inode = file->f_mapping->host;
- int ret = 0, ret2;
- loff_t new_i_size;
-
- new_i_size = pos + copied;
- if (new_i_size > EXT3_I(inode)->i_disksize)
- EXT3_I(inode)->i_disksize = new_i_size;
-
- ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
- page, fsdata);
- copied = ret2;
- if (ret2 < 0)
- ret = ret2;
+ int ret;

- ret2 = ext3_journal_stop(handle);
- if (!ret)
- ret = ret2;
+ copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+ update_file_sizes(inode, pos, copied);
+ /*
+ * There may be allocated blocks outside of i_size because
+ * we failed to copy some data. Prepare for truncate.
+ */
+ if (pos + len > inode->i_size)
+ ext3_orphan_add(handle, inode);
+ ret = ext3_journal_stop(handle);
unlock_page(page);
page_cache_release(page);

+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
return ret ? ret : copied;
}

@@ -1338,15 +1343,23 @@ static int ext3_journalled_write_end(struct file *file,
if (copied < len) {
if (!PageUptodate(page))
copied = 0;
- page_zero_new_buffers(page, from+copied, to);
+ page_zero_new_buffers(page, from + copied, to);
+ to = from + copied;
}

ret = walk_page_buffers(handle, page_buffers(page), from,
to, &partial, write_end_fn);
if (!partial)
SetPageUptodate(page);
- if (pos+copied > inode->i_size)
- i_size_write(inode, pos+copied);
+
+ if (pos + copied > inode->i_size)
+ i_size_write(inode, pos + copied);
+ /*
+ * There may be allocated blocks outside of i_size because
+ * we failed to copy some data. Prepare for truncate.
+ */
+ if (pos + len > inode->i_size)
+ ext3_orphan_add(handle, inode);
EXT3_I(inode)->i_state |= EXT3_STATE_JDATA;
if (inode->i_size > EXT3_I(inode)->i_disksize) {
EXT3_I(inode)->i_disksize = inode->i_size;
@@ -1361,6 +1374,8 @@ static int ext3_journalled_write_end(struct file *file,
unlock_page(page);
page_cache_release(page);

+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
return ret ? ret : copied;
}

@@ -1428,17 +1443,11 @@ static int bput_one(handle_t *handle, struct buffer_head *bh)
return 0;
}

-static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
-{
- if (buffer_mapped(bh))
- return ext3_journal_dirty_data(handle, bh);
- return 0;
-}

2009-04-01 00:08:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] ext3: Avoid false EIO errors (version 4)

On Tue, 31 Mar 2009 11:46:18 +0200
Jan Kara <[email protected]> wrote:

> Hello,
>
> as Aneesh pointed to me, I forgot to add truncation to the data=journaled
> mode. This patch fixes it. Hopefully final version of the fix ;).
>

I _think_ I got the right version here.

>
> >From ec5c2977f87328fb93fee1aa35043bafeb53cea1 Mon Sep 17 00:00:00 2001
> From: Jan Kara <[email protected]>
> Date: Wed, 25 Mar 2009 18:51:52 +0100
> Subject: [PATCH] ext3: Avoid false EIO errors (version 4)
>
> Sometimes block_write_begin() can map buffers in a page but later we fail to
> copy data into those buffers (because the source page has been paged out in the
> mean time).

Really? We should just page it back in again. Do you mean "it was an
invalid address and we got -EFAULT"? Or "a parallel thread unmapped
it" or...?

> We then end up with !uptodate mapped buffers.

OK, I suppose that has to be a legal buffer state.

> To add a bit more to
> the confusion, block_write_end() does not commit any data (and thus does not
> any mark buffers as uptodate) if we didn't succeed with copying all the data.
>
> Commit f4fc66a894546bdc88a775d0e83ad20a65210bcb (ext3: convert to new aops)
> missed these cases and thus we were inserting non-uptodate buffers to
> transaction's list which confuses JBD code and it reports IO errors, aborts
> a transaction and generally makes users afraid about their data ;-P.

hm. Did any other filesystems break for these reasons?

> This patch fixes the problem by reorganizing ext3_..._write_end() code to
> first call block_write_end() to mark buffers with valid data uptodate and
> after that we file only uptodate buffers to transaction's lists.
>
> We also fix a problem where we could leave blocks allocated beyond i_size
> (i_disksize in fact) because of failed write. We now add inode to orphan
> list when write fails (to be safe in case we crash) and then truncate blocks
> beyond i_size in a separate transaction.
>
> Signed-off-by: Jan Kara <[email protected]>
> Reviewed-by: Aneesh Kumar K.V <[email protected]>
> ---
> fs/ext3/inode.c | 139 +++++++++++++++++++++++++++++--------------------------
> 1 files changed, 74 insertions(+), 65 deletions(-)
>
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 4a09ff1..d3ef656 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -1149,12 +1149,15 @@ static int ext3_write_begin(struct file *file, struct address_space *mapping,
> struct page **pagep, void **fsdata)
> {
> struct inode *inode = mapping->host;
> - int ret, needed_blocks = ext3_writepage_trans_blocks(inode);
> + int ret;
> handle_t *handle;
> int retries = 0;
> struct page *page;
> pgoff_t index;
> unsigned from, to;
> + /* Reserve one block more for addition to orphan list in case
> + * we allocate blocks but write fails for some reason */
> + int needed_blocks = ext3_writepage_trans_blocks(inode) + 1;
>
> index = pos >> PAGE_CACHE_SHIFT;
> from = pos & (PAGE_CACHE_SIZE - 1);
> @@ -1184,15 +1187,20 @@ retry:
> }
> write_begin_failed:
> if (ret) {
> - ext3_journal_stop(handle);
> - unlock_page(page);
> - page_cache_release(page);
> /*
> * block_write_begin may have instantiated a few blocks
> * outside i_size. Trim these off again. Don't need
> * i_size_read because we hold i_mutex.
> + *
> + * Add inode to orphan list in case we crash before truncate
> + * finishes.
> */
> if (pos + len > inode->i_size)
> + ext3_orphan_add(handle, inode);
> + ext3_journal_stop(handle);
> + unlock_page(page);
> + page_cache_release(page);
> + if (pos + len > inode->i_size)
> vmtruncate(inode, inode->i_size);
> }
> if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
> @@ -1211,6 +1219,18 @@ int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
> return err;
> }
>
> +/* For ordered writepage and write_end functions */
> +static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> +{
> + /*
> + * Write could have mapped the buffer but it didn't copy the data in
> + * yet. So avoid filing such buffer into a transaction.
> + */
> + if (buffer_mapped(bh) && buffer_uptodate(bh))
> + return ext3_journal_dirty_data(handle, bh);
> + return 0;
> +}
> +
> /* For write_end() in data=journal mode */
> static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> {
> @@ -1221,26 +1241,20 @@ static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> }
>
> /*
> - * Generic write_end handler for ordered and writeback ext3 journal modes.
> - * We can't use generic_write_end, because that unlocks the page and we need to
> - * unlock the page after ext3_journal_stop, but ext3_journal_stop must run
> - * after block_write_end.
> + * This is nasty and subtle: ext3_write_begin() could have allocated blocks
> + * for the whole page but later we failed to copy the data in. Update inode
> + * size according to what we managed to copy. The rest is going to be
> + * truncated in write_end function.
> */
> -static int ext3_generic_write_end(struct file *file,
> - struct address_space *mapping,
> - loff_t pos, unsigned len, unsigned copied,
> - struct page *page, void *fsdata)
> +static void update_file_sizes(struct inode *inode, loff_t pos, unsigned copied)
> {
> - struct inode *inode = file->f_mapping->host;
> -
> - copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> -
> - if (pos+copied > inode->i_size) {
> - i_size_write(inode, pos+copied);
> + /* What matters to us is i_disksize. We don't write i_size anywhere */
> + if (pos + copied > inode->i_size)
> + i_size_write(inode, pos + copied);
> + if (pos + copied > EXT3_I(inode)->i_disksize) {
> + EXT3_I(inode)->i_disksize = pos + copied;
> mark_inode_dirty(inode);
> }
> -
> - return copied;
> }
>
> /*
> @@ -1260,35 +1274,29 @@ static int ext3_ordered_write_end(struct file *file,
> unsigned from, to;
> int ret = 0, ret2;
>
> - from = pos & (PAGE_CACHE_SIZE - 1);
> - to = from + len;
> + copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
>
> + from = pos & (PAGE_CACHE_SIZE - 1);
> + to = from + copied;
> ret = walk_page_buffers(handle, page_buffers(page),
> - from, to, NULL, ext3_journal_dirty_data);
> + from, to, NULL, journal_dirty_data_fn);
>
> - if (ret == 0) {
> - /*
> - * generic_write_end() will run mark_inode_dirty() if i_size
> - * changes. So let's piggyback the i_disksize mark_inode_dirty
> - * into that.
> - */
> - loff_t new_i_size;
> -
> - new_i_size = pos + copied;
> - if (new_i_size > EXT3_I(inode)->i_disksize)
> - EXT3_I(inode)->i_disksize = new_i_size;
> - ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
> - page, fsdata);
> - copied = ret2;
> - if (ret2 < 0)
> - ret = ret2;
> - }
> + if (ret == 0)
> + update_file_sizes(inode, pos, copied);
> + /*
> + * There may be allocated blocks outside of i_size because
> + * we failed to copy some data. Prepare for truncate.
> + */
> + if (pos + len > inode->i_size)
> + ext3_orphan_add(handle, inode);
> ret2 = ext3_journal_stop(handle);
> if (!ret)
> ret = ret2;
> unlock_page(page);
> page_cache_release(page);
>
> + if (pos + len > inode->i_size)
> + vmtruncate(inode, inode->i_size);
> return ret ? ret : copied;
> }
>
> @@ -1299,25 +1307,22 @@ static int ext3_writeback_write_end(struct file *file,
> {
> handle_t *handle = ext3_journal_current_handle();
> struct inode *inode = file->f_mapping->host;
> - int ret = 0, ret2;
> - loff_t new_i_size;
> -
> - new_i_size = pos + copied;
> - if (new_i_size > EXT3_I(inode)->i_disksize)
> - EXT3_I(inode)->i_disksize = new_i_size;
> -
> - ret2 = ext3_generic_write_end(file, mapping, pos, len, copied,
> - page, fsdata);
> - copied = ret2;
> - if (ret2 < 0)
> - ret = ret2;
> + int ret;
>
> - ret2 = ext3_journal_stop(handle);
> - if (!ret)
> - ret = ret2;
> + copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> + update_file_sizes(inode, pos, copied);
> + /*
> + * There may be allocated blocks outside of i_size because
> + * we failed to copy some data. Prepare for truncate.
> + */
> + if (pos + len > inode->i_size)
> + ext3_orphan_add(handle, inode);
> + ret = ext3_journal_stop(handle);
> unlock_page(page);
> page_cache_release(page);
>
> + if (pos + len > inode->i_size)
> + vmtruncate(inode, inode->i_size);
> return ret ? ret : copied;
> }
>
> @@ -1338,15 +1343,23 @@ static int ext3_journalled_write_end(struct file *file,
> if (copied < len) {
> if (!PageUptodate(page))
> copied = 0;
> - page_zero_new_buffers(page, from+copied, to);
> + page_zero_new_buffers(page, from + copied, to);
> + to = from + copied;
> }
>
> ret = walk_page_buffers(handle, page_buffers(page), from,
> to, &partial, write_end_fn);
> if (!partial)
> SetPageUptodate(page);
> - if (pos+copied > inode->i_size)
> - i_size_write(inode, pos+copied);
> +
> + if (pos + copied > inode->i_size)
> + i_size_write(inode, pos + copied);
> + /*
> + * There may be allocated blocks outside of i_size because
> + * we failed to copy some data. Prepare for truncate.
> + */
> + if (pos + len > inode->i_size)
> + ext3_orphan_add(handle, inode);
> EXT3_I(inode)->i_state |= EXT3_STATE_JDATA;
> if (inode->i_size > EXT3_I(inode)->i_disksize) {
> EXT3_I(inode)->i_disksize = inode->i_size;
> @@ -1361,6 +1374,8 @@ static int ext3_journalled_write_end(struct file *file,
> unlock_page(page);
> page_cache_release(page);
>
> + if (pos + len > inode->i_size)
> + vmtruncate(inode, inode->i_size);
> return ret ? ret : copied;
> }
>
> @@ -1428,17 +1443,11 @@ static int bput_one(handle_t *handle, struct buffer_head *bh)
> return 0;
> }
>
> -static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> -{
> - if (buffer_mapped(bh))
> - return ext3_journal_dirty_data(handle, bh);
> - return 0;
> -}
> -
> static int buffer_unmapped(handle_t *handle, struct buffer_head *bh)
> {
> return !buffer_mapped(bh);
> }
> +
> /*
> * Note that we always start a transaction even if we're not journalling
> * data. This is to preserve ordering: any hole instantiation within
> --
> 1.6.0.2

2009-04-01 09:49:43

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext3: Avoid false EIO errors (version 4)

On Tue 31-03-09 17:06:21, Andrew Morton wrote:
> On Tue, 31 Mar 2009 11:46:18 +0200
> Jan Kara <[email protected]> wrote:
>
> > Hello,
> >
> > as Aneesh pointed to me, I forgot to add truncation to the data=journaled
> > mode. This patch fixes it. Hopefully final version of the fix ;).
> >
>
> I _think_ I got the right version here.
Yes. I'm sorry I haven't started adding "version" suffix to the patches
at the second submission... That would make is easier for you I guess.

> > >From ec5c2977f87328fb93fee1aa35043bafeb53cea1 Mon Sep 17 00:00:00 2001
> > From: Jan Kara <[email protected]>
> > Date: Wed, 25 Mar 2009 18:51:52 +0100
> > Subject: [PATCH] ext3: Avoid false EIO errors (version 4)
> >
> > Sometimes block_write_begin() can map buffers in a page but later we fail to
> > copy data into those buffers (because the source page has been paged out in the
> > mean time).
>
> Really? We should just page it back in again. Do you mean "it was an
> invalid address and we got -EFAULT"? Or "a parallel thread unmapped
> it" or...?
I meant parallel thread freeing pages unmapped it.

> > We then end up with !uptodate mapped buffers.
>
> OK, I suppose that has to be a legal buffer state.
Yes, it is a state which makes sence. ext3 code just was not expecting
it.

> > To add a bit more to
> > the confusion, block_write_end() does not commit any data (and thus does not
> > any mark buffers as uptodate) if we didn't succeed with copying all the data.
> >
> > Commit f4fc66a894546bdc88a775d0e83ad20a65210bcb (ext3: convert to new aops)
> > missed these cases and thus we were inserting non-uptodate buffers to
> > transaction's list which confuses JBD code and it reports IO errors, aborts
> > a transaction and generally makes users afraid about their data ;-P.
>
> hm. Did any other filesystems break for these reasons?
Well, I've checked and ext4/ocfs2 are fine because they add inode, not
individual buffers, to the transaction and writepage() then handles those
buffers fine (they are not marked dirty so we just skip them). ext4
definitely has a problem with blocks instantiated outside of i_size, Aneesh
already sent a fix for that. ocfs2 might have this problem as well but I'm
not sure - it may be a legal state for it, although I doubt it (adding Mark
to CC).
Simple filesystems not doing journaling and using generic_write_end()
don't care (ext2, udf, fat and most of the others). So I *hope* noone else
has this problem (at least grepping for generic_write_end and
block_write_end does not seem to reveal any other callers which would have
problems with copied < len or block_write_end() setting copied to 0).

> > This patch fixes the problem by reorganizing ext3_..._write_end() code to
> > first call block_write_end() to mark buffers with valid data uptodate and
> > after that we file only uptodate buffers to transaction's lists.
> >
> > We also fix a problem where we could leave blocks allocated beyond i_size
> > (i_disksize in fact) because of failed write. We now add inode to orphan
> > list when write fails (to be safe in case we crash) and then truncate blocks
> > beyond i_size in a separate transaction.
> >
> > Signed-off-by: Jan Kara <[email protected]>
> > Reviewed-by: Aneesh Kumar K.V <[email protected]>

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-04-05 04:09:40

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] ext4: truncate the file properly if we fail to copy data from userspace.

On Tue, Mar 31, 2009 at 02:59:26PM +0530, Aneesh Kumar K.V wrote:
> In generic_perform_write if we fail to copy the user data we don't
> update the inode->i_size. We should truncate the file in the above case
> so that we don't have blocks allocated outside inode->i_size. Add
> the inode to orphan list in the same transaction as block allocation
> This ensures that if we crash in between the recovery would do the truncate.

Same problem as my comment in for the last patch; it seems rather
dangerous to try to call ext4_orphan_add() outside of ext4_truncate().
Can we instead call vmtruncate() inside the same transaction handle?
i.e., figure out how many journal credits will be needed for the
potential truncate, and add it to number of credits to reserve for the
begin_write and/or write_end handle, and then call vmtruncate before
calling ext4_journal_stop(). Can anyone see a problem with this
approach?

- Ted

2009-04-05 04:09:40

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/2] [PATCH] ext4: Add inode to the orphan list during block allocation failure

On Tue, Mar 31, 2009 at 02:59:25PM +0530, Aneesh Kumar K.V wrote:
> We should add inode to the orphan list in the same transaction
> as block allocation. This ensures that if we crash after a failed
> block allocation and before we do a vmtruncate we don't leak block
> (ie block marked as used in bitmap but not claimed by the inode).

How likely is this going to happen? If it's a failure in the block
allocation, we will have really end up with blocks outside i_size?
And in that case, I'm missing how this ends up being a leaked block,
since presumably the block is still being referenced by the inode. Am
I missing something here?

I guess it can happen if do_journal_get_write_access() returns an
error, which could potentially happen if there is a ENOMEM error
coming from the jbd2 layer. But that brings up another potential
problem. It's possible for vmtruncate() to fail; if ext4_truncate()
fails too early (particularly in ext4_ext_truncate, due to a memory
error in a jbd2 routines), it's possible for it to return without
calling ext4_orphan_del() --- and stray inodes on the orphan list will
cause the kernel to panic on umount.

I think this can be fixed by making sure that ext4_truncate() and
ext4_ext_truncate() calls ext4_orphan_del() in *all* of their error
paths. That *should* the problem, since at the moment, it doesn't
look vmtruncate() will return without calling inode->i_op->truncate().
But could you double check this carefully?

Thanks,

- Ted

2009-04-06 10:05:17

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/2] [PATCH] ext4: Add inode to the orphan list during block allocation failure

Hi,

On Sat 04-04-09 23:11:16, Theodore Tso wrote:
> On Tue, Mar 31, 2009 at 02:59:25PM +0530, Aneesh Kumar K.V wrote:
> > We should add inode to the orphan list in the same transaction
> > as block allocation. This ensures that if we crash after a failed
> > block allocation and before we do a vmtruncate we don't leak block
> > (ie block marked as used in bitmap but not claimed by the inode).
>
> How likely is this going to happen? If it's a failure in the block
> allocation, we will have really end up with blocks outside i_size?
> And in that case, I'm missing how this ends up being a leaked block,
> since presumably the block is still being referenced by the inode. Am
> I missing something here?
Aneesh's changelog was not completely precise here. First note that some
problems (namely those in ext4_write_begin()) can happen only if blocksize
< pagesize - there we can succeed in allocating some blocks for the page
but not all of them. Since we then decide to redo the whole write, we have
to truncate blocks we have already allocated.
The problem in ext4_..._write_end() is different (and rather easy to hit
under heavy memory pressure) - the problem is that generic_perform_write()
fails to copy data into our kernel page because the user page from which
we should copy the data has been paged-out. In this situation we again
decide to redo the write and so we should truncate the already allocated
blocks since i_size is still set to the value before write. Otherwise
we'd have blocks allocated beyond i_size and so a crash before we
successfully redo the write would "leak" blocks (you're right the inode
would be still referencing them but they'll be above i_size and I guess
it could confuse some code).

> I guess it can happen if do_journal_get_write_access() returns an
> error, which could potentially happen if there is a ENOMEM error
> coming from the jbd2 layer. But that brings up another potential
> problem. It's possible for vmtruncate() to fail; if ext4_truncate()
> fails too early (particularly in ext4_ext_truncate, due to a memory
> error in a jbd2 routines), it's possible for it to return without
> calling ext4_orphan_del() --- and stray inodes on the orphan list will
> cause the kernel to panic on umount.
>
> I think this can be fixed by making sure that ext4_truncate() and
> ext4_ext_truncate() calls ext4_orphan_del() in *all* of their error
> paths. That *should* the problem, since at the moment, it doesn't
> look vmtruncate() will return without calling inode->i_op->truncate().
> But could you double check this carefully?
Ah, OK, that should be fixed. But note that current ext4_setattr()
does exactly the same thing on standard truncates - it adds inode to
orphan list and calls inode_setattr() which end's up calling vmtruncate().

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-04-06 10:07:17

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/2] [PATCH] ext4: truncate the file properly if we fail to copy data from userspace.

On Sat 04-04-09 23:22:11, Theodore Tso wrote:
> On Tue, Mar 31, 2009 at 02:59:26PM +0530, Aneesh Kumar K.V wrote:
> > In generic_perform_write if we fail to copy the user data we don't
> > update the inode->i_size. We should truncate the file in the above case
> > so that we don't have blocks allocated outside inode->i_size. Add
> > the inode to orphan list in the same transaction as block allocation
> > This ensures that if we crash in between the recovery would do the truncate.
>
> Same problem as my comment in for the last patch; it seems rather
> dangerous to try to call ext4_orphan_add() outside of ext4_truncate().
> Can we instead call vmtruncate() inside the same transaction handle?
> i.e., figure out how many journal credits will be needed for the
> potential truncate, and add it to number of credits to reserve for the
> begin_write and/or write_end handle, and then call vmtruncate before
> calling ext4_journal_stop(). Can anyone see a problem with this
> approach?
Just adding inode to orphan list seems more elegant to me and as I wrote
in my previous email, we already do it from a common path so if there are
bugs, we should fix them anyway ;).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-06-05 04:31:23

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/2] [PATCH] ext4: Add inode to the orphan list during block allocation failure

On Mon, Apr 06, 2009 at 12:05:09PM +0200, Jan Kara wrote:
> > I think this can be fixed by making sure that ext4_truncate() and
> > ext4_ext_truncate() calls ext4_orphan_del() in *all* of their error
> > paths. That *should* the problem, since at the moment, it doesn't
> > look vmtruncate() will return without calling inode->i_op->truncate().
> > But could you double check this carefully?
>
> Ah, OK, that should be fixed. But note that current ext4_setattr()
> does exactly the same thing on standard truncates - it adds inode to
> orphan list and calls inode_setattr() which end's up calling vmtruncate().

I finally had a chance to take a closer look at this. ext4_setattr()
is safe, because it does this after calling inode_setattr():

/* If inode_setattr's call to ext4_truncate failed to get a
* transaction handle at all, we need to clean up the in-core
* orphan list manually. */
if (inode->i_nlink)
ext4_orphan_del(NULL, inode);

So if we put the same thing into the ext4_write_begin() and
ext4_writeback_write_end() in these patches, it should be OK. The key
is that if the inode is already is on the orphan list, it's harmless
to call ext4_orphan_add() --- and if the inode has already been
removed from the orphan list, it's harmless to call ext4_orphan_del()
on it.

- Ted

2009-06-05 06:22:41

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/2] [PATCH] ext4: Add inode to the orphan list during block allocation failure

I've tried applying these two patches to the ext4 patch queue. On a
metadata-heavy workload (specifically, fsx), it causes a 5%
degradation in the wall clock run-time of the fsx run-time.

Maybe there's no way around that, but it's rather unfortunate....

- Ted

2009-06-05 07:24:22

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/2] [PATCH] ext4: Add inode to the orphan list during block allocation failure

On Fri, Jun 05, 2009 at 02:22:34AM -0400, Theodore Tso wrote:
> I've tried applying these two patches to the ext4 patch queue. On a
> metadata-heavy workload (specifically, fsx), it causes a 5%
> degradation in the wall clock run-time of the fsx run-time.

I just did some more timing tests, and it looks like I can't confirm
it. The test times are noisy enough I need to run some more under very
controlled circumsntances. So ignore this for now, it might not be a
problem after all.

- Ted

2009-06-05 23:42:39

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/2] [PATCH] ext4: Add inode to the orphan list during block allocation failure

On Fri 05-06-09 03:24:17, Theodore Tso wrote:
> On Fri, Jun 05, 2009 at 02:22:34AM -0400, Theodore Tso wrote:
> > I've tried applying these two patches to the ext4 patch queue. On a
> > metadata-heavy workload (specifically, fsx), it causes a 5%
> > degradation in the wall clock run-time of the fsx run-time.
>
> I just did some more timing tests, and it looks like I can't confirm
> it. The test times are noisy enough I need to run some more under very
> controlled circumsntances. So ignore this for now, it might not be a
> problem after all.
It would be strange if the patches caused a measurable slowdown. They
change something only in the failure path where we fail to allocate some
blocks or fail to copy in all the data and these should better be
exceptional cases...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-06-05 23:44:59

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/2] [PATCH] ext4: Add inode to the orphan list during block allocation failure

On Fri 05-06-09 00:31:17, Theodore Tso wrote:
> On Mon, Apr 06, 2009 at 12:05:09PM +0200, Jan Kara wrote:
> > > I think this can be fixed by making sure that ext4_truncate() and
> > > ext4_ext_truncate() calls ext4_orphan_del() in *all* of their error
> > > paths. That *should* the problem, since at the moment, it doesn't
> > > look vmtruncate() will return without calling inode->i_op->truncate().
> > > But could you double check this carefully?
> >
> > Ah, OK, that should be fixed. But note that current ext4_setattr()
> > does exactly the same thing on standard truncates - it adds inode to
> > orphan list and calls inode_setattr() which end's up calling vmtruncate().
>
> I finally had a chance to take a closer look at this. ext4_setattr()
> is safe, because it does this after calling inode_setattr():
>
> /* If inode_setattr's call to ext4_truncate failed to get a
> * transaction handle at all, we need to clean up the in-core
> * orphan list manually. */
> if (inode->i_nlink)
> ext4_orphan_del(NULL, inode);
>
> So if we put the same thing into the ext4_write_begin() and
> ext4_writeback_write_end() in these patches, it should be OK. The key
> is that if the inode is already is on the orphan list, it's harmless
> to call ext4_orphan_add() --- and if the inode has already been
> removed from the orphan list, it's harmless to call ext4_orphan_del()
> on it.
Ah, good point. I haven't noticed this in ext4_inode_setattr when I
was checking it. Aneesh, will you take care of it for ext4? I'll cook up
an appropriate change for ext3... Thanks Ted for looking into this.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-06-08 04:35:24

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V2 1/2] ext4: Add inode to the orphan list during block allocation failure

We should add inode to the orphan list in the same transaction
as block allocation. This ensures that if we crash after a failed
block allocation and before we do a vmtruncate we don't leak block
(ie block marked as used in bitmap but not claimed by the inode).

Signed-off-by: Aneesh Kumar K.V <[email protected]>
CC: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 22 ++++++++++++++++++++--
1 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 00ebc99..820cb58 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1459,7 +1459,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
struct page **pagep, void **fsdata)
{
struct inode *inode = mapping->host;
- int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
+ int ret, needed_blocks;
handle_t *handle;
int retries = 0;
struct page *page;
@@ -1470,6 +1470,11 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
"dev %s ino %lu pos %llu len %u flags %u",
inode->i_sb->s_id, inode->i_ino,
(unsigned long long) pos, len, flags);
+ /*
+ * Reserve one block more for addition to orphan list in case
+ * we allocate blocks but write fails for some reason
+ */
+ needed_blocks = ext4_writepage_trans_blocks(inode) + 1;
index = pos >> PAGE_CACHE_SHIFT;
from = pos & (PAGE_CACHE_SIZE - 1);
to = from + len;
@@ -1503,14 +1508,20 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,

if (ret) {
unlock_page(page);
- ext4_journal_stop(handle);
page_cache_release(page);
/*
* block_write_begin may have instantiated a few blocks
* outside i_size. Trim these off again. Don't need
* i_size_read because we hold i_mutex.
+ *
+ * Add inode to orphan list in case we crash before
+ * truncate finishes
*/
if (pos + len > inode->i_size)
+ ext4_orphan_add(handle, inode);
+
+ ext4_journal_stop(handle);
+ if (pos + len > inode->i_size)
vmtruncate(inode, inode->i_size);
}

@@ -1519,6 +1530,13 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
out:
if (ret)
block_unlock_hole_extend(inode);
+ /*
+ * if vmtruncate failed to remove the inode from
+ * orphan list remove ourself
+ */
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+
return ret;
}

--
1.6.3.1.244.gf9275


2009-06-08 04:35:36

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V2 2/2] ext4: truncate the file properly if we fail to copy data from userspace

In generic_perform_write if we fail to copy the user data we don't
update the inode->i_size. We should truncate the file in the above case
so that we don't have blocks allocated outside inode->i_size. Add
the inode to orphan list in the same transaction as block allocation
This ensures that if we crash in between the recovery would do the truncate.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
CC: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 120 ++++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 93 insertions(+), 27 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 820cb58..a90e8eb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1549,6 +1549,53 @@ static int write_end_fn(handle_t *handle, struct buffer_head *bh)
return ext4_handle_dirty_metadata(handle, NULL, bh);
}

+static int ext4_generic_write_end(struct file *file,
+ struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata)
+{
+ int i_size_changed = 0;
+ struct inode *inode = mapping->host;
+ handle_t *handle = ext4_journal_current_handle();
+
+ copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+
+ /*
+ * No need to use i_size_read() here, the i_size
+ * cannot change under us because we hold i_mutex.
+ *
+ * But it's important to update i_size while still holding page lock:
+ * page writeout could otherwise come in and zero beyond i_size.
+ */
+ if (pos + copied > inode->i_size) {
+ i_size_write(inode, pos + copied);
+ i_size_changed = 1;
+ }
+
+ if (pos + copied > EXT4_I(inode)->i_disksize) {
+ /* We need to mark inode dirty even if
+ * new_i_size is less that inode->i_size
+ * bu greater than i_disksize.(hint delalloc)
+ */
+ ext4_update_i_disksize(inode, (pos + copied));
+ i_size_changed = 1;
+ }
+ unlock_page(page);
+ page_cache_release(page);
+ block_unlock_hole_extend(inode);
+
+ /*
+ * Don't mark the inode dirty under page lock. First, it unnecessarily
+ * makes the holding time of page lock longer. Second, it forces lock
+ * ordering of page lock and transaction start for journaling
+ * filesystems.
+ */
+ if (i_size_changed)
+ ext4_mark_inode_dirty(handle, inode);
+
+ return copied;
+}
+
/*
* We need to pick up the new inode size which generic_commit_write gave us
* `file' can be NULL - eg, when called from page_symlink().
@@ -1572,21 +1619,15 @@ static int ext4_ordered_write_end(struct file *file,
ret = ext4_jbd2_file_inode(handle, inode);

if (ret == 0) {
- loff_t new_i_size;
-
- new_i_size = pos + copied;
- if (new_i_size > EXT4_I(inode)->i_disksize) {
- ext4_update_i_disksize(inode, new_i_size);
- /* We need to mark inode dirty even if
- * new_i_size is less that inode->i_size
- * bu greater than i_disksize.(hint delalloc)
- */
- ext4_mark_inode_dirty(handle, inode);
- }
-
- ret2 = generic_write_end(file, mapping, pos, len, copied,
+ ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
page, fsdata);
copied = ret2;
+ if (pos + len > inode->i_size)
+ /* if we have allocated more blocks and copied
+ * less. We will have blocks allocated outside
+ * inode->i_size. So truncate them
+ */
+ ext4_orphan_add(handle, inode);
if (ret2 < 0)
ret = ret2;
}
@@ -1594,6 +1635,14 @@ static int ext4_ordered_write_end(struct file *file,
if (!ret)
ret = ret2;

+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
+ /*
+ * if vmtruncate failed to remove the inode from
+ * orphan list remove ourself
+ */
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
return ret ? ret : copied;
}

@@ -1605,25 +1654,21 @@ static int ext4_writeback_write_end(struct file *file,
handle_t *handle = ext4_journal_current_handle();
struct inode *inode = mapping->host;
int ret = 0, ret2;
- loff_t new_i_size;

trace_mark(ext4_writeback_write_end,
"dev %s ino %lu pos %llu len %u copied %u",
inode->i_sb->s_id, inode->i_ino,
(unsigned long long) pos, len, copied);
- new_i_size = pos + copied;
- if (new_i_size > EXT4_I(inode)->i_disksize) {
- ext4_update_i_disksize(inode, new_i_size);
- /* We need to mark inode dirty even if
- * new_i_size is less that inode->i_size
- * bu greater than i_disksize.(hint delalloc)
- */
- ext4_mark_inode_dirty(handle, inode);
- }
-
- ret2 = generic_write_end(file, mapping, pos, len, copied,
+ ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
page, fsdata);
copied = ret2;
+ if (pos + len > inode->i_size)
+ /* if we have allocated more blocks and copied
+ * less. We will have blocks allocated outside
+ * inode->i_size. So truncate them
+ */
+ ext4_orphan_add(handle, inode);
+
if (ret2 < 0)
ret = ret2;

@@ -1631,6 +1676,14 @@ static int ext4_writeback_write_end(struct file *file,
if (!ret)
ret = ret2;

+ if (pos + len > inode->i_size)
+ vmtruncate(inode, inode->i_size);
+ /*
+ * if vmtruncate failed to remove the inode from
+ * orphan list remove ourself
+ */
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
return ret ? ret : copied;
}

@@ -1675,12 +1728,25 @@ static int ext4_journalled_write_end(struct file *file,
}

unlock_page(page);
+ page_cache_release(page);
+ if (pos + len > inode->i_size)
+ /* if we have allocated more blocks and copied
+ * less. We will have blocks allocated outside
+ * inode->i_size. So truncate them
+ */
+ ext4_orphan_add(handle, inode);
ret2 = ext4_journal_stop(handle);
if (!ret)
ret = ret2;
- page_cache_release(page);
block_unlock_hole_extend(inode);

2009-06-08 16:29:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH -V2 1/2] ext4: Add inode to the orphan list during block allocation failure

Aneesh, thanks for working on a replacement patch, but I had already
added the call to ext4_orphan_del in the patch in the patch queue.
The difference is I added the call right after the call to
vmtruncate(), which avoids calling ext4_orphan_del() in a few cases
where it's not necessary.

This is what I have in the patch queue.

- Ted

ext4: Avoid leaking blocks after a block allocation failure

From: "Aneesh Kumar K.V" <[email protected]>

We should add inode to the orphan list in the same transaction
as block allocation. This ensures that if we crash after a failed
block allocation and before we do a vmtruncate we don't leak block
(ie block marked as used in bitmap but not claimed by the inode).

Signed-off-by: Aneesh Kumar K.V <[email protected]>
CC: Jan Kara <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 95a3f45..036552a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1462,7 +1462,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
struct page **pagep, void **fsdata)
{
struct inode *inode = mapping->host;
- int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
+ int ret, needed_blocks;
handle_t *handle;
int retries = 0;
struct page *page;
@@ -1473,6 +1473,11 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
"dev %s ino %lu pos %llu len %u flags %u",
inode->i_sb->s_id, inode->i_ino,
(unsigned long long) pos, len, flags);
+ /*
+ * Reserve one block more for addition to orphan list in case
+ * we allocate blocks but write fails for some reason
+ */
+ needed_blocks = ext4_writepage_trans_blocks(inode) + 1;
index = pos >> PAGE_CACHE_SHIFT;
from = pos & (PAGE_CACHE_SIZE - 1);
to = from + len;
@@ -1506,15 +1511,30 @@ retry:

if (ret) {
unlock_page(page);
- ext4_journal_stop(handle);
page_cache_release(page);
/*
* block_write_begin may have instantiated a few blocks
* outside i_size. Trim these off again. Don't need
* i_size_read because we hold i_mutex.
+ *
+ * Add inode to orphan list in case we crash before
+ * truncate finishes
*/
if (pos + len > inode->i_size)
+ ext4_orphan_add(handle, inode);
+
+ ext4_journal_stop(handle);
+ if (pos + len > inode->i_size) {
vmtruncate(inode, inode->i_size);
+ /*
+ * If vmtruncate failed early the inode might
+ * still be on the orphan list; we need to
+ * make sure the inode is removed from the
+ * orphan list in that case.
+ */
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+ }
}

if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))

2009-06-08 16:30:02

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH -V2 2/2] ext4: truncate the file properly if we fail to copy data from userspace

I had also already fixed this up in the patch queue. This is what is
currently there.

- Ted

From: "Aneesh Kumar K.V" <[email protected]>

ext4: truncate the file properly if we fail to copy data from userspace

In generic_perform_write if we fail to copy the user data we don't
update the inode->i_size. We should truncate the file in the above
case so that we don't have blocks allocated outside inode->i_size. Add
the inode to orphan list in the same transaction as block allocation
This ensures that if we crash in between the recovery would do the
truncate.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
CC: Jan Kara <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 50ef1a2..8225f21 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1551,6 +1551,52 @@ static int write_end_fn(handle_t *handle, struct buffer_head *bh)
return ext4_handle_dirty_metadata(handle, NULL, bh);
}

+static int ext4_generic_write_end(struct file *file,
+ struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata)
+{
+ int i_size_changed = 0;
+ struct inode *inode = mapping->host;
+ handle_t *handle = ext4_journal_current_handle();
+
+ copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
+
+ /*
+ * No need to use i_size_read() here, the i_size
+ * cannot change under us because we hold i_mutex.
+ *
+ * But it's important to update i_size while still holding page lock:
+ * page writeout could otherwise come in and zero beyond i_size.
+ */
+ if (pos + copied > inode->i_size) {
+ i_size_write(inode, pos + copied);
+ i_size_changed = 1;
+ }
+
+ if (pos + copied > EXT4_I(inode)->i_disksize) {
+ /* We need to mark inode dirty even if
+ * new_i_size is less that inode->i_size
+ * bu greater than i_disksize.(hint delalloc)
+ */
+ ext4_update_i_disksize(inode, (pos + copied));
+ i_size_changed = 1;
+ }
+ unlock_page(page);
+ page_cache_release(page);
+
+ /*
+ * Don't mark the inode dirty under page lock. First, it unnecessarily
+ * makes the holding time of page lock longer. Second, it forces lock
+ * ordering of page lock and transaction start for journaling
+ * filesystems.
+ */
+ if (i_size_changed)
+ ext4_mark_inode_dirty(handle, inode);
+
+ return copied;
+}
+
/*
* We need to pick up the new inode size which generic_commit_write gave us
* `file' can be NULL - eg, when called from page_symlink().
@@ -1574,21 +1620,15 @@ static int ext4_ordered_write_end(struct file *file,
ret = ext4_jbd2_file_inode(handle, inode);

if (ret == 0) {
- loff_t new_i_size;
-
- new_i_size = pos + copied;
- if (new_i_size > EXT4_I(inode)->i_disksize) {
- ext4_update_i_disksize(inode, new_i_size);
- /* We need to mark inode dirty even if
- * new_i_size is less that inode->i_size
- * bu greater than i_disksize.(hint delalloc)
- */
- ext4_mark_inode_dirty(handle, inode);
- }
-
- ret2 = generic_write_end(file, mapping, pos, len, copied,
+ ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
page, fsdata);
copied = ret2;
+ if (pos + len > inode->i_size)
+ /* if we have allocated more blocks and copied
+ * less. We will have blocks allocated outside
+ * inode->i_size. So truncate them
+ */
+ ext4_orphan_add(handle, inode);
if (ret2 < 0)
ret = ret2;
}
@@ -1596,6 +1636,18 @@ static int ext4_ordered_write_end(struct file *file,
if (!ret)
ret = ret2;

+ if (pos + len > inode->i_size) {
+ vmtruncate(inode, inode->i_size);
+ /*
+ * If vmtruncate failed early the inode might still be
+ * on the orphan list; we need to make sure the inode
+ * is removed from the orphan list in that case.
+ */
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+ }
+
+
return ret ? ret : copied;
}

@@ -1607,25 +1659,21 @@ static int ext4_writeback_write_end(struct file *file,
handle_t *handle = ext4_journal_current_handle();
struct inode *inode = mapping->host;
int ret = 0, ret2;
- loff_t new_i_size;

trace_mark(ext4_writeback_write_end,
"dev %s ino %lu pos %llu len %u copied %u",
inode->i_sb->s_id, inode->i_ino,
(unsigned long long) pos, len, copied);
- new_i_size = pos + copied;
- if (new_i_size > EXT4_I(inode)->i_disksize) {
- ext4_update_i_disksize(inode, new_i_size);
- /* We need to mark inode dirty even if
- * new_i_size is less that inode->i_size
- * bu greater than i_disksize.(hint delalloc)
- */
- ext4_mark_inode_dirty(handle, inode);
- }
-
- ret2 = generic_write_end(file, mapping, pos, len, copied,
+ ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
page, fsdata);
copied = ret2;
+ if (pos + len > inode->i_size)
+ /* if we have allocated more blocks and copied
+ * less. We will have blocks allocated outside
+ * inode->i_size. So truncate them
+ */
+ ext4_orphan_add(handle, inode);
+
if (ret2 < 0)
ret = ret2;

@@ -1633,6 +1681,17 @@ static int ext4_writeback_write_end(struct file *file,
if (!ret)
ret = ret2;

+ if (pos + len > inode->i_size) {
+ vmtruncate(inode, inode->i_size);
+ /*
+ * If vmtruncate failed early the inode might still be
+ * on the orphan list; we need to make sure the inode
+ * is removed from the orphan list in that case.
+ */
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+ }
+
return ret ? ret : copied;
}

@@ -1677,10 +1736,27 @@ static int ext4_journalled_write_end(struct file *file,
}

unlock_page(page);
+ page_cache_release(page);
+ if (pos + len > inode->i_size)
+ /* if we have allocated more blocks and copied
+ * less. We will have blocks allocated outside
+ * inode->i_size. So truncate them
+ */
+ ext4_orphan_add(handle, inode);
+
ret2 = ext4_journal_stop(handle);
if (!ret)
ret = ret2;
- page_cache_release(page);
+ if (pos + len > inode->i_size) {
+ vmtruncate(inode, inode->i_size);
+ /*
+ * If vmtruncate failed early the inode might still be
+ * on the orphan list; we need to make sure the inode
+ * is removed from the orphan list in that case.
+ */
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+ }

return ret ? ret : copied;
}

2009-06-08 16:44:34

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V2 2/2] ext4: truncate the file properly if we fail to copy data from userspace

On Mon, Jun 08, 2009 at 12:29:59PM -0400, Theodore Tso wrote:
> I had also already fixed this up in the patch queue. This is what is
> currently there.
>
> - Ted
>
> From: "Aneesh Kumar K.V" <[email protected]>
>
> ext4: truncate the file properly if we fail to copy data from userspace
>
> In generic_perform_write if we fail to copy the user data we don't
> update the inode->i_size. We should truncate the file in the above
> case so that we don't have blocks allocated outside inode->i_size. Add
> the inode to orphan list in the same transaction as block allocation
> This ensures that if we crash in between the recovery would do the
> truncate.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> CC: Jan Kara <[email protected]>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 50ef1a2..8225f21 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1551,6 +1551,52 @@ static int write_end_fn(handle_t *handle, struct buffer_head *bh)
> return ext4_handle_dirty_metadata(handle, NULL, bh);
> }
>
> +static int ext4_generic_write_end(struct file *file,
> + struct address_space *mapping,
> + loff_t pos, unsigned len, unsigned copied,
> + struct page *page, void *fsdata)
> +{
> + int i_size_changed = 0;
> + struct inode *inode = mapping->host;
> + handle_t *handle = ext4_journal_current_handle();
> +
> + copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> +
> + /*
> + * No need to use i_size_read() here, the i_size
> + * cannot change under us because we hold i_mutex.
> + *
> + * But it's important to update i_size while still holding page lock:
> + * page writeout could otherwise come in and zero beyond i_size.
> + */
> + if (pos + copied > inode->i_size) {
> + i_size_write(inode, pos + copied);
> + i_size_changed = 1;
> + }
> +
> + if (pos + copied > EXT4_I(inode)->i_disksize) {
> + /* We need to mark inode dirty even if
> + * new_i_size is less that inode->i_size
> + * bu greater than i_disksize.(hint delalloc)
> + */
> + ext4_update_i_disksize(inode, (pos + copied));
> + i_size_changed = 1;
> + }
> + unlock_page(page);
> + page_cache_release(page);


You missed a block_unlock_hole_extend here. So you need to either fix
jan's patches or move this patch after jan's patch


> +
> + /*
> + * Don't mark the inode dirty under page lock. First, it unnecessarily
> + * makes the holding time of page lock longer. Second, it forces lock
> + * ordering of page lock and transaction start for journaling
> + * filesystems.
> + */
> + if (i_size_changed)
> + ext4_mark_inode_dirty(handle, inode);
> +
> + return copied;
> +}
> +
> /*
> * We need to pick up the new inode size which generic_commit_write gave us
> * `file' can be NULL - eg, when called from page_symlink().
> @@ -1574,21 +1620,15 @@ static int ext4_ordered_write_end(struct file *file,
> ret = ext4_jbd2_file_inode(handle, inode);
>
> if (ret == 0) {
> - loff_t new_i_size;
> -
> - new_i_size = pos + copied;
> - if (new_i_size > EXT4_I(inode)->i_disksize) {
> - ext4_update_i_disksize(inode, new_i_size);
> - /* We need to mark inode dirty even if
> - * new_i_size is less that inode->i_size
> - * bu greater than i_disksize.(hint delalloc)
> - */
> - ext4_mark_inode_dirty(handle, inode);
> - }
> -
> - ret2 = generic_write_end(file, mapping, pos, len, copied,
> + ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
> page, fsdata);
> copied = ret2;
> + if (pos + len > inode->i_size)
> + /* if we have allocated more blocks and copied
> + * less. We will have blocks allocated outside
> + * inode->i_size. So truncate them
> + */
> + ext4_orphan_add(handle, inode);
> if (ret2 < 0)
> ret = ret2;
> }
> @@ -1596,6 +1636,18 @@ static int ext4_ordered_write_end(struct file *file,
> if (!ret)
> ret = ret2;
>
> + if (pos + len > inode->i_size) {
> + vmtruncate(inode, inode->i_size);
> + /*
> + * If vmtruncate failed early the inode might still be
> + * on the orphan list; we need to make sure the inode
> + * is removed from the orphan list in that case.
> + */
> + if (inode->i_nlink)
> + ext4_orphan_del(NULL, inode);
> + }
> +
> +
> return ret ? ret : copied;
> }
>
> @@ -1607,25 +1659,21 @@ static int ext4_writeback_write_end(struct file *file,
> handle_t *handle = ext4_journal_current_handle();
> struct inode *inode = mapping->host;
> int ret = 0, ret2;
> - loff_t new_i_size;
>
> trace_mark(ext4_writeback_write_end,
> "dev %s ino %lu pos %llu len %u copied %u",
> inode->i_sb->s_id, inode->i_ino,
> (unsigned long long) pos, len, copied);
> - new_i_size = pos + copied;
> - if (new_i_size > EXT4_I(inode)->i_disksize) {
> - ext4_update_i_disksize(inode, new_i_size);
> - /* We need to mark inode dirty even if
> - * new_i_size is less that inode->i_size
> - * bu greater than i_disksize.(hint delalloc)
> - */
> - ext4_mark_inode_dirty(handle, inode);
> - }
> -
> - ret2 = generic_write_end(file, mapping, pos, len, copied,
> + ret2 = ext4_generic_write_end(file, mapping, pos, len, copied,
> page, fsdata);
> copied = ret2;
> + if (pos + len > inode->i_size)
> + /* if we have allocated more blocks and copied
> + * less. We will have blocks allocated outside
> + * inode->i_size. So truncate them
> + */
> + ext4_orphan_add(handle, inode);
> +
> if (ret2 < 0)
> ret = ret2;
>
> @@ -1633,6 +1681,17 @@ static int ext4_writeback_write_end(struct file *file,
> if (!ret)
> ret = ret2;
>
> + if (pos + len > inode->i_size) {
> + vmtruncate(inode, inode->i_size);
> + /*
> + * If vmtruncate failed early the inode might still be
> + * on the orphan list; we need to make sure the inode
> + * is removed from the orphan list in that case.
> + */
> + if (inode->i_nlink)
> + ext4_orphan_del(NULL, inode);
> + }
> +
> return ret ? ret : copied;
> }
>
> @@ -1677,10 +1736,27 @@ static int ext4_journalled_write_end(struct file *file,
> }
>
> unlock_page(page);
> + page_cache_release(page);
> + if (pos + len > inode->i_size)
> + /* if we have allocated more blocks and copied
> + * less. We will have blocks allocated outside
> + * inode->i_size. So truncate them
> + */
> + ext4_orphan_add(handle, inode);
> +
> ret2 = ext4_journal_stop(handle);
> if (!ret)
> ret = ret2;
> - page_cache_release(page);
> + if (pos + len > inode->i_size) {
> + vmtruncate(inode, inode->i_size);
> + /*
> + * If vmtruncate failed early the inode might still be
> + * on the orphan list; we need to make sure the inode
> + * is removed from the orphan list in that case.
> + */
> + if (inode->i_nlink)
> + ext4_orphan_del(NULL, inode);
> + }

Here also need special care with block_unlock_hole_extend. We need to do
a vmtruncate outside block_unlock_hole_extend.


>
> return ret ? ret : copied;
> }

I think i both the case Jan's patch
allocate-blocks-correctly-with-subpage-blocksize need an update ? Since you have put
the patch before Jan's changes.

-aneesh

2009-06-08 19:14:29

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH -V2 2/2] ext4: truncate the file properly if we fail to copy data from userspace

On Mon, Jun 08, 2009 at 10:13:57PM +0530, Aneesh Kumar K.V wrote:
> I think i both the case Jan's patch
> allocate-blocks-correctly-with-subpage-blocksize need an update ? Since you have put
> the patch before Jan's changes.

OK, here's how I updated Jan's patch. I'm going to assume that we'll
submit the ext4 patch queue immediately as soon as the merge window
opens, since my impression is Jan is still waiting for some mm
developers to review his patch set.

Annesh, does this look good to you?

Jan, if we go down this path, you'll need to update your ext4 patch
with this updated one, to take into account the changes from Aneesh's
patch.

(We could do it the other way, but that means even more patches queued
up behind Jan's patch series, and I didn't realize originally that
Aneesh intended for these to be queued after Jan's patches. So it
seemed easier to just order Aneesh's patches to avoid block allocation
leaks first, since they are at least functionally (if not
syntactically) independent of the subpagesize blocksize patches.)

- Ted


ext4: Make sure blocks are properly allocated under mmaped page even when blocksize < pagesize

From: Jan Kara <[email protected]>

In a situation like:
truncate(f, 1024);
a = mmap(f, 0, 4096);
a[0] = 'a';
truncate(f, 4096);

we end up with a dirty page which does not have all blocks allocated /
reserved. Fix the problem by using new VFS infrastructure.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/extents.c | 2 +-
fs/ext4/inode.c | 22 ++++++++++++++++++++--
2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 9c35a7b..f2a2591 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3069,7 +3069,7 @@ static void ext4_falloc_update_inode(struct inode *inode,
*/
if (!(mode & FALLOC_FL_KEEP_SIZE)) {
if (new_size > i_size_read(inode))
- i_size_write(inode, new_size);
+ block_extend_i_size(inode, new_size, 0);
if (new_size > EXT4_I(inode)->i_disksize)
ext4_update_i_disksize(inode, new_size);
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f4cf46e..6450b55 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1491,7 +1491,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
index = pos >> PAGE_CACHE_SHIFT;
from = pos & (PAGE_CACHE_SIZE - 1);
to = from + len;
-
+ block_lock_hole_extend(inode, pos);
retry:
handle = ext4_journal_start(inode, needed_blocks);
if (IS_ERR(handle)) {
@@ -1550,6 +1550,8 @@ retry:
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;
out:
+ if (ret)
+ block_unlock_hole_extend(inode);
return ret;
}

@@ -1595,6 +1597,7 @@ static int ext4_generic_write_end(struct file *file,
}
unlock_page(page);
page_cache_release(page);
+ block_unlock_hole_extend(inode);

/*
* Don't mark the inode dirty under page lock. First, it unnecessarily
@@ -1739,6 +1742,8 @@ static int ext4_journalled_write_end(struct file *file,

unlock_page(page);
page_cache_release(page);
+ block_unlock_hole_extend(inode);
+
if (pos + len > inode->i_size)
/* if we have allocated more blocks and copied
* less. We will have blocks allocated outside
@@ -2888,6 +2893,7 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
}
*fsdata = (void *)0;
trace_ext4_da_write_begin(inode, pos, len, flags);
+ block_lock_hole_extend(inode, pos);
retry:
/*
* With delayed allocation, we don't log the i_disksize update
@@ -2930,6 +2936,8 @@ retry:
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;
out:
+ if (ret)
+ block_unlock_hole_extend(inode);
return ret;
}

@@ -3468,7 +3476,7 @@ static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
loff_t end = offset + ret;
if (end > inode->i_size) {
ei->i_disksize = end;
- i_size_write(inode, end);
+ block_extend_i_size(inode, offset, ret);
/*
* We're going to return a positive `ret'
* here due to non-zero-length I/O, so there's
@@ -3513,6 +3521,7 @@ static const struct address_space_operations ext4_ordered_aops = {
.sync_page = block_sync_page,
.write_begin = ext4_write_begin,
.write_end = ext4_ordered_write_end,
+ .extend_i_size = block_extend_i_size,
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
@@ -3528,6 +3537,7 @@ static const struct address_space_operations ext4_writeback_aops = {
.sync_page = block_sync_page,
.write_begin = ext4_write_begin,
.write_end = ext4_writeback_write_end,
+ .extend_i_size = block_extend_i_size,
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
@@ -3543,6 +3553,7 @@ static const struct address_space_operations ext4_journalled_aops = {
.sync_page = block_sync_page,
.write_begin = ext4_write_begin,
.write_end = ext4_journalled_write_end,
+ .extend_i_size = block_extend_i_size,
.set_page_dirty = ext4_journalled_set_page_dirty,
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
@@ -3558,6 +3569,7 @@ static const struct address_space_operations ext4_da_aops = {
.sync_page = block_sync_page,
.write_begin = ext4_da_write_begin,
.write_end = ext4_da_write_end,
+ .extend_i_size = block_extend_i_size,
.bmap = ext4_bmap,
.invalidatepage = ext4_da_invalidatepage,
.releasepage = ext4_releasepage,
@@ -5404,6 +5416,12 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
struct address_space *mapping = inode->i_mapping;

/*
+ * Wait for extending of i_size, after this moment, next truncate /
+ * write can create holes under us but they writeprotect our page so
+ * we'll be called again to fill the hole.
+ */
+ block_wait_on_hole_extend(inode, page_offset(page));
+ /*
* Get i_alloc_sem to stop truncates messing with the inode. We cannot
* get i_mutex because we are already holding mmap_sem.
*/

2009-06-08 19:24:00

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH -V2 2/2] ext4: truncate the file properly if we fail to copy data from userspace

Hi,

On Mon 08-06-09 15:14:20, Theodore Tso wrote:
> On Mon, Jun 08, 2009 at 10:13:57PM +0530, Aneesh Kumar K.V wrote:
> > I think i both the case Jan's patch
> > allocate-blocks-correctly-with-subpage-blocksize need an update ? Since you have put
> > the patch before Jan's changes.
>
> OK, here's how I updated Jan's patch. I'm going to assume that we'll
> submit the ext4 patch queue immediately as soon as the merge window
> opens, since my impression is Jan is still waiting for some mm
> developers to review his patch set.
Yes, no review yet :( So put my patches to the end so that they don't
block other fixes in the patch queue.

> Annesh, does this look good to you?
>
> Jan, if we go down this path, you'll need to update your ext4 patch
> with this updated one, to take into account the changes from Aneesh's
> patch.
Thanks for the patch.

Honza

> ext4: Make sure blocks are properly allocated under mmaped page even when blocksize < pagesize
>
> From: Jan Kara <[email protected]>
>
> In a situation like:
> truncate(f, 1024);
> a = mmap(f, 0, 4096);
> a[0] = 'a';
> truncate(f, 4096);
>
> we end up with a dirty page which does not have all blocks allocated /
> reserved. Fix the problem by using new VFS infrastructure.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext4/extents.c | 2 +-
> fs/ext4/inode.c | 22 ++++++++++++++++++++--
> 2 files changed, 21 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 9c35a7b..f2a2591 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3069,7 +3069,7 @@ static void ext4_falloc_update_inode(struct inode *inode,
> */
> if (!(mode & FALLOC_FL_KEEP_SIZE)) {
> if (new_size > i_size_read(inode))
> - i_size_write(inode, new_size);
> + block_extend_i_size(inode, new_size, 0);
> if (new_size > EXT4_I(inode)->i_disksize)
> ext4_update_i_disksize(inode, new_size);
> }
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f4cf46e..6450b55 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1491,7 +1491,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
> index = pos >> PAGE_CACHE_SHIFT;
> from = pos & (PAGE_CACHE_SIZE - 1);
> to = from + len;
> -
> + block_lock_hole_extend(inode, pos);
> retry:
> handle = ext4_journal_start(inode, needed_blocks);
> if (IS_ERR(handle)) {
> @@ -1550,6 +1550,8 @@ retry:
> if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> goto retry;
> out:
> + if (ret)
> + block_unlock_hole_extend(inode);
> return ret;
> }
>
> @@ -1595,6 +1597,7 @@ static int ext4_generic_write_end(struct file *file,
> }
> unlock_page(page);
> page_cache_release(page);
> + block_unlock_hole_extend(inode);
>
> /*
> * Don't mark the inode dirty under page lock. First, it unnecessarily
> @@ -1739,6 +1742,8 @@ static int ext4_journalled_write_end(struct file *file,
>
> unlock_page(page);
> page_cache_release(page);
> + block_unlock_hole_extend(inode);
> +
> if (pos + len > inode->i_size)
> /* if we have allocated more blocks and copied
> * less. We will have blocks allocated outside
> @@ -2888,6 +2893,7 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
> }
> *fsdata = (void *)0;
> trace_ext4_da_write_begin(inode, pos, len, flags);
> + block_lock_hole_extend(inode, pos);
> retry:
> /*
> * With delayed allocation, we don't log the i_disksize update
> @@ -2930,6 +2936,8 @@ retry:
> if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> goto retry;
> out:
> + if (ret)
> + block_unlock_hole_extend(inode);
> return ret;
> }
>
> @@ -3468,7 +3476,7 @@ static ssize_t ext4_direct_IO(int rw, struct kiocb *iocb,
> loff_t end = offset + ret;
> if (end > inode->i_size) {
> ei->i_disksize = end;
> - i_size_write(inode, end);
> + block_extend_i_size(inode, offset, ret);
> /*
> * We're going to return a positive `ret'
> * here due to non-zero-length I/O, so there's
> @@ -3513,6 +3521,7 @@ static const struct address_space_operations ext4_ordered_aops = {
> .sync_page = block_sync_page,
> .write_begin = ext4_write_begin,
> .write_end = ext4_ordered_write_end,
> + .extend_i_size = block_extend_i_size,
> .bmap = ext4_bmap,
> .invalidatepage = ext4_invalidatepage,
> .releasepage = ext4_releasepage,
> @@ -3528,6 +3537,7 @@ static const struct address_space_operations ext4_writeback_aops = {
> .sync_page = block_sync_page,
> .write_begin = ext4_write_begin,
> .write_end = ext4_writeback_write_end,
> + .extend_i_size = block_extend_i_size,
> .bmap = ext4_bmap,
> .invalidatepage = ext4_invalidatepage,
> .releasepage = ext4_releasepage,
> @@ -3543,6 +3553,7 @@ static const struct address_space_operations ext4_journalled_aops = {
> .sync_page = block_sync_page,
> .write_begin = ext4_write_begin,
> .write_end = ext4_journalled_write_end,
> + .extend_i_size = block_extend_i_size,
> .set_page_dirty = ext4_journalled_set_page_dirty,
> .bmap = ext4_bmap,
> .invalidatepage = ext4_invalidatepage,
> @@ -3558,6 +3569,7 @@ static const struct address_space_operations ext4_da_aops = {
> .sync_page = block_sync_page,
> .write_begin = ext4_da_write_begin,
> .write_end = ext4_da_write_end,
> + .extend_i_size = block_extend_i_size,
> .bmap = ext4_bmap,
> .invalidatepage = ext4_da_invalidatepage,
> .releasepage = ext4_releasepage,
> @@ -5404,6 +5416,12 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> struct address_space *mapping = inode->i_mapping;
>
> /*
> + * Wait for extending of i_size, after this moment, next truncate /
> + * write can create holes under us but they writeprotect our page so
> + * we'll be called again to fill the hole.
> + */
> + block_wait_on_hole_extend(inode, page_offset(page));
> + /*
> * Get i_alloc_sem to stop truncates messing with the inode. We cannot
> * get i_mutex because we are already holding mmap_sem.
> */
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-06-08 20:21:00

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V2 2/2] ext4: truncate the file properly if we fail to copy data from userspace

On Mon, Jun 08, 2009 at 03:14:20PM -0400, Theodore Tso wrote:
> On Mon, Jun 08, 2009 at 10:13:57PM +0530, Aneesh Kumar K.V wrote:
> > I think i both the case Jan's patch
> > allocate-blocks-correctly-with-subpage-blocksize need an update ? Since you have put
> > the patch before Jan's changes.
>
> OK, here's how I updated Jan's patch. I'm going to assume that we'll
> submit the ext4 patch queue immediately as soon as the merge window
> opens, since my impression is Jan is still waiting for some mm
> developers to review his patch set.
>
> Annesh, does this look good to you?

I just did a quick look. Should we do a block_unlock_hole_extend after
journal_stop. We do a block_lock_hole_extend before journal_start.

>
> Jan, if we go down this path, you'll need to update your ext4 patch
> with this updated one, to take into account the changes from Aneesh's
> patch.
>
> (We could do it the other way, but that means even more patches queued
> up behind Jan's patch series, and I didn't realize originally that
> Aneesh intended for these to be queued after Jan's patches. So it
> seemed easier to just order Aneesh's patches to avoid block allocation
> leaks first, since they are at least functionally (if not
> syntactically) independent of the subpagesize blocksize patches.)

-aneesh

2009-06-09 10:12:55

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH -V2 2/2] ext4: truncate the file properly if we fail to copy data from userspace

On Tue 09-06-09 01:50:49, Aneesh Kumar K.V wrote:
> On Mon, Jun 08, 2009 at 03:14:20PM -0400, Theodore Tso wrote:
> > On Mon, Jun 08, 2009 at 10:13:57PM +0530, Aneesh Kumar K.V wrote:
> > > I think i both the case Jan's patch
> > > allocate-blocks-correctly-with-subpage-blocksize need an update ? Since you have put
> > > the patch before Jan's changes.
> >
> > OK, here's how I updated Jan's patch. I'm going to assume that we'll
> > submit the ext4 patch queue immediately as soon as the merge window
> > opens, since my impression is Jan is still waiting for some mm
> > developers to review his patch set.
> >
> > Annesh, does this look good to you?
>
> I just did a quick look. Should we do a block_unlock_hole_extend after
> journal_stop. We do a block_lock_hole_extend before journal_start.
Well, it's not really necessary (and it was not like that in my original
patch). You just have to do it after i_size has been updated and the
function just needs to acquire inode_lock spinlock so there's no risk of
deadlock anywhere...

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR