2009-05-07 10:39:41

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH 1/3] ext4: Properly initialize the buffer_head state

These buffer_heads are allocated on stack and are
used only to make get_blocks calls. So we can set the
b_state to 0

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/extents.c | 1 +
fs/ext4/inode.c | 2 +-
fs/mpage.c | 2 +-
3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e963870..10b3028 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3141,6 +3141,7 @@ long ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
ret = PTR_ERR(handle);
break;
}
+ map_bh.b_state = 0;
ret = ext4_get_blocks_wrap(handle, inode, block,
max_blocks, &map_bh,
EXT4_CREATE_UNINITIALIZED_EXT, 0, 0);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 43884e3..c3cd00f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2104,7 +2104,7 @@ static int mpage_da_map_blocks(struct mpage_da_data *mpd)
if ((mpd->b_state & (1 << BH_Mapped)) &&
!(mpd->b_state & (1 << BH_Delay)))
return 0;
- new.b_state = mpd->b_state;
+ new.b_state = 0;
new.b_blocknr = 0;
new.b_size = mpd->b_size;
next = mpd->b_blocknr;
diff --git a/fs/mpage.c b/fs/mpage.c
index 680ba60..cd98409 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -412,7 +412,7 @@ int mpage_readpage(struct page *page, get_block_t get_block)
struct buffer_head map_bh;
unsigned long first_logical_block = 0;

- clear_buffer_mapped(&map_bh);
+ map_bh.b_state = 0;
bio = do_mpage_readpage(bio, page, 1, &last_block_in_bio,
&map_bh, &first_logical_block, get_block);
if (bio)
--
1.6.3.rc4



2009-05-07 10:39:41

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH 2/3] ext4: Clear the unwritten buffer_head flag properly

ext4_get_blocks_wrap does a block lookup requesting to
allocate new blocks. A lookup of blocks in prealloc area
result in setting the unwritten flag in buffer_head. So
a write to an unwritten extent will cause the buffer_head
to have unwritten and mapped flag set. Clear hte unwritten
buffer_head flag before requesting to allocate blocks.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/inode.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c3cd00f..f6d7e9b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1149,6 +1149,7 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
int retval;

clear_buffer_mapped(bh);
+ clear_buffer_unwritten(bh);

/*
* Try to see if we can get the block without requesting
@@ -1179,6 +1180,12 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
return retval;

/*
+ * The above get_blocks can cause the buffer to be
+ * marked unwritten. So clear the same.
+ */
+ clear_buffer_unwritten(bh);
+
+ /*
* New blocks allocate and/or writing to uninitialized extent
* will possibly result in updating i_data, so we take
* the write lock of i_data_sem, and call get_blocks()
--
1.6.3.rc4


2009-05-07 10:39:43

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH 3/3] vfs: Add BUG_ON for delayed and unwritten extents in submit_bh

We should not do submit_bh for delayed and unwritten extents. So
add a BUG_ON on them.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/buffer.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b3e5be7..ec480c8 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2931,6 +2931,8 @@ int submit_bh(int rw, struct buffer_head * bh)
BUG_ON(!buffer_locked(bh));
BUG_ON(!buffer_mapped(bh));
BUG_ON(!bh->b_end_io);
+ BUG_ON(buffer_delay(bh));
+ BUG_ON(buffer_unwritten(bh));

/*
* Mask in barrier bit for a write (could be either a WRITE or a
--
1.6.3.rc4


2009-05-07 15:20:31

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH 1/3] ext4: Properly initialize the buffer_head state

Aneesh Kumar K.V wrote:
> These buffer_heads are allocated on stack and are
> used only to make get_blocks calls. So we can set the
> b_state to 0
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>

I'd noticed this too, thanks for fixing up.

> ---
> fs/ext4/extents.c | 1 +
> fs/ext4/inode.c | 2 +-
> fs/mpage.c | 2 +-
> 3 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index e963870..10b3028 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3141,6 +3141,7 @@ long ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
> ret = PTR_ERR(handle);
> break;
> }
> + map_bh.b_state = 0;
> ret = ext4_get_blocks_wrap(handle, inode, block,
> max_blocks, &map_bh,
> EXT4_CREATE_UNINITIALIZED_EXT, 0, 0);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 43884e3..c3cd00f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2104,7 +2104,7 @@ static int mpage_da_map_blocks(struct mpage_da_data *mpd)
> if ((mpd->b_state & (1 << BH_Mapped)) &&
> !(mpd->b_state & (1 << BH_Delay)))
> return 0;
> - new.b_state = mpd->b_state;
> + new.b_state = 0;

hm can you explain why we want 0 rather than mpd->b_state? The others
are obvious, b_state was largely uninitialized, but this is changing
what looked like a different intentional initialization. Can you update
the changelog to say why it's wrong?

While we're at it could we name this something other than "new?"

If it's a mapping bh, maybe "map_bh" like normal? :)

> new.b_blocknr = 0;
> new.b_size = mpd->b_size;
> next = mpd->b_blocknr;
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 680ba60..cd98409 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -412,7 +412,7 @@ int mpage_readpage(struct page *page, get_block_t get_block)
> struct buffer_head map_bh;
> unsigned long first_logical_block = 0;
>
> - clear_buffer_mapped(&map_bh);
> + map_bh.b_state = 0;
> bio = do_mpage_readpage(bio, page, 1, &last_block_in_bio,
> &map_bh, &first_logical_block, get_block);
> if (bio)

the rest looks good to me; there are places in the core kernel that
don't initialize state and just clear flags they "know" they'll care
about... it always struck me as messy, clearing state is much much better.

-Eric

2009-05-07 15:36:53

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH 2/3] ext4: Clear the unwritten buffer_head flag properly

Aneesh Kumar K.V wrote:
> ext4_get_blocks_wrap does a block lookup requesting to
> allocate new blocks. A lookup of blocks in prealloc area
> result in setting the unwritten flag in buffer_head. So
> a write to an unwritten extent will cause the buffer_head
> to have unwritten and mapped flag set. Clear hte unwritten
> buffer_head flag before requesting to allocate blocks.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> fs/ext4/inode.c | 7 +++++++
> 1 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c3cd00f..f6d7e9b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1149,6 +1149,7 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> int retval;
>
> clear_buffer_mapped(bh);
> + clear_buffer_unwritten(bh);
>
> /*
> * Try to see if we can get the block without requesting
> @@ -1179,6 +1180,12 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> return retval;
>
> /*
> + * The above get_blocks can cause the buffer to be
> + * marked unwritten. So clear the same.
> + */
> + clear_buffer_unwritten(bh);

hm, thinking out loud here.

ext4_ext_get_blocks() will only set unwritten if (!create) ... but then
ext4_get_blocks_wrap() calls ext4_ext_get_blocks() !create as an
argument no matter what, the first time, for an initial lookup.

But if ext4_get_blocks_wrap() was called with !create, then we return
regardless, so ok - by the time you get to the above hunk, we -are- in
create mode, we're planning to write it ... so I guess clearing the
unwritten state makes sense here.

But is this too late, because it's after this?

/*
* Returns if the blocks have already allocated
*
* Note that if blocks have been preallocated
* ext4_ext_get_block() returns th create = 0
* with buffer head unmapped.
*/
if (retval > 0 && buffer_mapped(bh))
return retval;

I guess not; ext4_ext_get_blocks() won't map the buffer if it's found to
be preallocated/unwritten because it was called with !create. If we're
going on to write it, we want to clear unwritten.

So I guess this looks right, although I can't help but think that in
general, the buffer_head state management is really getting to be a
hard-to-follow mess...

-Eric

2009-05-07 15:37:40

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH 3/3] vfs: Add BUG_ON for delayed and unwritten extents in submit_bh

Aneesh Kumar K.V wrote:
> We should not do submit_bh for delayed and unwritten extents. So
> add a BUG_ON on them.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>

Makes sense to me, once we get our own house in order. :)

Reviewed-by: Eric Sandeen <[email protected]>

> ---
> fs/buffer.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b3e5be7..ec480c8 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2931,6 +2931,8 @@ int submit_bh(int rw, struct buffer_head * bh)
> BUG_ON(!buffer_locked(bh));
> BUG_ON(!buffer_mapped(bh));
> BUG_ON(!bh->b_end_io);
> + BUG_ON(buffer_delay(bh));
> + BUG_ON(buffer_unwritten(bh));
>
> /*
> * Mask in barrier bit for a write (could be either a WRITE or a


2009-05-08 08:13:28

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 2/3] ext4: Clear the unwritten buffer_head flag properly

On Thu, May 07, 2009 at 10:36:49AM -0500, Eric Sandeen wrote:
> Aneesh Kumar K.V wrote:
> > ext4_get_blocks_wrap does a block lookup requesting to
> > allocate new blocks. A lookup of blocks in prealloc area
> > result in setting the unwritten flag in buffer_head. So
> > a write to an unwritten extent will cause the buffer_head
> > to have unwritten and mapped flag set. Clear hte unwritten
> > buffer_head flag before requesting to allocate blocks.
> >
> > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> > ---
> > fs/ext4/inode.c | 7 +++++++
> > 1 files changed, 7 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index c3cd00f..f6d7e9b 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -1149,6 +1149,7 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> > int retval;
> >
> > clear_buffer_mapped(bh);
> > + clear_buffer_unwritten(bh);
> >
> > /*
> > * Try to see if we can get the block without requesting
> > @@ -1179,6 +1180,12 @@ int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode, sector_t block,
> > return retval;
> >
> > /*
> > + * The above get_blocks can cause the buffer to be
> > + * marked unwritten. So clear the same.
> > + */
> > + clear_buffer_unwritten(bh);
>
> hm, thinking out loud here.
>
> ext4_ext_get_blocks() will only set unwritten if (!create) ... but then
> ext4_get_blocks_wrap() calls ext4_ext_get_blocks() !create as an
> argument no matter what, the first time, for an initial lookup.
>
> But if ext4_get_blocks_wrap() was called with !create, then we return
> regardless, so ok - by the time you get to the above hunk, we -are- in
> create mode, we're planning to write it ... so I guess clearing the
> unwritten state makes sense here.
>
> But is this too late, because it's after this?
>
> /*
> * Returns if the blocks have already allocated
> *
> * Note that if blocks have been preallocated
> * ext4_ext_get_block() returns th create = 0
> * with buffer head unmapped.
> */
> if (retval > 0 && buffer_mapped(bh))
> return retval;
>
> I guess not; ext4_ext_get_blocks() won't map the buffer if it's found to
> be preallocated/unwritten because it was called with !create. If we're
> going on to write it, we want to clear unwritten.
>
> So I guess this looks right, although I can't help but think that in
> general, the buffer_head state management is really getting to be a
> hard-to-follow mess...

To further clarify what i think was causing the I/O error.

1) We do a multi block delayed alloc to prealloc space. That would get
us multiple buffer_heads marked with BH_Unwritten. (say 10, 11, 12)
2) pdflush attempt to write some pages (say mapping block 10) which cause
a get_block call with create = 1. That would attempt to convert
uninitialized extent to initialized one. This can cause multiple blocks
to be marked initialized. ( say 10, 11 , 12)
3) We do an overwrite of block 11. That would mean we call
ext4_da_get_block_prep, which would again do a get_block for block 11
with create = 0. But remember we already have buffer_head marked with
BH_Unwritten flag. But the buffer was unmapped because it is unwritten
( We are fixing this mess in the patch for 2.6.31).
4) The get_block call will find the buffer mapped due to step b. And
mark the buffer_head mapped. There we go . We end up with buffer_head
mapped and unwritten
5) later in ext4_da_get_block_prep we check whether the buffer_head in marked
BH_Unwritten if so we set the block number to ~0. This is introduced by
[PATCH -V4 1/2] Fix sub-block zeroing for buffered writes into unwritten extents
6) So now we have a buffer_head that is mapped, unwritten, with
b_blocknr = ~0. That would result in the I/O error.

-aneesh



2009-05-11 08:22:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/3] ext4: Properly initialize the buffer_head state

On Thu, May 07, 2009 at 10:20:26AM -0500, Eric Sandeen wrote:
> Aneesh Kumar K.V wrote:
> > These buffer_heads are allocated on stack and are
> > used only to make get_blocks calls. So we can set the
> > b_state to 0
> >
> > Signed-off-by: Aneesh Kumar K.V <[email protected]>
>
> I'd noticed this too, thanks for fixing up.

Is this just a clean-up, or does this fix a bug? It wasn't obvious
the patch description. (I'm not a big fan of Ingo's 'Impact: '
header, but it is good to make sure the patch description explains the
impact of a patch.)

In the long run, we should really look at cleaning up the get_blocks*
interfaces so they don't use buffer_head when all they're really doing
is passing back a block number. All aside from the confusion it
causes, it also bloats our stack usage.

- Ted

2009-05-11 09:24:48

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 1/3] ext4: Properly initialize the buffer_head state

On Sun, May 10, 2009 at 07:57:41PM -0400, Theodore Tso wrote:
> On Thu, May 07, 2009 at 10:20:26AM -0500, Eric Sandeen wrote:
> > Aneesh Kumar K.V wrote:
> > > These buffer_heads are allocated on stack and are
> > > used only to make get_blocks calls. So we can set the
> > > b_state to 0
> > >
> > > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> >
> > I'd noticed this too, thanks for fixing up.
>
> Is this just a clean-up, or does this fix a bug? It wasn't obvious
> the patch description. (I'm not a big fan of Ingo's 'Impact: '
> header, but it is good to make sure the patch description explains the
> impact of a patch.)

If you are taking patch 3/3 you would need this patch. But otherwise you
can drop this. The fix is actually in patch 2/3.


-aneesh

2009-05-11 11:31:09

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/3] ext4: Properly initialize the buffer_head state

On Mon, May 11, 2009 at 02:54:43PM +0530, Aneesh Kumar K.V wrote:
> On Sun, May 10, 2009 at 07:57:41PM -0400, Theodore Tso wrote:
> > On Thu, May 07, 2009 at 10:20:26AM -0500, Eric Sandeen wrote:
> > > Aneesh Kumar K.V wrote:
> > > > These buffer_heads are allocated on stack and are
> > > > used only to make get_blocks calls. So we can set the
> > > > b_state to 0
> > > >
> > > > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> > >
> > > I'd noticed this too, thanks for fixing up.
> >
> > Is this just a clean-up, or does this fix a bug? It wasn't obvious
> > the patch description. (I'm not a big fan of Ingo's 'Impact: '
> > header, but it is good to make sure the patch description explains the
> > impact of a patch.)
>
> If you are taking patch 3/3 you would need this patch. But otherwise you
> can drop this. The fix is actually in patch 2/3.

OK, thanks. And these patches are orthogonal to these patches in the
patch queue, right?

fix-sub-block-zeroing-for-unwritten-extents
use-a-fake-block-number-for-delalloc-bh

So it looks like we probably want to push these two, plus patch 2/3.
I'm also very much concerned about your for-2.6.31 patch. It is
complex, so we probably want wait, but it looks like we have a real
bug which these patches will just expose. And your 2/3 patch isn't
going to fix that, right, since these are orthogonal problems.

(Again, if you expect patches to supercede existing ones in the ext4
patchwork or which are in the patch queue, please tell me. Sometimes
it really isn't obvious....)

- Ted


2009-05-11 14:49:29

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH 1/3] ext4: Properly initialize the buffer_head state

Theodore Tso wrote:
> On Thu, May 07, 2009 at 10:20:26AM -0500, Eric Sandeen wrote:
>> Aneesh Kumar K.V wrote:
>>> These buffer_heads are allocated on stack and are used only to
>>> make get_blocks calls. So we can set the b_state to 0
>>>
>>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>>>
>> I'd noticed this too, thanks for fixing up.
>
> Is this just a clean-up, or does this fix a bug? It wasn't obvious
> the patch description. (I'm not a big fan of Ingo's 'Impact: '
> header, but it is good to make sure the patch description explains
> the impact of a patch.)

Aneesh responded, but AFAICT it doesn't actually fix a bug, but letting
buffer heads float around with indeterminate state can't be good in the
long run.


> In the long run, we should really look at cleaning up the get_blocks*
> interfaces so they don't use buffer_head when all they're really
> doing is passing back a block number. All aside from the confusion
> it causes, it also bloats our stack usage.

Overall, the kernel in general could use something in place of these
buffer-heads-that-aren't-buffer-heads, imho.

Pretty sure we use it for more than just a block nr, but it's not really
a buffer head either, it's one of these "map_bh's" - we should probably
at least try to consistently label them as such in ext*

-Eric

> - Ted


2009-05-12 03:08:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/3] ext4: Clear the unwritten buffer_head flag properly

On Thu, May 07, 2009 at 04:09:29PM +0530, Aneesh Kumar K.V wrote:
> ext4_get_blocks_wrap does a block lookup requesting to
> allocate new blocks. A lookup of blocks in prealloc area
> result in setting the unwritten flag in buffer_head. So
> a write to an unwritten extent will cause the buffer_head
> to have unwritten and mapped flag set. Clear hte unwritten
> buffer_head flag before requesting to allocate blocks.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>

I've rewritten the commit changelog to this, which I believe more
accurately describes the patch. Comments, please?

ext4: Clear the unwritten buffer_head flag after the extent is initialized

From: "Aneesh Kumar K.V" <[email protected]>

The BH_Unwritten flag indicates that the buffer is allocated on disk
but has not been written; that is, the disk was part of a persistent
preallocation area. That flag should only be set when a get_blocks()
function is looking up a inode's logical to physical block mapping.

When ext4_get_blocks_wrap() is called with create=1, the uninitialized
extent is converted into an initialized one, so the BH_Unwritten flag
is no longer appropriate. Hence, we need to make sure the
BH_Unwritten is not left set, to avoid the ensuing confusion and
hilarty.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>

- Ted

2009-05-12 03:17:11

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 3/3] vfs: Add BUG_ON for delayed and unwritten extents in submit_bh

On Thu, May 07, 2009 at 04:09:30PM +0530, Aneesh Kumar K.V wrote:
> We should not do submit_bh for delayed and unwritten extents. So
> add a BUG_ON on them.

Hmm, maybe we should add a BUG_ON(buffer_new(bh)) here to? That flag
should never leak out to submit_bh as well, right?

- Ted

2009-05-12 03:17:58

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/3] ext4: Properly initialize the buffer_head state

On Thu, May 07, 2009 at 04:09:28PM +0530, Aneesh Kumar K.V wrote:
> These buffer_heads are allocated on stack and are
> used only to make get_blocks calls. So we can set the
> b_state to 0
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>

I've rewritten the commit description to this:

ext4: Properly initialize the buffer_head state

From: "Aneesh Kumar K.V" <[email protected]>

These struct buffer_heads are allocated on the stack (and hence are
initialized with stack garbage). They are only used to call a
get_blocks() function, so that's mostly OK, but b_state must be
initialized to be 0.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>

Comments?

- Ted

2009-05-12 04:46:50

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 2/3] ext4: Clear the unwritten buffer_head flag properly

On Mon, May 11, 2009 at 11:08:56PM -0400, Theodore Tso wrote:
> On Thu, May 07, 2009 at 04:09:29PM +0530, Aneesh Kumar K.V wrote:
> > ext4_get_blocks_wrap does a block lookup requesting to
> > allocate new blocks. A lookup of blocks in prealloc area
> > result in setting the unwritten flag in buffer_head. So
> > a write to an unwritten extent will cause the buffer_head
> > to have unwritten and mapped flag set. Clear hte unwritten
> > buffer_head flag before requesting to allocate blocks.
> >
> > Signed-off-by: Aneesh Kumar K.V <[email protected]>
>
> I've rewritten the commit changelog to this, which I believe more
> accurately describes the patch. Comments, please?
>
> ext4: Clear the unwritten buffer_head flag after the extent is initialized
>
> From: "Aneesh Kumar K.V" <[email protected]>
>
> The BH_Unwritten flag indicates that the buffer is allocated on disk
> but has not been written; that is, the disk was part of a persistent
> preallocation area. That flag should only be set when a get_blocks()
> function is looking up a inode's logical to physical block mapping.
>
> When ext4_get_blocks_wrap() is called with create=1, the uninitialized
> extent is converted into an initialized one, so the BH_Unwritten flag
> is no longer appropriate. Hence, we need to make sure the
> BH_Unwritten is not left set, to avoid the ensuing confusion and
> hilarty.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
>

I think it is good. But one thing missing in the commit message is,
what happens if we do a write to prealloc space. Since a
get_block(create = 1) is now split into __get_block(create = 0 ) and
__get_block(create = 1). That would mean if we pass a buffer head with
BH_Unwritten cleared we will have


1) buffer_head as BH_Unwritten cleared.

2) __get_block(create = 0 ) -> Since it is prealloc space we will have
BH_Unwritten set .

3) __get_block(create = 1) -> get the blocks out of prealloc space.
and retun with BH_Mapped set.

That would imply we have BH_Unwritten and BH_Mapped set in the above
case which is wrong. So we need a BH_Unwritten clear between (2) and
(3). The patch does the same. May be we need to capture it in commit
message.

-aneesh




2009-05-12 04:47:27

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 1/3] ext4: Properly initialize the buffer_head state

On Mon, May 11, 2009 at 11:17:56PM -0400, Theodore Tso wrote:
> On Thu, May 07, 2009 at 04:09:28PM +0530, Aneesh Kumar K.V wrote:
> > These buffer_heads are allocated on stack and are
> > used only to make get_blocks calls. So we can set the
> > b_state to 0
> >
> > Signed-off-by: Aneesh Kumar K.V <[email protected]>
>
> I've rewritten the commit description to this:
>
> ext4: Properly initialize the buffer_head state
>
> From: "Aneesh Kumar K.V" <[email protected]>
>
> These struct buffer_heads are allocated on the stack (and hence are
> initialized with stack garbage). They are only used to call a
> get_blocks() function, so that's mostly OK, but b_state must be
> initialized to be 0.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
>
> Comments?
>

good.

-aneesh

2009-05-12 04:52:21

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 3/3] vfs: Add BUG_ON for delayed and unwritten extentsin submit_bh

On Mon, May 11, 2009 at 11:17:09PM -0400, Theodore Tso wrote:
> On Thu, May 07, 2009 at 04:09:30PM +0530, Aneesh Kumar K.V wrote:
> > We should not do submit_bh for delayed and unwritten extents. So
> > add a BUG_ON on them.
>
> Hmm, maybe we should add a BUG_ON(buffer_new(bh)) here to? That flag
> should never leak out to submit_bh as well, right?
>

Yes . But since BH_New is used by all the filesystems we need to
verify them and make sure other filesystems gets it right. IIUC BH_Delay and
BH_Unwritten are only used by ext4, xfs and btrfs

-aneesh

2009-05-12 13:25:49

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH 3/3] vfs: Add BUG_ON for delayed and unwritten extentsin submit_bh

Aneesh Kumar K.V wrote:
> On Mon, May 11, 2009 at 11:17:09PM -0400, Theodore Tso wrote:
>> On Thu, May 07, 2009 at 04:09:30PM +0530, Aneesh Kumar K.V wrote:
>>> We should not do submit_bh for delayed and unwritten extents. So
>>> add a BUG_ON on them.
>> Hmm, maybe we should add a BUG_ON(buffer_new(bh)) here to? That flag
>> should never leak out to submit_bh as well, right?
>>
>
> Yes . But since BH_New is used by all the filesystems we need to
> verify them and make sure other filesystems gets it right. IIUC BH_Delay and
> BH_Unwritten are only used by ext4, xfs and btrfs
>
> -aneesh

I have confirmed that xfs won't care what we do in submit_bh... so
BUG_ONs here related to delay/unwritten won't worry xfs.

-Eric

2009-05-13 18:56:59

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH 2/3] ext4: Clear the unwritten buffer_head flag properly

Aneesh Kumar K.V wrote:
> On Mon, May 11, 2009 at 11:08:56PM -0400, Theodore Tso wrote:
>> On Thu, May 07, 2009 at 04:09:29PM +0530, Aneesh Kumar K.V wrote:
>>> ext4_get_blocks_wrap does a block lookup requesting to
>>> allocate new blocks. A lookup of blocks in prealloc area
>>> result in setting the unwritten flag in buffer_head. So
>>> a write to an unwritten extent will cause the buffer_head
>>> to have unwritten and mapped flag set. Clear hte unwritten
>>> buffer_head flag before requesting to allocate blocks.
>>>
>>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>> I've rewritten the commit changelog to this, which I believe more
>> accurately describes the patch. Comments, please?
>>
>> ext4: Clear the unwritten buffer_head flag after the extent is initialized
>>
>> From: "Aneesh Kumar K.V" <[email protected]>
>>
>> The BH_Unwritten flag indicates that the buffer is allocated on disk
>> but has not been written; that is, the disk was part of a persistent
>> preallocation area. That flag should only be set when a get_blocks()
>> function is looking up a inode's logical to physical block mapping.
>>
>> When ext4_get_blocks_wrap() is called with create=1, the uninitialized
>> extent is converted into an initialized one, so the BH_Unwritten flag
>> is no longer appropriate. Hence, we need to make sure the
>> BH_Unwritten is not left set, to avoid the ensuing confusion and
>> hilarty.
>>
>> Signed-off-by: Aneesh Kumar K.V <[email protected]>
>> Signed-off-by: "Theodore Ts'o" <[email protected]>
>>
>
> I think it is good. But one thing missing in the commit message is,
> what happens if we do a write to prealloc space. Since a
> get_block(create = 1) is now split into __get_block(create = 0 ) and
> __get_block(create = 1). That would mean if we pass a buffer head with
> BH_Unwritten cleared we will have
>
>
> 1) buffer_head as BH_Unwritten cleared.
>
> 2) __get_block(create = 0 ) -> Since it is prealloc space we will have
> BH_Unwritten set .

Why do we need to set BH_Unwritten on a !create call at all?

Or maybe another way of asking is, are there any !create callers of
get_block who -want- BH_Unwritten set?

Which is to say, should we just not be setting BH_Unwritten in get_block
in the !create case, ever?

The comment:

/*
+ * The above get_blocks can cause the buffer to be
+ * marked unwritten. So clear the same.
+ */
+ clear_buffer_unwritten(bh);

is imho not helpful; to me it says "we -just- set this, so clear it!"
It leaves me scratching my head.

> 3) __get_block(create = 1) -> get the blocks out of prealloc space.
> and retun with BH_Mapped set.
>
> That would imply we have BH_Unwritten and BH_Mapped set in the above
> case which is wrong. So we need a BH_Unwritten clear between (2) and
> (3). The patch does the same. May be we need to capture it in commit
> message.

Better in comments, I think. :)

-Eric

> -aneesh
>
>
>


2009-05-13 22:28:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/3] ext4: Clear the unwritten buffer_head flag properly

On Wed, May 13, 2009 at 01:56:56PM -0500, Eric Sandeen wrote:
> The comment:
>
> /*
> + * The above get_blocks can cause the buffer to be
> + * marked unwritten. So clear the same.
> + */
> + clear_buffer_unwritten(bh);
>
> is imho not helpful; to me it says "we -just- set this, so clear it!"
> It leaves me scratching my head.

I've updated it the comment to say this instead.

/*
* When we call get_blocks without the create flag, the
* BH_Unwritten flag could have gotten set if the blocks
* requested were part of a uninitialized extent. We need to
* clear this flag now that we are committed to convert all or
* part of the uninitialized extent to be an initialized
* extent. This is because we need to avoid the combination
* of BH_Unwritten and BH_Mapped flags being simultaneously
* set on the buffer_head.
*/

I'm still not entirely clear what is the downside of having
BH_Unwritten and BH_Mapped at the same time; whether it is just a CPU
time-waster that will cause some unneeded calls to get_blocks() in the
write_begin path which "just" wastes a little CPU, or whether there
are other massive disasters that take place on the combination of
BH_Unwritten and BH_Mapped.

What we *desperately* need is documentation that describes which
functions sets these flags, and whether they are intended for
long-term use (either stored in buffer heads attached to a page, or in
the buffer cache, and if so, which), or just as short-term hacks to
pass information between two functions, or multiple functions deep in
a call stack, and if so, what the implications are of the bits, and
when they should be cleared --- and then we should add assert's or
BUG_ON's to enforce what we think the abstraction invariant should be.

Magic flags attached to objects that are really there because we don't
want to change function signatures are very scary; for example,
i_delalloc_reserved_flag in ext4_inode_info --- it's not clear to me
exactly what this is supposed to do, but I note that mballoc.c does a
lot more with the flag than balloc.c. So it's not clear to me if
there is magic semantics associated with that flag which are being
done in mballoc.c, but we never got around to implementing them in
balloc.c. This leads to the kind of code fragility that I've been
complaining about for some time.

The fact that Aneesh missed one of these magic code paths in his patch
attests to why this style of programming is really bad and not
long-term maintainable. So we need to document all of this mess, and
then gradually clean it up, one tiny patch at a time, each one
describing what we were doing before, and what the new way of doing
this should be, and why it's safe to make that change.

> > That would imply we have BH_Unwritten and BH_Mapped set in the above
> > case which is wrong. So we need a BH_Unwritten clear between (2) and
> > (3). The patch does the same. May be we need to capture it in commit
> > message.
>
> Better in comments, I think. :)

The comprehensive description of all of this should be in comments,
yes. See the example of the sort of comments that I've added in the
clear-unwritten-bh-flag and initialize-map_bh-state. If we don't like
such verbose comments, then we should use simpler programming
semantics when passing information between the various abstracton
layers. :-)

- Ted

2009-05-14 05:40:14

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 2/3] ext4: Clear the unwritten buffer_head flag properly

On Wed, May 13, 2009 at 01:56:56PM -0500, Eric Sandeen wrote:
> >
> > I think it is good. But one thing missing in the commit message is,
> > what happens if we do a write to prealloc space. Since a
> > get_block(create = 1) is now split into __get_block(create = 0 ) and
> > __get_block(create = 1). That would mean if we pass a buffer head with
> > BH_Unwritten cleared we will have
> >
> >
> > 1) buffer_head as BH_Unwritten cleared.
> >
> > 2) __get_block(create = 0 ) -> Since it is prealloc space we will have
> > BH_Unwritten set .
>
> Why do we need to set BH_Unwritten on a !create call at all?
>
> Or maybe another way of asking is, are there any !create callers of
> get_block who -want- BH_Unwritten set?
>
> Which is to say, should we just not be setting BH_Unwritten in get_block
> in the !create case, ever?

It should only be set in the !create case. With create == 1, we would
have already converted the uninitialized extent to initialized one and
the buffer_head should not be unwritten at all. My understanding is
unwritten flag is used to indicate the buffer_head state between a
write_begin and write_page phase with delayed allocation. ie, when we
write to fallocate space, since we have delalloc enabled, we just
do a block lookup (get_block with create = 0). The buffer_head returned
in the above case should have unwritten set so that during writepage
we do the actual block allocation (get_block writh create = 1)
looking at the flag.


>
> The comment:
>
> /*
> + * The above get_blocks can cause the buffer to be
> + * marked unwritten. So clear the same.
> + */
> + clear_buffer_unwritten(bh);
>
> is imho not helpful; to me it says "we -just- set this, so clear it!"
> It leaves me scratching my head.
>
> > 3) __get_block(create = 1) -> get the blocks out of prealloc space.
> > and retun with BH_Mapped set.
> >
> > That would imply we have BH_Unwritten and BH_Mapped set in the above
> > case which is wrong. So we need a BH_Unwritten clear between (2) and
> > (3). The patch does the same. May be we need to capture it in commit
> > message.
>
> Better in comments, I think. :)
>

-aneesh

2009-05-14 06:00:42

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 2/3] ext4: Clear the unwritten buffer_head flag properly

On Wed, May 13, 2009 at 06:28:47PM -0400, Theodore Tso wrote:
> On Wed, May 13, 2009 at 01:56:56PM -0500, Eric Sandeen wrote:
> > The comment:
> >
> > /*
> > + * The above get_blocks can cause the buffer to be
> > + * marked unwritten. So clear the same.
> > + */
> > + clear_buffer_unwritten(bh);
> >
> > is imho not helpful; to me it says "we -just- set this, so clear it!"
> > It leaves me scratching my head.
>
> I've updated it the comment to say this instead.
>
> /*
> * When we call get_blocks without the create flag, the
> * BH_Unwritten flag could have gotten set if the blocks
> * requested were part of a uninitialized extent. We need to
> * clear this flag now that we are committed to convert all or
> * part of the uninitialized extent to be an initialized
> * extent. This is because we need to avoid the combination
> * of BH_Unwritten and BH_Mapped flags being simultaneously
> * set on the buffer_head.
> */

The last line in the above comment is not a problem with the latest
kernel with all the patches in the patch queue. The patch that does that
is "ext4: Mark the unwritten buffer_head as mapped during write_begin"

The unwritten and mapped state together was a problem with the code path
we had before in ext4_da_get_block_prep
we had:

ret = ext4_get_blocks(NULL, inode, iblock, 1, bh_result, 0);
.....
..
} else if (ret > 0) {
bh_result->b_size = (ret << inode->i_blkbits);
if (buffer_unwritten(bh_result)) {
bh_result.b_blocknr = ~0;
....
}
}

The above can result in

1) We do a multi block delayed alloc to prealloc space. That would get
us multiple buffer_heads marked with BH_Unwritten. (say 10, 11, 12)
2) pdflush attempt to write some pages (say mapping block 10) which
cause
a get_block call with create = 1. That would attempt to convert
uninitialized extent to initialized one. This can cause multiple blocks
to be marked initialized. ( say 10, 11 , 12)
3) We do an overwrite of block 11. That would mean we call
ext4_da_get_block_prep, which would again do a get_block for block 11
with create = 0. But remember we already have buffer_head marked with
BH_Unwritten flag. But the buffer was unmapped because it is unwritten
( We are fixing this mess in the patch for 2.6.31).
4) The get_block call will find the buffer mapped due to step b. And
mark the buffer_head mapped. There we go . We end up with buffer_head
mapped and unwritten
5) later in ext4_da_get_block_prep we check whether the buffer_head in
marked
BH_Unwritten if so we set the block number to ~0. This is introduced by
[PATCH -V4 1/2] Fix sub-block zeroing for buffered writes into unwritten
extents
6) So now we have a buffer_head that is mapped, unwritten, with
b_blocknr = ~0. That would result in the I/O error.


Now that we have the actual block number in bh_result.b_blocknr I guess
we should be ok to have the unwritten flag set. But then i guess it is
against the VFS assumption of buffer_head state. The state unwritten
indicate the blocks are not yet written. So if we don't do a
clear_buffer_unwritten as in the patch we end up with a block that is
written ( done via create = 1 get_block call) and still marked
unwritten. (see the 6 step example above)

Now to explain what "ext4: Mark the unwritten buffer_head as mapped
during write_begin" does.

It marks the buffer_head as mapped in the write_begin phase.
And that helps in making sure we don't end up calling get_block
multiple times. So with delayed allocation we now have between
the write_begin and writepage phase a buffer_head which is mapped
and unwritten for blocks in the fallocate space. Once we do
writepage for the block we will have buffer_head which is mapped and
unwritten flag cleared. The unwritten get cleared when we do a get_block
call with create = 1.


To achieve the above we need to make sure writepage code path looks at
the unwritten flag and does a get_blocks call with create = 1. With
mainline we have in writepage code path

mpage_da_map_blocks(...)

if ((mpd->b_state & (1 << BH_Mapped)) &&
!(mpd->b_state & (1 << BH_Delay)))
return 0;
...
..
ext4_da_get_block_write()


If we start marking buffer_head mapped for fallocate blocks in the
write_begin phase, then the above code will return without doing
any get_block(create = 1) call. That means we don't convert the
uninitialized extent to initialized one. So along with marking
buffer_head mapped and unwritten in the write_begin phase we also
need changes to writepage code path which does something like below

mpage_da_map_blocks(...)

if ((mpd->b_state & (1 << BH_Mapped)) &&
!(mpd->b_state & (1 << BH_Delay)) &&
!(mpd->b_state & (1 << BH_Unwritten)))
return 0;
...
..
ext4_da_get_block_write()

this is done in patch "ext4: Mark the unwritten buffer_head as mapped
during write_begin".

-aneesh

2009-05-14 13:14:53

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/3] ext4: Clear the unwritten buffer_head flag properly

On Thu, May 14, 2009 at 11:10:02AM +0530, Aneesh Kumar K.V wrote:
>
> It should only be set in the !create case. With create == 1, we would
> have already converted the uninitialized extent to initialized one and
> the buffer_head should not be unwritten at all. My understanding is
> unwritten flag is used to indicate the buffer_head state between a
> write_begin and write_page phase with delayed allocation. ie, when we
> write to fallocate space, since we have delalloc enabled, we just
> do a block lookup (get_block with create = 0). The buffer_head returned
> in the above case should have unwritten set so that during writepage
> we do the actual block allocation (get_block writh create = 1)
> looking at the flag.

At the moment, ext4_da_get_block_prep(), which is used as a callback
function by ext4_da_write_begin(), checks for buffer_unwritten(), and
if true, set BH_New and BH_Mapped.

So between the time that that write_begin() and the time when the page
is actually written out, BH_Unwritten and BH_Mapped will both be set.
If we end up bailing due to some error of some kind, such that we
don't complete the write(2) operation we *can* have some pages that
are simultaneously have BH_Unwritten and BH_Mapped flags set. So this
had better be a harmless case, since I think it can happen.

What's confusing then is some of the comments which have been made
about why BH_Unwritten and BH_Mapped simultaneously are a bad. It may
be bad at some points in time, but at other points in time it's
completely normal operations. Or am I missing something?

- Ted