2009-09-04 20:06:22

by Chris Mason

[permalink] [raw]
Subject: [PATCH RFC] Add locking to ext3_do_update_inode

Hello everyone,

I've been struggling with this off and on while I've been testing the
data=guarded work. The symptom is corrupted orphan lists and inodes
with the wrong i_size stored on disk. I was convinced the
data=guarded code was just missing a call to ext3_mark_inode_dirty, but
tracing showed the i_disksize I was sending to ext3_mark_inode_dirty
wasn't actually making it to the drive.

ext3_mark_inode_dirty can be called without locks held (atime updates
and a few others), so the data=guarded code uses locks while updating
the in-memory inode, and then calls ext3_mark_inode_dirty
without any locks held.

But, ext3_mark_inode_dirty has no internal locking to make sure that
only one CPU is updating the buffer head at a time. Generally this
works out ok because everyone that changes the inode then calls
ext3_mark_inode_dirty themselves. Even though it races, eventually
someone updates the buffer heads and things move on.

But there is still a risk of the wrong values getting in, and the
data=guarded code seems to hit the race very often.

Since everyone that changes the inode also logs it, it should be
possible to fix this with some memory barriers. I'll leave that as an
exercise to the reader and lock the buffer head instead.

It it probably a good idea to have a different patch series for lockless
bit flipping on the ext3 i_state field. ext3_do_update_inode &= clears
EXT3_STATE_NEW without any locks held.

Signed-off-by: Chris Mason <[email protected]>

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 00f5dc1..6a0a056 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -3466,6 +3479,10 @@ static int ext3_do_update_inode(handle_t *handle,
struct buffer_head *bh = iloc->bh;
int err = 0, rc, block;

+again:
+ /* we can't allow multiple procs in here at once, its a bit racey */
+ lock_buffer(bh);
+
/* For fields not not tracking in the in-memory inode,
* initialise them to zero for new inodes. */
if (ei->i_state & EXT3_STATE_NEW)
@@ -3525,16 +3542,20 @@ static int ext3_do_update_inode(handle_t *handle,
/* If this is the first large file
* created, add a flag to the superblock.
*/
+ unlock_buffer(bh);
err = ext3_journal_get_write_access(handle,
EXT3_SB(sb)->s_sbh);
if (err)
goto out_brelse;
+
ext3_update_dynamic_rev(sb);
EXT3_SET_RO_COMPAT_FEATURE(sb,
EXT3_FEATURE_RO_COMPAT_LARGE_FILE);
handle->h_sync = 1;
err = ext3_journal_dirty_metadata(handle,
EXT3_SB(sb)->s_sbh);
+ /* get our lock and start over */
+ goto again;
}
}
}
@@ -3557,6 +3578,7 @@ static int ext3_do_update_inode(handle_t *handle,
raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);

BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata");
+ unlock_buffer(bh);
rc = ext3_journal_dirty_metadata(handle, bh);
if (!err)
err = rc;


2009-09-07 22:14:58

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC] Add locking to ext3_do_update_inode

On Fri 04-09-09 16:06:13, Chris Mason wrote:
> Hello everyone,
>
> I've been struggling with this off and on while I've been testing the
> data=guarded work. The symptom is corrupted orphan lists and inodes
> with the wrong i_size stored on disk. I was convinced the
> data=guarded code was just missing a call to ext3_mark_inode_dirty, but
> tracing showed the i_disksize I was sending to ext3_mark_inode_dirty
> wasn't actually making it to the drive.
>
> ext3_mark_inode_dirty can be called without locks held (atime updates
> and a few others), so the data=guarded code uses locks while updating
> the in-memory inode, and then calls ext3_mark_inode_dirty
> without any locks held.
>
> But, ext3_mark_inode_dirty has no internal locking to make sure that
> only one CPU is updating the buffer head at a time. Generally this
> works out ok because everyone that changes the inode then calls
> ext3_mark_inode_dirty themselves. Even though it races, eventually
> someone updates the buffer heads and things move on.
>
> But there is still a risk of the wrong values getting in, and the
> data=guarded code seems to hit the race very often.
>
> Since everyone that changes the inode also logs it, it should be
> possible to fix this with some memory barriers. I'll leave that as an
> exercise to the reader and lock the buffer head instead.
>
> It it probably a good idea to have a different patch series for lockless
> bit flipping on the ext3 i_state field. ext3_do_update_inode &= clears
> EXT3_STATE_NEW without any locks held.
>
> Signed-off-by: Chris Mason <[email protected]>
The patch looks good. I've added it to my tree...

Honza

> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index 00f5dc1..6a0a056 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -3466,6 +3479,10 @@ static int ext3_do_update_inode(handle_t *handle,
> struct buffer_head *bh = iloc->bh;
> int err = 0, rc, block;
>
> +again:
> + /* we can't allow multiple procs in here at once, its a bit racey */
> + lock_buffer(bh);
> +
> /* For fields not not tracking in the in-memory inode,
> * initialise them to zero for new inodes. */
> if (ei->i_state & EXT3_STATE_NEW)
> @@ -3525,16 +3542,20 @@ static int ext3_do_update_inode(handle_t *handle,
> /* If this is the first large file
> * created, add a flag to the superblock.
> */
> + unlock_buffer(bh);
> err = ext3_journal_get_write_access(handle,
> EXT3_SB(sb)->s_sbh);
> if (err)
> goto out_brelse;
> +
> ext3_update_dynamic_rev(sb);
> EXT3_SET_RO_COMPAT_FEATURE(sb,
> EXT3_FEATURE_RO_COMPAT_LARGE_FILE);
> handle->h_sync = 1;
> err = ext3_journal_dirty_metadata(handle,
> EXT3_SB(sb)->s_sbh);
> + /* get our lock and start over */
> + goto again;
> }
> }
> }
> @@ -3557,6 +3578,7 @@ static int ext3_do_update_inode(handle_t *handle,
> raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
>
> BUFFER_TRACE(bh, "call ext3_journal_dirty_metadata");
> + unlock_buffer(bh);
> rc = ext3_journal_dirty_metadata(handle, bh);
> if (!err)
> err = rc;
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-09-07 22:30:27

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC] Add locking to ext3_do_update_inode

Hi,

On Fri 04-09-09 16:06:13, Chris Mason wrote:
> I've been struggling with this off and on while I've been testing the
> data=guarded work. The symptom is corrupted orphan lists and inodes
> with the wrong i_size stored on disk. I was convinced the
> data=guarded code was just missing a call to ext3_mark_inode_dirty, but
> tracing showed the i_disksize I was sending to ext3_mark_inode_dirty
> wasn't actually making it to the drive.
>
> ext3_mark_inode_dirty can be called without locks held (atime updates
> and a few others), so the data=guarded code uses locks while updating
> the in-memory inode, and then calls ext3_mark_inode_dirty
> without any locks held.
>
> But, ext3_mark_inode_dirty has no internal locking to make sure that
> only one CPU is updating the buffer head at a time. Generally this
> works out ok because everyone that changes the inode then calls
> ext3_mark_inode_dirty themselves. Even though it races, eventually
> someone updates the buffer heads and things move on.
>
> But there is still a risk of the wrong values getting in, and the
> data=guarded code seems to hit the race very often.
>
> Since everyone that changes the inode also logs it, it should be
> possible to fix this with some memory barriers. I'll leave that as an
> exercise to the reader and lock the buffer head instead.
One more thing - Ted, I believe ext4 needs a similar patch.

> It it probably a good idea to have a different patch series for lockless
> bit flipping on the ext3 i_state field. ext3_do_update_inode &= clears
> EXT3_STATE_NEW without any locks held.
Yeah, the locking around handling of i_state and i_flags is kind of
unclean...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR