2009-01-17 18:43:57

by Theodore Ts'o

[permalink] [raw]
Subject: Ext4 tree backports for 2.6.27.10 and 2.6.28


I've created a couple of ext4 backport branches which have been uploaded
to the ext4 git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git

The master branch contains the latest ext4 patch queue, against Linus's
tip; currently, this is versus 2.6.29-rc2. The 'for_linus' tag is
located on this branch, and currently contains a number of urgent fixes
that I plan to be pushing to Linus after he gets back from
linux.conf.au. They are mostly fixes that prevent OOPS or cpu lockups
when ext4 mounts an intentionally corrupted filesystem.

The ext4-stable branch is based off of 2.6.28, and contains all of the
patches which were pushed to linus during the 2.6.29-rc1 merge window,
plus the fixes listed above in the 'master' branch. It is designed for
people who want the very latest in the ext4 tree versus a stable kernel.

The for-stable branch is currently based off of 2.6.28, and contains a
candidate set of patches to be included in the 2.6.28.y stable tree.
Ext4 developers --- I would appreciate it if you could review the
patches on the ext4-stable tree, and see if I missed any patches which
in your opinion should be pushed to the 2.6.28.y tree. Furthermore, if
some folks could test the for-stable branch and let me know whether or
not you found it stable, I would appreciate it. Some of the patches
were relatively painful to backport, given the desire to remove
"non-critical" patches, so I'm not 100% certain I got the backports
completely right. Please test!

The for-stable-2.6.27 is currently based off of 2.6.27.11, and it
contains a candidate set of patches to be incuded in the 2.6.27.y tree.
It was even more difficult to backport these patches to 2.6.27.y, and so
I would **really** appreciate if some folks could review and test this
branch. In addition, a number of changes (in particular some of
Aneesh's resize race condition patches) were painful enough that I
decided to abort and not try to do the backport. It was late, and I was
getting tired.... If someone would like to try to backport some of
these missing patches, I would appreciate it; Aneesh, you might have
better luck since they were originally your patches, and they were
complicated enough that I was worried that there might have been
prerequisites that I had missed so they would function correctly.

The patch backports can be summarized in this table below. It contains
the original mainline commit ID, the commit ID in the 2.6.28 for-stable
branch, and the commit ID in the for-stable-2.6.27 branch. If you see
"-----" in the column for the 2.6.27-stable column, those were patches
which I did not backport due to lack of valor/courage at 11pm at night.

- Ted

mainline 2.6.28 2.6.27
commit-description
-------------------------------------------------------------------------
f99b2589 485f02f 11599d0
ext4: Add support for non-native signed/unsigned htree hash algorithms

2a21e37e 7426272 8443aef
ext4: tone down ext4_da_writepages warnings

791b7f08 2efd58c c7eef47
ext4: Fix the delalloc writepages to allocate blocks at the right offset.

565a9617 ce99b0d b2a193d
ext4: avoid ext4_error when mounting a fs with a single bg

ff7ef329 3a04ef3 626e5b9
ext4: Widen type of ext4_sb_info.s_mb_maxs[]

fd98496f f97e641 dc270b3
jbd2: Add barrier not supported test to journal_wait_on_commit_record

032115fc b9475aa c1944c2
ext4: Don't overwrite allocation_context ac_status

e21675d4 c31a2b2 -----
ext4: Add blocks added during resize to bitmap

920313a7 2a4f6ca -----
ext4: Use EXT4_GROUP_INFO_NEED_INIT_BIT during resize

c3a326a6 66364e6 24a5c92
ext4: cleanup mballoc header files

7a2fcbf7 39a0b8b -----
ext4: don't use blocks freed but not yet committed in buddy cache init

e8134b27 83a082c c712c85
ext4: Fix race between read_block_bitmap() and mark_diskspace_used()

39341867 8e53df4 4d3302c
ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()

e97fcd95 20d6100 7e081e8
jbd2: Add BH_JBDPrivateStart

2ccb5fb9 92f1c0e -----
ext4: Use new buffer_head flag to check uninit group bitmaps initialization

648f5879 51eef9f 469a48a
ext4: mark the blocks/inode bitmap beyond end of group as used

8556e8f3 5a2c7ad 686beef
ext4: Don't allow new groups to be added during block allocation

29eaf024 39d994e 0c56383
ext4: Init the complete page while building buddy cache

0087d9fb 808dfdb -----
ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc

4ec11028 cf7da20 -----
ext4: Add sanity checks for the superblock before mounting the filesystem


2009-01-17 22:16:57

by Malte Schröder

[permalink] [raw]
Subject: ext4-stable build failure (Re: Ext4 tree backports for 2.6.27.10 and 2.6.28)

Hello,

On Sat, 17 Jan 2009 13:43:55 -0500
"Theodore Ts'o" <[email protected]> wrote:

>
> I've created a couple of ext4 backport branches which have been uploaded
> to the ext4 git tree:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git
>

kernel 2.6.28 with ext4-stable fails to build for SMP on AMD64.

Building modules, stage 2.
MODPOST 905 modules
ERROR: "inode_lock" [fs/ext4/ext4.ko] undefined!

I had to revert 5cb8613e44b6e6c77dcf231d691b96c91a5adeab (ext4: Add a
delayed allocation debugging ioctl), after that it works.

Regards
Malte Schröder


Attachments:
signature.asc (197.00 B)

2009-01-17 23:03:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4-stable build failure (Re: Ext4 tree backports for 2.6.27.10 and 2.6.28)

On Sat, Jan 17, 2009 at 11:16:51PM +0100, Malte Schr?der wrote:
>
> kernel 2.6.28 with ext4-stable fails to build for SMP on AMD64.
>
> Building modules, stage 2.
> MODPOST 905 modules
> ERROR: "inode_lock" [fs/ext4/ext4.ko] undefined!
>
> I had to revert 5cb8613e44b6e6c77dcf231d691b96c91a5adeab (ext4: Add a
> delayed allocation debugging ioctl), after that it works.

I see the problem; inode_lock isn't currently an exported symbol, and
I generally don't compile ext4 as a module, so I didn't notice the
problem. The patch in question is a debugging patch that is never
meant for mainline or a for-stable series, so the simplest thing to do
for now is just to revert the patch. I'll look into a module-friendly
way of accomplishing the same goal.

Regards,

- Ted

2009-01-19 11:33:00

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: Ext4 tree backports for 2.6.27.10 and 2.6.28

On Sat, Jan 17, 2009 at 01:43:55PM -0500, Theodore Ts'o wrote:
>
> I've created a couple of ext4 backport branches which have been uploaded
> to the ext4 git tree:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git
>
> The master branch contains the latest ext4 patch queue, against Linus's
> tip; currently, this is versus 2.6.29-rc2. The 'for_linus' tag is
> located on this branch, and currently contains a number of urgent fixes
> that I plan to be pushing to Linus after he gets back from
> linux.conf.au. They are mostly fixes that prevent OOPS or cpu lockups
> when ext4 mounts an intentionally corrupted filesystem.
>
> The ext4-stable branch is based off of 2.6.28, and contains all of the
> patches which were pushed to linus during the 2.6.29-rc1 merge window,
> plus the fixes listed above in the 'master' branch. It is designed for
> people who want the very latest in the ext4 tree versus a stable kernel.
>
> The for-stable branch is currently based off of 2.6.28, and contains a
> candidate set of patches to be included in the 2.6.28.y stable tree.
> Ext4 developers --- I would appreciate it if you could review the
> patches on the ext4-stable tree, and see if I missed any patches which
> in your opinion should be pushed to the 2.6.28.y tree. Furthermore, if
> some folks could test the for-stable branch and let me know whether or
> not you found it stable, I would appreciate it. Some of the patches
> were relatively painful to backport, given the desire to remove
> "non-critical" patches, so I'm not 100% certain I got the backports
> completely right. Please test!
>
> The for-stable-2.6.27 is currently based off of 2.6.27.11, and it
> contains a candidate set of patches to be incuded in the 2.6.27.y tree.
> It was even more difficult to backport these patches to 2.6.27.y, and so
> I would **really** appreciate if some folks could review and test this
> branch. In addition, a number of changes (in particular some of
> Aneesh's resize race condition patches) were painful enough that I
> decided to abort and not try to do the backport. It was late, and I was
> getting tired.... If someone would like to try to backport some of
> these missing patches, I would appreciate it; Aneesh, you might have
> better luck since they were originally your patches, and they were
> complicated enough that I was worried that there might have been
> prerequisites that I had missed so they would function correctly.
>
> The patch backports can be summarized in this table below. It contains
> the original mainline commit ID, the commit ID in the 2.6.28 for-stable
> branch, and the commit ID in the for-stable-2.6.27 branch. If you see
> "-----" in the column for the 2.6.27-stable column, those were patches
> which I did not backport due to lack of valor/courage at 11pm at night.
>
> - Ted
>
> mainline 2.6.28 2.6.27
> commit-description
> -------------------------------------------------------------------------
> f99b2589 485f02f 11599d0
> ext4: Add support for non-native signed/unsigned htree hash algorithms
>
> 2a21e37e 7426272 8443aef
> ext4: tone down ext4_da_writepages warnings
>
> 791b7f08 2efd58c c7eef47
> ext4: Fix the delalloc writepages to allocate blocks at the right offset.
>
> 565a9617 ce99b0d b2a193d
> ext4: avoid ext4_error when mounting a fs with a single bg
>
> ff7ef329 3a04ef3 626e5b9
> ext4: Widen type of ext4_sb_info.s_mb_maxs[]
>
> fd98496f f97e641 dc270b3
> jbd2: Add barrier not supported test to journal_wait_on_commit_record
>
> 032115fc b9475aa c1944c2
> ext4: Don't overwrite allocation_context ac_status
>
> e21675d4 c31a2b2 -----
> ext4: Add blocks added during resize to bitmap

Will send the patch as a reply to this mail to linux-ext4

>
> 920313a7 2a4f6ca -----
> ext4: Use EXT4_GROUP_INFO_NEED_INIT_BIT during resize

Will send the patch as a reply to this mail to linux-ext4

>
> c3a326a6 66364e6 24a5c92
> ext4: cleanup mballoc header files
>
> 7a2fcbf7 39a0b8b -----
> ext4: don't use blocks freed but not yet committed in buddy cache init

This would need the patch "use rb-tree for free blocks tracking".
Will send the patch as a reply to this mail to linux-ext4

>
> e8134b27 83a082c c712c85
> ext4: Fix race between read_block_bitmap() and mark_diskspace_used()
>
> 39341867 8e53df4 4d3302c
> ext4: Fix the race between read_inode_bitmap() and ext4_new_inode()
>
> e97fcd95 20d6100 7e081e8
> jbd2: Add BH_JBDPrivateStart
>
> 2ccb5fb9 92f1c0e -----
> ext4: Use new buffer_head flag to check uninit group bitmaps initialization

Will send the patch as a reply to this mail to linux-ext4

>
> 648f5879 51eef9f 469a48a
> ext4: mark the blocks/inode bitmap beyond end of group as used
>
> 8556e8f3 5a2c7ad 686beef
> ext4: Don't allow new groups to be added during block allocation
>
> 29eaf024 39d994e 0c56383
> ext4: Init the complete page while building buddy cache
>
> 0087d9fb 808dfdb -----
> ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc

2.6.27 doesn't have ext4: Add percpu dirty block accounting. So we
won't need this patch.

>
> 4ec11028 cf7da20 -----
> ext4: Add sanity checks for the superblock before mounting the filesystem

-aneesh

2009-01-19 11:34:05

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: Ext4 tree backports for 2.6.27.10 and 2.6.28


From: Aneesh Kumar K.V <[email protected]>
Subject: [PATCH] ext4: Add blocks added during resize to bitmap

With this change new blocks added during resize
are marked as free in the block bitmap and the
group is flagged with EXT4_GROUP_INFO_NEED_INIT_BIT
flag. This make sure when mballoc tries to allocate
blocks from the new group we would reload the
buddy information using the bitmap present in the disk.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>


---
fs/ext4/balloc.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/ext4.h | 6 ++-
fs/ext4/resize.c | 11 +----
3 files changed, 132 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index e9fa960..1b375bb 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -20,6 +20,7 @@
#include "ext4.h"
#include "ext4_jbd2.h"
#include "group.h"
+#include "mballoc.h"

/*
* balloc.c contains the blocks allocation and deallocation routines
@@ -837,6 +838,131 @@ error_return:
}

/**
+ * ext4_add_groupblocks() -- Add given blocks to an existing group
+ * @handle: handle to this transaction
+ * @sb: super block
+ * @block: start physcial block to add to the block group
+ * @count: number of blocks to free
+ *
+ * This marks the blocks as free in the bitmap. We ask the
+ * mballoc to reload the buddy after this by setting group
+ * EXT4_GROUP_INFO_NEED_INIT_BIT flag
+ */
+void ext4_add_groupblocks(handle_t *handle, struct super_block *sb,
+ ext4_fsblk_t block, unsigned long count)
+{
+ struct buffer_head *bitmap_bh = NULL;
+ struct buffer_head *gd_bh;
+ ext4_group_t block_group;
+ ext4_grpblk_t bit;
+ unsigned long i;
+ struct ext4_group_desc * desc;
+ struct ext4_super_block * es;
+ struct ext4_sb_info *sbi;
+ int err = 0, ret;
+ ext4_grpblk_t blocks_freed;
+ struct ext4_group_info *grp;
+
+ sbi = EXT4_SB(sb);
+ es = sbi->s_es;
+ ext4_debug("Adding block(s) %llu-%llu\n", block, block + count - 1);
+
+ ext4_get_group_no_and_offset(sb, block, &block_group, &bit);
+ /*
+ * Check to see if we are freeing blocks across a group
+ * boundary.
+ */
+ if (bit + count > EXT4_BLOCKS_PER_GROUP(sb)) {
+ goto error_return;
+ }
+ bitmap_bh = ext4_read_block_bitmap(sb, block_group);
+ if (!bitmap_bh)
+ goto error_return;
+ desc = ext4_get_group_desc (sb, block_group, &gd_bh);
+ if (!desc)
+ goto error_return;
+
+ if (in_range(ext4_block_bitmap(sb, desc), block, count) ||
+ in_range(ext4_inode_bitmap(sb, desc), block, count) ||
+ in_range(block, ext4_inode_table(sb, desc), sbi->s_itb_per_group) ||
+ in_range(block + count - 1, ext4_inode_table(sb, desc),
+ sbi->s_itb_per_group)) {
+ ext4_error(sb, __func__,
+ "Adding blocks in system zones - "
+ "Block = %llu, count = %lu",
+ block, count);
+ goto error_return;
+ }
+
+ /*
+ * We are about to add blocks to the bitmap,
+ * so we need undo access.
+ */
+ BUFFER_TRACE(bitmap_bh, "getting undo access");
+ err = ext4_journal_get_undo_access(handle, bitmap_bh);
+ if (err)
+ goto error_return;
+
+ /*
+ * We are about to modify some metadata. Call the journal APIs
+ * to unshare ->b_data if a currently-committing transaction is
+ * using it
+ */
+ BUFFER_TRACE(gd_bh, "get_write_access");
+ err = ext4_journal_get_write_access(handle, gd_bh);
+ if (err)
+ goto error_return;
+
+ for (i = 0, blocks_freed = 0; i < count; i++) {
+ BUFFER_TRACE(bitmap_bh, "clear bit");
+ if (!ext4_clear_bit_atomic(sb_bgl_lock(sbi, block_group),
+ bit + i, bitmap_bh->b_data)) {
+ ext4_error(sb, __func__,
+ "bit already cleared for block %llu",
+ (ext4_fsblk_t)(block + i));
+ BUFFER_TRACE(bitmap_bh, "bit already cleared");
+ } else {
+ blocks_freed++;
+ }
+ }
+ spin_lock(sb_bgl_lock(sbi, block_group));
+ le16_add_cpu(&desc->bg_free_blocks_count, blocks_freed);
+ desc->bg_checksum = ext4_group_desc_csum(sbi, block_group, desc);
+ spin_unlock(sb_bgl_lock(sbi, block_group));
+ percpu_counter_add(&sbi->s_freeblocks_counter, blocks_freed);
+
+ if (sbi->s_log_groups_per_flex) {
+ ext4_group_t flex_group = ext4_flex_group(sbi, block_group);
+ spin_lock(sb_bgl_lock(sbi, flex_group));
+ sbi->s_flex_groups[flex_group].free_blocks += blocks_freed;
+ spin_unlock(sb_bgl_lock(sbi, flex_group));
+ }
+
+ /* We dirtied the bitmap block */
+ BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
+ err = ext4_journal_dirty_metadata(handle, bitmap_bh);
+
+ /* And the group descriptor block */
+ BUFFER_TRACE(gd_bh, "dirtied group descriptor block");
+ ret = ext4_journal_dirty_metadata(handle, gd_bh);
+ if (!err)
+ err = ret;
+ sb->s_dirt = 1;
+ /*
+ * request to reload the buddy with the
+ * new bitmap information
+ */
+ grp = ext4_get_group_info(sb, block_group);
+ set_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state));
+ ext4_mb_update_group_info(grp, blocks_freed);
+
+error_return:
+ brelse(bitmap_bh);
+ ext4_std_error(sb, err);
+ return;
+}
+
+/**
* ext4_free_blocks() -- Free given blocks and update quota
* @handle: handle for this transaction
* @inode: inode
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b0483ce..98de963 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -992,8 +992,10 @@ extern ext4_fsblk_t ext4_has_free_blocks(struct ext4_sb_info *sbi,
extern void ext4_free_blocks (handle_t *handle, struct inode *inode,
ext4_fsblk_t block, unsigned long count, int metadata);
extern void ext4_free_blocks_sb (handle_t *handle, struct super_block *sb,
- ext4_fsblk_t block, unsigned long count,
- unsigned long *pdquot_freed_blocks);
+ ext4_fsblk_t block, unsigned long count,
+ unsigned long *pdquot_freed_blocks);
+extern void ext4_add_groupblocks(handle_t *handle, struct super_block *sb,
+ ext4_fsblk_t block, unsigned long count);
extern ext4_fsblk_t ext4_count_free_blocks (struct super_block *);
extern void ext4_check_blocks_bitmap (struct super_block *);
extern struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb,
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 5726c5e..d7e9761 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -974,9 +974,7 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
struct buffer_head * bh;
handle_t *handle;
int err;
- unsigned long freed_blocks;
ext4_group_t group;
- struct ext4_group_info *grp;

/* We don't need to worry about locking wrt other resizers just
* yet: we're going to revalidate es->s_blocks_count after
@@ -1075,7 +1073,8 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
unlock_super(sb);
ext4_debug("freeing blocks %llu through %llu\n", o_blocks_count,
o_blocks_count + add);
- ext4_free_blocks_sb(handle, sb, o_blocks_count, add, &freed_blocks);
+ /* We add the blocks to the bitmap and set the group need init bit */
+ ext4_add_groupblocks(handle, sb, o_blocks_count, add);
ext4_debug("freed blocks %llu through %llu\n", o_blocks_count,
o_blocks_count + add);
if ((err = ext4_journal_stop(handle)))
@@ -1111,12 +1110,6 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
ClearPageUptodate(page);
page_cache_release(page);
}
-
- /* Get the info on the last group */
- grp = ext4_get_group_info(sb, group);

2009-01-19 11:35:09

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH] ext4: Use EXT4_GROUP_INFO_NEED_INIT_BIT during resize


From: Aneesh Kumar K.V <[email protected]>
Subject: [PATCH] ext4: Use EXT4_GROUP_INFO_NEED_INIT_BIT during resize

The new groups added during resize are flagged as
need_init group. Make sure we properly initialize these
groups. When we have block size < page size and we are adding
new groups the page may still be marked uptodate even though
we haven't initialized the group. While forcing the init
of buddy cache we need to make sure other groups part of the
same page of buddy cache is not using the cache.
group_info->alloc_sem is added to ensure the same.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>

---
fs/ext4/balloc.c | 21 +++--
fs/ext4/ext4.h | 7 +-
fs/ext4/mballoc.c | 259 +++++++++++++++++++++++++++++++++++++++++------------
fs/ext4/mballoc.h | 3 +
fs/ext4/resize.c | 42 ++-------
5 files changed, 229 insertions(+), 103 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 1b375bb..4235817 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -868,6 +868,7 @@ void ext4_add_groupblocks(handle_t *handle, struct super_block *sb,
ext4_debug("Adding block(s) %llu-%llu\n", block, block + count - 1);

ext4_get_group_no_and_offset(sb, block, &block_group, &bit);
+ grp = ext4_get_group_info(sb, block_group);
/*
* Check to see if we are freeing blocks across a group
* boundary.
@@ -912,7 +913,11 @@ void ext4_add_groupblocks(handle_t *handle, struct super_block *sb,
err = ext4_journal_get_write_access(handle, gd_bh);
if (err)
goto error_return;
-
+ /*
+ * make sure we don't allow a parallel init on other groups in the
+ * same buddy cache
+ */
+ down_write(&grp->alloc_sem);
for (i = 0, blocks_freed = 0; i < count; i++) {
BUFFER_TRACE(bitmap_bh, "clear bit");
if (!ext4_clear_bit_atomic(sb_bgl_lock(sbi, block_group),
@@ -937,6 +942,13 @@ void ext4_add_groupblocks(handle_t *handle, struct super_block *sb,
sbi->s_flex_groups[flex_group].free_blocks += blocks_freed;
spin_unlock(sb_bgl_lock(sbi, flex_group));
}
+ /*
+ * request to reload the buddy with the
+ * new bitmap information
+ */
+ set_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state));
+ ext4_mb_update_group_info(grp, blocks_freed);
+ up_write(&grp->alloc_sem);

/* We dirtied the bitmap block */
BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
@@ -948,13 +960,6 @@ void ext4_add_groupblocks(handle_t *handle, struct super_block *sb,
if (!err)
err = ret;
sb->s_dirt = 1;
- /*
- * request to reload the buddy with the
- * new bitmap information
- */
- grp = ext4_get_group_info(sb, block_group);
- set_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state));
- ext4_mb_update_group_info(grp, blocks_freed);

error_return:
brelse(bitmap_bh);
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 98de963..e605295 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1043,12 +1043,13 @@ extern int __init init_ext4_mballoc(void);
extern void exit_ext4_mballoc(void);
extern void ext4_mb_free_blocks(handle_t *, struct inode *,
unsigned long, unsigned long, int, unsigned long *);
-extern int ext4_mb_add_more_groupinfo(struct super_block *sb,
+extern int ext4_mb_add_groupinfo(struct super_block *sb,
ext4_group_t i, struct ext4_group_desc *desc);
extern void ext4_mb_update_group_info(struct ext4_group_info *grp,
ext4_grpblk_t add);
-
-
+extern int ext4_mb_get_buddy_cache_lock(struct super_block *, ext4_group_t);
+extern void ext4_mb_put_buddy_cache_lock(struct super_block *,
+ ext4_group_t, int);
/* inode.c */
int ext4_forget(handle_t *handle, int is_metadata, struct inode *inode,
struct buffer_head *bh, ext4_fsblk_t blocknr);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 9cf497c..dd7a092 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -896,18 +896,20 @@ static noinline_for_stack int
ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
struct ext4_buddy *e4b)
{
- struct ext4_sb_info *sbi = EXT4_SB(sb);
- struct inode *inode = sbi->s_buddy_cache;
int blocks_per_page;
int block;
int pnum;
int poff;
struct page *page;
int ret;
+ struct ext4_group_info *grp;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct inode *inode = sbi->s_buddy_cache;

mb_debug("load group %lu\n", group);

blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize;
+ grp = ext4_get_group_info(sb, group);

e4b->bd_blkbits = sb->s_blocksize_bits;
e4b->bd_info = ext4_get_group_info(sb, group);
@@ -915,6 +917,15 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
e4b->bd_group = group;
e4b->bd_buddy_page = NULL;
e4b->bd_bitmap_page = NULL;
+ e4b->alloc_semp = &grp->alloc_sem;
+
+ /* Take the read lock on the group alloc
+ * sem. This would make sure a parallel
+ * ext4_mb_init_group happening on other
+ * groups mapped by the page is blocked
+ * till we are done with allocation
+ */
+ down_read(e4b->alloc_semp);

/*
* the buddy cache inode stores the block bitmap
@@ -930,6 +941,14 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
page = find_get_page(inode->i_mapping, pnum);
if (page == NULL || !PageUptodate(page)) {
if (page)
+ /*
+ * drop the page reference and try
+ * to get the page with lock. If we
+ * are not uptodate that implies
+ * somebody just created the page but
+ * is yet to initialize the same. So
+ * wait for it to initialize.
+ */
page_cache_release(page);
page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
if (page) {
@@ -995,6 +1014,9 @@ err:
page_cache_release(e4b->bd_buddy_page);
e4b->bd_buddy = NULL;
e4b->bd_bitmap = NULL;
+
+ /* Done with the buddy cache */
+ up_read(e4b->alloc_semp);
return ret;
}

@@ -1004,6 +1026,8 @@ static void ext4_mb_release_desc(struct ext4_buddy *e4b)
page_cache_release(e4b->bd_bitmap_page);
if (e4b->bd_buddy_page)
page_cache_release(e4b->bd_buddy_page);
+ /* Done with the buddy cache */
+ up_read(e4b->alloc_semp);
}


@@ -1714,6 +1738,173 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
return 0;
}

+/*
+ * lock the group_info alloc_sem of all the groups
+ * belonging to the same buddy cache page. This
+ * make sure other parallel operation on the buddy
+ * cache doesn't happen whild holding the buddy cache
+ * lock
+ */
+int ext4_mb_get_buddy_cache_lock(struct super_block *sb, ext4_group_t group)
+{
+ int i;
+ int block, pnum;
+ int blocks_per_page;
+ int groups_per_page;
+ ext4_group_t first_group;
+ struct ext4_group_info *grp;
+
+ blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize;
+ /*
+ * the buddy cache inode stores the block bitmap
+ * and buddy information in consecutive blocks.
+ * So for each group we need two blocks.
+ */
+ block = group * 2;
+ pnum = block / blocks_per_page;
+ first_group = pnum * blocks_per_page / 2;
+
+ groups_per_page = blocks_per_page >> 1;
+ if (groups_per_page == 0)
+ groups_per_page = 1;
+ /* read all groups the page covers into the cache */
+ for (i = 0; i < groups_per_page; i++) {
+
+ if ((first_group + i) >= EXT4_SB(sb)->s_groups_count)
+ break;
+ grp = ext4_get_group_info(sb, first_group + i);
+ /* take all groups write allocation
+ * semaphore. This make sure there is
+ * no block allocation going on in any
+ * of that groups
+ */
+ down_write(&grp->alloc_sem);
+ }
+ return i;
+}
+
+void ext4_mb_put_buddy_cache_lock(struct super_block *sb,
+ ext4_group_t group, int locked_group)
+{
+ int i;
+ int block, pnum;
+ int blocks_per_page;
+ ext4_group_t first_group;
+ struct ext4_group_info *grp;
+
+ blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize;
+ /*
+ * the buddy cache inode stores the block bitmap
+ * and buddy information in consecutive blocks.
+ * So for each group we need two blocks.
+ */
+ block = group * 2;
+ pnum = block / blocks_per_page;
+ first_group = pnum * blocks_per_page / 2;
+ /* release locks on all the groups */
+ for (i = 0; i < locked_group; i++) {
+
+ grp = ext4_get_group_info(sb, first_group + i);
+ /* take all groups write allocation
+ * semaphore. This make sure there is
+ * no block allocation going on in any
+ * of that groups
+ */
+ up_write(&grp->alloc_sem);
+ }
+
+}
+
+static int ext4_mb_init_group(struct super_block *sb, ext4_group_t group)
+{
+
+ int ret;
+ void *bitmap;
+ int blocks_per_page;
+ int block, pnum, poff;
+ int num_grp_locked = 0;
+ struct ext4_group_info *this_grp;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct inode *inode = sbi->s_buddy_cache;
+ struct page *page = NULL, *bitmap_page = NULL;
+
+ mb_debug("init group %lu\n", group);
+ blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize;
+ this_grp = ext4_get_group_info(sb, group);
+ /*
+ * This ensures we don't add group
+ * to this buddy cache via resize
+ */
+ num_grp_locked = ext4_mb_get_buddy_cache_lock(sb, group);
+ if (!EXT4_MB_GRP_NEED_INIT(this_grp)) {
+ /*
+ * somebody initialized the group
+ * return without doing anything
+ */
+ ret = 0;
+ goto err;
+ }
+ /*
+ * the buddy cache inode stores the block bitmap
+ * and buddy information in consecutive blocks.
+ * So for each group we need two blocks.
+ */
+ block = group * 2;
+ pnum = block / blocks_per_page;
+ poff = block % blocks_per_page;
+ page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
+ if (page) {
+ BUG_ON(page->mapping != inode->i_mapping);
+ ret = ext4_mb_init_cache(page, NULL);
+ if (ret) {
+ unlock_page(page);
+ goto err;
+ }
+ unlock_page(page);
+ }
+ if (page == NULL || !PageUptodate(page)) {
+ ret = -EIO;
+ goto err;
+ }
+ mark_page_accessed(page);
+ bitmap_page = page;
+ bitmap = page_address(page) + (poff * sb->s_blocksize);
+
+ /* init buddy cache */
+ block++;
+ pnum = block / blocks_per_page;
+ poff = block % blocks_per_page;
+ page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS);
+ if (page == bitmap_page) {
+ /*
+ * If both the bitmap and buddy are in
+ * the same page we don't need to force
+ * init the buddy
+ */
+ unlock_page(page);
+ } else if (page) {
+ BUG_ON(page->mapping != inode->i_mapping);
+ ret = ext4_mb_init_cache(page, bitmap);
+ if (ret) {
+ unlock_page(page);
+ goto err;
+ }
+ unlock_page(page);
+ }
+ if (page == NULL || !PageUptodate(page)) {
+ ret = -EIO;
+ goto err;
+ }
+ mark_page_accessed(page);
+err:
+ ext4_mb_put_buddy_cache_lock(sb, group, num_grp_locked);
+ if (bitmap_page)
+ page_cache_release(bitmap_page);
+ if (page)
+ page_cache_release(page);
+ return ret;
+}
+
static noinline_for_stack int
ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
{
@@ -1797,7 +1988,7 @@ repeat:
group = 0;

/* quick check to skip empty groups */
- grp = ext4_get_group_info(ac->ac_sb, group);
+ grp = ext4_get_group_info(sb, group);
if (grp->bb_free == 0)
continue;

@@ -1810,10 +2001,9 @@ repeat:
* we need full data about the group
* to make a good selection
*/
- err = ext4_mb_load_buddy(sb, group, &e4b);
+ err = ext4_mb_init_group(sb, group);
if (err)
goto out;
- ext4_mb_release_desc(&e4b);
}

/*
@@ -2321,6 +2511,7 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,
}

INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
+ init_rwsem(&meta_group_info[i]->alloc_sem);

#ifdef DOUBLE_CHECK
{
@@ -2347,54 +2538,6 @@ exit_meta_group_info:
} /* ext4_mb_add_groupinfo */

/*
- * Add a group to the existing groups.
- * This function is used for online resize
- */
-int ext4_mb_add_more_groupinfo(struct super_block *sb, ext4_group_t group,
- struct ext4_group_desc *desc)
-{
- struct ext4_sb_info *sbi = EXT4_SB(sb);
- struct inode *inode = sbi->s_buddy_cache;
- int blocks_per_page;
- int block;
- int pnum;
- struct page *page;
- int err;
-
- /* Add group based on group descriptor*/
- err = ext4_mb_add_groupinfo(sb, group, desc);
- if (err)
- return err;
-
- /*
- * Cache pages containing dynamic mb_alloc datas (buddy and bitmap
- * datas) are set not up to date so that they will be re-initilaized
- * during the next call to ext4_mb_load_buddy
- */
-
- /* Set buddy page as not up to date */
- blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize;
- block = group * 2;
- pnum = block / blocks_per_page;
- page = find_get_page(inode->i_mapping, pnum);
- if (page != NULL) {
- ClearPageUptodate(page);
- page_cache_release(page);
- }
-
- /* Set bitmap page as not up to date */
- block++;
- pnum = block / blocks_per_page;
- page = find_get_page(inode->i_mapping, pnum);
- if (page != NULL) {
- ClearPageUptodate(page);
- page_cache_release(page);
- }
-
- return 0;
-}
-
-/*
* Update an existing group.
* This function is used for online resize
*/
@@ -4705,11 +4848,6 @@ do_more:
err = ext4_journal_get_write_access(handle, gd_bh);
if (err)
goto error_return;
-
- err = ext4_mb_load_buddy(sb, block_group, &e4b);
- if (err)
- goto error_return;
-
#ifdef AGGRESSIVE_CHECK
{
int i;
@@ -4723,6 +4861,8 @@ do_more:
/* We dirtied the bitmap block */
BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
err = ext4_journal_dirty_metadata(handle, bitmap_bh);
+ if (err)
+ goto error_return;

if (ac) {
ac->ac_b_ex.fe_group = block_group;
@@ -4731,6 +4871,9 @@ do_more:
ext4_mb_store_history(ac);
}

+ err = ext4_mb_load_buddy(sb, block_group, &e4b);
+ if (err)
+ goto error_return;
if (metadata) {
/* blocks being freed are metadata. these blocks shouldn't
* be used until this transaction is committed */
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 850aa8a..9a020e6 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -18,6 +18,7 @@
#include <linux/pagemap.h>
#include <linux/seq_file.h>
#include <linux/version.h>
+#include <linux/mutex.h>
#include "ext4_jbd2.h"
#include "ext4.h"
#include "group.h"
@@ -119,6 +120,7 @@ struct ext4_group_info {
#ifdef DOUBLE_CHECK
void *bb_bitmap;
#endif
+ struct rw_semaphore alloc_sem;
unsigned short bb_counters[];
};

@@ -244,6 +246,7 @@ struct ext4_buddy {
struct super_block *bd_sb;
__u16 bd_blkbits;
ext4_group_t bd_group;
+ struct rw_semaphore *alloc_semp;
};
#define EXT4_MB_BITMAP(e4b) ((e4b)->bd_bitmap)
#define EXT4_MB_BUDDY(e4b) ((e4b)->bd_buddy)
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index d7e9761..cfb8e36 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -745,6 +745,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
struct inode *inode = NULL;
handle_t *handle;
int gdb_off, gdb_num;
+ int num_grp_locked = 0;
int err, err2;

gdb_num = input->group / EXT4_DESC_PER_BLOCK(sb);
@@ -785,6 +786,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
}
}

+
if ((err = verify_group_input(sb, input)))
goto exit_put;

@@ -853,6 +855,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
* using the new disk blocks.
*/

+ num_grp_locked = ext4_mb_get_buddy_cache_lock(sb, input->group);
/* Update group descriptor block for new group */
gdp = (struct ext4_group_desc *)((char *)primary->b_data +
gdb_off * EXT4_DESC_SIZE(sb));
@@ -869,9 +872,11 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
* descriptor
*/
if (test_opt(sb, MBALLOC)) {
- err = ext4_mb_add_more_groupinfo(sb, input->group, gdp);
- if (err)
+ err = ext4_mb_add_groupinfo(sb, input->group, gdp);
+ if (err) {
+ ext4_mb_put_buddy_cache_lock(sb, input->group, num_grp_locked);
goto exit_journal;
+ }
}
/*
* Make the new blocks and inodes valid next. We do this before
@@ -913,6 +918,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)

/* Update the global fs size fields */
sbi->s_groups_count++;
+ ext4_mb_put_buddy_cache_lock(sb, input->group, num_grp_locked);

ext4_journal_dirty_metadata(handle, primary);

@@ -1080,38 +1086,6 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
if ((err = ext4_journal_stop(handle)))
goto exit_put;

- /*
- * Mark mballoc pages as not up to date so that they will be updated
- * next time they are loaded by ext4_mb_load_buddy.
- */
- if (test_opt(sb, MBALLOC)) {
- struct ext4_sb_info *sbi = EXT4_SB(sb);
- struct inode *inode = sbi->s_buddy_cache;
- int blocks_per_page;
- int block;
- int pnum;
- struct page *page;
-
- /* Set buddy page as not up to date */
- blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize;
- block = group * 2;
- pnum = block / blocks_per_page;
- page = find_get_page(inode->i_mapping, pnum);
- if (page != NULL) {
- ClearPageUptodate(page);
- page_cache_release(page);
- }
-
- /* Set bitmap page as not up to date */
- block++;
- pnum = block / blocks_per_page;
- page = find_get_page(inode->i_mapping, pnum);
- if (page != NULL) {
- ClearPageUptodate(page);
- page_cache_release(page);
- }
- }

2009-01-19 11:35:44

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH] ext4: Use an rbtree for tracking blocks freed during transaction.


From: Aneesh Kumar K.V <[email protected]>
Subject: [PATCH] ext4: Use an rbtree for tracking blocks freed during transaction.

With this patch we track the block freed during a transaction using
red-black tree. We also make sure contiguous blocks freed are collected
in one node in the tree.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>

---
fs/ext4/mballoc.c | 185 ++++++++++++++++++++++++++++++++++-------------------
fs/ext4/mballoc.h | 25 ++++---
2 files changed, 133 insertions(+), 77 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index dd7a092..93e4c09 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -332,6 +332,7 @@
*/
static struct kmem_cache *ext4_pspace_cachep;
static struct kmem_cache *ext4_ac_cachep;
+static struct kmem_cache *ext4_free_ext_cachep;
static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
ext4_group_t group);
static int ext4_mb_init_per_dev_proc(struct super_block *sb);
@@ -2512,6 +2513,7 @@ int ext4_mb_add_groupinfo(struct super_block *sb, ext4_group_t group,

INIT_LIST_HEAD(&meta_group_info[i]->bb_prealloc_list);
init_rwsem(&meta_group_info[i]->alloc_sem);
+ meta_group_info[i]->bb_free_root.rb_node = NULL;;

#ifdef DOUBLE_CHECK
{
@@ -2825,13 +2827,11 @@ int ext4_mb_release(struct super_block *sb)
static noinline_for_stack void
ext4_mb_free_committed_blocks(struct super_block *sb)
{
- struct ext4_sb_info *sbi = EXT4_SB(sb);
- int err;
- int i;
- int count = 0;
- int count2 = 0;
- struct ext4_free_metadata *md;
struct ext4_buddy e4b;
+ struct ext4_group_info *db;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ int err, count = 0, count2 = 0;
+ struct ext4_free_data *entry;

if (list_empty(&sbi->s_committed_transaction))
return;
@@ -2839,44 +2839,46 @@ ext4_mb_free_committed_blocks(struct super_block *sb)
/* there is committed blocks to be freed yet */
do {
/* get next array of blocks */
- md = NULL;
+ entry = NULL;
spin_lock(&sbi->s_md_lock);
if (!list_empty(&sbi->s_committed_transaction)) {
- md = list_entry(sbi->s_committed_transaction.next,
- struct ext4_free_metadata, list);
- list_del(&md->list);
+ entry = list_entry(sbi->s_committed_transaction.next,
+ struct ext4_free_data, list);
+ list_del(&entry->list);
}
spin_unlock(&sbi->s_md_lock);

- if (md == NULL)
+ if (entry == NULL)
break;

mb_debug("gonna free %u blocks in group %lu (0x%p):",
- md->num, md->group, md);
+ entry->count, entry->group, entry);

- err = ext4_mb_load_buddy(sb, md->group, &e4b);
+ err = ext4_mb_load_buddy(sb, entry->group, &e4b);
/* we expect to find existing buddy because it's pinned */
BUG_ON(err != 0);

+ db = e4b.bd_info;
/* there are blocks to put in buddy to make them really free */
- count += md->num;
+ count += entry->count;
count2++;
- ext4_lock_group(sb, md->group);
- for (i = 0; i < md->num; i++) {
- mb_debug(" %u", md->blocks[i]);
- mb_free_blocks(NULL, &e4b, md->blocks[i], 1);
+ ext4_lock_group(sb, entry->group);
+ /* Take it out of per group rb tree */
+ rb_erase(&entry->node, &(db->bb_free_root));
+ mb_free_blocks(NULL, &e4b, entry->start_blk, entry->count);
+
+ if (!db->bb_free_root.rb_node) {
+ /* No more items in the per group rb tree
+ * balance refcounts from ext4_mb_free_metadata()
+ */
+ page_cache_release(e4b.bd_buddy_page);
+ page_cache_release(e4b.bd_bitmap_page);
}
- mb_debug("\n");
- ext4_unlock_group(sb, md->group);
-
- /* balance refcounts from ext4_mb_free_metadata() */
- page_cache_release(e4b.bd_buddy_page);
- page_cache_release(e4b.bd_bitmap_page);
+ ext4_unlock_group(sb, entry->group);

- kfree(md);
+ kmem_cache_free(ext4_free_ext_cachep, entry);
ext4_mb_release_desc(&e4b);
-
- } while (md);
+ } while (1);

mb_debug("freed %u blocks in %u structures\n", count, count2);
}
@@ -3047,6 +3049,7 @@ void exit_ext4_mballoc(void)
#ifdef CONFIG_PROC_FS
remove_proc_entry("fs/ext4", NULL);
#endif
+ kmem_cache_destroy(ext4_free_ext_cachep);
}


@@ -3566,6 +3569,16 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
ac->ac_criteria = 20;
return 1;
}
+
+ ext4_free_ext_cachep =
+ kmem_cache_create("ext4_free_block_extents",
+ sizeof(struct ext4_free_data),
+ 0, SLAB_RECLAIM_ACCOUNT, NULL);
+ if (ext4_free_ext_cachep == NULL) {
+ kmem_cache_destroy(ext4_pspace_cachep);
+ kmem_cache_destroy(ext4_ac_cachep);
+ return -ENOMEM;
+ }
return 0;
}

@@ -4690,6 +4703,21 @@ static void ext4_mb_poll_new_transaction(struct super_block *sb,
ext4_mb_free_committed_blocks(sb);
}

+/*
+ * We can merge two free data extents only if the physical blocks
+ * are contiguous, AND the extents were freed by the same transaction,
+ * AND the blocks are associated with the same group.
+ */
+static int can_merge(struct ext4_free_data *entry1,
+ struct ext4_free_data *entry2)
+{
+ if ((entry1->t_tid == entry2->t_tid) &&
+ (entry1->group == entry2->group) &&
+ ((entry1->start_blk + entry1->count) == entry2->start_blk))
+ return 1;
+ return 0;
+}
+
static noinline_for_stack int
ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
ext4_group_t group, ext4_grpblk_t block, int count)
@@ -4697,57 +4725,80 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
struct ext4_group_info *db = e4b->bd_info;
struct super_block *sb = e4b->bd_sb;
struct ext4_sb_info *sbi = EXT4_SB(sb);
- struct ext4_free_metadata *md;
- int i;
+ struct ext4_free_data *entry, *new_entry;
+ struct rb_node **n = &db->bb_free_root.rb_node, *node;
+ struct rb_node *parent = NULL, *new_node;
+

BUG_ON(e4b->bd_bitmap_page == NULL);
BUG_ON(e4b->bd_buddy_page == NULL);

+ new_entry = kmem_cache_alloc(ext4_free_ext_cachep, GFP_NOFS);
+ new_entry->start_blk = block;
+ new_entry->group = group;
+ new_entry->count = count;
+ new_entry->t_tid = handle->h_transaction->t_tid;
+ new_node = &new_entry->node;
+
ext4_lock_group(sb, group);
- for (i = 0; i < count; i++) {
- md = db->bb_md_cur;
- if (md && db->bb_tid != handle->h_transaction->t_tid) {
- db->bb_md_cur = NULL;
- md = NULL;
+ if (!*n) {
+ /* first free block exent. We need to
+ protect buddy cache from being freed,
+ * otherwise we'll refresh it from
+ * on-disk bitmap and lose not-yet-available
+ * blocks */
+ page_cache_get(e4b->bd_buddy_page);
+ page_cache_get(e4b->bd_bitmap_page);
+ }
+ while (*n) {
+ parent = *n;
+ entry = rb_entry(parent, struct ext4_free_data, node);
+ if (block < entry->start_blk)
+ n = &(*n)->rb_left;
+ else if (block >= (entry->start_blk + entry->count))
+ n = &(*n)->rb_right;
+ else {
+ ext4_error(sb, __func__,
+ "Double free of blocks %d (%d %d)\n",
+ block, entry->start_blk, entry->count);
+ return 0;
}
+ }

- if (md == NULL) {
- ext4_unlock_group(sb, group);
- md = kmalloc(sizeof(*md), GFP_NOFS);
- if (md == NULL)
- return -ENOMEM;
- md->num = 0;
- md->group = group;
-
- ext4_lock_group(sb, group);
- if (db->bb_md_cur == NULL) {
- spin_lock(&sbi->s_md_lock);
- list_add(&md->list, &sbi->s_active_transaction);
- spin_unlock(&sbi->s_md_lock);
- /* protect buddy cache from being freed,
- * otherwise we'll refresh it from
- * on-disk bitmap and lose not-yet-available
- * blocks */
- page_cache_get(e4b->bd_buddy_page);
- page_cache_get(e4b->bd_bitmap_page);
- db->bb_md_cur = md;
- db->bb_tid = handle->h_transaction->t_tid;
- mb_debug("new md 0x%p for group %lu\n",
- md, md->group);
- } else {
- kfree(md);
- md = db->bb_md_cur;
- }
+ rb_link_node(new_node, parent, n);
+ rb_insert_color(new_node, &db->bb_free_root);
+
+ /* Now try to see the extent can be merged to left and right */
+ node = rb_prev(new_node);
+ if (node) {
+ entry = rb_entry(node, struct ext4_free_data, node);
+ if (can_merge(entry, new_entry)) {
+ new_entry->start_blk = entry->start_blk;
+ new_entry->count += entry->count;
+ rb_erase(node, &(db->bb_free_root));
+ spin_lock(&sbi->s_md_lock);
+ list_del(&entry->list);
+ spin_unlock(&sbi->s_md_lock);
+ kmem_cache_free(ext4_free_ext_cachep, entry);
}
+ }

- BUG_ON(md->num >= EXT4_BB_MAX_BLOCKS);
- md->blocks[md->num] = block + i;
- md->num++;
- if (md->num == EXT4_BB_MAX_BLOCKS) {
- /* no more space, put full container on a sb's list */
- db->bb_md_cur = NULL;
+ node = rb_next(new_node);
+ if (node) {
+ entry = rb_entry(node, struct ext4_free_data, node);
+ if (can_merge(new_entry, entry)) {
+ new_entry->count += entry->count;
+ rb_erase(node, &(db->bb_free_root));
+ spin_lock(&sbi->s_md_lock);
+ list_del(&entry->list);
+ spin_unlock(&sbi->s_md_lock);
+ kmem_cache_free(ext4_free_ext_cachep, entry);
}
}
+ /* Add the extent to active_transaction list */
+ spin_lock(&sbi->s_md_lock);
+ list_add(&new_entry->list, &sbi->s_active_transaction);
+ spin_unlock(&sbi->s_md_lock);
ext4_unlock_group(sb, group);
return 0;
}
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 9a020e6..0a28dd3 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -97,22 +97,27 @@
*/
#define MB_DEFAULT_GROUP_PREALLOC 512

-#ifdef EXT4_BB_MAX_BLOCKS
-#undef EXT4_BB_MAX_BLOCKS
-#endif
-#define EXT4_BB_MAX_BLOCKS 30
+struct ext4_free_data {
+ /* this links the free block information from group_info */
+ struct rb_node node;

-struct ext4_free_metadata {
- ext4_group_t group;
- unsigned short num;
- ext4_grpblk_t blocks[EXT4_BB_MAX_BLOCKS];
+ /* this links the free block information from ext4_sb_info */
struct list_head list;
+
+ /* group which free block extent belongs */
+ ext4_group_t group;
+
+ /* free block extent */
+ ext4_grpblk_t start_blk;
+ ext4_grpblk_t count;
+
+ /* transaction which freed this extent */
+ tid_t t_tid;
};

struct ext4_group_info {
unsigned long bb_state;
- unsigned long bb_tid;
- struct ext4_free_metadata *bb_md_cur;
+ struct rb_root bb_free_root;
unsigned short bb_first_free;
unsigned short bb_free;
unsigned short bb_fragments;
--
tg: (5d9c0a4..) use-rb-tree-for-free-blocks-tracking (depends on: aneesh-3-use_EXT4_GROUP_INFO_NEED_INIT_BIT-during-resize)

2009-01-19 17:14:35

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH] ext4: Use an rbtree for tracking blocks freed during transaction.

On 1/19/09, Aneesh Kumar K.V <[email protected]> wrote:
>
> From: Aneesh Kumar K.V <[email protected]>
> Subject: [PATCH] ext4: Use an rbtree for tracking blocks freed during transaction.
>
> With this patch we track the block freed during a transaction using
> red-black tree. We also make sure contiguous blocks freed are collected
> in one node in the tree.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> Signed-off-by: Theodore Ts'o <[email protected]>

Any particular reason for (re)posting this patch?

This appears to have been committed upstream via commit:
c894058d66637c7720569fbe12957f4de64d9991

Thanks,
Mike

2009-01-19 17:16:16

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH] ext4: Use an rbtree for tracking blocks freed during transaction.

On 1/19/09, Mike Snitzer <[email protected]> wrote:
> On 1/19/09, Aneesh Kumar K.V <[email protected]> wrote:
> >
> > From: Aneesh Kumar K.V <[email protected]>
> > Subject: [PATCH] ext4: Use an rbtree for tracking blocks freed during transaction.
> >
> > With this patch we track the block freed during a transaction using
> > red-black tree. We also make sure contiguous blocks freed are collected
> > in one node in the tree.
> >
> > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> > Signed-off-by: Theodore Ts'o <[email protected]>
>
>
> Any particular reason for (re)posting this patch?
>
> This appears to have been committed upstream via commit:
> c894058d66637c7720569fbe12957f4de64d9991

Ah, sorry I missed the discussion in the "Ext4 tree backports for
2.6.27.10 and 2.6.28" thread.

regards,
Mike

2009-01-20 06:03:05

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH]ext4: Use new buffer_head flag to check uninit group bitmaps initialization


From: Aneesh Kumar K.V <[email protected]>
Subject: [PATCH]ext4: Use new buffer_head flag to check uninit group bitmaps initialization

For uninit block group, the ondisk bitmap is not initialized. That implies
we cannot depend on the uptodate flag on the bitmap buffer_head to
find bitmap validity. Use a new buffer_head flag which would be set after
we properly initialize the bitmap. This also prevent the initializing
the uninit group bitmap initialization every time we do a
ext4_read_block_bitmap.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>

---
fs/ext4/balloc.c | 25 ++++++++++++++++++++++++-
fs/ext4/ext4.h | 18 ++++++++++++++++++
fs/ext4/ialloc.c | 24 +++++++++++++++++++++++-
fs/ext4/mballoc.c | 24 +++++++++++++++++++++++-
4 files changed, 88 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 4235817..0cedf3b 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -319,18 +319,41 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
block_group, bitmap_blk);
return NULL;
}
- if (bh_uptodate_or_lock(bh))
+
+ if (bitmap_uptodate(bh))
return bh;

+ lock_buffer(bh);
+ if (bitmap_uptodate(bh)) {
+ unlock_buffer(bh);
+ return bh;
+ }
spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
ext4_init_block_bitmap(sb, bh, block_group, desc);
+ set_bitmap_uptodate(bh);
set_buffer_uptodate(bh);
unlock_buffer(bh);
spin_unlock(sb_bgl_lock(EXT4_SB(sb), block_group));
return bh;
}
spin_unlock(sb_bgl_lock(EXT4_SB(sb), block_group));
+ if (buffer_uptodate(bh)) {
+ /*
+ * if not uninit if bh is uptodate,
+ * bitmap is also uptodate
+ */
+ set_bitmap_uptodate(bh);
+ unlock_buffer(bh);
+ return bh;
+ }
+ /*
+ * submit the buffer_head for read. We can
+ * safely mark the bitmap as uptodate now.
+ * We do it here so the bitmap uptodate bit
+ * get set with buffer lock held.
+ */
+ set_bitmap_uptodate(bh);
if (bh_submit_read(bh) < 0) {
put_bh(bh);
ext4_error(sb, __func__,
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e605295..1bbab6a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -19,6 +19,7 @@
#include <linux/types.h>
#include <linux/blkdev.h>
#include <linux/magic.h>
+#include <linux/jbd2.h>
#include "ext4_i.h"

/*
@@ -1250,6 +1251,23 @@ extern int ext4_get_blocks_wrap(handle_t *handle, struct inode *inode,
sector_t block, unsigned long max_blocks,
struct buffer_head *bh, int create,
int extend_disksize, int flag);
+/*
+ * Add new method to test wether block and inode bitmaps are properly
+ * initialized. With uninit_bg reading the block from disk is not enough
+ * to mark the bitmap uptodate. We need to also zero-out the bitmap
+ */
+#define BH_BITMAP_UPTODATE BH_JBDPrivateStart
+
+static inline int bitmap_uptodate(struct buffer_head *bh)
+{
+ return (buffer_uptodate(bh) &&
+ test_bit(BH_BITMAP_UPTODATE, &(bh)->b_state));
+}
+static inline void set_bitmap_uptodate(struct buffer_head *bh)
+{
+ set_bit(BH_BITMAP_UPTODATE, &(bh)->b_state);
+}
+
#endif /* __KERNEL__ */

#endif /* _EXT4_H */
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index c61168d..9b9705c 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -115,18 +115,40 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
block_group, bitmap_blk);
return NULL;
}
- if (bh_uptodate_or_lock(bh))
+ if (bitmap_uptodate(bh))
return bh;

+ lock_buffer(bh);
+ if (bitmap_uptodate(bh)) {
+ unlock_buffer(bh);
+ return bh;
+ }
spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
ext4_init_inode_bitmap(sb, bh, block_group, desc);
+ set_bitmap_uptodate(bh);
set_buffer_uptodate(bh);
unlock_buffer(bh);
spin_unlock(sb_bgl_lock(EXT4_SB(sb), block_group));
return bh;
}
spin_unlock(sb_bgl_lock(EXT4_SB(sb), block_group));
+ if (buffer_uptodate(bh)) {
+ /*
+ * if not uninit if bh is uptodate,
+ * bitmap is also uptodate
+ */
+ set_bitmap_uptodate(bh);
+ unlock_buffer(bh);
+ return bh;
+ }
+ /*
+ * submit the buffer_head for read. We can
+ * safely mark the bitmap as uptodate now.
+ * We do it here so the bitmap uptodate bit
+ * get set with buffer lock held.
+ */
+ set_bitmap_uptodate(bh);
if (bh_submit_read(bh) < 0) {
put_bh(bh);
ext4_error(sb, __func__,
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index ea13c5e..417e254 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -796,20 +796,42 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
if (bh[i] == NULL)
goto out;

- if (bh_uptodate_or_lock(bh[i]))
+ if (bitmap_uptodate(bh[i]))
continue;

+ lock_buffer(bh[i]);
+ if (bitmap_uptodate(bh[i])) {
+ unlock_buffer(bh[i]);
+ continue;
+ }
spin_lock(sb_bgl_lock(EXT4_SB(sb), first_group + i));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
ext4_init_block_bitmap(sb, bh[i],
first_group + i, desc);
+ set_bitmap_uptodate(bh[i]);
set_buffer_uptodate(bh[i]);
unlock_buffer(bh[i]);
spin_unlock(sb_bgl_lock(EXT4_SB(sb), first_group + i));
continue;
}
spin_unlock(sb_bgl_lock(EXT4_SB(sb), first_group + i));
+ if (buffer_uptodate(bh[i])) {
+ /*
+ * if not uninit if bh is uptodate,
+ * bitmap is also uptodate
+ */
+ set_bitmap_uptodate(bh[i]);
+ unlock_buffer(bh[i]);
+ continue;
+ }
get_bh(bh[i]);
+ /*
+ * submit the buffer_head for read. We can
+ * safely mark the bitmap as uptodate now.
+ * We do it here so the bitmap uptodate bit
+ * get set with buffer lock held.
+ */
+ set_bitmap_uptodate(bh[i]);
bh[i]->b_end_io = end_buffer_read_sync;
submit_bh(READ, bh[i]);
mb_debug("read bitmap for group %lu\n", first_group + i);
--
tg: (56a5653..) ext4-Use-new-buffer_head-flag-to-check-uninit-group.patch (depends on: aneesh-7-dont-use-blocks-freed-but-not-yet-committed-in-buddy-cache-init)

2009-01-20 07:55:26

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH] ext4: don't use blocks freed but not yet committed in buddy cache init


From: Aneesh Kumar K.V <[email protected]>
Subject: [PATCH] ext4: don't use blocks freed but not yet committed in buddy cache init

When we generate buddy cache (especially during resize) we need to
make sure we don't use the blocks freed but not yet comitted. This
makes sure we have the right value of free blocks count in the group
info and also in the bitmap. This also ensures the ordered mode
consistency

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>

---
fs/ext4/mballoc.c | 82 ++++++++++++++++++++++++++++++++++++++--------------
1 files changed, 60 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 93e4c09..ea13c5e 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -335,6 +335,8 @@ static struct kmem_cache *ext4_ac_cachep;
static struct kmem_cache *ext4_free_ext_cachep;
static void ext4_mb_generate_from_pa(struct super_block *sb, void *bitmap,
ext4_group_t group);
+static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
+ ext4_group_t group);
static int ext4_mb_init_per_dev_proc(struct super_block *sb);
static int ext4_mb_destroy_per_dev_proc(struct super_block *sb);
static void ext4_mb_free_committed_blocks(struct super_block *);
@@ -859,7 +861,9 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
/*
* incore got set to the group block bitmap below
*/
+ ext4_lock_group(sb, group);
ext4_mb_generate_buddy(sb, data, incore, group);
+ ext4_unlock_group(sb, group);
incore = NULL;
} else {
/* this is block of bitmap */
@@ -873,6 +877,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore)

/* mark all preallocated blks used in in-core bitmap */
ext4_mb_generate_from_pa(sb, data, group);
+ ext4_mb_generate_from_freelist(sb, data, group);
ext4_unlock_group(sb, group);

/* set incore so that the buddy information can be
@@ -3583,6 +3588,32 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
}

/*
+ * the function goes through all block freed in the group
+ * but not yet committed and marks them used in in-core bitmap.
+ * buddy must be generated from this bitmap
+ * Need to be called with ext4 group lock (ext4_lock_group)
+ */
+static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
+ ext4_group_t group)
+{
+ struct rb_node *n;
+ struct ext4_group_info *grp;
+ struct ext4_free_data *entry;
+
+ grp = ext4_get_group_info(sb, group);
+ n = rb_first(&(grp->bb_free_root));
+
+ while (n) {
+ entry = rb_entry(n, struct ext4_free_data, node);
+ mb_set_bits(sb_bgl_lock(EXT4_SB(sb), group),
+ bitmap, entry->start_blk,
+ entry->count);
+ n = rb_next(n);
+ }
+ return;
+}
+
+/*
* the function goes through all preallocation in this group and marks them
* used in in-core bitmap. buddy must be generated from this bitmap
* Need to be called with ext4 group lock (ext4_lock_group)
@@ -4720,27 +4751,22 @@ static int can_merge(struct ext4_free_data *entry1,

static noinline_for_stack int
ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
- ext4_group_t group, ext4_grpblk_t block, int count)
+ struct ext4_free_data *new_entry)
{
+ ext4_grpblk_t block;
+ struct ext4_free_data *entry;
struct ext4_group_info *db = e4b->bd_info;
struct super_block *sb = e4b->bd_sb;
struct ext4_sb_info *sbi = EXT4_SB(sb);
- struct ext4_free_data *entry, *new_entry;
struct rb_node **n = &db->bb_free_root.rb_node, *node;
struct rb_node *parent = NULL, *new_node;

-
BUG_ON(e4b->bd_bitmap_page == NULL);
BUG_ON(e4b->bd_buddy_page == NULL);

- new_entry = kmem_cache_alloc(ext4_free_ext_cachep, GFP_NOFS);
- new_entry->start_blk = block;
- new_entry->group = group;
- new_entry->count = count;
- new_entry->t_tid = handle->h_transaction->t_tid;
new_node = &new_entry->node;
+ block = new_entry->start_blk;

- ext4_lock_group(sb, group);
if (!*n) {
/* first free block exent. We need to
protect buddy cache from being freed,
@@ -4799,7 +4825,6 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
spin_lock(&sbi->s_md_lock);
list_add(&new_entry->list, &sbi->s_active_transaction);
spin_unlock(&sbi->s_md_lock);
- ext4_unlock_group(sb, group);
return 0;
}

@@ -4906,15 +4931,6 @@ do_more:
BUG_ON(!mb_test_bit(bit + i, bitmap_bh->b_data));
}
#endif
- mb_clear_bits(sb_bgl_lock(sbi, block_group), bitmap_bh->b_data,
- bit, count);
-
- /* We dirtied the bitmap block */
- BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
- err = ext4_journal_dirty_metadata(handle, bitmap_bh);
- if (err)
- goto error_return;
-
if (ac) {
ac->ac_b_ex.fe_group = block_group;
ac->ac_b_ex.fe_start = bit;
@@ -4926,11 +4942,29 @@ do_more:
if (err)
goto error_return;
if (metadata) {
- /* blocks being freed are metadata. these blocks shouldn't
- * be used until this transaction is committed */
- ext4_mb_free_metadata(handle, &e4b, block_group, bit, count);
+ struct ext4_free_data *new_entry;
+ /*
+ * blocks being freed are metadata. these blocks shouldn't
+ * be used until this transaction is committed
+ */
+ new_entry = kmem_cache_alloc(ext4_free_ext_cachep, GFP_NOFS);
+ new_entry->start_blk = bit;
+ new_entry->group = block_group;
+ new_entry->count = count;
+ new_entry->t_tid = handle->h_transaction->t_tid;
+ ext4_lock_group(sb, block_group);
+ mb_clear_bits(sb_bgl_lock(sbi, block_group), bitmap_bh->b_data,
+ bit, count);
+ ext4_mb_free_metadata(handle, &e4b, new_entry);
+ ext4_unlock_group(sb, block_group);
} else {
ext4_lock_group(sb, block_group);
+ /* need to update group_info->bb_free and bitmap
+ * with group lock held. generate_buddy look at
+ * them with group lock_held
+ */
+ mb_clear_bits(sb_bgl_lock(sbi, block_group), bitmap_bh->b_data,
+ bit, count);
mb_free_blocks(inode, &e4b, bit, count);
ext4_mb_return_to_preallocation(inode, &e4b, block, count);
ext4_unlock_group(sb, block_group);
@@ -4953,6 +4987,10 @@ do_more:

*freed += count;

+ /* We dirtied the bitmap block */
+ BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
+ err = ext4_journal_dirty_metadata(handle, bitmap_bh);
+
/* And the group descriptor block */
BUFFER_TRACE(gd_bh, "dirtied group descriptor block");
ret = ext4_journal_dirty_metadata(handle, gd_bh);
--
tg: (7830455..) aneesh-7-dont-use-blocks-freed-but-not-yet-committed-in-buddy-cache-init (depends on: use-rb-tree-for-free-blocks-tracking)

2009-01-20 08:29:33

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH] ext4: Add blocks added during resize to bitmap


From: Aneesh Kumar K.V <[email protected]>
Subject: [PATCH] ext4: Add blocks added during resize to bitmap

With this change new blocks added during resize
are marked as free in the block bitmap and the
group is flagged with EXT4_GROUP_INFO_NEED_INIT_BIT
flag. This make sure when mballoc tries to allocate
blocks from the new group we would reload the
buddy information using the bitmap present in the disk.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>

---
fs/ext4/balloc.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/ext4.h | 6 ++-
fs/ext4/resize.c | 11 +----
3 files changed, 132 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index e9fa960..1b375bb 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -20,6 +20,7 @@
#include "ext4.h"
#include "ext4_jbd2.h"
#include "group.h"
+#include "mballoc.h"

/*
* balloc.c contains the blocks allocation and deallocation routines
@@ -837,6 +838,131 @@ error_return:
}

/**
+ * ext4_add_groupblocks() -- Add given blocks to an existing group
+ * @handle: handle to this transaction
+ * @sb: super block
+ * @block: start physcial block to add to the block group
+ * @count: number of blocks to free
+ *
+ * This marks the blocks as free in the bitmap. We ask the
+ * mballoc to reload the buddy after this by setting group
+ * EXT4_GROUP_INFO_NEED_INIT_BIT flag
+ */
+void ext4_add_groupblocks(handle_t *handle, struct super_block *sb,
+ ext4_fsblk_t block, unsigned long count)
+{
+ struct buffer_head *bitmap_bh = NULL;
+ struct buffer_head *gd_bh;
+ ext4_group_t block_group;
+ ext4_grpblk_t bit;
+ unsigned long i;
+ struct ext4_group_desc * desc;
+ struct ext4_super_block * es;
+ struct ext4_sb_info *sbi;
+ int err = 0, ret;
+ ext4_grpblk_t blocks_freed;
+ struct ext4_group_info *grp;
+
+ sbi = EXT4_SB(sb);
+ es = sbi->s_es;
+ ext4_debug("Adding block(s) %llu-%llu\n", block, block + count - 1);
+
+ ext4_get_group_no_and_offset(sb, block, &block_group, &bit);
+ /*
+ * Check to see if we are freeing blocks across a group
+ * boundary.
+ */
+ if (bit + count > EXT4_BLOCKS_PER_GROUP(sb)) {
+ goto error_return;
+ }
+ bitmap_bh = ext4_read_block_bitmap(sb, block_group);
+ if (!bitmap_bh)
+ goto error_return;
+ desc = ext4_get_group_desc (sb, block_group, &gd_bh);
+ if (!desc)
+ goto error_return;
+
+ if (in_range(ext4_block_bitmap(sb, desc), block, count) ||
+ in_range(ext4_inode_bitmap(sb, desc), block, count) ||
+ in_range(block, ext4_inode_table(sb, desc), sbi->s_itb_per_group) ||
+ in_range(block + count - 1, ext4_inode_table(sb, desc),
+ sbi->s_itb_per_group)) {
+ ext4_error(sb, __func__,
+ "Adding blocks in system zones - "
+ "Block = %llu, count = %lu",
+ block, count);
+ goto error_return;
+ }
+
+ /*
+ * We are about to add blocks to the bitmap,
+ * so we need undo access.
+ */
+ BUFFER_TRACE(bitmap_bh, "getting undo access");
+ err = ext4_journal_get_undo_access(handle, bitmap_bh);
+ if (err)
+ goto error_return;
+
+ /*
+ * We are about to modify some metadata. Call the journal APIs
+ * to unshare ->b_data if a currently-committing transaction is
+ * using it
+ */
+ BUFFER_TRACE(gd_bh, "get_write_access");
+ err = ext4_journal_get_write_access(handle, gd_bh);
+ if (err)
+ goto error_return;
+
+ for (i = 0, blocks_freed = 0; i < count; i++) {
+ BUFFER_TRACE(bitmap_bh, "clear bit");
+ if (!ext4_clear_bit_atomic(sb_bgl_lock(sbi, block_group),
+ bit + i, bitmap_bh->b_data)) {
+ ext4_error(sb, __func__,
+ "bit already cleared for block %llu",
+ (ext4_fsblk_t)(block + i));
+ BUFFER_TRACE(bitmap_bh, "bit already cleared");
+ } else {
+ blocks_freed++;
+ }
+ }
+ spin_lock(sb_bgl_lock(sbi, block_group));
+ le16_add_cpu(&desc->bg_free_blocks_count, blocks_freed);
+ desc->bg_checksum = ext4_group_desc_csum(sbi, block_group, desc);
+ spin_unlock(sb_bgl_lock(sbi, block_group));
+ percpu_counter_add(&sbi->s_freeblocks_counter, blocks_freed);
+
+ if (sbi->s_log_groups_per_flex) {
+ ext4_group_t flex_group = ext4_flex_group(sbi, block_group);
+ spin_lock(sb_bgl_lock(sbi, flex_group));
+ sbi->s_flex_groups[flex_group].free_blocks += blocks_freed;
+ spin_unlock(sb_bgl_lock(sbi, flex_group));
+ }
+
+ /* We dirtied the bitmap block */
+ BUFFER_TRACE(bitmap_bh, "dirtied bitmap block");
+ err = ext4_journal_dirty_metadata(handle, bitmap_bh);
+
+ /* And the group descriptor block */
+ BUFFER_TRACE(gd_bh, "dirtied group descriptor block");
+ ret = ext4_journal_dirty_metadata(handle, gd_bh);
+ if (!err)
+ err = ret;
+ sb->s_dirt = 1;
+ /*
+ * request to reload the buddy with the
+ * new bitmap information
+ */
+ grp = ext4_get_group_info(sb, block_group);
+ set_bit(EXT4_GROUP_INFO_NEED_INIT_BIT, &(grp->bb_state));
+ ext4_mb_update_group_info(grp, blocks_freed);
+
+error_return:
+ brelse(bitmap_bh);
+ ext4_std_error(sb, err);
+ return;
+}
+
+/**
* ext4_free_blocks() -- Free given blocks and update quota
* @handle: handle for this transaction
* @inode: inode
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b0483ce..98de963 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -992,8 +992,10 @@ extern ext4_fsblk_t ext4_has_free_blocks(struct ext4_sb_info *sbi,
extern void ext4_free_blocks (handle_t *handle, struct inode *inode,
ext4_fsblk_t block, unsigned long count, int metadata);
extern void ext4_free_blocks_sb (handle_t *handle, struct super_block *sb,
- ext4_fsblk_t block, unsigned long count,
- unsigned long *pdquot_freed_blocks);
+ ext4_fsblk_t block, unsigned long count,
+ unsigned long *pdquot_freed_blocks);
+extern void ext4_add_groupblocks(handle_t *handle, struct super_block *sb,
+ ext4_fsblk_t block, unsigned long count);
extern ext4_fsblk_t ext4_count_free_blocks (struct super_block *);
extern void ext4_check_blocks_bitmap (struct super_block *);
extern struct ext4_group_desc * ext4_get_group_desc(struct super_block * sb,
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 5726c5e..d7e9761 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -974,9 +974,7 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
struct buffer_head * bh;
handle_t *handle;
int err;
- unsigned long freed_blocks;
ext4_group_t group;
- struct ext4_group_info *grp;

/* We don't need to worry about locking wrt other resizers just
* yet: we're going to revalidate es->s_blocks_count after
@@ -1075,7 +1073,8 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
unlock_super(sb);
ext4_debug("freeing blocks %llu through %llu\n", o_blocks_count,
o_blocks_count + add);
- ext4_free_blocks_sb(handle, sb, o_blocks_count, add, &freed_blocks);
+ /* We add the blocks to the bitmap and set the group need init bit */
+ ext4_add_groupblocks(handle, sb, o_blocks_count, add);
ext4_debug("freed blocks %llu through %llu\n", o_blocks_count,
o_blocks_count + add);
if ((err = ext4_journal_stop(handle)))
@@ -1111,12 +1110,6 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
ClearPageUptodate(page);
page_cache_release(page);
}
-
- /* Get the info on the last group */
- grp = ext4_get_group_info(sb, group);

2009-01-22 19:50:12

by Greg KH

[permalink] [raw]
Subject: Re: [stable] Ext4 tree backports for 2.6.27.10 and 2.6.28

On Sat, Jan 17, 2009 at 01:43:55PM -0500, Theodore Ts'o wrote:
>
> I've created a couple of ext4 backport branches which have been uploaded
> to the ext4 git tree:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git

Any resolution of ext4 patches to go into the next -stable releases?
I'm pushing out the next review later today, and was wondering what the
status of these were.

thanks,

greg k-h

2009-08-30 04:25:47

by Ed Tomlinson

[permalink] [raw]
Subject: Ext4 corruption that will not go away

Hi,

I am running 2.6.31-rc8+ and have ext4 corruption that will not go away.

My root fs is ext4 on sdb3. I have moved the directory with corruption into lost+found and booted to a rescuse
system (arch linux) and run fsck.ext4 on the filesystem, which then reports its clean... Booting back into my
gentoo system and attempting to remove the xx directory from lost+found gives:

[ 172.408799] EXT4-fs error (device sdb3): ext4_ext_check_inode: bad header/extent in inode #706801: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
[ 172.429410] EXT4-fs error (device sdb3): ext4_ext_check_inode: bad header/extent in inode #706801: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
[ 172.449920] EXT4-fs error (device sdb3): ext4_ext_check_inode: bad header/extent in inode #706801: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)

The above is repeatable.

How can I _really_ clean this fs? What info is needed to help the process?

TIA,
Ed Tomlinson

2009-08-30 13:43:53

by LDB

[permalink] [raw]
Subject: Re: Ext4 corruption that will not go away

Ed Tomlinson wrote:
> Hi,
>
> I am running 2.6.31-rc8+ and have ext4 corruption that will not go away.
>
> My root fs is ext4 on sdb3. I have moved the directory with corruption into lost+found and booted to a rescuse
> system (arch linux) and run fsck.ext4 on the filesystem, which then reports its clean... Booting back into my
> gentoo system and attempting to remove the xx directory from lost+found gives:
>
> [ 172.408799] EXT4-fs error (device sdb3): ext4_ext_check_inode: bad header/extent in inode #706801: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
> [ 172.429410] EXT4-fs error (device sdb3): ext4_ext_check_inode: bad header/extent in inode #706801: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
> [ 172.449920] EXT4-fs error (device sdb3): ext4_ext_check_inode: bad header/extent in inode #706801: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
>
> The above is repeatable.
>
> How can I _really_ clean this fs? What info is needed to help the process?
>
> TIA,
> Ed Tomlinson
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

Have you tried "dumpe2fs /dev/sdb3" to determine if you can use a
different superblock number, since the problem is recurring?


LDB

2009-08-30 14:24:56

by Nick Dokos

[permalink] [raw]
Subject: Re: Ext4 corruption that will not go away

> Hi,
>
> I am running 2.6.31-rc8+ and have ext4 corruption that will not go away.
>
> My root fs is ext4 on sdb3. I have moved the directory with corruption into lost+found and booted to a rescuse
> system (arch linux) and run fsck.ext4 on the filesystem, which then reports its clean... Booting back into my
> gentoo system and attempting to remove the xx directory from lost+found gives:
>
> [ 172.408799] EXT4-fs error (device sdb3): ext4_ext_check_inode: bad header/extent in inode #706801: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
> [ 172.429410] EXT4-fs error (device sdb3): ext4_ext_check_inode: bad header/extent in inode #706801: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
> [ 172.449920] EXT4-fs error (device sdb3): ext4_ext_check_inode: bad header/extent in inode #706801: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
>
> The above is repeatable.
>
> How can I _really_ clean this fs? What info is needed to help the process?
>

The first two pieces of information needed would be the version of e2fsprogs
that you are running (e2fsck -V) and the stat of inode 706801:

debugfs -R 'stat <706801>' /dev/sdb3

Thanks,
Nick

2009-08-30 15:41:45

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Ext4 corruption that will not go away

On Sun, Aug 30, 2009 at 12:25:47AM -0400, Ed Tomlinson wrote:
> Hi,
>
> I am running 2.6.31-rc8+ and have ext4 corruption that will not go away.
>
> My root fs is ext4 on sdb3. I have moved the directory with corruption into lost+found and booted to a rescuse
> system (arch linux) and run fsck.ext4 on the filesystem, which then reports its clean... Booting back into my
> gentoo system and attempting to remove the xx directory from lost+found gives:
>
> [ 172.408799] EXT4-fs error (device sdb3): ext4_ext_check_inode: bad header/extent in inode #706801: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)

Can you send in the output of running fsck.ext4 on the filesystem, please?

- Ted

2009-08-30 23:19:57

by Ed Tomlinson

[permalink] [raw]
Subject: Re: Ext4 corruption that will not go away

On Sunday 30 August 2009 10:24:56 Nick Dokos wrote:
> > Hi,
> >
> > I am running 2.6.31-rc8+ and have ext4 corruption that will not go away.
> >
> > My root fs is ext4 on sdb3. I have moved the directory with corruption into lost+found and booted to a rescuse
> > system (arch linux) and run fsck.ext4 on the filesystem, which then reports its clean... Booting back into my
> > gentoo system and attempting to remove the xx directory from lost+found gives:
> >
> > [ 172.408799] EXT4-fs error (device sdb3): ext4_ext_check_inode: bad header/extent in inode #706801: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
> > [ 172.429410] EXT4-fs error (device sdb3): ext4_ext_check_inode: bad header/extent in inode #706801: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
> > [ 172.449920] EXT4-fs error (device sdb3): ext4_ext_check_inode: bad header/extent in inode #706801: invalid magic - magic 0, entries 0, max 0(0), depth 0(0)
> >
> > The above is repeatable.
> >
> > How can I _really_ clean this fs? What info is needed to help the process?
> >
>
> The first two pieces of information needed would be the version of e2fsprogs
> that you are running (e2fsck -V) and the stat of inode 706801:
>
> debugfs -R 'stat <706801>' /dev/sdb3
>

e2fsck 1.41.9 (22-Aug-2009)
Using EXT2FS Library version 1.41.9, 22-Aug-2009

Inode: 706801 Type: regular Mode: 0644 Flags: 0x80000
Generation: 28075061 Version: 0x00000000:00000001
User: 0 Group: 0 Size: 1442
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x4a8beb6a:a8c2e6e4 -- Wed Aug 19 08:09:14 2009
atime: 0x4a8beb6a:00000000 -- Wed Aug 19 08:09:14 2009
mtime: 0x4a8b17d9:00000000 -- Tue Aug 18 17:06:33 2009
crtime: 0x4a8beb6a:9e456724 -- Wed Aug 19 08:09:14 2009
Size of extra inode fields: 28
EXTENTS:

and of 76804 which also has a problem

Inode: 706804 Type: regular Mode: 0644 Flags: 0x80000
Generation: 4140131203 Version: 0x00000000:00000001
User: 1000 Group: 100 Size: 523
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 0
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x4a95c957:348db0b8 -- Wed Aug 26 19:46:31 2009
atime: 0x4a991f2b:cf3a606c -- Sat Aug 29 08:29:31 2009
mtime: 0x4a95c957:348db0b8 -- Wed Aug 26 19:46:31 2009
crtime: 0x4a95c957:348db0b8 -- Wed Aug 26 19:46:31 2009
Size of extra inode fields: 28
EXTENTS:

Hope this helps
Ed