2009-04-26 03:49:32

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH RFC 0/5] Eliminate most lock_super() calls from ext4


On Fri, Apr 24, 2009 at 08:40:47PM +0200, Christoph Hellwig wrote:
> ext3/4 internal bits? Doesn't seem to be used for any journal related
> activity but mostly as protection against resizing (the whole lock_super
> usage in ext3/4 looks odd to me, interestingly there's none at all in
> ext2. Maybe someone of the extN crowd should audit and get rid of it in
> favour of a better fs-specific lock)

Here are some patches which eliminate most of the lock_super() and
unlock_super() calls from ext4. The last remaining calls are designed
to proect against another CPU calling write_super(), which is the
original intended use.

If these patches work out, we can backport these patches to ext3.

Comments and review appreciated; thanks!!

- Ted



2009-04-26 03:49:32

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 2/5] ext4: Remove outdated comment about lock_super()

ext4_fill_super() is no longer called by read_super(), and it is no
longer called with the superblock locked. The
unlock_super()/lock_super() is no longer present, so this comment is
entirely superfluous.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/super.c | 8 --------
1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e20ff9c..e0b0c9f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2806,14 +2806,6 @@ no_journal:
goto failed_mount4;
};

- /*
- * akpm: core read_super() calls in here with the superblock locked.
- * That deadlocks, because orphan cleanup needs to lock the superblock
- * in numerous places. Here we just pop the lock - it's relatively
- * harmless, because we are now ready to accept write_super() requests,
- * and aviro says that's the only reason for hanging onto the
- * superblock lock.
- */
EXT4_SB(sb)->s_mount_state |= EXT4_ORPHAN_FS;
ext4_orphan_cleanup(sb, es);
EXT4_SB(sb)->s_mount_state &= ~EXT4_ORPHAN_FS;
--
1.5.6.3


2009-04-26 03:49:32

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 4/5] ext4: Replace lock/unlock_super() with an explicit lock for the orphan list

Use a separate lock to protect the orphan list, so we can stop
overloading the use of lock_super().

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/ext4_sb.h | 1 +
fs/ext4/namei.c | 20 +++++++++++---------
fs/ext4/super.c | 1 +
3 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/ext4_sb.h b/fs/ext4/ext4_sb.h
index 57b71fe..4bda2f7 100644
--- a/fs/ext4/ext4_sb.h
+++ b/fs/ext4/ext4_sb.h
@@ -71,6 +71,7 @@ struct ext4_sb_info {
struct inode *s_journal_inode;
struct journal_s *s_journal;
struct list_head s_orphan;
+ struct mutex s_orphan_lock;
unsigned long s_commit_interval;
u32 s_max_batch_time;
u32 s_min_batch_time;
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 22098e1..8018e49 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1997,7 +1997,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
if (!ext4_handle_valid(handle))
return 0;

- lock_super(sb);
+ mutex_lock(&EXT4_SB(sb)->s_orphan_lock);
if (!list_empty(&EXT4_I(inode)->i_orphan))
goto out_unlock;

@@ -2006,9 +2006,13 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)

/* @@@ FIXME: Observation from aviro:
* I think I can trigger J_ASSERT in ext4_orphan_add(). We block
- * here (on lock_super()), so race with ext4_link() which might bump
+ * here (on s_orphan_lock), so race with ext4_link() which might bump
* ->i_nlink. For, say it, character device. Not a regular file,
* not a directory, not a symlink and ->i_nlink > 0.
+ *
+ * tytso, 4/25/2009: I'm not sure how that could happen;
+ * shouldn't the fs core protect us from these sort of
+ * unlink()/link() races?
*/
J_ASSERT((S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
S_ISLNK(inode->i_mode)) || inode->i_nlink == 0);
@@ -2045,7 +2049,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
jbd_debug(4, "orphan inode %lu will point to %d\n",
inode->i_ino, NEXT_ORPHAN(inode));
out_unlock:
- unlock_super(sb);
+ mutex_unlock(&EXT4_SB(sb)->s_orphan_lock);
ext4_std_error(inode->i_sb, err);
return err;
}
@@ -2066,11 +2070,9 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
if (!ext4_handle_valid(handle))
return 0;

- lock_super(inode->i_sb);
- if (list_empty(&ei->i_orphan)) {
- unlock_super(inode->i_sb);
- return 0;
- }
+ mutex_lock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
+ if (list_empty(&ei->i_orphan))
+ goto out;

ino_next = NEXT_ORPHAN(inode);
prev = ei->i_orphan.prev;
@@ -2120,7 +2122,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
out_err:
ext4_std_error(inode->i_sb, err);
out:
- unlock_super(inode->i_sb);
+ mutex_unlock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
return err;

out_brelse:
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 176d43f..c23e82c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2623,6 +2623,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sb->dq_op = &ext4_quota_operations;
#endif
INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
+ mutex_init(&sbi->s_orphan_lock);

sb->s_root = NULL;

--
1.5.6.3


2009-04-26 03:49:32

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 1/5] ext4: Avoid races caused by on-line resizing and SMP memory reordering

Ext4's on-line resizing adds a new block group and then, only at the
last step adjusts s_groups_count. However, it's possible on SMP
systems that another CPU could see the updated the s_group_count and
not see the newly initialized data structures for the just-added block
group. For this reason, it's important to insert a SMP read barrier
after reading s_groups_count and before reading any, say, block group
descriptors allowed by the block group count.

Unfortunately, we rather blatently violate this locking protocol as
documented in fs/ext4/resize.c. Fortunately, (1) on-line resizes
happen relatively rarely, and (2) it seems rare that the filesystem
code will immediately try to use just-added block group before any
memory ordering issues resolve themselves. So apparently problems
here are relatively hard to hit, since ext3 also is vulnerable to this
race and no one has apparently complained.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/balloc.c | 6 ++++--
fs/ext4/ialloc.c | 34 +++++++++++++++++++++-------------
fs/ext4/mballoc.c | 49 ++++++++++++++++++++++++++++++++-----------------
3 files changed, 57 insertions(+), 32 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 53c72ad..d1615f2 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -88,9 +88,11 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
ext4_group_t block_group, struct ext4_group_desc *gdp)
{
int bit, bit_max;
+ ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
unsigned free_blocks, group_blocks;
struct ext4_sb_info *sbi = EXT4_SB(sb);

+ smp_rmb(); /* after reading s_groups_count first */
if (bh) {
J_ASSERT_BH(bh, buffer_locked(bh));

@@ -123,7 +125,7 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
bit_max += ext4_bg_num_gdb(sb, block_group);
}

- if (block_group == sbi->s_groups_count - 1) {
+ if (block_group == ngroups - 1) {
/*
* Even though mke2fs always initialize first and last group
* if some other tool enabled the EXT4_BG_BLOCK_UNINIT we need
@@ -131,7 +133,7 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
*/
group_blocks = ext4_blocks_count(sbi->s_es) -
le32_to_cpu(sbi->s_es->s_first_data_block) -
- (EXT4_BLOCKS_PER_GROUP(sb) * (sbi->s_groups_count - 1));
+ (EXT4_BLOCKS_PER_GROUP(sb) * (ngroups - 1));
} else {
group_blocks = EXT4_BLOCKS_PER_GROUP(sb);
}
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index f18e0a0..52ce274 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -322,6 +322,7 @@ static int find_group_dir(struct super_block *sb, struct inode *parent,
ext4_group_t group;
int ret = -1;

+ smp_rmb(); /* after reading s_groups_count first */
freei = percpu_counter_read_positive(&EXT4_SB(sb)->s_freeinodes_counter);
avefreei = freei / ngroups;

@@ -362,7 +363,8 @@ static int find_group_flex(struct super_block *sb, struct inode *parent,
ext4_group_t n_fbg_groups;
ext4_group_t i;

- n_fbg_groups = (sbi->s_groups_count + flex_size - 1) >>
+ smp_rmb(); /* after reading s_groups_count first */
+ n_fbg_groups = (ngroups + flex_size - 1) >>
sbi->s_log_groups_per_flex;

find_close_to_parent:
@@ -478,18 +480,18 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
{
ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
struct ext4_sb_info *sbi = EXT4_SB(sb);
- ext4_group_t ngroups = sbi->s_groups_count;
int inodes_per_group = EXT4_INODES_PER_GROUP(sb);
unsigned int freei, avefreei;
ext4_fsblk_t freeb, avefreeb;
unsigned int ndirs;
int max_dirs, min_inodes;
ext4_grpblk_t min_blocks;
- ext4_group_t i, grp, g;
+ ext4_group_t i, grp, g, ngroups = sbi->s_groups_count;;
struct ext4_group_desc *desc;
struct orlov_stats stats;
int flex_size = ext4_flex_bg_size(sbi);

+ smp_rmb(); /* after reading s_groups_count first */
if (flex_size > 1) {
ngroups = (ngroups + flex_size - 1) >>
sbi->s_log_groups_per_flex;
@@ -585,6 +587,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
fallback:
ngroups = sbi->s_groups_count;
avefreei = freei / ngroups;
+ smp_rmb();
fallback_retry:
parent_group = EXT4_I(parent)->i_block_group;
for (i = 0; i < ngroups; i++) {
@@ -613,11 +616,11 @@ static int find_group_other(struct super_block *sb, struct inode *parent,
ext4_group_t *group, int mode)
{
ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
- ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
+ ext4_group_t i, last, ngroups = EXT4_SB(sb)->s_groups_count;
struct ext4_group_desc *desc;
- ext4_group_t i, last;
int flex_size = ext4_flex_bg_size(EXT4_SB(sb));

+ smp_rmb(); /* after reading s_groups_count first */
/*
* Try to place the inode is the same flex group as its
* parent. If we can't find space, use the Orlov algorithm to
@@ -799,7 +802,7 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode *dir, int mode)
struct super_block *sb;
struct buffer_head *inode_bitmap_bh = NULL;
struct buffer_head *group_desc_bh;
- ext4_group_t group = 0;
+ ext4_group_t ngroups, group = 0;
unsigned long ino = 0;
struct inode *inode;
struct ext4_group_desc *gdp = NULL;
@@ -851,12 +854,14 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode *dir, int mode)
ret2 = find_group_other(sb, dir, &group, mode);

got_group:
+ ngroups = sbi->s_groups_count;
+ smp_rmb();
EXT4_I(dir)->i_last_alloc_group = group;
err = -ENOSPC;
if (ret2 == -1)
goto out;

- for (i = 0; i < sbi->s_groups_count; i++) {
+ for (i = 0; i < ngroups; i++) {
err = -EIO;

gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
@@ -917,7 +922,7 @@ repeat_in_this_group:
* group descriptor metadata has not yet been updated.
* So we just go onto the next blockgroup.
*/
- if (++group == sbi->s_groups_count)
+ if (++group == ngroups)
group = 0;
}
err = -ENOSPC;
@@ -1158,17 +1163,18 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
{
unsigned long desc_count;
struct ext4_group_desc *gdp;
- ext4_group_t i;
+ ext4_group_t i, ngroups = EXT4_SB(sb)->s_groups_count;
#ifdef EXT4FS_DEBUG
struct ext4_super_block *es;
unsigned long bitmap_count, x;
struct buffer_head *bitmap_bh = NULL;

+ smp_rmb(); /* after reading s_groups_count first */
es = EXT4_SB(sb)->s_es;
desc_count = 0;
bitmap_count = 0;
gdp = NULL;
- for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
+ for (i = 0; i < ngroups; i++) {
gdp = ext4_get_group_desc(sb, i, NULL);
if (!gdp)
continue;
@@ -1189,8 +1195,9 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
le32_to_cpu(es->s_free_inodes_count), desc_count, bitmap_count);
return desc_count;
#else
+ smp_rmb(); /* after reading s_groups_count first */
desc_count = 0;
- for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
+ for (i = 0; i < ngroups; i++) {
gdp = ext4_get_group_desc(sb, i, NULL);
if (!gdp)
continue;
@@ -1205,9 +1212,10 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
unsigned long ext4_count_dirs(struct super_block * sb)
{
unsigned long count = 0;
- ext4_group_t i;
+ ext4_group_t i, ngroups = EXT4_SB(sb)->s_groups_count;

- for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
+ smp_rmb(); /* after reading s_groups_count first */
+ for (i = 0; i < ngroups; i++) {
struct ext4_group_desc *gdp = ext4_get_group_desc(sb, i, NULL);
if (!gdp)
continue;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index f871677..ecc2d49 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -739,6 +739,7 @@ static void ext4_mb_generate_buddy(struct super_block *sb,

static int ext4_mb_init_cache(struct page *page, char *incore)
{
+ ext4_group_t ngroups;
int blocksize;
int blocks_per_page;
int groups_per_page;
@@ -757,6 +758,8 @@ static int ext4_mb_init_cache(struct page *page, char *incore)

inode = page->mapping->host;
sb = inode->i_sb;
+ ngroups = EXT4_SB(sb)->s_groups_count;
+ smp_rmb();
blocksize = 1 << inode->i_blkbits;
blocks_per_page = PAGE_CACHE_SIZE / blocksize;

@@ -780,7 +783,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
for (i = 0; i < groups_per_page; i++) {
struct ext4_group_desc *desc;

- if (first_group + i >= EXT4_SB(sb)->s_groups_count)
+ if (first_group + i >= ngroups)
break;

err = -EIO;
@@ -852,7 +855,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
struct ext4_group_info *grinfo;

group = (first_block + i) >> 1;
- if (group >= EXT4_SB(sb)->s_groups_count)
+ if (group >= ngroups)
break;

/*
@@ -1788,9 +1791,11 @@ int ext4_mb_get_buddy_cache_lock(struct super_block *sb, ext4_group_t group)
int block, pnum;
int blocks_per_page;
int groups_per_page;
+ ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
ext4_group_t first_group;
struct ext4_group_info *grp;

+ smp_rmb(); /* after reading s_groups_count first */
blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize;
/*
* the buddy cache inode stores the block bitmap
@@ -1807,7 +1812,7 @@ int ext4_mb_get_buddy_cache_lock(struct super_block *sb, ext4_group_t group)
/* read all groups the page covers into the cache */
for (i = 0; i < groups_per_page; i++) {

- if ((first_group + i) >= EXT4_SB(sb)->s_groups_count)
+ if ((first_group + i) >= ngroups)
break;
grp = ext4_get_group_info(sb, first_group + i);
/* take all groups write allocation
@@ -1945,8 +1950,7 @@ err:
static noinline_for_stack int
ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
{
- ext4_group_t group;
- ext4_group_t i;
+ ext4_group_t ngroups, group, i;
int cr;
int err = 0;
int bsbits;
@@ -1957,6 +1961,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)

sb = ac->ac_sb;
sbi = EXT4_SB(sb);
+ ngroups = EXT4_SB(sb)->s_groups_count;
+ smp_rmb();
BUG_ON(ac->ac_status == AC_STATUS_FOUND);

/* first, try the goal */
@@ -2017,11 +2023,11 @@ repeat:
*/
group = ac->ac_g_ex.fe_group;

- for (i = 0; i < EXT4_SB(sb)->s_groups_count; group++, i++) {
+ for (i = 0; i < ngroups; group++, i++) {
struct ext4_group_info *grp;
struct ext4_group_desc *desc;

- if (group == EXT4_SB(sb)->s_groups_count)
+ if (group == ngroups)
group = 0;

/* quick check to skip empty groups */
@@ -2320,7 +2326,7 @@ static void *ext4_mb_seq_groups_start(struct seq_file *seq, loff_t *pos)

if (*pos < 0 || *pos >= sbi->s_groups_count)
return NULL;
-
+ smp_rmb();
group = *pos + 1;
return (void *) ((unsigned long) group);
}
@@ -2334,6 +2340,7 @@ static void *ext4_mb_seq_groups_next(struct seq_file *seq, void *v, loff_t *pos)
++*pos;
if (*pos < 0 || *pos >= sbi->s_groups_count)
return NULL;
+ smp_rmb();
group = *pos + 1;
return (void *) ((unsigned long) group);
}
@@ -2587,6 +2594,7 @@ void ext4_mb_update_group_info(struct ext4_group_info *grp, ext4_grpblk_t add)

static int ext4_mb_init_backend(struct super_block *sb)
{
+ ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
ext4_group_t i;
int metalen;
struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -2597,8 +2605,10 @@ static int ext4_mb_init_backend(struct super_block *sb)
struct ext4_group_info **meta_group_info;
struct ext4_group_desc *desc;

+ smp_rmb(); /* after reading s_groups_count first */
+
/* This is the number of blocks used by GDT */
- num_meta_group_infos = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) -
+ num_meta_group_infos = (ngroups + EXT4_DESC_PER_BLOCK(sb) -
1) >> EXT4_DESC_PER_BLOCK_BITS(sb);

/*
@@ -2644,7 +2654,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
for (i = 0; i < num_meta_group_infos; i++) {
if ((i + 1) == num_meta_group_infos)
metalen = sizeof(*meta_group_info) *
- (sbi->s_groups_count -
+ (ngroups -
(i << EXT4_DESC_PER_BLOCK_BITS(sb)));
meta_group_info = kmalloc(metalen, GFP_KERNEL);
if (meta_group_info == NULL) {
@@ -2655,7 +2665,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
sbi->s_group_info[i] = meta_group_info;
}

- for (i = 0; i < sbi->s_groups_count; i++) {
+ for (i = 0; i < ngroups; i++) {
desc = ext4_get_group_desc(sb, i, NULL);
if (desc == NULL) {
printk(KERN_ERR
@@ -2781,13 +2791,15 @@ static void ext4_mb_cleanup_pa(struct ext4_group_info *grp)

int ext4_mb_release(struct super_block *sb)
{
+ ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
ext4_group_t i;
int num_meta_group_infos;
struct ext4_group_info *grinfo;
struct ext4_sb_info *sbi = EXT4_SB(sb);

+ smp_rmb(); /* after reading s_groups_count first */
if (sbi->s_group_info) {
- for (i = 0; i < sbi->s_groups_count; i++) {
+ for (i = 0; i < ngroups; i++) {
grinfo = ext4_get_group_info(sb, i);
#ifdef DOUBLE_CHECK
kfree(grinfo->bb_bitmap);
@@ -2797,7 +2809,7 @@ int ext4_mb_release(struct super_block *sb)
ext4_unlock_group(sb, i);
kfree(grinfo);
}
- num_meta_group_infos = (sbi->s_groups_count +
+ num_meta_group_infos = (ngroups +
EXT4_DESC_PER_BLOCK(sb) - 1) >>
EXT4_DESC_PER_BLOCK_BITS(sb);
for (i = 0; i < num_meta_group_infos; i++)
@@ -4121,7 +4133,7 @@ static void ext4_mb_return_to_preallocation(struct inode *inode,
static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
{
struct super_block *sb = ac->ac_sb;
- ext4_group_t i;
+ ext4_group_t ngroups, i;

printk(KERN_ERR "EXT4-fs: Can't allocate:"
" Allocation context details:\n");
@@ -4145,7 +4157,9 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
printk(KERN_ERR "EXT4-fs: %lu scanned, %d found\n", ac->ac_ex_scanned,
ac->ac_found);
printk(KERN_ERR "EXT4-fs: groups: \n");
- for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
+ ngroups = EXT4_SB(ac->ac_sb)->s_groups_count;
+ smp_rmb();
+ for (i = 0; i < EXT4_SB(sb)->ngroups; i++) {
struct ext4_group_info *grp = ext4_get_group_info(sb, i);
struct ext4_prealloc_space *pa;
ext4_grpblk_t start;
@@ -4469,13 +4483,14 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)

static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
{
- ext4_group_t i;
+ ext4_group_t i, ngroups = EXT4_SB(sb)->s_groups_count;
int ret;
int freed = 0;

+ smp_rmb(); /* after reading s_groups_count first */
trace_mark(ext4_mb_discard_preallocations, "dev %s needed %d",
sb->s_id, needed);
- for (i = 0; i < EXT4_SB(sb)->s_groups_count && needed > 0; i++) {
+ for (i = 0; i < ngroups && needed > 0; i++) {
ret = ext4_mb_discard_group_preallocations(sb, i, needed);
freed += ret;
needed -= ret;
--
1.5.6.3


2009-04-26 03:49:32

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 3/5] ext4: ext4_mark_recovery_complete() doesn't need to use lock_super

The function ext4_mark_recovery_complete() is called from two call
paths: either (a) while mounting the filesystem, in which case there's
no danger of any other CPU calling write_super() until the mount is
completed, and (b) while remounting the filesystem read-write, in
which case the fs core has already locked the superblock, and in any
case write_super() wouldn't be called until the filesystem is
successfully changed from being mounted read-only to read-write.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/super.c | 11 +----------
1 files changed, 1 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e0b0c9f..176d43f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3198,14 +3198,12 @@ static void ext4_mark_recovery_complete(struct super_block *sb,
if (jbd2_journal_flush(journal) < 0)
goto out;

- lock_super(sb);
if (EXT4_HAS_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER) &&
sb->s_flags & MS_RDONLY) {
EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
sb->s_dirt = 0;
ext4_commit_super(sb, es, 1);
}
- unlock_super(sb);

out:
jbd2_journal_unlock_updates(journal);
@@ -3438,15 +3436,8 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
(sbi->s_mount_state & EXT4_VALID_FS))
es->s_state = cpu_to_le16(sbi->s_mount_state);

- /*
- * We have to unlock super so that we can wait for
- * transactions.
- */
- if (sbi->s_journal) {
- unlock_super(sb);
+ if (sbi->s_journal)
ext4_mark_recovery_complete(sb, es);
- lock_super(sb);
- }
} else {
int ret;
if ((ret = EXT4_HAS_RO_COMPAT_FEATURE(sb,
--
1.5.6.3


2009-04-26 03:49:32

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH 5/5] ext4: Replace lock/unlock_super() with an explicit lock for resizing

Use a separate lock to protect s_groups_count and the other block
group descriptors which get changed via an on-line resize operation,
so we can stop overloading the use of lock_super().

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/ext4_sb.h | 1 +
fs/ext4/resize.c | 35 ++++++++++++++++++-----------------
fs/ext4/super.c | 1 +
3 files changed, 20 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/ext4_sb.h b/fs/ext4/ext4_sb.h
index 4bda2f7..2d36223 100644
--- a/fs/ext4/ext4_sb.h
+++ b/fs/ext4/ext4_sb.h
@@ -72,6 +72,7 @@ struct ext4_sb_info {
struct journal_s *s_journal;
struct list_head s_orphan;
struct mutex s_orphan_lock;
+ struct mutex s_resize_lock;
unsigned long s_commit_interval;
u32 s_max_batch_time;
u32 s_min_batch_time;
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 546c7dd..e8ded13 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -193,7 +193,7 @@ static int setup_new_group_blocks(struct super_block *sb,
if (IS_ERR(handle))
return PTR_ERR(handle);

- lock_super(sb);
+ mutex_lock(&sbi->s_resize_lock);
if (input->group != sbi->s_groups_count) {
err = -EBUSY;
goto exit_journal;
@@ -302,7 +302,7 @@ exit_bh:
brelse(bh);

exit_journal:
- unlock_super(sb);
+ mutex_unlock(&sbi->s_resize_lock);
if ((err2 = ext4_journal_stop(handle)) && !err)
err = err2;

@@ -643,11 +643,12 @@ exit_free:
* important part is that the new block and inode counts are in the backup
* superblocks, and the location of the new group metadata in the GDT backups.
*
- * We do not need lock_super() for this, because these blocks are not
- * otherwise touched by the filesystem code when it is mounted. We don't
- * need to worry about last changing from sbi->s_groups_count, because the
- * worst that can happen is that we do not copy the full number of backups
- * at this time. The resize which changed s_groups_count will backup again.
+ * We do not need take the s_resize_lock for this, because these
+ * blocks are not otherwise touched by the filesystem code when it is
+ * mounted. We don't need to worry about last changing from
+ * sbi->s_groups_count, because the worst that can happen is that we
+ * do not copy the full number of backups at this time. The resize
+ * which changed s_groups_count will backup again.
*/
static void update_backups(struct super_block *sb,
int blk_off, char *data, int size)
@@ -809,7 +810,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
goto exit_put;
}

- lock_super(sb);
+ mutex_lock(&sbi->s_resize_lock);
if (input->group != sbi->s_groups_count) {
ext4_warning(sb, __func__,
"multiple resizers run on filesystem!");
@@ -840,7 +841,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
/*
* OK, now we've set up the new group. Time to make it active.
*
- * Current kernels don't lock all allocations via lock_super(),
+ * We do not lock all allocations via s_resize_lock
* so we have to be safe wrt. concurrent accesses the group
* data. So we need to be careful to set all of the relevant
* group descriptor data etc. *before* we enable the group.
@@ -900,12 +901,12 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
*
* The precise rules we use are:
*
- * * Writers of s_groups_count *must* hold lock_super
+ * * Writers of s_groups_count *must* hold s_resize_lock
* AND
* * Writers must perform a smp_wmb() after updating all dependent
* data and before modifying the groups count
*
- * * Readers must hold lock_super() over the access
+ * * Readers must hold s_resize_lock over the access
* OR
* * Readers must perform an smp_rmb() after reading the groups count
* and before reading any dependent data.
@@ -948,7 +949,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
sb->s_dirt = 1;

exit_journal:
- unlock_super(sb);
+ mutex_unlock(&sbi->s_resize_lock);
if ((err2 = ext4_journal_stop(handle)) && !err)
err = err2;
if (!err) {
@@ -986,7 +987,7 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,

/* We don't need to worry about locking wrt other resizers just
* yet: we're going to revalidate es->s_blocks_count after
- * taking lock_super() below. */
+ * taking the s_resize_lock below. */
o_blocks_count = ext4_blocks_count(es);
o_groups_count = EXT4_SB(sb)->s_groups_count;

@@ -1056,11 +1057,11 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
goto exit_put;
}

- lock_super(sb);
+ mutex_lock(&EXT4_SB(sb)->s_resize_lock);
if (o_blocks_count != ext4_blocks_count(es)) {
ext4_warning(sb, __func__,
"multiple resizers run on filesystem!");
- unlock_super(sb);
+ mutex_unlock(&EXT4_SB(sb)->s_resize_lock);
ext4_journal_stop(handle);
err = -EBUSY;
goto exit_put;
@@ -1070,14 +1071,14 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
EXT4_SB(sb)->s_sbh))) {
ext4_warning(sb, __func__,
"error %d on journal write access", err);
- unlock_super(sb);
+ mutex_unlock(&EXT4_SB(sb)->s_resize_lock);
ext4_journal_stop(handle);
goto exit_put;
}
ext4_blocks_count_set(es, o_blocks_count + add);
ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh);
sb->s_dirt = 1;
- unlock_super(sb);
+ mutex_unlock(&EXT4_SB(sb)->s_resize_lock);
ext4_debug("freeing blocks %llu through %llu\n", o_blocks_count,
o_blocks_count + add);
/* We add the blocks to the bitmap and set the group need init bit */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c23e82c..b4735e4 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2624,6 +2624,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
#endif
INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
mutex_init(&sbi->s_orphan_lock);
+ mutex_init(&sbi->s_resize_lock);

sb->s_root = NULL;

--
1.5.6.3


2009-04-26 07:07:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 3/5] ext4: ext4_mark_recovery_complete() doesn't need to use lock_super

On Sat, Apr 25, 2009 at 11:49:23PM -0400, Theodore Ts'o wrote:
> The function ext4_mark_recovery_complete() is called from two call
> paths: either (a) while mounting the filesystem, in which case there's
> no danger of any other CPU calling write_super() until the mount is
> completed, and (b) while remounting the filesystem read-write, in
> which case the fs core has already locked the superblock, and in any
> case write_super() wouldn't be called until the filesystem is
> successfully changed from being mounted read-only to read-write.

Currently ext4_remount releases/reqacquires lock_super around
ext4_mark_recovery_complete, and unfortunately currently ->write_super
can be called on a r/o filesystem (that's why we have the MS_RDONLY
checks in all instance, I plan to clean that mess up).

So for now I would keep that instance of lock_super, it'll also serve as
documentation for the locking requirements once we pushed down
lock_super to the filesystem in ->write_super and ->remount_fs so that
it can eventually be replaced with an ext4-local lock.

2009-04-26 11:49:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 3/5] ext4: ext4_mark_recovery_complete() doesn't need to use lock_super

On Sun, Apr 26, 2009 at 07:46:08AM -0400, Theodore Tso wrote:
> That's true, but the patch also takes out the release/reacquire in in
> ext4_remount (which was particularly ugly, belch).

Sorry, missed the second hunk of the patch.

> So even if
> write_super gets called on an r/o filesystem (why?!?),

No good reason really. Hopefully we'll sort all that out soon.

> we should be
> safe because remount will hold lock_super() throughout the entire
> remount operation.
>
> We could delay this cleanup until you clean the mess with write_super,
> but I don't think it would be harmful in removing the
> lock_super()/unlock_super() pair in ext4_mark_recovery_complete(), and
> the unlock_super()/lock_super() pair in ext4_remount before then. Am
> I missing something?

No, I was just missing the second hunk of the patch.

2009-04-26 11:46:08

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 3/5] ext4: ext4_mark_recovery_complete() doesn't need to use lock_super

On Sun, Apr 26, 2009 at 03:07:14AM -0400, Christoph Hellwig wrote:
> On Sat, Apr 25, 2009 at 11:49:23PM -0400, Theodore Ts'o wrote:
> > The function ext4_mark_recovery_complete() is called from two call
> > paths: either (a) while mounting the filesystem, in which case there's
> > no danger of any other CPU calling write_super() until the mount is
> > completed, and (b) while remounting the filesystem read-write, in
> > which case the fs core has already locked the superblock, and in any
> > case write_super() wouldn't be called until the filesystem is
> > successfully changed from being mounted read-only to read-write.
>
> Currently ext4_remount releases/reqacquires lock_super around
> ext4_mark_recovery_complete, and unfortunately currently ->write_super
> can be called on a r/o filesystem (that's why we have the MS_RDONLY
> checks in all instance, I plan to clean that mess up).

That's true, but the patch also takes out the release/reacquire in in
ext4_remount (which was particularly ugly, belch). So even if
write_super gets called on an r/o filesystem (why?!?), we should be
safe because remount will hold lock_super() throughout the entire
remount operation.

We could delay this cleanup until you clean the mess with write_super,
but I don't think it would be harmful in removing the
lock_super()/unlock_super() pair in ext4_mark_recovery_complete(), and
the unlock_super()/lock_super() pair in ext4_remount before then. Am
I missing something?

- Ted

2009-04-28 15:52:52

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/5] ext4: Replace lock/unlock_super() with an explicit lock for the orphan list

Hi,

> Use a separate lock to protect the orphan list, so we can stop
> overloading the use of lock_super().
Yes, this was needed for a long time.

> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> fs/ext4/ext4_sb.h | 1 +
> fs/ext4/namei.c | 20 +++++++++++---------
> fs/ext4/super.c | 1 +
> 3 files changed, 13 insertions(+), 9 deletions(-)
>
> diff --git a/fs/ext4/ext4_sb.h b/fs/ext4/ext4_sb.h
> index 57b71fe..4bda2f7 100644
> --- a/fs/ext4/ext4_sb.h
> +++ b/fs/ext4/ext4_sb.h
> @@ -71,6 +71,7 @@ struct ext4_sb_info {
> struct inode *s_journal_inode;
> struct journal_s *s_journal;
> struct list_head s_orphan;
> + struct mutex s_orphan_lock;
> unsigned long s_commit_interval;
> u32 s_max_batch_time;
> u32 s_min_batch_time;
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 22098e1..8018e49 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -1997,7 +1997,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> if (!ext4_handle_valid(handle))
> return 0;
>
> - lock_super(sb);
> + mutex_lock(&EXT4_SB(sb)->s_orphan_lock);
> if (!list_empty(&EXT4_I(inode)->i_orphan))
> goto out_unlock;
>
> @@ -2006,9 +2006,13 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
>
> /* @@@ FIXME: Observation from aviro:
> * I think I can trigger J_ASSERT in ext4_orphan_add(). We block
> - * here (on lock_super()), so race with ext4_link() which might bump
> + * here (on s_orphan_lock), so race with ext4_link() which might bump
> * ->i_nlink. For, say it, character device. Not a regular file,
> * not a directory, not a symlink and ->i_nlink > 0.
> + *
> + * tytso, 4/25/2009: I'm not sure how that could happen;
> + * shouldn't the fs core protect us from these sort of
> + * unlink()/link() races?
> */
We always call ext4_orphan_add() under i_mutex of the inode we are
adding (except for migrate code, well) and hence i_nlink should better
be stable... I'd just remove the comment.

> J_ASSERT((S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
> S_ISLNK(inode->i_mode)) || inode->i_nlink == 0);
> @@ -2045,7 +2049,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> jbd_debug(4, "orphan inode %lu will point to %d\n",
> inode->i_ino, NEXT_ORPHAN(inode));
> out_unlock:
> - unlock_super(sb);
> + mutex_unlock(&EXT4_SB(sb)->s_orphan_lock);
> ext4_std_error(inode->i_sb, err);
> return err;
> }
> @@ -2066,11 +2070,9 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
> if (!ext4_handle_valid(handle))
> return 0;
>
> - lock_super(inode->i_sb);
> - if (list_empty(&ei->i_orphan)) {
> - unlock_super(inode->i_sb);
> - return 0;
> - }
> + mutex_lock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
> + if (list_empty(&ei->i_orphan))
> + goto out;
>
> ino_next = NEXT_ORPHAN(inode);
> prev = ei->i_orphan.prev;
> @@ -2120,7 +2122,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
> out_err:
> ext4_std_error(inode->i_sb, err);
> out:
> - unlock_super(inode->i_sb);
> + mutex_unlock(&EXT4_SB(inode->i_sb)->s_orphan_lock);
> return err;
>
> out_brelse:
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 176d43f..c23e82c 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -2623,6 +2623,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> sb->dq_op = &ext4_quota_operations;
> #endif
> INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
> + mutex_init(&sbi->s_orphan_lock);
>
> sb->s_root = NULL;
Otherwise the patch looks good.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2009-04-28 16:02:38

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/5] ext4: Replace lock/unlock_super() with an explicit lock for resizing

> Use a separate lock to protect s_groups_count and the other block
> group descriptors which get changed via an on-line resize operation,
> so we can stop overloading the use of lock_super().
This patch looks good as well. You can add
Acked-by: Jan Kara <[email protected]>
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> fs/ext4/ext4_sb.h | 1 +
> fs/ext4/resize.c | 35 ++++++++++++++++++-----------------
> fs/ext4/super.c | 1 +
> 3 files changed, 20 insertions(+), 17 deletions(-)
>
> diff --git a/fs/ext4/ext4_sb.h b/fs/ext4/ext4_sb.h
> index 4bda2f7..2d36223 100644
> --- a/fs/ext4/ext4_sb.h
> +++ b/fs/ext4/ext4_sb.h
> @@ -72,6 +72,7 @@ struct ext4_sb_info {
> struct journal_s *s_journal;
> struct list_head s_orphan;
> struct mutex s_orphan_lock;
> + struct mutex s_resize_lock;
> unsigned long s_commit_interval;
> u32 s_max_batch_time;
> u32 s_min_batch_time;
> diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
> index 546c7dd..e8ded13 100644
> --- a/fs/ext4/resize.c
> +++ b/fs/ext4/resize.c
> @@ -193,7 +193,7 @@ static int setup_new_group_blocks(struct super_block *sb,
> if (IS_ERR(handle))
> return PTR_ERR(handle);
>
> - lock_super(sb);
> + mutex_lock(&sbi->s_resize_lock);
> if (input->group != sbi->s_groups_count) {
> err = -EBUSY;
> goto exit_journal;
> @@ -302,7 +302,7 @@ exit_bh:
> brelse(bh);
>
> exit_journal:
> - unlock_super(sb);
> + mutex_unlock(&sbi->s_resize_lock);
> if ((err2 = ext4_journal_stop(handle)) && !err)
> err = err2;
>
> @@ -643,11 +643,12 @@ exit_free:
> * important part is that the new block and inode counts are in the backup
> * superblocks, and the location of the new group metadata in the GDT backups.
> *
> - * We do not need lock_super() for this, because these blocks are not
> - * otherwise touched by the filesystem code when it is mounted. We don't
> - * need to worry about last changing from sbi->s_groups_count, because the
> - * worst that can happen is that we do not copy the full number of backups
> - * at this time. The resize which changed s_groups_count will backup again.
> + * We do not need take the s_resize_lock for this, because these
> + * blocks are not otherwise touched by the filesystem code when it is
> + * mounted. We don't need to worry about last changing from
> + * sbi->s_groups_count, because the worst that can happen is that we
> + * do not copy the full number of backups at this time. The resize
> + * which changed s_groups_count will backup again.
> */
> static void update_backups(struct super_block *sb,
> int blk_off, char *data, int size)
> @@ -809,7 +810,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
> goto exit_put;
> }
>
> - lock_super(sb);
> + mutex_lock(&sbi->s_resize_lock);
> if (input->group != sbi->s_groups_count) {
> ext4_warning(sb, __func__,
> "multiple resizers run on filesystem!");
> @@ -840,7 +841,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
> /*
> * OK, now we've set up the new group. Time to make it active.
> *
> - * Current kernels don't lock all allocations via lock_super(),
> + * We do not lock all allocations via s_resize_lock
> * so we have to be safe wrt. concurrent accesses the group
> * data. So we need to be careful to set all of the relevant
> * group descriptor data etc. *before* we enable the group.
> @@ -900,12 +901,12 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
> *
> * The precise rules we use are:
> *
> - * * Writers of s_groups_count *must* hold lock_super
> + * * Writers of s_groups_count *must* hold s_resize_lock
> * AND
> * * Writers must perform a smp_wmb() after updating all dependent
> * data and before modifying the groups count
> *
> - * * Readers must hold lock_super() over the access
> + * * Readers must hold s_resize_lock over the access
> * OR
> * * Readers must perform an smp_rmb() after reading the groups count
> * and before reading any dependent data.
> @@ -948,7 +949,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
> sb->s_dirt = 1;
>
> exit_journal:
> - unlock_super(sb);
> + mutex_unlock(&sbi->s_resize_lock);
> if ((err2 = ext4_journal_stop(handle)) && !err)
> err = err2;
> if (!err) {
> @@ -986,7 +987,7 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
>
> /* We don't need to worry about locking wrt other resizers just
> * yet: we're going to revalidate es->s_blocks_count after
> - * taking lock_super() below. */
> + * taking the s_resize_lock below. */
> o_blocks_count = ext4_blocks_count(es);
> o_groups_count = EXT4_SB(sb)->s_groups_count;
>
> @@ -1056,11 +1057,11 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
> goto exit_put;
> }
>
> - lock_super(sb);
> + mutex_lock(&EXT4_SB(sb)->s_resize_lock);
> if (o_blocks_count != ext4_blocks_count(es)) {
> ext4_warning(sb, __func__,
> "multiple resizers run on filesystem!");
> - unlock_super(sb);
> + mutex_unlock(&EXT4_SB(sb)->s_resize_lock);
> ext4_journal_stop(handle);
> err = -EBUSY;
> goto exit_put;
> @@ -1070,14 +1071,14 @@ int ext4_group_extend(struct super_block *sb, struct ext4_super_block *es,
> EXT4_SB(sb)->s_sbh))) {
> ext4_warning(sb, __func__,
> "error %d on journal write access", err);
> - unlock_super(sb);
> + mutex_unlock(&EXT4_SB(sb)->s_resize_lock);
> ext4_journal_stop(handle);
> goto exit_put;
> }
> ext4_blocks_count_set(es, o_blocks_count + add);
> ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh);
> sb->s_dirt = 1;
> - unlock_super(sb);
> + mutex_unlock(&EXT4_SB(sb)->s_resize_lock);
> ext4_debug("freeing blocks %llu through %llu\n", o_blocks_count,
> o_blocks_count + add);
> /* We add the blocks to the bitmap and set the group need init bit */
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index c23e82c..b4735e4 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -2624,6 +2624,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> #endif
> INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
> mutex_init(&sbi->s_orphan_lock);
> + mutex_init(&sbi->s_resize_lock);
>
> sb->s_root = NULL;
>
> --
> 1.5.6.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SuSE CR Labs

2009-04-28 16:13:54

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/5] ext4: Remove outdated comment about lock_super()

> ext4_fill_super() is no longer called by read_super(), and it is no
> longer called with the superblock locked. The
> unlock_super()/lock_super() is no longer present, so this comment is
> entirely superfluous.
Yes, the comment seems to be out of date.

Acked-by: Jan Kara <[email protected]>

Honza
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> fs/ext4/super.c | 8 --------
> 1 files changed, 0 insertions(+), 8 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index e20ff9c..e0b0c9f 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -2806,14 +2806,6 @@ no_journal:
> goto failed_mount4;
> };
>
> - /*
> - * akpm: core read_super() calls in here with the superblock locked.
> - * That deadlocks, because orphan cleanup needs to lock the superblock
> - * in numerous places. Here we just pop the lock - it's relatively
> - * harmless, because we are now ready to accept write_super() requests,
> - * and aviro says that's the only reason for hanging onto the
> - * superblock lock.
> - */
> EXT4_SB(sb)->s_mount_state |= EXT4_ORPHAN_FS;
> ext4_orphan_cleanup(sb, es);
> EXT4_SB(sb)->s_mount_state &= ~EXT4_ORPHAN_FS;
> --
> 1.5.6.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-04-28 16:24:01

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/5] ext4: Avoid races caused by on-line resizing and SMP memory reordering

> Ext4's on-line resizing adds a new block group and then, only at the
> last step adjusts s_groups_count. However, it's possible on SMP
> systems that another CPU could see the updated the s_group_count and
> not see the newly initialized data structures for the just-added block
> group. For this reason, it's important to insert a SMP read barrier
> after reading s_groups_count and before reading any, say, block group
> descriptors allowed by the block group count.
>
> Unfortunately, we rather blatently violate this locking protocol as
> documented in fs/ext4/resize.c. Fortunately, (1) on-line resizes
> happen relatively rarely, and (2) it seems rare that the filesystem
> code will immediately try to use just-added block group before any
> memory ordering issues resolve themselves. So apparently problems
> here are relatively hard to hit, since ext3 also is vulnerable to this
> race and no one has apparently complained.
Ouch... Hmm, smp_rmb() isn't completely free and mainly it's a bit
ugly and prone to errors (I'm afraid next time someone changes the
allocation code, we miss some barriers again)... so.. Maybe a stupid
idea but wouldn't it be easier to solve the online resize like: freeze
the filesystem, do all the changes required for extend, unfreeze the
filesystem?
I guess the resize code might get simpler as well with this.

Honza
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> fs/ext4/balloc.c | 6 ++++--
> fs/ext4/ialloc.c | 34 +++++++++++++++++++++-------------
> fs/ext4/mballoc.c | 49 ++++++++++++++++++++++++++++++++-----------------
> 3 files changed, 57 insertions(+), 32 deletions(-)
>
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index 53c72ad..d1615f2 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -88,9 +88,11 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
> ext4_group_t block_group, struct ext4_group_desc *gdp)
> {
> int bit, bit_max;
> + ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
> unsigned free_blocks, group_blocks;
> struct ext4_sb_info *sbi = EXT4_SB(sb);
>
> + smp_rmb(); /* after reading s_groups_count first */
> if (bh) {
> J_ASSERT_BH(bh, buffer_locked(bh));
>
> @@ -123,7 +125,7 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
> bit_max += ext4_bg_num_gdb(sb, block_group);
> }
>
> - if (block_group == sbi->s_groups_count - 1) {
> + if (block_group == ngroups - 1) {
> /*
> * Even though mke2fs always initialize first and last group
> * if some other tool enabled the EXT4_BG_BLOCK_UNINIT we need
> @@ -131,7 +133,7 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
> */
> group_blocks = ext4_blocks_count(sbi->s_es) -
> le32_to_cpu(sbi->s_es->s_first_data_block) -
> - (EXT4_BLOCKS_PER_GROUP(sb) * (sbi->s_groups_count - 1));
> + (EXT4_BLOCKS_PER_GROUP(sb) * (ngroups - 1));
> } else {
> group_blocks = EXT4_BLOCKS_PER_GROUP(sb);
> }
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index f18e0a0..52ce274 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -322,6 +322,7 @@ static int find_group_dir(struct super_block *sb, struct inode *parent,
> ext4_group_t group;
> int ret = -1;
>
> + smp_rmb(); /* after reading s_groups_count first */
> freei = percpu_counter_read_positive(&EXT4_SB(sb)->s_freeinodes_counter);
> avefreei = freei / ngroups;
>
> @@ -362,7 +363,8 @@ static int find_group_flex(struct super_block *sb, struct inode *parent,
> ext4_group_t n_fbg_groups;
> ext4_group_t i;
>
> - n_fbg_groups = (sbi->s_groups_count + flex_size - 1) >>
> + smp_rmb(); /* after reading s_groups_count first */
> + n_fbg_groups = (ngroups + flex_size - 1) >>
> sbi->s_log_groups_per_flex;
>
> find_close_to_parent:
> @@ -478,18 +480,18 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
> {
> ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> - ext4_group_t ngroups = sbi->s_groups_count;
> int inodes_per_group = EXT4_INODES_PER_GROUP(sb);
> unsigned int freei, avefreei;
> ext4_fsblk_t freeb, avefreeb;
> unsigned int ndirs;
> int max_dirs, min_inodes;
> ext4_grpblk_t min_blocks;
> - ext4_group_t i, grp, g;
> + ext4_group_t i, grp, g, ngroups = sbi->s_groups_count;;
> struct ext4_group_desc *desc;
> struct orlov_stats stats;
> int flex_size = ext4_flex_bg_size(sbi);
>
> + smp_rmb(); /* after reading s_groups_count first */
> if (flex_size > 1) {
> ngroups = (ngroups + flex_size - 1) >>
> sbi->s_log_groups_per_flex;
> @@ -585,6 +587,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
> fallback:
> ngroups = sbi->s_groups_count;
> avefreei = freei / ngroups;
> + smp_rmb();
> fallback_retry:
> parent_group = EXT4_I(parent)->i_block_group;
> for (i = 0; i < ngroups; i++) {
> @@ -613,11 +616,11 @@ static int find_group_other(struct super_block *sb, struct inode *parent,
> ext4_group_t *group, int mode)
> {
> ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
> - ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
> + ext4_group_t i, last, ngroups = EXT4_SB(sb)->s_groups_count;
> struct ext4_group_desc *desc;
> - ext4_group_t i, last;
> int flex_size = ext4_flex_bg_size(EXT4_SB(sb));
>
> + smp_rmb(); /* after reading s_groups_count first */
> /*
> * Try to place the inode is the same flex group as its
> * parent. If we can't find space, use the Orlov algorithm to
> @@ -799,7 +802,7 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode *dir, int mode)
> struct super_block *sb;
> struct buffer_head *inode_bitmap_bh = NULL;
> struct buffer_head *group_desc_bh;
> - ext4_group_t group = 0;
> + ext4_group_t ngroups, group = 0;
> unsigned long ino = 0;
> struct inode *inode;
> struct ext4_group_desc *gdp = NULL;
> @@ -851,12 +854,14 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode *dir, int mode)
> ret2 = find_group_other(sb, dir, &group, mode);
>
> got_group:
> + ngroups = sbi->s_groups_count;
> + smp_rmb();
> EXT4_I(dir)->i_last_alloc_group = group;
> err = -ENOSPC;
> if (ret2 == -1)
> goto out;
>
> - for (i = 0; i < sbi->s_groups_count; i++) {
> + for (i = 0; i < ngroups; i++) {
> err = -EIO;
>
> gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
> @@ -917,7 +922,7 @@ repeat_in_this_group:
> * group descriptor metadata has not yet been updated.
> * So we just go onto the next blockgroup.
> */
> - if (++group == sbi->s_groups_count)
> + if (++group == ngroups)
> group = 0;
> }
> err = -ENOSPC;
> @@ -1158,17 +1163,18 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
> {
> unsigned long desc_count;
> struct ext4_group_desc *gdp;
> - ext4_group_t i;
> + ext4_group_t i, ngroups = EXT4_SB(sb)->s_groups_count;
> #ifdef EXT4FS_DEBUG
> struct ext4_super_block *es;
> unsigned long bitmap_count, x;
> struct buffer_head *bitmap_bh = NULL;
>
> + smp_rmb(); /* after reading s_groups_count first */
> es = EXT4_SB(sb)->s_es;
> desc_count = 0;
> bitmap_count = 0;
> gdp = NULL;
> - for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
> + for (i = 0; i < ngroups; i++) {
> gdp = ext4_get_group_desc(sb, i, NULL);
> if (!gdp)
> continue;
> @@ -1189,8 +1195,9 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
> le32_to_cpu(es->s_free_inodes_count), desc_count, bitmap_count);
> return desc_count;
> #else
> + smp_rmb(); /* after reading s_groups_count first */
> desc_count = 0;
> - for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
> + for (i = 0; i < ngroups; i++) {
> gdp = ext4_get_group_desc(sb, i, NULL);
> if (!gdp)
> continue;
> @@ -1205,9 +1212,10 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
> unsigned long ext4_count_dirs(struct super_block * sb)
> {
> unsigned long count = 0;
> - ext4_group_t i;
> + ext4_group_t i, ngroups = EXT4_SB(sb)->s_groups_count;
>
> - for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
> + smp_rmb(); /* after reading s_groups_count first */
> + for (i = 0; i < ngroups; i++) {
> struct ext4_group_desc *gdp = ext4_get_group_desc(sb, i, NULL);
> if (!gdp)
> continue;
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index f871677..ecc2d49 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -739,6 +739,7 @@ static void ext4_mb_generate_buddy(struct super_block *sb,
>
> static int ext4_mb_init_cache(struct page *page, char *incore)
> {
> + ext4_group_t ngroups;
> int blocksize;
> int blocks_per_page;
> int groups_per_page;
> @@ -757,6 +758,8 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
>
> inode = page->mapping->host;
> sb = inode->i_sb;
> + ngroups = EXT4_SB(sb)->s_groups_count;
> + smp_rmb();
> blocksize = 1 << inode->i_blkbits;
> blocks_per_page = PAGE_CACHE_SIZE / blocksize;
>
> @@ -780,7 +783,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
> for (i = 0; i < groups_per_page; i++) {
> struct ext4_group_desc *desc;
>
> - if (first_group + i >= EXT4_SB(sb)->s_groups_count)
> + if (first_group + i >= ngroups)
> break;
>
> err = -EIO;
> @@ -852,7 +855,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
> struct ext4_group_info *grinfo;
>
> group = (first_block + i) >> 1;
> - if (group >= EXT4_SB(sb)->s_groups_count)
> + if (group >= ngroups)
> break;
>
> /*
> @@ -1788,9 +1791,11 @@ int ext4_mb_get_buddy_cache_lock(struct super_block *sb, ext4_group_t group)
> int block, pnum;
> int blocks_per_page;
> int groups_per_page;
> + ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
> ext4_group_t first_group;
> struct ext4_group_info *grp;
>
> + smp_rmb(); /* after reading s_groups_count first */
> blocks_per_page = PAGE_CACHE_SIZE / sb->s_blocksize;
> /*
> * the buddy cache inode stores the block bitmap
> @@ -1807,7 +1812,7 @@ int ext4_mb_get_buddy_cache_lock(struct super_block *sb, ext4_group_t group)
> /* read all groups the page covers into the cache */
> for (i = 0; i < groups_per_page; i++) {
>
> - if ((first_group + i) >= EXT4_SB(sb)->s_groups_count)
> + if ((first_group + i) >= ngroups)
> break;
> grp = ext4_get_group_info(sb, first_group + i);
> /* take all groups write allocation
> @@ -1945,8 +1950,7 @@ err:
> static noinline_for_stack int
> ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
> {
> - ext4_group_t group;
> - ext4_group_t i;
> + ext4_group_t ngroups, group, i;
> int cr;
> int err = 0;
> int bsbits;
> @@ -1957,6 +1961,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>
> sb = ac->ac_sb;
> sbi = EXT4_SB(sb);
> + ngroups = EXT4_SB(sb)->s_groups_count;
> + smp_rmb();
> BUG_ON(ac->ac_status == AC_STATUS_FOUND);
>
> /* first, try the goal */
> @@ -2017,11 +2023,11 @@ repeat:
> */
> group = ac->ac_g_ex.fe_group;
>
> - for (i = 0; i < EXT4_SB(sb)->s_groups_count; group++, i++) {
> + for (i = 0; i < ngroups; group++, i++) {
> struct ext4_group_info *grp;
> struct ext4_group_desc *desc;
>
> - if (group == EXT4_SB(sb)->s_groups_count)
> + if (group == ngroups)
> group = 0;
>
> /* quick check to skip empty groups */
> @@ -2320,7 +2326,7 @@ static void *ext4_mb_seq_groups_start(struct seq_file *seq, loff_t *pos)
>
> if (*pos < 0 || *pos >= sbi->s_groups_count)
> return NULL;
> -
> + smp_rmb();
> group = *pos + 1;
> return (void *) ((unsigned long) group);
> }
> @@ -2334,6 +2340,7 @@ static void *ext4_mb_seq_groups_next(struct seq_file *seq, void *v, loff_t *pos)
> ++*pos;
> if (*pos < 0 || *pos >= sbi->s_groups_count)
> return NULL;
> + smp_rmb();
> group = *pos + 1;
> return (void *) ((unsigned long) group);
> }
> @@ -2587,6 +2594,7 @@ void ext4_mb_update_group_info(struct ext4_group_info *grp, ext4_grpblk_t add)
>
> static int ext4_mb_init_backend(struct super_block *sb)
> {
> + ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
> ext4_group_t i;
> int metalen;
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> @@ -2597,8 +2605,10 @@ static int ext4_mb_init_backend(struct super_block *sb)
> struct ext4_group_info **meta_group_info;
> struct ext4_group_desc *desc;
>
> + smp_rmb(); /* after reading s_groups_count first */
> +
> /* This is the number of blocks used by GDT */
> - num_meta_group_infos = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) -
> + num_meta_group_infos = (ngroups + EXT4_DESC_PER_BLOCK(sb) -
> 1) >> EXT4_DESC_PER_BLOCK_BITS(sb);
>
> /*
> @@ -2644,7 +2654,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
> for (i = 0; i < num_meta_group_infos; i++) {
> if ((i + 1) == num_meta_group_infos)
> metalen = sizeof(*meta_group_info) *
> - (sbi->s_groups_count -
> + (ngroups -
> (i << EXT4_DESC_PER_BLOCK_BITS(sb)));
> meta_group_info = kmalloc(metalen, GFP_KERNEL);
> if (meta_group_info == NULL) {
> @@ -2655,7 +2665,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
> sbi->s_group_info[i] = meta_group_info;
> }
>
> - for (i = 0; i < sbi->s_groups_count; i++) {
> + for (i = 0; i < ngroups; i++) {
> desc = ext4_get_group_desc(sb, i, NULL);
> if (desc == NULL) {
> printk(KERN_ERR
> @@ -2781,13 +2791,15 @@ static void ext4_mb_cleanup_pa(struct ext4_group_info *grp)
>
> int ext4_mb_release(struct super_block *sb)
> {
> + ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
> ext4_group_t i;
> int num_meta_group_infos;
> struct ext4_group_info *grinfo;
> struct ext4_sb_info *sbi = EXT4_SB(sb);
>
> + smp_rmb(); /* after reading s_groups_count first */
> if (sbi->s_group_info) {
> - for (i = 0; i < sbi->s_groups_count; i++) {
> + for (i = 0; i < ngroups; i++) {
> grinfo = ext4_get_group_info(sb, i);
> #ifdef DOUBLE_CHECK
> kfree(grinfo->bb_bitmap);
> @@ -2797,7 +2809,7 @@ int ext4_mb_release(struct super_block *sb)
> ext4_unlock_group(sb, i);
> kfree(grinfo);
> }
> - num_meta_group_infos = (sbi->s_groups_count +
> + num_meta_group_infos = (ngroups +
> EXT4_DESC_PER_BLOCK(sb) - 1) >>
> EXT4_DESC_PER_BLOCK_BITS(sb);
> for (i = 0; i < num_meta_group_infos; i++)
> @@ -4121,7 +4133,7 @@ static void ext4_mb_return_to_preallocation(struct inode *inode,
> static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
> {
> struct super_block *sb = ac->ac_sb;
> - ext4_group_t i;
> + ext4_group_t ngroups, i;
>
> printk(KERN_ERR "EXT4-fs: Can't allocate:"
> " Allocation context details:\n");
> @@ -4145,7 +4157,9 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
> printk(KERN_ERR "EXT4-fs: %lu scanned, %d found\n", ac->ac_ex_scanned,
> ac->ac_found);
> printk(KERN_ERR "EXT4-fs: groups: \n");
> - for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
> + ngroups = EXT4_SB(ac->ac_sb)->s_groups_count;
> + smp_rmb();
> + for (i = 0; i < EXT4_SB(sb)->ngroups; i++) {
> struct ext4_group_info *grp = ext4_get_group_info(sb, i);
> struct ext4_prealloc_space *pa;
> ext4_grpblk_t start;
> @@ -4469,13 +4483,14 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
>
> static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
> {
> - ext4_group_t i;
> + ext4_group_t i, ngroups = EXT4_SB(sb)->s_groups_count;
> int ret;
> int freed = 0;
>
> + smp_rmb(); /* after reading s_groups_count first */
> trace_mark(ext4_mb_discard_preallocations, "dev %s needed %d",
> sb->s_id, needed);
> - for (i = 0; i < EXT4_SB(sb)->s_groups_count && needed > 0; i++) {
> + for (i = 0; i < ngroups && needed > 0; i++) {
> ret = ext4_mb_discard_group_preallocations(sb, i, needed);
> freed += ret;
> needed -= ret;
> --
> 1.5.6.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SuSE CR Labs

2009-04-28 17:14:48

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/5] ext4: Avoid races caused by on-line resizing and SMP memory reordering

On Tue, Apr 28, 2009 at 06:23:59PM +0200, Jan Kara wrote:
> Ouch... Hmm, smp_rmb() isn't completely free and mainly it's a bit
> ugly and prone to errors (I'm afraid next time someone changes the
> allocation code, we miss some barriers again)... so.. Maybe a stupid
> idea but wouldn't it be easier to solve the online resize like: freeze
> the filesystem, do all the changes required for extend, unfreeze the
> filesystem?

Eric suggested a helper function when reading from s_groups_count.
That would take care of the code maintenance headache. One problem
with using freeze/thaw is it won't work without a journal, and we do
support filesystems without journals in ext4. (Probably the best
solution for netbooks with crapola SSD's.)

As far as smb_rmb() not being free, it's essentially free for
x86/x86_64 platforms. Is it really that costly on other
architectures?

- Ted

2009-04-29 09:28:40

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/5] ext4: Avoid races caused by on-line resizing and SMP memory reordering

On Tue 28-04-09 13:14:45, Theodore Tso wrote:
> On Tue, Apr 28, 2009 at 06:23:59PM +0200, Jan Kara wrote:
> > Ouch... Hmm, smp_rmb() isn't completely free and mainly it's a bit
> > ugly and prone to errors (I'm afraid next time someone changes the
> > allocation code, we miss some barriers again)... so.. Maybe a stupid
> > idea but wouldn't it be easier to solve the online resize like: freeze
> > the filesystem, do all the changes required for extend, unfreeze the
> > filesystem?
>
> Eric suggested a helper function when reading from s_groups_count.
> That would take care of the code maintenance headache. One problem
> with using freeze/thaw is it won't work without a journal, and we do
> support filesystems without journals in ext4. (Probably the best
> solution for netbooks with crapola SSD's.)
Well, in non-journalling case, we could introduce a rw semaphore
(read acquired / released in journal_start / journal_stop, write acquired
when the fs is frozen). This might be useful for other rare cases where
freezing the fs would be beneficial. But yes, if wrapping into a helper
function works then that might be the easiest way to go.

> As far as smb_rmb() not being free, it's essentially free for
> x86/x86_64 platforms. Is it really that costly on other
> architectures?
I had a feeling that it's not that expensive but not quite free either on
x86/x86_64 (I know even less about other archs) - it has to lock the bus,
writes in local CPU caches have to be flushed, no? Probably we don't care
given the size of the functions doing allocation... As an excercise I was
trying to google some numbers but was not really successful, just some
comments about tens of cycles in some emails...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-05-01 13:56:03

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH v2] ext4: Avoid races caused by on-line resizing and SMP memory reordering

Ext4's on-line resizing adds a new block group and then, only at the
last step adjusts s_groups_count. However, it's possible on SMP
systems that another CPU could see the updated the s_group_count and
not see the newly initialized data structures for the just-added block
group. For this reason, it's important to insert a SMP read barrier
after reading s_groups_count and before reading any (for example) the
new block group descriptors allowed by the increased value of
s_groups_count.

Unfortunately, we rather blatently violate this locking protocol
documented in fs/ext4/resize.c. Fortunately, (1) on-line resizes
happen relatively rarely, and (2) it seems rare that the filesystem
code will immediately try to use just-added block group before any
memory ordering issues resolve themselves. So apparently problems
here are relatively hard to hit, since ext3 has been vulnerable to the
same issue for years with no one apparently complaining.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---

Here's an updated version which uses a helper function to make sure we
in the future we don't miss a place where smp_rmb() is needed.

fs/ext4/balloc.c | 15 +++++++--------
fs/ext4/ext4.h | 12 ++++++++++++
fs/ext4/ialloc.c | 40 +++++++++++++++++++---------------------
fs/ext4/inode.c | 7 ++++---
fs/ext4/mballoc.c | 45 ++++++++++++++++++++++++---------------------
fs/ext4/super.c | 3 +--
6 files changed, 67 insertions(+), 55 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 53c72ad..a5ba039 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -88,6 +88,7 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
ext4_group_t block_group, struct ext4_group_desc *gdp)
{
int bit, bit_max;
+ ext4_group_t ngroups = ext4_get_groups_count(sb);
unsigned free_blocks, group_blocks;
struct ext4_sb_info *sbi = EXT4_SB(sb);

@@ -123,7 +124,7 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
bit_max += ext4_bg_num_gdb(sb, block_group);
}

- if (block_group == sbi->s_groups_count - 1) {
+ if (block_group == ngroups - 1) {
/*
* Even though mke2fs always initialize first and last group
* if some other tool enabled the EXT4_BG_BLOCK_UNINIT we need
@@ -131,7 +132,7 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
*/
group_blocks = ext4_blocks_count(sbi->s_es) -
le32_to_cpu(sbi->s_es->s_first_data_block) -
- (EXT4_BLOCKS_PER_GROUP(sb) * (sbi->s_groups_count - 1));
+ (EXT4_BLOCKS_PER_GROUP(sb) * (ngroups - 1));
} else {
group_blocks = EXT4_BLOCKS_PER_GROUP(sb);
}
@@ -205,18 +206,18 @@ struct ext4_group_desc * ext4_get_group_desc(struct super_block *sb,
{
unsigned int group_desc;
unsigned int offset;
+ ext4_group_t ngroups = ext4_get_groups_count(sb);
struct ext4_group_desc *desc;
struct ext4_sb_info *sbi = EXT4_SB(sb);

- if (block_group >= sbi->s_groups_count) {
+ if (block_group >= ngroups) {
ext4_error(sb, "ext4_get_group_desc",
"block_group >= groups_count - "
"block_group = %u, groups_count = %u",
- block_group, sbi->s_groups_count);
+ block_group, ngroups);

return NULL;
}
- smp_rmb();

group_desc = block_group >> EXT4_DESC_PER_BLOCK_BITS(sb);
offset = block_group & (EXT4_DESC_PER_BLOCK(sb) - 1);
@@ -665,7 +666,7 @@ ext4_fsblk_t ext4_count_free_blocks(struct super_block *sb)
ext4_fsblk_t desc_count;
struct ext4_group_desc *gdp;
ext4_group_t i;
- ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
+ ext4_group_t ngroups = ext4_get_groups_count(sb);
#ifdef EXT4FS_DEBUG
struct ext4_super_block *es;
ext4_fsblk_t bitmap_count;
@@ -677,7 +678,6 @@ ext4_fsblk_t ext4_count_free_blocks(struct super_block *sb)
bitmap_count = 0;
gdp = NULL;

- smp_rmb();
for (i = 0; i < ngroups; i++) {
gdp = ext4_get_group_desc(sb, i, NULL);
if (!gdp)
@@ -700,7 +700,6 @@ ext4_fsblk_t ext4_count_free_blocks(struct super_block *sb)
return bitmap_count;
#else
desc_count = 0;
- smp_rmb();
for (i = 0; i < ngroups; i++) {
gdp = ext4_get_group_desc(sb, i, NULL);
if (!gdp)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d0f15ef..02ec44b 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1228,6 +1228,18 @@ struct ext4_group_info *ext4_get_group_info(struct super_block *sb,
return grp_info[indexv][indexh];
}

+/*
+ * Reading s_groups_count requires using smp_rmb() afterwards. See
+ * the locking protocol documented in the comments of ext4_group_add()
+ * in resize.c
+ */
+static inline ext4_group_t ext4_get_groups_count(struct super_block *sb)
+{
+ ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
+
+ smp_rmb();
+ return ngroups;
+}

static inline ext4_group_t ext4_flex_group(struct ext4_sb_info *sbi,
ext4_group_t block_group)
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index f18e0a0..55ba419 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -316,7 +316,7 @@ error_return:
static int find_group_dir(struct super_block *sb, struct inode *parent,
ext4_group_t *best_group)
{
- ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
+ ext4_group_t ngroups = ext4_get_groups_count(sb);
unsigned int freei, avefreei;
struct ext4_group_desc *desc, *best_desc = NULL;
ext4_group_t group;
@@ -353,7 +353,7 @@ static int find_group_flex(struct super_block *sb, struct inode *parent,
struct flex_groups *flex_group = sbi->s_flex_groups;
ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
ext4_group_t parent_fbg_group = ext4_flex_group(sbi, parent_group);
- ext4_group_t ngroups = sbi->s_groups_count;
+ ext4_group_t ngroups = ext4_get_groups_count(sb);
int flex_size = ext4_flex_bg_size(sbi);
ext4_group_t best_flex = parent_fbg_group;
int blocks_per_flex = sbi->s_blocks_per_group * flex_size;
@@ -362,7 +362,7 @@ static int find_group_flex(struct super_block *sb, struct inode *parent,
ext4_group_t n_fbg_groups;
ext4_group_t i;

- n_fbg_groups = (sbi->s_groups_count + flex_size - 1) >>
+ n_fbg_groups = (ngroups + flex_size - 1) >>
sbi->s_log_groups_per_flex;

find_close_to_parent:
@@ -478,20 +478,21 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
{
ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
struct ext4_sb_info *sbi = EXT4_SB(sb);
- ext4_group_t ngroups = sbi->s_groups_count;
+ ext4_group_t real_ngroups = ext4_get_groups_count(sb);
int inodes_per_group = EXT4_INODES_PER_GROUP(sb);
unsigned int freei, avefreei;
ext4_fsblk_t freeb, avefreeb;
unsigned int ndirs;
int max_dirs, min_inodes;
ext4_grpblk_t min_blocks;
- ext4_group_t i, grp, g;
+ ext4_group_t i, grp, g, ngroups;
struct ext4_group_desc *desc;
struct orlov_stats stats;
int flex_size = ext4_flex_bg_size(sbi);

+ ngroups = real_ngroups;
if (flex_size > 1) {
- ngroups = (ngroups + flex_size - 1) >>
+ ngroups = (real_ngroups + flex_size - 1) >>
sbi->s_log_groups_per_flex;
parent_group >>= sbi->s_log_groups_per_flex;
}
@@ -543,7 +544,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
*/
grp *= flex_size;
for (i = 0; i < flex_size; i++) {
- if (grp+i >= sbi->s_groups_count)
+ if (grp+i >= real_ngroups)
break;
desc = ext4_get_group_desc(sb, grp+i, NULL);
if (desc && ext4_free_inodes_count(sb, desc)) {
@@ -583,7 +584,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
}

fallback:
- ngroups = sbi->s_groups_count;
+ ngroups = real_ngroups;
avefreei = freei / ngroups;
fallback_retry:
parent_group = EXT4_I(parent)->i_block_group;
@@ -613,9 +614,8 @@ static int find_group_other(struct super_block *sb, struct inode *parent,
ext4_group_t *group, int mode)
{
ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
- ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
+ ext4_group_t i, last, ngroups = ext4_get_groups_count(sb);
struct ext4_group_desc *desc;
- ext4_group_t i, last;
int flex_size = ext4_flex_bg_size(EXT4_SB(sb));

/*
@@ -799,11 +799,10 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode *dir, int mode)
struct super_block *sb;
struct buffer_head *inode_bitmap_bh = NULL;
struct buffer_head *group_desc_bh;
- ext4_group_t group = 0;
+ ext4_group_t ngroups, group = 0;
unsigned long ino = 0;
struct inode *inode;
struct ext4_group_desc *gdp = NULL;
- struct ext4_super_block *es;
struct ext4_inode_info *ei;
struct ext4_sb_info *sbi;
int ret2, err = 0;
@@ -818,15 +817,14 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode *dir, int mode)
return ERR_PTR(-EPERM);

sb = dir->i_sb;
+ ngroups = ext4_get_groups_count(sb);
trace_mark(ext4_request_inode, "dev %s dir %lu mode %d", sb->s_id,
dir->i_ino, mode);
inode = new_inode(sb);
if (!inode)
return ERR_PTR(-ENOMEM);
ei = EXT4_I(inode);
-
sbi = EXT4_SB(sb);
- es = sbi->s_es;

if (sbi->s_log_groups_per_flex && test_opt(sb, OLDALLOC)) {
ret2 = find_group_flex(sb, dir, &group);
@@ -856,7 +854,7 @@ got_group:
if (ret2 == -1)
goto out;

- for (i = 0; i < sbi->s_groups_count; i++) {
+ for (i = 0; i < ngroups; i++) {
err = -EIO;

gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
@@ -917,7 +915,7 @@ repeat_in_this_group:
* group descriptor metadata has not yet been updated.
* So we just go onto the next blockgroup.
*/
- if (++group == sbi->s_groups_count)
+ if (++group == ngroups)
group = 0;
}
err = -ENOSPC;
@@ -1158,7 +1156,7 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
{
unsigned long desc_count;
struct ext4_group_desc *gdp;
- ext4_group_t i;
+ ext4_group_t i, ngroups = ext4_get_groups_count(sb);
#ifdef EXT4FS_DEBUG
struct ext4_super_block *es;
unsigned long bitmap_count, x;
@@ -1168,7 +1166,7 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
desc_count = 0;
bitmap_count = 0;
gdp = NULL;
- for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
+ for (i = 0; i < ngroups; i++) {
gdp = ext4_get_group_desc(sb, i, NULL);
if (!gdp)
continue;
@@ -1190,7 +1188,7 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
return desc_count;
#else
desc_count = 0;
- for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
+ for (i = 0; i < ngroups; i++) {
gdp = ext4_get_group_desc(sb, i, NULL);
if (!gdp)
continue;
@@ -1205,9 +1203,9 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
unsigned long ext4_count_dirs(struct super_block * sb)
{
unsigned long count = 0;
- ext4_group_t i;
+ ext4_group_t i, ngroups = ext4_get_groups_count(sb);

- for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
+ for (i = 0; i < ngroups; i++) {
struct ext4_group_desc *gdp = ext4_get_group_desc(sb, i, NULL);
if (!gdp)
continue;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b891045..3e90965 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4916,7 +4916,8 @@ static int ext4_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
*/
int ext4_meta_trans_blocks(struct inode *inode, int nrblocks, int chunk)
{
- int groups, gdpblocks;
+ ext4_group_t groups, ngroups = ext4_get_groups_count(inode->i_sb);
+ int gdpblocks;
int idxblocks;
int ret = 0;

@@ -4943,8 +4944,8 @@ int ext4_meta_trans_blocks(struct inode *inode, int nrblocks, int chunk)
groups += nrblocks;

gdpblocks = groups;
- if (groups > EXT4_SB(inode->i_sb)->s_groups_count)
- groups = EXT4_SB(inode->i_sb)->s_groups_count;
+ if (groups > ngroups)
+ groups = ngroups;
if (groups > EXT4_SB(inode->i_sb)->s_gdb_count)
gdpblocks = EXT4_SB(inode->i_sb)->s_gdb_count;

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index f871677..c3af9e6 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -739,6 +739,7 @@ static void ext4_mb_generate_buddy(struct super_block *sb,

static int ext4_mb_init_cache(struct page *page, char *incore)
{
+ ext4_group_t ngroups;
int blocksize;
int blocks_per_page;
int groups_per_page;
@@ -757,6 +758,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore)

inode = page->mapping->host;
sb = inode->i_sb;
+ ngroups = ext4_get_groups_count(sb);
blocksize = 1 << inode->i_blkbits;
blocks_per_page = PAGE_CACHE_SIZE / blocksize;

@@ -780,7 +782,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
for (i = 0; i < groups_per_page; i++) {
struct ext4_group_desc *desc;

- if (first_group + i >= EXT4_SB(sb)->s_groups_count)
+ if (first_group + i >= ngroups)
break;

err = -EIO;
@@ -852,7 +854,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
struct ext4_group_info *grinfo;

group = (first_block + i) >> 1;
- if (group >= EXT4_SB(sb)->s_groups_count)
+ if (group >= ngroups)
break;

/*
@@ -1788,6 +1790,7 @@ int ext4_mb_get_buddy_cache_lock(struct super_block *sb, ext4_group_t group)
int block, pnum;
int blocks_per_page;
int groups_per_page;
+ ext4_group_t ngroups = ext4_get_groups_count(sb);
ext4_group_t first_group;
struct ext4_group_info *grp;

@@ -1807,7 +1810,7 @@ int ext4_mb_get_buddy_cache_lock(struct super_block *sb, ext4_group_t group)
/* read all groups the page covers into the cache */
for (i = 0; i < groups_per_page; i++) {

- if ((first_group + i) >= EXT4_SB(sb)->s_groups_count)
+ if ((first_group + i) >= ngroups)
break;
grp = ext4_get_group_info(sb, first_group + i);
/* take all groups write allocation
@@ -1945,8 +1948,7 @@ err:
static noinline_for_stack int
ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
{
- ext4_group_t group;
- ext4_group_t i;
+ ext4_group_t ngroups, group, i;
int cr;
int err = 0;
int bsbits;
@@ -1957,6 +1959,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)

sb = ac->ac_sb;
sbi = EXT4_SB(sb);
+ ngroups = ext4_get_groups_count(sb);
BUG_ON(ac->ac_status == AC_STATUS_FOUND);

/* first, try the goal */
@@ -2017,11 +2020,11 @@ repeat:
*/
group = ac->ac_g_ex.fe_group;

- for (i = 0; i < EXT4_SB(sb)->s_groups_count; group++, i++) {
+ for (i = 0; i < ngroups; group++, i++) {
struct ext4_group_info *grp;
struct ext4_group_desc *desc;

- if (group == EXT4_SB(sb)->s_groups_count)
+ if (group == ngroups)
group = 0;

/* quick check to skip empty groups */
@@ -2315,12 +2318,10 @@ static struct file_operations ext4_mb_seq_history_fops = {
static void *ext4_mb_seq_groups_start(struct seq_file *seq, loff_t *pos)
{
struct super_block *sb = seq->private;
- struct ext4_sb_info *sbi = EXT4_SB(sb);
ext4_group_t group;

- if (*pos < 0 || *pos >= sbi->s_groups_count)
+ if (*pos < 0 || *pos >= ext4_get_groups_count(sb))
return NULL;
-
group = *pos + 1;
return (void *) ((unsigned long) group);
}
@@ -2328,11 +2329,10 @@ static void *ext4_mb_seq_groups_start(struct seq_file *seq, loff_t *pos)
static void *ext4_mb_seq_groups_next(struct seq_file *seq, void *v, loff_t *pos)
{
struct super_block *sb = seq->private;
- struct ext4_sb_info *sbi = EXT4_SB(sb);
ext4_group_t group;

++*pos;
- if (*pos < 0 || *pos >= sbi->s_groups_count)
+ if (*pos < 0 || *pos >= ext4_get_groups_count(sb))
return NULL;
group = *pos + 1;
return (void *) ((unsigned long) group);
@@ -2587,6 +2587,7 @@ void ext4_mb_update_group_info(struct ext4_group_info *grp, ext4_grpblk_t add)

static int ext4_mb_init_backend(struct super_block *sb)
{
+ ext4_group_t ngroups = ext4_get_groups_count(sb);
ext4_group_t i;
int metalen;
struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -2598,7 +2599,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
struct ext4_group_desc *desc;

/* This is the number of blocks used by GDT */
- num_meta_group_infos = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) -
+ num_meta_group_infos = (ngroups + EXT4_DESC_PER_BLOCK(sb) -
1) >> EXT4_DESC_PER_BLOCK_BITS(sb);

/*
@@ -2644,7 +2645,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
for (i = 0; i < num_meta_group_infos; i++) {
if ((i + 1) == num_meta_group_infos)
metalen = sizeof(*meta_group_info) *
- (sbi->s_groups_count -
+ (ngroups -
(i << EXT4_DESC_PER_BLOCK_BITS(sb)));
meta_group_info = kmalloc(metalen, GFP_KERNEL);
if (meta_group_info == NULL) {
@@ -2655,7 +2656,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
sbi->s_group_info[i] = meta_group_info;
}

- for (i = 0; i < sbi->s_groups_count; i++) {
+ for (i = 0; i < ngroups; i++) {
desc = ext4_get_group_desc(sb, i, NULL);
if (desc == NULL) {
printk(KERN_ERR
@@ -2781,13 +2782,14 @@ static void ext4_mb_cleanup_pa(struct ext4_group_info *grp)

int ext4_mb_release(struct super_block *sb)
{
+ ext4_group_t ngroups = ext4_get_groups_count(sb);
ext4_group_t i;
int num_meta_group_infos;
struct ext4_group_info *grinfo;
struct ext4_sb_info *sbi = EXT4_SB(sb);

if (sbi->s_group_info) {
- for (i = 0; i < sbi->s_groups_count; i++) {
+ for (i = 0; i < ngroups; i++) {
grinfo = ext4_get_group_info(sb, i);
#ifdef DOUBLE_CHECK
kfree(grinfo->bb_bitmap);
@@ -2797,7 +2799,7 @@ int ext4_mb_release(struct super_block *sb)
ext4_unlock_group(sb, i);
kfree(grinfo);
}
- num_meta_group_infos = (sbi->s_groups_count +
+ num_meta_group_infos = (ngroups +
EXT4_DESC_PER_BLOCK(sb) - 1) >>
EXT4_DESC_PER_BLOCK_BITS(sb);
for (i = 0; i < num_meta_group_infos; i++)
@@ -4121,7 +4123,7 @@ static void ext4_mb_return_to_preallocation(struct inode *inode,
static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
{
struct super_block *sb = ac->ac_sb;
- ext4_group_t i;
+ ext4_group_t ngroups, i;

printk(KERN_ERR "EXT4-fs: Can't allocate:"
" Allocation context details:\n");
@@ -4145,7 +4147,8 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
printk(KERN_ERR "EXT4-fs: %lu scanned, %d found\n", ac->ac_ex_scanned,
ac->ac_found);
printk(KERN_ERR "EXT4-fs: groups: \n");
- for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
+ ngroups = ext4_get_groups_count(sb);
+ for (i = 0; i < ngroups; i++) {
struct ext4_group_info *grp = ext4_get_group_info(sb, i);
struct ext4_prealloc_space *pa;
ext4_grpblk_t start;
@@ -4469,13 +4472,13 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)

static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
{
- ext4_group_t i;
+ ext4_group_t i, ngroups = ext4_get_groups_count(sb);
int ret;
int freed = 0;

trace_mark(ext4_mb_discard_preallocations, "dev %s needed %d",
sb->s_id, needed);
- for (i = 0; i < EXT4_SB(sb)->s_groups_count && needed > 0; i++) {
+ for (i = 0; i < ngroups && needed > 0; i++) {
ret = ext4_mb_discard_group_preallocations(sb, i, needed);
freed += ret;
needed -= ret;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 241a81e..bcdbb6d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3539,9 +3539,8 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
if (test_opt(sb, MINIX_DF)) {
sbi->s_overhead_last = 0;
} else if (sbi->s_blocks_last != ext4_blocks_count(es)) {
- ext4_group_t ngroups = sbi->s_groups_count, i;
+ ext4_group_t i, ngroups = ext4_get_groups_count(sb);
ext4_fsblk_t overhead = 0;
- smp_rmb();

/*
* Compute the overhead (FS structures). This is constant
--
1.6.0.4


2009-05-03 17:08:39

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: Avoid races caused by on-line resizing and SMP memory reordering

On Fri 01-05-09 09:55:59, Theodore Ts'o wrote:
> Ext4's on-line resizing adds a new block group and then, only at the
> last step adjusts s_groups_count. However, it's possible on SMP
> systems that another CPU could see the updated the s_group_count and
> not see the newly initialized data structures for the just-added block
> group. For this reason, it's important to insert a SMP read barrier
> after reading s_groups_count and before reading any (for example) the
> new block group descriptors allowed by the increased value of
> s_groups_count.
>
> Unfortunately, we rather blatently violate this locking protocol
> documented in fs/ext4/resize.c. Fortunately, (1) on-line resizes
> happen relatively rarely, and (2) it seems rare that the filesystem
> code will immediately try to use just-added block group before any
> memory ordering issues resolve themselves. So apparently problems
> here are relatively hard to hit, since ext3 has been vulnerable to the
> same issue for years with no one apparently complaining.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
>
> Here's an updated version which uses a helper function to make sure we
> in the future we don't miss a place where smp_rmb() is needed.
Yeah, it looks nice. Maybe the code looks even better like this ;).

Acked-by: Jan Kara <[email protected]>

Honza

> fs/ext4/balloc.c | 15 +++++++--------
> fs/ext4/ext4.h | 12 ++++++++++++
> fs/ext4/ialloc.c | 40 +++++++++++++++++++---------------------
> fs/ext4/inode.c | 7 ++++---
> fs/ext4/mballoc.c | 45 ++++++++++++++++++++++++---------------------
> fs/ext4/super.c | 3 +--
> 6 files changed, 67 insertions(+), 55 deletions(-)
>
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index 53c72ad..a5ba039 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -88,6 +88,7 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
> ext4_group_t block_group, struct ext4_group_desc *gdp)
> {
> int bit, bit_max;
> + ext4_group_t ngroups = ext4_get_groups_count(sb);
> unsigned free_blocks, group_blocks;
> struct ext4_sb_info *sbi = EXT4_SB(sb);
>
> @@ -123,7 +124,7 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
> bit_max += ext4_bg_num_gdb(sb, block_group);
> }
>
> - if (block_group == sbi->s_groups_count - 1) {
> + if (block_group == ngroups - 1) {
> /*
> * Even though mke2fs always initialize first and last group
> * if some other tool enabled the EXT4_BG_BLOCK_UNINIT we need
> @@ -131,7 +132,7 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
> */
> group_blocks = ext4_blocks_count(sbi->s_es) -
> le32_to_cpu(sbi->s_es->s_first_data_block) -
> - (EXT4_BLOCKS_PER_GROUP(sb) * (sbi->s_groups_count - 1));
> + (EXT4_BLOCKS_PER_GROUP(sb) * (ngroups - 1));
> } else {
> group_blocks = EXT4_BLOCKS_PER_GROUP(sb);
> }
> @@ -205,18 +206,18 @@ struct ext4_group_desc * ext4_get_group_desc(struct super_block *sb,
> {
> unsigned int group_desc;
> unsigned int offset;
> + ext4_group_t ngroups = ext4_get_groups_count(sb);
> struct ext4_group_desc *desc;
> struct ext4_sb_info *sbi = EXT4_SB(sb);
>
> - if (block_group >= sbi->s_groups_count) {
> + if (block_group >= ngroups) {
> ext4_error(sb, "ext4_get_group_desc",
> "block_group >= groups_count - "
> "block_group = %u, groups_count = %u",
> - block_group, sbi->s_groups_count);
> + block_group, ngroups);
>
> return NULL;
> }
> - smp_rmb();
>
> group_desc = block_group >> EXT4_DESC_PER_BLOCK_BITS(sb);
> offset = block_group & (EXT4_DESC_PER_BLOCK(sb) - 1);
> @@ -665,7 +666,7 @@ ext4_fsblk_t ext4_count_free_blocks(struct super_block *sb)
> ext4_fsblk_t desc_count;
> struct ext4_group_desc *gdp;
> ext4_group_t i;
> - ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
> + ext4_group_t ngroups = ext4_get_groups_count(sb);
> #ifdef EXT4FS_DEBUG
> struct ext4_super_block *es;
> ext4_fsblk_t bitmap_count;
> @@ -677,7 +678,6 @@ ext4_fsblk_t ext4_count_free_blocks(struct super_block *sb)
> bitmap_count = 0;
> gdp = NULL;
>
> - smp_rmb();
> for (i = 0; i < ngroups; i++) {
> gdp = ext4_get_group_desc(sb, i, NULL);
> if (!gdp)
> @@ -700,7 +700,6 @@ ext4_fsblk_t ext4_count_free_blocks(struct super_block *sb)
> return bitmap_count;
> #else
> desc_count = 0;
> - smp_rmb();
> for (i = 0; i < ngroups; i++) {
> gdp = ext4_get_group_desc(sb, i, NULL);
> if (!gdp)
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index d0f15ef..02ec44b 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1228,6 +1228,18 @@ struct ext4_group_info *ext4_get_group_info(struct super_block *sb,
> return grp_info[indexv][indexh];
> }
>
> +/*
> + * Reading s_groups_count requires using smp_rmb() afterwards. See
> + * the locking protocol documented in the comments of ext4_group_add()
> + * in resize.c
> + */
> +static inline ext4_group_t ext4_get_groups_count(struct super_block *sb)
> +{
> + ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
> +
> + smp_rmb();
> + return ngroups;
> +}
>
> static inline ext4_group_t ext4_flex_group(struct ext4_sb_info *sbi,
> ext4_group_t block_group)
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index f18e0a0..55ba419 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -316,7 +316,7 @@ error_return:
> static int find_group_dir(struct super_block *sb, struct inode *parent,
> ext4_group_t *best_group)
> {
> - ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
> + ext4_group_t ngroups = ext4_get_groups_count(sb);
> unsigned int freei, avefreei;
> struct ext4_group_desc *desc, *best_desc = NULL;
> ext4_group_t group;
> @@ -353,7 +353,7 @@ static int find_group_flex(struct super_block *sb, struct inode *parent,
> struct flex_groups *flex_group = sbi->s_flex_groups;
> ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
> ext4_group_t parent_fbg_group = ext4_flex_group(sbi, parent_group);
> - ext4_group_t ngroups = sbi->s_groups_count;
> + ext4_group_t ngroups = ext4_get_groups_count(sb);
> int flex_size = ext4_flex_bg_size(sbi);
> ext4_group_t best_flex = parent_fbg_group;
> int blocks_per_flex = sbi->s_blocks_per_group * flex_size;
> @@ -362,7 +362,7 @@ static int find_group_flex(struct super_block *sb, struct inode *parent,
> ext4_group_t n_fbg_groups;
> ext4_group_t i;
>
> - n_fbg_groups = (sbi->s_groups_count + flex_size - 1) >>
> + n_fbg_groups = (ngroups + flex_size - 1) >>
> sbi->s_log_groups_per_flex;
>
> find_close_to_parent:
> @@ -478,20 +478,21 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
> {
> ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> - ext4_group_t ngroups = sbi->s_groups_count;
> + ext4_group_t real_ngroups = ext4_get_groups_count(sb);
> int inodes_per_group = EXT4_INODES_PER_GROUP(sb);
> unsigned int freei, avefreei;
> ext4_fsblk_t freeb, avefreeb;
> unsigned int ndirs;
> int max_dirs, min_inodes;
> ext4_grpblk_t min_blocks;
> - ext4_group_t i, grp, g;
> + ext4_group_t i, grp, g, ngroups;
> struct ext4_group_desc *desc;
> struct orlov_stats stats;
> int flex_size = ext4_flex_bg_size(sbi);
>
> + ngroups = real_ngroups;
> if (flex_size > 1) {
> - ngroups = (ngroups + flex_size - 1) >>
> + ngroups = (real_ngroups + flex_size - 1) >>
> sbi->s_log_groups_per_flex;
> parent_group >>= sbi->s_log_groups_per_flex;
> }
> @@ -543,7 +544,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
> */
> grp *= flex_size;
> for (i = 0; i < flex_size; i++) {
> - if (grp+i >= sbi->s_groups_count)
> + if (grp+i >= real_ngroups)
> break;
> desc = ext4_get_group_desc(sb, grp+i, NULL);
> if (desc && ext4_free_inodes_count(sb, desc)) {
> @@ -583,7 +584,7 @@ static int find_group_orlov(struct super_block *sb, struct inode *parent,
> }
>
> fallback:
> - ngroups = sbi->s_groups_count;
> + ngroups = real_ngroups;
> avefreei = freei / ngroups;
> fallback_retry:
> parent_group = EXT4_I(parent)->i_block_group;
> @@ -613,9 +614,8 @@ static int find_group_other(struct super_block *sb, struct inode *parent,
> ext4_group_t *group, int mode)
> {
> ext4_group_t parent_group = EXT4_I(parent)->i_block_group;
> - ext4_group_t ngroups = EXT4_SB(sb)->s_groups_count;
> + ext4_group_t i, last, ngroups = ext4_get_groups_count(sb);
> struct ext4_group_desc *desc;
> - ext4_group_t i, last;
> int flex_size = ext4_flex_bg_size(EXT4_SB(sb));
>
> /*
> @@ -799,11 +799,10 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode *dir, int mode)
> struct super_block *sb;
> struct buffer_head *inode_bitmap_bh = NULL;
> struct buffer_head *group_desc_bh;
> - ext4_group_t group = 0;
> + ext4_group_t ngroups, group = 0;
> unsigned long ino = 0;
> struct inode *inode;
> struct ext4_group_desc *gdp = NULL;
> - struct ext4_super_block *es;
> struct ext4_inode_info *ei;
> struct ext4_sb_info *sbi;
> int ret2, err = 0;
> @@ -818,15 +817,14 @@ struct inode *ext4_new_inode(handle_t *handle, struct inode *dir, int mode)
> return ERR_PTR(-EPERM);
>
> sb = dir->i_sb;
> + ngroups = ext4_get_groups_count(sb);
> trace_mark(ext4_request_inode, "dev %s dir %lu mode %d", sb->s_id,
> dir->i_ino, mode);
> inode = new_inode(sb);
> if (!inode)
> return ERR_PTR(-ENOMEM);
> ei = EXT4_I(inode);
> -
> sbi = EXT4_SB(sb);
> - es = sbi->s_es;
>
> if (sbi->s_log_groups_per_flex && test_opt(sb, OLDALLOC)) {
> ret2 = find_group_flex(sb, dir, &group);
> @@ -856,7 +854,7 @@ got_group:
> if (ret2 == -1)
> goto out;
>
> - for (i = 0; i < sbi->s_groups_count; i++) {
> + for (i = 0; i < ngroups; i++) {
> err = -EIO;
>
> gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
> @@ -917,7 +915,7 @@ repeat_in_this_group:
> * group descriptor metadata has not yet been updated.
> * So we just go onto the next blockgroup.
> */
> - if (++group == sbi->s_groups_count)
> + if (++group == ngroups)
> group = 0;
> }
> err = -ENOSPC;
> @@ -1158,7 +1156,7 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
> {
> unsigned long desc_count;
> struct ext4_group_desc *gdp;
> - ext4_group_t i;
> + ext4_group_t i, ngroups = ext4_get_groups_count(sb);
> #ifdef EXT4FS_DEBUG
> struct ext4_super_block *es;
> unsigned long bitmap_count, x;
> @@ -1168,7 +1166,7 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
> desc_count = 0;
> bitmap_count = 0;
> gdp = NULL;
> - for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
> + for (i = 0; i < ngroups; i++) {
> gdp = ext4_get_group_desc(sb, i, NULL);
> if (!gdp)
> continue;
> @@ -1190,7 +1188,7 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
> return desc_count;
> #else
> desc_count = 0;
> - for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
> + for (i = 0; i < ngroups; i++) {
> gdp = ext4_get_group_desc(sb, i, NULL);
> if (!gdp)
> continue;
> @@ -1205,9 +1203,9 @@ unsigned long ext4_count_free_inodes(struct super_block *sb)
> unsigned long ext4_count_dirs(struct super_block * sb)
> {
> unsigned long count = 0;
> - ext4_group_t i;
> + ext4_group_t i, ngroups = ext4_get_groups_count(sb);
>
> - for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
> + for (i = 0; i < ngroups; i++) {
> struct ext4_group_desc *gdp = ext4_get_group_desc(sb, i, NULL);
> if (!gdp)
> continue;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b891045..3e90965 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4916,7 +4916,8 @@ static int ext4_index_trans_blocks(struct inode *inode, int nrblocks, int chunk)
> */
> int ext4_meta_trans_blocks(struct inode *inode, int nrblocks, int chunk)
> {
> - int groups, gdpblocks;
> + ext4_group_t groups, ngroups = ext4_get_groups_count(inode->i_sb);
> + int gdpblocks;
> int idxblocks;
> int ret = 0;
>
> @@ -4943,8 +4944,8 @@ int ext4_meta_trans_blocks(struct inode *inode, int nrblocks, int chunk)
> groups += nrblocks;
>
> gdpblocks = groups;
> - if (groups > EXT4_SB(inode->i_sb)->s_groups_count)
> - groups = EXT4_SB(inode->i_sb)->s_groups_count;
> + if (groups > ngroups)
> + groups = ngroups;
> if (groups > EXT4_SB(inode->i_sb)->s_gdb_count)
> gdpblocks = EXT4_SB(inode->i_sb)->s_gdb_count;
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index f871677..c3af9e6 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -739,6 +739,7 @@ static void ext4_mb_generate_buddy(struct super_block *sb,
>
> static int ext4_mb_init_cache(struct page *page, char *incore)
> {
> + ext4_group_t ngroups;
> int blocksize;
> int blocks_per_page;
> int groups_per_page;
> @@ -757,6 +758,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
>
> inode = page->mapping->host;
> sb = inode->i_sb;
> + ngroups = ext4_get_groups_count(sb);
> blocksize = 1 << inode->i_blkbits;
> blocks_per_page = PAGE_CACHE_SIZE / blocksize;
>
> @@ -780,7 +782,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
> for (i = 0; i < groups_per_page; i++) {
> struct ext4_group_desc *desc;
>
> - if (first_group + i >= EXT4_SB(sb)->s_groups_count)
> + if (first_group + i >= ngroups)
> break;
>
> err = -EIO;
> @@ -852,7 +854,7 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
> struct ext4_group_info *grinfo;
>
> group = (first_block + i) >> 1;
> - if (group >= EXT4_SB(sb)->s_groups_count)
> + if (group >= ngroups)
> break;
>
> /*
> @@ -1788,6 +1790,7 @@ int ext4_mb_get_buddy_cache_lock(struct super_block *sb, ext4_group_t group)
> int block, pnum;
> int blocks_per_page;
> int groups_per_page;
> + ext4_group_t ngroups = ext4_get_groups_count(sb);
> ext4_group_t first_group;
> struct ext4_group_info *grp;
>
> @@ -1807,7 +1810,7 @@ int ext4_mb_get_buddy_cache_lock(struct super_block *sb, ext4_group_t group)
> /* read all groups the page covers into the cache */
> for (i = 0; i < groups_per_page; i++) {
>
> - if ((first_group + i) >= EXT4_SB(sb)->s_groups_count)
> + if ((first_group + i) >= ngroups)
> break;
> grp = ext4_get_group_info(sb, first_group + i);
> /* take all groups write allocation
> @@ -1945,8 +1948,7 @@ err:
> static noinline_for_stack int
> ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
> {
> - ext4_group_t group;
> - ext4_group_t i;
> + ext4_group_t ngroups, group, i;
> int cr;
> int err = 0;
> int bsbits;
> @@ -1957,6 +1959,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
>
> sb = ac->ac_sb;
> sbi = EXT4_SB(sb);
> + ngroups = ext4_get_groups_count(sb);
> BUG_ON(ac->ac_status == AC_STATUS_FOUND);
>
> /* first, try the goal */
> @@ -2017,11 +2020,11 @@ repeat:
> */
> group = ac->ac_g_ex.fe_group;
>
> - for (i = 0; i < EXT4_SB(sb)->s_groups_count; group++, i++) {
> + for (i = 0; i < ngroups; group++, i++) {
> struct ext4_group_info *grp;
> struct ext4_group_desc *desc;
>
> - if (group == EXT4_SB(sb)->s_groups_count)
> + if (group == ngroups)
> group = 0;
>
> /* quick check to skip empty groups */
> @@ -2315,12 +2318,10 @@ static struct file_operations ext4_mb_seq_history_fops = {
> static void *ext4_mb_seq_groups_start(struct seq_file *seq, loff_t *pos)
> {
> struct super_block *sb = seq->private;
> - struct ext4_sb_info *sbi = EXT4_SB(sb);
> ext4_group_t group;
>
> - if (*pos < 0 || *pos >= sbi->s_groups_count)
> + if (*pos < 0 || *pos >= ext4_get_groups_count(sb))
> return NULL;
> -
> group = *pos + 1;
> return (void *) ((unsigned long) group);
> }
> @@ -2328,11 +2329,10 @@ static void *ext4_mb_seq_groups_start(struct seq_file *seq, loff_t *pos)
> static void *ext4_mb_seq_groups_next(struct seq_file *seq, void *v, loff_t *pos)
> {
> struct super_block *sb = seq->private;
> - struct ext4_sb_info *sbi = EXT4_SB(sb);
> ext4_group_t group;
>
> ++*pos;
> - if (*pos < 0 || *pos >= sbi->s_groups_count)
> + if (*pos < 0 || *pos >= ext4_get_groups_count(sb))
> return NULL;
> group = *pos + 1;
> return (void *) ((unsigned long) group);
> @@ -2587,6 +2587,7 @@ void ext4_mb_update_group_info(struct ext4_group_info *grp, ext4_grpblk_t add)
>
> static int ext4_mb_init_backend(struct super_block *sb)
> {
> + ext4_group_t ngroups = ext4_get_groups_count(sb);
> ext4_group_t i;
> int metalen;
> struct ext4_sb_info *sbi = EXT4_SB(sb);
> @@ -2598,7 +2599,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
> struct ext4_group_desc *desc;
>
> /* This is the number of blocks used by GDT */
> - num_meta_group_infos = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) -
> + num_meta_group_infos = (ngroups + EXT4_DESC_PER_BLOCK(sb) -
> 1) >> EXT4_DESC_PER_BLOCK_BITS(sb);
>
> /*
> @@ -2644,7 +2645,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
> for (i = 0; i < num_meta_group_infos; i++) {
> if ((i + 1) == num_meta_group_infos)
> metalen = sizeof(*meta_group_info) *
> - (sbi->s_groups_count -
> + (ngroups -
> (i << EXT4_DESC_PER_BLOCK_BITS(sb)));
> meta_group_info = kmalloc(metalen, GFP_KERNEL);
> if (meta_group_info == NULL) {
> @@ -2655,7 +2656,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
> sbi->s_group_info[i] = meta_group_info;
> }
>
> - for (i = 0; i < sbi->s_groups_count; i++) {
> + for (i = 0; i < ngroups; i++) {
> desc = ext4_get_group_desc(sb, i, NULL);
> if (desc == NULL) {
> printk(KERN_ERR
> @@ -2781,13 +2782,14 @@ static void ext4_mb_cleanup_pa(struct ext4_group_info *grp)
>
> int ext4_mb_release(struct super_block *sb)
> {
> + ext4_group_t ngroups = ext4_get_groups_count(sb);
> ext4_group_t i;
> int num_meta_group_infos;
> struct ext4_group_info *grinfo;
> struct ext4_sb_info *sbi = EXT4_SB(sb);
>
> if (sbi->s_group_info) {
> - for (i = 0; i < sbi->s_groups_count; i++) {
> + for (i = 0; i < ngroups; i++) {
> grinfo = ext4_get_group_info(sb, i);
> #ifdef DOUBLE_CHECK
> kfree(grinfo->bb_bitmap);
> @@ -2797,7 +2799,7 @@ int ext4_mb_release(struct super_block *sb)
> ext4_unlock_group(sb, i);
> kfree(grinfo);
> }
> - num_meta_group_infos = (sbi->s_groups_count +
> + num_meta_group_infos = (ngroups +
> EXT4_DESC_PER_BLOCK(sb) - 1) >>
> EXT4_DESC_PER_BLOCK_BITS(sb);
> for (i = 0; i < num_meta_group_infos; i++)
> @@ -4121,7 +4123,7 @@ static void ext4_mb_return_to_preallocation(struct inode *inode,
> static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
> {
> struct super_block *sb = ac->ac_sb;
> - ext4_group_t i;
> + ext4_group_t ngroups, i;
>
> printk(KERN_ERR "EXT4-fs: Can't allocate:"
> " Allocation context details:\n");
> @@ -4145,7 +4147,8 @@ static void ext4_mb_show_ac(struct ext4_allocation_context *ac)
> printk(KERN_ERR "EXT4-fs: %lu scanned, %d found\n", ac->ac_ex_scanned,
> ac->ac_found);
> printk(KERN_ERR "EXT4-fs: groups: \n");
> - for (i = 0; i < EXT4_SB(sb)->s_groups_count; i++) {
> + ngroups = ext4_get_groups_count(sb);
> + for (i = 0; i < ngroups; i++) {
> struct ext4_group_info *grp = ext4_get_group_info(sb, i);
> struct ext4_prealloc_space *pa;
> ext4_grpblk_t start;
> @@ -4469,13 +4472,13 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
>
> static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
> {
> - ext4_group_t i;
> + ext4_group_t i, ngroups = ext4_get_groups_count(sb);
> int ret;
> int freed = 0;
>
> trace_mark(ext4_mb_discard_preallocations, "dev %s needed %d",
> sb->s_id, needed);
> - for (i = 0; i < EXT4_SB(sb)->s_groups_count && needed > 0; i++) {
> + for (i = 0; i < ngroups && needed > 0; i++) {
> ret = ext4_mb_discard_group_preallocations(sb, i, needed);
> freed += ret;
> needed -= ret;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 241a81e..bcdbb6d 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -3539,9 +3539,8 @@ static int ext4_statfs(struct dentry *dentry, struct kstatfs *buf)
> if (test_opt(sb, MINIX_DF)) {
> sbi->s_overhead_last = 0;
> } else if (sbi->s_blocks_last != ext4_blocks_count(es)) {
> - ext4_group_t ngroups = sbi->s_groups_count, i;
> + ext4_group_t i, ngroups = ext4_get_groups_count(sb);
> ext4_fsblk_t overhead = 0;
> - smp_rmb();
>
> /*
> * Compute the overhead (FS structures). This is constant
> --
> 1.6.0.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR