2008-09-15 11:40:19

by Frédéric Bohé

[permalink] [raw]
Subject: [PATCH] ext4: fix initialization of UNINIT bitmap blocks

From: Frederic Bohe <[email protected]>

Do not rely on buffer head's uptodate flag to initialize
uninitialized bitmap blocks.

Signed-off-by: Frederic Bohe <[email protected]>
---
This patch makes sure to initialize uninited bitmap blocks.
These are two test cases where bugs appear because of uninited blocks :

1- This test case lead to uninited block bitmap and an error message
from the mballocator during the second dd.

dd if=/dev/urandom of=/dev/md0 bs=1M count=300
mkfs.ext4 -t ext4dev /dev/md0 1G
mount -t ext4dev /dev/md0 /mnt/test
resize2fs /dev/md0 2G
dd if=/dev/zero of=/mnt/test/dummy bs=1M count=1500

Note that the first dd is to make sure we have random garbage in the
uninited blocks. If not, you could miss the issue depending what was in
those blocks before running mkfs.

2- This test case lead to uninited inode bitmap blocks, making it
impossible to use all the inodes of the fs.

dd if=/dev/urandom of=/dev/md0 bs=1M count=20
mkfs.ext4 -t ext4dev /dev/md0 10M
mount -t ext4dev /dev/md0 /mnt/test
resize2fs /dev/md0 20M
for i in $(seq 1 3800); do touch /mnt/test/file${i} 2>&1; done

balloc.c | 4 +++-
ialloc.c | 4 +++-
mballoc.c | 4 +++-
3 files changed, 9 insertions(+), 3 deletions(-)

Index: linux/fs/ext4/balloc.c
===================================================================
--- linux.orig/fs/ext4/balloc.c 2008-09-15 10:59:27.000000000 +0200
+++ linux/fs/ext4/balloc.c 2008-09-15 12:58:54.000000000 +0200
@@ -318,9 +318,11 @@ ext4_read_block_bitmap(struct super_bloc
block_group, bitmap_blk);
return NULL;
}
- if (bh_uptodate_or_lock(bh))
+ if (buffer_uptodate(bh) &&
+ !(desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)))
return bh;

+ lock_buffer(bh);
spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
ext4_init_block_bitmap(sb, bh, block_group, desc);
Index: linux/fs/ext4/ialloc.c
===================================================================
--- linux.orig/fs/ext4/ialloc.c 2008-09-15 10:59:27.000000000 +0200
+++ linux/fs/ext4/ialloc.c 2008-09-15 11:12:16.000000000 +0200
@@ -115,9 +115,11 @@ ext4_read_inode_bitmap(struct super_bloc
block_group, bitmap_blk);
return NULL;
}
- if (bh_uptodate_or_lock(bh))
+ if (buffer_uptodate(bh) &&
+ !(desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)))
return bh;

+ lock_buffer(bh);
spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
ext4_init_inode_bitmap(sb, bh, block_group, desc);
Index: linux/fs/ext4/mballoc.c
===================================================================
--- linux.orig/fs/ext4/mballoc.c 2008-09-15 10:59:27.000000000 +0200
+++ linux/fs/ext4/mballoc.c 2008-09-15 13:01:01.000000000 +0200
@@ -785,9 +785,11 @@ static int ext4_mb_init_cache(struct pag
if (bh[i] == NULL)
goto out;

- if (bh_uptodate_or_lock(bh[i]))
+ if (buffer_uptodate(bh[i]) &&
+ !(desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)))
continue;

+ lock_buffer(bh[i]);
spin_lock(sb_bgl_lock(EXT4_SB(sb), first_group + i));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
ext4_init_block_bitmap(sb, bh[i],

--



2008-09-15 12:16:05

by Frédéric Bohé

[permalink] [raw]
Subject: [PATCH v2] ext4: fix initialization of UNINIT bitmap blocks

From: Frederic Bohe <[email protected]>

Do not rely on buffer head's uptodate flag to initialize
uninitialized bitmap blocks.

Signed-off-by: Frederic Bohe <[email protected]>
---
Sorry there was a copy/paste error in the previous mail !

This patch makes sure to initialize uninited bitmap blocks.
These are two test cases where bugs appear because of uninited blocks :

1- This test case lead to uninited block bitmap and an error message
from the mballocator during the second dd.

dd if=/dev/urandom of=/dev/md0 bs=1M count=300
mkfs.ext4 -t ext4dev /dev/md0 1G
mount -t ext4dev /dev/md0 /mnt/test
resize2fs /dev/md0 2G
dd if=/dev/zero of=/mnt/test/dummy bs=1M count=1500

Note that the first dd is to make sure we have random garbage in the
uninited blocks. If not, you could miss the issue depending what was in
those blocks before running mkfs.

2- This test case lead to uninited inode bitmap blocks, making it
impossible to use all the inodes of the fs.

dd if=/dev/urandom of=/dev/md0 bs=1M count=20
mkfs.ext4 -t ext4dev /dev/md0 10M
mount -t ext4dev /dev/md0 /mnt/test
resize2fs /dev/md0 20M
for i in $(seq 1 3800); do touch /mnt/test/file${i} 2>&1; done

balloc.c | 4 +++-
ialloc.c | 4 +++-
mballoc.c | 4 +++-
3 files changed, 9 insertions(+), 3 deletions(-)

Index: linux-2.6.27-rc5+patch_queue/fs/ext4/balloc.c
===================================================================
--- linux-2.6.27-rc5+patch_queue.orig/fs/ext4/balloc.c 2008-09-15 10:59:27.000000000 +0200
+++ linux-2.6.27-rc5+patch_queue/fs/ext4/balloc.c 2008-09-15 14:03:04.000000000 +0200
@@ -318,9 +318,11 @@ ext4_read_block_bitmap(struct super_bloc
block_group, bitmap_blk);
return NULL;
}
- if (bh_uptodate_or_lock(bh))
+ if (buffer_uptodate(bh) &&
+ !(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))
return bh;

+ lock_buffer(bh);
spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
ext4_init_block_bitmap(sb, bh, block_group, desc);
Index: linux-2.6.27-rc5+patch_queue/fs/ext4/ialloc.c
===================================================================
--- linux-2.6.27-rc5+patch_queue.orig/fs/ext4/ialloc.c 2008-09-15 10:59:27.000000000 +0200
+++ linux-2.6.27-rc5+patch_queue/fs/ext4/ialloc.c 2008-09-15 11:12:16.000000000 +0200
@@ -115,9 +115,11 @@ ext4_read_inode_bitmap(struct super_bloc
block_group, bitmap_blk);
return NULL;
}
- if (bh_uptodate_or_lock(bh))
+ if (buffer_uptodate(bh) &&
+ !(desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)))
return bh;

+ lock_buffer(bh);
spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
ext4_init_inode_bitmap(sb, bh, block_group, desc);
Index: linux-2.6.27-rc5+patch_queue/fs/ext4/mballoc.c
===================================================================
--- linux-2.6.27-rc5+patch_queue.orig/fs/ext4/mballoc.c 2008-09-15 10:59:27.000000000 +0200
+++ linux-2.6.27-rc5+patch_queue/fs/ext4/mballoc.c 2008-09-15 14:02:44.000000000 +0200
@@ -785,9 +785,11 @@ static int ext4_mb_init_cache(struct pag
if (bh[i] == NULL)
goto out;

- if (bh_uptodate_or_lock(bh[i]))
+ if (buffer_uptodate(bh[i]) &&
+ !(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))
continue;

+ lock_buffer(bh[i]);
spin_lock(sb_bgl_lock(EXT4_SB(sb), first_group + i));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
ext4_init_block_bitmap(sb, bh[i],

--


2008-09-15 13:37:39

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix initialization of UNINIT bitmap blocks

On Mon, Sep 15, 2008 at 02:16:47PM +0200, Fr?d?ric Boh? wrote:
> From: Frederic Bohe <[email protected]>
>
> Do not rely on buffer head's uptodate flag to initialize
> uninitialized bitmap blocks.
>
> Signed-off-by: Frederic Bohe <[email protected]>
> ---
> Sorry there was a copy/paste error in the previous mail !
>
> This patch makes sure to initialize uninited bitmap blocks.
> These are two test cases where bugs appear because of uninited blocks :
>
> 1- This test case lead to uninited block bitmap and an error message
> from the mballocator during the second dd.
>
> dd if=/dev/urandom of=/dev/md0 bs=1M count=300
> mkfs.ext4 -t ext4dev /dev/md0 1G
> mount -t ext4dev /dev/md0 /mnt/test
> resize2fs /dev/md0 2G
> dd if=/dev/zero of=/mnt/test/dummy bs=1M count=1500
>
> Note that the first dd is to make sure we have random garbage in the
> uninited blocks. If not, you could miss the issue depending what was in
> those blocks before running mkfs.
>
> 2- This test case lead to uninited inode bitmap blocks, making it
> impossible to use all the inodes of the fs.
>
> dd if=/dev/urandom of=/dev/md0 bs=1M count=20
> mkfs.ext4 -t ext4dev /dev/md0 10M
> mount -t ext4dev /dev/md0 /mnt/test
> resize2fs /dev/md0 20M
> for i in $(seq 1 3800); do touch /mnt/test/file${i} 2>&1; done
>
> balloc.c | 4 +++-
> ialloc.c | 4 +++-
> mballoc.c | 4 +++-
> 3 files changed, 9 insertions(+), 3 deletions(-)
>
> Index: linux-2.6.27-rc5+patch_queue/fs/ext4/balloc.c
> ===================================================================
> --- linux-2.6.27-rc5+patch_queue.orig/fs/ext4/balloc.c 2008-09-15 10:59:27.000000000 +0200
> +++ linux-2.6.27-rc5+patch_queue/fs/ext4/balloc.c 2008-09-15 14:03:04.000000000 +0200
> @@ -318,9 +318,11 @@ ext4_read_block_bitmap(struct super_bloc
> block_group, bitmap_blk);
> return NULL;
> }
> - if (bh_uptodate_or_lock(bh))
> + if (buffer_uptodate(bh) &&
> + !(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))
> return bh;
>
> + lock_buffer(bh);
> spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
> if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> ext4_init_block_bitmap(sb, bh, block_group, desc);

Why ? I guess resize should mark those buffer_heads as not uptodate so
that we do a reinit of block bitmap again later. The above change will
result in calling ext4_init_block_bitmap everytime we do a
read_block_bitmap on an uninit group




> Index: linux-2.6.27-rc5+patch_queue/fs/ext4/ialloc.c
> ===================================================================
> --- linux-2.6.27-rc5+patch_queue.orig/fs/ext4/ialloc.c 2008-09-15 10:59:27.000000000 +0200
> +++ linux-2.6.27-rc5+patch_queue/fs/ext4/ialloc.c 2008-09-15 11:12:16.000000000 +0200
> @@ -115,9 +115,11 @@ ext4_read_inode_bitmap(struct super_bloc
> block_group, bitmap_blk);
> return NULL;
> }
> - if (bh_uptodate_or_lock(bh))
> + if (buffer_uptodate(bh) &&
> + !(desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)))
> return bh;
>
> + lock_buffer(bh);
> spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
> if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
> ext4_init_inode_bitmap(sb, bh, block_group, desc);
> Index: linux-2.6.27-rc5+patch_queue/fs/ext4/mballoc.c
> ===================================================================
> --- linux-2.6.27-rc5+patch_queue.orig/fs/ext4/mballoc.c 2008-09-15 10:59:27.000000000 +0200
> +++ linux-2.6.27-rc5+patch_queue/fs/ext4/mballoc.c 2008-09-15 14:02:44.000000000 +0200
> @@ -785,9 +785,11 @@ static int ext4_mb_init_cache(struct pag
> if (bh[i] == NULL)
> goto out;
>
> - if (bh_uptodate_or_lock(bh[i]))
> + if (buffer_uptodate(bh[i]) &&
> + !(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))
> continue;
>
> + lock_buffer(bh[i]);
> spin_lock(sb_bgl_lock(EXT4_SB(sb), first_group + i));
> if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> ext4_init_block_bitmap(sb, bh[i],
>

-aneesh

2008-09-15 14:29:12

by Frédéric Bohé

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix initialization of UNINIT bitmap blocks

Le lundi 15 septembre 2008 à 19:06 +0530, Aneesh Kumar K.V a écrit :
> On Mon, Sep 15, 2008 at 02:16:47PM +0200, Frédéric Bohé wrote:
> > From: Frederic Bohe <[email protected]>
> >
> > Do not rely on buffer head's uptodate flag to initialize
> > uninitialized bitmap blocks.
> >
> > Signed-off-by: Frederic Bohe <[email protected]>
> > ---
> > Sorry there was a copy/paste error in the previous mail !
> >
> > This patch makes sure to initialize uninited bitmap blocks.
> > These are two test cases where bugs appear because of uninited blocks :
> >
> > 1- This test case lead to uninited block bitmap and an error message
> > from the mballocator during the second dd.
> >
> > dd if=/dev/urandom of=/dev/md0 bs=1M count=300
> > mkfs.ext4 -t ext4dev /dev/md0 1G
> > mount -t ext4dev /dev/md0 /mnt/test
> > resize2fs /dev/md0 2G
> > dd if=/dev/zero of=/mnt/test/dummy bs=1M count=1500
> >
> > Note that the first dd is to make sure we have random garbage in the
> > uninited blocks. If not, you could miss the issue depending what was in
> > those blocks before running mkfs.
> >
> > 2- This test case lead to uninited inode bitmap blocks, making it
> > impossible to use all the inodes of the fs.
> >
> > dd if=/dev/urandom of=/dev/md0 bs=1M count=20
> > mkfs.ext4 -t ext4dev /dev/md0 10M
> > mount -t ext4dev /dev/md0 /mnt/test
> > resize2fs /dev/md0 20M
> > for i in $(seq 1 3800); do touch /mnt/test/file${i} 2>&1; done
> >
> > balloc.c | 4 +++-
> > ialloc.c | 4 +++-
> > mballoc.c | 4 +++-
> > 3 files changed, 9 insertions(+), 3 deletions(-)
> >
> > Index: linux-2.6.27-rc5+patch_queue/fs/ext4/balloc.c
> > ===================================================================
> > --- linux-2.6.27-rc5+patch_queue.orig/fs/ext4/balloc.c 2008-09-15 10:59:27.000000000 +0200
> > +++ linux-2.6.27-rc5+patch_queue/fs/ext4/balloc.c 2008-09-15 14:03:04.000000000 +0200
> > @@ -318,9 +318,11 @@ ext4_read_block_bitmap(struct super_bloc
> > block_group, bitmap_blk);
> > return NULL;
> > }
> > - if (bh_uptodate_or_lock(bh))
> > + if (buffer_uptodate(bh) &&
> > + !(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))
> > return bh;
> >
> > + lock_buffer(bh);
> > spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
> > if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> > ext4_init_block_bitmap(sb, bh, block_group, desc);
>
> Why ? I guess resize should mark those buffer_heads as not uptodate so
> that we do a reinit of block bitmap again later. The above change will
> result in calling ext4_init_block_bitmap everytime we do a
> read_block_bitmap on an uninit group

Thanks for your comment Aneesh. I thought ext4_init_block_bitmap was
setting the EXT4_BG_BLOCK_UNINIT flags, but it seems it is not true.
I will try to fix it on the resize side.


>
>
>
>
> > Index: linux-2.6.27-rc5+patch_queue/fs/ext4/ialloc.c
> > ===================================================================
> > --- linux-2.6.27-rc5+patch_queue.orig/fs/ext4/ialloc.c 2008-09-15 10:59:27.000000000 +0200
> > +++ linux-2.6.27-rc5+patch_queue/fs/ext4/ialloc.c 2008-09-15 11:12:16.000000000 +0200
> > @@ -115,9 +115,11 @@ ext4_read_inode_bitmap(struct super_bloc
> > block_group, bitmap_blk);
> > return NULL;
> > }
> > - if (bh_uptodate_or_lock(bh))
> > + if (buffer_uptodate(bh) &&
> > + !(desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)))
> > return bh;
> >
> > + lock_buffer(bh);
> > spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
> > if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
> > ext4_init_inode_bitmap(sb, bh, block_group, desc);
> > Index: linux-2.6.27-rc5+patch_queue/fs/ext4/mballoc.c
> > ===================================================================
> > --- linux-2.6.27-rc5+patch_queue.orig/fs/ext4/mballoc.c 2008-09-15 10:59:27.000000000 +0200
> > +++ linux-2.6.27-rc5+patch_queue/fs/ext4/mballoc.c 2008-09-15 14:02:44.000000000 +0200
> > @@ -785,9 +785,11 @@ static int ext4_mb_init_cache(struct pag
> > if (bh[i] == NULL)
> > goto out;
> >
> > - if (bh_uptodate_or_lock(bh[i]))
> > + if (buffer_uptodate(bh[i]) &&
> > + !(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))
> > continue;
> >
> > + lock_buffer(bh[i]);
> > spin_lock(sb_bgl_lock(EXT4_SB(sb), first_group + i));
> > if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> > ext4_init_block_bitmap(sb, bh[i],
> >
>
> -aneesh
>

2008-09-18 13:43:59

by Frédéric Bohé

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix initialization of UNINIT bitmap blocks

I am not very confident with buffer's behavior, but I think I have
understood what happens. Correct me if I am wrong.
Let's see the second test case, which was :

dd if=/dev/urandom of=/dev/md0 bs=1M count=20
mkfs.ext4 -t ext4dev /dev/md0 10M
mount -t ext4dev /dev/md0 /mnt/test
resize2fs /dev/md0 20M
for i in $(seq 1 3800); do touch /mnt/test/file${i} 2>&1; done
touch: cannot touch `/mnt/test/file3188': No space left on device
touch: cannot touch `/mnt/test/file3189': No space left on device
...
...


The issue here is that you can't use all inode of the second group of
the fs.

This happens because resize2fs make a call to ext2fs_read_bitmaps. This
function reads all bitmaps while paying attention not to read the
uninited bitmap. This works well as long as the fs block size is equal
to the page size. But in the above test case, the fs use 1k blocks and
we have an issue.

That's because the "read" function issued by ext2fs_read_bitmaps is a
call to kernel's block_read_full_page function. So when a single bitmap
block is asked for, 4 blocks (for 1k blocks fs on x86) are actually read
(including the uninited ones) and their respective buffer set to
uptodate.

As we rely on the buffer's uptodate flags to initialize or not this
buffer, it may happen that certain bitmap blocks are not initialized at
all. So their buffer contains the random garbage that was present on the
disk prior to the mkfs ( In the above test case, the inode bitmap of the
second group is full a random bits so I can't use all of its inodes ).

If the bitmap block corresponding to this buffer is later changed, its
UNINIT flag will be cleared and the content of the buffer written to the
disk, including the garbage.


I am a bit lost on how to fix this. Aneesh was right, I think it's an
ext2fs_read_bitmaps bug, not a kernel bug. I guess we need a userland
function to read a single block whatever the block size and page size
are. I've made a try using O_DIRECT flag but I was unsuccessful. Any
ideas/suggestions ?

Fred


Le lundi 15 septembre 2008 à 16:30 +0200, Frédéric Bohé a écrit :
> Le lundi 15 septembre 2008 à 19:06 +0530, Aneesh Kumar K.V a écrit :
> > On Mon, Sep 15, 2008 at 02:16:47PM +0200, Frédéric Bohé wrote:
> > > From: Frederic Bohe <[email protected]>
> > >
> > > Do not rely on buffer head's uptodate flag to initialize
> > > uninitialized bitmap blocks.
> > >
> > > Signed-off-by: Frederic Bohe <[email protected]>
> > > ---
> > > Sorry there was a copy/paste error in the previous mail !
> > >
> > > This patch makes sure to initialize uninited bitmap blocks.
> > > These are two test cases where bugs appear because of uninited blocks :
> > >
> > > 1- This test case lead to uninited block bitmap and an error message
> > > from the mballocator during the second dd.
> > >
> > > dd if=/dev/urandom of=/dev/md0 bs=1M count=300
> > > mkfs.ext4 -t ext4dev /dev/md0 1G
> > > mount -t ext4dev /dev/md0 /mnt/test
> > > resize2fs /dev/md0 2G
> > > dd if=/dev/zero of=/mnt/test/dummy bs=1M count=1500
> > >
> > > Note that the first dd is to make sure we have random garbage in the
> > > uninited blocks. If not, you could miss the issue depending what was in
> > > those blocks before running mkfs.
> > >
> > > 2- This test case lead to uninited inode bitmap blocks, making it
> > > impossible to use all the inodes of the fs.
> > >
> > > dd if=/dev/urandom of=/dev/md0 bs=1M count=20
> > > mkfs.ext4 -t ext4dev /dev/md0 10M
> > > mount -t ext4dev /dev/md0 /mnt/test
> > > resize2fs /dev/md0 20M
> > > for i in $(seq 1 3800); do touch /mnt/test/file${i} 2>&1; done
> > >
> > > balloc.c | 4 +++-
> > > ialloc.c | 4 +++-
> > > mballoc.c | 4 +++-
> > > 3 files changed, 9 insertions(+), 3 deletions(-)
> > >
> > > Index: linux-2.6.27-rc5+patch_queue/fs/ext4/balloc.c
> > > ===================================================================
> > > --- linux-2.6.27-rc5+patch_queue.orig/fs/ext4/balloc.c 2008-09-15 10:59:27.000000000 +0200
> > > +++ linux-2.6.27-rc5+patch_queue/fs/ext4/balloc.c 2008-09-15 14:03:04.000000000 +0200
> > > @@ -318,9 +318,11 @@ ext4_read_block_bitmap(struct super_bloc
> > > block_group, bitmap_blk);
> > > return NULL;
> > > }
> > > - if (bh_uptodate_or_lock(bh))
> > > + if (buffer_uptodate(bh) &&
> > > + !(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))
> > > return bh;
> > >
> > > + lock_buffer(bh);
> > > spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
> > > if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> > > ext4_init_block_bitmap(sb, bh, block_group, desc);
> >
> > Why ? I guess resize should mark those buffer_heads as not uptodate so
> > that we do a reinit of block bitmap again later. The above change will
> > result in calling ext4_init_block_bitmap everytime we do a
> > read_block_bitmap on an uninit group
>
> Thanks for your comment Aneesh. I thought ext4_init_block_bitmap was
> setting the EXT4_BG_BLOCK_UNINIT flags, but it seems it is not true.
> I will try to fix it on the resize side.
>
>
> >
> >
> >
> >
> > > Index: linux-2.6.27-rc5+patch_queue/fs/ext4/ialloc.c
> > > ===================================================================
> > > --- linux-2.6.27-rc5+patch_queue.orig/fs/ext4/ialloc.c 2008-09-15 10:59:27.000000000 +0200
> > > +++ linux-2.6.27-rc5+patch_queue/fs/ext4/ialloc.c 2008-09-15 11:12:16.000000000 +0200
> > > @@ -115,9 +115,11 @@ ext4_read_inode_bitmap(struct super_bloc
> > > block_group, bitmap_blk);
> > > return NULL;
> > > }
> > > - if (bh_uptodate_or_lock(bh))
> > > + if (buffer_uptodate(bh) &&
> > > + !(desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)))
> > > return bh;
> > >
> > > + lock_buffer(bh);
> > > spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
> > > if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
> > > ext4_init_inode_bitmap(sb, bh, block_group, desc);
> > > Index: linux-2.6.27-rc5+patch_queue/fs/ext4/mballoc.c
> > > ===================================================================
> > > --- linux-2.6.27-rc5+patch_queue.orig/fs/ext4/mballoc.c 2008-09-15 10:59:27.000000000 +0200
> > > +++ linux-2.6.27-rc5+patch_queue/fs/ext4/mballoc.c 2008-09-15 14:02:44.000000000 +0200
> > > @@ -785,9 +785,11 @@ static int ext4_mb_init_cache(struct pag
> > > if (bh[i] == NULL)
> > > goto out;
> > >
> > > - if (bh_uptodate_or_lock(bh[i]))
> > > + if (buffer_uptodate(bh[i]) &&
> > > + !(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)))
> > > continue;
> > >
> > > + lock_buffer(bh[i]);
> > > spin_lock(sb_bgl_lock(EXT4_SB(sb), first_group + i));
> > > if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> > > ext4_init_block_bitmap(sb, bh[i],
> > >
> >
> > -aneesh
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2008-09-21 03:15:55

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix initialization of UNINIT bitmap blocks

On Thu, Sep 18, 2008 at 03:45:14PM +0200, Fr?d?ric Boh? wrote:
> The issue here is that you can't use all inode of the second group of
> the fs.
>
> This happens because resize2fs make a call to ext2fs_read_bitmaps. This
> function reads all bitmaps while paying attention not to read the
> uninited bitmap. This works well as long as the fs block size is equal
> to the page size. But in the above test case, the fs use 1k blocks and
> we have an issue.
>
> That's because the "read" function issued by ext2fs_read_bitmaps is a
> call to kernel's block_read_full_page function. So when a single bitmap
> block is asked for, 4 blocks (for 1k blocks fs on x86) are actually read
> (including the uninited ones) and their respective buffer set to
> uptodate.
>
> As we rely on the buffer's uptodate flags to initialize or not this
> buffer, it may happen that certain bitmap blocks are not initialized at
> all. So their buffer contains the random garbage that was present on the
> disk prior to the mkfs ( In the above test case, the inode bitmap of the
> second group is full a random bits so I can't use all of its inodes ).

Actually that's the problem. We shouldn't be relying on the buffer's
uptodate flags as a hint to tell mballoc to reload the buddy bitmaps.
Unfortunately I didn't notice this problem by not carefully auditing
commit 5f21b0e6 before it went in, but it's seriously buggy by trying
to overload the use of the buffer's uptodate flag for anything other
than error handling.

> I am a bit lost on how to fix this. Aneesh was right, I think it's an
> ext2fs_read_bitmaps bug, not a kernel bug. I guess we need a userland
> function to read a single block whatever the block size and page size
> are. I've made a try using O_DIRECT flag but I was unsuccessful. Any
> ideas/suggestions ?

No!!!! Think about it. It's always fair for userspace to read from
the block device. If this causes the kernel to blow up, then it's a
kernel bug, not a userspace bug. And it is a *perfect* demonstration
why overloading the uptodate flag by using it for *anything* other
than error signalling from the buffer I/O layer is wrong and horribly
fragile.

Commit 5f21b0e6 should have been split up into separate patches, so it
would have been easier to audit. (I'm guessing though that I somehow
never audited it, since my Signed-off-by wasn't on the patch, and it
should have been before I pushed it to Linus). As far as what it was
trying to do, how this probably should have been solved is that
mballoc.c should have a new function which causes the out-of-date
buddy bitmap to be removed, and to have resize.c call that function,
instead of playing games with the uptodate flag.

- Ted

2008-09-22 08:09:04

by Frédéric Bohé

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix initialization of UNINIT bitmap blocks

Le samedi 20 septembre 2008 à 20:44 -0400, Theodore Tso a écrit :
> On Thu, Sep 18, 2008 at 03:45:14PM +0200, Frédéric Bohé wrote:
> > The issue here is that you can't use all inode of the second group of
> > the fs.
> >
> > This happens because resize2fs make a call to ext2fs_read_bitmaps. This
> > function reads all bitmaps while paying attention not to read the
> > uninited bitmap. This works well as long as the fs block size is equal
> > to the page size. But in the above test case, the fs use 1k blocks and
> > we have an issue.
> >
> > That's because the "read" function issued by ext2fs_read_bitmaps is a
> > call to kernel's block_read_full_page function. So when a single bitmap
> > block is asked for, 4 blocks (for 1k blocks fs on x86) are actually read
> > (including the uninited ones) and their respective buffer set to
> > uptodate.
> >
> > As we rely on the buffer's uptodate flags to initialize or not this
> > buffer, it may happen that certain bitmap blocks are not initialized at
> > all. So their buffer contains the random garbage that was present on the
> > disk prior to the mkfs ( In the above test case, the inode bitmap of the
> > second group is full a random bits so I can't use all of its inodes ).
>
> Actually that's the problem. We shouldn't be relying on the buffer's
> uptodate flags as a hint to tell mballoc to reload the buddy bitmaps.
> Unfortunately I didn't notice this problem by not carefully auditing
> commit 5f21b0e6 before it went in, but it's seriously buggy by trying
> to overload the use of the buffer's uptodate flag for anything other
> than error handling.
>

Maybe I missed something, but I thought the bug I am talking about here,
is neither related to buddy nor directly to mballoc. Sorry, I was not
clear enough. In fact, it happens even without using mballoc. It is
related to uninit feature with filesystems using blocks which are
smaller than page size. If any userland process call ext2fs_read_bitmaps
function (or try to read a bitmap block directly), you may end up with
those buffers full of garbage. It concerns either block bitmap buffers
or inode bitmap buffers.



> > I am a bit lost on how to fix this. Aneesh was right, I think it's an
> > ext2fs_read_bitmaps bug, not a kernel bug. I guess we need a userland
> > function to read a single block whatever the block size and page size
> > are. I've made a try using O_DIRECT flag but I was unsuccessful. Any
> > ideas/suggestions ?
>
> No!!!! Think about it. It's always fair for userspace to read from
> the block device. If this causes the kernel to blow up, then it's a
> kernel bug, not a userspace bug. And it is a *perfect* demonstration
> why overloading the uptodate flag by using it for *anything* other
> than error signalling from the buffer I/O layer is wrong and horribly
> fragile.

You are probably right, so maybe the patch I sent at the beginning of
this thread makes sense ?

>
> Commit 5f21b0e6 should have been split up into separate patches, so it
> would have been easier to audit. (I'm guessing though that I somehow
> never audited it, since my Signed-off-by wasn't on the patch, and it
> should have been before I pushed it to Linus). As far as what it was
> trying to do, how this probably should have been solved is that
> mballoc.c should have a new function which causes the out-of-date
> buddy bitmap to be removed, and to have resize.c call that function,
> instead of playing games with the uptodate flag.

Thank you for your comments about this commit, I will give a look at it
later.



2008-09-22 08:51:35

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix initialization of UNINIT bitmap blocks

On Mon, Sep 22, 2008 at 10:09:57AM +0200, Fr?d?ric Boh? wrote:
> Le samedi 20 septembre 2008 ? 20:44 -0400, Theodore Tso a ?crit :
> > On Thu, Sep 18, 2008 at 03:45:14PM +0200, Fr?d?ric Boh? wrote:
> > > The issue here is that you can't use all inode of the second group of
> > > the fs.
> > >
> > > This happens because resize2fs make a call to ext2fs_read_bitmaps. This
> > > function reads all bitmaps while paying attention not to read the
> > > uninited bitmap. This works well as long as the fs block size is equal
> > > to the page size. But in the above test case, the fs use 1k blocks and
> > > we have an issue.
> > >
> > > That's because the "read" function issued by ext2fs_read_bitmaps is a
> > > call to kernel's block_read_full_page function. So when a single bitmap
> > > block is asked for, 4 blocks (for 1k blocks fs on x86) are actually read
> > > (including the uninited ones) and their respective buffer set to
> > > uptodate.
> > >
> > > As we rely on the buffer's uptodate flags to initialize or not this
> > > buffer, it may happen that certain bitmap blocks are not initialized at
> > > all. So their buffer contains the random garbage that was present on the
> > > disk prior to the mkfs ( In the above test case, the inode bitmap of the
> > > second group is full a random bits so I can't use all of its inodes ).
> >
> > Actually that's the problem. We shouldn't be relying on the buffer's
> > uptodate flags as a hint to tell mballoc to reload the buddy bitmaps.
> > Unfortunately I didn't notice this problem by not carefully auditing
> > commit 5f21b0e6 before it went in, but it's seriously buggy by trying
> > to overload the use of the buffer's uptodate flag for anything other
> > than error handling.
> >
>
> Maybe I missed something, but I thought the bug I am talking about here,
> is neither related to buddy nor directly to mballoc. Sorry, I was not
> clear enough. In fact, it happens even without using mballoc. It is
> related to uninit feature with filesystems using blocks which are
> smaller than page size. If any userland process call ext2fs_read_bitmaps
> function (or try to read a bitmap block directly), you may end up with
> those buffers full of garbage. It concerns either block bitmap buffers
> or inode bitmap buffers.
>
>
>
> > > I am a bit lost on how to fix this. Aneesh was right, I think it's an
> > > ext2fs_read_bitmaps bug, not a kernel bug. I guess we need a userland
> > > function to read a single block whatever the block size and page size
> > > are. I've made a try using O_DIRECT flag but I was unsuccessful. Any
> > > ideas/suggestions ?
> >
> > No!!!! Think about it. It's always fair for userspace to read from
> > the block device. If this causes the kernel to blow up, then it's a
> > kernel bug, not a userspace bug. And it is a *perfect* demonstration
> > why overloading the uptodate flag by using it for *anything* other
> > than error signalling from the buffer I/O layer is wrong and horribly
> > fragile.
>
> You are probably right, so maybe the patch I sent at the beginning of
> this thread makes sense ?
>

What you can do is make ext4_group_info generic for both mballoc and
oldalloc. We can then add bg_flag to the in memory ext4_group_info
that would indicate whether the group is initialized or not. Here
initialized for an UNINIT_GROUP indicate we have done
ext4_init_block_bitmap on the buffer_head. Then
instead of depending on the buffer_head uptodate flag we can check
for the ext4_group_info bg_flags and decided whether the block/inode
bitmap need to be initialized.

-aneesh

2008-09-22 09:31:42

by Frédéric Bohé

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix initialization of UNINIT bitmap blocks

Le lundi 22 septembre 2008 à 14:17 +0530, Aneesh Kumar K.V a écrit :
> On Mon, Sep 22, 2008 at 10:09:57AM +0200, Frédéric Bohé wrote:
> > Le samedi 20 septembre 2008 à 20:44 -0400, Theodore Tso a écrit :
> > > On Thu, Sep 18, 2008 at 03:45:14PM +0200, Frédéric Bohé wrote:
> > > > The issue here is that you can't use all inode of the second group of
> > > > the fs.
> > > >
> > > > This happens because resize2fs make a call to ext2fs_read_bitmaps. This
> > > > function reads all bitmaps while paying attention not to read the
> > > > uninited bitmap. This works well as long as the fs block size is equal
> > > > to the page size. But in the above test case, the fs use 1k blocks and
> > > > we have an issue.
> > > >
> > > > That's because the "read" function issued by ext2fs_read_bitmaps is a
> > > > call to kernel's block_read_full_page function. So when a single bitmap
> > > > block is asked for, 4 blocks (for 1k blocks fs on x86) are actually read
> > > > (including the uninited ones) and their respective buffer set to
> > > > uptodate.
> > > >
> > > > As we rely on the buffer's uptodate flags to initialize or not this
> > > > buffer, it may happen that certain bitmap blocks are not initialized at
> > > > all. So their buffer contains the random garbage that was present on the
> > > > disk prior to the mkfs ( In the above test case, the inode bitmap of the
> > > > second group is full a random bits so I can't use all of its inodes ).
> > >
> > > Actually that's the problem. We shouldn't be relying on the buffer's
> > > uptodate flags as a hint to tell mballoc to reload the buddy bitmaps.
> > > Unfortunately I didn't notice this problem by not carefully auditing
> > > commit 5f21b0e6 before it went in, but it's seriously buggy by trying
> > > to overload the use of the buffer's uptodate flag for anything other
> > > than error handling.
> > >
> >
> > Maybe I missed something, but I thought the bug I am talking about here,
> > is neither related to buddy nor directly to mballoc. Sorry, I was not
> > clear enough. In fact, it happens even without using mballoc. It is
> > related to uninit feature with filesystems using blocks which are
> > smaller than page size. If any userland process call ext2fs_read_bitmaps
> > function (or try to read a bitmap block directly), you may end up with
> > those buffers full of garbage. It concerns either block bitmap buffers
> > or inode bitmap buffers.
> >
> >
> >
> > > > I am a bit lost on how to fix this. Aneesh was right, I think it's an
> > > > ext2fs_read_bitmaps bug, not a kernel bug. I guess we need a userland
> > > > function to read a single block whatever the block size and page size
> > > > are. I've made a try using O_DIRECT flag but I was unsuccessful. Any
> > > > ideas/suggestions ?
> > >
> > > No!!!! Think about it. It's always fair for userspace to read from
> > > the block device. If this causes the kernel to blow up, then it's a
> > > kernel bug, not a userspace bug. And it is a *perfect* demonstration
> > > why overloading the uptodate flag by using it for *anything* other
> > > than error signalling from the buffer I/O layer is wrong and horribly
> > > fragile.
> >
> > You are probably right, so maybe the patch I sent at the beginning of
> > this thread makes sense ?
> >
>
> What you can do is make ext4_group_info generic for both mballoc and
> oldalloc. We can then add bg_flag to the in memory ext4_group_info
> that would indicate whether the group is initialized or not. Here
> initialized for an UNINIT_GROUP indicate we have done
> ext4_init_block_bitmap on the buffer_head. Then
> instead of depending on the buffer_head uptodate flag we can check
> for the ext4_group_info bg_flags and decided whether the block/inode
> bitmap need to be initialized.
>

That makes sense ! I agree with you, we need an additional in-memory
flag to know whether buffers are initialized or not. Anyway, making
ext4_group_info generic will lead to unneeded memory consumption for
oldalloc. Maybe a simple independent bits array could do the trick. Is
there any advantage to re-use ext4_group_info ?



2008-09-23 23:14:25

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix initialization of UNINIT bitmap blocks

On Sep 22, 2008 11:32 +0200, Fr�d�ric Boh� wrote:
> Le lundi 22 septembre 2008 à 14:17 +0530, Aneesh Kumar K.V a écrit :
> > What you can do is make ext4_group_info generic for both mballoc and
> > oldalloc. We can then add bg_flag to the in memory ext4_group_info
> > that would indicate whether the group is initialized or not. Here
> > initialized for an UNINIT_GROUP indicate we have done
> > ext4_init_block_bitmap on the buffer_head. Then
> > instead of depending on the buffer_head uptodate flag we can check
> > for the ext4_group_info bg_flags and decided whether the block/inode
> > bitmap need to be initialized.
>
> That makes sense ! I agree with you, we need an additional in-memory
> flag to know whether buffers are initialized or not. Anyway, making
> ext4_group_info generic will lead to unneeded memory consumption for
> oldalloc. Maybe a simple independent bits array could do the trick. Is
> there any advantage to re-use ext4_group_info ?

For ext4 I think 99% of users will use mballoc, and the reduction in code
complexity is itself useful. I don't think the in-memory overhead is very
much, maybe 1 MB per TB of filesystem space.

Also, if you are considering this approach (to initialize the in-memory
bitmaps at mount time) they should be written to disk even if unused.
Please also consider doing the inode table zeroing at the same time.
This would allow uninit_bg to avoid doing it at mke2fs time.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2008-09-24 12:38:12

by Frédéric Bohé

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix initialization of UNINIT bitmap blocks

Le lundi 22 septembre 2008 à 11:32 +0200, Frédéric Bohé a écrit :
> Le lundi 22 septembre 2008 à 14:17 +0530, Aneesh Kumar K.V a écrit :
> > What you can do is make ext4_group_info generic for both mballoc and
> > oldalloc. We can then add bg_flag to the in memory ext4_group_info
> > that would indicate whether the group is initialized or not. Here
> > initialized for an UNINIT_GROUP indicate we have done
> > ext4_init_block_bitmap on the buffer_head. Then
> > instead of depending on the buffer_head uptodate flag we can check
> > for the ext4_group_info bg_flags and decided whether the block/inode
> > bitmap need to be initialized.
> >
>
> That makes sense ! I agree with you, we need an additional in-memory
> flag to know whether buffers are initialized or not. Anyway, making
> ext4_group_info generic will lead to unneeded memory consumption for
> oldalloc. Maybe a simple independent bits array could do the trick. Is
> there any advantage to re-use ext4_group_info ?
>

This is an implementation of what I was talking about. Please let me know your comments.

Index: linux-2.6.27-rc6+patch_queue/fs/ext4/balloc.c
===================================================================
--- linux-2.6.27-rc6+patch_queue.orig/fs/ext4/balloc.c 2008-09-23 15:04:39.000000000 +0200
+++ linux-2.6.27-rc6+patch_queue/fs/ext4/balloc.c 2008-09-23 15:39:38.000000000 +0200
@@ -175,6 +175,8 @@ unsigned ext4_init_block_bitmap(struct s
*/
mark_bitmap_end(group_blocks, sb->s_blocksize * 8, bh->b_data);
}
+
+ ext4_set_bit(block_group, sbi->s_block_bitmap_buffer_state);
return free_blocks - ext4_group_used_meta_blocks(sb, block_group);
}

@@ -318,9 +320,13 @@ ext4_read_block_bitmap(struct super_bloc
block_group, bitmap_blk);
return NULL;
}
- if (bh_uptodate_or_lock(bh))
+
+ if (buffer_uptodate(bh) && ext4_test_bit(block_group,
+ EXT4_SB(sb)->s_block_bitmap_buffer_state))
return bh;

+ lock_buffer(bh);
+
spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
ext4_init_block_bitmap(sb, bh, block_group, desc);
@@ -328,7 +334,9 @@ ext4_read_block_bitmap(struct super_bloc
unlock_buffer(bh);
spin_unlock(sb_bgl_lock(EXT4_SB(sb), block_group));
return bh;
- }
+ } else
+ ext4_set_bit(block_group,
+ EXT4_SB(sb)->s_block_bitmap_buffer_state);
spin_unlock(sb_bgl_lock(EXT4_SB(sb), block_group));
if (bh_submit_read(bh) < 0) {
put_bh(bh);
Index: linux-2.6.27-rc6+patch_queue/fs/ext4/ext4_sb.h
===================================================================
--- linux-2.6.27-rc6+patch_queue.orig/fs/ext4/ext4_sb.h 2008-09-23 15:07:28.000000000 +0200
+++ linux-2.6.27-rc6+patch_queue/fs/ext4/ext4_sb.h 2008-09-23 15:40:57.000000000 +0200
@@ -147,6 +147,14 @@ struct ext4_sb_info {

unsigned int s_log_groups_per_flex;
struct flex_groups *s_flex_groups;
+
+ /*
+ * Flag for the state of the bitmaps buffers
+ * 0 = unknown or uninitialized
+ * 1 = initialized
+ */
+ char *s_block_bitmap_buffer_state;
+ char *s_inode_bitmap_buffer_state;
};

#endif /* _EXT4_SB */
Index: linux-2.6.27-rc6+patch_queue/fs/ext4/ialloc.c
===================================================================
--- linux-2.6.27-rc6+patch_queue.orig/fs/ext4/ialloc.c 2008-09-23 15:09:15.000000000 +0200
+++ linux-2.6.27-rc6+patch_queue/fs/ext4/ialloc.c 2008-09-23 15:41:24.000000000 +0200
@@ -86,6 +86,7 @@ unsigned ext4_init_inode_bitmap(struct s
memset(bh->b_data, 0, (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
mark_bitmap_end(EXT4_INODES_PER_GROUP(sb), EXT4_BLOCKS_PER_GROUP(sb),
bh->b_data);
+ ext4_set_bit(block_group, sbi->s_inode_bitmap_buffer_state);

return EXT4_INODES_PER_GROUP(sb);
}
@@ -115,9 +116,12 @@ ext4_read_inode_bitmap(struct super_bloc
block_group, bitmap_blk);
return NULL;
}
- if (bh_uptodate_or_lock(bh))
+
+ if (buffer_uptodate(bh) && ext4_test_bit(block_group,
+ EXT4_SB(sb)->s_inode_bitmap_buffer_state))
return bh;

+ lock_buffer(bh);
spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
ext4_init_inode_bitmap(sb, bh, block_group, desc);
@@ -125,7 +129,9 @@ ext4_read_inode_bitmap(struct super_bloc
unlock_buffer(bh);
spin_unlock(sb_bgl_lock(EXT4_SB(sb), block_group));
return bh;
- }
+ } else
+ ext4_set_bit(block_group,
+ EXT4_SB(sb)->s_inode_bitmap_buffer_state);
spin_unlock(sb_bgl_lock(EXT4_SB(sb), block_group));
if (bh_submit_read(bh) < 0) {
put_bh(bh);
Index: linux-2.6.27-rc6+patch_queue/fs/ext4/mballoc.c
===================================================================
--- linux-2.6.27-rc6+patch_queue.orig/fs/ext4/mballoc.c 2008-09-23 15:11:48.000000000 +0200
+++ linux-2.6.27-rc6+patch_queue/fs/ext4/mballoc.c 2008-09-24 14:27:55.000000000 +0200
@@ -785,9 +785,12 @@ static int ext4_mb_init_cache(struct pag
if (bh[i] == NULL)
goto out;

- if (bh_uptodate_or_lock(bh[i]))
+ if (buffer_uptodate(bh[i]) && ext4_test_bit(first_group + i,
+ EXT4_SB(sb)->s_block_bitmap_buffer_state))
continue;

+ lock_buffer(bh[i]);
+
spin_lock(sb_bgl_lock(EXT4_SB(sb), first_group + i));
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
ext4_init_block_bitmap(sb, bh[i],
@@ -796,7 +799,9 @@ static int ext4_mb_init_cache(struct pag
unlock_buffer(bh[i]);
spin_unlock(sb_bgl_lock(EXT4_SB(sb), first_group + i));
continue;
- }
+ } else
+ ext4_set_bit(first_group + i,
+ EXT4_SB(sb)->s_block_bitmap_buffer_state);
spin_unlock(sb_bgl_lock(EXT4_SB(sb), first_group + i));
get_bh(bh[i]);
bh[i]->b_end_io = end_buffer_read_sync;
Index: linux-2.6.27-rc6+patch_queue/fs/ext4/super.c
===================================================================
--- linux-2.6.27-rc6+patch_queue.orig/fs/ext4/super.c 2008-09-23 15:16:15.000000000 +0200
+++ linux-2.6.27-rc6+patch_queue/fs/ext4/super.c 2008-09-24 14:28:42.000000000 +0200
@@ -2219,6 +2219,20 @@ static int ext4_fill_super(struct super_
printk(KERN_ERR "EXT4-fs: not enough memory\n");
goto failed_mount;
}
+ sbi->s_block_bitmap_buffer_state = kzalloc((sbi->s_groups_count +
+ le16_to_cpu(es->s_reserved_gdt_blocks) +
+ 7) / 8, GFP_KERNEL);
+ if (sbi->s_block_bitmap_buffer_state == NULL) {
+ printk(KERN_ERR "EXT4-fs: not enough memory\n");
+ goto failed_mount;
+ }
+ sbi->s_inode_bitmap_buffer_state = kzalloc((sbi->s_groups_count +
+ le16_to_cpu(es->s_reserved_gdt_blocks) +
+ 7) / 8, GFP_KERNEL);
+ if (sbi->s_inode_bitmap_buffer_state == NULL) {
+ printk(KERN_ERR "EXT4-fs: not enough memory\n");
+ goto failed_mount;
+ }

bgl_lock_init(&sbi->s_blockgroup_lock);




2008-09-24 12:56:08

by Frédéric Bohé

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix initialization of UNINIT bitmap blocks

Le mardi 23 septembre 2008 à 17:13 -0600, Andreas Dilger a écrit :
> On Sep 22, 2008 11:32 +0200, Fr�d�ric Boh� wrote:
> > Le lundi 22 septembre 2008 à 14:17 +0530, Aneesh Kumar K.V a écrit :
> > > What you can do is make ext4_group_info generic for both mballoc and
> > > oldalloc. We can then add bg_flag to the in memory ext4_group_info
> > > that would indicate whether the group is initialized or not. Here
> > > initialized for an UNINIT_GROUP indicate we have done
> > > ext4_init_block_bitmap on the buffer_head. Then
> > > instead of depending on the buffer_head uptodate flag we can check
> > > for the ext4_group_info bg_flags and decided whether the block/inode
> > > bitmap need to be initialized.
> >
> > That makes sense ! I agree with you, we need an additional in-memory
> > flag to know whether buffers are initialized or not. Anyway, making
> > ext4_group_info generic will lead to unneeded memory consumption for
> > oldalloc. Maybe a simple independent bits array could do the trick. Is
> > there any advantage to re-use ext4_group_info ?
>
> For ext4 I think 99% of users will use mballoc, and the reduction in code
> complexity is itself useful. I don't think the in-memory overhead is very
> much, maybe 1 MB per TB of filesystem space.
>

You are right ext4_group_info structure was not as big as I thought.
Do you mean that making ext4_group_info generic for both mballoc and
oldalloc will reduce the code complexity ?

> Also, if you are considering this approach (to initialize the in-memory
> bitmaps at mount time) they should be written to disk even if unused.
> Please also consider doing the inode table zeroing at the same time.
> This would allow uninit_bg to avoid doing it at mke2fs time.

In fact, I was not considering doing this at mount time, but it could be
a good approach.
Anyway, I don't understand why we should write bitmaps to disk after
that, and why we should zeroing the inode table. Don't we end up with a
fast mkfs and a slow mount doing all the stuff older mkfs was doing ?
The UNINIT feature would become less interesting.

Regards,
Frederic


2008-09-24 16:24:16

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix initialization of UNINIT bitmap blocks

On Wed, Sep 24, 2008 at 02:57:28PM +0200, Fr?d?ric Boh? wrote:
>
> You are right ext4_group_info structure was not as big as I thought.
> Do you mean that making ext4_group_info generic for both mballoc and
> oldalloc will reduce the code complexity ?

Long-term, we want to do this, yes. There's a lot of stuff in mballoc
that we probably need to move out into generic code. I'll sending
patches shortly that move the /proc handling code into the generic
code, and also saving 2k of compiled object code in the process.

Here, I think main argument is since mballoc is on by default, and the
benefits of this are huge, is that we would save memory by using an
unused bit in ext4_group_info.

A related question is at what point should we remove the oldalloc code
altogehter?

> > Also, if you are considering this approach (to initialize the in-memory
> > bitmaps at mount time) they should be written to disk even if unused.
> > Please also consider doing the inode table zeroing at the same time.
> > This would allow uninit_bg to avoid doing it at mke2fs time.
>
> In fact, I was not considering doing this at mount time, but it could be
> a good approach.
> Anyway, I don't understand why we should write bitmaps to disk after
> that, and why we should zeroing the inode table. Don't we end up with a
> fast mkfs and a slow mount doing all the stuff older mkfs was doing ?
> The UNINIT feature would become less interesting.

It would be an absolute disaster to do this at mount time, especially
if it included zeroing the inode table. Zeroing the inode table must
be done in a background kernel thread, with appropriate locks to avoid
races with the block allocation code (this is one of the places where
eliminating the old allocation code would make life easier).

I don't think we should worry about initializing the bitmaps in
advance. There's just no advantage in doing that for the bitmaps.
For the inode table we want to do this for safety's sake, but that's
not a concern for the bitmaps.

- Ted

2008-09-25 23:05:32

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix initialization of UNINIT bitmap blocks

On Sep 24, 2008 12:23 -0400, Theodore Ts'o wrote:
> > Do you mean that making ext4_group_info generic for both mballoc and
> > oldalloc will reduce the code complexity ?
>
> Long-term, we want to do this, yes. There's a lot of stuff in mballoc
> that we probably need to move out into generic code. I'll sending
> patches shortly that move the /proc handling code into the generic
> code, and also saving 2k of compiled object code in the process.
>
> Here, I think main argument is since mballoc is on by default, and the
> benefits of this are huge, is that we would save memory by using an
> unused bit in ext4_group_info.

Exactly.

> A related question is at what point should we remove the oldalloc code
> altogehter?

I'd vote for sooner rather than later. We're pretty clear on the mballoc
benefits, and there is a lot of old/duplicate cruft that is confusing
(e.g. old block reservation code) that could be removed at the same time.

> > Anyway, I don't understand why we should write bitmaps to disk after
> > that, and why we should zeroing the inode table. Don't we end up with a
> > fast mkfs and a slow mount doing all the stuff older mkfs was doing ?
> > The UNINIT feature would become less interesting.
>
> It would be an absolute disaster to do this at mount time, especially
> if it included zeroing the inode table. Zeroing the inode table must
> be done in a background kernel thread,

Yes, definitely I meant "in a background thread that can be interrupted
if there is other fs activity or unmount", not synchronously with the
mount. The risk of fatal itable/GDT corruption in the first minute of
using a newly formatted filesystem is small, and the corresponding value
of any data in that filesystem would be equally small.

> with appropriate locks to avoid races with the block allocation code

Definitely...

> I don't think we should worry about initializing the bitmaps in
> advance. There's just no advantage in doing that for the bitmaps.

Well, just some small safety that there isn't complete garbage on
disk, which helps e2fsck make a better decision in case of old data
still on the disk.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-09-26 14:27:55

by Frédéric Bohé

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix initialization of UNINIT bitmap blocks

Le mercredi 24 septembre 2008 à 14:38 +0200, Frédéric Bohé a écrit :
> Le lundi 22 septembre 2008 à 11:32 +0200, Frédéric Bohé a écrit :
> > Le lundi 22 septembre 2008 à 14:17 +0530, Aneesh Kumar K.V a écrit :
> > > What you can do is make ext4_group_info generic for both mballoc and
> > > oldalloc. We can then add bg_flag to the in memory ext4_group_info
> > > that would indicate whether the group is initialized or not. Here
> > > initialized for an UNINIT_GROUP indicate we have done
> > > ext4_init_block_bitmap on the buffer_head. Then
> > > instead of depending on the buffer_head uptodate flag we can check
> > > for the ext4_group_info bg_flags and decided whether the block/inode
> > > bitmap need to be initialized.
> > >
> >
> > That makes sense ! I agree with you, we need an additional in-memory
> > flag to know whether buffers are initialized or not. Anyway, making
> > ext4_group_info generic will lead to unneeded memory consumption for
> > oldalloc. Maybe a simple independent bits array could do the trick. Is
> > there any advantage to re-use ext4_group_info ?
> >
>
> This is an implementation of what I was talking about. Please let me know your comments.
>
> Index: linux-2.6.27-rc6+patch_queue/fs/ext4/balloc.c
> ===================================================================
> --- linux-2.6.27-rc6+patch_queue.orig/fs/ext4/balloc.c 2008-09-23 15:04:39.000000000 +0200
> +++ linux-2.6.27-rc6+patch_queue/fs/ext4/balloc.c 2008-09-23 15:39:38.000000000 +0200
> @@ -175,6 +175,8 @@ unsigned ext4_init_block_bitmap(struct s
> */
> mark_bitmap_end(group_blocks, sb->s_blocksize * 8, bh->b_data);
> }
> +
> + ext4_set_bit(block_group, sbi->s_block_bitmap_buffer_state);
> return free_blocks - ext4_group_used_meta_blocks(sb, block_group);
> }
>
> @@ -318,9 +320,13 @@ ext4_read_block_bitmap(struct super_bloc
> block_group, bitmap_blk);
> return NULL;
> }
> - if (bh_uptodate_or_lock(bh))
> +
> + if (buffer_uptodate(bh) && ext4_test_bit(block_group,
> + EXT4_SB(sb)->s_block_bitmap_buffer_state))
> return bh;
>
> + lock_buffer(bh);
> +
> spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
> if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> ext4_init_block_bitmap(sb, bh, block_group, desc);
> @@ -328,7 +334,9 @@ ext4_read_block_bitmap(struct super_bloc
> unlock_buffer(bh);
> spin_unlock(sb_bgl_lock(EXT4_SB(sb), block_group));
> return bh;
> - }
> + } else
> + ext4_set_bit(block_group,
> + EXT4_SB(sb)->s_block_bitmap_buffer_state);
> spin_unlock(sb_bgl_lock(EXT4_SB(sb), block_group));
> if (bh_submit_read(bh) < 0) {
> put_bh(bh);
> Index: linux-2.6.27-rc6+patch_queue/fs/ext4/ext4_sb.h
> ===================================================================
> --- linux-2.6.27-rc6+patch_queue.orig/fs/ext4/ext4_sb.h 2008-09-23 15:07:28.000000000 +0200
> +++ linux-2.6.27-rc6+patch_queue/fs/ext4/ext4_sb.h 2008-09-23 15:40:57.000000000 +0200
> @@ -147,6 +147,14 @@ struct ext4_sb_info {
>
> unsigned int s_log_groups_per_flex;
> struct flex_groups *s_flex_groups;
> +
> + /*
> + * Flag for the state of the bitmaps buffers
> + * 0 = unknown or uninitialized
> + * 1 = initialized
> + */
> + char *s_block_bitmap_buffer_state;
> + char *s_inode_bitmap_buffer_state;
> };
>
> #endif /* _EXT4_SB */
> Index: linux-2.6.27-rc6+patch_queue/fs/ext4/ialloc.c
> ===================================================================
> --- linux-2.6.27-rc6+patch_queue.orig/fs/ext4/ialloc.c 2008-09-23 15:09:15.000000000 +0200
> +++ linux-2.6.27-rc6+patch_queue/fs/ext4/ialloc.c 2008-09-23 15:41:24.000000000 +0200
> @@ -86,6 +86,7 @@ unsigned ext4_init_inode_bitmap(struct s
> memset(bh->b_data, 0, (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
> mark_bitmap_end(EXT4_INODES_PER_GROUP(sb), EXT4_BLOCKS_PER_GROUP(sb),
> bh->b_data);
> + ext4_set_bit(block_group, sbi->s_inode_bitmap_buffer_state);
>
> return EXT4_INODES_PER_GROUP(sb);
> }
> @@ -115,9 +116,12 @@ ext4_read_inode_bitmap(struct super_bloc
> block_group, bitmap_blk);
> return NULL;
> }
> - if (bh_uptodate_or_lock(bh))
> +
> + if (buffer_uptodate(bh) && ext4_test_bit(block_group,
> + EXT4_SB(sb)->s_inode_bitmap_buffer_state))
> return bh;
>
> + lock_buffer(bh);
> spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
> if (desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
> ext4_init_inode_bitmap(sb, bh, block_group, desc);
> @@ -125,7 +129,9 @@ ext4_read_inode_bitmap(struct super_bloc
> unlock_buffer(bh);
> spin_unlock(sb_bgl_lock(EXT4_SB(sb), block_group));
> return bh;
> - }
> + } else
> + ext4_set_bit(block_group,
> + EXT4_SB(sb)->s_inode_bitmap_buffer_state);
> spin_unlock(sb_bgl_lock(EXT4_SB(sb), block_group));
> if (bh_submit_read(bh) < 0) {
> put_bh(bh);
> Index: linux-2.6.27-rc6+patch_queue/fs/ext4/mballoc.c
> ===================================================================
> --- linux-2.6.27-rc6+patch_queue.orig/fs/ext4/mballoc.c 2008-09-23 15:11:48.000000000 +0200
> +++ linux-2.6.27-rc6+patch_queue/fs/ext4/mballoc.c 2008-09-24 14:27:55.000000000 +0200
> @@ -785,9 +785,12 @@ static int ext4_mb_init_cache(struct pag
> if (bh[i] == NULL)
> goto out;
>
> - if (bh_uptodate_or_lock(bh[i]))
> + if (buffer_uptodate(bh[i]) && ext4_test_bit(first_group + i,
> + EXT4_SB(sb)->s_block_bitmap_buffer_state))
> continue;
>
> + lock_buffer(bh[i]);
> +
> spin_lock(sb_bgl_lock(EXT4_SB(sb), first_group + i));
> if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> ext4_init_block_bitmap(sb, bh[i],
> @@ -796,7 +799,9 @@ static int ext4_mb_init_cache(struct pag
> unlock_buffer(bh[i]);
> spin_unlock(sb_bgl_lock(EXT4_SB(sb), first_group + i));
> continue;
> - }
> + } else
> + ext4_set_bit(first_group + i,
> + EXT4_SB(sb)->s_block_bitmap_buffer_state);
> spin_unlock(sb_bgl_lock(EXT4_SB(sb), first_group + i));
> get_bh(bh[i]);
> bh[i]->b_end_io = end_buffer_read_sync;
> Index: linux-2.6.27-rc6+patch_queue/fs/ext4/super.c
> ===================================================================
> --- linux-2.6.27-rc6+patch_queue.orig/fs/ext4/super.c 2008-09-23 15:16:15.000000000 +0200
> +++ linux-2.6.27-rc6+patch_queue/fs/ext4/super.c 2008-09-24 14:28:42.000000000 +0200
> @@ -2219,6 +2219,20 @@ static int ext4_fill_super(struct super_
> printk(KERN_ERR "EXT4-fs: not enough memory\n");
> goto failed_mount;
> }
> + sbi->s_block_bitmap_buffer_state = kzalloc((sbi->s_groups_count +
> + le16_to_cpu(es->s_reserved_gdt_blocks) +
> + 7) / 8, GFP_KERNEL);
> + if (sbi->s_block_bitmap_buffer_state == NULL) {
> + printk(KERN_ERR "EXT4-fs: not enough memory\n");
> + goto failed_mount;
> + }
> + sbi->s_inode_bitmap_buffer_state = kzalloc((sbi->s_groups_count +
> + le16_to_cpu(es->s_reserved_gdt_blocks) +
> + 7) / 8, GFP_KERNEL);
> + if (sbi->s_inode_bitmap_buffer_state == NULL) {
> + printk(KERN_ERR "EXT4-fs: not enough memory\n");
> + goto failed_mount;
> + }
>
> bgl_lock_init(&sbi->s_blockgroup_lock);
>


After some testing of this implementation, I think that using a bit to
know whether we have done ext4_init_block_bitmap or not for the bitmaps
of a group is useless. In fact this method works as long as buffer head
are not freed. But consider we have already initialized the "in-memory
init" bit then the buffer is re-read from the disk after being freed :
we come back to the initial problem with on-disk garbage in the buffer
head !
At the moment, the only safe way I see of knowing whether we have to
initialize the buffer-head or not is to rely on the UNINIT flag in the
group descriptor (the way my initial patch does).
As Aneesh said, this will possibly lead to multiple call
ext4_init_block_bitmap instead of one. So there may be an impact on the
performance.

Frederic


2008-09-28 22:49:38

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix initialization of UNINIT bitmap blocks

Ok, I've added this patch to the patch series.

- Ted