2008-03-24 17:05:10

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH] ext3: Return EIO if new block is allocated from system zone.

If the block allocator gets blocks out of system zone ext3 calls
ext3_error. But if the file system is mounted with errors=continue
return with -EIO.

System zone is the block range mapping block bitmap, inode bitmap and inode
table.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Mingming Cao <[email protected]>
---
fs/ext3/balloc.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
index da0cb2c..6ce7f7d 100644
--- a/fs/ext3/balloc.c
+++ b/fs/ext3/balloc.c
@@ -1642,7 +1642,7 @@ allocated:
"Allocating block in system zone - "
"blocks from "E3FSBLK", length %lu",
ret_block, num);
- goto out;
+ goto io_error;
}

performed_allocation = 1;
--
1.5.5.rc0.16.g02b00.dirty



2008-03-24 17:05:10

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH] ext3: Use page_mkwrite vma_operations to get mmap write notification.

We would like to get notified when we are doing a write on mmap section.
The changes are needed to handle ENOSPC when writing to an mmap section
of files with holes.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext3/file.c | 19 ++++++++++++++++++-
fs/ext3/inode.c | 5 +++++
include/linux/ext3_fs.h | 1 +
3 files changed, 24 insertions(+), 1 deletions(-)

diff --git a/fs/ext3/file.c b/fs/ext3/file.c
index acc4913..09e22e4 100644
--- a/fs/ext3/file.c
+++ b/fs/ext3/file.c
@@ -106,6 +106,23 @@ force_commit:
return ret;
}

+static struct vm_operations_struct ext3_file_vm_ops = {
+ .fault = filemap_fault,
+ .page_mkwrite = ext3_page_mkwrite,
+};
+
+static int ext3_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct address_space *mapping = file->f_mapping;
+
+ if (!mapping->a_ops->readpage)
+ return -ENOEXEC;
+ file_accessed(file);
+ vma->vm_ops = &ext3_file_vm_ops;
+ vma->vm_flags |= VM_CAN_NONLINEAR;
+ return 0;
+}
+
const struct file_operations ext3_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
@@ -116,7 +133,7 @@ const struct file_operations ext3_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = ext3_compat_ioctl,
#endif
- .mmap = generic_file_mmap,
+ .mmap = ext3_file_mmap,
.open = generic_file_open,
.release = ext3_release_file,
.fsync = ext3_sync_file,
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index eb95670..2293506 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -3306,3 +3306,8 @@ int ext3_change_inode_journal_flag(struct inode *inode, int val)

return err;
}
+
+int ext3_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+ return block_page_mkwrite(vma, page, ext3_get_block);
+}
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 36c5403..715c35e 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -836,6 +836,7 @@ extern void ext3_truncate (struct inode *);
extern void ext3_set_inode_flags(struct inode *);
extern void ext3_get_inode_flags(struct ext3_inode_info *);
extern void ext3_set_aops(struct inode *inode);
+extern int ext3_page_mkwrite(struct vm_area_struct *vma, struct page *page);

/* ioctl.c */
extern int ext3_ioctl (struct inode *, struct file *, unsigned int,
--
1.5.5.rc0.16.g02b00.dirty


2008-03-24 17:05:15

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH] ext4: Export needed symbol for ZERO_PAGE usage in modules.

ext4 use ZERO_PAGE(0) to zero out blocks. We need to export
different symbols in different arch for the usage of ZERO_PAGE
in modules.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
arch/arm/mm/mmu.c | 1 +
arch/m68k/mm/init.c | 1 +
arch/powerpc/kernel/ppc_ksyms.c | 2 ++
arch/sparc/kernel/sparc_ksyms.c | 2 ++
arch/sparc64/mm/init.c | 1 +
5 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index d41a75e..2d6d682 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -35,6 +35,7 @@ extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
* zero-initialized data and COW.
*/
struct page *empty_zero_page;
+EXPORT_SYMBOL(empty_zero_page);

/*
* The pmd table for the upper-most set of pages.
diff --git a/arch/m68k/mm/init.c b/arch/m68k/mm/init.c
index f42caa7..ee27ed2 100644
--- a/arch/m68k/mm/init.c
+++ b/arch/m68k/mm/init.c
@@ -69,6 +69,7 @@ void __init m68k_setup_node(int node)
*/

void *empty_zero_page;
+EXPORT_SYMBOL(empty_zero_page);

void show_mem(void)
{
diff --git a/arch/powerpc/kernel/ppc_ksyms.c b/arch/powerpc/kernel/ppc_ksyms.c
index 9c98424..93fb09a 100644
--- a/arch/powerpc/kernel/ppc_ksyms.c
+++ b/arch/powerpc/kernel/ppc_ksyms.c
@@ -192,3 +192,5 @@ EXPORT_SYMBOL(intercept_table);
EXPORT_SYMBOL(__mtdcr);
EXPORT_SYMBOL(__mfdcr);
#endif
+
+EXPORT_SYMBOL(empty_zero_page);
diff --git a/arch/sparc/kernel/sparc_ksyms.c b/arch/sparc/kernel/sparc_ksyms.c
index c1025e5..e902846 100644
--- a/arch/sparc/kernel/sparc_ksyms.c
+++ b/arch/sparc/kernel/sparc_ksyms.c
@@ -295,3 +295,5 @@ EXPORT_SYMBOL(do_BUG);

/* Sun Power Management Idle Handler */
EXPORT_SYMBOL(pm_idle);
+
+EXPORT_SYMBOL(empty_zero_page);
diff --git a/arch/sparc64/mm/init.c b/arch/sparc64/mm/init.c
index b5c3041..2d15f92 100644
--- a/arch/sparc64/mm/init.c
+++ b/arch/sparc64/mm/init.c
@@ -159,6 +159,7 @@ extern unsigned int sparc_ramdisk_image;
extern unsigned int sparc_ramdisk_size;

struct page *mem_map_zero __read_mostly;
+EXPORT_SYMBOL(mem_map_zero);

unsigned int sparc64_highest_unlocked_tlb_ent __read_mostly;

--
1.5.5.rc0.16.g02b00.dirty


2008-03-24 17:05:19

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH] ext2: Use page_mkwrite vma_operations to get mmap write notification.

We would like to get notified when we are doing a write on mmap
section. The changes are needed to handle ENOSPC when writing to an
mmap section of files with holes.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext2/ext2.h | 1 +
fs/ext2/file.c | 21 ++++++++++++++++++++-
fs/ext2/inode.c | 5 +++++
3 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 47d88da..cc2e106 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -136,6 +136,7 @@ extern void ext2_get_inode_flags(struct ext2_inode_info *);
int __ext2_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata);
+extern int ext2_page_mkwrite(struct vm_area_struct *vma, struct page *page);

/* ioctl.c */
extern long ext2_ioctl(struct file *, unsigned int, unsigned long);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 5f2fa9c..d539dcf 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -18,6 +18,7 @@
* ([email protected])
*/

+#include <linux/mm.h>
#include <linux/time.h>
#include "ext2.h"
#include "xattr.h"
@@ -38,6 +39,24 @@ static int ext2_release_file (struct inode * inode, struct file * filp)
return 0;
}

+static struct vm_operations_struct ext2_file_vm_ops = {
+ .fault = filemap_fault,
+ .page_mkwrite = ext2_page_mkwrite,
+};
+
+static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct address_space *mapping = file->f_mapping;
+
+ if (!mapping->a_ops->readpage)
+ return -ENOEXEC;
+ file_accessed(file);
+ vma->vm_ops = &ext2_file_vm_ops;
+ vma->vm_flags |= VM_CAN_NONLINEAR;
+ return 0;
+}
+
+
/*
* We have mostly NULL's here: the current defaults are ok for
* the ext2 filesystem.
@@ -52,7 +71,7 @@ const struct file_operations ext2_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
#endif
- .mmap = generic_file_mmap,
+ .mmap = ext2_file_mmap,
.open = generic_file_open,
.release = ext2_release_file,
.fsync = ext2_sync_file,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c620068..196e063 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1444,3 +1444,8 @@ int ext2_setattr(struct dentry *dentry, struct iattr *iattr)
error = ext2_acl_chmod(inode);
return error;
}
+
+int ext2_page_mkwrite(struct vm_area_struct *vma, struct page *page)
+{
+ return block_page_mkwrite(vma, page, ext2_get_block);
+}
--
1.5.5.rc0.16.g02b00.dirty


2008-03-24 22:33:03

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] ext3: Return EIO if new block is allocated from system zone.

On Mar 24, 2008 22:34 +0530, Aneesh Kumar K.V wrote:
> If the block allocator gets blocks out of system zone ext3 calls
> ext3_error. But if the file system is mounted with errors=continue
> return with -EIO.
>
> System zone is the block range mapping block bitmap, inode bitmap and inode
> table.

I don't understand this patch. Surely it has nothing to do with the user,
and instead the code should re-try the allocation after marking the block
in-use in the bitmap?

> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> Signed-off-by: Mingming Cao <[email protected]>
> ---
> fs/ext3/balloc.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
> index da0cb2c..6ce7f7d 100644
> --- a/fs/ext3/balloc.c
> +++ b/fs/ext3/balloc.c
> @@ -1642,7 +1642,7 @@ allocated:
> "Allocating block in system zone - "
> "blocks from "E3FSBLK", length %lu",
> ret_block, num);
> - goto out;
> + goto io_error;
> }

I think the problem here is that this "goto" is simply an in-effective
method of error handling. The block HAS to be marked in-use in the
bitmap, or it will just continue to try and be allocated. After marking
the block in-use there should be a "goto retry-alloc" instead of error.
To be honest, I thought in 2.6 this would invoke the block bitmap checking
code to revalidate the whole bitmap at this point and then retry the alloc.

In 2.4 this similar code looks like:

if (tmp == le32_to_cpu(gdp->bg_block_bitmap) ||
tmp == le32_to_cpu(gdp->bg_inode_bitmap) ||
in_range (tmp, le32_to_cpu(gdp->bg_inode_table),
EXT3_SB(sb)->s_itb_per_group)) {
ext3_error(sb, __FUNCTION__,
"Allocating block in system zone - block = %u", tmp);

/* Note: This will potentially use up one of the handle's
* buffer credits. Normally we have way too many credits,
* so that is OK. In _very_ rare cases it might not be OK.
* We will trigger an assertion if we run out of credits,
* and we will have to do a full fsck of the filesystem -
* better than randomly corrupting filesystem metadata.
*/
ext3_set_bit(j, bh->b_data);
goto repeat;
}


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-03-24 23:01:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] ext3: Use page_mkwrite vma_operations to get mmap write notification.

On Mon, 24 Mar 2008 22:34:56 +0530
"Aneesh Kumar K.V" <[email protected]> wrote:

> We would like to get notified when we are doing a write on mmap section.
> The changes are needed to handle ENOSPC when writing to an mmap section
> of files with holes.
>

umm,

>
> diff --git a/fs/ext3/file.c b/fs/ext3/file.c
> index acc4913..09e22e4 100644
> --- a/fs/ext3/file.c
> +++ b/fs/ext3/file.c
> @@ -106,6 +106,23 @@ force_commit:
> return ret;
> }
>
> +static struct vm_operations_struct ext3_file_vm_ops = {
> + .fault = filemap_fault,
> + .page_mkwrite = ext3_page_mkwrite,
> +};
> +
> +static int ext3_file_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> + struct address_space *mapping = file->f_mapping;
> +
> + if (!mapping->a_ops->readpage)
> + return -ENOEXEC;
> + file_accessed(file);
> + vma->vm_ops = &ext3_file_vm_ops;
> + vma->vm_flags |= VM_CAN_NONLINEAR;
> + return 0;
> +}
> +
> const struct file_operations ext3_file_operations = {
> .llseek = generic_file_llseek,
> .read = do_sync_read,
> @@ -116,7 +133,7 @@ const struct file_operations ext3_file_operations = {
> #ifdef CONFIG_COMPAT
> .compat_ioctl = ext3_compat_ioctl,
> #endif
> - .mmap = generic_file_mmap,
> + .mmap = ext3_file_mmap,
> .open = generic_file_open,
> .release = ext3_release_file,
> .fsync = ext3_sync_file,
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index eb95670..2293506 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -3306,3 +3306,8 @@ int ext3_change_inode_journal_flag(struct inode *inode, int val)
>
> return err;
> }
> +
> +int ext3_page_mkwrite(struct vm_area_struct *vma, struct page *page)
> +{
> + return block_page_mkwrite(vma, page, ext3_get_block);
> +}

This gets called within the pagefault handler.

And block_page_mkwrite() does lock_page().

But the pagefault handler can be called with a page already locked, from
generic_perform_write().

Nick, why are we not vulnerable to A-A or to AB-BA deadlocks here?

2008-03-25 09:35:07

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] ext3: Use page_mkwrite vma_operations to get mmap write notification.

On Tuesday 25 March 2008 10:01, Andrew Morton wrote:
> On Mon, 24 Mar 2008 22:34:56 +0530
>
> "Aneesh Kumar K.V" <[email protected]> wrote:
> > We would like to get notified when we are doing a write on mmap section.
> > The changes are needed to handle ENOSPC when writing to an mmap section
> > of files with holes.
>
> umm,
>
> > diff --git a/fs/ext3/file.c b/fs/ext3/file.c
> > index acc4913..09e22e4 100644
> > --- a/fs/ext3/file.c
> > +++ b/fs/ext3/file.c
> > @@ -106,6 +106,23 @@ force_commit:
> > return ret;
> > }
> >
> > +static struct vm_operations_struct ext3_file_vm_ops = {
> > + .fault = filemap_fault,
> > + .page_mkwrite = ext3_page_mkwrite,
> > +};
> > +
> > +static int ext3_file_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > + struct address_space *mapping = file->f_mapping;
> > +
> > + if (!mapping->a_ops->readpage)
> > + return -ENOEXEC;
> > + file_accessed(file);
> > + vma->vm_ops = &ext3_file_vm_ops;
> > + vma->vm_flags |= VM_CAN_NONLINEAR;
> > + return 0;
> > +}
> > +
> > const struct file_operations ext3_file_operations = {
> > .llseek = generic_file_llseek,
> > .read = do_sync_read,
> > @@ -116,7 +133,7 @@ const struct file_operations ext3_file_operations = {
> > #ifdef CONFIG_COMPAT
> > .compat_ioctl = ext3_compat_ioctl,
> > #endif
> > - .mmap = generic_file_mmap,
> > + .mmap = ext3_file_mmap,
> > .open = generic_file_open,
> > .release = ext3_release_file,
> > .fsync = ext3_sync_file,
> > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > index eb95670..2293506 100644
> > --- a/fs/ext3/inode.c
> > +++ b/fs/ext3/inode.c
> > @@ -3306,3 +3306,8 @@ int ext3_change_inode_journal_flag(struct inode
> > *inode, int val)
> >
> > return err;
> > }
> > +
> > +int ext3_page_mkwrite(struct vm_area_struct *vma, struct page *page)
> > +{
> > + return block_page_mkwrite(vma, page, ext3_get_block);
> > +}
>
> This gets called within the pagefault handler.
>
> And block_page_mkwrite() does lock_page().
>
> But the pagefault handler can be called with a page already locked, from
> generic_perform_write().
>
> Nick, why are we not vulnerable to A-A or to AB-BA deadlocks here?

Pagefault handler unlocks the page before calling page_mkwrite. This is
kind of crap, because the caller invariably has to lock the page again
to check that it has not been truncated away anyway.... I voiced my
concerns about this page_mkwrite API, but no, as always, people want to
be really clever first and have understandable locking schemes second ;)

I'm hoping to one day clean this up and fold it all into ->fault()...



2008-03-25 11:03:24

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] ext3: Return EIO if new block is allocated from system zone.

On Mon, Mar 24, 2008 at 04:32:51PM -0600, Andreas Dilger wrote:
> On Mar 24, 2008 22:34 +0530, Aneesh Kumar K.V wrote:
> > If the block allocator gets blocks out of system zone ext3 calls
> > ext3_error. But if the file system is mounted with errors=continue
> > return with -EIO.
> >
> > System zone is the block range mapping block bitmap, inode bitmap and inode
> > table.
>
> I don't understand this patch. Surely it has nothing to do with the user,
> and instead the code should re-try the allocation after marking the block
> in-use in the bitmap?
>
> > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> > Signed-off-by: Mingming Cao <[email protected]>
> > ---
> > fs/ext3/balloc.c | 2 +-
> > 1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
> > index da0cb2c..6ce7f7d 100644
> > --- a/fs/ext3/balloc.c
> > +++ b/fs/ext3/balloc.c
> > @@ -1642,7 +1642,7 @@ allocated:
> > "Allocating block in system zone - "
> > "blocks from "E3FSBLK", length %lu",
> > ret_block, num);
> > - goto out;
> > + goto io_error;
> > }
>
> I think the problem here is that this "goto" is simply an in-effective
> method of error handling. The block HAS to be marked in-use in the
> bitmap, or it will just continue to try and be allocated. After marking
> the block in-use there should be a "goto retry-alloc" instead of error.
> To be honest, I thought in 2.6 this would invoke the block bitmap checking
> code to revalidate the whole bitmap at this point and then retry the alloc.
>
> In 2.4 this similar code looks like:
>
> if (tmp == le32_to_cpu(gdp->bg_block_bitmap) ||
> tmp == le32_to_cpu(gdp->bg_inode_bitmap) ||
> in_range (tmp, le32_to_cpu(gdp->bg_inode_table),
> EXT3_SB(sb)->s_itb_per_group)) {
> ext3_error(sb, __FUNCTION__,
> "Allocating block in system zone - block = %u", tmp);
>
> /* Note: This will potentially use up one of the handle's
> * buffer credits. Normally we have way too many credits,
> * so that is OK. In _very_ rare cases it might not be OK.
> * We will trigger an assertion if we run out of credits,
> * and we will have to do a full fsck of the filesystem -
> * better than randomly corrupting filesystem metadata.
> */
> ext3_set_bit(j, bh->b_data);
> goto repeat;
> }
>
>

How about this. Patch is not complete, we leak some of the blocks here.

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 6d30af2..a9f27c7 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -289,11 +289,11 @@ read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
(int)block_group, (unsigned long long)bitmap_blk);
return NULL;
}
- if (!ext4_valid_block_bitmap(sb, desc, block_group, bh)) {
- put_bh(bh);
- return NULL;
- }

2008-03-25 20:59:47

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] ext3: Return EIO if new block is allocated from system zone.

On Mar 25, 2008 16:33 +0530, Aneesh Kumar K.V wrote:
> On Mon, Mar 24, 2008 at 04:32:51PM -0600, Andreas Dilger wrote:
> > I think the problem here is that this "goto" is simply an in-effective
> > method of error handling. The block HAS to be marked in-use in the
> > bitmap, or it will just continue to try and be allocated. After marking
> > the block in-use there should be a "goto retry-alloc" instead of error.
> > To be honest, I thought in 2.6 this would invoke the block bitmap checking
> > code to revalidate the whole bitmap at this point and then retry the alloc.
> >
> > In 2.4 this similar code looks like:
> >
> > if (tmp == le32_to_cpu(gdp->bg_block_bitmap) ||
> > tmp == le32_to_cpu(gdp->bg_inode_bitmap) ||
> > in_range (tmp, le32_to_cpu(gdp->bg_inode_table),
> > EXT3_SB(sb)->s_itb_per_group)) {
> > ext3_error(sb, __FUNCTION__,
> > "Allocating block in system zone - block = %u", tmp);
> >
> > /* Note: This will potentially use up one of the handle's
> > * buffer credits. Normally we have way too many credits,
> > * so that is OK. In _very_ rare cases it might not be OK.
> > * We will trigger an assertion if we run out of credits,
> > * and we will have to do a full fsck of the filesystem -
> > * better than randomly corrupting filesystem metadata.
> > */
> > ext3_set_bit(j, bh->b_data);
> > goto repeat;
> > }
> >
> >
>
> How about this. Patch is not complete, we leak some of the blocks here.
>
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index 6d30af2..a9f27c7 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -289,11 +289,11 @@ read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
> (int)block_group, (unsigned long long)bitmap_blk);
> return NULL;
> }
> - if (!ext4_valid_block_bitmap(sb, desc, block_group, bh)) {
> - put_bh(bh);
> - return NULL;
> - }
> -
> + ext4_valid_block_bitmap(sb, desc, block_group, bh);
> + /*
> + * file system mounted to not panic on error.
> + * continue with corrput bitmap
> + */
> return bh;
> }

Part of the problem here is that ext4_valid_block_bitmap() only reports if
the bitmap is good or bad. It doesn't actually try to fix the problems it
finds. Instead of changing read_block_bitmap() to return an invalid bitmap
to the caller on error, it makes more sense to keep this code as-is and
change ext4_new_blocks_old() to change the bg_free_blocks_count = 0 for
this group and then continue to find a different group to allocate in:

if (free_blocks > 0) {
bitmap_bh = read_block_bitmap(sb, group_no);
if (bitmap_bh != NULL) {
grp_alloc_blk =
ext4_try_to_allocate_with_rsv(sb, handle,
group_no,
bitmap_bh,
grp_target_blk,
my_rsv, &num,
&fatal);
if (fatal)
goto out;
if (grp_alloc_blk >= 0)
goto allocated;
} else {
/* on error skip this group in future allocations */
ext4_journal_get_write_access(handle, gdp_bh);
gdp->bg_free_blocks_count = 0;
ext4_journal_dirty_metadata(handle, gdp_bh);
}
}

> @@ -1779,7 +1779,12 @@ allocated:
> "Allocating block in system zone - "
> "blocks from %llu, length %lu",
> ret_block, num);
> - goto io_error;
> + /*
> + * claim_block marked the buffer we allocated
> + * as in use. So we may want to selectively
> + * mark some of the blocks as free
> + */
> + goto retry_alloc;

However, the above is only half of the story (though the most important
half). If the bitmap has been loaded into memory read_block_bitmap()
will not check the validity of the bitmap anymore. What 2.4 did here is
actually mark the system zone bits in use and continue on. We could either
do that directly, or have a flag to ext4_valid_block_bitmap() to tell it
to set the bits, or just skip the group entirely.

Probably skipping the group entirely is safest and easiest because who
knows what else is wrong with the bitmap, and this is what is done
above (set bg_free_blocks_count = 0, write it to disk and the ext4_error()
call will have already marked the filesystem as needing a check.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.