2015-05-27 10:25:33

by Lukas Czerner

[permalink] [raw]
Subject: [PATCH] ext4: Try to initialize all groups we can in case of failure on ppc64

Currently on the machines with page size > block size when initializing
block group buddy cache we initialize it for all the block group bitmaps
in the page. However in the case of read error, checksum error, or if
a single bitmap is in any way corrupted we would fail to initialize all
of the bitmaps. This is problematic because we will not have access to
the other allocation groups even though those might be perfectly fine
and usable.

Fix this by reading all the bitmaps instead of error out on the first
problem and simply skip the bitmaps which were either not read properly,
or are not valid.

Signed-off-by: Lukas Czerner <[email protected]>
---
fs/ext4/mballoc.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 8d1e602..7e28007 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -882,10 +882,8 @@ static int ext4_mb_init_cache(struct page *page, char *incore)

/* wait for I/O completion */
for (i = 0, group = first_group; i < groups_per_page; i++, group++) {
- if (bh[i] && ext4_wait_block_bitmap(sb, group, bh[i])) {
+ if (bh[i] && ext4_wait_block_bitmap(sb, group, bh[i]))
err = -EIO;
- goto out;
- }
}

first_block = page->index * blocks_per_page;
@@ -898,6 +896,13 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
/* skip initialized uptodate buddy */
continue;

+ if (!buffer_verified(bh[group - first_group]) ||
+ !buffer_uptodate(bh[group - first_group]))
+ /* Skip faulty bitmaps */
+ continue;
+ else
+ err = 0;
+
/*
* data carry information regarding this
* particular group in the format specified
--
1.8.3.1



2015-05-27 15:44:07

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] ext4: Try to initialize all groups we can in case of failure on ppc64

On 5/27/15 5:25 AM, Lukas Czerner wrote:
> Currently on the machines with page size > block size when initializing
> block group buddy cache we initialize it for all the block group bitmaps
> in the page. However in the case of read error, checksum error, or if
> a single bitmap is in any way corrupted we would fail to initialize all
> of the bitmaps. This is problematic because we will not have access to
> the other allocation groups even though those might be perfectly fine
> and usable.
>
> Fix this by reading all the bitmaps instead of error out on the first
> problem and simply skip the bitmaps which were either not read properly,
> or are not valid.
>
> Signed-off-by: Lukas Czerner <[email protected]>
> ---
> fs/ext4/mballoc.c | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 8d1e602..7e28007 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -882,10 +882,8 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
>
> /* wait for I/O completion */
> for (i = 0, group = first_group; i < groups_per_page; i++, group++) {
> - if (bh[i] && ext4_wait_block_bitmap(sb, group, bh[i])) {
> + if (bh[i] && ext4_wait_block_bitmap(sb, group, bh[i]))
> err = -EIO;
> - goto out;
> - }
> }
>
> first_block = page->index * blocks_per_page;
> @@ -898,6 +896,13 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
> /* skip initialized uptodate buddy */
> continue;
>
> + if (!buffer_verified(bh[group - first_group]) ||
> + !buffer_uptodate(bh[group - first_group]))
> + /* Skip faulty bitmaps */
> + continue;
> + else
> + err = 0;
> +

Hm, ext4_wait_block_bitmap() can fail 3 ways (buffer not update, or not verified,
but also if ext4_get_group_desc fails), but this only checks 2 of those, right?

If ext4_get_group_desc fails, we'll have a bh which is new, may or may not be
uptodate, and ... I guess it won't be verified in that case either, will it. So
that's probably ok.

In fact, is the (!verified) test enough, here? (IOWs, could it possibly be verified,
but not uptodate?)


In the bigger picture - if the filesystem is corrupt (or the device is bad) in this
way, do we really *want* to continue? What are the consequences of doing so,
and have you tested with a filesystem partially-initialized like this?

Thinking more about it ... (sorry for the stream of consciousness here) - if validation
fails, encountering this sort of error will trigger remount,ro or panic unless
errors=continue, right? But I guess that's still the default, so maybe we do expect
to continue. So I go back to the question of: have you tested with partial init
like this, to be sure we don't fall off some cliff later?

Thanks,
-Eric

2015-05-27 20:07:08

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] ext4: Try to initialize all groups we can in case of failure on ppc64

On May 27, 2015, at 9:44 AM, Eric Sandeen <[email protected]> wrote:
>
> On 5/27/15 5:25 AM, Lukas Czerner wrote:
>> Currently on the machines with page size > block size when initializing
>> block group buddy cache we initialize it for all the block group bitmaps
>> in the page. However in the case of read error, checksum error, or if
>> a single bitmap is in any way corrupted we would fail to initialize all
>> of the bitmaps. This is problematic because we will not have access to
>> the other allocation groups even though those might be perfectly fine
>> and usable.
>>
>> Fix this by reading all the bitmaps instead of error out on the first
>> problem and simply skip the bitmaps which were either not read properly,
>> or are not valid.
>>
>> Signed-off-by: Lukas Czerner <[email protected]>
>> ---
>> fs/ext4/mballoc.c | 11 ++++++++---
>> 1 file changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index 8d1e602..7e28007 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -882,10 +882,8 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
>>
>> /* wait for I/O completion */
>> for (i = 0, group = first_group; i < groups_per_page; i++, group++) {
>> - if (bh[i] && ext4_wait_block_bitmap(sb, group, bh[i])) {
>> + if (bh[i] && ext4_wait_block_bitmap(sb, group, bh[i]))
>> err = -EIO;
>> - goto out;
>> - }
>> }
>>
>> first_block = page->index * blocks_per_page;
>> @@ -898,6 +896,13 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
>> /* skip initialized uptodate buddy */
>> continue;
>>
>> + if (!buffer_verified(bh[group - first_group]) ||
>> + !buffer_uptodate(bh[group - first_group]))
>> + /* Skip faulty bitmaps */
>> + continue;
>> + else
>> + err = 0;
>> +
>
> Hm, ext4_wait_block_bitmap() can fail 3 ways (buffer not update, or
> not verified, but also if ext4_get_group_desc fails), but this only
> checks 2 of those, right?
>
> If ext4_get_group_desc fails, we'll have a bh which is new, may or
> may not be uptodate, and ... I guess it won't be verified in that
> case either, will it. So that's probably ok.
>
> In fact, is the (!verified) test enough, here? (IOWs, could it
> possibly be verified, but not uptodate?)
>
>
> In the bigger picture - if the filesystem is corrupt (or the device
> is bad) in this way, do we really *want* to continue? What are the
> consequences of doing so, and have you tested with a filesystem
> partially-initialized like this?
>
> Thinking more about it ... (sorry for the stream of consciousness
> here) - if validation fails, encountering this sort of error will
> trigger remount,ro or panic unless errors=continue, right? But I
> guess that's still the default, so maybe we do expect to continue.
> So I go back to the question of: have you tested with partial init
> like this, to be sure we don't fall off some cliff later?

There is also the check in ext4_validate_block_bitmap() that will
fail if the bitmap is bad in some way. At least that is marking
the bitmap bad in the group descriptor so that it is skipped for
later allocations. Probably the same needs to be done here.

It also makes sense to fail the mount outright if too many of the
bitmaps are bad, or only mount it read-only, since it will otherwise
be unusable.

Cheers, Andreas






2015-05-28 08:21:52

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH] ext4: Try to initialize all groups we can in case of failure on ppc64

On Wed, 27 May 2015, Eric Sandeen wrote:

> Date: Wed, 27 May 2015 10:44:06 -0500
> From: Eric Sandeen <[email protected]>
> To: Lukas Czerner <[email protected]>, [email protected]
> Subject: Re: [PATCH] ext4: Try to initialize all groups we can in case of
> failure on ppc64
>
> On 5/27/15 5:25 AM, Lukas Czerner wrote:
> > Currently on the machines with page size > block size when initializing
> > block group buddy cache we initialize it for all the block group bitmaps
> > in the page. However in the case of read error, checksum error, or if
> > a single bitmap is in any way corrupted we would fail to initialize all
> > of the bitmaps. This is problematic because we will not have access to
> > the other allocation groups even though those might be perfectly fine
> > and usable.
> >
> > Fix this by reading all the bitmaps instead of error out on the first
> > problem and simply skip the bitmaps which were either not read properly,
> > or are not valid.
> >
> > Signed-off-by: Lukas Czerner <[email protected]>
> > ---
> > fs/ext4/mballoc.c | 11 ++++++++---
> > 1 file changed, 8 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index 8d1e602..7e28007 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -882,10 +882,8 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
> >
> > /* wait for I/O completion */
> > for (i = 0, group = first_group; i < groups_per_page; i++, group++) {
> > - if (bh[i] && ext4_wait_block_bitmap(sb, group, bh[i])) {
> > + if (bh[i] && ext4_wait_block_bitmap(sb, group, bh[i]))
> > err = -EIO;
> > - goto out;
> > - }
> > }
> >
> > first_block = page->index * blocks_per_page;
> > @@ -898,6 +896,13 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
> > /* skip initialized uptodate buddy */
> > continue;
> >
> > + if (!buffer_verified(bh[group - first_group]) ||
> > + !buffer_uptodate(bh[group - first_group]))
> > + /* Skip faulty bitmaps */
> > + continue;
> > + else
> > + err = 0;
> > +
>
> Hm, ext4_wait_block_bitmap() can fail 3 ways (buffer not update, or not verified,
> but also if ext4_get_group_desc fails), but this only checks 2 of those, right?

Yes, maybe it'll be worth it to return an error if
ext4_get_group_desc() fails as well. -EIO is not exactly what
happens there, but I guess it's the nest option anyway.

>
> If ext4_get_group_desc fails, we'll have a bh which is new, may or may not be
> uptodate, and ... I guess it won't be verified in that case either, will it. So
> that's probably ok.

We still need to get the error code from ext4_wait_block_bitmap()
though.

>
> In fact, is the (!verified) test enough, here? (IOWs, could it possibly be verified,
> but not uptodate?)

Yes, now when I come to think about this !verified should be enough
for, because it should not be verified if its not uptodate.

>
>
> In the bigger picture - if the filesystem is corrupt (or the device is bad) in this
> way, do we really *want* to continue? What are the consequences of doing so,
> and have you tested with a filesystem partially-initialized like this?

First of all we already do this if the block bitmap is corrupt
(that's why we're verifying it) or checksum does not match (it's
corrupt). What we do there is just mark the bitmap corrupt so we do
not try that anymore.

But this is different, at least in my read example. Because if read
fails there is a slight chance that it'll succeed in the future
(link comes up again or whatever), so trying it again next time
seems ok to me. It should make file system more resilient that way.
However maybe I should make this on for errors=continue case.

And yes I tested it with fialty mdadm on ppc64.


>
> Thinking more about it ... (sorry for the stream of consciousness here) - if validation
> fails, encountering this sort of error will trigger remount,ro or panic unless
> errors=continue, right? But I guess that's still the default, so maybe we do expect
> to continue. So I go back to the question of: have you tested with partial init
> like this, to be sure we don't fall off some cliff later?

Yes, errors=continue is the default so the default behaviour would
be an endless loop without this patch.

Note that there is another problem that it'll loop forever even on
x86_64 in the case that all the bitmaps are corrupt, or can not be
read. That's mostly because we do not return error codes from
ext4_mb_good_group(). I am already testing a fix for that and will
send it together with v2 of this one.

Thanks!
-Lukas

>
> Thanks,
> -Eric
>

2015-05-28 08:26:58

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH] ext4: Try to initialize all groups we can in case of failure on ppc64

On Wed, 27 May 2015, Andreas Dilger wrote:

> Date: Wed, 27 May 2015 14:07:03 -0600
> From: Andreas Dilger <[email protected]>
> To: Eric Sandeen <[email protected]>
> Cc: Lukas Czerner <[email protected]>, [email protected]
> Subject: Re: [PATCH] ext4: Try to initialize all groups we can in case of
> failure on ppc64
>
> On May 27, 2015, at 9:44 AM, Eric Sandeen <[email protected]> wrote:
> >
> > On 5/27/15 5:25 AM, Lukas Czerner wrote:
> >> Currently on the machines with page size > block size when initializing
> >> block group buddy cache we initialize it for all the block group bitmaps
> >> in the page. However in the case of read error, checksum error, or if
> >> a single bitmap is in any way corrupted we would fail to initialize all
> >> of the bitmaps. This is problematic because we will not have access to
> >> the other allocation groups even though those might be perfectly fine
> >> and usable.
> >>
> >> Fix this by reading all the bitmaps instead of error out on the first
> >> problem and simply skip the bitmaps which were either not read properly,
> >> or are not valid.
> >>
> >> Signed-off-by: Lukas Czerner <[email protected]>
> >> ---
> >> fs/ext4/mballoc.c | 11 ++++++++---
> >> 1 file changed, 8 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> >> index 8d1e602..7e28007 100644
> >> --- a/fs/ext4/mballoc.c
> >> +++ b/fs/ext4/mballoc.c
> >> @@ -882,10 +882,8 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
> >>
> >> /* wait for I/O completion */
> >> for (i = 0, group = first_group; i < groups_per_page; i++, group++) {
> >> - if (bh[i] && ext4_wait_block_bitmap(sb, group, bh[i])) {
> >> + if (bh[i] && ext4_wait_block_bitmap(sb, group, bh[i]))
> >> err = -EIO;
> >> - goto out;
> >> - }
> >> }
> >>
> >> first_block = page->index * blocks_per_page;
> >> @@ -898,6 +896,13 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
> >> /* skip initialized uptodate buddy */
> >> continue;
> >>
> >> + if (!buffer_verified(bh[group - first_group]) ||
> >> + !buffer_uptodate(bh[group - first_group]))
> >> + /* Skip faulty bitmaps */
> >> + continue;
> >> + else
> >> + err = 0;
> >> +
> >
> > Hm, ext4_wait_block_bitmap() can fail 3 ways (buffer not update, or
> > not verified, but also if ext4_get_group_desc fails), but this only
> > checks 2 of those, right?
> >
> > If ext4_get_group_desc fails, we'll have a bh which is new, may or
> > may not be uptodate, and ... I guess it won't be verified in that
> > case either, will it. So that's probably ok.
> >
> > In fact, is the (!verified) test enough, here? (IOWs, could it
> > possibly be verified, but not uptodate?)
> >
> >
> > In the bigger picture - if the filesystem is corrupt (or the device
> > is bad) in this way, do we really *want* to continue? What are the
> > consequences of doing so, and have you tested with a filesystem
> > partially-initialized like this?
> >
> > Thinking more about it ... (sorry for the stream of consciousness
> > here) - if validation fails, encountering this sort of error will
> > trigger remount,ro or panic unless errors=continue, right? But I
> > guess that's still the default, so maybe we do expect to continue.
> > So I go back to the question of: have you tested with partial init
> > like this, to be sure we don't fall off some cliff later?
>
> There is also the check in ext4_validate_block_bitmap() that will
> fail if the bitmap is bad in some way. At least that is marking
> the bitmap bad in the group descriptor so that it is skipped for
> later allocations. Probably the same needs to be done here.

But if read fails it does not mean that the bitmap is bad. It might
be transient error and we might succeed next time. We only want to
hard fail if there are no more block groups to try - I am working on
the patch right now.

>
> It also makes sense to fail the mount outright if too many of the
> bitmaps are bad, or only mount it read-only, since it will otherwise
> be unusable.

Have not tried the mount case yet as I usually make the device fail
after the mount. But then it becomes the question of how many bad
bitmaps is bad. Note that we'll always leave trace in the logs that
the file system is not in the best shape and we will fail eventually
if we run out of bitmaps to try.

Thanks!
-Lukas


>
> Cheers, Andreas
>
>
>
>
>
>

2015-05-28 11:34:16

by Lukas Czerner

[permalink] [raw]
Subject: Re: [PATCH] ext4: Try to initialize all groups we can in case of failure on ppc64

On Thu, 28 May 2015, Luk?? Czerner wrote:

> Date: Thu, 28 May 2015 10:21:47 +0200 (CEST)
> From: Luk?? Czerner <[email protected]>
> To: Eric Sandeen <[email protected]>
> Cc: [email protected]
> Subject: Re: [PATCH] ext4: Try to initialize all groups we can in case of
> failure on ppc64
>
> On Wed, 27 May 2015, Eric Sandeen wrote:
>
> > Date: Wed, 27 May 2015 10:44:06 -0500
> > From: Eric Sandeen <[email protected]>
> > To: Lukas Czerner <[email protected]>, [email protected]
> > Subject: Re: [PATCH] ext4: Try to initialize all groups we can in case of
> > failure on ppc64
> >
> > On 5/27/15 5:25 AM, Lukas Czerner wrote:
> > > Currently on the machines with page size > block size when initializing
> > > block group buddy cache we initialize it for all the block group bitmaps
> > > in the page. However in the case of read error, checksum error, or if
> > > a single bitmap is in any way corrupted we would fail to initialize all
> > > of the bitmaps. This is problematic because we will not have access to
> > > the other allocation groups even though those might be perfectly fine
> > > and usable.
> > >
> > > Fix this by reading all the bitmaps instead of error out on the first
> > > problem and simply skip the bitmaps which were either not read properly,
> > > or are not valid.
> > >
> > > Signed-off-by: Lukas Czerner <[email protected]>
> > > ---
> > > fs/ext4/mballoc.c | 11 ++++++++---
> > > 1 file changed, 8 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > > index 8d1e602..7e28007 100644
> > > --- a/fs/ext4/mballoc.c
> > > +++ b/fs/ext4/mballoc.c
> > > @@ -882,10 +882,8 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
> > >
> > > /* wait for I/O completion */
> > > for (i = 0, group = first_group; i < groups_per_page; i++, group++) {
> > > - if (bh[i] && ext4_wait_block_bitmap(sb, group, bh[i])) {
> > > + if (bh[i] && ext4_wait_block_bitmap(sb, group, bh[i]))
> > > err = -EIO;
> > > - goto out;
> > > - }
> > > }
> > >
> > > first_block = page->index * blocks_per_page;
> > > @@ -898,6 +896,13 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
> > > /* skip initialized uptodate buddy */
> > > continue;
> > >
> > > + if (!buffer_verified(bh[group - first_group]) ||
> > > + !buffer_uptodate(bh[group - first_group]))
> > > + /* Skip faulty bitmaps */
> > > + continue;
> > > + else
> > > + err = 0;
> > > +
> >
> > Hm, ext4_wait_block_bitmap() can fail 3 ways (buffer not update, or not verified,
> > but also if ext4_get_group_desc fails), but this only checks 2 of those, right?
>
> Yes, maybe it'll be worth it to return an error if
> ext4_get_group_desc() fails as well. -EIO is not exactly what
> happens there, but I guess it's the nest option anyway.
>
> >
> > If ext4_get_group_desc fails, we'll have a bh which is new, may or may not be
> > uptodate, and ... I guess it won't be verified in that case either, will it. So
> > that's probably ok.
>
> We still need to get the error code from ext4_wait_block_bitmap()
> though.
>
> >
> > In fact, is the (!verified) test enough, here? (IOWs, could it possibly be verified,
> > but not uptodate?)
>
> Yes, now when I come to think about this !verified should be enough
> for, because it should not be verified if its not uptodate.
>
> >
> >
> > In the bigger picture - if the filesystem is corrupt (or the device is bad) in this
> > way, do we really *want* to continue? What are the consequences of doing so,
> > and have you tested with a filesystem partially-initialized like this?
>
> First of all we already do this if the block bitmap is corrupt
> (that's why we're verifying it) or checksum does not match (it's
> corrupt). What we do there is just mark the bitmap corrupt so we do
> not try that anymore.
>
> But this is different, at least in my read example. Because if read
> fails there is a slight chance that it'll succeed in the future
> (link comes up again or whatever), so trying it again next time
> seems ok to me. It should make file system more resilient that way.
> However maybe I should make this on for errors=continue case.

Nevermind it already is errors=continue only case.

>
> And yes I tested it with fialty mdadm on ppc64.
>
>
> >
> > Thinking more about it ... (sorry for the stream of consciousness here) - if validation
> > fails, encountering this sort of error will trigger remount,ro or panic unless
> > errors=continue, right? But I guess that's still the default, so maybe we do expect
> > to continue. So I go back to the question of: have you tested with partial init
> > like this, to be sure we don't fall off some cliff later?
>
> Yes, errors=continue is the default so the default behaviour would
> be an endless loop without this patch.
>
> Note that there is another problem that it'll loop forever even on
> x86_64 in the case that all the bitmaps are corrupt, or can not be
> read. That's mostly because we do not return error codes from
> ext4_mb_good_group(). I am already testing a fix for that and will
> send it together with v2 of this one.
>
> Thanks!
> -Lukas
>
> >
> > Thanks,
> > -Eric
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>