2006-11-16 05:48:59

by Andrew Morton

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Wed, 15 Nov 2006 14:17:01 +0000 (GMT)
Hugh Dickins <[email protected]> wrote:

> On Tue, 14 Nov 2006, Hugh Dickins wrote:
> > On Tue, 14 Nov 2006, Andrew Morton wrote:
> > >
> > > The below might help.
> >
> > Indeed it does (with Martin's E2FSBLK warning fix),
> > seems to be running well on all machines now.
>
> i386 and ppc64 still doing builds, but after an hour on x86_64,
> an ld got stuck in a loop under ext2_try_to_allocate_with_rsv,
> alternating between ext2_rsv_window_add and rsv_window_remove.
> Send me a patch and I'll try it...
>
> ext2_try_to_allocate_with_rsv+0x288
> ext2_new_blocks+0x21e
> ext2_get_blocks+0x398
> ext2_get_block+0x46
> __block_prepare_write+0x171
> block_prepare_write+0x39
> ext2_prepare_write+0x2c
> generic_file_buffered_write+0x2b0
> __generic_file_aio_write_nolock+0x4bc
> generic_file_aio_write+0x6d
> do_sync_write+0xf9
> vfs_write+0xc8
> sys_write+0x51

OK, I have a theory.

This must have been the seventeenth damn time I've stared at
find_next_zero_bit() wondering what the damn return value is and wondering
how any even slightly non-sadistic person could write a damn function like
that and not damn well document it.

int find_next_zero_bit(const unsigned long *addr, int size, int offset)

It returns the offset of the first zero bit relative to addr.

ext3's bitmap_search_next_usable_block() assumed that find_next_zero_bit()
returns the offset of the first zero bit relative to (addr+offset).

The while loop in ext3's bitmap_search_next_usable_block() serendipitously
covered that bug up.

ext2's bitmap_search_next_usable_block() doesn't need that while loop, so
ext3's benign bug became ext2's fatal bug.

So...

--- a/fs/ext2/balloc.c~a
+++ a/fs/ext2/balloc.c
@@ -524,7 +524,7 @@ bitmap_search_next_usable_block(ext2_grp
ext2_grpblk_t next;

next = ext2_find_next_zero_bit(bh->b_data, maxblocks, start);
- if (next >= maxblocks)
+ if (next >= start + maxblocks)
return -1;
return next;
}
_

Anyway, I think that's the bug. Or a bug, at least. If so, the cause of
this bug is inadequate code commenting, pure and simple. And ext3 and ext4
need fixing.


2006-11-16 06:39:47

by Andrew Morton

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Wed, 15 Nov 2006 21:45:34 -0800
Andrew Morton <[email protected]> wrote:

> --- a/fs/ext2/balloc.c~a
> +++ a/fs/ext2/balloc.c
> @@ -524,7 +524,7 @@ bitmap_search_next_usable_block(ext2_grp
> ext2_grpblk_t next;
>
> next = ext2_find_next_zero_bit(bh->b_data, maxblocks, start);
> - if (next >= maxblocks)
> + if (next >= start + maxblocks)
> return -1;
> return next;
> }
> _
>
> Anyway, I think that's the bug.

Changed my mind. Drat.

2006-11-16 06:56:02

by Mingming Cao

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

Andrew Morton wrote:
> On Wed, 15 Nov 2006 14:17:01 +0000 (GMT)
> Hugh Dickins <[email protected]> wrote:
>
>
>>On Tue, 14 Nov 2006, Hugh Dickins wrote:
>>
>>>On Tue, 14 Nov 2006, Andrew Morton wrote:
>>>
>>>>The below might help.
>>>
>>>Indeed it does (with Martin's E2FSBLK warning fix),
>>>seems to be running well on all machines now.
>>
>>i386 and ppc64 still doing builds, but after an hour on x86_64,
>>an ld got stuck in a loop under ext2_try_to_allocate_with_rsv,
>>alternating between ext2_rsv_window_add and rsv_window_remove.
>>Send me a patch and I'll try it...
>>
>>ext2_try_to_allocate_with_rsv+0x288
>>ext2_new_blocks+0x21e
>>ext2_get_blocks+0x398
>>ext2_get_block+0x46
>>__block_prepare_write+0x171
>>block_prepare_write+0x39
>>ext2_prepare_write+0x2c
>>generic_file_buffered_write+0x2b0
>>__generic_file_aio_write_nolock+0x4bc
>>generic_file_aio_write+0x6d
>>do_sync_write+0xf9
>>vfs_write+0xc8
>>sys_write+0x51
>
>
> OK, I have a theory.
>
> This must have been the seventeenth damn time I've stared at
> find_next_zero_bit() wondering what the damn return value is and wondering
> how any even slightly non-sadistic person could write a damn function like
> that and not damn well document it.
>
> int find_next_zero_bit(const unsigned long *addr, int size, int offset)
>
> It returns the offset of the first zero bit relative to addr.
>
> ext3's bitmap_search_next_usable_block() assumed that find_next_zero_bit()
> returns the offset of the first zero bit relative to (addr+offset).
>
> The while loop in ext3's bitmap_search_next_usable_block() serendipitously
> covered that bug up.
>
> ext2's bitmap_search_next_usable_block() doesn't need that while loop, so
> ext3's benign bug became ext2's fatal bug.
>
> So...
>
> --- a/fs/ext2/balloc.c~a
> +++ a/fs/ext2/balloc.c
> @@ -524,7 +524,7 @@ bitmap_search_next_usable_block(ext2_grp
> ext2_grpblk_t next;
>
> next = ext2_find_next_zero_bit(bh->b_data, maxblocks, start);
> - if (next >= maxblocks)
> + if (next >= start + maxblocks)
> return -1;
> return next;
> }
> _
>
> Anyway, I think that's the bug. Or a bug, at least. If so, the cause of
> this bug is inadequate code commenting, pure and simple. And ext3 and ext4
> need fixing.
>
Hmm, maxblocks, in bitmap_search_next_usable_block(), is the end block
number of the range to search, not the lengh of the range. maxblocks
get passed to ext2_find_next_zero_bit(), where it expecting to take the
_size_ of the range to search instead...

Something like this: (this is not a patch)
@@ -524,7 +524,7 @@ bitmap_search_next_usable_block(ext2_grp
ext2_grpblk_t next;

- next = ext2_find_next_zero_bit(bh->b_data, maxblocks, start);
+ next = ext2_find_next_zero_bit(bh->b_data, maxblocks-start + 1,
start);
if (next >= maxblocks)
return -1;
return next;
}


Mingming



Yes, it's quite confusing and probably we should replace it a better
name......

Mingming

2006-11-16 07:22:47

by Andrew Morton

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Wed, 15 Nov 2006 22:55:43 -0800
Mingming Cao <[email protected]> wrote:

> Hmm, maxblocks, in bitmap_search_next_usable_block(), is the end block
> number of the range to search, not the lengh of the range. maxblocks
> get passed to ext2_find_next_zero_bit(), where it expecting to take the
> _size_ of the range to search instead...
>
> Something like this: (this is not a patch)
> @@ -524,7 +524,7 @@ bitmap_search_next_usable_block(ext2_grp
> ext2_grpblk_t next;
>
> - next = ext2_find_next_zero_bit(bh->b_data, maxblocks, start);
> + next = ext2_find_next_zero_bit(bh->b_data, maxblocks-start + 1, start);
> if (next >= maxblocks)
> return -1;
> return next;
> }

yes, the `size' arg to find_next_zero_bit() represents the number of bits
to scan at `offset'.

So I think your change is correctish. But we don't want the "+ 1", do we?

If we're right then this bug could cause the code to scan off the end of the
bitmap. But it won't explain Hugh's bug, because of the if (next >= maxblocks).

btw, how come try_to_extend_reservation() uses spin_trylock?

2006-11-16 08:49:27

by Mingming Cao

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Wed, 2006-11-15 at 23:22 -0800, Andrew Morton wrote:
> On Wed, 15 Nov 2006 22:55:43 -0800
> Mingming Cao <[email protected]> wrote:
>
> > Hmm, maxblocks, in bitmap_search_next_usable_block(), is the end block
> > number of the range to search, not the lengh of the range. maxblocks
> > get passed to ext2_find_next_zero_bit(), where it expecting to take the
> > _size_ of the range to search instead...
> >
> > Something like this: (this is not a patch)
> > @@ -524,7 +524,7 @@ bitmap_search_next_usable_block(ext2_grp
> > ext2_grpblk_t next;
> >
> > - next = ext2_find_next_zero_bit(bh->b_data, maxblocks, start);
> > + next = ext2_find_next_zero_bit(bh->b_data, maxblocks-start + 1, start);
> > if (next >= maxblocks)
> > return -1;
> > return next;
> > }
>
> yes, the `size' arg to find_next_zero_bit() represents the number of bits
> to scan at `offset'.
>
> So I think your change is correctish. But we don't want the "+ 1", do we?
>
I think we still need the "+1", maxblocks here is the ending block of
the reservation window, so the number of bits to scan =end-start+1.

> If we're right then this bug could cause the code to scan off the end of the
> bitmap. But it won't explain Hugh's bug, because of the if (next >= maxblocks).
>

Yeah.. at first I thought it might be related, then, thinked it over,
the bug only makes the bits to scan larger, so if find_next_zero_bit()
returns something off the end of bitmap, that is fine, it just
indicating that there is no free bit left in the rest of bitmap, which
is expected behavior. So bitmap_search_next_usable_block() fail is the
expected. It will move on to next block group and try to create a new
reservation window there.

That does not explain the repeated reservation window add and remove
behavior Huge has reported.

> btw, how come try_to_extend_reservation() uses spin_trylock?

Since locks are all allocated from reservation window, when ext3
multiple blocks allocation was added, we added try_to_extend_reservation
() to ext3, which trying to extend the reservation window size to at
least match the number of blocks to allocate. So we have better chance
to allocating multiple blocks from the window at a time.

Since all the multiple block allocation is based on best effort basis,
the same applied to try_to_extend_reservation(). It seems no need to
wait for the reservation tree lock if it's not avaible at that moment.

Mingming

2006-11-16 09:21:17

by Andrew Morton

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Thu, 16 Nov 2006 00:49:20 -0800
Mingming Cao <[email protected]> wrote:

> On Wed, 2006-11-15 at 23:22 -0800, Andrew Morton wrote:
> > On Wed, 15 Nov 2006 22:55:43 -0800
> > Mingming Cao <[email protected]> wrote:
> >
> > > Hmm, maxblocks, in bitmap_search_next_usable_block(), is the end block
> > > number of the range to search, not the lengh of the range. maxblocks
> > > get passed to ext2_find_next_zero_bit(), where it expecting to take the
> > > _size_ of the range to search instead...
> > >
> > > Something like this: (this is not a patch)
> > > @@ -524,7 +524,7 @@ bitmap_search_next_usable_block(ext2_grp
> > > ext2_grpblk_t next;
> > >
> > > - next = ext2_find_next_zero_bit(bh->b_data, maxblocks, start);
> > > + next = ext2_find_next_zero_bit(bh->b_data, maxblocks-start + 1, start);
> > > if (next >= maxblocks)
> > > return -1;
> > > return next;
> > > }
> >
> > yes, the `size' arg to find_next_zero_bit() represents the number of bits
> > to scan at `offset'.
> >
> > So I think your change is correctish. But we don't want the "+ 1", do we?
> >
> I think we still need the "+1", maxblocks here is the ending block of
> the reservation window, so the number of bits to scan =end-start+1.
>
> > If we're right then this bug could cause the code to scan off the end of the
> > bitmap. But it won't explain Hugh's bug, because of the if (next >= maxblocks).
> >
>
> Yeah.. at first I thought it might be related, then, thinked it over,
> the bug only makes the bits to scan larger, so if find_next_zero_bit()
> returns something off the end of bitmap, that is fine, it just
> indicating that there is no free bit left in the rest of bitmap, which
> is expected behavior. So bitmap_search_next_usable_block() fail is the
> expected. It will move on to next block group and try to create a new
> reservation window there.

I wonder why it's never oopsed. Perhaps there's always a zero in there for
some reason.

> That does not explain the repeated reservation window add and remove
> behavior Huge has reported.

I spent quite some time comparing with ext3. I'm a bit stumped and I'm
suspecting that the simplistic porting the code is now OK, but something's
just wrong.

I assume that the while (1) loop in ext3_try_to_allocate_with_rsv() has
gone infinite. I don't see why, but more staring is needed.

What lock protects the fields in struct ext[234]_reserve_window from being
concurrently modified by two CPUs? None, it seems. Ditto
ext[234]_reserve_window_node. i_mutex will cover it for write(), but not
for pageout over a file hole. If we end up with a zero- or negative-sized
window then odd things might happen.

> > btw, how come try_to_extend_reservation() uses spin_trylock?
>
> Since locks are all allocated from reservation window, when ext3
> multiple blocks allocation was added, we added try_to_extend_reservation
> () to ext3, which trying to extend the reservation window size to at
> least match the number of blocks to allocate. So we have better chance
> to allocating multiple blocks from the window at a time.
>
> Since all the multiple block allocation is based on best effort basis,
> the same applied to try_to_extend_reservation(). It seems no need to
> wait for the reservation tree lock if it's not avaible at that moment.
>

I suspect that was not a good idea - it has added rather different
behaviour in the once-in-a-million case.

2006-11-16 09:39:54

by Alex Tomas

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

>>>>> Andrew Morton (AM) writes:

AM> What lock protects the fields in struct ext[234]_reserve_window from being
AM> concurrently modified by two CPUs? None, it seems. Ditto
AM> ext[234]_reserve_window_node. i_mutex will cover it for write(), but not
AM> for pageout over a file hole. If we end up with a zero- or negative-sized
AM> window then odd things might happen.

truncate_mutex?

thanks, Alex

2006-11-16 09:50:48

by Andrew Morton

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Thu, 16 Nov 2006 01:48:09 -0800
Andrew Morton <[email protected]> wrote:

> On Thu, 16 Nov 2006 12:37:17 +0300
> Alex Tomas <[email protected]> wrote:
>
> > >>>>> Andrew Morton (AM) writes:
> >
> > AM> What lock protects the fields in struct ext[234]_reserve_window from being
> > AM> concurrently modified by two CPUs? None, it seems. Ditto
> > AM> ext[234]_reserve_window_node. i_mutex will cover it for write(), but not
> > AM> for pageout over a file hole. If we end up with a zero- or negative-sized
> > AM> window then odd things might happen.
> >
> > truncate_mutex?
> >
>
> yes. hmm.

by which I mean "ext2 doesn't have a truncate_mutex".

2006-11-16 09:51:32

by Andrew Morton

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Thu, 16 Nov 2006 12:37:17 +0300
Alex Tomas <[email protected]> wrote:

> >>>>> Andrew Morton (AM) writes:
>
> AM> What lock protects the fields in struct ext[234]_reserve_window from being
> AM> concurrently modified by two CPUs? None, it seems. Ditto
> AM> ext[234]_reserve_window_node. i_mutex will cover it for write(), but not
> AM> for pageout over a file hole. If we end up with a zero- or negative-sized
> AM> window then odd things might happen.
>
> truncate_mutex?
>

yes. hmm.

2006-11-16 12:35:11

by Russell King

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Wed, Nov 15, 2006 at 11:22:28PM -0800, Andrew Morton wrote:
> On Wed, 15 Nov 2006 22:55:43 -0800
> Mingming Cao <[email protected]> wrote:
>
> > Hmm, maxblocks, in bitmap_search_next_usable_block(), is the end block
> > number of the range to search, not the lengh of the range. maxblocks
> > get passed to ext2_find_next_zero_bit(), where it expecting to take the
> > _size_ of the range to search instead...
> >
> > Something like this: (this is not a patch)
> > @@ -524,7 +524,7 @@ bitmap_search_next_usable_block(ext2_grp
> > ext2_grpblk_t next;
> >
> > - next = ext2_find_next_zero_bit(bh->b_data, maxblocks, start);
> > + next = ext2_find_next_zero_bit(bh->b_data, maxblocks-start + 1, start);
> > if (next >= maxblocks)
> > return -1;
> > return next;
> > }
>
> yes, the `size' arg to find_next_zero_bit() represents the number of bits
> to scan at `offset'.

Are you sure? That's not the way it's implemented in many architectures.
find_next_*_bit() has always taken "address, maximum offset, starting offset"
and always has returned "next offset".

Just look at arch/i386/lib/bitops.c:

int find_next_zero_bit(const unsigned long *addr, int size, int offset)
{
unsigned long * p = ((unsigned long *) addr) + (offset >> 5);
int set = 0, bit = offset & 31, res;
...
/*
* No zero yet, search remaining full bytes for a zero
*/
res = find_first_zero_bit (p, size - 32 * (p - (unsigned long *) addr));
return (offset + set + res);
}

So for the case that "offset" is aligned to a "long" boundary, that gives us:

res = find_first_zero_bit(addr + (offset>>5),
size - 32 * (addr + (offset>>5) - addr));

or:

res = find_first_zero_bit(addr + (offset>>5), size - (offset & ~31));

So, size _excludes_ offset.

Now, considering the return value, "res" above will be relative to
"addr + (offset>>5)". However, we add "offset" on to that, so it's
relative to addr + (offset bits).

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2006-11-16 16:25:47

by Hugh Dickins

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Thu, 16 Nov 2006, Andrew Morton wrote:
> On Thu, 16 Nov 2006 00:49:20 -0800
> Mingming Cao <[email protected]> wrote:
>
> > That does not explain the repeated reservation window add and remove
> > behavior Huge has reported.
>
> I spent quite some time comparing with ext3. I'm a bit stumped and I'm
> suspecting that the simplistic porting the code is now OK, but something's
> just wrong.
>
> I assume that the while (1) loop in ext3_try_to_allocate_with_rsv() has
> gone infinite. I don't see why, but more staring is needed.

Just to report that similar tests on three machines have each run
for 20 hours so far without any such infinite loop reoccurring.

Well, I broke off the x86_64 for a couple of hours: wondered if I'd got
confused, forgotten to "rmmod ext2" at one stage, and saw that behaviour
with my simple "ext2fs_blk_t ret_block" patch, rather than your more
extensive patch to ext2_new_blocks() that I'd believed I was testing.
I didn't give it long enough to be conclusive, but the problem didn't
show up with that either, so I've gone back to testing with yours.

I'd have kept the hang around for longer if I'd guessed it would be
hard to reproduce, and that it would be puzzling even to you: sorry.
kdb is in, if it comes again I can peer at it more closely.

Hugh

2006-11-16 20:15:21

by Mingming Cao

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Thu, 2006-11-16 at 01:13 -0800, Andrew Morton wrote:
> On Thu, 16 Nov 2006 00:49:20 -0800
> Mingming Cao <[email protected]> wrote:
>
> > On Wed, 2006-11-15 at 23:22 -0800, Andrew Morton wrote:
> > > On Wed, 15 Nov 2006 22:55:43 -0800
> > > Mingming Cao <[email protected]> wrote:
> > >
> > > > Hmm, maxblocks, in bitmap_search_next_usable_block(), is the end block
> > > > number of the range to search, not the lengh of the range. maxblocks
> > > > get passed to ext2_find_next_zero_bit(), where it expecting to take the
> > > > _size_ of the range to search instead...
> > > >
> > > > Something like this: (this is not a patch)
> > > > @@ -524,7 +524,7 @@ bitmap_search_next_usable_block(ext2_grp
> > > > ext2_grpblk_t next;
> > > >
> > > > - next = ext2_find_next_zero_bit(bh->b_data, maxblocks, start);
> > > > + next = ext2_find_next_zero_bit(bh->b_data, maxblocks-start + 1, start);
> > > > if (next >= maxblocks)
> > > > return -1;
> > > > return next;
> > > > }
> > >
> > > yes, the `size' arg to find_next_zero_bit() represents the number of bits
> > > to scan at `offset'.
> > >
> > > So I think your change is correctish. But we don't want the "+ 1", do we?
> > >
> > I think we still need the "+1", maxblocks here is the ending block of
> > the reservation window, so the number of bits to scan =end-start+1.
> >
> > > If we're right then this bug could cause the code to scan off the end of the
> > > bitmap. But it won't explain Hugh's bug, because of the if (next >= maxblocks).
> > >
> >
> > Yeah.. at first I thought it might be related, then, thinked it over,
> > the bug only makes the bits to scan larger, so if find_next_zero_bit()
> > returns something off the end of bitmap, that is fine, it just
> > indicating that there is no free bit left in the rest of bitmap, which
> > is expected behavior. So bitmap_search_next_usable_block() fail is the
> > expected. It will move on to next block group and try to create a new
> > reservation window there.
>
> I wonder why it's never oopsed. Perhaps there's always a zero in there for
> some reason.
>

Why you think it should oopsed? Even if find_next_zero_bit() finds a
zero bit beyond of the end of bitmap, the check next > maxblocks will
catch this and make sure we are not taking a zero bit out of the bitmap
range, so it fails as expected.

> > That does not explain the repeated reservation window add and remove
> > behavior Huge has reported.
>
> I spent quite some time comparing with ext3. I'm a bit stumped and I'm
> suspecting that the simplistic porting the code is now OK, but something's
> just wrong.
>
> I assume that the while (1) loop in ext3_try_to_allocate_with_rsv() has
> gone infinite. I don't see why, but more staring is needed.
>

The loop should not go forever, it will stops when there is no window
with free bit to reserve in the given block group.

> What lock protects the fields in struct ext[234]_reserve_window from being
> concurrently modified by two CPUs? None, it seems. Ditto
> ext[234]_reserve_window_node. i_mutex will cover it for write(), but not
> for pageout over a file hole. If we end up with a zero- or negative-sized
> window then odd things might happen.
>

Yes, trucate_mutex protect both struct ext[234]_reserve_window and ext
[234]_reserve_window_node, and struct ext[234]_block_alloc_info.
Actually I think truncate_mutex protects all data structures related to
block allocation/mapping structures.



Mingming

2006-11-16 21:27:40

by Andrew Morton

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Thu, 16 Nov 2006 12:15:16 -0800
Mingming Cao <[email protected]> wrote:

> On Thu, 2006-11-16 at 01:13 -0800, Andrew Morton wrote:
> > On Thu, 16 Nov 2006 00:49:20 -0800
> > Mingming Cao <[email protected]> wrote:
> >
> > > On Wed, 2006-11-15 at 23:22 -0800, Andrew Morton wrote:
> > > > On Wed, 15 Nov 2006 22:55:43 -0800
> > > > Mingming Cao <[email protected]> wrote:
> > > >
> > > > > Hmm, maxblocks, in bitmap_search_next_usable_block(), is the end block
> > > > > number of the range to search, not the lengh of the range. maxblocks
> > > > > get passed to ext2_find_next_zero_bit(), where it expecting to take the
> > > > > _size_ of the range to search instead...
> > > > >
> > > > > Something like this: (this is not a patch)
> > > > > @@ -524,7 +524,7 @@ bitmap_search_next_usable_block(ext2_grp
> > > > > ext2_grpblk_t next;
> > > > >
> > > > > - next = ext2_find_next_zero_bit(bh->b_data, maxblocks, start);
> > > > > + next = ext2_find_next_zero_bit(bh->b_data, maxblocks-start + 1, start);
> > > > > if (next >= maxblocks)
> > > > > return -1;
> > > > > return next;
> > > > > }
> > > >
> > > > yes, the `size' arg to find_next_zero_bit() represents the number of bits
> > > > to scan at `offset'.
> > > >
> > > > So I think your change is correctish. But we don't want the "+ 1", do we?
> > > >
> > > I think we still need the "+1", maxblocks here is the ending block of
> > > the reservation window, so the number of bits to scan =end-start+1.
> > >
> > > > If we're right then this bug could cause the code to scan off the end of the
> > > > bitmap. But it won't explain Hugh's bug, because of the if (next >= maxblocks).
> > > >
> > >
> > > Yeah.. at first I thought it might be related, then, thinked it over,
> > > the bug only makes the bits to scan larger, so if find_next_zero_bit()
> > > returns something off the end of bitmap, that is fine, it just
> > > indicating that there is no free bit left in the rest of bitmap, which
> > > is expected behavior. So bitmap_search_next_usable_block() fail is the
> > > expected. It will move on to next block group and try to create a new
> > > reservation window there.
> >
> > I wonder why it's never oopsed. Perhaps there's always a zero in there for
> > some reason.
> >
>
> Why you think it should oopsed? Even if find_next_zero_bit() finds a
> zero bit beyond of the end of bitmap, the check next > maxblocks will
> catch this and make sure we are not taking a zero bit out of the bitmap
> range, so it fails as expected.

If it can read off the end of the buffer, it can oops. With
CONFIG_DEBUG_PAGEALLOC, especially.


> > > That does not explain the repeated reservation window add and remove
> > > behavior Huge has reported.
> >
> > I spent quite some time comparing with ext3. I'm a bit stumped and I'm
> > suspecting that the simplistic porting the code is now OK, but something's
> > just wrong.
> >
> > I assume that the while (1) loop in ext3_try_to_allocate_with_rsv() has
> > gone infinite. I don't see why, but more staring is needed.
> >
>
> The loop should not go forever, it will stops when there is no window
> with free bit to reserve in the given block group.

It seems to have done so in Hugh's testing, but there's some question there
now. Although I didn't check to see if there's a significant difference
between Hugh's patch and mine.


> > What lock protects the fields in struct ext[234]_reserve_window from being
> > concurrently modified by two CPUs? None, it seems. Ditto
> > ext[234]_reserve_window_node. i_mutex will cover it for write(), but not
> > for pageout over a file hole. If we end up with a zero- or negative-sized
> > window then odd things might happen.
> >
>
> Yes, trucate_mutex protect both struct ext[234]_reserve_window and ext
> [234]_reserve_window_node, and struct ext[234]_block_alloc_info.
> Actually I think truncate_mutex protects all data structures related to
> block allocation/mapping structures.

Yes. I guess ext2 needs a new mutex for this. Sad.

2006-11-20 16:19:23

by Hugh Dickins

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Thu, 16 Nov 2006, Andrew Morton wrote:
> On Thu, 16 Nov 2006 12:15:16 -0800
> Mingming Cao <[email protected]> wrote:
> > On Thu, 2006-11-16 at 01:13 -0800, Andrew Morton wrote:
> > >
> > > I assume that the while (1) loop in ext3_try_to_allocate_with_rsv() has
> > > gone infinite. I don't see why, but more staring is needed.
> >
> > The loop should not go forever, it will stops when there is no window
> > with free bit to reserve in the given block group.
>
> It seems to have done so in Hugh's testing, but there's some question there
> now. Although I didn't check to see if there's a significant difference
> between Hugh's patch and mine.

After four days of running, the EM64T has at last reproduced the same
hang as it did in an hour before: stuck in ext2_try_to_allocate_with_rsv,
repeatedly ext2_rsv_window_add, ext2_try_to_allocate, rsv_window_remove
(perhaps not in that order).

But somehow I've screwed up, and before I'd extracted any new info,
kdb was spinning on its own kdb_printf_lock, and then the box completely
frozen: nothing for it but to reboot.

Grrrr.

Well: at least this resolves the doubt I raised earlier, whether
I'd been testing with the right ext2 patch. This time was definitely
with your patch not my two-liner: your original patch from Tuesday 14 Nov,
ext2-reservations-bring-ext2-reservations-code-in-line-with-latest-ext3-fix

Various other patches have come into -mm (one gone) since then, but I've
been running just with that: so far as I can see, none of the later ones
should be important in my case - the filesystem is too small for the
difference between int and ext2_fsblk_t to become important, and
neither "ld" nor "as" (the writer this time) does MAP_SHARED pageout.

I'll do a little staring at the code myself: I'm unlikely to notice
anything you've missed, but there's just a chance staring at it will
direct me to some detail I've jotted down from before.

Hugh

2006-11-20 20:54:03

by Hugh Dickins

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Mon, 20 Nov 2006, Hugh Dickins wrote:
>
> I'll do a little staring at the code myself: I'm unlikely to notice
> anything you've missed, but there's just a chance staring at it will
> direct me to some detail I've jotted down from before.

Not found anything relevant, but I keep noticing these lines
in ext2_try_to_allocate_with_rsv(), ext3 and ext4 similar:

} else if (grp_goal > 0 &&
(my_rsv->rsv_end - grp_goal + 1) < *count)
try_to_extend_reservation(my_rsv, sb,
*count-my_rsv->rsv_end + grp_goal - 1);

They're wrong, a no-op in most groups, aren't they? rsv_end is an
absolute block number, whereas grp_goal is group-relative, so the
calculation ought to bring in group_first_block? Or I'm confused.

(Whereas in my hang the grp_goal to ext2_try_to_allocate was -1
when I looked, with group 0 and num 1.)

Hugh

2006-11-21 01:36:08

by Mingming Cao

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Mon, 2006-11-20 at 20:54 +0000, Hugh Dickins wrote:
> Not found anything relevant, but I keep noticing these lines
> in ext2_try_to_allocate_with_rsv(), ext3 and ext4 similar:
>
> } else if (grp_goal > 0 &&
> (my_rsv->rsv_end - grp_goal + 1) < *count)
> try_to_extend_reservation(my_rsv, sb,
> *count-my_rsv->rsv_end + grp_goal - 1);
>
> They're wrong, a no-op in most groups, aren't they? rsv_end is an
> absolute block number, whereas grp_goal is group-relative, so the
> calculation ought to bring in group_first_block? Or I'm confused.
>

You are right, thanks. Here are two patches, to fix this bug in ext3/4
and ext2 correspondingly.


Signed-Off-By: Mingming Cao <[email protected]>

---

linux-2.6.19-rc5-cmm/fs/ext3/balloc.c | 12 ++++++++----
linux-2.6.19-rc5-cmm/fs/ext4/balloc.c | 12 ++++++++----
2 files changed, 16 insertions(+), 8 deletions(-)

diff -puN fs/ext3/balloc.c~ext34_extend_reservation_window_fix fs/ext3/balloc.c
--- linux-2.6.19-rc5/fs/ext3/balloc.c~ext34_extend_reservation_window_fix 2006-11-20 15:58:11.000000000 -0800
+++ linux-2.6.19-rc5-cmm/fs/ext3/balloc.c 2006-11-20 15:58:11.000000000 -0800
@@ -1307,10 +1307,14 @@ ext3_try_to_allocate_with_rsv(struct sup
if (!goal_in_my_reservation(&my_rsv->rsv_window,
grp_goal, group, sb))
grp_goal = -1;
- } else if (grp_goal > 0 &&
- (my_rsv->rsv_end-grp_goal+1) < *count)
- try_to_extend_reservation(my_rsv, sb,
- *count-my_rsv->rsv_end + grp_goal - 1);
+ } else if (grp_goal > 0) {
+ int curr = my_rsv->rsv_end -
+ (grp_goal + group_first_block) + 1;
+
+ if (curr < *count)
+ try_to_extend_reservation(my_rsv, sb,
+ *count - curr);
+ }

if ((my_rsv->rsv_start > group_last_block) ||
(my_rsv->rsv_end < group_first_block)) {
diff -puN fs/ext4/balloc.c~ext34_extend_reservation_window_fix fs/ext4/balloc.c
--- linux-2.6.19-rc5/fs/ext4/balloc.c~ext34_extend_reservation_window_fix 2006-11-20 15:58:11.000000000 -0800
+++ linux-2.6.19-rc5-cmm/fs/ext4/balloc.c 2006-11-20 15:58:11.000000000 -0800
@@ -1324,10 +1324,14 @@ ext4_try_to_allocate_with_rsv(struct sup
if (!goal_in_my_reservation(&my_rsv->rsv_window,
grp_goal, group, sb))
grp_goal = -1;
- } else if (grp_goal > 0 &&
- (my_rsv->rsv_end-grp_goal+1) < *count)
- try_to_extend_reservation(my_rsv, sb,
- *count-my_rsv->rsv_end + grp_goal - 1);
+ } else if (grp_goal > 0) {
+ int curr = my_rsv->rsv_end -
+ (grp_goal + group_first_block) + 1;
+
+ if (curr < *count)
+ try_to_extend_reservation(my_rsv, sb,
+ *count - curr);
+ }

if ((my_rsv->rsv_start > group_last_block) ||
(my_rsv->rsv_end < group_first_block)) {


Sync up ext2 with ext3/4 for the extend reservation window bug.

Signed-Off-By: Mingming Cao <[email protected]>



---

linux-2.6.19-rc5-cmm/fs/ext2/balloc.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

diff -puN fs/ext2/balloc.c~ext2_reservation_extend_window_fix fs/ext2/balloc.c
--- linux-2.6.19-rc5/fs/ext2/balloc.c~ext2_reservation_extend_window_fix 2006-11-20 16:05:36.000000000 -0800
+++ linux-2.6.19-rc5-cmm/fs/ext2/balloc.c 2006-11-20 16:05:36.000000000 -0800
@@ -1091,10 +1091,14 @@ ext2_try_to_allocate_with_rsv(struct sup
if (!goal_in_my_reservation(&my_rsv->rsv_window,
grp_goal, group, sb))
grp_goal = -1;
- } else if (grp_goal > 0 &&
- (my_rsv->rsv_end - grp_goal + 1) < *count)
- try_to_extend_reservation(my_rsv, sb,
- *count-my_rsv->rsv_end + grp_goal - 1);
+ } else if (grp_goal > 0) {
+ int curr = my_rsv->rsv_end -
+ (grp_goal + group_first_block) + 1;
+
+ if (curr < *count)
+ try_to_extend_reservation(my_rsv, sb,
+ *count - curr);
+ }

if ((my_rsv->rsv_start >=
group_first_block + EXT2_BLOCKS_PER_GROUP(sb))

_

2006-11-21 01:47:37

by Mingming Cao

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Mon, 2006-11-20 at 16:19 +0000, Hugh Dickins wrote:
> On Thu, 16 Nov 2006, Andrew Morton wrote:
> > On Thu, 16 Nov 2006 12:15:16 -0800
> > Mingming Cao <[email protected]> wrote:
> > > On Thu, 2006-11-16 at 01:13 -0800, Andrew Morton wrote:
> > > >
> > > > I assume that the while (1) loop in ext3_try_to_allocate_with_rsv() has
> > > > gone infinite. I don't see why, but more staring is needed.
> > >
> > > The loop should not go forever, it will stops when there is no window
> > > with free bit to reserve in the given block group.
> >
> > It seems to have done so in Hugh's testing, but there's some question there
> > now. Although I didn't check to see if there's a significant difference
> > between Hugh's patch and mine.
>
> After four days of running, the EM64T has at last reproduced the same
> hang as it did in an hour before: stuck in ext2_try_to_allocate_with_rsv,
> repeatedly ext2_rsv_window_add, ext2_try_to_allocate, rsv_window_remove
> (perhaps not in that order).
>
> But somehow I've screwed up, and before I'd extracted any new info,
> kdb was spinning on its own kdb_printf_lock, and then the box completely
> frozen: nothing for it but to reboot.
>
> Grrrr.
>
> Well: at least this resolves the doubt I raised earlier, whether
> I'd been testing with the right ext2 patch. This time was definitely
> with your patch not my two-liner: your original patch from Tuesday 14 Nov,
> ext2-reservations-bring-ext2-reservations-code-in-line-with-latest-ext3-fix
>
> Various other patches have come into -mm (one gone) since then, but I've
> been running just with that: so far as I can see, none of the later ones
> should be important in my case - the filesystem is too small for the
> difference between int and ext2_fsblk_t to become important, and
> neither "ld" nor "as" (the writer this time) does MAP_SHARED pageout.
>

So there is only one writer at the moment the hang was happening?

hmm, is the filesystem relatively all being used or reserved, i.e, the
free bits are all being reserved? There is one extreme case that may
cause starvation. If filesystem free blocks are all being reserved, when
a new writer need a free block, it has to go through the entire
filesystems, try to reserve a space, which will repeatly calling
rsv_window_add and rsv_window_remove, since. Finally give up and fall
back to allocation without reservation. But this is all theory, not sure
fits your case here.

2006-11-21 05:38:53

by Hugh Dickins

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Mon, 20 Nov 2006, Mingming Cao wrote:
>
> So there is only one writer at the moment the hang was happening?

I expect there were multiple writers when the task which hangs
first entered its ext2_prepare_write (it's a make -j20 build on
that ext2 filesystem); but by the time I come to look at the hang,
there's only that one writer active - everything else would be waiting
on that part of the build to complete (well, your question makes me
realize that I didn't look down the whole "ps" listing to see what
was waiting; but the hanging task is the only one I see running on
on any cpu, each time I break in).

>
> hmm, is the filesystem relatively all being used or reserved, i.e, the
> free bits are all being reserved? There is one extreme case that may
> cause starvation. If filesystem free blocks are all being reserved, when
> a new writer need a free block, it has to go through the entire
> filesystems, try to reserve a space, which will repeatly calling
> rsv_window_add and rsv_window_remove, since. Finally give up and fall
> back to allocation without reservation. But this is all theory, not sure
> fits your case here.

I can understand that there may be a worst case like that: but I hope
it wouldn't take 20 hours to find a free block on a default 340MB ext2
filesystem! And unless something else has gone wrong, this build would
not be filling the filesystem to that extent: it's probably around 80%
full at this stage, and shouldn't get fuller than 98% in the end.

Any suggestions for what I might check, next time it happens?

Hugh

2006-11-22 00:43:15

by Mingming Cao

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Mon, 2006-11-20 at 16:19 +0000, Hugh Dickins wrote:
> After four days of running, the EM64T has at last reproduced the same
> hang as it did in an hour before: stuck in
> ext2_try_to_allocate_with_rsv,
> repeatedly ext2_rsv_window_add, ext2_try_to_allocate,
> rsv_window_remove
> (perhaps not in that order).
>

Don't have much clue, still...:(

The logic of the while loop in ext2_try_to_allocate_with_rsv() looks
like:

while (1) {
/*make a new reservation window*/
ret = allocate_new_reservation();
if (ret < 0)
break /*fail*/
/*allocate a block from the reservation window */
ret = ext2_try_to_allocate_with_rsv()
if (ret > 0) {
...
break; /*success*/
}
}
If it stucks in the loop, that means next two things have to be true:

1) ext2_try_to_allocate() keeps failing to allocate a bit from the
window just reserved, every time.

2) we could endlessly create a new reservation window in this block
group.


For (1), when create a new reservation window, we do make sure there are
at least one free bit in the window, so that
ext2_try_to_allocate_with_rsv() could claim that bit. I haven't found
any thing wrong in that part yet;

For (2)the search for the next reservation window should start from the
end block of the old window, and should stop and fail if it reachs the
last block of the block group.

Will continue looking at the code, but so far everything looks just fine
to me.:(


On Tue, 2006-11-21 at 05:39 +0000, Hugh Dickins wrote:
> On Mon, 20 Nov 2006, Mingming Cao wrote:
> >
> > So there is only one writer at the moment the hang was happening?
>
> I expect there were multiple writers when the task which hangs
> first entered its ext2_prepare_write (it's a make -j20 build on
> that ext2 filesystem); but by the time I come to look at the hang,
> there's only that one writer active - everything else would be waiting
> on that part of the build to complete
> (well, your question makes me
> realize that I didn't look down the whole "ps" listing to see what
> was waiting; but the hanging task is the only one I see running on
> on any cpu, each time I break in).
>
A bit confused, did the whole system hang or just the "ld" writer hang
in ext2 block allocation? And what other writers were waiting for? Were
they trying to write to the same file?

> >
> > hmm, is the filesystem relatively all being used or reserved, i.e, the
> > free bits are all being reserved? There is one extreme case that may
> > cause starvation. If filesystem free blocks are all being reserved, when
> > a new writer need a free block, it has to go through the entire
> > filesystems, try to reserve a space, which will repeatly calling
> > rsv_window_add and rsv_window_remove, since. Finally give up and fall
> > back to allocation without reservation. But this is all theory, not sure
> > fits your case here.
>
> I can understand that there may be a worst case like that: but I hope
> it wouldn't take 20 hours to find a free block on a default 340MB ext2
> filesystem! And unless something else has gone wrong, this build would
> not be filling the filesystem to that extent: it's probably around 80%
> full at this stage, and shouldn't get fuller than 98% in the end.
>
> Any suggestions for what I might check, next time it happens?
>

It might be helpful to check the return value of ext2_try_to_allocate(),
to see if it fails. It would be nice if you could check if there any
free bit inside the reservation window.

And could you check the start and end block for every rsv_window_add and
rsv_window_remove, to see if it was keep creating and removing the same
window in the same block group?

And it might be useful to dump the whole filesystem block reservation
tree as well.

Not sure if it worth the effort to test it on ext3. The ext2 block
reservation code in 2.6.19-rc5-mm2 looks pretty much the same as ext3/4,
except the missing truncate_mutex. I understand this might take a few
days to reproduce, so this might not needed now.


Mingming
> Hugh
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2006-11-25 14:59:38

by Russell King

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Thu, Nov 16, 2006 at 12:34:48PM +0000, Russell King wrote:
> On Wed, Nov 15, 2006 at 11:22:28PM -0800, Andrew Morton wrote:
> > On Wed, 15 Nov 2006 22:55:43 -0800
> > Mingming Cao <[email protected]> wrote:
> >
> > > Hmm, maxblocks, in bitmap_search_next_usable_block(), is the end block
> > > number of the range to search, not the lengh of the range. maxblocks
> > > get passed to ext2_find_next_zero_bit(), where it expecting to take the
> > > _size_ of the range to search instead...
> > >
> > > Something like this: (this is not a patch)
> > > @@ -524,7 +524,7 @@ bitmap_search_next_usable_block(ext2_grp
> > > ext2_grpblk_t next;
> > >
> > > - next = ext2_find_next_zero_bit(bh->b_data, maxblocks, start);
> > > + next = ext2_find_next_zero_bit(bh->b_data, maxblocks-start + 1, start);
> > > if (next >= maxblocks)
> > > return -1;
> > > return next;
> > > }
> >
> > yes, the `size' arg to find_next_zero_bit() represents the number of bits
> > to scan at `offset'.
>
> Are you sure? That's not the way it's implemented in many architectures.
> find_next_*_bit() has always taken "address, maximum offset, starting offset"
> and always has returned "next offset".
>
> Just look at arch/i386/lib/bitops.c:
>
> int find_next_zero_bit(const unsigned long *addr, int size, int offset)
> {
> unsigned long * p = ((unsigned long *) addr) + (offset >> 5);
> int set = 0, bit = offset & 31, res;
> ...
> /*
> * No zero yet, search remaining full bytes for a zero
> */
> res = find_first_zero_bit (p, size - 32 * (p - (unsigned long *) addr));
> return (offset + set + res);
> }
>
> So for the case that "offset" is aligned to a "long" boundary, that gives us:
>
> res = find_first_zero_bit(addr + (offset>>5),
> size - 32 * (addr + (offset>>5) - addr));
>
> or:
>
> res = find_first_zero_bit(addr + (offset>>5), size - (offset & ~31));
>
> So, size _excludes_ offset.
>
> Now, considering the return value, "res" above will be relative to
> "addr + (offset>>5)". However, we add "offset" on to that, so it's
> relative to addr + (offset bits).

Andrew,

Please respond to the above. If what you say is correct then all
architectures need their bitops fixing to fit ext2's requirements.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-11-28 17:39:45

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 1/6] ext2 balloc: fix _with_rsv freeze

After several days of testing ext2 with reservations, it got caught inside
ext2_try_to_allocate_with_rsv: alloc_new_reservation repeatedly succeeding
on the window [12cff,12d0e], ext2_try_to_allocate repeatedly failing to
find the free block guaranteed to be included (unless there's contention).

Fix the range to find_next_usable_block's memscan: the scan from "here"
(0xcfe) up to (but excluding) "maxblocks" (0xd0e) needs to scan 3 bytes
not 2 (the relevant bytes of bitmap in this case being f7 df ff - none
00, but the premature cutoff implying that the last was found 00).

Is this a problem for mainline ext2? No, because the "size" in its memscan
is always EXT2_BLOCKS_PER_GROUP(sb), which mkfs.ext2 requires to be a
multiple of 8. Is this a problem for ext3 or ext4? No, because they have
an additional extN_test_allocatable test which rescues them from the error.

Signed-off-by: Hugh Dickins <[email protected]>
---
But the bigger question is, why does the my_rsv case come here to
find_next_usable_block at all? Doesn't its 64-bit boundary limit, and its
memscan, blithely ignore what the reservation prepared? It's messy too,
the complement of the memscan being that "i < 7" loop over in
ext2_try_to_allocate. I think this ought to be cleaned up,
in ext2+reservations and ext3 and ext4.

fs/ext2/balloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.000000000 +0000
+++ linux/fs/ext2/balloc.c 2006-11-27 19:28:41.000000000 +0000
@@ -570,7 +570,7 @@ find_next_usable_block(int start, struct
here = 0;

p = ((char *)bh->b_data) + (here >> 3);
- r = memscan(p, 0, (maxblocks - here + 7) >> 3);
+ r = memscan(p, 0, ((maxblocks + 7) >> 3) - (here >> 3));
next = (r - ((char *)bh->b_data)) << 3;

if (next < maxblocks && next >= here)

2006-11-28 17:41:19

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 3/6] ext2 balloc: fix off-by-one against rsv_end

rsv_end is the last block within the reservation,
so alloc_new_reservation should accept start_block == rsv_end as success.

Signed-off-by: Hugh Dickins <[email protected]>
---

fs/ext2/balloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.000000000 +0000
+++ linux/fs/ext2/balloc.c 2006-11-27 19:28:41.000000000 +0000
@@ -950,7 +950,7 @@ retry:
* check if the first free block is within the
* free space we just reserved
*/
- if (start_block >= my_rsv->rsv_start && start_block < my_rsv->rsv_end)
+ if (start_block >= my_rsv->rsv_start && start_block <= my_rsv->rsv_end)
return 0; /* success */
/*
* if the first free bit we found is out of the reservable space

2006-11-28 17:40:29

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 2/6] ext2 balloc: reset windowsz when full

ext2_new_blocks should reset the reservation window size to 0 when squeezing
the last blocks out of an almost full filesystem, so the retry doesn't skip
any groups with less than half that free, reporting ENOSPC too soon.

Signed-off-by: Hugh Dickins <[email protected]>
---

fs/ext2/balloc.c | 1 +
1 file changed, 1 insertion(+)

--- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.000000000 +0000
+++ linux/fs/ext2/balloc.c 2006-11-27 19:28:41.000000000 +0000
@@ -1292,6 +1292,7 @@ retry_alloc:
*/
if (my_rsv) {
my_rsv = NULL;
+ windowsz = 0;
group_no = goal_group;
goto retry_alloc;
}

2006-11-28 17:42:20

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 4/6] ext2 balloc: fix off-by-one against grp_goal

grp_goal 0 is a genuine goal (unlike -1),
so ext2_try_to_allocate_with_rsv should treat it as such.

Signed-off-by: Hugh Dickins <[email protected]>
---

fs/ext2/balloc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

--- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.000000000 +0000
+++ linux/fs/ext2/balloc.c 2006-11-27 19:28:41.000000000 +0000
@@ -1053,7 +1053,7 @@ ext2_try_to_allocate_with_rsv(struct sup
}
/*
* grp_goal is a group relative block number (if there is a goal)
- * 0 < grp_goal < EXT2_BLOCKS_PER_GROUP(sb)
+ * 0 <= grp_goal < EXT2_BLOCKS_PER_GROUP(sb)
* first block is a filesystem wide block number
* first block is the block number of the first block in this group
*/
@@ -1089,7 +1089,7 @@ ext2_try_to_allocate_with_rsv(struct sup
if (!goal_in_my_reservation(&my_rsv->rsv_window,
grp_goal, group, sb))
grp_goal = -1;
- } else if (grp_goal > 0) {
+ } else if (grp_goal >= 0) {
int curr = my_rsv->rsv_end -
(grp_goal + group_first_block) + 1;


2006-11-28 17:43:11

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 5/6] ext2 balloc: say rb_entry not list_entry

The reservations tree is an rb_tree not a list, so it's less confusing to
use rb_entry() than list_entry() - though they're both just container_of().

Signed-off-by: Hugh Dickins <[email protected]>
---

fs/ext2/balloc.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

--- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.000000000 +0000
+++ linux/fs/ext2/balloc.c 2006-11-27 19:28:41.000000000 +0000
@@ -156,7 +156,7 @@ restart:

printk("Block Allocation Reservation Windows Map (%s):\n", fn);
while (n) {
- rsv = list_entry(n, struct ext2_reserve_window_node, rsv_node);
+ rsv = rb_entry(n, struct ext2_reserve_window_node, rsv_node);
if (verbose)
printk("reservation window 0x%p "
"start: %lu, end: %lu\n",
@@ -753,7 +753,7 @@ static int find_next_reservable_window(

prev = rsv;
next = rb_next(&rsv->rsv_node);
- rsv = list_entry(next,struct ext2_reserve_window_node,rsv_node);
+ rsv = rb_entry(next,struct ext2_reserve_window_node,rsv_node);

/*
* Reached the last reservation, we can just append to the
@@ -995,7 +995,7 @@ static void try_to_extend_reservation(st
if (!next)
my_rsv->rsv_end += size;
else {
- next_rsv = list_entry(next, struct ext2_reserve_window_node, rsv_node);
+ next_rsv = rb_entry(next, struct ext2_reserve_window_node, rsv_node);

if ((next_rsv->rsv_start - my_rsv->rsv_end - 1) >= size)
my_rsv->rsv_end += size;

2006-11-28 17:43:53

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 6/6] ext2 balloc: use io_error label

ext2_new_blocks has a nice io_error label for setting -EIO,
so goto that in the one place that doesn't already use it.

Signed-off-by: Hugh Dickins <[email protected]>
---

fs/ext2/balloc.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

--- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.000000000 +0000
+++ linux/fs/ext2/balloc.c 2006-11-27 19:28:41.000000000 +0000
@@ -1258,10 +1258,9 @@ retry_alloc:
if (group_no >= ngroups)
group_no = 0;
gdp = ext2_get_group_desc(sb, group_no, &gdp_bh);
- if (!gdp) {
- *errp = -EIO;
- goto out;
- }
+ if (!gdp)
+ goto io_error;
+
free_blocks = le16_to_cpu(gdp->bg_free_blocks_count);
/*
* skip this group if the number of

2006-11-28 19:26:16

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH 1/6] ext2 balloc: fix _with_rsv freeze

On Tue, 2006-11-28 at 17:40 +0000, Hugh Dickins wrote:
> After several days of testing ext2 with reservations, it got caught inside
> ext2_try_to_allocate_with_rsv: alloc_new_reservation repeatedly succeeding
> on the window [12cff,12d0e], ext2_try_to_allocate repeatedly failing to
> find the free block guaranteed to be included (unless there's contention).
>

Hmm, I suspect there is other issue: alloc_new_reservation should not
repeatedly allocating the same window, if ext2_try_to_allocate
repeatedly fails to find a free block in that window.
find_next_reservable_window() takes my_rsv (the old window that he
thinks there is no free block) as a guide to find a window "after" the
end block of my_rsv, so how could this happen?


> Fix the range to find_next_usable_block's memscan: the scan from "here"
> (0xcfe) up to (but excluding) "maxblocks" (0xd0e) needs to scan 3 bytes
> not 2 (the relevant bytes of bitmap in this case being f7 df ff - none
> 00, but the premature cutoff implying that the last was found 00).
>

alloc_new_reservation() reserved a window with free block, when come to
the time to claim it, it scans the window again. So it seems that the
range of the the scan is too small:

p = ((char *)bh->b_data) + (here >> 3);
r = memscan(p, 0, (maxblocks - here + 7) >> 3);
next = (r - ((char *)bh->b_data)) << 3;

---------------------> next is -1
if (next < maxblocks && next >= here)
return next;

----------------------> falls to false branch

here = bitmap_search_next_usable_block(here, bh, maxblocks);
return here;

So we failed to find a free byte in the range. That's seems fine to me.
It's only a nice thing to have -- try to allocate a block in a place
where it's neighbors are all free also. If it fails, it will search the
window bit by bit. So I don't understand why it is not being recovered
by bitmap_search_next_usable_block(), which test the bitmap bit by bit?

> Is this a problem for mainline ext2? No, because the "size" in its memscan
> is always EXT2_BLOCKS_PER_GROUP(sb), which mkfs.ext2 requires to be a
> multiple of 8. Is this a problem for ext3 or ext4? No, because they have
> an additional extN_test_allocatable test which rescues them from the error.
>
Hmm, if the error is it prematurely think there is no free block in the
range (bitmap on disk), then even in ext3/4, it will not bother checking
the jbd copy of the bitmap. I am not sure this is the cause that ext3/4
may not has the problem.

> But the bigger question is, why does the my_rsv case come here to
> find_next_usable_block at all?

Because grp_goal is -1?

> Doesn't its 64-bit boundary limit, and its
> memscan, blithely ignore what the reservation prepared?

I agree with you that the double check is urgly. But it's necessary:( If
there to prevent contention: other file make steal that free block we
reserved for this file, in the case filesystem is full of reservation...

> It's messy too,
> the complement of the memscan being that "i < 7" loop over in
> ext2_try_to_allocate. I think this ought to be cleaned up,
> in ext2+reservations and ext3 and ext4.
>
The "i<7" loop there is for non reservation case. Since
find_next_usable_block() could find a free byte, it's trying to avoid
filesystem holes by shifting the start of the free block for at most 7
times.


Thanks!

Mingming
> fs/ext2/balloc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.000000000 +0000
> +++ linux/fs/ext2/balloc.c 2006-11-27 19:28:41.000000000 +0000
> @@ -570,7 +570,7 @@ find_next_usable_block(int start, struct
> here = 0;
>
> p = ((char *)bh->b_data) + (here >> 3);
> - r = memscan(p, 0, (maxblocks - here + 7) >> 3);
> + r = memscan(p, 0, ((maxblocks + 7) >> 3) - (here >> 3));
> next = (r - ((char *)bh->b_data)) << 3;
>
> if (next < maxblocks && next >= here)
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2006-11-28 19:36:18

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH 2/6] ext2 balloc: reset windowsz when full

On Tue, 2006-11-28 at 17:40 +0000, Hugh Dickins wrote:
> ext2_new_blocks should reset the reservation window size to 0 when squeezing
> the last blocks out of an almost full filesystem, so the retry doesn't skip
> any groups with less than half that free, reporting ENOSPC too soon.
>

I realize this is a bug when I was looking at the code as well. I was
thinking we should not check for whether the window size is less than
the free blocks counter at all, when the allocation is fall back to non
reservation allocation. But your fix works as well.


Mingming

2006-11-28 19:42:20

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH 3/6] ext2 balloc: fix off-by-one against rsv_end

On Tue, 2006-11-28 at 17:41 +0000, Hugh Dickins wrote:
> rsv_end is the last block within the reservation,
> so alloc_new_reservation should accept start_block == rsv_end as success.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>

Thanks, Acked.

This is not a problem for now, as the default window size is 8 blocks,
and we never shrink the window size. But it could be a issue in the
future, if the reservation window could be dynamically shrink, when it
keep failing to create a new(large) reservation window, or we keep throw
away a just-allocated-window as the application is doing very seeky
random write.


> fs/ext2/balloc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.000000000 +0000
> +++ linux/fs/ext2/balloc.c 2006-11-27 19:28:41.000000000 +0000
> @@ -950,7 +950,7 @@ retry:
> * check if the first free block is within the
> * free space we just reserved
> */
> - if (start_block >= my_rsv->rsv_start && start_block < my_rsv->rsv_end)
> + if (start_block >= my_rsv->rsv_start && start_block <= my_rsv->rsv_end)
> return 0; /* success */
> /*
> * if the first free bit we found is out of the reservable space

2006-11-28 20:08:44

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/6] ext2 balloc: fix _with_rsv freeze

On Tue, 28 Nov 2006, Mingming Cao wrote:
> On Tue, 2006-11-28 at 17:40 +0000, Hugh Dickins wrote:
> > After several days of testing ext2 with reservations, it got caught inside
> > ext2_try_to_allocate_with_rsv: alloc_new_reservation repeatedly succeeding
> > on the window [12cff,12d0e], ext2_try_to_allocate repeatedly failing to
> > find the free block guaranteed to be included (unless there's contention).
> >
>
> Hmm, I suspect there is other issue: alloc_new_reservation should not
> repeatedly allocating the same window, if ext2_try_to_allocate
> repeatedly fails to find a free block in that window.
> find_next_reservable_window() takes my_rsv (the old window that he
> thinks there is no free block) as a guide to find a window "after" the
> end block of my_rsv, so how could this happen?

Hmmm. I haven't studied that part of the code, but what you say sounds
sensible: that would leave more to be explained, yes. I guess it would
happen if all the rest of the bitmap were either allocated or reserved,
but I don't believe that was the case here: I have noted that the map
was all 00s from offset 0x1ae onwards, plenty unallocated; I've not
recorded the following reservations, but it seems unlikely they covered
the remaining free area (and still covered it even when the remaining
tasks got to the point of just waiting for this one).

>
> > Fix the range to find_next_usable_block's memscan: the scan from "here"
> > (0xcfe) up to (but excluding) "maxblocks" (0xd0e) needs to scan 3 bytes
> > not 2 (the relevant bytes of bitmap in this case being f7 df ff - none
> > 00, but the premature cutoff implying that the last was found 00).
> >
>
> alloc_new_reservation() reserved a window with free block, when come to
> the time to claim it, it scans the window again. So it seems that the
> range of the the scan is too small:

The range of the scan is 1 byte too small in this case, yes.

>
> p = ((char *)bh->b_data) + (here >> 3);
> r = memscan(p, 0, (maxblocks - here + 7) >> 3);
> next = (r - ((char *)bh->b_data)) << 3;
>
> ---------------------> next is -1

I don't understand you: next was not -1, it was 0xd08.

> if (next < maxblocks && next >= here)
> return next;
>
> ----------------------> falls to false branch

No, it passed the "next < maxblocks && next >= here" test
(maxblocks being 0xd0e and here being 0xcfe), so returned
pointing to an allocated block - then the caller finds it
cannot set the bit.

>
> here = bitmap_search_next_usable_block(here, bh, maxblocks);
> return here;
>
> So we failed to find a free byte in the range. That's seems fine to me.
> It's only a nice thing to have -- try to allocate a block in a place
> where it's neighbors are all free also. If it fails, it will search the
> window bit by bit. So I don't understand why it is not being recovered
> by bitmap_search_next_usable_block(), which test the bitmap bit by bit?

It already returned, it doesn't reach that line.

>
> > Is this a problem for mainline ext2? No, because the "size" in its memscan
> > is always EXT2_BLOCKS_PER_GROUP(sb), which mkfs.ext2 requires to be a
> > multiple of 8. Is this a problem for ext3 or ext4? No, because they have
> > an additional extN_test_allocatable test which rescues them from the error.
> >
> Hmm, if the error is it prematurely think there is no free block in the
> range (bitmap on disk), then even in ext3/4, it will not bother checking
> the jbd copy of the bitmap. I am not sure this is the cause that ext3/4
> may not has the problem.

In the ext3/4 case, it indeed won't bother to check the jbd copy
(having found this bitmap bit set), it'll fall through to the
bitmap_search_next_usable_block you indicated above,
and that should do the right thing, finding the first
free bit in the area originally reserved.

>
> > But the bigger question is, why does the my_rsv case come here to
> > find_next_usable_block at all?
>
> Because grp_goal is -1?

Well, yes, but my point is that we've got a reservation, and we're
hoping to allocate from it (even though we've given up on the "goal"),
but find_next_usable_block is not respecting it at all - liable to be
allocating out of others' reservations instead.

>
> > Doesn't its 64-bit boundary limit, and its
> > memscan, blithely ignore what the reservation prepared?
>
> I agree with you that the double check is urgly. But it's necessary:( If
> there to prevent contention: other file make steal that free block we
> reserved for this file, in the case filesystem is full of reservation...

I agree it's necessary to recheck the allocation; I disagree that the
64-bit boundary limit and memscan are appropriate when my_rsv is set.

>
> > It's messy too,
> > the complement of the memscan being that "i < 7" loop over in
> > ext2_try_to_allocate. I think this ought to be cleaned up,
> > in ext2+reservations and ext3 and ext4.
> >
> The "i<7" loop there is for non reservation case. Since
> find_next_usable_block() could find a free byte, it's trying to avoid
> filesystem holes by shifting the start of the free block for at most 7
> times.

Yes, the "i<7" loop rightly has a !my_rsv check; my point it that it's
the "other end" of the memscan, which was operating 8-bits at a time,
which lacks any my_rsv check (doesn't even know my_rsv at that level).

Hugh

2006-11-28 20:27:57

by Hugh Dickins

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Tue, 21 Nov 2006, Mingming Cao wrote:
> On Mon, 2006-11-20 at 16:19 +0000, Hugh Dickins wrote:
> > After four days of running, the EM64T has at last reproduced the same
> > hang as it did in an hour before: stuck in
> > ext2_try_to_allocate_with_rsv,
> > repeatedly ext2_rsv_window_add, ext2_try_to_allocate,
> > rsv_window_remove
> > (perhaps not in that order).
> >

At last it did happen again, on both x86_64 and ppc64.

>
> Don't have much clue, still...:(

Don't worry, KDB helped me work it out, patch follows in a moment.

>
> The logic of the while loop in ext2_try_to_allocate_with_rsv() looks
> like:

Thanks for your clarifications and tips.

> >
> A bit confused, did the whole system hang or just the "ld" writer hang
> in ext2 block allocation? And what other writers were waiting for? Were
> they trying to write to the same file?

The system was pingable, but couldn't do much else. Only the one
"ld" was active on the ext2 filesystem by this time, other tasks of
the make just waiting on it, nothing else was trying to write there.

4 cpus, well, 2*HT: why couldn't I ssh or login? I don't know,
something else to investigate, but that can be quite separate: very
unlikely to be related to the particular ext2 bug which showed it
(that filesystem was just for the test, it's not my root).
Probably stupidly obvious once I've worked it out.

>
> It might be helpful to check the return value of ext2_try_to_allocate(),
> to see if it fails. It would be nice if you could check if there any
> free bit inside the reservation window.

After screwing up last time, I was ultra-paranoid this time, and did
not dare to set any breakpoints: proceeded by repeatedly breaking in
from the keyboard, and the time I happened to catch it on return from
memscan() was revealing.

>
> And could you check the start and end block for every rsv_window_add and
> rsv_window_remove, to see if it was keep creating and removing the same
> window in the same block group?

The same every time it settled on a usable reservation, but a lot of
wasted adds and removes as it works across a fully allocated area of
the bitmap. Not very efficient, all those rb_tree insertions and
removals until it gets to a free area; but I can't judge from this
example how common that would be, may not be worth bothering about.

>
> And it might be useful to dump the whole filesystem block reservation
> tree as well.

Not necessary, I've put just the relevant numbers in the patch comment,
it helped me a lot to see the actual numbers and work it out from there.

>
> Not sure if it worth the effort to test it on ext3. The ext2 block
> reservation code in 2.6.19-rc5-mm2 looks pretty much the same as ext3/4,
> except the missing truncate_mutex. I understand this might take a few
> days to reproduce, so this might not needed now.

Turns out vanilla-ext2 and ext3 and ext4 are safe:
ext3 and ext4 slightly wrong in the same way, but safe.

I'll continue this thread with the bugfix patch 1/6; and five more
(roughly descending order of importance, the latter just cosmetic)
little fs/ext2/balloc.c patches to things I noticed on the way.
Nothing very worrying. Easier to send patches than ask questions:
please check, perhaps my "off-by-one" accusations are wrong, for
example. If you're satisfied they're right, please apply the
same to ext3 and ext4.

Although 1/6 does (I believe: too early for my tests to tell)
fix the bug in a minimal way, I rather think that area needs
cleanup - further comments below my Signed-off-by in 1/6.

Hugh

2006-11-28 21:04:59

by Mingming Cao

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Tue, 2006-11-28 at 17:38 +0000, Hugh Dickins wrote:

> >
> > And could you check the start and end block for every rsv_window_add and
> > rsv_window_remove, to see if it was keep creating and removing the same
> > window in the same block group?
>
> The same every time it settled on a usable reservation, but a lot of
> wasted adds and removes as it works across a fully allocated area of
> the bitmap. Not very efficient, all those rb_tree insertions and
> removals until it gets to a free area; but I can't judge from this
> example how common that would be, may not be worth bothering about.
>

Yeah, it's not so efficient to add a window before knowing it has a free
block, as rb tree insertion is quit expensive. Actually it used to be
this way: only insert a window to the rb tree when there is a free bit
inside of it. However we need to hold the per-fs reservation lock while
scanning the bitmap:( This badly hurt the performance in some case and
the real time folks complained about it.

So we changed the code to the current way. I think it was not that
inefficient in the case it works across a fully allocated/reserved area,
since bitmap_search_next_usable_block() will skip the fully allocated
area fairly quickly, as it searchs from the beginning of the window till
the last block of the block group (not just the end of the window)


> >
> > And it might be useful to dump the whole filesystem block reservation
> > tree as well.
>
> Not necessary, I've put just the relevant numbers in the patch comment,
> it helped me a lot to see the actual numbers and work it out from there.
>
> >
> > Not sure if it worth the effort to test it on ext3. The ext2 block
> > reservation code in 2.6.19-rc5-mm2 looks pretty much the same as ext3/4,
> > except the missing truncate_mutex. I understand this might take a few
> > days to reproduce, so this might not needed now.
>
> Turns out vanilla-ext2 and ext3 and ext4 are safe:
> ext3 and ext4 slightly wrong in the same way, but safe.
>

Good to know.

> I'll continue this thread with the bugfix patch 1/6; and five more
> (roughly descending order of importance, the latter just cosmetic)
> little fs/ext2/balloc.c patches to things I noticed on the way.
> Nothing very worrying. Easier to send patches than ask questions:
> please check, perhaps my "off-by-one" accusations are wrong, for
> example. If you're satisfied they're right, please apply the
> same to ext3 and ext4.
>

Thanks, I have acked most of them, and will port them to ext3/4 soon.

2006-11-28 22:34:14

by Andrew Morton

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Tue, 28 Nov 2006 13:04:53 -0800
Mingming Cao <[email protected]> wrote:

> Thanks, I have acked most of them, and will port them to ext3/4 soon.

You've acked #2 and #3. #4, #5 and #6 remain un-commented-upon and #1 is
unclear?

2006-11-28 23:30:36

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH 4/6] ext2 balloc: fix off-by-one against grp_goal

On Tue, 2006-11-28 at 17:42 +0000, Hugh Dickins wrote:
> grp_goal 0 is a genuine goal (unlike -1),
> so ext2_try_to_allocate_with_rsv should treat it as such.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Acked.

Thanks,

Mingming
> ---
>
> fs/ext2/balloc.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> --- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.000000000 +0000
> +++ linux/fs/ext2/balloc.c 2006-11-27 19:28:41.000000000 +0000
> @@ -1053,7 +1053,7 @@ ext2_try_to_allocate_with_rsv(struct sup
> }
> /*
> * grp_goal is a group relative block number (if there is a goal)
> - * 0 < grp_goal < EXT2_BLOCKS_PER_GROUP(sb)
> + * 0 <= grp_goal < EXT2_BLOCKS_PER_GROUP(sb)
> * first block is a filesystem wide block number
> * first block is the block number of the first block in this group
> */
> @@ -1089,7 +1089,7 @@ ext2_try_to_allocate_with_rsv(struct sup
> if (!goal_in_my_reservation(&my_rsv->rsv_window,
> grp_goal, group, sb))
> grp_goal = -1;
> - } else if (grp_goal > 0) {
> + } else if (grp_goal >= 0) {
> int curr = my_rsv->rsv_end -
> (grp_goal + group_first_block) + 1;
>

2006-11-28 23:30:55

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH 5/6] ext2 balloc: say rb_entry not list_entry

On Tue, 2006-11-28 at 17:43 +0000, Hugh Dickins wrote:
> The reservations tree is an rb_tree not a list, so it's less confusing to
> use rb_entry() than list_entry() - though they're both just container_of().
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>

Acked.

Thanks,
Mingming
> fs/ext2/balloc.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> --- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.000000000 +0000
> +++ linux/fs/ext2/balloc.c 2006-11-27 19:28:41.000000000 +0000
> @@ -156,7 +156,7 @@ restart:
>
> printk("Block Allocation Reservation Windows Map (%s):\n", fn);
> while (n) {
> - rsv = list_entry(n, struct ext2_reserve_window_node, rsv_node);
> + rsv = rb_entry(n, struct ext2_reserve_window_node, rsv_node);
> if (verbose)
> printk("reservation window 0x%p "
> "start: %lu, end: %lu\n",
> @@ -753,7 +753,7 @@ static int find_next_reservable_window(
>
> prev = rsv;
> next = rb_next(&rsv->rsv_node);
> - rsv = list_entry(next,struct ext2_reserve_window_node,rsv_node);
> + rsv = rb_entry(next,struct ext2_reserve_window_node,rsv_node);
>
> /*
> * Reached the last reservation, we can just append to the
> @@ -995,7 +995,7 @@ static void try_to_extend_reservation(st
> if (!next)
> my_rsv->rsv_end += size;
> else {
> - next_rsv = list_entry(next, struct ext2_reserve_window_node, rsv_node);
> + next_rsv = rb_entry(next, struct ext2_reserve_window_node, rsv_node);
>
> if ((next_rsv->rsv_start - my_rsv->rsv_end - 1) >= size)
> my_rsv->rsv_end += size;

2006-11-28 23:31:14

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH 6/6] ext2 balloc: use io_error label

On Tue, 2006-11-28 at 17:44 +0000, Hugh Dickins wrote:
> ext2_new_blocks has a nice io_error label for setting -EIO,
> so goto that in the one place that doesn't already use it.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>

Acked,

Mingming
> fs/ext2/balloc.c | 7 +++----
> 1 file changed, 3 insertions(+), 4 deletions(-)
>
> --- 2.6.19-rc6-mm2/fs/ext2/balloc.c 2006-11-24 08:18:02.000000000 +0000
> +++ linux/fs/ext2/balloc.c 2006-11-27 19:28:41.000000000 +0000
> @@ -1258,10 +1258,9 @@ retry_alloc:
> if (group_no >= ngroups)
> group_no = 0;
> gdp = ext2_get_group_desc(sb, group_no, &gdp_bh);
> - if (!gdp) {
> - *errp = -EIO;
> - goto out;
> - }
> + if (!gdp)
> + goto io_error;
> +
> free_blocks = le16_to_cpu(gdp->bg_free_blocks_count);
> /*
> * skip this group if the number of
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2006-11-28 23:38:31

by Mingming Cao

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Tue, 2006-11-28 at 14:33 -0800, Andrew Morton wrote:
> On Tue, 28 Nov 2006 13:04:53 -0800
> Mingming Cao <[email protected]> wrote:
>
> > Thanks, I have acked most of them, and will port them to ext3/4 soon.
>
> You've acked #2 and #3. #4, #5 and #6 remain un-commented-upon and #1 is
> unclear?

sorry, just acked #4, #5 and #6. I mean to do that before I said so but
they did not go out of my outbox till now ( I was out for a few hours).

#1 looks correct to me now, just thought there are other issues. I will
comment on that one in that thread.


Mingming

2006-11-29 00:42:09

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH 1/6] ext2 balloc: fix _with_rsv freeze

On Tue, 2006-11-28 at 20:07 +0000, Hugh Dickins wrote:
> On Tue, 28 Nov 2006, Mingming Cao wrote:
> > On Tue, 2006-11-28 at 17:40 +0000, Hugh Dickins wrote:
> > > After several days of testing ext2 with reservations, it got caught inside
> > > ext2_try_to_allocate_with_rsv: alloc_new_reservation repeatedly succeeding
> > > on the window [12cff,12d0e], ext2_try_to_allocate repeatedly failing to
> > > find the free block guaranteed to be included (unless there's contention).
> > >
> >
> > Hmm, I suspect there is other issue: alloc_new_reservation should not
> > repeatedly allocating the same window, if ext2_try_to_allocate
> > repeatedly fails to find a free block in that window.
> > find_next_reservable_window() takes my_rsv (the old window that he
> > thinks there is no free block) as a guide to find a window "after" the
> > end block of my_rsv, so how could this happen?
>
> Hmmm. I haven't studied that part of the code, but what you say sounds
> sensible: that would leave more to be explained, yes. I guess it would
> happen if all the rest of the bitmap were either allocated or reserved,

But bitmap_search_next_usable_block() will fail in the case the rest of
bitmap were allocated, and find_next_reservable_space() will fail in the
case the rest of group were all reserved. alloc_new_reservation() should
not create a new window in this case.

> but I don't believe that was the case here: I have noted that the map
> was all 00s from offset 0x1ae onwards, plenty unallocated; I've not
> recorded the following reservations, but it seems unlikely they covered
> the remaining free area (and still covered it even when the remaining
> tasks got to the point of just waiting for this one).
>
> >
> > > Fix the range to find_next_usable_block's memscan: the scan from "here"
> > > (0xcfe) up to (but excluding) "maxblocks" (0xd0e) needs to scan 3 bytes
> > > not 2 (the relevant bytes of bitmap in this case being f7 df ff - none
> > > 00, but the premature cutoff implying that the last was found 00).
> > >
> >
> > alloc_new_reservation() reserved a window with free block, when come to
> > the time to claim it, it scans the window again. So it seems that the
> > range of the the scan is too small:
>
> The range of the scan is 1 byte too small in this case, yes.
>
> >
> > p = ((char *)bh->b_data) + (here >> 3);
> > r = memscan(p, 0, (maxblocks - here + 7) >> 3);
> > next = (r - ((char *)bh->b_data)) << 3;
> >
> > ---------------------> next is -1
>
> I don't understand you: next was not -1, it was 0xd08.
>
> > if (next < maxblocks && next >= here)
> > return next;
> >
> > ----------------------> falls to false branch
>
> No, it passed the "next < maxblocks && next >= here" test
> (maxblocks being 0xd0e and here being 0xcfe), so returned
> pointing to an allocated block - then the caller finds it
> cannot set the bit.
>

Apologies for the confusion. I thought ext2_try_to_allocate() failed
because we could not find a free block in the reserved window (i.e.,
find_next_usable_block() failed)

It seems in this case, find_next_usable_block() incorrectly returns a
bit it *thinks* free, but ext2_try_to_allocate() fails to claim it as
it's being marked as used.


So yes, Acked this fix. Thanks.

> >
> > here = bitmap_search_next_usable_block(here, bh, maxblocks);
> > return here;
> >
> > So we failed to find a free byte in the range. That's seems fine to me.
> > It's only a nice thing to have -- try to allocate a block in a place
> > where it's neighbors are all free also. If it fails, it will search the
> > window bit by bit. So I don't understand why it is not being recovered
> > by bitmap_search_next_usable_block(), which test the bitmap bit by bit?
>
> It already returned, it doesn't reach that line.
>
Yep.

> >
> > > Is this a problem for mainline ext2? No, because the "size" in its memscan
> > > is always EXT2_BLOCKS_PER_GROUP(sb), which mkfs.ext2 requires to be a
> > > multiple of 8. Is this a problem for ext3 or ext4? No, because they have
> > > an additional extN_test_allocatable test which rescues them from the error.
> > >
> > Hmm, if the error is it prematurely think there is no free block in the
> > range (bitmap on disk), then even in ext3/4, it will not bother checking
> > the jbd copy of the bitmap. I am not sure this is the cause that ext3/4
> > may not has the problem.
>
> In the ext3/4 case, it indeed won't bother to check the jbd copy
> (having found this bitmap bit set), it'll fall through to the
> bitmap_search_next_usable_block you indicated above,
> and that should do the right thing, finding the first
> free bit in the area originally reserved.
>
Make sense.
> >
> > > But the bigger question is, why does the my_rsv case come here to
> > > find_next_usable_block at all?
> >
> > Because grp_goal is -1?
>
> Well, yes, but my point is that we've got a reservation, and we're
> hoping to allocate from it (even though we've given up on the "goal"),
> but find_next_usable_block is not respecting it at all - liable to be
> allocating out of others' reservations instead.
>
Okay got your points. I think you are right.:) Well, right now the
window is just a (start,end) pair which pointing to a range that nobody
has reserved, that doesn't include the information about which block is
free even though we did scan the bitmap making sure there is a free
block. So when alloc_new_reservation() successfully create a new
window, so we still relies on find_next_usable() to check for which free
block to use.

It does seem redundant in the reservation case. We could adjust the new
created window start block to the first free block returned from
bitmap_search_next_usable_block(), so that the information of first free
block in the window got passed to ext2_try_to_allocate().

> >
> > > Doesn't its 64-bit boundary limit, and its
> > > memscan, blithely ignore what the reservation prepared?
> >
> > I agree with you that the double check is urgly. But it's necessary:( If
> > there to prevent contention: other file make steal that free block we
> > reserved for this file, in the case filesystem is full of reservation...
>
> I agree it's necessary to recheck the allocation; I disagree that the
> 64-bit boundary limit and memscan are appropriate when my_rsv is set.
>

Okay, I understand your points now:) Quite reasonable. I have the
similar doubts when I first touch this code when implement reservation
code. it's all about allocation policy. Looking at the
ext2_try_to_allocate(), it's the core of ext2/3 block allocation
strategy:

1) if there are some free block near by the goal block(64 bit boundary
limit), then take that block (locality)

2) otherwise, try to place a new block in a contiguous chunk of free
block(memscan for a free byte), so file could be less fragmented

3) otherwise, any free block in the given range of bitmap will work.

The same policy applies to both reservation and non reservation: goal
guided first, then searching for free chunk. the In the following
example, reservation window is (16,39), goal is block 32, blocks 17,
24-31, and 33 are free. Which block should it allocate?

|------------------------------------------------|
|1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 |
|------------------------------------------------|
16 24 32 40

In the current code with 64bit boundary limit, it will try to satisfy
locality so block #33 will be allocated. If we skip the 64bit boundary
limit and and memscan, it will go directly pick up block #17, seems less
optimized.

> >
> > > It's messy too,
> > > the complement of the memscan being that "i < 7" loop over in
> > > ext2_try_to_allocate. I think this ought to be cleaned up,
> > > in ext2+reservations and ext3 and ext4.
> > >
> > The "i<7" loop there is for non reservation case. Since
> > find_next_usable_block() could find a free byte, it's trying to avoid
> > filesystem holes by shifting the start of the free block for at most 7
> > times.
>
> Yes, the "i<7" loop rightly has a !my_rsv check; my point it that it's
> the "other end" of the memscan, which was operating 8-bits at a time,
> which lacks any my_rsv check (doesn't even know my_rsv at that level).
>

To me memscan still make sense for reservation case. In above example,
if block #33 is also allocated, since there is no free block nearby
(right side) the goal block, it probably better to allocate block #24,
so that the next allocation will be contiguous if application is doing
sequential allocation.


Mingming

2006-11-29 04:14:00

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 1/12] ext3 balloc: reset windowsz when full

Port a series ext2 balloc patches from Hugh to ext3/4. The first 6
patches are against ext3, and the rest are aginst ext4.


------------------------------------------------------
Subject: ext2 balloc: reset windowsz when full
From: Hugh Dickins <[email protected]>

ext2_new_blocks should reset the reservation window size to 0 when squeezing
the last blocks out of an almost full filesystem, so the retry doesn't skip
any groups with less than half that free, reporting ENOSPC too soon.

------------------------------------------------------
Sync up reservation fix from ext2

Signed-off-by: Mingming Cao <[email protected]>
---


---

linux-2.6.19-rc5-cmm/fs/ext3/balloc.c | 1 +
1 file changed, 1 insertion(+)

diff -puN fs/ext3/balloc.c~ext3_reset_windowsz_in_full_fs fs/ext3/balloc.c
--- linux-2.6.19-rc5/fs/ext3/balloc.c~ext3_reset_windowsz_in_full_fs 2006-11-28 19:36:41.000000000 -0800
+++ linux-2.6.19-rc5-cmm/fs/ext3/balloc.c 2006-11-28 19:36:41.000000000 -0800
@@ -1552,6 +1552,7 @@ retry_alloc:
*/
if (my_rsv) {
my_rsv = NULL;
+ windowsz = 0;
group_no = goal_group;
goto retry_alloc;
}

_

2006-11-29 04:14:26

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 3/12] ext3 balloc: fix off-by-one against rsv_end


------------------------------------------------------
Subject: ext2 balloc: fix off-by-one against rsv_end
From: Hugh Dickins <[email protected]>

rsv_end is the last block within the reservation, so alloc_new_reservation
should accept start_block == rsv_end as success.
------------------------------------------------------
Sync up a ext2 reservation fix in ext3

Signed-Off-By: Mingming Cao <[email protected]>


---

linux-2.6.19-rc5-cmm/fs/ext3/balloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN fs/ext3/balloc.c~ext3-balloc-fix-off-by-one-against-rsv_end fs/ext3/balloc.c
--- linux-2.6.19-rc5/fs/ext3/balloc.c~ext3-balloc-fix-off-by-one-against-rsv_end 2006-11-28 19:36:58.000000000 -0800
+++ linux-2.6.19-rc5-cmm/fs/ext3/balloc.c 2006-11-28 19:36:58.000000000 -0800
@@ -1148,7 +1148,7 @@ retry:
* check if the first free block is within the
* free space we just reserved
*/
- if (start_block >= my_rsv->rsv_start && start_block < my_rsv->rsv_end)
+ if (start_block >= my_rsv->rsv_start && start_block <= my_rsv->rsv_end)
return 0; /* success */
/*
* if the first free bit we found is out of the reservable space

_

2006-11-29 04:14:28

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 4/12] ext3 balloc: say rb_entry not list_entry


------------------------------------------------------
Subject: ext2 balloc: say rb_entry not list_entry
From: Hugh Dickins <[email protected]>

The reservations tree is an rb_tree not a list, so it's less confusing to use
rb_entry() than list_entry() - though they're both just container_of().

----------------------------------------------------------

Sync up this fix in ext3

Signed-off-by: Mingming Cao <[email protected]>
---


---

linux-2.6.19-rc5-cmm/fs/ext3/balloc.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff -puN fs/ext3/balloc.c~ext3-balloc-say-rb_entry-not-list_entry fs/ext3/balloc.c
--- linux-2.6.19-rc5/fs/ext3/balloc.c~ext3-balloc-say-rb_entry-not-list_entry 2006-11-28 19:36:52.000000000 -0800
+++ linux-2.6.19-rc5-cmm/fs/ext3/balloc.c 2006-11-28 19:36:52.000000000 -0800
@@ -144,7 +144,7 @@ restart:

printk("Block Allocation Reservation Windows Map (%s):\n", fn);
while (n) {
- rsv = list_entry(n, struct ext3_reserve_window_node, rsv_node);
+ rsv = rb_entry(n, struct ext3_reserve_window_node, rsv_node);
if (verbose)
printk("reservation window 0x%p "
"start: %lu, end: %lu\n",
@@ -949,7 +949,7 @@ static int find_next_reservable_window(

prev = rsv;
next = rb_next(&rsv->rsv_node);
- rsv = list_entry(next,struct ext3_reserve_window_node,rsv_node);
+ rsv = rb_entry(next,struct ext3_reserve_window_node,rsv_node);

/*
* Reached the last reservation, we can just append to the
@@ -1193,7 +1193,7 @@ static void try_to_extend_reservation(st
if (!next)
my_rsv->rsv_end += size;
else {
- next_rsv = list_entry(next, struct ext3_reserve_window_node, rsv_node);
+ next_rsv = rb_entry(next, struct ext3_reserve_window_node, rsv_node);

if ((next_rsv->rsv_start - my_rsv->rsv_end - 1) >= size)
my_rsv->rsv_end += size;

_

2006-11-29 04:14:41

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 5/12] ext3 balloc: use io_error label


------------------------------------------------------
Subject: ext2 balloc: use io_error label
From: Hugh Dickins <[email protected]>

ext2_new_blocks has a nice io_error label for setting -EIO, so goto that in
the one place that doesn't already use it.

------------------------------------------------------
Fix it in ext3_new_blocks.

Signed-off-by: Mingming Cao <[email protected]>


---

linux-2.6.19-rc5-cmm/fs/ext3/balloc.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff -puN fs/ext3/balloc.c~ext3-balloc-use-io_error-label fs/ext3/balloc.c
--- linux-2.6.19-rc5/fs/ext3/balloc.c~ext3-balloc-use-io_error-label 2006-11-28 19:45:51.000000000 -0800
+++ linux-2.6.19-rc5-cmm/fs/ext3/balloc.c 2006-11-28 19:45:51.000000000 -0800
@@ -1515,10 +1515,8 @@ retry_alloc:
if (group_no >= ngroups)
group_no = 0;
gdp = ext3_get_group_desc(sb, group_no, &gdp_bh);
- if (!gdp) {
- *errp = -EIO;
- goto out;
- }
+ if (!gdp)
+ goto io_error;
free_blocks = le16_to_cpu(gdp->bg_free_blocks_count);
/*
* skip this group if the number of

_

2006-11-29 04:14:07

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 2/12] ext3 balloc: fix off-by-one against grp_goal


------------------------------------------------------
Subject: ext2 balloc: fix off-by-one against grp_goal
From: Hugh Dickins <[email protected]>

grp_goal 0 is a genuine goal (unlike -1), so ext2_try_to_allocate_with_rsv
should treat it as such.
------------------------------------------------------

Sync up with ext2 reservation fix in ext3

Signed-off-by: Mingming Cao <[email protected]>
---


---

linux-2.6.19-rc5-cmm/fs/ext3/balloc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff -puN fs/ext3/balloc.c~ext3-balloc-fix-off-by-one-against-grp_goal fs/ext3/balloc.c
--- linux-2.6.19-rc5/fs/ext3/balloc.c~ext3-balloc-fix-off-by-one-against-grp_goal 2006-11-28 19:36:48.000000000 -0800
+++ linux-2.6.19-rc5-cmm/fs/ext3/balloc.c 2006-11-28 19:36:48.000000000 -0800
@@ -1271,7 +1271,7 @@ ext3_try_to_allocate_with_rsv(struct sup
}
/*
* grp_goal is a group relative block number (if there is a goal)
- * 0 < grp_goal < EXT3_BLOCKS_PER_GROUP(sb)
+ * 0 <= grp_goal < EXT3_BLOCKS_PER_GROUP(sb)
* first block is a filesystem wide block number
* first block is the block number of the first block in this group
*/
@@ -1307,7 +1307,7 @@ ext3_try_to_allocate_with_rsv(struct sup
if (!goal_in_my_reservation(&my_rsv->rsv_window,
grp_goal, group, sb))
grp_goal = -1;
- } else if (grp_goal > 0) {
+ } else if (grp_goal >= 0) {
int curr = my_rsv->rsv_end -
(grp_goal + group_first_block) + 1;


_

2006-11-29 04:14:49

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 6/12] ext2 balloc: fix _with_rsv freeze


Sync up a reservation fix from ext2 in ext3
------------------------------------------------------
Subject: ext2 balloc: fix _with_rsv freeze
From: Hugh Dickins <[email protected]>

After several days of testing ext2 with reservations, it got caught inside
ext2_try_to_allocate_with_rsv: alloc_new_reservation repeatedly succeeding on
the window [12cff,12d0e], ext2_try_to_allocate repeatedly failing to find the
free block guaranteed to be included (unless there's contention).

Fix the range to find_next_usable_block's memscan: the scan from "here"
(0xcfe) up to (but excluding) "maxblocks" (0xd0e) needs to scan 3 bytes not 2
(the relevant bytes of bitmap in this case being f7 df ff - none 00, but the
premature cutoff implying that the last was found 00).

Is this a problem for mainline ext2? No, because the "size" in its memscan is
always EXT2_BLOCKS_PER_GROUP(sb), which mkfs.ext2 requires to be a multiple of
8. Is this a problem for ext3 or ext4? No, because they have an additional
extN_test_allocatable test which rescues them from the error.

--------------------------------------------------

Signed-off-by: Mingming Cao <[email protected]>


---

linux-2.6.19-rc5-cmm/fs/ext3/balloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN fs/ext3/balloc.c~ext3-balloc-fix-_with_rsv-freeze fs/ext3/balloc.c
--- linux-2.6.19-rc5/fs/ext3/balloc.c~ext3-balloc-fix-_with_rsv-freeze 2006-11-28 19:36:55.000000000 -0800
+++ linux-2.6.19-rc5-cmm/fs/ext3/balloc.c 2006-11-28 19:36:55.000000000 -0800
@@ -730,7 +730,7 @@ find_next_usable_block(ext3_grpblk_t sta
here = 0;

p = ((char *)bh->b_data) + (here >> 3);
- r = memscan(p, 0, (maxblocks - here + 7) >> 3);
+ r = memscan(p, 0, ((maxblocks + 7) >> 3) - (here >> 3));
next = (r - ((char *)bh->b_data)) << 3;

if (next < maxblocks && next >= start && ext3_test_allocatable(next, bh))

_

2006-11-29 04:15:09

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 7/12] ext4 balloc: reset windowsz when full


------------------------------------------------------
Subject: ext2 balloc: reset windowsz when full
From: Hugh Dickins <[email protected]>

ext2_new_blocks should reset the reservation window size to 0 when squeezing
the last blocks out of an almost full filesystem, so the retry doesn't skip
any groups with less than half that free, reporting ENOSPC too soon.

------------------------------------------------------
Sync up reservation fix from ext2 in ext4

Signed-off-by: Mingming Cao <[email protected]>
---


---

linux-2.6.19-rc5-cmm/fs/ext4/balloc.c | 1 +
1 file changed, 1 insertion(+)

diff -puN fs/ext4/balloc.c~ext4_reset_windowsz_in_full_fs fs/ext4/balloc.c
--- linux-2.6.19-rc5/fs/ext4/balloc.c~ext4_reset_windowsz_in_full_fs 2006-11-28 19:37:01.000000000 -0800
+++ linux-2.6.19-rc5-cmm/fs/ext4/balloc.c 2006-11-28 19:37:01.000000000 -0800
@@ -1566,6 +1566,7 @@ retry_alloc:
*/
if (my_rsv) {
my_rsv = NULL;
+ windowsz = 0;
group_no = goal_group;
goto retry_alloc;
}

_

2006-11-29 04:15:32

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 11/12] ext4 balloc: use io_error label


------------------------------------------------------
Subject: ext2 balloc: use io_error label
From: Hugh Dickins <[email protected]>

ext2_new_blocks has a nice io_error label for setting -EIO, so goto that in
the one place that doesn't already use it.

------------------------------------------------------
Fix it in ext4_new_blocks.

Signed-off-by: Mingming Cao <[email protected]>


---

linux-2.6.19-rc5-cmm/fs/ext4/balloc.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff -puN fs/ext4/balloc.c~ext4-balloc-use-io_error-label fs/ext4/balloc.c
--- linux-2.6.19-rc5/fs/ext4/balloc.c~ext4-balloc-use-io_error-label 2006-11-28 19:42:45.000000000 -0800
+++ linux-2.6.19-rc5-cmm/fs/ext4/balloc.c 2006-11-28 19:43:21.000000000 -0800
@@ -1529,10 +1529,8 @@ retry_alloc:
if (group_no >= ngroups)
group_no = 0;
gdp = ext4_get_group_desc(sb, group_no, &gdp_bh);
- if (!gdp) {
- *errp = -EIO;
- goto out;
- }
+ if (!gdp)
+ goto io_error;
free_blocks = le16_to_cpu(gdp->bg_free_blocks_count);
/*
* skip this group if the number of

_

2006-11-29 04:15:21

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 9/12] ext4 balloc: fix off-by-one against rsv_end


------------------------------------------------------
Subject: ext2 balloc: fix off-by-one against rsv_end
From: Hugh Dickins <[email protected]>

rsv_end is the last block within the reservation, so alloc_new_reservation
should accept start_block == rsv_end as success.
------------------------------------------------------
Sync up a ext2 reservation fix in ext4
Signed-Off-By: Mingming Cao <[email protected]>


---

linux-2.6.19-rc5-cmm/fs/ext4/balloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN fs/ext4/balloc.c~ext4-balloc-fix-off-by-one-against-rsv_end fs/ext4/balloc.c
--- linux-2.6.19-rc5/fs/ext4/balloc.c~ext4-balloc-fix-off-by-one-against-rsv_end 2006-11-28 19:37:15.000000000 -0800
+++ linux-2.6.19-rc5-cmm/fs/ext4/balloc.c 2006-11-28 19:37:15.000000000 -0800
@@ -1165,7 +1165,7 @@ retry:
* check if the first free block is within the
* free space we just reserved
*/
- if (start_block >= my_rsv->rsv_start && start_block < my_rsv->rsv_end)
+ if (start_block >= my_rsv->rsv_start && start_block <= my_rsv->rsv_end)
return 0; /* success */
/*
* if the first free bit we found is out of the reservable space

_

2006-11-29 04:15:06

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 8/12] ext4 balloc: fix off-by-one against grp_goal


Subject: ext2 balloc: fix off-by-one against grp_goal
From: Hugh Dickins <[email protected]>

grp_goal 0 is a genuine goal (unlike -1), so ext2_try_to_allocate_with_rsv
should treat it as such.
------------------------------------------------------
Sync up with ext2 reservation fix in ext4
Signed-off-by: Mingming Cao <[email protected]>
---


---

linux-2.6.19-rc5-cmm/fs/ext4/balloc.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff -puN fs/ext4/balloc.c~ext4-balloc-fix-off-by-one-against-grp_goal fs/ext4/balloc.c
--- linux-2.6.19-rc5/fs/ext4/balloc.c~ext4-balloc-fix-off-by-one-against-grp_goal 2006-11-28 19:37:05.000000000 -0800
+++ linux-2.6.19-rc5-cmm/fs/ext4/balloc.c 2006-11-28 19:37:05.000000000 -0800
@@ -1288,7 +1288,7 @@ ext4_try_to_allocate_with_rsv(struct sup
}
/*
* grp_goal is a group relative block number (if there is a goal)
- * 0 < grp_goal < EXT4_BLOCKS_PER_GROUP(sb)
+ * 0 <= grp_goal < EXT4_BLOCKS_PER_GROUP(sb)
* first block is a filesystem wide block number
* first block is the block number of the first block in this group
*/
@@ -1324,7 +1324,7 @@ ext4_try_to_allocate_with_rsv(struct sup
if (!goal_in_my_reservation(&my_rsv->rsv_window,
grp_goal, group, sb))
grp_goal = -1;
- } else if (grp_goal > 0) {
+ } else if (grp_goal >= 0) {
int curr = my_rsv->rsv_end -
(grp_goal + group_first_block) + 1;


_

2006-11-29 04:15:41

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 12/12] ext3 balloc: fix _with_rsv freeze


------------------------------------------------------
Subject: ext2 balloc: fix _with_rsv freeze
From: Hugh Dickins <[email protected]>

After several days of testing ext2 with reservations, it got caught inside
ext2_try_to_allocate_with_rsv: alloc_new_reservation repeatedly succeeding on
the window [12cff,12d0e], ext2_try_to_allocate repeatedly failing to find the
free block guaranteed to be included (unless there's contention).

Fix the range to find_next_usable_block's memscan: the scan from "here"
(0xcfe) up to (but excluding) "maxblocks" (0xd0e) needs to scan 3 bytes not 2
(the relevant bytes of bitmap in this case being f7 df ff - none 00, but the
premature cutoff implying that the last was found 00).

Is this a problem for mainline ext2? No, because the "size" in its memscan is
always EXT2_BLOCKS_PER_GROUP(sb), which mkfs.ext2 requires to be a multiple of
8. Is this a problem for ext3 or ext4? No, because they have an additional
extN_test_allocatable test which rescues them from the error.

--------------------------------------------------

Sync up a reservation fix from ext2 in ext4
Signed-off-by: Mingming Cao <[email protected]>


---

linux-2.6.19-rc5-cmm/fs/ext4/balloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN fs/ext4/balloc.c~ext4-balloc-fix-_with_rsv-freeze fs/ext4/balloc.c
--- linux-2.6.19-rc5/fs/ext4/balloc.c~ext4-balloc-fix-_with_rsv-freeze 2006-11-28 19:37:12.000000000 -0800
+++ linux-2.6.19-rc5-cmm/fs/ext4/balloc.c 2006-11-28 19:37:12.000000000 -0800
@@ -747,7 +747,7 @@ find_next_usable_block(ext4_grpblk_t sta
here = 0;

p = ((char *)bh->b_data) + (here >> 3);
- r = memscan(p, 0, (maxblocks - here + 7) >> 3);
+ r = memscan(p, 0, ((maxblocks + 7) >> 3 - (here >> 3));
next = (r - ((char *)bh->b_data)) << 3;

if (next < maxblocks && next >= start && ext4_test_allocatable(next, bh))

_

2006-11-29 04:15:31

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 10/12] ext4 balloc: say rb_entry not list_entry


------------------------------------------------------
Subject: ext2 balloc: say rb_entry not list_entry
From: Hugh Dickins <[email protected]>

The reservations tree is an rb_tree not a list, so it's less confusing to use
rb_entry() than list_entry() - though they're both just container_of().

----------------------------------------------------------

Sync up this fix in ext4

Signed-off-by: Mingming Cao <[email protected]>
---


---

linux-2.6.19-rc5-cmm/fs/ext4/balloc.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff -puN fs/ext4/balloc.c~ext4-balloc-say-rb_entry-not-list_entry fs/ext4/balloc.c
--- linux-2.6.19-rc5/fs/ext4/balloc.c~ext4-balloc-say-rb_entry-not-list_entry 2006-11-28 19:37:08.000000000 -0800
+++ linux-2.6.19-rc5-cmm/fs/ext4/balloc.c 2006-11-28 19:37:08.000000000 -0800
@@ -165,7 +165,7 @@ restart:

printk("Block Allocation Reservation Windows Map (%s):\n", fn);
while (n) {
- rsv = list_entry(n, struct ext4_reserve_window_node, rsv_node);
+ rsv = rb_entry(n, struct ext4_reserve_window_node, rsv_node);
if (verbose)
printk("reservation window 0x%p "
"start: %llu, end: %llu\n",
@@ -966,7 +966,7 @@ static int find_next_reservable_window(

prev = rsv;
next = rb_next(&rsv->rsv_node);
- rsv = list_entry(next,struct ext4_reserve_window_node,rsv_node);
+ rsv = rb_entry(next,struct ext4_reserve_window_node,rsv_node);

/*
* Reached the last reservation, we can just append to the
@@ -1210,7 +1210,7 @@ static void try_to_extend_reservation(st
if (!next)
my_rsv->rsv_end += size;
else {
- next_rsv = list_entry(next, struct ext4_reserve_window_node, rsv_node);
+ next_rsv = rb_entry(next, struct ext4_reserve_window_node, rsv_node);

if ((next_rsv->rsv_start - my_rsv->rsv_end - 1) >= size)
my_rsv->rsv_end += size;

_

2006-11-29 05:51:42

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/12] ext3 balloc: reset windowsz when full

On Tue, 28 Nov 2006, Mingming Cao wrote:

> Port a series ext2 balloc patches from Hugh to ext3/4. The first 6
> patches are against ext3, and the rest are aginst ext4.

Thanks for all that, Mingming:
whichever is appropriate, all twelve
Acked-by: Hugh Dickins <[email protected]>
or
Signed-off-by: Hugh Dickins <[email protected]>

I'll think about your other mails, those that need further thought,
later on: I need to pin down more accurately the repetitious sequence of
reservations in the mistaken case - maybe it indicated further issues,
maybe not; and I need to consider our different views of the my_rsv
find_next_usable_block.

Hugh

2006-11-29 08:24:45

by Russell King

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

Yet another attempt to get a response from Andrew. It is rather
important that you DO respond to this.

On Sat, Nov 25, 2006 at 02:59:16PM +0000, Russell King wrote:
> On Thu, Nov 16, 2006 at 12:34:48PM +0000, Russell King wrote:
> > On Wed, Nov 15, 2006 at 11:22:28PM -0800, Andrew Morton wrote:
> > > On Wed, 15 Nov 2006 22:55:43 -0800
> > > Mingming Cao <[email protected]> wrote:
> > >
> > > > Hmm, maxblocks, in bitmap_search_next_usable_block(), is the end block
> > > > number of the range to search, not the lengh of the range. maxblocks
> > > > get passed to ext2_find_next_zero_bit(), where it expecting to take the
> > > > _size_ of the range to search instead...
> > > >
> > > > Something like this: (this is not a patch)
> > > > @@ -524,7 +524,7 @@ bitmap_search_next_usable_block(ext2_grp
> > > > ext2_grpblk_t next;
> > > >
> > > > - next = ext2_find_next_zero_bit(bh->b_data, maxblocks, start);
> > > > + next = ext2_find_next_zero_bit(bh->b_data, maxblocks-start + 1, start);
> > > > if (next >= maxblocks)
> > > > return -1;
> > > > return next;
> > > > }
> > >
> > > yes, the `size' arg to find_next_zero_bit() represents the number of bits
> > > to scan at `offset'.
> >
> > Are you sure? That's not the way it's implemented in many architectures.
> > find_next_*_bit() has always taken "address, maximum offset, starting offset"
> > and always has returned "next offset".
> >
> > Just look at arch/i386/lib/bitops.c:
> >
> > int find_next_zero_bit(const unsigned long *addr, int size, int offset)
> > {
> > unsigned long * p = ((unsigned long *) addr) + (offset >> 5);
> > int set = 0, bit = offset & 31, res;
> > ...
> > /*
> > * No zero yet, search remaining full bytes for a zero
> > */
> > res = find_first_zero_bit (p, size - 32 * (p - (unsigned long *) addr));
> > return (offset + set + res);
> > }
> >
> > So for the case that "offset" is aligned to a "long" boundary, that gives us:
> >
> > res = find_first_zero_bit(addr + (offset>>5),
> > size - 32 * (addr + (offset>>5) - addr));
> >
> > or:
> >
> > res = find_first_zero_bit(addr + (offset>>5), size - (offset & ~31));
> >
> > So, size _excludes_ offset.
> >
> > Now, considering the return value, "res" above will be relative to
> > "addr + (offset>>5)". However, we add "offset" on to that, so it's
> > relative to addr + (offset bits).
>
> Andrew,
>
> Please respond to the above. If what you say is correct then all
> architectures need their bitops fixing to fit ext2's requirements.
>
> --
> Russell King
> Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
> maintainer of:
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-11-29 08:31:06

by Andrew Morton

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Wed, 29 Nov 2006 07:40:00 +0000
Russell King <[email protected]> wrote:

> Yet another attempt to get a response from Andrew. It is rather
> important that you DO respond to this.

You can read the code as easily as I can? I'm not really sure what
you're asking - I thought Mingming cleared things up.

2006-11-29 09:20:42

by Russell King

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Wed, Nov 29, 2006 at 12:30:36AM -0800, Andrew Morton wrote:
> On Wed, 29 Nov 2006 07:40:00 +0000
> Russell King <[email protected]> wrote:
>
> > Yet another attempt to get a response from Andrew. It is rather
> > important that you DO respond to this.
>
> You can read the code as easily as I can?

Sigh. Please don't cut the relevant part of my _first_ email message
where it can be clearly seen that I _have_ read the code and interpreted
it _differently_ from you.

> I'm not really sure what you're asking - I thought Mingming cleared
> things up.

Which message did this happen?

What I'm looking for is confirmation of the semantics of
find_next_zero_bit(), which is a fairly simple question to ask, and
certainly does not justify this rather obtuse and difficult thread.

<extremely frustrated>

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2006-11-29 09:40:14

by Andrew Morton

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Wed, 29 Nov 2006 09:20:24 +0000
Russell King <[email protected]> wrote:

> What I'm looking for is confirmation of the semantics of
> find_next_zero_bit()

What are the existing semantics? I see no documentation in any of the
architectures I've looked at. That's my point.

>From a quick read of fs/ext2/balloc.c

ext2_find_next_zero_bit(base, size, offset)

appears to expect that base is the start of the memory buffer, size is the
number of bits at *base and offset is the bit at which to start the search,
relative to base. If a zero bit is found it will return the offset of that
bit relative to base. It will return some number greater than `size' if no
zero-bit was found.

Whether that's how all the implementors interpreted it is anyone's guess.
Presumably the architectures all do roughly the same thing.

> <extremely frustrated>

Well likewise. It appears that nobody (and about 20 people have
implemented these things) could be bothered getting off ass and documenting
the pathetic thing.

2006-11-29 18:16:24

by Russell King

[permalink] [raw]
Subject: Re: Boot failure with ext2 and initrds

On Wed, Nov 29, 2006 at 01:39:22AM -0800, Andrew Morton wrote:
> On Wed, 29 Nov 2006 09:20:24 +0000
> Russell King <[email protected]> wrote:
>
> > What I'm looking for is confirmation of the semantics of
> > find_next_zero_bit()
>
> What are the existing semantics? I see no documentation in any of the
> architectures I've looked at. That's my point.
>
> >From a quick read of fs/ext2/balloc.c
>
> ext2_find_next_zero_bit(base, size, offset)
>
> appears to expect that base is the start of the memory buffer, size is the
> number of bits at *base and offset is the bit at which to start the search,
> relative to base. If a zero bit is found it will return the offset of that
> bit relative to base. It will return some number greater than `size' if no
> zero-bit was found.

Thank you for taking the time to agree with my analysis of x86 and
confirm that what ARM implements is also what is expected - that's
all that I was after. The reason I was after it was because you'd
said in the message I originally replied to:

| yes, the `size' arg to find_next_zero_bit() represents the number of
| bits to scan at `offset'.

which is entirely different from my understanding of what is required of
this function. Hence the confusion caused and the need to clear up
that confusion.

> Whether that's how all the implementors interpreted it is anyone's guess.
> Presumably the architectures all do roughly the same thing.

ARM does exactly the same as x86, since x86 was the only architecture
which existed in Linux when it was originally implemented.

> > <extremely frustrated>
>
> Well likewise. It appears that nobody (and about 20 people have
> implemented these things) could be bothered getting off ass and
> documenting the pathetic thing.

Back in those days it very much was "read the source, luke" and when
porting the kernel that meant the x86 code. Consider the lack of
documentation a case of just following the agreed convention at the
time.

We've since now moved on to a more mature attitude towards documentation,
and decided upon a format that it should take. So yes, it would be nice
if someone would document the entire set of kernel functions which
architectures are expected to provide. That'll probably be a full time
job for someone though, and probably needs someone to be paid to do it.
Or are the janitor folks up for it?

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: