LinuxLists.cc - Potential bug in mballoc --- reusing data blocks before txn commit

2008-09-28 01:35:49

Subject: Potential bug in mballoc --- reusing data blocks before txn commit

So while I was preparing a patch to delete the old legacy block
allocator, I came across this bit of code in balloc.c:

/**
* ext4_test_allocatable()
* @nr: given allocation block group
* @bh: bufferhead contains the bitmap of the given bloc
*
* For ext4 allocations, we must not reuse any blocks which are
* allocated in the bitmap buffer's "last committed data" copy. This
* prevents deletes from freeing up the page for reuse until we have
* committed the delete transaction.
*
* If we didn't do this, then deleting something and reallocating it as
* data would allow the old block to be overwritten before the
* transaction committed (because we force data to disk before commit).
* This would lead to corruption if we crashed between overwriting the
* data and committing the delete.
*
* @@@ We may want to make this allocation behaviour conditional on
* data-writes at some point, and disable it for metadata allocations or
* sync-data inodes.
*/

This is done by searching an old copy of the bitmap found
jh->b_committed_data.

So I was more than a little bit concerned when I found that mballoc.c
wasn't referencing b_committed data *at* *all*. When I looked a bit
more closely, it looks like that mballoc is using a separate scheme,
based on linked list hanging off of sbi->s_active_transaction.
Unfortunately, it seems to only prevent released metadata blocks from
being reused:

if (metadata) {
/* blocks being freed are metadata. these blocks shouldn't
* be used until this transaction is committed */
ext4_mb_free_metadata(handle, &e4b, block_group, bit, count);
} else {

This means that if a file is deleted, but then system crashes before the
commit, some of its data blocks could be reallocated and then written
out. We could fix this by running all released blocks via
ext4_mb_free_metadata(), but since ext4_mb_free_metadata() stores blocks
that need to be protected via a block list, this could get *quite*
unwieldy, especially when deleting very large files (we could end up
with a very large linked list). Alternate solutions would involve using
a list of extents, or the original jh->b_commited_data mechanism.

I'll also note that a linked list of extents that should be freed would
also be useful for implementing the trim command for SSD's --- and that
this would be much more cleanly implemented via a callback from the jbd2
layer when a commit is finished, rather than the current
ext4_mb_poll_new_transaction() mechanism.

In any case, is there a reason why the mballoc.c is using its current
scheme, and not using kj->b_commited_data as in the original balloc.c
code? And was there a reason why you decided that it wasn't necessary
to protect freed data blocks from being reused until the transaction was
committed?

Regards,

- Ted

2008-09-29 20:21:49

by Alex Tomas

[permalink] [raw]

Subject: Re: Potential bug in mballoc --- reusing data blocks before txn commit

Theodore Ts'o wrote:
> I'll also note that a linked list of extents that should be freed would
> also be useful for implementing the trim command for SSD's --- and that
> this would be much more cleanly implemented via a callback from the jbd2
> layer when a commit is finished, rather than the current
> ext4_mb_poll_new_transaction() mechanism.

yes, polling is a hack as we lost commit callback long ago.

> In any case, is there a reason why the mballoc.c is using its current
> scheme, and not using kj->b_commited_data as in the original balloc.c
> code? And was there a reason why you decided that it wasn't necessary
> to protect freed data blocks from being reused until the transaction was
> committed?

I think we don't really care about data consistency much. so I tried to save
some memory (given amount of metadata is smaller usually).

thanks, Alex

2008-09-29 20:57:15

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Potential bug in mballoc --- reusing data blocks before txn commit

On Tue, Sep 30, 2008 at 12:21:06AM +0400, Alex Tomas wrote:
> Theodore Ts'o wrote:
>> I'll also note that a linked list of extents that should be freed would
>> also be useful for implementing the trim command for SSD's --- and that
>> this would be much more cleanly implemented via a callback from the jbd2
>> layer when a commit is finished, rather than the current
>> ext4_mb_poll_new_transaction() mechanism.
>
> yes, polling is a hack as we lost commit callback long ago.

Yeah, I know Andrian Bunk strikes again.... but the right answer is
to ressurect that code and add it back.

>> In any case, is there a reason why the mballoc.c is using its current
>> scheme, and not using kj->b_commited_data as in the original balloc.c
>> code? And was there a reason why you decided that it wasn't necessary
>> to protect freed data blocks from being reused until the transaction was
>> committed?
>
> I think we don't really care about data consistency much. so I tried to save
> some memory (given amount of metadata is smaller usually).

Well, we need to keep this information for the SSD Trim command
anyway; so probably the right approach is to keep a red/black tree of
extents that need to be freed, and then when the commit callback is
called, we can update the appropriate mballoc data structures and call
the SSD trim command if necessary.

That restores the data consistency that we have with ext3, and it also
gives us the SSD trim functionality, which we need for both ext3 and
ext4. In fact, the information we need in both cases is 100% identical.

The other thing which I should check is that if we are using this
scheme, I think we shouldn't need to keep the shadow copy of the block
bitmap buffers any more. I would imagine we still need them for the
inode bitmaps, for the same reason, though.

- Ted

2008-09-29 21:04:28

by Ric Wheeler

[permalink] [raw]

Subject: Re: Potential bug in mballoc --- reusing data blocks before txn commit

Theodore Tso wrote:
> On Tue, Sep 30, 2008 at 12:21:06AM +0400, Alex Tomas wrote:
>
>> Theodore Ts'o wrote:
>>
>>> I'll also note that a linked list of extents that should be freed would
>>> also be useful for implementing the trim command for SSD's --- and that
>>> this would be much more cleanly implemented via a callback from the jbd2
>>> layer when a commit is finished, rather than the current
>>> ext4_mb_poll_new_transaction() mechanism.
>>>
>> yes, polling is a hack as we lost commit callback long ago.
>>
>
> Yeah, I know Andrian Bunk strikes again.... but the right answer is
> to ressurect that code and add it back.
>
>
>>> In any case, is there a reason why the mballoc.c is using its current
>>> scheme, and not using kj->b_commited_data as in the original balloc.c
>>> code? And was there a reason why you decided that it wasn't necessary
>>> to protect freed data blocks from being reused until the transaction was
>>> committed?
>>>
>> I think we don't really care about data consistency much. so I tried to save
>> some memory (given amount of metadata is smaller usually).
>>
>
> Well, we need to keep this information for the SSD Trim command
> anyway; so probably the right approach is to keep a red/black tree of
> extents that need to be freed, and then when the commit callback is
> called, we can update the appropriate mballoc data structures and call
> the SSD trim command if necessary.
>
> That restores the data consistency that we have with ext3, and it also
> gives us the SSD trim functionality, which we need for both ext3 and
> ext4. In fact, the information we need in both cases is 100% identical.
>
> The other thing which I should check is that if we are using this
> scheme, I think we shouldn't need to keep the shadow copy of the block
> bitmap buffers any more. I would imagine we still need them for the
> inode bitmaps, for the same reason, though.
>
> - Ted
>
>

I don't disagree with any of the above, just want to point out that TRIM
has a SCSI T10 cousin that is very similar (used to implement thinly
provisioned luns). We should make sure as much as possible to make our
file system level support work for both...

ric

2008-09-29 23:00:33

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Potential bug in mballoc --- reusing data blocks before txn commit

On Mon, Sep 29, 2008 at 05:04:04PM -0400, Ric Wheeler wrote:
>
> I don't disagree with any of the above, just want to point out that TRIM
> has a SCSI T10 cousin that is very similar (used to implement thinly
> provisioned luns). We should make sure as much as possible to make our
> file system level support work for both...
>

That's shouldn't be a problem. David Woodhouse created a standard
block device interface that should be suitable both for ATA and SCSI:

int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
unsigned nr_sects, bio_end_io_t end_io);

- Ted

2008-09-29 23:06:13

by Ric Wheeler

[permalink] [raw]

Subject: Re: Potential bug in mballoc --- reusing data blocks before txn commit

Theodore Tso wrote:
> On Mon, Sep 29, 2008 at 05:04:04PM -0400, Ric Wheeler wrote:
>
>> I don't disagree with any of the above, just want to point out that TRIM
>> has a SCSI T10 cousin that is very similar (used to implement thinly
>> provisioned luns). We should make sure as much as possible to make our
>> file system level support work for both...
>>
>>
>
> That's shouldn't be a problem. David Woodhouse created a standard
> block device interface that should be suitable both for ATA and SCSI:
>
> int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
> unsigned nr_sects, bio_end_io_t end_io);
>
> - Ted
>
>

I know, just wanted to point out to the broader audience that it is not
just trim ;-)

ric

2008-09-30 04:35:39

by Alex Tomas

[permalink] [raw]

Subject: Re: Potential bug in mballoc --- reusing data blocks before txn commit

Theodore Tso wrote:
> Yeah, I know Andrian Bunk strikes again.... but the right answer is
> to ressurect that code and add it back.

indeed

> Well, we need to keep this information for the SSD Trim command
> anyway; so probably the right approach is to keep a red/black tree of
> extents that need to be freed, and then when the commit callback is
> called, we can update the appropriate mballoc data structures and call
> the SSD trim command if necessary.

why we need a tree? at least for the purpose of keeping blocks unavailable
we'd need just a list as at commit we free them all.

> The other thing which I should check is that if we are using this
> scheme, I think we shouldn't need to keep the shadow copy of the block
> bitmap buffers any more. I would imagine we still need them for the
> inode bitmaps, for the same reason, though.

shadow copy holds preallocated blocks

thanks, Alex

2008-09-30 13:02:50

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Potential bug in mballoc --- reusing data blocks before txn commit

On Tue, Sep 30, 2008 at 08:35:21AM +0400, Alex Tomas wrote:
> why we need a tree? at least for the purpose of keeping blocks unavailable
> we'd need just a list as at commit we free them all.

For ext4, the only reason to use a tree would be to allow us to merge
deleted extents. This might not be worth the complexity, though, I
admit it.

For ext3, we could use it to replace the the use of
bh->b_committed_data --- in which case, we would need to use a rbtree
so we can quickly look up to see which blocks can't be allocated yet.

>> The other thing which I should check is that if we are using this
>> scheme, I think we shouldn't need to keep the shadow copy of the block
>> bitmap buffers any more. I would imagine we still need them for the
>> inode bitmaps, for the same reason, though.
>
> shadow copy holds preallocated blocks

Are we talking about the same thing? I was referring to the
jh->b_committed_data, which isn't used by mballoc at all.

- Ted

2008-09-30 13:12:34

by Alex Tomas

[permalink] [raw]

Subject: Re: Potential bug in mballoc --- reusing data blocks before txn commit

Theodore Tso wrote:
> On Tue, Sep 30, 2008 at 08:35:21AM +0400, Alex Tomas wrote:
>> why we need a tree? at least for the purpose of keeping blocks unavailable
>> we'd need just a list as at commit we free them all.
>
> For ext4, the only reason to use a tree would be to allow us to merge
> deleted extents. This might not be worth the complexity, though, I
> admit it.

strictly speaking, extents code should have merged them at allocation time.

>>> The other thing which I should check is that if we are using this
>>> scheme, I think we shouldn't need to keep the shadow copy of the block
>>> bitmap buffers any more. I would imagine we still need them for the
>>> inode bitmaps, for the same reason, though.
>> shadow copy holds preallocated blocks
>
> Are we talking about the same thing? I was referring to the
> jh->b_committed_data, which isn't used by mballoc at all.

oops. I meant in-core bitmap mballoc generates. if there is intention
to get rid of old allocator (balloc.c), then we don't need b_committed_data.

btw, I've just remembered why I decided don't protect data from reallocation:
in data=writeback one can get block with stale data easily. and many people
(to my knowledge) were using data=writeback as performing better.

thanks, Alex

2008-09-30 14:16:03

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Potential bug in mballoc --- reusing data blocks before txn commit

On Tue, Sep 30, 2008 at 05:12:12PM +0400, Alex Tomas wrote:
>> For ext4, the only reason to use a tree would be to allow us to merge
>> deleted extents. This might not be worth the complexity, though, I
>> admit it.
>
> strictly speaking, extents code should have merged them at allocation time.

Sorry, I wasn't being clear enough. I was thinking of the scenario
where the user runs "rm -r" and deletes a directory hierarchy with
lots of small files. So the merging I was talking about was between
blocks belonging to different files, so we can send a single large
"trim" command to the disk. And since we can delete a large number of
files in 5 seconds with "rm -r", and the blocks will likely be very
close together if the allocator is doing a good job and the filesystem
is relatively unfragmented, it would also save memory if we can merge
extents belonging to different files instead of keeping them
separately on the linked list.

> oops. I meant in-core bitmap mballoc generates. if there is intention
> to get rid of old allocator (balloc.c), then we don't need b_committed_data.

Yes, I sent a patch on Sunday night proposing to do exactly that, as a
way of simplifying the code and reducing the test matrix for ext4.

> btw, I've just remembered why I decided don't protect data from reallocation:
> in data=writeback one can get block with stale data easily. and many people
> (to my knowledge) were using data=writeback as performing better.

Well, data=ordered is the default, so there would be many more people
using data=ordered. If we think there is a significant advantage in
not protecting data from reallocation besides the memory utilization,
I suppose we could make protecting data being conditional on
data=writeback. Perhaps having the additional data blocks available
to the block allocator could allow it to make better decisions. Not
sure it's worth it, though. Any thoughts?

- Ted

2008-10-01 07:17:15

by Andreas Dilger

[permalink] [raw]

Subject: Re: Potential bug in mballoc --- reusing data blocks before txn commit

Theodore Tso wrote:
> Yeah, I know Andrian Bunk strikes again.... but the right answer is
> to ressurect that code and add it back.

I submitted our patch to re-add this support yesterday.

On Sep 30, 2008 10:15 -0400, Theodore Ts'o wrote:
> On Tue, Sep 30, 2008 at 05:12:12PM +0400, Alex Tomas wrote:
> >> For ext4, the only reason to use a tree would be to allow us to merge
> >> deleted extents. This might not be worth the complexity, though, I
> >> admit it.
> >
> > strictly speaking, extents code should have merged them at allocation time.
>
> Sorry, I wasn't being clear enough. I was thinking of the scenario
> where the user runs "rm -r" and deletes a directory hierarchy with
> lots of small files. So the merging I was talking about was between
> blocks belonging to different files, so we can send a single large
> "trim" command to the disk.

I agree that this is probably most efficient. There would be an rbtree
per transaction, and would mostly be sparse.

> > btw, I've just remembered why I decided don't protect data from reallocation:
> > in data=writeback one can get block with stale data easily. and many people
> > (to my knowledge) were using data=writeback as performing better.
>
> Well, data=ordered is the default, so there would be many more people
> using data=ordered. If we think there is a significant advantage in
> not protecting data from reallocation besides the memory utilization,
> I suppose we could make protecting data being conditional on
> data=writeback. Perhaps having the additional data blocks available
> to the block allocator could allow it to make better decisions. Not
> sure it's worth it, though. Any thoughts?

I think for minimum complexity we should keep the rbtree all the time.
We would need it for the TRIM support in any case.

Having it enabled for ordered mode is fairly important, and I didn't
know that this little surprise was in the mballoc code at all. It may
explain some rare problems we've seen in the past.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.