2007-09-16 03:46:41

by Eric Sandeen

[permalink] [raw]
Subject: [RFC][PATCH] ensure do_split leaves enough free space in both blocks

For me, this fixes the problem reported by
[email protected], "ext3 dir_index causes an error"

The issue is that the do_split() function sorts the entries in the old
block by hash value, then moves half the entries to the new block
without accounting for how much space this actually moves. (IOW,
it moves half of the entry *count* not half of the entry *space*)

The patch below stores size as well when calculating the dx_map,
and then walks the hash-sorted dx_map, calculating how
many entries must be moved to more evenly split the existing
entries between the old block and the new block, guaranteeing
enough space for the new entry.

Enhancements that could be made, though I'm not sure it's worth it:
* pack the old dir block before calculating nr of entries to move,

-or-

* calculate the minimum rec_len when generating the map, vs.
just storing the current rec_len.

I'm not sure it's worth the extra calculations, I think this code
below works just fine from a correctness perspective.

How's this look, any comments?

Thanks,

-Eric

Index: linux/fs/ext3/namei.c
===================================================================
--- linux.orig/fs/ext3/namei.c
+++ linux/fs/ext3/namei.c
@@ -141,6 +141,7 @@ struct dx_map_entry
{
u32 hash;
u32 offs;
+ u32 size;
};

#ifdef CONFIG_EXT3_INDEX
@@ -685,6 +687,7 @@ static int dx_make_map (struct ext3_dir_
map_tail--;
map_tail->hash = h.hash;
map_tail->offs = (u32) ((char *) de - base);
+ map_tail->size = le16_to_cpu(de->rec_len);
count++;
cond_resched();
}
@@ -1142,7 +1159,7 @@ static struct ext3_dir_entry_2 *do_split
u32 hash2;
struct dx_map_entry *map;
char *data1 = (*bh)->b_data, *data2;
- unsigned split;
+ unsigned split, move, size, i;
struct ext3_dir_entry_2 *de = NULL, *de2;
int err = 0;

@@ -1170,8 +1187,19 @@ static struct ext3_dir_entry_2 *do_split
count = dx_make_map ((struct ext3_dir_entry_2 *) data1,
blocksize, hinfo, map);
map -= count;
- split = count/2; // need to adjust to actual middle
dx_sort_map (map, count);
+ /* Split the existing block in the middle, size-wise */
+ size = 0;
+ move = 0;
+ for (i = count-1; i >= 0; i--) {
+ /* is more than half of this entry in last half of the block? */
+ if (size + map[i].size/2 > blocksize/2)
+ break;
+ size += map[i].size;
+ move++;
+ }
+ /* map index at which we will split */
+ split = count - move;
hash2 = map[split].hash;
continued = hash2 == map[split - 1].hash;
dxtrace(printk("Split block %i at %x, %i/%i\n",


2007-09-17 05:47:53

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC][PATCH] ensure do_split leaves enough free space in both blocks

On Sep 15, 2007 22:46 -0500, Eric Sandeen wrote:
> * calculate the minimum rec_len when generating the map, vs.
> just storing the current rec_len.

Well, we already do this when moving the entries, so in theory we
should do it when checking how many entries to move. That said,
we know we can't _increase_ the amount of space used (so no chance
of introducing a problem) but we might still end up with some imbalance.

> @@ -141,6 +141,7 @@ struct dx_map_entry
> {
> u32 hash;
> u32 offs;
> + u32 size;
> };

Hmm, there was something about the size of the dx_map_entry, because
it is actually built at the end of the target block, that we don't
want to make it too large.

Now, I'm not sure if adding an extra 32-bit field per entry would make
it too large or not, since I haven't looked at that code in ages. The
critical factor is whether max_entries = blocksize / min_rec_len would
consume more than the worst-case amount of space in the target block.

So, because thinking is hard, you might consider just changing the above
code to use "u16 offs; u16 size;" since we know those are big enough
variables, and won't increase the size of the map...

> + for (i = count-1; i >= 0; i--) {
> + /* is more than half of this entry in last half of the block? */
> + if (size + map[i].size/2 > blocksize/2)
> + break;
> + size += map[i].size;
> + move++;
> + }
> + /* map index at which we will split */
> + split = count - move;

The rest of this looks fine - I think the "1/2 of median entry" decision
is the right one as we discussed.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-09-17 12:17:51

by Eric Sandeen

[permalink] [raw]
Subject: Re: [RFC][PATCH] ensure do_split leaves enough free space in both blocks

Andreas Dilger wrote:

>> @@ -141,6 +141,7 @@ struct dx_map_entry
>> {
>> u32 hash;
>> u32 offs;
>> + u32 size;
>> };
>
> Hmm, there was something about the size of the dx_map_entry, because
> it is actually built at the end of the target block, that we don't
> want to make it too large.

Yep, that crossed my mind...

> Now, I'm not sure if adding an extra 32-bit field per entry would make
> it too large or not, since I haven't looked at that code in ages. The
> critical factor is whether max_entries = blocksize / min_rec_len would
> consume more than the worst-case amount of space in the target block.
>
> So, because thinking is hard, you might consider just changing the above
> code to use "u16 offs; u16 size;" since we know those are big enough
> variables, and won't increase the size of the map...

That sounds like a good plan. The other possibility is, we don't *have*
to store size in the map, with offset we can always get to the size, too.

>> + for (i = count-1; i >= 0; i--) {
>> + /* is more than half of this entry in last half of the block? */
>> + if (size + map[i].size/2 > blocksize/2)
>> + break;
>> + size += map[i].size;
>> + move++;
>> + }
>> + /* map index at which we will split */
>> + split = count - move;
>
> The rest of this looks fine - I think the "1/2 of median entry" decision
> is the right one as we discussed.

Yes, I forgot to mention that I had discussed this with you a bit
already. :) After drawing a few pictures, this seems like the right
way to go.

-Eric

2007-09-17 15:51:11

by Eric Sandeen

[permalink] [raw]
Subject: Re: [RFC][PATCH] ensure do_split leaves enough free space in both blocks

Andreas Dilger wrote:
> On Sep 15, 2007 22:46 -0500, Eric Sandeen wrote:
>> * calculate the minimum rec_len when generating the map, vs.
>> just storing the current rec_len.
>
> Well, we already do this when moving the entries, so in theory we
> should do it when checking how many entries to move. That said,
> we know we can't _increase_ the amount of space used (so no chance
> of introducing a problem) but we might still end up with some imbalance.

I think I'm going to pass on this, at least for now, to save the extra
calculation. We've lived for years with whatever split the original
code gave us, and with the exception of this overflow, we've not seen
problems. Splitting based on rec_len (vs. EXT3_DIR_REC_LEN) should
still split things fairly well, on average.

If we used EXT3_DIR_REC_LEN)(name_len) for the new tests, I think we'd
first need to add up the minimal size of *all* entries, because the new
threshold would no longer be blocksize/2. So, this would be more times
around the loop.

If anyone feels strongly that we should make this split as accurate as
possible, I'm willing to patch that up too, but for now I think I'd like
to get the more targeted fix in place, as it should solve the bug at
hand in a simpler manner.

Sound ok?

-Eric