by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH RFC 0/3] Block reservation for ext3

On Wed 13-10-10 18:14:08, Amir G. wrote:
> On Wed, Oct 13, 2010 at 10:49 AM, Amir G.
> <[email protected]> wrote:
> > On Wed, Oct 13, 2010 at 1:14 AM, Jan Kara <[email protected]> wrote:
> >> > When I did ext2 delayed allocation back in, err, 2001 I had
> >> > considerable trouble working out how many blocks to actually reserve
> >> > for a file block, because it also had to reserve the indirect blocks.
> >> > One file block allocation can result in reserving four disk blocks!
> >> > And iirc it was not possible with existing in-core data structures to
> >> > work out whether all four blocks needed reserving until the actual
> >> > block allocation had occurred. ?So I ended up reserving the worst-case
> >> > number of indirects, based upon the file offset. ?If the disk ran out
> >> > of "space" I'd do a forced writeback to empty all the reservations and
> >> > would then take a look to see if the disk was _really_ out of space.
> >> >
> >> > Is all of this an issue with this work? ?If so, what approach did you
> >> > take?
> >> ?Yeah, I've spotted exactly the same problem. How I decided to solve it in
> >> the end is that in memory we keep track of each indirect block that has
> >> delay-allocated buffer under it. This allows us to reserve space for each
> >> indirect block at most once (I didn't bother with making the accounting
> >> precise for double or triple indirect blocks so when I need to reserve
> >> space for indirect block, I reserve the whole path just to be sure). This
> >> pushes the error in estimation to rather acceptable range for reasonably
> >> common workloads - the error can still be 50% for workloads which use just
> >> one data block in each indirect block but even in this case the absolute
> >> number of blocks falsely reserved is small.
> >> ?The cost is of course increased complexity of the code, the memory
> >> spent for tracking those indirect blocks (32 bytes per indirect block), and
> >> some time for lookups in the RB-tree of the structures. At least the nice
> >> thing is that when there are no delay-allocated blocks, there isn't any
> >> overhead (tree is empty).
> >>
> >
> > How about allocating *only* the indirect blocks on page fault.
> > IMHO it seems like a fair mixture of high quota accuracy, low
> > complexity of the accounting code and low file fragmentation (only
> > indirect may be a bit further away from data).
> >
> > In my snapshot patches I use the @create arg to get_blocks_handle() to
> > pass commands just like "allocate only indirect blocks".
> > The patch is rather simple. I can prepare it for ext3 if you like.
>
> Here is the indirect allocation patch.
> The following debugfs dump shows the difference between a 1G file
> allocated by dd (indirect interleaved with data) and by mmap (indirect
> sequential before data).
> (I did not test performance and I ignored SIGBUS at the end of mmap
> file allocation)
...
>
> Allocate file indirect blocks on page_mkwrite().
>
> This is a sample patch to be merged with Jan Kara's ext3 delayed allocation
> patches. Some of the code was taken from Jan's patches for testing only.
> This patch is for kernel 2.6.35.6.
>
> On page_mkwrite(), we allocate indirect blocks if needed and leave the data
> blocks unallocated. Jan's patches take care of reserving space for the data.
Thanks for the idea and the patch.

Yes, this is one of the trade-off options. But it's not that simple.
There's a problem with with truncate coming after page_mkwrite but before
the allocation happens. See:

ext3_page_mkwrite() for page index 70
-> allocates the indirect block (or just sees that it's allocated
and does nothing)
- marks buffer as delayed
...
truncate to index 80
- sees indirect block has no more blocks allocated and removes it

ext3_writepage() for index 70
- would like to allocate block for index 70 but indirect block does
not exist. Bugger.

So you have to somehow track that indirect block has some delayed
allocation pending - and that is the most complex part of my patch. The
rest is rather simple...
Actually, there are also other ways to track that indirect block has the
delayed allocation pending. For example if we could do a disk format
change, we could simply reserve say block number 1 or -1 to indicate delayed
allocated block and everything would be much simpler. But I don't think we
really can (it would be an incompatible disk format change).
Hmm, using radix tree dirty tag could also work for that -- when the
range covered by an indirect block has some dirty pages we know that we
shouldn't delete it because it's going to be used soon. But it's subtle
because we rely on the fact that radix tree dirty tag is cleared only in
set_page_writeback() which is after the get_block() call while page dirty
flag is already cleared before the writepage() (and thus get_block()) call
-- I originally discared this idea because I forgot about this tag handling
subtlety.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR