LinuxLists.cc - Proposed design for big allocation blocks for ext4

2011-02-25 02:56:48

Subject: Proposed design for big allocation blocks for ext4

I've had problems creating new pages on the ext4 wiki, so I haven't been
able to post it there. I'm waiting for the kernel.org admins to fix it,
but until they do, I'll just send it to the ext4 list for comments.

- Ted

= Problem statement =

If the vast majority of the files are large, and the system is running
under tight memory pressure, such that the block allocation bitmaps
and buddy bitmaps are frequently getting evicted from the system, if
the filesystem's free space is moderately fragmented, file allocations
and unlinks may require reading multiple block bitmaps into memory,
and this can be a surprisingly large amount of overhead, since the
reading the various block bitmaps is a synchronous operation that
tends to involve a large number of seeks.

One way of solving this problem is to use a large block size than 4k
--- say, perhaps 256k or 1 megabyte. By reducing the number of blocks
needed to allocate the same file, and at the same time increasing the
amount of storage that can be covered by a block allocation bitmap, we
can address this problem, at the cost of increased internal
fragmentation. However Linux has a fundamental assumption that the
page size is >= the file system block size.

So the goal is to design a system that allows us to get the benefits
of larger block sizes without the complexities caused by large block
sizes.

= Constraints =

We want to keep the changes to the ext4 code base small and easily
testable. Ideally, no changes to the core kernel should be required.
What follows from this is a requirement that the fundamental block
size (as far as the VM is concerned) must remain at 4k.

= Design =

The solution to this problem is to use an increased allocation size as
far as the block allocaiton bitmaps are concerned. However, the size
of allocation bitmaps, and the granularity of blocks as far as the the
extent tree blocks are concerned, are still based on the original
maximum 4k block size.

We assume the fields previously used for defining the fragment size in
the ext4 superblock: s_log_frag_size and redefine it to be
s_log_alloc_size. This defines a size, the allocation blocksize. The
allocation blocksize is only enabled if a new INCOMPAT feature is
defined, EXT4_FEATURE_INCOMPAT_BIGALLOC, and it must be greater than
or equal to the block size.

The allocation blocksize controls the allocation granularity of a bit
in the block bitmap. However, the maximum size of a block bitmap
remains a single block size. Hence, for a file system with a 4k block
size and a 1 megabyte allocation blocksize, there can be a maximum of
32,768 (4k * 8) allocation blocks, and so a single block group can now
cover a maximum of 32GiB.

To minimize changes to ext4, we will constrain an allocation block such
that it must contain only fixed file system metadata (i.e., block
bitmaps, inode bitmaps, and inode table) or data blocks belonging to a
single inode. Furthermore, if allocation block is to be used for data
blocks for an inode, it must be used contiguously starting at a file
offset which is a multiple of the allocation blocksize. The first
block of the allocation block must be mapped in the extent tree (or
indirect block), so that the ext4_map_blocks() can easily find (and
continue using) partially used allocation block.

Because we are not changing the definition of a block, the only
changes that need to be made are at the intersection of allocating to
an inode (or to file system metadata). This is good, because it means
the bulk of ext4 does not need to be changed

= Kernel Changes required =

1) Globally throughout ext4: uses of EXT4_BLOCKS_PER_GROUP() need to
be audited to see if they should be EXT4_BLOCKS_PER_GROUP() or
EXT4_ALLOC_BLOCKS_PER_GROUP().

2) ext4_map_blocks() and its downstream functions need to be changed so
that they understand the new allocation rules, and in particular
understand that before allocating a new block, they need to see if a
partially allocated block has already been allocated, and can be used
to fulfill the current allocation.

3) mballoc.c will need little or no changes, other than the
EXT4_BLOCKS_PER_GROUP()/EXT4_ALLOC_BLOCKS_PER_GROUP() audit discussed
in (1).

= E2fsprogs changes required =

# Changes to dumpe2fs, debugfs, etc., to understand the new
superblock field.

# Changes to mke2fs to understand how to allocate the inode, block,
and inode table metadata blocks.

# Changes to e2fsck pass1 and pass 1b/c/d to understand that a single
allocation block can be shared across multiple file system metadata
blocks or a file's data blocks. (But the fact that allocation block
can be used contiguously vis-a-vis an inode's logical block space will
make these changes relatively straightforwarwd.)

= Upsides =

Block allocation and deallocation for large files should be much
faster, since a block bitmap will cover much more space, and it is
much less likely that a single file system operation will need to
touch multiple block bitmaps.

Using a larger allocation blocksize will also reduce external
fragmentation. By reducing the free space fragmentation, when files
are written (and later read) seeking between data blocks will be
greatly reduced.

Directory blocks will be contiguous, which will speed up situations
where a directory is very, very large.

= Downsides =

Internal fragmentation will be expensive for small files. So this is
only useful for file systems where most files are large, or where the
file system performance is more important than the losses caused by
internal fragmentation.

Directories will also be allocated in chucks of the allocation block
size. If this is especially large (such as 1 MiB), and there are a
large number of directories, this could be quite expensive.
Applications which use multi-level directory schemes to keep
directories small to optimize for ext2's very slow large directory
performance could be especially vulnerable.

2011-02-25 08:21:59

by Andreas Dilger

[permalink] [raw]

Subject: Re: Proposed design for big allocation blocks for ext4

On 2011-02-24, at 7:56 PM, Theodore Ts'o wrote:
> = Problem statement =
>
> If the vast majority of the files are large, and the system is running
> under tight memory pressure, such that the block allocation bitmaps
> and buddy bitmaps are frequently getting evicted from the system, if
> the filesystem's free space is moderately fragmented, file allocations
> and unlinks may require reading multiple block bitmaps into memory,
> and this can be a surprisingly large amount of overhead, since the
> reading the various block bitmaps is a synchronous operation that
> tends to involve a large number of seeks.
>
> One way of solving this problem is to use a large block size than 4k
> --- say, perhaps 256k or 1 megabyte. By reducing the number of blocks
> needed to allocate the same file, and at the same time increasing the
> amount of storage that can be covered by a block allocation bitmap, we
> can address this problem, at the cost of increased internal
> fragmentation. However Linux has a fundamental assumption that the
> page size is >= the file system block size.
>
> So the goal is to design a system that allows us to get the benefits
> of larger block sizes without the complexities caused by large block
> sizes.

Hmm, interesting. This could be useful for Lustre installations, since the majority of installations are using large files, and a 1MB allocation blocksize would align very nicely with the 1MB Lustre RPC size...

I guess one minor drawback of this implementation is that it isn't quite identical to the case where there actually IS a larger blocksize (whether due to larger PAGE_SIZE, or kernel support in software for blocksize > PAGE_SIZE.

I suppose the changes that would be needed to emulate real large blocks to map multiple pages to the same block number (with different offsets within the block) would just cause the VM and block layer to explode. page->index would still be in units of PAGE_SIZE, but bh->b_blocknr would be the same for two consecutive pages without major block layer surgery.

> = Constraints =
>
> We want to keep the changes to the ext4 code base small and easily
> testable. Ideally, no changes to the core kernel should be required.
> What follows from this is a requirement that the fundamental block
> size (as far as the VM is concerned) must remain at 4k.
>
> = Design =
>
> The solution to this problem is to use an increased allocation size as
> far as the block allocaiton bitmaps are concerned. However, the size
> of allocation bitmaps, and the granularity of blocks as far as the the
> extent tree blocks are concerned, are still based on the original
> maximum 4k block size.
>
> We assume the fields previously used for defining the fragment size in
> the ext4 superblock: s_log_frag_size and redefine it to be
> s_log_alloc_size. This defines a size, the allocation blocksize. The
> allocation blocksize is only enabled if a new INCOMPAT feature is
> defined, EXT4_FEATURE_INCOMPAT_BIGALLOC, and it must be greater than
> or equal to the block size.

Fortunately, "s_log_alloc_size" currently contains the blocksize so this should be completely compatible with current filesystems, in that we can change the allocation code to always use s_log_alloc_size regardless of whether INCOMPAT_BIGALLOC is set or not.

> The allocation blocksize controls the allocation granularity of a bit
> in the block bitmap. However, the maximum size of a block bitmap
> remains a single block size. Hence, for a file system with a 4k block
> size and a 1 megabyte allocation blocksize, there can be a maximum of
> 32,768 (4k * 8) allocation blocks, and so a single block group can now
> cover a maximum of 32GiB.

Hmm, so does this change impact the 32-bit block number limit? It seems like it wouldn't, since the underlying blocks are still addressed using 4kB blocksize. The only difference is that each bit in the block bitmap covers, say, 256 * 4kB blocks and that is the minimum granularity of allocations.

I guess one would need to be careful with the inode ratio, since the number of inodes per group is still only 32k or less. This implies that the allocation blocksize should not be larger than the inode ratio (i.e. the average file size) otherwise there will not be enough inodes to allocate 1 allocation block per inode.

> To minimize changes to ext4, we will constrain an allocation block such
> that it must contain only fixed file system metadata (i.e., block
> bitmaps, inode bitmaps, and inode table) or data blocks belonging to a
> single inode. Furthermore, if allocation block is to be used for data
> blocks for an inode, it must be used contiguously starting at a file
> offset which is a multiple of the allocation blocksize. The first
> block of the allocation block must be mapped in the extent tree (or
> indirect block), so that the ext4_map_blocks() can easily find (and
> continue using) partially used allocation block.

For files with holes, does this mean that e.g. a sparse file with 4 data ranges of 256kB each, in a filesystem with 1MB allocation blocks would use 4*1MB of space in the filesystem, or could the sparse ranges be allocated from within a single 1MB allocation block? The easier and more straight forward implementation would imply 4*1MB blocks are consumed (i.e. identical to the case where there are really 1MB blocks), even though it would waste more space.

> Because we are not changing the definition of a block, the only
> changes that need to be made are at the intersection of allocating to
> an inode (or to file system metadata). This is good, because it means
> the bulk of ext4 does not need to be changed
>
>
> = Kernel Changes required =
>
> 1) Globally throughout ext4: uses of EXT4_BLOCKS_PER_GROUP() need to
> be audited to see if they should be EXT4_BLOCKS_PER_GROUP() or
> EXT4_ALLOC_BLOCKS_PER_GROUP().
>
> 2) ext4_map_blocks() and its downstream functions need to be changed so
> that they understand the new allocation rules, and in particular
> understand that before allocating a new block, they need to see if a
> partially allocated block has already been allocated, and can be used
> to fulfill the current allocation.
>
> 3) mballoc.c will need little or no changes, other than the
> EXT4_BLOCKS_PER_GROUP()/EXT4_ALLOC_BLOCKS_PER_GROUP() audit discussed
> in (1).

I'm particularly interested in what happens with directories? Some data structures like ext2_dirent_2.rec_len can only handle a blocksize of 256kB, even with the hack to use the low 2 bits of rec_len to store the MSB of the actual length.

Would directories be affected at all, since the underlying blocksize is still 4kB?

> = E2fsprogs changes required =
>
> # Changes to dumpe2fs, debugfs, etc., to understand the new
> superblock field.
>
> # Changes to mke2fs to understand how to allocate the inode, block,
> and inode table metadata blocks.

Presumably, it would always make sense to consume an entire allocation block with metadata... Inodes in the itable would fill up the whole allocation block, or the rest of the space would be wasted.

Would there be a requirement that the flex_bg factor is >= the allocation blocksize, so that e.g. block bitmaps fill an entire allocation block, or will it be possible to mix & match metadata within an allocation block?

> # Changes to e2fsck pass1 and pass 1b/c/d to understand that a single
> allocation block can be shared across multiple file system metadata
> blocks or a file's data blocks. (But the fact that allocation block
> can be used contiguously vis-a-vis an inode's logical block space will
> make these changes relatively straightforwarwd.)
>
> = Upsides =
>
> Block allocation and deallocation for large files should be much
> faster, since a block bitmap will cover much more space, and it is
> much less likely that a single file system operation will need to
> touch multiple block bitmaps.
>
> Using a larger allocation blocksize will also reduce external
> fragmentation. By reducing the free space fragmentation, when files
> are written (and later read) seeking between data blocks will be
> greatly reduced.

In essence, this is mostly just an optimization of the mballoc code to be able to allocate large chunks more quickly (i.e. equivalent to being able to get/set tens or hundreds of bits at a time). I don't think it is actually fundamentally different than just changing the mballoc code to actually get/set these tens/hundreds of bits at one time, or doing some kind of "bitmap compression" whereby it loads the block bitmap into memory and only saves the top levels of the buddy bitmap to reduce the memory used by some power-of-two factor. That could all be done in software without changing the on-disk format.

The only significant difference in the end is that it saves on reading/writing the block bitmaps, though if they are stored compressed in memory and contiguous on disk with flex_bg that may not even be a significant difference.

I think the fundamental flaws of the current mballoc code will still be present, in that it does linear scanning of the groups in order to find free space. I think we'd be better off to improve mballoc to have a more holistic approach to allocation. For example, having a top-level buddy-bitmap or tree that shows where the most free space is in a filesystem would be a big win in reducing search times for free blocks.

> Directory blocks will be contiguous, which will speed up situations
> where a directory is very, very large.

This is no different than having a tuneable to always preallocate the directory blocks. IMHO, the main factor affecting directory performance today is not the IO of reading the directory blocks, but rather the random access order of the inodes vs. readdir order (very few apps do pure readdir() operations).

> = Downsides =
>
> Internal fragmentation will be expensive for small files. So this is
> only useful for file systems where most files are large, or where the
> file system performance is more important than the losses caused by
> internal fragmentation.

One interesting counter proposal would be to leave the on-disk format the same, use flex_bg == allocation size to reduce the seek overhead, and only store in memory the top levels of the block bitmaps - for _most_ groups but not necessarily all groups. For example, with an allocation blocksize of 1MB, the block/buddy bitmap would store in memory 1 bit per 256 blocks on disk, so most flex_bgs (= 256 groups) could conveniently store the whole block bitmap in 4kB of RAM and read/write those bitmaps in a single seek/IO.

Allocations from these groups would be at a 1MB granularity, so there is no difference in fragmentation from your proposal. Files allocated in this way would have the EXT4_EOFBLOCKS_FL set, and all of the blocks up to the 1MB boundary would actually be allocated by the file. It would require set/clear of 256 bits == 32 bytes at a time on disk, so the shadow copy of modified block bitmaps in the journal would be generated on-the-fly from the compressed in-memory bitmap. Optionally, only the modified disk block bitmaps would need to be written to the journal, to avoid unnecessary IO expansion though this may not be a net win on a RAID system that needs to do stripe-aligned IO anyway.

Essentially, the groups would be split up into "1MB allocation slabs" and "4kB allocation slabs", and/or possibly sizes in-between. It would be possible to store the "slab size" in the group descriptor for each group, and possibly migrate between different slab sizes using online defragmentation to move unaligned files. If only compliant ext4 code writes to such a filesystem, then there is no chance of unaligned/fragmented allocations in the slabs. It would even be possible to have migrate an existing filesystem to using this scheme (possibly using e4defrag to empty out some groups), if there was enough free space.

A major upside of this is that some groups could store the full-resolution bitmap in memory and still do 4kB-granular allocations, which would be great for small files, directories, and index blocks. We might want to segregate the file data blocks into separate slabs/groups from the index/directory blocks, to speed up e2fsck by keeping the metadata dense and separate from the data. This would also be ideal for having hybrid SSD/HDD layouts, with metadata and/or small files on RAID-1 SSD slabs, and large file data on RAID-6 HDD slabs.

This is the same as the VM implementation of hugepages, which segregates chunks of memory for large or small, and persistent vs. ephemeral memory allocations. AFAIK, that has shown to work pretty well over time.

This would need to be at worst a ROCOMPAT feature, since all that is needed is to store the "slab size" for new allocations in the group descriptor (possibly only a few bits in the bg_flags field in log-log format (0=1*blocksize, 1=16*blocksize, 2=256 * blocksize, 3=(1 << s_log_groups_per_flex) * blocksize).

A simple forward compatible extension for "maint" kernels to be able to r/w mount a filesystem with this ROCOMPAT feature would be to clear the slab size in a group descriptor when allocating new blocks there, since freeing blocks will not affect the slab size. No other visible difference would exist in the on-disk layout.

> Directories will also be allocated in chucks of the allocation block
> size. If this is especially large (such as 1 MiB), and there are a
> large number of directories, this could be quite expensive.
> Applications which use multi-level directory schemes to keep
> directories small to optimize for ext2's very slow large directory
> performance could be especially vulnerable.

Cheers, Andreas

2011-02-25 09:16:05

by Rogier Wolff

[permalink] [raw]

Subject: Re: Proposed design for big allocation blocks for ext4

Hi,

I must say I haven't read all of the large amounts of text in this
discussion.

But what I understand is that you're suggesting that we implement
larger blocksizes on the device, while we have to maintain towards the
rest of the kernel that the blocksize is no larger than 4k, because
the kernel can't handle that.

Part of reasoning why this should be like this comes from the
assumption that each block group has just one block worth of bitmap.
That is IMHO the "outdated" assumption that needs to go.

Then, especially on filesystems where many large files live, we can
emulate the "larger blocksize" at the filesystem level: We always
allocate 256 blocks in one go! This is something that can be
dynamically adjusted: You might stop doing this for the last 10% of
free disk space.

Now, you might say: How does this help with the performance problems
mentioned in the introduction? Well. reading 16 block bitmaps from 16
block groups will cost a modern harddrive on average 16 * (7ms avg
seek + 4.1 avg rot latency + 0.04ms transfer time), or about 170 ms.

Reading 16 block bitmaps from ONE block group will cost a modern
harddrive on average: 7ms avg seek + 4.1ms rot + 16*0.06 =
11.2ms. That is an improvement of a factor of over 15...

Now, whenever you allocate blocks for a file, just zap 256 bits at
once! Again the overhead of handling 255 more bits in memory is
trivial.

I now see that andreas already suggested something similar but still
different.

Anyway: Advantages that I see:

- the performance benefits sougth for.

- a more sensible number of block groups on filesystems. (my 3T
filessytem has 21000 block groups!)

- the option of storing lots of small files without having to make
a fs-creation-time choice.

- the option of improving defrag to "make things perfect". (allocation
strategy may be: big files go in big-files-only block groups and
their tails go in small-files-only block groups. Or if you think
big files may grow, tails go in big-files-only block groups. Whatever
you chose, defrag may clean up a fragpoint and or some unallocated
space when after a while it's clear that a big file will no longer
grow, and is just an archive).

Roger.

On Fri, Feb 25, 2011 at 01:21:58AM -0700, Andreas Dilger wrote:
> On 2011-02-24, at 7:56 PM, Theodore Ts'o wrote:
> > = Problem statement =

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2011-02-25 10:01:10

by Andreas Dilger

[permalink] [raw]

Subject: Re: Proposed design for big allocation blocks for ext4

On 2011-02-25, at 2:15 AM, Rogier Wolff wrote:
> I must say I haven't read all of the large amounts of text in this
> discussion.

We don't write it to be read, just for fun :-).

> But what I understand is that you're suggesting that we implement
> larger blocksizes on the device, while we have to maintain towards the
> rest of the kernel that the blocksize is no larger than 4k, because
> the kernel can't handle that.
>
> Part of reasoning why this should be like this comes from the
> assumption that each block group has just one block worth of bitmap.
> That is IMHO the "outdated" assumption that needs to go.

What you are suggesting is a feature called "flex_bg", and already is
implemented in ext4, which is why I referenced it in my email.

> Then, especially on filesystems where many large files live, we can
> emulate the "larger blocksize" at the filesystem level: We always
> allocate 256 blocks in one go! This is something that can be
> dynamically adjusted: You might stop doing this for the last 10% of
> free disk space.

That's exactly what I wrote.

> Now, you might say: How does this help with the performance problems
> mentioned in the introduction? Well. reading 16 block bitmaps from 16
> block groups will cost a modern harddrive on average 16 * (7ms avg
> seek + 4.1ms avg rot latency + 0.04ms transfer time), or about 170 ms.

That is the time to load bitmaps in a non-flex_bg filesystem, which is
the default for ext3-formatted filesystems.

> Reading 16 block bitmaps from ONE block group will cost a modern
> harddrive on average: 7ms avg seek + 4.1ms rot + 16*0.04ms xfer =
> 11.2ms. That is an improvement of a factor of over 15...

That is possible with flex_bg and a flex_bg factor of 16. That said,
I don't think the kernel explicitly fetches all 16 bitmaps today,
though it may have the benefit of a track cache on the disk. I think
the correct number above is actually 11.8ms, not 11.2ms.

In comparison, Ted's proposal would have an average access time of

7ms avg seek + 4.1ms rot + 0.04ms xfer = 11.14ms

which is not a significant savings.

> Now, whenever you allocate blocks for a file, just zap 256 bits at
> once! Again the overhead of handling 255 more bits in memory is
> trivial.
>
> I now see that andreas already suggested something similar but still
> different.

I'm not quite sure how your proposal is different, once you understand
what a flex_bg is.

> Anyway: Advantages that I see:
>
> - the performance benefits sougth for.
>
> - a more sensible number of block groups on filesystems. (my 3T
> filessytem has 21000 block groups!)
>
> - the option of storing lots of small files without having to make
> a fs-creation-time choice.
>
> - the option of improving defrag to "make things perfect". (allocation
> strategy may be: big files go in big-files-only block groups and
> their tails go in small-files-only block groups. Or if you think
> big files may grow, tails go in big-files-only block groups. Whatever
> you chose, defrag may clean up a fragpoint and or some unallocated
> space when after a while it's clear that a big file will no longer
> grow, and is just an archive).
>
> Roger.
>
>
> On Fri, Feb 25, 2011 at 01:21:58AM -0700, Andreas Dilger wrote:
>> On 2011-02-24, at 7:56 PM, Theodore Ts'o wrote:
>>> = Problem statement =
>
> --
> ** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
> ** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
> *-- BitWizard writes Linux device drivers for any device you may have! --*
> Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
> Does it sit on the couch all day? Is it unemployed? Please be specific!
> Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

Cheers, Andreas

2011-02-25 10:39:31

by Rogier Wolff

[permalink] [raw]

Subject: Re: Proposed design for big allocation blocks for ext4

On Fri, Feb 25, 2011 at 03:01:08AM -0700, Andreas Dilger wrote:

.... I was babbling....

> That is the time to load bitmaps in a non-flex_bg filesystem, which is
> the default for ext3-formatted filesystems.

OK! Already done!

Then I don't really understand what the problem is, so better ignore
me. :-)

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2011-02-25 12:57:09

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Proposed design for big allocation blocks for ext4

On Feb 25, 2011, at 3:21 AM, Andreas Dilger wrote:

> I guess one would need to be careful with the inode ratio, since the number of inodes per group is still only 32k or less. This implies that the allocation blocksize should not be larger than the inode ratio (i.e. the average file size) otherwise there will not be enough inodes to allocate 1 allocation block per inode.

Yes. You would only want to do this on large data disks, where the average file size is large. For example, in the target application I have in mind, the average file size might be 8 or 16 megabytes, and then I might use a 1M allocation block size.

>
>> To minimize changes to ext4, we will constrain an allocation block such
>> that it must contain only fixed file system metadata (i.e., block
>> bitmaps, inode bitmaps, and inode table) or data blocks belonging to a
>> single inode. Furthermore, if allocation block is to be used for data
>> blocks for an inode, it must be used contiguously starting at a file
>> offset which is a multiple of the allocation blocksize. The first
>> block of the allocation block must be mapped in the extent tree (or
>> indirect block), so that the ext4_map_blocks() can easily find (and
>> continue using) partially used allocation block.
>
> For files with holes, does this mean that e.g. a sparse file with 4 data ranges of 256kB each, in a filesystem with 1MB allocation blocks would use 4*1MB of space in the filesystem, or could the sparse ranges be allocated from within a single 1MB allocation block? The easier and more straight forward implementation would imply 4*1MB blocks are consumed (i.e. identical to the case where there are really 1MB blocks), even though it would waste more space.

It is identical to the case where there are really 1MB blocks. That's why I said, "allocation bocks must be used contiguously starting at a file offset which is a multiple of the allocation block size". This allows ext4_map_blocks to find a previously used allocation block by simply looking up what block was used at the beginning of the allocation block range.

>
> I'm particularly interested in what happens with directories? Some data structures like ext2_dirent_2.rec_len can only handle a blocksize of 256kB, even with the hack to use the low 2 bits of rec_len to store the MSB of the actual length.
>
> Would directories be affected at all, since the underlying blocksize is still 4kB?

Directories don't change at all.

>
>> = E2fsprogs changes required =
>>
>> # Changes to dumpe2fs, debugfs, etc., to understand the new
>> superblock field.
>>
>> # Changes to mke2fs to understand how to allocate the inode, block,
>> and inode table metadata blocks.
>
> Presumably, it would always make sense to consume an entire allocation block with metadata... Inodes in the itable would fill up the whole allocation block, or the rest of the space would be wasted.
>
> Would there be a requirement that the flex_bg factor is >= the allocation blocksize, so that e.g. block bitmaps fill an entire allocation block, or will it be possible to mix & match metadata within an allocation block?

There will be so such *requirement*, although certain fs parameter tunings will make more sense than others.

One of the high order drivers of this design is that as we get bigger and bigger disks (i.e., the progression from 1T, 2T, 3T, 4T, etc. disks), we are effectively losing seek capacity because the the bigger disks mean we have fewer spindles at a given capacity level, even as the seek time stays more or less constant. So if we waste 2-5% more space on metadata blocks because all of the metadata blocks in a flex_bg don't fit inside an allocation block, I'm not going to shed massive amounts of tears.

> In essence, this is mostly just an optimization of the mballoc code to be able to allocate large chunks more quickly (i.e. equivalent to being able to get/set tens or hundreds of bits at a time). I don't think it is actually fundamentally different than just changing the mballoc code to actually get/set these tens/hundreds of bits at one time, or doing some kind of "bitmap compression" whereby it loads the block bitmap into memory and only saves the top levels of the buddy bitmap to reduce the memory used by some power-of-two factor. That could all be done in software without changing the on-disk format.

I actually initially got started on an approach where we did a "bitmap compression" scheme, so I've been down this path already. The approach I was trying was to use an rbtree of free extents, but that doesn't make much difference to the fundamental problem which I saw, and that is that you still have to update the block allocation bitmaps in memory. And if you are running in a memory constrained environment (either because you're a cheap... err, financially constrained NAS vendor, or because you're running in tight memory containers because you are trying to pack hundreds of jobs onto a single machine), the block bitmaps are getting constantly ejected from memory.

Furthermore, if you have 16 2T disks, the block bitmaps alone will consume a gigabyte of memory, so it's really not feasible to pin that much memory just to avoid the disk seeks to read/write the block bitmaps. And of course the buddy bitmaps would consume another gigabyte, but even if you were to discard the buddy bitmaps after each allocation and recalculate them from the pinned block bitmaps, pinning a gigabyte of memory isn't really acceptable.

On the other hand, if you change the meaning of each bit in the block bitmap to mean 1M instead of 4k, the gigabyte of memory turns into 4MB (for the same 16 x 2T disks), which is much more reasonable. It's more likely to stay cached in memory, and it's much more tractable to consider pinning that in memory. And the buddy bitmaps also become much more efficient, since they would also only consume 4MB if fully populated in memory.

> I think the fundamental flaws of the current mballoc code will still be present, in that it does linear scanning of the groups in order to find free space. I think we'd be better off to improve mballoc to have a more holistic approach to allocation. For example, having a top-level buddy-bitmap or tree that shows where the most free space is in a filesystem would be a big win in reducing search times for free blocks.

We are already storing the largest free contiguous power of two free extent in each block group, which really helps with the linear scanning of groups, since it means we don't have to read in a block bitmap just to find that while it has a large amount of free space, it's not contiguous enough for our needs.

Of course, one the file system's free space gets fragmented enough that _no_ block group has 1M of contiguous free space, trying to allocate space for an 8M file will involve a huge number of seeks as we read in each block bitmap, trying to get 256k of free space here, and another 256k of free space there, etc.

> One interesting counter proposal would be to leave the on-disk format the same, use flex_bg == allocation size to reduce the seek overhead, and only store in memory the top levels of the block bitmaps - for _most_ groups but not necessarily all groups. For example, with an allocation blocksize of 1MB, the block/buddy bitmap would store in memory 1 bit per 256 blocks on disk, so most flex_bgs (= 256 groups) could conveniently store the whole block bitmap in 4kB of RAM and read/write those bitmaps in a single seek/IO.
>
> Allocations from these groups would be at a 1MB granularity, so there is no difference in fragmentation from your proposal. Files allocated in this way would have the EXT4_EOFBLOCKS_FL set, and all of the blocks up to the 1MB boundary would actually be allocated by the file. It would require set/clear of 256 bits == 32 bytes at a time on disk, so the shadow copy of modified block bitmaps in the journal would be generated on-the-fly from the compressed in-memory bitmap. Optionally, only the modified disk block bitmaps would need to be written to the journal, to avoid unnecessary IO expansion though this may not be a net win on a RAID system that needs to do stripe-aligned IO anyway.

No difference in file fragmentation, but without heavy modification to mballoc.c, the buddy bitmaps will still chew a huge amount of space. And forgive me, but having tried to modify mballoc.c, it needs a lot of cleanup before it's easy to modify. One of the reasons why I like my proposal is I don't have to touch mballoc.c much; there are lots of side effects in different functions, and the on-disk block bitmaps are assumed to be present and modified deep inside various call chains --- I don't think it would be that easy to make the sort of changes you are proposing.

So the main advantages of your proposal is that it gets you read-only compatibility instead of incompat, and the ability to more flexibly store small files/directories.

The disadvantages is that it doesn't address the memory inefficiency issue, and you still have to seek to update the block bitmaps; one of the extensions I have in mind for our no-journal mode is to simply pin the block bitmaps in memory and only update the at umount time. This is *not* trivial to do in mballoc.c today by just keeping the "high level" mballoc pages in memory, because mballoc assumes that you always have access to the on-disk bitmap, and there are times when the mballoc buddy bitmaps and the on-disk buddy bitmaps are allowed to be out of sync to provide locking guarantees. It's not so simple to just wave a magic wand and say, "we will only update the buddy bitmaps!", because of this implicit assumption in parts of mballoc.c that the on-disk buddy bitmap is available to
it.

The other big disadvantage is that the coding complexity of what you propose is much more significant, and at this point, I have a very strong bias towards KISS.

Still, if you want to clean up and make the mballoc layer more functional, it's not going to collide with the allocation blocksize work. The two solutions aren't really mutually exclusive, but I suspect I'll be able to get there a much more quickly, and with a much lower memory overhead.

Thoughts, comments?

-- Ted

2011-02-25 18:05:44

by Amir Goldstein

[permalink] [raw]

Subject: Re: Proposed design for big allocation blocks for ext4

On Fri, Feb 25, 2011 at 2:57 PM, Theodore Tso <[email protected]> wrote:
>
> On Feb 25, 2011, at 3:21 AM, Andreas Dilger wrote:
>
>> I guess one would need to be careful with the inode ratio, since the number of inodes per group is still only 32k or less. ?This implies that the allocation blocksize should not be larger than the inode ratio (i.e. the average file size) otherwise there will not be enough inodes to allocate 1 allocation block per inode.
>
> Yes. ?You would only want to do this on large data disks, where the average file size is large. ? For example, in the target application I have in mind, the average file size might be 8 or 16 megabytes, and then I might use a 1M ?allocation block size.
>
>>
>>> To minimize changes to ext4, we will constrain an allocation block such
>>> that it must contain only fixed file system metadata (i.e., block
>>> bitmaps, inode bitmaps, and inode table) or data blocks belonging to a
>>> single inode. ?Furthermore, if allocation block is to be used for data
>>> blocks for an inode, it must be used contiguously starting at a file
>>> offset which is a multiple of the allocation blocksize. ?The first
>>> block of the allocation block must be mapped in the extent tree (or
>>> indirect block), so that the ext4_map_blocks() can easily find (and
>>> continue using) partially used allocation block.
>>
>> For files with holes, does this mean that e.g. a sparse file with 4 data ranges of 256kB each, in a filesystem with 1MB allocation blocks would use 4*1MB of space in the filesystem, or could the sparse ranges be allocated from within a single 1MB allocation block? ?The easier and more straight forward implementation would imply 4*1MB blocks are consumed (i.e. identical to the case where there are really 1MB blocks), even though it would waste more space.
>
> It is identical to the case where there are really 1MB blocks. ? That's why I said, "allocation bocks must be used contiguously starting at a file offset which is a multiple of the allocation block size". ?This allows ext4_map_blocks to find a previously used allocation block by simply looking up what block was used at the beginning of the allocation block range.
>
>>
>> I'm particularly interested in what happens with directories? ?Some data structures like ext2_dirent_2.rec_len can only handle a blocksize of 256kB, even with the hack to use the low 2 bits of rec_len to store the MSB of the actual length.
>>
>> Would directories be affected at all, since the underlying blocksize is still 4kB?
>
> Directories don't change at all.
>
>>
>>> = E2fsprogs changes required =
>>>
>>> # Changes to dumpe2fs, debugfs, etc., to understand the new
>>> superblock field.
>>>
>>> # Changes to mke2fs to understand how to allocate the inode, block,
>>> and inode table metadata blocks.
>>
>> Presumably, it would always make sense to consume an entire allocation block with metadata... ?Inodes in the itable would fill up the whole allocation block, or the rest of the space would be wasted.
>>
>> Would there be a requirement that the flex_bg factor is >= the allocation blocksize, so that e.g. block bitmaps fill an entire allocation block, or will it be possible to mix & match metadata within an allocation block?
>
> There will be so such *requirement*, although certain fs parameter tunings will make more sense than others.
>
> One of the high order drivers of this design is that as we get bigger and bigger disks (i.e., the progression from 1T, 2T, 3T, 4T, etc. disks), we are effectively losing seek capacity because the the bigger disks mean we have fewer spindles at a given capacity level, even as the seek time stays more or less constant. ? So if we waste 2-5% more space on metadata blocks because all of the metadata blocks in a flex_bg don't fit inside an allocation block, I'm not going to shed massive amounts of tears.
>
>> In essence, this is mostly just an optimization of the mballoc code to be able to allocate large chunks more quickly (i.e. equivalent to being able to get/set tens or hundreds of bits at a time). ?I don't think it is actually fundamentally different than just changing the mballoc code to actually get/set these tens/hundreds of bits at one time, or doing some kind of "bitmap compression" whereby it loads the block bitmap into memory and only saves the top levels of the buddy bitmap to reduce the memory used by some power-of-two factor. ?That could all be done in software without changing the on-disk format.
>
> I actually initially got started on an approach where we did a "bitmap compression" scheme, so I've been down this path already. ? The approach I was trying was to use an rbtree of free extents, but that doesn't make much difference to the fundamental problem which I saw, and that is that you still have to update the block allocation bitmaps in memory. ? ?And if you are running in a memory constrained environment (either because you're a cheap... err, financially constrained NAS vendor, or because you're running in tight memory containers because you are trying to pack hundreds of jobs onto a single machine), the block bitmaps are getting constantly ejected from memory.
>
> Furthermore, if you have 16 2T disks, the block bitmaps alone will consume a gigabyte of memory, so it's really not feasible to pin that much memory just to avoid the disk seeks to read/write the block bitmaps. ? And of course the buddy bitmaps would consume another gigabyte, but even if you were to discard the buddy bitmaps after each allocation and recalculate them from the pinned block bitmaps, pinning a gigabyte of memory isn't really acceptable.
>
> On the other hand, if you change the meaning of each bit in the block bitmap to mean 1M instead of 4k, the gigabyte of memory turns into 4MB (for the same 16 x 2T disks), which is much more reasonable. ?It's more likely to stay cached in memory, and it's much more tractable to consider pinning that in memory. ? And the buddy bitmaps also become much more efficient, since they would also only consume 4MB if fully populated in memory.
>
>> I think the fundamental flaws of the current mballoc code will still be present, in that it does linear scanning of the groups in order to find free space. ?I think we'd be better off to improve mballoc to have a more holistic approach to allocation. ?For example, having a top-level buddy-bitmap or tree that shows where the most free space is in a filesystem would be a big win in reducing search times for free blocks.
>
> We are already storing the largest free contiguous power of two free extent in each block group, which really helps with the linear scanning of groups, since it means we don't have to read in a block bitmap just to find that while it has a large amount of free space, it's not contiguous enough for our needs.
>
> Of course, one the file system's free space gets fragmented enough that _no_ block group has 1M of contiguous free space, trying to allocate space for an 8M file will involve a huge number of seeks as we read in each block bitmap, trying to get 256k of free space here, and another 256k of free space there, etc.
>
>> One interesting counter proposal would be to leave the on-disk format the same, use flex_bg == allocation size to reduce the seek overhead, and only store in memory the top levels of the block bitmaps - for _most_ groups but not necessarily all groups. ?For example, with an allocation blocksize of 1MB, the block/buddy bitmap would store in memory 1 bit per 256 blocks on disk, so most flex_bgs (= 256 groups) could conveniently store the whole block bitmap in 4kB of RAM and read/write those bitmaps in a single seek/IO.
>>
>> Allocations from these groups would be at a 1MB granularity, so there is no difference in fragmentation from your proposal. ?Files allocated in this way would have the EXT4_EOFBLOCKS_FL set, and all of the blocks up to the 1MB boundary would actually be allocated by the file. ?It would require set/clear of 256 bits == 32 bytes at a time on disk, so the shadow copy of modified block bitmaps in the journal would be generated on-the-fly from the compressed in-memory bitmap. ?Optionally, only the modified disk block bitmaps would need to be written to the journal, to avoid unnecessary IO expansion though this may not be a net win on a RAID system that needs to do stripe-aligned IO anyway.
>
> No difference in file fragmentation, but without heavy modification to mballoc.c, the buddy bitmaps will still chew a huge amount of space. ? ?And forgive me, but having tried to modify mballoc.c, it needs a lot of cleanup before it's easy to modify. ? One of the reasons why I like my proposal is I don't have to touch mballoc.c much; there are lots of side effects in different functions, and the on-disk block bitmaps are assumed to be present and modified deep inside various call chains --- I don't think it would be that easy to make the sort of changes you are proposing.
>
> So the main advantages of your proposal is that it gets you read-only compatibility instead of incompat, and the ability to more flexibly store small files/directories.
>
> The disadvantages is that it doesn't address the memory inefficiency issue, and you still have to seek to update the block bitmaps; one of the extensions I have in mind for our no-journal mode is to simply pin the block bitmaps ?in memory and only update the at umount time. ? This is *not* trivial to do in mballoc.c today by just keeping the "high level" mballoc pages in memory, because mballoc assumes that you always have access to the on-disk bitmap, and there are times when the mballoc buddy bitmaps and the on-disk buddy bitmaps are allowed to be out of sync to provide locking guarantees. ? It's not so simple to just wave a magic wand and say, "we will only update the buddy bitmaps!", because of this implicit assumption in parts of mballoc.c that the on-disk buddy bitmap is available to it.
>
> The other big disadvantage is that the coding complexity of what you propose is much more significant, and at this point, I have a very strong bias towards KISS.
>
> Still, if you want to clean up and make the mballoc layer more functional, it's not going to collide with the allocation blocksize work. ? The two solutions aren't really mutually exclusive, but I suspect I'll be able to get there a much more quickly, and with a much lower memory overhead.
>
> Thoughts, comments?
>

I like your design. very KISS indeed.
I am just wondering why should BIGALLOC be INCOMPAT and not RO_COMPAT?
After all, ro mount doesn't allocate and RO_COMPAT features are so much nicer...

So maybe by forcing a flex_bg to align with the bigalloc group
and at the cost of "wasting" 1 bigalloc block on-disk for the
flex/bigalloc block bitmap,
the on-disk layout can stay the same and only the internal format of
the on-disk bitmap
will change.
And I can't see how ro mount can possibly be affected by the content
of the block bitmaps.

Amir.

2011-02-25 19:04:40

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Proposed design for big allocation blocks for ext4

On Fri, Feb 25, 2011 at 08:05:43PM +0200, Amir Goldstein wrote:
>
> I like your design. very KISS indeed.
> I am just wondering why should BIGALLOC be INCOMPAT and not RO_COMPAT?
> After all, ro mount doesn't allocate and RO_COMPAT features are so muc
> nicer...

I can try to make it be RO_COMPAT, but one thing my design changes is
that a block group will contain 32768 allocation blocks; so assuming a
4k blocks, instead of a block group containing a maximum of 32,768 4k
blocks comprising 128 MB, a block group would now contain 32,768 1M
blocks, or 32 GiB, or 8,388,608 4k blocks.

I'm pretty sure that existing kernels have superblock sanity checks
that will barf if they see this. Still, yeah, I can try allocating
this as a ROCOMPAT feature, and later on, if people really care, they
can patch older kernels so they won't freak out when they see a
BigAlloc file system and can thus successfully mount it read-only.

(Right now existing kernels will complain when s_blocks_per_group is
greater than blocksize*8.)

- Ted

2011-02-25 19:39:40

by Andreas Dilger

[permalink] [raw]

Subject: Re: Proposed design for big allocation blocks for ext4

On 2011-02-25, at 12:04 PM, Ted Ts'o wrote:
> On Fri, Feb 25, 2011 at 08:05:43PM +0200, Amir Goldstein wrote:
>>
>> I like your design. very KISS indeed.
>> I am just wondering why should BIGALLOC be INCOMPAT and not RO_COMPAT?
>> After all, ro mount doesn't allocate and RO_COMPAT features are so muc
>> nicer...
>
> I can try to make it be RO_COMPAT, but one thing my design changes is
> that a block group will contain 32768 allocation blocks; so assuming a
> 4k blocks, instead of a block group containing a maximum of 32,768 4k
> blocks comprising 128 MB, a block group would now contain 32,768 1M
> blocks, or 32 GiB, or 8,388,608 4k blocks.
>
> I'm pretty sure that existing kernels have superblock sanity checks
> that will barf if they see this. Still, yeah, I can try allocating
> this as a ROCOMPAT feature, and later on, if people really care, they
> can patch older kernels so they won't freak out when they see a
> BigAlloc file system and can thus successfully mount it read-only.
>
> (Right now existing kernels will complain when s_blocks_per_group is
> greater than blocksize*8.)

Hmm, if we stuck with a flex_bg factor G >= the allocation blocksize (2^G * blocksize), then it would appear that the main difference is (2^G - 1) unused block bitmaps per flex_bg, and the single used block bitmap per group would be compressed 2^G:1 (i.e. it wouldn't represent a valid block bitmap to unaware kernels). This would be OK for ROCOMPAT, like Amir wrote.

I guess the main difference is that there would still be 2^G more group descriptors to read/scan/write, though they would all be contiguous on disk.

To convert away from this feature on an old kernel would mean expanding each bit in flex_bg bitmap[0] by a factor 2^G and clearing the ROCOMPAT flag.

Cheers, Andreas

2011-02-25 21:24:34

by Amir Goldstein

[permalink] [raw]

Subject: Re: Proposed design for big allocation blocks for ext4

On Fri, Feb 25, 2011 at 9:04 PM, Ted Ts'o <[email protected]> wrote:
> On Fri, Feb 25, 2011 at 08:05:43PM +0200, Amir Goldstein wrote:
>>
>> I like your design. very KISS indeed.
>> I am just wondering why should BIGALLOC be INCOMPAT and not RO_COMPAT?
>> After all, ro mount doesn't allocate and RO_COMPAT features are so muc
>> ?nicer...
>
> I can try to make it be RO_COMPAT, but one thing my design changes is
> that a block group will contain 32768 allocation blocks; so assuming a
> 4k blocks, instead of a block group containing a maximum of 32,768 4k
> blocks comprising 128 MB, a block group would now contain 32,768 1M
> blocks, or 32 GiB, or 8,388,608 4k blocks.
>
> I'm pretty sure that existing kernels have superblock sanity checks
> that will barf if they see this. ?Still, yeah, I can try allocating
> this as a ROCOMPAT feature, and later on, if people really care, they
> can patch older kernels so they won't freak out when they see a
> BigAlloc file system and can thus successfully mount it read-only.
>
> (Right now existing kernels will complain when s_blocks_per_group is
> greater than blocksize*8.)
>

no problem. just rename s_blocks_per_group to s_bigblocks_per_group
to be compatible with old kernels.

2011-02-25 21:59:30

by Joel Becker

[permalink] [raw]

Subject: Re: Proposed design for big allocation blocks for ext4

On Thu, Feb 24, 2011 at 09:56:46PM -0500, Theodore Ts'o wrote:
> The solution to this problem is to use an increased allocation size as
> far as the block allocaiton bitmaps are concerned. However, the size
> of allocation bitmaps, and the granularity of blocks as far as the the
> extent tree blocks are concerned, are still based on the original
> maximum 4k block size.

Why not call it a 'cluster' like the rest of us do? The term
'blocksize' is overloaded enough already.

> Because we are not changing the definition of a block, the only
> changes that need to be made are at the intersection of allocating to
> an inode (or to file system metadata). This is good, because it means
> the bulk of ext4 does not need to be changed
>
>
> = Kernel Changes required =
>
> 1) Globally throughout ext4: uses of EXT4_BLOCKS_PER_GROUP() need to
> be audited to see if they should be EXT4_BLOCKS_PER_GROUP() or
> EXT4_ALLOC_BLOCKS_PER_GROUP().
>
> 2) ext4_map_blocks() and its downstream functions need to be changed so
> that they understand the new allocation rules, and in particular
> understand that before allocating a new block, they need to see if a
> partially allocated block has already been allocated, and can be used
> to fulfill the current allocation.
>
> 3) mballoc.c will need little or no changes, other than the
> EXT4_BLOCKS_PER_GROUP()/EXT4_ALLOC_BLOCKS_PER_GROUP() audit discussed
> in (1).

Be careful in your zeroing. A new allocation block might have
pages at its front that are not part of the write() or mmap(). You'll
either need to keep track that they are uninitialized, or you will have
to zero them in write_begin() (ocfs2 does the latter). We've had quite
a few tricky bugs in this area, because the standard pagecache code
handles the pags covered by the write, but the filesystem has to handle
the new pages outside the write.

> = Downsides =
>
> Internal fragmentation will be expensive for small files. So this is
> only useful for file systems where most files are large, or where the
> file system performance is more important than the losses caused by
> internal fragmentation.

It's a huge win for anything needing large files, like database
files or VM images. mkfs.ocfs2 has a vmimage mode just for this ;-)
Even with good allocation code and proper extents, a long-lived
filesystem with 4K clusters just gets fragmented. This leads to later
files being very discontiguous, which are slow to I/O to. I think this
is much more important than the simple speed-of-allocation win.

> Directories will also be allocated in chucks of the allocation block
> size. If this is especially large (such as 1 MiB), and there are a
> large number of directories, this could be quite expensive.
> Applications which use multi-level directory schemes to keep
> directories small to optimize for ext2's very slow large directory
> performance could be especially vulnerable.

Anecdotal evidence suggests that directories often benefit with
clusters of 8-16K size, but suffer greatly after 128K for precisely the
reasons you describe. We usually don't recommend clusters greater than
32K for filesystems that aren't expressly for large things.

Joel

--

"I don't want to achieve immortality through my work; I want to
achieve immortality through not dying."
- Woody Allen

http://www.jlbec.org/
[email protected]

2011-02-25 23:40:07

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Proposed design for big allocation blocks for ext4

On Fri, Feb 25, 2011 at 01:59:25PM -0800, Joel Becker wrote:
>
> Why not call it a 'cluster' like the rest of us do? The term
> 'blocksize' is overloaded enough already.

Yes, good point. Allocation cluster makes a lot more sense as a name.

> > 3) mballoc.c will need little or no changes, other than the
> > EXT4_BLOCKS_PER_GROUP()/EXT4_ALLOC_BLOCKS_PER_GROUP() audit discussed
> > in (1).
>
> Be careful in your zeroing. A new allocation block might have
> pages at its front that are not part of the write() or mmap(). You'll
> either need to keep track that they are uninitialized, or you will have
> to zero them in write_begin() (ocfs2 does the latter). We've had quite
> a few tricky bugs in this area, because the standard pagecache code
> handles the pags covered by the write, but the filesystem has to handle
> the new pages outside the write.

We're going to keep track of what blocks are uninitialized or not on a
4k basis. So that part of the ext4 code doesn't change.

That being said, one of my primary design mantras for ext4 is, "we're
not going to optimize for sparse files". They should work for
correctness sake, but if the file system isn't at its most performant
in the case of sparse files, I'm not going to shed any tears.

> It's a huge win for anything needing large files, like database
> files or VM images. mkfs.ocfs2 has a vmimage mode just for this ;-)
> Even with good allocation code and proper extents, a long-lived
> filesystem with 4K clusters just gets fragmented. This leads to later
> files being very discontiguous, which are slow to I/O to. I think this
> is much more important than the simple speed-of-allocation win.

Yes, very true.

> > Directories will also be allocated in chucks of the allocation block
> > size. If this is especially large (such as 1 MiB), and there are a
> > large number of directories, this could be quite expensive.
> > Applications which use multi-level directory schemes to keep
> > directories small to optimize for ext2's very slow large directory
> > performance could be especially vulnerable.
>
> Anecdotal evidence suggests that directories often benefit with
> clusters of 8-16K size, but suffer greatly after 128K for precisely the
> reasons you describe. We usually don't recommend clusters greater than
> 32K for filesystems that aren't expressly for large things.

Yes. I'm going to assume that file systems optimized for large files
are (in general) not going to have lots of directories, and even if
they do, chewing a 1 megabyte for a directory isn't that a big of a
deal of you're talking about a 2-4TB disk.

We could add complexity to do suballocations for directories, but KISS
seems to be a much better idea for now.

- Ted

2011-02-26 00:03:15

by Joel Becker

[permalink] [raw]

Subject: Re: Proposed design for big allocation blocks for ext4

On Fri, Feb 25, 2011 at 06:40:02PM -0500, Ted Ts'o wrote:
> On Fri, Feb 25, 2011 at 01:59:25PM -0800, Joel Becker wrote:
> >
> > Why not call it a 'cluster' like the rest of us do? The term
> > 'blocksize' is overloaded enough already.
>
> Yes, good point. Allocation cluster makes a lot more sense as a name.

Thank you ;-)

> We're going to keep track of what blocks are uninitialized or not on a
> 4k basis. So that part of the ext4 code doesn't change.

Ok, good. We don't have that info, so we enjoy a lot of fun
with the various pagesize/blocksize/clustersize combinations.

> We could add complexity to do suballocations for directories, but KISS
> seems to be a much better idea for now.

Oh dear God no.

Joel

--

"Nobody loves me,
Nobody seems to care.
Troubles and worries, people,
You know I've had my share."

http://www.jlbec.org/
[email protected]

2011-02-26 00:31:43

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Proposed design for big allocation blocks for ext4

On Fri, Feb 25, 2011 at 04:03:04PM -0800, Joel Becker wrote:
> > We could add complexity to do suballocations for directories, but KISS
> > seems to be a much better idea for now.
>
> Oh dear God no.

What are you saying no to? KISS or suballocations? :-)

- Ted

2011-02-26 00:33:24

by Joel Becker

[permalink] [raw]

Subject: Re: Proposed design for big allocation blocks for ext4

On Fri, Feb 25, 2011 at 07:31:40PM -0500, Ted Ts'o wrote:
> On Fri, Feb 25, 2011 at 04:03:04PM -0800, Joel Becker wrote:
> > > We could add complexity to do suballocations for directories, but KISS
> > > seems to be a much better idea for now.
> >
> > Oh dear God no.
>
> What are you saying no to? KISS or suballocations? :-)

You know the answer to that ;-) Though if you ever do smoke
some crack and design suballocations, ocfs2 will happily steal them.

Jool

--

"Only a life lived for others is a life worth while."
-Albert Einstein

http://www.jlbec.org/
[email protected]