2006-10-11 13:56:01

by Theodore Ts'o

[permalink] [raw]
Subject: Design alternatives for fragments/file tail support in ext4


Some of the new on-disk format changes are chewing up some of the inode
fields that had been previously reserved for fragments, which has
started me thinking about whether this would box us in if we ever wanted
to implement extents --- for example, if we start seeing 64k blocksize
filesystems and still want to make it be useful for a general purpose
filesystem without wasting a huge amount of space to internal
fragmentation.

There have been a couple of different approaches that have been
suggested over the years, and I think it's worth thinking about some of
them again as we start the ext4 process, and see if we can do something
better.

But before we do this, it is useful to think about why we would want
larger blocks in the first place.

1. Make the indirect/double indrect block mapping scheme more
efficient. When the block size is increased, the need to go to
double and triple indirect blocks is reduced in two ways; first, the
number of block pointers in an indirect block is increased; second
the amount of data addressed by a block pointer is increased.
Hence, an indirect block for filesystem with a 16k block size can
address 256 times as much data as an indirect block for a filesystem
using 1k blocks.

2. Increase the ability to be able to read from the disk in contiguous
reads and writes. Modern disks use internal cluster sizes of 32k
and up at this point. Hence, allocating blocks in 32k chunks at a
time is extremely helpful.

3. Make the block allocation bitmap more efficient. By using a larger
blocksize, the number of block groups required decreases, and the
amount of overhead to maintain the block allocation bitmaps per
kilobyte of data is decreased.

Extents, when implemented, making the first reason for wanting to
implement fragments go away, assuming that we keep block sizes
relatively small (i.e., 4k or smaller). They also mostly help reduce
the motivation for the second reason, although there are problably
changes we should make to our allocation algorithms that would improve
this. (More later on this point). However, the 3rd point might still
be a motivation for implementing fragments, if they are relatively cheap
to implement.

So given that, let's look at the various solutions in more detail:

* Storing the tail as an extended attribute
* Tail merging
* BSD FFS/UFS fragments
* Block allocation clusters

Storing the tail as an extended attribute
=========================================

Stephen and I have discussed this in the past, and the idea is a simple
one; simply store the tail as an extended attribute. There are other
filesystems that have done this, most notably NTFS (post-Windows 2000).
However, this approach is a little unsatisfying to me, since it buys us
nothing if there are no other extended attributes contained by the
filesystem, and if we are using large inodes to accelerate extended
attributes, there won't be much space for any but the smallest tails
(i.e., if we are using 4k blocks, and 512 byte inodes, the largest tail
that we could store using this method is around 350 bytes or so.)

Using this technique would only require a flag in the inode indicating
it has a fragment, so the filesystem knows to look for the extended
attribute. In theory this could also be done by checking the i_size
field, and assuming that last block in the file can never normally be a
hole, but this can be quite fragile.

Tail merging
============

Daniel Phillps had prototype code which implemented a form of tail
packing, which he called tail merging. Details of his approach can be
found here:

http://www.mail-archive.com/[email protected]/msg01372.html

The quick summary of his approach, however, is that he uses all six
bytes in the inodes relating to fragments (i_faddr, l_i_frag, l_i_fsize)
to maintain a doubly linked list (using 16-bit signed offsets of the
inod number) of all of the inodes utiling the same tail block, and 16
bit offset into the shared-tail block. The i_size field used to
determine which block is the last block, as well as the size of the
fragment stored in the shared-tail block.

In order to determine which portions of the shared-tail block are in use
and which are free, the filesystem code must walk through the entire
linked list to map out which portions of the block are in use. The
kernel caches this information, as well as the location of some number
of incompletely filled tailed blocks and one of the inodes to locate the
"merge" ring.

Daniel had prototype code implemented against ext2 in mid 2000, but this
approach never got integrated because it had a number of shortcomings:

* It was very complicated to map out the free portions of the
tail block. This complexity was an issue both for
understanding and merging the code, as well as trying to make
it play nice with ext3's journaling.
* There was no good way to find partially merged tail blocks
after the filesystem is freshly mounted.

Still, if one accepts the fact that the empty space in shared-tailed
blocks will likely not be reused after the filesystem is unmounted,
without either (a) opportunistically finding shared blocks when an inode
containing one of the shared tails is referenced, or (b) some kind of
brute force searching scheme, the solution is workable.

BSD FFS/UFS Fragments
=====================

This scheme is (as far as I know) not necessarily well understood by the
ext3/4 development community. It is fairly elegant, although it has
some shortcomings which I will discuss below.

In the BSD Fast filesystem, where we would use a block number, the
FFS/UFS uses a "filesystem address" instead. The filesystem address is
effectively a fragment number, referenced in fragment size chunks. This
means that if you are using a 512 byte fragment size, and 32-bit
filesystem addresses (as is the case with UFS1; UFS2 uses 64-bit
filesystem addresses), the maximum filesystem volume size is 2**32 *
512, or 2TB. (In fact the maximum advertised volume size for UFS1 is
1TB, so I assume they had some signed vs unsigned issues like we did
with ext3.)

The allocation bitmap for the UFS is also stored in terms of fragments.
However, allocations are done a block at a time, and in a block-aligned
fashion. In addition, indirect blocks are also a full block. So for
example, on a filesystem with 4k blocks and 512 byte fragments, most
bytes in the allocation map would be either 0xFF or 0x00, corresponding
to a fully used or unused block. In their i_blocks[] array, all blocks
except for the last block (which could be a fragment) would be stored
using filesystem addresses based on 512 byte sector number that would
always be a multiple of 8, since they must be stored on a block-aligned
boundary. The last block, if a fragment, might not be a multiple of 8,
and the i_blocks field (which is always measured in 512 byte units)
could be used to indicate how much of the last block was in use by that
inode.

This solution could be easily implemented by us; although if we were to
do so, we would have to change the definition of i_blocks to always be
based on the fragment size, instead of basing it on the block size, as
currently proposed by EXT4_FEATURE_RO_COMPAT_HUGE_FILE. Since current
ext3/4 filesystems have the fragment size == the block size, this would
not be difficult. This scheme also means that none of the i_faddr,
l_iu_frag, or l_i_fsize fields are necessary.

Downsides of this scheme? To the extent that we use a smaller fragment
size, we limit the maximum size of files and filesystem volumes. In
addition, it is quite wasteful of space in the allocation bitmap, since
a bit is used for each fragment-sized chunk of space on disk, even
though most blocks are allocated a full block at a time.

Block allocation clusters
=========================

Is not strictly speaking a fragmentation solution, but if we assume that
extents takes care of the indirect block efficiency concern, and if we
assume that we're not too worried about the efficiency of the block
allocation bitmap, then perhaps a solution which only affects our
allocation algorithm would be sufficient combined with the extents
solution. This solution also has the benefit that it (without extents)
it is backwards compatible with existing ext3 filesystem, as it only
changes the allocation algorithm.

The basic idea is that we store in the superblock the size of a block
allocation cluster, and that we change the allocation algorithm and the
preallocation code to always try to allocate blocks so that whenever
possible, an inode will use contiguous clusters of blocks, which are
aligned in multiples of the cluster size.

So for example, suppose we are using a 4k filesystem with a 64k cluster
size, so there are 16 blocks in an allocation cluster. The allocation
algorithm will try very hard so that logical block numbers
0,16,32,48,64, etc. of each inode gets a block number which is a
multiple of 16, so that logical blocks 0-15 will usually be contiguous,
and logical blocks 16-31 are usually contiguous, etc. Block
reservations (aka preallocation) can help to enforce this, but even
without using the reservation tree, by simply changing the allocation
algorithm to preferentially use clusters of 16 blocks which are free,
and clusters which are partially in use if there are no clusters that
are completely free, we can substantially improve the block layout to
improve extents efficiency as well as data transfer efficiency.

Does this have a downside? Yes; unless we have special case code when
doing delayed allocation so that the "tail" --- where in this case the
tail is whatever portion of the file doesn't fit in a full cluster ---
is packed with other partially used clusters, this scheme can in the end
have even worse fragmentation issues once the filesystem starts getting
full. However, since we need delayed allocation for extents anyway, we
can simply encode this requirement into the delayed allocation scheme.

Conclusion
==========

If we are to implement extents for ext4, given that we are already
implementing extents, I believe the best approach is either the BSD
Fragements scheme, or the block allocation clusters scheme. The
advantage of the BSD Fragments scheme is that it allows the block
bitmaps to be more efficient for very large files; but at the cost of
requiring us to go to 64-bit filesystem addresses much more quickly, and
more complexity in handling the final file tail/fragment.

The block cluster scheme has the advantage that it is much simpler to
implement (just changes to the block allocation code), but it does
require that delayed allocation be also implemented, and it means that
the block allocation bitmaps have to get touched more frequently, since
the block size remains at the current 4k size. It also had the
advantage that it doesn't require that complexities and VM changes
necessary to support filesystem block sizes which are greater than the
page size.

Since currently both the x86 and x86_64 use 4k page sizes, and the ia64
system seems stuck as a niche market, using larger block sizes such as
16k or 64k is highly problematic for the dominant architecture(s) in the
market. Hence, until the VM has the capability of supporting 64k block
sizes on systems with 4k page sizes, it is very unlikely many people
would use the large blocksize changes and enhancements to the
filesystem. So the block cluster scheme is much more likely to be
immediately useful to the ext4 user community.

Finally, note that implementation of block clusters does not foreclose
implementation of the BSD fragments scheme. However, it would probably
only be of interest to a company who is highly invested in the Intanium
architecture, at least in the short term.


Comments?

- Ted



2006-10-13 06:18:58

by Michael Burschik

[permalink] [raw]
Subject: Re: Design alternatives for fragments/file tail support in ext4

Theodore Ts'o wrote:
> Storing the tail as an extended attribute
> =========================================
>
> Stephen and I have discussed this in the past, and the idea is a simple
> one; simply store the tail as an extended attribute. There are other
> filesystems that have done this, most notably NTFS (post-Windows 2000).
> However, this approach is a little unsatisfying to me, since it buys us
> nothing if there are no other extended attributes contained by the
> filesystem, and if we are using large inodes to accelerate extended
> attributes, there won't be much space for any but the smallest tails
> (i.e., if we are using 4k blocks, and 512 byte inodes, the largest tail
> that we could store using this method is around 350 bytes or so.)
>
> Using this technique would only require a flag in the inode indicating
> it has a fragment, so the filesystem knows to look for the extended
> attribute. In theory this could also be done by checking the i_size
> field, and assuming that last block in the file can never normally be a
> hole, but this can be quite fragile.
>
>
Disclaimer: I am not a file system expert.

That said, I still think there are some advantages to this scheme. First
of all, 350 bytes are not all that bad. Many short files are
configuration files, and many of them are shorter than 350 bytes
(resolv.conf, hostname, hosts, etc.). On my system I find more than 300
files shorter than 512 bytes in /etc, and more than 700 in my home
directory. Some applications read many configuration files, so I would
expect this scheme to lead to a considerable speed improvement.
Moreover, there is much to be said in favour of configuration systems
that store exactly one value per file, such as tcb and Elektra.

Secondly, I expect systems that do not use extended attributes to become
increasingly rare. SELinux has become mainstream. It is included in Red
Hat/Fedora, Ubuntu, Debian etch (IIRC). Desktop software such as Beagle
uses extended attributes if available, and I expect this tendency to
increase. Desktop software is also notorious for needing vast numbers of
configuration files.

Regards

Michael Burschik

2006-10-13 08:10:04

by Andreas Dilger

[permalink] [raw]
Subject: Re: Design alternatives for fragments/file tail support in ext4

On Oct 11, 2006 09:55 -0400, Theodore Ts'o wrote:
> Block allocation clusters
> =========================
> The basic idea is that we store in the superblock the size of a block
> allocation cluster, and that we change the allocation algorithm and the
> preallocation code to always try to allocate blocks so that whenever
> possible, an inode will use contiguous clusters of blocks, which are
> aligned in multiples of the cluster size.

As mentioned in the weekly conference call - Alex has already implemented
this as part of the mballoc code that CFS uses in conjunction with extents.
There is a /proc tunable for the cluster size, which currently defaults to
1MB clusters (the Lustre RPC size) to optimize performance for RAID systems.
The allocations are aligned with the LUN so that an integer number of RAID
stripes are modified for a write. Smaller allocation chunks are packed
together.

Alex is working to update the multi-block allocator for the 2.6.18 kernel,
in conjunction with delayed allocation for ext4, and will hopefully have
a patch soon.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


2006-10-13 10:50:05

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Design alternatives for fragments/file tail support in ext4

On Fri, Oct 13, 2006 at 02:10:02AM -0600, Andreas Dilger wrote:
> On Oct 11, 2006 09:55 -0400, Theodore Ts'o wrote:
> > Block allocation clusters
> > =========================
> > The basic idea is that we store in the superblock the size of a block
> > allocation cluster, and that we change the allocation algorithm and the
> > preallocation code to always try to allocate blocks so that whenever
> > possible, an inode will use contiguous clusters of blocks, which are
> > aligned in multiples of the cluster size.
>
> As mentioned in the weekly conference call - Alex has already implemented
> this as part of the mballoc code that CFS uses in conjunction with extents.
> There is a /proc tunable for the cluster size, which currently defaults to
> 1MB clusters (the Lustre RPC size) to optimize performance for RAID systems.
> The allocations are aligned with the LUN so that an integer number of RAID
> stripes are modified for a write. Smaller allocation chunks are packed
> together.

I suggest this be tunable by superblock field, and not by a /proc
tunable. This is the sort of thing which might be different
per-filesystem, and the algorithm will be most effective if the
filesystem always use the same cluster size from the time when it was
first created. I'd be happy to assign a superblock field for this
purpose, and add the appropriate tune2fs support if we have general
agreement on this point.

> Alex is working to update the multi-block allocator for the 2.6.18 kernel,
> in conjunction with delayed allocation for ext4, and will hopefully have
> a patch soon.

Great!

- Ted

2006-10-13 10:57:03

by Alex Tomas

[permalink] [raw]
Subject: Re: Design alternatives for fragments/file tail support in ext4

>>>>> Theodore Tso (TT) writes:

TT> I suggest this be tunable by superblock field, and not by a /proc
TT> tunable. This is the sort of thing which might be different
TT> per-filesystem, and the algorithm will be most effective if the
TT> filesystem always use the same cluster size from the time when it was
TT> first created. I'd be happy to assign a superblock field for this
TT> purpose, and add the appropriate tune2fs support if we have general
TT> agreement on this point.

that would be good. there is even a stride option to mke2fs?

thanks, Alex

2006-10-13 12:23:56

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Design alternatives for fragments/file tail support in ext4

On Fri, Oct 13, 2006 at 02:56:46PM +0400, Alex Tomas wrote:
> >>>>> Theodore Tso (TT) writes:
>
> TT> I suggest this be tunable by superblock field, and not by a /proc
> TT> tunable. This is the sort of thing which might be different
> TT> per-filesystem, and the algorithm will be most effective if the
> TT> filesystem always use the same cluster size from the time when it was
> TT> first created. I'd be happy to assign a superblock field for this
> TT> purpose, and add the appropriate tune2fs support if we have general
> TT> agreement on this point.
>
> that would be good. there is even a stride option to mke2fs?

Yes, there is. And just as we have -E stride=stripe-size and -E
resize=max-online-resize, we can also -E cluster-size=bytes parameter
in mke2fs. It would also make sense to make this be something that
can be defaulted in /etc/mke2fs.conf, since even for IDE or SATA disks
it probably makes sense to make the cluster size be 16k or 32k or
maybe even higher. We probably need to do some benchmarks to see
whether or not this makes sense.

- Ted


2006-10-13 12:48:44

by Erik Mouw

[permalink] [raw]
Subject: Re: Design alternatives for fragments/file tail support in ext4

On Wed, Oct 11, 2006 at 09:55:57AM -0400, Theodore Ts'o wrote:

[...]

> 2. Increase the ability to be able to read from the disk in contiguous
> reads and writes. Modern disks use internal cluster sizes of 32k
> and up at this point. Hence, allocating blocks in 32k chunks at a
> time is extremely helpful.

It's only useful if you can start the partition at such a 32k chunk,
otherwise you end up having the disk reading two such chunks for every
chunk the filesystem reads.

Unfortunately the current linux fdisk implementation insists on putting
partitions on track boundaries cause some other OS thinks that's a good
idea cause it was indeed a good idea back when disks didn't lie about
their physical layout. Right now disks all claim to have 63
sectors/track so we end up having partitions starting at 32k - 512
bytes. By overriding the physical layout in fdisk you can get a better
layout, but unfortunately that behaviour isn't standard.

> Storing the tail as an extended attribute
> =========================================
>
> Stephen and I have discussed this in the past, and the idea is a simple
> one; simply store the tail as an extended attribute. There are other
> filesystems that have done this, most notably NTFS (post-Windows 2000).
> However, this approach is a little unsatisfying to me, since it buys us
> nothing if there are no other extended attributes contained by the
> filesystem, and if we are using large inodes to accelerate extended
> attributes, there won't be much space for any but the smallest tails
> (i.e., if we are using 4k blocks, and 512 byte inodes, the largest tail
> that we could store using this method is around 350 bytes or so.)

That isn't very different from NTFS, which usually has 1k sized MFT
records (Master File Table records, which combine inode information but
also store the name of the file) that are usually filled for around 400
to 500 bytes. You can tell mkfs.ntfs (or whatever it is called in
Windows) that you want larger MFT records so you have more room for
inline files or tails.

Although some people claim that Windows uses extended attributes quite
often, we don't see that in practice when we recover data from broken
disks. The last time I've seen extended attributes used was when a
virusscanner put a checksum in an extended attribute. Oh, and there was
a machine which served files for MacOS machines so it had the resource
fork and finder info in separate attributes.

Extended attributes might be used more on Linux, especially because
distributions are moving to selinux so the room in an inode becomes a
lot smaller for storing tail data.

> Using this technique would only require a flag in the inode indicating
> it has a fragment, so the filesystem knows to look for the extended
> attribute. In theory this could also be done by checking the i_size
> field, and assuming that last block in the file can never normally be a
> hole, but this can be quite fragile.

Better be explicit about fragments, that also makes e2fsck easier.

> Tail merging
> ============

[...]

Another way (and I can't find which filesystem exactly uses it) is to
store all tails within a certain size range in a special file for that
particular range. So for example if your filesystem has a 64k
blocksize, it has files for 512, 1k, 1.5k, 2k till 63.5k. If a tail
falls within 0 and 512 bytes, it's stored in the special file for 512
bytes, it's within 512 and 1k it's stored in the 1k file, etc. The
special files are normal files with a special meaning. Of course you
need to do some bookkeeping for the contents of the special files but
you need that for every way to manage tails you proposed.


Erik

--
+-- Erik Mouw -- http://www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands

2006-10-13 17:47:42

by Andreas Dilger

[permalink] [raw]
Subject: Re: Design alternatives for fragments/file tail support in ext4

On Oct 13, 2006 08:23 -0400, Theodore Tso wrote:
> On Fri, Oct 13, 2006 at 02:56:46PM +0400, Alex Tomas wrote:
> > >>>>> Theodore Tso (TT) writes:
> > TT> I suggest this be tunable by superblock field, and not by a /proc
> > TT> tunable. This is the sort of thing which might be different
> > TT> per-filesystem, and the algorithm will be most effective if the
> > TT> filesystem always use the same cluster size from the time when it was
> > TT> first created. I'd be happy to assign a superblock field for this
> > TT> purpose, and add the appropriate tune2fs support if we have general
> > TT> agreement on this point.
> >
> > that would be good. there is even a stride option to mke2fs?
>
> Yes, there is. And just as we have -E stride=stripe-size and -E
> resize=max-online-resize, we can also -E cluster-size=bytes parameter
> in mke2fs. It would also make sense to make this be something that
> can be defaulted in /etc/mke2fs.conf, since even for IDE or SATA disks
> it probably makes sense to make the cluster size be 16k or 32k or
> maybe even higher. We probably need to do some benchmarks to see
> whether or not this makes sense.

I think what Alex meant is that the "mke2fs -E stride=" value should
just be put into the superblock. This would allow tuning the mballoc code
to match the RAID alignment, and would also make life easier for resizers
so they can continue the RAID-stepping of bitmaps that mke2fs does
without having to extrapolate the stride value from the bitmap locations.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.


2006-12-02 11:02:55

by Alex Tomas

[permalink] [raw]
Subject: Re: Design alternatives for fragments/file tail support in ext4


>>>>> Theodore Ts'o (TT) writes:

TT> Storing the tail as an extended attribute
TT> =========================================
TT> BSD FFS/UFS Fragments
TT> =====================

what if we would apply BSD FFS/UFS Fragments design to inodes?
something like basic inode size is still 128 bytes, but we
could unite few contiguous inodes to a larger one and use that
space to store extented attributes, including a tail ?

I foresee couple issues:

a) as we can't predict tail size, we'd need to be able
to reallocate inode or do some sort of delayed inode
allocation

b) inode space is quite limited and this why we may exhaust
it very quickly

thanks, Alex