LinuxLists.cc - [RFC] extents,delayed allocation,mballoc for ext3

2004-04-13 19:29:56

Subject: [RFC] extents,delayed allocation,mballoc for ext3

these patches implement several features for ext3:
- extents
- multiblock allocator
- delayed allocation (a.k.a. allocation on flush)

extents
=======
it's just a way to store inode's blockmap in well-known triples
[logical block; phys. block; length]. all the extents are stored
in B+Tree. code is splitted in two parts:
1) generic extents support
implements primitives like lookup, insert, remove, walk
2) VFS part
implements ->getblock() and ->truncate() methods

multiblock allocator
===================
the larger extents the better. the reasonable way is to ask block
allocator to allocate several blocks at once. it is possible to
scan bitmaps, but such a scanning isn't very good method. so, here
is mballoc - buddy algorithm + possibility to find contig.buddies
fast way. mballoc is backward-compatible, buddies are stored on a
disk as usual file (temporal solution until fsck support is ready)
and regenerated at mount time. also, with existing block-at-once
allocator it's impossible to write at very high rate (several
hundreds MB a sec). multiblock allocator solves this issue.

delayed allocation
==================
this is ->writepages() implementation that exploits very nice tagged
radix tree. it finds contiguous spaces and asks extents code to walk
over specified ranges of blocks. extents code calls given callback
routine that allocates blocks for listed pages, cookes a bio's and
submit them on a disk.

todo
====
1) blocks must be reserved to avoid -ENOSPC upon writeback
2) blocks must be available for allocation after committing only
3) data=order support
4) blocksize < PAGE_CACHE_SIZE support
5) option to allocator to look for +N blocks if goal is busy
6) probably preallocation for slowly-growing files
7) allocation policy tuning
8) regenerate buddies in crash case only

NOTE: don't try to use it in production. all the patches (probably
excluding extents) are pre-pre-alpha. because of size I put patches
in ftp://ftp.clusterfs.com/pub/people/alex/2.6.4-mm2/

benchmarks (hardware: 2way iPIII-1000Mhz/512MB/old scsi hdd (20MB/s)
====================================================================

I ran dd to write specified amount of data and measured time ext3
spent in allocator via get_cycles(). all the bitmaps were preloaded.

size 5kb, before: 2 allocations, 30285 cycles
size 5kb, before: 2 allocations, 27550 cycles
size 5kb, before: 2 allocations, 27307 cycles
size 5kb, before: 2 allocations, 27486 cycles
14078 cycles per block

size 5kb, after : 1 allocations, 50531 cycles
size 5kb, after : 1 allocations, 47915 cycles
size 5kb, after : 1 allocations, 51890 cycles
size 5kb, after : 1 allocations, 49094 cycles
24928 cycles per block

size 600kb, before: 151 allocations, 443282 cycles
size 600kb, before: 151 allocations, 439809 cycles
size 600kb, before: 151 allocations, 438705 cycles
size 600kb, before: 151 allocations, 506309 cycles
3026 cycles per block

size 600kb, after : 1 allocations, 55344 cycles
size 600kb, after : 1 allocations, 55094 cycles
size 600kb, after : 1 allocations, 54311 cycles
size 600kb, after : 1 allocations, 102892 cycles
446 cycles per block

size 12445kb, before: 3117 allocations, 9683780 cycles
size 12445kb, before: 3117 allocations, 9866494 cycles
size 12445kb, before: 3117 allocations, 9702287 cycles
size 12445kb, before: 3117 allocations, 10127695 cycles
3158 cycles per block

size 12445kb, after : 1 allocations, 60446 cycles
size 12445kb, after : 1 allocations, 60978 cycles
size 12445kb, after : 1 allocations, 65121 cycles
size 12445kb, after : 1 allocations, 61893 cycles
20 cycles per block

single dd writes 2GB:
before:
real 2m2.623s
user 0m0.028s
sys 0m12.236s

after:
real 1m49.696s
user 0m0.028s
sys 0m8.008s

9 copies of dd, each writes 20MB:
before:
real 1m22.151s
user 0m0.057s
sys 0m2.102s

after:
real 0m9.664s
user 0m0.061s
sys 0m1.209s

time to dbench things ...
before
Throughput 150.215 MB/sec 8 procs
Throughput 140.273 MB/sec 8 procs
Throughput 153.377 MB/sec 8 procs
Throughput 101.198 MB/sec 8 procs
Average: 136.26575

Throughput 68.8406 MB/sec 16 procs
Throughput 83.0574 MB/sec 16 procs
Throughput 48.3245 MB/sec 16 procs
Throughput 54.4254 MB/sec 16 procs
Average: 63.66197

Throughput 53.6807 MB/sec 32 procs
Throughput 66.5997 MB/sec 32 procs
Throughput 59.4454 MB/sec 32 procs
Throughput 62.9191 MB/sec 32 procs
Average: 60.66122

after:
Throughput 226.799 MB/sec 8 procs
Throughput 205.548 MB/sec 8 procs
Throughput 220.675 MB/sec 8 procs
Throughput 192.285 MB/sec 8 procs
Average: 211.32675

Throughput 178.969 MB/sec 16 procs
Throughput 182.105 MB/sec 16 procs
Throughput 196.786 MB/sec 16 procs
Throughput 180.526 MB/sec 16 procs
Average: 184.59650

Throughput 139.905 MB/sec 32 procs
Throughput 132.72 MB/sec 32 procs
Throughput 133.429 MB/sec 32 procs
Throughput 131.498 MB/sec 32 procs
Average: 134.38800

2004-04-14 04:01:14

by Matt Mackall

[permalink] [raw]

Subject: Re: [RFC] extents,delayed allocation,mballoc for ext3

On Tue, Apr 13, 2004 at 11:28:57PM +0400, [email protected] wrote:
>
> these patches implement several features for ext3:
> - extents
> - multiblock allocator
> - delayed allocation (a.k.a. allocation on flush)
>
>
> extents
> =======
> it's just a way to store inode's blockmap in well-known triples
> [logical block; phys. block; length]. all the extents are stored
> in B+Tree. code is splitted in two parts:
> 1) generic extents support
> implements primitives like lookup, insert, remove, walk
> 2) VFS part
> implements ->getblock() and ->truncate() methods

I'm going to assume that there's no way for ext3 without extents
support to mount such a filesystem, so I think this means changing the
FS name. Is there a simple migration path to extents for existing filesystems?

> multiblock allocator
> ===================
> the larger extents the better. the reasonable way is to ask block
> allocator to allocate several blocks at once. it is possible to
> scan bitmaps, but such a scanning isn't very good method. so, here
> is mballoc - buddy algorithm + possibility to find contig.buddies
> fast way. mballoc is backward-compatible, buddies are stored on a
> disk as usual file (temporal solution until fsck support is ready)
> and regenerated at mount time. also, with existing block-at-once
> allocator it's impossible to write at very high rate (several
> hundreds MB a sec). multiblock allocator solves this issue.

Similar questions here.

> NOTE: don't try to use it in production. all the patches (probably
> excluding extents) are pre-pre-alpha. because of size I put patches
> in ftp://ftp.clusterfs.com/pub/people/alex/2.6.4-mm2/

You might also mention that on-disk format issues such as endian
layout are not finalized.

--
Matt Mackall : http://www.selenic.com : Linux development and consulting

2004-04-14 12:06:16

by Alex Tomas

[permalink] [raw]

Subject: Re: [RFC] extents,delayed allocation,mballoc for ext3

>>>>> Matt Mackall (MM) writes:

MM> I'm going to assume that there's no way for ext3 without extents
MM> support to mount such a filesystem, so I think this means changing the
MM> FS name. Is there a simple migration path to extents for existing filesystems?

yeah. you're right. I see no way to make it backward-compatible. in fact,
I haven't think much about name. probably you're right again and this
"ext3 on steroids" should have another name.

MM> Similar questions here.

no. this one is backward-compatible and usual ext3 will run ok.
btw, I think it possible to implement few routines that could allow
to exploit delayed allocation and multiblock allocator patches w/o
introducing extents. the most visible effect of the extents is much
faster truncate.

MM> You might also mention that on-disk format issues such as endian
MM> layout are not finalized.

yep. thanks for notice.

2004-04-14 12:11:25

by Alex Tomas

[permalink] [raw]

Subject: Re: [RFC] extents,delayed allocation,mballoc for ext3

I've just benched ext3 vs. ext3+reservation vs. ext3+delalloc vs. xfs.
it was tiobench.

Sequential Writes
File Blk Num Avg CPU
Identifier Size Size Thr Rate (CPU%) Latency Eff
---------------------------- ------ ----- --- ------ ------ --------- -----
ext3 1024 4096 4 13.34 14.76% 0.897 90
ext3-dalloc 1024 4096 4 26.39 19.26% 0.452 137
ext3-reserv 1024 4096 4 23.77 28.99% 0.529 82
xfs 1024 4096 4 27.22 20.68% 0.373 132

ext3 1024 4096 8 9.71 10.82% 2.421 90
ext3-dalloc 1024 4096 8 25.81 18.64% 0.816 138
ext3-reserv 1024 4096 8 23.62 29.49% 1.006 80
xfs 1024 4096 8 27.06 22.49% 0.763 120

ext3 1024 4096 16 6.60 7.891% 7.222 84
ext3-dalloc 1024 4096 16 24.99 19.71% 1.783 127
ext3-reserv 1024 4096 16 23.04 28.15% 1.849 82
xfs 1024 4096 16 24.84 20.58% 1.300 121

ext3 1024 4096 32 8.12 9.872% 8.111 82
ext3-dalloc 1024 4096 32 24.83 20.01% 2.995 124
ext3-reserv 1024 4096 32 22.72 29.51% 3.282 77
xfs 1024 4096 32 25.47 21.75% 2.247 117

ext3-dalloc is ext3 + extents + delayed allocation + multiblock allocator
ext3-reserv is ext3 + reservation patches by Mingming Cao

2004-04-14 17:43:30

by Mingming Cao

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [RFC] extents,delayed allocation,mballoc for ext3

On Wed, 2004-04-14 at 05:10, Alex Tomas wrote:
>
> I've just benched ext3 vs. ext3+reservation vs. ext3+delalloc vs. xfs.
> it was tiobench.
> ext3 1024 4096 32 8.12 9.872% 8.111 82
> ext3-dalloc 1024 4096 32 24.83 20.01% 2.995 124
> ext3-reserv 1024 4096 32 22.72 29.51% 3.282 77
> xfs 1024 4096 32 25.47 21.75% 2.247 117
>
Hi Alex,

Nice comparison! The ext3 reservation system use more cpus because we do
reservations in memory( not on disk) and we have a global lock per
filesystem to guard the operation. The current search for a new
reservation window algorithm is not perfect right now.

extents and delayed allocation probably is the right way to go for next
generation (maybe ext4). Currently I just try to fix the missing
preallocation feature in ext3, without break the disk compatibility and
involve too much changes....

Thanks for your interest.

Mingming

2004-04-19 19:47:19

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [RFC] extents,delayed allocation,mballoc for ext3

Hi,

On Wed, 2004-04-14 at 13:05, Alex Tomas wrote:

> MM> I'm going to assume that there's no way for ext3 without extents
> MM> support to mount such a filesystem, so I think this means changing the
> MM> FS name. Is there a simple migration path to extents for existing filesystems?
>
> yeah. you're right. I see no way to make it backward-compatible. in fact,
> I haven't think much about name. probably you're right again and this
> "ext3 on steroids" should have another name.

We've already got feature compatibility bits that can deal with this
sort of thing. There are various other proposed incompatible features,
such as large inodes and dynamically placed metadata (eg. placing inode
tables into an inode "file"), too. Rather than invent new names for
each combination of incompatible feature set, we're probably better off
just using the feature masks.

Cheers,
Stephen

2004-04-20 22:46:46

by Matt Mackall

[permalink] [raw]

Subject: Re: [Ext2-devel] Re: [RFC] extents,delayed allocation,mballoc for ext3

On Mon, Apr 19, 2004 at 08:47:10PM +0100, Stephen C. Tweedie wrote:
> Hi,
>
> On Wed, 2004-04-14 at 13:05, Alex Tomas wrote:
>
> > MM> I'm going to assume that there's no way for ext3 without extents
> > MM> support to mount such a filesystem, so I think this means changing the
> > MM> FS name. Is there a simple migration path to extents for existing filesystems?
> >
> > yeah. you're right. I see no way to make it backward-compatible. in fact,
> > I haven't think much about name. probably you're right again and this
> > "ext3 on steroids" should have another name.
>
> We've already got feature compatibility bits that can deal with this
> sort of thing. There are various other proposed incompatible features,
> such as large inodes and dynamically placed metadata (eg. placing inode
> tables into an inode "file"), too. Rather than invent new names for
> each combination of incompatible feature set, we're probably better off
> just using the feature masks.

I'm aware of the existence of such features, I just think it's yet to
be demonstrated that they're actually a good idea for real deployment
by themselves. ext3+{btree,extents} is not backwards compatible in any
useful sense, unlike features such as journalling, directory hashing,
sparse superblocks, wandering journals, etc. Given that you can't
mount the new filesystem with an old kernel, not changing the name can
only result in confusion.

But I see your point about dealing with a cartesian product of
features. So if and when this stuff approaches beta, we should
probably use the feature flags _and_ change the name to something like
ext3+be (btrees, extents) or ext3+i (inode in file) to indicate the
presence of experimental, incompatible features, and when the feature
set is actually pinned down, rename it simply ext3+ or ext4 or whatever.

It might be possible to have ext4 actually be a family of filesystems
where extents or large inodes are optional, but I suspect the value of
that would be minimal and again, all such features would have to be
available in every kernel tree that claimed to support ext4.

--
Matt Mackall : http://www.selenic.com : Linux development and consulting