2004-03-30 08:50:37

by Mingming Cao

[permalink] [raw]
Subject: [RFC, PATCH] Reservation based ext3 preallocation

Hi, Andrew, Ted, All,

Ext3 preallocation is currently missing. This is the first cut of the
prototype for the reservation based ext3 preallocation based on the
ideas suggested by Andrew and Ted. The implementation is incomplete, but
I want to hear your valuable opinion about the current design.

What I have done in this version of prototype:
1)basic reservation structure and operations
2)reservation based ext3 block allocation
3)and reservation window allocations
4)block allocation when fs reservation is turned off

For 1) Use a sorted double linked list for the per-filesystem
reservation list, like the vm_region does. The operations on double
linked list are abstract so later if necessary we could replace it with
other sohpysicated tree easily.

Each inode have a reservation structure inside it's ext3_inode_info
structure. Each reservation structure contains(start, end, list_head,
goal_window_size)

For 2) The basic idea is: When we try to allocate a new block for a
inode, if there is a reservation window for it, it will try to do
allocation from there.

If it does not have a reservation window, we will allocate a block and
make a reservation window for it. Instead of doing the block allocation
first then do the reservation window allocation second, we make the
reservation window first, then allocate a block within the window. The
new reservation window has at least one free block and does not overlap
with other reservation windows. This way we avoid keeping looking up the
reservation list again and again when we found a free bit on bitmap and
not sure if it belongs to any body's reservation window.

For 3) To allocate a new reservation window, we search the part of
filesystem reservation list that fall into the group which we are trying
to allocate a block from. We will have a goal block to guide where we
want the new reservation window start from. If we already have a old
reservation, we will discard it first, then search the part of list that
after the old reservation window. Otherwise the sub-list start from the
beginning of the group. The new reservation window could cross group
boundary. The reservation window has contains at least one free block.

For 4) If the filesystem has reservation turned off, all the code/path
for new block allocation is the same as the current code-- just call
ext3_try_to_allocate() with a NULL reservation window pointer.

Above logic has been verified on a user level simulation program.
Attached prototype patch (against 2.6.4 kernel) compiles and boots. I
have done initial test of the patch on a 2way PIII 700Mhz box.

Below is the debugfs output after running a simple test. The test has 8
threads sequentially write 20M on different files at the same time, in
the same directory, on a fresh created ext3 filesystem. Basically, after
the apply the patch, the filesystem is much lest fragmented:

before(2.6.4 kernel):

Inode: 12 Type: regular Mode: 0644 Flags: 0x0 Generation:
3375236196
.....
BLOCKS:
(0):8716, (1):8718, (2):8720, (3):8722, (4):8724, (5):8726, (6):8728,
(7):8730, (8):8732, (9):8734, (10):8736, (11):8738, (IND):8741,
(12):8742, (13):8744, (14):8746, (15):8748, (16):8750, (17):8752,
(18):8754, (19):8756, (20):8758, (21):8760, (22):8762, (23):8764,
(24):8766, (25):8768, (26):8770, (27):8772, (28):8774, (29):8776,
(30):8778, (31):8780, (32):8782, (33):8784, (34):8786, (35):8788,
(36):8790, (37):8792, (38):8794, (39):8796, (40):8798, (41):8800,
(42):8802, (43):8804, (44):8806, (45):8808, (46):8810, (47):8812,
(48):8814, (49):8816, (50):8818, (51):8820, (52):8822, (53):8824,
(54):8826, (55):8828, (56):8830, (57):8832, (58):8834, (59):8836, (6
0):8838, (61):8840, (62):8842, (63):8844, (64):8846, (65):8848,
(66):8850, (67):8852, (68):8854, (69):8856, (70):8858, (71):8860,
(72):8862, (73):8864, (74):8866, (75):8868, (76):8870, (77):8872,
(78):8874, (79):8876, (80):8878, (81):8880,
......
......
......

after apply the patch(reservation window size is 128 blocks):
Inode: 15 Type: regular Mode: 0644 Flags: 0x0 Generation:
2351221293
......
BLOCKS:
(0-11):24576-24587, (IND):24588, (12-1035):24592-25615, (DIND):25616,
(IND):25624,(1036-2027):25632-26623, (2028-2059):37116-37147,
(IND):37148, (2060-2151):37152-37243, (2152-2279):37372-37499,
(2280-2407):37756-37883, (2408-2535):38012-38139,
(2536-2663):38268-38395, (2664-2791):38524-38651,
(2792-2919):38780-38907, (2920-3083):43132-43295, (IND):43296,
(3084-3167):43304-43387, (3168-3551):43516-43899,
(3552-3679):44028-44155, (3680-3807):44284-44411,
(3808-3935):44540-44667, (3936-4063):44924-45051,
(4064-4107):45180-45223, (IND):45224, (4108-4183):45232-45307, (4184-456
7):45436-45819, (4568-4695):45948-46075, (4696-4823):46204-46331,
(4824-4828):46875-46879, (4829-4956):48380-48507,
(4957-4999):48636-48678

TOTAL: 5006

Things to do:
1) Dynamic increase the reservation window for individual files.
2) Prevent bogus early ENOSPC error when filesystem is being full
reserved.
3) Preserve the reservation window on file/close on some files which
frequently append
4) Play with other tree structures to replace sorted double linked list
for the reservation tree/list if necessary.

Before working on above todos, I would like to hear you valuable
comments, suggestions, ideas on current design of this reservation based
ext3 preallocation. Patch is attached and against 2.6.4 kernel.

Thanks!

Mingming

diffstat ext3_reservation-7.diff
fs/ext3/balloc.c | 485
+++++++++++++++++++++++++++++++++++++++++++--
fs/ext3/ialloc.c | 6
fs/ext3/inode.c | 9
fs/ext3/super.c | 7
include/linux/ext3_fs.h | 4
include/linux/ext3_fs_i.h | 12 +
include/linux/ext3_fs_sb.h | 4
7 files changed, 511 insertions(+), 16 deletions(-)


Attachments:
ext3_reservation-7.diff (22.76 kB)

2004-03-30 09:45:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC, PATCH] Reservation based ext3 preallocation

Mingming Cao <[email protected]> wrote:
>
> Ext3 preallocation is currently missing.

I thing this is heading the right way.

- Please use u32 for block numbers everywhere. In a number of places you
are using int, and that may go wrong if the block numbers wrap negative
(I'm not sure that ext3 supports 8TB, but it's the right thing to do).

- Using ext3_find_next_zero_bit(bitmap_bh->b_data in
alloc_new_reservation() is risky. There are some circumstances when you
have a huge number of "free" blocks in ->b_data, but they are all unfree
in ->b_committed_data. You could end up with astronomical search
complexity in there. You should search both bitmaps to find a block
which really is allocatable. Otherwise you'll have
ext3_try_to_allocate() failing 20,000 times in succession and much CPU
will be burnt.

- I suspect ext3_try_to_allocate_with_rsv() could be reorganised a bit to
reduce the goto spaghetti?

- Please provide a mount option which enables the feature, defaulting to
"off".

- Make sure that you have a many-small-file test. Say, untar a kernel
tree onto a clean filesystem and make sure that reading all the files in
the tree is nice and fast.

This is to check that the reservation is being discarded appropriately
on file close, and that those small files are contiguous on-disk. If we
accidentally leave gaps in between them the many-small-file bandwidth
takes a dive.

- There's a little program called `bmap' in
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz which
can be used to dump out a file's block allocation map, to check
fragmentation.

Apart from that, looking good. Where are the benchmarks? ;)

2004-03-30 17:15:11

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [RFC, PATCH] Reservation based ext3 preallocation

On Tuesday 30 March 2004 01:45 am, Andrew Morton wrote:

Andrew,

> - Using ext3_find_next_zero_bit(bitmap_bh->b_data in
> alloc_new_reservation() is risky. There are some circumstances when you
> have a huge number of "free" blocks in ->b_data, but they are all unfree
> in ->b_committed_data. You could end up with astronomical search
> complexity in there. You should search both bitmaps to find a block
> which really is allocatable. Otherwise you'll have
> ext3_try_to_allocate() failing 20,000 times in succession and much CPU
> will be burnt.

Can you explain this a little more ? What does b->data and b->commited_data
represent ? We are assuming that b->data will always be uptodate.

May be we should use ext3_test_allocatable() also.
Mingming, what was the reason for using ext3_find_next_zero_bit() only ?
We had this discussion earlier, but I forgot :(

> - I suspect ext3_try_to_allocate_with_rsv() could be reorganised a bit to
> reduce the goto spaghetti?

will do :)

>
> - Please provide a mount option which enables the feature, defaulting to
> "off".

Sure.

>
> - Make sure that you have a many-small-file test. Say, untar a kernel
> tree onto a clean filesystem and make sure that reading all the files in
> the tree is nice and fast.
>
> This is to check that the reservation is being discarded appropriately
> on file close, and that those small files are contiguous on-disk. If we
> accidentally leave gaps in between them the many-small-file bandwidth
> takes a dive.

Hmm. Ted proposed that we should keep reservation after file close.
We weren't sure about this either. Its on our TODO list.

>
> - There's a little program called `bmap' in
> http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz which
> can be used to dump out a file's block allocation map, to check
> fragmentation.

Thanks. will use that. We are using debugfs for now. Do you have any tools
to dump out whats in journal ? I want to understand log format etc..
Just curious.

>
> Apart from that, looking good. Where are the benchmarks? ;)

We are first concentrating on tiobench regression. We see clear
degrade with tiobench on ext3, since it creates lots of files in the
same directory. Once we are happy with tiobench, we go for others
kernel untars, rawiobench etc.

Thanks,
Badari

2004-03-30 18:12:11

by Alex Tomas

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [RFC, PATCH] Reservation based ext3 preallocation

On ???, 2004-03-30 at 21:07, Badari Pulavarty wrote:

> Can you explain this a little more ? What does b->data and b->commited_data
> represent ? We are assuming that b->data will always be uptodate.
>

b_data represents actual information about used/free blocks.
b_committed_data represents blocks that freed during current
transaction. these blocks must not be allocated. there is good
note about this just before ext3_test_allocatable() in balloc.c


2004-03-30 18:14:51

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [RFC, PATCH] Reservation based ext3 preallocation

On Tuesday 30 March 2004 09:12 am, Alex Tomas wrote:
> On ???, 2004-03-30 at 21:07, Badari Pulavarty wrote:
> > Can you explain this a little more ? What does b->data and
> > b->commited_data represent ? We are assuming that b->data will always be
> > uptodate.
>
> b_data represents actual information about used/free blocks.
> b_committed_data represents blocks that freed during current
> transaction. these blocks must not be allocated. there is good
> note about this just before ext3_test_allocatable() in balloc.c

Yes. I read the note after sending the mail.

Thanks,
Badari

2004-03-30 18:18:19

by Mingming Cao

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [RFC, PATCH] Reservation based ext3 preallocation

On Tue, 2004-03-30 at 09:07, Badari Pulavarty wrote:
> On Tuesday 30 March 2004 01:45 am, Andrew Morton wrote:
>
> >I thing this is heading the right way.
Andrew, Thanks your comment and response to quickly! Will make changes
as you suggested.

> Andrew,
>
> > - Using ext3_find_next_zero_bit(bitmap_bh->b_data in
> > alloc_new_reservation() is risky.

> Can you explain this a little more ? What does b->data and b->commited_data
> represent ? We are assuming that b->data will always be uptodate.
>
> May be we should use ext3_test_allocatable() also.
> Mingming, what was the reason for using ext3_find_next_zero_bit() only ?
> We had this discussion earlier, but I forgot :(

I thought that just using ext3_find_next_zero_bit probably would be okey
since, once we get a reservation window that has a possible free block,
the ext3_try_to_allocate will check both the block group bitmap and the
copy of last committed bitb inside that window range anyway before doing
the real allocation. If there is no really free block on both bitmaps,
ext3_try_to_allocate will fail and will looking for a new reservation
window.

But as Andrew said, this may cause unnecessary calling
ext3_try_to_allocate() many many times...

We could do the same thing as in find_next_usable_block():
/*
* The bitmap search --- search forward alternately through the actual
* bitmap and the last-committed copy until we find a bit free in
* both
*/
while (here < maxblocks) {
next = ext3_find_next_zero_bit(bh->b_data, maxblocks, here);
if (next >= maxblocks)
return -1;
if (ext3_test_allocatable(next, bh))
return next;
jbd_lock_bh_state(bh);
if (jh->b_committed_data)
here =
ext3_find_next_zero_bit(jh->b_committed_data, maxblocks, next);
jbd_unlock_bh_state(bh);
}

Maybe make this a inline function....ext2 does not need to care about
the journalling stuff.

> > - Make sure that you have a many-small-file test. Say, untar a kernel
> > tree onto a clean filesystem and make sure that reading all the files in
> > the tree is nice and fast.
Haven't got a chance to verify that but good point. Will do.
> >
> > This is to check that the reservation is being discarded appropriately
> > on file close, and that those small files are contiguous on-disk. If we
> > accidentally leave gaps in between them the many-small-file bandwidth
> > takes a dive.
>
> Hmm. Ted proposed that we should keep reservation after file close.
> We weren't sure about this either. Its on our TODO list.

Only some files need to keep reservation cross file open/close, for
those that opened with append flag, or some file like /var/log while
multiple processes frequently open/write/close/. Maybe a file attribute
or a open flag could be used for this purpose. For regular files, I
think the files should discard at close. Untar a kernel tree just did
open files with WRITE then write and close.

>
> >
> > - There's a little program called `bmap' in
> > http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz which
> > can be used to dump out a file's block allocation map, to check
> > fragmentation.
>
Thanks again.

Mingming



2004-03-30 18:37:29

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC, PATCH] Reservation based ext3 preallocation

Badari Pulavarty <[email protected]> wrote:
>
> On Tuesday 30 March 2004 01:45 am, Andrew Morton wrote:
>
> Andrew,
>
> > - Using ext3_find_next_zero_bit(bitmap_bh->b_data in
> > alloc_new_reservation() is risky. There are some circumstances when you
> > have a huge number of "free" blocks in ->b_data, but they are all unfree
> > in ->b_committed_data. You could end up with astronomical search
> > complexity in there. You should search both bitmaps to find a block
> > which really is allocatable. Otherwise you'll have
> > ext3_try_to_allocate() failing 20,000 times in succession and much CPU
> > will be burnt.
>
> Can you explain this a little more ? What does b->data and b->commited_data
> represent ? We are assuming that b->data will always be uptodate.

The comment Alex pointed to is splendid ;)

> May be we should use ext3_test_allocatable() also.

I think so.

> > - There's a little program called `bmap' in
> > http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz which
> > can be used to dump out a file's block allocation map, to check
> > fragmentation.
>
> Thanks. will use that. We are using debugfs for now. Do you have any tools
> to dump out whats in journal ? I want to understand log format etc..
> Just curious.

I cannot think of any. It wouldn't surprise me if e2fsck had a debug mode
which printed out this info, but I have not looked.

> >
> > Apart from that, looking good. Where are the benchmarks? ;)
>
> We are first concentrating on tiobench regression. We see clear
> degrade with tiobench on ext3, since it creates lots of files in the
> same directory. Once we are happy with tiobench, we go for others
> kernel untars, rawiobench etc.

OK.. dbench on SMP hardware shows poor layout also.

2004-04-03 01:40:38

by Mingming Cao

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [RFC, PATCH] Reservation based ext3 preallocation

Hi Andrew,
Here is the second version of the ext3, mostly bug fixes and made the
changes you have suggested last time.

It's a stable version now. We have done overnight fsx tests on a 2 CPU
PIII 700MHz box, did not see any issues. We also run tiobench and untar
tests.

> - Please use u32 for block numbers everywhere.
I tried to make the changes as you suggested: use u32 for block numbers
everywhere. Then hit a problem: sometimes, especially when doing block
allocation or find a free block on bitmap, on success, it returns the
block number, on fail, it returns -1 to indicate the failure. I found
we can't use the u32 for those cases. ext3_new_block() and
ext3_free_block() did the same thing. So I just did wherever I could to
make a block number from int to u32, and keep it int if it also means
failure in the fail case. Hmm...Is there any good way to do this?

> You should search both bitmaps to find a block
> which really is allocatable. Otherwise you'll have
> ext3_try_to_allocate() failing 20,000 times in succession and much CPU
> will be burnt.
Done.

> - I suspect ext3_try_to_allocate_with_rsv() could be reorganised a bit to
> reduce the goto spaghetti?
Re-organized. Merged conditions together and put them into a single loop. Now it does not contains any goto :-) See if the code looks clean and now...

> - Please provide a mount option which enables the feature, defaulting to
> "off".
Planning to do right after this version. Besides enable/disable the
reservation feature, I am thinking to enable the feature that could set
the the default reservation window size(in blocks) when the fs is
mounted. just one single mount option:"prealloc_window=n". When n=0,
it means turns off, when n>0, it means on, and the ext3 default
reservation window size for each file is n blocks(or 8 blocks, if 0< n <
8).

> - Make sure that you have a many-small-file test. Say, untar a kernel
> tree onto a clean filesystem and make sure that reading all the files in
> the tree is nice and fast.
Yes, have done this: untar linux-2.6.4 kernel tree to a clean ext3
filesystem. Verified the reservation is being discarded and the small
files are contiguous on-disk.

> Apart from that, looking good. Where are the benchmarks? ;)
One simple dd test, it's 4 times faster with the patch on a 2 way PIII
700Mhz box:

# cat /tmp/dd-large.sh
dd if=/dev/zero of=x1 bs=4k count=5000 &
dd if=/dev/zero of=x2 bs=4k count=5000 &
dd if=/dev/zero of=x3 bs=4k count=5000 &
dd if=/dev/zero of=x4 bs=4k count=5000 &
dd if=/dev/zero of=x5 bs=4k count=5000 &
dd if=/dev/zero of=x6 bs=4k count=5000 &
dd if=/dev/zero of=x7 bs=4k count=5000 &
dd if=/dev/zero of=x8 bs=4k count=5000 &
elm3b92:/mnt # time /tmp/dd-large.sh

# time /tmp/dd-large.sh
real 0m0.431s
user 0m0.001s
sys 0m0.009s

linux-2.6.4 kernel + ext3 reservation patch, window size 128 blocks:
# time /tmp/dd-large.sh
real 0m0.098s
user 0m0.001s
sys 0m0.009s

We also did tiobench sequential write tests on ext2, jfs and ext3 on
different number of threads(1,4,8,16,32 and 64), block size is 4k, file
size is 4000k. For ext3, we tried different reservation size from 8 to
128 blocks, as well as without any reservation. The test was done on a
8CPU PIII 700 i386 machine with 4G memory, on linux-2.6.4 kernel+patch
and linux2.6.4-mm1 kernel.

Attached is a graphic file show the tiobench results. It shows that
before the patch, the sequential write throughput on ext3 is pretty bad
compare with ext2 and jfs; with the patch, it's much better. And the
block allocation on disk for the files created is much less fragmented.

Planning to do dbench and other regression test later. Just what to
share with you the current status.

Patch attached below.

Thanks!
Mingming

fs/ext3/balloc.c | 581
+++++++++++++++++++++++++++++++++++++++++----
fs/ext3/file.c | 3
fs/ext3/ialloc.c | 6
fs/ext3/inode.c | 14 -
fs/ext3/super.c | 7
include/linux/ext3_fs.h | 3
include/linux/ext3_fs_i.h | 12
include/linux/ext3_fs_sb.h | 4
8 files changed, 578 insertions(+), 52 deletions(-)


Attachments:
ext3_reservation_9.diff (26.94 kB)
Throughput.png (28.71 kB)
Download all attachments

2004-04-03 01:48:54

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [RFC, PATCH] Reservation based ext3 preallocation

Mingming Cao <[email protected]> wrote:
>
> Hi Andrew,
> Here is the second version of the ext3, mostly bug fixes and made the
> changes you have suggested last time.

Great, thanks.

> Besides enable/disable the
> reservation feature, I am thinking to enable the feature that could set
> the the default reservation window size(in blocks) when the fs is
> mounted. just one single mount option:"prealloc_window=n". When n=0,
> it means turns off, when n>0, it means on, and the ext3 default
> reservation window size for each file is n blocks(or 8 blocks, if 0< n <
> 8).

hm, maybe. We should probably also provide a per-file ext3-specific ioctl
to allow specialised apps to manipulate the reservation size.

And we should grow the reservation size dynamically. I've suggested that
we double its size each time it is exhausted, up to some limit. There may
be better algorithms though.

This work doesn't help us with the slowly-growing logfile or mailbox file
problem. I guess that would require on-disk reservations, or a new
`chattr' hint or such.

2004-04-03 02:31:57

by Mingming Cao

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [RFC, PATCH] Reservation based ext3 preallocation

On Fri, 2004-04-02 at 17:50, Andrew Morton wrote:
> hm, maybe. We should probably also provide a per-file ext3-specific ioctl
> to allow specialised apps to manipulate the reservation size.
>
> And we should grow the reservation size dynamically. I've suggested that
> we double its size each time it is exhausted, up to some limit. There may
> be better algorithms though.
You mean when the reservation window size is exhausted, right? I think
this is probably the easiest way. Maybe like the readahead window does.
Just sometimes the window reserved does not contains much free blocks to
allocate, and we could easily reach to the upper limit.

Currently, when try to reserve a window in a block group, if there is no
window big enough for this, we skip this group and move on to the next
group. I was thinking maybe we should keep track of the largest
avaliable reservable window when we are looking for a new window, so in
case we can't find the one with expected size, we at least could get one
within the group.

This will try to keep the file inside it's target group, and also reduce
the possibility of bogus earlier ENOSPC. Just it's a trade off:there
maybe plenty of space in the next group......who knows. What do you
think?

Also, for the the bogus earlier ENOSPC, : the filesystem probably
relatively full of reservations, so the late guy who need a new block
but don't have a reservation window will failed to allocate a block. In
this case, the easiest way as you said before is to just steal a free
block from other file's reservation window. I agree it's the extreme
case and the solution is easy, just a little concern that this will in
favor of those inodes who came first and made reservations, but whose
who come in later need a new block, every time has to suffer the same
pain: search the whole filesystem first, then end of steal blocks every
time.

Anyway maybe by that moment the there are too many open files for
writes(so the fs is full of reservations) so it is already in trouble.

>
> This work doesn't help us with the slowly-growing logfile or mailbox file
> problem. I guess that would require on-disk reservations, or a new
> `chattr' hint or such.

Ted has suggested to preserve the reservation/preallocation for those
slowing growing logfile for mailbox file. Probably do not discard the
reservation window for those files(the logfile) when they are closed.
When it opens next time, it will allocate blocks directly from the old
reservation window. Is that what you think?



2004-04-03 02:50:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [RFC, PATCH] Reservation based ext3 preallocation

Mingming Cao <[email protected]> wrote:
>
> On Fri, 2004-04-02 at 17:50, Andrew Morton wrote:
> > hm, maybe. We should probably also provide a per-file ext3-specific ioctl
> > to allow specialised apps to manipulate the reservation size.
> >
> > And we should grow the reservation size dynamically. I've suggested that
> > we double its size each time it is exhausted, up to some limit. There may
> > be better algorithms though.
> You mean when the reservation window size is exhausted, right? I think
> this is probably the easiest way. Maybe like the readahead window does.
> Just sometimes the window reserved does not contains much free blocks to
> allocate, and we could easily reach to the upper limit.

Good point. So the reservation should be grown by "the number of blocks we
allocated in the previous window", not by "the size of the previous
window", yes?

> Currently, when try to reserve a window in a block group, if there is no
> window big enough for this, we skip this group and move on to the next
> group. I was thinking maybe we should keep track of the largest
> avaliable reservable window when we are looking for a new window, so in
> case we can't find the one with expected size, we at least could get one
> within the group.

I suspect that if you cannot get a window in the blockgroup then simply
skipping to the next blockgroup should be OK.

But I don't understand why the reservation code needs to know about
blockgroups at all, at least from a conceptual point of view.

> This will try to keep the file inside it's target group, and also reduce
> the possibility of bogus earlier ENOSPC. Just it's a trade off:there
> maybe plenty of space in the next group......who knows. What do you
> think?

Probably it's sufficient to use the inode's blockgroup's starting block as
the initial target for allocations and then just forget about blockgroups.
Simply let allocation wander further up the disk from there, with no
further consideration of blockgroups.

> Also, for the the bogus earlier ENOSPC, : the filesystem probably
> relatively full of reservations, so the late guy who need a new block
> but don't have a reservation window will failed to allocate a block. In
> this case, the easiest way as you said before is to just steal a free
> block from other file's reservation window. I agree it's the extreme
> case and the solution is easy, just a little concern that this will in
> favor of those inodes who came first and made reservations, but whose
> who come in later need a new block, every time has to suffer the same
> pain: search the whole filesystem first, then end of steal blocks every
> time.

It would be fairly weird for the entire disk to be covered by reservations,
so falling back to the current algorithm would be OK.

> Anyway maybe by that moment the there are too many open files for
> writes(so the fs is full of reservations) so it is already in trouble.
>
> >
> > This work doesn't help us with the slowly-growing logfile or mailbox file
> > problem. I guess that would require on-disk reservations, or a new
> > `chattr' hint or such.
>
> Ted has suggested to preserve the reservation/preallocation for those
> slowing growing logfile for mailbox file. Probably do not discard the
> reservation window for those files(the logfile) when they are closed.
> When it opens next time, it will allocate blocks directly from the old
> reservation window. Is that what you think?

yup, except we now have potentially millions of inodes which have active
reservations. ENOSPC and CPU consumption problems are certain.

Some combination of

- A chattr hint

- Using O_APPEND as a hint and

- Retaining an upper limit on the number of unopened inodes which have a
reservation

should fix that up. You'd need to hook into ->destroy_inode to release
reservations when inodes are reclaimed by the VM.

But this is surely phase two material.


2004-04-05 16:43:15

by Mingming Cao

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: [RFC, PATCH] Reservation based ext3 preallocation

On Fri, 2004-04-02 at 18:50, Andrew Morton wrote:
> Mingming Cao <[email protected]> wrote:
> >
> > On Fri, 2004-04-02 at 17:50, Andrew Morton wrote:
> > > hm, maybe. We should probably also provide a per-file ext3-specific ioctl
> > > to allow specialised apps to manipulate the reservation size.
> > >
> > > And we should grow the reservation size dynamically. I've suggested that
> > > we double its size each time it is exhausted, up to some limit. There may
> > > be better algorithms though.
> > You mean when the reservation window size is exhausted, right? I think
> > this is probably the easiest way. Maybe like the readahead window does.
> > Just sometimes the window reserved does not contains much free blocks to
> > allocate, and we could easily reach to the upper limit.
>
> Good point. So the reservation should be grown by "the number of blocks we
> allocated in the previous window", not by "the size of the previous
> window", yes?
>
Yes. Maybe in the reservation structure we add a counter to keep track
of the preallocation hit. Then when a new window need to be created, we
look at the old window preallocation hit ratio to determine how much the
window size should be next time.

> > Currently, when try to reserve a window in a block group, if there is no
> > window big enough for this, we skip this group and move on to the next
> > group. I was thinking maybe we should keep track of the largest
> > avaliable reservable window when we are looking for a new window, so in
> > case we can't find the one with expected size, we at least could get one
> > within the group.
>
> I suspect that if you cannot get a window in the blockgroup then simply
> skipping to the next blockgroup should be OK.
>
okey.

> But I don't understand why the reservation code needs to know about
> blockgroups at all, at least from a conceptual point of view.
>
Agree that reservation itself is a filesystem wide concept. The
reservation window could cross the block group boundary.

> Probably it's sufficient to use the inode's blockgroup's starting block as
> the initial target for allocations and then just forget about blockgroups.
> Simply let allocation wander further up the disk from there, with no
> further consideration of blockgroups.
I think the current code's logic is the same as you said. The logic of
current code is: given a goal block,try to allocate a block starting
from there within the inode's block group. If it failed, then simply
move on to next group without a goal -- the search for a free block will
start from the starting block of the next group. I was trying to keep
the same logic as before. So for the reservation code, given a goal
block, we will try to allocate a new reservation window (and then
allocate a block within it) from the give goal block. If it failed, we
will simply do reservation window allocate in the rest of the disk,
without consideration of the inode's blockgroup.

>
> It would be fairly weird for the entire disk to be covered by reservations,
> so falling back to the current algorithm would be OK.
okey.

> > > This work doesn't help us with the slowly-growing logfile or mailbox file
> > > problem. I guess that would require on-disk reservations, or a new
> > > `chattr' hint or such.
> >
> > Ted has suggested to preserve the reservation/preallocation for those
> > slowing growing logfile for mailbox file. Probably do not discard the
> > reservation window for those files(the logfile) when they are closed.
> > When it opens next time, it will allocate blocks directly from the old
> > reservation window. Is that what you think?
>
> yup, except we now have potentially millions of inodes which have active
> reservations. ENOSPC and CPU consumption problems are certain.
>
> Some combination of
>
> - A chattr hint
>
> - Using O_APPEND as a hint and
>
> - Retaining an upper limit on the number of unopened inodes which have a
> reservation
>
> should fix that up. You'd need to hook into ->destroy_inode to release
> reservations when inodes are reclaimed by the VM.
>
> But this is surely phase two material.
Okey. Will think about this more later...

Thanks for your help!

Mingming

>
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> GenToo technologies. Learn everything from fundamentals to system
> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> _______________________________________________
> Ext2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ext2-devel
>


2004-04-14 00:46:00

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 0/4] ext3 block reservation patch set

Hello,

Here is a set of patches which implement the in-memory ext3 block
reservation (previously called reservation based ext3 preallocation).

[patch 1]ext3_rsv_cleanup.patch: Cleans up the old ext3 preallocation
code carried from ext2 but turned off.

[patch 2]ext3_rsv_base.patch: Implements the base of in-memory block
reservation and block allocation from reservation window.

[patch 3]ext3_rsv_mount.patch: Adds features on top of the
ext3_rsv_base.patch:
- deal with earlier bogus -ENOSPC error
- do block reservation only for regular file
- make the ext3 reservation feature as a mount option:
new mount option added: reservation
- A pair of file ioctl commands are added for application to control
the block reservation window size.

[patch 4]ext3_rsv_dw.patch: adjust the reservation window size
dynamically:
Start from the deault reservation window size, if the hit ration of
the reservation window is more than 50%, we will double the reservation
window size next time up to a certain upper limit.

Here are some numbers collected on dbench on 8 way PIII 700Mhz:

dbench average throughputs on 4 runs
==================================================
Threads ext3 ext3+rsv(8) ext3+rsv+dw
1 103 104(0%) 105(1%)
4 144 286(98%) 256(77%)
8 118 197(66%) 210(77%)
16 113 160(41%) 177(56%)
32 61 123(101%) 150(145%)
64 41 82(100%) 85(107%)

And some numbers on tiobench sequential write:

tiobench Sequential Writes throughputs(improvments)
=====================================================================
Threads ext2 ext3 ext3+rsv(8)(%) ext3+rsv(128)(%) ext3+rsv+dw(%)
1 26 23 25(8%) 26(13%) 26(13%)
4 17 4 14(250%) 24(500%) 25(525%)
8 15 7 13(85%) 23(228%) 24(242%)
16 16 13 12(-7%) 22(69%) 24(84%)
32 15 3 12(300%) 23(666%) 23(666%)
64 14 1 11(1000%) 22(2100%) 23(2200%)

Note each time we run the test on a fresh created ext3 filesystem.

We have also run fsx tests on a 8 way on 2.6.4 kernel with the patch set
for a whole weekend on fresh created ext3 filesystem, as well as on a 4
way with the root filesystem as ext3 plus all the changes. Other tests
include 8 threads dd tests and untar a kernel source tree.

Besides look at the performance numbers and verify the functionality, we
also checked the block allocation layout for each file generated during
the test: the blocks for a file are more contiguous with the reservation
mount option on, especially when we dynamically increase the reservation
window size in the sequential write cases.

Andrew, is this something that you would consider for -mm tree?

Thanks again for Andrew, Ted and Badari's ideas and helps on this
project. I would really appreciate any comments and feedbacks.


Mingming


2004-04-14 00:48:32

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 1/4] ext3 block reservation patch set -- ext3 preallocation cleanup

> [patch 1]ext3_rsv_cleanup.patch: Cleans up the old ext3 preallocation
> code carried from ext2 but turned off.

diffstat ext3_rsv_cleanup.patch
fs/ext3/balloc.c | 3 -
fs/ext3/file.c | 2 -
fs/ext3/ialloc.c | 4 --
fs/ext3/inode.c | 91
----------------------------------------------
fs/ext3/xattr.c | 2 -
include/linux/ext3_fs.h | 9 ----
include/linux/ext3_fs_i.h | 4 --
7 files changed, 4 insertions(+), 111 deletions(-)



Attachments:
ext3_rsv_cleanup.patch (7.35 kB)

2004-04-14 00:52:48

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 3/4] ext3 block reservation patch set --mount and ioctl feature

> [patch 3]ext3_rsv_mount.patch: Adds features on top of the
> ext3_rsv_base.patch:
> - deal with earlier bogus -ENOSPC error
> - do block reservation only for regular file
> - make the ext3 reservation feature as a mount option:
> new mount option added: reservation
> - A pair of file ioctl commands are added for application to control
> the block reservation window size.
>
diffstat ext3_rsv_mount.patch
fs/ext3/balloc.c | 49
++++++++++++++++++++++++++++++++++++----------
fs/ext3/ialloc.c | 2 -
fs/ext3/inode.c | 2 -
fs/ext3/ioctl.c | 23 +++++++++++++++++++++
fs/ext3/super.c | 20 ++++++++++++++++--
include/linux/ext3_fs.h | 42
++++++++++++++++++++++-----------------
include/linux/ext3_fs_i.h | 2 -
7 files changed, 107 insertions(+), 33 deletions(-)


Attachments:
ext3_rsv_mount.patch (11.75 kB)

2004-04-14 00:51:28

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 2/4] ext3 block reservation patch set --ext3 block reservation

> [patch 2]ext3_rsv_base.patch: Implements the base of in-memory block
> reservation and block allocation from reservation window.
-basic reservation structure and operations
-reservation based ext3 block allocation

diffstat ext3_rsv_base.patch
fs/ext3/balloc.c | 548
++++++++++++++++++++++++++++++++++++++++-----
fs/ext3/file.c | 2
fs/ext3/ialloc.c | 5
fs/ext3/inode.c | 9
fs/ext3/super.c | 7
include/linux/ext3_fs.h | 6
include/linux/ext3_fs_i.h | 12
include/linux/ext3_fs_sb.h | 4
8 files changed, 542 insertions(+), 51 deletions(-)


Attachments:
ext3_rsv_base.patch (26.41 kB)

2004-04-14 00:54:01

by Mingming Cao

[permalink] [raw]
Subject: [PATCH 4/4] ext3 block reservation patch set -- dynamically increase reservation window

> [patch 4]ext3_rsv_dw.patch: adjust the reservation window size
> dynamically:
> Start from the deault reservation window size, if the hit ration of
> the reservation window is more than 50%, we will double the reservation
> window size next time up to a certain upper limit.

diffstat ext3_rsv_dw.patch
fs/ext3/balloc.c | 13 ++++++++++++-
fs/ext3/ialloc.c | 3 ++-
fs/ext3/super.c | 1 +
include/linux/ext3_fs_i.h | 1 +
4 files changed, 16 insertions(+), 2 deletions(-)


Attachments:
ext3_rsv_dw.patch (2.77 kB)

2004-04-14 02:48:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/4] ext3 block reservation patch set

Mingming Cao <[email protected]> wrote:
>
> Here is a set of patches which implement the in-memory ext3 block
> reservation (previously called reservation based ext3 preallocation).

Great, thanks. Let's get these in the pipeline.

A few thoughts, from a five-minute read:


- The majority of in-core inodes are not open for reading, and we've
added 24 bytes to the inode just for inodes which are open for writing.

At some stage we should stop aggregating struct reserve_window into the
inode and dynamically allocate it. We can move i_next_alloc_block,
i_next_alloc_goal and possibly other fields in there too.

At which point it has the wrong name ;) Should be `write_state' or
something.

It's not clear when we should free up the write_state. I guess we
could leave it around for the remaining lifetime of the inode - that'd
still be a net win.

Is this something you can look at as a low-priority activity?

- You're performing ext3_discard_reservation() in ext3_release_file().
Note that the file may still have pending allocations at this stage: say,
open a file, map it MAP_SHARED, dirty some pages which lie over file
holes then close the file again.

Later, the VM will come along and write those dirty pages into the
file, at which point allocations need to be performed. But we have no
reservation data and, later, we may have no inode->write_state at all.

What will happen?

- Have you tested and profiled this with a huge number of open files? At
what stage do we get into search complexity problems?

- Why do we discard the file's reservation on every iput()? iput's are
relatively common operations. (see fs/fs-writeback.c)

- What locking protects rsv_alloc_hit? i_sem is not held during
VM-initiated writeout. Maybe an atomic_t there, or just say that if we
race and the number is a bit inaccurate, we don't care?

2004-04-14 16:23:29

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 0/4] ext3 block reservation patch set

On Tuesday 13 April 2004 07:47 pm, Andrew Morton wrote:
> Mingming Cao <[email protected]> wrote:
> > Here is a set of patches which implement the in-memory ext3 block
> > reservation (previously called reservation based ext3 preallocation).
>
> Great, thanks. Let's get these in the pipeline.
>
> A few thoughts, from a five-minute read:
>
>
> - The majority of in-core inodes are not open for reading, and we've
> added 24 bytes to the inode just for inodes which are open for writing.
>
> At some stage we should stop aggregating struct reserve_window into the
> inode and dynamically allocate it. We can move i_next_alloc_block,
> i_next_alloc_goal and possibly other fields in there too.
>
> At which point it has the wrong name ;) Should be `write_state' or
> something.
>
> It's not clear when we should free up the write_state. I guess we
> could leave it around for the remaining lifetime of the inode - that'd
> still be a net win.
>
> Is this something you can look at as a low-priority activity?

Good point !! we will surely look at it.
>
> - You're performing ext3_discard_reservation() in ext3_release_file().
> Note that the file may still have pending allocations at this stage: say,
> open a file, map it MAP_SHARED, dirty some pages which lie over file
> holes then close the file again.
..
> - Why do we discard the file's reservation on every iput()? iput's are
> relatively common operations. (see fs/fs-writeback.c)

We just followed old prealloc code. Where ever preallocation is dropped
we dropped reservation. May be thats overkill. We will look at it.

Whats the best place to drop the reservation ?

> - Have you tested and profiled this with a huge number of open files? At
> what stage do we get into search complexity problems?

In our TODO list. But our original thought was, we have to search only the
current block group reservations to get a window. So, if we have lots & lots
of reservations in a single block group - search gets complicated. We were
thinking of adding (dummy) anchors in the list to represent begining of each
block group, so that we can get to the start of a block group quickly. But
so far, we haven't done anything.

We are also looking at RB tree and see how we can make use of it. Our problem
is, we are interested in finding out a big enough hole in the tree to put our
reservation. We need to look closely.


> - What locking protects rsv_alloc_hit? i_sem is not held during
> VM-initiated writeout. Maybe an atomic_t there, or just say that if we
> race and the number is a bit inaccurate, we don't care?

We need to atleast change it to atomic_t.

Mingming, I don't see any check to force maximum. Am I missing something ?

We really appreciate your comments.

Thanks,
Badari



2004-04-14 16:54:50

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 0/4] ext3 block reservation patch set

On Tuesday 13 April 2004 07:47 pm, Andrew Morton wrote:

> - You're performing ext3_discard_reservation() in ext3_release_file().
> Note that the file may still have pending allocations at this stage: say,
> open a file, map it MAP_SHARED, dirty some pages which lie over file
> holes then close the file again.
>
> Later, the VM will come along and write those dirty pages into the
> file, at which point allocations need to be performed. But we have no
> reservation data and, later, we may have no inode->write_state at all.
>
> What will happen?

Block allocations happen after ext3_release_file() ? In that case,
we would have dropped all our reservations at the time of last file close.
But if allocations happen later, the current code will start new reservation
window and start allocations from there.

> - Have you tested and profiled this with a huge number of open files? At
> what stage do we get into search complexity problems?

Come to think of it, the current code has pretty bad search algorithm. We need
to fix that. We hold the spinlock for entire search, thats why our CPU
utilization is pretty high.

Thanks,
Badari

2004-04-14 17:24:19

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH 0/4] ext3 block reservation patch set

On Tue, 2004-04-13 at 19:47, Andrew Morton wrote:
> Mingming Cao <[email protected]> wrote:
> >
> > Here is a set of patches which implement the in-memory ext3 block
> > reservation (previously called reservation based ext3 preallocation).
>
> Great, thanks. Let's get these in the pipeline.
>
> A few thoughts, from a five-minute read:
>
>
> - The majority of in-core inodes are not open for reading, and we've
> added 24 bytes to the inode just for inodes which are open for writing.
Yes, The structure is getting bigger when we add more stuff into it. It
may not worth to put it inside the ext3_inode_info structure just for
files for write....I agree!
>
> At some stage we should stop aggregating struct reserve_window into the
> inode and dynamically allocate it. We can move i_next_alloc_block,
> i_next_alloc_goal and possibly other fields in there too.
>
> At which point it has the wrong name ;) Should be `write_state' or
> something.
>
> It's not clear when we should free up the write_state. I guess we
> could leave it around for the remaining lifetime of the inode - that'd
> still be a net win.
We could free up the write_state at the time of ext3_discard_allocation(), (not at the time when we allocate a new reservation window)

or later if we preserve reservation for slow growing files, we release the write_state at the time the inode is released.

> Is this something you can look at as a low-priority activity?
>
Sure!
> - You're performing ext3_discard_reservation() in ext3_release_file().
> Note that the file may still have pending allocations at this stage: say,
> open a file, map it MAP_SHARED, dirty some pages which lie over file
> holes then close the file again.
>
> Later, the VM will come along and write those dirty pages into the
> file, at which point allocations need to be performed. But we have no
> reservation data and, later, we may have no inode->write_state at all.
>
> What will happen?
>
In this case, we will allocation a new reservation window for it.
Nothing bad will happen. We probably just waste a previously allocated
reservation window...but I am not sure.

My question is, if the file is first time opened, mapped, and we dirty
pages in the file hole, will there any really disk block allocation
involved there? If not, we do not have a reservation window at at all,
and ext3_discard_reservation will detect that and will do nothing.

> - Have you tested and profiled this with a huge number of open files? At
> what stage do we get into search complexity problems?
>
Not yet. The current search complexity is O(n), if you don't have a
reservation, you need O(n) to move the search head to the place where
you want to search for a new reservation, finding the hole size between
two reservation window is just O(1) for sorted double linked list, we
need O(n) to look for a reservable window after that, so the complex is:
O(n) +O(1) * 0(n) = O(n);
if you already have a old reservation, we will remember where to start
the search, so the complex is O(1) + O(n);

The current implementation is more than O(n): every time it does not
have a reservation window, it search from the head of per filesystem
reservation window list head. If it failed within the group, it will
move to the next group and start the search from the head of the list
again.

This could be fixed by forget about the block group boundary at
all,(remove the for loop in ext3_new_block), make it searchs for a block
in a filesystem wide:)

I have concern about red black tree: it takes O(log(n)) to get where you
want to start, but it need also takes O(log(n)) compare to find the hole
size between two windows next to each other. And to find a reservable
window, we need to browse the whole red black tree in the worse case, so
the complexity is
O(log(n)) + O(log(n)) *O(n)) = O(n)*O(log(n))

Am I right?

> - Why do we discard the file's reservation on every iput()? iput's are
> relatively common operations. (see fs/fs-writeback.c)
>
Yes..you are right! I was intent to call ext3_discard_allocation only
when the usage count of the inode is 0. I looked at ext2 preallocation
code, it called ext2_discard_preallocation in ext2_put_inode(), so I
thought that's the place. But it seems ext3_put_inode() being called
every time iput() is called. We should call ext3_discard_reservation in
iput_final(). Should fix this in ext2.

> - What locking protects rsv_alloc_hit? i_sem is not held during
> VM-initiated writeout. Maybe an atomic_t there, or just say that if we
> race and the number is a bit inaccurate, we don't care?
>
Currently no lock is protect rsv_alloc_hit. The reason is it is just a
heuristics indicator of whether we should enlarge the reservation window
size next time. Even the hit ratio(50%) is just a rough guess, so, a
little bit inaccurate would not hurt much, adding another lock probably
not worth it.

Thanks,

Mingming

2004-04-14 17:37:33

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH 0/4] ext3 block reservation patch set

On Wed, 2004-04-14 at 09:11, Badari Pulavarty wrote:
> > - What locking protects rsv_alloc_hit? i_sem is not held during
> > VM-initiated writeout. Maybe an atomic_t there, or just say that if we
> > race and the number is a bit inaccurate, we don't care?
>
> Mingming, I don't see any check to force maximum. Am I missing something ?

Nice catch! forget to check the maximum when growing the window
size...fixed it.:)

Thanks.


2004-04-14 23:03:25

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/4] ext3 block reservation patch set

Badari Pulavarty <[email protected]> wrote:
>
> > - Why do we discard the file's reservation on every iput()? iput's are
> > relatively common operations. (see fs/fs-writeback.c)
>
> We just followed old prealloc code. Where ever preallocation is dropped
> we dropped reservation. May be thats overkill. We will look at it.
>
> Whats the best place to drop the reservation ?

You know, I wish I had an easy answer to that, but I don't. It's a matter
of sticking a printk in there, running careful tests, making sure that
we're doing the right thing at the right time.

As we discussed earlier, it could be that in some some situations we should
hold onto the reservation window after the file has been closed - the
slowly-growing mbox or logfile problem. But without causing bandwidth
regressions in the the many-small-files scenario.

> > - Have you tested and profiled this with a huge number of open files? At
> > what stage do we get into search complexity problems?
>
> In our TODO list. But our original thought was, we have to search only the
> current block group reservations to get a window. So, if we have lots & lots
> of reservations in a single block group - search gets complicated. We were
> thinking of adding (dummy) anchors in the list to represent begining of each
> block group, so that we can get to the start of a block group quickly. But
> so far, we haven't done anything.

hm, I need to look at the new code more closely. I was hoping that we
could divorce the reservation windows from any knowledge of blockgroups.
Is that not the case?

> We are also looking at RB tree and see how we can make use of it. Our problem
> is, we are interested in finding out a big enough hole in the tree to put our
> reservation. We need to look closely.

This sounds awfully like get_unmapped_area().


2004-04-14 23:09:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/4] ext3 block reservation patch set

Mingming Cao <[email protected]> wrote:
>
> > It's not clear when we should free up the write_state. I guess we
> > could leave it around for the remaining lifetime of the inode - that'd
> > still be a net win.

> We could free up the write_state at the time of ext3_discard_allocation(),
> (not at the time when we allocate a new reservation window)
>
> or later if we preserve reservation for slow growing files, we release
> the write_state at the time the inode is released.

That sounds appropriate.

> > - You're performing ext3_discard_reservation() in ext3_release_file().
> > Note that the file may still have pending allocations at this stage: say,
> > open a file, map it MAP_SHARED, dirty some pages which lie over file
> > holes then close the file again.
> >
> > Later, the VM will come along and write those dirty pages into the
> > file, at which point allocations need to be performed. But we have no
> > reservation data and, later, we may have no inode->write_state at all.
> >
> > What will happen?
> >
> In this case, we will allocation a new reservation window for it.
> Nothing bad will happen. We probably just waste a previously allocated
> reservation window...but I am not sure.
>
> My question is, if the file is first time opened, mapped, and we dirty
> pages in the file hole, will there any really disk block allocation
> involved there?

There might be, and there might not be. It depends on timing, memory
pressure, application activity, etc.

> The current implementation is more than O(n): every time it does not
> have a reservation window, it search from the head of per filesystem
> reservation window list head. If it failed within the group, it will
> move to the next group and start the search from the head of the list
> again.

Same problem exists in arch_get_unmapped_area(). We have a funny little
heuristic (free_area_cache) in there to speed up the common case.

> This could be fixed by forget about the block group boundary at
> all,(remove the for loop in ext3_new_block), make it searchs for a block
> in a filesystem wide:)

I do think we should do this. Does it have any disadvantages?

> I have concern about red black tree: it takes O(log(n)) to get where you
> want to start, but it need also takes O(log(n)) compare to find the hole
> size between two windows next to each other. And to find a reservable
> window, we need to browse the whole red black tree in the worse case, so
> the complexity is
> O(log(n)) + O(log(n)) *O(n)) = O(n)*O(log(n))
>
> Am I right?

Think so. rbtrees are optimised for loopkup, not for
get-me-a-suitably-sized-hole.


> > - Why do we discard the file's reservation on every iput()? iput's are
> > relatively common operations. (see fs/fs-writeback.c)
> >
> Yes..you are right! I was intent to call ext3_discard_allocation only
> when the usage count of the inode is 0. I looked at ext2 preallocation
> code, it called ext2_discard_preallocation in ext2_put_inode(), so I
> thought that's the place. But it seems ext3_put_inode() being called
> every time iput() is called. We should call ext3_discard_reservation in
> iput_final(). Should fix this in ext2.

Could be. so.

> > - What locking protects rsv_alloc_hit? i_sem is not held during
> > VM-initiated writeout. Maybe an atomic_t there, or just say that if we
> > race and the number is a bit inaccurate, we don't care?
> >
> Currently no lock is protect rsv_alloc_hit. The reason is it is just a
> heuristics indicator of whether we should enlarge the reservation window
> size next time. Even the hit ratio(50%) is just a rough guess, so, a
> little bit inaccurate would not hurt much, adding another lock probably
> not worth it.

I'd agree with that.

2004-04-14 23:29:29

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH 0/4] ext3 block reservation patch set

On Wednesday 14 April 2004 04:02 pm, Andrew Morton wrote:


> > In our TODO list. But our original thought was, we have to search only
> > the current block group reservations to get a window. So, if we have lots
> > & lots of reservations in a single block group - search gets complicated.
> > We were thinking of adding (dummy) anchors in the list to represent
> > begining of each block group, so that we can get to the start of a block
> > group quickly. But so far, we haven't done anything.
>
> hm, I need to look at the new code more closely. I was hoping that we
> could divorce the reservation windows from any knowledge of blockgroups.
> Is that not the case?

The reservation window code kind of knows the group boundaries. The
reason why we did this was, we want to fit it into existing
ext3_get_newblock() code easily. ext3_get_newblock() operates on each
group and passes a bitmap for each group to work on. The current code
looks for a reservation window in the given group (since we need bitmap to
verify that there is something allocatable in that group).
To make the reservation window ignore groups, we may need to do some major
surgery to ext3_get_newblock().


> > We are also looking at RB tree and see how we can make use of it. Our
> > problem is, we are interested in finding out a big enough hole in the
> > tree to put our reservation. We need to look closely.

> This sounds awfully like get_unmapped_area().

That was the first place I looked, i need to look at it one more time to see
if we can reuse the logic.

Thanks,
Badari

2004-04-14 23:37:38

by Mingming Cao

[permalink] [raw]
Subject: Re: [PATCH 0/4] ext3 block reservation patch set

On Wed, 2004-04-14 at 16:07, Andrew Morton wrote:

> > The current implementation is more than O(n): every time it does not
> > have a reservation window, it search from the head of per filesystem
> > reservation window list head. If it failed within the group, it will
> > move to the next group and start the search from the head of the list
> > again.
>
> Same problem exists in arch_get_unmapped_area(). We have a funny little
> heuristic (free_area_cache) in there to speed up the common case.
Actually, we only hit this more than O(n) case when the file is just
opened for write(without reservation window). In the normal case, if we
have a old reservation window, we will start the from the old
reservation, instead of the head of whole filesystem list, to search for
next new reservable hole.

>
> > This could be fixed by forget about the block group boundary at
> > all,(remove the for loop in ext3_new_block), make it searchs for a block
> > in a filesystem wide:)
>
> I do think we should do this. Does it have any disadvantages?
>
I re-looked at the code today, my concern is, we may end of a big
changes to the existing code. Need to think it more....



2004-04-21 23:28:21

by Mingming Cao

[permalink] [raw]
Subject: [PATCH] Lazy discard ext3 reservation window patch

Andrew,

This patch contains several changes against the ext3 reservation code in
265-mm6 tree:

Lazy Discard Reservation Window:
This patch is trying to do lazy discard: keep the old reservation
window temporally until we find the new reservation window, only do
remove/add if the new reservation window locate different than the old
one. (The reservation code in mm6 tree will discard the old one first,
then search the new one). Two reasons:
- If the ext3_find_goal() does a good job, the reservation windows on
the list should not very close to each other. So a inode's new
reservation window is likely located just next to it's old one, it's
position in the whole list is unchanged, no need to do remove and then
add the new one to the list in the same location. Just update the start
block and end block.
- If we failed to find a new reservation in the goal group and move on
the search to the next group, having the old reservation around
temporally could allow us to search the list directly after the old
window. Otherwise we lost where we were and has to start from the
beginning of the list. Eventually the old window will be discard when we
found a new one.

Other changes:
- Add check to force maximum when dynamically increase window size.
- ext3_discard_reservation() should not be called on every iput(). Now
it is moved to ext3_delete_inode(), so it is only called on the last
iput() if i_nlink is 0
- remove #ifdef EXT3_RESERVATION since we made reservation an mount
option
- Only allow application to modify the file's reservation window size
when fs is mounted with reservation and the operation is performed on
regular files.

This patch should apply to 2.6.5-mm6. Have tested it through many dd
test, untar test,dbench and tiobench.

Thanks!

Mingming


Attachments:
ext3_reservation_lazydiscard.patch (6.93 kB)

2004-04-27 15:27:56

by Mary Edie Meredith

[permalink] [raw]
Subject: Re: [PATCH 0/4] ext3 block reservation patch set


To test the benefit of Mingming's ext3 reservation
patchset, we ran tiobench on 2-way systems on STP
using 2.6.6-rc2-mm1 versus 2.6.6-rc2-mm1 patched to
force the ext3 file system to be built without
reservation.

The results show increased throughput for >1
threads not only for sequential write, but also
for random write, sequential read, and random read.
Latency is also decreased for all cases.

Raw data can be found:
-2 way 2.6.6-rc2-mm1
http://khack.osdl.org/stp/292223/results/tiobench-ext3.txt
-2 way 2.6.6-rc2-mm1 noreservation default
http://khack.osdl.org/stp/292225/results/tiobench-ext3.txt

Judith compared the two runs by plotting the
results at: http://developer.osdl.org/judith/tiobench/ext3-reserve/

Here are some interesting ones:
Thruput results:
-Random write thruput 128k
http://developer.osdl.org/judith/tiobench/ext3-reserve/through.ext3.2CPU.RW.128.png
-Random write thruput 4k
http://developer.osdl.org/judith/tiobench/ext3-reserve/through.ext3.2CPU.RW.4.png
-Sequential write thruput 4k
http://developer.osdl.org/judith/tiobench/ext3-reserve/through.ext3.2CPU.SW.4.png
-Sequential write thruput 128k
http://developer.osdl.org/judith/tiobench/ext3-reserve/through.ext3.2CPU.SW.128.png

Latency is reduced almost across the board.
-Example: Latency figures for Random write 4k:
http://developer.osdl.org/judith/tiobench/ext3-reserve/lat.ext3.2CPU.RW.4.png

Mary Edie Meredith
Open Source Development Labs
503-626-2455 x42
[email protected]

Mingming Cao wrote:
> Hello,
>
> Here is a set of patches which implement the in-memory ext3 block
> reservation (previously called reservation based ext3 preallocation).
>
> [patch 1]ext3_rsv_cleanup.patch: Cleans up the old ext3 preallocation
> code carried from ext2 but turned off.
>
> [patch 2]ext3_rsv_base.patch: Implements the base of in-memory block
> reservation and block allocation from reservation window.
>
> [patch 3]ext3_rsv_mount.patch: Adds features on top of the
> ext3_rsv_base.patch:
> - deal with earlier bogus -ENOSPC error
> - do block reservation only for regular file
> - make the ext3 reservation feature as a mount option:
> new mount option added: reservation
> - A pair of file ioctl commands are added for application to control
> the block reservation window size.
>
> [patch 4]ext3_rsv_dw.patch: adjust the reservation window size
> dynamically:
> Start from the deault reservation window size, if the hit ration of
> the reservation window is more than 50%, we will double the reservation
> window size next time up to a certain upper limit.
>
> Here are some numbers collected on dbench on 8 way PIII 700Mhz:
>
> dbench average throughputs on 4 runs
> ==================================================
> Threads ext3 ext3+rsv(8) ext3+rsv+dw
> 1 103 104(0%) 105(1%)
> 4 144 286(98%) 256(77%)
> 8 118 197(66%) 210(77%)
> 16 113 160(41%) 177(56%)
> 32 61 123(101%) 150(145%)
> 64 41 82(100%) 85(107%)
>
> And some numbers on tiobench sequential write:
>
> tiobench Sequential Writes throughputs(improvments)
> =====================================================================
> Threads ext2 ext3 ext3+rsv(8)(%) ext3+rsv(128)(%) ext3+rsv+dw(%)
> 1 26 23 25(8%) 26(13%) 26(13%)
> 4 17 4 14(250%) 24(500%) 25(525%)
> 8 15 7 13(85%) 23(228%) 24(242%)
> 16 16 13 12(-7%) 22(69%) 24(84%)
> 32 15 3 12(300%) 23(666%) 23(666%)
> 64 14 1 11(1000%) 22(2100%) 23(2200%)
>
> Note each time we run the test on a fresh created ext3 filesystem.
>
> We have also run fsx tests on a 8 way on 2.6.4 kernel with the patch set
> for a whole weekend on fresh created ext3 filesystem, as well as on a 4
> way with the root filesystem as ext3 plus all the changes. Other tests
> include 8 threads dd tests and untar a kernel source tree.
>
> Besides look at the performance numbers and verify the functionality, we
> also checked the block allocation layout for each file generated during
> the test: the blocks for a file are more contiguous with the reservation
> mount option on, especially when we dynamically increase the reservation
> window size in the sequential write cases.
>
> Andrew, is this something that you would consider for -mm tree?
>
> Thanks again for Andrew, Ted and Badari's ideas and helps on this
> project. I would really appreciate any comments and feedbacks.
>
>
> Mingming
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/