2001-11-05 02:18:49

by Andrew Morton

[permalink] [raw]
Subject: disk throughput

I've been taking a look at one particular workload - the creation
and use of many smallish files. ie: the things we do every day
with kernel trees.

There are a number of things which cause linux to perform quite badly
with this workload. I've fixed most of them and the speedups are quite
dramatic. The changes include:

- reorganise the sync_inode() paths so we don't do
read-a-block,write-a-block,read-a-block, ...

- asynchronous preread of an ext2 inode's block when we
create the inode, so:

a) the reads will cluster into larger requests and
b) during inode writeout we don't keep stalling on
reads, preventing write clustering.

- Move ext2's bitmap loading code outside lock_super(),
so other threads can still get in and write to the
fs while the reader sleeps, thus increasing write
request merging. This benefits multithreaded workloads
(ie: dbench) and needs more thought.

The above changes are basically a search-and-destroy mission against
the things which are preventing effective writeout request merging.
Once they're in place we also need:

- Alter the request queue size and elvtune settings


The time to create 100,000 4k files (10 per directory) has fallen
from 3:09 (3min 9second) down to 0:30. A six-fold speedup.


All well and good, and still a WIP. But by far the most dramatic
speedups come from disabling ext2's policy of placing new directories
in a different blockgroup from the parent:

--- linux-2.4.14-pre8/fs/ext2/ialloc.c Tue Oct 9 21:31:40 2001
+++ linux-akpm/fs/ext2/ialloc.c Sun Nov 4 17:40:43 2001
@@ -286,7 +286,7 @@ struct inode * ext2_new_inode (const str
repeat:
gdp = NULL; i=0;

- if (S_ISDIR(mode)) {
+ if (0 && S_ISDIR(mode)) {
avefreei = le32_to_cpu(es->s_free_inodes_count) /
sb->u.ext2_sb.s_groups_count;
/* I am not yet convinced that this next bit is necessary.


Numbers. The machine has 768 megs; the disk is IDE with a two meg cache.
The workload consists of untarring, tarring, diffing and removing kernel
trees. This filesystem is 21 gigs, and has 176 block groups.


After each test which wrote data a `sync' was performed, and was included
in the timing under the assumption that all the data will be written back
by kupdate in a few seconds, and running `sync' allows measurement of the
cost of that.

The filesystem was unmounted between each test - all tests are with
cold caches.

stock patched
untar one kernel tree, sync: 0:31 0:14
diff two trees: 3:04 1:12
tar kernel tree to /dev/null: 0:15 0:03
remove 2 kernel trees, sync: 0:30 0:10

A significant thing here is the improvement in read performance as well
as writes. All of the other speedup changes only affect writes.

We are paying an extremely heavy price for placing directories in
different block groups from their parent. Why do we do this, and
is it worth the cost?

-


2001-11-05 03:20:26

by Mohammad A. Haque

[permalink] [raw]
Subject: Re: disk throughput


On Sunday, November 4, 2001, at 09:13 PM, Andrew Morton wrote:

> The time to create 100,000 4k files (10 per directory) has fallen
> from 3:09 (3min 9second) down to 0:30. A six-fold speedup.

Nice.

How long before you release a patch? I have a couple of tasks we execute
at work I'd like to throw at it.

--

=====================================================================
Mohammad A. Haque http://www.haque.net/
[email protected]

"Alcohol and calculus don't mix. Developer/Project Lead
Don't drink and derive." --Unknown http://wm.themes.org/
[email protected]
=====================================================================

2001-11-05 03:32:57

by Mike Fedyk

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Sun, Nov 04, 2001 at 06:13:19PM -0800, Andrew Morton wrote:
> I've been taking a look at one particular workload - the creation
> and use of many smallish files. ie: the things we do every day
> with kernel trees.
>
> There are a number of things which cause linux to perform quite badly
> with this workload. I've fixed most of them and the speedups are quite
> dramatic. The changes include:
>
> - reorganise the sync_inode() paths so we don't do
> read-a-block,write-a-block,read-a-block, ...
>

Why can't the elevator do that? I'm all for better performance, but if
the elevator can do it, then it will work for other file systems too.

> - asynchronous preread of an ext2 inode's block when we
> create the inode, so:
>
> a) the reads will cluster into larger requests and
> b) during inode writeout we don't keep stalling on
> reads, preventing write clustering.
>

This may be what is inhibiting the elevator...

> The above changes are basically a search-and-destroy mission against
> the things which are preventing effective writeout request merging.
> Once they're in place we also need:
>
> - Alter the request queue size and elvtune settings
>

What settings are you suggesting? The 2.4 elevator queue size is an
order of magnatide larger than 2.2...

>
> The time to create 100,000 4k files (10 per directory) has fallen
> from 3:09 (3min 9second) down to 0:30. A six-fold speedup.
>

Nice!

>
> All well and good, and still a WIP. But by far the most dramatic
> speedups come from disabling ext2's policy of placing new directories
> in a different blockgroup from the parent:
[snip]
> A significant thing here is the improvement in read performance as well
> as writes. All of the other speedup changes only affect writes.
>
> We are paying an extremely heavy price for placing directories in
> different block groups from their parent. Why do we do this, and
> is it worth the cost?
>

My God! I'm no kernel hacker, but I would think the first thing you would
want to do is keep similar data (in this case similar because of proximity
in the dir tree) as close as possible to reduce seeking...

Is there any chance that this will go into ext3 too?

Mike

2001-11-05 03:37:07

by Andrew Morton

[permalink] [raw]
Subject: Re: disk throughput

"Mohammad A. Haque" wrote:
>
> On Sunday, November 4, 2001, at 09:13 PM, Andrew Morton wrote:
>
> > The time to create 100,000 4k files (10 per directory) has fallen
> > from 3:09 (3min 9second) down to 0:30. A six-fold speedup.
>
> Nice.
>
> How long before you release a patch? I have a couple of tasks we execute
> at work I'd like to throw at it.
>

Try the one-liner against ialloc.c. You'll need to rebuild
each filesystem to remove the inter-file fragmentation which
ext2 put in there though.

Fortunately I made all the partitions on my laptop the same size,
and there is a spare one. So I'm busily doing a `cp -a' of each
filesystem onto a newly created one, then mke2fs -j'ing the old one,
then moving on to the next one. It's boring.

All the other changes make basically no difference once the ialloc.c
change is made. With that workload. It's all a matter of utilisation
of device-level readahead.

2001-11-05 03:51:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

Mike Fedyk wrote:
>
> On Sun, Nov 04, 2001 at 06:13:19PM -0800, Andrew Morton wrote:
> > I've been taking a look at one particular workload - the creation
> > and use of many smallish files. ie: the things we do every day
> > with kernel trees.
> >
> > There are a number of things which cause linux to perform quite badly
> > with this workload. I've fixed most of them and the speedups are quite
> > dramatic. The changes include:
> >
> > - reorganise the sync_inode() paths so we don't do
> > read-a-block,write-a-block,read-a-block, ...
> >
>
> Why can't the elevator do that? I'm all for better performance, but if
> the elevator can do it, then it will work for other file systems too.

If a process is feeding writes into the request layer and then
stops to do a read, we've blown it. There's nothing for the
elevator to elevate.

> > - asynchronous preread of an ext2 inode's block when we
> > create the inode, so:
> >
> > a) the reads will cluster into larger requests and
> > b) during inode writeout we don't keep stalling on
> > reads, preventing write clustering.
> >
>
> This may be what is inhibiting the elevator...

Yes.

> > The above changes are basically a search-and-destroy mission against
> > the things which are preventing effective writeout request merging.
> > Once they're in place we also need:
> >
> > - Alter the request queue size and elvtune settings
> >
>
> What settings are you suggesting? The 2.4 elevator queue size is an
> order of magnatide larger than 2.2...

The default number of requests is 128. This is in fact quite ample AS
LONG AS the filesystem is feeding decent amounts of reasonably localised
stuff into the request layer, and isn't stopping for reads all the time.
ext2 and the VFS are not. But I suspect that with the ialloc.c change,
disk readahead is covering up for it.

The meaning of the parameter to elvtune is a complete mystery, and the
code is uncommented crud (tautology). So I just used -r20000 -w20000.

This was based on observing the request queue dynamics. We frequently
fail to merge requests which really should be merged regardless of
latency. Bumping the elvtune settings fixed it all. But once the
fs starts writing data out contiguously it's all academic.

> >
> > The time to create 100,000 4k files (10 per directory) has fallen
> > from 3:09 (3min 9second) down to 0:30. A six-fold speedup.
> >
>
> Nice!

Well. I got to choose the benchmark.

> >
> > All well and good, and still a WIP. But by far the most dramatic
> > speedups come from disabling ext2's policy of placing new directories
> > in a different blockgroup from the parent:
> [snip]
> > A significant thing here is the improvement in read performance as well
> > as writes. All of the other speedup changes only affect writes.
> >
> > We are paying an extremely heavy price for placing directories in
> > different block groups from their parent. Why do we do this, and
> > is it worth the cost?
> >
>
> My God! I'm no kernel hacker, but I would think the first thing you would
> want to do is keep similar data (in this case similar because of proximity
> in the dir tree) as close as possible to reduce seeking...

Sure. ext2 goes to great lengths to avoid intra-file fragmentation,
and then goes and adds its own inter-file fragmentation.

It's worse on larger partitons, because they have more of the 128 meg
block groups.

> Is there any chance that this will go into ext3 too?
>

If it goes in ext2, yes. Depends on what the ext2 gods say - there
may be some deep design issue I'm missing here.

-

2001-11-05 04:39:45

by Mike Fedyk

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Sun, Nov 04, 2001 at 07:45:21PM -0800, Andrew Morton wrote:
> Mike Fedyk wrote:
> > What settings are you suggesting? The 2.4 elevator queue size is an
> > order of magnatide larger than 2.2...
>
> The default number of requests is 128. This is in fact quite ample AS
> LONG AS the filesystem is feeding decent amounts of reasonably localised
> stuff into the request layer, and isn't stopping for reads all the time.
> ext2 and the VFS are not. But I suspect that with the ialloc.c change,
> disk readahead is covering up for it.
>

Hmm...

> The meaning of the parameter to elvtune is a complete mystery, and the
> code is uncommented crud (tautology). So I just used -r20000 -w20000.
>

I saw somewhere that Andrea Acrangeli wrote it... Maybe he can help?

> This was based on observing the request queue dynamics. We frequently
> fail to merge requests which really should be merged regardless of
> latency. Bumping the elvtune settings fixed it all. But once the
> fs starts writing data out contiguously it's all academic.
>

I have had much improved interactive performance with -r 333 -w 1000, or
even -r 100 -w 300...

Setting it down to -r 0 -w 0 caused several processes (in a -j5 kernel
compile) to start waiting for disk...

> > >
> > > The time to create 100,000 4k files (10 per directory) has fallen
> > > from 3:09 (3min 9second) down to 0:30. A six-fold speedup.
> > >
> >
> > Nice!
>
> Well. I got to choose the benchmark.
>

Yep, but I'm sure the diffing two trees test will will get your patch
noticed... ;)

How do the numbers look for ext2 mounted -o sync?

> > My God! I'm no kernel hacker, but I would think the first thing you would
> > want to do is keep similar data (in this case similar because of proximity
> > in the dir tree) as close as possible to reduce seeking...
>
> Sure. ext2 goes to great lengths to avoid intra-file fragmentation,
> and then goes and adds its own inter-file fragmentation.
>
> It's worse on larger partitons, because they have more of the 128 meg
> block groups.
>

Yep.

Do you think that more (and thus, smaller) block groups would help for the
larger partitions?

> > Is there any chance that this will go into ext3 too?
> >
>
> If it goes in ext2, yes.

Great!

>Depends on what the ext2 gods say - there
> may be some deep design issue I'm missing here.
>
Yes, let's see what they say. :)

Mike

2001-11-05 05:55:36

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

Mike Fedyk writes:
> On Sun, Nov 04, 2001 at 06:13:19PM -0800, Andrew Morton wrote:

>> All well and good, and still a WIP. But by far the most dramatic
>> speedups come from disabling ext2's policy of placing new directories
>> in a different blockgroup from the parent:
> [snip]
>> A significant thing here is the improvement in read performance as well
>> as writes. All of the other speedup changes only affect writes.
>>
>> We are paying an extremely heavy price for placing directories in
>> different block groups from their parent. Why do we do this, and
>> is it worth the cost?
>
> My God! I'm no kernel hacker, but I would think the first thing
> you would want to do is keep similar data (in this case similar
> because of proximity in the dir tree) as close as possible to
> reduce seeking...

By putting directories far apart, you leave room for regular
files (added at some future time) to be packed close together.

I'm sure your benchmark doesn't fill directories with files
by adding a few files every day over a period of many months.
Think about projects under your home directory though.

2001-11-05 07:07:23

by Jens Axboe

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Sun, Nov 04 2001, Andrew Morton wrote:
> The meaning of the parameter to elvtune is a complete mystery, and the
> code is uncommented crud (tautology). So I just used -r20000 -w20000.

It's the number of sectors that are allowed to pass a request on the
queue, because of merges or inserts before that particular request. So
you want lower than that probably, and you want READ latency to be
smaller than WRITE latency too. The default I set is 8192/16384 iirc, so
go lower than this -- -r512 -w1024 or even lower just to check the
results.

> This was based on observing the request queue dynamics. We frequently
> fail to merge requests which really should be merged regardless of
> latency. Bumping the elvtune settings fixed it all. But once the
> fs starts writing data out contiguously it's all academic.

Interesting, the 2.5 design prevents this since it doesn't account
merges as a penalty (like a seek). I can do something like that for 2.4
too, I'll do a patch for you to test.

--
Jens Axboe

2001-11-05 07:15:26

by Mike Fedyk

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Mon, Nov 05, 2001 at 08:06:35AM +0100, Jens Axboe wrote:
> On Sun, Nov 04 2001, Andrew Morton wrote:
> > The meaning of the parameter to elvtune is a complete mystery, and the
> > code is uncommented crud (tautology). So I just used -r20000 -w20000.
>
> It's the number of sectors that are allowed to pass a request on the
> queue, because of merges or inserts before that particular request. So
> you want lower than that probably, and you want READ latency to be
> smaller than WRITE latency too. The default I set is 8192/16384 iirc, so
> go lower than this -- -r512 -w1024 or even lower just to check the
> results.
>

Does the elevator do better with powers of two?

> > This was based on observing the request queue dynamics. We frequently
> > fail to merge requests which really should be merged regardless of
> > latency. Bumping the elvtune settings fixed it all. But once the
> > fs starts writing data out contiguously it's all academic.
>
> Interesting, the 2.5 design prevents this since it doesn't account
> merges as a penalty (like a seek). I can do something like that for 2.4
> too, I'll do a patch for you to test.
>

I'd be very interested in this patch. Can you post it pubically?

> --
> Jens Axboe
>

Mike

2001-11-05 07:19:16

by Jens Axboe

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Mon, Nov 05 2001, Jens Axboe wrote:
> Interesting, the 2.5 design prevents this since it doesn't account
> merges as a penalty (like a seek). I can do something like that for 2.4
> too, I'll do a patch for you to test.

Ok here it is. It's not as efficient as the 2.5 version since it
proceeds to scan the entire queue for a merge, but it should suffice for
your testing.

--
Jens Axboe


Attachments:
(No filename) (392.00 B)
elv-merge-account-1 (1.67 kB)
Download all attachments

2001-11-05 07:19:26

by Jens Axboe

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Sun, Nov 04 2001, Mike Fedyk wrote:
> On Mon, Nov 05, 2001 at 08:06:35AM +0100, Jens Axboe wrote:
> > On Sun, Nov 04 2001, Andrew Morton wrote:
> > > The meaning of the parameter to elvtune is a complete mystery, and the
> > > code is uncommented crud (tautology). So I just used -r20000 -w20000.
> >
> > It's the number of sectors that are allowed to pass a request on the
> > queue, because of merges or inserts before that particular request. So
> > you want lower than that probably, and you want READ latency to be
> > smaller than WRITE latency too. The default I set is 8192/16384 iirc, so
> > go lower than this -- -r512 -w1024 or even lower just to check the
> > results.
> >
>
> Does the elevator do better with powers of two?

No, that doesn't matter.

> > > This was based on observing the request queue dynamics. We frequently
> > > fail to merge requests which really should be merged regardless of
> > > latency. Bumping the elvtune settings fixed it all. But once the
> > > fs starts writing data out contiguously it's all academic.
> >
> > Interesting, the 2.5 design prevents this since it doesn't account
> > merges as a penalty (like a seek). I can do something like that for 2.4
> > too, I'll do a patch for you to test.
> >
>
> I'd be very interested in this patch. Can you post it pubically?

Just posted :-)

--
Jens Axboe

2001-11-05 07:20:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

Jens Axboe wrote:
>
> On Sun, Nov 04 2001, Andrew Morton wrote:
> > The meaning of the parameter to elvtune is a complete mystery, and the
> > code is uncommented crud (tautology). So I just used -r20000 -w20000.
>
> It's the number of sectors that are allowed to pass a request on the
> queue, because of merges or inserts before that particular request. So
> you want lower than that probably, and you want READ latency to be
> smaller than WRITE latency too. The default I set is 8192/16384 iirc, so
> go lower than this -- -r512 -w1024 or even lower just to check the
> results.

Right, thanks. With the ialloc.c one-liner I didn't touch
elvtune. Defaults seem fine.

It should the number of requests which are allowed to pass a
request, not the number of sectors!

Well, you know what I mean: Make it

1 + nr_sectors_in_request / 1000

> > This was based on observing the request queue dynamics. We frequently
> > fail to merge requests which really should be merged regardless of
> > latency. Bumping the elvtune settings fixed it all. But once the
> > fs starts writing data out contiguously it's all academic.
>
> Interesting, the 2.5 design prevents this since it doesn't account
> merges as a penalty (like a seek). I can do something like that for 2.4
> too, I'll do a patch for you to test.

OK.

-

2001-11-05 07:26:36

by Jens Axboe

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Sun, Nov 04 2001, Andrew Morton wrote:
> Jens Axboe wrote:
> >
> > On Sun, Nov 04 2001, Andrew Morton wrote:
> > > The meaning of the parameter to elvtune is a complete mystery, and the
> > > code is uncommented crud (tautology). So I just used -r20000 -w20000.
> >
> > It's the number of sectors that are allowed to pass a request on the
> > queue, because of merges or inserts before that particular request. So
> > you want lower than that probably, and you want READ latency to be
> > smaller than WRITE latency too. The default I set is 8192/16384 iirc, so
> > go lower than this -- -r512 -w1024 or even lower just to check the
> > results.
>
> Right, thanks. With the ialloc.c one-liner I didn't touch
> elvtune. Defaults seem fine.
>
> It should the number of requests which are allowed to pass a
> request, not the number of sectors!
>
> Well, you know what I mean: Make it
>
> 1 + nr_sectors_in_request / 1000

That has been tried, performance and latency wasn't good. But yes that
is what we are really looking to account, the number of seeks.
Approximately.

--
Jens Axboe

2001-11-05 08:09:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

"Albert D. Cahalan" wrote:
>
> Mike Fedyk writes:
> > On Sun, Nov 04, 2001 at 06:13:19PM -0800, Andrew Morton wrote:
>
> >> All well and good, and still a WIP. But by far the most dramatic
> >> speedups come from disabling ext2's policy of placing new directories
> >> in a different blockgroup from the parent:
> > [snip]
> >> A significant thing here is the improvement in read performance as well
> >> as writes. All of the other speedup changes only affect writes.
> >>
> >> We are paying an extremely heavy price for placing directories in
> >> different block groups from their parent. Why do we do this, and
> >> is it worth the cost?
> >
> > My God! I'm no kernel hacker, but I would think the first thing
> > you would want to do is keep similar data (in this case similar
> > because of proximity in the dir tree) as close as possible to
> > reduce seeking...
>
> By putting directories far apart, you leave room for regular
> files (added at some future time) to be packed close together.

OK, that's one possible reason. Not sure I buy it though. If
the files are created a few days after their parent directory
then the chance of their data or metadata being within device
readhead scope of any of the parent dir's blocks seems small?

> I'm sure your benchmark doesn't fill directories with files
> by adding a few files every day over a period of many months.
> Think about projects under your home directory though.

OK. I'm not really aware of a suitable benchmark for that.
postmark only works in a single directory. mongo hardly
touches the disk at all, (with 3/4 of a gig of RAM, anyway).

My original make-100,000-4k-files test creates the files
in a tree - each node has 10 leafs. For a total of 11,110
directories and 100,000 files. It originally did it
in-order, so:

mkdir(00)
mkdir(00/00)
mkdir(00/00/00)
mkdir(00/00/00/00)
creat(00/00/00/00/00)
creat(00/00/00/00/01)
...
mkdir(00/00/00/01)

etc.

So I changed it to create the 11,110 directories, and then
to go back and create the 100,000 files. This will ensure that the
file's data are not contiguous with their parent directory.

With the ialloc.c change, plus the other changes I mentioned
the time to create all these directories and files and then run
/bin/sync fell from 1:53 to 0:28. Fourfold.

But the time to read the data was remarkable. This is purely
due to the ialloc.c change. The command was

time (find . -type f | xargs cat > /dev/null)

stock 2.4.14-pre8: 6:32
with ialloc.c change: 0:30

And this was on an 8 gig fs. On a 32 gig fs I'd expect to see
a fifteen-fold difference due to the additional block groups.

Can you suggest a better test?

-

2001-11-05 08:47:29

by Jan Kara

[permalink] [raw]
Subject: Re: disk throughput

> I've been taking a look at one particular workload - the creation
> and use of many smallish files. ie: the things we do every day
> with kernel trees.
>
> There are a number of things which cause linux to perform quite badly
> with this workload. I've fixed most of them and the speedups are quite
> dramatic. The changes include:
>
> - reorganise the sync_inode() paths so we don't do
> read-a-block,write-a-block,read-a-block, ...
>
> - asynchronous preread of an ext2 inode's block when we
> create the inode, so:
>
> a) the reads will cluster into larger requests and
> b) during inode writeout we don't keep stalling on
> reads, preventing write clustering.
>
> - Move ext2's bitmap loading code outside lock_super(),
> so other threads can still get in and write to the
> fs while the reader sleeps, thus increasing write
> request merging. This benefits multithreaded workloads
> (ie: dbench) and needs more thought.
>
> The above changes are basically a search-and-destroy mission against
> the things which are preventing effective writeout request merging.
> Once they're in place we also need:
>
> - Alter the request queue size and elvtune settings
>
>
> The time to create 100,000 4k files (10 per directory) has fallen
> from 3:09 (3min 9second) down to 0:30. A six-fold speedup.
Nice :).

> All well and good, and still a WIP. But by far the most dramatic
> speedups come from disabling ext2's policy of placing new directories
> in a different blockgroup from the parent:
>
> --- linux-2.4.14-pre8/fs/ext2/ialloc.c Tue Oct 9 21:31:40 2001
> +++ linux-akpm/fs/ext2/ialloc.c Sun Nov 4 17:40:43 2001
> @@ -286,7 +286,7 @@ struct inode * ext2_new_inode (const str
> repeat:
> gdp = NULL; i=0;
>
> - if (S_ISDIR(mode)) {
> + if (0 && S_ISDIR(mode)) {
> avefreei = le32_to_cpu(es->s_free_inodes_count) /
> sb->u.ext2_sb.s_groups_count;
> /* I am not yet convinced that this next bit is necessary.
>
>
> Numbers. The machine has 768 megs; the disk is IDE with a two meg cache.
> The workload consists of untarring, tarring, diffing and removing kernel
> trees. This filesystem is 21 gigs, and has 176 block groups.
I'm not sure if this speedup isn't connected just with your testcase..
If I understood well the code it tries to spread files uniformly over the
fs (ie. all groups equally full). I think that if you have filesystem like
/home where are large+small files changing a lot your change can actually
lead to more fragmentation - groups in the beginning gets full (files
are created in the same group as it's parent). Then if some file gets deleted
and new one created filesystem will try to stuff new file in the first
groups and that causes fragmentation.. But it's just an idea - some testing
would be probably more useful...

> After each test which wrote data a `sync' was performed, and was included
> in the timing under the assumption that all the data will be written back
> by kupdate in a few seconds, and running `sync' allows measurement of the
> cost of that.
>
> The filesystem was unmounted between each test - all tests are with
> cold caches.
>
> stock patched
> untar one kernel tree, sync: 0:31 0:14
> diff two trees: 3:04 1:12
> tar kernel tree to /dev/null: 0:15 0:03
> remove 2 kernel trees, sync: 0:30 0:10
>
> A significant thing here is the improvement in read performance as well
> as writes. All of the other speedup changes only affect writes.
>
> We are paying an extremely heavy price for placing directories in
> different block groups from their parent. Why do we do this, and
> is it worth the cost?

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2001-11-05 08:51:09

by Mike Fedyk

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: disk throughput

On Mon, Nov 05, 2001 at 09:47:01AM +0100, Jan Kara wrote:
> If I understood well the code it tries to spread files uniformly over the
> fs (ie. all groups equally full). I think that if you have filesystem like
> /home where are large+small files changing a lot your change can actually
> lead to more fragmentation - groups in the beginning gets full (files
> are created in the same group as it's parent). Then if some file gets deleted
> and new one created filesystem will try to stuff new file in the first
> groups and that causes fragmentation.. But it's just an idea - some testing
> would be probably more useful...
>

Shouldn't it choose another block group if the file won't fit in the current
one?

2001-11-05 09:01:51

by Jan Kara

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: disk throughput

> On Mon, Nov 05, 2001 at 09:47:01AM +0100, Jan Kara wrote:
> > If I understood well the code it tries to spread files uniformly over the
> > fs (ie. all groups equally full). I think that if you have filesystem like
> > /home where are large+small files changing a lot your change can actually
> > lead to more fragmentation - groups in the beginning gets full (files
> > are created in the same group as it's parent). Then if some file gets deleted
> > and new one created filesystem will try to stuff new file in the first
> > groups and that causes fragmentation.. But it's just an idea - some testing
> > would be probably more useful...
> >
>
> Shouldn't it choose another block group if the file won't fit in the current
> one?
If I understand the code well: When ext2 creates new file it will try to
allocate data in the same group where it got inode. If there's no free block
(or better 8 blocks) in the group it will try another one etc (but it will
be probably also full - we will first try to spread file over half of full
groups (in average) before hitting the empty one). But I agree that this way it
will fill the holes with the file and next time it will start in empty
group. Anyway this allocation strategy will IMHO not keep inodes near data...

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2001-11-05 09:15:01

by Mike Fedyk

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Mon, Nov 05, 2001 at 08:18:36AM +0100, Jens Axboe wrote:
> On Mon, Nov 05 2001, Jens Axboe wrote:
> > Interesting, the 2.5 design prevents this since it doesn't account
> > merges as a penalty (like a seek). I can do something like that for 2.4
> > too, I'll do a patch for you to test.
>
> Ok here it is. It's not as efficient as the 2.5 version since it
> proceeds to scan the entire queue for a merge, but it should suffice for
> your testing.
>

Does the elevator still favor blocks that are on the outside of the platter
over all others? If so, I think this should be removed in favor of a
timeout just like the other seek requests...

I've been able to put a swap partition on the outside of my drive, and get
utterly abizmal performance, while on another similar system, with swap on
the inside of the drive performance was much better during a swap storm...

Mike

2001-11-05 09:21:12

by Jens Axboe

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Mon, Nov 05 2001, Mike Fedyk wrote:
> On Mon, Nov 05, 2001 at 08:18:36AM +0100, Jens Axboe wrote:
> > On Mon, Nov 05 2001, Jens Axboe wrote:
> > > Interesting, the 2.5 design prevents this since it doesn't account
> > > merges as a penalty (like a seek). I can do something like that for 2.4
> > > too, I'll do a patch for you to test.
> >
> > Ok here it is. It's not as efficient as the 2.5 version since it
> > proceeds to scan the entire queue for a merge, but it should suffice for
> > your testing.
> >
>
> Does the elevator still favor blocks that are on the outside of the platter
> over all others? If so, I think this should be removed in favor of a
> timeout just like the other seek requests...

It doesn't quite favor outside LBA's (lower numbers), it's a combination
of sector and device. It's hard to do this right in 2.4 because request
sectors are not absolute but a combination of partion + offset. 2.5 will
have this fixed, generic_make_request remaps buffer heads (well actually
bio's, but same deal) before submitting so the I/O scheduler can be a
bit smarter.

> I've been able to put a swap partition on the outside of my drive, and get
> utterly abizmal performance, while on another similar system, with swap on
> the inside of the drive performance was much better during a swap storm...

That sounds very odd, swap should be much faster on the outside.

--
Jens Axboe

Subject: Re: [Ext2-devel] disk throughput



--On Monday, 05 November, 2001 12:54 AM -0500 "Albert D. Cahalan"
<[email protected]> wrote:

> By putting directories far apart, you leave room for regular
> files (added at some future time) to be packed close together.

Mmmm... but if you read ahead your directories (see lkml passim)
I'd be prepared to be the gain here is either minimized, or
negative.

--
Alex Bligh

Subject: Re: [Ext2-devel] disk throughput

I think I was slightly too brief:

> --On Monday, 05 November, 2001 12:54 AM -0500 "Albert D. Cahalan"
> <[email protected]> wrote:
>
>> By putting directories far apart, you leave room for regular
>> files (added at some future time) to be packed close together.
>
> Mmmm... but if you read ahead your directories (see lkml passim)
> I'd be prepared to be the gain here is either minimized, or
> negative.

IE if you put in code to queue a cache a subdirectory when it's
entry in the parent is enumerated (either in userland or
in kernelland), then you get far less seeking as it all
comes 'up front' and the seeks can be monotonic. However,
still the closer the directories are, the better.

But more to the point, if your directories tend to be cached
this way, you surely win by having the directories allocated
the same way as files (i.e. to be packed /all/ close together).

--
Alex Bligh

2001-11-05 12:24:10

by Matthias Andree

[permalink] [raw]
Subject: Re: disk throughput

On Sun, 04 Nov 2001, Andrew Morton wrote:

> Numbers. The machine has 768 megs; the disk is IDE with a two meg cache.
> The workload consists of untarring, tarring, diffing and removing kernel
> trees. This filesystem is 21 gigs, and has 176 block groups.

Does that IDE disk run with its write cache enabled or disabled?

2001-11-05 12:29:00

by Matthias Andree

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Mon, 05 Nov 2001, Andrew Morton wrote:

> OK. I'm not really aware of a suitable benchmark for that.
> postmark only works in a single directory. mongo hardly
> touches the disk at all, (with 3/4 of a gig of RAM, anyway).

You can always pass the kernel "mem=64M".

--
Matthias Andree

"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." Benjamin Franklin

2001-11-05 14:24:06

by Alexander Viro

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput



On Mon, 5 Nov 2001, Andrew Morton wrote:

> OK, that's one possible reason. Not sure I buy it though. If
> the files are created a few days after their parent directory
> then the chance of their data or metadata being within device
> readhead scope of any of the parent dir's blocks seems small?

Algorithm for inode allocation had been written by Kirk back in
'84. You can find some analisys in the original paper (A Fast
Filesystem for UNIX).

BTW, what you want is not "readahead scope of parent dir block".
You want inodes of files in given directory close to each other.
That matters a lot when you do stat() on directory contents,
etc. Moreover, since we attempt to keep data blocks close to
inodes, we want to keep use of cylinder groups more or less
even.

2001-11-05 20:18:31

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Nov 05, 2001 00:04 -0800, Andrew Morton wrote:
> My original make-100,000-4k-files test creates the files
> in a tree - each node has 10 leafs. For a total of 11,110
> directories and 100,000 files. It originally did it
> in-order, so:
>
> mkdir(00)
> mkdir(00/00)
> mkdir(00/00/00)
> mkdir(00/00/00/00)
> creat(00/00/00/00/00)
> creat(00/00/00/00/01)
> ...
> mkdir(00/00/00/01)
>
> etc.
>
> So I changed it to create the 11,110 directories, and then
> to go back and create the 100,000 files. This will ensure that the
> file's data are not contiguous with their parent directory.
>
> With the ialloc.c change, plus the other changes I mentioned
> the time to create all these directories and files and then run
> /bin/sync fell from 1:53 to 0:28. Fourfold.
>
> And this was on an 8 gig fs. On a 32 gig fs I'd expect to see
> a fifteen-fold difference due to the additional block groups.
>
> Can you suggest a better test?

Well, just to emphasize the "block group" issues, you could try testing
with a 1kB or 2kB block filesystem. This will give you 64x or 8x as
many groups as a 4kB block filesystem, respectively.

A more "valid" test, IMHO, would be "untar the kernel, (flush buffers),
build kernel" on both the original, and your "all in one group" inode
allocation heuristics. It should be especially noticable on a 1kB
filesystem. What this will show (I think) is that while untar/read
with your method will be fast (all inodes/files contiguous on disk)
once you start trying to write to that filesystem, you will have more
fragmentation/seeking for the writes. It may be that with large-memory
systems you will cache so much you don't see a difference, hence the
(flush buffers) part, which is probably umount, mount.

An even better test would be untar kernel, patch up a few major versions,
then try to compile. The old heuristic would probably be OK, as there
is space in each group for files to grow, while your heuristic would
move files into groups other than their parent because there is no space.

In the end, though, while the old heuristic has a good theory, it _may_
be that in practise, you are _always_ seeking to get data from different
groups, rather than _theoretically_ seeking because of fragmented files.
I don't know what the answer is - probably depends on finding "valid"
benchmarks (cough).

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-05 20:28:40

by m

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

Andreas Dilger <[email protected]> writes:
> On Nov 05, 2001 00:04 -0800, Andrew Morton wrote:
[..]
> > With the ialloc.c change, plus the other changes I mentioned
> > the time to create all these directories and files and then run
> > /bin/sync fell from 1:53 to 0:28. Fourfold.
>
> In the end, though, while the old heuristic has a good theory, it _may_
> be that in practise, you are _always_ seeking to get data from different
> groups, rather than _theoretically_ seeking because of fragmented files.
> I don't know what the answer is - probably depends on finding "valid"
> benchmarks (cough).

Another heuristic to try make be to only use a different blockgroup
for when the mkdir()s are seperate in time. i.e. rather than
doing
if ( 0 && ..
use something like
if ((last_time + 100) < jiffes && ...
last_time = jiffies;
which would in theory use the old behaviour for sparodic mkdirs
and the new behaviour for things like 'untar' et al.

M.

2001-11-05 21:45:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

[email protected] wrote:
>
> Andreas Dilger <[email protected]> writes:
> > On Nov 05, 2001 00:04 -0800, Andrew Morton wrote:
> [..]
> > > With the ialloc.c change, plus the other changes I mentioned
> > > the time to create all these directories and files and then run
> > > /bin/sync fell from 1:53 to 0:28. Fourfold.
> >
> > In the end, though, while the old heuristic has a good theory, it _may_
> > be that in practise, you are _always_ seeking to get data from different
> > groups, rather than _theoretically_ seeking because of fragmented files.
> > I don't know what the answer is - probably depends on finding "valid"
> > benchmarks (cough).
>
> Another heuristic to try make be to only use a different blockgroup
> for when the mkdir()s are seperate in time. i.e. rather than
> doing
> if ( 0 && ..
> use something like
> if ((last_time + 100) < jiffes && ...
> last_time = jiffies;
> which would in theory use the old behaviour for sparodic mkdirs
> and the new behaviour for things like 'untar' et al.
>

I agree - that's a pretty sane heuristic.

It would allow us to preserve the existing semantics for the
slowly-accreting case. If they're still valid.

2001-11-05 22:28:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

Alexander Viro wrote:
>
> On Mon, 5 Nov 2001, Andrew Morton wrote:
>
> > OK, that's one possible reason. Not sure I buy it though. If
> > the files are created a few days after their parent directory
> > then the chance of their data or metadata being within device
> > readhead scope of any of the parent dir's blocks seems small?
>
> Algorithm for inode allocation had been written by Kirk back in
> '84. You can find some analisys in the original paper (A Fast
> Filesystem for UNIX).

'84.

> BTW, what you want is not "readahead scope of parent dir block".
> You want inodes of files in given directory close to each other.
> That matters a lot when you do stat() on directory contents,
> etc. Moreover, since we attempt to keep data blocks close to
> inodes, we want to keep use of cylinder groups more or less
> even.

For some workloads we want the subdirectories close to the
parent as well. Failing to do so is horridly wrong.

What has changed since Kirk's design?

- The relative cost of seeks has increased. Device caching
and readahead and increased bandwidth make it more expensive
to seek away.

- The seek distance-versus-cost equation has changed. Take a look
at a graph of seek distance versus time. Once you've decided to
seek ten percent of the distance across the disk, a 90% seek only
takes twice as long.

- The size of devices has grown more quickly than ext2 blocksizes,
so instead of having four or eight block groups as we did in the
old days, we now have hundreds. 256 block groups on a 32 gig fs.
Sprinkling a directory tree across 256 block groups hurts a lot.


I don't think I buy the fragmentation argument, really. I suspect
that doing first-fit within the bounds of a block group will have
a long-term effect similar to performing first-fit on the entire fs.

But I don't know. This is just all bullshit handwaving speculation.
We need tests. Numbers. Does anyone have source to a filesystem
aging simulation? The Smith/Seltzer code seems to be off the air.

I just fed patch-2.4.14-ac8 into my sucky cvs import scripts and it
ran in about a fifth of the time. This is a serious matter, worth
spending time on.

-

2001-11-05 22:45:12

by Andrew Morton

[permalink] [raw]
Subject: Re: disk throughput

Matthias Andree wrote:
>
> On Sun, 04 Nov 2001, Andrew Morton wrote:
>
> > Numbers. The machine has 768 megs; the disk is IDE with a two meg cache.
> > The workload consists of untarring, tarring, diffing and removing kernel
> > trees. This filesystem is 21 gigs, and has 176 block groups.
>
> Does that IDE disk run with its write cache enabled or disabled?

write-behind is enabled. I'm not religious about write-behind.
Yes, there's a small chance that the disk will decide to write a
commit block in front of the data (which is at lower LBAs). But

a) It's improbable
b) if it does happen, the time window where you need to crash
is small.
c) if your kernel crashes, the data will still be written. It
has to be a power-down.

But good point - I'll test without writebehind, and with SCSI.

-

2001-11-05 22:44:11

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Nov 05, 2001 14:22 -0800, Andrew Morton wrote:
> But I don't know. This is just all bullshit handwaving speculation.
> We need tests. Numbers. Does anyone have source to a filesystem
> aging simulation? The Smith/Seltzer code seems to be off the air.

There is a guy doing fragmentation testing for reiserfs. It turns
out that (in his tests) reiserfs can get 10x slower as the filesystem
fills up because of intra-file fragmentation. I don't know enough
about reiserfs block/file allocation policy to know how this compares
to ext2 at all.

See http://www.informatik.uni-frankfurt.de/~loizides/reiserfs/agetest.html

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-05 22:59:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

Andreas Dilger wrote:
>
> On Nov 05, 2001 14:22 -0800, Andrew Morton wrote:
> > But I don't know. This is just all bullshit handwaving speculation.
> > We need tests. Numbers. Does anyone have source to a filesystem
> > aging simulation? The Smith/Seltzer code seems to be off the air.
>
> There is a guy doing fragmentation testing for reiserfs. It turns
> out that (in his tests) reiserfs can get 10x slower as the filesystem
> fills up because of intra-file fragmentation. I don't know enough
> about reiserfs block/file allocation policy to know how this compares
> to ext2 at all.
>
> See http://www.informatik.uni-frankfurt.de/~loizides/reiserfs/agetest.html
>

Wow. That looks nice. Unfortunately the tarballs are
missing several files. read.cxx, agesystem.cxx, etc. Plus
a number of dud links.

Constantin, could you please check that the agesystem3 tarball
builds OK, maybe put up a new one? Thanks.

-

2001-11-05 23:03:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

In article <[email protected]>,
Andrew Morton <[email protected]> wrote:
>> if ( 0 && ..
>> use something like
>> if ((last_time + 100) < jiffes && ...
>> last_time = jiffies;
>> which would in theory use the old behaviour for sparodic mkdirs
>> and the new behaviour for things like 'untar' et al.
>>
>
>I agree - that's a pretty sane heuristic.
>
>It would allow us to preserve the existing semantics for the
>slowly-accreting case. If they're still valid.

I don't particularly like behaviour that changes over time, so I would
much rather just state clearly that the current inode allocation
strategy is obviously complete crap. Proof: simple real-world
benchmarks, along with some trivial thinking about seek latencies.

In particular, the way it works now, it will on purpose try to spread
out inodes over the whole disk. Every new directory will be allocated in
the group that has the most free inodes, which obviously on average
means that you try to fill up all groups equally.

Which makes _no_ sense. There is no advantage to trying to spread things
out, only clear disadvantages.

Instead of trying to make this time-dependent, let's just realize that
spreading things out is stupid. It might make more sense to say

- if we create a new directory
- AND the group we're currently in is getting filled up
- AND there is another group that is clearly emptier,
- THEN do we switch groups.

None of this stupid "if this group has more inodes than the average"
crap.

But I'd rather have the "if (0 && .." than something that depends on
time. But if people want to have something that is in the _spirit_ of
the old code, then make it something like the above, which only switches
groups if there is a need and a clear win from it.

Linus

2001-11-05 23:15:12

by Dan Hollis

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Mon, 5 Nov 2001, Andreas Dilger wrote:
> There is a guy doing fragmentation testing for reiserfs. It turns
> out that (in his tests) reiserfs can get 10x slower as the filesystem
> fills up because of intra-file fragmentation. I don't know enough
> about reiserfs block/file allocation policy to know how this compares
> to ext2 at all.

How does xfs fare in comparison?

-Dan
--
[-] Omae no subete no kichi wa ore no mono da. [-]

2001-11-05 23:36:33

by Alexander Viro

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput



On Mon, 5 Nov 2001, Linus Torvalds wrote:

> I don't particularly like behaviour that changes over time, so I would
> much rather just state clearly that the current inode allocation
> strategy is obviously complete crap. Proof: simple real-world
> benchmarks, along with some trivial thinking about seek latencies.
>
> In particular, the way it works now, it will on purpose try to spread
> out inodes over the whole disk. Every new directory will be allocated in
> the group that has the most free inodes, which obviously on average
> means that you try to fill up all groups equally.

> Which makes _no_ sense. There is no advantage to trying to spread things
> out, only clear disadvantages.

Wrong. Trivial example: create skeleton homedirs for 50 new users.
You _really_ don't want all of them in one cylinder group. Because they
will be slowly filling up with files, while directory structure is very likely
to stay more or less stable. You want the prefered group for the file
inode to be the same as its parent directory. _And_ you want data
close to inode if we can afford that. Worse yet, for data allocation
we use quadratic hash. Which works nicely _unless_ starting point for
all of them sits in the same group.

See where it's going? The real issue is ratio of frequencies for
directory and file creation. The "time-dependent" part is ugly, but
the thing it tries to address is very, very real. Allocation policy
for a tree created at once is different from allocation policy for
normal use.

Ideally we would need to predict how many (and how large) files
will go into directory. We can't - we have no time machines. But
heuristics you've mentioned is clearly broken. It will end up with
mostly empty trees squeezed into a single cylinder group and when
they start to get populated that will be pure hell.

And yes, it's more than realistic scenario. Your strategy would make
sense if all directories were created by untaring a large archive.
Which may be fairly accurate for your boxen (or mine, for that matter -
most of the time), but it's not universal.

Benchmarks that try to stress that code tend to be something like
cvs co, tar x, yodda, yodda. _All_ of them deal only with "fast-growth"
pattern. And yes, FFS inode allocator sucks for that scenario - no
arguments here. Unfortunately, the variant you propose will suck for
slow-growth one and that is going to hurt a lot.

The fact that Linux became a huge directory tree means that we tend
to deal with fast-growth scenario quite often. Not everyone works
on the kernel, though ;-)

2001-11-05 23:41:43

by Matthias Andree

[permalink] [raw]
Subject: Re: disk throughput

On Mon, 05 Nov 2001, Andrew Morton wrote:

> write-behind is enabled. I'm not religious about write-behind.
> Yes, there's a small chance that the disk will decide to write a
> commit block in front of the data (which is at lower LBAs). But

Well, you're talking about performance here. In my simple tests, the
write cache has a major impact (twofold) on linear writes (just dd to
a partition), so your milage may vary a lot depending on the cache
configuration.

2001-11-05 23:54:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput


On Mon, 5 Nov 2001, Alexander Viro wrote:
>
> Ideally we would need to predict how many (and how large) files
> will go into directory. We can't - we have no time machines. But
> heuristics you've mentioned is clearly broken. It will end up with
> mostly empty trees squeezed into a single cylinder group and when
> they start to get populated that will be pure hell.

Why? Squeezing into a single cylinder group is _good_. Fragmentation is
bad.

Where's the hell? You move over to the next cylinder group when you have
90% filled one ("pick percentage randomly", and there are others that are
clearly less used.

> Benchmarks that try to stress that code tend to be something like
> cvs co, tar x, yodda, yodda. _All_ of them deal only with "fast-growth"
> pattern. And yes, FFS inode allocator sucks for that scenario - no
> arguments here. Unfortunately, the variant you propose will suck for
> slow-growth one and that is going to hurt a lot.

I don't see any arguments for _why_ you claim it would hurt a lot.

Fast-growth is what we should care about from a performance standpoint -
slow growth can be sanely handled with slower-paced maintenance issues
(including approaches like "defragment the disk once a month").

By definition, "slow growth" is amenable to defragmentation. And by
definition, fast growth isn't. And if we're talking about speedups on the
order of _five times_ for something as simple as a "cvs co", I don't see
where your "pure hell" comes in.

A five-time slowdown on real work _is_ pure hell. You've not shown a
credible argument that the slow-growth behaviour would ever result in a
five-time slowdown for _anything_.

Linus

2001-11-06 00:07:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput


On Mon, 5 Nov 2001, Linus Torvalds wrote:
>
> A five-time slowdown on real work _is_ pure hell. You've not shown a
> credible argument that the slow-growth behaviour would ever result in a
> five-time slowdown for _anything_.

There might also be heuristics that explicitly _notice_ slow growth, not
necessarily as a function of time, but as a function of the tree structure
itself.

For example, spreading out (and the inherent assumption of "slow growth")
might make sense for the root directory, and possibly for a level below
that. It almost certainly does _not_ make sense for a directory created
four levels down.

Linus

2001-11-06 01:28:57

by Alexander Viro

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput



On Mon, 5 Nov 2001, Linus Torvalds wrote:

> Fast-growth is what we should care about from a performance standpoint -
> slow growth can be sanely handled with slower-paced maintenance issues
> (including approaches like "defragment the disk once a month").

Ehh... Slow growth can involve a _lot_ of IO and file creation/removal,
just no directory tree changes.

> By definition, "slow growth" is amenable to defragmentation. And by
> definition, fast growth isn't. And if we're talking about speedups on the
> order of _five times_ for something as simple as a "cvs co", I don't see
> where your "pure hell" comes in.

Take a look at the allocator for non-directory inodes.

> A five-time slowdown on real work _is_ pure hell. You've not shown a
> credible argument that the slow-growth behaviour would ever result in a
> five-time slowdown for _anything_.

find(1). Stuff that runs every damn night. We end up doing getdents() +
bunch of lstat() on the entries we had found. And that means access to
inodes refered from them. Try that on a large directory with child
inodes spread all over the disk.

Linus, I'm not saying that fast-growth scenario doesn't need to be handled
or that 5 times slowdown on cvs co is OK. We need a decent heuristics to
distinguish between the slow and fast variants. BTW, the problem is not
new or Linux-specific. IIRC, there was a couple of papers by folks who
gathered real-life traces of allocations/freeing of objects on different
boxen and had ran them through several allocators. I'll try to locate that
stuff and see if we can get to their raw data. OK? I'd really like to
have something beyond "current allocator sucks for fast-growth" to deal
with - especially if we start inventing heuristics for switching between
the fast and slow modes.

2001-11-06 01:34:16

by Alexander Viro

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput



On Mon, 5 Nov 2001, Linus Torvalds wrote:

> For example, spreading out (and the inherent assumption of "slow growth")
> might make sense for the root directory, and possibly for a level below
> that. It almost certainly does _not_ make sense for a directory created
> four levels down.

Depends on the naming scheme used by admin, for one thing. Linus, I seriously
think that getting the real-life traces to play with would be a Good Thing(tm)
- at least that would allow to test how well would heuristics of that kind do.

2001-11-06 02:14:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput


On Mon, 5 Nov 2001, Alexander Viro wrote:
>
> Depends on the naming scheme used by admin, for one thing. Linus, I seriously
> think that getting the real-life traces to play with would be a Good Thing(tm)
> - at least that would allow to test how well would heuristics of that kind do.

Well, we hae some real-life traces. Those particular real-life traces show
that the "switch cylinder groups on mkdir" heuristic sucks.

And you have to realize that _whatever_ we do, it will always be a
heuristic. We don't know what the right behaviour is without being able to
predict the future. Agreed?

Ok, so let's just assume that we had no heuristic in place at all, and we
were looking at improving it for slow-growth situations. I think you'd
agree that decreasing the performance of a CVS checkout five-fold would be
considered _totally_ unacceptable for a new heuristic, no?

So the current heuristic provably sucks. We have cold hard numbers, and
quite frankly, Al, there is very very little point in arguing against
numbers. It's silly. "Gimme an S, gimme a U, gimme a C, gimme a K -
S-U-C-K". The current one sucks.

So the only question on the table is not "do we stay with the current
algorithm", because I dare you, the answer to that question is very
probably a clear "NO". So there's no point in even trying to find out
_why_, in 1984, it was considered a good heuristic.

The question is: "what can we do to improve it?". Not "what arguments can
we come up with to make excuses for a sucky algorithm that clearly does
the wrong thing for real-life loads".

One such improvement has already been put on the table: remove the
algorithm, and make it purely greedy.

We know that works. And yes, we realize that it has downsides too. Which
is why some kind of hybrid is probably called for. Come up with your own
sixth sense. We know the current one is really bad for real-life loads.

I actually bet that the greedy algorithm would work wonderfully well, if
we just had run-time balancing. I bet that the main argument for
"spreading things out" is that it minimizes the risk of overflow in any
particular group, never mind performance.

And if you're stuck with the decision you make forever, minimizing risk is
a good approach.

And maybe the fundamental problem is exactly that: because we're stuck
with our decision forever, people felt that they couldn't afford to risk
doing what was very obviously the right thing.

So I still claim that we should look for short-time profit, and then try
to fix up the problems longer term. With, if required, some kind of
rebalancing.

Linus

2001-11-06 03:03:00

by Alexander Viro

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput



On Mon, 5 Nov 2001, Linus Torvalds wrote:

> And you have to realize that _whatever_ we do, it will always be a
> heuristic. We don't know what the right behaviour is without being able to
> predict the future. Agreed?

No arguments. While we are at it, let's just state once and forever:
FFS allocator sucks for fast-growth case
Everyone agrees with that, including me, you _and_ Kirk.

> The question is: "what can we do to improve it?". Not "what arguments can
> we come up with to make excuses for a sucky algorithm that clearly does
> the wrong thing for real-life loads".

Obviously.

> One such improvement has already been put on the table: remove the
> algorithm, and make it purely greedy.
>
> We know that works. And yes, we realize that it has downsides too. Which
> is why some kind of hybrid is probably called for. Come up with your own

Exactly.

> And maybe the fundamental problem is exactly that: because we're stuck
> with our decision forever, people felt that they couldn't afford to risk
> doing what was very obviously the right thing.
>
> So I still claim that we should look for short-time profit, and then try
> to fix up the problems longer term. With, if required, some kind of
> rebalancing.

Whatever heuristics we use, it _must_ catch fast-growth scenario. No
arguments on that. The question being, what will minimize the problems
for other cases.

On-line defrag can be actually fairly nasty - that had been tried and
it ends up with a hell of tricky details. Especially with page cache
in the game. And "umount once a month" is not serious - think of a
_large_ disk on a department NFS server. So I'd rather look for decent
heuristics before going for "let's defrag it once in a while" kind of
solution.

I definitely want to start with looking through relevant work - both
for the data and for information on _failed_ attempts to solve the
problem.

/me goes to dig through that stuff

2001-11-06 03:49:46

by Alexander Viro

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput



On Mon, 5 Nov 2001, Linus Torvalds wrote:

> One such improvement has already been put on the table: remove the
> algorithm, and make it purely greedy.

OK, some digging had brought another one:

a) if it's first-level directory - get it the fsck out of root's cylinder
group.
b) if we just keep creating directories in a cylinder group and do not
create any files there - stop, it's no good (i.e. there's a limit on
number of back-to-back directory creations in the same group).
c) try putting it into the parent's CG, but reserve some number of inodes
and data blocks in it. If we can't - tough, get the fsck out of there.

>From the first reading of the code (aside of general yuck wrt style,
but that's actually more about the older code in there) it seems to
be reasonable. It should solve the problem with fast-growth scenario
and it puts some effort to avoid nastiness with slow-growth one.

2001-11-06 04:04:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput


On Mon, 5 Nov 2001, Alexander Viro wrote:
>
> OK, some digging had brought another one:
>
> a) if it's first-level directory - get it the fsck out of root's cylinder
> group.

Hey, now that you've read it in a paper you like it, but when I suggest it
in email you shoot it down?

<Whiny mode on> I thought you loved me, Al. <Whiny mode off>

> b) if we just keep creating directories in a cylinder group and do not
> create any files there - stop, it's no good (i.e. there's a limit on
> number of back-to-back directory creations in the same group).

The current code actually has some vestiges that _seem_ to be trying to do
something like this: see the commented-out

if (tmp && le16_to_cpu(tmp->bg_used_dirs_count) << 8) <
le16_to_cpu(tmp->bg_free_inodes_count)) {

which _seems_ to want to play games with "number of directories allocated
vs nr of free inodes".

But it's commented out with "I am not yet convinced that this next bit is
necessary". I don't know if the code has ever been active, or whether it
showed other problems.

> c) try putting it into the parent's CG, but reserve some number of inodes
> and data blocks in it. If we can't - tough, get the fsck out of there.

Hmm.. Maybe this is actually closer to what we try to do above..

Linus

2001-11-06 04:21:32

by Alexander Viro

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput



On Mon, 5 Nov 2001, Linus Torvalds wrote:

>
> On Mon, 5 Nov 2001, Alexander Viro wrote:
> >
> > OK, some digging had brought another one:
> >
> > a) if it's first-level directory - get it the fsck out of root's cylinder
> > group.
>
> Hey, now that you've read it in a paper you like it, but when I suggest it
> in email you shoot it down?
>
> <Whiny mode on> I thought you loved me, Al. <Whiny mode off>

Oh, come on. (a) is obvious, but obviously not enough ;-)

> > b) if we just keep creating directories in a cylinder group and do not
> > create any files there - stop, it's no good (i.e. there's a limit on
> > number of back-to-back directory creations in the same group).
>
> The current code actually has some vestiges that _seem_ to be trying to do
> something like this: see the commented-out
>
> if (tmp && le16_to_cpu(tmp->bg_used_dirs_count) << 8) <
> le16_to_cpu(tmp->bg_free_inodes_count)) {
>
> which _seems_ to want to play games with "number of directories allocated
> vs nr of free inodes".
>
> But it's commented out with "I am not yet convinced that this next bit is
> necessary". I don't know if the code has ever been active, or whether it
> showed other problems.
>
> > c) try putting it into the parent's CG, but reserve some number of inodes
> > and data blocks in it. If we can't - tough, get the fsck out of there.
>
> Hmm.. Maybe this is actually closer to what we try to do above..

Yes, but block reservation also makes sense (otherwise we can end up
putting a directory into parent's CG only to have all children
going there _and_ getting far from their data). Which might be the
problem with original code, BTW.

OK, anyway - I've got a bunch of cleanups for ialloc.c (equivalent
transformations, split into small steps and decently tested - I've
used them for almost a year). That stuff moves choice of CG for
directories and non-directories into separate functions, so no
matter which variant we end up doing I think that it's worth doing
first - things will be cleaner after that.

2001-11-06 05:05:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput


On Mon, 5 Nov 2001, Alexander Viro wrote:
>
> Oh, come on. (a) is obvious, but obviously not enough ;-)

I agree, but I think it's a fairly good starting point to build up some
heuristics.

Also note that traditonal UNIX implementations will have a hard time doing
anything more fine-grained than "is the parent the root directory or not".
Information about grandparents etc has been lost long before we get to
"ialloc()", so I bet you'll find experimental data only on that one-level
thing.

But it is probably not a binary decision in real life.

In real life, you're ok switching cylinder groups for /home vs /usr, but
you're also ok switching groups for /home/torvalds vs /home/viro. And it's
still reasonably likely that it makes sense to switch groups in
/home/torvalds between 'src' and 'public_html'.

But the deeper down in the directory hierarchy you get, the less likely it
is that it makes sense.

And yes, it's just a heuristic. But it's one that where thanks to the
dcache we could reasonably easily do more than just a black-and-white
"root vs non-root" thing.

> Yes, but block reservation also makes sense (otherwise we can end up
> putting a directory into parent's CG only to have all children
> going there _and_ getting far from their data). Which might be the
> problem with original code, BTW.

Yes, agreed. Disk space allocations should definitely be part of the
heuristics, and that's obviously both inode and blocks.

I do believe that there are probably more "high-level" heuristics that can
be useful, though. Looking at man common big trees, the ownership issue is
one obvious clue. Sadly the trees obviously end up being _created_ without
owner information, and the chown/chgrp happens later, but it might still
be useable for some clues.

Linus

2001-11-06 05:36:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

Linus Torvalds wrote:
>
> I do believe that there are probably more "high-level" heuristics that can
> be useful, though. Looking at man common big trees, the ownership issue is
> one obvious clue. Sadly the trees obviously end up being _created_ without
> owner information, and the chown/chgrp happens later, but it might still
> be useable for some clues.
>

I didn't understand your objection to the heuristic "was the
parent directory created within the past 30 seconds?". If the
parent and child were created at the same time, chances are that
they'll be accessed at the same time?

And there's always the `chattr' cop-out, to alter the allocation
policy at a chosen point in the tree by administrative act.

Any change in ext2 allocation policy at this point in time really,
really worries me. If it screws up we'll have 10,000,000 linux
boxes running like dead dogs in a year. So if we _do_ make a change, I'd
suggest that it be opt-in. Call me a wimp.

-

2001-11-06 05:52:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput


On Mon, 5 Nov 2001, Andrew Morton wrote:
>
> I didn't understand your objection to the heuristic "was the
> parent directory created within the past 30 seconds?". If the
> parent and child were created at the same time, chances are that
> they'll be accessed at the same time?

the thing I don't like about it is the non-data-dependence, ie the
layout of the disk will actually depend on how long it took you to write
the tree.

I'm not saying it's a bad heuristic - it's probably a fine (and certainly
simple) one. But the thought that when the NFS server has problems, a
straight "cp -a" of the same tree results in different layout just because
the server was moved over from one network to another makes me go "Ewww.."

Linus

2001-11-06 07:36:18

by Mike Castle

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Mon, Nov 05, 2001 at 09:48:40PM -0800, Linus Torvalds wrote:
> I'm not saying it's a bad heuristic - it's probably a fine (and certainly
> simple) one. But the thought that when the NFS server has problems, a
> straight "cp -a" of the same tree results in different layout just because
> the server was moved over from one network to another makes me go "Ewww.."

The layout is most likely going to be different anyway, isn't it? We don't
know what has gone on in the past to get the FS into the current state.

But now we have more information than we did when the file system was
originally built.

We don't have an extent based interface do we? So we can't say "I know
this file is going to be X bytes in size." But if we accomplish nearly the
same thing by a quick copy like this, what's the harm?

mrc
--
Mike Castle [email protected] http://www.netcom.com/~dalgoda/
We are all of us living in the shadow of Manhattan. -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

2001-11-06 08:15:27

by kaih

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

[email protected] (Andrew Morton) wrote on 05.11.01 in <[email protected]>:

> Linus Torvalds wrote:
> >
> > I do believe that there are probably more "high-level" heuristics that can
> > be useful, though. Looking at man common big trees, the ownership issue is
> > one obvious clue. Sadly the trees obviously end up being _created_ without
> > owner information, and the chown/chgrp happens later, but it might still
> > be useable for some clues.

Size of the parent directory might be another clue.

> I didn't understand your objection to the heuristic "was the
> parent directory created within the past 30 seconds?". If the
> parent and child were created at the same time, chances are that
> they'll be accessed at the same time?

Thought experiment:

Put stuff on a disk the usual slow way.

Backup. Mkfs. Restore.

Should the allocation pattern now be different? Why?

> And there's always the `chattr' cop-out, to alter the allocation
> policy at a chosen point in the tree by administrative act.

Much help that's going to be in the above scenario, given how tar calls
chattr ... not.

> Any change in ext2 allocation policy at this point in time really,
> really worries me. If it screws up we'll have 10,000,000 linux
> boxes running like dead dogs in a year. So if we _do_ make a change, I'd
> suggest that it be opt-in. Call me a wimp.

Well, with Alex' cleanups, switchable policies might just become possible.

MfG Kai

2001-11-06 08:33:00

by Alan

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

> > So I still claim that we should look for short-time profit, and then try
> > to fix up the problems longer term. With, if required, some kind of
> > rebalancing.
>
> Whatever heuristics we use, it _must_ catch fast-growth scenario. No
> arguments on that. The question being, what will minimize the problems
> for other cases.

Surely the answer if you want short term write speed and long term balancing
is to use ext3 not ext2 and simply ensure that the relevant stuff goes to
the journal (which will be nicely ordered) first. That will give you some
buffering at least.

Alan

2001-11-06 08:37:41

by Alexander Viro

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput



On Tue, 6 Nov 2001, Alan Cox wrote:

> Surely the answer if you want short term write speed and long term balancing
> is to use ext3 not ext2 and simply ensure that the relevant stuff goes to
> the journal (which will be nicely ordered) first. That will give you some
> buffering at least.

Alan, the problem is present in ext3 as well as in all other FFS derivatives
(well, FreeBSD had tried to deal that stuff this Spring).

2001-11-06 08:54:12

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

Alexander Viro wrote:
>
> On Tue, 6 Nov 2001, Alan Cox wrote:
>
> > Surely the answer if you want short term write speed and long term balancing
> > is to use ext3 not ext2 and simply ensure that the relevant stuff goes to
> > the journal (which will be nicely ordered) first. That will give you some
> > buffering at least.
>
> Alan, the problem is present in ext3 as well as in all other FFS derivatives
> (well, FreeBSD had tried to deal that stuff this Spring).
>

Yep. Once we're seek-bound on metadata and data, the occasional
seek-and-squirt into the journal won't make much difference, and
the other write patterns will be the same.

Interestingly, current ext3 can do a 600 meg write in fifty
seconds, whereas ext2 takes seventy. This will be related to the
fact that ext3 just pumps all the buffers into submit_bh(),
whereas ext2 fiddles around with all the write_locked_buffers()
stuff. I think. Or the intermingling of indirects with data
is tripping ext2 up. The additional seeking is audible.

-

2001-11-06 09:16:13

by Wojtek Pilorz

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Mon, 5 Nov 2001, Alexander Viro wrote:

[...]
>
> find(1). Stuff that runs every damn night. We end up doing getdents() +
> bunch of lstat() on the entries we had found. And that means access to
> inodes refered from them. Try that on a large directory with child
> inodes spread all over the disk.
>
[...]

Part of the problem is that, as far as I know, there is no sane way to
request the kernel to execute a number of lstat-s or similar calls
in the order that would be convenient to the system (I do not consider
creating threads for this purpose a sane way).
If a program (say find, or tar, or anything) needs an information from
5000 lstats of fstats or reads, it has to specify them one-by-one in some
order, without knowledge which order would be best.
If we had a call to execute a vector of lstats, or open, or reads (from
different handles), program which would use such calls would allow kernel
to take decision what is the best order or individual operations.

I remember that using similar 'vector' functions in old days on IBM OS/MVT
gave dramatical performance improvements (maybe for different
reasons; machine memories were often half a megabyte, so opening a file
required the system to read and execute several modules loaded from system
disks; when opening 10 files each of the modules had to be loaded once
rather than 10 times).

Would it be possible and a good idea to add such 'vector' operations to
the Linux kernel?

Best regards,

Wojtek
--------------------
Wojtek Pilorz
[email protected]


2001-11-06 09:59:14

by Alexander Viro

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput



On Tue, 6 Nov 2001, Wojtek Pilorz wrote:

> Would it be possible and a good idea to add such 'vector' operations to
> the Linux kernel?

Occam's Razor and all such. Plus portability problems.

Besides, you get enough information from getdents() if you really want
to play silly games with that.

2001-11-06 10:52:00

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On November 5, 2001 11:22 pm, Andrew Morton wrote:
> - The seek distance-versus-cost equation has changed. Take a look
> at a graph of seek distance versus time. Once you've decided to
> seek ten percent of the distance across the disk, a 90% seek only
> takes twice as long.

Do you have such a graph handy?

--
Daniel

2001-11-06 16:17:44

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Tue, 2001-11-06 at 02:52, Daniel Phillips wrote:
> On November 5, 2001 11:22 pm, Andrew Morton wrote:
> > - The seek distance-versus-cost equation has changed. Take a look
> > at a graph of seek distance versus time. Once you've decided to
> > seek ten percent of the distance across the disk, a 90% seek only
> > takes twice as long.
>
> Do you have such a graph handy?

This is one I made a while ago while doing some measurements; its also
the one Andrew was referring to. The drive is a very ordinary WD 40G
7200RPM drive. You can clearly see the rotation time in the width of
the band, and the non-linear short-seek optimizations (the drive seems
to do something like dump a lot more current into the seek coil for
short seeks, probably knowing that it can't burn things out if its only
short).

There's an interesting detail you can't really see from this graph.
Very short seeks (<< 1% disk size) are about 1ms. That's not seeks
within a track; that's track-track seeks. Since rotational latency is
up to 8.3ms on this drive, its often cheaper to seek to the next track
if the data is available there rather than wait for a rotation.

J


Attachments:
seek-buf.eps.gz (25.57 kB)

2001-11-06 22:23:18

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

Hi,

On Mon, Nov 05, 2001 at 02:22:41PM -0800, Andrew Morton wrote:

> For some workloads we want the subdirectories close to the
> parent as well. Failing to do so is horridly wrong.

If you apply that recursively, then _all_ the directories in a
filesystem end up in the same place. ext2 has traditionally been
extremely resistant to fragmentation degradation over time, and the
spreading out of the directory tree over the filesystem is part of
that.

> What has changed since Kirk's design?
>
> - The relative cost of seeks has increased. Device caching
> and readahead and increased bandwidth make it more expensive
> to seek away.

I'm not convinced about that. Modern disks are so fast at streaming
that _any_ non-sequential access is a major cost. Track-to-track
seeks are typically well under the average rotational cost. It's not
seeking to a distant location that's particularly expensive: any seek
is, whether to the the same track or not.

> I don't think I buy the fragmentation argument, really.

Recent experiments showed that reiserfs, which starts off allocating
files quite close together, was significantly faster than ext2 on
mostly-empty filesystems but got hugely slower as you approached 90%
full or more. I don't buy the argument that you can ignore
fragmentation. There must be a balance between short-term performance
when allocating files and long-term performance when ensuring you've
got enough free space inside a directory tree to cope with new files.

Even kernel builds may show this up. If you try to keep a directory
tree compact, then you may get excellent performance when unpacking
the kernel tarball. But once you've applied a few large patch sets
and set the kernel build going, your new files, .c.orig patch backups,
and .o files will have nowhere nearby to get allocated in.

Cheers,
Stephen

2001-11-06 22:23:28

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

Hi,

On Tue, Nov 06, 2001 at 07:28:02AM +1100, [email protected] wrote:

> Another heuristic to try make be to only use a different blockgroup
> for when the mkdir()s are seperate in time.

If I'm creating a hierarchy with "tar xvzf", why am I doing it? I
might be unpacking a source tree where all the files are going to be
used together. But I might be restoring a backup of /home. Assuming
that we want a compact unpacking is not always going to be correct.

Cheers,
Stephen

2001-11-06 23:18:18

by Alexander Viro

[permalink] [raw]
Subject: ext2/ialloc.c cleanup

Folks, promised cleanup of ialloc.c is on
ftp.math.psu.edu:pub/viro/ialloc.c,v

And yes, it's in RCS. The thing is split into _really_ small
steps (commented in the log). Each is a trivial transformation
and it should be very easy to verify correctness of any of
them.

Please, review. IMO it's cut fine enough to make the gradual merge
possible for 2.4 - look and you'll see.


2001-11-07 19:38:07

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] ext2/ialloc.c cleanup

On Nov 06, 2001 18:17 -0500, Alexander Viro wrote:
> Folks, promised cleanup of ialloc.c is on
> ftp.math.psu.edu:pub/viro/ialloc.c,v
>
> And yes, it's in RCS. The thing is split into _really_ small
> steps (commented in the log). Each is a trivial transformation
> and it should be very easy to verify correctness of any of
> them.
>
> Please, review. IMO it's cut fine enough to make the gradual merge
> possible for 2.4 - look and you'll see.

Minor nits, from my changes to this same function:
1) please replace use of "i" for best block group in find_cg_*, to
something better like "group", just for clarity.
2) in find_cg_*, when you fail the quadratic search, the linear search
should skip groups that were previously checked in the quadratic search,
with slight changes to both loops:

start = dir->u.ext2_i.i_block_group;

/* Use quadratic hash to find group with a free inode */
for (j = 1; j < count; j <<= 1) {
group = start + j;
if (group >= count)
group -= count;
cg = ext2_get_group_desc(sb, group, &bh);
if (cg && le16_to_cpu(cg->bg_free_inodes_count))
goto found;
}

/* That failed: try linear search for a free inode
* skipping groups we checked in the previous loop.
*/
for (j = 3; j < count; j++) {
if ((j & (j - 1)) == 0)
continue;
group = start + j;
if (group > count)
group -= count;
cg = ext2_get_group_desc(sb, group, &bh);
if (cg && le16_to_cpu(cg->bg_free_inodes_count))
goto found;
}
3) I know that "cylinder groups" were used in old FFS/whatever implementation,
but all of the ext2 code/documentation refers to these as block groups.
Can you stick with that for ext2 (e.g. gdp, not cg; bg_foo, not cg_foo)?
4) sbi can be gotten by "EXT2_SB(sb)".

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-07 20:03:00

by Alexander Viro

[permalink] [raw]
Subject: Re: [Ext2-devel] ext2/ialloc.c cleanup



On Wed, 7 Nov 2001, Andreas Dilger wrote:

> Minor nits, from my changes to this same function:
> 1) please replace use of "i" for best block group in find_cg_*, to
> something better like "group", just for clarity.

Consider that done.

> 2) in find_cg_*, when you fail the quadratic search, the linear search
> should skip groups that were previously checked in the quadratic search,
> with slight changes to both loops:

I'm not actually sure that it's a good thing. The different between the
sequences we do is that I do
n n+1 n+3 n+7 ... n+2 (linear)
and you do
n n+1 n+2 n+4 n+8 ... n+3 (linear)
which has slightly worse properties. You avoid duplicated check on n+3,
but lose a very nice property - shifting the old sequence is guaranteed
not to have many intersections with original in the beginning (distances
between elements do not repeat). With your sequence it's no longer true.

> 3) I know that "cylinder groups" were used in old FFS/whatever implementation,
> but all of the ext2 code/documentation refers to these as block groups.
> Can you stick with that for ext2 (e.g. gdp, not cg; bg_foo, not cg_foo)?

Ehh... Try to read that aloud. Maybe it's just me, but "gdp" sounds (and
looks) bad...

> 4) sbi can be gotten by "EXT2_SB(sb)".

True, consider that done.

Right now I'm doing alternative strategy for directory allocation, as soon
as I finish that I'll put the result on usual place.

2001-11-08 02:12:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] ext2/ialloc.c cleanup

Alexander Viro wrote:
>
> Right now I'm doing alternative strategy for directory allocation, as soon
> as I finish that I'll put the result on usual place.

I'll be interested in seeing the algorithm.

Keith Smith has kindly directed me to the current location of
his testing workload. This attempts to replay ten months' worth
of real file server activity. It's described in his paper, at

http://www.eecs.harvard.edu/~keith/usenix96

It's all set up for a 512 megabyte filesystem. So I ran
it sixteen times in succession on sixteen different directories
at the top level of an eight-gig filesystem. It takes about
five hours. The filesystem ends up 76% full.

I did this on two filesystems: one with stock 2.4.14 and the other
with the `if (0 &&' change. The patched kernel was consistently
10-15% faster at running the workload.

Now, lots of researchers seem to define lots of different ways of measuring
fragmentation and they are all rather unconvincing - they don't take into
account the extreme non-linear behaviour of modern disks.

So I simply measured how long it took to read some files. Which doesn't
give any insight into the nature and sources of fragmentation, but it
is the bottom line.

And reading nine thousand files containing three hundred megs of
data from the filesystem which was populated with the `if (0 &&'
patch takes 1.5x to 2x as long as it does from stock 2.4.14.

So that idea is a non-starter.

This seems to be a pretty good fragmentation test and I'm inclined
to stick with it.

I suspect that the current algorithm, with some hysteresis to make it
less inclined to hop into a different block group is the way to go.

=

2001-11-08 15:25:00

by Constantin Loizides

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput


> This is one I made a while ago while doing some measurements; its also
<[snip]

That's an interesting plot. I would like to do one for my disk!

How did you do it?
How do you find out about the seek distance???
Do you create one big file and then start
seeking around accessing different blocks of it?

Please, let me know,

thanks,
Constantin

2001-11-08 15:29:20

by Constantin Loizides

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

> Wow. That looks nice. Unfortunately the tarballs are
> missing several files. read.cxx, agesystem.cxx, etc. Plus
> a number of dud links.

I am sorry, I have made a new makefile recently... should have tested
it...

>
> Constantin, could you please check that the agesystem3 tarball
> builds OK, maybe put up a new one? Thanks.
>

New tars are on the web...

If you plan to do some tests using agesystem3 eg. the agetest.tar.gz
package, dont hesitate to ask me before you waste your time,
as my documentation is not very selfexplained!

Constantin

2001-11-08 16:46:23

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Thu, 2001-11-08 at 07:24, Constantin Loizides wrote:
>
> > This is one I made a while ago while doing some measurements; its also
> <[snip]
>
> That's an interesting plot. I would like to do one for my disk!
>
> How did you do it?
> How do you find out about the seek distance???
> Do you create one big file and then start
> seeking around accessing different blocks of it?

Its pretty simple. I open /dev/hda (the whole device), and read random
blocks, timing how long it takes for the data to come back since the
last one. I set up a few hundred/thousand buckets, and accumulate the
measured time into the bucket for that seek distance. So:

fd=open("/dev/hda")
llseek(fd, last = rand());
read(fd)
while(!finished) {
blk = rand();
llseek(fd, blk);
read(fd);
bucket[abs(last-blk)] = time;
last = blk;
}

I found the random seeking was the only way I could get reasonable
numbers out of the drive; any of the ordered patterns of the form "OK,
lets measure N block seeks" seemed to be predicted by the drive and gave
unreasonably fast results.

This technique measures seek+rotational latency for reads.
Seek-for-reading is generally faster than seek-for-writing, because they
start reading as soon as the head is appoximately in the right place,
and start using the data as soon as the ECCs start checking out OK. You
certainly can't start writing until you've done the full head-settle
time (another 1-2ms).

I'll see what state my measurement code is in (technically and legally)
and see if I can release it. Anyway I'm happy to answer questions.

J

2001-11-08 20:51:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] ext2/ialloc.c cleanup

OK, some more columns of numbers.

The 10-month simulated workload generates a total of 328 megs
of data. This workload was applied seventeen times in succession
to sequentially numbered directories at the top of an empty
8 gigabyte ext2 fs. Each top-level directory ends up containing
62 directories and 8757 files. After the seventeen directories
were created, the disk was 75% full.

To assess the fragmentation we see how long it takes to read
all the files in each top-level directory, cache-cold, using

find . -type f | xargs cat > /dev/null

We can find the "ideal" time for running the above command by copying
the 328 megs of files onto a pristine partition using an ext2 which
has the the `if (0 &&' patch and then timing how long that takes to
read. It's 13 seconds, which is pretty darn good. Reading an
equivalently-sized single file take 12.6, so we're maxing out the disk.

Tree Stock Stock/ Patched Patched/
(secs) ideal (secs) Stock
0 37 2.9 74 2.0
1 41 3.2 89 2.2
2 41 3.2 97 2.4
3 38 3.0 97 2.6
4 55 4.3 102 1.9
5 51 4.0 98 1.9
6 72 5.7 94 1.3
7 56 4.4 96 1.7
8 64 5.1 102 1.6
9 63 5 100 1.6
10 57 4.5 95 1.7
11 83 6.6 102 1.2
12 54 4.3 101 1.9
13 82 6.5 104 1.3
14 89 7.1 103 1.2
15 88 7.0 95 1.1
16 106 8.4 99 0.9

Mean: 63.35 4.9 96.4 1.7

So.

- Even after a single iteration of the Smith test, ext2 fragmentation
has reduced efficiency by a factor of three. And the disk was
never more than 5% full during the creation of these files.

- By the time the disk is 75% full, read efficiency is down by a factor
of over eight. Given that the most-recently created files are also
the most-frequently accessed, this hurts.

- The `if (0 &&' patch makes things 1.7x worse, but it never got worse
than the stock ext2's worst case.

Directory and inode fragmentation is not significant. The time
to run `du -s' against tree number 8 is two seconds.

Seems to be fertile ground for improvement, however it looks
as though even a single pass of the Smith test has fragged the
disk to death and this is perhaps not a case we want to optimise
for.

The total amount of data which a single pass of the Smith workload
passes into write() is approx 52 gigabytes. Thankfully, very little
of that actually hits disk because it's deleted before it gets written
back; but that's fine for this exercise.

So after writing 52 gigs of data and removing 51.7 gigs, an 8 gig
ext2 filesystem is badly fragmented. Does this mean that the tale
about ext2 being resistant to fragmentation is a myth?

Or is this workload simply asking too much? If the answer is
yes, then one solution is to change ext2 block allocation to
favour the fast-growth case and develop an online defrag tool.

Seems we have looked at both ends of the spectrum: the ultimate
case of fast growth and the ultimate case of slow growth. In
both cases, ext2 block allocation is running at 3x to 5x less
than ideal. Where is its sweet spot?


-

2001-11-08 22:16:59

by Alexander Viro

[permalink] [raw]
Subject: Re: [Ext2-devel] ext2/ialloc.c cleanup



OK, here's something that might be worth testing. It isn't a final
version and it most likely needs tuning, but...

Anyway, it builds, boots and seems to be working. The worst it
can do is to make bad choices for directory allocation ;-)

Comparing this beast (and its variants with different parameters) with
standard allocator would be interesting.

diff -urN S14/fs/ext2/ialloc.c S14-ext2/fs/ext2/ialloc.c
--- S14/fs/ext2/ialloc.c Tue Oct 9 21:47:26 2001
+++ S14-ext2/fs/ext2/ialloc.c Thu Nov 8 15:57:13 2001
@@ -17,7 +17,7 @@
#include <linux/ext2_fs.h>
#include <linux/locks.h>
#include <linux/quotaops.h>
-
+#include <linux/random.h>

/*
* ialloc.c contains the inodes allocation and deallocation routines
@@ -39,37 +39,26 @@
* Read the inode allocation bitmap for a given block_group, reading
* into the specified slot in the superblock's bitmap cache.
*
- * Return >=0 on success or a -ve error code.
+ * Return buffer_head of bitmap on success or NULL.
*/
-static int read_inode_bitmap (struct super_block * sb,
- unsigned long block_group,
- unsigned int bitmap_nr)
+static struct buffer_head *read_inode_bitmap (struct super_block * sb,
+ unsigned long block_group)
{
struct ext2_group_desc * gdp;
struct buffer_head * bh = NULL;
- int retval = 0;

gdp = ext2_get_group_desc (sb, block_group, NULL);
- if (!gdp) {
- retval = -EIO;
+ if (!gdp)
goto error_out;
- }
+
bh = bread (sb->s_dev, le32_to_cpu(gdp->bg_inode_bitmap), sb->s_blocksize);
- if (!bh) {
+ if (!bh)
ext2_error (sb, "read_inode_bitmap",
"Cannot read inode bitmap - "
"block_group = %lu, inode_bitmap = %lu",
block_group, (unsigned long) gdp->bg_inode_bitmap);
- retval = -EIO;
- }
- /*
- * On IO error, just leave a zero in the superblock's block pointer for
- * this group. The IO will be retried next time.
- */
error_out:
- sb->u.ext2_sb.s_inode_bitmap_number[bitmap_nr] = block_group;
- sb->u.ext2_sb.s_inode_bitmap[bitmap_nr] = bh;
- return retval;
+ return bh;
}

/*
@@ -83,79 +72,62 @@
* 2/ If the file system contains less than EXT2_MAX_GROUP_LOADED groups,
* this function reads the bitmap without maintaining a LRU cache.
*
- * Return the slot used to store the bitmap, or a -ve error code.
+ * Return the buffer_head of the bitmap or the ERR_PTR(error)
*/
-static int load_inode_bitmap (struct super_block * sb,
- unsigned int block_group)
+static struct buffer_head *load_inode_bitmap (struct super_block * sb,
+ unsigned int block_group)
{
- int i, j, retval = 0;
- unsigned long inode_bitmap_number;
- struct buffer_head * inode_bitmap;
+ int i, slot = 0;
+ struct ext2_sb_info *sbi = &sb->u.ext2_sb;
+ struct buffer_head *bh = sbi->s_inode_bitmap[0];

- if (block_group >= sb->u.ext2_sb.s_groups_count)
+ if (block_group >= sbi->s_groups_count)
ext2_panic (sb, "load_inode_bitmap",
"block_group >= groups_count - "
"block_group = %d, groups_count = %lu",
- block_group, sb->u.ext2_sb.s_groups_count);
- if (sb->u.ext2_sb.s_loaded_inode_bitmaps > 0 &&
- sb->u.ext2_sb.s_inode_bitmap_number[0] == block_group &&
- sb->u.ext2_sb.s_inode_bitmap[0] != NULL)
- return 0;
- if (sb->u.ext2_sb.s_groups_count <= EXT2_MAX_GROUP_LOADED) {
- if (sb->u.ext2_sb.s_inode_bitmap[block_group]) {
- if (sb->u.ext2_sb.s_inode_bitmap_number[block_group] != block_group)
- ext2_panic (sb, "load_inode_bitmap",
- "block_group != inode_bitmap_number");
- else
- return block_group;
- } else {
- retval = read_inode_bitmap (sb, block_group,
- block_group);
- if (retval < 0)
- return retval;
- return block_group;
- }
+ block_group, sbi->s_groups_count);
+
+ if (sbi->s_loaded_inode_bitmaps > 0 &&
+ sbi->s_inode_bitmap_number[0] == block_group && bh)
+ goto found;
+
+ if (sbi->s_groups_count <= EXT2_MAX_GROUP_LOADED) {
+ slot = block_group;
+ bh = sbi->s_inode_bitmap[slot];
+ if (!bh)
+ goto read_it;
+ if (sbi->s_inode_bitmap_number[slot] == slot)
+ goto found;
+ ext2_panic (sb, "load_inode_bitmap",
+ "block_group != inode_bitmap_number");
}

- for (i = 0; i < sb->u.ext2_sb.s_loaded_inode_bitmaps &&
- sb->u.ext2_sb.s_inode_bitmap_number[i] != block_group;
+ bh = NULL;
+ for (i = 0; i < sbi->s_loaded_inode_bitmaps &&
+ sbi->s_inode_bitmap_number[i] != block_group;
i++)
;
- if (i < sb->u.ext2_sb.s_loaded_inode_bitmaps &&
- sb->u.ext2_sb.s_inode_bitmap_number[i] == block_group) {
- inode_bitmap_number = sb->u.ext2_sb.s_inode_bitmap_number[i];
- inode_bitmap = sb->u.ext2_sb.s_inode_bitmap[i];
- for (j = i; j > 0; j--) {
- sb->u.ext2_sb.s_inode_bitmap_number[j] =
- sb->u.ext2_sb.s_inode_bitmap_number[j - 1];
- sb->u.ext2_sb.s_inode_bitmap[j] =
- sb->u.ext2_sb.s_inode_bitmap[j - 1];
- }
- sb->u.ext2_sb.s_inode_bitmap_number[0] = inode_bitmap_number;
- sb->u.ext2_sb.s_inode_bitmap[0] = inode_bitmap;
-
- /*
- * There's still one special case here --- if inode_bitmap == 0
- * then our last attempt to read the bitmap failed and we have
- * just ended up caching that failure. Try again to read it.
- */
- if (!inode_bitmap)
- retval = read_inode_bitmap (sb, block_group, 0);
-
- } else {
- if (sb->u.ext2_sb.s_loaded_inode_bitmaps < EXT2_MAX_GROUP_LOADED)
- sb->u.ext2_sb.s_loaded_inode_bitmaps++;
- else
- brelse (sb->u.ext2_sb.s_inode_bitmap[EXT2_MAX_GROUP_LOADED - 1]);
- for (j = sb->u.ext2_sb.s_loaded_inode_bitmaps - 1; j > 0; j--) {
- sb->u.ext2_sb.s_inode_bitmap_number[j] =
- sb->u.ext2_sb.s_inode_bitmap_number[j - 1];
- sb->u.ext2_sb.s_inode_bitmap[j] =
- sb->u.ext2_sb.s_inode_bitmap[j - 1];
- }
- retval = read_inode_bitmap (sb, block_group, 0);
- }
- return retval;
+ if (i < sbi->s_loaded_inode_bitmaps)
+ bh = sbi->s_inode_bitmap[i];
+ else if (sbi->s_loaded_inode_bitmaps < EXT2_MAX_GROUP_LOADED)
+ sbi->s_loaded_inode_bitmaps++;
+ else
+ brelse (sbi->s_inode_bitmap[--i]);
+
+ while (i--) {
+ sbi->s_inode_bitmap_number[i+1] = sbi->s_inode_bitmap_number[i];
+ sbi->s_inode_bitmap[i+1] = sbi->s_inode_bitmap[i];
+ }
+
+read_it:
+ if (!bh)
+ bh = read_inode_bitmap (sb, block_group);
+ sbi->s_inode_bitmap_number[slot] = block_group;
+ sbi->s_inode_bitmap[slot] = bh;
+ if (!bh)
+ return ERR_PTR(-EIO);
+found:
+ return bh;
}

/*
@@ -183,7 +155,6 @@
struct buffer_head * bh2;
unsigned long block_group;
unsigned long bit;
- int bitmap_nr;
struct ext2_group_desc * gdp;
struct ext2_super_block * es;

@@ -215,12 +186,10 @@
}
block_group = (ino - 1) / EXT2_INODES_PER_GROUP(sb);
bit = (ino - 1) % EXT2_INODES_PER_GROUP(sb);
- bitmap_nr = load_inode_bitmap (sb, block_group);
- if (bitmap_nr < 0)
+ bh = load_inode_bitmap (sb, block_group);
+ if (IS_ERR(bh))
goto error_return;

- bh = sb->u.ext2_sb.s_inode_bitmap[bitmap_nr];
-
/* Ok, now we can actually update the inode bitmaps.. */
if (!ext2_clear_bit (bit, bh->b_data))
ext2_error (sb, "ext2_free_inode",
@@ -230,9 +199,11 @@
if (gdp) {
gdp->bg_free_inodes_count =
cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) + 1);
- if (is_directory)
+ if (is_directory) {
gdp->bg_used_dirs_count =
cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) - 1);
+ sb->u.ext2_sb.s_dir_count--;
+ }
}
mark_buffer_dirty(bh2);
es->s_free_inodes_count =
@@ -259,23 +230,229 @@
* For other inodes, search forward from the parent directory\'s block
* group to find a free inode.
*/
+
+static int find_cg_dir(struct super_block *sb, const struct inode *parent)
+{
+ struct ext2_super_block * es = sb->u.ext2_sb.s_es;
+ int ngroups = sb->u.ext2_sb.s_groups_count;
+ int avefreei = le32_to_cpu(es->s_free_inodes_count) / ngroups;
+ struct ext2_group_desc *cg, *best_cg = NULL;
+ struct buffer_head *bh, *best_bh = NULL;
+ int i = -1, j;
+
+ for (j = 0; j < ngroups; j++) {
+ cg = ext2_get_group_desc (sb, j, &bh);
+ if (!cg || !cg->bg_free_inodes_count)
+ continue;
+ if (le16_to_cpu(cg->bg_free_inodes_count) < avefreei)
+ continue;
+ if (!best_cg ||
+ (le16_to_cpu(cg->bg_free_blocks_count) >
+ le16_to_cpu(best_cg->bg_free_blocks_count))) {
+ i = j;
+ best_cg = cg;
+ best_bh = bh;
+ }
+ }
+ if (!best_cg)
+ return -1;
+
+ best_cg->bg_free_inodes_count =
+ cpu_to_le16(le16_to_cpu(best_cg->bg_free_inodes_count) - 1);
+ best_cg->bg_used_dirs_count =
+ cpu_to_le16(le16_to_cpu(best_cg->bg_used_dirs_count) + 1);
+ sb->u.ext2_sb.s_dir_count++;
+ mark_buffer_dirty(best_bh);
+ return i;
+}
+
+/*
+ * Orlov's allocator for directories.
+ *
+ * We always try to spread first-level directories:
+ * If there are directories with both free inodes and free blocks counts
+ * not worse than average we return one with smallest directory count.
+ * Otherwise we simply return a random group.
+ *
+ * For the rest rules look so:
+ *
+ * It's OK to put directory into a group unless
+ * it has too many directories already (max_dirs) or
+ * it has too few free inodes left (min_inodes) or
+ * it has too few free blocks left (min_blocks) or
+ * it's already running too large debt (max_debt).
+ * Parent's group is prefered, if it doesn't satisfy these
+ * conditions we search cyclically through the rest. If none
+ * of the groups look good we just look for a group with more
+ * free inodes than average (starting at parent's group).
+ *
+ * Debt is incremented each time we allocate a directory and decremented
+ * when we allocate an inode, within 0--255.
+ */
+
+#define INODE_COST 64
+#define BLOCK_COST 256
+
+static int find_cg_dir_orlov(struct super_block *sb, const struct inode *parent)
+{
+ int parent_cg = parent->u.ext2_i.i_block_group;
+ struct ext2_super_block * es = sb->u.ext2_sb.s_es;
+ int ngroups = sb->u.ext2_sb.s_groups_count;
+ int inodes_per_group = EXT2_INODES_PER_GROUP(sb);
+ int avefreei = le32_to_cpu(es->s_free_inodes_count) / ngroups;
+ int avefreeb = le32_to_cpu(es->s_free_blocks_count) / ngroups;
+ int blocks_per_dir;
+ int ndirs = sb->u.ext2_sb.s_dir_count;
+ int max_debt, max_dirs, min_blocks, min_inodes;
+ int group = -1, i;
+ struct ext2_group_desc *cg;
+ struct buffer_head *bh;
+
+ if (parent == sb->s_root->d_inode) {
+ struct ext2_group_desc *best_cg = NULL;
+ struct buffer_head *best_bh = NULL;
+ int best_ndir = inodes_per_group;
+ int best_group = -1;
+
+ for (group = 0; group < ngroups; group++) {
+ cg = ext2_get_group_desc (sb, group, &bh);
+ if (!cg || !cg->bg_free_inodes_count)
+ continue;
+ if (le16_to_cpu(cg->bg_used_dirs_count) >= best_ndir)
+ continue;
+ if (le16_to_cpu(cg->bg_free_inodes_count) < avefreei)
+ continue;
+ if (le16_to_cpu(cg->bg_free_blocks_count) < avefreeb)
+ continue;
+ best_group = group;
+ best_ndir = le16_to_cpu(cg->bg_used_dirs_count);
+ best_cg = cg;
+ best_bh = bh;
+ }
+ if (best_group >= 0) {
+ cg = best_cg;
+ bh = best_bh;
+ group = best_group;
+ goto found;
+ }
+ get_random_bytes(&group, sizeof(group));
+ goto fallback;
+ }
+
+ blocks_per_dir = (le32_to_cpu(es->s_blocks_count) -
+ le32_to_cpu(es->s_free_blocks_count)) / ndirs;
+
+ max_dirs = ndirs / ngroups + inodes_per_group / 16;
+ min_inodes = avefreei - inodes_per_group / 4;
+ min_blocks = avefreeb - EXT2_BLOCKS_PER_GROUP(sb) / 4;
+
+ max_debt = EXT2_BLOCKS_PER_GROUP(sb) / max(blocks_per_dir, BLOCK_COST);
+ if (max_debt * INODE_COST > inodes_per_group)
+ max_debt = inodes_per_group / INODE_COST;
+ if (max_debt > 255)
+ max_debt = 255;
+ if (max_debt == 0)
+ max_debt = 1;
+
+ for (i = 0; i < ngroups; i++) {
+ group = (parent_cg + i) % ngroups;
+ cg = ext2_get_group_desc (sb, group, &bh);
+ if (!cg || !cg->bg_free_inodes_count)
+ continue;
+ if (sb->u.ext2_sb.debts[group] >= max_debt)
+ continue;
+ if (le16_to_cpu(cg->bg_used_dirs_count) >= max_dirs)
+ continue;
+ if (le16_to_cpu(cg->bg_free_inodes_count) < min_inodes)
+ continue;
+ if (le16_to_cpu(cg->bg_free_blocks_count) < min_blocks)
+ continue;
+ goto found;
+ }
+
+fallback:
+ for (i = 0; i < ngroups; i++) {
+ group = (parent_cg + i) % ngroups;
+ cg = ext2_get_group_desc (sb, group, &bh);
+ if (!cg || !cg->bg_free_inodes_count)
+ continue;
+ if (le16_to_cpu(cg->bg_free_inodes_count) >= avefreei)
+ goto found;
+ }
+
+ return -1;
+
+found:
+ cg->bg_free_inodes_count =
+ cpu_to_le16(le16_to_cpu(cg->bg_free_inodes_count) - 1);
+ cg->bg_used_dirs_count =
+ cpu_to_le16(le16_to_cpu(cg->bg_used_dirs_count) + 1);
+ sb->u.ext2_sb.s_dir_count++;
+ mark_buffer_dirty(bh);
+ return group;
+}
+
+static int find_cg_other(struct super_block *sb, const struct inode *parent)
+{
+ int parent_cg = parent->u.ext2_i.i_block_group;
+ int ngroups = sb->u.ext2_sb.s_groups_count;
+ struct ext2_group_desc *cg;
+ struct buffer_head *bh;
+ int i, j;
+
+ /*
+ * Try to place the inode in its parent directory
+ */
+ i = parent_cg;
+ cg = ext2_get_group_desc (sb, i, &bh);
+ if (cg && le16_to_cpu(cg->bg_free_inodes_count))
+ goto found;
+
+ /*
+ * Use a quadratic hash to find a group with a
+ * free inode
+ */
+ for (j = 1; j < ngroups; j <<= 1) {
+ i += j;
+ if (i >= ngroups)
+ i -= ngroups;
+ cg = ext2_get_group_desc (sb, i, &bh);
+ if (cg && le16_to_cpu(cg->bg_free_inodes_count))
+ goto found;
+ }
+
+ /*
+ * That failed: try linear search for a free inode
+ */
+ i = parent_cg + 1;
+ for (j = 2; j < ngroups; j++) {
+ if (++i >= ngroups)
+ i = 0;
+ cg = ext2_get_group_desc (sb, i, &bh);
+ if (cg && le16_to_cpu(cg->bg_free_inodes_count))
+ goto found;
+ }
+
+ return -1;
+
+found:
+ cg->bg_free_inodes_count =
+ cpu_to_le16(le16_to_cpu(cg->bg_free_inodes_count) - 1);
+ mark_buffer_dirty(bh);
+ return i;
+}
+
struct inode * ext2_new_inode (const struct inode * dir, int mode)
{
struct super_block * sb;
struct buffer_head * bh;
struct buffer_head * bh2;
- int i, j, avefreei;
+ int i, j;
struct inode * inode;
- int bitmap_nr;
struct ext2_group_desc * gdp;
- struct ext2_group_desc * tmp;
struct ext2_super_block * es;
int err;

- /* Cannot create files in a deleted directory */
- if (!dir || !dir->i_nlink)
- return ERR_PTR(-EPERM);
-
sb = dir->i_sb;
inode = new_inode(sb);
if (!inode)
@@ -284,140 +461,52 @@
lock_super (sb);
es = sb->u.ext2_sb.s_es;
repeat:
- gdp = NULL; i=0;
-
- if (S_ISDIR(mode)) {
- avefreei = le32_to_cpu(es->s_free_inodes_count) /
- sb->u.ext2_sb.s_groups_count;
-/* I am not yet convinced that this next bit is necessary.
- i = dir->u.ext2_i.i_block_group;
- for (j = 0; j < sb->u.ext2_sb.s_groups_count; j++) {
- tmp = ext2_get_group_desc (sb, i, &bh2);
- if (tmp &&
- (le16_to_cpu(tmp->bg_used_dirs_count) << 8) <
- le16_to_cpu(tmp->bg_free_inodes_count)) {
- gdp = tmp;
- break;
- }
- else
- i = ++i % sb->u.ext2_sb.s_groups_count;
- }
-*/
- if (!gdp) {
- for (j = 0; j < sb->u.ext2_sb.s_groups_count; j++) {
- tmp = ext2_get_group_desc (sb, j, &bh2);
- if (tmp &&
- le16_to_cpu(tmp->bg_free_inodes_count) &&
- le16_to_cpu(tmp->bg_free_inodes_count) >= avefreei) {
- if (!gdp ||
- (le16_to_cpu(tmp->bg_free_blocks_count) >
- le16_to_cpu(gdp->bg_free_blocks_count))) {
- i = j;
- gdp = tmp;
- }
- }
- }
- }
- }
+ if (S_ISDIR(mode))
+ i = find_cg_dir_orlov(sb, dir);
else
- {
- /*
- * Try to place the inode in its parent directory
- */
- i = dir->u.ext2_i.i_block_group;
- tmp = ext2_get_group_desc (sb, i, &bh2);
- if (tmp && le16_to_cpu(tmp->bg_free_inodes_count))
- gdp = tmp;
- else
- {
- /*
- * Use a quadratic hash to find a group with a
- * free inode
- */
- for (j = 1; j < sb->u.ext2_sb.s_groups_count; j <<= 1) {
- i += j;
- if (i >= sb->u.ext2_sb.s_groups_count)
- i -= sb->u.ext2_sb.s_groups_count;
- tmp = ext2_get_group_desc (sb, i, &bh2);
- if (tmp &&
- le16_to_cpu(tmp->bg_free_inodes_count)) {
- gdp = tmp;
- break;
- }
- }
- }
- if (!gdp) {
- /*
- * That failed: try linear search for a free inode
- */
- i = dir->u.ext2_i.i_block_group + 1;
- for (j = 2; j < sb->u.ext2_sb.s_groups_count; j++) {
- if (++i >= sb->u.ext2_sb.s_groups_count)
- i = 0;
- tmp = ext2_get_group_desc (sb, i, &bh2);
- if (tmp &&
- le16_to_cpu(tmp->bg_free_inodes_count)) {
- gdp = tmp;
- break;
- }
- }
- }
- }
+ i = find_cg_other(sb, dir);

err = -ENOSPC;
- if (!gdp)
+ if (i == -1)
goto fail;

err = -EIO;
- bitmap_nr = load_inode_bitmap (sb, i);
- if (bitmap_nr < 0)
- goto fail;
+ bh = load_inode_bitmap (sb, i);
+ if (IS_ERR(bh))
+ goto fail2;
+
+ j = ext2_find_first_zero_bit ((unsigned long *) bh->b_data,
+ EXT2_INODES_PER_GROUP(sb));
+ if (j >= EXT2_INODES_PER_GROUP(sb))
+ goto bad_count;
+ ext2_set_bit (j, bh->b_data);

- bh = sb->u.ext2_sb.s_inode_bitmap[bitmap_nr];
- if ((j = ext2_find_first_zero_bit ((unsigned long *) bh->b_data,
- EXT2_INODES_PER_GROUP(sb))) <
- EXT2_INODES_PER_GROUP(sb)) {
- if (ext2_set_bit (j, bh->b_data)) {
- ext2_error (sb, "ext2_new_inode",
- "bit already set for inode %d", j);
- goto repeat;
- }
- mark_buffer_dirty(bh);
- if (sb->s_flags & MS_SYNCHRONOUS) {
- ll_rw_block (WRITE, 1, &bh);
- wait_on_buffer (bh);
- }
- } else {
- if (le16_to_cpu(gdp->bg_free_inodes_count) != 0) {
- ext2_error (sb, "ext2_new_inode",
- "Free inodes count corrupted in group %d",
- i);
- /* Is it really ENOSPC? */
- err = -ENOSPC;
- if (sb->s_flags & MS_RDONLY)
- goto fail;
-
- gdp->bg_free_inodes_count = 0;
- mark_buffer_dirty(bh2);
- }
- goto repeat;
+ mark_buffer_dirty(bh);
+ if (sb->s_flags & MS_SYNCHRONOUS) {
+ ll_rw_block (WRITE, 1, &bh);
+ wait_on_buffer (bh);
}
+
j += i * EXT2_INODES_PER_GROUP(sb) + 1;
if (j < EXT2_FIRST_INO(sb) || j > le32_to_cpu(es->s_inodes_count)) {
ext2_error (sb, "ext2_new_inode",
"reserved inode or inode > inodes count - "
"block_group = %d,inode=%d", i, j);
err = -EIO;
- goto fail;
+ goto fail2;
}
- gdp->bg_free_inodes_count =
- cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) - 1);
- if (S_ISDIR(mode))
- gdp->bg_used_dirs_count =
- cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) + 1);
- mark_buffer_dirty(bh2);
+
es->s_free_inodes_count =
cpu_to_le32(le32_to_cpu(es->s_free_inodes_count) - 1);
+
+ if (S_ISDIR(mode)) {
+ if (sb->u.ext2_sb.debts[i] < 255)
+ sb->u.ext2_sb.debts[i]++;
+ } else {
+ if (sb->u.ext2_sb.debts[i])
+ sb->u.ext2_sb.debts[i]--;
+ }
+
mark_buffer_dirty(sb->u.ext2_sb.s_sbh);
sb->s_dirt = 1;
inode->i_uid = current->fsuid;
@@ -438,14 +527,7 @@
inode->u.ext2_i.i_new_inode = 1;
inode->u.ext2_i.i_flags = dir->u.ext2_i.i_flags;
if (S_ISLNK(mode))
- inode->u.ext2_i.i_flags &= ~(EXT2_IMMUTABLE_FL | EXT2_APPEND_FL);
- inode->u.ext2_i.i_faddr = 0;
- inode->u.ext2_i.i_frag_no = 0;
- inode->u.ext2_i.i_frag_size = 0;
- inode->u.ext2_i.i_file_acl = 0;
- inode->u.ext2_i.i_dir_acl = 0;
- inode->u.ext2_i.i_dtime = 0;
- inode->u.ext2_i.i_prealloc_count = 0;
+ inode->u.ext2_i.i_flags &= ~(EXT2_IMMUTABLE_FL|EXT2_APPEND_FL);
inode->u.ext2_i.i_block_group = i;
if (inode->u.ext2_i.i_flags & EXT2_SYNC_FL)
inode->i_flags |= S_SYNC;
@@ -464,38 +546,59 @@
ext2_debug ("allocating inode %lu\n", inode->i_ino);
return inode;

+fail2:
+ gdp = ext2_get_group_desc (sb, i, &bh2);
+ gdp->bg_free_inodes_count =
+ cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) + 1);
+ if (S_ISDIR(mode)) {
+ gdp->bg_used_dirs_count =
+ cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) - 1);
+ sb->u.ext2_sb.s_dir_count--;
+ }
+ mark_buffer_dirty(bh2);
fail:
unlock_super(sb);
make_bad_inode(inode);
iput(inode);
return ERR_PTR(err);
+
+bad_count:
+ ext2_error (sb, "ext2_new_inode",
+ "Free inodes count corrupted in group %d",
+ i);
+ /* Is it really ENOSPC? */
+ err = -ENOSPC;
+ if (sb->s_flags & MS_RDONLY)
+ goto fail;
+
+ gdp = ext2_get_group_desc (sb, i, &bh2);
+ gdp->bg_free_inodes_count = 0;
+ mark_buffer_dirty(bh2);
+ goto repeat;
}

unsigned long ext2_count_free_inodes (struct super_block * sb)
{
#ifdef EXT2FS_DEBUG
struct ext2_super_block * es;
- unsigned long desc_count, bitmap_count, x;
- int bitmap_nr;
- struct ext2_group_desc * gdp;
+ unsigned long desc_count = 0, bitmap_count = 0;
int i;

lock_super (sb);
es = sb->u.ext2_sb.s_es;
- desc_count = 0;
- bitmap_count = 0;
- gdp = NULL;
for (i = 0; i < sb->u.ext2_sb.s_groups_count; i++) {
- gdp = ext2_get_group_desc (sb, i, NULL);
+ struct ext2_group_desc *gdp = ext2_get_group_desc (sb, i, NULL);
+ struct buffer_head *bh;
+ unsigned x;
+
if (!gdp)
continue;
desc_count += le16_to_cpu(gdp->bg_free_inodes_count);
- bitmap_nr = load_inode_bitmap (sb, i);
- if (bitmap_nr < 0)
+ bh = load_inode_bitmap (sb, i);
+ if (IS_ERR(bh))
continue;

- x = ext2_count_free (sb->u.ext2_sb.s_inode_bitmap[bitmap_nr],
- EXT2_INODES_PER_GROUP(sb) / 8);
+ x = ext2_count_free (bh, EXT2_INODES_PER_GROUP(sb) / 8);
printk ("group %d: stored = %d, counted = %lu\n",
i, le16_to_cpu(gdp->bg_free_inodes_count), x);
bitmap_count += x;
@@ -509,31 +612,42 @@
#endif
}

+/* Called at mount-time, super-block is locked */
+unsigned long ext2_count_dirs (struct super_block * sb)
+{
+ unsigned long count = 0;
+ int i;
+
+ for (i = 0; i < sb->u.ext2_sb.s_groups_count; i++) {
+ struct ext2_group_desc *gdp = ext2_get_group_desc (sb, i, NULL);
+ if (!gdp)
+ continue;
+ count += le16_to_cpu(gdp->bg_used_dirs_count);
+ }
+ return count;
+}
+
#ifdef CONFIG_EXT2_CHECK
/* Called at mount-time, super-block is locked */
void ext2_check_inodes_bitmap (struct super_block * sb)
{
- struct ext2_super_block * es;
- unsigned long desc_count, bitmap_count, x;
- int bitmap_nr;
- struct ext2_group_desc * gdp;
+ struct ext2_super_block * es = sb->u.ext2_sb.s_es;
+ unsigned long desc_count = 0, bitmap_count = 0;
int i;

- es = sb->u.ext2_sb.s_es;
- desc_count = 0;
- bitmap_count = 0;
- gdp = NULL;
for (i = 0; i < sb->u.ext2_sb.s_groups_count; i++) {
- gdp = ext2_get_group_desc (sb, i, NULL);
+ struct ext2_group_desc *gdp = ext2_get_group_desc (sb, i, NULL);
+ struct buffer_head *bh;
+ unsigned x;
+
if (!gdp)
continue;
desc_count += le16_to_cpu(gdp->bg_free_inodes_count);
- bitmap_nr = load_inode_bitmap (sb, i);
- if (bitmap_nr < 0)
+ bh = load_inode_bitmap (sb, i);
+ if (IS_ERR(bh))
continue;

- x = ext2_count_free (sb->u.ext2_sb.s_inode_bitmap[bitmap_nr],
- EXT2_INODES_PER_GROUP(sb) / 8);
+ x = ext2_count_free (bh, EXT2_INODES_PER_GROUP(sb) / 8);
if (le16_to_cpu(gdp->bg_free_inodes_count) != x)
ext2_error (sb, "ext2_check_inodes_bitmap",
"Wrong free inodes count in group %d, "
@@ -545,7 +659,7 @@
ext2_error (sb, "ext2_check_inodes_bitmap",
"Wrong free inodes count in super block, "
"stored = %lu, counted = %lu",
- (unsigned long) le32_to_cpu(es->s_free_inodes_count),
+ (unsigned long)le32_to_cpu(es->s_free_inodes_count),
bitmap_count);
}
#endif
diff -urN S14/fs/ext2/super.c S14-ext2/fs/ext2/super.c
--- S14/fs/ext2/super.c Tue Oct 9 21:47:26 2001
+++ S14-ext2/fs/ext2/super.c Thu Nov 8 15:29:07 2001
@@ -605,6 +605,12 @@
printk ("EXT2-fs: not enough memory\n");
goto failed_mount;
}
+ sb->u.ext2_sb.debts = kmalloc(sb->u.ext2_sb.s_groups_count, GFP_KERNEL);
+ if (!sb->u.ext2_sb.debts) {
+ printk ("EXT2-fs: not enough memory\n");
+ goto failed_mount;
+ }
+ memset(sb->u.ext2_sb.debts, 0, sb->u.ext2_sb.s_groups_count);
for (i = 0; i < db_count; i++) {
sb->u.ext2_sb.s_group_desc[i] = bread (dev, logic_sb_block + i + 1,
sb->s_blocksize);
@@ -621,6 +627,7 @@
db_count = i;
goto failed_mount2;
}
+ sb->u.ext2_sb.s_dir_count = ext2_count_dirs(sb);
for (i = 0; i < EXT2_MAX_GROUP_LOADED; i++) {
sb->u.ext2_sb.s_inode_bitmap_number[i] = 0;
sb->u.ext2_sb.s_inode_bitmap[i] = NULL;
diff -urN S14/include/linux/ext2_fs.h S14-ext2/include/linux/ext2_fs.h
--- S14/include/linux/ext2_fs.h Thu Nov 8 16:33:29 2001
+++ S14-ext2/include/linux/ext2_fs.h Thu Nov 8 16:20:16 2001
@@ -578,6 +578,7 @@
extern unsigned long ext2_count_free_inodes (struct super_block *);
extern void ext2_check_inodes_bitmap (struct super_block *);
extern unsigned long ext2_count_free (struct buffer_head *, unsigned);
+extern unsigned long ext2_count_dirs (struct super_block *);

/* inode.c */
extern void ext2_read_inode (struct inode *);
diff -urN S14/include/linux/ext2_fs_sb.h S14-ext2/include/linux/ext2_fs_sb.h
--- S14/include/linux/ext2_fs_sb.h Fri Feb 16 21:29:52 2001
+++ S14-ext2/include/linux/ext2_fs_sb.h Thu Nov 8 15:28:34 2001
@@ -56,6 +56,8 @@
int s_desc_per_block_bits;
int s_inode_size;
int s_first_ino;
+ unsigned long s_dir_count;
+ u8 *debts;
};

#endif /* _LINUX_EXT2_FS_SB */

2001-11-08 22:45:01

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] ext2/ialloc.c cleanup

On Nov 08, 2001 17:16 -0500, Alexander Viro wrote:
> + get_random_bytes(&group, sizeof(group));
> + goto fallback;

Little chance that 0 <= group < ngroups. Even so, value is unused, AFAICS.

> +fallback:
> + for (i = 0; i < ngroups; i++) {
> + group = (parent_cg + i) % ngroups;
> + cg = ext2_get_group_desc (sb, group, &bh);
> + if (!cg || !cg->bg_free_inodes_count)
> + continue;
> + if (le16_to_cpu(cg->bg_free_inodes_count) >= avefreei)
> + goto found;
> + }
> +
> + return -1;


> diff -urN S14/include/linux/ext2_fs_sb.h S14-ext2/include/linux/ext2_fs_sb.h
> --- S14/include/linux/ext2_fs_sb.h Fri Feb 16 21:29:52 2001
> +++ S14-ext2/include/linux/ext2_fs_sb.h Thu Nov 8 15:28:34 2001
> @@ -56,6 +56,8 @@
> int s_desc_per_block_bits;
> int s_inode_size;
> int s_first_ino;
> + unsigned long s_dir_count;
> + u8 *debts;
> };

I had thought a couple of times it may be useful to have an in-memory list
of group descriptors, each with a pointer to the on-disk struct, like
ext2_sb points to s_es.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-08 23:09:04

by Alexander Viro

[permalink] [raw]
Subject: Re: [Ext2-devel] ext2/ialloc.c cleanup



On Thu, 8 Nov 2001, Andreas Dilger wrote:

> On Nov 08, 2001 17:16 -0500, Alexander Viro wrote:
> > + get_random_bytes(&group, sizeof(group));

Grrr....

+ parent_cg = group % ngroups;

/me wonders how the heck had it disappeared...

Corrected patch follows:

diff -urN S14/fs/ext2/ialloc.c S14-ext2/fs/ext2/ialloc.c
--- S14/fs/ext2/ialloc.c Tue Oct 9 21:47:26 2001
+++ S14-ext2/fs/ext2/ialloc.c Thu Nov 8 18:05:00 2001
@@ -17,7 +17,7 @@
#include <linux/ext2_fs.h>
#include <linux/locks.h>
#include <linux/quotaops.h>
-
+#include <linux/random.h>

/*
* ialloc.c contains the inodes allocation and deallocation routines
@@ -39,37 +39,26 @@
* Read the inode allocation bitmap for a given block_group, reading
* into the specified slot in the superblock's bitmap cache.
*
- * Return >=0 on success or a -ve error code.
+ * Return buffer_head of bitmap on success or NULL.
*/
-static int read_inode_bitmap (struct super_block * sb,
- unsigned long block_group,
- unsigned int bitmap_nr)
+static struct buffer_head *read_inode_bitmap (struct super_block * sb,
+ unsigned long block_group)
{
struct ext2_group_desc * gdp;
struct buffer_head * bh = NULL;
- int retval = 0;

gdp = ext2_get_group_desc (sb, block_group, NULL);
- if (!gdp) {
- retval = -EIO;
+ if (!gdp)
goto error_out;
- }
+
bh = bread (sb->s_dev, le32_to_cpu(gdp->bg_inode_bitmap), sb->s_blocksize);
- if (!bh) {
+ if (!bh)
ext2_error (sb, "read_inode_bitmap",
"Cannot read inode bitmap - "
"block_group = %lu, inode_bitmap = %lu",
block_group, (unsigned long) gdp->bg_inode_bitmap);
- retval = -EIO;
- }
- /*
- * On IO error, just leave a zero in the superblock's block pointer for
- * this group. The IO will be retried next time.
- */
error_out:
- sb->u.ext2_sb.s_inode_bitmap_number[bitmap_nr] = block_group;
- sb->u.ext2_sb.s_inode_bitmap[bitmap_nr] = bh;
- return retval;
+ return bh;
}

/*
@@ -83,79 +72,62 @@
* 2/ If the file system contains less than EXT2_MAX_GROUP_LOADED groups,
* this function reads the bitmap without maintaining a LRU cache.
*
- * Return the slot used to store the bitmap, or a -ve error code.
+ * Return the buffer_head of the bitmap or the ERR_PTR(error)
*/
-static int load_inode_bitmap (struct super_block * sb,
- unsigned int block_group)
+static struct buffer_head *load_inode_bitmap (struct super_block * sb,
+ unsigned int block_group)
{
- int i, j, retval = 0;
- unsigned long inode_bitmap_number;
- struct buffer_head * inode_bitmap;
+ int i, slot = 0;
+ struct ext2_sb_info *sbi = &sb->u.ext2_sb;
+ struct buffer_head *bh = sbi->s_inode_bitmap[0];

- if (block_group >= sb->u.ext2_sb.s_groups_count)
+ if (block_group >= sbi->s_groups_count)
ext2_panic (sb, "load_inode_bitmap",
"block_group >= groups_count - "
"block_group = %d, groups_count = %lu",
- block_group, sb->u.ext2_sb.s_groups_count);
- if (sb->u.ext2_sb.s_loaded_inode_bitmaps > 0 &&
- sb->u.ext2_sb.s_inode_bitmap_number[0] == block_group &&
- sb->u.ext2_sb.s_inode_bitmap[0] != NULL)
- return 0;
- if (sb->u.ext2_sb.s_groups_count <= EXT2_MAX_GROUP_LOADED) {
- if (sb->u.ext2_sb.s_inode_bitmap[block_group]) {
- if (sb->u.ext2_sb.s_inode_bitmap_number[block_group] != block_group)
- ext2_panic (sb, "load_inode_bitmap",
- "block_group != inode_bitmap_number");
- else
- return block_group;
- } else {
- retval = read_inode_bitmap (sb, block_group,
- block_group);
- if (retval < 0)
- return retval;
- return block_group;
- }
+ block_group, sbi->s_groups_count);
+
+ if (sbi->s_loaded_inode_bitmaps > 0 &&
+ sbi->s_inode_bitmap_number[0] == block_group && bh)
+ goto found;
+
+ if (sbi->s_groups_count <= EXT2_MAX_GROUP_LOADED) {
+ slot = block_group;
+ bh = sbi->s_inode_bitmap[slot];
+ if (!bh)
+ goto read_it;
+ if (sbi->s_inode_bitmap_number[slot] == slot)
+ goto found;
+ ext2_panic (sb, "load_inode_bitmap",
+ "block_group != inode_bitmap_number");
}

- for (i = 0; i < sb->u.ext2_sb.s_loaded_inode_bitmaps &&
- sb->u.ext2_sb.s_inode_bitmap_number[i] != block_group;
+ bh = NULL;
+ for (i = 0; i < sbi->s_loaded_inode_bitmaps &&
+ sbi->s_inode_bitmap_number[i] != block_group;
i++)
;
- if (i < sb->u.ext2_sb.s_loaded_inode_bitmaps &&
- sb->u.ext2_sb.s_inode_bitmap_number[i] == block_group) {
- inode_bitmap_number = sb->u.ext2_sb.s_inode_bitmap_number[i];
- inode_bitmap = sb->u.ext2_sb.s_inode_bitmap[i];
- for (j = i; j > 0; j--) {
- sb->u.ext2_sb.s_inode_bitmap_number[j] =
- sb->u.ext2_sb.s_inode_bitmap_number[j - 1];
- sb->u.ext2_sb.s_inode_bitmap[j] =
- sb->u.ext2_sb.s_inode_bitmap[j - 1];
- }
- sb->u.ext2_sb.s_inode_bitmap_number[0] = inode_bitmap_number;
- sb->u.ext2_sb.s_inode_bitmap[0] = inode_bitmap;
-
- /*
- * There's still one special case here --- if inode_bitmap == 0
- * then our last attempt to read the bitmap failed and we have
- * just ended up caching that failure. Try again to read it.
- */
- if (!inode_bitmap)
- retval = read_inode_bitmap (sb, block_group, 0);
-
- } else {
- if (sb->u.ext2_sb.s_loaded_inode_bitmaps < EXT2_MAX_GROUP_LOADED)
- sb->u.ext2_sb.s_loaded_inode_bitmaps++;
- else
- brelse (sb->u.ext2_sb.s_inode_bitmap[EXT2_MAX_GROUP_LOADED - 1]);
- for (j = sb->u.ext2_sb.s_loaded_inode_bitmaps - 1; j > 0; j--) {
- sb->u.ext2_sb.s_inode_bitmap_number[j] =
- sb->u.ext2_sb.s_inode_bitmap_number[j - 1];
- sb->u.ext2_sb.s_inode_bitmap[j] =
- sb->u.ext2_sb.s_inode_bitmap[j - 1];
- }
- retval = read_inode_bitmap (sb, block_group, 0);
- }
- return retval;
+ if (i < sbi->s_loaded_inode_bitmaps)
+ bh = sbi->s_inode_bitmap[i];
+ else if (sbi->s_loaded_inode_bitmaps < EXT2_MAX_GROUP_LOADED)
+ sbi->s_loaded_inode_bitmaps++;
+ else
+ brelse (sbi->s_inode_bitmap[--i]);
+
+ while (i--) {
+ sbi->s_inode_bitmap_number[i+1] = sbi->s_inode_bitmap_number[i];
+ sbi->s_inode_bitmap[i+1] = sbi->s_inode_bitmap[i];
+ }
+
+read_it:
+ if (!bh)
+ bh = read_inode_bitmap (sb, block_group);
+ sbi->s_inode_bitmap_number[slot] = block_group;
+ sbi->s_inode_bitmap[slot] = bh;
+ if (!bh)
+ return ERR_PTR(-EIO);
+found:
+ return bh;
}

/*
@@ -183,7 +155,6 @@
struct buffer_head * bh2;
unsigned long block_group;
unsigned long bit;
- int bitmap_nr;
struct ext2_group_desc * gdp;
struct ext2_super_block * es;

@@ -215,12 +186,10 @@
}
block_group = (ino - 1) / EXT2_INODES_PER_GROUP(sb);
bit = (ino - 1) % EXT2_INODES_PER_GROUP(sb);
- bitmap_nr = load_inode_bitmap (sb, block_group);
- if (bitmap_nr < 0)
+ bh = load_inode_bitmap (sb, block_group);
+ if (IS_ERR(bh))
goto error_return;

- bh = sb->u.ext2_sb.s_inode_bitmap[bitmap_nr];
-
/* Ok, now we can actually update the inode bitmaps.. */
if (!ext2_clear_bit (bit, bh->b_data))
ext2_error (sb, "ext2_free_inode",
@@ -230,9 +199,11 @@
if (gdp) {
gdp->bg_free_inodes_count =
cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) + 1);
- if (is_directory)
+ if (is_directory) {
gdp->bg_used_dirs_count =
cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) - 1);
+ sb->u.ext2_sb.s_dir_count--;
+ }
}
mark_buffer_dirty(bh2);
es->s_free_inodes_count =
@@ -259,23 +230,230 @@
* For other inodes, search forward from the parent directory\'s block
* group to find a free inode.
*/
+
+static int find_cg_dir(struct super_block *sb, const struct inode *parent)
+{
+ struct ext2_super_block * es = sb->u.ext2_sb.s_es;
+ int ngroups = sb->u.ext2_sb.s_groups_count;
+ int avefreei = le32_to_cpu(es->s_free_inodes_count) / ngroups;
+ struct ext2_group_desc *cg, *best_cg = NULL;
+ struct buffer_head *bh, *best_bh = NULL;
+ int i = -1, j;
+
+ for (j = 0; j < ngroups; j++) {
+ cg = ext2_get_group_desc (sb, j, &bh);
+ if (!cg || !cg->bg_free_inodes_count)
+ continue;
+ if (le16_to_cpu(cg->bg_free_inodes_count) < avefreei)
+ continue;
+ if (!best_cg ||
+ (le16_to_cpu(cg->bg_free_blocks_count) >
+ le16_to_cpu(best_cg->bg_free_blocks_count))) {
+ i = j;
+ best_cg = cg;
+ best_bh = bh;
+ }
+ }
+ if (!best_cg)
+ return -1;
+
+ best_cg->bg_free_inodes_count =
+ cpu_to_le16(le16_to_cpu(best_cg->bg_free_inodes_count) - 1);
+ best_cg->bg_used_dirs_count =
+ cpu_to_le16(le16_to_cpu(best_cg->bg_used_dirs_count) + 1);
+ sb->u.ext2_sb.s_dir_count++;
+ mark_buffer_dirty(best_bh);
+ return i;
+}
+
+/*
+ * Orlov's allocator for directories.
+ *
+ * We always try to spread first-level directories:
+ * If there are directories with both free inodes and free blocks counts
+ * not worse than average we return one with smallest directory count.
+ * Otherwise we simply return a random group.
+ *
+ * For the rest rules look so:
+ *
+ * It's OK to put directory into a group unless
+ * it has too many directories already (max_dirs) or
+ * it has too few free inodes left (min_inodes) or
+ * it has too few free blocks left (min_blocks) or
+ * it's already running too large debt (max_debt).
+ * Parent's group is prefered, if it doesn't satisfy these
+ * conditions we search cyclically through the rest. If none
+ * of the groups look good we just look for a group with more
+ * free inodes than average (starting at parent's group).
+ *
+ * Debt is incremented each time we allocate a directory and decremented
+ * when we allocate an inode, within 0--255.
+ */
+
+#define INODE_COST 64
+#define BLOCK_COST 256
+
+static int find_cg_dir_orlov(struct super_block *sb, const struct inode *parent)
+{
+ int parent_cg = parent->u.ext2_i.i_block_group;
+ struct ext2_super_block * es = sb->u.ext2_sb.s_es;
+ int ngroups = sb->u.ext2_sb.s_groups_count;
+ int inodes_per_group = EXT2_INODES_PER_GROUP(sb);
+ int avefreei = le32_to_cpu(es->s_free_inodes_count) / ngroups;
+ int avefreeb = le32_to_cpu(es->s_free_blocks_count) / ngroups;
+ int blocks_per_dir;
+ int ndirs = sb->u.ext2_sb.s_dir_count;
+ int max_debt, max_dirs, min_blocks, min_inodes;
+ int group = -1, i;
+ struct ext2_group_desc *cg;
+ struct buffer_head *bh;
+
+ if (parent == sb->s_root->d_inode) {
+ struct ext2_group_desc *best_cg = NULL;
+ struct buffer_head *best_bh = NULL;
+ int best_ndir = inodes_per_group;
+ int best_group = -1;
+
+ for (group = 0; group < ngroups; group++) {
+ cg = ext2_get_group_desc (sb, group, &bh);
+ if (!cg || !cg->bg_free_inodes_count)
+ continue;
+ if (le16_to_cpu(cg->bg_used_dirs_count) >= best_ndir)
+ continue;
+ if (le16_to_cpu(cg->bg_free_inodes_count) < avefreei)
+ continue;
+ if (le16_to_cpu(cg->bg_free_blocks_count) < avefreeb)
+ continue;
+ best_group = group;
+ best_ndir = le16_to_cpu(cg->bg_used_dirs_count);
+ best_cg = cg;
+ best_bh = bh;
+ }
+ if (best_group >= 0) {
+ cg = best_cg;
+ bh = best_bh;
+ group = best_group;
+ goto found;
+ }
+ get_random_bytes(&group, sizeof(group));
+ parent_cg = group % ngroups;
+ goto fallback;
+ }
+
+ blocks_per_dir = (le32_to_cpu(es->s_blocks_count) -
+ le32_to_cpu(es->s_free_blocks_count)) / ndirs;
+
+ max_dirs = ndirs / ngroups + inodes_per_group / 16;
+ min_inodes = avefreei - inodes_per_group / 4;
+ min_blocks = avefreeb - EXT2_BLOCKS_PER_GROUP(sb) / 4;
+
+ max_debt = EXT2_BLOCKS_PER_GROUP(sb) / max(blocks_per_dir, BLOCK_COST);
+ if (max_debt * INODE_COST > inodes_per_group)
+ max_debt = inodes_per_group / INODE_COST;
+ if (max_debt > 255)
+ max_debt = 255;
+ if (max_debt == 0)
+ max_debt = 1;
+
+ for (i = 0; i < ngroups; i++) {
+ group = (parent_cg + i) % ngroups;
+ cg = ext2_get_group_desc (sb, group, &bh);
+ if (!cg || !cg->bg_free_inodes_count)
+ continue;
+ if (sb->u.ext2_sb.debts[group] >= max_debt)
+ continue;
+ if (le16_to_cpu(cg->bg_used_dirs_count) >= max_dirs)
+ continue;
+ if (le16_to_cpu(cg->bg_free_inodes_count) < min_inodes)
+ continue;
+ if (le16_to_cpu(cg->bg_free_blocks_count) < min_blocks)
+ continue;
+ goto found;
+ }
+
+fallback:
+ for (i = 0; i < ngroups; i++) {
+ group = (parent_cg + i) % ngroups;
+ cg = ext2_get_group_desc (sb, group, &bh);
+ if (!cg || !cg->bg_free_inodes_count)
+ continue;
+ if (le16_to_cpu(cg->bg_free_inodes_count) >= avefreei)
+ goto found;
+ }
+
+ return -1;
+
+found:
+ cg->bg_free_inodes_count =
+ cpu_to_le16(le16_to_cpu(cg->bg_free_inodes_count) - 1);
+ cg->bg_used_dirs_count =
+ cpu_to_le16(le16_to_cpu(cg->bg_used_dirs_count) + 1);
+ sb->u.ext2_sb.s_dir_count++;
+ mark_buffer_dirty(bh);
+ return group;
+}
+
+static int find_cg_other(struct super_block *sb, const struct inode *parent)
+{
+ int parent_cg = parent->u.ext2_i.i_block_group;
+ int ngroups = sb->u.ext2_sb.s_groups_count;
+ struct ext2_group_desc *cg;
+ struct buffer_head *bh;
+ int i, j;
+
+ /*
+ * Try to place the inode in its parent directory
+ */
+ i = parent_cg;
+ cg = ext2_get_group_desc (sb, i, &bh);
+ if (cg && le16_to_cpu(cg->bg_free_inodes_count))
+ goto found;
+
+ /*
+ * Use a quadratic hash to find a group with a
+ * free inode
+ */
+ for (j = 1; j < ngroups; j <<= 1) {
+ i += j;
+ if (i >= ngroups)
+ i -= ngroups;
+ cg = ext2_get_group_desc (sb, i, &bh);
+ if (cg && le16_to_cpu(cg->bg_free_inodes_count))
+ goto found;
+ }
+
+ /*
+ * That failed: try linear search for a free inode
+ */
+ i = parent_cg + 1;
+ for (j = 2; j < ngroups; j++) {
+ if (++i >= ngroups)
+ i = 0;
+ cg = ext2_get_group_desc (sb, i, &bh);
+ if (cg && le16_to_cpu(cg->bg_free_inodes_count))
+ goto found;
+ }
+
+ return -1;
+
+found:
+ cg->bg_free_inodes_count =
+ cpu_to_le16(le16_to_cpu(cg->bg_free_inodes_count) - 1);
+ mark_buffer_dirty(bh);
+ return i;
+}
+
struct inode * ext2_new_inode (const struct inode * dir, int mode)
{
struct super_block * sb;
struct buffer_head * bh;
struct buffer_head * bh2;
- int i, j, avefreei;
+ int i, j;
struct inode * inode;
- int bitmap_nr;
struct ext2_group_desc * gdp;
- struct ext2_group_desc * tmp;
struct ext2_super_block * es;
int err;

- /* Cannot create files in a deleted directory */
- if (!dir || !dir->i_nlink)
- return ERR_PTR(-EPERM);
-
sb = dir->i_sb;
inode = new_inode(sb);
if (!inode)
@@ -284,140 +462,52 @@
lock_super (sb);
es = sb->u.ext2_sb.s_es;
repeat:
- gdp = NULL; i=0;
-
- if (S_ISDIR(mode)) {
- avefreei = le32_to_cpu(es->s_free_inodes_count) /
- sb->u.ext2_sb.s_groups_count;
-/* I am not yet convinced that this next bit is necessary.
- i = dir->u.ext2_i.i_block_group;
- for (j = 0; j < sb->u.ext2_sb.s_groups_count; j++) {
- tmp = ext2_get_group_desc (sb, i, &bh2);
- if (tmp &&
- (le16_to_cpu(tmp->bg_used_dirs_count) << 8) <
- le16_to_cpu(tmp->bg_free_inodes_count)) {
- gdp = tmp;
- break;
- }
- else
- i = ++i % sb->u.ext2_sb.s_groups_count;
- }
-*/
- if (!gdp) {
- for (j = 0; j < sb->u.ext2_sb.s_groups_count; j++) {
- tmp = ext2_get_group_desc (sb, j, &bh2);
- if (tmp &&
- le16_to_cpu(tmp->bg_free_inodes_count) &&
- le16_to_cpu(tmp->bg_free_inodes_count) >= avefreei) {
- if (!gdp ||
- (le16_to_cpu(tmp->bg_free_blocks_count) >
- le16_to_cpu(gdp->bg_free_blocks_count))) {
- i = j;
- gdp = tmp;
- }
- }
- }
- }
- }
+ if (S_ISDIR(mode))
+ i = find_cg_dir_orlov(sb, dir);
else
- {
- /*
- * Try to place the inode in its parent directory
- */
- i = dir->u.ext2_i.i_block_group;
- tmp = ext2_get_group_desc (sb, i, &bh2);
- if (tmp && le16_to_cpu(tmp->bg_free_inodes_count))
- gdp = tmp;
- else
- {
- /*
- * Use a quadratic hash to find a group with a
- * free inode
- */
- for (j = 1; j < sb->u.ext2_sb.s_groups_count; j <<= 1) {
- i += j;
- if (i >= sb->u.ext2_sb.s_groups_count)
- i -= sb->u.ext2_sb.s_groups_count;
- tmp = ext2_get_group_desc (sb, i, &bh2);
- if (tmp &&
- le16_to_cpu(tmp->bg_free_inodes_count)) {
- gdp = tmp;
- break;
- }
- }
- }
- if (!gdp) {
- /*
- * That failed: try linear search for a free inode
- */
- i = dir->u.ext2_i.i_block_group + 1;
- for (j = 2; j < sb->u.ext2_sb.s_groups_count; j++) {
- if (++i >= sb->u.ext2_sb.s_groups_count)
- i = 0;
- tmp = ext2_get_group_desc (sb, i, &bh2);
- if (tmp &&
- le16_to_cpu(tmp->bg_free_inodes_count)) {
- gdp = tmp;
- break;
- }
- }
- }
- }
+ i = find_cg_other(sb, dir);

err = -ENOSPC;
- if (!gdp)
+ if (i == -1)
goto fail;

err = -EIO;
- bitmap_nr = load_inode_bitmap (sb, i);
- if (bitmap_nr < 0)
- goto fail;
+ bh = load_inode_bitmap (sb, i);
+ if (IS_ERR(bh))
+ goto fail2;
+
+ j = ext2_find_first_zero_bit ((unsigned long *) bh->b_data,
+ EXT2_INODES_PER_GROUP(sb));
+ if (j >= EXT2_INODES_PER_GROUP(sb))
+ goto bad_count;
+ ext2_set_bit (j, bh->b_data);

- bh = sb->u.ext2_sb.s_inode_bitmap[bitmap_nr];
- if ((j = ext2_find_first_zero_bit ((unsigned long *) bh->b_data,
- EXT2_INODES_PER_GROUP(sb))) <
- EXT2_INODES_PER_GROUP(sb)) {
- if (ext2_set_bit (j, bh->b_data)) {
- ext2_error (sb, "ext2_new_inode",
- "bit already set for inode %d", j);
- goto repeat;
- }
- mark_buffer_dirty(bh);
- if (sb->s_flags & MS_SYNCHRONOUS) {
- ll_rw_block (WRITE, 1, &bh);
- wait_on_buffer (bh);
- }
- } else {
- if (le16_to_cpu(gdp->bg_free_inodes_count) != 0) {
- ext2_error (sb, "ext2_new_inode",
- "Free inodes count corrupted in group %d",
- i);
- /* Is it really ENOSPC? */
- err = -ENOSPC;
- if (sb->s_flags & MS_RDONLY)
- goto fail;
-
- gdp->bg_free_inodes_count = 0;
- mark_buffer_dirty(bh2);
- }
- goto repeat;
+ mark_buffer_dirty(bh);
+ if (sb->s_flags & MS_SYNCHRONOUS) {
+ ll_rw_block (WRITE, 1, &bh);
+ wait_on_buffer (bh);
}
+
j += i * EXT2_INODES_PER_GROUP(sb) + 1;
if (j < EXT2_FIRST_INO(sb) || j > le32_to_cpu(es->s_inodes_count)) {
ext2_error (sb, "ext2_new_inode",
"reserved inode or inode > inodes count - "
"block_group = %d,inode=%d", i, j);
err = -EIO;
- goto fail;
+ goto fail2;
}
- gdp->bg_free_inodes_count =
- cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) - 1);
- if (S_ISDIR(mode))
- gdp->bg_used_dirs_count =
- cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) + 1);
- mark_buffer_dirty(bh2);
+
es->s_free_inodes_count =
cpu_to_le32(le32_to_cpu(es->s_free_inodes_count) - 1);
+
+ if (S_ISDIR(mode)) {
+ if (sb->u.ext2_sb.debts[i] < 255)
+ sb->u.ext2_sb.debts[i]++;
+ } else {
+ if (sb->u.ext2_sb.debts[i])
+ sb->u.ext2_sb.debts[i]--;
+ }
+
mark_buffer_dirty(sb->u.ext2_sb.s_sbh);
sb->s_dirt = 1;
inode->i_uid = current->fsuid;
@@ -438,14 +528,7 @@
inode->u.ext2_i.i_new_inode = 1;
inode->u.ext2_i.i_flags = dir->u.ext2_i.i_flags;
if (S_ISLNK(mode))
- inode->u.ext2_i.i_flags &= ~(EXT2_IMMUTABLE_FL | EXT2_APPEND_FL);
- inode->u.ext2_i.i_faddr = 0;
- inode->u.ext2_i.i_frag_no = 0;
- inode->u.ext2_i.i_frag_size = 0;
- inode->u.ext2_i.i_file_acl = 0;
- inode->u.ext2_i.i_dir_acl = 0;
- inode->u.ext2_i.i_dtime = 0;
- inode->u.ext2_i.i_prealloc_count = 0;
+ inode->u.ext2_i.i_flags &= ~(EXT2_IMMUTABLE_FL|EXT2_APPEND_FL);
inode->u.ext2_i.i_block_group = i;
if (inode->u.ext2_i.i_flags & EXT2_SYNC_FL)
inode->i_flags |= S_SYNC;
@@ -464,38 +547,59 @@
ext2_debug ("allocating inode %lu\n", inode->i_ino);
return inode;

+fail2:
+ gdp = ext2_get_group_desc (sb, i, &bh2);
+ gdp->bg_free_inodes_count =
+ cpu_to_le16(le16_to_cpu(gdp->bg_free_inodes_count) + 1);
+ if (S_ISDIR(mode)) {
+ gdp->bg_used_dirs_count =
+ cpu_to_le16(le16_to_cpu(gdp->bg_used_dirs_count) - 1);
+ sb->u.ext2_sb.s_dir_count--;
+ }
+ mark_buffer_dirty(bh2);
fail:
unlock_super(sb);
make_bad_inode(inode);
iput(inode);
return ERR_PTR(err);
+
+bad_count:
+ ext2_error (sb, "ext2_new_inode",
+ "Free inodes count corrupted in group %d",
+ i);
+ /* Is it really ENOSPC? */
+ err = -ENOSPC;
+ if (sb->s_flags & MS_RDONLY)
+ goto fail;
+
+ gdp = ext2_get_group_desc (sb, i, &bh2);
+ gdp->bg_free_inodes_count = 0;
+ mark_buffer_dirty(bh2);
+ goto repeat;
}

unsigned long ext2_count_free_inodes (struct super_block * sb)
{
#ifdef EXT2FS_DEBUG
struct ext2_super_block * es;
- unsigned long desc_count, bitmap_count, x;
- int bitmap_nr;
- struct ext2_group_desc * gdp;
+ unsigned long desc_count = 0, bitmap_count = 0;
int i;

lock_super (sb);
es = sb->u.ext2_sb.s_es;
- desc_count = 0;
- bitmap_count = 0;
- gdp = NULL;
for (i = 0; i < sb->u.ext2_sb.s_groups_count; i++) {
- gdp = ext2_get_group_desc (sb, i, NULL);
+ struct ext2_group_desc *gdp = ext2_get_group_desc (sb, i, NULL);
+ struct buffer_head *bh;
+ unsigned x;
+
if (!gdp)
continue;
desc_count += le16_to_cpu(gdp->bg_free_inodes_count);
- bitmap_nr = load_inode_bitmap (sb, i);
- if (bitmap_nr < 0)
+ bh = load_inode_bitmap (sb, i);
+ if (IS_ERR(bh))
continue;

- x = ext2_count_free (sb->u.ext2_sb.s_inode_bitmap[bitmap_nr],
- EXT2_INODES_PER_GROUP(sb) / 8);
+ x = ext2_count_free (bh, EXT2_INODES_PER_GROUP(sb) / 8);
printk ("group %d: stored = %d, counted = %lu\n",
i, le16_to_cpu(gdp->bg_free_inodes_count), x);
bitmap_count += x;
@@ -509,31 +613,42 @@
#endif
}

+/* Called at mount-time, super-block is locked */
+unsigned long ext2_count_dirs (struct super_block * sb)
+{
+ unsigned long count = 0;
+ int i;
+
+ for (i = 0; i < sb->u.ext2_sb.s_groups_count; i++) {
+ struct ext2_group_desc *gdp = ext2_get_group_desc (sb, i, NULL);
+ if (!gdp)
+ continue;
+ count += le16_to_cpu(gdp->bg_used_dirs_count);
+ }
+ return count;
+}
+
#ifdef CONFIG_EXT2_CHECK
/* Called at mount-time, super-block is locked */
void ext2_check_inodes_bitmap (struct super_block * sb)
{
- struct ext2_super_block * es;
- unsigned long desc_count, bitmap_count, x;
- int bitmap_nr;
- struct ext2_group_desc * gdp;
+ struct ext2_super_block * es = sb->u.ext2_sb.s_es;
+ unsigned long desc_count = 0, bitmap_count = 0;
int i;

- es = sb->u.ext2_sb.s_es;
- desc_count = 0;
- bitmap_count = 0;
- gdp = NULL;
for (i = 0; i < sb->u.ext2_sb.s_groups_count; i++) {
- gdp = ext2_get_group_desc (sb, i, NULL);
+ struct ext2_group_desc *gdp = ext2_get_group_desc (sb, i, NULL);
+ struct buffer_head *bh;
+ unsigned x;
+
if (!gdp)
continue;
desc_count += le16_to_cpu(gdp->bg_free_inodes_count);
- bitmap_nr = load_inode_bitmap (sb, i);
- if (bitmap_nr < 0)
+ bh = load_inode_bitmap (sb, i);
+ if (IS_ERR(bh))
continue;

- x = ext2_count_free (sb->u.ext2_sb.s_inode_bitmap[bitmap_nr],
- EXT2_INODES_PER_GROUP(sb) / 8);
+ x = ext2_count_free (bh, EXT2_INODES_PER_GROUP(sb) / 8);
if (le16_to_cpu(gdp->bg_free_inodes_count) != x)
ext2_error (sb, "ext2_check_inodes_bitmap",
"Wrong free inodes count in group %d, "
@@ -545,7 +660,7 @@
ext2_error (sb, "ext2_check_inodes_bitmap",
"Wrong free inodes count in super block, "
"stored = %lu, counted = %lu",
- (unsigned long) le32_to_cpu(es->s_free_inodes_count),
+ (unsigned long)le32_to_cpu(es->s_free_inodes_count),
bitmap_count);
}
#endif
diff -urN S14/fs/ext2/super.c S14-ext2/fs/ext2/super.c
--- S14/fs/ext2/super.c Tue Oct 9 21:47:26 2001
+++ S14-ext2/fs/ext2/super.c Thu Nov 8 15:29:07 2001
@@ -605,6 +605,12 @@
printk ("EXT2-fs: not enough memory\n");
goto failed_mount;
}
+ sb->u.ext2_sb.debts = kmalloc(sb->u.ext2_sb.s_groups_count, GFP_KERNEL);
+ if (!sb->u.ext2_sb.debts) {
+ printk ("EXT2-fs: not enough memory\n");
+ goto failed_mount;
+ }
+ memset(sb->u.ext2_sb.debts, 0, sb->u.ext2_sb.s_groups_count);
for (i = 0; i < db_count; i++) {
sb->u.ext2_sb.s_group_desc[i] = bread (dev, logic_sb_block + i + 1,
sb->s_blocksize);
@@ -621,6 +627,7 @@
db_count = i;
goto failed_mount2;
}
+ sb->u.ext2_sb.s_dir_count = ext2_count_dirs(sb);
for (i = 0; i < EXT2_MAX_GROUP_LOADED; i++) {
sb->u.ext2_sb.s_inode_bitmap_number[i] = 0;
sb->u.ext2_sb.s_inode_bitmap[i] = NULL;
diff -urN S14/include/linux/ext2_fs.h S14-ext2/include/linux/ext2_fs.h
--- S14/include/linux/ext2_fs.h Thu Nov 8 16:33:29 2001
+++ S14-ext2/include/linux/ext2_fs.h Thu Nov 8 16:20:16 2001
@@ -578,6 +578,7 @@
extern unsigned long ext2_count_free_inodes (struct super_block *);
extern void ext2_check_inodes_bitmap (struct super_block *);
extern unsigned long ext2_count_free (struct buffer_head *, unsigned);
+extern unsigned long ext2_count_dirs (struct super_block *);

/* inode.c */
extern void ext2_read_inode (struct inode *);
diff -urN S14/include/linux/ext2_fs_sb.h S14-ext2/include/linux/ext2_fs_sb.h
--- S14/include/linux/ext2_fs_sb.h Fri Feb 16 21:29:52 2001
+++ S14-ext2/include/linux/ext2_fs_sb.h Thu Nov 8 15:28:34 2001
@@ -56,6 +56,8 @@
int s_desc_per_block_bits;
int s_inode_size;
int s_first_ino;
+ unsigned long s_dir_count;
+ u8 *debts;
};

#endif /* _LINUX_EXT2_FS_SB */

2001-11-09 06:14:03

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

Jeremy Fitzhardinge wrote:
>
> On Thu, 2001-11-08 at 07:24, Constantin Loizides wrote:
> >
> > > This is one I made a while ago while doing some measurements; its also
> > <[snip]
> >
> > That's an interesting plot. I would like to do one for my disk!
> >
> > How did you do it?
> > How do you find out about the seek distance???
> > Do you create one big file and then start
> > seeking around accessing different blocks of it?
>
> Its pretty simple. I open /dev/hda (the whole device), and read random
> blocks, timing how long it takes for the data to come back since the
> last one. I set up a few hundred/thousand buckets, and accumulate the
> measured time into the bucket for that seek distance. So:
>
> fd=open("/dev/hda")
> llseek(fd, last = rand());
> read(fd)
> while(!finished) {
> blk = rand();
> llseek(fd, blk);
> read(fd);
> bucket[abs(last-blk)] = time;
> last = blk;
> }

How do you avoid aftifacts from Linux caching with this test?

> I found the random seeking was the only way I could get reasonable
> numbers out of the drive; any of the ordered patterns of the form "OK,
> lets measure N block seeks" seemed to be predicted by the drive and gave
> unreasonably fast results.
>

But we *want* unreasonably fast results! We need to understand
this device-level prediction, then decide if it's sufficiently
widespread for us to go and exploit the hell out of it.

Have you any ideas as to what's happening?

-

2001-11-09 06:20:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] ext2/ialloc.c cleanup

Thanks, Al.

First a couple of comments on the patch (looks nice, BTW):

/*
* Orlov's allocator for directories.
*
* We always try to spread first-level directories:
* If there are directories with both free inodes and free blocks counts
^^^^^^^^^^^
cylinder groups
* not worse than average we return one with smallest directory count.

(I agree with Andreas on this one. Why switch terminology?)




get_random_bytes(&group, sizeof(group));
parent_cg = group % ngroups;
goto fallback;

AFAICT, get_random_bytes() here can set `group' to a negative
value, and parent_cg can go negative, and that propagates to
`group' going negative, and getting passed to ext2_get_group_desc(),
and everything goes generally pear-shaped. Really, all this arith
should be using unsigneds.



>From here:

max_dirs = ndirs / ngroups + inodes_per_group / 16;
min_inodes = avefreei - inodes_per_group / 4;
min_blocks = avefreeb - EXT2_BLOCKS_PER_GROUP(sb) / 4;

things start to get a bit confusing. A couple of 1-2 line comments
which explain what the variables actually represent would help to
clarify things. Also, an explanation of `debt' is needed.



Offtopic, in ext2_new_inode():

mark_buffer_dirty(bh);
if (sb->s_flags & MS_SYNCHRONOUS) {
ll_rw_block (WRITE, 1, &bh);
wait_on_buffer (bh);
}

Does the inode bitmap writeout actually need to be synchronous
here? The file will still be recoverable by fsck if this is
not done? If the sync _is_ needed, shouldn't we be doing it with
the group descriptors?


Finally, please, please, please take the opportunity to rename
`bh', `bh2', `i' and `j' in ext2_new_inode() to something
semantically meaningful. What we have now is just insane.


We need to test carefully with ENOSPC, but it looks solid.


Performance-wise, the Orlov approach is almost as good as
the `if (0 &&' approach for fast growth. This is the "manipulate
kernel trees on an empty partition" test:

Stock Patched Orlov

untar 31 14 14
diff 184 72 82.6
tar 15 3 3
rm 30 10 10.3

So the diffing was a bit slower; not much.


For the slow growth test, same as last time (cut-n-paste
from the very excellent staroffice 6 spreadsheet):

Tree Stock Stock/ideal Patched Patched/stock Orlov Orlov/ideal
(secs) (secs) (secs)
0 37 2.85 74 200.00% 63 4.85
1 41 3.15 89 217.07% 68 5.23
2 41 3.15 97 236.59% 74 5.69
3 38 2.92 97 255.26% 84 6.46
4 55 4.23 102 185.45% 78 6
5 51 3.92 98 192.16% 76 5.85
6 72 5.54 94 130.56% 73 5.62
7 56 4.31 96 171.43% 71 5.46
8 64 4.92 102 159.38% 73 5.62
9 63 4.85 100 158.73% 71 5.46
10 57 4.38 95 166.67% 74 5.69
11 83 6.38 102 122.89% 78 6
12 54 4.15 101 187.04% 76 5.85
13 82 6.31 104 126.83% 78 6
14 89 6.85 103 115.73% 77 5.92
15 88 6.77 95 107.95% 77 5.92
16 106 8.15 99 93.40% 77 5.92


We see that Orlov is more prone to fragmentation than stock 2.4.14: The
time to read the first batch of 378 megs of files is 37 seconds with
2.4.14, 63 seconds with Orlov. But it's better than the `if (0 && '
approach.


So I just don't know at this stage. Even after a single pass of the Smith
workload, we're running at 3x to 5x worse than ideal. If that's simply
the best we can do, then we need to compare stock 2.4.14 with Orlov
partway through the progress of the Smith workload to evaluate how much
more serious the fragmentation is. That's easy enough - I'll do it.


The next task is to work out why a single pass of the Smith workload
fragments the filesystem so much, and whether this can be improved.


Oh. And some preliminary testing with SCSI shows markedly different
behaviour: those 3x to 5x speedups I was demonstrating with IDE are
only 1.5x with a Quantum Atlas IV. There is some magical characteristic
of modern IDE disks which stock 2.4.14 is failing to exploit, and I
don't yet have a handle on precisely what it is.

-

2001-11-09 06:58:19

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] ext2/ialloc.c cleanup

On Nov 08, 2001 22:15 -0800, Andrew Morton wrote:
> max_dirs = ndirs / ngroups + inodes_per_group / 16;
> min_inodes = avefreei - inodes_per_group / 4;
> min_blocks = avefreeb - EXT2_BLOCKS_PER_GROUP(sb) / 4;
>
> things start to get a bit confusing. A couple of 1-2 line comments
> which explain what the variables actually represent would help to
> clarify things. Also, an explanation of `debt' is needed.

Yes, my eyes glazed over at this point as well.

> Tree Stock Stock/ideal Patched Patched/stock Orlov Orlov/ideal
> (secs) (secs) (secs)

Andrew, could you stick with a single metric (either /ideal or /stock)?

> 12 54 4.15 101 187.04% 76 5.85
> 13 82 6.31 104 126.83% 78 6
> 14 89 6.85 103 115.73% 77 5.92
> 15 88 6.77 95 107.95% 77 5.92
> 16 106 8.15 99 93.40% 77 5.92

What else would be useful here is percent full for the filesystem.
Since stock ext2 preallocates up to 7 blocks for each file, this is good
for reducing fragmentation, but as the filesystem gets full it makes
fragmentation worse in the end.

> So I just don't know at this stage. Even after a single pass of the Smith
> workload, we're running at 3x to 5x worse than ideal. If that's simply
> the best we can do, then we need to compare stock 2.4.14 with Orlov
> partway through the progress of the Smith workload to evaluate how much
> more serious the fragmentation is. That's easy enough - I'll do it.
>
> The next task is to work out why a single pass of the Smith workload
> fragments the filesystem so much, and whether this can be improved.

After reading the paper, one possible cause of skewed results may be
a result of their "reverse engineering" of the filesystem layout from
the inode numbers. They state that they were not able to capture the
actual pathnames of the files in the workload, so they invented pathnames
that would put the files in the same cylinder groups based on the FFS
allocation algorithms then in use, assuming the test filesystem had an
identical layout.

It may be possible to hack the test data into ext2 by creating a filesystem
with the same number of block groups as the test FFS filesystem with the
Smith workload. It may also not be valid for our needs, as we are playing
with the actual group selection algorithm, so real pathnames may give us
a different layout.

Have you ever looked at the resulting data from a Smith test? Does it do
more than simply create 27 directories (the number of cylinder groups) and
dump files directly therein, as opposed to creating more subdirectories?
Remember that they were strictly concerned with block allocation for files,
and not cylinder group selection for directories.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2001-11-09 07:15:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] ext2/ialloc.c cleanup

Andreas Dilger wrote:
>
> > Tree Stock Stock/ideal Patched Patched/stock Orlov Orlov/ideal
> > (secs) (secs) (secs)
>
> Andrew, could you stick with a single metric (either /ideal or /stock)?

yup.

> > 12 54 4.15 101 187.04% 76 5.85
> > 13 82 6.31 104 126.83% 78 6
> > 14 89 6.85 103 115.73% 77 5.92
> > 15 88 6.77 95 107.95% 77 5.92
> > 16 106 8.15 99 93.40% 77 5.92
>
> What else would be useful here is percent full for the filesystem.
> Since stock ext2 preallocates up to 7 blocks for each file, this is good
> for reducing fragmentation, but as the filesystem gets full it makes
> fragmentation worse in the end.

After the 17th iteration the fs was 75% full, so at iteration 8
we're at 37%, etc.

> > So I just don't know at this stage. Even after a single pass of the Smith
> > workload, we're running at 3x to 5x worse than ideal. If that's simply
> > the best we can do, then we need to compare stock 2.4.14 with Orlov
> > partway through the progress of the Smith workload to evaluate how much
> > more serious the fragmentation is. That's easy enough - I'll do it.
> >
> > The next task is to work out why a single pass of the Smith workload
> > fragments the filesystem so much, and whether this can be improved.
>
> After reading the paper, one possible cause of skewed results may be
> a result of their "reverse engineering" of the filesystem layout from
> the inode numbers. They state that they were not able to capture the
> actual pathnames of the files in the workload, so they invented pathnames
> that would put the files in the same cylinder groups based on the FFS
> allocation algorithms then in use, assuming the test filesystem had an
> identical layout.

Their attempt to put each top-level dir in its own cylinder group
_may_ work with ext2 on pass zero, but it'll come unstuck on passes
one to sixteen. I've been treating this as basically random fs
activity. Any conclusions need to verified with Constantin's
test code.

> ...
> Have you ever looked at the resulting data from a Smith test? Does it do
> more than simply create 27 directories (the number of cylinder groups) and
> dump files directly therein, as opposed to creating more subdirectories?
> Remember that they were strictly concerned with block allocation for files,
> and not cylinder group selection for directories.
>

True. At the end of the run, we end up with 62 directories and
8757 files. So under

/mountpoint/NN/

we end up with 61 directories and 474 files. The remainder of the
files are immediately under thos 61 directories. It's a wide and
shallow layout.

2001-11-09 07:12:59

by Alexander Viro

[permalink] [raw]
Subject: Re: [Ext2-devel] ext2/ialloc.c cleanup



On Thu, 8 Nov 2001, Andreas Dilger wrote:

> It may be possible to hack the test data into ext2 by creating a filesystem
> with the same number of block groups as the test FFS filesystem with the
> Smith workload. It may also not be valid for our needs, as we are playing
> with the actual group selection algorithm, so real pathnames may give us
> a different layout.

Umm... What was the block and fragment sizes in their tests?

2001-11-09 07:24:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] ext2/ialloc.c cleanup

Alexander Viro wrote:
>
> On Thu, 8 Nov 2001, Andreas Dilger wrote:
>
> > It may be possible to hack the test data into ext2 by creating a filesystem
> > with the same number of block groups as the test FFS filesystem with the
> > Smith workload. It may also not be valid for our needs, as we are playing
> > with the actual group selection algorithm, so real pathnames may give us
> > a different layout.
>
> Umm... What was the block and fragment sizes in their tests?

Size: 502M
Fragment size: 1k
Block size: 8k
Max cluster size: 56k

I haven't been trying to recreate the Smith tests, BTW. Just using
it as a representative workload and something which is worth
optimising for.

2001-11-09 08:49:38

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

On Thu, 2001-11-08 at 22:08, Andrew Morton wrote:
> How do you avoid aftifacts from Linux caching with this test?

Randomly seeking around a 40G disk isn't going to give the caching much
to work with (that was on a 256Mbyte machine). Also, I tried it with
raw devices, but it made no difference to the measurements.

> > I found the random seeking was the only way I could get reasonable
> > numbers out of the drive; any of the ordered patterns of the form "OK,
> > lets measure N block seeks" seemed to be predicted by the drive and gave
> > unreasonably fast results.
> >
>
> But we *want* unreasonably fast results! We need to understand
> this device-level prediction, then decide if it's sufficiently
> widespread for us to go and exploit the hell out of it.
>
> Have you any ideas as to what's happening?

Not really. I wanted results which at least confirm the simple model of
how the disk works. When I did deterministic seeks, it was something
like:

for(n = 0; n < disksize/2; n += about_a_cylinder) {
seek(n);
read();
seek(disksize-n);
read();
}

I suspect that the drive was partitioning its cache and using lots of
readahead to satisfy the reads at either end of the seek. I was trying
to work around that by making the n increment large enough, but it
didn't seem to work. Maybe the drive was being extra special clever, or
maybe I miscalculated the cylinder size. Or maybe I just happened to be
hitting a rotational sweet-spot.

For this kind of test, it would be interesting to know what differences
there are between read and write. In principle the drive could start
writing a track anywhere it wants so long as it has enough to write the
whole thing. That would automatically make it get the track-track skew
right...

J

2001-11-09 09:27:36

by Pavel Machek

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

Hi!

> By definition, "slow growth" is amenable to defragmentation. And by
> definition, fast growth isn't. And if we're talking about speedups on the

Heh, last time I looked, defrage2fs crashed on >2G partition and was
clearly not maintained. Has things changed?

--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.

2001-11-10 01:00:27

by Riley Williams

[permalink] [raw]
Subject: Re: [Ext2-devel] disk throughput

Hi Al, Linus.

>> For example, spreading out (and the inherent assumption of "slow
>> growth") might make sense for the root directory, and possibly for a
>> level below that. It almost certainly does _not_ make sense for a
>> directory created four levels down.

> Depends on the naming scheme used by admin, for one thing. Linus, I
> seriously think that getting the real-life traces to play with would
> be a Good Thing(tm) - at least that would allow to test how well
> would heuristics of that kind do.

Let's be realistic here - you two seem to be talking at cross-purposes,
and that will get us precicely nowhere.

Linus: If I'm right, your "root directory" refers to the root
directory of a particular partition. Correct?

Al: If I'm right, your "root directory" refers to the root
directory of the entire VFS tree. Correct?

If you can try reading each other's comments with both of the above
definitions, I think you'll see why you're clearly drifting apart from
each other in your comments.

Best wishes from Riley.