2003-08-30 11:33:56

by Erik Hensema

[permalink] [raw]
Subject: First impressions of reiserfs4


Currently I'm testing reiserfs4 on my otherwise vanilla 2.6.0-test4
machine.

At first I tried building reiser4 as a module. The kernel failed to link
due to an unresolved symbol: sys_reiser4
I tried commenting sys_reiser4 out from entry.S. Now the kernel linked
fine.

However, I can't insert the module due to unexported symbols:

reiser4: Unknown symbol balance_dirty_pages
reiser4: Unknown symbol max_sane_readahead
reiser4: Unknown symbol generic_sync_sb_inodes
reiser4: Unknown symbol truncate_mapping_pages_range
reiser4: Unknown symbol wakeup_kswapd
reiser4: Unknown symbol balance_dirty_pages_ratelimited
reiser4: Unknown symbol inodes_stat
reiser4: Unknown symbol nr_free_pagecache_pages
reiser4: Unknown symbol destroy_inode

So, I tried linking reiser4 directly into the kernel. No problems there.

As we speak I'm building Mozilla Firebird from source of a 20 GB reiser4
partition. If something interesting comes up, this list will be the first
to know :-)

I've currently got only one small problem: df can't handle the data from
the kernel it seems. I also got this problem on NFS mounted partitions:

df: `/reiser4': Value too large for defined data type
df: `/home': Value too large for defined data type

--
Erik Hensema <[email protected]>


2003-08-30 17:10:36

by Hans Reiser

[permalink] [raw]
Subject: Re: First impressions of reiserfs4

Erik Hensema wrote:

>Currently I'm testing reiserfs4 on my otherwise vanilla 2.6.0-test4
>machine.
>
>At first I tried building reiser4 as a module. The kernel failed to link
>due to an unresolved symbol: sys_reiser4
>
compile without sys_reiser4 and not as a module, the config is fixed in
what will be the next snapshot

>I tried commenting sys_reiser4 out from entry.S. Now the kernel linked
>fine.
>
>However, I can't insert the module due to unexported symbols:
>
>reiser4: Unknown symbol balance_dirty_pages
>reiser4: Unknown symbol max_sane_readahead
>reiser4: Unknown symbol generic_sync_sb_inodes
>reiser4: Unknown symbol truncate_mapping_pages_range
>reiser4: Unknown symbol wakeup_kswapd
>reiser4: Unknown symbol balance_dirty_pages_ratelimited
>reiser4: Unknown symbol inodes_stat
>reiser4: Unknown symbol nr_free_pagecache_pages
>reiser4: Unknown symbol destroy_inode
>
>So, I tried linking reiser4 directly into the kernel. No problems there.
>
>As we speak I'm building Mozilla Firebird from source of a 20 GB reiser4
>partition. If something interesting comes up, this list will be the first
>to know :-)
>
>I've currently got only one small problem: df can't handle the data from
>the kernel it seems. I also got this problem on NFS mounted partitions:
>
fixed in what will be the next snapshot

>
>df: `/reiser4': Value too large for defined data type
>df: `/home': Value too large for defined data type
>
>
>
thanks for your patience.

nikita, when are you releasing the next snapshot, with the improved
performance and bug fixes

--
Hans


2003-08-31 17:14:52

by Rogier Wolff

[permalink] [raw]
Subject: Re: First impressions of reiserfs4

On Sat, Aug 30, 2003 at 09:06:14PM +0400, Hans Reiser wrote:
> >df: `/reiser4': Value too large for defined data type
> >df: `/home': Value too large for defined data type

Hans,

Would it be possible to do something like: "pretend that there
are always 100 million inodes free", and then report sensible
numbers to "df -i"?

There is no installation program that will fail with: "Sorry,
you only have 100 million inodes free, this program will need
132 million after installation", and it allows me a quick way
of counting the number of actual files on the disk....

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
**** "Linux is like a wigwam - no windows, no gates, apache inside!" ****

2003-09-08 08:12:10

by Oleg Drokin

[permalink] [raw]
Subject: Re: First impressions of reiserfs4

Hello!

On Sun, Aug 31, 2003 at 07:14:19PM +0200, Rogier Wolff wrote:

> Would it be possible to do something like: "pretend that there
> are always 100 million inodes free", and then report sensible
> numbers to "df -i"?

This won't work. No sensible numbers would be there.

> There is no installation program that will fail with: "Sorry,
> you only have 100 million inodes free, this program will need
> 132 million after installation", and it allows me a quick way
> of counting the number of actual files on the disk....

You cannot. statfs(2) only exports "Total number of inodes on disk" and
"number of free inodes on disk" values for fs. df substracts one from another one
to get "number of inodes in use".
Actually we export necessary numbers through sysfs for now. And we have patch
in our tree that just sets statfs(2) inode stuff to zero. You should see it after
next snapshot is released.

$ cat /sys/fs/reiser4/hdb1/oids_in_use
104875
$ cat /sys/fs/reiser4/hdb1/next_to_use
261239

Bye,
Oleg

2003-09-08 08:56:57

by Rogier Wolff

[permalink] [raw]
Subject: Re: First impressions of reiserfs4

On Mon, Sep 08, 2003 at 12:12:06PM +0400, Oleg Drokin wrote:
> Hello!
>
> On Sun, Aug 31, 2003 at 07:14:19PM +0200, Rogier Wolff wrote:
>
> > Would it be possible to do something like: "pretend that there
> > are always 100 million inodes free", and then report sensible
> > numbers to "df -i"?
>
> This won't work. No sensible numbers would be there.
>
> > There is no installation program that will fail with: "Sorry,
> > you only have 100 million inodes free, this program will need
> > 132 million after installation", and it allows me a quick way
> > of counting the number of actual files on the disk....
>
> You cannot. statfs(2) only exports "Total number of inodes on disk" and
> "number of free inodes on disk" values for fs. df substracts one from another one
> to get "number of inodes in use".

So, you report "oids_in_use + 100M" as total and "100M" as free inodes
on disk. Voila!

We're using a Unix operating system which has a bunch of standard
interfaces. The fun about using those is that lots of stuff "just works"
even if it wasn't designed to do exactly what you are doing right
now. So even if "df" wasn't designed to work on NFS, it still works.

But now we're going to get a new "df" which grabs the sysfs info and
uses that. But it won't work on reiserfs5, as the interface changes
again.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
**** "Linux is like a wigwam - no windows, no gates, apache inside!" ****

2003-09-08 09:08:28

by Oleg Drokin

[permalink] [raw]
Subject: Re: First impressions of reiserfs4

Hello!

On Mon, Sep 08, 2003 at 10:56:41AM +0200, Rogier Wolff wrote:
> > > There is no installation program that will fail with: "Sorry,
> > > you only have 100 million inodes free, this program will need
> > > 132 million after installation", and it allows me a quick way
> > > of counting the number of actual files on the disk....
> > You cannot. statfs(2) only exports "Total number of inodes on disk" and
> > "number of free inodes on disk" values for fs. df substracts one from another one
> > to get "number of inodes in use".
> So, you report "oids_in_use + 100M" as total and "100M" as free inodes
> on disk. Voila!

Yes, we thought about that too. Need to be careful to not overflow "long int".
And idea of filesystem with variable amount of inodes over time sounds confusing to me, too.

> We're using a Unix operating system which has a bunch of standard
> interfaces. The fun about using those is that lots of stuff "just works"
> even if it wasn't designed to do exactly what you are doing right
> now. So even if "df" wasn't designed to work on NFS, it still works.

Yes. There is a special value of zero, that says "this field have absolutely
no sence for this filesystem". Which is sort of our case.

> But now we're going to get a new "df" which grabs the sysfs info and
> uses that. But it won't work on reiserfs5, as the interface changes
> again.

Well, if current interface does not allow to see all the stuff you want to,
time to change (introduce new one) interface, anyway.

Bye,
Oleg

2003-09-08 09:33:21

by Rogier Wolff

[permalink] [raw]
Subject: Re: First impressions of reiserfs4

On Mon, Sep 08, 2003 at 01:08:26PM +0400, Oleg Drokin wrote:
> Hello!
>
> On Mon, Sep 08, 2003 at 10:56:41AM +0200, Rogier Wolff wrote:
> > > > There is no installation program that will fail with: "Sorry,
> > > > you only have 100 million inodes free, this program will need
> > > > 132 million after installation", and it allows me a quick way
> > > > of counting the number of actual files on the disk....
> > > You cannot. statfs(2) only exports "Total number of inodes on disk" and
> > > "number of free inodes on disk" values for fs. df substracts one from another one
> > > to get "number of inodes in use".
> > So, you report "oids_in_use + 100M" as total and "100M" as free inodes
> > on disk. Voila!

> Yes, we thought about that too. Need to be careful to not overflow
> "long int".

> And idea of filesystem with variable amount of inodes over time
> sounds confusing to me, too. ]

SO? That's actually the case. So it's confusing. So you're confusing
people even more by telling nothing. Great.

#define LARGE_NUMBER 100000

out->total_inodes = fs->oids_in_use + LARGE_NUMBER;
if (out->total_inodes < fs->oids_in_use)
out -> total_inods = MAXINT;
out -> free_inodes = LARGE_NUMBER;

Three lines of code fixes that.

> Well, if current interface does not allow to see all the stuff you want to,
> time to change (introduce new one) interface, anyway.

Fine, introduce a new interface. But report as much as you can on the
old interface. Remember you can read/write/seek files using the 32bit
interface even though the new (seek-, and stat-) interface uses 64
bits.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
**** "Linux is like a wigwam - no windows, no gates, apache inside!" ****

2003-09-08 09:50:06

by Oleg Drokin

[permalink] [raw]
Subject: Re: First impressions of reiserfs4

Hello!

On Mon, Sep 08, 2003 at 11:33:04AM +0200, Rogier Wolff wrote:
> > > > > There is no installation program that will fail with: "Sorry,
> > > > > you only have 100 million inodes free, this program will need
> > > > > 132 million after installation", and it allows me a quick way
> > > > > of counting the number of actual files on the disk....
> > > > You cannot. statfs(2) only exports "Total number of inodes on disk" and
> > > > "number of free inodes on disk" values for fs. df substracts one from another one
> > > > to get "number of inodes in use".
> > > So, you report "oids_in_use + 100M" as total and "100M" as free inodes
> > > on disk. Voila!
> > Yes, we thought about that too. Need to be careful to not overflow
> > "long int".
> > And idea of filesystem with variable amount of inodes over time
> > sounds confusing to me, too. ]
> SO? That's actually the case. So it's confusing. So you're confusing
> people even more by telling nothing. Great.

Well, but statfs(2) does not return an "inodes in use" value, that's it.

> #define LARGE_NUMBER 100000
> out->total_inodes = fs->oids_in_use + LARGE_NUMBER;
> if (out->total_inodes < fs->oids_in_use)
> out -> total_inods = MAXINT;
> out -> free_inodes = LARGE_NUMBER;
> Three lines of code fixes that.

Yes, and you get complete crap once you hit the overflow condition?

> > Well, if current interface does not allow to see all the stuff you want to,
> > time to change (introduce new one) interface, anyway.
> Fine, introduce a new interface. But report as much as you can on the
> old interface. Remember you can read/write/seek files using the 32bit
> interface even though the new (seek-, and stat-) interface uses 64
> bits.

You need to open a file with O_LARGEFILE first, so old binaries still won't work.

Bye,
Oleg

2003-09-08 10:05:43

by Rogier Wolff

[permalink] [raw]
Subject: Re: First impressions of reiserfs4

On Mon, Sep 08, 2003 at 01:48:25PM +0400, Oleg Drokin wrote:
> Hello!
>
> On Mon, Sep 08, 2003 at 11:33:04AM +0200, Rogier Wolff wrote:
> > > > > > There is no installation program that will fail with: "Sorry,
> > > > > > you only have 100 million inodes free, this program will need
> > > > > > 132 million after installation", and it allows me a quick way
> > > > > > of counting the number of actual files on the disk....
> > > > > You cannot. statfs(2) only exports "Total number of inodes on disk" and
> > > > > "number of free inodes on disk" values for fs. df substracts one from another one
> > > > > to get "number of inodes in use".
> > > > So, you report "oids_in_use + 100M" as total and "100M" as free inodes
> > > > on disk. Voila!
> > > Yes, we thought about that too. Need to be careful to not overflow
> > > "long int".
> > > And idea of filesystem with variable amount of inodes over time
> > > sounds confusing to me, too. ]
> > SO? That's actually the case. So it's confusing. So you're confusing
> > people even more by telling nothing. Great.
>
> Well, but statfs(2) does not return an "inodes in use" value, that's it.
>
> > #define LARGE_NUMBER 100000
> > out->total_inodes = fs->oids_in_use + LARGE_NUMBER;
> > if (out->total_inodes < fs->oids_in_use)
> > out -> total_inods = MAXINT;
> > out -> free_inodes = LARGE_NUMBER;
> > Three lines of code fixes that.
>
> Yes, and you get complete crap once you hit the overflow condition?

No. Not complete crap. It's a thirty two bit integer. What do you expect
when you hit the "limit"?

What will ext2 report when you have 4G inodes in use?

Just capping is the best way.

And as Reiserfs has the option of still storing (at least)
LARGE_NUMBER more files it can simply report that as well.

Anyway, I happen to (also) work for a company called
"harddisk-recovery.nl". We get to see varying types of uses for
harddisk and their contents. So far we've had two clients with more
than half a million files. One had 3.6 and the other had 4.7 million
files. Trust me, those are extreme. Oh, and we have on the order of 10
million files ourselves (but notquite that many inodes!). We're
extreme.

The tendency is that when disks get larger, so do the files. The
number of files grows much slower than the number of bytes.

So a factor of 1000 will last us some 40 years instead of the 20 that
Mr Moore predicts. Now Alan is saying that by the time we have
512Mbyte of RAM on video cards we'll all be using 64 bit anyway. Well
I predict that he's a bit optimistic in that respect. But in 40 years,
you'll have your new 64 bit statfs. You can be certain of that.

> > > Well, if current interface does not allow to see all the stuff you want to,
> > > time to change (introduce new one) interface, anyway.
> > Fine, introduce a new interface. But report as much as you can on the
> > old interface. Remember you can read/write/seek files using the 32bit
> > interface even though the new (seek-, and stat-) interface uses 64
> > bits.

> You need to open a file with O_LARGEFILE first, so old binaries
> still won't work.

1) I can still work on files smaller than 2G without problems.

2) If my shell uses O_LARGEFILE, I can redirect stdin and stdout
to large files anyway, even if the app would open without O_LARGEFILE.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
**** "Linux is like a wigwam - no windows, no gates, apache inside!" ****

2003-09-08 10:17:10

by Oleg Drokin

[permalink] [raw]
Subject: Re: First impressions of reiserfs4

Hello!

On Mon, Sep 08, 2003 at 12:05:31PM +0200, Rogier Wolff wrote:
> > Well, but statfs(2) does not return an "inodes in use" value, that's it.
> > > #define LARGE_NUMBER 100000
> > > out->total_inodes = fs->oids_in_use + LARGE_NUMBER;
> > > if (out->total_inodes < fs->oids_in_use)
> > > out -> total_inods = MAXINT;
> > > out -> free_inodes = LARGE_NUMBER;
> > > Three lines of code fixes that.
> > Yes, and you get complete crap once you hit the overflow condition?
> No. Not complete crap. It's a thirty two bit integer. What do you expect
> when you hit the "limit"?

I prefer there to be no limit ;)

> What will ext2 report when you have 4G inodes in use?

This is basically not possible on 32 bit architectures now (and not a problem for 64 bit ones).
We limit ourselves to at least 512 bytes blocks.
You only can have as many inodes as number of blocks on the fs (at least that's the limit imposed on you
by mke2fs).
So 2Tb fs gives you ~4G inodes with 512 bytes blocksize.
Yes, I know there is this support for bigger blockdevices in 2.6, but who will use
those with that small blocksizes anyway?

> Just capping is the best way.

May be.
At least people in here seems to like the idea so we will implement it.

> Anyway, I happen to (also) work for a company called
> "harddisk-recovery.nl". We get to see varying types of uses for
> harddisk and their contents. So far we've had two clients with more
> than half a million files. One had 3.6 and the other had 4.7 million
> files. Trust me, those are extreme. Oh, and we have on the order of 10
> million files ourselves (but notquite that many inodes!). We're
> extreme.

Hm.

> > > old interface. Remember you can read/write/seek files using the 32bit
> > > interface even though the new (seek-, and stat-) interface uses 64
> > > bits.
> > You need to open a file with O_LARGEFILE first, so old binaries
> > still won't work.
> 1) I can still work on files smaller than 2G without problems.

Yes, that's for sure ;)

> 2) If my shell uses O_LARGEFILE, I can redirect stdin and stdout
> to large files anyway, even if the app would open without O_LARGEFILE.

Yes, missed this case ;)

Bye,
Oleg

2003-09-08 12:59:22

by Herbert Poetzl

[permalink] [raw]
Subject: Re: First impressions of reiserfs4

On Mon, Sep 08, 2003 at 02:17:04PM +0400, Oleg Drokin wrote:
> Hello!
>
> On Mon, Sep 08, 2003 at 12:05:31PM +0200, Rogier Wolff wrote:
> > > Well, but statfs(2) does not return an "inodes in use" value, that's it.
> > > > #define LARGE_NUMBER 100000
> > > > out->total_inodes = fs->oids_in_use + LARGE_NUMBER;
> > > > if (out->total_inodes < fs->oids_in_use)
> > > > out -> total_inods = MAXINT;
> > > > out -> free_inodes = LARGE_NUMBER;
> > > > Three lines of code fixes that.
> > > Yes, and you get complete crap once you hit the overflow condition?
> > No. Not complete crap. It's a thirty two bit integer. What do you expect
> > when you hit the "limit"?

what about

total_inods = MAXINT
free_inodes = total_inods - oids_in_use;

this would not change from one moment to
the other, reflect the correct amount, and
stay within limits for reasonable iods_in_use

best,
Herbert

2003-09-08 20:12:55

by Andreas Dilger

[permalink] [raw]
Subject: Re: First impressions of reiserfs4

On Sep 08, 2003 12:12 +0400, Oleg Drokin wrote:
> Hello!
> On Sun, Aug 31, 2003 at 07:14:19PM +0200, Rogier Wolff wrote:
>
> > Would it be possible to do something like: "pretend that there
> > are always 100 million inodes free", and then report sensible
> > numbers to "df -i"?
>
> This won't work. No sensible numbers would be there.
>
> > There is no installation program that will fail with: "Sorry,
> > you only have 100 million inodes free, this program will need
> > 132 million after installation", and it allows me a quick way
> > of counting the number of actual files on the disk....
>
> You cannot. statfs(2) only exports "Total number of inodes on disk" and
> "number of free inodes on disk" values for fs. df substracts one from
> another one to get "number of inodes in use".
> Actually we export necessary numbers through sysfs for now. And we have
> patch in our tree that just sets statfs(2) inode stuff to zero. You should
> see it after next snapshot is released.

In a way, it would have been nice if "sys_statfs64()" had implemented the
values as "files in use" and "files total" instead of the older (and less
useful "files free").

However, that doesn't mean you can't return something useful to statfs().
Since the linux VFS limits us to 2^32 - 1 inodes for now, you could still
return 2^32 - 1 - num_in_use for "f_ffree" and 2^32 -1 for f_files, so that
"df -i" shows a useful number for IUsed.


Sadly, the sys_statfs64() API is broken such that the filesystem can't make
a distinction between being called from sys_statfs64() and sys_statfs(),
so you have to assume the 32-bit limits even for the 64-bit API. We should
really have a new FS method which is "statfs64()" that is optionally called
from sys_statfs64() so the FS has a chance to return something different for
64-bit callers.

For Lustre, we can't be guaranteed to fit into the 32-bit f_blocks counts
with 100TB filesystems, so we scale the f_bsize until the f_blocks fits into
32 bits. However, we would like to be able to return the correct values to
sys_statfs64() if possible.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://www-mddsp.enel.ucalgary.ca/People/adilger/

2003-09-08 22:24:47

by Mike Fedyk

[permalink] [raw]
Subject: Re: First impressions of reiserfs4

On Mon, Sep 08, 2003 at 02:17:04PM +0400, Oleg Drokin wrote:
> You only can have as many inodes as number of blocks on the fs (at least that's the limit imposed on you
> by mke2fs).

True, but not exactly. Each file will need one block to store even one byte
on ext2/3. But your inode tables have about 1/4-1/2 the number of inode entries to
blocks. This can be changed at mkfs time though.

# mke2fs -n -b 1024 -m0 /dev/md0
mke2fs 1.34-WIP (21-May-2003)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
39923712 inodes, 319388032 blocks
0 blocks (0.00%) reserved for the super user
First data block=1
38988 block groups
8192 blocks per group, 8192 fragments per group
1024 inodes per group

40M inodes and 319M blocks with a 1k block ext2/3 filesystem.

# mke2fs -n -b 4096 -m0 /dev/md0
mke2fs 1.34-WIP (21-May-2003)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
39927808 inodes, 79847008 blocks
0 blocks (0.00%) reserved for the super user
First data block=0
2437 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group

40M inodes and 80M blocks on a 4k block ext2/3 filesystem.

So with the defaults, you'd have to have 40M files each between 4.1 - 7.9Kb
to run out of inodes, and fill the filesystem.

k# mke2fs -n -b 1024 -m0 -T news /dev/md0
mke2fs 1.34-WIP (21-May-2003)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
79847424 inodes, 319388032 blocks
0 blocks (0.00%) reserved for the super user
First data block=1
38988 block groups
8192 blocks per group, 8192 fragments per group
2048 inodes per group

80M inodes and 319M blocks on a 1k block ext2/3 filesystem.

Now one question I have...

Is that 32k blocks per group + 32k fragments per group = 64k blocks per
group (since fragments aren't[1] implemented)?

[1] For those interested, the space allocated for fragments was a perfect
fit for tail merging support. There is a patch in alpha stages floating
around for that...

Hmm, take ext3 with htree, reiser3 & reiser4 (choose the block size 1k, 2k or 4k) with
tail merging off, 1k files per directory and all files the same size as
block size with 40M files. How would the table look as far as space effency
look comparing them? For that matter, how do JFS & XFS compare?

Mike

2003-09-09 07:04:30

by Oleg Drokin

[permalink] [raw]
Subject: Re: First impressions of reiserfs4

Hello!

On Mon, Sep 08, 2003 at 03:24:57PM -0700, Mike Fedyk wrote:
> > You only can have as many inodes as number of blocks on the fs (at least that's the limit imposed on you
> > by mke2fs).
> True, but not exactly. Each file will need one block to store even one byte
> on ext2/3. But your inode tables have about 1/4-1/2 the number of inode entries to
> blocks. This can be changed at mkfs time though.

Yes, I know this. But my experiments quickly shown that if you ask mkfs to create inode tables with
free inodes that exceed blocks count for the device, then mkfs will only create as much free inodes
as there are free blocks on the device (I was needing that when I experimented with 60 millions files
on ext2/reiserfs/xfs and stuff and I only had 20G partition.)

> Hmm, take ext3 with htree, reiser3 & reiser4 (choose the block size 1k, 2k or 4k) with

reiser4 does not have support for blocksize different from page size for now (sigh, same old problems
we finally solved for reiser3 recently).

> tail merging off, 1k files per directory and all files the same size as
> block size with 40M files. How would the table look as far as space effency

Hm. I will probably try this once.
For reiserfs:
I can tell you that 60M+ empty files (cannot remember exact number, but I still have the script to create those)
took ~5.5G of space. Then 60M * 4k is 240G, all these blocks are referenced by leafnodes, ~1000 pointers fits into one node,
so we will spend ~245M for block pointers (extra 5 because there are more layers of indirections).

> look comparing them? For that matter, how do JFS & XFS compare?

Unfortunatelly I never had the patience to wait until XFS creates 60M files. Have not tried jfs.

Bye,
Oleg

2003-09-09 19:10:31

by Mike Fedyk

[permalink] [raw]
Subject: Re: First impressions of reiserfs4

On Tue, Sep 09, 2003 at 11:04:21AM +0400, Oleg Drokin wrote:
> Hello!
>
> On Mon, Sep 08, 2003 at 03:24:57PM -0700, Mike Fedyk wrote:
> > > You only can have as many inodes as number of blocks on the fs (at least that's the limit imposed on you
> > > by mke2fs).
> > True, but not exactly. Each file will need one block to store even one byte
> > on ext2/3. But your inode tables have about 1/4-1/2 the number of inode entries to
> > blocks. This can be changed at mkfs time though.
>
> Yes, I know this.

I figured you did, as the explanation was mostly for people who don't.

> But my experiments quickly shown that if you ask mkfs to create inode tables with
> free inodes that exceed blocks count for the device, then mkfs will only create as much free inodes
> as there are free blocks on the device (I was needing that when I experimented with 60 millions files
> on ext2/reiserfs/xfs and stuff and I only had 20G partition.)
>

Hmm, didn't know this, but it makes sence for ext2/3 since they use 1 block
per file/directory. It wouldn't do much good to waste more space for inode
tables than you could even theoretically use.

> > Hmm, take ext3 with htree, reiser3 & reiser4 (choose the block size 1k, 2k or 4k) with
>
> reiser4 does not have support for blocksize different from page size for now (sigh, same old problems
> we finally solved for reiser3 recently).
>

Interesting, somewhere I think I saw that it was using 512 byte blocks, but
don't ask me where I saw that or who said it.

> > tail merging off, 1k files per directory and all files the same size as
> > block size with 40M files. How would the table look as far as space effency
>
> Hm. I will probably try this once.
> For reiserfs:
> I can tell you that 60M+ empty files (cannot remember exact number, but I still have the script to create those)
> took ~5.5G of space.

With how many directories? Do you run into drive speed limitations with
that much meta-data, or are there still bottlenecks in the
journaling/hashing to deal with? How big are the reiser3/4 equivalents to
inodes? In ext2/3 they're currently 128 bytes I believe plus some static
bitmaps in the block groups. The only thing variable in ext2/3 are the
directory sizes (and they don't shrink... :( )

> Then 60M * 4k is 240G, all these blocks are referenced by leafnodes, ~1000 pointers fits into one node,
> so we will spend ~245M for block pointers (extra 5 because there are more layers of indirections).
>
> > look comparing them? For that matter, how do JFS & XFS compare?
>
> Unfortunatelly I never had the patience to wait until XFS creates 60M files. Have not tried jfs.
>

Hmm, isn't XFS slower than ext2/3 in that regard?

2003-09-11 10:29:42

by Oleg Drokin

[permalink] [raw]
Subject: Re: First impressions of reiserfs4

Hello!

On Tue, Sep 09, 2003 at 12:10:44PM -0700, Mike Fedyk wrote:

> > But my experiments quickly shown that if you ask mkfs to create inode tables with
> > free inodes that exceed blocks count for the device, then mkfs will only create as much free inodes
> > as there are free blocks on the device (I was needing that when I experimented with 60 millions files
> > on ext2/reiserfs/xfs and stuff and I only had 20G partition.)
> Hmm, didn't know this, but it makes sence for ext2/3 since they use 1 block
> per file/directory. It wouldn't do much good to waste more space for inode
> tables than you could even theoretically use.

Well, in fact empty files do not need this block.

> > > tail merging off, 1k files per directory and all files the same size as
> > > block size with 40M files. How would the table look as far as space effency
> > Hm. I will probably try this once.
> > For reiserfs:
> > I can tell you that 60M+ empty files (cannot remember exact number, but I still have the script to create those)
> > took ~5.5G of space.
> With how many directories? Do you run into drive speed limitations with

Ok. I looked at the script. There should be 182900000 files created. (182.9M)
100 files per dir.
the dir structure was like this (in number of dirs per level):
31/59/25/40
Files were only created at the end of hierarchy.
See the script at the end if you are interested or want to try it yourself.
(script was donated to us by somebody, only it was shell script,
also I changed variable-names to hide identity).

> that much meta-data, or are there still bottlenecks in the
> journaling/hashing to deal with? How big are the reiser3/4 equivalents to

I do not remember what was major limitation.
Creation took something like one hour on my dual athlon box. This suggests
most overhead was still CPU. (I used perl script to create stuff)
Removing those files took even longer.

> inodes? In ext2/3 they're currently 128 bytes I believe plus some static
> bitmaps in the block groups. The only thing variable in ext2/3 are the
> directory sizes (and they don't shrink... :( )

Well in reiserfs we have statdata (each object should have one), this is sort of
like ext2 inode, only not static. It's size is 44 bytes (plus 24 bytes item
header overhead). Each metadata block has header of 24 bytes.
If you write to file (not looking at tail case yet), you create "indirect"
items in which block pointers are stored.
(4 bytes per pointer, when you use all space in metadata block, next block is
allocated (24 bytes of overhead + pointer in higher level block) plus
new indirect item (24 bytes of overhead again))
Also bitmaps are static (1 block per 128M of space in case of 4k blocksize)
as are superblock, journal and journal header.

> > Then 60M * 4k is 240G, all these blocks are referenced by leafnodes, ~1000 pointers fits into one node,
> > so we will spend ~245M for block pointers (extra 5 because there are more layers of indirections).
> > > look comparing them? For that matter, how do JFS & XFS compare?
> > Unfortunatelly I never had the patience to wait until XFS creates 60M files. Have not tried jfs.
> Hmm, isn't XFS slower than ext2/3 in that regard?

Probably it is. I was unable to find blockdevice big enough to hold
all the inode stuff for ext2/3 so I do not have comparable number.
XFS was very slow at creation (like 3 hours only gave ~ 10% of progress
and testing was stopped at this point.
Deletion of those created files also took forever)

Bye,
Oleg


Attachments:
(No filename) (3.42 kB)
createdirs.pl (1.57 kB)
Download all attachments

2003-09-11 17:17:54

by Mike Fedyk

[permalink] [raw]
Subject: Reiser3/4 & Ext2/3 was: First impressions of reiserfs4

On Thu, Sep 11, 2003 at 02:29:38PM +0400, Oleg Drokin wrote:
> Hello!
>
> On Tue, Sep 09, 2003 at 12:10:44PM -0700, Mike Fedyk wrote:
>
> > > But my experiments quickly shown that if you ask mkfs to create inode tables with
> > > free inodes that exceed blocks count for the device, then mkfs will only create as much free inodes
> > > as there are free blocks on the device (I was needing that when I experimented with 60 millions files
> > > on ext2/reiserfs/xfs and stuff and I only had 20G partition.)
> > Hmm, didn't know this, but it makes sence for ext2/3 since they use 1 block
> > per file/directory. It wouldn't do much good to waste more space for inode
> > tables than you could even theoretically use.
>
> Well, in fact empty files do not need this block.
>

True. Do you know if ext2/3 allocates the block even for empty files? So
if you create the file, it should be sparse until you write something to it,
right? Does the touch command do this?

> > > > tail merging off, 1k files per directory and all files the same size as
> > > > block size with 40M files. How would the table look as far as space effency
> > > Hm. I will probably try this once.
> > > For reiserfs:
> > > I can tell you that 60M+ empty files (cannot remember exact number, but I still have the script to create those)
> > > took ~5.5G of space.
> > With how many directories? Do you run into drive speed limitations with
>
> Ok. I looked at the script. There should be 182900000 files created. (182.9M)
> 100 files per dir.
> the dir structure was like this (in number of dirs per level):
> 31/59/25/40
> Files were only created at the end of hierarchy.
> See the script at the end if you are interested or want to try it yourself.
> (script was donated to us by somebody, only it was shell script,
> also I changed variable-names to hide identity).
>

Hmm, any experiments with more files per dir (maybe 500 or 1000)? I'm not
sure if you're going to use a directory full block with 100 files per dir in
ext2/3.

> > inodes? In ext2/3 they're currently 128 bytes I believe plus some static
> > bitmaps in the block groups. The only thing variable in ext2/3 are the
> > directory sizes (and they don't shrink... :( )
>
> Well in reiserfs we have statdata (each object should have one), this is sort of
> like ext2 inode, only not static. It's size is 44 bytes (plus 24 bytes item
> header overhead). Each metadata block has header of 24 bytes.
> If you write to file (not looking at tail case yet), you create "indirect"
> items in which block pointers are stored.
> (4 bytes per pointer, when you use all space in metadata block, next block is
> allocated (24 bytes of overhead + pointer in higher level block) plus
> new indirect item (24 bytes of overhead again))

Are these indirects stored in the tree, or do you have many partially filled
indirect blocks?

> Also bitmaps are static (1 block per 128M of space in case of 4k blocksize)
> as are superblock, journal and journal header.
>

How many superblocks are there in reiser3? Also, the bitmap locations are
static, and allocated at mkfs time? How is that done so fast for large
filesystems?

2003-09-11 17:33:39

by Oleg Drokin

[permalink] [raw]
Subject: Re: Reiser3/4 & Ext2/3 was: First impressions of reiserfs4

Hello!

On Thu, Sep 11, 2003 at 10:15:13AM -0700, Mike Fedyk wrote:
> > Well, in fact empty files do not need this block.
> True. Do you know if ext2/3 allocates the block even for empty files? So

No. But I guess it is not allocated. It would be stupid to allocate it.
"du" agrees with me that no block is allocated :)

> if you create the file, it should be sparse until you write something to it,
> right? Does the touch command do this?

Yes. I tried touch and du reports file takes zero bytes.

> > Ok. I looked at the script. There should be 182900000 files created. (182.9M)
> > 100 files per dir.
> > the dir structure was like this (in number of dirs per level):
> > 31/59/25/40
> > Files were only created at the end of hierarchy.
> > See the script at the end if you are interested or want to try it yourself.
> > (script was donated to us by somebody, only it was shell script,
> > also I changed variable-names to hide identity).
> Hmm, any experiments with more files per dir (maybe 500 or 1000)? I'm not

Feel free to perform ;)

> sure if you're going to use a directory full block with 100 files per dir in
> ext2/3.

There is no such problem in reiserfs and also so were requirements of those
people.
Anyway as I said, I had problem creating ext3 filesystem that can hold this
many inodes just because I had not big enough block device.

> > header overhead). Each metadata block has header of 24 bytes.
> > If you write to file (not looking at tail case yet), you create "indirect"
> > items in which block pointers are stored.
> > (4 bytes per pointer, when you use all space in metadata block, next block is
> > allocated (24 bytes of overhead + pointer in higher level block) plus
> > new indirect item (24 bytes of overhead again))
> Are these indirects stored in the tree, or do you have many partially filled
> indirect blocks?

They are stored in tree.

> > Also bitmaps are static (1 block per 128M of space in case of 4k blocksize)
> > as are superblock, journal and journal header.
> How many superblocks are there in reiser3? Also, the bitmap locations are

One superblock.

> static, and allocated at mkfs time? How is that done so fast for large
> filesystems?

Bitmap locations are indeed static and allocated at mkfs time (and at resize
time if you happen to grow the volume after creating)

This is not fast at all. And also we load those bitmaps at mount time,
this leads to a lot of complains about "reiserfs mounts large volumes slowly".

Bye,
Oleg

2003-09-11 22:50:20

by Rogier Wolff

[permalink] [raw]
Subject: Re: Reiser3/4 & Ext2/3 was: First impressions of reiserfs4

On Thu, Sep 11, 2003 at 09:27:40PM +0400, Oleg Drokin wrote:
> > > as are superblock, journal and journal header.
> > How many superblocks are there in reiser3? Also, the bitmap locations are
>
> One superblock.

As we've experienced that it's possible to lose the one-and-only
superblock, I would recommend that you build a backup superblock
in the future. Of course you're going to argue that some parameters
constantly change in the superblock so that it would mean a performance
penalty to have two of them. Well, the backup superblock should
be marked and used as such: It will allow a more "easy" recovery
of the filesystem parameters, should the primary be "gone", but
it should not interfere with "normal operation". So, feel free to
only update it once every ten minutes or so. Or just initialize
it and only write it when the fs is unmouted. Or don't update it
at all. But no backup superblock, is just plain wrong.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
**** "Linux is like a wigwam - no windows, no gates, apache inside!" ****

2003-09-12 01:18:03

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: Reiser3/4 & Ext2/3 was: First impressions of reiserfs4

In article <[email protected]> you wrote:
>> Well, in fact empty files do not need this block.
>>
>
> True. Do you know if ext2/3 allocates the block even for empty files? So
> if you create the file, it should be sparse until you write something to it,
> right? Does the touch command do this?

At least it reserves an inode, and:

> touch /bla
> ls -lis bla
62 0 -rw-rw-r-- 1 ecki ecki 0 Sep 12 03:13 bla
> echo -n 1 >> /bla
> ls -lis bla
62 1 -rw-rw-r-- 1 ecki ecki 1 Sep 12 03:13 bla

looks like it reserves no data blocks until first written.

On XFS btw it starts with 4 blocks (2k?)

> ls -lis ~ecki/bla
7641042 4 -rw-rw-r-- 1 ecki ecki 1 Sep 12 03:13 bla

Greetings
Bernd
--
eckes privat - http://www.eckes.org/
Project Freefire - http://www.freefire.org/

2003-09-12 01:20:42

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: Reiser3/4 & Ext2/3 was: First impressions of reiserfs4

In article <[email protected]> you wrote:
> at all. But no backup superblock, is just plain wrong.

this totally depends on the capabilities of the fsck and in-kerlen journal
replay code, if they can reconstruct the data in there.

Greetings
Bernd
--
eckes privat - http://www.eckes.org/
Project Freefire - http://www.freefire.org/

2003-09-12 04:48:19

by Mike Fedyk

[permalink] [raw]
Subject: Re: Reiser3/4 & Ext2/3 was: First impressions of reiserfs4

On Fri, Sep 12, 2003 at 03:20:37AM +0200, Bernd Eckenfels wrote:
> In article <[email protected]> you wrote:
> > at all. But no backup superblock, is just plain wrong.
>
> this totally depends on the capabilities of the fsck and in-kerlen journal
> replay code, if they can reconstruct the data in there.

And if you have no superblock how does it know where the journal is?