LinuxLists.cc - BIG files & file systems

2002-07-31 19:15:35

Subject: BIG files & file systems

Hi,

I've just been told that some "limitations" of the following kind will
remain:
page index = unsigned long
ino_t = unsigned long

Lustre has definitely been asked to support much larger files than
16TB. Also file systems with a trillion files have been requested by
one of our supporters (you don't want to know who, besides I've no
idea how many bits go in a trillion, but it's more than 32).

I understand why people don't want to sprinkle the kernel with u64's,
and arguably we can wait a year or two and use 64 bit architectures,
so I'm probably not going to kick up a fuss about it.

However, I thought I'd let you know that there are organizations that
_really_ want to have such big files and file systems and get quite
dismayed about "small integers". And we will fail to deliver on a
requirement to write a 50TB file because of this.

My first Linux machine was a 25MHz i386 with a 40MB disk....

- Peter -

2002-07-31 19:23:13

by Christoph Hellwig

[permalink] [raw]

Subject: Re: BIG files & file systems

On Wed, Jul 31, 2002 at 01:16:20PM -0600, Peter J. Braam wrote:
> Hi,
>
> I've just been told that some "limitations" of the following kind will
> remain:
> page index = unsigned long
> ino_t = unsigned long
>
> Lustre has definitely been asked to support much larger files than
> 16TB. Also file systems with a trillion files have been requested by
> one of our supporters (you don't want to know who, besides I've no
> idea how many bits go in a trillion, but it's more than 32).

What about using 64bit machines? ..

2002-07-31 20:00:47

by Matti Aarnio

[permalink] [raw]

Subject: Re: BIG files & file systems

On Wed, Jul 31, 2002 at 08:26:38PM +0100, Christoph Hellwig wrote:
> On Wed, Jul 31, 2002 at 01:16:20PM -0600, Peter J. Braam wrote:
> > Hi,
> >
> > I've just been told that some "limitations" of the following kind will
> > remain:
> >
> > page index = unsigned long
> > ino_t = unsigned long
> >
> > Lustre has definitely been asked to support much larger files than
> > 16TB. Also file systems with a trillion files have been requested by
> > one of our supporters (you don't want to know who, besides I've no
> > idea how many bits go in a trillion, but it's more than 32).
>
> What about using 64bit machines? ..

It depends on many things:
- Block layer (unsigned long)
- Page indexes (unsigned long)
- Filesystem format dependent limits
- EXT2/EXT3: u32_t FILESYSTEM block index, presuming the EXT2/EXT3
is supported only up to 4 kB block sizes, that gives
you a very hard limit.. of 16 terabytes (16 * "10^12")
- ReiserFS: u32_t block indexes presently, u64_t in future;
block size ranges ? Max size is limited by the
maximum supported file size, likely 2^63, which is
roughly 8 * "10^18", or circa 500 000 times larger
than EXT2/EXT3 format maximum.
- ClusterFS: (Braam et.al.): 64 bit block indexes ?
System file size limitation, same as with ReiserFS.

(Just to illustriate a few..)

/Matti Aarnio

2002-07-31 20:08:59

by Christoph Hellwig

[permalink] [raw]

Subject: Re: BIG files & file systems

On Wed, Jul 31, 2002 at 11:04:12PM +0300, Matti Aarnio wrote:
> It depends on many things:
> - Block layer (unsigned long)
> - Page indexes (unsigned long)

That grows with sizeof(unsigned long) on 64bit machines. And for the
filesystem internals just use one that is designed to be used with that
big storage devices (e.g. jfs or xfs ceratainly not ext2/3 or reiserfs3).

2002-07-31 21:04:16

by Jan Harkes

[permalink] [raw]

Subject: Re: BIG files & file systems

On Wed, Jul 31, 2002 at 01:16:20PM -0600, Peter J. Braam wrote:
> Hi,
>
> I've just been told that some "limitations" of the following kind will
> remain:
> page index = unsigned long
> ino_t = unsigned long

The number of files is not limited by ino_t, just look at the
iget5_locked operation in fs/inode.c. It is possible to have your own
n-bit file identifier, and simply provide your own comparison function.
The ino_t then becomes the 'hash-bucket' in which the actual inode is
looked up.

For the page_index, maybe at some point someone manages to cleanly mix
large pages (2MB?) with the current 4KB pages. Very large files could
then use the page_index as an index into these large pages which should
allow for 9PB files (or something close to that).

Jan

2002-07-31 21:10:21

by Alexander Viro

[permalink] [raw]

Subject: Re: BIG files & file systems

On Wed, 31 Jul 2002, Jan Harkes wrote:

> On Wed, Jul 31, 2002 at 01:16:20PM -0600, Peter J. Braam wrote:
> > Hi,
> >
> > I've just been told that some "limitations" of the following kind will
> > remain:
> > page index = unsigned long
> > ino_t = unsigned long
>
> The number of files is not limited by ino_t, just look at the
> iget5_locked operation in fs/inode.c. It is possible to have your own
> n-bit file identifier, and simply provide your own comparison function.
> The ino_t then becomes the 'hash-bucket' in which the actual inode is
> looked up.

You _do_ need unique ->st_ino from stat(2), though - otherwise tar(1)
and friends will break in all sorts of amusing ways. And there's
nothing kernel can do about that - applications expect 32bit st_ino
(compare them as 32bit values, etc.)

2002-08-01 03:47:54

by Jan Harkes

[permalink] [raw]

Subject: Re: BIG files & file systems

On Wed, Jul 31, 2002 at 05:13:46PM -0400, Alexander Viro wrote:
> On Wed, 31 Jul 2002, Jan Harkes wrote:
> > On Wed, Jul 31, 2002 at 01:16:20PM -0600, Peter J. Braam wrote:
> > > I've just been told that some "limitations" of the following kind will
> > > remain:
> > > page index = unsigned long
> > > ino_t = unsigned long
> >
> > The number of files is not limited by ino_t, just look at the
> > iget5_locked operation in fs/inode.c. It is possible to have your own
> > n-bit file identifier, and simply provide your own comparison function.
> > The ino_t then becomes the 'hash-bucket' in which the actual inode is
> > looked up.
>
> You _do_ need unique ->st_ino from stat(2), though - otherwise tar(1)
> and friends will break in all sorts of amusing ways. And there's
> nothing kernel can do about that - applications expect 32bit st_ino
> (compare them as 32bit values, etc.)

Which is why "tar and friends" are to different extents already broken
on various filesystems like Coda, NFS, NTFS, ReiserFS, and probably XFS.
(i.e. anything that currently uses iget5_locked instead of iget to grab
the inode).

Jan

2002-08-01 11:58:49

by Mark Mielke

[permalink] [raw]

Subject: Re: BIG files & file systems

On Wed, Jul 31, 2002 at 11:51:19PM -0400, Jan Harkes wrote:
> On Wed, Jul 31, 2002 at 05:13:46PM -0400, Alexander Viro wrote:
> > You _do_ need unique ->st_ino from stat(2), though - otherwise tar(1)
> > and friends will break in all sorts of amusing ways. And there's
> > nothing kernel can do about that - applications expect 32bit st_ino
> > (compare them as 32bit values, etc.)
> Which is why "tar and friends" are to different extents already broken
> on various filesystems like Coda, NFS, NTFS, ReiserFS, and probably XFS.
> (i.e. anything that currently uses iget5_locked instead of iget to grab
> the inode).

In theory? Maybe.

In practice, a lot more than just "tar and friends" assume that inodes
are unique...

mark (who recently, *continues* to write code that makes this assumption,
although, granted, most of the checks are 'file caching'-type
checks, and it isn't likely that a file will be the same size, the
same inode, the same device, and the same path...)

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-08-01 11:58:33

by David Woodhouse

[permalink] [raw]

Subject: Re: BIG files & file systems

[email protected] said:
> (you don't want to know who, besides I've no idea how many bits go in
> a trillion, but it's more than 32).

It all gets a little confusing after 'million'. Either you mean a US
'trillion', which is a European 'billion'; 10^12. Or you mean a European
'trillion', which is a US 'quintillion'; 10^18.

In general, it's best to stick to the numeric form if it's greater than
10^6. With the possible exception of using 'milliard' for 10^9, which may
cause the recipient to have to look up the word, but won't cause it to be
misinterpreted as 10^12 by non-usians.

http://216.239.33.100/search?q=cache:rwJFJLB7ZnoC:www.reportercentral.com/reference/vocabulary/numbernames.html

--
dwmw2

2002-08-01 20:32:30

by Andrew Morton

[permalink] [raw]

Subject: Re: BIG files & file systems

"Peter J. Braam" wrote:
>
> Hi,
>
> I've just been told that some "limitations" of the following kind will
> remain:
> page index = unsigned long
> ino_t = unsigned long
>
> Lustre has definitely been asked to support much larger files than
> 16TB. Also file systems with a trillion files have been requested by
> one of our supporters (you don't want to know who, besides I've no
> idea how many bits go in a trillion, but it's more than 32).
>
> I understand why people don't want to sprinkle the kernel with u64's,
> and arguably we can wait a year or two and use 64 bit architectures,
> so I'm probably not going to kick up a fuss about it.
>
> However, I thought I'd let you know that there are organizations that
> _really_ want to have such big files and file systems and get quite
> dismayed about "small integers". And we will fail to deliver on a
> requirement to write a 50TB file because of this.

I don't know about the ino_t thing, but as far as the pagecache
indices goes it's simply a matter of

- s/unsigned long/pgoff_t/ in a zillion places
- modify the radix tree code a bit
- implement CONFIG_LL_PAGECACHE_INDEX
- make it all work
- convince Linus

Linus's objections are threefold: it expands struct page, 64 bit
arith is slow and gcc tends to get it wrong. And I would add "most
developers won't test 64-bit pgoff_t, and it'll get broken regularly".

The expansion of struct page and the performance impact is just a
cost which you'll have to balance against the benefits. For a few
people, 32-bit pagecache index is a showstopper and they'll accept that
tradeoff.

Sprinkling `pgoff_t' everywhere is, IMO, not a bad thing - it aids code
readability because it tells you what the variable is used for.

As for broken gcc, well, the proponents of 64-bit pgoff_t would have
to work to identify the correct gcc version and generally get gcc
doing the right thing.

2002-08-02 00:07:12

by Steve Lord

[permalink] [raw]

Subject: Re: BIG files & file systems

On Wed, 2002-07-31 at 22:51, Jan Harkes wrote:
> On Wed, Jul 31, 2002 at 05:13:46PM -0400, Alexander Viro wrote:
> > On Wed, 31 Jul 2002, Jan Harkes wrote:
> > > On Wed, Jul 31, 2002 at 01:16:20PM -0600, Peter J. Braam wrote:
> > > > I've just been told that some "limitations" of the following kind will
> > > > remain:
> > > > page index = unsigned long
> > > > ino_t = unsigned long
> > >
> > > The number of files is not limited by ino_t, just look at the
> > > iget5_locked operation in fs/inode.c. It is possible to have your own
> > > n-bit file identifier, and simply provide your own comparison function.
> > > The ino_t then becomes the 'hash-bucket' in which the actual inode is
> > > looked up.
> >
> > You _do_ need unique ->st_ino from stat(2), though - otherwise tar(1)
> > and friends will break in all sorts of amusing ways. And there's
> > nothing kernel can do about that - applications expect 32bit st_ino
> > (compare them as 32bit values, etc.)
>
> Which is why "tar and friends" are to different extents already broken
> on various filesystems like Coda, NFS, NTFS, ReiserFS, and probably XFS.
> (i.e. anything that currently uses iget5_locked instead of iget to grab
> the inode).

Why are they broken? In the case of XFS at least you still get a unique
and stable inode number back - and it fits in 32 bits too.

Steve

2002-08-02 12:13:34

by Chris Mason

[permalink] [raw]

Subject: Re: BIG files & file systems

On Thu, 2002-08-01 at 20:09, Stephen Lord wrote:

> > > You _do_ need unique ->st_ino from stat(2), though - otherwise tar(1)
> > > and friends will break in all sorts of amusing ways. And there's
> > > nothing kernel can do about that - applications expect 32bit st_ino
> > > (compare them as 32bit values, etc.)
> >
> > Which is why "tar and friends" are to different extents already broken
> > on various filesystems like Coda, NFS, NTFS, ReiserFS, and probably XFS.
> > (i.e. anything that currently uses iget5_locked instead of iget to grab
> > the inode).
>
> Why are they broken? In the case of XFS at least you still get a unique
> and stable inode number back - and it fits in 32 bits too.

reiserfs is not broken here. It has unique stable 32 bit inode numbers,
but looking up the file on disk requires 64 bits of information.

-chris

2002-08-02 12:28:11

by Anton Altaparmakov

[permalink] [raw]

Subject: Re: BIG files & file systems

At 13:17 02/08/02, Chris Mason wrote:
>On Thu, 2002-08-01 at 20:09, Stephen Lord wrote:
>
> > > > You _do_ need unique ->st_ino from stat(2), though - otherwise tar(1)
> > > > and friends will break in all sorts of amusing ways. And there's
> > > > nothing kernel can do about that - applications expect 32bit st_ino
> > > > (compare them as 32bit values, etc.)
> > >
> > > Which is why "tar and friends" are to different extents already broken
> > > on various filesystems like Coda, NFS, NTFS, ReiserFS, and probably XFS.
> > > (i.e. anything that currently uses iget5_locked instead of iget to grab
> > > the inode).
> >
> > Why are they broken? In the case of XFS at least you still get a unique
> > and stable inode number back - and it fits in 32 bits too.
>
>reiserfs is not broken here. It has unique stable 32 bit inode numbers,
>but looking up the file on disk requires 64 bits of information.

ntfs is not broken here, either. It also uses unique stable 32 bit inode
numbers, but inside the driver (not visible to user space at all at
present), we use additional, fake inodes. But tar and friends will never
see those so there is no problem...

Anton

--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cantab.net> (replace at with @)
Linux NTFS Maintainer / IRC: #ntfs on irc.openprojects.net
WWW: http://linux-ntfs.sf.net/ & http://www-stu.christs.cam.ac.uk/~aia21/

2002-08-02 13:53:00

by Jan Harkes

[permalink] [raw]

Subject: Re: BIG files & file systems

On Thu, Aug 01, 2002 at 07:09:37PM -0500, Stephen Lord wrote:
> On Wed, 2002-07-31 at 22:51, Jan Harkes wrote:
> > On Wed, Jul 31, 2002 at 05:13:46PM -0400, Alexander Viro wrote:
> > > You _do_ need unique ->st_ino from stat(2), though - otherwise tar(1)
> > > and friends will break in all sorts of amusing ways. And there's
> > > nothing kernel can do about that - applications expect 32bit st_ino
> > > (compare them as 32bit values, etc.)
> >
> > Which is why "tar and friends" are to different extents already broken
> > on various filesystems like Coda, NFS, NTFS, ReiserFS, and probably XFS.
> > (i.e. anything that currently uses iget5_locked instead of iget to grab
> > the inode).
>
> Why are they broken? In the case of XFS at least you still get a unique
> and stable inode number back - and it fits in 32 bits too.

I was simply assuming that any filesystem that is using iget5 and
doesn't use the simpler iget helper has some reason why it cannot find
an inode given just the 32-bit ino_t.

This is definitely true for Coda, we have 96-bit file identifiers.
Actually my development tree currently uses 128-bit, it is aware of
multiple administrative realms and distinguishes between objects with
FID 0x7f000001.0x1.0x1 in different administrative domains. There is a
hash-function that tries to map these large FIDs into the 32-bit ino_t
space with as few collisions as possible.

NFS has a >32-bit filehandle. ReiserFS might have unique inodes, but
seems to need access to the directory to find them. So I don't quickly
see how it would guarantee uniqueness. NTFS actually doesn't seem to use
iget5 yet, but it has multiple streams per object which would probably
end up using the same ino_t.

I haven't looked at XFS, but as all in-tree filesystems that use
iget5_locked have potential ino_t collisions, So I was assuming XFS
would fit in the same category.

Userspace applications should either have an option to ignore hardlinks.
Very large filesystems either don't care because there is plenty of
space, don't support them across boundaries that are not visible to the
application, or could be dealing with them them automatically (COW
links). Besides, if I really have a trillion files, I don't want 'tar
and friends' to try to keep track of all those inode numbers (and device
numbers) in memory.

The other solution is that applications can actually use more of the
information from the inode to avoid confusion, like st_nlink and
st_mtime, which are useful when the filesystem is still mounted rw as
well. And to make it even better, st_uid, st_gid, st_size, st_blocks and
st_ctime, and a MD5/SHA checksum. Although this obviously would become
even worse for the trillion file backup case.

Jan

2002-08-02 14:11:18

by Steve Lord

[permalink] [raw]

Subject: Re: BIG files & file systems

On Fri, 2002-08-02 at 08:56, Jan Harkes wrote:
>
> I was simply assuming that any filesystem that is using iget5 and
> doesn't use the simpler iget helper has some reason why it cannot find
> an inode given just the 32-bit ino_t.

In XFS's case (remember, the iget5 code is based on XFS changes) it is
more a matter of the code to read the inode sometimes needing to pass
other info down to the read_inode part of the filesystem, so we want to
do that internally. XFS can have 64 bit inode numbers, but you need more
than 1 Tbyte in an fs to get that big (inode numbers are a disk
address). We also have code which keeps them in the bottom 1 Tbyte
which is turned on by default on Linux.

>
> This is definitely true for Coda, we have 96-bit file identifiers.
> Actually my development tree currently uses 128-bit, it is aware of
> multiple administrative realms and distinguishes between objects with
> FID 0x7f000001.0x1.0x1 in different administrative domains. There is a
> hash-function that tries to map these large FIDs into the 32-bit ino_t
> space with as few collisions as possible.
>
> NFS has a >32-bit filehandle. ReiserFS might have unique inodes, but
> seems to need access to the directory to find them. So I don't quickly
> see how it would guarantee uniqueness. NTFS actually doesn't seem to use
> iget5 yet, but it has multiple streams per object which would probably
> end up using the same ino_t.
>
> Userspace applications should either have an option to ignore hardlinks.
> Very large filesystems either don't care because there is plenty of
> space, don't support them across boundaries that are not visible to the
> application, or could be dealing with them them automatically (COW
> links). Besides, if I really have a trillion files, I don't want 'tar
> and friends' to try to keep track of all those inode numbers (and device
> numbers) in memory.
>
> The other solution is that applications can actually use more of the
> information from the inode to avoid confusion, like st_nlink and
> st_mtime, which are useful when the filesystem is still mounted rw as
> well. And to make it even better, st_uid, st_gid, st_size, st_blocks and
> st_ctime, and a MD5/SHA checksum. Although this obviously would become
> even worse for the trillion file backup case.

If apps would have to change then I would vote for allowing larger
inodes out of the kernel in an extended version of stat and getdents.
I was going to say 64 bit versions, but if even 64 is not enough for
you, it is getting a little hard to handle.

Steve

> Jan
--

Steve Lord voice: +1-651-683-3511
Principal Engineer, Filesystem Software email: [email protected]

2002-08-02 15:07:34

by Hans Reiser

[permalink] [raw]

Subject: Re: BIG files & file systems

There are a number of interfaces that need expansion in 2.5. Telldir
and seekdir would be much better if they took as argument some
filesystem specific opaque cookie (e.g. filename). Using a byte offset
to reference a directory entry that was found with a filename is an
implementation specific artifact that obviously only works for a
ufs/s5fs/ext2 type of filesystem, and is just wrong.

4 billion files is not enough to store the government's XML databases in.

Hans

Steve Lord wrote:

>On Fri, 2002-08-02 at 08:56, Jan Harkes wrote:
>
>
>>I was simply assuming that any filesystem that is using iget5 and
>>doesn't use the simpler iget helper has some reason why it cannot find
>>an inode given just the 32-bit ino_t.
>>
>>
>
>In XFS's case (remember, the iget5 code is based on XFS changes) it is
>more a matter of the code to read the inode sometimes needing to pass
>other info down to the read_inode part of the filesystem, so we want to
>do that internally. XFS can have 64 bit inode numbers, but you need more
>than 1 Tbyte in an fs to get that big (inode numbers are a disk
>address). We also have code which keeps them in the bottom 1 Tbyte
>which is turned on by default on Linux.
>
>
>
>>This is definitely true for Coda, we have 96-bit file identifiers.
>>Actually my development tree currently uses 128-bit, it is aware of
>>multiple administrative realms and distinguishes between objects with
>>FID 0x7f000001.0x1.0x1 in different administrative domains. There is a
>>hash-function that tries to map these large FIDs into the 32-bit ino_t
>>space with as few collisions as possible.
>>
>>NFS has a >32-bit filehandle. ReiserFS might have unique inodes, but
>>seems to need access to the directory to find them. So I don't quickly
>>see how it would guarantee uniqueness. NTFS actually doesn't seem to use
>>iget5 yet, but it has multiple streams per object which would probably
>>end up using the same ino_t.
>>
>>Userspace applications should either have an option to ignore hardlinks.
>>Very large filesystems either don't care because there is plenty of
>>space, don't support them across boundaries that are not visible to the
>>application, or could be dealing with them them automatically (COW
>>links). Besides, if I really have a trillion files, I don't want 'tar
>>and friends' to try to keep track of all those inode numbers (and device
>>numbers) in memory.
>>
>>The other solution is that applications can actually use more of the
>>information from the inode to avoid confusion, like st_nlink and
>>st_mtime, which are useful when the filesystem is still mounted rw as
>>well. And to make it even better, st_uid, st_gid, st_size, st_blocks and
>>st_ctime, and a MD5/SHA checksum. Although this obviously would become
>>even worse for the trillion file backup case.
>>
>>
>
>If apps would have to change then I would vote for allowing larger
>inodes out of the kernel in an extended version of stat and getdents.
>I was going to say 64 bit versions, but if even 64 is not enough for
>you, it is getting a little hard to handle.
>
>Steve
>
>
>
>>Jan
>>
>>

--
Hans

2002-08-02 15:36:45

by Trond Myklebust

[permalink] [raw]

Subject: Re: BIG files & file systems

>>>>> " " == Hans Reiser <[email protected]> writes:

> There are a number of interfaces that need expansion in 2.5.
> Telldir and seekdir would be much better if they took as
> argument some filesystem specific opaque cookie
> (e.g. filename). Using a byte offset to reference a directory
> entry that was found with a filename is an implementation
> specific artifact that obviously only works for a ufs/s5fs/ext2
> type of filesystem, and is just wrong.

> 4 billion files is not enough to store the government's XML
> databases in.

That's more of a glibc-specific bug. Most other libc implementations
appear to be quite capable of providing a userspace 'readdir()' which
doesn't ever use the lseek() syscall.

Note however that NFS compatibility *does* provide a limitation here:
the cookies that are passed between client and server are limited to
32 bits (NFSv2) or 64 bits (NFSv3/v4), so you'll be wanting to provide
some hack to get around this...

Cheers,
Trond

2002-08-02 16:58:38

by Hans Reiser

[permalink] [raw]

Subject: Re: BIG files & file systems

Trond Myklebust wrote:

> > 4 billion files is not enough to store the government's XML
> > databases in.
>
>That's more of a glibc-specific bug. Most other libc implementations
>appear to be quite capable of providing a userspace 'readdir()' which
>doesn't ever use the lseek() syscall.
>
Interesting. Thanks for the info.

--
Hans

2002-08-02 17:21:43

by Nikita Danilov

[permalink] [raw]

Subject: Re: BIG files & file systems

Hans Reiser writes:
> Trond Myklebust wrote:
>
> > > 4 billion files is not enough to store the government's XML
> > > databases in.
> >
> >That's more of a glibc-specific bug. Most other libc implementations
> >appear to be quite capable of providing a userspace 'readdir()' which
> >doesn't ever use the lseek() syscall.
> >
> Interesting. Thanks for the info.

But there still is a problem with applications (if any) calling
seekdir/telldir directly...

>
> --
> Hans
>

Nikita.

>

2002-08-02 17:23:03

by Albert D. Cahalan

[permalink] [raw]

Subject: Re: BIG files & file systems

Matti Aarnio writes:

> It depends on many things:
> - Block layer (unsigned long)
> - Page indexes (unsigned long)
> - Filesystem format dependent limits
> - EXT2/EXT3: u32_t FILESYSTEM block index, presuming the EXT2/EXT3
> is supported only up to 4 kB block sizes, that gives
> you a very hard limit.. of 16 terabytes (16 * "10^12")

You first hit the triple-indirection limit at 4 TB.
http://www.cs.uml.edu/~acahalan/linux/ext2.gif

> - ReiserFS: u32_t block indexes presently, u64_t in future;
> block size ranges ? Max size is limited by the
> maximum supported file size, likely 2^63, which is
> roughly 8 * "10^18", or circa 500 000 times larger
> than EXT2/EXT3 format maximum.

The top 4 st_size bits get stolen, so it's 60-bit sizes.
You also get the 32-bit block limit at 16 TB.

2002-08-02 17:44:04

by Trond Myklebust

[permalink] [raw]

Subject: Re: BIG files & file systems

>>>>> " " == Nikita Danilov <[email protected]> writes:

> But there still is a problem with applications (if any) calling
> seekdir/telldir directly...

Agreed. Note however that the semantics for seekdir/telldir as
specified by SUSv2 are much weaker than those in our current
getdents()+lseek().

>From the Opengroup documentation for seekdir, it states that:

On systems that conform to the Single UNIX Specification, Version 2,
a subsequent call to readdir() may not be at the desired position if
the value of loc was not obtained from an earlier call to telldir(),
or if a call to rewinddir() occurred between the call to telldir()
and the call to seekdir().

IOW assigning a unique offset to each and every entry in the directory
is overkill (unless the user is calling telldir() for all those
entries).

Cheers,
Trond

2002-08-02 18:07:07

by Nikita Danilov

[permalink] [raw]

Subject: Re: BIG files & file systems

Trond Myklebust writes:
> >>>>> " " == Nikita Danilov <[email protected]> writes:
>
> > But there still is a problem with applications (if any) calling
> > seekdir/telldir directly...
>
> Agreed. Note however that the semantics for seekdir/telldir as
> specified by SUSv2 are much weaker than those in our current
> getdents()+lseek().
>
> >From the Opengroup documentation for seekdir, it states that:
>
> On systems that conform to the Single UNIX Specification, Version 2,
> a subsequent call to readdir() may not be at the desired position if
> the value of loc was not obtained from an earlier call to telldir(),
> or if a call to rewinddir() occurred between the call to telldir()
> and the call to seekdir().
>
> IOW assigning a unique offset to each and every entry in the directory
> is overkill (unless the user is calling telldir() for all those
> entries).

Are you implying some kind of ->telldir() file operation that notifies
file-system that user has intention to later restart readdir from the
"current" position and changing glibc to call sys_telldir/sys_seekdir in
stead of lseek? This will allow file-systems like reiser4 that cannot
restart readdir from 32bitsful of data to, at least, allocate something
in kernel on call to ->telldir() and free in ->release().

>
> Cheers,
> Trond

Nikita.

2002-08-02 18:29:26

by Hans Reiser

[permalink] [raw]

Subject: Re: BIG files & file systems

Nikita Danilov wrote:

>Trond Myklebust writes:
> > >>>>> " " == Nikita Danilov <[email protected]> writes:
> >
> > > But there still is a problem with applications (if any) calling
> > > seekdir/telldir directly...
> >
> > Agreed. Note however that the semantics for seekdir/telldir as
> > specified by SUSv2 are much weaker than those in our current
> > getdents()+lseek().
> >
> > >From the Opengroup documentation for seekdir, it states that:
> >
> > On systems that conform to the Single UNIX Specification, Version 2,
> > a subsequent call to readdir() may not be at the desired position if
> > the value of loc was not obtained from an earlier call to telldir(),
> > or if a call to rewinddir() occurred between the call to telldir()
> > and the call to seekdir().
> >
> > IOW assigning a unique offset to each and every entry in the directory
> > is overkill (unless the user is calling telldir() for all those
> > entries).
>
Forgive the really dumb question, but does this mean we can just store
the last entry returned to readdir in the directory metadata, and
completely ignore the value of loc?

>
>Are you implying some kind of ->telldir() file operation that notifies
>file-system that user has intention to later restart readdir from the
>"current" position and changing glibc to call sys_telldir/sys_seekdir in
>stead of lseek? This will allow file-systems like reiser4 that cannot
>restart readdir from 32bitsful of data to, at least, allocate something
>in kernel on call to ->telldir() and free in ->release().
>
> >
> > Cheers,
> > Trond
>
>Nikita.
>
>
>
>

--
Hans

2002-08-02 18:44:42

by Nikita Danilov

[permalink] [raw]

Subject: Re: BIG files & file systems

Hans Reiser writes:
> Nikita Danilov wrote:
>
> >Trond Myklebust writes:
> > > >>>>> " " == Nikita Danilov <[email protected]> writes:
> > >
> > > > But there still is a problem with applications (if any) calling
> > > > seekdir/telldir directly...
> > >
> > > Agreed. Note however that the semantics for seekdir/telldir as
> > > specified by SUSv2 are much weaker than those in our current
> > > getdents()+lseek().
> > >
> > > >From the Opengroup documentation for seekdir, it states that:
> > >
> > > On systems that conform to the Single UNIX Specification, Version 2,
> > > a subsequent call to readdir() may not be at the desired position if
> > > the value of loc was not obtained from an earlier call to telldir(),
> > > or if a call to rewinddir() occurred between the call to telldir()
> > > and the call to seekdir().
> > >
> > > IOW assigning a unique offset to each and every entry in the directory
> > > is overkill (unless the user is calling telldir() for all those
> > > entries).
> >
> Forgive the really dumb question, but does this mean we can just store
> the last entry returned to readdir in the directory metadata, and
> completely ignore the value of loc?

If application is using readdir, then yes: glibc internally maps readdir
into getdents plus at most one lseek on directory for "adjustment"
purposes (if I remember correctly, problem is that kernel struct dirent
has extra field and glibc cannot tell in advance how many of them will
fit into supplied user buffer).

But if application uses seekdir(3)/telldir(3) directly---then no.

>
> >
> >Are you implying some kind of ->telldir() file operation that notifies
> >file-system that user has intention to later restart readdir from the
> >"current" position and changing glibc to call sys_telldir/sys_seekdir in
> >stead of lseek? This will allow file-systems like reiser4 that cannot
> >restart readdir from 32bitsful of data to, at least, allocate something
> >in kernel on call to ->telldir() and free in ->release().
> >
> > >
> > > Cheers,
> > > Trond
> >

Nikita.

> >
> >
> >
> >
>
>
> --
> Hans
>
>
>

2002-08-02 18:57:02

by Hans Reiser

[permalink] [raw]

Subject: Re: BIG files & file systems

Nikita Danilov wrote:

>Hans Reiser writes:
> > Nikita Danilov wrote:
> >
> > >Trond Myklebust writes:
> > > > >>>>> " " == Nikita Danilov <[email protected]> writes:
> > > >
> > > > > But there still is a problem with applications (if any) calling
> > > > > seekdir/telldir directly...
> > > >
> > > > Agreed. Note however that the semantics for seekdir/telldir as
> > > > specified by SUSv2 are much weaker than those in our current
> > > > getdents()+lseek().
> > > >
> > > > >From the Opengroup documentation for seekdir, it states that:
> > > >
> > > > On systems that conform to the Single UNIX Specification, Version 2,
> > > > a subsequent call to readdir() may not be at the desired position if
> > > > the value of loc was not obtained from an earlier call to telldir(),
> > > > or if a call to rewinddir() occurred between the call to telldir()
> > > > and the call to seekdir().
> > > >
> > > > IOW assigning a unique offset to each and every entry in the directory
> > > > is overkill (unless the user is calling telldir() for all those
> > > > entries).
> > >
> > Forgive the really dumb question, but does this mean we can just store
> > the last entry returned to readdir in the directory metadata, and
> > completely ignore the value of loc?
>
>If application is using readdir, then yes: glibc internally maps readdir
>into getdents plus at most one lseek on directory for "adjustment"
>purposes (if I remember correctly, problem is that kernel struct dirent
>has extra field and glibc cannot tell in advance how many of them will
>fit into supplied user buffer).
>
>But if application uses seekdir(3)/telldir(3) directly---then no.
>
It sounds like we could store the loc to key mapping in the file handle
(a (partial) key is what reiser4 needs to find a directory entry). I am
trying to understand if we need to store more than one loc to key
mapping in the file handle, or if one is enough. What do people use
telldir()/seekdir() for in practice?

>
> >
> > >
> > >Are you implying some kind of ->telldir() file operation that notifies
> > >file-system that user has intention to later restart readdir from the
> > >"current" position and changing glibc to call sys_telldir/sys_seekdir in
> > >stead of lseek? This will allow file-systems like reiser4 that cannot
> > >restart readdir from 32bitsful of data to, at least, allocate something
> > >in kernel on call to ->telldir() and free in ->release().
> > >
> > > >
> > > > Cheers,
> > > > Trond
> > >
>
>Nikita.
>
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Hans
> >
> >
> >
>
>
>
>

--
Hans

2002-08-02 22:12:52

by Randy.Dunlap

[permalink] [raw]

Subject: Re: BIG files & file systems

On Fri, 2 Aug 2002, Albert D. Cahalan wrote:

| Matti Aarnio writes:
|
| > It depends on many things:
| > - Block layer (unsigned long)
| > - Page indexes (unsigned long)
| > - Filesystem format dependent limits
| > - EXT2/EXT3: u32_t FILESYSTEM block index, presuming the EXT2/EXT3
| > is supported only up to 4 kB block sizes, that gives
| > you a very hard limit.. of 16 terabytes (16 * "10^12")
|
| You first hit the triple-indirection limit at 4 TB.
| http://www.cs.uml.edu/~acahalan/linux/ext2.gif
|
| > - ReiserFS: u32_t block indexes presently, u64_t in future;
| > block size ranges ? Max size is limited by the
| > maximum supported file size, likely 2^63, which is
| > roughly 8 * "10^18", or circa 500 000 times larger
| > than EXT2/EXT3 format maximum.
|
| The top 4 st_size bits get stolen, so it's 60-bit sizes.
| You also get the 32-bit block limit at 16 TB.
| -

For a LinuxWorld presentation in August, I have asked each of the
4 journaling filesystems (ext3, reiserfs, JFS, and XFS) what their
filesystem/filesize limits are. Here's what they have told me.

ext3fs reiserfs JFS XFS
max filesize: 16 TB# 1 EB 4 PB$ 8 TB%
max filesystem size: 2 TB 17.6 TB* 4 PB$ 2 TB!

Notes:
#: think sparse files
*: 4 KB blocks
$: 16 TB on 32-bit architectures
%: 4 KB pages
!: block device limit

--
~Randy

2002-08-03 03:22:51

by Albert D. Cahalan

[permalink] [raw]

Subject: Re: BIG files & file systems

Randy.Dunlap writes:
> On Fri, 2 Aug 2002, Albert D. Cahalan wrote:
>> Matti Aarnio writes:

>>> - Filesystem format dependent limits
>>> - EXT2/EXT3: u32_t FILESYSTEM block index, presuming the EXT2/EXT3
>>> is supported only up to 4 kB block sizes, that gives
>>> you a very hard limit.. of 16 terabytes (16 * "10^12")
>>
>> You first hit the triple-indirection limit at 4 TB.
>> http://www.cs.uml.edu/~acahalan/linux/ext2.gif
>>
>>> - ReiserFS: u32_t block indexes presently, u64_t in future;
>>> block size ranges ? Max size is limited by the
>>> maximum supported file size, likely 2^63, which is
>>> roughly 8 * "10^18", or circa 500 000 times larger
>>> than EXT2/EXT3 format maximum.
>>
>> The top 4 st_size bits get stolen, so it's 60-bit sizes.
>> You also get the 32-bit block limit at 16 TB.
>
> For a LinuxWorld presentation in August, I have asked each of the
> 4 journaling filesystems (ext3, reiserfs, JFS, and XFS) what their
> filesystem/filesize limits are. Here's what they have told me.
>
> ext3fs reiserfs JFS XFS
> max filesize: 16 TB# 1 EB 4 PB$ 8 TB%
> max filesystem size: 2 TB 17.6 TB* 4 PB$ 2 TB!
>
> Notes:
> #: think sparse files
> *: 4 KB blocks
> $: 16 TB on 32-bit architectures
> %: 4 KB pages
> !: block device limit

Please fix that before you give your presentation.
Sparse files won't save you from the triple-indirection limit.
This has me suspicious of the other numbers as well.

Ext2 gives you 0xc blocks addressed right off the inode.
Then with one 4 kB block of block pointers, you can get
to another 0x400 (1024) blocks. With a block of pointers to
blocks of pointers, you may address another 0x100000 blocks.
Finally, triple indirection gives you a block of pointers
to blocks of pointers to blocks of pointers, for another
0x40000000 data blocks. That's a total of:

0x4010040c blocks
0x4010040c000 bytes
4.4e12 bytes and change
4402 GB (decimal gigabytes)
4.4 TB (decimal terabytes)

Of course you can't really use 4.4 TB on 32-bit Linux,
so there is a sort of dishonesty in making this claim.
I can get to 2.2 TB, which disturbingly would wrap any
code using signed 32-bit math on units of 512 bytes.
The exact limits are:

0x000001ffffffefff max offset
0x000001fffffff000 max size

2002-08-05 13:03:30

by Steve Lord

[permalink] [raw]

Subject: Re: BIG files & file systems

On Fri, 2002-08-02 at 17:14, Randy.Dunlap wrote:
> On Fri, 2 Aug 2002, Albert D. Cahalan wrote:
>
> | Matti Aarnio writes:
> |
> | > It depends on many things:
> | > - Block layer (unsigned long)
> | > - Page indexes (unsigned long)
> | > - Filesystem format dependent limits
> | > - EXT2/EXT3: u32_t FILESYSTEM block index, presuming the EXT2/EXT3
> | > is supported only up to 4 kB block sizes, that gives
> | > you a very hard limit.. of 16 terabytes (16 * "10^12")
> |
> | You first hit the triple-indirection limit at 4 TB.
> | http://www.cs.uml.edu/~acahalan/linux/ext2.gif
> |
> | > - ReiserFS: u32_t block indexes presently, u64_t in future;
> | > block size ranges ? Max size is limited by the
> | > maximum supported file size, likely 2^63, which is
> | > roughly 8 * "10^18", or circa 500 000 times larger
> | > than EXT2/EXT3 format maximum.
> |
> | The top 4 st_size bits get stolen, so it's 60-bit sizes.
> | You also get the 32-bit block limit at 16 TB.
> | -
>
> For a LinuxWorld presentation in August, I have asked each of the
> 4 journaling filesystems (ext3, reiserfs, JFS, and XFS) what their
> filesystem/filesize limits are. Here's what they have told me.
>
> ext3fs reiserfs JFS XFS
> max filesize: 16 TB# 1 EB 4 PB$ 8 TB%
> max filesystem size: 2 TB 17.6 TB* 4 PB$ 2 TB!
>
> Notes:
> #: think sparse files
> *: 4 KB blocks
> $: 16 TB on 32-bit architectures
> %: 4 KB pages
> !: block device limit

Randy,

If those are the numbers you are presenting then make it clear that
for XFS those are the limits imposed by the the Linux kernel. The
core of XFS itself can support files and filesystems of 9 Exabytes.
I do not think all the filesystems are reporting their numbers in
the same way.

Steve

2002-08-05 13:39:20

by Hans Reiser

[permalink] [raw]

Subject: Re: BIG files & file systems

Stephen Lord wrote:

>
>
>>For a LinuxWorld presentation in August, I have asked each of the
>>4 journaling filesystems (ext3, reiserfs, JFS, and XFS) what their
>>filesystem/filesize limits are. Here's what they have told me.
>>
>> ext3fs reiserfs JFS XFS
>>max filesize: 16 TB# 1 EB 4 PB$ 8 TB%
>>max filesystem size: 2 TB 17.6 TB* 4 PB$ 2 TB!
>>
>>Notes:
>>#: think sparse files
>>*: 4 KB blocks
>>$: 16 TB on 32-bit architectures
>>%: 4 KB pages
>>!: block device limit
>>
>>
>
>Randy,
>
>If those are the numbers you are presenting then make it clear that
>for XFS those are the limits imposed by the the Linux kernel. The
>core of XFS itself can support files and filesystems of 9 Exabytes.
>I do not think all the filesystems are reporting their numbers in
>the same way.
>
>Steve
>
>
>
>
You might also mention that I think the limits imposed by Linux are the
only meaningful ones, as we would change our limits as soon as Linux
did, and it was Linux that selected our limits for us. We would have
changed already if Linux didn't make it pointless to change it on Intel.
Reiser4 will have 64 bit blocknumbers that will be semi-pointless until
64 bit CPUs are widely deployed, and I am simply guessing this will be
not very far into reiser4's lifecycle. Really, the couple of #defines
that constitute these size limits, plus some surrounding code, are not
such a big thing to change (except that it constitutes a disk format
change).

--
Hans

2002-08-05 13:57:25

by Randy.Dunlap

[permalink] [raw]

Subject: Re: BIG files & file systems

On Mon, 5 Aug 2002, Hans Reiser wrote:

| Stephen Lord wrote:
| >
| >>For a LinuxWorld presentation in August, I have asked each of the
| >>4 journaling filesystems (ext3, reiserfs, JFS, and XFS) what their
| >>filesystem/filesize limits are. Here's what they have told me.
| >>
| >> ext3fs reiserfs JFS XFS=
| >>max filesize: 16 TB# 1 EB 4 PB$ 8 TB%
| >>max filesystem size: 2 TB 17.6 TB* 4 PB$ 2 TB!
| >>
| >>Notes:
| >>#: think sparse files
| >>*: 4 KB blocks
| >>$: 16 TB on 32-bit architectures
| >>%: 4 KB pages
| >>!: block device limit
=: all limits are kernel limits (probably true for JFS and reiser
also)

Albert, your graph shows that the triple-indirect limit is
at 8 EB, right?

| >Randy,
| >
| >If those are the numbers you are presenting then make it clear that
| >for XFS those are the limits imposed by the the Linux kernel. The
| >core of XFS itself can support files and filesystems of 9 Exabytes.
| >I do not think all the filesystems are reporting their numbers in
| >the same way.
| >
| >Steve

Yes, that info was missing from this text-mode info, but it's
already on the slide. I will be sure to make it More obvious,
and to make the numbers more consistent.

| You might also mention that I think the limits imposed by Linux are the
| only meaningful ones, as we would change our limits as soon as Linux
| did, and it was Linux that selected our limits for us. We would have
| changed already if Linux didn't make it pointless to change it on Intel.
| Reiser4 will have 64 bit blocknumbers that will be semi-pointless until
| 64 bit CPUs are widely deployed, and I am simply guessing this will be
| not very far into reiser4's lifecycle. Really, the couple of #defines
| that constitute these size limits, plus some surrounding code, are not
| such a big thing to change (except that it constitutes a disk format
| change).

Right. I'll make the point in general that Linux internals are the
reasons for many of these limits.

Thanks,
--
~Randy

2002-08-05 14:19:45

by Randy.Dunlap

[permalink] [raw]

Subject: Re: BIG files & file systems

On Mon, 5 Aug 2002, Randy.Dunlap wrote:

| Albert, your graph shows that the triple-indirect limit is
| at 8 EB, right?

Yes, but your text (email) explanation puts it at around
4.4 TB. Got it.

Thanks.
--
~Randy

2002-08-05 17:27:46

by Albert D. Cahalan

[permalink] [raw]

Subject: Re: BIG files & file systems

Randy.Dunlap writes:
> On Mon, 5 Aug 2002, Randy.Dunlap wrote:

>> Albert, your graph shows that the triple-indirect limit is
>> at 8 EB, right?

No, that's the API limit. We use signed 64-bit byte
offsets in our API. (it's just under 8 EiB, which
is about 9.2 EB)

I do see one flaw on my graph. That horizontal line
at 1 TiB ought to be at 2 TiB apparently. It's for
the kernel limit, perhaps only on 32-bit hardware.
This changes the limit with 4096-byte blocks from
1 TiB to 2 TiB, so the filesystem's 4.4 TB is still
out of reach.

> Yes, but your text (email) explanation puts it at around
> 4.4 TB. Got it.

If we had quadruple indirection, then we'd hit a 17.6 TB
limit (16 TiB) due to the 32-bit block numbers. With an
8192-byte block size, we'd hit the block number limit
at 35 TB (32 TiB) before hitting the triple-indirection
limit. Of course none of this gets you past the kernel
limit at around 2.2 TB.

I believe we allow 8192-byte blocks on the Alpha.
You might want to look into that. IA-64 maybe too.

2002-08-06 00:13:15

by jw schultz

[permalink] [raw]

Subject: Re: BIG files & file systems

On Mon, Aug 05, 2002 at 05:42:18PM +0400, Hans Reiser wrote:
> You might also mention that I think the limits imposed by Linux are the
> only meaningful ones, as we would change our limits as soon as Linux
> did, and it was Linux that selected our limits for us. We would have
> changed already if Linux didn't make it pointless to change it on Intel.
> Reiser4 will have 64 bit blocknumbers that will be semi-pointless until
> 64 bit CPUs are widely deployed, and I am simply guessing this will be
> not very far into reiser4's lifecycle. Really, the couple of #defines
> that constitute these size limits, plus some surrounding code, are not
> such a big thing to change (except that it constitutes a disk format
> change).

Hans,

My recollection is that reiser4 isn't released yet. Why not
set the reiser4 disk format with 64 bit blocknumbers from
dot? 32 bit archs could write zeros and otherwise ignore
the upper 32 bits and refuse to mount if filesystem size
would cause overflow. That way you avoid on-disk format
change mid cycle. That seems a lot less overhead than
coping with different datatypes.

Of course if you'd rather support another on-disk format
to squeeze a bit more data onto small drives i can understand.

--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt

2002-08-06 07:21:28

by Albert D. Cahalan

[permalink] [raw]

Subject: Re: BIG files & file systems

Andreas Dilger writes:

> I would also have to add another footnote to this, if people start
> talking about limits on 64-bit and >4kB page size systems. ext2/3 can
> support multiple block sizes (limited by the hardware page size), and
> actually supporting larger block sizes has only been restricted for
> cross-platform compatibility reasons.

This looks pretty silly if you think about it. We support
both 8 kB UFS and 64 kB FAT16 already.

> Having 16kB block size would allow a maximum of 64TB for a single
> filesystem. The per-file limit would be over 256TB.

Um, yeah, 64 TB of data with 192 TB of holes!
I really don't think you should count a file
that won't fit on your filesystem.

It's one thing to say ext2 is ready for when
the block devices grow. It's another thing to
talk about files that can't possibly fit without
changing the filesystem layout.

> In reality, we will probably implement extent-based allocation for
> ext3 when we start getting into filesystems that large, which has been
> discussed among the ext2/ext3 developers already.

It's nice to have a simple filesystem. If you turn ext2/ext3
into an XFS/JFS competitor, then what is left? Just minix fs?

2002-08-06 06:49:41

by Andreas Dilger

[permalink] [raw]

Subject: Re: BIG files & file systems

On Aug 02, 2002 23:26 -0400, Albert D. Cahalan wrote:
> Randy.Dunlap writes:
> > On Fri, 2 Aug 2002, Albert D. Cahalan wrote:
> >> Matti Aarnio writes:
>
> >>> - Filesystem format dependent limits
> >>> - EXT2/EXT3: u32_t FILESYSTEM block index, presuming the EXT2/EXT3
> >>> is supported only up to 4 kB block sizes, that gives
> >>> you a very hard limit.. of 16 terabytes (16 * "10^12")
> >>
> >> You first hit the triple-indirection limit at 4 TB.
> >> http://www.cs.uml.edu/~acahalan/linux/ext2.gif
> >>
> >>> - ReiserFS: u32_t block indexes presently, u64_t in future;
> >>> block size ranges ? Max size is limited by the
> >>> maximum supported file size, likely 2^63, which is
> >>> roughly 8 * "10^18", or circa 500 000 times larger
> >>> than EXT2/EXT3 format maximum.
> >>
> >> The top 4 st_size bits get stolen, so it's 60-bit sizes.
> >> You also get the 32-bit block limit at 16 TB.
> >
> > For a LinuxWorld presentation in August, I have asked each of the
> > 4 journaling filesystems (ext3, reiserfs, JFS, and XFS) what their
> > filesystem/filesize limits are. Here's what they have told me.
> >
> > ext3fs reiserfs JFS XFS
> > max filesize: 16 TB# 1 EB 4 PB$ 8 TB%
> > max filesystem size: 2 TB 17.6 TB* 4 PB$ 2 TB!

I think you need a "!" behind the 2TB limit for ext3 max filesystem
size. The actual filesystem limit for 4kB block size is 16TB*
(2^32 blocks). More on this below.

> > Notes:
> > #: think sparse files
> > *: 4 KB blocks
> > $: 16 TB on 32-bit architectures
> > %: 4 KB pages
> > !: block device limit
>
> Please fix that before you give your presentation.
> Sparse files won't save you from the triple-indirection limit.
> This has me suspicious of the other numbers as well.
>
> Ext2 gives you 0xc blocks addressed right off the inode.
> Then with one 4 kB block of block pointers, you can get
> to another 0x400 (1024) blocks. With a block of pointers to
> blocks of pointers, you may address another 0x100000 blocks.
> Finally, triple indirection gives you a block of pointers
> to blocks of pointers to blocks of pointers, for another
> 0x40000000 data blocks. That's a total of:
>
> 0x4010040c blocks
> 0x4010040c000 bytes
> 4.4e12 bytes and change
> 4402 GB (decimal gigabytes)
> 4.4 TB (decimal terabytes)
>
> Of course you can't really use 4.4 TB on 32-bit Linux,
> so there is a sort of dishonesty in making this claim.
> I can get to 2.2 TB, which disturbingly would wrap any
> code using signed 32-bit math on units of 512 bytes.
> The exact limits are:
>
> 0x000001ffffffefff max offset
> 0x000001fffffff000 max size

I would also have to add another footnote to this, if people start
talking about limits on 64-bit and >4kB page size systems. ext2/3 can
support multiple block sizes (limited by the hardware page size), and
actually supporting larger block sizes has only been restricted for
cross-platform compatibility reasons.

Now that larger page sizes are becoming more common, the support for up
to 16kB block sizes has already been added into e2fsprogs, and will only
need a 1-line change in the kernel to be supported. The choice of 16kB
pages as the limit is somewhat arbitrary also, and could be increased
again in the future, as needed.

Having 16kB block size would allow a maximum of 64TB for a single
filesystem. The per-file limit would be over 256TB.

In reality, we will probably implement extent-based allocation for
ext3 when we start getting into filesystems that large, which has been
discussed among the ext2/ext3 developers already. We could also go to
a clustered filesystem like Lustre, which can span a large number of
separate filesystems (and hosts).

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-08-06 07:51:54

by Andreas Dilger

[permalink] [raw]

Subject: Re: BIG files & file systems

On Aug 06, 2002 03:24 -0400, Albert D. Cahalan wrote:
> Andreas Dilger writes:
> > Having 16kB block size would allow a maximum of 64TB for a single
> > filesystem. The per-file limit would be over 256TB.
>
> Um, yeah, 64 TB of data with 192 TB of holes!
> I really don't think you should count a file
> that won't fit on your filesystem.

Well, no worse than the original posting which had reiserfs supporting
something-EB files and 16TB filesystems. Don't think I didn't consider
this at the time of posting.

> > In reality, we will probably implement extent-based allocation for
> > ext3 when we start getting into filesystems that large, which has been
> > discussed among the ext2/ext3 developers already.
>
> It's nice to have a simple filesystem. If you turn ext2/ext3
> into an XFS/JFS competitor, then what is left? Just minix fs?

Note that I said ext3 in the above sentence, and not ext2. I'm not in
favour of adding all of the high-end features (htree, extents, etc) into
ext2 at all. It makes absolutely no sense to have a multi-TB filesystem
running ext2, and then the fsck time takes a day. It is desirable to
put some minimum support into ext2 for newer features when it makes
sense and does not complicate the code, but not for everything.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-08-06 09:25:00

by Matti Aarnio

[permalink] [raw]

Subject: Re: BIG files & file systems

The original question was: "on 64 bit machine, what are the limits",
this "willdo, wontdo" with 32-bit systems is thus out of the scope.

And prolonged will/won't is out of the scope in every case...

Maybe somebody wants to make concise encyclopedic article
about the issue into LKML-FAQ ? http://www.tux.org/lkml/

/Matti Aarnio

2002-08-06 09:45:31

by Hans Reiser

[permalink] [raw]

Subject: Re: BIG files & file systems

jw schultz wrote:

>On Mon, Aug 05, 2002 at 05:42:18PM +0400, Hans Reiser wrote:
>
>
>>You might also mention that I think the limits imposed by Linux are the
>>only meaningful ones, as we would change our limits as soon as Linux
>>did, and it was Linux that selected our limits for us. We would have
>>changed already if Linux didn't make it pointless to change it on Intel.
>>Reiser4 will have 64 bit blocknumbers that will be semi-pointless until
>>64 bit CPUs are widely deployed, and I am simply guessing this will be
>>not very far into reiser4's lifecycle. Really, the couple of #defines
>>that constitute these size limits, plus some surrounding code, are not
>>such a big thing to change (except that it constitutes a disk format
>>change).
>>
>>
>
>Hans,
>
>My recollection is that reiser4 isn't released yet. Why not
>set the reiser4 disk format with 64 bit blocknumbers from
>dot? 32 bit archs could write zeros and otherwise ignore
>the upper 32 bits and refuse to mount if filesystem size
>would cause overflow. That way you avoid on-disk format
>change mid cycle. That seems a lot less overhead than
>coping with different datatypes.
>
>Of course if you'd rather support another on-disk format
>to squeeze a bit more data onto small drives i can understand.
>
>
>
We are using 64 bit blocknumbers in reiser4, and letting linux limit
them. Perhaps my writing style was rather lacking in clarity.....

Linux is going to use some hacks in 2.5 that will let it go moderately
above the 2.4 limits. 64 bit blocknumbers seem the most flexible thing
in the face of what will be ever evolving hacks followed by the
introduction of 64 bit CPUs into the mainstream.

--
Hans