2005-03-09 19:05:37

by Dan Stromberg

[permalink] [raw]
Subject: huge filesystems


The group I work in has been experimenting with GFS and Lustre, and I did
some NBD/ENBD experimentation on my own, described at
http://dcs.nac.uci.edu/~strombrg/nbd.html

My question is, what is the current status of huge filesystems - IE,
filesystems that exceed 2 terabytes, and hopefully also exceeding 16
terabytes?

Am I correct in assuming that the usual linux buffer cache only goes to 16
terabytes?

Does the FUSE API (or similar) happen to allow surpassing either the 2T or
16T limits?

What about the "LBD" patches - what limits are involved there, and have
they been rolled into a Linus kernel, or one or more vendor kernels?

Thanks!



2005-03-09 22:06:29

by Miklos Szeredi

[permalink] [raw]
Subject: Re: huge filesystems

> The group I work in has been experimenting with GFS and Lustre, and I did
> some NBD/ENBD experimentation on my own, described at
> http://dcs.nac.uci.edu/~strombrg/nbd.html
>
> My question is, what is the current status of huge filesystems - IE,
> filesystems that exceed 2 terabytes, and hopefully also exceeding 16
> terabytes?
>
> Am I correct in assuming that the usual linux buffer cache only goes to 16
> terabytes?

The page cache limit is PAGE_CACHE_BITS + BITS_PER_LONG - 1. On i386
that's 12 + 32 - 1 = 43 bits or 8Tbytes. On 64 bit architectures the
size of off_t is the only limit.

> Does the FUSE API (or similar) happen to allow surpassing either the 2T or
> 16T limits?

The API certainly doesn't have any limits. The page cache limit holds
for FUSE too, though with the direct-io mount option the page cache is
not used, so the limit could be removed as well. I'll fix that.

Thanks,
Miklos

2005-03-10 06:19:23

by Chris Wedgwood

[permalink] [raw]
Subject: Re: huge filesystems

On Wed, Mar 09, 2005 at 10:53:48AM -0800, Dan Stromberg wrote:

> My question is, what is the current status of huge filesystems - IE,
> filesystems that exceed 2 terabytes, and hopefully also exceeding 16
> terabytes?

people can and do have >2T filesystems now. some people on x86 have
hit the 16TB limit and others are large still with 64-bit CPUs

> Am I correct in assuming that the usual linux buffer cache only goes
> to 16 terabytes?

for 32-bit CPUs

> What about the "LBD" patches - what limits are involved there, and
> have they been rolled into a Linus kernel, or one or more vendor
> kernels?

LBD is in 2.6.x and is required for >2TB but sometimes that means >1TB
or even smaller depending on the drivers

many drivers simply won't go above 2T even with CONFIG_LBD so you need
to poke about and see what works for you (or use md/raid to glue
together multiple 2TB volumes)

2005-03-14 16:42:11

by Andreas Dilger

[permalink] [raw]
Subject: Re: huge filesystems

On Mar 09, 2005 10:53 -0800, Dan Stromberg wrote:
> The group I work in has been experimenting with GFS and Lustre, and I did
> some NBD/ENBD experimentation on my own, described at
> http://dcs.nac.uci.edu/~strombrg/nbd.html
>
> My question is, what is the current status of huge filesystems - IE,
> filesystems that exceed 2 terabytes, and hopefully also exceeding 16
> terabytes?

Lustre has run with filesystems up to 400TB (where it hits a Lustre limit
that should be removed shortly for a 900TB filesystem being deployed).
The caveat is that Lustre is made up of individual block devices and
filesystems of only 2TB or less in size.

> Am I correct in assuming that the usual linux buffer cache only goes to 16
> terabytes?

That is the block device limit, and also the file limit for 32-bit systems,
imposed by the size of a single VM mapping 2^32 * PAGE_SIZE.

> Does the FUSE API (or similar) happen to allow surpassing either the 2T or
> 16T limits?

Some 32-bit systems (PPC?) may allow larger PAGE_SIZE and will have a
larger limit for a single VM mapping. For 64-bit platforms there is no
2^32 limit for page->index and this also removes the 16TB limit.

> What about the "LBD" patches - what limits are involved there, and have
> they been rolled into a Linus kernel, or one or more vendor kernels?

These are part of stock 2.6 kernels. The caveat here is that there have
been some problems reported (with ext3 at least) for filesystems > 2TB
so I don't think it has really been tested very much.

Cheers, Andreas
--
Andreas Dilger
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/

2005-03-14 17:06:24

by jmerkey

[permalink] [raw]
Subject: Re: huge filesystems


I am running the DSFS file system as a 7 TB file system on 2.6.9. There
are a host
of problems with the current VFS, ad I have gotten around most of them
by **NOT** using the linux page cache interface. The VFS I am using
creates a virtual represeation of the files and it's own cache. You need
a lot of memory - 2GB roughly. The only way to do it at present is to
use the address patch that reserves 1GB for user address space and
leaves 3GB of space
in kernel in order to create the caches.

You also need to ignore the broken fdisk partitioning tools. Just create
one partition if you
create disks larger than 2 TB with 3Ware, and ignore the values in the
table. I check for a single partition on these devices and rely on the
capacity parameter reported from the
bdev handle and just use the extra space.

Linux proper has a long way to go on this topic, but it is possible I am
doing this today
on 2.6.9.

Jeff


2005-03-15 03:23:03

by Andrew Morton

[permalink] [raw]
Subject: Re: huge filesystems

jmerkey <[email protected]> wrote:
>
> I am running the DSFS file system as a 7 TB file system on 2.6.9.

On a 32-bit CPU?

> There are a host of problems with the current VFS,

I don't recall you reporting any of them. How can we expect to fix
anything if we aren't told about it?

> ad I have gotten around most of them
> by **NOT** using the linux page cache interface.

Well that won't fly.


The VFS should support devices up to 16TB on 32-bit CPUs. If you know of
scenarios in which it fails to do that, please send a bug report.

2005-03-15 03:49:21

by jmerkey

[permalink] [raw]
Subject: Re: huge filesystems

Andrew Morton wrote:

>jmerkey <[email protected]> wrote:
>
>
>>I am running the DSFS file system as a 7 TB file system on 2.6.9.
>>
>>
>
>On a 32-bit CPU?
>
>

Yep.

>
>
>>There are a host of problems with the current VFS,
>>
>>
>
>I don't recall you reporting any of them. How can we expect to fix
>anything if we aren't told about it?
>
>
>
I report them when I can't get around them myself. I've been able to get
around most of them.

>>ad I have gotten around most of them
>> by **NOT** using the linux page cache interface.
>>
>>
>
>Well that won't fly.
>
>
>
>
For this application it will.

>The VFS should support devices up to 16TB on 32-bit CPUs. If you know of
>scenarios in which it fails to do that, please send a bug report.
>
>
>
>
Based on the changes I've mode to it locally for my version of 2.6.9, it
now goes to
1 zetabyte (1024 pedabytes). Largest one I've configured so far with
actual storage
is 128 TB, though. Had to drop the page cache and replace though -- for now.


Jeff


2005-03-15 04:03:35

by Andrew Morton

[permalink] [raw]
Subject: Re: huge filesystems

jmerkey <[email protected]> wrote:
>
> >I don't recall you reporting any of them. How can we expect to fix
> >anything if we aren't told about it?
> >
> >
> >
> I report them when I can't get around them myself. I've been able to get
> around most of them.

Jeff, that's all take and no give.

Please give: what problems have you observed in the current VFS for devices
and files less than 16TB?

2005-03-15 04:49:53

by jmerkey

[permalink] [raw]
Subject: Re: huge filesystems

Andrew Morton wrote:

>jmerkey <[email protected]> wrote:
>
>
>> >I don't recall you reporting any of them. How can we expect to fix
>> >anything if we aren't told about it?
>> >
>> >
>> >
>> I report them when I can't get around them myself. I've been able to get
>> around most of them.
>>
>>
>
>Jeff, that's all take and no give.
>
>Please give: what problems have you observed in the current VFS for devices
>and files less than 16TB?u
>
>

1. Scaling issues with readdir() with huge numbers of files (not even
huge really. 87000 files in a dir takes a while
for readdir() to return results). I average 2-3 million files per
directory on 2.6.9. It can take a up to a minute for
readdir() to return from initial reading from on of these directories
with readdir() through the VFS.

2. NFS performance and stability issues with mapping NFS on top of dsfs.
All sorts of problems (performance)
with system slowdowns -- in some cases can copy a file to a floppy
system to system faster than I can copy over
100 mbit ethernet.

3. RCU and interrupt state problems with concurrent Network I/O and VFS
interaction. Lots of places, I
reordered the code in these sections to hold more course grained locking.

4. BIO multiple chained requests has never worked correctly, so I have
to submit 4K / BIO always. The design
and concept behind BIO's was great -- the implementation has a lot of
problems. When I submit a chain
larger than 32 MB of 4K pages, the system looses state and the BIO's
don't get returned or completed. And I see
some bizarre error returns from sumission. Jens classic response is
always "Merkey you don't understand the interface" --
I have the code, I understand quite well, it does not work as advertised
with these big sizes.

5. Files larger than 2TB work fine through the VFS provided I force mmap
to use the internal interface. Files larger than
4 TB also seem to work fine. I have also tested with files larger than
7TB, they also seem to work fine. I have not tested
individual files larger than 10 TB yet, but this will be happening in a
month or so based on the units we are selling. When I
enable page cache mmap through the VFS, the system gets into trouble
with these five memory pools from hell (slab, and the
various allocators in Linux -- I would think one byte level allocator
would be enough) and the system has problems with
low memory conditions. I don't use the buffer cache because I post these
huge coalesced sector runs to disk and need
memory in contuguous chunks, so the page cache/buffer cache don't
optimize well in dsfs. I am achieving over 700 MB/S
megabytes per second to disk with custom hardware with the architecture
I am using -- 6 % processor utilization on 2.6.9.

6. fdisk does not support drives arger than 2TB, so I have to hack the
partition tables and fake out dsfs with 3TB abd 4TB
drives created with RAID 0 controllers and hardware. This needs to get
fixed.

I will always give back changes to GPL code if folks ask for them -- buy
an appliance through OSDL (they are really cool)
and request the GPL changes to Linux and I'll provide it as requested.

Order one from http://www.soleranetworks.com. Ask for Troy.

Jeff



>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>

2005-03-15 06:04:53

by Andreas Dilger

[permalink] [raw]
Subject: Re: huge filesystems

On Mar 14, 2005 21:37 -0700, jmerkey wrote:
> 1. Scaling issues with readdir() with huge numbers of files (not even
> huge really. 87000 files in a dir takes a while
> for readdir() to return results). I average 2-3 million files per
> directory on 2.6.9. It can take a up to a minute for
> readdir() to return from initial reading from on of these directories
> with readdir() through the VFS.

Actually, unless I'm mistaken the problem is that "ls" (even when you
ask it not to sort entries) is doing readdir on the whole directory
before returning any results. We see this with Lustre and very large
directories. Run strace on "ls" and it is doing masses of readdirs, but
no output to stdout. Lustre readdir works OK on directories up to 10M
files, but ls sucks.

$ strace ls /usr/lib 2>&1 > /dev/null
:
:
open("/usr/lib", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3
fstat64(3, {st_mode=S_IFDIR|0755, st_size=57344, ...}) = 0
fcntl64(3, F_SETFD, FD_CLOEXEC) = 0
getdents64(3, /* 120 entries */, 4096) = 4096
getdents64(3, /* 65 entries */, 4096) = 2568
getdents64(3, /* 111 entries */, 4096) = 4088
:
:
getdents64(3, /* 59 entries */, 4096) = 2152
getdents64(3, /* 10 entries */, 4096) = 496
getdents64(3, /* 0 entries */, 4096) = 0
close(3) = 0
write(1, "Acrobat5\nalchemist\nanaconda\nanac"..., 4096) = 4096
write(1, "libbonobo-2.a\nlibbonobo-2.so\nlib"..., 4096) = 4096
write(1, "ibgdbm.so\nlibgdbm.so.2\nlibgdbm.s"..., 4096) = 4096
write(1, "nica_qmxxx.la\nlibgphoto_konica_q"..., 4096) = 4096
write(1, ".so\nlibIDL-2.so.0\nlibIDL-2.so.0."..., 4096) = 4096
write(1, "libkpilot.so.0\nlibkpilot.so.0.0."..., 4096) = 4096
write(1, "ove.so.0\nlibospgrove.so.0.0.0\nli"..., 4096) = 4096
write(1, ".6\nlibsoundserver_idl.la\nlibsoun"..., 4096) = 4096
write(1, "lparse.so.0\nlibxmlparse.so.0.1.0"..., 1294) = 1294


> 6. fdisk does not support drives arger than 2TB, so I have to hack the
> partition tables and fake out dsfs with 3TB abd 4TB drives created with
> RAID 0 controllers and hardware. This needs to get fixed.

Use a different partition format (e.g. EFI or devicemapper) or none at all.
That is better than just ignoring the whole thing and some user thinking
"gee, I have all this free space here, maybe I'll make another partition".

Cheers, Andreas
--
Andreas Dilger
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/


Attachments:
(No filename) (2.37 kB)
(No filename) (189.00 B)
Download all attachments

2005-03-15 09:28:43

by Barry K. Nathan

[permalink] [raw]
Subject: Re: huge filesystems

On Mon, Mar 14, 2005 at 11:41:37AM -0500, Andreas Dilger wrote:
> > What about the "LBD" patches - what limits are involved there, and have
> > they been rolled into a Linus kernel, or one or more vendor kernels?
>
> These are part of stock 2.6 kernels. The caveat here is that there have
> been some problems reported (with ext3 at least) for filesystems > 2TB
> so I don't think it has really been tested very much.

FWIW Red Hat appears to officially support 8TB ext3 filesystems on
Red Hat Enterprise Linux 4:

"Ext3 scalability: Dynamic file system expansion and file system sizes
up to 8TB are now supported."

http://www.redhat.com/software/rhel/features/

-Barry K. Nathan <[email protected]>

2005-03-19 11:09:58

by Eric W. Biederman

[permalink] [raw]
Subject: Re: huge filesystems

Andreas Dilger <[email protected]> writes:

> On Mar 14, 2005 21:37 -0700, jmerkey wrote:
> > 1. Scaling issues with readdir() with huge numbers of files (not even
> > huge really. 87000 files in a dir takes a while
> > for readdir() to return results). I average 2-3 million files per
> > directory on 2.6.9. It can take a up to a minute for
> > readdir() to return from initial reading from on of these directories
> > with readdir() through the VFS.
>
> Actually, unless I'm mistaken the problem is that "ls" (even when you
> ask it not to sort entries) is doing readdir on the whole directory
> before returning any results. We see this with Lustre and very large
> directories. Run strace on "ls" and it is doing masses of readdirs, but
> no output to stdout. Lustre readdir works OK on directories up to 10M
> files, but ls sucks.

The classic test is does 'echo *' which does the readdir but not the
stat come back quickly?

Anyway most of the readdir work is in the filesystem so I don't see
how the VFS would be involved....

Eric