2006-10-04 16:56:56

by Andreas Dilger

[permalink] [raw]
Subject: Directories > 2GB

For ext4 we are exploring the possibility of directories being larger
than 2GB in size. For ext3/ext4 the 2GB limit is about 50M files, and
the 2-level htree limit is about 25M files (this is a kernel code and not
disk format limit).

Amusingly (or not) some users of very large filesystems hit this limit
with their HPC batch jobs because they have 10,000 or 128,000 processes
creating files in a directory on an hourly basis (job restart files,
data dumps for visualization, etc) and it is not always easy to change
the apps.

My question (esp. for XFS folks) is if anyone has looked at this problem
before, and what kind of problems they might have hit in userspace and in
the kernel due to "large" directory sizes (i.e. > 2GB). It appears at
first glance that 64-bit systems will do OK because off_t is a long
(for telldir output), but that 32-bit systems would need to use O_LARGEFILE
when opening the file in order to be able to read the full directory
contents. It might also be possible to return -EFBIG only in the case
that telldir is used beyond 2GB (the LFS spec doesn't really talk about
large directories at all).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.



2006-10-04 17:51:20

by Dave Kleikamp

[permalink] [raw]
Subject: Re: Directories > 2GB

On Wed, 2006-10-04 at 10:56 -0600, Andreas Dilger wrote:
> For ext4 we are exploring the possibility of directories being larger
> than 2GB in size. For ext3/ext4 the 2GB limit is about 50M files, and
> the 2-level htree limit is about 25M files (this is a kernel code and not
> disk format limit).
>
> Amusingly (or not) some users of very large filesystems hit this limit
> with their HPC batch jobs because they have 10,000 or 128,000 processes
> creating files in a directory on an hourly basis (job restart files,
> data dumps for visualization, etc) and it is not always easy to change
> the apps.
>
> My question (esp. for XFS folks) is if anyone has looked at this problem
> before, and what kind of problems they might have hit in userspace and in
> the kernel due to "large" directory sizes (i.e. > 2GB). It appears at
> first glance that 64-bit systems will do OK because off_t is a long
> (for telldir output), but that 32-bit systems would need to use O_LARGEFILE
> when opening the file in order to be able to read the full directory
> contents. It might also be possible to return -EFBIG only in the case
> that telldir is used beyond 2GB (the LFS spec doesn't really talk about
> large directories at all).

ext3 directory entries are always multiples of 4 bytes in length. So
the lowest 2 bits of the offset are always zero. Right? Why not shift
the returned offset and f_pos 2 bits right?

JFS uses an index into an array for the position (which isn't even in
the directory traversal order) so it can handle about 2G files in a
directory (although deleted entries aren't reused).
--
David Kleikamp
IBM Linux Technology Center

2006-10-09 21:53:27

by Steve Lord

[permalink] [raw]
Subject: Re: Directories > 2GB

Andreas Dilger wrote:
> For ext4 we are exploring the possibility of directories being larger
> than 2GB in size. For ext3/ext4 the 2GB limit is about 50M files, and
> the 2-level htree limit is about 25M files (this is a kernel code and not
> disk format limit).
>
> Amusingly (or not) some users of very large filesystems hit this limit
> with their HPC batch jobs because they have 10,000 or 128,000 processes
> creating files in a directory on an hourly basis (job restart files,
> data dumps for visualization, etc) and it is not always easy to change
> the apps.
>
> My question (esp. for XFS folks) is if anyone has looked at this problem
> before, and what kind of problems they might have hit in userspace and in
> the kernel due to "large" directory sizes (i.e. > 2GB). It appears at
> first glance that 64-bit systems will do OK because off_t is a long
> (for telldir output), but that 32-bit systems would need to use O_LARGEFILE
> when opening the file in order to be able to read the full directory
> contents. It might also be possible to return -EFBIG only in the case
> that telldir is used beyond 2GB (the LFS spec doesn't really talk about
> large directories at all).
>

My first thought is to run screaming for the hills when user's want this.
In a previous life we had a customer in the US Gov who decided to
put all their 700 million files in one directory. Then they had a
double disk unreported raid failure (raid vendors fault). The
filesystem repair ran for 7 days and a heck of a lot of files
ended up in lost+found. Fortunately they had the huge amount of
memory and process address space available to run repair.

Anyone who does this and has any sense does not allow any sort of
scanning of the namespace (i.e. anything using readdir). You tend
to run out of process address space before you have read the
directory.

You might want to think about keeping the directory a little
more contiguous than individual disk blocks. XFS does have
code in it to allocate the directory in chunks larger than
a single file system block. It does not get used on linux
because the code was written under the assumption you can
see the whole chunk as a single piece of memory which does not
work to well in the linux kernel.

Steve

2006-10-10 01:55:35

by David Chinner

[permalink] [raw]
Subject: Re: Directories > 2GB

On Mon, Oct 09, 2006 at 04:53:02PM -0500, Steve Lord wrote:
> You might want to think about keeping the directory a little
> more contiguous than individual disk blocks. XFS does have
> code in it to allocate the directory in chunks larger than
> a single file system block. It does not get used on linux
> because the code was written under the assumption you can
> see the whole chunk as a single piece of memory which does not
> work to well in the linux kernel.

This code is enabled and seems to work in Linux. I don't know if it
passes xfsqa so I don't know how reliable this feature is. TO check
it all I did was run a quick test on a x86_64 kernel (4k page
size) using 16k directory blocks (4 pages):

# mkfs.xfs -f -n size=16384 /dev/ubd/1
.....
# xfs_db -r -c "sb 0" -c "p dirblklog" /dev/ubd/1
dirblklog = 2
# mount /dev/ubd/1 /mnt/xfs
# for i in `seq 0 1 100000`; do touch fred.$i; done
# umount /mnt/xfs
# mount /mnt/xfs
# ls /mnt/xfs |wc -l
100000
# rm -rf /mnt/xfs/*
# ls /mnt/xfs |wc -l
0
# umount /mnt/xfs
#

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-10 02:15:36

by Steve Lord

[permalink] [raw]
Subject: Re: Directories > 2GB

Hi Dave,

My recollection is that it used to default to on, it was disabled
because it needs to map the buffer into a single contiguous chunk
of kernel memory. This was placing a lot of pressure on the memory
remapping code, so we made it not default to on as reworking the
code to deal with non contig memory was looking like a major
effort.

Steve


David Chinner wrote:
> On Mon, Oct 09, 2006 at 04:53:02PM -0500, Steve Lord wrote:
>> You might want to think about keeping the directory a little
>> more contiguous than individual disk blocks. XFS does have
>> code in it to allocate the directory in chunks larger than
>> a single file system block. It does not get used on linux
>> because the code was written under the assumption you can
>> see the whole chunk as a single piece of memory which does not
>> work to well in the linux kernel.
>
> This code is enabled and seems to work in Linux. I don't know if it
> passes xfsqa so I don't know how reliable this feature is. TO check
> it all I did was run a quick test on a x86_64 kernel (4k page
> size) using 16k directory blocks (4 pages):



>
> # mkfs.xfs -f -n size=16384 /dev/ubd/1
> .....
> # xfs_db -r -c "sb 0" -c "p dirblklog" /dev/ubd/1
> dirblklog = 2
> # mount /dev/ubd/1 /mnt/xfs
> # for i in `seq 0 1 100000`; do touch fred.$i; done
> # umount /mnt/xfs
> # mount /mnt/xfs
> # ls /mnt/xfs |wc -l
> 100000
> # rm -rf /mnt/xfs/*
> # ls /mnt/xfs |wc -l
> 0
> # umount /mnt/xfs
> #
>
> Cheers,
>
> Dave.


2006-10-10 09:19:10

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Directories > 2GB

On Mon, Oct 09, 2006 at 09:15:28PM -0500, Steve Lord wrote:
> Hi Dave,
>
> My recollection is that it used to default to on, it was disabled
> because it needs to map the buffer into a single contiguous chunk
> of kernel memory. This was placing a lot of pressure on the memory
> remapping code, so we made it not default to on as reworking the
> code to deal with non contig memory was looking like a major
> effort.

Exactly. The code works but tends to go OOM pretty fast at least
when the dir blocksize code is bigger than the page size. I should
give the code a spin on my ppc box with 64k pages if it works better
there.


2006-10-10 23:31:56

by David Chinner

[permalink] [raw]
Subject: Re: Directories > 2GB

On Tue, Oct 10, 2006 at 10:19:04AM +0100, Christoph Hellwig wrote:
> On Mon, Oct 09, 2006 at 09:15:28PM -0500, Steve Lord wrote:
> > Hi Dave,
> >
> > My recollection is that it used to default to on, it was disabled
> > because it needs to map the buffer into a single contiguous chunk
> > of kernel memory. This was placing a lot of pressure on the memory
> > remapping code, so we made it not default to on as reworking the
> > code to deal with non contig memory was looking like a major
> > effort.
>
> Exactly. The code works but tends to go OOM pretty fast at least
> when the dir blocksize code is bigger than the page size. I should
> give the code a spin on my ppc box with 64k pages if it works better
> there.

The pagebuf code doesn't use high-order allocations anymore; it uses
scatter lists and remapping to allow physically discontiguous pages
in a multi-page buffer. That is, the pages are sourced via
find_or_create_page() from the address space of the backing device,
and then mapped via vmap() to provide a virtually contigous mapping
of the multi-page buffer.

So I don't think this problem exists anymore...

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-11 16:49:10

by Steve Lord

[permalink] [raw]
Subject: Re: Directories > 2GB

David Chinner wrote:
> On Tue, Oct 10, 2006 at 10:19:04AM +0100, Christoph Hellwig wrote:
>> On Mon, Oct 09, 2006 at 09:15:28PM -0500, Steve Lord wrote:
>>> Hi Dave,
>>>
>>> My recollection is that it used to default to on, it was disabled
>>> because it needs to map the buffer into a single contiguous chunk
>>> of kernel memory. This was placing a lot of pressure on the memory
>>> remapping code, so we made it not default to on as reworking the
>>> code to deal with non contig memory was looking like a major
>>> effort.
>> Exactly. The code works but tends to go OOM pretty fast at least
>> when the dir blocksize code is bigger than the page size. I should
>> give the code a spin on my ppc box with 64k pages if it works better
>> there.
>
> The pagebuf code doesn't use high-order allocations anymore; it uses
> scatter lists and remapping to allow physically discontiguous pages
> in a multi-page buffer. That is, the pages are sourced via
> find_or_create_page() from the address space of the backing device,
> and then mapped via vmap() to provide a virtually contigous mapping
> of the multi-page buffer.
>
> So I don't think this problem exists anymore...

I was not referring to high order allocations here, but the overhead
of doing address space remapping every time a directory is accessed.

Steve


2006-10-12 00:26:47

by David Chinner

[permalink] [raw]
Subject: Re: Directories > 2GB

On Wed, Oct 11, 2006 at 11:49:10AM -0500, Steve Lord wrote:
> David Chinner wrote:
> >On Tue, Oct 10, 2006 at 10:19:04AM +0100, Christoph Hellwig wrote:
> >>On Mon, Oct 09, 2006 at 09:15:28PM -0500, Steve Lord wrote:
> >>>Hi Dave,
> >>>
> >>>My recollection is that it used to default to on, it was disabled
> >>>because it needs to map the buffer into a single contiguous chunk
> >>>of kernel memory. This was placing a lot of pressure on the memory
> >>>remapping code, so we made it not default to on as reworking the
> >>>code to deal with non contig memory was looking like a major
> >>>effort.
> >>Exactly. The code works but tends to go OOM pretty fast at least
> >>when the dir blocksize code is bigger than the page size. I should
> >>give the code a spin on my ppc box with 64k pages if it works better
> >>there.
> >
> >The pagebuf code doesn't use high-order allocations anymore; it uses
> >scatter lists and remapping to allow physically discontiguous pages
> >in a multi-page buffer. That is, the pages are sourced via
> >find_or_create_page() from the address space of the backing device,
> >and then mapped via vmap() to provide a virtually contigous mapping
> >of the multi-page buffer.
> >
> >So I don't think this problem exists anymore...
>
> I was not referring to high order allocations here, but the overhead
> of doing address space remapping every time a directory is accessed.

Ah - ok. contig -> non-contig and OOM is usually discussed in the
context of higher order allocations failing. FWIW, I've not noticed
any extra overhead - the CPU usage seems to grow roughly linearly
with the increase in directory operations done as a result of
higher throughput for the same number of I/Os. I'll have a look at
the Vm stats, though, next time I run a comparison to see how bad
this is.

Thanks for the clarification, Steve.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-10-16 18:17:03

by Andi Kleen

[permalink] [raw]
Subject: Re: Directories > 2GB

Steve Lord <[email protected]> writes:

> David Chinner wrote:
> > On Tue, Oct 10, 2006 at 10:19:04AM +0100, Christoph Hellwig wrote:
> >> On Mon, Oct 09, 2006 at 09:15:28PM -0500, Steve Lord wrote:
> >>> Hi Dave,
> >>>
> >>> My recollection is that it used to default to on, it was disabled
> >>> because it needs to map the buffer into a single contiguous chunk
> >>> of kernel memory. This was placing a lot of pressure on the memory
> >>> remapping code, so we made it not default to on as reworking the
> >>> code to deal with non contig memory was looking like a major
> >>> effort.
> >> Exactly. The code works but tends to go OOM pretty fast at least
> >> when the dir blocksize code is bigger than the page size. I should
> >> give the code a spin on my ppc box with 64k pages if it works better
> >> there.
> > The pagebuf code doesn't use high-order allocations anymore; it uses
> > scatter lists and remapping to allow physically discontiguous pages
> > in a multi-page buffer. That is, the pages are sourced via
> > find_or_create_page() from the address space of the backing device,
> > and then mapped via vmap() to provide a virtually contigous mapping
> > of the multi-page buffer.
> > So I don't think this problem exists anymore...
>
> I was not referring to high order allocations here, but the overhead
> of doing address space remapping every time a directory is accessed.

# grep -i vmalloc /proc/meminfo
VmallocTotal: 34359738367 kB

At least on 64bit systems it would be reasonable to keep a large
number of directories mapped this way over a longer time. vmap() space is
cheap there.

-Andi