LinuxLists.cc - bigalloc and max file size

2011-10-26 23:36:41

Subject: bigalloc and max file size

Ted,
we were having a discussion about bigalloc and the maximum file size
(as limited by the 2^32 logical block number in struct ext4_extent).

Currently the maximum file size is blocksize * 2^32, 16TB for 4kB blocks.

Since it is not possible to allocate sub-blocks in the bigalloc code,
what about storing the "chunk number" in the extent logical block?

This would allow us to create files up to chunksize * 2^32. With
a bigalloc chunk size of 1MB we could have a maximum file size of
2^(20 + 32) = 2^52 = 4PB, which is within spitting distance of the
maximum filesystem size of 2^56 bytes (4kB blocks * 2^48 blocks)
with the current extent format, and beyond reasonable limits today.

This essentially allows creating files as large as the filesystem size
without having to change the extent format, which is a good thing.

Is this implemented in bigalloc? If not, it would be great to do
this before landing bigalloc in the upstream kernel, since it is
basically free to do, can already fall under the INCOMPAT_BIGALLOC
feature flag, and avoids issues in the near future. I don't think
the e2fsprogs with bigalloc support is released yet either, so it
should still be OK to change the on-disk format?

Cheers, Andreas

2011-10-27 01:05:56

by Tao Ma

[permalink] [raw]

Subject: Re: bigalloc and max file size

Hi Andreas,
On 10/27/2011 07:36 AM, Andreas Dilger wrote:
> Ted,
> we were having a discussion about bigalloc and the maximum file size
> (as limited by the 2^32 logical block number in struct ext4_extent).
>
> Currently the maximum file size is blocksize * 2^32, 16TB for 4kB blocks.
>
> Since it is not possible to allocate sub-blocks in the bigalloc code,
> what about storing the "chunk number" in the extent logical block?
> This would allow us to create files up to chunksize * 2^32. With
> a bigalloc chunk size of 1MB we could have a maximum file size of
> 2^(20 + 32) = 2^52 = 4PB, which is within spitting distance of the
> maximum filesystem size of 2^56 bytes (4kB blocks * 2^48 blocks)
> with the current extent format, and beyond reasonable limits today.
>
> This essentially allows creating files as large as the filesystem size
> without having to change the extent format, which is a good thing.
>
> Is this implemented in bigalloc? If not, it would be great to do
> this before landing bigalloc in the upstream kernel, since it is
> basically free to do, can already fall under the INCOMPAT_BIGALLOC
> feature flag, and avoids issues in the near future. I don't think
> the e2fsprogs with bigalloc support is released yet either, so it
> should still be OK to change the on-disk format?
That is also our need here. ;) We have asked Ted about it, and Ted had
expressed his concern. Please searched the mailist for the subject
"Question about BIGALLOC".

ocfs2 has implemented similar mechanism for the file's extents and it
seems to work by now. But ocfs2 doesn't have delayed allocation, so the
case may not be the same here. Anyway, one of my colleague Robin is
trying to change extent length from blocks to clusters. I'd like him to
update his process here.

Thanks
Tao

2011-10-27 06:40:13

by Theodore Ts'o

[permalink] [raw]

Subject: Re: bigalloc and max file size

On Oct 26, 2011, at 7:36 PM, Andreas Dilger wrote:

> Ted,
> we were having a discussion about bigalloc and the maximum file size
> (as limited by the 2^32 logical block number in struct ext4_extent).
>
> Currently the maximum file size is blocksize * 2^32, 16TB for 4kB blocks.
>
> Since it is not possible to allocate sub-blocks in the bigalloc code,
> what about storing the "chunk number" in the extent logical block?

This adds all sorts of complexity, and it's why we can't simply increase the block size above a page size. Basically, avoiding all of the complex changes needed was the whole point of the bigalloc feature.

Effectively, if we stored the "chunk number" in the extent tree blocks, it's equivalent of setting the block size to the larger chunk size. Among other things, it means that we have to support setting or clearing the uninitialized bit in chunks larger than a page, and how dirtying a 4k page in a sparse file would require clearing 64k or 1 megabyte (whatever the chunk size might be) either on disk or in memory.

In any case, it's not a simple change that we can make before the merge window. If someone wants to work on that, we can support that simply by supporting block size > page size, with all of the complexity and pain which has prevented us from supporting that for the last ten years?.

-- Ted

2011-10-27 11:48:41

by Theodore Ts'o

[permalink] [raw]

Subject: Re: bigalloc and max file size

On Oct 27, 2011, at 5:38 AM, Andreas Dilger wrote:

> Writing 64kB is basically the minimum useful unit of IO to a modern disk drive, namely if you are doing any writes then zeroing 64kB isn't going to be noticeably slower than 4kB or 16kB.

That may be true if the cluster size is 64k, but if the cluster size is 1MB, the requirement to zero out 1MB chunks each time a 4k block is written would be painful.

>
>> In any case, it's not a simple change that we can make before the merge window.
>
> Are you saying that bigalloc is already pushed for this merge window? It sounds like there is someone else working on this issue already, and I'd like to give them and me a chance to resolve it before the on-disk format of bigalloc is cast in stone.

Yes, it's already in the ext4 and e2fsprogs tree, and it's due to be pushed to Linus this week. E2fsprogs with bigalloc support just entered Debian testing, so it's really too late change the bigalloc format without a new feature flag.

> This is all a bit hand wavy, since I admit I haven't yet dug into this code, but I don't think it has exactly the same issues as large blocks, since fundamentally there are not multiple pages that address the same block number, so the filesystem can properly address the right logical blocks in the filesystem.

That's a good point, but we could do that with a normal 64k block file system. The block number which we use on-disk can be in multiples of 64k, but the "block number" that we use in the bmap function and in the bh_blocknr field attached to the pages could be in units of 4k pages.

This is also a bit hand-wavy, but if we also can handle 64k directory blocks, then we could mount 64k block file systems as used in IA64/Power HPC systems on x86 systems, which would be really cool.

-- Ted

2011-10-27 21:42:23

by Theodore Ts'o

[permalink] [raw]

Subject: Re: bigalloc and max file size

On Oct 27, 2011, at 11:08 AM, Andreas Dilger wrote:

>>
>> That may be true if the cluster size is 64k, but if the cluster size is 1MB, the requirement to zero out 1MB chunks each time a 4k block is written would be painful.
>
> But it should be up to the admin not to configure the filesystem in a foolish way like this. One wouldn't expect good performance with real 1MB block size and random 4kB writes either, so don't do that.

Yes, but with the current bigalloc scheme, we don't have to zero out the whole 1MB cluster, and there are reasons why 1MB cluster sizes makes sense in some situations. Your variation would require the whole 1MB cluster to be zeroed, with the attendant performance hit, but I see that as a criticism of your proposed change, not of the intelligence of the system administrator. :-)

> It's taken 3+ years to get an e2fsprogs release out with 64-bit blocksize support, and we can't wait a couple if weeks to see id there is an easy way make bigalloc useful for large file sizes? Don't you think this would be a lot easier to fix now compared to e.g. having to create a new extent format or adding yet another feature that would allow the extents to specify either logical block vs logical chunks?

Well, in addition to the e2fsprogs 1.42-WIP being in Debian testing (as well as other community distro's like Arch and Gentoo), there's also the situation that we're in the middle of the merge window, and I have a whole stack of patches on top of the Bigalloc patches, some of which would have to be reworked if the bigalloc patches were to be yanked out. So removing the bigalloc patches before I push to Linus is going to be a bit of a bother (as well as violating our newly in-place rule that commits in between the dev and master branch heads could be mutated, but commits that are in the master branch were considered non-rewindable).

One could argue that I could add a patch which disabled the bigalloc patch, and then make changes in the next merge window, but to be completely honest I have my own selfish reason for not wanting to do that, which is the bigalloc patches have also been integrated into Google's internal kernels already, and changing the bigalloc format without a new flag would make things complicated for me. Given that we decided to lock down the extent leaf format (even though I had wanted to make changes to it, for example to support a full 64-bit block number) in deference to the fact that it was in ClusterFS deployed kernels, there is precedent for taking into account the status of formats used in non-mainline kernels by the original authors of the feature.

>> This is also a bit hand-wavy, but if we also can handle 64k directory blocks, then we could mount 64k block file systems as used in IA64/Power HPC systems on x86 systems, which would be really cool.
>
> At that point, would there be any value in using bigalloc at all? The one benefit I can see is that bigalloc would help the most common case of linear file writes (if the extent still stores the length in blocks instead of chunks) because it could track the last block written and only have to zero out the last block.

Well, it would also have the benefit of sparse, random 4k writes into a file system with a large cluster size (going back to the discussion in the first paragraph).

In general, the current bigalloc approach is more suited for very large cluster sizes (>> 64k), whereas using a block size > page size approach makes more sense in the 4k-64k range, especially since it provides better cross-architecture compatibility with large block size file systems that are already in existence today. Note too that the large block size approach completely tops out at 256k because of the dirent length encoding issue, where as with bigalloc we can support cluster sizes even larger than 1MB if that were to be useful for some storage scenarios.

Regards,

-- Ted

2011-10-28 03:31:42

by Tao Ma

[permalink] [raw]

Subject: Re: bigalloc and max file size

Hi Ted,
On 10/28/2011 05:42 AM, Theodore Tso wrote:
>
> On Oct 27, 2011, at 11:08 AM, Andreas Dilger wrote:
>
>>>
>>> That may be true if the cluster size is 64k, but if the cluster size is 1MB, the requirement to zero out 1MB chunks each time a 4k block is written would be painful.
>>
>> But it should be up to the admin not to configure the filesystem in a foolish way like this. One wouldn't expect good performance with real 1MB block size and random 4kB writes either, so don't do that.
>
> Yes, but with the current bigalloc scheme, we don't have to zero out the whole 1MB cluster, and there are reasons why 1MB cluster sizes makes sense in some situations. Your variation would require the whole 1MB cluster to be zeroed, with the attendant performance hit, but I see that as a criticism of your proposed change, not of the intelligence of the system administrator. :-)
>
>> It's taken 3+ years to get an e2fsprogs release out with 64-bit blocksize support, and we can't wait a couple if weeks to see id there is an easy way make bigalloc useful for large file sizes? Don't you think this would be a lot easier to fix now compared to e.g. having to create a new extent format or adding yet another feature that would allow the extents to specify either logical block vs logical chunks?
>
> Well, in addition to the e2fsprogs 1.42-WIP being in Debian testing (as well as other community distro's like Arch and Gentoo), there's also the situation that we're in the middle of the merge window, and I have a whole stack of patches on top of the Bigalloc patches, some of which would have to be reworked if the bigalloc patches were to be yanked out. So removing the bigalloc patches before I push to Linus is going to be a bit of a bother (as well as violating our newly in-place rule that commits in between the dev and master branch heads could be mutated, but commits that are in the master branch were considered non-rewindable).
>
> One could argue that I could add a patch which disabled the bigalloc patch, and then make changes in the next merge window, but to be completely honest I have my own selfish reason for not wanting to do that, which is the bigalloc patches have also been integrated into Google's internal kernels already, and changing the bigalloc format without a new flag would make things complicated for me. Given that we decided to lock down the extent leaf format (even though I had wanted to make changes to it, for example to support a full 64-bit block number) in deference to the fact that it was in ClusterFS deployed kernels, there is precedent for taking into account the status of formats used in non-mainline kernels by the original authors of the feature.
>
>>> This is also a bit hand-wavy, but if we also can handle 64k directory blocks, then we could mount 64k block file systems as used in IA64/Power HPC systems on x86 systems, which would be really cool.
>>
>> At that point, would there be any value in using bigalloc at all? The one benefit I can see is that bigalloc would help the most common case of linear file writes (if the extent still stores the length in blocks instead of chunks) because it could track the last block written and only have to zero out the last block.
>
> Well, it would also have the benefit of sparse, random 4k writes into a file system with a large cluster size (going back to the discussion in the first paragraph).
>
> In general, the current bigalloc approach is more suited for very large cluster sizes (>> 64k), whereas using a block size > page size approach makes more sense in the 4k-64k range, especially since it provides better cross-architecture compatibility with large block size file systems that are already in existence today. Note too that the large block size approach completely tops out at 256k because of the dirent length encoding issue, where as with bigalloc we can support cluster sizes even larger than 1MB if that were to be useful for some storage scenarios.
Forget to say, if we increase the extent length to be cluster, there are
also a good side effect. ;) Current bigalloc has a severe performance
regression in the following test case:
mount -t ext4 /dev/sdb1 /mnt/ext4
cp linux-3.0.tar.gz /mnt/ext4
cd /mnt/ext4
tar zxvf linux-3.0.tar.gz
umount /mnt/ext4

Now it will take more than 60 secs in my SAS env. While the old solution
will take only 20s. With the new extent length of cluster, it is also
around 20s.

btw, Robin's work is almost finished in the kernel part except the
delayed allocation(the e2fsprogs hasn't been started yet), and he told
me that a V1 will be sent out early next week.

Thanks
Tao

2011-10-30 05:28:15

by Coly Li

[permalink] [raw]

Subject: Re: bigalloc and max file size

On 2011年10月28日 05:42, Theodore Tso Wrote:
>
> On Oct 27, 2011, at 11:08 AM, Andreas Dilger wrote:
[snip]
> One could argue that I could add a patch which disabled the bigalloc patch, and then make changes in the next merge window, but to be completely honest I have my own selfish reason for not wanting to do that, which is the bigalloc patches have also been integrated into Google's internal kernels already, and changing the bigalloc format without a new flag would make things complicated for me. Given that we decided to lock down the extent leaf format (even though I had wanted to make changes to it, for example to support a full 64-bit block number) in deference to the fact that it was in ClusterFS deployed kernels, there is precedent for taking into account the status of formats used in non-mainline kernels by the original authors of the feature.
>
Hi Ted,

Forgive me if this is out of topic.
In our test, allocating directories W/ bigalloc and W/O inline-data may occupy most of disk space. By now Ext4
inline-data is not merged yet, I just wondering how Google uses bigalloc without inline-data patch set ?

Thanks.

--
Coly Li

2011-10-31 00:56:09

by Theodore Ts'o

[permalink] [raw]

Subject: Re: bigalloc and max file size

On Oct 30, 2011, at 1:37 AM, Coly Li wrote:

> Forgive me if this is out of topic.
> In our test, allocating directories W/ bigalloc and W/O inline-data may occupy most of disk space. By now Ext4
> inline-data is not merged yet, I just wondering how Google uses bigalloc without inline-data patch set ?

It depends on how many directories you have (i.e, how deep your directory structure is) and how many small files you have in the file system as to whether bigalloc w/o inline-data has an acceptable overhead or not.

As I've noted before, for at least the last 7-8 years, and probably a decade, average seek times for 7200rpm drives have remained constant at 10ms, even as disk capacities have grown from 200GB in 2004, to 3TB in 2011. Yes, you can spin the platters faster, but the energy requirements go up with the square of the revolutions per minute, while the seek times only go up linearly; and so platter speeds don't get any faster than 15000rpm due to diminishing returns, and in fact some "green" drives only go at 5400rpm or even slower (interestingly enough, they tend not to advertise either the platter speed or the average seek time; funny, that?.)

At 10ms per seek, that means that if the HDD isn't doing _anything_ else, it can do at most 100 seeks per second. Hence, if you have a workload where latency is at a premium, as disk capacities grow, disks are effectively getting slower for given data set size. For example, in 2004, if you wanted to serve 5TB of data, you'd need 25 200GB disks, so you had at your disposal 2500 random read/write operations per second at your disposal. In 2011, with 3TB disks, you'd have an order of magnitude fewer random writes when you only need to use 2 HDD's. (Yes, you could use flash, or flash-backed cache, but if the working set is really large this can get very expensive, so it's not a solution suitable for all situations.)

Another way of putting things is if latency really matters, and you have a random read/write workload, capacity management can become more about seeks than actual number of gigabytes. Hence, "wasting" space by using a larger cluster size may be a win if you are doing a large number of block allocations/deallocations, and memory pressure keeps on throwing the block bitmaps out of memory, so you have to keep seeking to read them back into memory. By using a large cluster size, we reduce fragmentation, and we reduce the number of block bitmaps, which makes them more likely to stay in memory.

Furthermore, reducing the number of the bitmap blocks makes it more tenable to pin them in memory, if there is a desire to guarantee that they stay in memory. (Dave Chinner was telling me that XFS manages its own metadata block lifespan, with its own shrinkers, instead of leaving when cached metadata gets ejected from memory. That might be worth doing at some point in ext4, but of course that would add complexity as well.)

The bottom line is that if you are seek constrained, wasting space by using a large cluster size may not be a huge concern. And if nearly all of your files are larger than 1MB, with many significantly larger, in-line data isn't going to help you a lot.

On the other hand, it may be that using 128 byte inode is a bigger win than using a larger inode size and storing the data in the inode table. Using a small inode size reduces metadata I/O by doubling the number of inodes/block compared to a 256 byte inode, never mind a 1k or 4k inode. Hence, if you don't need extended attributes or ACL's or sub-second timestamp resolution, you might want to consider using 128 byte inodes as possibly being a bigger win than in-line data. All of this requires benchmarking with your specific workload, of course.

I'm not against your patch set, however; I just haven't had time to look at them, at all (nor the secure delete patch set, etc.) . Between organizing the kernel summit, the kernel.org compromise, and some high priority bugs at $WORK, things have just been too busy. Sorry for that; I'll get to them after the merge window and post-merge bug fixing is under control.

-- Ted

2011-10-31 09:25:25

by Coly Li

[permalink] [raw]

Subject: Re: bigalloc and max file size

On 2011年10月31日 03:49, Theodore Tso Wrote:
>
> On Oct 30, 2011, at 1:37 AM, Coly Li wrote:
>
>> Forgive me if this is out of topic.
>> In our test, allocating directories W/ bigalloc and W/O inline-data may occupy most of disk space. By now Ext4
>> inline-data is not merged yet, I just wondering how Google uses bigalloc without inline-data patch set ?
>
> It depends on how many directories you have (i.e, how deep your directory structure is) and how many small files you have in the file system as to whether bigalloc w/o inline-data has an acceptable overhead or not.
[snip]
> I'm not against your patch set, however; I just haven't had time to look at them, at all (nor the secure delete patch set, etc.) . Between organizing the kernel summit, the kernel.org compromise, and some high priority bugs at $WORK, things have just been too busy. Sorry for that; I'll get to them after the merge window and post-merge bug fixing is under control.

Hi Ted,

In our test, bigalloc without inline-data dose not work very well with deep directory structure, e.g. Hadoop or Squid,
because small directories occupies all disk space. That's why I asked the question. Thanks for your patient reply, it
makes sense for me :-)

Back to our topic, Ext4 doesn't have too much on-disk incompatible flag-bits now. If we get current bigalloc code merged
now, we have to use another incompatible bit when we merge cluster/chunk based extent patch set. Further more, we
observe performance regression without cluster-based-extent on file system umount (as Tao mentioned in this thread).
IMHO, without inline-data and cluster-based-extent, current bigalloc code is a little bit inperfect for many users.

Bigalloc is a very useful feature, can we consider making it better before getting merged ?

Thanks.
--
Coly Li

2011-10-31 10:15:27

by Theodore Ts'o

[permalink] [raw]

Subject: Re: bigalloc and max file size

On Oct 27, 2011, at 11:31 PM, Tao Ma wrote:

> Forget to say, if we increase the extent length to be cluster, there are
> also a good side effect. ;) Current bigalloc has a severe performance
> regression in the following test case:
> mount -t ext4 /dev/sdb1 /mnt/ext4
> cp linux-3.0.tar.gz /mnt/ext4
> cd /mnt/ext4
> tar zxvf linux-3.0.tar.gz
> umount /mnt/ext4

I've been traveling, so I haven't had a chance to test this, but it makes no sense that changing the encoding fro the extent length would change the performance of the forced writeback caused by amount. There may be a performance bug that we should fix, or may have been fixed by accident with the extent encoding change.

Have you investigated why this got better when you changed the meaning of the extent length field? It makes no sense that such a format change would have such an impact?.

-- Ted

2011-10-31 10:22:28

by Theodore Ts'o

[permalink] [raw]

Subject: Re: bigalloc and max file size

On Oct 31, 2011, at 5:35 AM, Coly Li wrote:

>
> Back to our topic, Ext4 doesn't have too much on-disk incompatible flag-bits now. If we get current bigalloc code merged now, we have to use another incompatible bit when we merge cluster/chunk based extent patch set.

What is the appeal to you have the cluster/chunk based extent patch set? I'm not sure I understand why it's so interesting to you in the first place. Ext4's RAID support isn't particularly good, and its sweet spot really is for single disk file systems. And for cluster file systems, such as when you might build Hadoop on top of ext4, there's no real advantage of using RAID arrays as opposed to having single file systems on each disk. In fact, due to the specd of being able to check multiple disk spindles in parallel, it's advantageous to build cluster file systems on single disk file systems.

I'm just curious what your use case is, because that tends to drive decision decisions in subtle ways.

Regards,

-- Ted

2011-10-31 10:27:47

by Tao Ma

[permalink] [raw]

Subject: Re: bigalloc and max file size

On 10/31/2011 06:15 PM, Theodore Tso wrote:
>
> On Oct 27, 2011, at 11:31 PM, Tao Ma wrote:
>
>> Forget to say, if we increase the extent length to be cluster, there are
>> also a good side effect. ;) Current bigalloc has a severe performance
>> regression in the following test case:
>> mount -t ext4 /dev/sdb1 /mnt/ext4
>> cp linux-3.0.tar.gz /mnt/ext4
>> cd /mnt/ext4
>> tar zxvf linux-3.0.tar.gz
>> umount /mnt/ext4
>
> I've been traveling, so I haven't had a chance to test this, but it makes no sense that changing the encoding fro the extent length would change the performance of the forced writeback caused by amount. There may be a performance bug that we should fix, or may have been fixed by accident with the extent encoding change.
>
> Have you investigated why this got better when you changed the meaning of the extent length field? It makes no sense that such a format change would have such an impact?.
OK, so let me explain why the big cluster length works.

In the new bigalloc case if chunk size=64k, and with the linux-3.0
source, every file will be allocated a chunk, but they aren't contiguous
if we only write the 1st 4k bytes. In this case, writeback and the block
layer below can't merge all the requests sent by ext4. And in our test
case, the total io will be around 20000. While with the cluster size, we
have to zero the whole cluster. From the upper point of view. we have to
write more bytes. But from the block layer, the write is contiguous and
it can merge them to be a big one. In our test, it will only do around
2000 ios. So it helps the test case.

Thanks
Tao

2011-10-31 16:07:28

by Andreas Dilger

[permalink] [raw]

Subject: Re: bigalloc and max file size

On 2011-10-31, at 4:22 AM, Theodore Tso <[email protected]> wrote:

> For cluster file systems, such as when you might build Hadoop on top of ext4, there's no real advantage of using RAID arrays as opposed to having single file systems on each disk. In fact, due to the specd of being able to check multiple disk spindles in parallel, it's advantageous to build cluster file systems on single disk file systems.

For Lustre at least there are a number of reasons why it uses large RAID devices to store the data instead of many small devices:
- fewer devices that need to be managed. Lustre runs on systems with more than 13000 drives, and having to manage connection state for that many internal devices is a lot of overhead.

2011-10-31 16:22:28

by Theodore Ts'o

[permalink] [raw]

Subject: Re: bigalloc and max file size

On Mon, Oct 31, 2011 at 10:08:20AM -0600, Andreas Dilger wrote:
> On 2011-10-31, at 4:22 AM, Theodore Tso <[email protected]> wrote:
> For cluster file systems, such as when you might build Hadoop on top
> > of ext4, there's no real advantage of using RAID arrays as opposed
> > to having single file systems on each disk. In fact, due to the
> > specd of being able to check multiple disk spindles in parallel,
> > it's advantageous to build cluster file systems on single disk
> > file systems.
>
> For Lustre at least there are a number of reasons why it uses large
> RAID devices to store the data instead of many small devices: -
> fewer devices that need to be managed. Lustre runs on systems with
> more than 13000 drives, and having to manage connection state for
> that many internal devices is a lot of overhead.

Well, per the discussion on the ext4 call, with Lustre hardware
multiple RAID LUN's get used, so while they might have tens of
petabytes of data, it is still split across a thousand hardware LUN's
or so. So there is a middle ground between "put all of your 13000
devices on a single hardware RAID LUN", and "use 13000 file systems".
And in that middle ground, it seems surprising that someone would be
bumping into the the 1EB file system limit offered by ext4.

I'm curious why TaoBao is so interested in changing the extent
encoding for bigalloc file systems. Currently we can support up to 1
EB worth of physical block numbers, and 16TB of logical block numbers.
Are you concerned about bumping into the 1 EB file system limit? Or
the 16 TB file size limit? Or something else?

Regards,

- Ted

2011-10-31 16:33:08

by Andreas Dilger

[permalink] [raw]

Subject: Re: bigalloc and max file size

On 2011-10-31, at 10:08 AM, Andreas Dilger <[email protected]> wrote:
> On 2011-10-31, at 4:22 AM, Theodore Tso <[email protected]> wrote:
>
>> For cluster file systems, such as when you might build Hadoop on top of ext4, there's no real advantage of using RAID arrays as opposed to having single file systems on each disk. In fact, due to the specd of being able to check multiple disk spindles in parallel, it's advantageous to build cluster file systems on single disk file systems.
>
> For Lustre at least there are a number of reasons why it uses large RAID devices to store the data instead of many small devices:
> - fewer devices that need to be managed at the Filesystem level. Lustre runs on systems with more than 13000 drives, and having to manage connection state for that many internal devices is a lot of overhead.

Doh, hit send too soon...

- reduced complexity of filesystem allocation decisions with fewer large LUNs vs many smaller LUNs
- reduced free space and file fragmentation with fewer large LUNs, since the block allocator for each LUN has more blocks to choose from
- sysadmin of so many unique devices is difficult, while clustering them into RAID sets with hardware management features (we call this blinkenlights) makes this tractable compared to software RAID on generic hardware.
- performance management of the RAID hardware can detect and mask individual drives that are slow compared to others in that RAID set, which is much harder if each drive is treated individually

These reasons don't apply to all cluster filesystems, but I thought I'd chime in on why we use large LUNs even though we could also handle more smaller LUNs.

Cheers, Andreas

2011-10-31 17:30:07

by Coly Li

[permalink] [raw]

Subject: Re: bigalloc and max file size

On 2011年11月01日 00:22, Ted Ts'o Wrote:
> On Mon, Oct 31, 2011 at 10:08:20AM -0600, Andreas Dilger wrote:
>> On 2011-10-31, at 4:22 AM, Theodore Tso <[email protected]> wrote:
[snip]
> I'm curious why TaoBao is so interested in changing the extent
> encoding for bigalloc file systems. Currently we can support up to 1
> EB worth of physical block numbers, and 16TB of logical block numbers.
> Are you concerned about bumping into the 1 EB file system limit? Or
> the 16 TB file size limit? Or something else?
>
In some application, we allocate a big file which occupies most space of a file system, while the file system built on
(expensive) SSD. In such configuration, we want less blocks allocated for inode table and bitmap. If the max extent
length could be much big, there is chance to have much less block groups, which results more blocks for regular file.
Current bigalloc code does well already, but there is still chance to do better. The sys-admin team believe
cluster-based-extent can help Ext4 to consume as less meta data memory as raw disk does, and gain as more available data
blocks as raw disks does, too. This is a small number on one single SSD, but in our cluster environment, this effort can
help to save a recognized amount of capex.

Further more, for HDFS with 128MB data block file, and the file system is formatted with 1MB cluster bigalloc. In worst
case, only one extent block read is needed to access an 128MB data block file. (However, this case is about a chunk size
more than 64K, not compulsory for cluster-based-extent)

With inline-data and cluster-based-extent to bigalloc, we get more closed to the above goal.

P.S. When I finish typing this email, I find Andreas also explain the similar reason in his email, much more simple and
clear :-)
--
Coly Li

2011-10-31 18:54:24

by Sunil Mushran

[permalink] [raw]

Subject: Re: bigalloc and max file size

On 10/31/2011 03:27 AM, Tao Ma wrote:
> On 10/31/2011 06:15 PM, Theodore Tso wrote:
>> On Oct 27, 2011, at 11:31 PM, Tao Ma wrote:
>>
>>> Forget to say, if we increase the extent length to be cluster, there are
>>> also a good side effect. ;) Current bigalloc has a severe performance
>>> regression in the following test case:
>>> mount -t ext4 /dev/sdb1 /mnt/ext4
>>> cp linux-3.0.tar.gz /mnt/ext4
>>> cd /mnt/ext4
>>> tar zxvf linux-3.0.tar.gz
>>> umount /mnt/ext4
>> I've been traveling, so I haven't had a chance to test this, but it makes no sense that changing the encoding fro the extent length would change the performance of the forced writeback caused by amount. There may be a performance bug that we should fix, or may have been fixed by accident with the extent encoding change.
>>
>> Have you investigated why this got better when you changed the meaning of the extent length field? It makes no sense that such a format change would have such an impact?.
> OK, so let me explain why the big cluster length works.
>
> In the new bigalloc case if chunk size=64k, and with the linux-3.0
> source, every file will be allocated a chunk, but they aren't contiguous
> if we only write the 1st 4k bytes. In this case, writeback and the block
> layer below can't merge all the requests sent by ext4. And in our test
> case, the total io will be around 20000. While with the cluster size, we
> have to zero the whole cluster. From the upper point of view. we have to
> write more bytes. But from the block layer, the write is contiguous and
> it can merge them to be a big one. In our test, it will only do around
> 2000 ios. So it helps the test case.

Am I missing something but you cannot zero the entire cluster because
block_write_full_page() drops pages past i_size.

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5693486bad2bc2ac585a2c24f7e2f3964b478df9

2011-10-31 19:09:27

by Andreas Dilger

[permalink] [raw]

Subject: Re: bigalloc and max file size

On 2011-10-31, at 12:53 PM, Sunil Mushran wrote:
> On 10/31/2011 03:27 AM, Tao Ma wrote:
>> OK, so let me explain why the big cluster length works.
>>
>> In the new bigalloc case if chunk size=64k, and with the linux-3.0
>> source, every file will be allocated a chunk, but they aren't contiguous
>> if we only write the 1st 4k bytes. In this case, writeback and the block
>> layer below can't merge all the requests sent by ext4. And in our test
>> case, the total io will be around 20000. While with the cluster size, we
>> have to zero the whole cluster. From the upper point of view. we have to
>> write more bytes. But from the block layer, the write is contiguous and
>> it can merge them to be a big one. In our test, it will only do around
>> 2000 ios. So it helps the test case.
>
> Am I missing something but you cannot zero the entire cluster because
> block_write_full_page() drops pages past i_size.
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=5693486bad2bc2ac585a2c24f7e2f3964b478df9

With ext4_ext_zeroout->blkdev_issue_zeroout() it submits the zeroing
request directly to the block layer (with cloned ZERO_PAGE pages) and
skips the VM entirely.

Cheers, Andreas
--
Andreas Dilger
Principal Engineer
Whamcloud, Inc.

2011-10-31 19:38:42

by Theodore Ts'o

[permalink] [raw]

Subject: Re: bigalloc and max file size

On Tue, Nov 01, 2011 at 01:39:34AM +0800, Coly Li wrote:
> In some application, we allocate a big file which occupies most space of a file system, while the file system built on
> (expensive) SSD. In such configuration, we want less blocks allocated for inode table and bitmap. If the max extent
> length could be much big, there is chance to have much less block groups, which results more blocks for regular file.
> Current bigalloc code does well already, but there is still chance to do better. The sys-admin team believe
> cluster-based-extent can help Ext4 to consume as less meta data memory as raw disk does, and gain as more available data
> blocks as raw disks does, too. This is a small number on one single SSD, but in our cluster environment, this effort can
> help to save a recognized amount of capex.

OK, but you're not running into the 16TB file size limitation, are
you? That would be a lot of SSD's. I assume the issue then is you
want to minimize the number of extents, limited by the 15-bit extent
length field?

What cluster size are you thinking about? And how do you plan to
initialize it? Via fallocate, or by explicitly writing zeros to the
whole file (so all of the blocks are marked as initialzied? Is it
going to be sparse file?

- Ted

2011-10-31 20:00:57

by Theodore Ts'o

[permalink] [raw]

Subject: Re: bigalloc and max file size

On Mon, Oct 31, 2011 at 06:27:25PM +0800, Tao Ma wrote:
> In the new bigalloc case if chunk size=64k, and with the linux-3.0
> source, every file will be allocated a chunk, but they aren't contiguous
> if we only write the 1st 4k bytes. In this case, writeback and the block
> layer below can't merge all the requests sent by ext4. And in our test
> case, the total io will be around 20000. While with the cluster size, we
> have to zero the whole cluster. From the upper point of view. we have to
> write more bytes. But from the block layer, the write is contiguous and
> it can merge them to be a big one. In our test, it will only do around
> 2000 ios. So it helps the test case.

This is test case then where there are lot of sub-64k files, and so
the system administrator would be ill-advised to use a 64k bigalloc
cluster size in the first place. So don't really consider that a
strong argument; in fact, if the block device is a SSD or a
thin-provisioned device with an allocation size smaller than the
cluster size, the behaviour you describe would in fact be detrimental,
not a benefit.

In the case of a hard drive where seeks are expensive relative to
small writes, this is something which we could do (zero out the whole
cluster) with the current bigalloc file system format. I could
imagine trying to turn this on automatically with a hueristic, but
since we can't know the underlying allocation size of a
thin-provisioned block device, that would be tricky at best...

Regards,

- Ted

2011-11-01 01:01:04

by Coly Li

[permalink] [raw]

Subject: Re: bigalloc and max file size

On 2011年11月01日 03:38, Ted Ts'o Wrote:
> On Tue, Nov 01, 2011 at 01:39:34AM +0800, Coly Li wrote:
>> In some application, we allocate a big file which occupies most space of a file system, while the file system built on
>> (expensive) SSD. In such configuration, we want less blocks allocated for inode table and bitmap. If the max extent
>> length could be much big, there is chance to have much less block groups, which results more blocks for regular file.
>> Current bigalloc code does well already, but there is still chance to do better. The sys-admin team believe
>> cluster-based-extent can help Ext4 to consume as less meta data memory as raw disk does, and gain as more available data
>> blocks as raw disks does, too. This is a small number on one single SSD, but in our cluster environment, this effort can
>> help to save a recognized amount of capex.
>
> OK, but you're not running into the 16TB file size limitation, are
> you?
No, we are not in this problem.

> That would be a lot of SSD's.
Yes, IMHO that's a lot of SSDs.

>I assume the issue then is you
> want to minimize the number of extents, limited by the 15-bit extent
> length field?
Not only extents, but also minimize inode table blocks, bitmap blocks.

>
> What cluster size are you thinking about?
Currently we test 1MB cluster size. The extreme ideal configuration (of one use case) is, there is only one block group
on the whole file system. (In this use case) we are willing to try biggest possible cluster size if we are able to.

And how do you plan to
> initialize it? Via fallocate, or by explicitly writing zeros to the
> whole file (so all of the blocks are marked as initialzied? Is it
> going to be sparse file?
In the above application I mentioned, the file is allocated by fallocate(2). Writing to the file is appending write with
8MB length, when writing reaches file end, it goes back to the beginning of the file and continue to write.

Thanks.
--
Coly Li

2011-11-01 04:06:16

by Tao Ma

[permalink] [raw]

Subject: Re: bigalloc and max file size

On 11/01/2011 04:00 AM, Ted Ts'o wrote:
> On Mon, Oct 31, 2011 at 06:27:25PM +0800, Tao Ma wrote:
>> In the new bigalloc case if chunk size=64k, and with the linux-3.0
>> source, every file will be allocated a chunk, but they aren't contiguous
>> if we only write the 1st 4k bytes. In this case, writeback and the block
>> layer below can't merge all the requests sent by ext4. And in our test
>> case, the total io will be around 20000. While with the cluster size, we
>> have to zero the whole cluster. From the upper point of view. we have to
>> write more bytes. But from the block layer, the write is contiguous and
>> it can merge them to be a big one. In our test, it will only do around
>> 2000 ios. So it helps the test case.
>
> This is test case then where there are lot of sub-64k files, and so
> the system administrator would be ill-advised to use a 64k bigalloc
> cluster size in the first place. So don't really consider that a
> strong argument; in fact, if the block device is a SSD or a
> thin-provisioned device with an allocation size smaller than the
> cluster size, the behaviour you describe would in fact be detrimental,
> not a benefit.
OK, actually the above test case is more natural if we replace umount
with sync. And I guess this is the most common case for a normal desktop
user. Even without sync, the disk util will be very high. As now the SSD
isn't popular in normal user's env, I would imagine more guy will
complain about it when bigalloc get merged.
>
> In the case of a hard drive where seeks are expensive relative to
> small writes, this is something which we could do (zero out the whole
> cluster) with the current bigalloc file system format. I could
> imagine trying to turn this on automatically with a hueristic, but
> since we can't know the underlying allocation size of a
> thin-provisioned block device, that would be tricky at best...
OK, if we would decide to leave extent length to be block length, we can
do some tricky thing like cfq to read the rotational flag of the
underlying device. It is a bit pain, but we have to handle it as I
mention above.

Thanks
Tao

2011-11-01 11:48:02

by Theodore Ts'o

[permalink] [raw]

Subject: Re: bigalloc and max file size

On Oct 31, 2011, at 9:10 PM, Coly Li wrote:
>
>> I assume the issue then is you
>> want to minimize the number of extents, limited by the 15-bit extent
>> length field?
> Not only extents, but also minimize inode table blocks, bitmap blocks.

So this makes no sense to me. Bigalloc doesn't have any effect on the number of inode table blocks, and while it certainly shrinks the number block allocation bitmap blocks, changing the extent tree format has no effect on the number of bitmap blocks.

Perhaps I'm not making myself clear. I was trying to figure out the basis for the desire to use units of clusters for the extent length in the extent tree block. Is that the reason you were so interested in this change of the bigalloc format? So you could have a smaller extent tree?

>>
>> What cluster size are you thinking about?
> Currently we test 1MB cluster size. The extreme ideal configuration (of one use case) is, there is only one block group
> on the whole file system. (In this use case) we are willing to try biggest possible cluster size if we are able to.

This is where you have a single file which is nearly as big as the entire file system? In that case, why are you using an ext4 file system at all? Why not just use a raw partition instead, plus an auxiliary partition for the smaller files?

I'm not being critical; I'm just trying to understand your use case and constraints.

Regards,

-- Ted

2011-11-01 12:12:24

by Coly Li

[permalink] [raw]

Subject: Re: bigalloc and max file size

On 2011年11月01日 19:47, Theodore Tso Wrote:
>
> On Oct 31, 2011, at 9:10 PM, Coly Li wrote:
>>
>>> I assume the issue then is you
>>> want to minimize the number of extents, limited by the 15-bit extent
>>> length field?
>> Not only extents, but also minimize inode table blocks, bitmap blocks.
>
>
> So this makes no sense to me. Bigalloc doesn't have any effect on the number of inode table blocks, and while it certainly shrinks the number block allocation bitmap blocks, changing the extent tree format has no effect on the number of bitmap blocks.
>

In mkfs.ext4, with -N 16, there are still much more inodes allocated, because for each block group there has to be some
inode blocks to be allocated. For bigalloc, same file system size may have less block groups, which results less inode
blocks.

>>>
>>> What cluster size are you thinking about?
>> Currently we test 1MB cluster size. The extreme ideal configuration (of one use case) is, there is only one block group
>> on the whole file system. (In this use case) we are willing to try biggest possible cluster size if we are able to.
>
> This is where you have a single file which is nearly as big as the entire file system? In that case, why are you using an ext4 file system at all? Why not just use a raw partition instead, plus an auxiliary partition for the smaller files?
>

To make management simple and easy, sys-admin team trends to use on-hand Linux tools to manage the data, a raw disk
format is the last choice. Yes, your suggestion make senses, while there is open source project using raw disk for the
similar purpose, but it takes quit long time to deploy another stable system to online servers ...

> I'm not being critical; I'm just trying to understand your use case and constraints.
>

I try to explain in one line why we are interested on bigalloc: try best to minimize all kinds of metadata space, both
on disk and in memory.

Thanks.
--
Coly Li