2019-10-25 11:51:52

by Artem Blagodarenko

[permalink] [raw]
Subject: 1024TB Ext4 partition

Hello,

Lustre FS successfully uses LDISKFS(ext4) partitions with size near 512TB.
This 512TB is current "verified limit". This means that we do not expect any troubles in
production with such large partitions.

Our new challenge now is 1024TB because hardware allows to assemble such partition.
The question is: do you know any possible issues with EXT4 (and e2fsprogs) for such large partitions?

I know about this possible problems:
1. E2fsck is too slow. But parallel e2fsck project is being developed by Li Xi
2. Block groups reading takes a lot of time. We have fixes for special cases like e2label.
Bigalloc also allows to decrease metadata size, but sometimes meta_bg is preferable.
3. Aged fs and allocator that process all groups to find good group. There is solution, but with some issues.
4. 32 bit inode counter. Not a problem for Lustre FS users, that prefer use DNE for inode scaling,
but probably somebody wants to store a lot inodes on the same partition. Project was not finished.
Looks nobody require it now.


Could you please, point me to other possible problems?

Thanks.
Artem Blagodarenko.



2019-11-01 01:37:32

by Andreas Dilger

[permalink] [raw]
Subject: Re: 1024TB Ext4 partition

On Oct 24, 2019, at 7:23 AM, Благодаренко Артём <[email protected]> wrote:
>
> Lustre FS successfully uses LDISKFS(ext4) partitions with size near 512TB.
> This 512TB is current "verified limit". This means that we do not expect
> any troubles in production with such large partitions.
>
> Our new challenge now is 1024TB because hardware allows to assemble such
> partition. The question is: do you know any possible issues with EXT4
> (and e2fsprogs) for such large partitions?

Hi Artem, thanks for bringing this up on the list. We've also seen that
with large declustered parity RAID arrays there is a need to have 40+
disks in the array to get good rebuild performance, and with 12TB disks
this will pushes LUN size to 430TB and next year beyond 512TB for 16TB
disks so this is an upcoming issue.

> I know about this possible problems:
> 1. E2fsck is too slow. But parallel e2fsck project is being developed by Li Xi

Right. This is so far only adding parallelism to the pass1 inode table scan.
In our testing that is the majority of the time taken is in pass1 (3879s of
3959s for a 25% full 1PB fs, see https://jira.whamcloud.com/browse/LU-8465
for more details) so if this phase can most easily be optimized it will give
the most overall improvement as well.

> 2. Block groups reading takes a lot of time. We have fixes for special cases
> like e2label. Bigalloc also allows to decrease metadata size, but sometimes
> meta_bg is preferable.

For large filesystems (over 256TB) I think meta_bg is always required, as the
GDT is larger than a single block group. However, I guess with bigalloc it
is possible to avoid meta_bg since the block group size increases by a factor
of the chunk size as well. That means a 1PiB filesystem could avoid meta_bg
if it is using a bigalloc chunk size of 16KB or larger.

> 3. Aged fs and allocator that process all groups to find good group. There
> is solution, but with some issues.

It would be nice to see this allocator work included into upstream ext4.

> 4. 32 bit inode counter. Not a problem for Lustre FS users, that prefer use
> DNE (distributed namespace) for inode scaling, but probably somebody wants
> to store a lot inodes on the same partition. Project was not finished.
> Looks nobody require it now.
>
> Could you please, point me to other possible problems?

While it is not specifically a problem with ext4, as Lustre OSTs get larger
they will create a large number of objects in each directory, which will
hurt performance as each htree level is added (about 100k, 1M, and 10M).
To avoid the need for such large directories, it would be useful to reduce
the number of objects created per directory by the MDS, which can be done
in Lustre (https://jira.whamcloud.com/browse/LU-11912) by creating a series
of directories over time.

Splitting up object creation by age also has the benefit of segregating the
objects by age and allowing the whole directory to drop out of cache, rather
than just distributing all objects over a larger number of directories. Using
a larger number of directories would just increase the cache footprint and
result in more random IO (i.e. one create/unlink per directory leaf block).


This would also benefit from the directory shrinking code that was posted
a couple of times to the list, but has not yet landed. As old directories
have their objects deleted, then eventually shrink down from millions of
objects to a few hundreds, and with this feature the blocks will also shrink.

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP