2017-03-21 20:08:03

by Manish Katiyar

[permalink] [raw]
Subject: ext4 scaling limits ?

Hi,

I was looking at e2fsck code to see if there are any limits on running
e2fsck on large ext4 filesystems. From the code it looks like all the
required metadata while e2fsck is running is only kept in memory and
is only flushed to disk when the appropriate changes are corrected.
(Except the undo file case).
There doesn't seem to be a case/code where we have to periodically
flush some tracking metadata while it is running, just because we have
too much of incore tracking data and may ran out of memory (looks like
code will simply return failure if ext2fs_get_mem() returns failure)

Appreciate if someone can confirm that my understanding is correct ?

Thanks -
Manish


2017-03-21 21:48:13

by Andreas Dilger

[permalink] [raw]
Subject: Re: ext4 scaling limits ?

While it is true that e2fsck does not free memory during operation, in
practice this is not a problem. Even for large filesystems (say 32-48TB)
it will only use around 8-12GB of RAM so that is very reasonable for a
server today.

The rough estimate that I use for e2fsck is 1 byte of RAM per block.

Cheers, Andreas

> On Mar 21, 2017, at 16:07, Manish Katiyar <[email protected]> wrote:
>
> Hi,
>
> I was looking at e2fsck code to see if there are any limits on running
> e2fsck on large ext4 filesystems. From the code it looks like all the
> required metadata while e2fsck is running is only kept in memory and
> is only flushed to disk when the appropriate changes are corrected.
> (Except the undo file case).
> There doesn't seem to be a case/code where we have to periodically
> flush some tracking metadata while it is running, just because we have
> too much of incore tracking data and may ran out of memory (looks like
> code will simply return failure if ext2fs_get_mem() returns failure)
>
> Appreciate if someone can confirm that my understanding is correct ?
>
> Thanks -
> Manish

2017-03-21 22:18:03

by Reindl Harald

[permalink] [raw]
Subject: Re: ext4 scaling limits ?



Am 21.03.2017 um 22:48 schrieb Andreas Dilger:
> While it is true that e2fsck does not free memory during operation, in
> practice this is not a problem. Even for large filesystems (say 32-48TB)
> it will only use around 8-12GB of RAM so that is very reasonable for a
> server today.

no it's not reasonable even today that your whole physical machine
exposes it's total RAM to the one of many single virtual machines
running just a samba server for a 50 TB "datagrave" with a handful of users

in reality it should not be a problem to attach even a 100 TB storage to
a VM with 1-2 GB

> The rough estimate that I use for e2fsck is 1 byte of RAM per block.
>
> Cheers, Andreas
>
>> On Mar 21, 2017, at 16:07, Manish Katiyar <[email protected]> wrote:
>>
>> Hi,
>>
>> I was looking at e2fsck code to see if there are any limits on running
>> e2fsck on large ext4 filesystems. From the code it looks like all the
>> required metadata while e2fsck is running is only kept in memory and
>> is only flushed to disk when the appropriate changes are corrected.
>> (Except the undo file case).
>> There doesn't seem to be a case/code where we have to periodically
>> flush some tracking metadata while it is running, just because we have
>> too much of incore tracking data and may ran out of memory (looks like
>> code will simply return failure if ext2fs_get_mem() returns failure)
>>
>> Appreciate if someone can confirm that my understanding is correct ?

2017-03-21 23:29:19

by Manish Katiyar

[permalink] [raw]
Subject: Re: ext4 scaling limits ?

On Tue, Mar 21, 2017 at 2:59 PM, Reindl Harald <[email protected]> wrote:
>
>
> Am 21.03.2017 um 22:48 schrieb Andreas Dilger:
>>
>> While it is true that e2fsck does not free memory during operation, in
>> practice this is not a problem. Even for large filesystems (say 32-48TB)
>> it will only use around 8-12GB of RAM so that is very reasonable for a
>> server today.
>
>
> no it's not reasonable even today that your whole physical machine exposes
> it's total RAM to the one of many single virtual machines running just a
> samba server for a 50 TB "datagrave" with a handful of users
>
> in reality it should not be a problem to attach even a 100 TB storage to a
> VM with 1-2 GB
>

Thanks Andreas, for confirming.

If I understand correctly, then the theoretical limit is really (RAM +
available swap space) right ? Only if we aren't able to page out
anything to swap it should hurt ?


Thanks -
Manish



>
>> The rough estimate that I use for e2fsck is 1 byte of RAM per block.
>>
>> Cheers, Andreas
>>
>>> On Mar 21, 2017, at 16:07, Manish Katiyar <[email protected]> wrote:
>>>
>>> Hi,
>>>
>>> I was looking at e2fsck code to see if there are any limits on running
>>> e2fsck on large ext4 filesystems. From the code it looks like all the
>>> required metadata while e2fsck is running is only kept in memory and
>>> is only flushed to disk when the appropriate changes are corrected.
>>> (Except the undo file case).
>>> There doesn't seem to be a case/code where we have to periodically
>>> flush some tracking metadata while it is running, just because we have
>>> too much of incore tracking data and may ran out of memory (looks like
>>> code will simply return failure if ext2fs_get_mem() returns failure)
>>>
>>> Appreciate if someone can confirm that my understanding is correct ?

2017-03-23 03:35:13

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 scaling limits ?

On Tue, Mar 21, 2017 at 05:48:11PM -0400, Andreas Dilger wrote:
> While it is true that e2fsck does not free memory during operation, in
> practice this is not a problem. Even for large filesystems (say 32-48TB)
> it will only use around 8-12GB of RAM so that is very reasonable for a
> server today.

E2fsck does free memory during operation; see the comments in front of
pass 2 and pass 3 for example:

* Pass 2 also collects the following information:
* - The inode numbers of the subdirectories for each directory.
*
* Pass 2 relies on the following information from previous passes:
* - The directory information collected in pass 1.
* - The inode_used_map bitmap
* - The inode_bad_map bitmap
* - The inode_dir_map bitmap
*
* Pass 2 frees the following data structures
* - The inode_bad_map bitmap
* - The inode_reg_map bitmap

* Pass 3 frees the following data structures:
* - The dirinfo directory information cache.

It's not a *lot* of memory, especially given that bitmaps are stored
in a much more compact, extent-mapped format, but it does free some
memory.

It is fair to say that e2fsck is optimized to run as quickly as
possible, and to cache information so that we are not rereading file
system metadata from disk. This was done using some of the
suggestions from the 1989 Usenix ATC paper:

Bina. E. J., and P. A. Emrath (1989): "A faster fsck for BSD UNIX,"
Proceedings of the Winter 1989 USENIX Technical Conference, 173-185.


On Tue, 21 Mar 2017 22:59:18 +0100 Reindl Harald <[email protected]> said:
>Am 21.03.2017 um 22:48 schrieb Andreas Dilger:
>> While it is true that e2fsck does not free memory during operation, in
>> practice this is not a problem. Even for large filesystems (say 32-48TB)
>> it will only use around 8-12GB of RAM so that is very reasonable for a
>> server today.
>
>no it's not reasonable even today that your whole physical machine exposes
>it's total RAM to the one of many single virtual machines running just a samba
>server for a 50 TB "datagrave" with a handful of users
>
>in reality it should not be a problem to attach even a 100 TB storage to a VM
>with 1-2 GB

Reindl, sorry, but today, if you have an out-of-balance server with a
huge amoutn of storage, and a tiny amount of disk, it *will* be a
problem.

If you are desperate, you *may* be able to use the scratch files
feature documented in e2fsck.conf. This was mainly implemented for
users of desktop NAS boxes which tried to connect a huge disk to a
tiny arm server, and the manufacturers of said NAS boxes didn't bother
to check to see if they had provisioned enough memory so they could
repair a broken file system. (I know they didn't because the
developers didn't reach out to me; their users did.) The scratch
files is way to use on-disk databases to replace the in-memory data
structure, but it is S-L-O-W. But hey, you get what you pay for, and
if you are too cheapskate to provision a system with enough memory,
you (or your paying customers) will suffer the consequences.

If you don't like this answer, feel free to write your own e2fsck
which is 5-6 times slower because it is constantly rereading metadata
from disk.

Or submit patches, but if it slows down the fsck times on a reasonably
configured servers, I reserve the right to reject such patches as
inflicting pain existing users of ext4 who correctly sized their
servers.

- Ted