2019-12-26 10:25:36

by xiaohui li

[permalink] [raw]
Subject: the side effect of enlarger max mount count in ext4 superblock

hi ted:

we have found the e2fsck full check cost-so-much-time problem in
android booting phase, especially it will spend 120 seconds on doing
this e2fsck full check in ext4 rw partition which has large storage
capacity and serious fragmentation related with used extents.

so we want to reduce the numbers of the called work of e2fsck full check.

condition 1:
and then we have find when encountering the metadata error or
inconsistent problems, ext4 will has put an error flag in its
superblock.
when the next e2fsck data check begin, it will check if there is an
error flag in partition superblock, and will do the full check work
automatically if has this error flag.

condition 2:
meanwhile, in android code, when ext4 partition has been mounted
unsuccessfully, it will also do e2fsck full check subsequently.

according to above two showed conditions on which e2fsck full check
can be called automatically,
the e2fsck full check has not to be called periodically when the ext4
partition mount times is above max mount times we set in ext4 super
block.
when ext4 data or medata error has happened, e2fsck full check will be
called automatically during next e2fsck data checking.

so i wonder the reason why set EXT4_DFL_MAX_MNT_COUNT value to 20 in
fs/ext4/ext4.h and not set a large value to it ?
is there any reason or any condition when file system data error or
stability problems happens and ext4 can't get this information, can't
set the error flag in superblock, and so will not call the e2fsck full
check during next e2fsck check?
and because of this reason or condition, it will have to do periodic
e2fsck full check.

many thanks if you and any other people can give me advise on the
above question.


2019-12-26 13:10:00

by Theodore Ts'o

[permalink] [raw]
Subject: Re: the side effect of enlarger max mount count in ext4 superblock

On Thu, Dec 26, 2019 at 06:25:01PM +0800, xiaohui li wrote:
> so i wonder the reason why set EXT4_DFL_MAX_MNT_COUNT value to 20 in
> fs/ext4/ext4.h and not set a large value to it ?

It sounds like you're still using the old make_ext4fs program that is
in the older versions of AOSP? More recently, AOSP uses mke2fs to
create the file system, in combination with e2fsdroid. And newer
versions mke2fs sets the max count value to 0, which means it doesn't
automatically check the file system after 20 reboots. This is for the
reason that you stated; for larger storage devices, a forced e2fsck
run can take a long time, and if it's not necessary we can skip it.

> is there any reason or any condition when file system data error or
> stability problems happens and ext4 can't get this information, can't
> set the error flag in superblock, and so will not call the e2fsck full
> check during next e2fsck check?
> and because of this reason or condition, it will have to do periodic
> e2fsck full check.

The reason why we used to set max mount count to 20 is because there
are indeed many kinds of file system inconsistencies which the kernel
can not detect at runtime or when it tries to mount the file system,
and that can lead to data loss or corruption. So setting a max mount
count of 20 was way of trying to catch that early, hopefully before
*too* much data was lost.

Metadata inconsistencies should *not* be happening normally. Typical
causes of inconsistencies are kernel bugs or media problems (e.g.,
eMMC, HDD, SSD failures of some sort; sometimes because they don't do
the right thing on power drops).

Unfortunately, many Android devices, especially the cheaper priced
versions, are using older SOC's, with older kernels, which are missing
a lot of bug fixes. Even if they have been fixed upstream, kernel
coming from an old Board Support Package may not have those bug fixes.
This is one of the reasons my personal advice to friends is get higher
end Pixels and not some of the cheaper, low-quality Android devices
coming out of Asia. (Sorry.)

If you're using one of those older, crappier BSP kernels, one of the
ways you can find out how horrible it is to see how many tests fail if
you use something like android-xfstests[1]. In some cases, especially
with an older kernel (for example, a 3.10 or 3.18 kernel), running
file system stress tests can cause the kernel to crash.

[1] https://thunk.org/android-xfstests

If you are using high quality eMMC flash (as opposed to the cheapest
possible grade flash to maximize profits), and you have tested your
flash to make sure they handle power drops correctly (e.g., that the
FTL metadata never gets corrupted on a power drop, and all data
written after a FLUSH CACHE command is retained after a power drop),
and you are using a kernel which is regularly getting updated to get
the latest security and bug fixes, then there is no need to set max
mount count to a non-zero value.

If you are not in that ideal state, then question really boils down to
"do you feel lucky?". Although that's probably true with or without
max mount count set to 20. :-)

Cheers,

- Ted

2019-12-29 07:04:38

by xiaohui li

[permalink] [raw]
Subject: Re: the side effect of enlarger max mount count in ext4 superblock

hi ted :

thank you, sorry for my late reply.

shall the e2fsck tool can be divided into two parts ?
one part only do the full data consistency check work, focus on
checking if data has inconsistency just when ext4 filesystem has been
frozen or very few IO activities are going on.
and the other part can be doing the actual repair work if data
inconsistent has encountered.

but i wonder if some problems will happen if doing the full data
consistency checking online, without ext4 filesystem umount.
so even if very few io activities are going on, the data checking
can't be implemented. just because some file data may be in memory,
not in disk.
so the data consistency checking only can be started when ext4
filesystem has been frozen from my viewpoint, at least at this moment,
file data can be returned back to disk as much as possible.

is my idea showed above right ? thanks if some one give some suggestions on it.
i will investigate the time and frequency of ext4 filesystem frozen on
android system if my idea above is right.



On Thu, Dec 26, 2019 at 9:09 PM Theodore Y. Ts'o <[email protected]> wrote:
>
> On Thu, Dec 26, 2019 at 06:25:01PM +0800, xiaohui li wrote:
> > so i wonder the reason why set EXT4_DFL_MAX_MNT_COUNT value to 20 in
> > fs/ext4/ext4.h and not set a large value to it ?
>
> It sounds like you're still using the old make_ext4fs program that is
> in the older versions of AOSP? More recently, AOSP uses mke2fs to
> create the file system, in combination with e2fsdroid. And newer
> versions mke2fs sets the max count value to 0, which means it doesn't
> automatically check the file system after 20 reboots. This is for the
> reason that you stated; for larger storage devices, a forced e2fsck
> run can take a long time, and if it's not necessary we can skip it.
>
> > is there any reason or any condition when file system data error or
> > stability problems happens and ext4 can't get this information, can't
> > set the error flag in superblock, and so will not call the e2fsck full
> > check during next e2fsck check?
> > and because of this reason or condition, it will have to do periodic
> > e2fsck full check.
>
> The reason why we used to set max mount count to 20 is because there
> are indeed many kinds of file system inconsistencies which the kernel
> can not detect at runtime or when it tries to mount the file system,
> and that can lead to data loss or corruption. So setting a max mount
> count of 20 was way of trying to catch that early, hopefully before
> *too* much data was lost.
>
> Metadata inconsistencies should *not* be happening normally. Typical
> causes of inconsistencies are kernel bugs or media problems (e.g.,
> eMMC, HDD, SSD failures of some sort; sometimes because they don't do
> the right thing on power drops).
>
> Unfortunately, many Android devices, especially the cheaper priced
> versions, are using older SOC's, with older kernels, which are missing
> a lot of bug fixes. Even if they have been fixed upstream, kernel
> coming from an old Board Support Package may not have those bug fixes.
> This is one of the reasons my personal advice to friends is get higher
> end Pixels and not some of the cheaper, low-quality Android devices
> coming out of Asia. (Sorry.)
>
> If you're using one of those older, crappier BSP kernels, one of the
> ways you can find out how horrible it is to see how many tests fail if
> you use something like android-xfstests[1]. In some cases, especially
> with an older kernel (for example, a 3.10 or 3.18 kernel), running
> file system stress tests can cause the kernel to crash.
>
> [1] https://thunk.org/android-xfstests
>
> If you are using high quality eMMC flash (as opposed to the cheapest
> possible grade flash to maximize profits), and you have tested your
> flash to make sure they handle power drops correctly (e.g., that the
> FTL metadata never gets corrupted on a power drop, and all data
> written after a FLUSH CACHE command is retained after a power drop),
> and you are using a kernel which is regularly getting updated to get
> the latest security and bug fixes, then there is no need to set max
> mount count to a non-zero value.
>
> If you are not in that ideal state, then question really boils down to
> "do you feel lucky?". Although that's probably true with or without
> max mount count set to 20. :-)
>
> Cheers,
>
> - Ted

2019-12-29 14:38:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: the side effect of enlarger max mount count in ext4 superblock

On Sun, Dec 29, 2019 at 02:58:21PM +0800, xiaohui li wrote:
>
> shall the e2fsck tool can be divided into two parts ?
> one part only do the full data consistency check work, focus on
> checking if data has inconsistency just when ext4 filesystem has been
> frozen or very few IO activities are going on.
> and the other part can be doing the actual repair work if data
> inconsistent has encountered.

Alas, that's not really practical. In order to repair a particular
part of the file system, you need to know what the correct value
should be. And calculating the correct value sometimes requires
global knowledge of the entire file system state.

For example, consider the inode field i_links_count. For regular
files, the value of this field is the number of references from
directory entries (in other words, links) that point at a particular
inode. If the correct value is 2 (there are two directory entries
which reference this inode), but it is incorrectly set to 1, then when
the first directory entry is removed with an unlink system call, the
i_links_count will go to zero, and the kernel will free the inode and
its blocks, leaving those blocks to be used by other inodes. But
there still is a valid reference to that inode, and the potential
result is that the one or more files will get corrupted, because
blocks can end up being claimed by different inodes.

So there are a couple of things to learn from this. First, the
determine whether or not the field is corrupted is 99.999% of the
effort. Once you know the correct value, the repair part is trivial.
So separating the consistency check and repair efforts don't make much
sense.

Second, when we are considering the i_links_count for a particular
inode, we have no idea where in the directory tree structure the
directory entries which reference that inode might be located. So we
have to examine all of the blocks of all directories in order to
determine the value of each inodes i_links_count. And of course, if
the contents of the directory blocks are changing while you are trying
calculate the i_links_count for all of the inodes in the directory,
this makes the job incredibly difficult. Effectively, it also
requires reading all of the metadata blocks, and making sure that they
are consistent with each other and this requires a lot of memory and a
lot of I/O bandwidth.

> but i wonder if some problems will happen if doing the full data
> consistency checking online, without ext4 filesystem umount.
> so even if very few io activities are going on, the data checking
> can't be implemented. just because some file data may be in memory,
> not in disk.
> so the data consistency checking only can be started when ext4
> filesystem has been frozen from my viewpoint, at least at this moment,
> file data can be returned back to disk as much as possible.

So we can do this already. It's called e2scrub[1]. It requires using
dm_snapshot, so we can create a frozen copy of the file system, and
then we check that frozen file system.

[1] https://manpages.debian.org/testing/e2fsprogs/e2scrub.8.en.html

This has tradeoffs. The first, and most important, is that if any
problems are found, you need to unmount the file system, and then
rerun e2fsck on the actual file system (as opposed to the frozen copy)
to actually effectuate the repair.

So if you have a large 100TB RAID array, which takes hours to run
fsck, first of all, you need to reserve enough space in the snapshot
partition to save an original copy of all blocks written to the file
system while the e2fsck is running. This could potentially be a large
amount of storage. Secondly, if a problem is found, now what?
Current e2scrub sends an e-mail to the system administrator,
requesting that the sysadmin schedule downtime so the system can be
rebooted, and e2fsck run on the unmounted file system so it can be
fixed. If it took hours to detect that the file system was corrupted,
it will take hours to repair the file system, and the system will be
out of service during that time.

I'm not convinced this would work terribly well on an Android device.
E2scrub was designed for enterprise servers that might be running for
years without a reboot, and the goal was to allow a periodic sanity
check (say, every few months) to make sure there weren't any problems
that had accumulated due to cosmic rays flipping bigs in the DRAM
(although hopefully all enterprise servers are using ECC memory), etc.

One thing that we could do to optimize things a bit is to enhance
dm_snapshot so that it only makes copies of the original block if the
I/O indicates that it is a metadata block. This would reduce the
amount of space needed to be reserved for the snapshot volume, and it
would reduce the overhead of dm_snapshot while the fsck is running.
This isn't something that has been done, because e2scrub is all that
commonly used, and most uses of dm_snapshot want the snapshot to have
the data blocks snapshotted as well as the metadata blocks.

So if you are looking for a project, one thing you could perhaps do is
to approach the device mapper developers at [email protected],
and try to add this feature to dm_snapshot. It might be, though, that
getting your Android devices to use the latest kernels and using the
highest quality flash might be a better approach in the long run.

Cheers,

- Ted

2020-01-02 03:18:52

by xiaohui li

[permalink] [raw]
Subject: Re: the side effect of enlarger max mount count in ext4 superblock

thank you , ted.
many thanks to you.

I understand that i have to use offline fsck.
because it can save me a lot of time , and because e2scrub is not
really practical on my android storage application.
it may be useful in distributed storage system。

On Sun, Dec 29, 2019 at 10:37 PM Theodore Y. Ts'o <[email protected]> wrote:
>
> On Sun, Dec 29, 2019 at 02:58:21PM +0800, xiaohui li wrote:
> >
> > shall the e2fsck tool can be divided into two parts ?
> > one part only do the full data consistency check work, focus on
> > checking if data has inconsistency just when ext4 filesystem has been
> > frozen or very few IO activities are going on.
> > and the other part can be doing the actual repair work if data
> > inconsistent has encountered.
>
> Alas, that's not really practical. In order to repair a particular
> part of the file system, you need to know what the correct value
> should be. And calculating the correct value sometimes requires
> global knowledge of the entire file system state.
>
> For example, consider the inode field i_links_count. For regular
> files, the value of this field is the number of references from
> directory entries (in other words, links) that point at a particular
> inode. If the correct value is 2 (there are two directory entries
> which reference this inode), but it is incorrectly set to 1, then when
> the first directory entry is removed with an unlink system call, the
> i_links_count will go to zero, and the kernel will free the inode and
> its blocks, leaving those blocks to be used by other inodes. But
> there still is a valid reference to that inode, and the potential
> result is that the one or more files will get corrupted, because
> blocks can end up being claimed by different inodes.
>
> So there are a couple of things to learn from this. First, the
> determine whether or not the field is corrupted is 99.999% of the
> effort. Once you know the correct value, the repair part is trivial.
> So separating the consistency check and repair efforts don't make much
> sense.
>
> Second, when we are considering the i_links_count for a particular
> inode, we have no idea where in the directory tree structure the
> directory entries which reference that inode might be located. So we
> have to examine all of the blocks of all directories in order to
> determine the value of each inodes i_links_count. And of course, if
> the contents of the directory blocks are changing while you are trying
> calculate the i_links_count for all of the inodes in the directory,
> this makes the job incredibly difficult. Effectively, it also
> requires reading all of the metadata blocks, and making sure that they
> are consistent with each other and this requires a lot of memory and a
> lot of I/O bandwidth.
>
> > but i wonder if some problems will happen if doing the full data
> > consistency checking online, without ext4 filesystem umount.
> > so even if very few io activities are going on, the data checking
> > can't be implemented. just because some file data may be in memory,
> > not in disk.
> > so the data consistency checking only can be started when ext4
> > filesystem has been frozen from my viewpoint, at least at this moment,
> > file data can be returned back to disk as much as possible.
>
> So we can do this already. It's called e2scrub[1]. It requires using
> dm_snapshot, so we can create a frozen copy of the file system, and
> then we check that frozen file system.
>
> [1] https://manpages.debian.org/testing/e2fsprogs/e2scrub.8.en.html
>
> This has tradeoffs. The first, and most important, is that if any
> problems are found, you need to unmount the file system, and then
> rerun e2fsck on the actual file system (as opposed to the frozen copy)
> to actually effectuate the repair.
>
> So if you have a large 100TB RAID array, which takes hours to run
> fsck, first of all, you need to reserve enough space in the snapshot
> partition to save an original copy of all blocks written to the file
> system while the e2fsck is running. This could potentially be a large
> amount of storage. Secondly, if a problem is found, now what?
> Current e2scrub sends an e-mail to the system administrator,
> requesting that the sysadmin schedule downtime so the system can be
> rebooted, and e2fsck run on the unmounted file system so it can be
> fixed. If it took hours to detect that the file system was corrupted,
> it will take hours to repair the file system, and the system will be
> out of service during that time.
>
> I'm not convinced this would work terribly well on an Android device.
> E2scrub was designed for enterprise servers that might be running for
> years without a reboot, and the goal was to allow a periodic sanity
> check (say, every few months) to make sure there weren't any problems
> that had accumulated due to cosmic rays flipping bigs in the DRAM
> (although hopefully all enterprise servers are using ECC memory), etc.
>
> One thing that we could do to optimize things a bit is to enhance
> dm_snapshot so that it only makes copies of the original block if the
> I/O indicates that it is a metadata block. This would reduce the
> amount of space needed to be reserved for the snapshot volume, and it
> would reduce the overhead of dm_snapshot while the fsck is running.
> This isn't something that has been done, because e2scrub is all that
> commonly used, and most uses of dm_snapshot want the snapshot to have
> the data blocks snapshotted as well as the metadata blocks.
>
> So if you are looking for a project, one thing you could perhaps do is
> to approach the device mapper developers at [email protected],
> and try to add this feature to dm_snapshot. It might be, though, that
> getting your Android devices to use the latest kernels and using the
> highest quality flash might be a better approach in the long run.
>
> Cheers,
>
> - Ted