2014-03-05 12:33:36

by Zhang, Hongchao

[permalink] [raw]
Subject: an issue of ext4

Hi,

in ext4_fill_super, the variables related to statfs should be initialized after journal recovery is completed.
otherwise, if a large number of blocks were being allocated before the filesystem crashed, then the blocks
and inode counters may become negative during use and report incorrect values to statfs call.

Thanks
Hongchao


Attachments:
statfs-issue.patch (1.70 kB)
statfs-issue.patch

2014-03-05 12:51:10

by Theodore Ts'o

[permalink] [raw]
Subject: Re: an issue of ext4

On Wed, Mar 05, 2014 at 12:33:32PM +0000, Zhang, Hongchao wrote:
>
> in ext4_fill_super, the variables related to statfs should be
> initialized after journal recovery is completed. otherwise, if a
> large number of blocks were being allocated before the filesystem
> crashed, then the blocks and inode counters may become negative
> during use and report incorrect values to statfs call.

The ext4_statfs() doesn't use the free blocks and inodes count from
the superblock. For scalability reasons, we no longer update the
journal values in the superblock while they are in use, but rather
compute them from the sum of the values from the blockgroup
descriptors, and then track them via percpu counters.

Hence, the problem you described isn't an issue in practice. What we
are doing does mean that values in the superblock are not accurate
while the file system is mounted and if the system crashes before an
file system can be cleanly unmounted --- we do update these values
when the file system is unmounted. However, using tools such as
dumpe2fs to determine the free blocks/inodes available while the file
system is mounted or after an unclean shutdown was not considered an
important use case, especially compared to the scalability concerns of
supporting high cpu core count systems.

Cheers,

- Ted

2014-03-06 23:57:16

by Dilger, Andreas

[permalink] [raw]
Subject: Re: an issue of ext4

On 2014/03/05, 5:51 AM, "Theodore Ts'o" <[email protected]> wrote:

>On Wed, Mar 05, 2014 at 12:33:32PM +0000, Zhang, Hongchao wrote:
>>
>> in ext4_fill_super, the variables related to statfs should be
>> initialized after journal recovery is completed. otherwise, if a
>> large number of blocks were being allocated before the filesystem
>> crashed, then the blocks and inode counters may become negative
>> during use and report incorrect values to statfs call.
>
>The ext4_statfs() doesn't use the free blocks and inodes count from
>the superblock. For scalability reasons, we no longer update the
>journal values in the superblock while they are in use, but rather
>compute them from the sum of the values from the blockgroup
>descriptors, and then track them via percpu counters.

Ted,
This doesn't relate to using the summary counters in the superblock.

The problem is that the percpu counters are initialized from the
group descriptors (or block and inode bitmaps if EXT4_DEBUG is on)
at mount time _before_ the journal has been replayed. That means
journal replay can still change the group descriptors (or bitmaps)
after the counters are initialized, and statfs(), allocators, etc.
will use the wrong values for the rest of the mount.

If the journal is large, and there is heavy allocation happening
before the reboot then the counters can be significantly incorrect.

However, looking more closely at the upstream kernel, I see that this
code was changed by Dmitry Monakhov in v2.6.34-rc7-16-g84061e0 to
move the counters after journal init (almost the same as Hongchao's
patch does) but then you submitted a patch v2.6.37-rc1-3-gce7e010
to initialize the percpu counters are both before and after the
journal is loaded. It isn't clear from your commit comment why
the patch to load them both before and after was needed?

It seems we hit this problem in the RHEL6 (which is missing both of
these changes), and your patch made upstream look like the original
unpatched code was loading the counters only before the journal is
replayed, so Hongchao's patch still applied to upstream.

So I guess upstream is OK, with the exception that it isn't clear
why commit ce7e010 was made. Need to ask Eric to backport 84061e0
and ce7e010 to RHEL6 I guess, and use those patches in place of
our own in the meantime.

Cheers, Andreas
--
Andreas Dilger

Lustre Software Architect
Intel High Performance Data Division



2014-03-07 01:53:02

by Theodore Ts'o

[permalink] [raw]
Subject: Re: an issue of ext4

On Thu, Mar 06, 2014 at 11:57:15PM +0000, Dilger, Andreas wrote:
> However, looking more closely at the upstream kernel, I see that this
> code was changed by Dmitry Monakhov in v2.6.34-rc7-16-g84061e0 to
> move the counters after journal init (almost the same as Hongchao's
> patch does) but then you submitted a patch v2.6.37-rc1-3-gce7e010
> to initialize the percpu counters are both before and after the
> journal is loaded. It isn't clear from your commit comment why
> the patch to load them both before and after was needed?

I can't remembe the code path that was referencing one or more of the
percpu counters, it must have been some non-common corner case, since
we didn't notice the issue right away, but I believe that's why the
change was made. My bad for not including more details in the commit
description....

- Ted