2013-12-09 18:01:54

by Theodore Ts'o

[permalink] [raw]
Subject: FAST paper on ffsck

Andreas brought up on today's conference call Kirk McKusick's recent
changes[1] to try to improve fsck times for FFS, in response to the
recent FAST paper covering fsck speed ups for ext3, "ffsck: The Fast
Filesystem Checker"[2]

[1] http://www.mckusick.com/publications/faster_fsck.pdf
[2] https://www.usenix.org/system/files/conference/fast13/fast13-final52_0.pdf

All of the changes which Kirk outlined are ones which we had done
several years ago, in the early days of ext4 development. I talked
about some of these in some blog entries, "Fast ext4 fsck times"[3], and
"Fast ext4 fsck times, revisited"[4]

[3] http://thunk.org/tytso/blog/2008/08/08/fast-ext4-fsck-times/
[4] http://thunk.org/tytso/blog/2009/02/26/fast-ext4-fsck-times-revisited/

(Apologies for the really bad formatting; I recovered my blog from
backups a few months ago, installed onto a brand-new Wordpress
installation --- since the old one was security bug ridden and
horribly obsolete --- and I haven't had a chance to fix up some of the
older blog entries that had explicit HTML for tables to work with the
new theme.)

One further observation from reading the ffsck paper. Their method of
introducing heavy file system fragmentation resulted in a file system
where most of the files had external extent tree blocks; that is, the
trees had a depth > 1. I have not observed this in file systems under
normal load, since most files are written once and not rewritten, and
those that are rewritten (i.e., database files) are not the common
case, and even then, generally aren't written in a random append
workload where there are hundreds of files in the same directory which
are appended to in random order. So looking at at a couple file
systems' fsck -v output, I find results such as this:

Extent depth histogram: 1229346/569/3
Extent depth histogram: 332256/141
Extent depth histogram: 23253/456

... where the first number is the number of inode where all of the
extent information stored in the inode, and the second number is the
number of inodes with a single level of external extent tree blocks,
and so on.

As a result, I'm not seeing the fsck time degradation resulting from
file system aging, because with at leat my workloads, the file system
isn't getting fragmented in enough to result in a large number of
inodes with external extent tree blocks.

We could implement schemes to optimize fsck performance for heavily
fragmented file systems; a few which could be done using just e2fsck
optimizations, and some which would require file system format
changes. However, it's not clear to me that it's worth it.

If folks would like help run some experiments, it would be useful to
run a test e2fsck on a partition: "e2fsck -Fnfvtt /dev/sdb1" and look
at the extent depth histogram and the I/O rates for the various e2fsck
passes (see below for an example).

If you have examples where the file system has a very large number of
inodes with extent tree depths > 1, it would be useful to see these
numbers, with a description of how old the file system has been, and
what sort of workload might have contributed to its aging.

Thanks, regards,

- Ted

e2fsck 1.42.8 (20-Jun-2013)
Pass 1: Checking inodes, blocks, and sizes
Pass 1: Memory used: 668k/7692k (575k/94k), time: 0.92/ 0.42/ 0.02
Pass 1: I/O read: 11MB, write: 0MB, rate: 11.95MB/s
Pass 2: Checking directory structure
Pass 2: Memory used: 784k/15196k (466k/319k), time: 0.44/ 0.03/ 0.00
Pass 2: I/O read: 10MB, write: 0MB, rate: 22.76MB/s
Pass 3: Checking directory connectivity
Peak memory: Memory used: 784k/15196k (466k/319k), time: 1.60/ 0.63/ 0.02
Pass 3: Memory used: 784k/15196k (439k/346k), time: 0.00/ 0.00/ 0.00
Pass 3: I/O read: 1MB, write: 0MB, rate: 2793.30MB/s
Pass 4: Checking reference counts
Pass 4: Memory used: 784k/188k (432k/353k), time: 0.63/ 0.63/ 0.00
Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
Pass 5: Checking group summary information
Pass 5: Memory used: 784k/188k (426k/359k), time: 4.95/ 0.16/ 0.10
Pass 5: I/O read: 19MB, write: 0MB, rate: 3.84MB/s

13825 inodes used (0.03%, out of 47906816)
1425 non-contiguous files (10.3%)
11 non-contiguous directories (0.1%)
# of inodes with ind/dind/tind blocks: 0/0/0
Extent depth histogram: 12986/831
141525383 blocks used (73.85%, out of 191627264)
0 bad blocks
4 large files

11537 regular files
2279 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
------------
13816 files
Memory used: 784k/188k (426k/359k), time: 7.19/ 1.42/ 0.12
I/O read: 39MB, write: 0MB, rate: 5.43MB/s

Note: the reason why this file system has so many files with large
extents is because there are some video files which large enough that
even when contiguous, they will require an external extent block, e.g:

File size of 01 Yankee White.m4v is 499375730 (121918 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 0: 19802112.. 19802112: 1:
1: 2.. 315: 19802114.. 19802427: 314: 19802113:
2: 543.. 14335: 19802655.. 19816447: 13793: 19802428:
3: 14336.. 47103: 19830784.. 19863551: 32768: 19816448:
4: 47104.. 73727: 19896320.. 19922943: 26624: 19863552:
5: 73728.. 79871: 19955712.. 19961855: 6144: 19922944:
6: 79872.. 112639: 19994624.. 20027391: 32768: 19961856:
7: 112640.. 121917: 20060160.. 20069437: 9278: 20027392: eof
01 Yankee White.m4v: 8 extents found

BTW, looking at the output of filefrag -v on large files, it does look
like there is some work we can do to improve the block allocation
hueristics. These files were written w/o the benefit of fallocate,
but with delayed allocation, and apparently we aren't automatically
figuring out that we should be in stream mode from the get-go. This
pattern is reproduced in most of the files in the directory:

File size of 02 Hung Out to Dry.m4v is 552382434 (134859 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 0: 19816448.. 19816448: 1:
1: 2.. 314: 19816450.. 19816762: 313: 19816449:
2: 542.. 14335: 19816990.. 19830783: 13794: 19816763:
3: 14336.. 47103: 19863552.. 19896319: 32768: 19830784:
4: 47104.. 79871: 19961856.. 19994623: 32768: 19896320:
5: 79872.. 112639: 20027392.. 20060159: 32768: 19994624:
6: 112640.. 134858: 20070400.. 20092618: 22219: 20060160: eof
02 Hung Out to Dry.m4v: 7 extents found

File size of 03 Sea Dog.m4v is 553146161 (135046 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 0: 20092928.. 20092928: 1:
1: 2.. 159: 20092930.. 20093087: 158: 20092929:
2: 161.. 306: 20093089.. 20093234: 146: 20093088:
3: 534.. 14335: 20093462.. 20107263: 13802: 20093235:
4: 14336.. 47103: 20121600.. 20154367: 32768: 20107264:
5: 47104.. 79871: 20187136.. 20219903: 32768: 20154368:
6: 79872.. 112639: 20252672.. 20285439: 32768: 20219904:
7: 112640.. 135045: 20318208.. 20340613: 22406: 20285440: eof
03 Sea Dog.m4v: 8 extents found

File size of 04 The Immortals.m4v is 516091162 (125999 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 0: 20107264.. 20107264: 1:
1: 2.. 162: 20107266.. 20107426: 161: 20107265:
2: 164.. 312: 20107428.. 20107576: 149: 20107427:
3: 540.. 14335: 20107804.. 20121599: 13796: 20107577:
4: 14336.. 47103: 20154368.. 20187135: 32768: 20121600:
5: 47104.. 79871: 20219904.. 20252671: 32768: 20187136:
6: 79872.. 112639: 20285440.. 20318207: 32768: 20252672:
7: 112640.. 125998: 20340736.. 20354094: 13359: 20318208: eof
04 The Immortals.m4v: 8 extents found

Looking at all of these files, actually, if we had managed to allocate
them using contiguous 32768 block extents, these 45 minute TV episodes
would have just fit inside the in-inode's 4 extent slots.


2013-12-12 05:30:47

by Dave Chinner

[permalink] [raw]
Subject: Re: FAST paper on ffsck

On Mon, Dec 09, 2013 at 01:01:49PM -0500, Theodore Ts'o wrote:
> Andreas brought up on today's conference call Kirk McKusick's recent
> changes[1] to try to improve fsck times for FFS, in response to the
> recent FAST paper covering fsck speed ups for ext3, "ffsck: The Fast
> Filesystem Checker"[2]
>
> [1] http://www.mckusick.com/publications/faster_fsck.pdf
> [2] https://www.usenix.org/system/files/conference/fast13/fast13-final52_0.pdf

Interesting - it's all about trying to lay out data to get
sequential disk access patterns during scanning (i.e. minimise disk
seeks) to reduce fsck runtime. Fine in principle, but I think that
it's a dead end you don't want to go down.

Why? Because it's the exact opposite of what you need for SSD based
filesystems. What fsck really needs is to be able to saturate the
IOPS capability of the underlying device rather than optimising for
bandwidth, and that means driving deep IO queue depths.

e.g I've dropped xfs_repair times on a 100TB test filesystem with 50
million inodes from 25 minutes to 5 minutes simply by adding gobs of
additional concurrency and ignoring sequential IO optimisations.
It's driving bandwidth rates of 200-250MB/s simply due to the IOPS
rate it is acheiving, not because I'm optimising IO patterns for
sequential IO.

In fact, it dispatches so much IO now that the limitation is not the
60,000 IOPS that it is pulling from the underlying SSDs, but
mmap_sem contention caused by 30-odd threads doing concurrent memory
allocation to cache and store all the information that is being read
from disk...

Cheers,

Dave.
--
Dave Chinner
[email protected]

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2014-01-29 18:57:47

by Darrick J. Wong

[permalink] [raw]
Subject: Re: FAST paper on ffsck

On Mon, Dec 09, 2013 at 01:01:49PM -0500, Theodore Ts'o wrote:
> Andreas brought up on today's conference call Kirk McKusick's recent
> changes[1] to try to improve fsck times for FFS, in response to the
> recent FAST paper covering fsck speed ups for ext3, "ffsck: The Fast
> Filesystem Checker"[2]
>
> [1] http://www.mckusick.com/publications/faster_fsck.pdf
> [2] https://www.usenix.org/system/files/conference/fast13/fast13-final52_0.pdf
>
> All of the changes which Kirk outlined are ones which we had done
> several years ago, in the early days of ext4 development. I talked
> about some of these in some blog entries, "Fast ext4 fsck times"[3], and
> "Fast ext4 fsck times, revisited"[4]
>
> [3] http://thunk.org/tytso/blog/2008/08/08/fast-ext4-fsck-times/
> [4] http://thunk.org/tytso/blog/2009/02/26/fast-ext4-fsck-times-revisited/
>
> (Apologies for the really bad formatting; I recovered my blog from
> backups a few months ago, installed onto a brand-new Wordpress
> installation --- since the old one was security bug ridden and
> horribly obsolete --- and I haven't had a chance to fix up some of the
> older blog entries that had explicit HTML for tables to work with the
> new theme.)
>
> One further observation from reading the ffsck paper. Their method of
> introducing heavy file system fragmentation resulted in a file system
> where most of the files had external extent tree blocks; that is, the
> trees had a depth > 1. I have not observed this in file systems under
> normal load, since most files are written once and not rewritten, and
> those that are rewritten (i.e., database files) are not the common
> case, and even then, generally aren't written in a random append
> workload where there are hundreds of files in the same directory which
> are appended to in random order. So looking at at a couple file
> systems' fsck -v output, I find results such as this:
>
> Extent depth histogram: 1229346/569/3
> Extent depth histogram: 332256/141
> Extent depth histogram: 23253/456
>
> ... where the first number is the number of inode where all of the
> extent information stored in the inode, and the second number is the
> number of inodes with a single level of external extent tree blocks,
> and so on.
>
> As a result, I'm not seeing the fsck time degradation resulting from
> file system aging, because with at leat my workloads, the file system
> isn't getting fragmented in enough to result in a large number of
> inodes with external extent tree blocks.
>
> We could implement schemes to optimize fsck performance for heavily
> fragmented file systems; a few which could be done using just e2fsck
> optimizations, and some which would require file system format
> changes. However, it's not clear to me that it's worth it.
>
> If folks would like help run some experiments, it would be useful to
> run a test e2fsck on a partition: "e2fsck -Fnfvtt /dev/sdb1" and look
> at the extent depth histogram and the I/O rates for the various e2fsck
> passes (see below for an example).
>
> If you have examples where the file system has a very large number of
> inodes with extent tree depths > 1, it would be useful to see these
> numbers, with a description of how old the file system has been, and
> what sort of workload might have contributed to its aging.
>

I don't know about "very large", but here's what I see on the server that I
share with some friends. Afaik it's used mostly for VM images and test
kernels... and other parallel-write-once files. ;) This FS has been running
since Nov. 2012. That said, I think the VM images were created without
fallocate; some of these files have tens of thousands of tiny extents.

5386404 inodes used (4.44%, out of 121307136)
22651 non-contiguous files (0.4%)
7433 non-contiguous directories (0.1%)
# of inodes with ind/dind/tind blocks: 0/0/0
Extent depth histogram: 5526723/1334/16
202583901 blocks used (41.75%, out of 485198848)
0 bad blocks
34 large files

5207070 regular files
313009 directories
576 character device files
192 block device files
11 fifos
1103023 links
94363 symbolic links (86370 fast symbolic links)
73 sockets
------------
6718317 files

On my main dev box, which is entirely old photos, mp3s, VM images, and kernel
builds, I see:

2155348 inodes used (2.94%, out of 73211904)
14923 non-contiguous files (0.7%)
1528 non-contiguous directories (0.1%)
# of inodes with ind/dind/tind blocks: 0/0/0
Extent depth histogram: 2147966/685/3
85967035 blocks used (29.36%, out of 292834304)
0 bad blocks
6 large files

1862617 regular files
284915 directories
370 character device files
59 block device files
6 fifos
609215 links
7454 symbolic links (6333 fast symbolic links)
24 sockets
------------
2764660 files

Sadly, since I've left the LTC I no longer have access to tux1, which had a
rather horrifically fragmented ext3. Its backup server, which created a Time
Machine-like series of "snapshots" with rsync --link-dest, took days to fsck,
despite being ext4.

--D

> Thanks, regards,
>
> - Ted
>
> e2fsck 1.42.8 (20-Jun-2013)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 1: Memory used: 668k/7692k (575k/94k), time: 0.92/ 0.42/ 0.02
> Pass 1: I/O read: 11MB, write: 0MB, rate: 11.95MB/s
> Pass 2: Checking directory structure
> Pass 2: Memory used: 784k/15196k (466k/319k), time: 0.44/ 0.03/ 0.00
> Pass 2: I/O read: 10MB, write: 0MB, rate: 22.76MB/s
> Pass 3: Checking directory connectivity
> Peak memory: Memory used: 784k/15196k (466k/319k), time: 1.60/ 0.63/ 0.02
> Pass 3: Memory used: 784k/15196k (439k/346k), time: 0.00/ 0.00/ 0.00
> Pass 3: I/O read: 1MB, write: 0MB, rate: 2793.30MB/s
> Pass 4: Checking reference counts
> Pass 4: Memory used: 784k/188k (432k/353k), time: 0.63/ 0.63/ 0.00
> Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
> Pass 5: Checking group summary information
> Pass 5: Memory used: 784k/188k (426k/359k), time: 4.95/ 0.16/ 0.10
> Pass 5: I/O read: 19MB, write: 0MB, rate: 3.84MB/s
>
> 13825 inodes used (0.03%, out of 47906816)
> 1425 non-contiguous files (10.3%)
> 11 non-contiguous directories (0.1%)
> # of inodes with ind/dind/tind blocks: 0/0/0
> Extent depth histogram: 12986/831
> 141525383 blocks used (73.85%, out of 191627264)
> 0 bad blocks
> 4 large files
>
> 11537 regular files
> 2279 directories
> 0 character device files
> 0 block device files
> 0 fifos
> 0 links
> 0 symbolic links (0 fast symbolic links)
> 0 sockets
> ------------
> 13816 files
> Memory used: 784k/188k (426k/359k), time: 7.19/ 1.42/ 0.12
> I/O read: 39MB, write: 0MB, rate: 5.43MB/s
>
> Note: the reason why this file system has so many files with large
> extents is because there are some video files which large enough that
> even when contiguous, they will require an external extent block, e.g:
>
> File size of 01 Yankee White.m4v is 499375730 (121918 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 0: 19802112.. 19802112: 1:
> 1: 2.. 315: 19802114.. 19802427: 314: 19802113:
> 2: 543.. 14335: 19802655.. 19816447: 13793: 19802428:
> 3: 14336.. 47103: 19830784.. 19863551: 32768: 19816448:
> 4: 47104.. 73727: 19896320.. 19922943: 26624: 19863552:
> 5: 73728.. 79871: 19955712.. 19961855: 6144: 19922944:
> 6: 79872.. 112639: 19994624.. 20027391: 32768: 19961856:
> 7: 112640.. 121917: 20060160.. 20069437: 9278: 20027392: eof
> 01 Yankee White.m4v: 8 extents found
>
> BTW, looking at the output of filefrag -v on large files, it does look
> like there is some work we can do to improve the block allocation
> hueristics. These files were written w/o the benefit of fallocate,
> but with delayed allocation, and apparently we aren't automatically
> figuring out that we should be in stream mode from the get-go. This
> pattern is reproduced in most of the files in the directory:
>
> File size of 02 Hung Out to Dry.m4v is 552382434 (134859 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 0: 19816448.. 19816448: 1:
> 1: 2.. 314: 19816450.. 19816762: 313: 19816449:
> 2: 542.. 14335: 19816990.. 19830783: 13794: 19816763:
> 3: 14336.. 47103: 19863552.. 19896319: 32768: 19830784:
> 4: 47104.. 79871: 19961856.. 19994623: 32768: 19896320:
> 5: 79872.. 112639: 20027392.. 20060159: 32768: 19994624:
> 6: 112640.. 134858: 20070400.. 20092618: 22219: 20060160: eof
> 02 Hung Out to Dry.m4v: 7 extents found
>
> File size of 03 Sea Dog.m4v is 553146161 (135046 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 0: 20092928.. 20092928: 1:
> 1: 2.. 159: 20092930.. 20093087: 158: 20092929:
> 2: 161.. 306: 20093089.. 20093234: 146: 20093088:
> 3: 534.. 14335: 20093462.. 20107263: 13802: 20093235:
> 4: 14336.. 47103: 20121600.. 20154367: 32768: 20107264:
> 5: 47104.. 79871: 20187136.. 20219903: 32768: 20154368:
> 6: 79872.. 112639: 20252672.. 20285439: 32768: 20219904:
> 7: 112640.. 135045: 20318208.. 20340613: 22406: 20285440: eof
> 03 Sea Dog.m4v: 8 extents found
>
> File size of 04 The Immortals.m4v is 516091162 (125999 blocks of 4096 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 0: 20107264.. 20107264: 1:
> 1: 2.. 162: 20107266.. 20107426: 161: 20107265:
> 2: 164.. 312: 20107428.. 20107576: 149: 20107427:
> 3: 540.. 14335: 20107804.. 20121599: 13796: 20107577:
> 4: 14336.. 47103: 20154368.. 20187135: 32768: 20121600:
> 5: 47104.. 79871: 20219904.. 20252671: 32768: 20187136:
> 6: 79872.. 112639: 20285440.. 20318207: 32768: 20252672:
> 7: 112640.. 125998: 20340736.. 20354094: 13359: 20318208: eof
> 04 The Immortals.m4v: 8 extents found
>
> Looking at all of these files, actually, if we had managed to allocate
> them using contiguous 32768 block extents, these 45 minute TV episodes
> would have just fit inside the in-inode's 4 extent slots.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-01-29 19:21:08

by Azat Khuzhin

[permalink] [raw]
Subject: Re: FAST paper on ffsck

On Wed, Jan 29, 2014 at 10:57 PM, Darrick J. Wong
<[email protected]> wrote:
> On Mon, Dec 09, 2013 at 01:01:49PM -0500, Theodore Ts'o wrote:
>> Andreas brought up on today's conference call Kirk McKusick's recent
>> changes[1] to try to improve fsck times for FFS, in response to the
>> recent FAST paper covering fsck speed ups for ext3, "ffsck: The Fast
>> Filesystem Checker"[2]
>>
>> [1] http://www.mckusick.com/publications/faster_fsck.pdf
>> [2] https://www.usenix.org/system/files/conference/fast13/fast13-final52_0.pdf
>>
>> All of the changes which Kirk outlined are ones which we had done
>> several years ago, in the early days of ext4 development. I talked
>> about some of these in some blog entries, "Fast ext4 fsck times"[3], and
>> "Fast ext4 fsck times, revisited"[4]
>>
>> [3] http://thunk.org/tytso/blog/2008/08/08/fast-ext4-fsck-times/
>> [4] http://thunk.org/tytso/blog/2009/02/26/fast-ext4-fsck-times-revisited/
>>
>> (Apologies for the really bad formatting; I recovered my blog from
>> backups a few months ago, installed onto a brand-new Wordpress
>> installation --- since the old one was security bug ridden and
>> horribly obsolete --- and I haven't had a chance to fix up some of the
>> older blog entries that had explicit HTML for tables to work with the
>> new theme.)
>>
>> One further observation from reading the ffsck paper. Their method of
>> introducing heavy file system fragmentation resulted in a file system
>> where most of the files had external extent tree blocks; that is, the
>> trees had a depth > 1. I have not observed this in file systems under
>> normal load, since most files are written once and not rewritten, and
>> those that are rewritten (i.e., database files) are not the common
>> case, and even then, generally aren't written in a random append
>> workload where there are hundreds of files in the same directory which
>> are appended to in random order. So looking at at a couple file
>> systems' fsck -v output, I find results such as this:
>>
>> Extent depth histogram: 1229346/569/3
>> Extent depth histogram: 332256/141
>> Extent depth histogram: 23253/456
>>
>> ... where the first number is the number of inode where all of the
>> extent information stored in the inode, and the second number is the
>> number of inodes with a single level of external extent tree blocks,
>> and so on.
>>
>> As a result, I'm not seeing the fsck time degradation resulting from
>> file system aging, because with at leat my workloads, the file system
>> isn't getting fragmented in enough to result in a large number of
>> inodes with external extent tree blocks.
>>
>> We could implement schemes to optimize fsck performance for heavily
>> fragmented file systems; a few which could be done using just e2fsck
>> optimizations, and some which would require file system format
>> changes. However, it's not clear to me that it's worth it.
>>
>> If folks would like help run some experiments, it would be useful to
>> run a test e2fsck on a partition: "e2fsck -Fnfvtt /dev/sdb1" and look
>> at the extent depth histogram and the I/O rates for the various e2fsck
>> passes (see below for an example).
>>
>> If you have examples where the file system has a very large number of
>> inodes with extent tree depths > 1, it would be useful to see these
>> numbers, with a description of how old the file system has been, and
>> what sort of workload might have contributed to its aging.
>>
>
> I don't know about "very large", but here's what I see on the server that I
> share with some friends. Afaik it's used mostly for VM images and test
> kernels... and other parallel-write-once files. ;) This FS has been running
> since Nov. 2012. That said, I think the VM images were created without
> fallocate; some of these files have tens of thousands of tiny extents.
>
> 5386404 inodes used (4.44%, out of 121307136)
> 22651 non-contiguous files (0.4%)
> 7433 non-contiguous directories (0.1%)
> # of inodes with ind/dind/tind blocks: 0/0/0
> Extent depth histogram: 5526723/1334/16
> 202583901 blocks used (41.75%, out of 485198848)
> 0 bad blocks
> 34 large files
>
> 5207070 regular files
> 313009 directories
> 576 character device files
> 192 block device files
> 11 fifos
> 1103023 links
> 94363 symbolic links (86370 fast symbolic links)
> 73 sockets
> ------------
> 6718317 files
>
> On my main dev box, which is entirely old photos, mp3s, VM images, and kernel
> builds, I see:
>
> 2155348 inodes used (2.94%, out of 73211904)
> 14923 non-contiguous files (0.7%)
> 1528 non-contiguous directories (0.1%)
> # of inodes with ind/dind/tind blocks: 0/0/0
> Extent depth histogram: 2147966/685/3
> 85967035 blocks used (29.36%, out of 292834304)
> 0 bad blocks
> 6 large files
>
> 1862617 regular files
> 284915 directories
> 370 character device files
> 59 block device files
> 6 fifos
> 609215 links
> 7454 symbolic links (6333 fast symbolic links)
> 24 sockets
> ------------
> 2764660 files


Workload: there are _many_ files that don't deleted, append/full
rewrite/create only, lifetime 1-2 years:

8988871 inodes used (2.09%, out of 429817856)
1012499 non-contiguous files (1.7%)
2039 non-contiguous directories (0.0%)
# of inodes with ind/dind/tind blocks: 0/0/0
Extent depth histogram: 8616444/372389/30
# about 99% blocks in use wrong information, I shrinked fs before
this, to minimal size
428752124 blocks used (99.76%, out of 429788930)
0 bad blocks
50 large files

5988792 regular files
3000070 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
------------
8988862 files


>
> Sadly, since I've left the LTC I no longer have access to tux1, which had a
> rather horrifically fragmented ext3. Its backup server, which created a Time
> Machine-like series of "snapshots" with rsync --link-dest, took days to fsck,
> despite being ext4.
>
> --D
>
>> Thanks, regards,
>>
>> - Ted
>>
>> e2fsck 1.42.8 (20-Jun-2013)
>> Pass 1: Checking inodes, blocks, and sizes
>> Pass 1: Memory used: 668k/7692k (575k/94k), time: 0.92/ 0.42/ 0.02
>> Pass 1: I/O read: 11MB, write: 0MB, rate: 11.95MB/s
>> Pass 2: Checking directory structure
>> Pass 2: Memory used: 784k/15196k (466k/319k), time: 0.44/ 0.03/ 0.00
>> Pass 2: I/O read: 10MB, write: 0MB, rate: 22.76MB/s
>> Pass 3: Checking directory connectivity
>> Peak memory: Memory used: 784k/15196k (466k/319k), time: 1.60/ 0.63/ 0.02
>> Pass 3: Memory used: 784k/15196k (439k/346k), time: 0.00/ 0.00/ 0.00
>> Pass 3: I/O read: 1MB, write: 0MB, rate: 2793.30MB/s
>> Pass 4: Checking reference counts
>> Pass 4: Memory used: 784k/188k (432k/353k), time: 0.63/ 0.63/ 0.00
>> Pass 4: I/O read: 0MB, write: 0MB, rate: 0.00MB/s
>> Pass 5: Checking group summary information
>> Pass 5: Memory used: 784k/188k (426k/359k), time: 4.95/ 0.16/ 0.10
>> Pass 5: I/O read: 19MB, write: 0MB, rate: 3.84MB/s
>>
>> 13825 inodes used (0.03%, out of 47906816)
>> 1425 non-contiguous files (10.3%)
>> 11 non-contiguous directories (0.1%)
>> # of inodes with ind/dind/tind blocks: 0/0/0
>> Extent depth histogram: 12986/831
>> 141525383 blocks used (73.85%, out of 191627264)
>> 0 bad blocks
>> 4 large files
>>
>> 11537 regular files
>> 2279 directories
>> 0 character device files
>> 0 block device files
>> 0 fifos
>> 0 links
>> 0 symbolic links (0 fast symbolic links)
>> 0 sockets
>> ------------
>> 13816 files
>> Memory used: 784k/188k (426k/359k), time: 7.19/ 1.42/ 0.12
>> I/O read: 39MB, write: 0MB, rate: 5.43MB/s
>>
>> Note: the reason why this file system has so many files with large
>> extents is because there are some video files which large enough that
>> even when contiguous, they will require an external extent block, e.g:
>>
>> File size of 01 Yankee White.m4v is 499375730 (121918 blocks of 4096 bytes)
>> ext: logical_offset: physical_offset: length: expected: flags:
>> 0: 0.. 0: 19802112.. 19802112: 1:
>> 1: 2.. 315: 19802114.. 19802427: 314: 19802113:
>> 2: 543.. 14335: 19802655.. 19816447: 13793: 19802428:
>> 3: 14336.. 47103: 19830784.. 19863551: 32768: 19816448:
>> 4: 47104.. 73727: 19896320.. 19922943: 26624: 19863552:
>> 5: 73728.. 79871: 19955712.. 19961855: 6144: 19922944:
>> 6: 79872.. 112639: 19994624.. 20027391: 32768: 19961856:
>> 7: 112640.. 121917: 20060160.. 20069437: 9278: 20027392: eof
>> 01 Yankee White.m4v: 8 extents found
>>
>> BTW, looking at the output of filefrag -v on large files, it does look
>> like there is some work we can do to improve the block allocation
>> hueristics. These files were written w/o the benefit of fallocate,
>> but with delayed allocation, and apparently we aren't automatically
>> figuring out that we should be in stream mode from the get-go. This
>> pattern is reproduced in most of the files in the directory:
>>
>> File size of 02 Hung Out to Dry.m4v is 552382434 (134859 blocks of 4096 bytes)
>> ext: logical_offset: physical_offset: length: expected: flags:
>> 0: 0.. 0: 19816448.. 19816448: 1:
>> 1: 2.. 314: 19816450.. 19816762: 313: 19816449:
>> 2: 542.. 14335: 19816990.. 19830783: 13794: 19816763:
>> 3: 14336.. 47103: 19863552.. 19896319: 32768: 19830784:
>> 4: 47104.. 79871: 19961856.. 19994623: 32768: 19896320:
>> 5: 79872.. 112639: 20027392.. 20060159: 32768: 19994624:
>> 6: 112640.. 134858: 20070400.. 20092618: 22219: 20060160: eof
>> 02 Hung Out to Dry.m4v: 7 extents found
>>
>> File size of 03 Sea Dog.m4v is 553146161 (135046 blocks of 4096 bytes)
>> ext: logical_offset: physical_offset: length: expected: flags:
>> 0: 0.. 0: 20092928.. 20092928: 1:
>> 1: 2.. 159: 20092930.. 20093087: 158: 20092929:
>> 2: 161.. 306: 20093089.. 20093234: 146: 20093088:
>> 3: 534.. 14335: 20093462.. 20107263: 13802: 20093235:
>> 4: 14336.. 47103: 20121600.. 20154367: 32768: 20107264:
>> 5: 47104.. 79871: 20187136.. 20219903: 32768: 20154368:
>> 6: 79872.. 112639: 20252672.. 20285439: 32768: 20219904:
>> 7: 112640.. 135045: 20318208.. 20340613: 22406: 20285440: eof
>> 03 Sea Dog.m4v: 8 extents found
>>
>> File size of 04 The Immortals.m4v is 516091162 (125999 blocks of 4096 bytes)
>> ext: logical_offset: physical_offset: length: expected: flags:
>> 0: 0.. 0: 20107264.. 20107264: 1:
>> 1: 2.. 162: 20107266.. 20107426: 161: 20107265:
>> 2: 164.. 312: 20107428.. 20107576: 149: 20107427:
>> 3: 540.. 14335: 20107804.. 20121599: 13796: 20107577:
>> 4: 14336.. 47103: 20154368.. 20187135: 32768: 20121600:
>> 5: 47104.. 79871: 20219904.. 20252671: 32768: 20187136:
>> 6: 79872.. 112639: 20285440.. 20318207: 32768: 20252672:
>> 7: 112640.. 125998: 20340736.. 20354094: 13359: 20318208: eof
>> 04 The Immortals.m4v: 8 extents found
>>
>> Looking at all of these files, actually, if we had managed to allocate
>> them using contiguous 32768 block extents, these 45 minute TV episodes
>> would have just fit inside the in-inode's 4 extent slots.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Respectfully
Azat Khuzhin

2014-01-29 19:40:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: FAST paper on ffsck

On Wed, Jan 29, 2014 at 11:21:07PM +0400, Azat Khuzhin wrote:
>
> Workload: there are _many_ files that don't deleted, append/full
> rewrite/create only, lifetime 1-2 years:
>
> 8988871 inodes used (2.09%, out of 429817856)
> 1012499 non-contiguous files (1.7%)
> 2039 non-contiguous directories (0.0%)
> # of inodes with ind/dind/tind blocks: 0/0/0
> Extent depth histogram: 8616444/372389/30
> # about 99% blocks in use wrong information, I shrinked fs before
> this, to minimal size


Shrinking the file system is known to result in really horrible
fragmentation. Part of this is because resize2fs has a really stupid
block allocator, but if you're going to shrink the file system to
minimal size the results can be truly catastrophic from a file
fragmentation point of view. (Imagine my horror when I was told that
Fedora was creating bootable CD-ROM's by creating a large file system
image, and then using resize2fs -M to shink it to minimal size. Not
only would the file layout be definitely non-optimal, but worse,
CD-ROM drives are not know for fast seek times!)

We can probably make resize2fs smarter in the case where we are
shrinking the file system slightly (say, to make room for LVM /thinp
metadata when coverting a whole disk file system to one which is being
managed via LVM). But I'm not sure it's ever going to be worth it
making resize2fs -M generate an optimal, minimally fragmented file
system image.

- Ted

2014-01-29 19:45:46

by Darrick J. Wong

[permalink] [raw]
Subject: Re: FAST paper on ffsck

On Thu, Dec 12, 2013 at 04:30:47PM +1100, Dave Chinner wrote:
> On Mon, Dec 09, 2013 at 01:01:49PM -0500, Theodore Ts'o wrote:
> > Andreas brought up on today's conference call Kirk McKusick's recent
> > changes[1] to try to improve fsck times for FFS, in response to the
> > recent FAST paper covering fsck speed ups for ext3, "ffsck: The Fast
> > Filesystem Checker"[2]
> >
> > [1] http://www.mckusick.com/publications/faster_fsck.pdf
> > [2] https://www.usenix.org/system/files/conference/fast13/fast13-final52_0.pdf
>
> Interesting - it's all about trying to lay out data to get
> sequential disk access patterns during scanning (i.e. minimise disk
> seeks) to reduce fsck runtime. Fine in principle, but I think that
> it's a dead end you don't want to go down.
>
> Why? Because it's the exact opposite of what you need for SSD based
> filesystems. What fsck really needs is to be able to saturate the
> IOPS capability of the underlying device rather than optimising for
> bandwidth, and that means driving deep IO queue depths.
>
> e.g I've dropped xfs_repair times on a 100TB test filesystem with 50
> million inodes from 25 minutes to 5 minutes simply by adding gobs of
> additional concurrency and ignoring sequential IO optimisations.
> It's driving bandwidth rates of 200-250MB/s simply due to the IOPS
> rate it is acheiving, not because I'm optimising IO patterns for
> sequential IO.
>
> In fact, it dispatches so much IO now that the limitation is not the
> 60,000 IOPS that it is pulling from the underlying SSDs, but
> mmap_sem contention caused by 30-odd threads doing concurrent memory
> allocation to cache and store all the information that is being read
> from disk...

I've created a couple of experimental patches to speed up e2fsck. The first
patch creates a new IO manager that mmap()s the device and simply memcpy()s
buffers in and out to do IO. The second patch spawns a bunch of threads that
split up the work of scanning each block group in the hopes of faulting in all
the metadata off the disk ahead of the main e2fsck thread.

The upside is that on a cold system, the patches reduce e2fsck running time on
HDD RAIDs and SSDs by 30-40%. On a warm system there's not much advantage.
The downside is that fsck tends to crash when it writes anything out. I've
been meaning to send this out after I fix the write crash, but I've been
occupied with other things at work. :/

There's also a horrible case where on a disk where can_queue = 1, the disk
mostly just thrashes like mad and takes several times longer than regular fsck.
I could be wrong about that; I don't know if it's really can_queue = 1 or
simply having only one disk head that's the cause.

<shrug> I'll clean 'em up and send an RFC.

--D
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-01-29 20:09:08

by Azat Khuzhin

[permalink] [raw]
Subject: Re: FAST paper on ffsck

On Wed, Jan 29, 2014 at 11:40 PM, Theodore Ts'o <[email protected]> wrote:
> On Wed, Jan 29, 2014 at 11:21:07PM +0400, Azat Khuzhin wrote:
>>
>> Workload: there are _many_ files that don't deleted, append/full
>> rewrite/create only, lifetime 1-2 years:
>>
>> 8988871 inodes used (2.09%, out of 429817856)
>> 1012499 non-contiguous files (1.7%)
>> 2039 non-contiguous directories (0.0%)
>> # of inodes with ind/dind/tind blocks: 0/0/0
>> Extent depth histogram: 8616444/372389/30
>> # about 99% blocks in use wrong information, I shrinked fs before
>> this, to minimal size
>
>
> Shrinking the file system is known to result in really horrible
> fragmentation. Part of this is because resize2fs has a really stupid
> block allocator, but if you're going to shrink the file system to
> minimal size the results can be truly catastrophic from a file
> fragmentation point of view. (Imagine my horror when I was told that
> Fedora was creating bootable CD-ROM's by creating a large file system
> image, and then using resize2fs -M to shink it to minimal size. Not
> only would the file layout be definitely non-optimal, but worse,
> CD-ROM drives are not know for fast seek times!)

Thanks, good to know.
Anyway, that fs already not used, it was once for migration,
and on new fs most of files already rewritten.

>
> We can probably make resize2fs smarter in the case where we are
> shrinking the file system slightly (say, to make room for LVM /thinp
> metadata when coverting a whole disk file system to one which is being
> managed via LVM). But I'm not sure it's ever going to be worth it
> making resize2fs -M generate an optimal, minimally fragmented file
> system image.
>
> - Ted



--
Respectfully
Azat Khuzhin

2014-01-30 03:14:08

by Darrick J. Wong

[permalink] [raw]
Subject: Re: FAST paper on ffsck

On Wed, Jan 29, 2014 at 10:57:41AM -0800, Darrick J. Wong wrote:
> On Mon, Dec 09, 2013 at 01:01:49PM -0500, Theodore Ts'o wrote:
> > Andreas brought up on today's conference call Kirk McKusick's recent
> > changes[1] to try to improve fsck times for FFS, in response to the
> > recent FAST paper covering fsck speed ups for ext3, "ffsck: The Fast
> > Filesystem Checker"[2]
> >
> > [1] http://www.mckusick.com/publications/faster_fsck.pdf
> > [2] https://www.usenix.org/system/files/conference/fast13/fast13-final52_0.pdf
> >
> > All of the changes which Kirk outlined are ones which we had done
> > several years ago, in the early days of ext4 development. I talked
> > about some of these in some blog entries, "Fast ext4 fsck times"[3], and
> > "Fast ext4 fsck times, revisited"[4]
> >
> > [3] http://thunk.org/tytso/blog/2008/08/08/fast-ext4-fsck-times/
> > [4] http://thunk.org/tytso/blog/2009/02/26/fast-ext4-fsck-times-revisited/
> >
> > (Apologies for the really bad formatting; I recovered my blog from
> > backups a few months ago, installed onto a brand-new Wordpress
> > installation --- since the old one was security bug ridden and
> > horribly obsolete --- and I haven't had a chance to fix up some of the
> > older blog entries that had explicit HTML for tables to work with the
> > new theme.)
> >
> > One further observation from reading the ffsck paper. Their method of
> > introducing heavy file system fragmentation resulted in a file system
> > where most of the files had external extent tree blocks; that is, the
> > trees had a depth > 1. I have not observed this in file systems under
> > normal load, since most files are written once and not rewritten, and
> > those that are rewritten (i.e., database files) are not the common
> > case, and even then, generally aren't written in a random append
> > workload where there are hundreds of files in the same directory which
> > are appended to in random order. So looking at at a couple file
> > systems' fsck -v output, I find results such as this:
> >
> > Extent depth histogram: 1229346/569/3
> > Extent depth histogram: 332256/141
> > Extent depth histogram: 23253/456
> >
> > ... where the first number is the number of inode where all of the
> > extent information stored in the inode, and the second number is the
> > number of inodes with a single level of external extent tree blocks,
> > and so on.
> >
> > As a result, I'm not seeing the fsck time degradation resulting from
> > file system aging, because with at leat my workloads, the file system
> > isn't getting fragmented in enough to result in a large number of
> > inodes with external extent tree blocks.
> >
> > We could implement schemes to optimize fsck performance for heavily
> > fragmented file systems; a few which could be done using just e2fsck
> > optimizations, and some which would require file system format
> > changes. However, it's not clear to me that it's worth it.
> >
> > If folks would like help run some experiments, it would be useful to
> > run a test e2fsck on a partition: "e2fsck -Fnfvtt /dev/sdb1" and look
> > at the extent depth histogram and the I/O rates for the various e2fsck
> > passes (see below for an example).
> >
> > If you have examples where the file system has a very large number of
> > inodes with extent tree depths > 1, it would be useful to see these
> > numbers, with a description of how old the file system has been, and
> > what sort of workload might have contributed to its aging.
> >
>
> I don't know about "very large", but here's what I see on the server that I
> share with some friends. Afaik it's used mostly for VM images and test
> kernels... and other parallel-write-once files. ;) This FS has been running
> since Nov. 2012. That said, I think the VM images were created without
> fallocate; some of these files have tens of thousands of tiny extents.
>
> 5386404 inodes used (4.44%, out of 121307136)
> 22651 non-contiguous files (0.4%)
> 7433 non-contiguous directories (0.1%)
> # of inodes with ind/dind/tind blocks: 0/0/0
> Extent depth histogram: 5526723/1334/16
> 202583901 blocks used (41.75%, out of 485198848)
> 0 bad blocks
> 34 large files
>
> 5207070 regular files
> 313009 directories
> 576 character device files
> 192 block device files
> 11 fifos
> 1103023 links
> 94363 symbolic links (86370 fast symbolic links)
> 73 sockets
> ------------
> 6718317 files
>
> On my main dev box, which is entirely old photos, mp3s, VM images, and kernel
> builds, I see:
>
> 2155348 inodes used (2.94%, out of 73211904)
> 14923 non-contiguous files (0.7%)
> 1528 non-contiguous directories (0.1%)
> # of inodes with ind/dind/tind blocks: 0/0/0
> Extent depth histogram: 2147966/685/3
> 85967035 blocks used (29.36%, out of 292834304)
> 0 bad blocks
> 6 large files
>
> 1862617 regular files
> 284915 directories
> 370 character device files
> 59 block device files
> 6 fifos
> 609215 links
> 7454 symbolic links (6333 fast symbolic links)
> 24 sockets
> ------------
> 2764660 files
>
> Sadly, since I've left the LTC I no longer have access to tux1, which had a
> rather horrifically fragmented ext3. Its backup server, which created a Time
> Machine-like series of "snapshots" with rsync --link-dest, took days to fsck,
> despite being ext4.

Well, I got a partial report -- the fs containing ISO images produced this fsck
output. Not terribly helpful, alas.

561392 inodes used (0.21%)
14007 non-contiguous inodes (2.5%)
# of inodes with ind/dind/tind blocks: 93077/7341/74
440877945 blocks used (82.12%)
0 bad blocks
382 large files
492651 regular files
36414 directories
270 character device files
760 block device files
3 fifos
2514 links
31930 symbolic links (31398 fast symbolic links)
4 sockets
--------
564546 files

--D