From: Jaco Kroon <jaco@uls.co.za>
Subject: fragmentation optimization
Date: Sat, 23 Sep 2017 09:49:25 +0200
Message-ID: <6f6eb640-cd31-7080-a575-b7c6c4dd9e3f@uls.co.za>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
To: linux-ext4@vger.kernel.org, Theodore Ts'o <tytso@mit.edu>
Content-Language: en-US
Sender: linux-ext4-owner@vger.kernel.org

Hi Ted, Everyone,

During our last discussions you mentioned the following (2017/08/16 5:06 
SAST/GMT+2):

"One other thought.  There is an ext4 block allocator optimization
"feature" which is biting us here.  At the moment we have an
optimization where if there is small "hole" in the logical block
number space, we leave a "hole" in the physical blocks allocated to
the file."

You proceeded to provide the example regarding writing of object files 
as per binutils (ld specifically).

As per the data I provided you previously rsync (with --sparse) is 
generating a lot of "holes" for us due to this.  As a result I end up 
with a rather insane amount of fragmentation:

Blocksize: 4096 bytes
Total blocks: 13153337344
Free blocks: 1272662587 (9.7%)

Min. free extent: 4 KB
Max. free extent: 17304 KB
Avg. free extent: 44 KB
Num. free extent: 68868260

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range :  Free extents   Free Blocks  Percent
     4K...    8K-  :      28472490      28472490    2.24%
     8K...   16K-  :      27005860      55030426    4.32%
    16K...   32K-  :       2595993      14333888    1.13%
    32K...   64K-  :       2888720      32441623    2.55%
    64K...  128K-  :       2745121      62071861    4.88%
   128K...  256K-  :       2303439     103166554    8.11%
   256K...  512K-  :       1518463     134776388   10.59%
   512K... 1024K-  :        902691     163108612   12.82%
     1M...    2M-  :        314858     105445496    8.29%
     2M...    4M-  :         97174      64620009    5.08%
     4M...    8M-  :         22501      28760501    2.26%
     8M...   16M-  :           945       2069807    0.16%
    16M...   32M-  :             5         21155    0.00%

Based on the behavior I notice by watching how rsync works[1] I greatly 
suspect that writes are sequential from start of file to end of file.  
Regarding the above "feature" you further proceeded to mention:

"However, it obviously doesn't do the right thing for rsync --sparse,
and these days, thanks to delayed allocation, so long as binutils can
finish writing the blocks within 30 seconds, it doesn't matter if GNU
ld writes the blocks in a completely random order, since we will only
attempt to do the writeback to the disk after all of the holes in the
.o file have been filled in.  So perhaps we should turn off this ext4
block allocator optimization if delayed allocation is enabled (which
is the default these days)."

You mentioned a few pros and cons of this approach as well, and also 
mentioned that it won't help my existing filesystem, however, I suspect 
it might in combination with a e4defrag sweep (which if it takes a few 
weeks in the background that's fine by me).  Also, I suspect disabling 
this might help avoid future holes, and since persistence of files 
varies (from a week to a year) I suspect it may help to over time slowly 
improve performance.

I'm also relatively comfortable to make the 30s write limit even longer 
(as you pointed out the files causing the problems are typically 300GB+ 
even though on average my files are very small), permitting that I won't 
introduce additional file-system corruption risk.  Also keeping in mind 
that I run anything from 10 to 20 concurrent rsync instances at any 
point in time.

I would like to attempt such a patch, so if you (or someone else) could 
possibly point me in an appropriate direction of where to start work on 
this I would really appreciate the help.

Another approach for me may be to simply switch off --sparse since 
especially now I'm unsure of it's benefit.  I'm guessing that I could do 
a sweep of all inodes to determine how much space is really being saved 
by this.

Kind Regards,
Jaco

[1] My observed behaviour when syncing a file (without --inplace which 
is in my opinion a bad idea in general unless you're severely space 
constrained, and then I honestly don't know how this situation would be 
affected) is that rsync will create a new file, and then the file size 
of this file will grow slowly (not, not disk usage, but size as reported 
by ls) until it reaches the file size of the new file, and at this point 
rsync will use rename(2) to replace the old file with the new one (which 
is the right approach).