As asked by Alex, I included in the test results the file fragmentation
level and the number of I/Os done during the file deletion.
Here are the results obtained with a not very fragmented 100-GB file:
| ext3 ext4 + extents xfs
------------------------------------------------------------
nb of fragments | 796 798 15
elapsed time | 2m0.306s 0m11.127s 0m0.553s
|
blks read | 206600 6416 352
blks written | 13592 13064 104
------------------------------------------------------------
And with a more fragmented 100-GB file:
| ext3 ext4 + extents xfs
------------------------------------------------------------
nb of fragments | 20297 19841 234
elapsed time | 2m18.914s 0m27.429s 0m0.892s
|
blks read | 225624 25432 592
blks written | 52120 50664 872
------------------------------------------------------------
More details on our web site:
http://www.bullopensource.org/ext4/20070404/FileDeletion.html
Val?rie
On Fri, Apr 27, 2007 at 03:41:19PM +0200, Valerie Clement wrote:
> As asked by Alex, I included in the test results the file fragmentation
> level and the number of I/Os done during the file deletion.
>
> Here are the results obtained with a not very fragmented 100-GB file:
>
> | ext3 ext4 + extents xfs
> ------------------------------------------------------------
> nb of fragments | 796 798 15
> elapsed time | 2m0.306s 0m11.127s 0m0.553s
> |
> blks read | 206600 6416 352
> blks written | 13592 13064 104
> ------------------------------------------------------------
The metablockgroups feature should help the file fragmentation level
with extents. It's easy enough to enable this for ext4 (we just need
to remove some checks in ext4_check_descriptors), so we should just do
it.
- Ted
Valerie Clement wrote:
> As asked by Alex, I included in the test results the file fragmentation
> level and the number of I/Os done during the file deletion.
>
> Here are the results obtained with a not very fragmented 100-GB file:
>
> | ext3 ext4 + extents xfs
> ------------------------------------------------------------
> nb of fragments | 796 798 15
> elapsed time | 2m0.306s 0m11.127s 0m0.553s
> |
> blks read | 206600 6416 352
> blks written | 13592 13064 104
> ------------------------------------------------------------
hmm. if I did math right, then, in theory, 100GB file could be
placed using ~850 extents: 100 * 1024 / 120, where 120 is amount
of data one can allocate in regular group. 850 extents would
require 3 leaf blocks (340 extents/block) + 1 index block. we'd
need to read these 4 blocks + all 850 involved bitmaps + some
blocks of group descriptors. so, probably we need to tune balloc.
then we'd improve remove time by factor six (6400 blocks to read
vs. ~900-1000 blocks to read) ?
thanks, Alex
On Apr 27, 2007 14:33 -0400, Theodore Tso wrote:
> > Here are the results obtained with a not very fragmented 100-GB file:
> >
> > | ext3 ext4 + extents xfs
> > ------------------------------------------------------------
> > nb of fragments | 796 798 15
> > elapsed time | 2m0.306s 0m11.127s 0m0.553s
> > |
> > blks read | 206600 6416 352
> > blks written | 13592 13064 104
> > ------------------------------------------------------------
>
> The metablockgroups feature should help the file fragmentation level
> with extents. It's easy enough to enable this for ext4 (we just need
> to remove some checks in ext4_check_descriptors), so we should just do
> it.
I agree in this case that the META_BG feature would help here (100GB / 128MB
is in fact the 800 fragments shown), I don't think that is the major
performance hit.
The fact that we need to read 6000 blocks and write 13000 blocks is the
more serious part. I assume that since there are only 800 fragments
there should be only 800 extents. We can fit (4096 / 12 - 1) = 340
extents into each block, and 4 index blocks into the inode, so this
should allow all 800 extents in only 3 index blocks. It would be useful
to know where those 6416 block reads are going in the extent case.
I suspect that is because the "tail first" truncation mechanism of ext3
causes it to zero out FAR more blocks than needed. With extents and a
default 128MB journal we should be able to truncate + unlink a file with
only writes to the inode and the 800 bitmap + gdt blocks. The reads should
also be limited to the bitmap blocks and extent indexes (gdt being read at
mount time).
What is needed is for truncate to walk the inode block tree (extents or
indirect blocks) and count the bitmaps + gdt blocks dirtied, and then try
and do the whole truncate under a single transaction. That avoids any need
for truncate to be "restartable" and then there is no need to zero out the
indirect blocks from the end one-at-a-time.
Doing the bitmap read/write will definitely be more efficient with META_BG,
but that doesn't explain the other 19k blocks undergoing IO.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Apr 27, 2007 15:41 +0200, Valerie Clement wrote:
> As asked by Alex, I included in the test results the file fragmentation
> level and the number of I/Os done during the file deletion.
>
> Here are the results obtained with a not very fragmented 100-GB file:
>
> | ext3 ext4 + extents xfs
> ------------------------------------------------------------
> nb of fragments | 796 798 15
> elapsed time | 2m0.306s 0m11.127s 0m0.553s
> |
> blks read | 206600 6416 352
> blks written | 13592 13064 104
> ------------------------------------------------------------
>
>
> And with a more fragmented 100-GB file:
>
> | ext3 ext4 + extents xfs
> ------------------------------------------------------------
> nb of fragments | 20297 19841 234
> elapsed time | 2m18.914s 0m27.429s 0m0.892s
> |
> blks read | 225624 25432 592
> blks written | 52120 50664 872
> ------------------------------------------------------------
>
>
> More details on our web site:
> http://www.bullopensource.org/ext4/20070404/FileDeletion.html
Ah, one thing that is only mentioned in the URL is that the "IO count" is
in units of 512-byte sectors. In the case of XFS doing logical journaling
this avoids a huge amount of double writes to the journal and then to the
filesystem. I still think ext4 could do better than it currently does.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
Andreas Dilger wrote:
> Ah, one thing that is only mentioned in the URL is that the "IO count" is
> in units of 512-byte sectors. In the case of XFS doing logical journaling
> this avoids a huge amount of double writes to the journal and then to the
> filesystem. I still think ext4 could do better than it currently does.
I thought about this in context of huge directories when working set of blocks
is very large and it doesn't fit journal causing frequent commits. two ideas
I was thinking of are: 1) journal "change" where possible 2) compress whole
transaction to be written in the journal
thanks, Alex