Date: Tue, 13 Oct 2009 15:10:02 +0200
From: Laurent CORBES <laurent.corbes@smartjog.com>
To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: Ext3 sequential read performance drop 2.6.29 ->
 2.6.30,2.6.31,...
Message-ID: <20091013151002.53efae58@smartjog.com>
In-Reply-To: <20091013120955.6bd5844b@smartjog.com>
References: <20091013120955.6bd5844b@smartjog.com>
Organization: SmartJog
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8BIT
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 10899
Lines: 234

Some updates and added linux-fsdevel in the loop:

> While benchmarking some systems I discover a big sequential read performance
> drop using ext3 on ~ big files. The drop seems to be introduced in 2.6.30. I'm
> testing with 2.6.28.6 -> 2.6.29.6 -> 2.6.30.4 -> 2.6.31.3.
> 
> I'm running a software raid6 (chunk 256k) on 6 750Go 7200rpm disks. here are
> the raw datas of disks and raid device:
> 
> $ dd if=/dev/sda of=/dev/null bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 98.7483 seconds, 109 MB/s
> 
> $ dd if=/dev/md7 of=/dev/null bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 34.8744 seconds, 308 MB/s
> 
> Over the different kernels changes here are not important (~1MB on the raw disk
> and ~5MB on the raid device). The write of a 10GB file over the fs here is also
> almost constant at ~100MB/s.
> 
> $ dd if=/dev/zero of=/mnt/space/benchtmp//dd.out bs=1M count=10240
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 102.547 seconds, 105 MB/s
> 
> However while reading this file there is a huge perf drop between 2.6.29.6 and
> 2.6.30.4 and 2.6.31.3:

I add slabtop infos before and after the runs for 2.6.28.6 and 2.6.31.3. run is
just after a system reboot

 Active / Total Objects (% used)    : 83612 / 90199 (92.7%)
 Active / Total Slabs (% used)      : 4643 / 4643 (100.0%)
 Active / Total Caches (% used)     : 93 / 150 (62.0%)
 Active / Total Size (% used)       : 16989.63K / 17858.85K (95.1%)
 Minimum / Average / Maximum Object : 0.01K / 0.20K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
 20820  20688  99%    0.12K    694       30      2776K dentry
 12096  12029  99%    0.04K    144       84       576K sysfs_dir_cache
  8701   8523  97%    0.03K     77      113       308K size-32
  6036   6018  99%    0.32K    503       12      2012K inode_cache
  4757   4646  97%    0.05K     71       67       284K buffer_head
  4602   4254  92%    0.06K     78       59       312K size-64
  4256   4256 100%    0.47K    532        8      2128K ext3_inode_cache
  3864   3607  93%    0.08K     84       46       336K vm_area_struct
  2509   2509 100%    0.28K    193       13       772K radix_tree_node
  2130   1373  64%    0.12K     71       30       284K filp
  1962   1938  98%    0.41K    218        9       872K shmem_inode_cache
  1580   1580 100%    0.19K     79       20       316K skbuff_head_cache
  1524   1219  79%    0.01K      6      254        24K anon_vma
  1450   1450 100%    2.00K    725        2      2900K size-2048
  1432   1382  96%    0.50K    179        8       716K size-512
  1260   1198  95%    0.12K     42       30       168K size-128

> 2.6.28.6:
> sj-dev-7:/mnt/space/Benchmark# dd if=dd.out of=/dev/null bs=1M
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 43.8288 seconds, 245 MB/s

 Active / Total Objects (% used)    : 78853 / 90405 (87.2%)
 Active / Total Slabs (% used)      : 5079 / 5084 (99.9%)
 Active / Total Caches (% used)     : 93 / 150 (62.0%)
 Active / Total Size (% used)       : 17612.24K / 19391.84K (90.8%)
 Minimum / Average / Maximum Object : 0.01K / 0.21K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
 17589  17488  99%    0.28K   1353       13      5412K radix_tree_node
 12096  12029  99%    0.04K    144       84       576K sysfs_dir_cache
  9840   5659  57%    0.12K    328       30      1312K dentry
  8701   8568  98%    0.03K     77      113       308K size-32
  5226   4981  95%    0.05K     78       67       312K buffer_head
  4602   4366  94%    0.06K     78       59       312K size-64
  4264   4253  99%    0.47K    533        8      2132K ext3_inode_cache
  3726   3531  94%    0.08K     81       46       324K vm_area_struct
  2130   1364  64%    0.12K     71       30       284K filp
  1962   1938  98%    0.41K    218        9       872K shmem_inode_cache
  1580   1460  92%    0.19K     79       20       316K skbuff_head_cache
  1548   1406  90%    0.32K    129       12       516K inode_cache
  1524   1228  80%    0.01K      6      254        24K anon_vma
  1450   1424  98%    2.00K    725        2      2900K size-2048
  1432   1370  95%    0.50K    179        8       716K size-512
  1260   1202  95%    0.12K     42       30       168K size-128


> 2.6.29.6:
> sj-dev-7:/mnt/space/Benchmark# dd if=dd.out of=/dev/null bs=1M
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 42.745 seconds, 251 MB/s
> 
> 2.6.30.4:
> $ dd if=/mnt/space/benchtmp//dd.out of=/dev/null bs=1M
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 48.621 seconds, 221 MB/s


 Active / Total Objects (% used)    : 88438 / 97670 (90.5%)
 Active / Total Slabs (% used)      : 5451 / 5451 (100.0%)
 Active / Total Caches (% used)     : 93 / 155 (60.0%)
 Active / Total Size (% used)       : 19564.52K / 20948.54K (93.4%)
 Minimum / Average / Maximum Object : 0.01K / 0.21K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
 21547  21527  99%    0.13K    743       29      2972K dentry
 12684  12636  99%    0.04K    151       84       604K sysfs_dir_cache
  8927   8639  96%    0.03K     79      113       316K size-32
  6721   6720  99%    0.33K    611       11      2444K inode_cache
  4425   4007  90%    0.06K     75       59       300K size-64
  4240   4237  99%    0.48K    530        8      2120K ext3_inode_cache
  4154   4089  98%    0.05K     62       67       248K buffer_head
  3910   3574  91%    0.08K     85       46       340K vm_area_struct
  2483   2449  98%    0.28K    191       13       764K radix_tree_node
  2280   1330  58%    0.12K     76       30       304K filp
  2240   2132  95%    0.19K    112       20       448K skbuff_head_cache
  2198   2198 100%    2.00K   1099        2      4396K size-2048
  1935   1910  98%    0.43K    215        9       860K shmem_inode_cache
  1770   1738  98%    0.12K     59       30       236K size-96
  1524   1278  83%    0.01K      6      254        24K anon_vma
  1056    936  88%    0.50K    132        8       528K size-512

> 2.6.31.3:
> sj-dev-7:/mnt/space/Benchmark# dd if=dd.out of=/dev/null bs=1M
> 10240+0 records in
> 10240+0 records out
> 10737418240 bytes (11 GB) copied, 51.4148 seconds, 209 MB/s

 Active / Total Objects (% used)    : 81843 / 97478 (84.0%)
 Active / Total Slabs (% used)      : 5759 / 5763 (99.9%)
 Active / Total Caches (% used)     : 92 / 155 (59.4%)
 Active / Total Size (% used)       : 19486.81K / 22048.45K (88.4%)
 Minimum / Average / Maximum Object : 0.01K / 0.23K / 4096.00K

  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
 17589  17426  99%    0.28K   1353       13      5412K radix_tree_node
 12684  12636  99%    0.04K    151       84       604K sysfs_dir_cache
 10991   6235  56%    0.13K    379       29      1516K dentry
  8927   8624  96%    0.03K     79      113       316K size-32
  4824   4819  99%    0.05K     72       67       288K buffer_head
  4425   3853  87%    0.06K     75       59       300K size-64
  3910   3527  90%    0.08K     85       46       340K vm_area_struct
  3560   3268  91%    0.48K    445        8      1780K ext3_inode_cache
  2288   1394  60%    0.33K    208       11       832K inode_cache
  2280   1236  54%    0.12K     76       30       304K filp
  2240   2183  97%    0.19K    112       20       448K skbuff_head_cache
  2216   2191  98%    2.00K   1108        2      4432K size-2048
  1935   1910  98%    0.43K    215        9       860K shmem_inode_cache
  1770   1719  97%    0.12K     59       30       236K size-96
  1524   1203  78%    0.01K      6      254        24K anon_vma
  1056    921  87%    0.50K    132        8       528K size-512


> ... Things going worst over time ...
> 
> Numbers are average over ~10 runs each.
> 
> I first check for stripe/stride aligment of the ext3 fs that is quite important
> in raid6. I recheck it and everything seems fine from my understandings and
> formula:
> raid6 chunk 256k -> stride = 64. 4 data disks -> stripe-width = 256 ?
> 
> In both case I'm using cfq IO scheduler and no special tuning is done with it.
> 
> 
> For informations the test server is a Dell PowerEdge R710 with SAS 6iR, 4GB
> ram and 6*750GB sata disks. I got the same behavior on PE2950 Perc6i, 2GB
> ram and 6*750GB sata disks. 
> 
> Here are misc informations about the setup:
> sj-dev-7:/mnt/space/Benchmark# cat /proc/mdstat 
> md7 : active raid6 sdf7[5] sde7[4] sdd7[3] sdc7[2] sdb7[1] sda7[0]
>       2923443200 blocks level 6, 256k chunk, algorithm 2 [6/6] [UUUUUU]
>       bitmap: 0/175 pages [0KB], 2048KB chunk
> 
> sj-dev-7:/mnt/space/Benchmark# dumpe2fs -h /dev/md7
> dumpe2fs 1.40-WIP (14-Nov-2006)
> Filesystem volume name:   <none>
> Last mounted on:          <not available>
> Filesystem UUID:          9c29f236-e4f2-4db4-bf48-ea613cd0ebad
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal resize_inode dir_index filetype
> needs_recovery sparse_super large_file Filesystem flags:         signed
> directory hash Default mount options:    (none)
> Filesystem state:         clean
> Errors behavior:          Continue
> Filesystem OS type:       Linux
> Inode count:              713760
> Block count:              730860800
> Reserved block count:     0
> Free blocks:              705211695
> Free inodes:              713655
> First block:              0
> Block size:               4096
> Fragment size:            4096
> Reserved GDT blocks:      849
> Blocks per group:         32768
> Fragments per group:      32768
> Inodes per group:         32
> Inode blocks per group:   1
> Filesystem created:       Thu Oct  1 15:45:01 2009
> Last mount time:          Mon Oct 12 13:17:45 2009
> Last write time:          Mon Oct 12 13:17:45 2009
> Mount count:              10
> Maximum mount count:      30
> Last checked:             Thu Oct  1 15:45:01 2009
> Check interval:           15552000 (6 months)
> Next check after:         Tue Mar 30 15:45:01 2010
> Reserved blocks uid:      0 (user root)
> Reserved blocks gid:      0 (group root)
> First inode:              11
> Inode size:               128
> Journal inode:            8
> Default directory hash:   tea
> Directory Hash Seed:      378d4fd2-23c9-487c-b635-5601585f0da7
> Journal backup:           inode blocks
> Journal size:             128M

Thanks all.
-- 
Laurent Corbes - laurent.corbes@smartjog.com
SmartJog SAS | Phone: +33 1 5868 6225 | Fax: +33 1 5868 6255 | www.smartjog.com
27 Blvd Hippolyte Marqu?s, 94200 Ivry-sur-Seine, France
A TDF Group company
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/