Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753940AbZIOHkR (ORCPT ); Tue, 15 Sep 2009 03:40:17 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751238AbZIOHkN (ORCPT ); Tue, 15 Sep 2009 03:40:13 -0400 Received: from james.oetiker.ch ([213.144.138.195]:35685 "EHLO james.oetiker.ch" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751050AbZIOHkL (ORCPT ); Tue, 15 Sep 2009 03:40:11 -0400 X-Greylist: delayed 588 seconds by postgrey-1.27 at vger.kernel.org; Tue, 15 Sep 2009 03:40:11 EDT Date: Tue, 15 Sep 2009 09:30:21 +0200 (CEST) From: Tobias Oetiker To: linux-kernel@vger.kernel.org Subject: unfair io behaviour for high load interactive use still present in 2.6.31 Message-ID: User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9885 Lines: 167 Experts, We run several busy NFS file servers with Areca HW Raid + LVM2 + ext3 We find that the read bandwidth falls dramatically as well as the response times going up to several seconds as soon as the system comes under heavy write strain. With the release of 2.6.31 and all the io fixes that went in there, I was hoping for a solution and set out to do some tests ... I have seen the io problem posted on LKML a few times, and normally it is simulated by running several concurrent dd processes one reading and one writing. It seems that cfq with a low slice_async can deal pretty well with competing dds. Unfortunately our use case is not users running dd but rather a lot of processes accessing many small to medium sized files for reading and writing. I have written a test program that unpacks Linux 2.6.30.5 a few times into a file system, flushes the cache and then tars it up again while unpacking some more tars in parallel. While this is running I use iostat to watch the activity on the block devices. As I am interested in interactive performance of the system, the await row as well as the rMB/s row are of special interest to me iostat -m -x 5 Even with a low 'resolution' of 5 seconds, the performance figures are jumping all over the place. This concurs with the user experience when working with the system interactively. I tried to optimize the configuration systematically turning all the knobs I know of (/proc/sys/vm, /sys/block/*/queue/scheduler, data=journal, data=ordered, external journal on a ssd device) one at a time. I found that cfq helps the read performance quite a lot as far as total run-time is concerned, but the jerky nature of the measurements does not change, and also the read performance keeps dropping dramatically as soon as it is in competition with writers. I would love to get some hints on how to make such a setup perform without these huge performance fluctuations. While testing, I saw that iostat reports huge wMB/s numbers and ridiculously low rMB/s numbers. Looking at the actual amount of data in the tar files as well as the run time of the tar processes the numbers MB/s numbers reported by iostat do seem strange. The read numbers are too low and the write numbers are too high. 12 * 1.3 GB reading in 270s = 14 MB/s sustained 12 * 0.3 GB writing in 180s = 1.8 MB/s sustained Since I am looking at relative performance figures this does not matter so much, but it is still a bit disconcerting. My tests script is available on http://tobi.oetiker.ch/fspunisher/ Below is an excerpt from iostat while the test is in full swing: * 2.6.31 (8 cpu x86_64, 24 GB Ram) * scheduler = cfq * iostat -m -x dm-5 5 * running in parallel on 3 lvm logical volumes on a single physical volume Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util --------------------------------------------------------------------------------------------------------- dm-5 0.00 0.00 26.60 7036.60 0.10 27.49 8.00 669.04 91.58 0.09 66.96 dm-5 0.00 0.00 63.80 2543.00 0.25 9.93 8.00 1383.63 539.28 0.36 94.64 dm-5 0.00 0.00 78.00 5084.60 0.30 19.86 8.00 1007.41 195.12 0.15 77.36 dm-5 0.00 0.00 44.00 5588.00 0.17 21.83 8.00 516.27 91.69 0.17 95.44 dm-5 0.00 0.00 0.00 6014.20 0.00 23.49 8.00 1331.42 66.25 0.13 76.48 dm-5 0.00 0.00 28.80 4491.40 0.11 17.54 8.00 1000.37 412.09 0.17 78.24 dm-5 0.00 0.00 36.60 6486.40 0.14 25.34 8.00 765.12 128.07 0.11 72.16 dm-5 0.00 0.00 33.40 5095.60 0.13 19.90 8.00 431.38 43.78 0.17 85.20 for comparison these are the numbers for seen when running the test with just the writing enabled Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util --------------------------------------------------------------------------------------------------------- dm-5 0.00 0.00 4.00 12047.40 0.02 47.06 8.00 989.81 79.55 0.07 79.20 dm-5 0.00 0.00 3.40 12399.00 0.01 48.43 8.00 977.13 72.15 0.05 58.08 dm-5 0.00 0.00 3.80 13130.00 0.01 51.29 8.00 1130.48 95.11 0.04 58.48 dm-5 0.00 0.00 2.40 5109.20 0.01 19.96 8.00 427.75 47.41 0.16 79.92 dm-5 0.00 0.00 3.20 0.00 0.01 0.00 8.00 290.33 148653.75 282.50 90.40 dm-5 0.00 0.00 3.40 5103.00 0.01 19.93 8.00 168.75 33.06 0.13 67.84 And also with just the reading: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util --------------------------------------------------------------------------------------------------------- dm-5 0.00 0.00 463.80 0.00 1.81 0.00 8.00 3.90 8.41 2.16 100.00 dm-5 0.00 0.00 434.20 0.00 1.70 0.00 8.00 3.89 8.95 2.30 100.00 dm-5 0.00 0.00 540.80 0.00 2.11 0.00 8.00 3.88 7.18 1.85 100.00 dm-5 0.00 0.00 591.60 0.00 2.31 0.00 8.00 3.84 6.50 1.68 99.68 dm-5 0.00 0.00 793.20 0.00 3.10 0.00 8.00 3.81 4.80 1.26 100.00 dm-5 0.00 0.00 592.80 0.00 2.32 0.00 8.00 3.84 6.47 1.68 99.60 dm-5 0.00 0.00 578.80 0.00 2.26 0.00 8.00 3.85 6.66 1.73 100.00 dm-5 0.00 0.00 771.00 0.00 3.01 0.00 8.00 3.81 4.93 1.30 99.92 I also tested 2.6.31 for what happens when I run the same 'load' on a single lvm logical volume. Interestingly enough it is even worse than when doing it on three volumes: Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util --------------------------------------------------------------------------------------------------------- dm-5 0.00 0.00 19.20 4566.80 0.07 17.84 8.00 2232.12 486.87 0.22 99.92 dm-5 0.00 0.00 6.80 9410.40 0.03 36.76 8.00 1827.47 187.99 0.10 92.00 dm-5 0.00 0.00 4.00 0.00 0.02 0.00 8.00 685.26 185618.40 249.60 99.84 dm-5 0.00 0.00 4.20 4968.20 0.02 19.41 8.00 1426.45 286.86 0.20 99.84 dm-5 0.00 0.00 10.60 9886.00 0.04 38.62 8.00 167.57 5.74 0.09 88.72 dm-5 0.00 0.00 5.00 0.00 0.02 0.00 8.00 1103.98 242774.88 199.68 99.84 dm-5 0.00 0.00 38.20 14794.60 0.15 57.79 8.00 1171.75 74.25 0.06 87.20 I also tested 2.6.31 with io-controller v9 patches. It seems to help a bit with the read rate, but the figures still jumps all over the place * 2.6.31 with io-controller patches v9 (8 cpu x86_64, 24 GB Ram) * fairness set to 1 on all block devices * scheduler = cfq * iostat -m -x dm-5 5 * running in parallel on 3 lvm logical volumes on a single physical volume Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util --------------------------------------------------------------------------------------------------------- dm-5 0.00 0.00 412.00 1640.60 1.61 6.41 8.00 1992.54 1032.27 0.49 99.84 dm-5 0.00 0.00 362.40 576.40 1.42 2.25 8.00 456.13 612.67 1.07 100.00 dm-5 0.00 0.00 211.80 1004.40 0.83 3.92 8.00 1186.20 995.00 0.82 100.00 dm-5 0.00 0.00 44.20 719.60 0.17 2.81 8.00 788.56 574.81 1.31 99.76 dm-5 0.00 0.00 0.00 1274.80 0.00 4.98 8.00 1584.07 1317.32 0.78 100.00 dm-5 0.00 0.00 0.00 946.91 0.00 3.70 8.00 989.09 911.30 1.05 99.72 dm-5 0.00 0.00 7.20 2526.00 0.03 9.87 8.00 2085.57 201.72 0.37 92.88 For completeness sake I did the tests on 2.4.24 as well not much different. * 2.6.24 (8 cpu x86_64, 24 GB Ram) * scheduler = cfq * iostat -m -x dm-5 5 * running in parallel on 3 lvm logical volumes situated on a single physical volume Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util --------------------------------------------------------------------------------------------------------- dm-5 0.00 0.00 144.60 5498.00 0.56 21.48 8.00 1215.82 215.48 0.14 76.80 dm-5 0.00 0.00 30.00 9831.40 0.12 38.40 8.00 1729.16 142.37 0.09 88.60 dm-5 0.00 0.00 27.60 4126.40 0.11 16.12 8.00 2245.24 618.77 0.21 86.00 dm-5 0.00 0.00 2.00 3981.20 0.01 15.55 8.00 1069.07 268.40 0.23 91.60 dm-5 0.00 0.00 40.60 13.20 0.16 0.05 8.00 3.98 74.83 15.02 80.80 dm-5 0.00 0.00 5.60 5085.20 0.02 19.86 8.00 2586.65 508.10 0.18 94.00 dm-5 0.00 0.00 20.80 5344.60 0.08 20.88 8.00 985.51 148.96 0.17 92.60 cheers tobi -- Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland http://it.oetiker.ch tobi@oetiker.ch ++41 62 775 9902 / sb: -9900 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/