Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754027AbZKPUuz (ORCPT ); Mon, 16 Nov 2009 15:50:55 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753919AbZKPUuz (ORCPT ); Mon, 16 Nov 2009 15:50:55 -0500 Received: from g1t0029.austin.hp.com ([15.216.28.36]:3513 "EHLO g1t0029.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752845AbZKPUuy (ORCPT ); Mon, 16 Nov 2009 15:50:54 -0500 Subject: Re: [RFC] Block IO Controller V2 - some results From: "Alan D. Brunelle" To: linux-kernel@vger.kernel.org Cc: vgoyal@redhat.com, jens.axboe@oracle.com Content-Type: text/plain; charset="UTF-8" Date: Mon, 16 Nov 2009 15:51:00 -0500 Message-ID: <1258404660.3533.150.camel@cail> Mime-Version: 1.0 X-Mailer: Evolution 2.28.1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12193 Lines: 304 Hi Vivek: I'm finding some things that don't quite seem right - executive summary: o I think the apportionment algorithm doesn't work consistently well for writes. o I think there are problems with significant performance loss when doing random I/Os. :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: Test configuration: HP dl585 (32-way quad-core AMD Opteron processors + 128GB RAM + 4 FC HBAs + 4 MSA 1000's (each exporting 3 multi-disk striped LUNs)). Running with Jens Axboe's remotes/origin/for-2.6.33 branch (at commit 8721c81f6480e2c9acbf92078383953f825d1057)) w/out and w/ your V2 patch. The test: 12 Ext3 file systems (1 per disk), each file system has eight 8GB files on it. Doing simple fio runs in various modes and I/O directions: random or sequential, read or write or read/write (80%/20%). Using 2, 4 or 8 processes per file system (each process working on a different file). Here is a sample fio command file: [global] ioengine=sync size=8g overwrite=0 runtime=120 bs=256k readwrite=write [/mnt/sdl/data.7] filename=/mnt/sdl/data.7 I'm then using cgroups that have IO weights as follows: /cgroup/test0/blkio.weight 100 /cgroup/test1/blkio.weight 200 /cgroup/test2/blkio.weight 300 /cgroup/test3/blkio.weight 400 /cgroup/test4/blkio.weight 500 /cgroup/test5/blkio.weight 600 /cgroup/test6/blkio.weight 700 /cgroup/test7/blkio.weight 800 There were 12 X N total processes running in the system for each test, and each file system would have N process working on a different file in that file system. The N processes would be assigned to increasing test groups: process 0 will be in test0's group and working on file 0 in a file system; process 1 will be in test1's group and working on file 1 in a file system; and so on. Before each test I drop caches & umount/mount the filesystem anew. In the following tables: 'base' - means a kernel generated from Jens' branch (-no- patching) 'ioc off' - means a kernel generated w/ your patches added but -no- other settings (no CGROUP stuff mounted or enabled) 'ioc no idle' - means the ioc kernel w/ CGROUP stuff enabled -but- /sys/block/sd*/queue/iosched/cgroup_idle = 0 'ioc idle' - means the ioc kernel w/ CGROUP stuff enabled -and- /sys/block/sd*/queue/iosched/cgroup_idle = 1 Modes: random or sequential RdWr: rd==read, wr==write, rdwr==80%read & 20%write N: Number of processes per disk testX: Processes sharing a task group (when enabled) :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: The first thing to do is to check for correctness: when the I/O controller is enabled do we see correctly apportioned I/O? At the tail end of the e-mail I've placed three (3) tables showing the state where -no- differences should be seen between the various "task" groups in terms of performance ("level playing field"), and sure enough no differences were seen. These were done basically as a "control" set of tests - the script being used didn't have any inherent biases in it.[1] This table shows the cases where we should see a difference based upon weights: ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7 ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- ioc idle rnd rd 2 2.8 6.3 ioc idle rnd rd 4 0.7 1.5 2.5 3.5 ioc idle rnd rd 8 0.2 0.4 0.5 0.8 0.9 1.2 1.4 1.7 ioc idle rnd wr 2 38.2 192.7 ioc idle rnd wr 4 1.0 17.7 38.1 204.5 ioc idle rnd wr 8 0.3 0.6 0.9 1.5 2.2 16.3 16.6 208.3 ioc idle rnd rdwr 2 4.9 11.3 ioc idle rnd rdwr 4 0.9 2.4 4.3 6.2 ioc idle rnd rdwr 8 0.2 0.5 0.8 1.1 1.4 1.8 2.2 2.7 ioc idle seq rd 2 221.0 386.4 ioc idle seq rd 4 69.8 128.1 183.2 226.8 ioc idle seq rd 8 21.4 40.0 55.6 70.8 85.2 98.3 111.6 121.9 ioc idle seq wr 2 398.6 391.6 ioc idle seq wr 4 219.0 214.5 214.1 214.5 ioc idle seq wr 8 107.6 106.8 104.7 102.5 99.5 99.5 100.5 100.8 ioc idle seq rdwr 2 196.8 340.9 ioc idle seq rdwr 4 64.0 109.6 148.7 183.5 ioc idle seq rdwr 8 22.6 36.6 48.8 61.1 70.3 78.5 84.9 94.3 In general, we do see weights associated in correctly increasing order, but I don't think the proportions are done correctly in all cases. In the random tests for example, the read distribution looks pretty decent, but random writes are all off - for some reason the highest priority (most heavily weighted) is getting a disproportionately large percentage of the I/O bandwidth. For the sequential loads, the reads look "OK" - not quite correctly fair when we have 8 processes running against the devices, but on the whole things look ok. Sequential writes are not working well at all: relatively flat distribution. I _think_ this is pointing to some real problems in both the write cases for both random & sequential I/Os. :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: The next thing to look at is to see what the "penalty" is for the additional code: see how much bandwidth we lose for the capability added. Here we see the sum of the system's throughput for the various tests: ---- ---- - ----------- ----------- ----------- ----------- Mode RdWr N base ioc off ioc no idle ioc idle ---- ---- - ----------- ----------- ----------- ----------- rnd rd 2 17.3 17.1 9.4 9.1 rnd rd 4 27.1 27.1 8.1 8.2 rnd rd 8 37.1 37.1 6.8 7.1 rnd wr 2 296.5 243.7 290.2 230.9 rnd wr 4 287.3 280.7 270.4 261.3 rnd wr 8 272.5 273.1 237.7 246.5 rnd rdwr 2 27.4 27.7 16.1 16.2 rnd rdwr 4 38.3 39.3 13.5 13.9 rnd rdwr 8 62.0 61.5 10.0 10.7 seq rd 2 610.2 608.1 610.7 607.4 seq rd 4 608.4 601.5 609.3 608.0 seq rd 8 605.7 603.7 605.0 604.8 seq wr 2 840.3 850.2 836.8 790.2 seq wr 4 886.8 891.6 868.2 862.2 seq wr 8 865.1 887.1 832.1 822.0 seq rdwr 2 536.2 550.0 538.1 537.7 seq rdwr 4 595.3 605.7 512.9 505.8 seq rdwr 8 617.3 628.5 526.6 497.1 The sequential runs look very good - not much variance across the board. The random results look horrible, especially when reads are involved: The first two columns (base & ioc off) are very similar, however note the significant drop in overall system performance once the io-controller CGROUP stuff gets involved - the more processes involved the more performance is lost. :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: I'm going to spend some time drilling down into three specific tests: ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7 ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- ioc idle rnd wr 2 38.2 192.7 ioc idle seq wr 2 398.6 391.6 This test I can use to see why random writes are so disproportionately apportioned - it should be 2-to-1 but we are seeing something like 6-to-1. And then I can look at why sequential writes are flat. and: ---- ---- - ----------- ----------- ----------- ----------- Mode RdWr N base ioc off ioc no idle ioc idle ---- ---- - ----------- ----------- ----------- ----------- rnd rd 2 17.3 17.1 9.4 9.1 I will try to find out why we are seeing such a loss in system performance... Regards, Alan D. Brunelle Hewlett-Packard / Linux Kernel Technology Team :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: [1] Three tables showing the I/O load distributed when either there was no I/O controller code or when it was turned off or when cgroup_idle was turned off. All looks sane - with the exception of the ioc-enabled kernel with no-idle set - for random writes it appears like there is some differences, but not an appreciable amount? ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7 ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- base rnd rd 2 8.6 8.6 base rnd rd 4 6.8 6.8 6.8 6.7 base rnd rd 8 4.7 4.6 4.6 4.6 4.6 4.6 4.6 4.6 base rnd wr 2 150.4 146.1 base rnd wr 4 75.2 74.8 68.1 69.2 base rnd wr 8 36.2 39.3 29.6 35.9 32.9 37.0 29.6 32.2 base rnd rdwr 2 13.7 13.7 base rnd rdwr 4 9.6 9.6 9.6 9.6 base rnd rdwr 8 7.8 7.8 7.7 7.8 7.8 7.7 7.7 7.8 base seq rd 2 306.2 304.0 base seq rd 4 150.1 152.4 151.9 154.0 base seq rd 8 77.2 75.9 75.9 73.9 77.0 75.7 75.0 74.9 base seq wr 2 420.2 420.1 base seq wr 4 220.5 222.5 221.9 221.9 base seq wr 8 108.2 108.8 107.8 107.7 108.7 108.5 108.1 107.2 base seq rdwr 2 268.4 267.8 base seq rdwr 4 148.9 150.6 147.8 148.0 base seq rdwr 8 78.0 77.7 76.3 76.0 79.1 77.9 74.3 77.9 ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7 ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- ioc off rnd rd 2 8.6 8.6 ioc off rnd rd 4 6.8 6.8 6.7 6.7 ioc off rnd rd 8 4.7 4.6 4.6 4.7 4.6 4.6 4.6 4.6 ioc off rnd wr 2 112.6 131.1 ioc off rnd wr 4 64.9 67.8 79.9 68.1 ioc off rnd wr 8 35.1 39.5 31.5 32.0 36.1 34.5 30.8 33.5 ioc off rnd rdwr 2 13.8 13.8 ioc off rnd rdwr 4 9.8 9.8 9.9 9.8 ioc off rnd rdwr 8 7.7 7.7 7.7 7.7 7.7 7.7 7.7 7.7 ioc off seq rd 2 303.1 305.0 ioc off seq rd 4 150.8 151.6 149.0 150.2 ioc off seq rd 8 77.0 76.3 74.5 74.0 77.9 75.5 74.0 74.6 ioc off seq wr 2 424.6 425.5 ioc off seq wr 4 223.0 222.4 223.9 222.3 ioc off seq wr 8 110.8 112.0 111.3 109.6 111.7 111.3 110.8 109.7 ioc off seq rdwr 2 274.3 275.8 ioc off seq rdwr 4 151.3 154.8 149.0 150.6 ioc off seq rdwr 8 81.1 80.6 77.8 74.8 81.0 78.5 77.0 77.7 ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- Test Mode RdWr N test0 test1 test2 test3 test4 test5 test6 test7 ----------- ---- ---- - ----- ----- ----- ----- ----- ----- ----- ----- ioc no idle rnd rd 2 4.7 4.7 ioc no idle rnd rd 4 2.0 2.0 2.0 2.0 ioc no idle rnd rd 8 0.9 0.9 0.8 0.8 0.8 0.8 0.9 0.9 ioc no idle rnd wr 2 144.8 145.4 ioc no idle rnd wr 4 73.2 65.9 65.5 65.8 ioc no idle rnd wr 8 35.5 52.5 26.2 31.0 25.5 19.3 25.1 22.6 ioc no idle rnd rdwr 2 8.1 8.1 ioc no idle rnd rdwr 4 3.4 3.4 3.4 3.4 ioc no idle rnd rdwr 8 1.3 1.3 1.3 1.2 1.2 1.3 1.2 1.3 ioc no idle seq rd 2 304.1 306.6 ioc no idle seq rd 4 152.1 154.5 149.8 153.0 ioc no idle seq rd 8 75.8 75.8 75.2 75.1 75.5 75.3 75.7 76.5 ioc no idle seq wr 2 418.6 418.2 ioc no idle seq wr 4 217.7 217.7 215.4 217.4 ioc no idle seq wr 8 105.5 105.8 105.8 103.4 102.9 103.1 102.7 102.8 ioc no idle seq rdwr 2 269.2 269.0 ioc no idle seq rdwr 4 130.0 126.4 127.8 128.6 ioc no idle seq rdwr 8 67.2 66.6 65.4 65.0 65.3 64.8 65.7 66.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/