Date: Thu, 23 Apr 2009 13:28:23 +0200
From: Jens Axboe <jens.axboe@oracle.com>
To: Corrado Zoccolo <czoccolo@gmail.com>
Cc: Linux-Kernel <linux-kernel@vger.kernel.org>
Subject: Re: Reduce latencies for syncronous writes and high I/O priority
	requests   in deadline IO scheduler
Message-ID: <20090423112823.GB4593@kernel.dk>
References: <4e5e476b0904221407v7f43c058l8fc61198a2e4bb6e@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4e5e476b0904221407v7f43c058l8fc61198a2e4bb6e@mail.gmail.com>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6658
Lines: 147

On Wed, Apr 22 2009, Corrado Zoccolo wrote:
> Hi,
> deadline I/O scheduler currently classifies all I/O requests in only 2
> classes, reads (always considered high priority) and writes (always
> lower).
> The attached patch, intended to reduce latencies for syncronous writes
> and high I/O priority requests, introduces more levels of priorities:
> * real time reads: highest priority and shortest deadline, can starve
> other levels
> * syncronous operations (either best effort reads or RT/BE writes),
> mid priority, starvation for lower level is prevented as usual
> * asyncronous operations (async writes and all IDLE class requests),
> lowest priority and longest deadline
> 
> The patch also introduces some new heuristics:
> * for non-rotational devices, reads (within a given priority level)
> are issued in FIFO order, to improve the latency perceived by readers

Danger danger... I smell nasty heuristics.

> * minimum batch timespan (time quantum): partners with fifo_batch to
> improve throughput, by sending more consecutive requests together. A
> given number of requests will not always take the same time (due to
> amount of seek needed), therefore fifo_batch must be tuned for worst
> cases, while in best cases, having longer batches would give a
> throughput boost.
> * batch start request is chosen fifo_batch/3 requests before the
> expired one, to improve fairness for requests with lower start sector,
> that otherwise have higher probability to miss a deadline than
> mid-sector requests.

This is a huge patch, I'm not going to be reviewing this. Make this a
patchset, each doing that little change separately. Then it's easier to
review, and much easier to pick the parts that can go in directly and
leave the ones that either need more work or are not going be merged
out.

> I did few performance comparisons:
> * HDD, ext3 partition with data=writeback, tiotest with 32 threads,
> each writing 80MB of data

It doesn't seem to make a whole lot of difference, does it?

> ** deadline-original
> Tiotest results for 32 concurrent io threads:
> ,----------------------------------------------------------------------.
> | Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
> +-----------------------+----------+--------------+----------+---------+
> | Write        2560 MBs |  103.0 s |  24.848 MB/s |  10.6 %  | 522.2 % |
> | Random Write  125 MBs |   98.8 s |   1.265 MB/s |  -1.6 %  |  16.1 % |
> | Read         2560 MBs |  166.2 s |  15.400 MB/s |   4.2 %  |  82.7 % |
> | Random Read   125 MBs |  193.3 s |   0.647 MB/s |  -0.8 %  |  14.5 % |
> `----------------------------------------------------------------------'
> Tiotest latency results:
> ,-------------------------------------------------------------------------.
> | Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
> +--------------+-----------------+-----------------+----------+-----------+
> | Write        |        4.122 ms |    17922.920 ms |  0.07980 |   0.00061 |
> | Random Write |        0.599 ms |     1245.200 ms |  0.00000 |   0.00000 |
> | Read         |        8.032 ms |     1125.759 ms |  0.00000 |   0.00000 |
> | Random Read  |      181.968 ms |      972.657 ms |  0.00000 |   0.00000 |
> |--------------+-----------------+-----------------+----------+-----------|
> | Total        |       10.044 ms |    17922.920 ms |  0.03804 |   0.00029 |
> `--------------+-----------------+-----------------+----------+-----------'
> 
> ** deadline-patched
> Tiotest results for 32 concurrent io threads:
> ,----------------------------------------------------------------------.
> | Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
> +-----------------------+----------+--------------+----------+---------+
> | Write        2560 MBs |  105.3 s |  24.301 MB/s |  10.5 %  | 514.8 % |
> | Random Write  125 MBs |   95.9 s |   1.304 MB/s |  -1.8 %  |  17.3 % |
> | Read         2560 MBs |  165.1 s |  15.507 MB/s |   2.7 %  |  61.9 % |
> | Random Read   125 MBs |  110.6 s |   1.130 MB/s |   0.8 %  |  12.2 % |
> `----------------------------------------------------------------------'
> Tiotest latency results:
> ,-------------------------------------------------------------------------.
> | Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
> +--------------+-----------------+-----------------+----------+-----------+
> | Write        |        4.131 ms |    17456.831 ms |  0.08041 |   0.00275 |
> | Random Write |        2.780 ms |     5073.180 ms |  0.07500 |   0.00000 |
> | Read         |        7.748 ms |      936.499 ms |  0.00000 |   0.00000 |
> | Random Read  |      104.849 ms |      695.192 ms |  0.00000 |   0.00000 |
> |--------------+-----------------+-----------------+----------+-----------|
> | Total        |        8.168 ms |    17456.831 ms |  0.04008 |   0.00131 |
> `--------------+-----------------+-----------------+----------+-----------'

Main difference here seems to be random read performance, the rest are
pretty close and could just be noise. Random write is much worse, from a
latency view point. Is this just one run, or did you average several?

For something like this, you also need to consider workloads that
consist of processes with different IO patterns running at the same
time. With this tiotest run, you only test sequential readers competing,
then random readers, etc.

So, please, split the big patch into lots of little separate pieces.
Benchmark each one separately, so they each carry their own
justification.

> * fsync-tester results, on HDD, empty ext3 partition, mounted with
> data=writeback
> ** deadline-original:
> fsync time: 0.7963
> fsync time: 4.5914
> fsync time: 4.2347
> fsync time: 1.1670
> fsync time: 0.8164
> fsync time: 1.9783
> fsync time: 4.9726
> fsync time: 2.4929
> fsync time: 2.5448
> fsync time: 3.9627
> ** cfq 2.6.30-rc2
> fsync time: 0.0288
> fsync time: 0.0528
> fsync time: 0.0299
> fsync time: 0.0397
> fsync time: 0.5720
> fsync time: 0.0409
> fsync time: 0.0876
> fsync time: 0.0294
> fsync time: 0.0485
> ** deadline-patched
> fsync time: 0.0772
> fsync time: 0.0381
> fsync time: 0.0604
> fsync time: 0.2923
> fsync time: 0.2488
> fsync time: 0.0924
> fsync time: 0.0144
> fsync time: 1.4824
> fsync time: 0.0789
> fsync time: 0.0565
> fsync time: 0.0550
> fsync time: 0.0421

At least this test looks a lot better!

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/