2009-04-22 21:07:45

by Corrado Zoccolo

[permalink] [raw]
Subject: Reduce latencies for syncronous writes and high I/O priority requests in deadline IO scheduler

Hi,
deadline I/O scheduler currently classifies all I/O requests in only 2
classes, reads (always considered high priority) and writes (always
lower).
The attached patch, intended to reduce latencies for syncronous writes
and high I/O priority requests, introduces more levels of priorities:
* real time reads: highest priority and shortest deadline, can starve
other levels
* syncronous operations (either best effort reads or RT/BE writes),
mid priority, starvation for lower level is prevented as usual
* asyncronous operations (async writes and all IDLE class requests),
lowest priority and longest deadline

The patch also introduces some new heuristics:
* for non-rotational devices, reads (within a given priority level)
are issued in FIFO order, to improve the latency perceived by readers
* minimum batch timespan (time quantum): partners with fifo_batch to
improve throughput, by sending more consecutive requests together. A
given number of requests will not always take the same time (due to
amount of seek needed), therefore fifo_batch must be tuned for worst
cases, while in best cases, having longer batches would give a
throughput boost.
* batch start request is chosen fifo_batch/3 requests before the
expired one, to improve fairness for requests with lower start sector,
that otherwise have higher probability to miss a deadline than
mid-sector requests.

I did few performance comparisons:
* HDD, ext3 partition with data=writeback, tiotest with 32 threads,
each writing 80MB of data

** deadline-original
Tiotest results for 32 concurrent io threads:
,----------------------------------------------------------------------.
| Item | Time | Rate | Usr CPU | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write 2560 MBs | 103.0 s | 24.848 MB/s | 10.6 % | 522.2 % |
| Random Write 125 MBs | 98.8 s | 1.265 MB/s | -1.6 % | 16.1 % |
| Read 2560 MBs | 166.2 s | 15.400 MB/s | 4.2 % | 82.7 % |
| Random Read 125 MBs | 193.3 s | 0.647 MB/s | -0.8 % | 14.5 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write | 4.122 ms | 17922.920 ms | 0.07980 | 0.00061 |
| Random Write | 0.599 ms | 1245.200 ms | 0.00000 | 0.00000 |
| Read | 8.032 ms | 1125.759 ms | 0.00000 | 0.00000 |
| Random Read | 181.968 ms | 972.657 ms | 0.00000 | 0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total | 10.044 ms | 17922.920 ms | 0.03804 | 0.00029 |
`--------------+-----------------+-----------------+----------+-----------'

** cfq (2.6.30-rc2)
Tiotest results for 32 concurrent io threads:
,----------------------------------------------------------------------.
| Item | Time | Rate | Usr CPU | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write 2560 MBs | 132.4 s | 19.342 MB/s | 8.5 % | 400.4 % |
| Random Write 125 MBs | 107.8 s | 1.159 MB/s | -1.6 % | 16.8 % |
| Read 2560 MBs | 107.6 s | 23.788 MB/s | 5.4 % | 95.7 % |
| Random Read 125 MBs | 158.4 s | 0.789 MB/s | 0.9 % | 7.7 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write | 5.362 ms | 21081.012 ms | 0.09811 | 0.00244 |
| Random Write | 23.310 ms | 31865.095 ms | 0.13437 | 0.06250 |
| Read | 5.048 ms | 3694.001 ms | 0.15167 | 0.00000 |
| Random Read | 146.523 ms | 2880.409 ms | 0.52187 | 0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total | 8.916 ms | 31865.095 ms | 0.13435 | 0.00262 |
`--------------+-----------------+-----------------+----------+-----------'

** deadline-patched
Tiotest results for 32 concurrent io threads:
,----------------------------------------------------------------------.
| Item | Time | Rate | Usr CPU | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write 2560 MBs | 105.3 s | 24.301 MB/s | 10.5 % | 514.8 % |
| Random Write 125 MBs | 95.9 s | 1.304 MB/s | -1.8 % | 17.3 % |
| Read 2560 MBs | 165.1 s | 15.507 MB/s | 2.7 % | 61.9 % |
| Random Read 125 MBs | 110.6 s | 1.130 MB/s | 0.8 % | 12.2 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write | 4.131 ms | 17456.831 ms | 0.08041 | 0.00275 |
| Random Write | 2.780 ms | 5073.180 ms | 0.07500 | 0.00000 |
| Read | 7.748 ms | 936.499 ms | 0.00000 | 0.00000 |
| Random Read | 104.849 ms | 695.192 ms | 0.00000 | 0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total | 8.168 ms | 17456.831 ms | 0.04008 | 0.00131 |
`--------------+-----------------+-----------------+----------+-----------'

* SD card, nilfs2 partition, tiotest with 16 threads, each writing 80MB of data
** cfq(2.6.30-rc2)
Tiotest results for 16 concurrent io threads:
,----------------------------------------------------------------------.
| Item | Time | Rate | Usr CPU | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write 1280 MBs | 217.8 s | 5.878 MB/s | 3.7 % | 92.2 % |
| Random Write 62 MBs | 18.2 s | 3.432 MB/s | -2.3 % | 28.7 % |
| Read 1280 MBs | 114.7 s | 11.156 MB/s | 7.3 % | 76.6 % |
| Random Read 62 MBs | 3.4 s | 18.615 MB/s | -5.4 % | 274.2 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write | 9.943 ms | 10223.581 ms | 0.14252 | 0.00488 |
| Random Write | 12.287 ms | 5097.196 ms | 0.25625 | 0.00000 |
| Read | 5.352 ms | 1550.162 ms | 0.00000 | 0.00000 |
| Random Read | 3.051 ms | 1507.837 ms | 0.00000 | 0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total | 7.649 ms | 10223.581 ms | 0.07391 | 0.00233 |
`--------------+-----------------+-----------------+----------+-----------'

** deadline-patched:
Tiotest results for 16 concurrent io threads:
,----------------------------------------------------------------------.
| Item | Time | Rate | Usr CPU | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write 1280 MBs | 220.9 s | 5.794 MB/s | 4.0 % | 93.9 % |
| Random Write 62 MBs | 20.5 s | 3.044 MB/s | -2.2 % | 24.9 % |
| Read 1280 MBs | 113.2 s | 11.304 MB/s | 6.8 % | 72.8 % |
| Random Read 62 MBs | 2.9 s | 21.896 MB/s | 5.1 % | 293.8 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write | 10.078 ms | 13303.036 ms | 0.14160 | 0.00031 |
| Random Write | 14.350 ms | 5265.088 ms | 0.40000 | 0.00000 |
| Read | 5.455 ms | 434.495 ms | 0.00000 | 0.00000 |
| Random Read | 2.685 ms | 12.652 ms | 0.00000 | 0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total | 7.801 ms | 13303.036 ms | 0.07682 | 0.00015 |
`--------------+-----------------+-----------------+----------+-----------'

* fsync-tester results, on HDD, empty ext3 partition, mounted with
data=writeback
** deadline-original:
fsync time: 0.7963
fsync time: 4.5914
fsync time: 4.2347
fsync time: 1.1670
fsync time: 0.8164
fsync time: 1.9783
fsync time: 4.9726
fsync time: 2.4929
fsync time: 2.5448
fsync time: 3.9627
** cfq 2.6.30-rc2
fsync time: 0.0288
fsync time: 0.0528
fsync time: 0.0299
fsync time: 0.0397
fsync time: 0.5720
fsync time: 0.0409
fsync time: 0.0876
fsync time: 0.0294
fsync time: 0.0485
** deadline-patched
fsync time: 0.0772
fsync time: 0.0381
fsync time: 0.0604
fsync time: 0.2923
fsync time: 0.2488
fsync time: 0.0924
fsync time: 0.0144
fsync time: 1.4824
fsync time: 0.0789
fsync time: 0.0565
fsync time: 0.0550
fsync time: 0.0421
** deadline-patched, ionice -c1:
fsync time: 0.2569
fsync time: 0.0500
fsync time: 0.0681
fsync time: 0.2863
fsync time: 0.0140
fsync time: 0.0171
fsync time: 0.1198
fsync time: 0.0530
fsync time: 0.0503
fsync time: 0.0462
fsync time: 0.0484
fsync time: 0.0328
fsync time: 0.0562
fsync time: 0.0451
fsync time: 0.0576
fsync time: 0.0444
fsync time: 0.0469
fsync time: 0.0368
fsync time: 0.2865

Corrado

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------


Attachments:
deadline-iosched.c.patch (17.07 kB)

2009-04-23 11:18:40

by Paolo Ciarrocchi

[permalink] [raw]
Subject: Re: Reduce latencies for syncronous writes and high I/O priority requests in deadline IO scheduler

On 4/22/09, Corrado Zoccolo <[email protected]> wrote:
> Hi,
> deadline I/O scheduler currently classifies all I/O requests in only 2
> classes, reads (always considered high priority) and writes (always
> lower).
> The attached patch, intended to reduce latencies for syncronous writes
> and high I/O priority requests, introduces more levels of priorities:
> * real time reads: highest priority and shortest deadline, can starve
> other levels
> * syncronous operations (either best effort reads or RT/BE writes),
> mid priority, starvation for lower level is prevented as usual
> * asyncronous operations (async writes and all IDLE class requests),
> lowest priority and longest deadline
>


numbers are impressive.

do you observe better latencies in normal desktop usage as well?

ciao,
--
Paolo
http://paolo.ciarrocchi.googlepages.com/
http://mypage.vodafone.it/

2009-04-23 11:28:34

by Jens Axboe

[permalink] [raw]
Subject: Re: Reduce latencies for syncronous writes and high I/O priority requests in deadline IO scheduler

On Wed, Apr 22 2009, Corrado Zoccolo wrote:
> Hi,
> deadline I/O scheduler currently classifies all I/O requests in only 2
> classes, reads (always considered high priority) and writes (always
> lower).
> The attached patch, intended to reduce latencies for syncronous writes
> and high I/O priority requests, introduces more levels of priorities:
> * real time reads: highest priority and shortest deadline, can starve
> other levels
> * syncronous operations (either best effort reads or RT/BE writes),
> mid priority, starvation for lower level is prevented as usual
> * asyncronous operations (async writes and all IDLE class requests),
> lowest priority and longest deadline
>
> The patch also introduces some new heuristics:
> * for non-rotational devices, reads (within a given priority level)
> are issued in FIFO order, to improve the latency perceived by readers

Danger danger... I smell nasty heuristics.

> * minimum batch timespan (time quantum): partners with fifo_batch to
> improve throughput, by sending more consecutive requests together. A
> given number of requests will not always take the same time (due to
> amount of seek needed), therefore fifo_batch must be tuned for worst
> cases, while in best cases, having longer batches would give a
> throughput boost.
> * batch start request is chosen fifo_batch/3 requests before the
> expired one, to improve fairness for requests with lower start sector,
> that otherwise have higher probability to miss a deadline than
> mid-sector requests.

This is a huge patch, I'm not going to be reviewing this. Make this a
patchset, each doing that little change separately. Then it's easier to
review, and much easier to pick the parts that can go in directly and
leave the ones that either need more work or are not going be merged
out.

> I did few performance comparisons:
> * HDD, ext3 partition with data=writeback, tiotest with 32 threads,
> each writing 80MB of data

It doesn't seem to make a whole lot of difference, does it?

> ** deadline-original
> Tiotest results for 32 concurrent io threads:
> ,----------------------------------------------------------------------.
> | Item | Time | Rate | Usr CPU | Sys CPU |
> +-----------------------+----------+--------------+----------+---------+
> | Write 2560 MBs | 103.0 s | 24.848 MB/s | 10.6 % | 522.2 % |
> | Random Write 125 MBs | 98.8 s | 1.265 MB/s | -1.6 % | 16.1 % |
> | Read 2560 MBs | 166.2 s | 15.400 MB/s | 4.2 % | 82.7 % |
> | Random Read 125 MBs | 193.3 s | 0.647 MB/s | -0.8 % | 14.5 % |
> `----------------------------------------------------------------------'
> Tiotest latency results:
> ,-------------------------------------------------------------------------.
> | Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
> +--------------+-----------------+-----------------+----------+-----------+
> | Write | 4.122 ms | 17922.920 ms | 0.07980 | 0.00061 |
> | Random Write | 0.599 ms | 1245.200 ms | 0.00000 | 0.00000 |
> | Read | 8.032 ms | 1125.759 ms | 0.00000 | 0.00000 |
> | Random Read | 181.968 ms | 972.657 ms | 0.00000 | 0.00000 |
> |--------------+-----------------+-----------------+----------+-----------|
> | Total | 10.044 ms | 17922.920 ms | 0.03804 | 0.00029 |
> `--------------+-----------------+-----------------+----------+-----------'
>
> ** deadline-patched
> Tiotest results for 32 concurrent io threads:
> ,----------------------------------------------------------------------.
> | Item | Time | Rate | Usr CPU | Sys CPU |
> +-----------------------+----------+--------------+----------+---------+
> | Write 2560 MBs | 105.3 s | 24.301 MB/s | 10.5 % | 514.8 % |
> | Random Write 125 MBs | 95.9 s | 1.304 MB/s | -1.8 % | 17.3 % |
> | Read 2560 MBs | 165.1 s | 15.507 MB/s | 2.7 % | 61.9 % |
> | Random Read 125 MBs | 110.6 s | 1.130 MB/s | 0.8 % | 12.2 % |
> `----------------------------------------------------------------------'
> Tiotest latency results:
> ,-------------------------------------------------------------------------.
> | Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
> +--------------+-----------------+-----------------+----------+-----------+
> | Write | 4.131 ms | 17456.831 ms | 0.08041 | 0.00275 |
> | Random Write | 2.780 ms | 5073.180 ms | 0.07500 | 0.00000 |
> | Read | 7.748 ms | 936.499 ms | 0.00000 | 0.00000 |
> | Random Read | 104.849 ms | 695.192 ms | 0.00000 | 0.00000 |
> |--------------+-----------------+-----------------+----------+-----------|
> | Total | 8.168 ms | 17456.831 ms | 0.04008 | 0.00131 |
> `--------------+-----------------+-----------------+----------+-----------'

Main difference here seems to be random read performance, the rest are
pretty close and could just be noise. Random write is much worse, from a
latency view point. Is this just one run, or did you average several?

For something like this, you also need to consider workloads that
consist of processes with different IO patterns running at the same
time. With this tiotest run, you only test sequential readers competing,
then random readers, etc.

So, please, split the big patch into lots of little separate pieces.
Benchmark each one separately, so they each carry their own
justification.

> * fsync-tester results, on HDD, empty ext3 partition, mounted with
> data=writeback
> ** deadline-original:
> fsync time: 0.7963
> fsync time: 4.5914
> fsync time: 4.2347
> fsync time: 1.1670
> fsync time: 0.8164
> fsync time: 1.9783
> fsync time: 4.9726
> fsync time: 2.4929
> fsync time: 2.5448
> fsync time: 3.9627
> ** cfq 2.6.30-rc2
> fsync time: 0.0288
> fsync time: 0.0528
> fsync time: 0.0299
> fsync time: 0.0397
> fsync time: 0.5720
> fsync time: 0.0409
> fsync time: 0.0876
> fsync time: 0.0294
> fsync time: 0.0485
> ** deadline-patched
> fsync time: 0.0772
> fsync time: 0.0381
> fsync time: 0.0604
> fsync time: 0.2923
> fsync time: 0.2488
> fsync time: 0.0924
> fsync time: 0.0144
> fsync time: 1.4824
> fsync time: 0.0789
> fsync time: 0.0565
> fsync time: 0.0550
> fsync time: 0.0421

At least this test looks a lot better!

--
Jens Axboe

2009-04-23 12:02:10

by Aaron Carroll

[permalink] [raw]
Subject: Re: Reduce latencies for syncronous writes and high I/O priority requests in deadline IO scheduler

Corrado Zoccolo wrote:
> Hi,
> deadline I/O scheduler currently classifies all I/O requests in only 2
> classes, reads (always considered high priority) and writes (always
> lower).
> The attached patch, intended to reduce latencies for syncronous writes

Can be achieved by switching to sync/async rather than read/write. No
one has shown results where this makes an improvement. Let us know if
you have a good example.

> and high I/O priority requests, introduces more levels of priorities:
> * real time reads: highest priority and shortest deadline, can starve
> other levels
> * syncronous operations (either best effort reads or RT/BE writes),
> mid priority, starvation for lower level is prevented as usual
> * asyncronous operations (async writes and all IDLE class requests),
> lowest priority and longest deadline
>
> The patch also introduces some new heuristics:
> * for non-rotational devices, reads (within a given priority level)
> are issued in FIFO order, to improve the latency perceived by readers

This might be a good idea. Can you make this a separate patch?
Is there a good reason not to do the same for writes?

> * minimum batch timespan (time quantum): partners with fifo_batch to
> improve throughput, by sending more consecutive requests together. A
> given number of requests will not always take the same time (due to
> amount of seek needed), therefore fifo_batch must be tuned for worst
> cases, while in best cases, having longer batches would give a
> throughput boost.
> * batch start request is chosen fifo_batch/3 requests before the
> expired one, to improve fairness for requests with lower start sector,
> that otherwise have higher probability to miss a deadline than
> mid-sector requests.

I don't like the rest of it. I use deadline because it's a simple,
no surprises, no bullshit scheduler with reasonably good performance
in all situations. Is there some reason why CFQ won't work for you?

> I did few performance comparisons:
> * HDD, ext3 partition with data=writeback, tiotest with 32 threads,
> each writing 80MB of data
>
> ** deadline-original
> Tiotest results for 32 concurrent io threads:
> ,----------------------------------------------------------------------.
> | Item | Time | Rate | Usr CPU | Sys CPU |
> +-----------------------+----------+--------------+----------+---------+
> | Write 2560 MBs | 103.0 s | 24.848 MB/s | 10.6 % | 522.2 % |
> | Random Write 125 MBs | 98.8 s | 1.265 MB/s | -1.6 % | 16.1 % |
> | Read 2560 MBs | 166.2 s | 15.400 MB/s | 4.2 % | 82.7 % |
> | Random Read 125 MBs | 193.3 s | 0.647 MB/s | -0.8 % | 14.5 % |
> `----------------------------------------------------------------------'
> Tiotest latency results:
> ,-------------------------------------------------------------------------.
> | Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
> +--------------+-----------------+-----------------+----------+-----------+
> | Write | 4.122 ms | 17922.920 ms | 0.07980 | 0.00061 |
> | Random Write | 0.599 ms | 1245.200 ms | 0.00000 | 0.00000 |
> | Read | 8.032 ms | 1125.759 ms | 0.00000 | 0.00000 |
> | Random Read | 181.968 ms | 972.657 ms | 0.00000 | 0.00000 |
> |--------------+-----------------+-----------------+----------+-----------|
> | Total | 10.044 ms | 17922.920 ms | 0.03804 | 0.00029 |
> `--------------+-----------------+-----------------+----------+-----------'
>
> ** cfq (2.6.30-rc2)
> Tiotest results for 32 concurrent io threads:
> ,----------------------------------------------------------------------.
> | Item | Time | Rate | Usr CPU | Sys CPU |
> +-----------------------+----------+--------------+----------+---------+
> | Write 2560 MBs | 132.4 s | 19.342 MB/s | 8.5 % | 400.4 % |
> | Random Write 125 MBs | 107.8 s | 1.159 MB/s | -1.6 % | 16.8 % |
> | Read 2560 MBs | 107.6 s | 23.788 MB/s | 5.4 % | 95.7 % |
> | Random Read 125 MBs | 158.4 s | 0.789 MB/s | 0.9 % | 7.7 % |
> `----------------------------------------------------------------------'
> Tiotest latency results:
> ,-------------------------------------------------------------------------.
> | Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
> +--------------+-----------------+-----------------+----------+-----------+
> | Write | 5.362 ms | 21081.012 ms | 0.09811 | 0.00244 |
> | Random Write | 23.310 ms | 31865.095 ms | 0.13437 | 0.06250 |
> | Read | 5.048 ms | 3694.001 ms | 0.15167 | 0.00000 |
> | Random Read | 146.523 ms | 2880.409 ms | 0.52187 | 0.00000 |
> |--------------+-----------------+-----------------+----------+-----------|
> | Total | 8.916 ms | 31865.095 ms | 0.13435 | 0.00262 |
> `--------------+-----------------+-----------------+----------+-----------'
>
> ** deadline-patched
> Tiotest results for 32 concurrent io threads:
> ,----------------------------------------------------------------------.
> | Item | Time | Rate | Usr CPU | Sys CPU |
> +-----------------------+----------+--------------+----------+---------+
> | Write 2560 MBs | 105.3 s | 24.301 MB/s | 10.5 % | 514.8 % |
> | Random Write 125 MBs | 95.9 s | 1.304 MB/s | -1.8 % | 17.3 % |
> | Read 2560 MBs | 165.1 s | 15.507 MB/s | 2.7 % | 61.9 % |
> | Random Read 125 MBs | 110.6 s | 1.130 MB/s | 0.8 % | 12.2 % |
> `----------------------------------------------------------------------'
> Tiotest latency results:
> ,-------------------------------------------------------------------------.
> | Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
> +--------------+-----------------+-----------------+----------+-----------+
> | Write | 4.131 ms | 17456.831 ms | 0.08041 | 0.00275 |
> | Random Write | 2.780 ms | 5073.180 ms | 0.07500 | 0.00000 |
> | Read | 7.748 ms | 936.499 ms | 0.00000 | 0.00000 |
> | Random Read | 104.849 ms | 695.192 ms | 0.00000 | 0.00000 |
> |--------------+-----------------+-----------------+----------+-----------|
> | Total | 8.168 ms | 17456.831 ms | 0.04008 | 0.00131 |
> `--------------+-----------------+-----------------+----------+-----------'
>
> * SD card, nilfs2 partition, tiotest with 16 threads, each writing 80MB of data
> ** cfq(2.6.30-rc2)
> Tiotest results for 16 concurrent io threads:
> ,----------------------------------------------------------------------.
> | Item | Time | Rate | Usr CPU | Sys CPU |
> +-----------------------+----------+--------------+----------+---------+
> | Write 1280 MBs | 217.8 s | 5.878 MB/s | 3.7 % | 92.2 % |
> | Random Write 62 MBs | 18.2 s | 3.432 MB/s | -2.3 % | 28.7 % |
> | Read 1280 MBs | 114.7 s | 11.156 MB/s | 7.3 % | 76.6 % |
> | Random Read 62 MBs | 3.4 s | 18.615 MB/s | -5.4 % | 274.2 % |
> `----------------------------------------------------------------------'
> Tiotest latency results:
> ,-------------------------------------------------------------------------.
> | Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
> +--------------+-----------------+-----------------+----------+-----------+
> | Write | 9.943 ms | 10223.581 ms | 0.14252 | 0.00488 |
> | Random Write | 12.287 ms | 5097.196 ms | 0.25625 | 0.00000 |
> | Read | 5.352 ms | 1550.162 ms | 0.00000 | 0.00000 |
> | Random Read | 3.051 ms | 1507.837 ms | 0.00000 | 0.00000 |
> |--------------+-----------------+-----------------+----------+-----------|
> | Total | 7.649 ms | 10223.581 ms | 0.07391 | 0.00233 |
> `--------------+-----------------+-----------------+----------+-----------'
>
> ** deadline-patched:
> Tiotest results for 16 concurrent io threads:
> ,----------------------------------------------------------------------.
> | Item | Time | Rate | Usr CPU | Sys CPU |
> +-----------------------+----------+--------------+----------+---------+
> | Write 1280 MBs | 220.9 s | 5.794 MB/s | 4.0 % | 93.9 % |
> | Random Write 62 MBs | 20.5 s | 3.044 MB/s | -2.2 % | 24.9 % |
> | Read 1280 MBs | 113.2 s | 11.304 MB/s | 6.8 % | 72.8 % |
> | Random Read 62 MBs | 2.9 s | 21.896 MB/s | 5.1 % | 293.8 % |
> `----------------------------------------------------------------------'
> Tiotest latency results:
> ,-------------------------------------------------------------------------.
> | Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
> +--------------+-----------------+-----------------+----------+-----------+
> | Write | 10.078 ms | 13303.036 ms | 0.14160 | 0.00031 |
> | Random Write | 14.350 ms | 5265.088 ms | 0.40000 | 0.00000 |
> | Read | 5.455 ms | 434.495 ms | 0.00000 | 0.00000 |
> | Random Read | 2.685 ms | 12.652 ms | 0.00000 | 0.00000 |
> |--------------+-----------------+-----------------+----------+-----------|
> | Total | 7.801 ms | 13303.036 ms | 0.07682 | 0.00015 |
> `--------------+-----------------+-----------------+----------+-----------'
>
> * fsync-tester results, on HDD, empty ext3 partition, mounted with
> data=writeback
> ** deadline-original:
> fsync time: 0.7963
> fsync time: 4.5914
> fsync time: 4.2347
> fsync time: 1.1670
> fsync time: 0.8164
> fsync time: 1.9783
> fsync time: 4.9726
> fsync time: 2.4929
> fsync time: 2.5448
> fsync time: 3.9627
> ** cfq 2.6.30-rc2
> fsync time: 0.0288
> fsync time: 0.0528
> fsync time: 0.0299
> fsync time: 0.0397
> fsync time: 0.5720
> fsync time: 0.0409
> fsync time: 0.0876
> fsync time: 0.0294
> fsync time: 0.0485
> ** deadline-patched
> fsync time: 0.0772
> fsync time: 0.0381
> fsync time: 0.0604
> fsync time: 0.2923
> fsync time: 0.2488
> fsync time: 0.0924
> fsync time: 0.0144
> fsync time: 1.4824
> fsync time: 0.0789
> fsync time: 0.0565
> fsync time: 0.0550
> fsync time: 0.0421
> ** deadline-patched, ionice -c1:
> fsync time: 0.2569
> fsync time: 0.0500
> fsync time: 0.0681
> fsync time: 0.2863
> fsync time: 0.0140
> fsync time: 0.0171
> fsync time: 0.1198
> fsync time: 0.0530
> fsync time: 0.0503
> fsync time: 0.0462
> fsync time: 0.0484
> fsync time: 0.0328
> fsync time: 0.0562
> fsync time: 0.0451
> fsync time: 0.0576
> fsync time: 0.0444
> fsync time: 0.0469
> fsync time: 0.0368
> fsync time: 0.2865
>
> Corrado
>

2009-04-23 12:14:15

by Jens Axboe

[permalink] [raw]
Subject: Re: Reduce latencies for syncronous writes and high I/O priority requests in deadline IO scheduler

On Thu, Apr 23 2009, Aaron Carroll wrote:
> Corrado Zoccolo wrote:
> > Hi,
> > deadline I/O scheduler currently classifies all I/O requests in only 2
> > classes, reads (always considered high priority) and writes (always
> > lower).
> > The attached patch, intended to reduce latencies for syncronous writes
>
> Can be achieved by switching to sync/async rather than read/write. No
> one has shown results where this makes an improvement. Let us know if
> you have a good example.
>
> > and high I/O priority requests, introduces more levels of priorities:
> > * real time reads: highest priority and shortest deadline, can starve
> > other levels
> > * syncronous operations (either best effort reads or RT/BE writes),
> > mid priority, starvation for lower level is prevented as usual
> > * asyncronous operations (async writes and all IDLE class requests),
> > lowest priority and longest deadline
> >
> > The patch also introduces some new heuristics:
> > * for non-rotational devices, reads (within a given priority level)
> > are issued in FIFO order, to improve the latency perceived by readers
>
> This might be a good idea. Can you make this a separate patch?
> Is there a good reason not to do the same for writes?
>
> > * minimum batch timespan (time quantum): partners with fifo_batch to
> > improve throughput, by sending more consecutive requests together. A
> > given number of requests will not always take the same time (due to
> > amount of seek needed), therefore fifo_batch must be tuned for worst
> > cases, while in best cases, having longer batches would give a
> > throughput boost.
> > * batch start request is chosen fifo_batch/3 requests before the
> > expired one, to improve fairness for requests with lower start sector,
> > that otherwise have higher probability to miss a deadline than
> > mid-sector requests.
>
> I don't like the rest of it. I use deadline because it's a simple,
> no surprises, no bullshit scheduler with reasonably good performance
> in all situations. Is there some reason why CFQ won't work for you?

Fully agree with that, deadline is not going to be changed radically.
Doing sync/async instead of read/write would indeed likely bring the
latency results down alone, what impact the rest has is unknown.

If CFQ performs poorly for some situations, we fix that.


--
Jens Axboe

2009-04-23 15:57:50

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: Reduce latencies for syncronous writes and high I/O priority requests in deadline IO scheduler

Hi Jens,

On Thu, Apr 23, 2009 at 1:28 PM, Jens Axboe <[email protected]> wrote:
> On Wed, Apr 22 2009, Corrado Zoccolo wrote:
>> The patch also introduces some new heuristics:
>> * for non-rotational devices, reads (within a given priority level)
>> are issued in FIFO order, to improve the latency perceived by readers
>
> Danger danger... I smell nasty heuristics.

Ok, I wanted to sneak this heuristic in :), but probably I can abridge
it from the initial submission.
The fact is that many people around are using noop scheduler for SSDs,
to get the ultimate preformance out of their hardware, and I wanted to
give them a better alternative. CFQ doesn't honor non-rotational flag
when tag-queuing is not supported, so it will not be an alternative in
such cases.

>
>> * minimum batch timespan (time quantum): partners with fifo_batch to
>> improve throughput, by sending more consecutive requests together. A
>> given number of requests will not always take the same time (due to
>> amount of seek needed), therefore fifo_batch must be tuned for worst
>> cases, while in best cases, having longer batches would give a
>> throughput boost.
>> * batch start request is chosen fifo_batch/3 requests before the
>> expired one, to improve fairness for requests with lower start sector,
>> that otherwise have higher probability to miss a deadline than
>> mid-sector requests.
>
> This is a huge patch, I'm not going to be reviewing this. Make this a
> patchset, each doing that little change separately. Then it's easier to
> review, and much easier to pick the parts that can go in directly and
> leave the ones that either need more work or are not going be merged
> out.
>
Ok.
I think I can split it into:
* add new heuristics (so those can be evaluated independently of the
read/write vs. sync/async)
* read/write becomes sync/async
* add iopriorities

>> I did few performance comparisons:
>> * HDD, ext3 partition with data=writeback, tiotest with 32 threads,
>> each writing 80MB of data
>
> It doesn't seem to make a whole lot of difference, does it?

The intent is not to boost throughput, but to reduce sync latency.
The heuristics were added to avoid throughput regression.

> Main difference here seems to be random read performance, the rest are
> pretty close and could just be noise. Random write is much worse, from a
> latency view point. Is this just one run, or did you average several?

The random writes issued by tio-test are async writes, so the latency
is not an issue, and having higher latencies here actually help in
improving throughput.

>
> For something like this, you also need to consider workloads that
> consist of processes with different IO patterns running at the same
> time. With this tiotest run, you only test sequential readers competing,
> then random readers, etc.

Sure. Do you have any suggestion?
I have an other workload, that is the boot of my netbook, on which the
patched ioscheduler saves 1s out of 12s.
All other IOschedulers, including noop, perform equally worse (for
noop, I think I lose on writes).

> So, please, split the big patch into lots of little separate pieces.
> Benchmark each one separately, so they each carry their own
> justification.

Does a theoretical proof of unfairness also matter, for the
fifo_batch/3 backjump?

>> * fsync-tester results, on HDD, empty ext3 partition, mounted with
>
> At least this test looks a lot better!

This is why the subject says "reducing latencies ..." :)
Maybe I should have started the mails with this test, instead of the
other ones showing that there wasn't regression for throughput (that
in principle could happen when trying to reduce latencies).

Thanks,
Corrado


--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

2009-04-23 16:10:55

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: Reduce latencies for syncronous writes and high I/O priority requests in deadline IO scheduler

On Thu, Apr 23, 2009 at 1:52 PM, Aaron Carroll <[email protected]> wrote:
> Corrado Zoccolo wrote:
>> Hi,
>> deadline I/O scheduler currently classifies all I/O requests in only 2
>> classes, reads (always considered high priority) and writes (always
>> lower).
>> The attached patch, intended to reduce latencies for syncronous writes
>
> Can be achieved by switching to sync/async rather than read/write. No
> one has shown results where this makes an improvement. Let us know if
> you have a good example.

Yes, this is exactly what my patch does, and the numbers for
fsync-tester are much better than baseline deadline, almost comparable
with cfq.

>
>> and high I/O priority requests, introduces more levels of priorities:
>> * real time reads: highest priority and shortest deadline, can starve
>> other levels
>> * syncronous operations (either best effort reads or RT/BE writes),
>> mid priority, starvation for lower level is prevented as usual
>> * asyncronous operations (async writes and all IDLE class requests),
>> lowest priority and longest deadline
>>
>> The patch also introduces some new heuristics:
>> * for non-rotational devices, reads (within a given priority level)
>> are issued in FIFO order, to improve the latency perceived by readers
>
> This might be a good idea.
I think Jens doesn't like it very much.
> Can you make this a separate patch?
I have an earlier attempt, much simpler, at:
http://lkml.indiana.edu/hypermail/linux/kernel/0904.1/00667.html
> Is there a good reason not to do the same for writes?
Well, in that case you could just use noop.
I found that this scheme outperforms noop. Random writes, in fact,
perform quite bad on most SSDs (unless you use a logging FS like
nilfs2, that transforms them into sequential writes), so having all
the deadline ioscheduler machinery to merge write requests is much
better. As I said, my patched IO scheduler outperforms noop on my
normal usage.


>> * minimum batch timespan (time quantum): partners with fifo_batch to
>> improve throughput, by sending more consecutive requests together. A
>> given number of requests will not always take the same time (due to
>> amount of seek needed), therefore fifo_batch must be tuned for worst
>> cases, while in best cases, having longer batches would give a
>> throughput boost.
>> * batch start request is chosen fifo_batch/3 requests before the
>> expired one, to improve fairness for requests with lower start sector,
>> that otherwise have higher probability to miss a deadline than
>> mid-sector requests.
>
> I don't like the rest of it. I use deadline because it's a simple,
> no surprises, no bullshit scheduler with reasonably good performance
> in all situations. Is there some reason why CFQ won't work for you?

I actually like CFQ, and use it almost everywhere, and switch to
deadline only when submitting an heavy-duty workload (having a SysRq
combination to switch I/O schedulers could sometimes be very handy).

However, on SSDs it's not optimal, so I'm developing this to overcome
those limitations.
In the meantime, I wanted to overcome also deadline limitations, i.e.
the high latencies on fsync/fdatasync.

Corrado

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

2009-04-24 00:11:31

by Aaron Carroll

[permalink] [raw]
Subject: Re: Reduce latencies for syncronous writes and high I/O priority requests in deadline IO scheduler

Hi Corrado,

Corrado Zoccolo wrote:
> On Thu, Apr 23, 2009 at 1:52 PM, Aaron Carroll <[email protected]> wrote:
>> Corrado Zoccolo wrote:
>>> Hi,
>>> deadline I/O scheduler currently classifies all I/O requests in only 2
>>> classes, reads (always considered high priority) and writes (always
>>> lower).
>>> The attached patch, intended to reduce latencies for syncronous writes
>> Can be achieved by switching to sync/async rather than read/write. No
>> one has shown results where this makes an improvement. Let us know if
>> you have a good example.
>
> Yes, this is exactly what my patch does, and the numbers for
> fsync-tester are much better than baseline deadline, almost comparable
> with cfq.

The patch does a bunch of other things too. I can't tell what is due to
the read/write -> sync/async change, and what is due to the rest of it.

>>> and high I/O priority requests, introduces more levels of priorities:
>>> * real time reads: highest priority and shortest deadline, can starve
>>> other levels
>>> * syncronous operations (either best effort reads or RT/BE writes),
>>> mid priority, starvation for lower level is prevented as usual
>>> * asyncronous operations (async writes and all IDLE class requests),
>>> lowest priority and longest deadline
>>>
>>> The patch also introduces some new heuristics:
>>> * for non-rotational devices, reads (within a given priority level)
>>> are issued in FIFO order, to improve the latency perceived by readers
>> This might be a good idea.
> I think Jens doesn't like it very much.

Let's convince him :)

I think a nice way to do this would be to make fifo_batch=1 the default
for nonrot devices. Of course this will affect writes too...

One problem here is the definition of nonrot. E.g. if H/W RAID drivers
start setting that flag, it will kill performance. Sorting is important
for arrays of rotational disks.

>> Can you make this a separate patch?
> I have an earlier attempt, much simpler, at:
> http://lkml.indiana.edu/hypermail/linux/kernel/0904.1/00667.html
>> Is there a good reason not to do the same for writes?
> Well, in that case you could just use noop.

Noop doesn't merge as well as deadline, nor does is provide read/write
differentiation. Is there a performance/QoS argument for not doing it?

> I found that this scheme outperforms noop. Random writes, in fact,
> perform quite bad on most SSDs (unless you use a logging FS like
> nilfs2, that transforms them into sequential writes), so having all
> the deadline ioscheduler machinery to merge write requests is much
> better. As I said, my patched IO scheduler outperforms noop on my
> normal usage.

You still get the merging... we are only talking about the issue
order here.

>>> * minimum batch timespan (time quantum): partners with fifo_batch to
>>> improve throughput, by sending more consecutive requests together. A
>>> given number of requests will not always take the same time (due to
>>> amount of seek needed), therefore fifo_batch must be tuned for worst
>>> cases, while in best cases, having longer batches would give a
>>> throughput boost.
>>> * batch start request is chosen fifo_batch/3 requests before the
>>> expired one, to improve fairness for requests with lower start sector,
>>> that otherwise have higher probability to miss a deadline than
>>> mid-sector requests.
>> I don't like the rest of it. I use deadline because it's a simple,
>> no surprises, no bullshit scheduler with reasonably good performance
>> in all situations. Is there some reason why CFQ won't work for you?
>
> I actually like CFQ, and use it almost everywhere, and switch to
> deadline only when submitting an heavy-duty workload (having a SysRq
> combination to switch I/O schedulers could sometimes be very handy).
>
> However, on SSDs it's not optimal, so I'm developing this to overcome
> those limitations.

Is this due to the stall on each batch switch?

> In the meantime, I wanted to overcome also deadline limitations, i.e.
> the high latencies on fsync/fdatasync.

Did you try dropping the expiry times and/or batch size?


-- Aaron

>
> Corrado
>

2009-04-24 06:13:25

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: Reduce latencies for syncronous writes and high I/O priority requests in deadline IO scheduler

Hi Aaron
On Fri, Apr 24, 2009 at 1:30 AM, Aaron Carroll <[email protected]> wrote:
> Hi Corrado,
>
> Corrado Zoccolo wrote:
>>
>> On Thu, Apr 23, 2009 at 1:52 PM, Aaron Carroll <[email protected]>
>> wrote:
>>>
>>> Corrado Zoccolo wrote:
>>>>
>>>> Hi,
>>>> deadline I/O scheduler currently classifies all I/O requests in only 2
>>>> classes, reads (always considered high priority) and writes (always
>>>> lower).
>>>> The attached patch, intended to reduce latencies for syncronous writes
>>>
>>> Can be achieved by switching to sync/async rather than read/write.  No
>>> one has shown results where this makes an improvement.  Let us know if
>>> you have a good example.
>>
>> Yes, this is exactly what my patch does, and the numbers for
>> fsync-tester are much better than baseline deadline, almost comparable
>> with cfq.
>
> The patch does a bunch of other things too.  I can't tell what is due to
> the read/write -> sync/async change, and what is due to the rest of it.

Ok, I got it. I'm splitting it in smaller patches.

>>>> and high I/O priority requests, introduces more levels of priorities:
>>>> * real time reads: highest priority and shortest deadline, can starve
>>>> other levels
>>>> * syncronous operations (either best effort reads or RT/BE writes),
>>>> mid priority, starvation for lower level is prevented as usual
>>>> * asyncronous operations (async writes and all IDLE class requests),
>>>> lowest priority and longest deadline
>>>>
>>>> The patch also introduces some new heuristics:
>>>> * for non-rotational devices, reads (within a given priority level)
>>>> are issued in FIFO order, to improve the latency perceived by readers
>>>
>>> This might be a good idea.
>>
>> I think Jens doesn't like it very much.
>
> Let's convince him :)
>
> I think a nice way to do this would be to make fifo_batch=1 the default
> for nonrot devices.  Of course this will affect writes too...

Fifo_batch has various implications, concerning also the alternation
between reads and writes.
Moreover, too low numbers also negatively affect merging.
In deadline, often merging on writeback requests happens because the
scheduler is handling unrelated requests for some time, so incoming
requests have time to accumulate.

>
> One problem here is the definition of nonrot.  E.g. if H/W RAID drivers
> start setting that flag, it will kill performance.  Sorting is important for
> arrays of rotational disks.
>
The flag should have a well defined semantics.
In RAID, I think it could be defined for the aggregated disk, while
single disks will be rotational or not depending on the technology.
This could work very well, since each disk will sort only its
requests, and the scheduler will not waste time on other disk
requests.
A random read workload with reads smaller than the RAID stripe will
shine with this.
Clearly, for writes, since multiple disks are touched, the sorting
must be performed at the aggregated disk level to have some
opportunity of reducing data transfers: this corresponds to what my
patch does.

>>> Can you make this a separate patch?
>>
>> I have an earlier attempt, much simpler, at:
>> http://lkml.indiana.edu/hypermail/linux/kernel/0904.1/00667.html
>>>
>>> Is there a good reason not to do the same for writes?
>>
>> Well, in that case you could just use noop.
>
> Noop doesn't merge as well as deadline, nor does is provide read/write
> differentiation.  Is there a performance/QoS argument for not doing it?

I think only experimentation can tell. But the RAID argument above
could make a case.

>> I found that this scheme outperforms noop. Random writes, in fact,
>> perform quite bad on most SSDs (unless you use a logging FS like
>> nilfs2, that transforms them into sequential writes), so having all
>> the deadline ioscheduler machinery to merge write requests is much
>> better. As I said, my patched IO scheduler outperforms noop on my
>> normal usage.
>
> You still get the merging... we are only talking about the issue
> order here.
>

Ditto, more experimentation is needed.

>>>> * minimum batch timespan (time quantum): partners with fifo_batch to
>>>> improve throughput, by sending more consecutive requests together. A
>>>> given number of requests will not always take the same time (due to
>>>> amount of seek needed), therefore fifo_batch must be tuned for worst
>>>> cases, while in best cases, having longer batches would give a
>>>> throughput boost.
>>>> * batch start request is chosen fifo_batch/3 requests before the
>>>> expired one, to improve fairness for requests with lower start sector,
>>>> that otherwise have higher probability to miss a deadline than
>>>> mid-sector requests.
>>>
>>> I don't like the rest of it.  I use deadline because it's a simple,
>>> no surprises, no bullshit scheduler with reasonably good performance
>>> in all situations.  Is there some reason why CFQ won't work for you?
>>
>> I actually like CFQ, and use it almost everywhere, and switch to
>> deadline only when submitting an heavy-duty workload (having a SysRq
>> combination to switch I/O schedulers could sometimes be very handy).
>>
>> However, on SSDs it's not optimal, so I'm developing this to overcome
>> those limitations.
>
> Is this due to the stall on each batch switch?

Possibly (CFQ is too complex to start hacking with it without some
experience on something simpler).
AFAIK, it should be disabled when ronrot=1, but actually only if the
device supports tag queuing.
I think, however, that the whole machinery of CFQ is too heavy for
non-rotational devices, where a simple fifo scheme, adjusted with
priorities, can achieve fair handling of requests.

>
>> In the meantime, I wanted to overcome also deadline limitations, i.e.
>> the high latencies on fsync/fdatasync.
>
> Did you try dropping the expiry times and/or batch size?

Yes. Expiry times are soft, so often they are not satisfied.
Dropping batch size will cause bandwidth drop, causing expiry times to
miss more often due to longer queues (I'm speaking of rotational
devices here, since the latencies affect also them).

>
>
>   -- Aaron
>
>>
>> Corrado
>>
>
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

2009-04-24 06:39:56

by Jens Axboe

[permalink] [raw]
Subject: Re: Reduce latencies for syncronous writes and high I/O priority requests in deadline IO scheduler

On Thu, Apr 23 2009, Corrado Zoccolo wrote:
> >> * minimum batch timespan (time quantum): partners with fifo_batch to
> >> improve throughput, by sending more consecutive requests together. A
> >> given number of requests will not always take the same time (due to
> >> amount of seek needed), therefore fifo_batch must be tuned for worst
> >> cases, while in best cases, having longer batches would give a
> >> throughput boost.
> >> * batch start request is chosen fifo_batch/3 requests before the
> >> expired one, to improve fairness for requests with lower start sector,
> >> that otherwise have higher probability to miss a deadline than
> >> mid-sector requests.
> >
> > I don't like the rest of it. I use deadline because it's a simple,
> > no surprises, no bullshit scheduler with reasonably good performance
> > in all situations. Is there some reason why CFQ won't work for you?
>
> I actually like CFQ, and use it almost everywhere, and switch to
> deadline only when submitting an heavy-duty workload (having a SysRq
> combination to switch I/O schedulers could sometimes be very handy).
>
> However, on SSDs it's not optimal, so I'm developing this to overcome
> those limitations.

I find your solution quite confusing - the statement is that it CFQ
isn't optimal on SSD, so you modify deadline? ;-)

Most of the "CFQ doesn't work well on SSD" statements are mostly wrong.
Now, you seem to have done some testing, so when you say that you
probably have actually done some testing that tells you that this is the
case. But lets attempt to fix that issue, then!

One thing you pointed out is that CFQ doesn't treat the device as a
"real" SSD unless it does queueing. This is very much on purpose, for
two reasons:

1) I have never seen a non-queueing SSD that actually performs well for
reads-vs-write situations, so CFQ still does idling for those.
2) It's a problem that is going away. SSD that are coming out today and
in the future WILL definitely do queuing. We can attribute most of
the crap behaviour to the lacking jmicron flash controller, which
also has a crappy SATA interface.

What I am worried about in the future is even faster SSD devices. CFQ is
already down a percent or two when we are doing 100k iops and such, this
problem will only get worse. So I'm very much interested in speeding up
CFQ for such devices, which I think will mainly be slimming down the IO
path and bypassing much of the (unneeded) complexity for them. The last
thing I want is to have to tell people to use deadline or noop on SSD
devices.

> In the meantime, I wanted to overcome also deadline limitations, i.e.
> the high latencies on fsync/fdatasync.

This is very much something you could pull out of the patchset and we
could include without much questioning.

--
Jens Axboe

2009-04-24 16:08:10

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: Reduce latencies for syncronous writes and high I/O priority requests in deadline IO scheduler

On Fri, Apr 24, 2009 at 8:39 AM, Jens Axboe <[email protected]> wrote:
> I find your solution quite confusing - the statement is that it CFQ
> isn't optimal on SSD, so you modify deadline? ;-)

Well, I find CFQ too confusing to start with, so I chose a simpler one.
If I can prove something with deadline, maybe you will decide to
implement it also on CFQ ;)

>
> Most of the "CFQ doesn't work well on SSD" statements are mostly wrong.
> Now, you seem to have done some testing, so when you say that you
> probably have actually done some testing that tells you that this is the
> case. But lets attempt to fix that issue, then!
>
> One thing you pointed out is that CFQ doesn't treat the device as a
> "real" SSD unless it does queueing. This is very much on purpose, for
> two reasons:
>
> 1) I have never seen a non-queueing SSD that actually performs well for
> reads-vs-write situations, so CFQ still does idling for those.

Does CFQ idle only when switching between reads and writes, or even
when switching between reads from one process, and reads from an
other?
I think I'll have to instrument CFQ a bit to understand how it works.
Is there a better way instead of scattering printks all around?

> 2) It's a problem that is going away. SSD that are coming out today and
> in the future WILL definitely do queuing. We can attribute most of
> the crap behaviour to the lacking jmicron flash controller, which
> also has a crappy SATA interface.

I think SD cards will still be around a lot, and I don't expect them
to have queuing, so some support for them might still be needed.

> What I am worried about in the future is even faster SSD devices. CFQ is
> already down a percent or two when we are doing 100k iops and such, this
> problem will only get worse. So I'm very much interested in speeding up
> CFQ for such devices, which I think will mainly be slimming down the IO
> path and bypassing much of the (unneeded) complexity for them. The last
> thing I want is to have to tell people to use deadline or noop on SSD
> devices.
>

Totally agree. Having the main IOscheduler perform good on most
scenarios is surely needed.
But this could be achieved in various ways.
What if the main IO scheduler had in his toolbox various strategies,
and could switch between them based on the workload or type of
hardware?
FIFO scheduling for reads could be one such strategy, used only when
the conditions are good for it.
An other possibility is to use auto-tuning strategies, but those are
more difficult to devise and test.

>> In the meantime, I wanted to overcome also deadline limitations, i.e.
>> the high latencies on fsync/fdatasync.
>
> This is very much something you could pull out of the patchset and we
> could include without much questioning.
>

Ok, this is the first patch of the series, and contains code cleanup
needed before changing read/write to sync/async. No behavioral change
is introduced by this patch.

I found where the random read performance is gained, but I didn't
include in this patch, because it require sync/async separation to not
negatively impact sync write latencies.

If the following new code, that replicates existing behaviour:
if (!dd->next_rq
|| rq_data_dir(dd->next_rq) != data_dir
|| deadline_check_fifo(dd, data_dir)) {
/*
* A deadline has expired, the last request was in the other
* direction, or we have run out of higher-sectored requests.
is changed to:
if (!dd->next_rq
|| rq_data_dir(dd->next_rq) > data_dir
|| deadline_check_fifo(dd, data_dir)) {
/*
* A deadline has expired, the last request was less
important (where WRITE is less important than READ),
* or we have run out of higher-sectored requests.

you get both higher random read throughput and higher write latencies.

Corrado

> --
> Jens Axboe
>
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------


Attachments:
deadline-patch-cleanup (4.76 kB)

2009-04-24 21:37:27

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: Reduce latencies for syncronous writes and high I/O priority requests in deadline IO scheduler

On Fri, Apr 24, 2009 at 6:07 PM, Corrado Zoccolo <[email protected]> wrote:
> Ok, this is the first patch of the series, and contains code cleanup
> needed before changing read/write to sync/async. No behavioral change
> is introduced by this patch.
>

And this one is the second patch, that changes read/write to sync vs. async.
fsync-tester results,with
while : ; do time sh -c "dd if=/dev/zero of=bigfile bs=8M
count=256 ; sync; rm bigfile"; done:
running in the background
./fsync-tester
fsync time: 0.6383
fsync time: 0.1835
fsync time: 0.1744
fsync time: 0.1103
fsync time: 0.1535
fsync time: 0.1545
fsync time: 0.1491
fsync time: 0.1524
fsync time: 0.1609
fsync time: 0.1168
fsync time: 0.1458
fsync time: 0.1328
fsync time: 0.1655
fsync time: 0.1731
fsync time: 0.1356
fsync time: 0.1746
fsync time: 0.1166
fsync time: 0.1609
fsync time: 0.1370
fsync time: 0.1379
fsync time: 0.2096
fsync time: 0.1438
fsync time: 0.1652
fsync time: 0.1612
fsync time: 0.1438

compared with original deadline:
fsync time: 0.7963
fsync time: 4.5914
fsync time: 4.2347
fsync time: 1.1670
fsync time: 0.8164
fsync time: 1.9783
fsync time: 4.9726
fsync time: 2.4929
fsync time: 2.5448
fsync time: 3.9627

Usual tio-test, 32 threads 80Mb each:
Tiotest results for 32 concurrent io threads:
,----------------------------------------------------------------------.
| Item | Time | Rate | Usr CPU | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write 2560 MBs | 103.6 s | 24.705 MB/s | 9.7 % | 495.6 % |
| Random Write 125 MBs | 95.9 s | 1.304 MB/s | -1.9 % | 23.8 % |
| Read 2560 MBs | 164.3 s | 15.580 MB/s | 3.5 % | 70.1 % |
| Random Read 125 MBs | 129.7 s | 0.964 MB/s | -0.0 % | 16.2 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write | 4.040 ms | 8949.999 ms | 0.07614 | 0.00000 |
| Random Write | 1.252 ms | 5439.023 ms | 0.03125 | 0.00000 |
| Read | 7.920 ms | 792.899 ms | 0.00000 | 0.00000 |
| Random Read | 123.807 ms | 910.263 ms | 0.00000 | 0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total | 8.613 ms | 8949.999 ms | 0.03703 | 0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

compared with original deadline:
Tiotest results for 32 concurrent io threads:
,----------------------------------------------------------------------.
| Item | Time | Rate | Usr CPU | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write 2560 MBs | 103.0 s | 24.848 MB/s | 10.6 % | 522.2 % |
| Random Write 125 MBs | 98.8 s | 1.265 MB/s | -1.6 % | 16.1 % |
| Read 2560 MBs | 166.2 s | 15.400 MB/s | 4.2 % | 82.7 % |
| Random Read 125 MBs | 193.3 s | 0.647 MB/s | -0.8 % | 14.5 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write | 4.122 ms | 17922.920 ms | 0.07980 | 0.00061 |
| Random Write | 0.599 ms | 1245.200 ms | 0.00000 | 0.00000 |
| Read | 8.032 ms | 1125.759 ms | 0.00000 | 0.00000 |
| Random Read | 181.968 ms | 972.657 ms | 0.00000 | 0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total | 10.044 ms | 17922.920 ms | 0.03804 | 0.00029 |
`--------------+-----------------+-----------------+----------+-----------'

We see the improvement on random read (not as pronounced as with full
set of heuristics, but still noticeable) and smaller increase on
random write latency, but this will not harm, since now we have
distinction between sync vs async reads.
I can test which of the other heuristics will provide the other few %
of improvements on random read if you are interested.
Otherwise, I'm more inclined on submitting an other patch to add
io-priorities to this series.

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------


Attachments:
deadline-patch-sync-async (10.74 kB)

2009-04-26 12:43:55

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: Reduce latencies for syncronous writes and high I/O priority requests in deadline IO scheduler

Hi Jens,
I found fio, a very handy and complete tool for block I/O performance
testing (kudos to the author), and started doing some thorough testing
for the patch, since I couldn't tune tiotest behaviour, and I had only
a surface understanding of what was going on.
The test configuration is attached for reference. Each test is run
after dropping the caches. The suffix .2 or .3 means the value for
{writes,async}_starved tunable.

My findings are interesting:
* there is a definite improvement for many readers performing random
reads and one sequential writer (this was the one I think tiotest was
showing, due to unclear separation -i.e. no fsync- between tiotest
phases). This workload simulates boot for a single disk machine, with
random reads that represent fault-ins for binaries and libraries, and
sequential writes that represent log updates.

* the improvement is not present if the number of readers is small
(e.g. 4). It gets performance similar to original deadline, that is
far below cfq. The problem appears to be caused by the unfairness for
low-numbered sectors, and happens only when the random readers have
overlapping reading regions. Let's assume 4 readers as in my test.
The workload evolution will be: a read batch is started, from the
request that is first in FIFO. The probability that the batch starts
at the first read in disk order is 1/4, and the probability that it
will be first in next second is 7/24 (assuming the first reader
doesn't post a new request yet). This means there is 11/24 of
probability that we need more than 2 batches to service all initial
read requests (and then we will service the starved writer: increasing
writes_starved in fact improves the reader bw). If the reading regions
overlap, after the writer is serviced, the FIFO will still be randomly
ordered, so the same pattern will repeat.
A perfect scheduler, instead, for each batch in which fewer than
fifo_batch requests are available, should schedule all the read
requests available, i.e. start from the first in disk order instead of
the first in FIFO order (unless a deadline is expired).
This allows the readers to progress much faster. Do you want me to
test such heuristic?
** I think there is also an other theoretical bad case in deadline
behaviour, i.e. when all deadlines expire. In that case, it switches
to complete FIFO batch scheduling. In this case, instead. scheduling
all requests in disk order will allow for a faster recovery. Do you
think we should handle also this case?

* I think that now that we differentiate between sync and async
writes, we can painlessly increase the async_starved tunable. This
will provide better performance for mixed workloads as random readers
mixed with sequential writer. In particular, the 32 readers/1writer
test shows impressive performance, where full write bandwidth is
achieved, while reader bandwidth outperforms all other schedulers,
including cfq (that instead completely starve the writer).

* on my machine, there is a regression on sequential write (2 parallel
sequential writers, instead, give better performance, and 1 seq writer
mixed with many random readers max out the write bandwidth).
Interestingly, this regression disappears when I spread some printks
around. It is therefore a timing issue, that causes less merges to
happen (I think this can be fixed allowing async writes to be
scheduled only after an initial delay):
# run with printks #

seqwrite: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=psync, iodepth=1
Starting 1 process
Jobs: 1 (f=1): [F] [100.0% done] [ 0/ 0 kb/s] [eta 00m:00s]
seqwrite: (groupid=0, jobs=1): err= 0: pid=4838
write: io=1010MiB, bw=30967KiB/s, iops=7560, runt= 34193msec
clat (usec): min=7, max=4274K, avg=114.53, stdev=13822.22
cpu : usr=1.44%, sys=9.47%, ctx=1299, majf=0, minf=154
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w: total=0/258510, short=0/0
lat (usec): 10=45.00%, 20=52.36%, 50=2.18%, 100=0.05%, 250=0.32%
lat (usec): 500=0.03%, 750=0.01%, 1000=0.01%
lat (msec): 2=0.01%, 4=0.01%, 10=0.01%, 100=0.01%, 250=0.02%
lat (msec): 500=0.01%, 2000=0.01%, >=2000=0.01%

Run status group 0 (all jobs):
WRITE: io=1010MiB, aggrb=30967KiB/s, minb=30967KiB/s,
maxb=30967KiB/s, mint=34193msec, maxt=34193msec

Disk stats (read/write):
sda: ios=35/8113, merge=0/250415, ticks=2619/4277418,
in_queue=4280032, util=96.52%

# run without printks #
seqwrite: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=psync, iodepth=1
Starting 1 process
Jobs: 1 (f=1): [F] [100.0% done] [ 0/ 0 kb/s] [eta 00m:00s]
seqwrite: (groupid=0, jobs=1): err= 0: pid=5311
write: io=897076KiB, bw=26726KiB/s, iops=6524, runt= 34371msec
clat (usec): min=7, max=1801K, avg=132.11, stdev=6407.61
cpu : usr=1.14%, sys=7.84%, ctx=1272, majf=0, minf=318
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued r/w: total=0/224269, short=0/0
lat (usec): 10=49.04%, 20=49.05%, 50=1.17%, 100=0.07%, 250=0.51%
lat (usec): 500=0.02%, 750=0.01%, 1000=0.01%
lat (msec): 2=0.01%, 4=0.02%, 10=0.01%, 20=0.01%, 50=0.01%
lat (msec): 100=0.05%, 250=0.03%, 500=0.01%, 2000=0.01%

Run status group 0 (all jobs):
WRITE: io=897076KiB, aggrb=26726KiB/s, minb=26726KiB/s,
maxb=26726KiB/s, mint=34371msec, maxt=34371msec

Disk stats (read/write):
sda: ios=218/7041, merge=0/217243, ticks=16638/4254061,
in_queue=4270696, util=98.92%

Corrado

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------


Attachments:
test2.fio (5.64 kB)
deadline-iosched-orig.2 (39.81 kB)
deadline-iosched-patched.2 (39.72 kB)
deadline-iosched-orig.3 (39.76 kB)
deadline-iosched-patched.3 (39.79 kB)
cfq (39.74 kB)
Download all attachments