2009-09-15 07:40:17

by Tobias Oetiker

[permalink] [raw]
Subject: unfair io behaviour for high load interactive use still present in 2.6.31

Experts,

We run several busy NFS file servers with Areca HW Raid + LVM2 + ext3

We find that the read bandwidth falls dramatically as well as the
response times going up to several seconds as soon as the system
comes under heavy write strain.

With the release of 2.6.31 and all the io fixes that went in there,
I was hoping for a solution and set out to do some tests ...

I have seen the io problem posted on LKML a few times, and normally
it is simulated by running several concurrent dd processes one
reading and one writing. It seems that cfq with a low slice_async
can deal pretty well with competing dds.

Unfortunately our use case is not users running dd but rather a lot
of processes accessing many small to medium sized files for reading
and writing.

I have written a test program that unpacks Linux 2.6.30.5 a few
times into a file system, flushes the cache and then tars it up
again while unpacking some more tars in parallel. While this is
running I use iostat to watch the activity on the block devices. As
I am interested in interactive performance of the system, the await
row as well as the rMB/s row are of special interest to me

iostat -m -x 5

Even with a low 'resolution' of 5 seconds, the performance figures
are jumping all over the place. This concurs with the user
experience when working with the system interactively.

I tried to optimize the configuration systematically turning all
the knobs I know of (/proc/sys/vm, /sys/block/*/queue/scheduler,
data=journal, data=ordered, external journal on a ssd device) one
at a time. I found that cfq helps the read performance quite a lot
as far as total run-time is concerned, but the jerky nature of the
measurements does not change, and also the read performance keeps
dropping dramatically as soon as it is in competition with writers.

I would love to get some hints on how to make such a setup perform
without these huge performance fluctuations.

While testing, I saw that iostat reports huge wMB/s numbers and
ridiculously low rMB/s numbers. Looking at the actual amount of
data in the tar files as well as the run time of the tar processes
the numbers MB/s numbers reported by iostat do seem strange. The
read numbers are too low and the write numbers are too high.

12 * 1.3 GB reading in 270s = 14 MB/s sustained
12 * 0.3 GB writing in 180s = 1.8 MB/s sustained

Since I am looking at relative performance figures this does not
matter so much, but it is still a bit disconcerting.

My tests script is available on http://tobi.oetiker.ch/fspunisher/

Below is an excerpt from iostat while the test is in full swing:

* 2.6.31 (8 cpu x86_64, 24 GB Ram)
* scheduler = cfq
* iostat -m -x dm-5 5
* running in parallel on 3 lvm logical volumes
on a single physical volume

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
---------------------------------------------------------------------------------------------------------
dm-5 0.00 0.00 26.60 7036.60 0.10 27.49 8.00 669.04 91.58 0.09 66.96
dm-5 0.00 0.00 63.80 2543.00 0.25 9.93 8.00 1383.63 539.28 0.36 94.64
dm-5 0.00 0.00 78.00 5084.60 0.30 19.86 8.00 1007.41 195.12 0.15 77.36
dm-5 0.00 0.00 44.00 5588.00 0.17 21.83 8.00 516.27 91.69 0.17 95.44
dm-5 0.00 0.00 0.00 6014.20 0.00 23.49 8.00 1331.42 66.25 0.13 76.48
dm-5 0.00 0.00 28.80 4491.40 0.11 17.54 8.00 1000.37 412.09 0.17 78.24
dm-5 0.00 0.00 36.60 6486.40 0.14 25.34 8.00 765.12 128.07 0.11 72.16
dm-5 0.00 0.00 33.40 5095.60 0.13 19.90 8.00 431.38 43.78 0.17 85.20

for comparison these are the numbers for seen when running the test
with just the writing enabled

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
---------------------------------------------------------------------------------------------------------
dm-5 0.00 0.00 4.00 12047.40 0.02 47.06 8.00 989.81 79.55 0.07 79.20
dm-5 0.00 0.00 3.40 12399.00 0.01 48.43 8.00 977.13 72.15 0.05 58.08
dm-5 0.00 0.00 3.80 13130.00 0.01 51.29 8.00 1130.48 95.11 0.04 58.48
dm-5 0.00 0.00 2.40 5109.20 0.01 19.96 8.00 427.75 47.41 0.16 79.92
dm-5 0.00 0.00 3.20 0.00 0.01 0.00 8.00 290.33 148653.75 282.50 90.40
dm-5 0.00 0.00 3.40 5103.00 0.01 19.93 8.00 168.75 33.06 0.13 67.84

And also with just the reading:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
---------------------------------------------------------------------------------------------------------
dm-5 0.00 0.00 463.80 0.00 1.81 0.00 8.00 3.90 8.41 2.16 100.00
dm-5 0.00 0.00 434.20 0.00 1.70 0.00 8.00 3.89 8.95 2.30 100.00
dm-5 0.00 0.00 540.80 0.00 2.11 0.00 8.00 3.88 7.18 1.85 100.00
dm-5 0.00 0.00 591.60 0.00 2.31 0.00 8.00 3.84 6.50 1.68 99.68
dm-5 0.00 0.00 793.20 0.00 3.10 0.00 8.00 3.81 4.80 1.26 100.00
dm-5 0.00 0.00 592.80 0.00 2.32 0.00 8.00 3.84 6.47 1.68 99.60
dm-5 0.00 0.00 578.80 0.00 2.26 0.00 8.00 3.85 6.66 1.73 100.00
dm-5 0.00 0.00 771.00 0.00 3.01 0.00 8.00 3.81 4.93 1.30 99.92

I also tested 2.6.31 for what happens when I run the same 'load' on a single lvm
logical volume. Interestingly enough it is even worse than when doing it on three
volumes:

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
---------------------------------------------------------------------------------------------------------
dm-5 0.00 0.00 19.20 4566.80 0.07 17.84 8.00 2232.12 486.87 0.22 99.92
dm-5 0.00 0.00 6.80 9410.40 0.03 36.76 8.00 1827.47 187.99 0.10 92.00
dm-5 0.00 0.00 4.00 0.00 0.02 0.00 8.00 685.26 185618.40 249.60 99.84
dm-5 0.00 0.00 4.20 4968.20 0.02 19.41 8.00 1426.45 286.86 0.20 99.84
dm-5 0.00 0.00 10.60 9886.00 0.04 38.62 8.00 167.57 5.74 0.09 88.72
dm-5 0.00 0.00 5.00 0.00 0.02 0.00 8.00 1103.98 242774.88 199.68 99.84
dm-5 0.00 0.00 38.20 14794.60 0.15 57.79 8.00 1171.75 74.25 0.06 87.20

I also tested 2.6.31 with io-controller v9 patches. It seems to help a
bit with the read rate, but the figures still jumps all over the place

* 2.6.31 with io-controller patches v9 (8 cpu x86_64, 24 GB Ram)
* fairness set to 1 on all block devices
* scheduler = cfq
* iostat -m -x dm-5 5
* running in parallel on 3 lvm logical volumes
on a single physical volume

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
---------------------------------------------------------------------------------------------------------
dm-5 0.00 0.00 412.00 1640.60 1.61 6.41 8.00 1992.54 1032.27 0.49 99.84
dm-5 0.00 0.00 362.40 576.40 1.42 2.25 8.00 456.13 612.67 1.07 100.00
dm-5 0.00 0.00 211.80 1004.40 0.83 3.92 8.00 1186.20 995.00 0.82 100.00
dm-5 0.00 0.00 44.20 719.60 0.17 2.81 8.00 788.56 574.81 1.31 99.76
dm-5 0.00 0.00 0.00 1274.80 0.00 4.98 8.00 1584.07 1317.32 0.78 100.00
dm-5 0.00 0.00 0.00 946.91 0.00 3.70 8.00 989.09 911.30 1.05 99.72
dm-5 0.00 0.00 7.20 2526.00 0.03 9.87 8.00 2085.57 201.72 0.37 92.88


For completeness sake I did the tests on 2.4.24 as well not much different.

* 2.6.24 (8 cpu x86_64, 24 GB Ram)
* scheduler = cfq
* iostat -m -x dm-5 5
* running in parallel on 3 lvm logical volumes
situated on a single physical volume

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
---------------------------------------------------------------------------------------------------------
dm-5 0.00 0.00 144.60 5498.00 0.56 21.48 8.00 1215.82 215.48 0.14 76.80
dm-5 0.00 0.00 30.00 9831.40 0.12 38.40 8.00 1729.16 142.37 0.09 88.60
dm-5 0.00 0.00 27.60 4126.40 0.11 16.12 8.00 2245.24 618.77 0.21 86.00
dm-5 0.00 0.00 2.00 3981.20 0.01 15.55 8.00 1069.07 268.40 0.23 91.60
dm-5 0.00 0.00 40.60 13.20 0.16 0.05 8.00 3.98 74.83 15.02 80.80
dm-5 0.00 0.00 5.60 5085.20 0.02 19.86 8.00 2586.65 508.10 0.18 94.00
dm-5 0.00 0.00 20.80 5344.60 0.08 20.88 8.00 985.51 148.96 0.17 92.60


cheers
tobi

--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900


2009-09-15 14:53:27

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: unfair io behaviour for high load interactive use still present in 2.6.31

On Tue, Sep 15, 2009 at 9:30 AM, Tobias Oetiker <[email protected]> wrote:
> Experts,
>
> We run several busy NFS file servers with Areca HW Raid + LVM2 + ext3
>
> We find that the read bandwidth falls dramatically as well as the
> response times going up to several seconds as soon as the system
> comes under heavy write strain.
>
> With the release of 2.6.31 and all the io fixes that went in there,
> I was hoping for a solution and set out to do some tests ...
>
> I have seen the io problem posted on LKML a few times, and normally
> it is simulated by running several concurrent dd processes one
> reading and one writing. It seems that cfq with a low slice_async
> can deal pretty well with competing dds.
>
> Unfortunately our use case is not users running dd but rather a lot
> of processes accessing many small to medium sized files for reading
> and writing.
>
> I have written a test program that unpacks Linux 2.6.30.5 a few
> times into a file system, flushes the cache and then tars it up
> again while unpacking some more tars in parallel. While this is
> running I use iostat to watch the activity on the block devices. As
> I am interested in interactive performance of the system, the await
> row as well as the rMB/s row are of special interest to me
>
>  iostat -m -x 5
>
> Even with a low 'resolution' of 5 seconds, the performance figures
> are jumping all over the place. This concurs with the user
> experience when working with the system interactively.
>
> I tried to optimize the configuration systematically turning all
> the knobs I know of (/proc/sys/vm, /sys/block/*/queue/scheduler,
> data=journal, data=ordered, external journal on a ssd device) one
> at a time. I found that cfq helps the read performance quite a lot
> as far as total run-time is concerned, but the jerky nature of the
> measurements does not change, and also the read performance keeps
> dropping dramatically as soon as it is in competition with writers.
>
> I would love to get some hints on how to make such a setup perform
> without these huge performance fluctuations.
>
> While testing, I saw that iostat reports huge wMB/s numbers and
> ridiculously low rMB/s numbers. Looking at the actual amount of
> data in the tar files as well as the run time of the tar processes
> the numbers MB/s numbers reported by iostat do seem strange. The
> read numbers are too low and the write numbers are too high.
>
>  12 * 1.3 GB reading in 270s = 14 MB/s sustained
>  12 * 0.3 GB writing in 180s = 1.8 MB/s sustained
>
> Since I am looking at relative performance figures this does not
> matter so much, but it is still a bit disconcerting.
>
> My tests script is available on http://tobi.oetiker.ch/fspunisher/
>
> Below is an excerpt from iostat while the test is in full swing:
>
> * 2.6.31 (8 cpu x86_64, 24 GB Ram)
> * scheduler = cfq
> * iostat -m -x dm-5 5
> * running in parallel on 3 lvm logical volumes
>  on a single physical volume
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> ---------------------------------------------------------------------------------------------------------
> dm-5             0.00     0.00   26.60 7036.60     0.10    27.49     8.00   669.04   91.58   0.09  66.96
> dm-5             0.00     0.00   63.80 2543.00     0.25     9.93     8.00  1383.63  539.28   0.36  94.64
> dm-5             0.00     0.00   78.00 5084.60     0.30    19.86     8.00  1007.41  195.12   0.15  77.36
> dm-5             0.00     0.00   44.00 5588.00     0.17    21.83     8.00   516.27   91.69   0.17  95.44
> dm-5             0.00     0.00    0.00 6014.20     0.00    23.49     8.00  1331.42   66.25   0.13  76.48
> dm-5             0.00     0.00   28.80 4491.40     0.11    17.54     8.00  1000.37  412.09   0.17  78.24
> dm-5             0.00     0.00   36.60 6486.40     0.14    25.34     8.00   765.12  128.07   0.11  72.16
> dm-5             0.00     0.00   33.40 5095.60     0.13    19.90     8.00   431.38   43.78   0.17  85.20
>
> for comparison these are the numbers for seen when running the test
> with just the writing enabled
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> ---------------------------------------------------------------------------------------------------------
> dm-5             0.00     0.00    4.00 12047.40    0.02    47.06     8.00   989.81   79.55   0.07  79.20
> dm-5             0.00     0.00    3.40 12399.00    0.01    48.43     8.00   977.13   72.15   0.05  58.08
> dm-5             0.00     0.00    3.80 13130.00    0.01    51.29     8.00  1130.48   95.11   0.04  58.48
> dm-5             0.00     0.00    2.40 5109.20     0.01    19.96     8.00   427.75   47.41   0.16  79.92
> dm-5             0.00     0.00    3.20    0.00     0.01     0.00     8.00   290.33 148653.75 282.50  90.40
> dm-5             0.00     0.00    3.40 5103.00     0.01    19.93     8.00   168.75   33.06   0.13  67.84
>
> And also with just the reading:
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> ---------------------------------------------------------------------------------------------------------
> dm-5             0.00     0.00  463.80    0.00     1.81     0.00     8.00     3.90    8.41   2.16 100.00
> dm-5             0.00     0.00  434.20    0.00     1.70     0.00     8.00     3.89    8.95   2.30 100.00
> dm-5             0.00     0.00  540.80    0.00     2.11     0.00     8.00     3.88    7.18   1.85 100.00
> dm-5             0.00     0.00  591.60    0.00     2.31     0.00     8.00     3.84    6.50   1.68  99.68
> dm-5             0.00     0.00  793.20    0.00     3.10     0.00     8.00     3.81    4.80   1.26 100.00
> dm-5             0.00     0.00  592.80    0.00     2.32     0.00     8.00     3.84    6.47   1.68  99.60
> dm-5             0.00     0.00  578.80    0.00     2.26     0.00     8.00     3.85    6.66   1.73 100.00
> dm-5             0.00     0.00  771.00    0.00     3.01     0.00     8.00     3.81    4.93   1.30  99.92
>

>From the numbers, it appears that your controller (or the disks) is
lying about writes. They get cached (service time around 0.1ms) and
reported as completed. When the cache is full, or a flush is issued,
you get >200ms delay.
Reads, instead, have a more reasonable service time of around 2ms.

Can you disable write cache, and see if the response times for writes
become reasonable?

> I also tested 2.6.31 for what happens when I run the same 'load' on a single lvm
> logical volume. Interestingly enough it is even worse than when doing it on three
> volumes:
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> ---------------------------------------------------------------------------------------------------------
> dm-5             0.00     0.00   19.20 4566.80     0.07    17.84     8.00  2232.12  486.87   0.22  99.92
> dm-5             0.00     0.00    6.80 9410.40     0.03    36.76     8.00  1827.47  187.99   0.10  92.00
> dm-5             0.00     0.00    4.00    0.00     0.02     0.00     8.00   685.26 185618.40 249.60  99.84
> dm-5             0.00     0.00    4.20 4968.20     0.02    19.41     8.00  1426.45  286.86   0.20  99.84
> dm-5             0.00     0.00   10.60 9886.00     0.04    38.62     8.00   167.57    5.74   0.09  88.72
> dm-5             0.00     0.00    5.00    0.00     0.02     0.00     8.00  1103.98 242774.88 199.68  99.84
> dm-5             0.00     0.00   38.20 14794.60    0.15    57.79     8.00  1171.75   74.25   0.06  87.20
>
> I also tested 2.6.31 with io-controller v9 patches. It seems to help a
> bit with the read rate, but the figures still jumps all over the place
>
> * 2.6.31 with io-controller patches v9 (8 cpu x86_64, 24 GB Ram)
> * fairness set to 1 on all block devices
> * scheduler = cfq
> * iostat -m -x dm-5 5
> * running in parallel on 3 lvm logical volumes
>  on a single physical volume
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> ---------------------------------------------------------------------------------------------------------
> dm-5              0.00     0.00  412.00 1640.60     1.61     6.41     8.00  1992.54 1032.27   0.49  99.84
> dm-5              0.00     0.00  362.40  576.40     1.42     2.25     8.00   456.13  612.67   1.07 100.00
> dm-5              0.00     0.00  211.80 1004.40     0.83     3.92     8.00  1186.20  995.00   0.82 100.00
> dm-5              0.00     0.00   44.20  719.60     0.17     2.81     8.00   788.56  574.81   1.31  99.76
> dm-5              0.00     0.00    0.00 1274.80     0.00     4.98     8.00  1584.07 1317.32   0.78 100.00
> dm-5              0.00     0.00    0.00  946.91     0.00     3.70     8.00   989.09  911.30   1.05  99.72
> dm-5              0.00     0.00    7.20 2526.00     0.03     9.87     8.00  2085.57  201.72   0.37  92.88
>
>
> For completeness sake I did the tests on 2.4.24 as well not much different.
>
> * 2.6.24 (8 cpu x86_64, 24 GB Ram)
> * scheduler = cfq
> * iostat -m -x dm-5 5
> * running in parallel on 3 lvm logical volumes
>  situated on a single physical volume
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> ---------------------------------------------------------------------------------------------------------
> dm-5              0.00     0.00  144.60 5498.00     0.56    21.48     8.00  1215.82  215.48   0.14  76.80
> dm-5              0.00     0.00   30.00 9831.40     0.12    38.40     8.00  1729.16  142.37   0.09  88.60
> dm-5              0.00     0.00   27.60 4126.40     0.11    16.12     8.00  2245.24  618.77   0.21  86.00
> dm-5              0.00     0.00    2.00 3981.20     0.01    15.55     8.00  1069.07  268.40   0.23  91.60
> dm-5              0.00     0.00   40.60   13.20     0.16     0.05     8.00     3.98   74.83  15.02  80.80
> dm-5              0.00     0.00    5.60 5085.20     0.02    19.86     8.00  2586.65  508.10   0.18  94.00
> dm-5              0.00     0.00   20.80 5344.60     0.08    20.88     8.00   985.51  148.96   0.17  92.60
>
>
> cheers
> tobi
>
> --
> Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
> http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

2009-09-15 21:07:58

by Tobias Oetiker

[permalink] [raw]
Subject: Re: unfair io behaviour for high load interactive use still present in 2.6.31

Today Corrado Zoccolo wrote:

> On Tue, Sep 15, 2009 at 9:30 AM, Tobias Oetiker <[email protected]> wrote:
> > Below is an excerpt from iostat while the test is in full swing:
> >
> > * 2.6.31 (8 cpu x86_64, 24 GB Ram)
> > * scheduler = cfq
> > * iostat -m -x dm-5 5
> > * running in parallel on 3 lvm logical volumes
> >  on a single physical volume
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> > ---------------------------------------------------------------------------------------------------------
> > dm-5             0.00     0.00   26.60 7036.60     0.10    27.49     8.00   669.04   91.58   0.09  66.96
> > dm-5             0.00     0.00   63.80 2543.00     0.25     9.93     8.00  1383.63  539.28   0.36  94.64
> > dm-5             0.00     0.00   78.00 5084.60     0.30    19.86     8.00  1007.41  195.12   0.15  77.36
> > dm-5             0.00     0.00   44.00 5588.00     0.17    21.83     8.00   516.27   91.69   0.17  95.44
> > dm-5             0.00     0.00    0.00 6014.20     0.00    23.49     8.00  1331.42   66.25   0.13  76.48
> > dm-5             0.00     0.00   28.80 4491.40     0.11    17.54     8.00  1000.37  412.09   0.17  78.24
> > dm-5             0.00     0.00   36.60 6486.40     0.14    25.34     8.00   765.12  128.07   0.11  72.16
> > dm-5             0.00     0.00   33.40 5095.60     0.13    19.90     8.00   431.38   43.78   0.17  85.20
> >
> > for comparison these are the numbers for seen when running the test
> > with just the writing enabled
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> > ---------------------------------------------------------------------------------------------------------
> > dm-5             0.00     0.00    4.00 12047.40    0.02    47.06     8.00   989.81   79.55   0.07  79.20
> > dm-5             0.00     0.00    3.40 12399.00    0.01    48.43     8.00   977.13   72.15   0.05  58.08
> > dm-5             0.00     0.00    3.80 13130.00    0.01    51.29     8.00  1130.48   95.11   0.04  58.48
> > dm-5             0.00     0.00    2.40 5109.20     0.01    19.96     8.00   427.75   47.41   0.16  79.92
> > dm-5             0.00     0.00    3.20    0.00     0.01     0.00     8.00   290.33 148653.75 282.50  90.40
> > dm-5             0.00     0.00    3.40 5103.00     0.01    19.93     8.00   168.75   33.06   0.13  67.84
> >
>
> From the numbers, it appears that your controller (or the disks) is
> lying about writes. They get cached (service time around 0.1ms) and
> reported as completed. When the cache is full, or a flush is issued,
> you get >200ms delay.
> Reads, instead, have a more reasonable service time of around 2ms.
>
> Can you disable write cache, and see if the response times for writes
> become reasonable?

sure can ... the write-only results look like this. As you have predicted,
the wMB has gone down to more realistic levels ... but also the await has gone way up.

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
---------------------------------------------------------------------------------------------------------
dm-18 0.00 0.00 0.00 2566.80 0.00 10.03 8.00 2224.74 737.62 0.39 100.00
dm-18 0.00 0.00 9.60 679.00 0.04 2.65 8.00 400.41 1029.73 1.35 92.80
dm-18 0.00 0.00 0.00 2080.80 0.00 8.13 8.00 906.58 456.45 0.48 100.00
dm-18 0.00 0.00 0.00 2349.20 0.00 9.18 8.00 1351.17 491.44 0.43 100.00
dm-18 0.00 0.00 3.80 665.60 0.01 2.60 8.00 906.72 1098.75 1.39 93.20
dm-18 0.00 0.00 0.00 1811.20 0.00 7.07 8.00 1008.23 725.34 0.55 100.00
dm-18 0.00 0.00 0.00 2632.60 0.00 10.28 8.00 1651.18 640.61 0.38 100.00


in the read/write test the write rate stais down, but the rMB/s is even worse and the await is also way up,
so I guess the bad performance is not to blame on the the cache in the controller ...

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
---------------------------------------------------------------------------------------------------------
dm-18 0.00 0.00 0.00 1225.80 0.00 4.79 8.00 1050.49 807.38 0.82 100.00
dm-18 0.00 0.00 0.00 1721.80 0.00 6.73 8.00 1823.67 807.20 0.58 100.00
dm-18 0.00 0.00 0.00 1128.00 0.00 4.41 8.00 617.94 832.52 0.89 100.00
dm-18 0.00 0.00 0.00 838.80 0.00 3.28 8.00 873.04 1056.37 1.19 100.00
dm-18 0.00 0.00 39.60 347.80 0.15 1.36 8.00 590.27 1880.05 2.57 99.68
dm-18 0.00 0.00 0.00 1626.00 0.00 6.35 8.00 983.85 452.72 0.62 100.00
dm-18 0.00 0.00 0.00 1117.00 0.00 4.36 8.00 1047.16 1117.78 0.90 100.00
dm-18 0.00 0.00 0.00 1319.20 0.00 5.15 8.00 840.64 573.39 0.76 100.00

cheers
tobi

--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900

2009-09-16 07:29:06

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: unfair io behaviour for high load interactive use still present in 2.6.31

Hi Tobias,
On Tue, Sep 15, 2009 at 11:07 PM, Tobias Oetiker <[email protected]> wrote:
> Today Corrado Zoccolo wrote:
>
>> On Tue, Sep 15, 2009 at 9:30 AM, Tobias Oetiker <[email protected]> wrote:
>> > Below is an excerpt from iostat while the test is in full swing:
>> >
>> > * 2.6.31 (8 cpu x86_64, 24 GB Ram)
>> > * scheduler = cfq
>> > * iostat -m -x dm-5 5
>> > * running in parallel on 3 lvm logical volumes
>> >  on a single physical volume
>> >
>> > Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
>> > ---------------------------------------------------------------------------------------------------------
>> > dm-5             0.00     0.00   26.60 7036.60     0.10    27.49     8.00   669.04   91.58   0.09  66.96
>> > dm-5             0.00     0.00   63.80 2543.00     0.25     9.93     8.00  1383.63  539.28   0.36  94.64
>> > dm-5             0.00     0.00   78.00 5084.60     0.30    19.86     8.00  1007.41  195.12   0.15  77.36
>> > dm-5             0.00     0.00   44.00 5588.00     0.17    21.83     8.00   516.27   91.69   0.17  95.44
>> > dm-5             0.00     0.00    0.00 6014.20     0.00    23.49     8.00  1331.42   66.25   0.13  76.48
>> > dm-5             0.00     0.00   28.80 4491.40     0.11    17.54     8.00  1000.37  412.09   0.17  78.24
>> > dm-5             0.00     0.00   36.60 6486.40     0.14    25.34     8.00   765.12  128.07   0.11  72.16
>> > dm-5             0.00     0.00   33.40 5095.60     0.13    19.90     8.00   431.38   43.78   0.17  85.20
>> >
>> > for comparison these are the numbers for seen when running the test
>> > with just the writing enabled
>> >
>> > Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
>> > ---------------------------------------------------------------------------------------------------------
>> > dm-5             0.00     0.00    4.00 12047.40    0.02    47.06     8.00   989.81   79.55   0.07  79.20
>> > dm-5             0.00     0.00    3.40 12399.00    0.01    48.43     8.00   977.13   72.15   0.05  58.08
>> > dm-5             0.00     0.00    3.80 13130.00    0.01    51.29     8.00  1130.48   95.11   0.04  58.48
>> > dm-5             0.00     0.00    2.40 5109.20     0.01    19.96     8.00   427.75   47.41   0.16  79.92
>> > dm-5             0.00     0.00    3.20    0.00     0.01     0.00     8.00   290.33 148653.75 282.50  90.40
>> > dm-5             0.00     0.00    3.40 5103.00     0.01    19.93     8.00   168.75   33.06   0.13  67.84
>> >
>>
>> From the numbers, it appears that your controller (or the disks) is
>> lying about writes. They get cached (service time around 0.1ms) and
>> reported as completed. When the cache is full, or a flush is issued,
>> you get >200ms delay.
>> Reads, instead, have a more reasonable service time of around 2ms.
>>
>> Can you disable write cache, and see if the response times for writes
>> become reasonable?
>
> sure can ... the write-only results look like this. As you have predicted,
> the wMB has gone down to more realistic levels ... but also the await has gone way up.
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> ---------------------------------------------------------------------------------------------------------
> dm-18             0.00     0.00    0.00 2566.80     0.00    10.03     8.00  2224.74  737.62   0.39 100.00
> dm-18             0.00     0.00    9.60  679.00     0.04     2.65     8.00   400.41 1029.73   1.35  92.80
> dm-18             0.00     0.00    0.00 2080.80     0.00     8.13     8.00   906.58  456.45   0.48 100.00
> dm-18             0.00     0.00    0.00 2349.20     0.00     9.18     8.00  1351.17  491.44   0.43 100.00
> dm-18             0.00     0.00    3.80  665.60     0.01     2.60     8.00   906.72 1098.75   1.39  93.20
> dm-18             0.00     0.00    0.00 1811.20     0.00     7.07     8.00  1008.23  725.34   0.55 100.00
> dm-18             0.00     0.00    0.00 2632.60     0.00    10.28     8.00  1651.18  640.61   0.38 100.00
>

Good.
The high await is normal for writes, especially since you get so many
queued requests.
can you post the output of "grep -r . /sys/block/_device_/queue/" and
iostat for your real devices?
This should not affect reads, that will preempt writes with cfq.

>
> in the read/write test the write rate stais down, but the rMB/s is even worse and the await is also way up,
> so I guess the bad performance is not to blame on the the cache in the controller ...
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> ---------------------------------------------------------------------------------------------------------
> dm-18             0.00     0.00    0.00 1225.80     0.00     4.79     8.00  1050.49  807.38   0.82 100.00
> dm-18             0.00     0.00    0.00 1721.80     0.00     6.73     8.00  1823.67  807.20   0.58 100.00
> dm-18             0.00     0.00    0.00 1128.00     0.00     4.41     8.00   617.94  832.52   0.89 100.00
> dm-18             0.00     0.00    0.00  838.80     0.00     3.28     8.00   873.04 1056.37   1.19 100.00
> dm-18             0.00     0.00   39.60  347.80     0.15     1.36     8.00   590.27 1880.05   2.57  99.68
> dm-18             0.00     0.00    0.00 1626.00     0.00     6.35     8.00   983.85  452.72   0.62 100.00
> dm-18             0.00     0.00    0.00 1117.00     0.00     4.36     8.00  1047.16 1117.78   0.90 100.00
> dm-18             0.00     0.00    0.00 1319.20     0.00     5.15     8.00   840.64  573.39   0.76 100.00
>

This is interesting. it seems no reads are happening at all here.
I suspect that this and your observation that the real read throughput
is much higher, can be explained because your readers mostly read from
the page cache.
Can you describe better how the test workload is structured,
especially regarding cache?
How do you say that some tars are just readers, and some are just
writers? I suppose you pre-fault-in the input for writers, and you
direct output to /dev/null for readers? Are all readers reading from
different directories?

Few things to check:
* are the cpus saturated during the test?
* are the readers mostly in state 'D', or 'S', or 'R'?
* did you try 'as' I/O scheduler?
* how big are your volumes?
* what is the average load on the system?
* with i/o controller patches, what happens if you put readers in one
domain and writers in the other?

Are you willing to test some patches? I'm working on patches to reduce
read latency, that may be interesting to you.

Corrado

> cheers
> tobi
>
> --
> Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
> http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

2009-09-16 07:54:15

by Tobias Oetiker

[permalink] [raw]
Subject: Re: unfair io behaviour for high load interactive use still present in 2.6.31

HI Corrado,

Today Corrado Zoccolo wrote:

> Hi Tobias,
> On Tue, Sep 15, 2009 at 11:07 PM, Tobias Oetiker <[email protected]> wrote:
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> > ---------------------------------------------------------------------------------------------------------
> > dm-18             0.00     0.00    0.00 2566.80     0.00    10.03     8.00  2224.74  737.62   0.39 100.00
> > dm-18             0.00     0.00    9.60  679.00     0.04     2.65     8.00   400.41 1029.73   1.35  92.80
> > dm-18             0.00     0.00    0.00 2080.80     0.00     8.13     8.00   906.58  456.45   0.48 100.00
> > dm-18             0.00     0.00    0.00 2349.20     0.00     9.18     8.00  1351.17  491.44   0.43 100.00
> > dm-18             0.00     0.00    3.80  665.60     0.01     2.60     8.00   906.72 1098.75   1.39  93.20
> > dm-18             0.00     0.00    0.00 1811.20     0.00     7.07     8.00  1008.23  725.34   0.55 100.00
> > dm-18             0.00     0.00    0.00 2632.60     0.00    10.28     8.00  1651.18  640.61   0.38 100.00
> >
>
> Good.
> The high await is normal for writes, especially since you get so many
> queued requests.
> can you post the output of "grep -r . /sys/block/_device_/queue/" and
> iostat for your real devices?
> This should not affect reads, that will preempt writes with cfq.

/sys/block/sdc/queue/nr_requests:128
/sys/block/sdc/queue/read_ahead_kb:128
/sys/block/sdc/queue/max_hw_sectors_kb:2048
/sys/block/sdc/queue/max_sectors_kb:512
/sys/block/sdc/queue/scheduler:noop anticipatory deadline [cfq]
/sys/block/sdc/queue/hw_sector_size:512
/sys/block/sdc/queue/logical_block_size:512
/sys/block/sdc/queue/physical_block_size:512
/sys/block/sdc/queue/minimum_io_size:512
/sys/block/sdc/queue/optimal_io_size:0
/sys/block/sdc/queue/rotational:1
/sys/block/sdc/queue/nomerges:0
/sys/block/sdc/queue/rq_affinity:0
/sys/block/sdc/queue/iostats:1
/sys/block/sdc/queue/iosched/quantum:4
/sys/block/sdc/queue/iosched/fifo_expire_sync:124
/sys/block/sdc/queue/iosched/fifo_expire_async:248
/sys/block/sdc/queue/iosched/back_seek_max:16384
/sys/block/sdc/queue/iosched/back_seek_penalty:2
/sys/block/sdc/queue/iosched/slice_sync:100
/sys/block/sdc/queue/iosched/slice_async:40
/sys/block/sdc/queue/iosched/slice_async_rq:2
/sys/block/sdc/queue/iosched/slice_idle:8

but as I said in my original post. I have done extensive tests,
twiddly all the knobs I know of and the only thing that never
changed, was that as soon as writers come into play, the readers
get starved pretty thorowly.

> >
> > in the read/write test the write rate stais down, but the rMB/s is even worse and the await is also way up,
> > so I guess the bad performance is not to blame on the the cache in the controller ...
> >
> > Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> > ---------------------------------------------------------------------------------------------------------
> > dm-18             0.00     0.00    0.00 1225.80     0.00     4.79     8.00  1050.49  807.38   0.82 100.00
> > dm-18             0.00     0.00    0.00 1721.80     0.00     6.73     8.00  1823.67  807.20   0.58 100.00
> > dm-18             0.00     0.00    0.00 1128.00     0.00     4.41     8.00   617.94  832.52   0.89 100.00
> > dm-18             0.00     0.00    0.00  838.80     0.00     3.28     8.00   873.04 1056.37   1.19 100.00
> > dm-18             0.00     0.00   39.60  347.80     0.15     1.36     8.00   590.27 1880.05   2.57  99.68
> > dm-18             0.00     0.00    0.00 1626.00     0.00     6.35     8.00   983.85  452.72   0.62 100.00
> > dm-18             0.00     0.00    0.00 1117.00     0.00     4.36     8.00  1047.16 1117.78   0.90 100.00
> > dm-18             0.00     0.00    0.00 1319.20     0.00     5.15     8.00   840.64  573.39   0.76 100.00
> >
>
> This is interesting. it seems no reads are happening at all here.
> I suspect that this and your observation that the real read throughput
> is much higher, can be explained because your readers mostly read from
> the page cache.


> Can you describe better how the test workload is structured,
> especially regarding cache?
> How do you say that some tars are just readers, and some are just
> writers? I suppose you pre-fault-in the input for writers, and you
> direct output to /dev/null for readers? Are all readers reading from
> different directories?

they are NOT ... what I do is this (http://tobi.oetiker.ch/fspunisher).

* get a copy of the linux kernel source an place it on tmpfs

* unpack 4 copies of the linux kernel for each reader process

* sync and echo 3 >/proc/sys/vm/drop_caches (this should loose all
cache)

* start the readers, each on its private copy of the kernel source,
as prepared above writing their output to /dev/null.

* start an equal amount of tars unpacking the kernel archive from
tmpfs into separate directories, next to the readers source
directories.

the goal of this exercise is to simulate independent writers and
readers while excluding the cache as much as possible since I want
to see how the system deals with accessing the actual disk device
and not the local cache.
>
> Few things to check:
> * are the cpus saturated during the test?
nope

> * are the readers mostly in state 'D', or 'S', or 'R'?

they are almost alway in D (same as the writers) sleeping for IO

> * did you try 'as' I/O scheduler?

yes, this helps the readers a bit
but it does not help the 'smoothness'
of the system operation

> * how big are your volumes?

100 G

> * what is the average load on the system?

24 (or however many tars I start since they al want to run all the
time and are all waiting for IO)

> * with i/o controller patches, what happens if you put readers in one
> domain and writers in the other?

will try ... note though that this would have to happen
automatically since I can not know before hand who writes and who
reads ...

> Are you willing to test some patches? I'm working on patches to reduce
> read latency, that may be interesting to you.

by all means ... to repeat my goal, I want to see the readers not
starved by the writers, also I have a hunch that the handling of
metadata updates/access in the filesystem may play a role here.
In the dd tests these do not play a role but in real life I find
that often the problem is getting at the file, not reading the file
once I have it ... this is especially tricky to test since there is
good caching on this and this can pretty quickly distort the
results if not handled carefully.

cheers
tobi

--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900

2009-09-16 12:40:36

by Tobias Oetiker

[permalink] [raw]
Subject: Re: unfair io behaviour for high load interactive use still present in 2.6.31

Hi Corrado,

Today Corrado Zoccolo wrote:

> Hi Tobias,
> On Tue, Sep 15, 2009 at 11:07 PM, Tobias Oetiker <[email protected]> wrote:
> > Today Corrado Zoccolo wrote:
> >
> >> On Tue, Sep 15, 2009 at 9:30 AM, Tobias Oetiker <[email protected]> wrote:
> >> > Below is an excerpt from iostat while the test is in full swing:
> >> >
> >> > * 2.6.31 (8 cpu x86_64, 24 GB Ram)
> >> > * scheduler = cfq
> >> > * iostat -m -x dm-5 5
> >> > * running in parallel on 3 lvm logical volumes
> >> >  on a single physical volume
> >> >
> >> > Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> >> > ---------------------------------------------------------------------------------------------------------
> * with i/o controller patches, what happens if you put readers in one
> domain and writers in the other?

With the io-conroller and readers and writers in two groups, the
rMB/s and wMB/s come out roughly the same. there are two problems unfortunately

a) the interactive performance is still not good since directory access still seems to get.

b) processes normally do both, read AND write, so the distinction would have to be made at the io-controller level.


Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
---------------------------------------------------------------------------------------------------------
dm-5 0.00 0.00 549.80 0.00 2.15 0.00 8.00 577.75 6.82 1.82 100.00
dm-5 0.00 0.00 191.20 1333.60 0.75 5.21 8.00 1077.45 953.15 0.66 100.00
dm-5 0.00 0.00 457.20 1144.00 1.79 4.47 8.00 1820.56 836.86 0.62 100.00
dm-5 0.00 0.00 453.00 1328.80 1.77 5.19 8.00 1326.74 1168.99 0.56 100.00
dm-5 0.00 0.00 353.09 353.49 1.38 1.38 8.00 1405.80 1993.23 1.41 99.80
dm-5 0.00 0.00 280.40 1296.80 1.10 5.07 8.00 1248.95 747.99 0.63 100.00
dm-5 0.00 0.00 307.00 4557.20 1.20 17.80 8.00 985.92 211.25 0.21 100.00
dm-5 0.00 0.00 6.00 0.00 0.02 0.00 8.00 450.42 19116.00 166.67 100.00
dm-5 0.00 0.00 154.00 624.20 0.60 2.44 8.00 926.52 1119.77 1.29 100.00


cheers
tobi



> Are you willing to test some patches? I'm working on patches to reduce
> read latency, that may be interesting to you.
>
> Corrado
>
> > cheers
> > tobi
> >
> > --
> > Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
> > http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
>
>
>
>

--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900