LinuxLists.cc - Higher than expected disk write(2) latency

2008-06-28 12:42:36

Subject: Higher than expected disk write(2) latency

Hi,

we're getting some rather high figures for write(2) latency when testing
synchronous writing to disk. The test I'm running writes 2000 blocks of
contiguous data to a raw device, using O_DIRECT and various block sizes
down to a minimum of 512 bytes.

The disk is a Seagate ST380817AS SATA connected to an Intel ICH7
using ata_piix. Write caching has been explicitly disabled on the
drive, and there is no other activity that should affect the test
results (all system filesystems are on a separate drive). The system is
running Debian etch, with a 2.6.24 kernel.

Observed results:

size=1024, N=2000, took=4.450788 s, thput=3 mb/s seekc=1
write: avg=8.388851 max=24.998846 min=8.335624 ms
8 ms: 1992 cases
9 ms: 2 cases
10 ms: 1 cases
14 ms: 1 cases
16 ms: 3 cases
24 ms: 1 cases

size=512, N=2000, took=4.401289 s, thput=1 mb/s seekc=1
write: avg=8.364283 max=16.692206 min=2.010072 ms
2 ms: 1 cases
7 ms: 1 cases
8 ms: 1995 cases
16 ms: 3 cases

Measurement of the write(2) time is performed using the TSC, so any
latency there is negligible.

The datasheet for the drive being used gives the following figures:

Average latency (msec): 4.16
Track-to-track seek time (msec typical): <1.2 (write)
Average seek, write (msec typical): 9.5

If these figures are to be believed, then why are we seeing latencies of
8.3 msec? Is this normal? Or are we just being overly optimistic in
our performance expectations?

What we find suspicious is that the latency we see is so close to the
Average seek latency specified for the drive, almost as if the drive was
performing a seek on every write.

For comparison, here are the results of the same test with the disk
write cache *enabled*:

size=1024, N=2000, took=0.296284 s, thput=55 mb/s seekc=1
write: avg=0.147745 max=0.606990 min=0.117246 ms
0 ms: 2000 cases

size=512, N=2000, took=0.304614 s, thput=26 mb/s seekc=1
write: avg=0.152089 max=0.533234 min=0.125370 ms
0 ms: 2000 cases

We also ran the same test on a different system with recent SAS disks
connected via a HP/Compaq CCISS controller. I don't have the exact
details of the drives used, since I don't know how to get them out of
the cciss driver, but the latencies we got were around 4 msec. Whilst
this is better than the "commodity" hardware used in the tests above, it
still seems excessive.

Any advice would be appreciated.

Thanks,

-mato

2008-06-28 13:11:58

by Roger Heflin

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

Martin Lucina wrote:
> Hi,
>
> we're getting some rather high figures for write(2) latency when testing
> synchronous writing to disk. The test I'm running writes 2000 blocks of
> contiguous data to a raw device, using O_DIRECT and various block sizes
> down to a minimum of 512 bytes.
>
> The disk is a Seagate ST380817AS SATA connected to an Intel ICH7
> using ata_piix. Write caching has been explicitly disabled on the
> drive, and there is no other activity that should affect the test
> results (all system filesystems are on a separate drive). The system is
> running Debian etch, with a 2.6.24 kernel.
>
> Observed results:
>
> size=1024, N=2000, took=4.450788 s, thput=3 mb/s seekc=1
> write: avg=8.388851 max=24.998846 min=8.335624 ms
> 8 ms: 1992 cases
> 9 ms: 2 cases
> 10 ms: 1 cases
> 14 ms: 1 cases
> 16 ms: 3 cases
> 24 ms: 1 cases
>
> size=512, N=2000, took=4.401289 s, thput=1 mb/s seekc=1
> write: avg=8.364283 max=16.692206 min=2.010072 ms
> 2 ms: 1 cases
> 7 ms: 1 cases
> 8 ms: 1995 cases
> 16 ms: 3 cases
>
> Measurement of the write(2) time is performed using the TSC, so any
> latency there is negligible.
>
> The datasheet for the drive being used gives the following figures:
>
> Average latency (msec): 4.16
> Track-to-track seek time (msec typical): <1.2 (write)
> Average seek, write (msec typical): 9.5
>
> If these figures are to be believed, then why are we seeing latencies of
> 8.3 msec? Is this normal? Or are we just being overly optimistic in
> our performance expectations?

Consider this, 60/7200rpm=8.3ms for one rotation.

You write sector n and n+1, it takes some amount of time for that first set of
sectors to come under the head, when it does you write it and immediately
return. Immediately after that you attempt write sector n+2 and n+3 which just
a bit ago passed under the head, so you have to wait an *ENTIRE* revolution for
those sectors to again come under the head to be written, another ~8.3ms, and
you continue to repeat this with each block being written. If the sector was
randomly placed in the rotation (ie 50% chance of the disk being off by 1/2 a
rotation or less-you would have a 4.15 ms average seek time for your test)-but
the case of sequential sync writes this leaves the sector about as far as
possible from the head (it just passed under the head).

>
> What we find suspicious is that the latency we see is so close to the
> Average seek latency specified for the drive, almost as if the drive was
> performing a seek on every write.
>
> For comparison, here are the results of the same test with the disk
> write cache *enabled*:
>
> size=1024, N=2000, took=0.296284 s, thput=55 mb/s seekc=1
> write: avg=0.147745 max=0.606990 min=0.117246 ms
> 0 ms: 2000 cases
>
> size=512, N=2000, took=0.304614 s, thput=26 mb/s seekc=1
> write: avg=0.152089 max=0.533234 min=0.125370 ms
> 0 ms: 2000 cases

Write cache allows a return without writing to the actual disk, so you don't
have to wait, and on top of that it queues up all of the writes that are
together on the track and does them at one pass of the head over all of the sectors.

>
> We also ran the same test on a different system with recent SAS disks
> connected via a HP/Compaq CCISS controller. I don't have the exact
> details of the drives used, since I don't know how to get them out of
> the cciss driver, but the latencies we got were around 4 msec. Whilst
> this is better than the "commodity" hardware used in the tests above, it
> still seems excessive.

Almost the same case as for the 7200 rpm disk, but I bet these SAS drives are
15k drives? If so 60/15000=4ms.

>
> Any advice would be appreciated.
>
> Thanks,
>

Roger

2008-06-28 14:47:42

by David Newall

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

Martin Lucina wrote:
> Measurement of the write(2) time is performed using the TSC, so any
> latency there is negligible.

Measuring each write is lavish and bound to impact the results. To
measure sustained throughput, get the time only at start and end, and
divide the difference by the number of writes.

2008-06-29 12:03:24

by Martin Sustrik

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

David Newall wrote:
> Martin Lucina wrote:
>> Measurement of the write(2) time is performed using the TSC, so any
>> latency there is negligible.
>
> Measuring each write is lavish and bound to impact the results. To
> measure sustained throughput, get the time only at start and end, and
> divide the difference by the number of writes.

Hi,

We're using RDTSC to measure time, which takes ~30 nanoseconds to
complete. I would say this is negligible given that latency of a write
is 5 orders of magnitude higher.

Martin

2008-06-30 06:41:18

by Robert Hancock

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

2008-06-30 18:11:16

by Martin Sustrik

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

Hi Roger,

>> If these figures are to be believed, then why are we seeing latencies of
>> 8.3 msec? Is this normal? Or are we just being overly optimistic in
>> our performance expectations?
>
> Consider this, 60/7200rpm=8.3ms for one rotation.
>
> You write sector n and n+1, it takes some amount of time for that first
> set of sectors to come under the head, when it does you write it and
> immediately return. Immediately after that you attempt write sector
> n+2 and n+3 which just a bit ago passed under the head, so you have to
> wait an *ENTIRE* revolution for those sectors to again come under the
> head to be written, another ~8.3ms, and you continue to repeat this with
> each block being written. If the sector was randomly placed in the
> rotation (ie 50% chance of the disk being off by 1/2 a rotation or
> less-you would have a 4.15 ms average seek time for your test)-but the
> case of sequential sync writes this leaves the sector about as far as
> possible from the head (it just passed under the head).

Fair enough. That exaplains the behaviour. Would AIO help here? If we
are able to enqueue next write before the first one is finished, it can
start writing it immediately without waiting for a revolution.

>> We also ran the same test on a different system with recent SAS disks
>> connected via a HP/Compaq CCISS controller. I don't have the exact
>> details of the drives used, since I don't know how to get them out of
>> the cciss driver, but the latencies we got were around 4 msec. Whilst
>> this is better than the "commodity" hardware used in the tests above, it
>> still seems excessive.
>
> Almost the same case as for the 7200 rpm disk, but I bet these SAS
> drives are 15k drives? If so 60/15000=4ms.

Bingo!

Thanks.
Martin

2008-06-30 19:03:05

by Roger Heflin

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

Martin Sustrik wrote:
> Hi Roger,
>
>>> If these figures are to be believed, then why are we seeing latencies of
>>> 8.3 msec? Is this normal? Or are we just being overly optimistic in
>>> our performance expectations?
>>
>> Consider this, 60/7200rpm=8.3ms for one rotation.
>>
>> You write sector n and n+1, it takes some amount of time for that
>> first set of sectors to come under the head, when it does you write it
>> and immediately return. Immediately after that you attempt write
>> sector n+2 and n+3 which just a bit ago passed under the head, so you
>> have to wait an *ENTIRE* revolution for those sectors to again come
>> under the head to be written, another ~8.3ms, and you continue to
>> repeat this with each block being written. If the sector was
>> randomly placed in the rotation (ie 50% chance of the disk being off
>> by 1/2 a rotation or less-you would have a 4.15 ms average seek time
>> for your test)-but the case of sequential sync writes this leaves the
>> sector about as far as possible from the head (it just passed under
>> the head).
>
> Fair enough. That exaplains the behaviour. Would AIO help here? If we
> are able to enqueue next write before the first one is finished, it can
> start writing it immediately without waiting for a revolution.

If you could get them queued at the disk level, things that would need to be
watched were if the disk can queue things up (and all controllers/drivers
support it), and how many things the disk can queue up, and how large each of
those things can be, if they aren't queued at the disk, there is the chance that
the machine cannot get the data to the disk faster enough for that next sector.

I have always avoided fully sync operations as things *ALWAYS* got really really
slow because of all of the requirements need to make sure that it always got the
data to disk correctly on a unexpected crash, and typically the type of
applications I dealt with, if the machine crashed the currently outputting data
was known to be incomplete and generally useless, so things were reran.

Depending on your application you could always get a small fast solid state
device (no seek or RPM issues), and use it to keep a journal that could be
replayed on an unexpected crash...and then just use various syncs to force
things to disk at various points.

>
>>> We also ran the same test on a different system with recent SAS disks
>>> connected via a HP/Compaq CCISS controller. I don't have the exact
>>> details of the drives used, since I don't know how to get them out of
>>> the cciss driver, but the latencies we got were around 4 msec. Whilst
>>> this is better than the "commodity" hardware used in the tests above, it
>>> still seems excessive.
>>
>> Almost the same case as for the 7200 rpm disk, but I bet these SAS
>> drives are 15k drives? If so 60/15000=4ms.
>
> Bingo!

Note that in my experience the SAS drives do deal with more concurrently a lot
better than the SATA drives, one would expect a SAS drive to scale about 2x
better than a SATA drive (faster RPM) but the test results indicate that they
were considerably better when hitting it with more concurrent streams that would
be expected.

Roger

2008-06-30 22:20:34

by Martin Sustrik

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

2008-07-01 00:12:16

by Bernd Eckenfels

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

In article <[email protected]> you wrote:
> Yes, that's one thing we want to do. However, we cannot assume that
> every user will have SSD, thus we should try to get the best latencies
> possible even on standard HD.

In that case controller (SAN) based write back cache as well as Striping
helps a bit.

Gruss
Bernd

2008-07-02 16:48:27

by Martin Sustrik

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

Hi Roger,

>> Fair enough. That exaplains the behaviour. Would AIO help here? If we
>> are able to enqueue next write before the first one is finished, it
>> can start writing it immediately without waiting for a revolution.
>
> If you could get them queued at the disk level, things that would need
> to be watched were if the disk can queue things up (and all
> controllers/drivers support it), and how many things the disk can queue
> up, and how large each of those things can be, if they aren't queued at
> the disk, there is the chance that the machine cannot get the data to
> the disk faster enough for that next sector.
>
> I have always avoided fully sync operations as things *ALWAYS* got
> really really slow because of all of the requirements need to make sure
> that it always got the data to disk correctly on a unexpected crash, and
> typically the type of applications I dealt with, if the machine crashed
> the currently outputting data was known to be incomplete and generally
> useless, so things were reran.
>
> Depending on your application you could always get a small fast solid
> state device (no seek or RPM issues), and use it to keep a journal that
> could be replayed on an unexpected crash...and then just use various
> syncs to force things to disk at various points.

We've tried AIO and the results are quite disappointing. If you open the
file with O_SYNC, the latencies are the same as with sync I/O - each
write takes 8.3ms (7500rpm disk).

If you use O_ASYNC the latencies are nice (160us mean), however, the
first one is ~900us meaning that the data were not physically written to
the disk before AIO confirmation is sent. (Moving head to right position
would take much more than 900us.)

Still, my feeling is that our use case is pretty straightforward, i.e.
write data to the disk with any optimisations you are able to do and
notify me when the data are physically written to the medium.

Isn't there a way to achieve this kind of behaviour?

Martin

2008-07-02 18:16:20

by Jeff Moyer

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

Martin Sustrik <[email protected]> writes:

> Hi Roger,
>
>>> Fair enough. That exaplains the behaviour. Would AIO help here? If
>>> we are able to enqueue next write before the first one is finished,
>>> it can start writing it immediately without waiting for a
>>> revolution.
>>
>> If you could get them queued at the disk level, things that would
>> need to be watched were if the disk can queue things up (and all
>> controllers/drivers support it), and how many things the disk can
>> queue up, and how large each of those things can be, if they aren't
>> queued at the disk, there is the chance that the machine cannot get
>> the data to the disk faster enough for that next sector.
>>
>> I have always avoided fully sync operations as things *ALWAYS* got
>> really really slow because of all of the requirements need to make
>> sure that it always got the data to disk correctly on a unexpected
>> crash, and typically the type of applications I dealt with, if the
>> machine crashed the currently outputting data was known to be
>> incomplete and generally useless, so things were reran.
>>
>> Depending on your application you could always get a small fast
>> solid state device (no seek or RPM issues), and use it to keep a
>> journal that could be replayed on an unexpected crash...and then
>> just use various syncs to force things to disk at various points.
>
> We've tried AIO and the results are quite disappointing. If you open
> the file with O_SYNC, the latencies are the same as with sync I/O -
> each write takes 8.3ms (7500rpm disk).

I thought you were doing I/O to the underlying block device. If so,
there's no need to open with O_SYNC. You do, however, need to open the
device with O_DIRECT and align your buffers (and buffer lengths)
properly.

Which AIO interface are you using, libaio or librt? How many I/Os are
you queueing to the device? You may want to take a look at aio-stress.c
as a way to test your device (this uses libaio, the in-kernel AIO
interface).

Cheers,

Jeff

2008-07-02 18:20:52

by Martin Sustrik

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

> I thought you were doing I/O to the underlying block device. If so,
> there's no need to open with O_SYNC. You do, however, need to open the
> device with O_DIRECT and align your buffers (and buffer lengths)
> properly.

Yes, we are using O_DIRECT + aligning the buffers. However, usign O_SYNC
vs. O_ASYNC seems to matter (different latency). Maybe this is because
we are using librt?

> Which AIO interface are you using, libaio or librt? How many I/Os are
> you queueing to the device? You may want to take a look at aio-stress.c
> as a way to test your device (this uses libaio, the in-kernel AIO
> interface).

We are using librt, queueing 10240 writes, 512 bytes each.

We'll retry with libaio. Thanks for advice.

Martin

2008-07-02 21:34:17

by Roger Heflin

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

Martin Sustrik wrote:
> Hi Roger,
>
>>> Fair enough. That exaplains the behaviour. Would AIO help here? If we
>>> are able to enqueue next write before the first one is finished, it
>>> can start writing it immediately without waiting for a revolution.
>>
>> If you could get them queued at the disk level, things that would need
>> to be watched were if the disk can queue things up (and all
>> controllers/drivers support it), and how many things the disk can
>> queue up, and how large each of those things can be, if they aren't
>> queued at the disk, there is the chance that the machine cannot get
>> the data to the disk faster enough for that next sector.
>>
>> I have always avoided fully sync operations as things *ALWAYS* got
>> really really slow because of all of the requirements need to make
>> sure that it always got the data to disk correctly on a unexpected
>> crash, and typically the type of applications I dealt with, if the
>> machine crashed the currently outputting data was known to be
>> incomplete and generally useless, so things were reran.
>>
>> Depending on your application you could always get a small fast solid
>> state device (no seek or RPM issues), and use it to keep a journal
>> that could be replayed on an unexpected crash...and then just use
>> various syncs to force things to disk at various points.
>
> We've tried AIO and the results are quite disappointing. If you open the
> file with O_SYNC, the latencies are the same as with sync I/O - each
> write takes 8.3ms (7500rpm disk).
>
> If you use O_ASYNC the latencies are nice (160us mean), however, the
> first one is ~900us meaning that the data were not physically written to
> the disk before AIO confirmation is sent. (Moving head to right position
> would take much more than 900us.)
>
> Still, my feeling is that our use case is pretty straightforward, i.e.
> write data to the disk with any optimisations you are able to do and
> notify me when the data are physically written to the medium.
>
> Isn't there a way to achieve this kind of behaviour?
>
> Martin
>

A lot depends on what your application requirements are.

Back in a long time ago, before disks had cache RLL and MFM drives used a trick
called interleave, instead of writing to sector n, n+1,n+2 with a interleave of
2 would write to n,n+2,n+4 as once they got the message that n was written the
machine had enough time to setup and send the next write to sector n+2 before
the head got there, the question with your hardware would be how many sectors
need to be skipped to be able to write immediately (and this would change with
hardware changes-and the file would need to be pre-allocated, and may have to
have some testing done against the given file to quantify the underlying disk
reality-and reality may change if the disk reallocates sectors), and this
assumes that can you live with the lower sequential read performance-interleave
of 2 had 1/2 the read performance, interleave of 3 had 1/3, though with proper
sector picking and the read cache on the disk, this interleave my not kill the
read performance if one goes n,n+2,...,n+2*x,n+1,n+3,... correctly knowing that
n+2*x and n+1 are close in seek time and I don't have any idea what sort of
interleave one would need to be using with modern hardware.

It would take a fairly intricate program to sort out what reality was, but it
would seem to be possible to figure out exactly what reality is, and work with
it. And the first seek might take a while, but even that could be played
around with by having marker sectors be written to all of the time so one has
some idea of what is going to be under the head real soon, of course even if you
could get all of this correct there are still unknowns of the disk not always
doing what one expects since they do have a mind of their own, and the unknown
of can you get the correct sector supplied through everything to the disk fast
enough every time.

Roger

2008-07-04 03:34:26

by David Dillow

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

On Wed, 2008-07-02 at 20:20 +0200, Martin Sustrik wrote:
> We are using librt, queueing 10240 writes, 512 bytes each.
>
> We'll retry with libaio. Thanks for advice.

I've gotten good results going direct to the Linux AIO calls, but before
you go too far from your existing code base, are you using a single file
descriptor with the librt aio_*() routines? If so, that may be part of
your problem, as the library will serialize requests to the same fd. I
just pre-dup()'d the descriptor for each of the (max 32) requests I'd
have in flight for a toy stress-tester I wrote. With 10240 requests,
that may be a bit excessive, but it'd point out if that was part of the
problem fairly quickly.

2008-07-10 05:33:35

by Andrew Morton

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

On Sat, 28 Jun 2008 14:11:32 +0200 Martin Lucina <[email protected]> wrote:

> we're getting some rather high figures for write(2) latency when testing
> synchronous writing to disk. The test I'm running writes 2000 blocks of
> contiguous data to a raw device, using O_DIRECT and various block sizes
> down to a minimum of 512 bytes.
>
> The disk is a Seagate ST380817AS SATA connected to an Intel ICH7
> using ata_piix. Write caching has been explicitly disabled on the
> drive, and there is no other activity that should affect the test
> results (all system filesystems are on a separate drive). The system is
> running Debian etch, with a 2.6.24 kernel.
>
> Observed results:
>
> size=1024, N=2000, took=4.450788 s, thput=3 mb/s seekc=1
> write: avg=8.388851 max=24.998846 min=8.335624 ms
> 8 ms: 1992 cases
> 9 ms: 2 cases
> 10 ms: 1 cases
> 14 ms: 1 cases
> 16 ms: 3 cases
> 24 ms: 1 cases

stoopid question 1: are you writing to a regular file, or to /dev/sda? If
the former then metadata fetches will introduce glitches.

stoopid question 2: does the same effect happen with reads?

2008-07-10 08:12:37

by Martin Sustrik

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

Hi Andrew,

>> we're getting some rather high figures for write(2) latency when testing
>> synchronous writing to disk. The test I'm running writes 2000 blocks of
>> contiguous data to a raw device, using O_DIRECT and various block sizes
>> down to a minimum of 512 bytes.
>>
>> The disk is a Seagate ST380817AS SATA connected to an Intel ICH7
>> using ata_piix. Write caching has been explicitly disabled on the
>> drive, and there is no other activity that should affect the test
>> results (all system filesystems are on a separate drive). The system is
>> running Debian etch, with a 2.6.24 kernel.
>>
>> Observed results:
>>
>> size=1024, N=2000, took=4.450788 s, thput=3 mb/s seekc=1
>> write: avg=8.388851 max=24.998846 min=8.335624 ms
>> 8 ms: 1992 cases
>> 9 ms: 2 cases
>> 10 ms: 1 cases
>> 14 ms: 1 cases
>> 16 ms: 3 cases
>> 24 ms: 1 cases
>
> stoopid question 1: are you writing to a regular file, or to /dev/sda? If
> the former then metadata fetches will introduce glitches.

Not a file, just a raw device.

> stoopid question 2: does the same effect happen with reads?

Dunno. The read is not critical for us. However, I would expect the same
behaviour (see below).

We've got a satisfying explansation of the behaviour from Roger Heflin:

"You write sector n and n+1, it takes some amount of time for that first
set of sectors to come under the head, when it does you write it and
immediately return. Immediately after that you attempt write sector
n+2 and n+3 which just a bit ago passed under the head, so you have to
wait an *ENTIRE* revolution for those sectors to again come under the
head to be written, another ~8.3ms, and you continue to repeat this with
each block being written. If the sector was randomly placed in the
rotation (ie 50% chance of the disk being off by 1/2 a rotation or
less-you would have a 4.15 ms average seek time for your test)-but the
case of sequential sync writes this leaves the sector about as far as
possible from the head (it just passed under the head)."

Now, the obvious solution was to use AIO to be able to enqueue write
requests even before the head reaches the end of the sector - thus there
would be no need for superfluous disk revolvings.

We've actually measured this scenario with kernel AIO (libaio1) and this
is what we'vew got (see attached graph).

The x axis represents individual write operations, y axis represents
time. Crosses are operations enqueue times (when write requests were
issues), circles are times of notifications (when the app was notified
that the write request was processed).

What we see is that AIO performs rather bad while we are still
enqueueing more writes (it misses right position on the disk and has to
do superfluous disk revolvings), however, once we stop enqueueing new
write request, those already in the queue are processed swiftly.

My guess (I am not a kernel hacker) would be that sync operations on the
AIO queue are slowing down the retrieval from the queue and thus we miss
the right place on the disk almost all the time. Once app stops
enqueueing new write requests there's no contention on the queue and we
are able to catch up with the speed of disk rotation.

If this is the case, the solution would be straightforward: When
dequeueing from AIO queue, dequeue *all* the requests in the queue and
place them into another non-synchronised queue. Getting an element from
a non-sync queue is matter of few nanoseconds, thus we should be able to
process it before head missis the right point on the disk. Once the
non-sync queue is empty, we get *all* the requests from the AIO queue
again. Etc.

Anyone any opinion on this matter?

Thanks.
Martin

Attachments:

aio.png (3.54 kB)

2008-07-10 08:20:47

by Andrew Morton

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

On Thu, 10 Jul 2008 10:12:12 +0200 Martin Sustrik <[email protected]> wrote:

> Hi Andrew,
>
> >> we're getting some rather high figures for write(2) latency when testing
> >> synchronous writing to disk. The test I'm running writes 2000 blocks of
> >> contiguous data to a raw device, using O_DIRECT and various block sizes
> >> down to a minimum of 512 bytes.
> >>
> >> The disk is a Seagate ST380817AS SATA connected to an Intel ICH7
> >> using ata_piix. Write caching has been explicitly disabled on the
> >> drive, and there is no other activity that should affect the test
> >> results (all system filesystems are on a separate drive). The system is
> >> running Debian etch, with a 2.6.24 kernel.
> >>
> >> Observed results:
> >>
> >> size=1024, N=2000, took=4.450788 s, thput=3 mb/s seekc=1
> >> write: avg=8.388851 max=24.998846 min=8.335624 ms
> >> 8 ms: 1992 cases
> >> 9 ms: 2 cases
> >> 10 ms: 1 cases
> >> 14 ms: 1 cases
> >> 16 ms: 3 cases
> >> 24 ms: 1 cases
> >
> > stoopid question 1: are you writing to a regular file, or to /dev/sda? If
> > the former then metadata fetches will introduce glitches.
>
> Not a file, just a raw device.
>
> > stoopid question 2: does the same effect happen with reads?
>
> Dunno. The read is not critical for us. However, I would expect the same
> behaviour (see below).
>
> We've got a satisfying explansation of the behaviour from Roger Heflin:
>
> "You write sector n and n+1, it takes some amount of time for that first
> set of sectors to come under the head, when it does you write it and
> immediately return. Immediately after that you attempt write sector
> n+2 and n+3 which just a bit ago passed under the head, so you have to
> wait an *ENTIRE* revolution for those sectors to again come under the
> head to be written, another ~8.3ms, and you continue to repeat this with
> each block being written. If the sector was randomly placed in the
> rotation (ie 50% chance of the disk being off by 1/2 a rotation or
> less-you would have a 4.15 ms average seek time for your test)-but the
> case of sequential sync writes this leaves the sector about as far as
> possible from the head (it just passed under the head)."
>
> Now, the obvious solution was to use AIO to be able to enqueue write
> requests even before the head reaches the end of the sector - thus there
> would be no need for superfluous disk revolvings.
>
> We've actually measured this scenario with kernel AIO (libaio1) and this
> is what we'vew got (see attached graph).
>
> The x axis represents individual write operations, y axis represents
> time. Crosses are operations enqueue times (when write requests were
> issues), circles are times of notifications (when the app was notified
> that the write request was processed).
>
> What we see is that AIO performs rather bad while we are still
> enqueueing more writes (it misses right position on the disk and has to
> do superfluous disk revolvings), however, once we stop enqueueing new
> write request, those already in the queue are processed swiftly.
>
> My guess (I am not a kernel hacker) would be that sync operations on the
> AIO queue are slowing down the retrieval from the queue and thus we miss
> the right place on the disk almost all the time. Once app stops
> enqueueing new write requests there's no contention on the queue and we
> are able to catch up with the speed of disk rotation.
>
> If this is the case, the solution would be straightforward: When
> dequeueing from AIO queue, dequeue *all* the requests in the queue and
> place them into another non-synchronised queue. Getting an element from
> a non-sync queue is matter of few nanoseconds, thus we should be able to
> process it before head missis the right point on the disk. Once the
> non-sync queue is empty, we get *all* the requests from the AIO queue
> again. Etc.
>
> Anyone any opinion on this matter?

Not immediately, but the fine folks on the linux-aio list might be able to
help out. If you have some simple testcase code which you can share then
that would help things along.

2008-07-10 09:04:24

by Alan

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

> What we see is that AIO performs rather bad while we are still
> enqueueing more writes (it misses right position on the disk and has to
> do superfluous disk revolvings), however, once we stop enqueueing new
> write request, those already in the queue are processed swiftly.

Which disk scheduler are you using - some of the disk schedulers
intentionally delay writes to try and get better block merging.

Alan

2008-07-10 13:17:40

by Martin Sustrik

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

Hi Alan,

>> What we see is that AIO performs rather bad while we are still
>> enqueueing more writes (it misses right position on the disk and has to
>> do superfluous disk revolvings), however, once we stop enqueueing new
>> write request, those already in the queue are processed swiftly.
>
> Which disk scheduler are you using - some of the disk schedulers
> intentionally delay writes to try and get better block merging.

It's CFQ. Does it delay writes? And if so, what should we use instead?

Thanks.
Martin

2008-07-10 13:24:48

by Andrew Morton

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

On Thu, 10 Jul 2008 15:17:47 +0200 Martin Sustrik <[email protected]> wrote:

> Hi Alan,
>
> >> What we see is that AIO performs rather bad while we are still
> >> enqueueing more writes (it misses right position on the disk and has to
> >> do superfluous disk revolvings), however, once we stop enqueueing new
> >> write request, those already in the queue are processed swiftly.
> >
> > Which disk scheduler are you using - some of the disk schedulers
> > intentionally delay writes to try and get better block merging.
>
> It's CFQ. Does it delay writes? And if so, what should we use instead?
>

noop is the simplest scheduler. deadline is the simplest real scheduler,
and deadline doesn't have any delaying logic.

If CFQ or anticipatory _are_ putting delays into this workload, that'd be
a bug.

2008-07-10 13:30:39

by Chris Mason

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

On Thu, 2008-07-10 at 01:14 -0700, Andrew Morton wrote:
> On Thu, 10 Jul 2008 10:12:12 +0200 Martin Sustrik <[email protected]> wrote:
>
> > Hi Andrew,
> >
> > >> we're getting some rather high figures for write(2) latency when testing
> > >> synchronous writing to disk. The test I'm running writes 2000 blocks of
> > >> contiguous data to a raw device, using O_DIRECT and various block sizes
> > >> down to a minimum of 512 bytes.
> > >>
> > >> The disk is a Seagate ST380817AS SATA connected to an Intel ICH7
> > >> using ata_piix. Write caching has been explicitly disabled on the
> > >> drive, and there is no other activity that should affect the test
> > >> results (all system filesystems are on a separate drive). The system is
> > >> running Debian etch, with a 2.6.24 kernel.

Is NCQ enabled on the drive? The basic way to fix this is to have
multiple requests in flight, which isn't going to happen on sata with
the cache off and with ncq off.

So, AIO + NCQ should be fast, as long as your application has multiple
requests in flight through aio (or multiple threads).

-chris

2008-07-10 13:41:57

by Martin Lucina

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

[email protected] said:
> Is NCQ enabled on the drive? The basic way to fix this is to have
> multiple requests in flight, which isn't going to happen on sata with
> the cache off and with ncq off.

ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-7: Hitachi HDS721616PLA380, P22OAB3A, max UDMA/133
ata1.00: 321672960 sectors, multi 0: LBA48 NCQ (depth 31/32)
ata1.00: configured for UDMA/133

I'm assuming the above indicates NCQ is enabled.

We'll try running the tests tomorrow with different i/o schedulers and
report back.

-mato

2008-07-10 14:02:18

by Arjan van de Ven

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

On Thu, 10 Jul 2008 15:41:46 +0200
Martin Lucina <[email protected]> wrote:

> [email protected] said:
> > Is NCQ enabled on the drive? The basic way to fix this is to have
> > multiple requests in flight, which isn't going to happen on sata
> > with the cache off and with ncq off.
>
> ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata1.00: ATA-7: Hitachi HDS721616PLA380, P22OAB3A, max UDMA/133
> ata1.00: 321672960 sectors, multi 0: LBA48 NCQ (depth 31/32)
> ata1.00: configured for UDMA/133
>
> I'm assuming the above indicates NCQ is enabled.
>
> We'll try running the tests tomorrow with different i/o schedulers and
> report back.

another thing to try is using AHCI mode instead; AHCI tends to be
higher performance

(and.. for fun try to run latencytop during a run, to see if maybe
there are unsuspected delay causes)

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-07-10 14:19:21

by Chris Mason

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

On Thu, 2008-07-10 at 07:01 -0700, Arjan van de Ven wrote:
> On Thu, 10 Jul 2008 15:41:46 +0200
> Martin Lucina <[email protected]> wrote:
>
> > [email protected] said:
> > > Is NCQ enabled on the drive? The basic way to fix this is to have
> > > multiple requests in flight, which isn't going to happen on sata
> > > with the cache off and with ncq off.
> >
> > ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> > ata1.00: ATA-7: Hitachi HDS721616PLA380, P22OAB3A, max UDMA/133
> > ata1.00: 321672960 sectors, multi 0: LBA48 NCQ (depth 31/32)
> > ata1.00: configured for UDMA/133
> >
> > I'm assuming the above indicates NCQ is enabled.
> >
> > We'll try running the tests tomorrow with different i/o schedulers and
> > report back.
>
>
> another thing to try is using AHCI mode instead; AHCI tends to be
> higher performance
>
> (and.. for fun try to run latencytop during a run, to see if maybe
> there are unsuspected delay causes)
>
Any latencies during submit (at least for kernel aio) are probably from
get_request_wait, but the numbers so far look like they are only seeing
one request in flight at a time.

-chris

2008-07-11 15:20:26

by Martin Sustrik

[permalink] [raw]

Subject: Re: Higher than expected disk write(2) latency

Hi all,

> Which disk scheduler are you using - some of the disk schedulers
> intentionally delay writes to try and get better block merging.

We've run the test with different I/O schedulers. We've found out that
poor performance we've seen before was due to the short AIO queue size.
Now, when queue size is big enough to hold all the write requests in the
test, the results are much better. Have a look here for the code of the
test and the results in form of graphs:

http://www.zeromq.org/results:aio

There are still at least two problems I see:

1. Enqueueing of writes seems to block every now and then although the
size of AIO queue is large enough to hold all the requests in the test.

2. We've observed that messages are batched for writes by approximately
30 per batch. The latency impact of messages 2-30 is almost zero,
however, the impact of the first message is much higher than expected. I
would expect latency of one disk revolving (8.3 ms) + optionally latency
of head movement (say 10 us). What we are seing are latencies of 30-60 ms.

Martin