MIME-Version: 1.0
Date: Fri, 30 Oct 2009 16:21:39 -0700
Message-ID: <47c554d90910301621y1f19a96bx454f539adec1ae35@mail.gmail.com>
Subject: SSD read latency negatively impacted by large writes (independent of 
	choice of I/O scheduler)
From: Zubin Dittia <zubin@tintri.com>
To: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset=ISO-8859-1
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2396
Lines: 42

I've been doing some testing with an Intel X25-E SSD, and noticed that
large writes can severely affect read latency, regardless of which I/O
scheduler or scheduler parameters are in use (this is with kernel
2.6.28-16 from Ubuntu jaunty 9.04).  The test was very simple: I had
two threads running; the first was in a tight loop reading different
4KB sized blocks (and recording the latency of each read) from the SSD
block device file.  While the first thread is doing this, a second
thread does a single big 5MB write to the device.  What I noticed is
that about 30 seconds after the write (which is when the write is
actually written back to the device from buffer cache), I see a very
large spike in read latency: from 200 microseconds to 25 milliseconds.
 This seems to imply that the writes issued by the scheduler are not
being broken up into sufficiently small chunks with interspersed
reads; instead, the whole sequential write seems to be getting issued
while starving reads during that period.  I've noticed the same
behavior with SSDs from another vendor as well, and there the latency
impact was even worse (80 ms).  Playing around with different I/O
schedulers and parameters doesn't seem to help at all.

The same behavior is exhibited when using O_DIRECT as well (except
that the latency hit is immediate instead of 30 seconds later, as one
would expect).  The only way I was able to reduce the worst-case read
latency was by using O_DIRECT and breaking up the large write into
multiple smaller writes (with one system call per smaller write).  My
theory is that the time between write system calls was enough to allow
reads to squeeze themselves in between the writes.  But, as would be
expected, this does bad things to the sequential write throughput
because of the overhead of multiple system calls.

My question is: have others seen this behavior?  Are there any
tunables that could help (perhaps a parameter that would dictate the
largest size of a write that can be pending to the device at any given
time).  If not, would it make sense to implement a new I/O scheduler
(or hack an existing one) which does this.

Thanks,
-Zubin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/