Kernel hackers,
I have an application that is writing large amounts of very fragmented data to
harddrives. That is, I could write megabytes of data in blocks of a few bytes
scattered around a multi-gigabyte file.
Obviously, doing this causes the harddrive to seek a lot and takes a while.
>From what I understand, if I allow linux to cache the writes, it will fill up
the kernel's write cache, and then consequently the disk drive's DMA queue. As
a result of that, the harddrive can pick the correct order to do these writes,
significantly reducing seek times.
However, there's a major cost in allowing the write cache to fill: fsync takes
*ages*. What's worse is that while fsync is proceeding, it seems *all* disk
operations in the OS are blocked. This is really terrible for performance of
my application: my application might want to do some reads (i.e. from another
thread) from the disk preempting the fsync temporarily. It's also really
terrible for me, because then my workstation becomes unresponsive for several
minutes.
My general question is how to mitigate this. Is it possible to get a signal
for when a file is out of the disk cache. Or can I ask linux approximately how
much data is in the write queue for that specific file, and just do a sleep()-
loop checking until it goes down to something managable at which point I do
the fsync? Or, does aio support this scenario well, and if so, from what
version of Linux? (I've determined that there are some scenarios in which it
does, but it still requires O_DIRECT, apparently, which is weird considering
how I've heard Linux kernel hackers feel about that particular flag).
And yes, I *know* fsync is a poor method to determine if data is actually
committed to something non-volatile. :)
Thanks for the help,
Charles
> the kernel's write cache, and then consequently the disk drive's DMA queue. As
> a result of that, the harddrive can pick the correct order to do these writes,
> significantly reducing seek times.
Well that depends a lot on the data, if its very scattered and random it
may not help much.
> And yes, I *know* fsync is a poor method to determine if data is actually
> committed to something non-volatile. :)
fsync/fdatasync should at least make sure it hit the disk. If barriers
are enabled the rest too.
What file system are you using - some of the file systems have serious
limits in this are around fsync and ordering and you may be hitting those.
The ultima answer is probably an SSD of course 8)
Alan
On Friday, April 01, 2011 1:10:49 pm Alan Cox wrote:
> > the kernel's write cache, and then consequently the disk drive's DMA
> > queue. As a result of that, the harddrive can pick the correct order to
> > do these writes, significantly reducing seek times.
>
> Well that depends a lot on the data, if its very scattered and random it
> may not help much.
Generally 64KiB - I don't know disk geometry, but I guess that's a lot smaller
than a cylinder.
But I don't really care about throughput (much), I care more about my
application not blocking everything while the fsync happens.
>
> > And yes, I *know* fsync is a poor method to determine if data is actually
> > committed to something non-volatile. :)
>
> fsync/fdatasync should at least make sure it hit the disk. If barriers
> are enabled the rest too.
>
> What file system are you using - some of the file systems have serious
> limits in this are around fsync and ordering and you may be hitting those.
I've seen this on ext3, ext4, and XFS. Reiser3 not so much. I've also
convinced my users to use write-backed cache. They're Enterprise customers and
have loads of money to spend on hardware, such as Warp Cores.
>
> The ultima answer is probably an SSD of course 8)
Well, that would solve the throughput, but it doesn't solve the blocking!
Charles
> I've seen this on ext3, ext4, and XFS. Reiser3 not so much. I've also
> convinced my users to use write-backed cache. They're Enterprise customers and
> have loads of money to spend on hardware, such as Warp Cores.
ext3 certainly suffers badly from it, that is known.
> > The ultima answer is probably an SSD of course 8)
> Well, that would solve the throughput, but it doesn't solve the blocking!
Best place to go might be linux-fsdevel - especially if you can generate
a simulation of the problem workload (ie one that just does the I/O
patterns)
On Fri, Apr 01, 2011 at 12:59:53PM -0700, Charles Samuels wrote:
>
> I have an application that is writing large amounts of very
> fragmented data to harddrives. That is, I could write megabytes of
> data in blocks of a few bytes scattered around a multi-gigabyte
> file.
Doctor, doctor, it hurts when I do this.... any way you can avoid
doing this? What is your application doing at the high level.
> Obviously, doing this causes the harddrive to seek a lot and takes a
> while. From what I understand, if I allow linux to cache the
> writes, it will fill up the kernel's write cache, and then
> consequently the disk drive's DMA queue. As a result of that, the
> harddrive can pick the correct order to do these writes,
> significantly reducing seek times.
This is one way to avoid some of the seeks, yes.
> However, there's a major cost in allowing the write cache to fill:
> fsync takes *ages*. What's worse is that while fsync is proceeding,
> it seems *all* disk operations in the OS are blocked. This is really
> terrible for performance of my application: my application might
> want to do some reads (i.e. from another thread) from the disk
> preempting the fsync temporarily. It's also really terrible for me,
> because then my workstation becomes unresponsive for several
> minutes.
Who or what is calling fsync()? Is it being called by your
application because you want to initiate writeout? Or is it being
called by some completely unrelated process?
If it is being called by the application, one thing you can do is to
use the Linux-specific system call sync_file_range(). You can use
this to do asynchronous data flushes of the file, and control which
range of bytes are written out, which can also help avoid flooding the
disk with too many write requests.
If the fsync() is being called by some other process, then yeah, you
have a problem. What I'd suggest is using a separate partition and
file system for this application of yours which needs to write
megabytes and megabytes of data at random locations in this
multi-gigabyte file of yours....
- Ted
Hi,
Thanks for the reply.
On Sunday, April 03, 2011 7:02:35 pm Ted Ts'o wrote:
> On Fri, Apr 01, 2011 at 12:59:53PM -0700, Charles Samuels wrote:
> > I have an application that is writing large amounts of very
> > fragmented data to harddrives. That is, I could write megabytes of
> > data in blocks of a few bytes scattered around a multi-gigabyte
> > file.
>
> Doctor, doctor, it hurts when I do this.... any way you can avoid
> doing this? What is your application doing at the high level.
Not really, I need the on-disk data organized in this pattern, so that the
reads are optimized nicely. It's a database application.
>
> > Obviously, doing this causes the harddrive to seek a lot and takes a
> > while. From what I understand, if I allow linux to cache the
> > writes, it will fill up the kernel's write cache, and then
> > consequently the disk drive's DMA queue. As a result of that, the
> > harddrive can pick the correct order to do these writes,
> > significantly reducing seek times.
>
> This is one way to avoid some of the seeks, yes.
What's another way? Other than not doing it :)
> Who or what is calling fsync()? Is it being called by your
> application because you want to initiate writeout? Or is it being
> called by some completely unrelated process?
It's being called by my own process. When fsync finishes, I update another file
with some offset counters, fsync that, and with some luck, my writes are
transactional.
> If it is being called by the application, one thing you can do is to
> use the Linux-specific system call sync_file_range(). You can use
> this to do asynchronous data flushes of the file, and control which
> range of bytes are written out, which can also help avoid flooding the
> disk with too many write requests.
What would be good use of sync_file_range? It looks pretty useful, but I don't
know how to make good use of it.
For example, SYNC_FILE_RANGE_WRITE, wouldn't linux start this pretty much
immediately? And wouldn't I really not want to give it a suggestion for what
order it does it in?
Would calling sync_file_range with a flag that allows blocking have a
performance benefit compared to fsync? Specifically, can I expect Linux to not
totally block all reads and writes to other files?
Charles
On Mon, 4 Apr 2011, Charles Samuels wrote:
> Hi,
>
> Thanks for the reply.
>
> On Sunday, April 03, 2011 7:02:35 pm Ted Ts'o wrote:
>> On Fri, Apr 01, 2011 at 12:59:53PM -0700, Charles Samuels wrote:
>>> I have an application that is writing large amounts of very
>>> fragmented data to harddrives. That is, I could write megabytes of
>>> data in blocks of a few bytes scattered around a multi-gigabyte
>>> file.
>>
>> Doctor, doctor, it hurts when I do this.... any way you can avoid
>> doing this? What is your application doing at the high level.
> Not really, I need the on-disk data organized in this pattern, so that the
> reads are optimized nicely. It's a database application.
>
>>
>>> Obviously, doing this causes the harddrive to seek a lot and takes a
>>> while. From what I understand, if I allow linux to cache the
>>> writes, it will fill up the kernel's write cache, and then
>>> consequently the disk drive's DMA queue. As a result of that, the
>>> harddrive can pick the correct order to do these writes,
>>> significantly reducing seek times.
>>
>> This is one way to avoid some of the seeks, yes.
>
> What's another way? Other than not doing it :)
>
>> Who or what is calling fsync()? Is it being called by your
>> application because you want to initiate writeout? Or is it being
>> called by some completely unrelated process?
>
> It's being called by my own process. When fsync finishes, I update another file
> with some offset counters, fsync that, and with some luck, my writes are
> transactional.
get yourself a raid controller with a battery-backed cache on it. Then
your application can consider the data 'safe' once it's written to the
raid controller (and the fsync will return at that point), the raid
controller and the disks can then write the data in whatever order they
want and you won't care.
This is a standard requirement for high performance databases. Without
this they run into the exact problem you are experiancing.
This battery backed cache can be on the raid card in your machine, or in
the disk array that you are connecting to.
David Lang
On Mon, Apr 04, 2011 at 10:50:12AM -0700, Charles Samuels wrote:
>
> > Who or what is calling fsync()? Is it being called by your
> > application because you want to initiate writeout? Or is it being
> > called by some completely unrelated process?
>
> It's being called by my own process. When fsync finishes, I update
> another file with some offset counters, fsync that, and with some
> luck, my writes are transactional.
OK, how often are you calling fsync()? Is this something where you
are trying to get transactional guarantees by calling fsync() between
each transaction? And if so, how big are you transactions? If you
are trying to call fsync() 10+ times/second, then your only hope
really is going to be a battery-backed RAID controller card, as David
Lang has already suggested.
> What would be good use of sync_file_range? It looks pretty useful,
> but I don't know how to make good use of it. For example,
> SYNC_FILE_RANGE_WRITE, wouldn't linux start this pretty much
> immediately?
No, not necessarily. Generally Linux will pause for a bit to
hopefully allow writes to coalesce.
The reason why I suggested sync_file_range() is because you mentioned
that you tried waiting until there was a large amount of data in the
page cache, and then you called fsync() and that was taking forever.
I assumed from that you didn't necessarily had ACID or transactional
requirements.
The advantage of using sync_file_range() is that instead of forcing a
blocking write for *all* of the data pages, you can only do it on part
of the your data pages. This would allow the writing from interfering
with subsequent reads that was taking place to your database.
All of this goes by the boards if you need data integrity guarantees,
of course; in that case you need to call fsync() after each atomic
transaction update...
- Ted