Richard Hipp writes:
>
> We would really, really love to have some kind of write-barrier that is
> lighter than fsync(). If there is some method other than fsync() for
> forcing a write-barrier on Linux that we don't know about, please enlighten
> us.
Could you list the requirements of such a light weight barrier?
i.e. what would it need to do minimally, what's different from
fsync/fdatasync ?
-Andi
--
[email protected] -- Speaking for myself only
I am not quite whether I should ask this question here, but in terms
of light weight barrier/fsync, could anyone tell me why the device
driver / OS provide the barrier interface other than some other
abstractions anyway? I am sorry if this sounds like a stupid questions
or it has been discussed before....
I mean, most of the time, we only need some ordering in writes; not
complete order, but partial,very simple topological order. And a
barrier seems to be a heavy weighted solution to achieve this anyway:
you have to finish all writes before the barrier, then start all
writes issued after the barrier. That is some ordering which is much
stronger than what we need, isn't it?
As most of the time the order we need do not involve too many blocks
(certainly a lot less than all the cached blocks in the system or in
the disk's cache), that topological order isn't likely to be very
complicated, and I image it could be implemented efficiently in a
modern device, which already has complicated caching/garbage
collection/whatever going on internally. Particularly, it seems not
too hard to be implemented on top of SCSI's ordered/simple task mode?
(I believe Windows does this to an extent, but not quite sure).
Thanks a lot
Suli
On Wed, Oct 10, 2012 at 12:17 PM, Andi Kleen <[email protected]> wrote:
> Richard Hipp writes:
>>
>> We would really, really love to have some kind of write-barrier that is
>> lighter than fsync(). If there is some method other than fsync() for
>> forcing a write-barrier on Linux that we don't know about, please enlighten
>> us.
>
> Could you list the requirements of such a light weight barrier?
> i.e. what would it need to do minimally, what's different from
> fsync/fdatasync ?
>
> -Andi
>
> --
> [email protected] -- Speaking for myself only
> _______________________________________________
> sqlite-users mailing list
> [email protected]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
On Thu, Oct 11, 2012 at 11:32:27AM -0500, ????????? Yang Su Li wrote:
> I am not quite whether I should ask this question here, but in terms
> of light weight barrier/fsync, could anyone tell me why the device
> driver / OS provide the barrier interface other than some other
> abstractions anyway? I am sorry if this sounds like a stupid questions
> or it has been discussed before....
It does not. Except for the legacy mount option naming there is no such
thing as a barrier in Linux these days.
杨苏立 Yang Su Li, on 10/11/2012 12:32 PM wrote:
> I am not quite whether I should ask this question here, but in terms
> of light weight barrier/fsync, could anyone tell me why the device
> driver / OS provide the barrier interface other than some other
> abstractions anyway? I am sorry if this sounds like a stupid questions
> or it has been discussed before....
>
> I mean, most of the time, we only need some ordering in writes; not
> complete order, but partial,very simple topological order. And a
> barrier seems to be a heavy weighted solution to achieve this anyway:
> you have to finish all writes before the barrier, then start all
> writes issued after the barrier. That is some ordering which is much
> stronger than what we need, isn't it?
>
> As most of the time the order we need do not involve too many blocks
> (certainly a lot less than all the cached blocks in the system or in
> the disk's cache), that topological order isn't likely to be very
> complicated, and I image it could be implemented efficiently in a
> modern device, which already has complicated caching/garbage
> collection/whatever going on internally. Particularly, it seems not
> too hard to be implemented on top of SCSI's ordered/simple task mode?
Yes, SCSI has full support for ordered/simple commands designed exactly for that
task: to have steady flow of commands even in case when some of them are ordered.
It also has necessary facilities to handle commands errors without unexpected
reorders of their subsequent commands (ACA, etc.). Those allow to get full storage
performance by fully "fill the pipe", using networking terms. I can easily imaging
real life configs, where it can bring 2+ times more performance, than with queue
flushing.
In fact, AFAIK, AIX requires from storage to support ordered commands and ACA.
Implementation should be relatively easy as well, because all transports naturally
have link as the point of serialization, so all you need in multithreaded
environment is to pass some SN from the point when each ORDERED command created to
the point when it sent to the link and make sure that no SIMPLE commands can ever
cross ORDERED commands. You can see how it is implemented in SCST in an elegant
and lockless manner (for SIMPLE commands).
But historically for some reason Linux storage developers were stuck with
"barriers" concept, which is obviously not the same as ORDERED commands, hence had
a lot troubles with their ambiguous semantic. As far as I can tell the reason of
that was some lack of sufficiently deep SCSI understanding (how to handle errors,
believe that ACA is something legacy from parallel SCSI times, etc.).
Hopefully, eventually the storage developers will realize the value behind ordered
commands and learn corresponding SCSI facilities to deal with them. It's quite
easy to demonstrate this value, if you know where to look at and not blindly
refusing such possibility. I have already tried to explain it a couple of times,
but was not successful.
Before that happens, people will keep returning again and again with those simple
questions: why the queue must be flushed for any ordered operation? Isn't is an
obvious overkill?
Vlad
On Tue, Oct 23, 2012 at 2:53 PM, Vladislav Bolkhovitin
<[email protected]> wrote:
>> As most of the time the order we need do not involve too many blocks
>> (certainly a lot less than all the cached blocks in the system or in
>> the disk's cache), that topological order isn't likely to be very
>> complicated, and I image it could be implemented efficiently in a
>> modern device, which already has complicated caching/garbage
>> collection/whatever going on internally. Particularly, it seems not
>> too hard to be implemented on top of SCSI's ordered/simple task mode?
If you have multiple layers involved (e.g., SQLite then the
filesystem, and if the filesystem is spread over multiple storage
devices), and if transactions are not bounded, and on top of that if
there are other concurrent writers to the same filesystem (even if not
the same files) then the set of blocks to write and internal ordering
can get complex. In practice filesystems try to break these up into
large self-consistent chunks and write those -- ZFS does this, for
example -- and this is aided by the lack of transactional semantics in
the filesystem.
For SQLite with a VFS that talks [i]SCSI directly then things could be
much more manageable as there's only one write transaction in progress
at any given time. But that's not realistic, except, perhaps, in some
embedded systems.
> Yes, SCSI has full support for ordered/simple commands designed exactly for
> that task: [...]
>
> [...]
>
> But historically for some reason Linux storage developers were stuck with
> "barriers" concept, which is obviously not the same as ORDERED commands,
> hence had a lot troubles with their ambiguous semantic. As far as I can tell
> the reason of that was some lack of sufficiently deep SCSI understanding
> (how to handle errors, believe that ACA is something legacy from parallel
> SCSI times, etc.).
Barriers are a very simple abstraction, so there's that.
> Hopefully, eventually the storage developers will realize the value behind
> ordered commands and learn corresponding SCSI facilities to deal with them.
> It's quite easy to demonstrate this value, if you know where to look at and
> not blindly refusing such possibility. I have already tried to explain it a
> couple of times, but was not successful.
Exposing ordering of lower-layer operations to filesystem applications
is a non-starter. About the only reasonable thing to do with a
filesystem is add barrier operations. I know, you're talking about
lower layer capabilities, and SQLite could talk to that layer
directly, but let's face it: it's not likely to.
> Before that happens, people will keep returning again and again with those
> simple questions: why the queue must be flushed for any ordered operation?
> Isn't is an obvious overkill?
That [cache flushing] is not what's being asked for here. Just a
light-weight barrier. My proposal works without having to add new
system calls: a) use a COW format, b) have background threads doing
fsync()s, c) in each transaction's root block note the last
known-committed (from a completed fsync()) transaction's root block,
d) have an array of well-known ubberblocks large enough to accommodate
as many transactions as possible without having to wait for any one
fsync() to complete, d) do not reclaim space from any one past
transaction until at least one subsequent transaction is fully
committed. This obtains ACI- transaction semantics (survives power
failures but without durability for the last N transactions at
power-failure time) without requiring changes to the OS at all, and
with support for delayed D (durability) notification.
Nico
--
On Wed, 24 Oct 2012, Nico Williams wrote:
>> Before that happens, people will keep returning again and again with those
>> simple questions: why the queue must be flushed for any ordered operation?
>> Isn't is an obvious overkill?
>
> That [cache flushing] is not what's being asked for here. Just a
> light-weight barrier. My proposal works without having to add new
> system calls: a) use a COW format, b) have background threads doing
> fsync()s, c) in each transaction's root block note the last
> known-committed (from a completed fsync()) transaction's root block,
> d) have an array of well-known ubberblocks large enough to accommodate
> as many transactions as possible without having to wait for any one
> fsync() to complete, d) do not reclaim space from any one past
> transaction until at least one subsequent transaction is fully
> committed. This obtains ACI- transaction semantics (survives power
> failures but without durability for the last N transactions at
> power-failure time) without requiring changes to the OS at all, and
> with support for delayed D (durability) notification.
I'm doing some work with rsyslog and it's disk-baded queues and there is a
similar issue there. The good news is that we can have a version that is
linux specific (rsyslog is used on other OSs, but there is an existing
queue implementation that they can use, if the faster one is linux-only,
but is significantly faster, that's just a win for Linux)
Like what is being described for sqlite, loosing the tail end of the
messages is not a big problem under normal conditions. But there is a need
to be sure that what is there is complete up to the point where it's lost.
this is similar in concept to write-ahead-logs done for databases (without
the absolute durability requirement)
1. new messages arrive and get added to the end of the queue file.
2. a thread updates the queue to indicate that it is in the process
of delivering a block of messages
3. the thread updates the queue to indicate that the block of messages has
been delivered
4. garbage collection happens to delete the old messages to free up space
(if queues go into files, this can just be to limit the file size,
spilling to multiple files, and when an old file is completely marked as
delivered, delete it)
I am not fully understanding how what you are describing (COW, separate
fsync threads, etc) would be implemented on top of existing filesystems.
Most of what you are describing seems like it requires access to the
underlying storage to implement.
could you give a more detailed explination?
David Lang
On Wed, Oct 24, 2012 at 5:03 PM, <[email protected]> wrote:
> I'm doing some work with rsyslog and it's disk-baded queues and there is a
> similar issue there. The good news is that we can have a version that is
> linux specific (rsyslog is used on other OSs, but there is an existing queue
> implementation that they can use, if the faster one is linux-only, but is
> significantly faster, that's just a win for Linux)
>
> Like what is being described for sqlite, loosing the tail end of the
> messages is not a big problem under normal conditions. But there is a need
> to be sure that what is there is complete up to the point where it's lost.
>
> this is similar in concept to write-ahead-logs done for databases (without
> the absolute durability requirement)
>
> [...]
>
> I am not fully understanding how what you are describing (COW, separate
> fsync threads, etc) would be implemented on top of existing filesystems.
> Most of what you are describing seems like it requires access to the
> underlying storage to implement.
>
> could you give a more detailed explination?
COW is "copy on write", which is actually a bit of a misnomer -- all
COW means is that blocks aren't over-written, instead new blocks are
written. In particular this means that inodes, indirect blocks, data
blocks, and so on, that are changed are actually written to new
locations, and the on-disk format needs to handle this indirection.
As for fsyn() and background threads... fsync() is synchronous, but in
this scheme we want it to happen asynchronously and then we want to
update each transaction with a pointer to the last transaction that is
known stable given an fsync()'s return.
Nico
--
On Wed, 24 Oct 2012, Nico Williams wrote:
> On Wed, Oct 24, 2012 at 5:03 PM, <[email protected]> wrote:
>> I'm doing some work with rsyslog and it's disk-baded queues and there is a
>> similar issue there. The good news is that we can have a version that is
>> linux specific (rsyslog is used on other OSs, but there is an existing queue
>> implementation that they can use, if the faster one is linux-only, but is
>> significantly faster, that's just a win for Linux)
>>
>> Like what is being described for sqlite, loosing the tail end of the
>> messages is not a big problem under normal conditions. But there is a need
>> to be sure that what is there is complete up to the point where it's lost.
>>
>> this is similar in concept to write-ahead-logs done for databases (without
>> the absolute durability requirement)
>>
>> [...]
>>
>> I am not fully understanding how what you are describing (COW, separate
>> fsync threads, etc) would be implemented on top of existing filesystems.
>> Most of what you are describing seems like it requires access to the
>> underlying storage to implement.
>>
>> could you give a more detailed explination?
>
> COW is "copy on write", which is actually a bit of a misnomer -- all
> COW means is that blocks aren't over-written, instead new blocks are
> written. In particular this means that inodes, indirect blocks, data
> blocks, and so on, that are changed are actually written to new
> locations, and the on-disk format needs to handle this indirection.
so how can you do this, and keep the writes in order (especially between
two files) without being the filesystem?
> As for fsyn() and background threads... fsync() is synchronous, but in
> this scheme we want it to happen asynchronously and then we want to
> update each transaction with a pointer to the last transaction that is
> known stable given an fsync()'s return.
If you could specify ordering between two writes, I could see a process
along the lines of
Append new message to file1
append tiny status updates to file2
every million messages, move to new files. once the last message has been
processed for the old set of files, delete them.
since file2 is small, you can reconstruct state fairly cheaply
But unless you are a filesystem, how can you make sure that the message
data is written to file1 before you write the metadata about the message
to file2?
right now it seems that there is no way for an application to do this
other than doing a fsync(file1) before writing the metadata to file2
And there is no way for the application to tell the filesystem to write
the data in file2 in order (to make sure that block 3 is not written and
then have the system crash before block 2 is written), so the application
needs to do frequent fsync(file2) calls.
If you need complete durability of your data, there are well documented
ways of enforcing it (including the lwn.net article
http://lwn.net/Articles/457667/ )
But if you don't need the gurantee that your data is on disk now, you just
need to have it ordered so that if you crash you can be guaranteed only to
loose data off of the tail of your file, there doesn't seem to be any way
to do this other than using the fsync() hammer and wait for the overhead
of forcing the data to disk now.
Or, as I type this, it occurs to me that you may be saying that every time
you want to do an ordering guarantee, spawn a new thread to do the fsync
and then just keep processing. The fsync will happen at some point, and
the writes will not be re-ordered across the fsync, but you can keep
going, writing more data while the fsync's are pending.
Then if you have a filesystem and I/O subsystem that can consolodate the
fwyncs from all the different threads together into one I/O operation
without having to flush the entire I/O queue for each one, you can get
acceptable performance, with ordering. If the system crashes, data that
hasn't had it's fsync() complete will be the only thing that is lost.
David Lang
On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote:
> Yes, SCSI has full support for ordered/simple commands designed
> exactly for that task: to have steady flow of commands even in case
> when some of them are ordered.....
SCSI does, yes --- *if* the device actually implements Tagged Command
Queuing (TCQ). Not all devices do.
More importantly, SATA drives do *not* have this capability, and when
you compare the price of SATA drives to uber-expensive "enterprise
drives", it's not surprising that most people don't actually use
SCSI/SAS drives that have implemented TCQ. SATA's Native Command
Queuing (NCQ) is not equivalent; this allows the drive to reorder
requests (in particular read requests) so they can be serviced more
efficiently, but it does *not* allow the OS to specify a partial,
relative ordering of requests.
Yes, you can turn off writeback caching, but that has pretty huge
performance costs; and there is the FUA bit, but that's just an
unconditional high priority bypass of the writeback cache, which is
useful in some cases, but which again, does not give the ability for
the OS to specify a partial order, while letting the drive reorder
other requests for efficiency/performance's sake, since the drive has
a lot more information about the optimal way to reorder requests based
on the current location of the drive head and where certain blocks may
have been remapped due to bad block sparing, etc.
> Hopefully, eventually the storage developers will realize the value
> behind ordered commands and learn corresponding SCSI facilities to
> deal with them.
Eventually, drive manufacturers will realize that trying to price
guage people who want advanced features such as TCQ, DIF/DIX, is the
best way to gaurantee that most people won't bother to purchase them,
and hence the features will remain largely unused....
- Ted
On Wed, Oct 24, 2012 at 8:04 PM, <[email protected]> wrote:
> On Wed, 24 Oct 2012, Nico Williams wrote:
>> COW is "copy on write", which is actually a bit of a misnomer -- all
>> COW means is that blocks aren't over-written, instead new blocks are
>> written. In particular this means that inodes, indirect blocks, data
>> blocks, and so on, that are changed are actually written to new
>> locations, and the on-disk format needs to handle this indirection.
>
> so how can you do this, and keep the writes in order (especially between two
> files) without being the filesystem?
By trusting fsync(). And if you don't care about immediate Durability
you can run the fsync() in a background thread and mark the associated
transaction as completed in the next transaction to be written after
the fsync() completes.
>> As for fsyn() and background threads... fsync() is synchronous, but in
>> this scheme we want it to happen asynchronously and then we want to
>> update each transaction with a pointer to the last transaction that is
>> known stable given an fsync()'s return.
>
> If you could specify ordering between two writes, I could see a process
> along the lines of
>
> [...]
fsync() deals with just one file. fsync()s of different files are
another story. That said, as long as the format of the two files is
COW then you can still compose transactions involving two files. The
key is the file contents itself must be COW-structured.
Incidentally, here's a single-file, bag of b-trees that uses a COW
format: MDB, which can be found in
git://git.openldap.org/openldap.git, in the mdb.master branch.
> Or, as I type this, it occurs to me that you may be saying that every time
> you want to do an ordering guarantee, spawn a new thread to do the fsync and
> then just keep processing. The fsync will happen at some point, and the
> writes will not be re-ordered across the fsync, but you can keep going,
> writing more data while the fsync's are pending.
Yes, but only if the file's format is COWish.
The point is that COW saves the day. A file-based DB needs to be COW.
And the filesystem needs to be as well.
Note that write ahead logging approximates COW well enough most of the time.
> Then if you have a filesystem and I/O subsystem that can consolodate the
> fwyncs from all the different threads together into one I/O operation
> without having to flush the entire I/O queue for each one, you can get
> acceptable performance, with ordering. If the system crashes, data that
> hasn't had it's fsync() complete will be the only thing that is lost.
With the above caveat, yes.
Nico
--
On Wed, Oct 24, 2012 at 03:03:00PM -0700, [email protected] wrote:
> Like what is being described for sqlite, loosing the tail end of the
> messages is not a big problem under normal conditions. But there is
> a need to be sure that what is there is complete up to the point
> where it's lost.
>
> this is similar in concept to write-ahead-logs done for databases
> (without the absolute durability requirement)
If that's what you require, and you are using ext3/4, usng data
journalling might meet your requirements. It's something you can
enable on a per-file basis, via chattr +j; you don't have to force all
file systems to use data journaling via the data=journalled mount
option.
The potential downsides that you may or may not care about for this
particular application:
(a) This will definitely have a performance impact, especially if you
are doing lots of small (less than 4k) writes, since the data blocks
will get run through the journal, and will only get written to their
final location on disk.
(b) You don't get atomicity if the write spans a 4k block boundary.
All of the bytes before i_size will be written, so you don't have to
worry about "holes"; but the last message written to the log file
might be truncated.
(c) There will be a performance impact, since the contents of data
blocks will be written at least twice (once to the journal, and once
to the final location on disk). If you do lots of small, sub-4k
writes, the performance might be even worse, since data blocks might
be written multiple times to the journal.
- Ted
On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote:
>
> By trusting fsync(). And if you don't care about immediate Durability
> you can run the fsync() in a background thread and mark the associated
> transaction as completed in the next transaction to be written after
> the fsync() completes.
The challenge is when you have entagled metadata updates. That is,
you update file A, and file B, and file A and B might share metadata.
In order to sync file A, you also have to update part of the metadata
for the updates to file B, which means calculating the dependencies of
what you have to drag in can get very complicated. You can keep track
of what bits of the metadata you have to undo and then redo before
writing out the metadata for fsync(A), but that basically means you
have to implement soft updates, and all of the complexity this
implies: http://lwn.net/Articles/339337/
If you can keep all of the metadata separate, this can be somewhat
mitigated, but usually the block allocation records (regardless of
whether you use a tree, or a bitmap, or some other data structure)
tends of have entanglement problems.
It certainly is not impossible; RDBMS's have implemented this. On the
other hand, they generally aren't as fast as file systems for
non-transactional workloads, and people really care about performance
on those sorts of workloads for file systems. (About a decade ago,
Oracle tried to claim that you could run file system workloads using
an Oracle databsae as a back-end. Everyone laughed at them, and the
idea died a quick, merciful death.)
Still, if you want to try to implement such a thing, by all means,
give it a try. But I think you'll find that creating a file system
that can compete with existing file systems for performance, and
*then* also supports a transactional model, is going to be quite a
challenge.
- Ted
On Thu, 25 Oct 2012, Theodore Ts'o wrote:
> On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote:
>>
>> By trusting fsync(). And if you don't care about immediate Durability
>> you can run the fsync() in a background thread and mark the associated
>> transaction as completed in the next transaction to be written after
>> the fsync() completes.
>
> The challenge is when you have entagled metadata updates. That is,
> you update file A, and file B, and file A and B might share metadata.
> In order to sync file A, you also have to update part of the metadata
> for the updates to file B, which means calculating the dependencies of
> what you have to drag in can get very complicated. You can keep track
> of what bits of the metadata you have to undo and then redo before
> writing out the metadata for fsync(A), but that basically means you
> have to implement soft updates, and all of the complexity this
> implies: http://lwn.net/Articles/339337/
>
> If you can keep all of the metadata separate, this can be somewhat
> mitigated, but usually the block allocation records (regardless of
> whether you use a tree, or a bitmap, or some other data structure)
> tends of have entanglement problems.
hmm, two thoughts occur to me.
1. to avoid entanglement, put the two files in separate directories
2. take advantage of entaglement to enforce ordering
thread 1 (repeated): write new message to file 1, spawn new thread to
fsync
thread 2: write to file 2 that message1-5 are being worked on
thread 2 (later): write to file 2 that messages 1-5 are done
when thread 1 spawns the new thread to do the fsync, the system will be
forced to write the data to file 2 as of the time it does the fsync.
This should make it so that you never have data written to file2 that
refers to data that hasn't been written to file1 yet.
> It certainly is not impossible; RDBMS's have implemented this. On the
> other hand, they generally aren't as fast as file systems for
> non-transactional workloads, and people really care about performance
> on those sorts of workloads for file systems.
the RDBMS's have implemented stronger guarantees than what we are needing
A few years ago I was investigating this for logging. With the reliable
(RDBMS style) , but inefficent disk queue that rsyslog has, writing to a
high-end fusion-io SSD, ext2 resulted in ~8K logs/sec, ext3 resultedin ~2K
logs/sec, and JFS/XFS resulted in ~4K logs/sec (ext4 wasn't considered
stable enough at the time to be tested)
> Still, if you want to try to implement such a thing, by all means,
> give it a try. But I think you'll find that creating a file system
> that can compete with existing file systems for performance, and
> *then* also supports a transactional model, is going to be quite a
> challenge.
The question is trying to figure a way to get ordering right with existing
filesystms (preferrably without using something too tied to a single
filesystem implementation), not try and create a new one.
The frustrating thing is that when people point out how things like sqlite
are so horribly slow, the reply seems to be "well, that's what you get for
doing so many fsyncs, don't do that", when there is a 'problem' like the
KDE "config loss" problem a few years ago, the response is "well, that's
what you get for not doing fsync"
Both responses are correct, from a purely technical point of view.
But what's missing is any way to get the result of ordered I/O that will
let you do something pretty fast, but with the guarantee that, if you
loose data in a crash, the only loss you are risking is that your most
recent data may be missing. (either for one file, or using multiple files
if that's what it takes)
Since this topic came up again, I figured I'd poke a bit and try to either
get educated on how to do this "right" or try and see if there's something
that could be added to the kernel to make it possible for userspace
programs to do this.
What I think userspace really needs is something like a barrier function
call. "for this fd, don't re-order writes as they go down through the
stack"
If the hardware is going to reorder things once it hits the hardware, this
is going to hurt performance (how much depends on a lot of stuff)
but the filesystems are able to make their journals work, so there should
be some way to let userspace do some sort of similar ordering
David Lang
On Thu, 25 Oct 2012, Theodore Ts'o wrote:
> On Wed, Oct 24, 2012 at 03:03:00PM -0700, [email protected] wrote:
>> Like what is being described for sqlite, loosing the tail end of the
>> messages is not a big problem under normal conditions. But there is
>> a need to be sure that what is there is complete up to the point
>> where it's lost.
>>
>> this is similar in concept to write-ahead-logs done for databases
>> (without the absolute durability requirement)
>
> If that's what you require, and you are using ext3/4, usng data
> journalling might meet your requirements. It's something you can
> enable on a per-file basis, via chattr +j; you don't have to force all
> file systems to use data journaling via the data=journalled mount
> option.
>
> The potential downsides that you may or may not care about for this
> particular application:
>
> (a) This will definitely have a performance impact, especially if you
> are doing lots of small (less than 4k) writes, since the data blocks
> will get run through the journal, and will only get written to their
> final location on disk.
>
> (b) You don't get atomicity if the write spans a 4k block boundary.
> All of the bytes before i_size will be written, so you don't have to
> worry about "holes"; but the last message written to the log file
> might be truncated.
>
> (c) There will be a performance impact, since the contents of data
> blocks will be written at least twice (once to the journal, and once
> to the final location on disk). If you do lots of small, sub-4k
> writes, the performance might be even worse, since data blocks might
> be written multiple times to the journal.
I'll have to dig into this option. In the case of rsyslog it sounds
like it could work (not as good as a filesystem independant way of doing
things, but better than full fsyncs)
Truncated messages are not great, but they are a detectable, and
acceptable risk.
while the average message size is much smaller than 4K (on my network it's
~250 bytes), the metadata that's broken out expands this somewhat, and we
can afford to waste disk space if it makes things safer or more efficient.
If we do update in place with flags with each message, each message will
need to be written up to three times (on recipt, being processed, finished
processed). With high message burst rates, I'm worried that we would fill
up the journal, is there a good way to deal with this?
I believe that ext4 can put the journal on a different device from the
filesystem, would this help a lot?
If you were to put the journal for an ext4 filesystem on a ram disk, you
would loose the data recovery protection of the journal, but could you use
this trick to get ordered data writes onto the filesystem?
David Lang
> > Hopefully, eventually the storage developers will realize the value
> > behind ordered commands and learn corresponding SCSI facilities to
> > deal with them.
>
> Eventually, drive manufacturers will realize that trying to price
> guage people who want advanced features such as TCQ, DIF/DIX, is the
> best way to gaurantee that most people won't bother to purchase them,
> and hence the features will remain largely unused....
I doubt they care. The profit on high end features from the people who
really need them I would bet far exceeds any other benefit of giving it to
others. Welcome to capitalism 8)
Plus - spinning rust for those end users is on the way out, SATA to flash
is a bit of hack and people are already putting a lot of focus onto
things like NVM Express.
Alan
On Thu, Oct 25, 2012 at 02:03:25PM +0100, Alan Cox wrote:
>
> I doubt they care. The profit on high end features from the people who
> really need them I would bet far exceeds any other benefit of giving it to
> others. Welcome to capitalism 8)
Yes, but it's a question of pricing. If they had priced it a just a
wee bit higher, then there would have been incentive to add support
for TCQ so it could actually be used into various Linux file systems,
since there would have been lots of users of it. But as it is, the
folks who are purchasing huge, vast number of these drives --- such as
at the large cloud providers: Amazon, Facebook, Racespace, et. al. ---
will choose to purchase large numbers of commodity drives, and then
find ways to work around the missing functionality in userspace. For
example, DIF/DIX would be nice, and if it were available for cheap, I
could imagine it being used. But you can accomplish the same thing in
userspace, and in fact at Google I've implemented a special
not-for-mainline patch which spikes out stable writes (required for
DIF/DIX) because it has significant performance overhead, and DIF/DIX
has zero benefit if you're not willing to shell out $$$ for hardware
that supports it.
Maybe the HDD manufacturers have been able to price guage a small
number enterprise I/T shops with more dollars than sense, but
personally, I'm not convinced they picked an optimal pricing
strategy....
Put another way, I accept that Toyota should price a Lexus ES more
than a Camry, but if it's priced at say, 3x the price of a Camry
instead of 20%, they might find that precious few people are willing
to pay that kind of money for what is essentially the same car with
minor luxury tweaks added to it.
> Plus - spinning rust for those end users is on the way out, SATA to flash
> is a bit of hack and people are already putting a lot of focus onto
> things like NVM Express.
Yeah.... I don't buy that. One, flash is still too expensive. Two,
the capital costs to build enough Silicon foundries to replace the
current production volume of HDD's is way too expensive for any
company to afford (the cloud providers are buying *huge* numbers of
HDD's) --- and that's assuming companies wouldn't chose to use those
foundries for products with larger margins --- such as, for example,
CPU/GPU chips. :-) And third and finally, if you study the long-term
trends in terms of Data Retention Time (going down), Program and Read
Disturb (going up), and Write Endurance (going down) as a function of
feature size and/or time, you'd be wise to treat flash as nothing more
than short-term cache, and not as a long term stable store.
If end users completely give up on flash, and store all of their
precious family pictures on flash storage, after a couple of years,
they are likely going to be very disappointed....
Speaking personally, I wouldn't want to have anything on flash for
more than a few months at *most* before I made sure I had another copy
saved on spinning rust platters for long-term retention.
- Ted
On Wed, Oct 24, 2012 at 11:58:49PM -0700, [email protected] wrote:
> The frustrating thing is that when people point out how things like
> sqlite are so horribly slow, the reply seems to be "well, that's
> what you get for doing so many fsyncs, don't do that", when there is
> a 'problem' like the KDE "config loss" problem a few years ago, the
> response is "well, that's what you get for not doing fsync"
Sure... but the answer is to only do the fsync's when you need to.
For example, if GNOME and KDE is rewriting the entire registry file
each time the application is changing a single registry key, sure, if
you rewrite the entire registry file, and then fsync after each
rewrite before you replace the file, you will be safe. And if the
application needs to update dozens or hundreds of registry keys (or
every time the window gets moved or resized), then yes, it will be
slow. But the application didn't have to do that! It could have
updated all the registry keys in memory, and then update the registry
file periodically instead.
Similarly, Firefox didn't need to do a sqllite commit after every
single time its history file was written, causing a third of a
megabyte of write traffic each time you clicked on a web page. It
could have batched its updates to the history file, since most of the
time, you don't care about making sure the web history is written to
stable store before you're allowed to click on a web page and visit
the next web page.
Or does rsyslog *really* need to issue an fsync after each log
message? Or could it batch updates so that every N seconds, it
flushes writes to the disk?
(And this is a problem with most Android applications as well.
Apparently the framework API's are such that it's easier for an
application to treat each sqlite statement as an atomic update, so
many/most application writers don't use explicit transaction
boundaries, so updates don't get batched even though it would be more
efficient if they did so.)
Sometimes, the answer is not to try to create exotic database like
functionality in the file system --- the answer is to be more
intelligent at the application leyer. Not only will the application
be more portable, it will also in the end be more efficient, since
even with the most exotic database technologies, the most efficient
transactional commit is the unneeded commit that you optimize away at
the application layer.
- Ted
On Thu, 25 Oct 2012, Theodore Ts'o wrote:
> Or does rsyslog *really* need to issue an fsync after each log
> message? Or could it batch updates so that every N seconds, it
> flushes writes to the disk?
In part this depends on how paranoid the admin is. By default rsyslog
doesn't do fsyncs, but admins can configure it to do so and can configure
the batch size.
However, what I'm talking about here is not normal message traffic, it's
the case where the admin has decided that they don't want to use the
normal inmemory queues, they want to have the queues be on disk so that if
the system crashes the queued data will still be there to be processed
after the crash (In addition, this can get used to cover cases where you
want queue sizes larger than your available RAM)
In this case, the extreme, and only at the explicit direction of the
admin, is to fsync after every message.
The norm is that it's acceptable to loose the last few messages, but
loosing a chunk out of the middle of the queue file can cause a whole lot
more to be lost, passing the threshold of acceptable.
> Sometimes, the answer is not to try to create exotic database like
> functionality in the file system --- the answer is to be more
> intelligent at the application leyer. Not only will the application
> be more portable, it will also in the end be more efficient, since
> even with the most exotic database technologies, the most efficient
> transactional commit is the unneeded commit that you optimize away at
> the application layer.
I agree, this is why I'm trying to figure out the recommended way to do
this without needing to do full commits.
Since in most cases it's acceptable to loose the last few chunks written,
if we had some way of specifying ordering, without having to specify
"write this NOW", the solution would be pretty obvious.
David Lang
On Thu, Oct 25, 2012 at 11:03:13AM -0700, [email protected] wrote:
> I agree, this is why I'm trying to figure out the recommended way to
> do this without needing to do full commits.
>
> Since in most cases it's acceptable to loose the last few chunks
> written, if we had some way of specifying ordering, without having
> to specify "write this NOW", the solution would be pretty obvious.
Well, using data journalling with ext3/4 may do what you want. If you
don't do any fsync, the changes will get written every 5 seconds when
the automatic journal sync happens (and sub-4k writes will also get
coalesced to a 5 second granularity). Even with plain text files,
it's pretty easy to tell whether or not the final record is a
partially written or not after a crash; just look for a trailing
newline.
Better yet, if you are writing to multiple log files with data
journalling, all of the writes will happen at the same time, and they
will be streamed to the file system journal, minimizing random writes
for at least the journal writes.
- Ted
Nico Williams, on 10/24/2012 05:17 PM wrote:
>> Yes, SCSI has full support for ordered/simple commands designed exactly for
>> that task: [...]
>>
>> [...]
>>
>> But historically for some reason Linux storage developers were stuck with
>> "barriers" concept, which is obviously not the same as ORDERED commands,
>> hence had a lot troubles with their ambiguous semantic. As far as I can tell
>> the reason of that was some lack of sufficiently deep SCSI understanding
>> (how to handle errors, believe that ACA is something legacy from parallel
>> SCSI times, etc.).
>
> Barriers are a very simple abstraction, so there's that.
It isn't simple at all. If you think for some time about barriers from the storage
point of view, you will soon realize how bad and ambiguous they are.
>> Before that happens, people will keep returning again and again with those
>> simple questions: why the queue must be flushed for any ordered operation?
>> Isn't is an obvious overkill?
>
> That [cache flushing]
It isn't cache flushing, it's _queue_ flushing. You can call it queue draining, if
you like.
Often there's a big difference where it's done: on the system side, or on the
storage side.
Actually, performance improvements from NCQ in many cases are not because it
allows the drive to reorder requests, as it's commonly thought, but because it
allows to have internal drive's processing stages stay always busy without any
idle time. Drives often have a long internal pipeline.. Hence the need to keep
every stage of it always busy and hence why using ORDERED commands is important
for performance.
> is not what's being asked for here. Just a
> light-weight barrier. My proposal works without having to add new
> system calls: a) use a COW format, b) have background threads doing
> fsync()s, c) in each transaction's root block note the last
> known-committed (from a completed fsync()) transaction's root block,
> d) have an array of well-known ubberblocks large enough to accommodate
> as many transactions as possible without having to wait for any one
> fsync() to complete, d) do not reclaim space from any one past
> transaction until at least one subsequent transaction is fully
> committed. This obtains ACI- transaction semantics (survives power
> failures but without durability for the last N transactions at
> power-failure time) without requiring changes to the OS at all, and
> with support for delayed D (durability) notification.
I believe what you really want is to be able to send to the storage a sequence of
your favorite operations (FS operations, async IO operations, etc.) like:
Write back caching disabled:
data op11, ..., data op1N, ORDERED data op1, data op21, ..., data op2M, ...
Write back caching enabled:
data op11, ..., data op1N, ORDERED sync cache, ORDERED FUA data op1, data op21,
..., data op2M, ...
Right?
(ORDERED means that it is guaranteed that this ordered command never in any
circumstances will be executed before any previous command completed AND after any
subsequent command completed.)
Vlad
Theodore Ts'o, on 10/25/2012 01:14 AM wrote:
> On Tue, Oct 23, 2012 at 03:53:11PM -0400, Vladislav Bolkhovitin wrote:
>> Yes, SCSI has full support for ordered/simple commands designed
>> exactly for that task: to have steady flow of commands even in case
>> when some of them are ordered.....
>
> SCSI does, yes --- *if* the device actually implements Tagged Command
> Queuing (TCQ). Not all devices do.
>
> More importantly, SATA drives do *not* have this capability, and when
> you compare the price of SATA drives to uber-expensive "enterprise
> drives", it's not surprising that most people don't actually use
> SCSI/SAS drives that have implemented TCQ.
What different in our positions is that you are considering storage as something
you can connect to your desktop, while in my view storage is something, which
stores data and serves them the best possible way with the best performance.
Hence, for you the least common denominator of all storage features is the most
important, while for me to get the best of what possible from storage is the most
important.
In my view storage should offload from the host system as much as possible: data
movements, ordered operations requirements, atomic operations, deduplication,
snapshots, reliability measures (eg RAIDs), load balancing, etc.
It's the same as with 2D/3D video acceleration hardware. If you want the best
performance from your system, you should offload from it as much as possible. In
case of video - to the video hardware, in case of storage - to the storage. The
same as with video, for storage better offload - better performance. On hundreds
of thousands IOPS it's clearly visible.
Price doesn't matter here, because it's completely different topic.
> SATA's Native Command
> Queuing (NCQ) is not equivalent; this allows the drive to reorder
> requests (in particular read requests) so they can be serviced more
> efficiently, but it does *not* allow the OS to specify a partial,
> relative ordering of requests.
And so? If SATA can't do it, does it mean that nobody else can't do it too? I know
a plenty of non-SATA devices, which can do the ordering requirements you need.
Vlad
Theodore Ts'o, on 10/25/2012 09:50 AM wrote:
> Yeah.... I don't buy that. One, flash is still too expensive. Two,
> the capital costs to build enough Silicon foundries to replace the
> current production volume of HDD's is way too expensive for any
> company to afford (the cloud providers are buying *huge* numbers of
> HDD's) --- and that's assuming companies wouldn't chose to use those
> foundries for products with larger margins --- such as, for example,
> CPU/GPU chips. :-) And third and finally, if you study the long-term
> trends in terms of Data Retention Time (going down), Program and Read
> Disturb (going up), and Write Endurance (going down) as a function of
> feature size and/or time, you'd be wise to treat flash as nothing more
> than short-term cache, and not as a long term stable store.
>
> If end users completely give up on flash, and store all of their
> precious family pictures on flash storage, after a couple of years,
> they are likely going to be very disappointed....
>
> Speaking personally, I wouldn't want to have anything on flash for
> more than a few months at *most* before I made sure I had another copy
> saved on spinning rust platters for long-term retention.
Here I agree with you.
Vlad
On Fri, Oct 26, 2012 at 09:54:53PM -0400, Vladislav Bolkhovitin wrote:
> What different in our positions is that you are considering storage
> as something you can connect to your desktop, while in my view
> storage is something, which stores data and serves them the best
> possible way with the best performance.
I don't get paid to make Linux storage work well for gold-plated
storage, and as far as I know, none of the purveyors of said gold
plated software systems are currently employing Linux file system
developers to make Linux file systems work well on said gold-plated
hardware.
As for what I might do on my own time, for fun, I can't afford said
gold-plated hardware, and personally I get a lot more satisfaction if
I know there will be a large number of people who benefit from my work
(it was really cool when I found out that millions and millions of
Android devices were going to be using ext4 :-), as opposed to a very
small number of people who have paid $$$ to storage vendors who don't
feel it's worthwhile to pay core Linux file system developers to
leverage their hardware. Earlier, you were bemoaning why Linux file
system developers weren't paying attention to using said fancy SCSI
features. Perhaps now you'll understand better it's not happening?
> Price doesn't matter here, because it's completely different topic.
It matters if you think I'm going to do it on my own time, out of my
own budget. And if you think my employer is going to choose to use
said hardware, price definitely matters. I consider engineering to be
the art of making tradeoffs, and price is absolutely one of the things
that we need to trade off against other goals.
It's rare that you get to design something where performance matters
above all else. Maybe it's that way if you're paid by folks whose job
it is to destablize the world's financial markets by pushing the holes
into the right half plane (i.e., high frequency trading :-). But for
the rest of the world, price absolutely matters.
- Ted
P.S. All of the storage I have access to at home is SATA. If someone
would like to change that and ship me free hardware, as long as it
doesn't require three-phase power (or require some exotic interconnect
which is ghastly expensive and which you are also not going to provide
me for free), do contact me off-line. :-)
Theodore Ts'o, on 10/27/2012 12:44 AM wrote:
> On Fri, Oct 26, 2012 at 09:54:53PM -0400, Vladislav Bolkhovitin wrote:
>> What different in our positions is that you are considering storage
>> as something you can connect to your desktop, while in my view
>> storage is something, which stores data and serves them the best
>> possible way with the best performance.
>
> I don't get paid to make Linux storage work well for gold-plated
> storage, and as far as I know, none of the purveyors of said gold
> plated software systems are currently employing Linux file system
> developers to make Linux file systems work well on said gold-plated
> hardware.
I don't want to flame on this topic, but you are not right here. As far as I can
see, a big chunk of Linux storage and file system developers are/were employed by
the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.
You know, RedHat from recent times also stepped to this market, at least I saw
their advertisement on SDC 2012. So, you can add here all RedHat employees.
> As for what I might do on my own time, for fun, I can't afford said
> gold-plated hardware, and personally I get a lot more satisfaction if
> I know there will be a large number of people who benefit from my work
> (it was really cool when I found out that millions and millions of
> Android devices were going to be using ext4 :-), as opposed to a very
> small number of people who have paid $$$ to storage vendors who don't
> feel it's worthwhile to pay core Linux file system developers to
> leverage their hardware. Earlier, you were bemoaning why Linux file
> system developers weren't paying attention to using said fancy SCSI
> features. Perhaps now you'll understand better it's not happening?
>
>> Price doesn't matter here, because it's completely different topic.
>
> It matters if you think I'm going to do it on my own time, out of my
> own budget. And if you think my employer is going to choose to use
> said hardware, price definitely matters. I consider engineering to be
> the art of making tradeoffs, and price is absolutely one of the things
> that we need to trade off against other goals.
>
> It's rare that you get to design something where performance matters
> above all else. Maybe it's that way if you're paid by folks whose job
> it is to destablize the world's financial markets by pushing the holes
> into the right half plane (i.e., high frequency trading :-). But for
> the rest of the world, price absolutely matters.
I fully understand your position. But "affordable" and "useful" are completely
orthogonal things. The "high end" features are very useful, if you want to get
high performance. Then ones, who can afford them, will use them, which might be
your favorite bank, for instance, hence they will be indirectly working for you.
Of course, you don't have to work on those features, especially for free, but you
similarly don't have then to call them useless only because they are not
affordable to be put in a desktop [1].
Our discussion started not from "value-for-money", but from a constant demand to
perform ordered commands without full queue draining, which is ignored by the
Linux storage developers for YEARS as not useful, right?
Vlad
[1] If you or somebody else want to put something supporting all necessary
features to perform ORDERED commands, including ACA, in a desktop, you can look at
modern SAS SSDs. I can't call price for those devices "high-end".
[Dropping sqlite-users. Note that I'm not subscribed to any of the
other lists cc'ed.]
On Thu, Oct 25, 2012 at 1:02 AM, Theodore Ts'o <[email protected]> wrote:
> On Thu, Oct 25, 2012 at 12:18:47AM -0500, Nico Williams wrote:
>>
>> By trusting fsync(). And if you don't care about immediate Durability
>> you can run the fsync() in a background thread and mark the associated
>> transaction as completed in the next transaction to be written after
>> the fsync() completes.
You are all missing some context which I would have added had I
noticed the cc'ing of additional lists.
D.R. Hipp asked for a light-weight barrier API from the OS/filesystem,
the SQLite use-case being to implement fast ACI_ semantics, without
durability (i.e., that it be OK to lose the last few transactions, but
not to end up with a corrupt DB, and maintaining atomicity,
consistency, and isolation).
I noted that a journalled/COW DB file format[0] one could run an
fsync() in a "background" thread to act as a barrier, and then note in
each transaction the last preceding transaction known to have reached
disk (because fsync() returned and the bg thread marked the
transaction in question as durable). Then refrain from garbage
collecting any transactions not marked as durable. Now, there are
some caveats, the main one being that this fails if the filesystem or
hardware lie about fsync() / cache flushes. Other caveats include
that fsync() used this way can have more impact on filesystem
performance than a true light-weight barrier[1], that the filesystem
itself might not be powerfail-safe, and maybe a few others. But the
point is that fsync() can be used in such a way that one need not wait
for a transaction to reach rotating rust stably and still retain
powerfail safety without durability for the last few transactions.
[0] Like the BSD4.4 log structured filesystem, ZFS, Howard Chu's MDB,
and many others. Note that ZFS has a pool-import time option to
recover from power failures by ignoring any not completely verifiable
transactions and rolling back to the last verifiable one.
[1] Think of what ZFS does when there's no ZIL and an fsync() comes
along: ZFS will either block the fsync() thread until the current
transaction closes or else close the current transaction and possibly
write a much smaller transaction, thus losing out on making writes as
large and contiguous as possible.
> The challenge is when you have entagled metadata updates. That is,
> you update file A, and file B, and file A and B might share metadata.
> In order to sync file A, you also have to update part of the metadata
> for the updates to file B, which means calculating the dependencies of
> what you have to drag in can get very complicated. You can keep track
> of what bits of the metadata you have to undo and then redo before
> writing out the metadata for fsync(A), but that basically means you
> have to implement soft updates, and all of the complexity this
> implies: http://lwn.net/Articles/339337/
I believe that my suggestion composes for multi-file DB file formats,
as long as the sum total forms a COWish on-disk format. Of course,
adding more fsync()s, even if run in bg threads, may impact system
performance even more (see above). Also, if one has a COWish DB then
why use more than one file? If the answer were "to spread contents
across devices" one might ask "why not trust the filesystem/volume
manager to do that?", but hey.
I'm not actually proposing that people try to compose this ACI_
technique though...
Nico
--
> I don't want to flame on this topic, but you are not right here. As far as I can
> see, a big chunk of Linux storage and file system developers are/were employed by
> the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.
>
> You know, RedHat from recent times also stepped to this market, at least I saw
> their advertisement on SDC 2012. So, you can add here all RedHat employees.
Booleans generally should be reserved for logic operators. Most of the
Linux companies work on both low and high end storage. The two are not
mutually exclusive nor do they divide neatly by market. Many big clouds
use cheap low end drives by the crate, some high end desktops are using
SAS although given you can get six 2.5" hotplug drives in a 5.25" bay I'm
not sure personally there is much point
(and I used to have fibrechannel on my Thinkpad 600 when docked 8))
> Our discussion started not from "value-for-money", but from a constant demand to
> perform ordered commands without full queue draining, which is ignored by the
> Linux storage developers for YEARS as not useful, right?
Send patches with benchmarks demonstrating it is useful. It's really
quite simple. Code talks.
Alan
Alan Cox, on 10/31/2012 05:54 AM wrote:
>> I don't want to flame on this topic, but you are not right here. As far as I can
>> see, a big chunk of Linux storage and file system developers are/were employed by
>> the "gold-plated storage" manufacturers, starting from FusionIO, SGI and Oracle.
>>
>> You know, RedHat from recent times also stepped to this market, at least I saw
>> their advertisement on SDC 2012. So, you can add here all RedHat employees.
>
> Booleans generally should be reserved for logic operators. Most of the
> Linux companies work on both low and high end storage. The two are not
> mutually exclusive nor do they divide neatly by market. Many big clouds
> use cheap low end drives by the crate, some high end desktops are using
> SAS although given you can get six 2.5" hotplug drives in a 5.25" bay I'm
> not sure personally there is much point
Those doesn't contradict the point that high performance storage vendors are also
funding Linux kernel storage development.
> Send patches with benchmarks demonstrating it is useful. It's really
> quite simple. Code talks.
How about that recently preliminary infrastructure to send ORDERED commands
instead of queue draining was deleted from the kernel, because "there's no
difference where to drain the queue, on the kernel or the storage side"?
Vlad
> How about that recently preliminary infrastructure to send ORDERED commands
> instead of queue draining was deleted from the kernel, because "there's no
> difference where to drain the queue, on the kernel or the storage side"?
Send patches.
Alan
Alan Cox, on 11/01/2012 05:24 PM wrote:
>> How about that recently preliminary infrastructure to send ORDERED commands
>> instead of queue draining was deleted from the kernel, because "there's no
>> difference where to drain the queue, on the kernel or the storage side"?
>
> Send patches.
OK, then we have a good progress!
Vlad
Alan Cox wrote:
>> How about that recently preliminary infrastructure to send ORDERED commands
>> instead of queue draining was deleted from the kernel, because "there's no
>> difference where to drain the queue, on the kernel or the storage side"?
>
> Send patches.
Isn't any type of kernel-side ordering an exercise in futility, since
a) the kernel has no knowledge of the disk's actual geometry
b) most drives will internally re-order requests anyway
c) cheap drives won't support barriers
Even assuming the drives honored all your requests without lying, how would
you really want this behavior exposed? From the userland perspective, there
are very few apps that care. Probably only transactional databases, really.
As a DB author, I'm not sure I'd be keen on this as an open() or fcntl()
option. Databases that really care would be on dedicated filesystems and/or
devices, so per-file control would be tedious. You would most likely want to
say "all writes to this string of devices should be order-preserving" and
forget about it. With that guarantee, a careful writer can have perfectly
intact data structures all the time, without ever slowing down for a fsync.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
> Isn't any type of kernel-side ordering an exercise in futility, since
> a) the kernel has no knowledge of the disk's actual geometry
> b) most drives will internally re-order requests anyway
They will but only as permitted by the commands queued, so you have some
control depending upon the interface capabilities.
> c) cheap drives won't support barriers
Barriers are pretty much universal as you need them for power off !
> Even assuming the drives honored all your requests without lying, how would
> you really want this behavior exposed? From the userland perspective, there
> are very few apps that care. Probably only transactional databases, really.
And file systems internally sometimes. A file system is after all a
transactional database of sorts.
Alan
On Thu 2012-10-25 14:29:48, Theodore Ts'o wrote:
> On Thu, Oct 25, 2012 at 11:03:13AM -0700, [email protected] wrote:
> > I agree, this is why I'm trying to figure out the recommended way to
> > do this without needing to do full commits.
> >
> > Since in most cases it's acceptable to loose the last few chunks
> > written, if we had some way of specifying ordering, without having
> > to specify "write this NOW", the solution would be pretty obvious.
>
> Well, using data journalling with ext3/4 may do what you want. If you
> don't do any fsync, the changes will get written every 5 seconds when
> the automatic journal sync happens (and sub-4k writes will also get
Hmm. But that would need setting journalling mode per-file, no?
Like, make it journal data for all the databases, but keep normal mode
for rest of system...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Mon, Nov 05, 2012 at 09:03:48PM +0100, Pavel Machek wrote:
> > Well, using data journalling with ext3/4 may do what you want. If you
> > don't do any fsync, the changes will get written every 5 seconds when
> > the automatic journal sync happens (and sub-4k writes will also get
>
> Hmm. But that would need setting journalling mode per-file, no?
>
> Like, make it journal data for all the databases, but keep normal mode
> for rest of system...
You can do that, using "chattr +j file.db". It's apparently not a
well known feature of ext3/4....
- Ted
Howard Chu, on 11/01/2012 08:38 PM wrote:
> Alan Cox wrote:
>>> How about that recently preliminary infrastructure to send ORDERED commands
>>> instead of queue draining was deleted from the kernel, because "there's no
>>> difference where to drain the queue, on the kernel or the storage side"?
>>
>> Send patches.
>
> Isn't any type of kernel-side ordering an exercise in futility, since
> a) the kernel has no knowledge of the disk's actual geometry
> b) most drives will internally re-order requests anyway
> c) cheap drives won't support barriers
This is why it is so important for performance to use all storage capabilities.
Particularly, ORDERED commands instead of trying to pretend be smarter, than the
storage, doing queue draining.
Vlad
Alan Cox, on 11/02/2012 08:33 AM wrote:
>> b) most drives will internally re-order requests anyway
>
> They will but only as permitted by the commands queued, so you have some
> control depending upon the interface capabilities.
>
>> c) cheap drives won't support barriers
>
> Barriers are pretty much universal as you need them for power off !
I'm afraid, no storage (drives, if you like this term more) at the moment supports
barriers and, as far as I know the storage history, has never supported.
Instead, what storage does support in this area are:
1. Cache flushing facilities: FUA, SYNCHRONIZE CACHE, etc.
2. Commands ordering facilities: commands attributes (ORDERED, SIMPLE, etc.), ACA,
etc.
3. Atomic commands, e.g. scattered writes, which allow to write data in several
separate not adjacent blocks in an atomic manner, i.e. guarantee that either all
blocks are written or none at all. This is a relatively new functionality, natural
for flash storage with its COW internals.
Obviously, using such atomic write commands, an application or a file system don't
need any journaling anymore. FusionIO reported that after they modified MySQL to
use them, they had 50% performance increase.
Note, that those 3 facilities are ORTHOGONAL, i.e. can be used independently,
including on the same request. That is the root cause why barrier concept is so
evil. If you specify a barrier, how can you say what kind actual action you really
want from the storage: cache flush? Or ordered write? Or both?
This is why relatively recent removal of barriers from the Linux kernel
(http://lwn.net/Articles/400541/) was a big step ahead. The next logical step
should be to allow ORDERED attribute for requests be accelerated by ORDERED
commands of the storage, if it supports them. If not, fall back to the existing
queue draining.
Actually, I'm wondering, why barriers concept is so sticky in the Linux world? A
simple Google search shows that only Linux uses this concept for storage. And 2
years passed, since they were removed from the kernel, but people still discuss
barriers as if they are here.
Vlad
> > Barriers are pretty much universal as you need them for power off !
>
> I'm afraid, no storage (drives, if you like this term more) at the moment supports
> barriers and, as far as I know the storage history, has never supported.
The ATA cache flush is a write barrier, and given you have no NV cache
visible to the controller it's the same thing.
> Instead, what storage does support in this area are:
Yes - the devil is in the detail once you go beyond simple capabilities.
Alan
On Tue, Nov 13, 2012 at 11:40 AM, Alan Cox <[email protected]> wrote:
>> > Barriers are pretty much universal as you need them for power off !
>>
>> I'm afraid, no storage (drives, if you like this term more) at the moment supports
>> barriers and, as far as I know the storage history, has never supported.
>
> The ATA cache flush is a write barrier, and given you have no NV cache
> visible to the controller it's the same thing.
>
>> Instead, what storage does support in this area are:
>
> Yes - the devil is in the detail once you go beyond simple capabilities.
Right: barriers are trivial to program with. Ordered writes less so.
One could declare all writes to be ordered with respect to each other,
but this will almost certainly hurt performance (at least with disks,
though probably not SSDs) as opposed to barriers, which order one
group of internally-not-order writes relative to another. And
declaring groups of internally-unordered writes where the groups are
ordered with respect to each other... is practically the same as
barriers.
There's a lot to be said for simplicity... as long as the system is
not so simple as to not work at all.
My p.o.v. is that a filesystem write barrier is effectively the same
as fsync() with the ability to return sooner (before writes hit stable
storage) when the filesystem and hardware support on-disk layouts and
primitives which can be used to order writes preceding and succeeding
the barrier.
Nico
--
Alan Cox, on 11/13/2012 12:40 PM wrote:
>>> Barriers are pretty much universal as you need them for power off !
>>
>> I'm afraid, no storage (drives, if you like this term more) at the moment supports
>> barriers and, as far as I know the storage history, has never supported.
>
> The ATA cache flush is a write barrier, and given you have no NV cache
> visible to the controller it's the same thing.
The cache flush is cache flush. You can call it barrier, if you want to continue
confusing yourself and others.
>> Instead, what storage does support in this area are:
>
> Yes - the devil is in the detail once you go beyond simple capabilities.
None of those details brings anything not solvable. For instance, I already
described in this thread a simple way how requested order of commands can be
carried through the stack and implemented that algorithm in SCST.
Vlad
Nico Williams, on 11/13/2012 02:13 PM wrote:
> declaring groups of internally-unordered writes where the groups are
> ordered with respect to each other... is practically the same as
> barriers.
Which barriers? Barriers meaning cache flush or barriers meaning commands order,
or barriers meaning both?
There's no such thing as "barrier". It is fully artificial abstraction. After all,
at the bottom of your stack, you will have to translate it either to cache flush,
or commands order enforcement, or both.
Are you going to invent 3 types of barriers?
> There's a lot to be said for simplicity... as long as the system is
> not so simple as to not work at all.
>
> My p.o.v. is that a filesystem write barrier is effectively the same
> as fsync() with the ability to return sooner (before writes hit stable
> storage) when the filesystem and hardware support on-disk layouts and
> primitives which can be used to order writes preceding and succeeding
> the barrier.
Your mistake is that you are considering barriers as something real, which can do
something real for you, while it is just a artificial abstraction apparently
invented by people with limited knowledge how storage works, hence having very
foggy vision how barriers supposed to be processed by it. A simple wrong answer.
Generally, you can invent any abstraction convenient for you, but farther your
abstractions from reality of your hardware => less you will get from it with
bigger effort.
There are no barriers in Linux and not going to be. Accept it. And start instead
thinking about offload capabilities your storage can offer to you.
Vlad
On Wed, 14 Nov 2012, Vladislav Bolkhovitin wrote:
> Nico Williams, on 11/13/2012 02:13 PM wrote:
>> declaring groups of internally-unordered writes where the groups are
>> ordered with respect to each other... is practically the same as
>> barriers.
>
> Which barriers? Barriers meaning cache flush or barriers meaning commands
> order, or barriers meaning both?
>
> There's no such thing as "barrier". It is fully artificial abstraction. After
> all, at the bottom of your stack, you will have to translate it either to
> cache flush, or commands order enforcement, or both.
When people talk about barriers, they are talking about order enforcement.
> Your mistake is that you are considering barriers as something real, which
> can do something real for you, while it is just a artificial abstraction
> apparently invented by people with limited knowledge how storage works, hence
> having very foggy vision how barriers supposed to be processed by it. A
> simple wrong answer.
>
> Generally, you can invent any abstraction convenient for you, but farther
> your abstractions from reality of your hardware => less you will get from it
> with bigger effort.
>
> There are no barriers in Linux and not going to be. Accept it. And start
> instead thinking about offload capabilities your storage can offer to you.
the hardware capabilities are not directly accessable from userspace (and they
probably shouldn't be)
barriers keep getting mentioned because they are a easy concept to understand.
"do this set of stuff before doing any of this other set of stuff, but I don't
care when any of this gets done" and they fit well with the requirements of the
users.
Users readily accept that if the system crashes, they will loose the most recent
stuff that they did, but they get annoyed when things get corrupted to the point
that they loose the entire file.
this includes things like modifying one option and a crash resulting in the
config file being blank. Yes, you can do the 'write to temp file, sync file,
sync directory, rename file" dance, but the fact that to do so the user must sit
and wait for the syncs to take place can be a problem. It would be far better to
be able to say "write to temp file, and after it's on disk, rename the file" and
not have the user wait. The user doesn't really care if the changes hit disk
immediately, or several seconds (or even 10s of seconds) later, as long as there
is not any possibility of the rename hitting disk before the file contents.
The fact that this could be implemented in multiple ways in the existing
hardware does not mean that there need to be multiple ways exposed to userspace,
it just means that the cost of doing the operation will vary depending on the
hardware that you have. This also means that if new hardware introduces a new
way of implementing this, that improvement can be passed on to the users without
needing application changes.
David Lang
On 14/11/2012 8:17 PM, Vladislav Bolkhovitin wrote:
> Nico Williams, on 11/13/2012 02:13 PM wrote:
>> declaring groups of internally-unordered writes where the groups are
>> ordered with respect to each other... is practically the same as
>> barriers.
>
> Which barriers? Barriers meaning cache flush or barriers meaning
> commands order, or barriers meaning both?
>
> There's no such thing as "barrier". It is fully artificial
> abstraction. After all, at the bottom of your stack, you will have to
> translate it either to cache flush, or commands order enforcement, or
> both.
Isn't that why we *have* "the stack" in the first place? So apps
*don't* have to worry about how the OS implements an artificial (=
high-level and portable) abstraction on a given device?
>
> Are you going to invent 3 types of barriers?
One will do, it just needs to be a good one.
Maybe I'm missing something here, so I'm going to back up a bit and
recap what I understand.
The filesystem abstracts the concept of encoding patterns of bits in
some physical media (data), and making it easy to find and retrieve
those bits later (metadata, incl. file name). When users read(), they
expect to see whatever they most recently sent to write(). They also
expect that what they write will still be there later, in spite of any
failure that leaves the disk itself intact.
Operating systems cheat by not actually writing to disk -- for
performance reasons -- and users are (mostly, usually) OK with that,
because the performance gains are so attractive and things usually work
out anyway. Disks cheat too, in the same way and for the same reason.
The cheating works great most of the time, but breaks down -- badly --
if we actually care about what is on disk after a crash (or if we use a
network filesystem). Enough people do care that fsync() was added to the
toolbox. It is defined to transfer "all modified in-core data of the
file referred to by the file descriptor fd to the disk device" and
"blocks until the device reports that the transfer has completed"
(quoting from the fsync(2) man page). Translation: "Stop cheating. Make
sure the stuff I already wrote actually got written. And tell the disk
to stop cheating, too."
Problem is, this definition is asymmetric: it says what happens to
writes issued before the fsync, but nothing about those issued after the
fsync starts and before it returns [1]. The reader has to assume
fsync() makes no promises whatsoever about these later writes: making
fsync capture them exposes callers of fsync() to DoS attacks, and them
from reaching disk until all outstanding fsync calls complete would add
complexity the spec doesn't currently demand, leading to understandable
reluctance by kernel devs to code it up. Unfortunately, we're left with
the filesystem equivalent of what we in the database world call
"eventual consistency" -- easy to implement, nice and fast, but very
difficult to write reliable code against unless you're willing to pay
the cost of being fully synchronous, all the time. Having tried that for
a few years, many people are "returning" to better-specified concurrency
models, trading some amount of performance for comfort that the app will
at least work predictably when things go wrong in strange and
unanticipated ways.
The request, then, is to tighten up fsync semantics in two conceptually
straightforward ways [2]: First, guarantee that later writes to an fd do
not hit disk until earlier calls to fsync() complete. Second, make the
call asynchronous. That's all.
Note that both changes are necessary. The improved ordering semantic
useless by itself, because it's still not safe to request a blocking
fsync from one thread and and then let other threads continue issuing
writes: there's a race between broadcasting that fsync has begun and
issuing the actual syscall that begins it. An asynchronous fsync is also
useless by itself, because it only benefits uncoordinated writes (which
evidently don't care what data actually reaches disk anyway).
The easiest way to implement this fsync would involve three things:
1. Schedule writes for all dirty pages in the fs cache that belong to
the affected file, wait for the device to report success, issue a cache
flush to the device (or request ordering commands, if available) to make
it tell the truth, and wait for the device to report success. AFAIK this
already happens, but without taking advantage of any request ordering
commands.
2. The requesting thread returns as soon as the kernel has identified
all data that will be written back. This is new, but pretty similar to
what AIO already does.
3. No write is allowed to enqueue any requests at the device that
involve the same file, until all outstanding fsync complete [3]. This is
new.
The performance hit for #1 can be reduced significantly if the storage
hardware at hand happens to support some form of request ordering. The
amount of reduction could vary greatly depending on how sophisticated
such request ordering is, and how much effort the kernel and/or device
driver are willing to work for it. In any case, fsync should already do
this [4].
The performance hit for #3 can be minimized by buffering small or
otherwise convenient writes in the fs cache and letting the call return
immediately, as usual. The corresponding pages just have to be marked in
some way to prevent them from being written back too soon. Sequence
numbers work well for this sort of thing. Big requests may have to
block, but they probably would have anyway, if the buffer cache couldn't
absorb them. As with #1, fancy command ordering capabilities in the
underlying device just allow additional performance optimizations.
A carefully-written app (e.g. free of I/O races) would do pretty well
with this extended fsync, certainly far better than the current state of
the art allows.
Note that this still offers no protection for reads: no matter how many
times a thread issues fsync(), it still risks reading non-durable data
because reads are not ordered wrt either writes or fsync. That's not the
problem we're trying to solve, though.
Please feel free to point out where I've gone wrong, but this just
doesn't look like as complex or crazy an idea as you make it out to be.
[1] Maybe posix.1-1001 is more specific, but it's not publicly available
that I could see.
[2] I'm fully aware that implementing the request might require
significant -- perhaps even unreasonably complex -- changes to the way
the kernel currently does things (though I do doubt it). That's not a
good excuse to claim the idea itself is unreasonably complex or
ill-specified. Just say that it's not a good fit for the current code base.
[3] Another concern is whether fsync calls operate on the file or a
particular fd. What if a process opens the same file multiple times, or
multiple processes have fds pointing to the same file (whether by open
or fork)? I would argue for file-level barriers, because it leads to a
vastly simpler design (the fs cache doesn't track which process wrote
what via what fd). Besides, no app that cares about what ends up on disk
will allow uncoordinated writes anyway, so why do extra work just to
ensure I/O races stay fast?
[4] Really, device support for request ordering commands is a bit of a
red herring: the only way it helps significantly is if (a) the storage
device has a massive cache compared to the fs cache, (b) it allows I/O
scheduling to reduce latency of reads and/or writes (which fsync should
do already, and which matters little for flash), and (c) a logging
filesystem is not being used (else it's all sequential writes anyway).
In other words, it can help performance a bit but has little other
impact on what is essentially a software matter.
>
>> There's a lot to be said for simplicity... as long as the system is
>> not so simple as to not work at all.
>>
>> My p.o.v. is that a filesystem write barrier is effectively the same
>> as fsync() with the ability to return sooner (before writes hit stable
>> storage) when the filesystem and hardware support on-disk layouts and
>> primitives which can be used to order writes preceding and succeeding
>> the barrier.
>
> Your mistake is that you are considering barriers as something real,
> which can do something real for you, while it is just a artificial
> abstraction apparently invented by people with limited knowledge how
> storage works, hence having very foggy vision how barriers supposed to
> be processed by it. A simple wrong answer.
Storage: Accepts writes and ostensibly makes them available via reads
even after power failures. Reorders requests nearly arbitrarily and lies
about whether writes actually took effect, unless you issue appropriate
cache flushing and/or request ordering commands (and sometimes even
then, if it was a cheap consumer drive).
OS: Accepts writes and ostensibly makes them available via reads even
after power failures, reboots, etc. Reorders requests nearly arbitrarily
and lies about whether writes actually took effect, unless you issue a
stop-the-world, one-sided write barrier lovingly known as fsync
(assuming the actually disk listens when you tell it to stop cheating).
Wish: a two-sided write barrier that not only ensures previously-issued
writes complete before it reports success, but also prevents
later-issued writes from completing while it is in progress, giving a
reasonably simple way to enforce some ordering of writes in the system.
Can be implemented entirely in software, as the latter has full control
over which requests it chooses to schedule at the device, and also
decides whether to block the requesting thread or not. Can be made
virtually as fast as current writes, by maintaining a little extra
information in the fs cache.
Please, enlighten me: in what way does my limited knowledge of storage,
or my foggy vision of what is desired, make this feature impossible to
implement or useless if implemented?
>
> Generally, you can invent any abstraction convenient for you, but
> farther your abstractions from reality of your hardware => less you
> will get from it with bigger effort.
>
> There are no barriers in Linux and not going to be. Accept it. And
> start instead thinking about offload capabilities your storage can
> offer to you.
Apologies if this comes off as flame-bait, but I start to wonder whose
abstraction is broken here...
What I understand the above to mean is: "Linux file system abstractions
are too far from the reality of storage hardware, so it takes lots of
effort to accomplish little [in the way of enforcing write ordering].
Accept it. And start thinking instead about talking directly to a
storage controller that offers proper write barriers."
I hope I misread what you said, because that's a depressing thing to
hear from your OS.
Ryan
On 11/15/2012 11:06 AM, Ryan Johnson wrote:
> The easiest way to implement this fsync would involve three things:
> 1. Schedule writes for all dirty pages in the fs cache that belong to
> the affected file, wait for the device to report success, issue a cache
> flush to the device (or request ordering commands, if available) to make
> it tell the truth, and wait for the device to report success. AFAIK this
> already happens, but without taking advantage of any request ordering
> commands.
> 2. The requesting thread returns as soon as the kernel has identified
> all data that will be written back. This is new, but pretty similar to
> what AIO already does.
> 3. No write is allowed to enqueue any requests at the device that
> involve the same file, until all outstanding fsync complete [3]. This is
> new.
This sounds interesting as a way to expose some useful semantics to
userspace.
I assume we'd need to come up with a new syscall or something since it
doesn't match the behaviour of posix fsync().
Chris
David Lang wrote:
> barriers keep getting mentioned because they are a easy concept to understand.
> "do this set of stuff before doing any of this other set of stuff, but I don't
> care when any of this gets done" and they fit well with the requirements of the
> users.
>
> Users readily accept that if the system crashes, they will loose the most recent
> stuff that they did,
*some* users may accept that. *None* should.
> but they get annoyed when things get corrupted to the point
> that they loose the entire file.
>
> this includes things like modifying one option and a crash resulting in the
> config file being blank. Yes, you can do the 'write to temp file, sync file,
> sync directory, rename file" dance, but the fact that to do so the user must sit
> and wait for the syncs to take place can be a problem. It would be far better to
> be able to say "write to temp file, and after it's on disk, rename the file" and
> not have the user wait. The user doesn't really care if the changes hit disk
> immediately, or several seconds (or even 10s of seconds) later, as long as there
> is not any possibility of the rename hitting disk before the file contents.
>
> The fact that this could be implemented in multiple ways in the existing
> hardware does not mean that there need to be multiple ways exposed to userspace,
> it just means that the cost of doing the operation will vary depending on the
> hardware that you have. This also means that if new hardware introduces a new
> way of implementing this, that improvement can be passed on to the users without
> needing application changes.
There are a couple industry failures here:
1) the drive manufacturers sell drives that lie, and consumers accept it
because they don't know better. We programmers, who know better, have failed
to raise a stink and demand that this be fixed.
A) Drives should not lose data on power failure. If a drive accepts a write
request and says "OK, done" then that data should get written to stable
storage, period. Whether it requires capacitors or some other onboard power
supply, or whatever, they should just do it. Keep in mind that today, most of
the difference between enterprise drives and consumer desktop drives is just a
firmware change, that hardware is already identical. Nobody should accept a
product that doesn't offer this guarantee. It's inexcusable.
B) it should go without saying - drives should reliably report back to the
host, when something goes wrong. E.g., if a write request has been accepted,
cached, and reported complete, but then during the actual write an ECC failure
is detected in the cacheline, the drive needs to tell the host "oh by the way,
block XXX didn't actually make it to disk like I told you it did 10ms ago."
If the entire software industry were to simply state "your shit stinks and
we're not going to take it any more" the hard drive industry would have no
choice but to fix it. And in most cases it would be a zero-cost fix for them.
Once you have drives that are actually trustworthy, actually reliable (which
doesn't mean they never fail, it only means they tell the truth about
successes or failures), most of these other issues disappear. Most of the need
for barriers disappear.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
On 11/16/2012 10:06 AM, Howard Chu wrote:
> David Lang wrote:
>> barriers keep getting mentioned because they are a easy concept to understand.
>> "do this set of stuff before doing any of this other set of stuff, but I don't
>> care when any of this gets done" and they fit well with the requirements of the
>> users.
>>
>> Users readily accept that if the system crashes, they will loose the most recent
>> stuff that they did,
>
> *some* users may accept that. *None* should.
>
>> but they get annoyed when things get corrupted to the point
>> that they loose the entire file.
>>
>> this includes things like modifying one option and a crash resulting in the
>> config file being blank. Yes, you can do the 'write to temp file, sync file,
>> sync directory, rename file" dance, but the fact that to do so the user must sit
>> and wait for the syncs to take place can be a problem. It would be far better to
>> be able to say "write to temp file, and after it's on disk, rename the file" and
>> not have the user wait. The user doesn't really care if the changes hit disk
>> immediately, or several seconds (or even 10s of seconds) later, as long as there
>> is not any possibility of the rename hitting disk before the file contents.
>>
>> The fact that this could be implemented in multiple ways in the existing
>> hardware does not mean that there need to be multiple ways exposed to userspace,
>> it just means that the cost of doing the operation will vary depending on the
>> hardware that you have. This also means that if new hardware introduces a new
>> way of implementing this, that improvement can be passed on to the users without
>> needing application changes.
>
> There are a couple industry failures here:
>
> 1) the drive manufacturers sell drives that lie, and consumers accept it
> because they don't know better. We programmers, who know better, have failed
> to raise a stink and demand that this be fixed.
> A) Drives should not lose data on power failure. If a drive accepts a write
> request and says "OK, done" then that data should get written to stable
> storage, period. Whether it requires capacitors or some other onboard power
> supply, or whatever, they should just do it. Keep in mind that today, most of
> the difference between enterprise drives and consumer desktop drives is just a
> firmware change, that hardware is already identical. Nobody should accept a
> product that doesn't offer this guarantee. It's inexcusable.
> B) it should go without saying - drives should reliably report back to the
> host, when something goes wrong. E.g., if a write request has been accepted,
> cached, and reported complete, but then during the actual write an ECC failure
> is detected in the cacheline, the drive needs to tell the host "oh by the way,
> block XXX didn't actually make it to disk like I told you it did 10ms ago."
>
> If the entire software industry were to simply state "your shit stinks and
> we're not going to take it any more" the hard drive industry would have no
> choice but to fix it. And in most cases it would be a zero-cost fix for them.
>
> Once you have drives that are actually trustworthy, actually reliable (which
> doesn't mean they never fail, it only means they tell the truth about
> successes or failures), most of these other issues disappear. Most of the need
> for barriers disappear.
>
I think that you are arguing a fairly silly point.
If you want that behaviour, you have had it for more than a decade - simply
disable the write cache on your drive and you are done.
If you - as a user - want to run faster and use applications that are coded to
handle data integrity properly (fsync, fdatasync, etc), leave the write cache
enabled and use file system barriers.
Everyone has to trade off cost versus something else and this is a very, very
long standing trade off that drive manufacturers have made.
The more money you pay for your storage, the less likely this is to be an issue
(high end SSD's, enterprise class arrays, etc don't have volatile write caches
and most SAS drives perform reasonably well with the write cache disabled).
Regards,
Ric
Ric Wheeler wrote:
> On 11/16/2012 10:06 AM, Howard Chu wrote:
>> David Lang wrote:
>>> barriers keep getting mentioned because they are a easy concept to understand.
>>> "do this set of stuff before doing any of this other set of stuff, but I don't
>>> care when any of this gets done" and they fit well with the requirements of the
>>> users.
>>>
>>> Users readily accept that if the system crashes, they will loose the most recent
>>> stuff that they did,
>>
>> *some* users may accept that. *None* should.
>>
>>> but they get annoyed when things get corrupted to the point
>>> that they loose the entire file.
>>>
>>> this includes things like modifying one option and a crash resulting in the
>>> config file being blank. Yes, you can do the 'write to temp file, sync file,
>>> sync directory, rename file" dance, but the fact that to do so the user must sit
>>> and wait for the syncs to take place can be a problem. It would be far better to
>>> be able to say "write to temp file, and after it's on disk, rename the file" and
>>> not have the user wait. The user doesn't really care if the changes hit disk
>>> immediately, or several seconds (or even 10s of seconds) later, as long as there
>>> is not any possibility of the rename hitting disk before the file contents.
>>>
>>> The fact that this could be implemented in multiple ways in the existing
>>> hardware does not mean that there need to be multiple ways exposed to userspace,
>>> it just means that the cost of doing the operation will vary depending on the
>>> hardware that you have. This also means that if new hardware introduces a new
>>> way of implementing this, that improvement can be passed on to the users without
>>> needing application changes.
>>
>> There are a couple industry failures here:
>>
>> 1) the drive manufacturers sell drives that lie, and consumers accept it
>> because they don't know better. We programmers, who know better, have failed
>> to raise a stink and demand that this be fixed.
>> A) Drives should not lose data on power failure. If a drive accepts a write
>> request and says "OK, done" then that data should get written to stable
>> storage, period. Whether it requires capacitors or some other onboard power
>> supply, or whatever, they should just do it. Keep in mind that today, most of
>> the difference between enterprise drives and consumer desktop drives is just a
>> firmware change, that hardware is already identical. Nobody should accept a
>> product that doesn't offer this guarantee. It's inexcusable.
>> B) it should go without saying - drives should reliably report back to the
>> host, when something goes wrong. E.g., if a write request has been accepted,
>> cached, and reported complete, but then during the actual write an ECC failure
>> is detected in the cacheline, the drive needs to tell the host "oh by the way,
>> block XXX didn't actually make it to disk like I told you it did 10ms ago."
>>
>> If the entire software industry were to simply state "your shit stinks and
>> we're not going to take it any more" the hard drive industry would have no
>> choice but to fix it. And in most cases it would be a zero-cost fix for them.
>>
>> Once you have drives that are actually trustworthy, actually reliable (which
>> doesn't mean they never fail, it only means they tell the truth about
>> successes or failures), most of these other issues disappear. Most of the need
>> for barriers disappear.
>>
>
> I think that you are arguing a fairly silly point.
Seems to me that you're arguing that we should accept inferior technology.
Who's really being silly?
> If you want that behaviour, you have had it for more than a decade - simply
> disable the write cache on your drive and you are done.
You seem to believe it's nonsensical for someone to want both fast and
reliable writes, or that it's unreasonable for a storage device to offer the
same, cheaply. And yet it is clearly trivial to provide all of the above.
> If you - as a user - want to run faster and use applications that are coded to
> handle data integrity properly (fsync, fdatasync, etc), leave the write cache
> enabled and use file system barriers.
Applications aren't supposed to need to worry about such details, that's why
we have operating systems.
Drives should tell the truth. In event of an error detected after the fact,
the drive should report the error back to the host. There's nothing
nonsensical there.
When a drive's cache is enabled, the host should maintain a queue of written
pages, of a length equal to the size of the drive's cache. If a drive says
"hey, block XXX failed" the OS can reissue the write from its own queue. No
muss, no fuss, no performance bottlenecks. This is what Real Computers did
before the age of VAX Unix.
> Everyone has to trade off cost versus something else and this is a very, very
> long standing trade off that drive manufacturers have made.
With the cost of storage falling as rapidly as it has in recent years, this is
a stupid tradeoff.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
On 11/16/2012 10:54 AM, Howard Chu wrote:
> Ric Wheeler wrote:
>> On 11/16/2012 10:06 AM, Howard Chu wrote:
>>> David Lang wrote:
>>>> barriers keep getting mentioned because they are a easy concept to understand.
>>>> "do this set of stuff before doing any of this other set of stuff, but I don't
>>>> care when any of this gets done" and they fit well with the requirements of
>>>> the
>>>> users.
>>>>
>>>> Users readily accept that if the system crashes, they will loose the most
>>>> recent
>>>> stuff that they did,
>>>
>>> *some* users may accept that. *None* should.
>>>
>>>> but they get annoyed when things get corrupted to the point
>>>> that they loose the entire file.
>>>>
>>>> this includes things like modifying one option and a crash resulting in the
>>>> config file being blank. Yes, you can do the 'write to temp file, sync file,
>>>> sync directory, rename file" dance, but the fact that to do so the user
>>>> must sit
>>>> and wait for the syncs to take place can be a problem. It would be far
>>>> better to
>>>> be able to say "write to temp file, and after it's on disk, rename the
>>>> file" and
>>>> not have the user wait. The user doesn't really care if the changes hit disk
>>>> immediately, or several seconds (or even 10s of seconds) later, as long as
>>>> there
>>>> is not any possibility of the rename hitting disk before the file contents.
>>>>
>>>> The fact that this could be implemented in multiple ways in the existing
>>>> hardware does not mean that there need to be multiple ways exposed to
>>>> userspace,
>>>> it just means that the cost of doing the operation will vary depending on the
>>>> hardware that you have. This also means that if new hardware introduces a new
>>>> way of implementing this, that improvement can be passed on to the users
>>>> without
>>>> needing application changes.
>>>
>>> There are a couple industry failures here:
>>>
>>> 1) the drive manufacturers sell drives that lie, and consumers accept it
>>> because they don't know better. We programmers, who know better, have failed
>>> to raise a stink and demand that this be fixed.
>>> A) Drives should not lose data on power failure. If a drive accepts a write
>>> request and says "OK, done" then that data should get written to stable
>>> storage, period. Whether it requires capacitors or some other onboard power
>>> supply, or whatever, they should just do it. Keep in mind that today, most of
>>> the difference between enterprise drives and consumer desktop drives is just a
>>> firmware change, that hardware is already identical. Nobody should accept a
>>> product that doesn't offer this guarantee. It's inexcusable.
>>> B) it should go without saying - drives should reliably report back to the
>>> host, when something goes wrong. E.g., if a write request has been accepted,
>>> cached, and reported complete, but then during the actual write an ECC failure
>>> is detected in the cacheline, the drive needs to tell the host "oh by the way,
>>> block XXX didn't actually make it to disk like I told you it did 10ms ago."
>>>
>>> If the entire software industry were to simply state "your shit stinks and
>>> we're not going to take it any more" the hard drive industry would have no
>>> choice but to fix it. And in most cases it would be a zero-cost fix for them.
>>>
>>> Once you have drives that are actually trustworthy, actually reliable (which
>>> doesn't mean they never fail, it only means they tell the truth about
>>> successes or failures), most of these other issues disappear. Most of the need
>>> for barriers disappear.
>>>
>>
>> I think that you are arguing a fairly silly point.
>
> Seems to me that you're arguing that we should accept inferior technology.
> Who's really being silly?
No, just suggesting that you either pay for the expensive stuff or learn how to
use cost effective, high capacity storage like the rest of the world.
I don't disagree that having non-volatile write caches would be nice, but
everyone has learned how to deal with volatile write caches at the low end of
market.
>
>> If you want that behaviour, you have had it for more than a decade - simply
>> disable the write cache on your drive and you are done.
>
> You seem to believe it's nonsensical for someone to want both fast and
> reliable writes, or that it's unreasonable for a storage device to offer the
> same, cheaply. And yet it is clearly trivial to provide all of the above.
I look forward to seeing your products in the market.
Until you have more than "I want" and "I think" on your storage system design
resume, I suggest you spend the money to get the parts with non-volatile write
caches or fix your code.
Ric
>> If you - as a user - want to run faster and use applications that are coded to
>> handle data integrity properly (fsync, fdatasync, etc), leave the write cache
>> enabled and use file system barriers.
>
> Applications aren't supposed to need to worry about such details, that's why
> we have operating systems.
>
> Drives should tell the truth. In event of an error detected after the fact,
> the drive should report the error back to the host. There's nothing
> nonsensical there.
>
> When a drive's cache is enabled, the host should maintain a queue of written
> pages, of a length equal to the size of the drive's cache. If a drive says
> "hey, block XXX failed" the OS can reissue the write from its own queue. No
> muss, no fuss, no performance bottlenecks. This is what Real Computers did
> before the age of VAX Unix.
>
>> Everyone has to trade off cost versus something else and this is a very, very
>> long standing trade off that drive manufacturers have made.
>
> With the cost of storage falling as rapidly as it has in recent years, this is
> a stupid tradeoff.
>
On Fri, 16 Nov 2012, Howard Chu wrote:
> David Lang wrote:
>> barriers keep getting mentioned because they are a easy concept to
>> understand.
>> "do this set of stuff before doing any of this other set of stuff, but I
>> don't
>> care when any of this gets done" and they fit well with the requirements of
>> the
>> users.
>>
>> Users readily accept that if the system crashes, they will loose the most
>> recent
>> stuff that they did,
>
> *some* users may accept that. *None* should.
when users are given a choice of having all their work be very slow, or have it
be fast, but in the unlikely event of a crash they loose their mose recent
changes, they are willing to loose their most recent changes.
If you think about it, this is not much different from the fact that you loose
all changes since the last time you saved the thing you are working on. Many
programs save state periodically so that if the application crashes the user
hasn't lost everything, but any application that tried to save after every
single change would be so slow that nobody would use it.
There is always going to be a window after a user hits 'save' where the data can
be lost, because it's not yet on disk.
> There are a couple industry failures here:
>
> 1) the drive manufacturers sell drives that lie, and consumers accept it
> because they don't know better. We programmers, who know better, have failed
> to raise a stink and demand that this be fixed.
> A) Drives should not lose data on power failure. If a drive accepts a write
> request and says "OK, done" then that data should get written to stable
> storage, period. Whether it requires capacitors or some other onboard power
> supply, or whatever, they should just do it. Keep in mind that today, most of
> the difference between enterprise drives and consumer desktop drives is just
> a firmware change, that hardware is already identical. Nobody should accept a
> product that doesn't offer this guarantee. It's inexcusable.
This is an option to you. However if you have enabled write caching and
reordering, you have explicitly told the system to be faster at the expense of
loosing data under some conditions. The fact that you then loose data under
those conditions should not surprise you.
The idea that you must have enough power to write all the pending data to disk
is problematic as that then severely limits the amount of cache that you have.
> B) it should go without saying - drives should reliably report back to the
> host, when something goes wrong. E.g., if a write request has been accepted,
> cached, and reported complete, but then during the actual write an ECC
> failure is detected in the cacheline, the drive needs to tell the host "oh by
> the way, block XXX didn't actually make it to disk like I told you it did
> 10ms ago."
The issue isn't a drive having a write error, it's the system shutting down
(or crashing) before the data is written, no OS level tricks will help you here.
The real problem here isn't the drive claiming the data has been written when it
hasn't, the real problem is that the application has said 'write this data' to
the OS, and the OS has not done so yet.
The OS delays the writes for many legitimate reasons (the disk may be busy, it
can get things done more efficently by combining and reordering the writes, etc)
Unless the system crashes, this is not a problem, the data will eventually be
written out, and on system shutdown everthing is good.
But if the system crashes, some of this postphoned work doesn't get done, and
that can be a problem.
Applications can do fsync if they want to be sure that their data is safe on
disk NOW, but they currently have no way of saying "I want to make sure that A
happens before B, but I don't care if A happens now or 10 seconds from now"
That is the gap that it would be useful to provide a mechanism to deal with, and
it doesn't matter what your disk system does in terms of lieing ot not, there
still isn't a way to deal with this today.
David Lang
David Lang, on 11/15/2012 07:07 AM wrote:
>> There's no such thing as "barrier". It is fully artificial abstraction. After
>> all, at the bottom of your stack, you will have to translate it either to cache
>> flush, or commands order enforcement, or both.
>
> When people talk about barriers, they are talking about order enforcement.
Not correct. When people are talking about barriers, they are meaning different
things. For instance, Alan Cox few e-mails ago was meaning cache flush.
That's the problem with the barriers concept: barriers are ambiguous. There's no
barrier which can fit all requirements.
> the hardware capabilities are not directly accessable from userspace (and they
> probably shouldn't be)
The discussion is not about to directly provide storage hardware capabilities to
the user space. The discussion is to replace fully inadequate barriers
abstractions to a set of other, adequate abstractions.
For instance:
1. Cache flush primitives:
1.1. FUA
1.2. Non-immediate cache flush, i.e. don't return until all data hit non-volatile
media
1.3. Immediate cache flush, i.e. return ASAP after the cache sync started,
possibly before all data hit non-volatile media.
2. ORDERED attribute for requests. It provides the following behavior rules:
A. All requests without this attribute can be executed in parallel and be freely
reordered.
B. No ORDERED command can be completed before any previous not-ORDERED or ORDERED
command completed.
Those abstractions can naturally fit all storage capabilities. For instance:
- On simple WT cache hardware not supporting ordering commands, (1) translates
to NOP and (2) to queue draining.
- On full features HW, both (1) and (2) translates to the appropriate storage
capabilities.
On FTL storage (B) can be further optimized by doing data transfers for ORDERED
commands in parallel, but commit them in the requested order.
> barriers keep getting mentioned because they are a easy concept to understand.
Well, concept of flat Earth and Sun rotating around it is also easy to understand.
So, why isn't it used?
Vlad
Chris Friesen, on 11/15/2012 05:35 PM wrote:
>> The easiest way to implement this fsync would involve three things:
>> 1. Schedule writes for all dirty pages in the fs cache that belong to
>> the affected file, wait for the device to report success, issue a cache
>> flush to the device (or request ordering commands, if available) to make
>> it tell the truth, and wait for the device to report success. AFAIK this
>> already happens, but without taking advantage of any request ordering
>> commands.
>> 2. The requesting thread returns as soon as the kernel has identified
>> all data that will be written back. This is new, but pretty similar to
>> what AIO already does.
>> 3. No write is allowed to enqueue any requests at the device that
>> involve the same file, until all outstanding fsync complete [3]. This is
>> new.
>
> This sounds interesting as a way to expose some useful semantics to userspace.
>
> I assume we'd need to come up with a new syscall or something since it doesn't
> match the behaviour of posix fsync().
This is how I would export cache sync and requests ordering abstractions to the
user space:
For async IO (io_submit() and friends) I would extend struct iocb by flags, which
would allow to set the required capabilities, i.e. if this request is FUA, or full
cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per
each iocb.
For the regular read()/write() I would add to "flags" parameter of
sync_file_range() one more flag: if this sync is immediate or not.
To enforce ordering rules I would add one more command to fcntl(). It would make
the latest submitted write in this fd ORDERED.
All together those should provide the requested functionality in a simple,
effective, unambiguous and backward compatible manner.
Vlad
1. See my other today's e-mail about what is immediate cache sync.
Vladislav Bolkhovitin, on 11/17/2012 12:02 AM wrote:
>>> The easiest way to implement this fsync would involve three things:
>>> 1. Schedule writes for all dirty pages in the fs cache that belong to
>>> the affected file, wait for the device to report success, issue a cache
>>> flush to the device (or request ordering commands, if available) to make
>>> it tell the truth, and wait for the device to report success. AFAIK this
>>> already happens, but without taking advantage of any request ordering
>>> commands.
>>> 2. The requesting thread returns as soon as the kernel has identified
>>> all data that will be written back. This is new, but pretty similar to
>>> what AIO already does.
>>> 3. No write is allowed to enqueue any requests at the device that
>>> involve the same file, until all outstanding fsync complete [3]. This is
>>> new.
>>
>> This sounds interesting as a way to expose some useful semantics to userspace.
>>
>> I assume we'd need to come up with a new syscall or something since it doesn't
>> match the behaviour of posix fsync().
>
> This is how I would export cache sync and requests ordering abstractions to the
> user space:
>
> For async IO (io_submit() and friends) I would extend struct iocb by flags, which
> would allow to set the required capabilities, i.e. if this request is FUA, or full
> cache sync, immediate [1] or not, ORDERED or not, or all at the same time, per
> each iocb.
>
> For the regular read()/write() I would add to "flags" parameter of
> sync_file_range() one more flag: if this sync is immediate or not.
>
> To enforce ordering rules I would add one more command to fcntl(). It would make
> the latest submitted write in this fd ORDERED.
Correction. To avoid possible races better that the new fcntl() command would
specify that N subsequent read()/write()/sync() calls as ORDERED.
For instance, in the simplest case of N=1, one next after fcntl() write() would be
handled as ORDERED.
(Unfortunately, it doesn't look like this old read()/write() interface has space
for a more elegant solution)
Vlad
Vlad,
You keep saying that programmers don't understand "barriers". You've
provided no evidence of this. Meanwhile memory barriers are generally
well understood, and every programmer I know understands that a
"barrier" is a synchronization primitive that says that all operations
of a certain type will have completed prior to the barrier returning
control to its caller.
For some filesystems it is possible to configure fsync() to act as a
barrier: for example, ZFS can be told to perform no synchronous
operations for a given dataset, in which case fsync() devolves into a
simple barrier. (Cue Simon to tell us that some hardware and some
OSes, and some filesystems simply cannot implement fsync(), with or
without synchronicity.)
So just give us a barrier. Yes, I know, it's tricky to implement, but
it'd be OK to return EOPNOSUPP, and let the app do something else
(e.g., call fsync() instead, tell the user to expect instability, tell
the user to get a better system, ...).
As for implementation, it helps to have a journalled or log-structured
filesystem. It also helps to have hardware synchronization primitives
that don't suck, but these aren't entirely necessary: ZFS, for
example, can recover [*] from N incomplete transactions[**], and still
provides fsync() as a barrier given its on-disk structure and the ZIL.
Note that ZFS recovery from incomplete transactions should never be
necessary where the HW has proper cache flush support, but the
recovery functionality was added precisely because of lousy hardware.
[*] At volume import time, such as at boot-time.
[**] Granted, this requires user input, but if the user didn't care it
could be made automatic.
Nico
--
Nico Williams, on 11/26/2012 03:05 PM wrote:
> Vlad,
>
> You keep saying that programmers don't understand "barriers". You've
> provided no evidence of this. Meanwhile memory barriers are generally
> well understood, and every programmer I know understands that a
> "barrier" is a synchronization primitive that says that all operations
> of a certain type will have completed prior to the barrier returning
> control to its caller.
Well, your understanding of memory barriers is wrong, and you are illustrating
that the memory barriers concept is not so well understood on practice.
Simplifying, memory barrier instructions are not "cache flush" of this CPU as it
is often thought. They set order how reads or writes from other CPUs are visible
on this CPU. And nothing else. Locally on each CPU reads and writes are always
seen in order. So, (1) on a single CPU system memory barrier instructions don't
make any sense and (2) they should go at least in a pair for each participating in
the interaction CPU, otherwise it's an apparent sign of a mistake.
There's nothing similar in storage, because storage has strong consistency
requirements even if it is distributed. All those clouds and hadoops with weak
consistency requirements are outside of this discussion, although even they don't
have anything similar to memory barriers.
As I already wrote, concept of a flat Earth and Sun revolving around is also very
simple to understand. Are you still using this concept?
> So just give us a barrier.
Similarly to the flat Earth, I'd strongly suggest you to start using adequate
concept of what you want to achieve starting from what I proposed few e-mails ago
in this thread.
If you look at it, it offers exactly what you want, only named correctly.
Vlad