2012-10-11 16:38:49

by Nico Williams

[permalink] [raw]
Subject: Re: [sqlite] light weight write barriers

On Wed, Oct 10, 2012 at 12:48 PM, Richard Hipp <[email protected]> wrote:
>> Could you list the requirements of such a light weight barrier?
>> i.e. what would it need to do minimally, what's different from
>> fsync/fdatasync ?
>
> For SQLite, the write barrier needs to involve two separate inodes. The
> requirement is this:

...

> Note also that when fsync() works as advertised, SQLite transactions are
> ACID. But when fsync() is reduced to a write-barrier, we loss the D
> (durable) and transactions are only ACI. In our experience, nobody really
> cares very much about durable across a power-loss. People are mainly
> interested in Atomic, Consistent, and Isolated. If you take a power loss
> and then after reboot you find the 10 seconds of work prior to the power
> loss is missing, nobody much cares about that as long as all of the prior
> work is still present and consistent.

There is something you can do: use a combination of COW on-disk
formats in such a way that it's possible to detect partially-committed
transactions and rollback to the last good known root, and
backgrounded fsync()s (i.e., in a separate thread, without waiting for
the fsync() to complete).

Nico
--


2012-10-11 16:48:55

by Nico Williams

[permalink] [raw]
Subject: Re: [sqlite] light weight write barriers

To expand a bit, the on-disk format needs to allow the roots of N of
the last transactions to be/remain reachable at all times. At open
time you look for the latest transaction, verify that it has been
written[0] completely, then use it, else look for the preceding
transaction, verify it, and so on.

N needs to be at least 2: the last and the preceding transactions. No
blocks should be freed or reused for any transactions still in use or
possible use (e.g., for power failure recovery). For high read
concurrency you can allow connections to lock a past transaction so
that no blocks are freed that are needed to access the DB at that
state.

This all goes back to 1980s DB and filesystem concepts. See, for
example, the BSD4.4 Log Structure Filesystem. (I mention this in case
there are concerns about patents, though IANAL and I make no
particular assertions here other than that there is plenty of old
prior art and expired patents that can probably be used to obtain
sufficient certainty as to the patent law risks in the approach
described herein.)

[0] E.g., check a transaction block manifest and check that those
blocks were written correctly; or traverse the tree looking for
differences to the previous transaction; this may require checking
block contents checksums.

Nico
--