2014-11-03 23:33:13

by Theodore Ts'o

[permalink] [raw]
Subject: Some thoughts about providing data block checksumming for ext4

I've been thinking a lot about the best way to provide data block
checksumming for ext4 in an efficient way, and as I promised on
today's ext4 concall, I want to detail them in the hopes that it will
spark some more interest in actually implementing this feature,
perhaps in a more general way than just for ext4.

I've included in this writeup a strawman design to implement data
block checksuming as a device mapper module.

Comments appreciated!

- Ted

The checksum consistency problem
=================================

Copy-on-write file systems such as btrfs and zfs have a big
advantage when it comes to providing data block checksums because they
never overwrite an existing data block. In contrast, update-in-place
file systems such as ext4 and xfs, if they want to provide data block
checksums, must be able to update checksum and the data block
atomically, or else if the system fails at an inconvenient point in
time, the previously existing block in a file would have an
inconsistent checksum and contents.

In the case of ext4, we can solve this by the data blocks through the
journal, alongside the metadata block containing the checksum.
However, this results in the performance cost of doing double writes
as in data=journal mode. We can do slightly better by skipping this
if the block in question is a newly allocated block, since there is no
guarantee that data will be safe until an fsync() call, and in the
case of a newly allocated block, there is no previous contents which
is at risk.

But there is a way we can do even better! If we can manage to
compress the block even by a tiny amount, so that 4k block can be
stored in 4092 bytes (which means we need to be able to compress the
block by 0.1%), we can store the checksum inline with the data, which
can then be atomically updated assuming a modern drive with a 4k
sector size (even a 512e disk will work fine, assuming the partition
is properly 4k aligned). If the block is not sufficiently
compressible, then we will need to store the checksum out-of-line, but
in practice, this should be relatively rare. (The most common case of
incompressible file formats are things like media files and
already-compressed packages, and these files are generally not updated
in a random-write workload.)

In order to distinguish between these a compressed+checksum and
non-compressed+out-of-line checksum block, we can use a CRC-24
checksum. In the compressed+checksum case, we store a zero in the
first byte of the block, followed by a 3 byte checksum, followed by
the compressed contents of the block. In the case where block can not
be compressed, we save the high nibble of the block plus the 3 byte
CRC-24 checksum in the out-of-line metadata block, and then we set the
high nibble of the block to be 0xF so that there is no possibility
that a block with an original initial byte of zero will be confused
with a compressed+checksum block. (Why the high nibble and not the
just the first byte of the block? We have other planned uses for
those 4 bits; more later in this paper.)

Storing the data block checksums in ext4
========================================

There are two ways that have been discussed for storing data block
checksums in ext4. The first approach is to dedicate every a checksum
block every 1024 blocks, which would be sufficient to store a 4 byte
checksum (assuming a 4k block). This approach has the advantage of
being very simple. However, it becomes very difficult to upgrade an
existing file system to one that supports data block checksums without
doing the equivalet of a backup/restore operation.

The second approach is to store the checksums in a per-inode structure
which is indexed by logical block number. This approach makes is much
simpler to upgrade an existing file system. In addition, if not all
files need to be data integrity protected, it is less efficient. The
case where this might become important is in the case where we are
using a cryptographic Message Authentication Code (MAC) instead of a
checksum. This is because a MAC is significantly larger than 4 byte
checksum, and not all of the files in the file system might be
encrypted and thus need cryptographic data integrity protection in
order to protect against certain chosen plaintext attacks. In that
case, only using a per-inode structure in those cases for those file
blocks which require protection might make a lot of sense. (And if we
pursue cryptographic data integrity guarantees for the ext4 encryption
project, we will probably need to go down this route). The massive
disadvantage of this scheme is that it is significantly more
complicated to implement.

However, if we are going to simply intersperse the metadata blocks
alongside the data blocks, there is no real need to do this work in
the file system. Instead, we can actually do this work in a device
mapper plugin instead. This has the advantage that it moves the
complexity outside of the file system, and allows any update-in-place
file system (including xfs, jfs, etc.) to gain the benefits data block
checksumming. So in the next section of this paper I will outline a
strawman design of such a dm plugin.

Doing data block checksumming as a device-mapper plugin
=======================================================

Since we need to give this a name, for now I'm going to call this
proposed plugin "dm-protected". (If anyone would like to suggest a
better name, I'm all ears.)

The Non-Critical Write flag
---------------------------

First, let us define an optional extension to the Linux block layer
which allows to provide a certain optimization when writing
non-compressible files such as audio/video media files, which are
typically written in a streaming fashion and which are generally not
updated in place after they are initially written. As this
optimization is purely optional, this feature might not be implemented
initially, and a file system does not have to take advantage of this
extension if it is implemented.

If present, this extension allows the file system to pass a hint to
the block device that a particular data block write is the first time
that a newly allocated block is being written. As such, it is not
critically important that the checksum be atomically updated when the
data block is written, in the case where the data block can not be
compressed such that the checksum can fit inline with the compressed
data.

XXX I'm not sure "non-critical" is the best name for this flag. It
may be renamed if we can think of a better describe name.

Layout of the pm-protected device
---------------------------------

The layout of the the dm-protected device is a 4k checksum block
followed by 1024 data blocks. Hence, given a logical 4k block number
(LBN) L, the checksum block associated with that LBN is located at
physical block number (PBN):

PBN_checksum = (L + 1) / 1024

where '/' is an C-style integer division operation.

The PBN where the data for stored at LBN can be calculated as follows:

PBN_L = L + (L / 1024) + 1

The checksum block is used when we need to store an out-of-line
checksum for a particular block in its "checksum group", where we
treat the contents of checksum block as a 4 byte integer array, and
where the entry for a particular LBN can be found by indexing into (L
% 1024).

For redundancy purposes we calculate the metadata checksum of the
checksum block assuming that low nibble of the first byte in each
entry is entry, and we use the low nibbles of first byte in each entry
to store store the first LBN for which this block is used plus the
metdata checksum of the checksum block. We encoding the first LBN for
the checksum block so we can identify the checksum block when it is
copied into the Active Area (described below).


Writing to the dm-protected device
-----------------------------------

As described earlier, when we write to the dm-protected device, the
plugin will attempt to compress the contents of the data block. If it
is successful at reducing the required storage size by 4 bytes, then
it will write the block in place.

If the data block is not compressible, and this is a non-critical
write, then we update the checksum in the checksum block for that
particular LBN range, and we write out the data block immediately, and
then after a 5 second delay (in case there are subsequent
non-compressible, non-critial writes, as there will probably be when
large media file is written), we write out the modified checksum
block.

If the data block is not compressible, and the write is not marked as
non-critcal, then we need to worry about making sure the data block(s)
and the checksum block are written out transactionally. To do this, we
write the current contents of the checksum block to a free block in
the Active Area (AA) using FUA, which is 64 block area which is used to
store a copy of checksum blocks for which their blocks are actively
being modified. We then calculate the checksum for the modified data
blocks in the checksum group, and update the checksum block in memory,
but we do not allow any of the data blocks to be written out until one
of the following has happened and we need to trigger a commit of the
checksum group:

*) a 5 second timer has expired
*) we have run out of free slots in the Active Area
*) we are under significant memory pressure and we need to release some of
the pinned buffers for the data blocks in the checksum group
*) the file system has requested a FLUSH CACHE operation

A commit of the checksum group consists of the following:

1) An update of the checksum block using a FUA write
2) Writing all of the pinned data blocks in the checksum group to disk
3) Sending a FLUSH CACHE request to the underlying storage
4) Allowing the slot in the Active Area to be used for some other checksum block

Recovery after a power fail
---------------------------

If the dm-protected device was not cleanly shut down, then we need to
examine all of the checksum blocks in the Active Area. For each
checksum block in the AA, the checksums for all of their data blocks
should machine either the checksum found in the AA, or the checksum
found in the checksum block in the checksum group. Once we have which
checksum corresponds to the data block after the unclean shutdown, we
can update the checksum block and clear the copy found in the AA.

On a clean shutdown of the dm-protected device, we can clear the
Active Area, and so the recovery procedure will not be needed the next
time the dm-protected device is initialized.

Integration with other DM modules
=================================

If the dm-protected device is layered on dm-raid 1 setup, then if
there is a checksum failure the dm-protected device should attempt to
fetch the alternate copy of the device.

Of course, the the dm-protected module could be layered on top of a
dm-crypt, dm-thin module, or LVM setup.

Conclution
==========

In this paper, we have examined some of the problems of providing data
block checksumming in ext4, and have proposed a solution which
implements this functionality as a device-mapper plugin. For many
file types, it is expected that using a very fast compression
algorithm (we only need to compress the block by less than 0.1%) will
allow us to provide data block checksumming with almost no I/O
overhead and only a very modest amount of CPU overhead.

For those file types which contain a large number of incompressible
block, if they do not need to be updated-in-place, we can also
minimize the overhead by avoiding the need to do a transactional
update of the data block and the checksum block.

In those cases where we do need to do a transactional update of the
checksum block relative to the data blocks, we have outlined a very
simple logging scheme which is both efficient and relatively easy to
implement.




2014-11-04 21:20:48

by Andreas Dilger

[permalink] [raw]
Subject: Re: Some thoughts about providing data block checksumming for ext4


On Nov 3, 2014, at 4:33 PM, Theodore Ts'o <[email protected]> wrote:

> I've been thinking a lot about the best way to provide data block
> checksumming for ext4 in an efficient way, and as I promised on
> today's ext4 concall, I want to detail them in the hopes that it will
> spark some more interest in actually implementing this feature,
> perhaps in a more general way than just for ext4.
>
> I've included in this writeup a strawman design to implement data
> block checksuming as a device mapper module.
>
> Comments appreciated!
>
> - Ted
>
> The checksum consistency problem
> =================================
>
> Copy-on-write file systems such as btrfs and zfs have a big
> advantage when it comes to providing data block checksums because they
> never overwrite an existing data block. In contrast, update-in-place
> file systems such as ext4 and xfs, if they want to provide data block
> checksums, must be able to update checksum and the data block
> atomically, or else if the system fails at an inconvenient point in
> time, the previously existing block in a file would have an
> inconsistent checksum and contents.
>
> In the case of ext4, we can solve this by the data blocks through the
> journal, alongside the metadata block containing the checksum.
> However, this results in the performance cost of doing double writes
> as in data=journal mode. We can do slightly better by skipping this
> if the block in question is a newly allocated block, since there is no
> guarantee that data will be safe until an fsync() call, and in the
> case of a newly allocated block, there is no previous contents which
> is at risk.
>
> But there is a way we can do even better! If we can manage to
> compress the block even by a tiny amount, so that 4k block can be
> stored in 4092 bytes (which means we need to be able to compress the
> block by 0.1%), we can store the checksum inline with the data, which
> can then be atomically updated assuming a modern drive with a 4k
> sector size (even a 512e disk will work fine, assuming the partition
> is properly 4k aligned). If the block is not sufficiently
> compressible, then we will need to store the checksum out-of-line, but
> in practice, this should be relatively rare. (The most common case of
> incompressible file formats are things like media files and
> already-compressed packages, and these files are generally not updated
> in a random-write workload.)

My main concern here would be the potential performance impact. This
wouldn't ever reduce the amount of data actually written to any block
(presumably the end of the block would be zero-filled to avoid leaking
data), so the compress + checksum would mean every data block must have
every byte processed by the filesystem.

It's bad enough even having to do something once for each block (hence
mballoc and bios to allocate and submit many blocks at once), so if this
has to compress and checksum (or vice versa) every block it could get
expensive. Ideally the compress+checksum operations would be combined,
so that only a single pass would be needed for all of the data.

> In order to distinguish between these a compressed+checksum and
> non-compressed+out-of-line checksum block, we can use a CRC-24
> checksum. In the compressed+checksum case, we store a zero in the
> first byte of the block, followed by a 3 byte checksum, followed by
> the compressed contents of the block. In the case where block can not
> be compressed, we save the high nibble of the block plus the 3 byte
> CRC-24 checksum in the out-of-line metadata block, and then we set the
> high nibble of the block to be 0xF so that there is no possibility
> that a block with an original initial byte of zero will be confused
> with a compressed+checksum block. (Why the high nibble and not the
> just the first byte of the block? We have other planned uses for
> those 4 bits; more later in this paper.)
>
> Storing the data block checksums in ext4
> ========================================
>
> There are two ways that have been discussed for storing data block
> checksums in ext4. The first approach is to dedicate every a checksum
> block every 1024 blocks, which would be sufficient to store a 4 byte
> checksum (assuming a 4k block). This approach has the advantage of
> being very simple. However, it becomes very difficult to upgrade an
> existing file system to one that supports data block checksums without
> doing the equivalet of a backup/restore operation.
>
> The second approach is to store the checksums in a per-inode structure
> which is indexed by logical block number. This approach makes is much
> simpler to upgrade an existing file system. In addition, if not all
> files need to be data integrity protected, it is less efficient. The

s/less efficient/more efficient/ to checksum only some of the files?

> case where this might become important is in the case where we are
> using a cryptographic Message Authentication Code (MAC) instead of a
> checksum. This is because a MAC is significantly larger than 4 byte
> checksum, and not all of the files in the file system might be
> encrypted and thus need cryptographic data integrity protection in
> order to protect against certain chosen plaintext attacks. In that
> case, only using a per-inode structure in those cases for those file
> blocks which require protection might make a lot of sense. (And if we
> pursue cryptographic data integrity guarantees for the ext4 encryption
> project, we will probably need to go down this route). The massive
> disadvantage of this scheme is that it is significantly more
> complicated to implement.

If e.g. SHA-256 is needed, then compress-by-32-bytes with the
inline checksum might be a lot harder than compress-by-4-bytes, but
not necessarily impossible for 4KB blocks unless they are already
compressed files.

> However, if we are going to simply intersperse the metadata blocks
> alongside the data blocks, there is no real need to do this work in
> the file system. Instead, we can actually do this work in a device
> mapper plugin instead. This has the advantage that it moves the
> complexity outside of the file system, and allows any update-in-place
> file system (including xfs, jfs, etc.) to gain the benefits data block
> checksumming. So in the next section of this paper I will outline a
> strawman design of such a dm plugin.

I think it is easier to determine at the filesystem level if the data
blocks are overwriting existing blocks or not, without the overhead
of having to send per-unlink/truncate trim commands down to a DM device.
Having this implemented in ext4 allows a lot more flexibility in how
and when to store the checksum (e.g. per-file checksum flags that are
inherited, store the checksum for small incompressible files in the inode
or in extent blocks, etc).

> Doing data block checksumming as a device-mapper plugin
> =======================================================
>
> Since we need to give this a name, for now I'm going to call this
> proposed plugin "dm-protected". (If anyone would like to suggest a
> better name, I'm all ears.)

"dm-checksum" would be better, since "protected" falsely implies that
the data is somehow protected against loss or corruption, when it only
really allows detecting the corruption and not fixing it.

> The Non-Critical Write flag
> ---------------------------
>
> First, let us define an optional extension to the Linux block layer
> which allows to provide a certain optimization when writing
> non-compressible files such as audio/video media files, which are
> typically written in a streaming fashion and which are generally not
> updated in place after they are initially written. As this
> optimization is purely optional, this feature might not be implemented
> initially, and a file system does not have to take advantage of this
> extension if it is implemented.
>
> If present, this extension allows the file system to pass a hint to
> the block device that a particular data block write is the first time
> that a newly allocated block is being written. As such, it is not
> critically important that the checksum be atomically updated when the
> data block is written, in the case where the data block can not be
> compressed such that the checksum can fit inline with the compressed
> data.
>
> XXX I'm not sure "non-critical" is the best name for this flag. It
> may be renamed if we can think of a better describe name.

Something like "write-once" or "idempotent" or similar, since that
makes it clear how this is used. I think anyone who is checksumming
their data would consider that it is "critical".

> Layout of the pm-protected device
> ---------------------------------
>
> The layout of the the dm-protected device is a 4k checksum block
> followed by 1024 data blocks. Hence, given a logical 4k block number
> (LBN) L, the checksum block associated with that LBN is located at
> physical block number (PBN):
>
> PBN_checksum = (L + 1) / 1024
>
> where '/' is an C-style integer division operation.

>
> The PBN where the data for stored at LBN can be calculated as follows:
>
> PBN_L = L + (L / 1024) + 1
>
> The checksum block is used when we need to store an out-of-line
> checksum for a particular block in its "checksum group", where we
> treat the contents of checksum block as a 4 byte integer array, and
> where the entry for a particular LBN can be found by indexing into (L
> % 1024).
>
> For redundancy purposes we calculate the metadata checksum of the
> checksum block assuming that low nibble of the first byte in each
> entry is entry, and we use the low nibbles of first byte in each entry

s/each entry is entry/each entry is zero/ ?

> to store store the first LBN for which this block is used plus the
> metdata checksum of the checksum block. We encoding the first LBN for
> the checksum block so we can identify the checksum block when it is
> copied into the Active Area (described below).
>
>
> Writing to the dm-protected device
> -----------------------------------
>
> As described earlier, when we write to the dm-protected device, the
> plugin will attempt to compress the contents of the data block. If it
> is successful at reducing the required storage size by 4 bytes, then
> it will write the block in place.
>
> If the data block is not compressible, and this is a non-critical
> write, then we update the checksum in the checksum block for that
> particular LBN range, and we write out the data block immediately, and
> then after a 5 second delay (in case there are subsequent
> non-compressible, non-critial writes, as there will probably be when
> large media file is written), we write out the modified checksum
> block.

The good news is that (IMHO) these two uses are largely exclusive.
Files that are incompressible (e.g. media) are typically write-once,
while databases and other apps that overwrite files in place do not
typically compress the data blocks.

> If the data block is not compressible, and the write is not marked as
> non-critcal, then we need to worry about making sure the data block(s)
> and the checksum block are written out transactionally. To do this, we
> write the current contents of the checksum block to a free block in
> the Active Area (AA) using FUA, which is 64 block area which is used to
> store a copy of checksum blocks for which their blocks are actively
> being modified. We then calculate the checksum for the modified data
> blocks in the checksum group, and update the checksum block in memory,
> but we do not allow any of the data blocks to be written out until one
> of the following has happened and we need to trigger a commit of the
> checksum group:
>
> *) a 5 second timer has expired
> *) we have run out of free slots in the Active Area
> *) we are under significant memory pressure and we need to release some of
> the pinned buffers for the data blocks in the checksum group
> *) the file system has requested a FLUSH CACHE operation

Why introduce a new mechanism when this could be done using data=journal
writes for incompressible data? This is essentially just implementing
jbd2 journaling with a bunch of small journals (AAs), and we could save
a lot of code complexity by re-using the existing jbd2 code to do it.

Using data=journal, if there is a crash after the commit to the journal,
the data blocks and checksums will be checkpointed to the filesystem
again if needed, or be discarded without modifying the original data
blocks if the transaction didn't commit.

The journal would only need to get involved if data blocks couldn't be
compressed, and if overwriting existing data (presumably a rare case,
but this couldn't be optimized in a dm-layer device unless it was sparse
and was tracking block usage/trim.

Using the existing jbd2 code in this case could also take advantage of
optimizations like putting the journal on a separate disk, or on flash
for fast write/commit and the checkpoint can be done in the background
asynchronously. We could potentially allow multiple data journals per
device (multiple AAs) if there was a good reason to do so, since any
dependencies between blocks can be avoided, unlike with namespace ops.

> A commit of the checksum group consists of the following:
>
> 1) An update of the checksum block using a FUA write
> 2) Writing all of the pinned data blocks in the checksum group to disk
> 3) Sending a FLUSH CACHE request to the underlying storage
> 4) Allowing the slot in the Active Area to be used for some other checksum block
>
> Recovery after a power fail
> ---------------------------
>
> If the dm-protected device was not cleanly shut down, then we need to
> examine all of the checksum blocks in the Active Area. For each
> checksum block in the AA, the checksums for all of their data blocks
> should machine either the checksum found in the AA, or the checksum

s/machine/match/ ?

> found in the checksum block in the checksum group. Once we have which
> checksum corresponds to the data block after the unclean shutdown, we
> can update the checksum block and clear the copy found in the AA.

This is essentially journal checksums, which also already exist.

> On a clean shutdown of the dm-protected device, we can clear the
> Active Area, and so the recovery procedure will not be needed the next
> time the dm-protected device is initialized.

This is normal journal checkpoint and cleanup.

> Integration with other DM modules
> =================================
>
> If the dm-protected device is layered on dm-raid 1 setup, then if
> there is a checksum failure the dm-protected device should attempt to
> fetch the alternate copy of the device.
>
> Of course, the the dm-protected module could be layered on top of a
> dm-crypt, dm-thin module, or LVM setup.
>
> Conclution
> ==========
>
> In this paper, we have examined some of the problems of providing data
> block checksumming in ext4, and have proposed a solution which
> implements this functionality as a device-mapper plugin. For many
> file types, it is expected that using a very fast compression
> algorithm (we only need to compress the block by less than 0.1%) will
> allow us to provide data block checksumming with almost no I/O
> overhead and only a very modest amount of CPU overhead.
>
> For those file types which contain a large number of incompressible
> block, if they do not need to be updated-in-place, we can also
> minimize the overhead by avoiding the need to do a transactional
> update of the data block and the checksum block.
>
> In those cases where we do need to do a transactional update of the
> checksum block relative to the data blocks, we have outlined a very
> simple logging scheme which is both efficient and relatively easy to
> implement.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2014-11-04 21:39:59

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [dm-devel] Some thoughts about providing data block checksumming for ext4



On Mon, 3 Nov 2014, Theodore Ts'o wrote:

> But there is a way we can do even better! If we can manage to
> compress the block even by a tiny amount, so that 4k block can be
> stored in 4092 bytes (which means we need to be able to compress the
> block by 0.1%), we can store the checksum inline with the data, which
> can then be atomically updated assuming a modern drive with a 4k
> sector size (even a 512e disk will work fine, assuming the partition
> is properly 4k aligned). If the block is not sufficiently

There is still large number of drives with 512-byte sectors in use. So
we'd rather use 512-byte block?

> If the data block is not compressible, and the write is not marked as
> non-critcal, then we need to worry about making sure the data block(s)
> and the checksum block are written out transactionally. To do this, we
> write the current contents of the checksum block to a free block in
> the Active Area (AA) using FUA, which is 64 block area which is used to
> store a copy of checksum blocks for which their blocks are actively
> being modified. We then calculate the checksum for the modified data
> blocks in the checksum group, and update the checksum block in memory,
> but we do not allow any of the data blocks to be written out until one
> of the following has happened and we need to trigger a commit of the
> checksum group:
>
> *) a 5 second timer has expired
> *) we have run out of free slots in the Active Area
> *) we are under significant memory pressure and we need to release some of
> the pinned buffers for the data blocks in the checksum group
> *) the file system has requested a FLUSH CACHE operation
>
> A commit of the checksum group consists of the following:
>
> 1) An update of the checksum block using a FUA write
> 2) Writing all of the pinned data blocks in the checksum group to disk
> 3) Sending a FLUSH CACHE request to the underlying storage
> 4) Allowing the slot in the Active Area to be used for some other checksum block

Filesystems assume that 512-byte write is performed atomically. If you
split the sector to 4-bit nibble and the rest and write them to different
locations, you must make sure that both are modified or none is modified.

I don't see how are you going to do that - other than writing the full
sector to the active area - but that results in double writing, just like
data=journal.

Note that the checksum function can have collisions, so the checksum value
doesn't tell you which 4-bit nibble belongs to the data that is in the
sector.

You could use cryptographic hash as the checksum function, you can
reasonably assume that there are no collisions, but cryptographic hash is
too slow to calculate.

> Recovery after a power fail
> ---------------------------
>
> If the dm-protected device was not cleanly shut down, then we need to
> examine all of the checksum blocks in the Active Area. For each
> checksum block in the AA, the checksums for all of their data blocks
> should machine either the checksum found in the AA, or the checksum
> found in the checksum block in the checksum group.

... and if the checksum of the block matches BOTH the checksum in the AA
and the checksum in the checksum group (because of checksum function
collision), you don't know which 4-bit nibble belongs to the data in the
block.

> Once we have which
> checksum corresponds to the data block after the unclean shutdown, we
> can update the checksum block and clear the copy found in the AA.
>
> On a clean shutdown of the dm-protected device, we can clear the
> Active Area, and so the recovery procedure will not be needed the next
> time the dm-protected device is initialized.

Mikulas

2014-11-04 22:06:33

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [dm-devel] Some thoughts about providing data block checksumming for ext4



> > Recovery after a power fail
> > ---------------------------
> >
> > If the dm-protected device was not cleanly shut down, then we need to
> > examine all of the checksum blocks in the Active Area. For each
> > checksum block in the AA, the checksums for all of their data blocks
> > should machine either the checksum found in the AA, or the checksum
> > found in the checksum block in the checksum group.
>
> ... and if the checksum of the block matches BOTH the checksum in the AA
> and the checksum in the checksum group (because of checksum function
> collision), you don't know which 4-bit nibble belongs to the data in the
> block.

Though, I realize that you could avoid this problem by selecting the
appropriate checksum function - that never results in collision if the
4-bit nibble differs.

Mikulas

2014-11-04 23:58:10

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Some thoughts about providing data block checksumming for ext4

On Tue, Nov 04, 2014 at 02:20:44PM -0700, Andreas Dilger wrote:
>
> My main concern here would be the potential performance impact. This
> wouldn't ever reduce the amount of data actually written to any block
> (presumably the end of the block would be zero-filled to avoid leaking
> data), so the compress + checksum would mean every data block must have
> every byte processed by the filesystem.

Sure, if you are enabling data block checksuming, every byte of every
data block has to be processed by the file system anyway. That's kind
of inherent in checksumming the data block, after all! And we can use
a very fast compression algorithm since we only need to compress the
block slightly, and of course, it certainly makes sense to combine the
compression ahd checksumming operation.

> I think it is easier to determine at the filesystem level if the data
> blocks are overwriting existing blocks or not, without the overhead
> of having to send per-unlink/truncate trim commands down to a DM device.

As I discussed later, we simply need to pass a hint to indicate a
block write is overwriting pre-existing data or not.

> Having this implemented in ext4 allows a lot more flexibility in how
> and when to store the checksum (e.g. per-file checksum flags that are
> inherited, store the checksum for small incompressible files in the inode
> or in extent blocks, etc).

This is a tradeoff between complexity and flexibility, yes. If we
think users will want to checksum all of the files in the file system,
then using a dm plugin approach will be much simpler, since we won't
need all sorts of file-system level complexity (i.e., a per-inode data
data structure, using some kind of b-tree or indirect block scheme).
I don't think there are enough small incompressible files that it's
worth the complexity to try to store the checksum in the inode or
extent block.

> "dm-checksum" would be better, since "protected" falsely implies that
> the data is somehow protected against loss or corruption, when it only
> really allows detecting the corruption and not fixing it.

I like dm-checksum better, thanks!! I'll update this in my next rev of
this design doc.

> Something like "write-once" or "idempotent" or similar, since that
> makes it clear how this is used. I think anyone who is checksumming
> their data would consider that it is "critical".

It's not really write-once, though. It's more like "first write".

> > For redundancy purposes we calculate the metadata checksum of the
> > checksum block assuming that low nibble of the first byte in each
> > entry is entry, and we use the low nibbles of first byte in each entry
>
> s/each entry is entry/each entry is zero/ ?

Yes, thanks.

> The good news is that (IMHO) these two uses are largely exclusive.
> Files that are incompressible (e.g. media) are typically write-once,
> while databases and other apps that overwrite files in place do not
> typically compress the data blocks.

Yes, agreed. That's one of the reasons why I think this design is
promising....

> Why introduce a new mechanism when this could be done using data=journal
> writes for incompressible data? This is essentially just implementing
> jbd2 journaling with a bunch of small journals (AAs), and we could save
> a lot of code complexity by re-using the existing jbd2 code to do it.

Three reasons; first, it may not be so simple to integrate jbd2 into a
device-mapper plugin. And secondly, I think the design I've outlined
is far more efficient than jbd2.

> Using data=journal, if there is a crash after the commit to the journal,
> the data blocks and checksums will be checkpointed to the filesystem
> again if needed, or be discarded without modifying the original data
> blocks if the transaction didn't commit.

Third, there are some real headaches if we use data=journal because we
need to keep track if a block has been previously written to the
journal, since we will need to write a revoke block if a subsequent
update to the block can use the compression+checksum format. This is
some extra complexity which is going to be more annoying to track, and
the extra revoke blocks add additional performance overhead.

Cheers,

- Ted

2014-11-05 00:27:59

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [dm-devel] Some thoughts about providing data block checksumming for ext4



On Tue, 4 Nov 2014, Mikulas Patocka wrote:

>
>
> > > Recovery after a power fail
> > > ---------------------------
> > >
> > > If the dm-protected device was not cleanly shut down, then we need to
> > > examine all of the checksum blocks in the Active Area. For each
> > > checksum block in the AA, the checksums for all of their data blocks
> > > should machine either the checksum found in the AA, or the checksum
> > > found in the checksum block in the checksum group.
> >
> > ... and if the checksum of the block matches BOTH the checksum in the AA
> > and the checksum in the checksum group (because of checksum function
> > collision), you don't know which 4-bit nibble belongs to the data in the
> > block.
>
> Though, I realize that you could avoid this problem by selecting the
> appropriate checksum function - that never results in collision if the
> 4-bit nibble differs.

Hmm, that is still not sufficient.

Suppose that "a" and "b" is sector content without the 4-bit nibble and
"x" and "y" are two different nibbles.

Now, we have this situation:

a + x -> checksum1
b + x -> checksum1
a + y -> checksum2
b + y -> checksum2

Suppose that we do crash recovery and we have (x,checksum1) in the
checksum block and (y,checksum2) in the active area - we can't really tell
which one is valid.

So you really need cryptographic hashes instead of checksums to avoid the
collisions.

Mikulas

2014-11-05 02:33:41

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [dm-devel] Some thoughts about providing data block checksumming for ext4

On Tue, Nov 04, 2014 at 04:39:55PM -0500, Mikulas Patocka wrote:
>
>
> On Mon, 3 Nov 2014, Theodore Ts'o wrote:
>
> > But there is a way we can do even better! If we can manage to
> > compress the block even by a tiny amount, so that 4k block can be
> > stored in 4092 bytes (which means we need to be able to compress the
> > block by 0.1%), we can store the checksum inline with the data, which
> > can then be atomically updated assuming a modern drive with a 4k
> > sector size (even a 512e disk will work fine, assuming the partition
> > is properly 4k aligned). If the block is not sufficiently
>
> There is still large number of drives with 512-byte sectors in use. So
> we'd rather use 512-byte block?

There are a lot of systems (including Oracle IIRC) that use 4k blocks
and checksums, and accept the fact that very rarely it's possible that
even though writes are sent in chunks of 4k, it's possible (although
in general fairly rare) to have "torn writes" after a power failure.

I'd much rather design for the future and not try to tie ourselves in
knots about the possibility of some torn writes on 512 byte sector
disks. Many other file systems and databases have made similar
assumptions (and in fact have for years; I remember stories about
Oracle and another enterprise database having to deal with torn
writes eight years ago).

- Ted

2014-11-05 21:37:13

by Milan Broz

[permalink] [raw]
Subject: Re: [dm-devel] Some thoughts about providing data block checksumming for ext4

On 11/05/2014 01:27 AM, Mikulas Patocka wrote:

> So you really need cryptographic hashes instead of checksums to avoid the
> collisions.

I am not sure if it was mentioned but also see how integrity is implemented in
FreeBSD GELI system by playing with sector sizes

http://svnweb.freebsd.org/base/head/sys/geom/eli/g_eli_integrity.c?view=co

Also, for encrypted devices (either on file level or block level) I think
there are still requests for implementing real crypto authenticated modes (like GCM)
which obviously need similar space for auth tag. (I think ZFS uses it this way.)

Milan


2014-11-06 12:55:22

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [dm-devel] Some thoughts about providing data block checksumming for ext4

On Wed, Nov 05, 2014 at 10:37:09PM +0100, Milan Broz wrote:
>
> Also, for encrypted devices (either on file level or block level) I think
> there are still requests for implementing real crypto authenticated modes (like GCM)
> which obviously need similar space for auth tag. (I think ZFS uses it this way.)

Yes, although it depends on your threat model. If you need to worry
about known or chosen plaintext attack modes --- for example, if you
were implementing the chrome browser where the attacker might be able
to play MITM and replace web pages which would then get encrypted in
the browser cache, and where the attacker can continuously read and/or
replace blocks (say, because of some really stupid design where you
are using an unprotected iSCSI connection). Or if you assume the
attacker can remove the hard drive, twiddle some blocks, and then
surreptitiously replace the hard drive many times, then yes, you need
to worry about data integrity because a system that doesn't include a
MAC --- such as what dm-crypt provides, is simply not enough.

Basically, a dm-crypt style block device encryption is only good if
your threat model is "the attacker steals the laptop and I want to
keep the contents of the storage device safe".

Michael Halcrow discussed this in this years Linux Security Symposium:

http://kernsec.org/files/lss2014/Halcrow_EXT4_Encryption.pdf

Cheers,

- Ted

2014-11-26 23:47:14

by Darrick J. Wong

[permalink] [raw]
Subject: Re: Some thoughts about providing data block checksumming for ext4

Sigh...

Well, I wrote up a preliminary version of dm-checksum and then
realized that I've pretty much just built a crappier version of
dm-dedupe, but without the dedupe part. Given that it stores
checksums in a btree which claims to be robust through failures and
gives us automatic deduplication, I wonder if it we could achieve our
aims by modifying dm-dedupe to verify the checksums on the read path?

I guess it would be interesting to see how bad the performance hit is
with the online dedupe part enabled or disabled. dm-dedupe v2 went
out on the mailing list last August, which I missed. :(

Unless... there's a specific reason nobody mentioned dm-dedupe here?

--D

2014-11-27 00:07:27

by Mike Snitzer

[permalink] [raw]
Subject: Re: Some thoughts about providing data block checksumming for ext4

On Wed, Nov 26 2014 at 6:47pm -0500,
Darrick J. Wong <[email protected]> wrote:

> Sigh...
>
> Well, I wrote up a preliminary version of dm-checksum and then
> realized that I've pretty much just built a crappier version of
> dm-dedupe, but without the dedupe part. Given that it stores
> checksums in a btree which claims to be robust through failures and
> gives us automatic deduplication, I wonder if it we could achieve our
> aims by modifying dm-dedupe to verify the checksums on the read path?
>
> I guess it would be interesting to see how bad the performance hit is
> with the online dedupe part enabled or disabled. dm-dedupe v2 went
> out on the mailing list last August, which I missed. :(
>
> Unless... there's a specific reason nobody mentioned dm-dedupe here?

As you may have seen in the dm-dedup thread, we need to actively
review/test that target (if your initial review focus is on extending it
to _optionally_ verify the checksums on the read path then so be it).

See: https://www.redhat.com/archives/dm-devel/2014-November/msg00114.html
Specifically, the git branch that builds on v2 based on my initial
review of v2:

git://git.fsl.cs.stonybrook.edu/scm/git/linux-dmdedup
branch: dm-dedup-devel

Your help on getting dm-dedup upstream would be very much appreciated.

Thanks,
Mike

2014-11-27 00:39:31

by Darrick J. Wong

[permalink] [raw]
Subject: Re: Some thoughts about providing data block checksumming for ext4

On Wed, Nov 26, 2014 at 07:07:22PM -0500, Mike Snitzer wrote:
> On Wed, Nov 26 2014 at 6:47pm -0500,
> Darrick J. Wong <[email protected]> wrote:
>
> > Sigh...
> >
> > Well, I wrote up a preliminary version of dm-checksum and then
> > realized that I've pretty much just built a crappier version of
> > dm-dedupe, but without the dedupe part. Given that it stores
> > checksums in a btree which claims to be robust through failures and
> > gives us automatic deduplication, I wonder if it we could achieve our
> > aims by modifying dm-dedupe to verify the checksums on the read path?
> >
> > I guess it would be interesting to see how bad the performance hit is
> > with the online dedupe part enabled or disabled. dm-dedupe v2 went
> > out on the mailing list last August, which I missed. :(
> >
> > Unless... there's a specific reason nobody mentioned dm-dedupe here?
>
> As you may have seen in the dm-dedup thread, we need to actively
> review/test that target

It was in fact today's exchange on that thread that made me slap myself on
the forehead and utter "D'oh!".

> (if your initial review focus is on extending it
> to _optionally_ verify the checksums on the read path then so be it).

Yes, sorry, I meant to say "optionally to verify" in there. Adding a minor
feature like that might be a good check to make sure I actually understand
what's going on. :)

> See: https://www.redhat.com/archives/dm-devel/2014-November/msg00114.html
> Specifically, the git branch that builds on v2 based on my initial
> review of v2:
>
> git://git.fsl.cs.stonybrook.edu/scm/git/linux-dmdedup
> branch: dm-dedup-devel
>
> Your help on getting dm-dedup upstream would be very much appreciated.

<-- reading the OLS paper, as a start. What happens to the metadata btree if
someone sets the chunk size to 4KB? Will it become ungainly huge? The thing
that I wrote simply wrote a block's worth of checksums inline with the data,
which required a certain amount of slicing and dicing of bios but wasn't too
horrible with performance.

--D

>
> Thanks,
> Mike
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-01-23 16:46:43

by Vasily Tarasov

[permalink] [raw]
Subject: Re: [dm-devel] Some thoughts about providing data block checksumming for ext4

Hi Darrick,

Somehow I missed your posts in November.

You're right, the smaller is the chunk size the larger is the
metadata - standard dedup problem. The hope is that one recuperates
the loss of metadata space on better data deduplication ratio, which
grows as the chunk size decreases.

I find having an "external" B-Tree somewhat more reasonable than
in-line metadata in dm-dedup (situation is different for dm-checksum,
though). In dm-dedup we have to allocate blocks on-demand, keep
reference counters, and garbage collect unused blocks, etc. DM's
persistent-data library on which we rely helps with this enormously.

Vasily


On Wed, Nov 26, 2014 at 7:39 PM, Darrick J. Wong
<[email protected]> wrote:
> On Wed, Nov 26, 2014 at 07:07:22PM -0500, Mike Snitzer wrote:
>> On Wed, Nov 26 2014 at 6:47pm -0500,
>> Darrick J. Wong <[email protected]> wrote:
>>
>> > Sigh...
>> >
>> > Well, I wrote up a preliminary version of dm-checksum and then
>> > realized that I've pretty much just built a crappier version of
>> > dm-dedupe, but without the dedupe part. Given that it stores
>> > checksums in a btree which claims to be robust through failures and
>> > gives us automatic deduplication, I wonder if it we could achieve our
>> > aims by modifying dm-dedupe to verify the checksums on the read path?
>> >
>> > I guess it would be interesting to see how bad the performance hit is
>> > with the online dedupe part enabled or disabled. dm-dedupe v2 went
>> > out on the mailing list last August, which I missed. :(
>> >
>> > Unless... there's a specific reason nobody mentioned dm-dedupe here?
>>
>> As you may have seen in the dm-dedup thread, we need to actively
>> review/test that target
>
> It was in fact today's exchange on that thread that made me slap myself on
> the forehead and utter "D'oh!".
>
>> (if your initial review focus is on extending it
>> to _optionally_ verify the checksums on the read path then so be it).
>
> Yes, sorry, I meant to say "optionally to verify" in there. Adding a minor
> feature like that might be a good check to make sure I actually understand
> what's going on. :)
>
>> See: https://www.redhat.com/archives/dm-devel/2014-November/msg00114.html
>> Specifically, the git branch that builds on v2 based on my initial
>> review of v2:
>>
>> git://git.fsl.cs.stonybrook.edu/scm/git/linux-dmdedup
>> branch: dm-dedup-devel
>>
>> Your help on getting dm-dedup upstream would be very much appreciated.
>
> <-- reading the OLS paper, as a start. What happens to the metadata btree if
> someone sets the chunk size to 4KB? Will it become ungainly huge? The thing
> that I wrote simply wrote a block's worth of checksums inline with the data,
> which required a certain amount of slicing and dicing of bios but wasn't too
> horrible with performance.
>
> --D
>
>>
>> Thanks,
>> Mike
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> dm-devel mailing list
> [email protected]
> https://www.redhat.com/mailman/listinfo/dm-devel
>