For purposes of data deduplication and data synchronisation, it would
be a powerful tool to expose file data checksums.
Since eg BTRFS uses the crc32c algorithm [1], it's possible to compute
the file's overall CRC from the accumulation of the CRCs from all it's
extents' CRCs.
For now, exposing this via an IOCTL may be sufficient, though any
ideas for introducing it in a more standard way? (it's a pity that
when stat64 was introduced, reserved fields weren't added)
Thanks,
Daniel
[1] http://www.research.ibm.com/haifa/satran/ips/Vince-Luben-crc32c-01.pdf
--
Daniel J Blueman
Daniel J Blueman <[email protected]> writes:
> For purposes of data deduplication and data synchronisation, it would
> be a powerful tool to expose file data checksums.
>
> Since eg BTRFS uses the crc32c algorithm [1], it's possible to compute
> the file's overall CRC from the accumulation of the CRCs from all it's
> extents' CRCs.
>
> For now, exposing this via an IOCTL may be sufficient, though any
> ideas for introducing it in a more standard way? (it's a pity that
> when stat64 was introduced, reserved fields weren't added)
The problem of doing it in any "standard way" is that it would
hard code the way the file system does checksums in the applications.
So the file system could never change it without breaking
user space.
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, Jan 27, 2010 at 12:30 PM, Andi Kleen <[email protected]> wrote:
> Daniel J Blueman <[email protected]> writes:
>
>> For purposes of data deduplication and data synchronisation, it would
>> be a powerful tool to expose file data checksums.
>>
>> Since eg BTRFS uses the crc32c algorithm [1], it's possible to compute
>> the file's overall CRC from the accumulation of the CRCs from all it's
>> extents' CRCs.
>>
>> For now, exposing this via an IOCTL may be sufficient, though any
>> ideas for introducing it in a more standard way? (it's a pity that
>> when stat64 was introduced, reserved fields weren't added)
>
> The problem of doing it in any "standard way" is that it would
> hard code the way the file system does checksums in the applications.
> So the file system could never change it without breaking
> user space.
I guess the filesystem would need to express this in the resulting
data-structure, eg:
- type 1 corresponds to using the crc32c algorithm with starting seed
N and accumulating ascending over data extents, padding with modulus
remainder or sparse holes with 0
- type 2 etc
The next question, is does filesystem (eg BTRFS) compression come
before or after checksumming?
--
Daniel J Blueman
On Wed, Jan 27, 2010 at 01:23:28PM +0000, Daniel J Blueman wrote:
> On Wed, Jan 27, 2010 at 12:30 PM, Andi Kleen <[email protected]> wrote:
> > Daniel J Blueman <[email protected]> writes:
> >
> >> For purposes of data deduplication and data synchronisation, it would
> >> be a powerful tool to expose file data checksums.
> >>
> >> Since eg BTRFS uses the crc32c algorithm [1], it's possible to compute
> >> the file's overall CRC from the accumulation of the CRCs from all it's
> >> extents' CRCs.
> >>
> >> For now, exposing this via an IOCTL may be sufficient, though any
> >> ideas for introducing it in a more standard way? (it's a pity that
> >> when stat64 was introduced, reserved fields weren't added)
> >
> > The problem of doing it in any "standard way" is that it would
> > hard code the way the file system does checksums in the applications.
> > So the file system could never change it without breaking
> > user space.
At the end of the day the checksums are also hard coded on disk. We
can't add a new way without continuing to support the old one.
>
> I guess the filesystem would need to express this in the resulting
> data-structure, eg:
> - type 1 corresponds to using the crc32c algorithm with starting seed
> N and accumulating ascending over data extents, padding with modulus
> remainder or sparse holes with 0
> - type 2 etc
Yes, if they were exported to userland we'd need to export version info.
>
> The next question, is does filesystem (eg BTRFS) compression come
> before or after checksumming?
The checksums are based on what is on disk, so they are done on the
compressed data.
-chris