On Tuesday May 26, [email protected] wrote:
> On Tue, May 26, 2009 at 12:33:01PM +0200, Goswin von Brederlow wrote:
> > Alberto Bertogli <[email protected]> writes:
> > > On Mon, May 25, 2009 at 02:22:23PM +0200, Goswin von Brederlow wrote:
> > >> Alberto Bertogli <[email protected]> writes:
> > >> > I'm writing this device mapper target that stores checksums on writes and
> > >> > verifies them on reads.
> > >>
> > >> How does that behave on crashes? Will checksums be out of sync with data?
> > >> Will pending blocks recalculate their checksum?
> > >
> > > To guarantee consistency, two imd sectors (named M1 and M2) are kept for
> > > every 62 data sectors, and the following procedure is used to update them
> > > when a write to a given sector is required:
> > >
> > > - Read both M1 and M2.
> > > - Find out (using information stored in their headers) which one is newer.
> > > Let's assume M1 is newer than M2.
> > > - Update the M2 buffer to mark it's newer, and update the new data's CRC.
> > > - Submit the write to M2, and then the write to the data, using a barrier
> > > to make sure the metadata is updated _after_ the data.
> >
> > Consider that the disk writes the data and then the system
> > crashes. Now you have the old checksum but the new data. The checksum
> > is out of sync.
> >
> > Don't you mean that M2 is written _before_ the data? That way you have
> > the old checksum in M1 and the new in M2. One of them will match
> > depending on wether the data gets written before a crash or not. That
> > would be more consistent with your read operation below.
>
> Yes, the comment is wrong, thanks for noticing. That is how it's implemented.
>
>
> > > Accordingly, the read operations are handled as follows:
> > >
> > > - Read both the data, M1 and M2.
> > > - Find out which one is newer. Let's assume M1 is newer than M2.
> > > - Calculate the data's CRC, and compare it to the one found in M1. If they
> > > match, the reading is successful. If not, compare it to the one found in
> > > M2. If they match, the reading is successful; otherwise, fail. If
> > > the read involves multiple sectors, it is possible that some of the
> > > correct CRCs are in M1 and some in M2.
> > >
> > >
> > > The barrier will be (it's not done yet) replaced with serialized writes for
> > > cases where the underlying block device does not support them, or when the
> > > integrity metadata resides on a different block device than the data.
> > >
> > >
> > > This scheme assumes writes to a single sector are atomic in the presence of
> > > normal crashes, which I'm not sure if it's something sane to assume in
> > > practise. If it's not, then the scheme can be modified to cope with that.
> >
> > What happens if you have multiple writes to the same sector? (assuming
> > you ment "before" above)
> >
> > - user writes to sector
> > - queue up write for M1 and data1
> > - M1 writes
> > - user writes to sector
> > - queue up writes for M2 and data2
> > - data1 is thrown away as data2 overwrites it
> > - M2 writes
> > - system crashes
> >
> > Now both M1 and M2 have a different checksum than the old data left on
> > disk.
> >
> > Can this happen?
>
> No, parallel writes that affect the same metadata sectors will not be allowed.
> At the moment there is a rough lock which does not allow simultaneous updates
> at all, I plan to make that more fine-grained in the future.
Can I suggest a variation on the above which, I think, can cause a
problem.
- user writes data-A' to sector-A (which currently contains data-A)
- queue up write for M1 and data-A'
- M1 is written correctly.
- power fails (before data-A' is written)
reboot
- read sector-A, find data-A which matches checksum on M2, so
success.
So everything is working perfectly so far...
- write sector-B (in same 62-sector range as sector-A).
- queue up write for M2 and data-B
- those writes complete
- read sector-A. find data-A, which doesn't match M1 (that has
data-A') and doesn't match M2 (which is mostly a copy of M1),
so the read fails.
i.e. you get a situation where writing one sector can cause another
sector to spontaneously fail.
NeilBrown
On Sun, Jun 28, 2009 at 10:34:17AM +1000, Neil Brown wrote:
> On Tuesday May 26, [email protected] wrote:
> > On Tue, May 26, 2009 at 12:33:01PM +0200, Goswin von Brederlow wrote:
> > > > This scheme assumes writes to a single sector are atomic in the presence of
> > > > normal crashes, which I'm not sure if it's something sane to assume in
> > > > practise. If it's not, then the scheme can be modified to cope with that.
> > >
> > > What happens if you have multiple writes to the same sector? (assuming
> > > you ment "before" above)
> > >
> > > - user writes to sector
> > > - queue up write for M1 and data1
> > > - M1 writes
> > > - user writes to sector
> > > - queue up writes for M2 and data2
> > > - data1 is thrown away as data2 overwrites it
> > > - M2 writes
> > > - system crashes
> > >
> > > Now both M1 and M2 have a different checksum than the old data left on
> > > disk.
> > >
> > > Can this happen?
> >
> > No, parallel writes that affect the same metadata sectors will not be allowed.
> > At the moment there is a rough lock which does not allow simultaneous updates
> > at all, I plan to make that more fine-grained in the future.
>
> Can I suggest a variation on the above which, I think, can cause a
> problem.
>
> - user writes data-A' to sector-A (which currently contains data-A)
> - queue up write for M1 and data-A'
> - M1 is written correctly.
> - power fails (before data-A' is written)
> reboot
> - read sector-A, find data-A which matches checksum on M2, so
> success.
>
> So everything is working perfectly so far...
>
> - write sector-B (in same 62-sector range as sector-A).
> - queue up write for M2 and data-B
> - those writes complete
> - read sector-A. find data-A, which doesn't match M1 (that has
> data-A') and doesn't match M2 (which is mostly a copy of M1),
> so the read fails.
The thing is that M2 is not a copy of M1. When updating M2 for data-B, the
procedure is not "copy M1, update sector-B's checksum, write" but "read M2,
update sector-B's checksum, write". So as long as there are no writes to
sector-A, M1 will have the incorrect checksum and M2 will have the correct
one, regardless of writes to the other sectors.
However, a troubling scenario based on yours could be:
- M2 has the right checksum but is older, M1 has the wrong checksum but is
newer.
- user writes data-A'' to sector'A
- queue up write for M2 (chosen because it is older)
- M2 is written correctly
- power fails before data-A'' is written
At that point, data-A is written at sector-A, but both M1 and M2 have
incorrect checksums for it.
I'll try to come up with a better scheme that copes with this kind of
scenarios and post an updated patch.
Thanks a lot,
Alberto
Alberto Bertogli <[email protected]> writes:
> On Sun, Jun 28, 2009 at 10:34:17AM +1000, Neil Brown wrote:
>> On Tuesday May 26, [email protected] wrote:
>> > On Tue, May 26, 2009 at 12:33:01PM +0200, Goswin von Brederlow wrote:
>> > > > This scheme assumes writes to a single sector are atomic in the presence of
>> > > > normal crashes, which I'm not sure if it's something sane to assume in
>> > > > practise. If it's not, then the scheme can be modified to cope with that.
>> > >
>> > > What happens if you have multiple writes to the same sector? (assuming
>> > > you ment "before" above)
>> > >
>> > > - user writes to sector
>> > > - queue up write for M1 and data1
>> > > - M1 writes
>> > > - user writes to sector
>> > > - queue up writes for M2 and data2
>> > > - data1 is thrown away as data2 overwrites it
>> > > - M2 writes
>> > > - system crashes
>> > >
>> > > Now both M1 and M2 have a different checksum than the old data left on
>> > > disk.
>> > >
>> > > Can this happen?
>> >
>> > No, parallel writes that affect the same metadata sectors will not be allowed.
>> > At the moment there is a rough lock which does not allow simultaneous updates
>> > at all, I plan to make that more fine-grained in the future.
>>
>> Can I suggest a variation on the above which, I think, can cause a
>> problem.
>>
>> - user writes data-A' to sector-A (which currently contains data-A)
>> - queue up write for M1 and data-A'
>> - M1 is written correctly.
>> - power fails (before data-A' is written)
>> reboot
>> - read sector-A, find data-A which matches checksum on M2, so
>> success.
>>
>> So everything is working perfectly so far...
>>
>> - write sector-B (in same 62-sector range as sector-A).
>> - queue up write for M2 and data-B
>> - those writes complete
>> - read sector-A. find data-A, which doesn't match M1 (that has
>> data-A') and doesn't match M2 (which is mostly a copy of M1),
>> so the read fails.
>
> The thing is that M2 is not a copy of M1. When updating M2 for data-B, the
> procedure is not "copy M1, update sector-B's checksum, write" but "read M2,
> update sector-B's checksum, write". So as long as there are no writes to
> sector-A, M1 will have the incorrect checksum and M2 will have the correct
> one, regardless of writes to the other sectors.
>
> However, a troubling scenario based on yours could be:
>
> - M2 has the right checksum but is older, M1 has the wrong checksum but is
> newer.
> - user writes data-A'' to sector'A
> - queue up write for M2 (chosen because it is older)
> - M2 is written correctly
> - power fails before data-A'' is written
>
> At that point, data-A is written at sector-A, but both M1 and M2 have
> incorrect checksums for it.
>
> I'll try to come up with a better scheme that copes with this kind of
> scenarios and post an updated patch.
>
> Thanks a lot,
> Alberto
When the newer block has the wrong checksum you first need to correct
that. If you find a wrong checksum on read that is easy to do. But you
won't detect this on writes.
One solution I can think of is this:
- user writes to sector A
- compare checksum of sector A in M1 and M2
if checksums differ:
- read sector A and calculate checksum
- if M1 has the right checksum update M2
- wait
- write new checksum to M1
- wait
- write data to sector A
- wait
- write new checksum to M2
MfG
Goswin