LinuxLists.cc - Re: [RFC] dm-bow working prototype

2018-10-23 22:31:01

Subject: Re: [RFC] dm-bow working prototype

On Tue, Oct 23, 2018 at 02:23:28PM -0700, Paul Lawrence wrote:
> It is planned to use this driver to enable restoration of a failed
> update attempt on Android devices using ext4.

Could you say a bit more about the reason for this new dm target so we
can understand better what parameters you are trying to optimise and
within what new constraints? What are the failure modes that you need
to handle better by using this? (We can guess some answers, but it
would better if you can lay them out so we don't need to make
assumptions.)

Alasdair

2018-10-24 18:44:50

by Paul Lawrence

[permalink] [raw]

Subject: Re: [RFC] dm-bow working prototype

Android has had the concept of A/B updates for since Android N, which
means that if an update is unable to boot for any reason three times, we
revert to the older system. However, if the failure occurs after the new
system has started modifying userdata, we will be attempting to start an
older system with a newer userdata, which is an unsupported state. Thus
to make A/B able to fully deliver on its promise of safe updates, we
need to be able to revert userdata in the event of a failure.

For those cases where the file system on userdata supports
snapshots/checkpoints, we should clearly use them. However, there are
many Android devices using filesystems that do not support checkpoints,
so we need a generic solution. Here we had two options. One was to use
overlayfs to manage the changes, then on merge have a script that copies
the files to the underlying fs. This was rejected on the grounds of
compatibility concerns and managing the merge through reboots, though it
is definitely a plausible strategy. The second was to work at the block
layer.

At the block layer, dm-snap would have given us a ready-made solution,
except that there is no sufficiently large spare partition on Android
devices. But in general there is free space on userdata, just scattered
over the device, and of course likely to get modified as soon as
userdata is written to. We also decided that the merge phase was a high
risk component of any design. Since the normal path is that the update
succeeds, we anticipate merges happening 99% of the time, and we want to
guarantee their success even in the event of unexpected failure during
the merge. Thus we decided we preferred a strategy where the device is
in the committed state at all times, and rollback requires work, to one
where the device remains in the original state but the merge is complex.

On 10/23/2018 03:18 PM, Alasdair G Kergon wrote:
> On Tue, Oct 23, 2018 at 02:23:28PM -0700, Paul Lawrence wrote:
>> It is planned to use this driver to enable restoration of a failed
>> update attempt on Android devices using ext4.
> Could you say a bit more about the reason for this new dm target so we
> can understand better what parameters you are trying to optimise and
> within what new constraints? What are the failure modes that you need
> to handle better by using this? (We can guess some answers, but it
> would better if you can lay them out so we don't need to make
> assumptions.)
>
> Alasdair
>

2018-10-24 19:26:40

by Mikulas Patocka

[permalink] [raw]

Subject: Re: [dm-devel] [RFC] dm-bow working prototype

On Wed, 24 Oct 2018, Paul Lawrence wrote:

> Android has had the concept of A/B updates for since Android N, which means
> that if an update is unable to boot for any reason three times, we revert to
> the older system. However, if the failure occurs after the new system has
> started modifying userdata, we will be attempting to start an older system
> with a newer userdata, which is an unsupported state. Thus to make A/B able to
> fully deliver on its promise of safe updates, we need to be able to revert
> userdata in the event of a failure.
>
> For those cases where the file system on userdata supports
> snapshots/checkpoints, we should clearly use them. However, there are many
> Android devices using filesystems that do not support checkpoints, so we need
> a generic solution. Here we had two options. One was to use overlayfs to
> manage the changes, then on merge have a script that copies the files to the
> underlying fs. This was rejected on the grounds of compatibility concerns and
> managing the merge through reboots, though it is definitely a plausible
> strategy. The second was to work at the block layer.
>
> At the block layer, dm-snap would have given us a ready-made solution, except
> that there is no sufficiently large spare partition on Android devices. But in
> general there is free space on userdata, just scattered over the device, and
> of course likely to get modified as soon as userdata is written to. We also
> decided that the merge phase was a high risk component of any design. Since
> the normal path is that the update succeeds, we anticipate merges happening
> 99% of the time, and we want to guarantee their success even in the event of
> unexpected failure during the merge. Thus we decided we preferred a strategy
> where the device is in the committed state at all times, and rollback requires
> work, to one where the device remains in the original state but the merge is
> complex.

What about allocating a big file, using the FIEMAP ioctl to find the
physical locations of the file, creating a dm device with many linear
targets to map the big file and using it as a snapshot store? I think it
would be way easier than re-implementing the snapshot functionality in a
new target.

You can mount the whole filesystem using the "origin" target and you can
attach a "snapshot" target that uses the mapped big file as its snapshot
store - all writes will be placed directly to the device and the old data
will be copied to the snapshot store in the big file.

If you decide that rollback is no longer needed, you just unload the
snapshot target and delete the big file. If you decide that you want to
rollback, you can use the snapshot merge functionality (or you can write a
userspace utility that does offline merge).

Mikulas

2018-10-25 00:02:07

by Alasdair G Kergon

[permalink] [raw]

Subject: Re: [dm-devel] [RFC] dm-bow working prototype

On Wed, Oct 24, 2018 at 03:24:29PM -0400, Mikulas Patocka wrote:
> What about allocating a big file, using the FIEMAP ioctl to find the

For reference, dmfilemapd in the lvm2 tree (in daemons/) uses FIEMAP
(in libdm/libdm-stats.c) for monitoring I/O by file.

> If you decide that rollback is no longer needed, you just unload the
> snapshot target and delete the big file. If you decide that you want to
> rollback, you can use the snapshot merge functionality (or you can write a
> userspace utility that does offline merge).

There's some old code from Mark McLoughlin for userspace snapshot merging here:
https://people.gnome.org/~markmc/code/merge-dm-snapshot.c

Alasdair

2018-10-25 10:21:58

by Bryn M. Reeves

[permalink] [raw]

Subject: Re: [dm-devel] [RFC] dm-bow working prototype

On Wed, Oct 24, 2018 at 03:24:29PM -0400, Mikulas Patocka wrote:
>
>
> On Wed, 24 Oct 2018, Paul Lawrence wrote:
>
> > Android has had the concept of A/B updates for since Android N, which means
> > that if an update is unable to boot for any reason three times, we revert to
> > the older system. However, if the failure occurs after the new system has
> > started modifying userdata, we will be attempting to start an older system
> > with a newer userdata, which is an unsupported state. Thus to make A/B able to
> > fully deliver on its promise of safe updates, we need to be able to revert
> > userdata in the event of a failure.
> >
> > For those cases where the file system on userdata supports
> > snapshots/checkpoints, we should clearly use them. However, there are many
> > Android devices using filesystems that do not support checkpoints, so we need
> > a generic solution. Here we had two options. One was to use overlayfs to
> > manage the changes, then on merge have a script that copies the files to the
> > underlying fs. This was rejected on the grounds of compatibility concerns and
> > managing the merge through reboots, though it is definitely a plausible
> > strategy. The second was to work at the block layer.
> >
> > At the block layer, dm-snap would have given us a ready-made solution, except
> > that there is no sufficiently large spare partition on Android devices. But in
> > general there is free space on userdata, just scattered over the device, and
> > of course likely to get modified as soon as userdata is written to. We also
> > decided that the merge phase was a high risk component of any design. Since
> > the normal path is that the update succeeds, we anticipate merges happening
> > 99% of the time, and we want to guarantee their success even in the event of
> > unexpected failure during the merge. Thus we decided we preferred a strategy
> > where the device is in the committed state at all times, and rollback requires
> > work, to one where the device remains in the original state but the merge is
> > complex.
>
> What about allocating a big file, using the FIEMAP ioctl to find the
> physical locations of the file, creating a dm device with many linear
> targets to map the big file and using it as a snapshot store? I think it
> would be way easier than re-implementing the snapshot functionality in a
> new target.

libdevmapper already has code to handle enumerating physical file
extents via the dm-stats file mapping support. It should be fairly easy
to adapt this to create dm tables rather than dm-stats regions.

See dm_stats_create_regions_from_fd() and _stats_map_file_regions().

Bryn.

2018-10-25 17:26:32

by Paul Lawrence

[permalink] [raw]

Subject: Re: [dm-devel] [RFC] dm-bow working prototype

Thank you for the suggestion. I spent part of yesterday experimenting
with this idea, and it is certainly very promising. However, it does
have some disadvantages as compared to dm-bow, if I am understanding the
setup correctly:

1) Since dm-snap has no concept of the free space on the underlying file
system any write into the free space will trigger a backup, so using
twice the space of dm-bow. Changing existing data will create a backup
with both drivers, but since we have to reserve the space for the
backups up-front with dm-snap, we would likely only have half the space
for that. Either way, it seems that dm-bow is likely to double the
amount of changes we could make.

(Might it be possible to dynamically resize the backup file if it is
mostly used up? This would fix the problem of only having half the space
for changing existing data. The documentation seems to indicate that you
can increase the size of the snapshot partition, and it seems like it
should be possible to grow the underlying file without triggering a lot
of writes. OTOH this would have to happen in userspace which creates
other issues.)

2) Similarly, since writes into free space do not trigger a backup in
dm-bow, dm-bow is likely to have a lower performance overhead in many
circumstances. On the flip side, dm-bow's backup is in free space and
will collide with other writes, so this advantage will reduce as free
space fills up. But by choosing a suitable algorithm for how we use free
space we might be able to retain most of this advantage.

I intend to put together a fully working prototype of your suggestion
next to better compare with dm-bow. But I do believe there is value in
tracking free space and utilizing it in any such solution.

On 10/24/2018 12:24 PM, Mikulas Patocka wrote:
>
> On Wed, 24 Oct 2018, Paul Lawrence wrote:
>
>> Android has had the concept of A/B updates for since Android N, which means
>> that if an update is unable to boot for any reason three times, we revert to
>> the older system. However, if the failure occurs after the new system has
>> started modifying userdata, we will be attempting to start an older system
>> with a newer userdata, which is an unsupported state. Thus to make A/B able to
>> fully deliver on its promise of safe updates, we need to be able to revert
>> userdata in the event of a failure.
>>
>> For those cases where the file system on userdata supports
>> snapshots/checkpoints, we should clearly use them. However, there are many
>> Android devices using filesystems that do not support checkpoints, so we need
>> a generic solution. Here we had two options. One was to use overlayfs to
>> manage the changes, then on merge have a script that copies the files to the
>> underlying fs. This was rejected on the grounds of compatibility concerns and
>> managing the merge through reboots, though it is definitely a plausible
>> strategy. The second was to work at the block layer.
>>
>> At the block layer, dm-snap would have given us a ready-made solution, except
>> that there is no sufficiently large spare partition on Android devices. But in
>> general there is free space on userdata, just scattered over the device, and
>> of course likely to get modified as soon as userdata is written to. We also
>> decided that the merge phase was a high risk component of any design. Since
>> the normal path is that the update succeeds, we anticipate merges happening
>> 99% of the time, and we want to guarantee their success even in the event of
>> unexpected failure during the merge. Thus we decided we preferred a strategy
>> where the device is in the committed state at all times, and rollback requires
>> work, to one where the device remains in the original state but the merge is
>> complex.
> What about allocating a big file, using the FIEMAP ioctl to find the
> physical locations of the file, creating a dm device with many linear
> targets to map the big file and using it as a snapshot store? I think it
> would be way easier than re-implementing the snapshot functionality in a
> new target.
>
> You can mount the whole filesystem using the "origin" target and you can
> attach a "snapshot" target that uses the mapped big file as its snapshot
> store - all writes will be placed directly to the device and the old data
> will be copied to the snapshot store in the big file.
>
> If you decide that rollback is no longer needed, you just unload the
> snapshot target and delete the big file. If you decide that you want to
> rollback, you can use the snapshot merge functionality (or you can write a
> userspace utility that does offline merge).
>
> Mikulas

2018-10-26 20:03:54

by Mikulas Patocka

[permalink] [raw]

Subject: Re: [dm-devel] [RFC] dm-bow working prototype

On Thu, 25 Oct 2018, Paul Lawrence wrote:

> Thank you for the suggestion. I spent part of yesterday experimenting with
> this idea, and it is certainly very promising. However, it does have some
> disadvantages as compared to dm-bow, if I am understanding the setup
> correctly:
>
> 1) Since dm-snap has no concept of the free space on the underlying file
> system any write into the free space will trigger a backup, so using twice the
> space of dm-bow. Changing existing data will create a backup with both
> drivers, but since we have to reserve the space for the backups up-front with
> dm-snap, we would likely only have half the space for that. Either way, it
> seems that dm-bow is likely to double the amount of changes we could make.
>
> (Might it be possible to dynamically resize the backup file if it is mostly
> used up? This would fix the problem of only having half the space for changing

Yes - the snapshot store can be extended while the snapshot is active.

> existing data. The documentation seems to indicate that you can increase the
> size of the snapshot partition, and it seems like it should be possible to
> grow the underlying file without triggering a lot of writes. OTOH this would
> have to happen in userspace which creates other issues.)
>
> 2) Similarly, since writes into free space do not trigger a backup in dm-bow,
> dm-bow is likely to have a lower performance overhead in many circumstances.
> On the flip side, dm-bow's backup is in free space and will collide with other
> writes, so this advantage will reduce as free space fills up. But by choosing
> a suitable algorithm for how we use free space we might be able to retain most
> of this advantage.
>
> I intend to put together a fully working prototype of your suggestion next to
> better compare with dm-bow. But I do believe there is value in tracking free
> space and utilizing it in any such solution.

The snapshot target could be hacked so that it remembers space trimmed
with REQ_OP_DISCARD and won't reallocate these blocks.

But I suspect that running discard over the whole device would degrade
performance more than copying some unneeded data.

How much data do you intend to backup with this solution?

Mikulas

2018-10-29 16:53:06

by Paul Lawrence

[permalink] [raw]

Subject: Re: [dm-devel] [RFC] dm-bow working prototype

> The snapshot target could be hacked so that it remembers space trimmed
> with REQ_OP_DISCARD and won't reallocate these blocks.
>
> But I suspect that running discard over the whole device would degrade
> performance more than copying some unneeded data.
>
> How much data do you intend to backup with this solution?
>
>
We are space-constrained - we will have to free up space for the backup
before we apply the update, so we have to predict the size and keeping
usage as low as possible is thus very important.

Also, we've discussed the resizing requirement of the dm-snap solution
and that part is not attractive at all - it seems it would be impossible
to guarantee that the resizing happens in a timely fashion during the
(very busy) update cycle.

Thanks everyone for the insights, especially into how dm-snap works,
which I hadn't fully appreciated. At the moment, and for the above
reasons, we intend to continue with the dm-bow solution, but do want to
keep this discussion open. If anyone is going to be at Linux Plumbers,
I'll be presenting this work and would love to chat about it more.

2018-11-15 23:17:51

by Mikulas Patocka

[permalink] [raw]

Subject: Re: [dm-devel] [RFC] dm-bow working prototype

On Mon, 29 Oct 2018, Paul Lawrence wrote:

>
> > The snapshot target could be hacked so that it remembers space trimmed
> > with REQ_OP_DISCARD and won't reallocate these blocks.
> >
> > But I suspect that running discard over the whole device would degrade
> > performance more than copying some unneeded data.
> >
> > How much data do you intend to backup with this solution?
> >
> >
> We are space-constrained - we will have to free up space for the backup before
> we apply the update, so we have to predict the size and keeping usage as low
> as possible is thus very important.
>
> Also, we've discussed the resizing requirement of the dm-snap solution and
> that part is not attractive at all - it seems it would be impossible to
> guarantee that the resizing happens in a timely fashion during the (very busy)
> update cycle.
>
> Thanks everyone for the insights, especially into how dm-snap works, which I
> hadn't fully appreciated. At the moment, and for the above reasons, we intend
> to continue with the dm-bow solution, but do want to keep this discussion
> open. If anyone is going to be at Linux Plumbers, I'll be presenting this work
> and would love to chat about it more.

dm-snapshot took 9 years to fix the last data corruption bug (2004-2013 -
the commit e9c6a182649f4259db704ae15a91ac820e63b0ca).

And with the new target duplicating the snapshot functionality, it may be
the same.

Mikulas

2018-12-02 10:08:24

by Sandeep Patil

[permalink] [raw]

Subject: Re: [dm-devel] [RFC] dm-bow working prototype

Hi Mikulas,

On Thu, Nov 15, 2018 at 06:15:34PM -0500, Mikulas Patocka wrote:
>
>
> On Mon, 29 Oct 2018, Paul Lawrence wrote:
>
> >
> > > The snapshot target could be hacked so that it remembers space trimmed
> > > with REQ_OP_DISCARD and won't reallocate these blocks.
> > >
> > > But I suspect that running discard over the whole device would degrade
> > > performance more than copying some unneeded data.
> > >
> > > How much data do you intend to backup with this solution?
> > >
> > >
> > We are space-constrained - we will have to free up space for the backup before
> > we apply the update, so we have to predict the size and keeping usage as low
> > as possible is thus very important.
> >
> > Also, we've discussed the resizing requirement of the dm-snap solution and
> > that part is not attractive at all - it seems it would be impossible to
> > guarantee that the resizing happens in a timely fashion during the (very busy)
> > update cycle.
> >
> > Thanks everyone for the insights, especially into how dm-snap works, which I
> > hadn't fully appreciated. At the moment, and for the above reasons, we intend
> > to continue with the dm-bow solution, but do want to keep this discussion
> > open. If anyone is going to be at Linux Plumbers, I'll be presenting this work
> > and would love to chat about it more.
>
> dm-snapshot took 9 years to fix the last data corruption bug (2004-2013 -
> the commit e9c6a182649f4259db704ae15a91ac820e63b0ca).
>
> And with the new target duplicating the snapshot functionality, it may be
> the same.
>

Thanks for that. We are as much sensitive to not duplicating functionality
and of course reliability of the implementation.

So we did spend considerable amount of time trying to make dm-snapshot work
for us (including the approach suggested here now). However, the additional
space needed to make dm-snapshot work in this situation is unfortunate and
won't work for Android. Especially given that we will be taking that space
away from the user all in one go too.

Anyway, I wanted to ask if there is any way we can make dm-snapshot work the
way dm-bow does? With patches is fine, we can work on that :).

I think Paul is planning to send a v2 with more description and the block
size fix that caused problems for others trying it out.

FWIW, dm-bow itself suffers from a mutex for each write that stalls for
longer when the write is to a block being modified (as opposed to a free
block being written to). We are hoping to iterate over that problem once the
general idea is acceptable to everyone.

Thanks for your help.

- ssp

> Mikulas