On Wed, Jun 8, 2011 at 7:19 PM, Mike Snitzer <[email protected]> wrote:
> On Wed, Jun 8, 2011 at 11:59 AM, Amir G. <[email protected]> wrote:
>> On Wed, Jun 8, 2011 at 6:38 PM, Lukas Czerner <[email protected]> wrote:
>>> Amir said:
>
>>>> The question of whether the world needs ext4 snapshots is
>>>> perfectly valid, but going back to the food analogy, I think it's
>>>> a case of "the proof of the pudding is in the eating".
>>>> I have no doubt that if ext4 snapshots are merged, many people will use it.
>>>
>>> Well, I would like to have your confidence. Why do you think so ? They
>>> will use it for what ? Doing backups ? We can do this easily with LVM
>>> without any risk of compromising existing filesystem at all. On desktop
>>
>> LVM snapshots are not meant to be long lived snapshots.
>> As temporary snapshots they are fine, but with ext4 snapshots
>> you can easily retain monthly/weekly snapshots without the
>> need to allocate the space for it in advance and without the
>> 'vanish' quality of LVM snapshots.
>
> In that old sf.net wiki you say:
> Why use Next3 snapshots and not LVM snapshots?
> * Performance: only small overhead to write performance with snapshots
>
> Fair claim against current LVM snapshot (but not multisnap).
>
> In this thread you're being very terse on the performance hit you
> assert multisnap has that ext4 snapshots does not. ?Can you please be
> more specific?
>
> In your most recent post it seems you're focusing on "LVM snapshots"
> and attributing the deficiencies of old-style LVM snapshots
> (non-shared exception store causing N-way copy-out) to dm-multisnap?
>
> Again, nobody will dispute that the existing dm-snapshot target has
> poor performance that requires snapshots be short-lived. ?But
> multisnap does _not_ suffer from those performance problems.
>
> Mike
>
Hi Mike,
I am glad that you joined the debate and I am going to start a fresh
thread for that occasion, to give your question the proper attention.
In my old next3.sf.net wiki, which I do update from time to time,
I listed 4 advantages of Ext4 (then next3) snapshots over LVM:
* Performance: only small overhead to write performance with snapshots
* Scalability: no extra overhead per snapshot
* Maintenance: no need to pre-allocate disk space for snapshots
* Persistence: snapshots don't vanish when disk is full
As far as I know, the only thing that has changed from dm-snap
to dm-multisnap is the Scalability.
Did you resolve the Maintenance and Persistence issues?
With Regards to Performance, Ext4 snapshots are inherently different
then LVM snapshots and have near zero overhead to write performance
as the following benchmark, which I presented on LSF, demonstrates:
http://global.phoronix-test-suite.com/index.php?k=profile&u=amir73il-4632-11284-26560
There are several reasons for the near zero overhead:
1. Metadata buffers are always in cache when performing COW,
so there is no extra read I/O and write I/O of the copied pages is handled
by the journal (when flushing the snapshot file dirty pages).
2. Data blocks are never copied
The move-on-write technique is used to re-allocate data blocks on rewrite
instead of copying them.
This is not something that can be done when the snapshot is stored on
external storage, but it can done when the snapshot file lives in the fs.
3. New (= after last snapshot take) allocated blocks are never copied
nor reallocated on rewrite.
Ext4 snapshots uses the fs block bitmap, to know which blocks were allocated
at the time the last snapshot was taken, so new blocks are just out of the game.
For example, in the workload of a fresh kernel build and daily snapshots,
the creation and deletion of temp files causes no extra I/O overhead whatsoever.
So, yes, I know. I need to run a benchmark of Ext4 snapshots vs. LVM multisnap
and post the results. When I'll get around to it I'll do it.
But I really don't think that performance is how the 2 solutions
should be compared.
The way I see it, LVM snapshots are a complementary solution and they
have several advantages over Ext4 snapshots, like:
* Work with any FS
* Writable snapshots and snapshots of snapshots
* Merge a snapshot back to the main vol
We actually have one Google summer of code project that is going to export
an Ext4 snapshot to an LVM snapshot, in order to implement the "revert
to snapshot"
functionality, which Ext4 snapshots is lacking.
I'll be happy to answer more question regarding Ext4 snapshots.
Thanks,
Amir.
On 06/08/2011 11:26 AM, Amir G. wrote:
> 2. Data blocks are never copied
> The move-on-write technique is used to re-allocate data blocks on rewrite
> instead of copying them.
> This is not something that can be done when the snapshot is stored on
> external storage, but it can done when the snapshot file lives in the fs.
But does that not lead to fragmentation. And if I am understanding this,
the fragmentation will not resolve after dropping the snapshot. So while
you do save the overhead on write, you make the user pay on all future
reads (that need to hit the disk).
On Thu, Jun 9, 2011 at 2:49 AM, Sunil Mushran <[email protected]> wrote:
> On 06/08/2011 11:26 AM, Amir G. wrote:
>>
>> 2. Data blocks are never copied
>> The move-on-write technique is used to re-allocate data blocks on rewrite
>> instead of copying them.
>> This is not something that can be done when the snapshot is stored on
>> external storage, but it can done when the snapshot file lives in the fs.
>
> But does that not lead to fragmentation. And if I am understanding this,
> the fragmentation will not resolve after dropping the snapshot. So while
> you do save the overhead on write, you make the user pay on all future
> reads (that need to hit the disk).
Hi Sunil,
I am undertaking a project which aims to reduce fragmentation when the
file is rewritten. When the snapshot is deleted, fragmentation
introduced by snapshot will be removed once the file is rewritten. If
it is necessary, I think defragmentation can be done when the file is
read.
But it is not ready yet. So it is not a part of the code which are
going to be merged.
Does I answer your question?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>
--
Best Wishes
Yongqiang Yang
On Wed, Jun 08, 2011 at 09:26:11PM +0300, Amir G. wrote:
> In my old next3.sf.net wiki, which I do update from time to time,
> I listed 4 advantages of Ext4 (then next3) snapshots over LVM:
> * Performance: only small overhead to write performance with snapshots
> * Scalability: no extra overhead per snapshot
> * Maintenance: no need to pre-allocate disk space for snapshots
> * Persistence: snapshots don't vanish when disk is full
>
> As far as I know, the only thing that has changed from dm-snap
> to dm-multisnap is the Scalability.
I don't think you have looked at dm-multisnap at all, have you? It
addresses all your points and many more. Take a look at the code which
is in the multisnap branch of https://github.com/jthornber/linux-2.6/,
there's also some slides on it from Linuxtag at:
https://github.com/jthornber/storage-papers/blob/master/thinp-snapshots-2011/thinp-and-multisnap.otp?raw=true
On Thu, Jun 9, 2011 at 1:52 PM, Christoph Hellwig <[email protected]> wrote:
> On Wed, Jun 08, 2011 at 09:26:11PM +0300, Amir G. wrote:
>> In my old next3.sf.net wiki, which I do update from time to time,
>> I listed 4 advantages of Ext4 (then next3) snapshots over LVM:
>> * Performance: only small overhead to write performance with snapshots
>> * Scalability: no extra overhead per snapshot
>> * Maintenance: no need to pre-allocate disk space for snapshots
>> * Persistence: snapshots don't vanish when disk is full
>>
>> As far as I know, the only thing that has changed from dm-snap
>> to dm-multisnap is the Scalability.
>
> I don't think you have looked at dm-multisnap at all, have you?
You are on to me. I never tried it out or looked at the patches.
I did read about the multisnap shared storage target, but didn't know
about the thinp target.
Funny, I talked with Alasdair on LSF and asked him about the status
of multisnap and he only said there have been several implementations,
but didn't mention the thin provisioning target... or maybe I
understood him wrong.
So I guess that addresses the Maintenance and Persistence issues.
I'm sure it cannot thin provision the entire space for fs and snapsots
like ext4 snapshots do, but I'll have to read about it some more.
With regards to Performance, I will just have to run benchmarks, won't I ;-)
>?It addresses all your points and many more. ?Take a look at the code which
> is in the multisnap branch of https://github.com/jthornber/linux-2.6/,
> there's also some slides on it from Linuxtag at:
>
> https://github.com/jthornber/storage-papers/blob/master/thinp-snapshots-2011/thinp-and-multisnap.otp?raw=true
>
Thanks for the pointers!
These didn't come up in my googling.
Amir.
On Wed, Jun 8, 2011 at 9:26 PM, Amir G. <[email protected]> wrote:
> On Wed, Jun 8, 2011 at 7:19 PM, Mike Snitzer <[email protected]> wrote:
>> On Wed, Jun 8, 2011 at 11:59 AM, Amir G. <[email protected]> wrote:
>>> On Wed, Jun 8, 2011 at 6:38 PM, Lukas Czerner <[email protected]> wrote:
>>>> Amir said:
>>
>>>>> The question of whether the world needs ext4 snapshots is
>>>>> perfectly valid, but going back to the food analogy, I think it's
>>>>> a case of "the proof of the pudding is in the eating".
>>>>> I have no doubt that if ext4 snapshots are merged, many people will use it.
>>>>
>>>> Well, I would like to have your confidence. Why do you think so ? They
>>>> will use it for what ? Doing backups ? We can do this easily with LVM
>>>> without any risk of compromising existing filesystem at all. On desktop
>>>
>>> LVM snapshots are not meant to be long lived snapshots.
>>> As temporary snapshots they are fine, but with ext4 snapshots
>>> you can easily retain monthly/weekly snapshots without the
>>> need to allocate the space for it in advance and without the
>>> 'vanish' quality of LVM snapshots.
>>
>> In that old sf.net wiki you say:
>> Why use Next3 snapshots and not LVM snapshots?
>> * Performance: only small overhead to write performance with snapshots
>>
>> Fair claim against current LVM snapshot (but not multisnap).
>>
>> In this thread you're being very terse on the performance hit you
>> assert multisnap has that ext4 snapshots does not. ?Can you please be
>> more specific?
>>
>> In your most recent post it seems you're focusing on "LVM snapshots"
>> and attributing the deficiencies of old-style LVM snapshots
>> (non-shared exception store causing N-way copy-out) to dm-multisnap?
>>
>> Again, nobody will dispute that the existing dm-snapshot target has
>> poor performance that requires snapshots be short-lived. ?But
>> multisnap does _not_ suffer from those performance problems.
>>
>> Mike
>>
>
> Hi Mike,
>
> I am glad that you joined the debate and I am going to start a fresh
> thread for that occasion, to give your question the proper attention.
>
> In my old next3.sf.net wiki, which I do update from time to time,
> I listed 4 advantages of Ext4 (then next3) snapshots over LVM:
> * Performance: only small overhead to write performance with snapshots
> * Scalability: no extra overhead per snapshot
> * Maintenance: no need to pre-allocate disk space for snapshots
> * Persistence: snapshots don't vanish when disk is full
>
> As far as I know, the only thing that has changed from dm-snap
> to dm-multisnap is the Scalability.
>
> Did you resolve the Maintenance and Persistence issues?
>
> With Regards to Performance, Ext4 snapshots are inherently different
> then LVM snapshots and have near zero overhead to write performance
> as the following benchmark, which I presented on LSF, demonstrates:
> http://global.phoronix-test-suite.com/index.php?k=profile&u=amir73il-4632-11284-26560
>
> There are several reasons for the near zero overhead:
>
> 1. Metadata buffers are always in cache when performing COW,
> so there is no extra read I/O and write I/O of the copied pages is handled
> by the journal (when flushing the snapshot file dirty pages).
>
> 2. Data blocks are never copied
> The move-on-write technique is used to re-allocate data blocks on rewrite
> instead of copying them.
> This is not something that can be done when the snapshot is stored on
> external storage, but it can done when the snapshot file lives in the fs.
>
> 3. New (= after last snapshot take) allocated blocks are never copied
> nor reallocated on rewrite.
> Ext4 snapshots uses the fs block bitmap, to know which blocks were allocated
> at the time the last snapshot was taken, so new blocks are just out of the game.
> For example, in the workload of a fresh kernel build and daily snapshots,
> the creation and deletion of temp files causes no extra I/O overhead whatsoever.
>
> So, yes, I know. I need to run a benchmark of Ext4 snapshots vs. LVM multisnap
> and post the results. When I'll get around to it I'll do it.
> But I really don't think that performance is how the 2 solutions
> should be compared.
>
> The way I see it, LVM snapshots are a complementary solution and they
> have several advantages over Ext4 snapshots, like:
> * Work with any FS
> * Writable snapshots and snapshots of snapshots
> * Merge a snapshot back to the main vol
>
> We actually have one Google summer of code project that is going to export
> an Ext4 snapshot to an LVM snapshot, in order to implement the "revert
> to snapshot"
> functionality, which Ext4 snapshots is lacking.
>
> I'll be happy to answer more question regarding Ext4 snapshots.
>
> Thanks,
> Amir.
>
Hi Mike,
In the beginning of this thread I wrote that "competition is good
because it makes us modest",
so now I have to live up to this standard and apologize for not
learning the new LVM
implementation properly before passing judgment.
To my defense, I could not find any design papers and benchmarks on multisnap
until Christoph had pointed me to some (and was too lazy to read the code...)
Anyway, it was never my intention to bad mouth LVM. I think LVM is a very useful
tool and the new multisnap and thinp targets look very promising.
For the sake of letting everyone understand the differences and trade
offs between
LVM and ext4 snapshots, so ext4 snapshots can get a fair trial, I need
to ask you
some questions about the implementation, which I could not figure out by myself
from reading the documents.
1. Crash resistance
How is multisnap handling system crashes?
Ext4 snapshots are journaled along with data, so they are fully
resistant to crashes.
Do you need to keep origin target writes pending in batches and issue FUA/flush
request for the metadata and data store devices?
2. Performance
In the presentation from LinuxTag, there are 2 "meaningless benchmarks".
I suppose they are meaningless because the metadata is linear mapping
and therefor all disk writes and read are sequential.
Do you have any "real world" benchmarks?
I am guessing that without the filesystem level knowledge in the thin
provisioned target,
files and filesystem metadata are not really laid out on the hard
drive as the filesystem
designer intended.
Wouldn't that be causing a large seek overhead on spinning media?
3. ENOSPC
Ext4 snapshots will get into readonly mode on unexpected ENOSPC situation.
That is not perfect and the best practice is to avoid getting to
ENOSPC situation.
But most application do know how to deal with ENOSPC and EROFS gracefully.
Do you have any "real life" experience of how applications deal with
blocking the
write request in ENOSPC situation?
Or what is the outcome if someone presses the reset button because of an
unexplained (to him) system halt?
4. Cache size
At the time, I examined using ZFS on an embedded system with 512MB RAM.
I wasn't able to find any official requirements, but there were
several reports around
the net saying that running ZFS with less that 1GB RAM is a performance killer.
Do you have any information about recommended cache sizes to prevent
the metadata store from being a performance bottleneck?
Thank you!
Amir.
CC'ing lvm-devel and fsdevel
On Wed, Jun 8, 2011 at 9:26 PM, Amir G. <[email protected]> wrote:
> On Wed, Jun 8, 2011 at 7:19 PM, Mike Snitzer <[email protected]> wrote:
>> On Wed, Jun 8, 2011 at 11:59 AM, Amir G. <[email protected]> wrote:
>>> On Wed, Jun 8, 2011 at 6:38 PM, Lukas Czerner <[email protected]> wrote:
>>>> Amir said:
>>
>>>>> The question of whether the world needs ext4 snapshots is
>>>>> perfectly valid, but going back to the food analogy, I think it's
>>>>> a case of "the proof of the pudding is in the eating".
>>>>> I have no doubt that if ext4 snapshots are merged, many people will use it.
>>>>
>>>> Well, I would like to have your confidence. Why do you think so ? They
>>>> will use it for what ? Doing backups ? We can do this easily with LVM
>>>> without any risk of compromising existing filesystem at all. On desktop
>>>
>>> LVM snapshots are not meant to be long lived snapshots.
>>> As temporary snapshots they are fine, but with ext4 snapshots
>>> you can easily retain monthly/weekly snapshots without the
>>> need to allocate the space for it in advance and without the
>>> 'vanish' quality of LVM snapshots.
>>
>> In that old sf.net wiki you say:
>> Why use Next3 snapshots and not LVM snapshots?
>> * Performance: only small overhead to write performance with snapshots
>>
>> Fair claim against current LVM snapshot (but not multisnap).
>>
>> In this thread you're being very terse on the performance hit you
>> assert multisnap has that ext4 snapshots does not. ?Can you please be
>> more specific?
>>
>> In your most recent post it seems you're focusing on "LVM snapshots"
>> and attributing the deficiencies of old-style LVM snapshots
>> (non-shared exception store causing N-way copy-out) to dm-multisnap?
>>
>> Again, nobody will dispute that the existing dm-snapshot target has
>> poor performance that requires snapshots be short-lived. ?But
>> multisnap does _not_ suffer from those performance problems.
>>
>> Mike
>>
>
> Hi Mike,
>
> I am glad that you joined the debate and I am going to start a fresh
> thread for that occasion, to give your question the proper attention.
>
> In my old next3.sf.net wiki, which I do update from time to time,
> I listed 4 advantages of Ext4 (then next3) snapshots over LVM:
> * Performance: only small overhead to write performance with snapshots
> * Scalability: no extra overhead per snapshot
> * Maintenance: no need to pre-allocate disk space for snapshots
> * Persistence: snapshots don't vanish when disk is full
>
> As far as I know, the only thing that has changed from dm-snap
> to dm-multisnap is the Scalability.
>
> Did you resolve the Maintenance and Persistence issues?
>
> With Regards to Performance, Ext4 snapshots are inherently different
> then LVM snapshots and have near zero overhead to write performance
> as the following benchmark, which I presented on LSF, demonstrates:
> http://global.phoronix-test-suite.com/index.php?k=profile&u=amir73il-4632-11284-26560
>
> There are several reasons for the near zero overhead:
>
> 1. Metadata buffers are always in cache when performing COW,
> so there is no extra read I/O and write I/O of the copied pages is handled
> by the journal (when flushing the snapshot file dirty pages).
>
> 2. Data blocks are never copied
> The move-on-write technique is used to re-allocate data blocks on rewrite
> instead of copying them.
> This is not something that can be done when the snapshot is stored on
> external storage, but it can done when the snapshot file lives in the fs.
>
> 3. New (= after last snapshot take) allocated blocks are never copied
> nor reallocated on rewrite.
> Ext4 snapshots uses the fs block bitmap, to know which blocks were allocated
> at the time the last snapshot was taken, so new blocks are just out of the game.
> For example, in the workload of a fresh kernel build and daily snapshots,
> the creation and deletion of temp files causes no extra I/O overhead whatsoever.
>
> So, yes, I know. I need to run a benchmark of Ext4 snapshots vs. LVM multisnap
> and post the results. When I'll get around to it I'll do it.
> But I really don't think that performance is how the 2 solutions
> should be compared.
>
> The way I see it, LVM snapshots are a complementary solution and they
> have several advantages over Ext4 snapshots, like:
> * Work with any FS
> * Writable snapshots and snapshots of snapshots
> * Merge a snapshot back to the main vol
>
> We actually have one Google summer of code project that is going to export
> an Ext4 snapshot to an LVM snapshot, in order to implement the "revert
> to snapshot"
> functionality, which Ext4 snapshots is lacking.
>
> I'll be happy to answer more question regarding Ext4 snapshots.
>
> Thanks,
> Amir.
>
Hi Mike,
In the beginning of this thread I wrote that "competition is good
because it makes us modest",
so now I have to live up to this standard and apologize for not
learning the new LVM
implementation properly before passing judgment.
To my defense, I could not find any design papers and benchmarks on multisnap
until Christoph had pointed me to some (and was too lazy to read the code...)
Anyway, it was never my intention to bad mouth LVM. I think LVM is a very useful
tool and the new multisnap and thinp targets look very promising.
For the sake of letting everyone understand the differences and trade
offs between
LVM and ext4 snapshots, so ext4 snapshots can get a fair trial, I need
to ask you
some questions about the implementation, which I could not figure out by myself
from reading the documents.
1. Crash resistance
How is multisnap handling system crashes?
Ext4 snapshots are journaled along with data, so they are fully
resistant to crashes.
Do you need to keep origin target writes pending in batches and issue FUA/flush
request for the metadata and data store devices?
2. Performance
In the presentation from LinuxTag, there are 2 "meaningless benchmarks".
I suppose they are meaningless because the metadata is linear mapping
and therefor all disk writes and read are sequential.
Do you have any "real world" benchmarks?
I am guessing that without the filesystem level knowledge in the thin
provisioned target,
files and filesystem metadata are not really laid out on the hard
drive as the filesystem
designer intended.
Wouldn't that be causing a large seek overhead on spinning media?
3. ENOSPC
Ext4 snapshots will get into readonly mode on unexpected ENOSPC situation.
That is not perfect and the best practice is to avoid getting to
ENOSPC situation.
But most application do know how to deal with ENOSPC and EROFS gracefully.
Do you have any "real life" experience of how applications deal with
blocking the
write request in ENOSPC situation?
Or what is the outcome if someone presses the reset button because of an
unexplained (to him) system halt?
4. Cache size
At the time, I examined using ZFS on an embedded system with 512MB RAM.
I wasn't able to find any official requirements, but there were
several reports around
the net saying that running ZFS with less that 1GB RAM is a performance killer.
Do you have any information about recommended cache sizes to prevent
the metadata store from being a performance bottleneck?
Thank you!
Amir.
On Fri, 10 Jun 2011, Amir G. wrote:
> CC'ing lvm-devel and fsdevel
>
>
> On Wed, Jun 8, 2011 at 9:26 PM, Amir G. <[email protected]> wrote:
> > On Wed, Jun 8, 2011 at 7:19 PM, Mike Snitzer <[email protected]> wrote:
> >> On Wed, Jun 8, 2011 at 11:59 AM, Amir G. <[email protected]> wrote:
> >>> On Wed, Jun 8, 2011 at 6:38 PM, Lukas Czerner <[email protected]> wrote:
> >>>> Amir said:
> >>
> >>>>> The question of whether the world needs ext4 snapshots is
> >>>>> perfectly valid, but going back to the food analogy, I think it's
> >>>>> a case of "the proof of the pudding is in the eating".
> >>>>> I have no doubt that if ext4 snapshots are merged, many people will use it.
> >>>>
> >>>> Well, I would like to have your confidence. Why do you think so ? They
> >>>> will use it for what ? Doing backups ? We can do this easily with LVM
> >>>> without any risk of compromising existing filesystem at all. On desktop
> >>>
> >>> LVM snapshots are not meant to be long lived snapshots.
> >>> As temporary snapshots they are fine, but with ext4 snapshots
> >>> you can easily retain monthly/weekly snapshots without the
> >>> need to allocate the space for it in advance and without the
> >>> 'vanish' quality of LVM snapshots.
> >>
> >> In that old sf.net wiki you say:
> >> Why use Next3 snapshots and not LVM snapshots?
> >> * Performance: only small overhead to write performance with snapshots
> >>
> >> Fair claim against current LVM snapshot (but not multisnap).
> >>
> >> In this thread you're being very terse on the performance hit you
> >> assert multisnap has that ext4 snapshots does not. ?Can you please be
> >> more specific?
> >>
> >> In your most recent post it seems you're focusing on "LVM snapshots"
> >> and attributing the deficiencies of old-style LVM snapshots
> >> (non-shared exception store causing N-way copy-out) to dm-multisnap?
> >>
> >> Again, nobody will dispute that the existing dm-snapshot target has
> >> poor performance that requires snapshots be short-lived. ?But
> >> multisnap does _not_ suffer from those performance problems.
> >>
> >> Mike
> >>
> >
> > Hi Mike,
> >
> > I am glad that you joined the debate and I am going to start a fresh
> > thread for that occasion, to give your question the proper attention.
> >
> > In my old next3.sf.net wiki, which I do update from time to time,
> > I listed 4 advantages of Ext4 (then next3) snapshots over LVM:
> > * Performance: only small overhead to write performance with snapshots
> > * Scalability: no extra overhead per snapshot
> > * Maintenance: no need to pre-allocate disk space for snapshots
> > * Persistence: snapshots don't vanish when disk is full
> >
> > As far as I know, the only thing that has changed from dm-snap
> > to dm-multisnap is the Scalability.
> >
> > Did you resolve the Maintenance and Persistence issues?
> >
> > With Regards to Performance, Ext4 snapshots are inherently different
> > then LVM snapshots and have near zero overhead to write performance
> > as the following benchmark, which I presented on LSF, demonstrates:
> > http://global.phoronix-test-suite.com/index.php?k=profile&u=amir73il-4632-11284-26560
> >
> > There are several reasons for the near zero overhead:
> >
> > 1. Metadata buffers are always in cache when performing COW,
> > so there is no extra read I/O and write I/O of the copied pages is handled
> > by the journal (when flushing the snapshot file dirty pages).
> >
> > 2. Data blocks are never copied
> > The move-on-write technique is used to re-allocate data blocks on rewrite
> > instead of copying them.
> > This is not something that can be done when the snapshot is stored on
> > external storage, but it can done when the snapshot file lives in the fs.
> >
> > 3. New (= after last snapshot take) allocated blocks are never copied
> > nor reallocated on rewrite.
> > Ext4 snapshots uses the fs block bitmap, to know which blocks were allocated
> > at the time the last snapshot was taken, so new blocks are just out of the game.
> > For example, in the workload of a fresh kernel build and daily snapshots,
> > the creation and deletion of temp files causes no extra I/O overhead whatsoever.
> >
> > So, yes, I know. I need to run a benchmark of Ext4 snapshots vs. LVM multisnap
> > and post the results. When I'll get around to it I'll do it.
> > But I really don't think that performance is how the 2 solutions
> > should be compared.
> >
> > The way I see it, LVM snapshots are a complementary solution and they
> > have several advantages over Ext4 snapshots, like:
> > * Work with any FS
> > * Writable snapshots and snapshots of snapshots
> > * Merge a snapshot back to the main vol
> >
> > We actually have one Google summer of code project that is going to export
> > an Ext4 snapshot to an LVM snapshot, in order to implement the "revert
> > to snapshot"
> > functionality, which Ext4 snapshots is lacking.
> >
> > I'll be happy to answer more question regarding Ext4 snapshots.
> >
> > Thanks,
> > Amir.
> >
>
Adding ejt into discussion.
>
> Hi Mike,
>
> In the beginning of this thread I wrote that "competition is good
> because it makes us modest",
> so now I have to live up to this standard and apologize for not
> learning the new LVM
> implementation properly before passing judgment.
>
> To my defense, I could not find any design papers and benchmarks on multisnap
> until Christoph had pointed me to some (and was too lazy to read the code...)
>
> Anyway, it was never my intention to bad mouth LVM. I think LVM is a very useful
> tool and the new multisnap and thinp targets look very promising.
>
> For the sake of letting everyone understand the differences and trade
> offs between
> LVM and ext4 snapshots, so ext4 snapshots can get a fair trial, I need
> to ask you
> some questions about the implementation, which I could not figure out by myself
> from reading the documents.
>
> 1. Crash resistance
> How is multisnap handling system crashes?
> Ext4 snapshots are journaled along with data, so they are fully
> resistant to crashes.
> Do you need to keep origin target writes pending in batches and issue FUA/flush
> request for the metadata and data store devices?
>
> 2. Performance
> In the presentation from LinuxTag, there are 2 "meaningless benchmarks".
> I suppose they are meaningless because the metadata is linear mapping
> and therefor all disk writes and read are sequential.
> Do you have any "real world" benchmarks?
> I am guessing that without the filesystem level knowledge in the thin
> provisioned target,
> files and filesystem metadata are not really laid out on the hard
> drive as the filesystem
> designer intended.
> Wouldn't that be causing a large seek overhead on spinning media?
>
> 3. ENOSPC
> Ext4 snapshots will get into readonly mode on unexpected ENOSPC situation.
> That is not perfect and the best practice is to avoid getting to
> ENOSPC situation.
> But most application do know how to deal with ENOSPC and EROFS gracefully.
> Do you have any "real life" experience of how applications deal with
> blocking the
> write request in ENOSPC situation?
> Or what is the outcome if someone presses the reset button because of an
> unexplained (to him) system halt?
>
> 4. Cache size
> At the time, I examined using ZFS on an embedded system with 512MB RAM.
> I wasn't able to find any official requirements, but there were
> several reports around
> the net saying that running ZFS with less that 1GB RAM is a performance killer.
> Do you have any information about recommended cache sizes to prevent
> the metadata store from being a performance bottleneck?
>
> Thank you!
> Amir.
>
--
On Fri, Jun 10, 2011 at 11:01:41AM +0200, Lukas Czerner wrote:
> On Fri, 10 Jun 2011, Amir G. wrote:
>
> > CC'ing lvm-devel and fsdevel
> >
> >
> > On Wed, Jun 8, 2011 at 9:26 PM, Amir G. <[email protected]> wrote:
> > For the sake of letting everyone understand the differences and trade
> > offs between
> > LVM and ext4 snapshots, so ext4 snapshots can get a fair trial, I need
> > to ask you
> > some questions about the implementation, which I could not figure out by myself
> > from reading the documents.
First up let me say that I'm not intending to support writeable
_external_ origins with multisnap. This will come as a suprise to
many people, but I don't think we can resolve the dual requirements to
efficiently update many, many snapshots when a write occurs _and_ make
those snapshots quick to delete (when you're encouraging people to
take lots of snapshots performance of delete becomes a real issue).
One benefit of this decision is that there is no copying from an
external origin into the multisnap data store.
For internal snapshots (a snapshot of a thin provisioned volume, or
recursive snapshot), copy-on-write does occur. If you keep the
snapshot block size small, however, you find that this copying can
often be elided since the new data completely overwrites the old.
This avoidance of copying, and the use of FUA/FLUSH to schedule
commits means that performance is much better than the old snaps. It
wont be as fast as ext4 snapshots, it can't be, we don't know what the
bios contain, unlike ext4. But I think the performance will be good
enough that many people will be happy with this more general solution
rather than committing to a particular file system. There will be use
cases where snapshotting at the fs level is the only option.
> > 1. Crash resistance
> > How is multisnap handling system crashes?
> > Ext4 snapshots are journaled along with data, so they are fully
> > resistant to crashes.
> > Do you need to keep origin target writes pending in batches and issue FUA/flush
> > request for the metadata and data store devices?
FUA/flush allows us to treat multisnap devices as if they are devices
with a write cache. When a FUA/FLUSH bio comes in we ensure we commit
metadata before allowing the bio to continue. A crash will lose data
that is in the write cache, same as any real block device with a write
cache.
> > 2. Performance
> > In the presentation from LinuxTag, there are 2 "meaningless benchmarks".
> > I suppose they are meaningless because the metadata is linear mapping
> > and therefor all disk writes and read are sequential.
> > Do you have any "real world" benchmarks?
Not that I'm happy with. For me 'real world' means a realistic use of
snapshots. We've not had this ability to create lots of snapshots
before in Linux, so I'm not sure how people are going to use it. I'll
get round to writing some benchmarks for certain scenarios eventually
(eg. incremental backups), but atm there are more pressing issues.
I mainly called those benchmarks meaningless because they didn't
address how fragmented the volumes become over time. This
fragmentation is a function of io pattern, and the shape of the
snapshot tree. In the same way I think filesystem benchmarks that
write lots of files to a freshly formatted volume are also pretty
meaningless. What most people are interested in is how the system
will be performing after they've used it for six months, not the first
five minutes.
> > I am guessing that without the filesystem level knowledge in the thin
> > provisioned target,
> > files and filesystem metadata are not really laid out on the hard
> > drive as the filesystem
> > designer intended.
> > Wouldn't that be causing a large seek overhead on spinning media?
You're absolutely right.
> > 3. ENOSPC
> > Ext4 snapshots will get into readonly mode on unexpected ENOSPC situation.
> > That is not perfect and the best practice is to avoid getting to
> > ENOSPC situation.
> > But most application do know how to deal with ENOSPC and EROFS gracefully.
> > Do you have any "real life" experience of how applications deal with
> > blocking the
> > write request in ENOSPC situation?
If you run out of space userland needs to extend the data volume. the
multisnap-pool target notifies userland (ie. dmeventd) before it
actually runs out. If userland hasn't resized the volume before it
runs out of space then the ios will be paused. This pausing is really
no different from suspending a dm device, something LVM has been doing
for 10 years. So yes, we have experience of pausing io under
applications, and the 'notify userland' mechanism is already proven.
> > Or what is the outcome if someone presses the reset button because of an
> > unexplained (to him) system halt?
See my answer above on crash resistance.
> > 4. Cache size
> > At the time, I examined using ZFS on an embedded system with 512MB RAM.
> > I wasn't able to find any official requirements, but there were
> > several reports around
> > the net saying that running ZFS with less that 1GB RAM is a performance killer.
> > Do you have any information about recommended cache sizes to prevent
> > the metadata store from being a performance bottleneck?
The ideal cache size depends on your io patterns. It also depends on
the data block size you've chosen. The cache is divided into 4k
blocks, and each block holds ~256 mapping entries.
Unlike ZFS our metadata is very simple.
Those little micro benchmarks (dd and bonnie++) running on a little 4G
data volume perform nicely with only a 64k cache. So in the worst
case I was envisaging a few meg for the cache, rather than a few
hundred meg.
- Joe
On Fri, Jun 10, 2011 at 1:11 PM, Joe Thornber <[email protected]> wrote:
> On Fri, Jun 10, 2011 at 11:01:41AM +0200, Lukas Czerner wrote:
>> On Fri, 10 Jun 2011, Amir G. wrote:
>>
>> > CC'ing lvm-devel and fsdevel
>> >
>> >
>> > On Wed, Jun 8, 2011 at 9:26 PM, Amir G. <[email protected]> wrote:
>> > For the sake of letting everyone understand the differences and trade
>> > offs between
>> > LVM and ext4 snapshots, so ext4 snapshots can get a fair trial, I need
>> > to ask you
>> > some questions about the implementation, which I could not figure out by myself
>> > from reading the documents.
>
> First up let me say that I'm not intending to support writeable
> _external_ origins with multisnap. ?This will come as a suprise to
> many people, but I don't think we can resolve the dual requirements to
> efficiently update many, many snapshots when a write occurs _and_ make
> those snapshots quick to delete (when you're encouraging people to
> take lots of snapshots performance of delete becomes a real issue).
>
OK. that is an interesting point for people to understand.
There is a distinct trade off at hand.
LVM multisnap gives you lots of feature and can be used with
any filesystem.
The cost you are paying for all the wonderful features it provides is
a fragmented origin, which we both agree, is likely to have performance
costs as the filesystem ages.
Ext4 snapshots, on the other hand, is very limited in features
(i.e. only readonly snapshots of the origin), but the origin's layout on-disk
remains un-fragmented and optimized for spinning media and RAID
arrays underlying storage.
Ext4 snapshots also causes fragmentation of files in random write
workloads, but this is a problem that can and is being fixed.
> One benefit of this decision is that there is no copying from an
> external origin into the multisnap data store.
>
> For internal snapshots (a snapshot of a thin provisioned volume, or
> recursive snapshot), copy-on-write does occur. ?If you keep the
> snapshot block size small, however, you find that this copying can
> often be elided since the new data completely overwrites the old.
>
> This avoidance of copying, and the use of FUA/FLUSH to schedule
> commits means that performance is much better than the old snaps. ?It
> wont be as fast as ext4 snapshots, it can't be, we don't know what the
> bios contain, unlike ext4. ?But I think the performance will be good
> enough that many people will be happy with this more general solution
> rather than committing to a particular file system. ?There will be use
> cases where snapshotting at the fs level is the only option.
>
I have to agree with you. I do not think that the performance factor
is going to be a show stopper for most people.
I do think that LVM performance will be good enough
and that many people will be happy with the more general solution.
Especially those who can afford an SSD in their system.
The question is, are there enough people in the 'real world', with
enough varying use cases, so that many will also find ext4 snapshots features
good enough and will want to enjoy better and consistent read/write performance
to the origin, which does not degrade as the filesystem ages.
Clearly, we will need to come up with some 'real world' benchmarks, before
we can provide an intelligent answer to that question.
>> > 1. Crash resistance
>> > How is multisnap handling system crashes?
>> > Ext4 snapshots are journaled along with data, so they are fully
>> > resistant to crashes.
>> > Do you need to keep origin target writes pending in batches and issue FUA/flush
>> > request for the metadata and data store devices?
>
> FUA/flush allows us to treat multisnap devices as if they are devices
> with a write cache. ?When a FUA/FLUSH bio comes in we ensure we commit
> metadata before allowing the bio to continue. ?A crash will lose data
> that is in the write cache, same as any real block device with a write
> cache.
>
Now, here I am confused.
Reducing the problem to write cache enabled device sounds valid,
but I am not yet convinced it is enough.
In ext4 snapshots I had to deal with 'internal ordering' between I/O
of origin data and snapshot metadata and data.
That means that every single I/O to origin, which overwrites shared data,
must hit the media *after* the original data has been copied to snapshot
and the snapshot metadata and data are secure on media.
In ext4 this is done with the help of JBD2, which anyway holds back metadata
writes until commit.
It could be that this problem is only relevant to _extenal_ origin, which
are not supported for multisnap, but frankly, as I said, I am too confused
to figure out if there is yet an ordering problem for _internal_ origin or not.
>> > 2. Performance
>> > In the presentation from LinuxTag, there are 2 "meaningless benchmarks".
>> > I suppose they are meaningless because the metadata is linear mapping
>> > and therefor all disk writes and read are sequential.
>> > Do you have any "real world" benchmarks?
>
> Not that I'm happy with. ?For me 'real world' means a realistic use of
> snapshots. ?We've not had this ability to create lots of snapshots
> before in Linux, so I'm not sure how people are going to use it. ?I'll
> get round to writing some benchmarks for certain scenarios eventually
> (eg. incremental backups), but atm there are more pressing issues.
>
> I mainly called those benchmarks meaningless because they didn't
> address how fragmented the volumes become over time. ?This
> fragmentation is a function of io pattern, and the shape of the
> snapshot tree. ?In the same way I think filesystem benchmarks that
> write lots of files to a freshly formatted volume are also pretty
> meaningless. ?What most people are interested in is how the system
> will be performing after they've used it for six months, not the first
> five minutes.
>
>> > I am guessing that without the filesystem level knowledge in the thin
>> > provisioned target,
>> > files and filesystem metadata are not really laid out on the hard
>> > drive as the filesystem
>> > designer intended.
>> > Wouldn't that be causing a large seek overhead on spinning media?
>
> You're absolutely right.
>
>> > 3. ENOSPC
>> > Ext4 snapshots will get into readonly mode on unexpected ENOSPC situation.
>> > That is not perfect and the best practice is to avoid getting to
>> > ENOSPC situation.
>> > But most application do know how to deal with ENOSPC and EROFS gracefully.
>> > Do you have any "real life" experience of how applications deal with
>> > blocking the
>> > write request in ENOSPC situation?
>
> If you run out of space userland needs to extend the data volume. ?the
> multisnap-pool target notifies userland (ie. dmeventd) before it
> actually runs out. ?If userland hasn't resized the volume before it
> runs out of space then the ios will be paused. ?This pausing is really
> no different from suspending a dm device, something LVM has been doing
> for 10 years. ?So yes, we have experience of pausing io under
> applications, and the 'notify userland' mechanism is already proven.
>
>> > Or what is the outcome if someone presses the reset button because of an
>> > unexplained (to him) system halt?
>
> See my answer above on crash resistance.
>
>> > 4. Cache size
>> > At the time, I examined using ZFS on an embedded system with 512MB RAM.
>> > I wasn't able to find any official requirements, but there were
>> > several reports around
>> > the net saying that running ZFS with less that 1GB RAM is a performance killer.
>> > Do you have any information about recommended cache sizes to prevent
>> > the metadata store from being a performance bottleneck?
>
> The ideal cache size depends on your io patterns. ?It also depends on
> the data block size you've chosen. ?The cache is divided into 4k
> blocks, and each block holds ~256 mapping entries.
>
> Unlike ZFS our metadata is very simple.
>
> Those little micro benchmarks (dd and bonnie++) running on a little 4G
> data volume perform nicely with only a 64k cache. ?So in the worst
> case I was envisaging a few meg for the cache, rather than a few
> hundred meg.
>
> - Joe
>
Thanks for your elaborate answers!
Amir.
On Fri, Jun 10, 2011 at 05:15:37PM +0300, Amir G. wrote:
> On Fri, Jun 10, 2011 at 1:11 PM, Joe Thornber <[email protected]> wrote:
> > FUA/flush allows us to treat multisnap devices as if they are devices
> > with a write cache. ?When a FUA/FLUSH bio comes in we ensure we commit
> > metadata before allowing the bio to continue. ?A crash will lose data
> > that is in the write cache, same as any real block device with a write
> > cache.
> >
>
> Now, here I am confused.
> Reducing the problem to write cache enabled device sounds valid,
> but I am not yet convinced it is enough.
> In ext4 snapshots I had to deal with 'internal ordering' between I/O
> of origin data and snapshot metadata and data.
> That means that every single I/O to origin, which overwrites shared data,
> must hit the media *after* the original data has been copied to snapshot
> and the snapshot metadata and data are secure on media.
> In ext4 this is done with the help of JBD2, which anyway holds back metadata
> writes until commit.
> It could be that this problem is only relevant to _extenal_ origin, which
> are not supported for multisnap, but frankly, as I said, I am too confused
> to figure out if there is yet an ordering problem for _internal_ origin or not.
Ok, let me talk you through my solution. The relevant code is here if
you want to sing along:
https://github.com/jthornber/linux-2.6/blob/multisnap/drivers/md/dm-multisnap.c
We use a standard copy-on-write btree to store the mappings for the
devices (note I'm talking about copy-on-write of the metadata here,
not the data). When you take an internal snapshot you clone the root
node of the origin btree. After this there is no concept of an
origin or a snapshot. They are just two device trees that happen to
point to the same data blocks.
When we get a write in we decide if it's to a shared data block using
some timestamp magic. If it is, we have to break sharing.
Let's say we write to a shared block in what was the origin. The
steps are:
i) plug io further to this physical block. (see bio_prison code).
ii) quiesce any read io to that shared data block. Obviously
including all devices that share this block. (see deferred_set code)
iii) copy the data block to a newly allocate block. This step can be
missed out if the io covers the block. (schedule_copy).
iv) insert the new mapping into the origin's btree
(process_prepared_mappings). This act of inserting breaks some
sharing of btree nodes between the two devices. Breaking sharing only
effects the btree of that specific device. Btrees for the other
devices that share the block never change. The btree for the origin
device as it was after the last commit is untouched, ie. we're using
persistent data structures in the functional programming sense.
v) unplug io to this physical block, including the io that triggered
the breaking of sharing.
Steps (ii) and (iii) occur in parallel.
The main difference to what you described is the metadata _doesn't_
need to be committed before the io continues. We get away with this
because the io is always written to a _new_ block. If there's a
crash, then:
- The origin mapping will point to the old origin block (the shared
one). This will contain the data as it was before the io that
triggered the breaking of sharing came in.
- The snap mapping still points to the old block. As it would after
the commit.
The downside of this scheme is the timestamp magic isn't perfect, and
will continue to think that data block in the snapshot device is
shared even after the write to the origin has broken sharing. I
suspect data blocks will typically be shared by many different
devices, so we're breaking sharing n + 1 times, rather than n, where n
is the number of devices that reference this data block. At the
moment I think the benefits far, far out weigh the disadvantages.
- Joe
On Fri, Jun 10, 2011 at 6:01 PM, Joe Thornber <[email protected]> wrote:
> On Fri, Jun 10, 2011 at 05:15:37PM +0300, Amir G. wrote:
>> On Fri, Jun 10, 2011 at 1:11 PM, Joe Thornber <[email protected]> wrote:
>> > FUA/flush allows us to treat multisnap devices as if they are devices
>> > with a write cache. ?When a FUA/FLUSH bio comes in we ensure we commit
>> > metadata before allowing the bio to continue. ?A crash will lose data
>> > that is in the write cache, same as any real block device with a write
>> > cache.
>> >
>>
>> Now, here I am confused.
>> Reducing the problem to write cache enabled device sounds valid,
>> but I am not yet convinced it is enough.
>> In ext4 snapshots I had to deal with 'internal ordering' between I/O
>> of origin data and snapshot metadata and data.
>> That means that every single I/O to origin, which overwrites shared data,
>> must hit the media *after* the original data has been copied to snapshot
>> and the snapshot metadata and data are secure on media.
>> In ext4 this is done with the help of JBD2, which anyway holds back metadata
>> writes until commit.
>> It could be that this problem is only relevant to _extenal_ origin, which
>> are not supported for multisnap, but frankly, as I said, I am too confused
>> to figure out if there is yet an ordering problem for _internal_ origin or not.
>
> Ok, let me talk you through my solution. ?The relevant code is here if
> you want to sing along:
> https://github.com/jthornber/linux-2.6/blob/multisnap/drivers/md/dm-multisnap.c
>
> We use a standard copy-on-write btree to store the mappings for the
> devices (note I'm talking about copy-on-write of the metadata here,
> not the data). ?When you take an internal snapshot you clone the root
> node of the origin btree. ?After this there is no concept of an
> origin or a snapshot. ?They are just two device trees that happen to
> point to the same data blocks.
>
> When we get a write in we decide if it's to a shared data block using
> some timestamp magic. ?If it is, we have to break sharing.
>
> Let's say we write to a shared block in what was the origin. ?The
> steps are:
>
> i) plug io further to this physical block. (see bio_prison code).
>
> ii) quiesce any read io to that shared data block. ?Obviously
> including all devices that share this block. ?(see deferred_set code)
>
> iii) copy the data block to a newly allocate block. ?This step can be
> missed out if the io covers the block. (schedule_copy).
>
> iv) insert the new mapping into the origin's btree
> (process_prepared_mappings). ?This act of inserting breaks some
> sharing of btree nodes between the two devices. ?Breaking sharing only
> effects the btree of that specific device. ?Btrees for the other
> devices that share the block never change. ?The btree for the origin
> device as it was after the last commit is untouched, ie. we're using
> persistent data structures in the functional programming sense.
>
> v) unplug io to this physical block, including the io that triggered
> the breaking of sharing.
>
> Steps (ii) and (iii) occur in parallel.
>
> The main difference to what you described is the metadata _doesn't_
> need to be committed before the io continues. ?We get away with this
> because the io is always written to a _new_ block. ?If there's a
> crash, then:
>
> - The origin mapping will point to the old origin block (the shared
> ?one). ?This will contain the data as it was before the io that
> ?triggered the breaking of sharing came in.
>
> - The snap mapping still points to the old block. ?As it would after
> ?the commit.
>
OK. Now I am convinced that there is no I/O ordering issue,
since you are never overwriting shared data in-place.
Now I also convinced that the origin will be so heavily fragmented,
to the point that the solution will not be practical for performance
sensitive applications. Specifically, applications that use spinning
media storage and require consistent and predictable performance.
I do have a crazy idea, though, how to combine the power of the
multisnap features with the speed of a raw ext4 fs.
In the early days of next3 snapshots design I tried to mimic
the generic JBD APIs and added generic snapshot APIs
to ext3, so that some day an external snapshot store
implementation could use this API.
Over time, as the internal snapshots store implementation grew
to use many internal fs optimizations, I neglected the option to
ever support an external snapshots store.
Now that I think about it, it doesn't look so far fetched after all.
The concept is that multisnap can register as a 'snapshot store
provider' and get called by ext4 directly (not via device mapper)
to copy a metadata buffer on write (snapshot_get_write_access),
get ownership over fs data blocks on delete and rewrite
(snapshot_get_delete/move_access) and to commit/flush the store.
ext4 will keep track of blocks which are owned by the external
snapshot store (in the exclude bitmap) and provide a callback
API from the snapshots store to free those blocks on snapshot
delete.
The ext4 snapshot APIs are already working that way with
the internal store implementation (the store is a sparse file).
There is also the step of creating the initial metadata btree
when creating the multisnap volume with __external__ origin.
This is just a simple translation of the ext4 block bitmap to
a btree. After that, changes to the __external__ btree can
be made on changes to the ext4 block bitmap - an API already
being used by internal implementation (snapshot_get_bitmap_access).
What do you think?
Does this plan sound too crazy?
Do you think it is doable for multisnap to support this kind of
__external__ origin?
Amir.
On Fri, Jun 10, 2011 at 1:11 PM, Joe Thornber <[email protected]> wrote:
> On Fri, Jun 10, 2011 at 11:01:41AM +0200, Lukas Czerner wrote:
>> On Fri, 10 Jun 2011, Amir G. wrote:
>>
>> > CC'ing lvm-devel and fsdevel
>> >
>> >
>> > On Wed, Jun 8, 2011 at 9:26 PM, Amir G. <[email protected]> wrote:
>> > For the sake of letting everyone understand the differences and trade
>> > offs between
>> > LVM and ext4 snapshots, so ext4 snapshots can get a fair trial, I need
>> > to ask you
>> > some questions about the implementation, which I could not figure out by myself
>> > from reading the documents.
>
> First up let me say that I'm not intending to support writeable
> _external_ origins with multisnap. ?This will come as a suprise to
> many people, but I don't think we can resolve the dual requirements to
> efficiently update many, many snapshots when a write occurs _and_ make
> those snapshots quick to delete (when you're encouraging people to
> take lots of snapshots performance of delete becomes a real issue).
>
If I understand this article correctly:
http://people.redhat.com/mpatocka/papers/shared-snapshots.pdf
It says that _external_ origin write updates can be efficient to readonly
(or not written) snapshots.
Could you not support readonly snapshots of an _external_ origin?
You could even support writable snapshots, that will degrade write
performance to origin temporarily.
It can be useful, if one wants to "try-out" mounting a temporary
writable snapshot, when the origin is not even mounted.
After the "try-out", the temporary snapshot can be deleted
and origin write performance would go back to normal.
Is that correct?
Amir.
On Sat, Jun 11, 2011 at 08:41:38AM +0300, Amir G. wrote:
> On Fri, Jun 10, 2011 at 1:11 PM, Joe Thornber <[email protected]> wrote:
> > On Fri, Jun 10, 2011 at 11:01:41AM +0200, Lukas Czerner wrote:
> >> On Fri, 10 Jun 2011, Amir G. wrote:
> >>
> >> > CC'ing lvm-devel and fsdevel
> >> >
> >> >
> >> > On Wed, Jun 8, 2011 at 9:26 PM, Amir G. <[email protected]> wrote:
> >> > For the sake of letting everyone understand the differences and trade
> >> > offs between
> >> > LVM and ext4 snapshots, so ext4 snapshots can get a fair trial, I need
> >> > to ask you
> >> > some questions about the implementation, which I could not figure out by myself
> >> > from reading the documents.
> >
> > First up let me say that I'm not intending to support writeable
> > _external_ origins with multisnap. ?This will come as a suprise to
> > many people, but I don't think we can resolve the dual requirements to
> > efficiently update many, many snapshots when a write occurs _and_ make
> > those snapshots quick to delete (when you're encouraging people to
> > take lots of snapshots performance of delete becomes a real issue).
> >
>
> If I understand this article correctly:
> http://people.redhat.com/mpatocka/papers/shared-snapshots.pdf
> It says that _external_ origin write updates can be efficient to readonly
> (or not written) snapshots.
>
> Could you not support readonly snapshots of an _external_ origin?
Yes, that is the intention, and very little work to add. We just do
something different if the metadata lookup returns -ENODATA. Above I
said I didn't intend to support _writeable_ external snaps. Readable
ones are a must, for instance for supporting virtual machine base
images.
> You could even support writable snapshots, that will degrade write
> performance to origin temporarily.
> It can be useful, if one wants to "try-out" mounting a temporary
> writable snapshot, when the origin is not even mounted.
> After the "try-out", the temporary snapshot can be deleted
> and origin write performance would go back to normal.
Not sure what you're getting at here. All snapshots are writeable.
Of course you can take a snapshot of an external origin and then use
this as your temporary origin for experiments. If the origin is
itself a dm device then LVM can shuffle tables around to make this
transparent.
The user may want to commit to their experiment at a later time by
merging back to the external origin. This involves copying, but no
more than a copy-on-write scheme. Arguably it's better to do the copy
only once we know they want to commit to it.
- Joe
On Sat, Jun 11, 2011 at 07:01:36AM +0300, Amir G. wrote:
> OK. Now I am convinced that there is no I/O ordering issue,
> since you are never overwriting shared data in-place.
>
> Now I also convinced that the origin will be so heavily fragmented,
> to the point that the solution will not be practical for performance
> sensitive applications. Specifically, applications that use spinning
> media storage and require consistent and predictable performance.
I am also convinced multisnap wont be suitable for every use case. I
want to be very careful to only advocate it for people with suitable
tasks. Over time I'm sure we'll broaden the suitable apps, for
example by tinkering with the allocator, or doing some preemptive
defrag. It would be disappointing for everyone to write it off, just
because it isn't suitable for say high performance database apps.
The very simple allocator I'm using at the moment will try and place
new blocks together. My hope is that past io patterns will be similar
to future ones. So while the volumes will be fragmented, blocks for
the typical io access patterns will still be together. Much more
experimentation is needed.
This is very early days for multisnap, the code is still changing.
Only a few people have run it. For instance Lukas tested it on
Thursday and got some unexpectedly poor results. I'm there'll be a
quick fix for it (eg, wrong cache size, too much disk seeking due to
the metadata and data volumes being at opposite ends of a spindle
device), but this shows that I need more people to play with it.
> I do have a crazy idea, though, how to combine the power of the
> multisnap features with the speed of a raw ext4 fs.
I need to think this through over the weekend. The metadata interface
is pretty clean, so you could start by looking at that. However I do
find this suggestion surprising. My priority is block level
snapshots, if I can expose interfaces for you such that we share code
then that would be great.
- Joe
--On 11 June 2011 08:49:08 +0100 Joe Thornber <[email protected]> wrote:
> I am also convinced multisnap wont be suitable for every use case.
I'm surprised by one thing ext 4 snapshots doesn't seem to do: I would have
thought the "killer feature" for doing snapshots in the fs rather than in
the block layer would be the ability to snapshot - and more importantly
roll back - only parts of the directory hierarchy.
(I've only read the URLs Amir sent, so apologies if I've missed this)
--
Alex Bligh
On Sat, Jun 11, 2011 at 11:18 AM, Alex Bligh <[email protected]> wrote:
>
>
> --On 11 June 2011 08:49:08 +0100 Joe Thornber <[email protected]> wrote:
>
>> I am also convinced multisnap wont be suitable for every use case.
>
> I'm surprised by one thing ext 4 snapshots doesn't seem to do: I would have
> thought the "killer feature" for doing snapshots in the fs rather than in
> the block layer would be the ability to snapshot - and more importantly
> roll back - only parts of the directory hierarchy.
>
> (I've only read the URLs Amir sent, so apologies if I've missed this)
>
No need for apologies.
There is no per-directory snapshot nor rollback with ext4 snpshots.
It is possible to configure a part of the directory hierarchy to be excluded
from future snapshots, but not to delete it selectively from past snapshots.
Amir.
On Sat, Jun 11, 2011 at 10:35 AM, Joe Thornber <[email protected]> wrote:
> On Sat, Jun 11, 2011 at 08:41:38AM +0300, Amir G. wrote:
>> On Fri, Jun 10, 2011 at 1:11 PM, Joe Thornber <[email protected]> wrote:
>> > On Fri, Jun 10, 2011 at 11:01:41AM +0200, Lukas Czerner wrote:
>> >> On Fri, 10 Jun 2011, Amir G. wrote:
>> >>
>> >> > CC'ing lvm-devel and fsdevel
>> >> >
>> >> >
>> >> > On Wed, Jun 8, 2011 at 9:26 PM, Amir G. <[email protected]> wrote:
>> >> > For the sake of letting everyone understand the differences and trade
>> >> > offs between
>> >> > LVM and ext4 snapshots, so ext4 snapshots can get a fair trial, I need
>> >> > to ask you
>> >> > some questions about the implementation, which I could not figure out by myself
>> >> > from reading the documents.
>> >
>> > First up let me say that I'm not intending to support writeable
>> > _external_ origins with multisnap. ?This will come as a suprise to
>> > many people, but I don't think we can resolve the dual requirements to
>> > efficiently update many, many snapshots when a write occurs _and_ make
>> > those snapshots quick to delete (when you're encouraging people to
>> > take lots of snapshots performance of delete becomes a real issue).
>> >
>>
>> If I understand this article correctly:
>> http://people.redhat.com/mpatocka/papers/shared-snapshots.pdf
>> It says that _external_ origin write updates can be efficient to readonly
>> (or not written) snapshots.
>>
>> Could you not support readonly snapshots of an _external_ origin?
>
> Yes, that is the intention, and very little work to add. ?We just do
> something different if the metadata lookup returns -ENODATA. ?Above I
> said I didn't intend to support _writeable_ external snaps. ?Readable
> ones are a must, for instance for supporting virtual machine base
> images.
>
>> You could even support writable snapshots, that will degrade write
>> performance to origin temporarily.
>> It can be useful, if one wants to "try-out" mounting a temporary
>> writable snapshot, when the origin is not even mounted.
>> After the "try-out", the temporary snapshot can be deleted
>> and origin write performance would go back to normal.
>
> Not sure what you're getting at here. ?All snapshots are writeable.
>
I meant _readonly_ snapshots of a _writable_ _external_ origin,
which is what ext4 snapshots provides.
All snapshots are chained on a list that points to the origin and
only the latest (active) snapshot metadata get updated on origin writes.
When older snapshots lookup return -ENODATA, you go up the list
to the newer snapshot and up to the origin.
Those _incremental_ snapshots cannot be _writable_, because older
snapshots may implicitly share blocks with newer snapshots, but it should
be possible to make _writable_ clones of these snapshots.
Not sure what the implications are for deleting snapshots, because I am
not familiar with all the implementation details of multisnap.
> Of course you can take a snapshot of an external origin and then use
> this as your temporary origin for experiments. ?If the origin is
> itself a dm device then LVM can shuffle tables around to make this
> transparent.
>
> The user may want to commit to their experiment at a later time by
> merging back to the external origin. ?This involves copying, but no
> more than a copy-on-write scheme. ?Arguably it's better to do the copy
> only once we know they want to commit to it.
>
> - Joe
>
On Sat, Jun 11, 2011 at 12:58:26PM +0300, Amir G. wrote:
> I meant _readonly_ snapshots of a _writable_ _external_ origin,
> which is what ext4 snapshots provides.
> All snapshots are chained on a list that points to the origin and
> only the latest (active) snapshot metadata get updated on origin writes.
> When older snapshots lookup return -ENODATA, you go up the list
> to the newer snapshot and up to the origin.
>
> Those _incremental_ snapshots cannot be _writable_, because older
> snapshots may implicitly share blocks with newer snapshots, but it should
> be possible to make _writable_ clones of these snapshots.
> Not sure what the implications are for deleting snapshots, because I am
> not familiar with all the implementation details of multisnap.
I deliberately ruled out chaining schemes like this because I want to
support large numbers of snapshots. I believe Daniel Phillips
described a chaining scheme a while ago, and someone else implemented
it last year. From a cursory glance through the code they posted on
dm-devel, it appeared to need a large in memory hash table to cache
all those chained lookups.
- Joe