2012-01-21 02:45:27

by Robin Dong

[permalink] [raw]
Subject: Question about writable ext4-snapshot

Hello, Amir

I am evaluating ext4-snapshot (on github) for TAOBAO recently. The
snapshot of an ext4 fs is READONLY now, but we do need to write data
into snapshot. We also want using ext4-snapshot to do online-fsck on
Hadoop clusters, but our hadoop clusters are using no-journal ext4
now. So we have some question

1. Will it be possible to implement a writable ext4-snapshot ?
2. Will it be possible to snapshot a no-journal ext4-fs ?
3. What's the difficult point of implementing above ?

Any of your reply will be appreciate

--
--
Best Regard
Robin Dong


2012-01-21 04:24:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question about writable ext4-snapshot


On Jan 20, 2012, at 9:45 PM, Robin Dong wrote:

> Hello, Amir
>
> I am evaluating ext4-snapshot (on github) for TAOBAO recently. The
> snapshot of an ext4 fs is READONLY now, but we do need to write data
> into snapshot. We also want using ext4-snapshot to do online-fsck on
> Hadoop clusters, but our hadoop clusters are using no-journal ext4
> now. So we have some question
>
> 1. Will it be possible to implement a writable ext4-snapshot ?
> 2. Will it be possible to snapshot a no-journal ext4-fs ?
> 3. What's the difficult point of implementing above ?

Something else to consider is that the device mapper thin-provisioning approach. This approach does the snapshotting at the device-mapper layer, which means it is separate from the file system. It relies on using the discard request when the file is unlinked to know when blocks can be released from the snapshot. It also uses a granularity much smaller than that of the traditional LVM-style snapshots.

This code will still need a few months to be mature (the thin-provisioning code just got merged into 3.2, but discard support isn't done yet, and the userspace support is lagging). But in the long run, this might be a very attractive way of providing multiple levels of writeable snapshots, in a clean and relatively simple way.

-- Ted



2012-01-21 04:37:01

by Andreas Dilger

[permalink] [raw]
Subject: Re: Question about writable ext4-snapshot

On Jan 20, 2012, at 9:45 PM, Robin Dong wrote:

> Hello, Amir
>
> I am evaluating ext4-snapshot (on github) for TAOBAO recently. The
> snapshot of an ext4 fs is READONLY now, but we do need to write data
> into snapshot. We also want using ext4-snapshot to do online-fsck on
> Hadoop clusters, but our hadoop clusters are using no-journal ext4
> now.

When you write about online e2fsck, what do you mean exactly? It is already possible with LVM to create a read-only snapshot of a device and run read-only e2fsck. This works because the LVM snapshot is hooked to ext4 to freeze the filesystem and flush the journal before the snapshot is done.

At this point, if the fsck is clean then the original filesystem is clean also. This is the most common case. In the uncommon case of errors detected on the snapshot, then the filesystem would need to be taken offline to fix any problems.

By running the online fsck on the snapshot, one can be certain that the filesystem is clean, and reset the automatic checking date/mount counters.

If you are thinking about online repair, that would be much more complex, but may still be possible for some cases.

Cheers, Andreas

2012-01-21 16:09:51

by Amir Goldstein

[permalink] [raw]
Subject: Re: Question about writable ext4-snapshot

On Sat, Jan 21, 2012 at 6:24 AM, Theodore Tso <[email protected]> wrote:
>
> On Jan 20, 2012, at 9:45 PM, Robin Dong wrote:
>
>> Hello, Amir
>>
>> I am evaluating ext4-snapshot (on github) for TAOBAO recently. The
>> snapshot of an ext4 fs is READONLY now, but we do need to write data
>> into snapshot.
>> We also want using ?ext4-snapshot to do online-fsck on
>> Hadoop clusters, but our hadoop clusters are using no-journal ext4
>> now. So we have some question
>>
>> 1. Will it be possible to implement a writable ext4-snapshot ?
>> 2. Will it be possible to snapshot a no-journal ext4-fs ?
>> 3. What's the difficult point of ?implementing above ?
>

Hello Robin,

1. writable snapshots (snapshot clones) are actually quite simple to implement
(a sparse file containing all changes from a read-only snapshot).
The real challenge is how to support snapshots of these clones and how to
implement the space reclaim efficiently (time wise) when deleting snapshots.
indeed, LVM thin-provisioning target handles space reclaim very efficiently.

2. I think it is possible, but I never looked into it, so there may
be challenges that I haven't foreseen.
The obvious culprit is that snapshots will not be reliable after crash.
JBD ensures that metadata is not overwritten on-disk before it is
copied to snapshot,
but without journal, after a crash, meta data could have already been
written and you loose
the origin data that was supposed to be copied to snapshot.

3. I think I have already answered that question above, but the actual
difficulty
really depends on your specific needs.

> Something else to consider is that the device mapper thin-provisioning approach. ? This approach does the snapshotting at the device-mapper layer, which means it is separate from the file system. ?It relies on using the discard request when the file is unlinked to know when blocks can be released from the snapshot. ?It also uses a granularity much smaller than that of the traditional LVM-style snapshots.
>
> This code will still need a few months to be mature (the thin-provisioning code just got merged into 3.2, but discard support isn't done yet, and the userspace support is lagging). ? But in the long run, this might be a very attractive way of providing multiple levels of writeable snapshots, in a clean and relatively simple way.
>

There are some lengthy threads about LVM thinp vs. Ext4 snapshots here:
http://thread.gmane.org/gmane.comp.file-systems.ext4/25968/focus=26056
and here:
http://thread.gmane.org/gmane.comp.file-systems.ext4/26041

At the end of the day, thinp target is a very powerful tool, but is
does not fit all
use cases. In particular, it fragments the on-disk layout of ext4 metadata and
benchmark results for how this affect performance were never published.

Also, thinp needs to store quite a lot of metadata for the mapping of
all thinp blocks
and in order to keep this metadata durable and not hurt write speed performance
you will almost certainly need to store this metadata on an SSD - not
a bad solution
for a high end server, but not sure if everyone can afford this.

Amir.

2012-01-22 03:31:32

by Robin Dong

[permalink] [raw]
Subject: Re: Question about writable ext4-snapshot

2012/1/22 Amir Goldstein <[email protected]>:
> On Sat, Jan 21, 2012 at 6:24 AM, Theodore Tso <[email protected]> wrote:
>>
>> On Jan 20, 2012, at 9:45 PM, Robin Dong wrote:
>>
>>> Hello, Amir
>>>
>>> I am evaluating ext4-snapshot (on github) for TAOBAO recently. The
>>> snapshot of an ext4 fs is READONLY now, but we do need to write data
>>> into snapshot.
>>> We also want using ?ext4-snapshot to do online-fsck on
>>> Hadoop clusters, but our hadoop clusters are using no-journal ext4
>>> now. So we have some question
>>>
>>> 1. Will it be possible to implement a writable ext4-snapshot ?
>>> 2. Will it be possible to snapshot a no-journal ext4-fs ?
>>> 3. What's the difficult point of ?implementing above ?
>>
>
> Hello Robin,
>
> 1. writable snapshots (snapshot clones) are actually quite simple to implement
> (a sparse file containing all changes from a read-only snapshot).
> The real challenge is how to support snapshots of these clones and how to
> implement the space reclaim efficiently (time wise) when deleting snapshots.
> indeed, LVM thin-provisioning target handles space reclaim very efficiently.
>
> 2. I think it is possible, but I never looked into it, so there may
> be challenges that I haven't foreseen.
> The obvious culprit is that snapshots will not be reliable after crash.
> JBD ensures that metadata is not overwritten on-disk before it is
> copied to snapshot,
> but without journal, after a crash, meta data could have already been
> written and you loose
> the origin data that was supposed to be copied to snapshot.
>
> 3. I think I have already answered that question above, but the actual
> difficulty
> really depends on your specific needs.
>
>> Something else to consider is that the device mapper thin-provisioning approach. ? This approach does the snapshotting at the device-mapper layer, which means it is separate from the file system. ?It relies on using the discard request when the file is unlinked to know when blocks can be released from the snapshot. ?It also uses a granularity much smaller than that of the traditional LVM-style snapshots.
>>
>> This code will still need a few months to be mature (the thin-provisioning code just got merged into 3.2, but discard support isn't done yet, and the userspace support is lagging). ? But in the long run, this might be a very attractive way of providing multiple levels of writeable snapshots, in a clean and relatively simple way.
>>
>
> There are some lengthy threads about LVM thinp vs. Ext4 snapshots here:
> http://thread.gmane.org/gmane.comp.file-systems.ext4/25968/focus=26056
> and here:
> http://thread.gmane.org/gmane.comp.file-systems.ext4/26041
>
> At the end of the day, thinp target is a very powerful tool, but is
> does not fit all
> use cases. In particular, it fragments the on-disk layout of ext4 metadata and
> benchmark results for how this affect performance were never published.
>
> Also, thinp needs to store quite a lot of metadata for the mapping of
> all thinp blocks
> and in order to keep this metadata durable and not hurt write speed performance
> you will almost certainly need to store this metadata on an SSD - not
> a bad solution
> for a high end server, but not sure if everyone can afford this.
>
> Amir.

Thanks for all your suggestion!
I will evaluate thin-provision and ext4-snapshot both later.

--
--
Best Regard
Robin Dong

2012-01-23 03:21:48

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question about writable ext4-snapshot

On Sun, Jan 22, 2012 at 11:31:31AM +0800, Robin Dong wrote:
> > At the end of the day, thinp target is a very powerful tool, but
> > is does not fit all use cases. In particular, it fragments the
> > on-disk layout of ext4 metadata and benchmark results for how this
> > affect performance were never published.

Amir,

Well, to be fair, your approach to snapshotting also causes
fragmentation. If a file or a directory in the base image gets
modified while there is a read-only snapshot, the inode in the base
image gets fragmented as a result.

It is true that thin provisioning in general tends to defeat the block
placement algorithms used by a file system, but it will be possible to
create snapshots of non-thinp volumes, which will address this issue.
Hopefully in the next 3-6 months, these things will be implemented
enough so that we can benchmark them and see for certain how well or
poorly this approach will work out. I'm sure there will be a certain
number of tradeoffs for both approaches.

Regards,

- Ted

2012-01-23 20:08:51

by Amir Goldstein

[permalink] [raw]
Subject: Re: Question about writable ext4-snapshot

On Mon, Jan 23, 2012 at 5:21 AM, Ted Ts'o <[email protected]> wrote:
> On Sun, Jan 22, 2012 at 11:31:31AM +0800, Robin Dong wrote:
>> > At the end of the day, thinp target is a very powerful tool, but
>> > is does not fit all use cases. In particular, it fragments the
>> > on-disk layout of ext4 metadata and benchmark results for how this
>> > affect performance were never published.
>
> Amir,
>
> Well, to be fair, your approach to snapshotting also causes
> fragmentation. ?If a file or a directory in the base image gets
> modified while there is a read-only snapshot, the inode in the base
> image gets fragmented as a result.

Yes, that's true, to some extent. directory inodes, however, do not
get fragmented. all journaled metadata is copied a side on JBD hooks.
My claim was about fragmentation of ext4 metadata, but fragmentation
of data is also a problem in both approaches.

>
> It is true that thin provisioning in general tends to defeat the block
> placement algorithms used by a file system, but it will be possible to
> create snapshots of non-thinp volumes, which will address this issue.
> Hopefully in the next 3-6 months, these things will be implemented
> enough so that we can benchmark them and see for certain how well or
> poorly this approach will work out. ?I'm sure there will be a certain
> number of tradeoffs for both approaches.
>
> Regards,
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted