LinuxLists.cc - [LSF/FS TOPIC] Ext4 snapshots status update

2011-02-03 22:33:39

Subject: [LSF/FS TOPIC] Ext4 snapshots status update

Hi,

I have been working on porting Next3 snapshots to Ext4
with a group of 4 CS students.

This is where it all happens:
https://github.com/amir73il/ext4-snapshots

Ext4 snapshots prototype is already working!
Would you like to see a demo?

I would like to present the progress of our work
and discuss the remaining issues and how they should be best addressed.

Additionally, I would like to discuss the need for a unified snapshots API that
would serve both Ext4 and Btrfs.
This could be useful for someone that wants to implement a generic
snapshots management system.

Amir.

2011-02-04 00:20:52

by Joel Becker

[permalink] [raw]

Subject: Re: [LSF/FS TOPIC] Ext4 snapshots status update

On Fri, Feb 04, 2011 at 12:33:39AM +0200, Amir Goldstein wrote:
> I have been working on porting Next3 snapshots to Ext4
> with a group of 4 CS students.
>
> This is where it all happens:
> https://github.com/amir73il/ext4-snapshots
>
> Ext4 snapshots prototype is already working!
> Would you like to see a demo?
>
> I would like to present the progress of our work
> and discuss the remaining issues and how they should be best addressed.
>
> Additionally, I would like to discuss the need for a unified snapshots API that
> would serve both Ext4 and Btrfs.
> This could be useful for someone that wants to implement a generic
> snapshots management system.

ocfs2 definitely wants to be a part of that discussion, as we
already do snapshots and thin clones. ocfs2 snapshots are actually just
thin clones marked readonly. They can be placed anywhere in the
filesystem and are first-class inodes. They do not live in a hidden
space like Next3 snapshots seem to.
I've already got a design for a front-end snapshot program that
implements a policy on top this generic behavior. This design would
cover both first-class and hidden style snapshots, because it assume
snapshots are in a distinct namespace. I haven't gotten around to
implementing it yet, but btrfs and other snapshottable filesystems were
part of the design goal.

Joel

--

"Sometimes one pays most for the things one gets for nothing."
- Albert Einstein

http://www.jlbec.org/
[email protected]

2011-02-04 05:52:40

by Amir Goldstein

[permalink] [raw]

Subject: Re: [LSF/FS TOPIC] Ext4 snapshots status update

On Fri, Feb 4, 2011 at 2:20 AM, Joel Becker <[email protected]> wrote:
> On Fri, Feb 04, 2011 at 12:33:39AM +0200, Amir Goldstein wrote:
>> I have been working on porting Next3 snapshots to Ext4
>> with a group of 4 CS students.
>>
>> This is where it all happens:
>> https://github.com/amir73il/ext4-snapshots
>>
>> Ext4 snapshots prototype is already working!
>> Would you like to see a demo?
>>
>> I would like to present the progress of our work
>> and discuss the remaining issues and how they should be best addressed.
>>
>> Additionally, I would like to discuss the need for a unified snapshots API that
>> would serve both Ext4 and Btrfs.
>> This could be useful for someone that wants to implement a generic
>> snapshots management system.
>
> ? ? ? ?ocfs2 definitely wants to be a part of that discussion, as we
> already do snapshots and thin clones. ?ocfs2 snapshots are actually just
> thin clones marked readonly. ?They can be placed anywhere in the
> filesystem and are first-class inodes. ?They do not live in a hidden
> space like Next3 snapshots seem to.
> ? ? ? ?I've already got a design for a front-end snapshot program that
> implements a policy on top this generic behavior. ?This design would
> cover both first-class and hidden style snapshots, because it assume
> snapshots are in a distinct namespace. ?I haven't gotten around to
> implementing it yet, but btrfs and other snapshottable filesystems were
> part of the design goal.
>

Cool. I'd love to see a draft (if you already posted it I missed it).
I wonder how one deals with different fs with different snapshot capabilities,
i.e. file level vs. sub/volume level. which one is the case for ocfs2?

> Joel
>
> --
>
> "Sometimes one pays most for the things one gets for nothing."
> ? ? ? ?- Albert Einstein
>
> ? ? ? ? ? ? ? ? ? ? ? ?http://www.jlbec.org/
> ? ? ? ? ? ? ? ? ? ? ? [email protected]
>

2011-03-23 20:19:39

by Amir Goldstein

[permalink] [raw]

Subject: Re: [LSF/FS TOPIC] Ext4 snapshots status update

On Fri, Feb 4, 2011 at 2:20 AM, Joel Becker <[email protected]> wrote:
> On Fri, Feb 04, 2011 at 12:33:39AM +0200, Amir Goldstein wrote:
>> I have been working on porting Next3 snapshots to Ext4
>> with a group of 4 CS students.
>>
>> This is where it all happens:
>> https://github.com/amir73il/ext4-snapshots
>>
>> Ext4 snapshots prototype is already working!
>> Would you like to see a demo?
>>
>> I would like to present the progress of our work
>> and discuss the remaining issues and how they should be best addressed.
>>
>> Additionally, I would like to discuss the need for a unified snapshots API that
>> would serve both Ext4 and Btrfs.
>> This could be useful for someone that wants to implement a generic
>> snapshots management system.
>
> ? ? ? ?ocfs2 definitely wants to be a part of that discussion, as we
> already do snapshots and thin clones. ?ocfs2 snapshots are actually just
> thin clones marked readonly. ?They can be placed anywhere in the
> filesystem and are first-class inodes. ?They do not live in a hidden
> space like Next3 snapshots seem to.
> ? ? ? ?I've already got a design for a front-end snapshot program that
> implements a policy on top this generic behavior. ?This design would
> cover both first-class and hidden style snapshots, because it assume
> snapshots are in a distinct namespace. ?I haven't gotten around to
> implementing it yet, but btrfs and other snapshottable filesystems were
> part of the design goal.
>
> Joel
>

Hi Joel,

Any chance of getting a copy of that design of yours, to get a head start
for LSF?

Here are some other generic snapshot related topics we may want to discuss:

1. Collaborating the use of inode flags COW_FL, NOCOW_FL, suggested by Chris.
2. How to deal with mmap write to COW file, when you get ENOSPC.
3. Adding buffer_remap() flag for buffered I/O code, meaning, there is
an existing mapping to initialize a page on partial write, but still need
to call get_block() to get a (possibly) new mapping.

I'll be staying for Collaboration summit, so I've got plenty of time
if you like to exchange some ideas.

Amir.

2011-03-30 00:35:04

by Joel Becker

[permalink] [raw]

Subject: Re: [LSF/FS TOPIC] Ext4 snapshots status update

On Wed, Mar 23, 2011 at 10:19:38PM +0200, Amir Goldstein wrote:
> On Fri, Feb 4, 2011 at 2:20 AM, Joel Becker <[email protected]> wrote:
> > On Fri, Feb 04, 2011 at 12:33:39AM +0200, Amir Goldstein wrote:
> > ? ? ? ?I've already got a design for a front-end snapshot program that
> > implements a policy on top this generic behavior. ?This design would
> > cover both first-class and hidden style snapshots, because it assume
> > snapshots are in a distinct namespace. ?I haven't gotten around to
> > implementing it yet, but btrfs and other snapshottable filesystems were
> > part of the design goal.
>
> Any chance of getting a copy of that design of yours, to get a head start
> for LSF?

Yeah, I owe it to you. It wasn't a written-down thing, it was a
hammered-out-in-our-heads thing among some ocfs2 developers. I'm going
to braindump here to get us going. First, I'll speak to your points.

> Here are some other generic snapshot related topics we may want to discuss:
>
> 1. Collaborating the use of inode flags COW_FL, NOCOW_FL, suggested by Chris.

I'm unsure where these fit, perhaps because I missed the
discussion between Chris and you. ocfs2 has the inode flag
OCFS2_REFCOUNTED_FL to signify a refcount tree is attached to the inode.
This is ocfs2's structure for maintaining extent reference counts. Is
your COW_FL the same? Or is it a permission flag? NOCOW_FL sounds
like: "Set this flag on the inode and it will prevent CoW."

> 2. How to deal with mmap write to COW file, when you get ENOSPC.

We just fail the write with VM_FAULT_SIGBUS like mmap write to a
hole. It's what happens for most other CoW filesystems today. If
you're using CoW, you should be aware of what to expect.

> 3. Adding buffer_remap() flag for buffered I/O code, meaning, there is
> an existing mapping to initialize a page on partial write, but still need
> to call get_block() to get a (possibly) new mapping.

Since ocfs2 doesn't allocate in get_block(), this doesn't affect
us. We notice the refcounted extent in write_begin() and CoW it right
there. Same place we clean up unwritten extents.

--snip--

Now, about my snapshot thoughts as promised. My understanding
of the snapshots you have implemented in ext4 is that they are like some
SAN snapshots; they are hidden objects not visible unless you use
special access. They are particular to a given inode and are children
of that inode. What happens when you remove the visible inode? Do the
snapshots disappear? Do you have limitations on how many shapshots a
particular inode can have? These questions plagued us when we original
set out to design inode snapshots for ocfs2.
Once we settled on a mechanism for CoW among ocfs2 inodes, we
quickly decided that a snapshot should be visible in the namespace.
This gave rise to the reflink(2) call, though that name is deprecated in
favor of fastcopy(2). Currently our API is OCFS2_IOC_REFLINK (see,
legacy!), but we eventually want to get the system call upstream. In
ocfs2-land, we decided to keep policy out of the kernel.
OCFS2_IOC_REFLINK creates a new inode that shares all the extents of the
source in CoW fashion, but once it returns, that new inode is a peer of
the source. There is no parent->child relationship.
Thus, for ocfs2 (and forgive the legacy names, the binary hasn't
changed yet), a "snapshot" is just:

snapshot: reflink source target.snap && chmod 0444 target.snap

You can add "chattr +i target.snap" in there if you like.
Since there is no "snapshot namespace" stuff for ocfs2 in the
kernel, it was our intention to propose a snapshot(8) binary that works
like mkfs/fsck; snapshot(8) just calls snapshot.<fstype>(8). Our
plan was to place snapshot policy in snapshot.ocfs2(8). This
implementation would handle managing the <mountpoint>/.snapshot/...
namespace behind the user:

? cd /mnt/ocfs2
? snapshot file1 # Creates /mnt/ocfs2/.snapshot/file1.<timestamp>
<timestamp>
? snapshot file1 test # Creates /mnt/ocfs2/.snapshot/file1.test
test
? snapshot list file1
Snapshots for file1:
<timestamp>
test

Something like that.
A different snapshot model like ext4 could have snapshot.ext4(8)
call the kernel or whatever mechanism was appropriate. A filesystem
from a NAS filer could use filer-specific calls.
Beyond that, I wanted snapshot(8) to handle scheduling of
snapshots. The usual daily/weekly stuff should be easy to schedule
generically.
That's my brain dump. I could enumerate proposed command
syntaxes, but I don't think that's necessary.

Joel

--

"Depend on the rabbit's foot if you will, but remember, it didn't
help the rabbit."
- R. E. Shay

http://www.jlbec.org/
[email protected]

2011-03-30 04:16:45

by Amir Goldstein

[permalink] [raw]

Subject: Re: [LSF/FS TOPIC] Ext4 snapshots status update

On Wed, Mar 30, 2011 at 2:34 AM, Joel Becker <[email protected]> wrote:
> On Wed, Mar 23, 2011 at 10:19:38PM +0200, Amir Goldstein wrote:
>> On Fri, Feb 4, 2011 at 2:20 AM, Joel Becker <[email protected]> wrote:
>> > On Fri, Feb 04, 2011 at 12:33:39AM +0200, Amir Goldstein wrote:
>> > ? ? ? ?I've already got a design for a front-end snapshot program that
>> > implements a policy on top this generic behavior. ?This design would
>> > cover both first-class and hidden style snapshots, because it assume
>> > snapshots are in a distinct namespace. ?I haven't gotten around to
>> > implementing it yet, but btrfs and other snapshottable filesystems were
>> > part of the design goal.
>>
>> Any chance of getting a copy of that design of yours, to get a head start
>> for LSF?
>
> ? ? ? ?Yeah, I owe it to you. ?It wasn't a written-down thing, it was a
> hammered-out-in-our-heads thing among some ocfs2 developers. ?I'm going
> to braindump here to get us going. ?First, I'll speak to your points.
>
>> Here are some other generic snapshot related topics we may want to discuss:
>>
>> 1. Collaborating the use of inode flags COW_FL, NOCOW_FL, suggested by Chris.
>
> ? ? ? ?I'm unsure where these fit, perhaps because I missed the
> discussion between Chris and you. ?ocfs2 has the inode flag
> OCFS2_REFCOUNTED_FL to signify a refcount tree is attached to the inode.
> This is ocfs2's structure for maintaining extent reference counts. ?Is
> your COW_FL the same? ?Or is it a permission flag? ?NOCOW_FL sounds
> like: "Set this flag on the inode and it will prevent CoW."

I don't have a use for COW_FL, since my snapshots are volume level snapshots.
I intend to use NOCOW_FL to mark an inode as an "island" of NOCOW
blocks in the volume.
Maybe Chris or Josef can elaborate of the flags intended use in btrfs.

>
>> 2. How to deal with mmap write to COW file, when you get ENOSPC.
>
> ? ? ? ?We just fail the write with VM_FAULT_SIGBUS like mmap write to a
> hole. ?It's what happens for most other CoW filesystems today. ?If
> you're using CoW, you should be aware of what to expect.
>

"you", meaning a CoW fs developer? a CoW fs administrator? or an application
developer, who has no idea what fs the application will be on?
I know it is easy for us to say "there is no solution", but I have
actually implemented
a block reservation technique that may be useful in this case...
it's hammered-out-in-my-head, so let's save me the brain dump and I'll tell
you about it in person...

>> 3. Adding buffer_remap() flag for buffered I/O code, meaning, there is
>> an existing mapping to initialize a page on partial write, but still need
>> to call get_block() to get a (possibly) new mapping.
>
> ? ? ? ?Since ocfs2 doesn't allocate in get_block(), this doesn't affect
> us. ?We notice the refcounted extent in write_begin() and CoW it right
> there. ?Same place we clean up unwritten extents.
>

Yes, I was going to write a specialized block_write_begin() for CoW,
but I like to use existing generic code when possible and block_write_begin()
is only a few lines of code short of what I need, so maybe we can all use it?

> --snip--
>
> ? ? ? ?Now, about my snapshot thoughts as promised. ?My understanding
> of the snapshots you have implemented in ext4 is that they are like some
> SAN snapshots; they are hidden objects not visible unless you use
> special access. ?They are particular to a given inode and are children
> of that inode. ?What happens when you remove the visible inode? ?Do the
> snapshots disappear? ?Do you have limitations on how many shapshots a
> particular inode can have? ?These questions plagued us when we original
> set out to design inode snapshots for ocfs2.

ext4 snapshots are volume level (readonly) snapshots.
the snapshot inodes are both the "place-holder" of private snapshot blocks
and the (loopdev) mount point to access the volume snapshot.
This is why I wondered if inode level snapshots and volume/subvolume
level snapshots can share the same API.
BTW, does btrfs have inode level snapshots as well?

> ? ? ? ?Once we settled on a mechanism for CoW among ocfs2 inodes, we
> quickly decided that a snapshot should be visible in the namespace.
> This gave rise to the reflink(2) call, though that name is deprecated in
> favor of fastcopy(2). ?Currently our API is OCFS2_IOC_REFLINK (see,
> legacy!), but we eventually want to get the system call upstream. ?In
> ocfs2-land, we decided to keep policy out of the kernel.
> OCFS2_IOC_REFLINK creates a new inode that shares all the extents of the
> source in CoW fashion, but once it returns, that new inode is a peer of
> the source. ?There is no parent->child relationship.
> ? ? ? ?Thus, for ocfs2 (and forgive the legacy names, the binary hasn't
> changed yet), a "snapshot" is just:
>
> ? ?snapshot: reflink source target.snap && chmod 0444 target.snap
>
> You can add "chattr +i target.snap" in there if you like.
> ? ? ? ?Since there is no "snapshot namespace" stuff for ocfs2 in the
> kernel, it was our intention to propose a snapshot(8) binary that works
> like mkfs/fsck; snapshot(8) just calls snapshot.<fstype>(8). ?Our
> plan was to place snapshot policy in snapshot.ocfs2(8). ?This
> implementation would handle managing the <mountpoint>/.snapshot/...
> namespace behind the user:
>
> ? ?? cd /mnt/ocfs2
> ? ?? snapshot file1 ?# Creates /mnt/ocfs2/.snapshot/file1.<timestamp>
> ? ?<timestamp>
> ? ?? snapshot file1 test ?# Creates /mnt/ocfs2/.snapshot/file1.test
> ? ?test
> ? ?? snapshot list file1
> ? ?Snapshots for file1:
> ? ? ? ?<timestamp>
> ? ? ? ?test
>
> Something like that.
> ? ? ? ?A different snapshot model like ext4 could have snapshot.ext4(8)
> call the kernel or whatever mechanism was appropriate. ?A filesystem
> from a NAS filer could use filer-specific calls.
> ? ? ? ?Beyond that, I wanted snapshot(8) to handle scheduling of
> snapshots. ?The usual daily/weekly stuff should be easy to schedule
> generically.
> ? ? ? ?That's my brain dump. ?I could enumerate proposed command
> syntaxes, but I don't think that's necessary.
>

No need for that. snapshot(8) API sounds good.
Let's sit together in LSF with btrfs representatives and finalize this API.
For ext4, I just need for the 'file' arg to be optional.
I would like to include some API to attach a snapshot to a namespace
(mount it in my case) and to see how the inode level snapshots namespace
and volume level snapshots namespace will appear the same to the end-user.

I suppose further discussion on the subject should exclude lsf ml,
which appear to be very hectic these days, so anyone who likes to join this
thread, please say so now.

Thanks,
Amir.

2011-03-30 05:52:10

by Tao Ma

[permalink] [raw]

Subject: Re: [LSF/FS TOPIC] Ext4 snapshots status update

Hi Amir,
On 03/30/2011 12:16 PM, Amir Goldstein wrote:
> On Wed, Mar 30, 2011 at 2:34 AM, Joel Becker <[email protected]> wrote:
>> On Wed, Mar 23, 2011 at 10:19:38PM +0200, Amir Goldstein wrote:
>>> On Fri, Feb 4, 2011 at 2:20 AM, Joel Becker <[email protected]> wrote:
>>>> On Fri, Feb 04, 2011 at 12:33:39AM +0200, Amir Goldstein wrote:
>>>> I've already got a design for a front-end snapshot program that
>>>> implements a policy on top this generic behavior. This design would
>>>> cover both first-class and hidden style snapshots, because it assume
>>>> snapshots are in a distinct namespace. I haven't gotten around to
>>>> implementing it yet, but btrfs and other snapshottable filesystems were
>>>> part of the design goal.
>>>
>>> Any chance of getting a copy of that design of yours, to get a head start
>>> for LSF?
>>
>> Yeah, I owe it to you. It wasn't a written-down thing, it was a
>> hammered-out-in-our-heads thing among some ocfs2 developers. I'm going
>> to braindump here to get us going. First, I'll speak to your points.
>>
>>> Here are some other generic snapshot related topics we may want to discuss:
>>>
>>> 1. Collaborating the use of inode flags COW_FL, NOCOW_FL, suggested by Chris.
>>
>> I'm unsure where these fit, perhaps because I missed the
>> discussion between Chris and you. ocfs2 has the inode flag
>> OCFS2_REFCOUNTED_FL to signify a refcount tree is attached to the inode.
>> This is ocfs2's structure for maintaining extent reference counts. Is
>> your COW_FL the same? Or is it a permission flag? NOCOW_FL sounds
>> like: "Set this flag on the inode and it will prevent CoW."
>
> I don't have a use for COW_FL, since my snapshots are volume level snapshots.
> I intend to use NOCOW_FL to mark an inode as an "island" of NOCOW
> blocks in the volume.
> Maybe Chris or Josef can elaborate of the flags intended use in btrfs.
>
>>
>>> 2. How to deal with mmap write to COW file, when you get ENOSPC.
>>
>> We just fail the write with VM_FAULT_SIGBUS like mmap write to a
>> hole. It's what happens for most other CoW filesystems today. If
>> you're using CoW, you should be aware of what to expect.
>>
>
> "you", meaning a CoW fs developer? a CoW fs administrator? or an application
> developer, who has no idea what fs the application will be on?
> I know it is easy for us to say "there is no solution", but I have
> actually implemented
> a block reservation technique that may be useful in this case...
> it's hammered-out-in-my-head, so let's save me the brain dump and I'll tell
> you about it in person...
>
>
>>> 3. Adding buffer_remap() flag for buffered I/O code, meaning, there is
>>> an existing mapping to initialize a page on partial write, but still need
>>> to call get_block() to get a (possibly) new mapping.
>>
>> Since ocfs2 doesn't allocate in get_block(), this doesn't affect
>> us. We notice the refcounted extent in write_begin() and CoW it right
>> there. Same place we clean up unwritten extents.
>>
>
> Yes, I was going to write a specialized block_write_begin() for CoW,
> but I like to use existing generic code when possible and block_write_begin()
> is only a few lines of code short of what I need, so maybe we can all use it?
>
>
>> --snip--
>>
>> Now, about my snapshot thoughts as promised. My understanding
>> of the snapshots you have implemented in ext4 is that they are like some
>> SAN snapshots; they are hidden objects not visible unless you use
>> special access. They are particular to a given inode and are children
>> of that inode. What happens when you remove the visible inode? Do the
>> snapshots disappear? Do you have limitations on how many shapshots a
>> particular inode can have? These questions plagued us when we original
>> set out to design inode snapshots for ocfs2.
>
> ext4 snapshots are volume level (readonly) snapshots.
> the snapshot inodes are both the "place-holder" of private snapshot blocks
> and the (loopdev) mount point to access the volume snapshot.
> This is why I wondered if inode level snapshots and volume/subvolume
> level snapshots can share the same API.
> BTW, does btrfs have inode level snapshots as well?
>
>> Once we settled on a mechanism for CoW among ocfs2 inodes, we
>> quickly decided that a snapshot should be visible in the namespace.
>> This gave rise to the reflink(2) call, though that name is deprecated in
>> favor of fastcopy(2). Currently our API is OCFS2_IOC_REFLINK (see,
>> legacy!), but we eventually want to get the system call upstream. In
>> ocfs2-land, we decided to keep policy out of the kernel.
>> OCFS2_IOC_REFLINK creates a new inode that shares all the extents of the
>> source in CoW fashion, but once it returns, that new inode is a peer of
>> the source. There is no parent->child relationship.
>> Thus, for ocfs2 (and forgive the legacy names, the binary hasn't
>> changed yet), a "snapshot" is just:
>>
>> snapshot: reflink source target.snap && chmod 0444 target.snap
>>
>> You can add "chattr +i target.snap" in there if you like.
>> Since there is no "snapshot namespace" stuff for ocfs2 in the
>> kernel, it was our intention to propose a snapshot(8) binary that works
>> like mkfs/fsck; snapshot(8) just calls snapshot.<fstype>(8). Our
>> plan was to place snapshot policy in snapshot.ocfs2(8). This
>> implementation would handle managing the <mountpoint>/.snapshot/...
>> namespace behind the user:
>>
>> ? cd /mnt/ocfs2
>> ? snapshot file1 # Creates /mnt/ocfs2/.snapshot/file1.<timestamp>
>> <timestamp>
>> ? snapshot file1 test # Creates /mnt/ocfs2/.snapshot/file1.test
>> test
>> ? snapshot list file1
>> Snapshots for file1:
>> <timestamp>
>> test
>>
>> Something like that.
>> A different snapshot model like ext4 could have snapshot.ext4(8)
>> call the kernel or whatever mechanism was appropriate. A filesystem
>> from a NAS filer could use filer-specific calls.
>> Beyond that, I wanted snapshot(8) to handle scheduling of
>> snapshots. The usual daily/weekly stuff should be easy to schedule
>> generically.
>> That's my brain dump. I could enumerate proposed command
>> syntaxes, but I don't think that's necessary.
>>
>
> No need for that. snapshot(8) API sounds good.
> Let's sit together in LSF with btrfs representatives and finalize this API.
> For ext4, I just need for the 'file' arg to be optional.
> I would like to include some API to attach a snapshot to a namespace
> (mount it in my case) and to see how the inode level snapshots namespace
> and volume level snapshots namespace will appear the same to the end-user.
>
> I suppose further discussion on the subject should exclude lsf ml,
> which appear to be very hectic these days, so anyone who likes to join this
> thread, please say so now.
I implemented the reflink support in ocfs2, so please cc me when you
open a private thread about this topic. Thanks.

Regards,
Tao

2011-03-30 06:05:38

by Amir Goldstein

[permalink] [raw]

Subject: Re: [LSF/FS TOPIC] Ext4 snapshots status update

On Wed, Mar 30, 2011 at 7:52 AM, Tao Ma <[email protected]> wrote:
> Hi Amir,
> On 03/30/2011 12:16 PM, Amir Goldstein wrote:
>> On Wed, Mar 30, 2011 at 2:34 AM, Joel Becker <[email protected]> wrote:
>>> On Wed, Mar 23, 2011 at 10:19:38PM +0200, Amir Goldstein wrote:
>>>> On Fri, Feb 4, 2011 at 2:20 AM, Joel Becker <[email protected]> wrote:
>>>>> On Fri, Feb 04, 2011 at 12:33:39AM +0200, Amir Goldstein wrote:
>>>>> ? ? ? ?I've already got a design for a front-end snapshot program that
>>>>> implements a policy on top this generic behavior. ?This design would
>>>>> cover both first-class and hidden style snapshots, because it assume
>>>>> snapshots are in a distinct namespace. ?I haven't gotten around to
>>>>> implementing it yet, but btrfs and other snapshottable filesystems were
>>>>> part of the design goal.
>>>>
>>>> Any chance of getting a copy of that design of yours, to get a head start
>>>> for LSF?
>>>
>>> ? ? ? ?Yeah, I owe it to you. ?It wasn't a written-down thing, it was a
>>> hammered-out-in-our-heads thing among some ocfs2 developers. ?I'm going
>>> to braindump here to get us going. ?First, I'll speak to your points.
>>>
>>>> Here are some other generic snapshot related topics we may want to discuss:
>>>>
>>>> 1. Collaborating the use of inode flags COW_FL, NOCOW_FL, suggested by Chris.
>>>
>>> ? ? ? ?I'm unsure where these fit, perhaps because I missed the
>>> discussion between Chris and you. ?ocfs2 has the inode flag
>>> OCFS2_REFCOUNTED_FL to signify a refcount tree is attached to the inode.
>>> This is ocfs2's structure for maintaining extent reference counts. ?Is
>>> your COW_FL the same? ?Or is it a permission flag? ?NOCOW_FL sounds
>>> like: "Set this flag on the inode and it will prevent CoW."
>>
>> I don't have a use for COW_FL, since my snapshots are volume level snapshots.
>> I intend to use NOCOW_FL to mark an inode as an "island" of NOCOW
>> blocks in the volume.
>> Maybe Chris or Josef can elaborate of the flags intended use in btrfs.
>>
>>>
>>>> 2. How to deal with mmap write to COW file, when you get ENOSPC.
>>>
>>> ? ? ? ?We just fail the write with VM_FAULT_SIGBUS like mmap write to a
>>> hole.

OK. "private" thread is opened.
Just wanted to clarify there are 2 differences I notice between mmap
write to a hole
and mmap write to COWed file with ENOSPC:

1. A "good" application can avoid mmap write to a hole.

2. when initiating a hole, the mkwrite callback is in used (in ext4) to
reserve disk space for delayed allocation when a page becomes writable.
with COW a page may already be writable when the flush encounters COW
with ENOSPC. that flush can even happen after the application has exited,
so the data will be dropped on the floor silently (like in ext3).

>>> It's what happens for most other CoW filesystems today. ?If
>>> you're using CoW, you should be aware of what to expect.
>>>
>>
>> "you", meaning a CoW fs developer? a CoW fs administrator? or an application
>> developer, who has no idea what fs the application will be on?
>> I know it is easy for us to say "there is no solution", but I have
>> actually implemented
>> a block reservation technique that may be useful in this case...
>> it's hammered-out-in-my-head, so let's save me the brain dump and I'll tell
>> you about it in person...
>>
>>
>>>> 3. Adding buffer_remap() flag for buffered I/O code, meaning, there is
>>>> an existing mapping to initialize a page on partial write, but still need
>>>> to call get_block() to get a (possibly) new mapping.
>>>
>>> ? ? ? ?Since ocfs2 doesn't allocate in get_block(), this doesn't affect
>>> us. ?We notice the refcounted extent in write_begin() and CoW it right
>>> there. ?Same place we clean up unwritten extents.
>>>
>>
>> Yes, I was going to write a specialized block_write_begin() for CoW,
>> but I like to use existing generic code when possible and block_write_begin()
>> is only a few lines of code short of what I need, so maybe we can all use it?
>>
>>
>>> --snip--
>>>
>>> ? ? ? ?Now, about my snapshot thoughts as promised. ?My understanding
>>> of the snapshots you have implemented in ext4 is that they are like some
>>> SAN snapshots; they are hidden objects not visible unless you use
>>> special access. ?They are particular to a given inode and are children
>>> of that inode. ?What happens when you remove the visible inode? ?Do the
>>> snapshots disappear? ?Do you have limitations on how many shapshots a
>>> particular inode can have? ?These questions plagued us when we original
>>> set out to design inode snapshots for ocfs2.
>>
>> ext4 snapshots are volume level (readonly) snapshots.
>> the snapshot inodes are both the "place-holder" of private snapshot blocks
>> and the (loopdev) mount point to access the volume snapshot.
>> This is why I wondered if inode level snapshots and volume/subvolume
>> level snapshots can share the same API.
>> BTW, does btrfs have inode level snapshots as well?
>>
>>> ? ? ? ?Once we settled on a mechanism for CoW among ocfs2 inodes, we
>>> quickly decided that a snapshot should be visible in the namespace.
>>> This gave rise to the reflink(2) call, though that name is deprecated in
>>> favor of fastcopy(2). ?Currently our API is OCFS2_IOC_REFLINK (see,
>>> legacy!), but we eventually want to get the system call upstream. ?In
>>> ocfs2-land, we decided to keep policy out of the kernel.
>>> OCFS2_IOC_REFLINK creates a new inode that shares all the extents of the
>>> source in CoW fashion, but once it returns, that new inode is a peer of
>>> the source. ?There is no parent->child relationship.
>>> ? ? ? ?Thus, for ocfs2 (and forgive the legacy names, the binary hasn't
>>> changed yet), a "snapshot" is just:
>>>
>>> ? ?snapshot: reflink source target.snap && chmod 0444 target.snap
>>>
>>> You can add "chattr +i target.snap" in there if you like.
>>> ? ? ? ?Since there is no "snapshot namespace" stuff for ocfs2 in the
>>> kernel, it was our intention to propose a snapshot(8) binary that works
>>> like mkfs/fsck; snapshot(8) just calls snapshot.<fstype>(8). ?Our
>>> plan was to place snapshot policy in snapshot.ocfs2(8). ?This
>>> implementation would handle managing the <mountpoint>/.snapshot/...
>>> namespace behind the user:
>>>
>>> ? ?? cd /mnt/ocfs2
>>> ? ?? snapshot file1 ?# Creates /mnt/ocfs2/.snapshot/file1.<timestamp>
>>> ? ?<timestamp>
>>> ? ?? snapshot file1 test ?# Creates /mnt/ocfs2/.snapshot/file1.test
>>> ? ?test
>>> ? ?? snapshot list file1
>>> ? ?Snapshots for file1:
>>> ? ? ? ?<timestamp>
>>> ? ? ? ?test
>>>
>>> Something like that.
>>> ? ? ? ?A different snapshot model like ext4 could have snapshot.ext4(8)
>>> call the kernel or whatever mechanism was appropriate. ?A filesystem
>>> from a NAS filer could use filer-specific calls.
>>> ? ? ? ?Beyond that, I wanted snapshot(8) to handle scheduling of
>>> snapshots. ?The usual daily/weekly stuff should be easy to schedule
>>> generically.
>>> ? ? ? ?That's my brain dump. ?I could enumerate proposed command
>>> syntaxes, but I don't think that's necessary.
>>>
>>
>> No need for that. snapshot(8) API sounds good.
>> Let's sit together in LSF with btrfs representatives and finalize this API.
>> For ext4, I just need for the 'file' arg to be optional.
>> I would like to include some API to attach a snapshot to a namespace
>> (mount it in my case) and to see how the inode level snapshots namespace
>> and volume level snapshots namespace will appear the same to the end-user.
>>
>> I suppose further discussion on the subject should exclude lsf ml,
>> which appear to be very hectic these days, so anyone who likes to join this
>> thread, please say so now.
> I implemented the reflink support in ocfs2, so please cc me when you
> open a private thread about this topic. Thanks.
>
> Regards,
> Tao
>

2011-03-30 10:33:23

by Joel Becker

[permalink] [raw]

Subject: Re: [LSF/FS TOPIC] Ext4 snapshots status update

On Wed, Mar 30, 2011 at 08:05:38AM +0200, Amir Goldstein wrote:
> Just wanted to clarify there are 2 differences I notice between mmap
> write to a hole
> and mmap write to COWed file with ENOSPC:
>
> 1. A "good" application can avoid mmap write to a hole.
>
> 2. when initiating a hole, the mkwrite callback is in used (in ext4) to
> reserve disk space for delayed allocation when a page becomes writable.
> with COW a page may already be writable when the flush encounters COW
> with ENOSPC. that flush can even happen after the application has exited,
> so the data will be dropped on the floor silently (like in ext3).

ocfs2 doesn't have delayed allocation yet, so we try and fail
the allocation in write_begin() right under mkwrite().

Joel

--

The Graham Corollary:

The longer a socially-moderated news website exists, the
probability of an old Paul Graham link appearing at the top
approaches certainty.

http://www.jlbec.org/
[email protected]

2011-03-30 10:46:24

by Amir Goldstein

[permalink] [raw]

Subject: Re: [LSF/FS TOPIC] Ext4 snapshots status update

On Wed, Mar 30, 2011 at 12:33 PM, Joel Becker <[email protected]> wrote:
> On Wed, Mar 30, 2011 at 08:05:38AM +0200, Amir Goldstein wrote:
>> Just wanted to clarify there are 2 differences I notice between mmap
>> write to a hole
>> and mmap write to COWed file with ENOSPC:
>>
>> 1. A "good" application can avoid mmap write to a hole.
>>
>> 2. when initiating a hole, the mkwrite callback is in used (in ext4) to
>> reserve disk space for delayed allocation when a page becomes writable.
>> with COW a page may already be writable when the flush encounters COW
>> with ENOSPC. that flush can even happen after the application has exited,
>> so the data will be dropped on the floor silently (like in ext3).
>
> ? ? ? ?ocfs2 doesn't have delayed allocation yet, so we try and fail
> the allocation in write_begin() right under mkwrite().
>

And what if the page is already writable?
Do you go over all inode pages and make them RO after fastcopy?
For volume level snapshot this isn't a sensible option.

Amir.

2011-03-30 11:51:36

by Chris Mason

[permalink] [raw]

Subject: Re: [LSF/FS TOPIC] Ext4 snapshots status update

Excerpts from Amir Goldstein's message of 2011-03-30 00:16:45 -0400:
> On Wed, Mar 30, 2011 at 2:34 AM, Joel Becker <[email protected]> wrote:
> > On Wed, Mar 23, 2011 at 10:19:38PM +0200, Amir Goldstein wrote:
> >> On Fri, Feb 4, 2011 at 2:20 AM, Joel Becker <[email protected]> wrote:
> >> > On Fri, Feb 04, 2011 at 12:33:39AM +0200, Amir Goldstein wrote:
> >> > I've already got a design for a front-end snapshot program that
> >> > implements a policy on top this generic behavior. This design would
> >> > cover both first-class and hidden style snapshots, because it assume
> >> > snapshots are in a distinct namespace. I haven't gotten around to
> >> > implementing it yet, but btrfs and other snapshottable filesystems were
> >> > part of the design goal.
> >>
> >> Any chance of getting a copy of that design of yours, to get a head start
> >> for LSF?
> >
> > Yeah, I owe it to you. It wasn't a written-down thing, it was a
> > hammered-out-in-our-heads thing among some ocfs2 developers. I'm going
> > to braindump here to get us going. First, I'll speak to your points.
> >
> >> Here are some other generic snapshot related topics we may want to discuss:
> >>
> >> 1. Collaborating the use of inode flags COW_FL, NOCOW_FL, suggested by Chris.
> >
> > I'm unsure where these fit, perhaps because I missed the
> > discussion between Chris and you. ocfs2 has the inode flag
> > OCFS2_REFCOUNTED_FL to signify a refcount tree is attached to the inode.
> > This is ocfs2's structure for maintaining extent reference counts. Is
> > your COW_FL the same? Or is it a permission flag? NOCOW_FL sounds
> > like: "Set this flag on the inode and it will prevent CoW."
>
> I don't have a use for COW_FL, since my snapshots are volume level snapshots.
> I intend to use NOCOW_FL to mark an inode as an "island" of NOCOW
> blocks in the volume.
> Maybe Chris or Josef can elaborate of the flags intended use in btrfs.

NOWCOW_FL in btrfs means to directly overwrite blocks (and not do crcs)
unless the block has another reference. If there is another reference,
we COW once to honor the snapshot and then continue in NOCOW mode.

I'm kind of worried about your NOCOW island idea, maybe we can talk more
about that next week. It seems like it will lead to a lot of admin
surprises.

-chris

2011-03-30 12:08:52

by Amir Goldstein

[permalink] [raw]

Subject: Re: [LSF/FS TOPIC] Ext4 snapshots status update

On Wed, Mar 30, 2011 at 1:50 PM, Chris Mason <[email protected]> wrote:
> Excerpts from Amir Goldstein's message of 2011-03-30 00:16:45 -0400:
>> On Wed, Mar 30, 2011 at 2:34 AM, Joel Becker <[email protected]> wrote:
>> > On Wed, Mar 23, 2011 at 10:19:38PM +0200, Amir Goldstein wrote:
>> >> On Fri, Feb 4, 2011 at 2:20 AM, Joel Becker <[email protected]> wrote:
>> >> > On Fri, Feb 04, 2011 at 12:33:39AM +0200, Amir Goldstein wrote:
>> >> > ? ? ? ?I've already got a design for a front-end snapshot program that
>> >> > implements a policy on top this generic behavior. ?This design would
>> >> > cover both first-class and hidden style snapshots, because it assume
>> >> > snapshots are in a distinct namespace. ?I haven't gotten around to
>> >> > implementing it yet, but btrfs and other snapshottable filesystems were
>> >> > part of the design goal.
>> >>
>> >> Any chance of getting a copy of that design of yours, to get a head start
>> >> for LSF?
>> >
>> > ? ? ? ?Yeah, I owe it to you. ?It wasn't a written-down thing, it was a
>> > hammered-out-in-our-heads thing among some ocfs2 developers. ?I'm going
>> > to braindump here to get us going. ?First, I'll speak to your points.
>> >
>> >> Here are some other generic snapshot related topics we may want to discuss:
>> >>
>> >> 1. Collaborating the use of inode flags COW_FL, NOCOW_FL, suggested by Chris.
>> >
>> > ? ? ? ?I'm unsure where these fit, perhaps because I missed the
>> > discussion between Chris and you. ?ocfs2 has the inode flag
>> > OCFS2_REFCOUNTED_FL to signify a refcount tree is attached to the inode.
>> > This is ocfs2's structure for maintaining extent reference counts. ?Is
>> > your COW_FL the same? ?Or is it a permission flag? ?NOCOW_FL sounds
>> > like: "Set this flag on the inode and it will prevent CoW."
>>
>> I don't have a use for COW_FL, since my snapshots are volume level snapshots.
>> I intend to use NOCOW_FL to mark an inode as an "island" of NOCOW
>> blocks in the volume.
>> Maybe Chris or Josef can elaborate of the flags intended use in btrfs.
>
> NOWCOW_FL in btrfs means to directly overwrite blocks (and not do crcs)
> unless the block has another reference. ?If there is another reference,
> we COW once to honor the snapshot and then continue in NOCOW mode.
>
> I'm kind of worried about your NOCOW island idea, maybe we can talk more
> about that next week. ?It seems like it will lead to a lot of admin
> surprises.
>

Yes, that's something to talk about.
My desire for NOCOW comes from lack of sub volume granularity
in ext4 snapshots.

My NOCOW design states that NOCOW flag cannot be toggled on a regular file.
like a snapshot file, a NOCOW file must be born and die NOCOW, to avoid
admin surprises. NOCOW directories (which ARE COWed) are were NOCOW
files are born.

Using this scheme, an admin can exclude->include->exclude directory sub trees
from snapshots.

Amir.

2011-04-01 00:10:26

by Myklebust, Trond

[permalink] [raw]

Subject: Re: [Lsf-pc] [LSF/FS TOPIC] Ext4 snapshots status update

On Wed, 2011-03-30 at 14:08 +0200, Amir Goldstein wrote:
> On Wed, Mar 30, 2011 at 1:50 PM, Chris Mason <[email protected]> wrote:
> > Excerpts from Amir Goldstein's message of 2011-03-30 00:16:45 -0400:
> >> On Wed, Mar 30, 2011 at 2:34 AM, Joel Becker <[email protected]> wrote:
> >> > On Wed, Mar 23, 2011 at 10:19:38PM +0200, Amir Goldstein wrote:
> >> >> On Fri, Feb 4, 2011 at 2:20 AM, Joel Becker <[email protected]> wrote:
> >> >> > On Fri, Feb 04, 2011 at 12:33:39AM +0200, Amir Goldstein wrote:
> >> >> > I've already got a design for a front-end snapshot program that
> >> >> > implements a policy on top this generic behavior. This design would
> >> >> > cover both first-class and hidden style snapshots, because it assume
> >> >> > snapshots are in a distinct namespace. I haven't gotten around to
> >> >> > implementing it yet, but btrfs and other snapshottable filesystems were
> >> >> > part of the design goal.
> >> >>
> >> >> Any chance of getting a copy of that design of yours, to get a head start
> >> >> for LSF?
> >> >
> >> > Yeah, I owe it to you. It wasn't a written-down thing, it was a
> >> > hammered-out-in-our-heads thing among some ocfs2 developers. I'm going
> >> > to braindump here to get us going. First, I'll speak to your points.
> >> >
> >> >> Here are some other generic snapshot related topics we may want to discuss:
> >> >>
> >> >> 1. Collaborating the use of inode flags COW_FL, NOCOW_FL, suggested by Chris.
> >> >
> >> > I'm unsure where these fit, perhaps because I missed the
> >> > discussion between Chris and you. ocfs2 has the inode flag
> >> > OCFS2_REFCOUNTED_FL to signify a refcount tree is attached to the inode.
> >> > This is ocfs2's structure for maintaining extent reference counts. Is
> >> > your COW_FL the same? Or is it a permission flag? NOCOW_FL sounds
> >> > like: "Set this flag on the inode and it will prevent CoW."
> >>
> >> I don't have a use for COW_FL, since my snapshots are volume level snapshots.
> >> I intend to use NOCOW_FL to mark an inode as an "island" of NOCOW
> >> blocks in the volume.
> >> Maybe Chris or Josef can elaborate of the flags intended use in btrfs.
> >
> > NOWCOW_FL in btrfs means to directly overwrite blocks (and not do crcs)
> > unless the block has another reference. If there is another reference,
> > we COW once to honor the snapshot and then continue in NOCOW mode.
> >
> > I'm kind of worried about your NOCOW island idea, maybe we can talk more
> > about that next week. It seems like it will lead to a lot of admin
> > surprises.
> >
>
> Yes, that's something to talk about.
> My desire for NOCOW comes from lack of sub volume granularity
> in ext4 snapshots.
>
> My NOCOW design states that NOCOW flag cannot be toggled on a regular file.
> like a snapshot file, a NOCOW file must be born and die NOCOW, to avoid
> admin surprises. NOCOW directories (which ARE COWed) are were NOCOW
> files are born.
>
> Using this scheme, an admin can exclude->include->exclude directory sub trees
> from snapshots.

OK. I'd like to schedule a general talk about the state of snapshots and
future improvements. I'm assuming you would like to lead the debate.

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-04-01 03:58:07

by Amir Goldstein

[permalink] [raw]

Subject: Re: [Lsf-pc] [LSF/FS TOPIC] Ext4 snapshots status update

Sent from my iPhone

On 31/03/2011, at 17:10, Trond Myklebust <[email protected]>
wrote:

> On Wed, 2011-03-30 at 14:08 +0200, Amir Goldstein wrote:
>> On Wed, Mar 30, 2011 at 1:50 PM, Chris Mason
>> <[email protected]> wrote:
>>> Excerpts from Amir Goldstein's message of 2011-03-30 00:16:45 -0400:
>>>> On Wed, Mar 30, 2011 at 2:34 AM, Joel Becker <[email protected]>
>>>> wrote:
>>>>> On Wed, Mar 23, 2011 at 10:19:38PM +0200, Amir Goldstein wrote:
>>>>>> On Fri, Feb 4, 2011 at 2:20 AM, Joel Becker
>>>>>> <[email protected]> wrote:
>>>>>>> On Fri, Feb 04, 2011 at 12:33:39AM +0200, Amir Goldstein wrote:
>>>>>>> I've already got a design for a front-end snapshot
>>>>>>> program that
>>>>>>> implements a policy on top this generic behavior. This design
>>>>>>> would
>>>>>>> cover both first-class and hidden style snapshots, because it
>>>>>>> assume
>>>>>>> snapshots are in a distinct namespace. I haven't gotten
>>>>>>> around to
>>>>>>> implementing it yet, but btrfs and other snapshottable
>>>>>>> filesystems were
>>>>>>> part of the design goal.
>>>>>>
>>>>>> Any chance of getting a copy of that design of yours, to get a
>>>>>> head start
>>>>>> for LSF?
>>>>>
>>>>> Yeah, I owe it to you. It wasn't a written-down thing, it
>>>>> was a
>>>>> hammered-out-in-our-heads thing among some ocfs2 developers.
>>>>> I'm going
>>>>> to braindump here to get us going. First, I'll speak to your
>>>>> points.
>>>>>
>>>>>> Here are some other generic snapshot related topics we may want
>>>>>> to discuss:
>>>>>>
>>>>>> 1. Collaborating the use of inode flags COW_FL, NOCOW_FL,
>>>>>> suggested by Chris.
>>>>>
>>>>> I'm unsure where these fit, perhaps because I missed the
>>>>> discussion between Chris and you. ocfs2 has the inode flag
>>>>> OCFS2_REFCOUNTED_FL to signify a refcount tree is attached to
>>>>> the inode.
>>>>> This is ocfs2's structure for maintaining extent reference
>>>>> counts. Is
>>>>> your COW_FL the same? Or is it a permission flag? NOCOW_FL
>>>>> sounds
>>>>> like: "Set this flag on the inode and it will prevent CoW."
>>>>
>>>> I don't have a use for COW_FL, since my snapshots are volume
>>>> level snapshots.
>>>> I intend to use NOCOW_FL to mark an inode as an "island" of NOCOW
>>>> blocks in the volume.
>>>> Maybe Chris or Josef can elaborate of the flags intended use in
>>>> btrfs.
>>>
>>> NOWCOW_FL in btrfs means to directly overwrite blocks (and not do
>>> crcs)
>>> unless the block has another reference. If there is another
>>> reference,
>>> we COW once to honor the snapshot and then continue in NOCOW mode.
>>>
>>> I'm kind of worried about your NOCOW island idea, maybe we can
>>> talk more
>>> about that next week. It seems like it will lead to a lot of admin
>>> surprises.
>>>
>>
>> Yes, that's something to talk about.
>> My desire for NOCOW comes from lack of sub volume granularity
>> in ext4 snapshots.
>>
>> My NOCOW design states that NOCOW flag cannot be toggled on a
>> regular file.
>> like a snapshot file, a NOCOW file must be born and die NOCOW, to
>> avoid
>> admin surprises. NOCOW directories (which ARE COWed) are were NOCOW
>> files are born.
>>
>> Using this scheme, an admin can exclude->include->exclude directory
>> sub trees
>> from snapshots.
>
> OK. I'd like to schedule a general talk about the state of snapshots
> and
> future improvements. I'm assuming you would like to lead the debate.
>
Sure, I can do that.

Thanks
Amir

> Cheers
> Trond
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>