From: Amir Goldstein Subject: Re: [LSF/FS TOPIC] Ext4 snapshots status update Date: Wed, 30 Mar 2011 08:05:38 +0200 Message-ID: References: <20110204002043.GA15658@noexit> <20110330003429.GA32669@noexit> <4D92C508.7010404@tao.ma> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Joel Becker , linux-fsdevel , Ext4 Developers List , Theodore Tso , Chris Mason , Josef Bacik To: Tao Ma Return-path: In-Reply-To: <4D92C508.7010404@tao.ma> Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Wed, Mar 30, 2011 at 7:52 AM, Tao Ma wrote: > Hi Amir, > On 03/30/2011 12:16 PM, Amir Goldstein wrote: >> On Wed, Mar 30, 2011 at 2:34 AM, Joel Becker wr= ote: >>> On Wed, Mar 23, 2011 at 10:19:38PM +0200, Amir Goldstein wrote: >>>> On Fri, Feb 4, 2011 at 2:20 AM, Joel Becker w= rote: >>>>> On Fri, Feb 04, 2011 at 12:33:39AM +0200, Amir Goldstein wrote: >>>>> =A0 =A0 =A0 =A0I've already got a design for a front-end snapshot= program that >>>>> implements a policy on top this generic behavior. =A0This design = would >>>>> cover both first-class and hidden style snapshots, because it ass= ume >>>>> snapshots are in a distinct namespace. =A0I haven't gotten around= to >>>>> implementing it yet, but btrfs and other snapshottable filesystem= s were >>>>> part of the design goal. >>>> >>>> Any chance of getting a copy of that design of yours, to get a hea= d start >>>> for LSF? >>> >>> =A0 =A0 =A0 =A0Yeah, I owe it to you. =A0It wasn't a written-down t= hing, it was a >>> hammered-out-in-our-heads thing among some ocfs2 developers. =A0I'm= going >>> to braindump here to get us going. =A0First, I'll speak to your poi= nts. >>> >>>> Here are some other generic snapshot related topics we may want to= discuss: >>>> >>>> 1. Collaborating the use of inode flags COW_FL, NOCOW_FL, suggeste= d by Chris. >>> >>> =A0 =A0 =A0 =A0I'm unsure where these fit, perhaps because I missed= the >>> discussion between Chris and you. =A0ocfs2 has the inode flag >>> OCFS2_REFCOUNTED_FL to signify a refcount tree is attached to the i= node. >>> This is ocfs2's structure for maintaining extent reference counts. = =A0Is >>> your COW_FL the same? =A0Or is it a permission flag? =A0NOCOW_FL so= unds >>> like: "Set this flag on the inode and it will prevent CoW." >> >> I don't have a use for COW_FL, since my snapshots are volume level s= napshots. >> I intend to use NOCOW_FL to mark an inode as an "island" of NOCOW >> blocks in the volume. >> Maybe Chris or Josef can elaborate of the flags intended use in btrf= s. >> >>> >>>> 2. How to deal with mmap write to COW file, when you get ENOSPC. >>> >>> =A0 =A0 =A0 =A0We just fail the write with VM_FAULT_SIGBUS like mma= p write to a >>> hole. OK. "private" thread is opened. Just wanted to clarify there are 2 differences I notice between mmap write to a hole and mmap write to COWed file with ENOSPC: 1. A "good" application can avoid mmap write to a hole. 2. when initiating a hole, the mkwrite callback is in used (in ext4) to reserve disk space for delayed allocation when a page becomes writable. with COW a page may already be writable when the flush encounters COW with ENOSPC. that flush can even happen after the application has exite= d, so the data will be dropped on the floor silently (like in ext3). >>> It's what happens for most other CoW filesystems today. =A0If >>> you're using CoW, you should be aware of what to expect. >>> >> >> "you", meaning a CoW fs developer? a CoW fs administrator? or an app= lication >> developer, who has no idea what fs the application will be on? >> I know it is easy for us to say "there is no solution", but I have >> actually implemented >> a block reservation technique that may be useful in this case... >> it's hammered-out-in-my-head, so let's save me the brain dump and I'= ll tell >> you about it in person... >> >> >>>> 3. Adding buffer_remap() flag for buffered I/O code, meaning, ther= e is >>>> an existing mapping to initialize a page on partial write, but sti= ll need >>>> to call get_block() to get a (possibly) new mapping. >>> >>> =A0 =A0 =A0 =A0Since ocfs2 doesn't allocate in get_block(), this do= esn't affect >>> us. =A0We notice the refcounted extent in write_begin() and CoW it = right >>> there. =A0Same place we clean up unwritten extents. >>> >> >> Yes, I was going to write a specialized block_write_begin() for CoW, >> but I like to use existing generic code when possible and block_writ= e_begin() >> is only a few lines of code short of what I need, so maybe we can al= l use it? >> >> >>> --snip-- >>> >>> =A0 =A0 =A0 =A0Now, about my snapshot thoughts as promised. =A0My u= nderstanding >>> of the snapshots you have implemented in ext4 is that they are like= some >>> SAN snapshots; they are hidden objects not visible unless you use >>> special access. =A0They are particular to a given inode and are chi= ldren >>> of that inode. =A0What happens when you remove the visible inode? =A0= Do the >>> snapshots disappear? =A0Do you have limitations on how many shapsho= ts a >>> particular inode can have? =A0These questions plagued us when we or= iginal >>> set out to design inode snapshots for ocfs2. >> >> ext4 snapshots are volume level (readonly) snapshots. >> the snapshot inodes are both the "place-holder" of private snapshot = blocks >> and the (loopdev) mount point to access the volume snapshot. >> This is why I wondered if inode level snapshots and volume/subvolume >> level snapshots can share the same API. >> BTW, does btrfs have inode level snapshots as well? >> >>> =A0 =A0 =A0 =A0Once we settled on a mechanism for CoW among ocfs2 i= nodes, we >>> quickly decided that a snapshot should be visible in the namespace. >>> This gave rise to the reflink(2) call, though that name is deprecat= ed in >>> favor of fastcopy(2). =A0Currently our API is OCFS2_IOC_REFLINK (se= e, >>> legacy!), but we eventually want to get the system call upstream. =A0= In >>> ocfs2-land, we decided to keep policy out of the kernel. >>> OCFS2_IOC_REFLINK creates a new inode that shares all the extents o= f the >>> source in CoW fashion, but once it returns, that new inode is a pee= r of >>> the source. =A0There is no parent->child relationship. >>> =A0 =A0 =A0 =A0Thus, for ocfs2 (and forgive the legacy names, the b= inary hasn't >>> changed yet), a "snapshot" is just: >>> >>> =A0 =A0snapshot: reflink source target.snap && chmod 0444 target.sn= ap >>> >>> You can add "chattr +i target.snap" in there if you like. >>> =A0 =A0 =A0 =A0Since there is no "snapshot namespace" stuff for ocf= s2 in the >>> kernel, it was our intention to propose a snapshot(8) binary that w= orks >>> like mkfs/fsck; snapshot(8) just calls snapshot.(8). =A0Our >>> plan was to place snapshot policy in snapshot.ocfs2(8). =A0This >>> implementation would handle managing the /.snapshot/... >>> namespace behind the user: >>> >>> =A0 =A0? cd /mnt/ocfs2 >>> =A0 =A0? snapshot file1 =A0# Creates /mnt/ocfs2/.snapshot/file1. >>> =A0 =A0 >>> =A0 =A0? snapshot file1 test =A0# Creates /mnt/ocfs2/.snapshot/file= 1.test >>> =A0 =A0test >>> =A0 =A0? snapshot list file1 >>> =A0 =A0Snapshots for file1: >>> =A0 =A0 =A0 =A0 >>> =A0 =A0 =A0 =A0test >>> >>> Something like that. >>> =A0 =A0 =A0 =A0A different snapshot model like ext4 could have snap= shot.ext4(8) >>> call the kernel or whatever mechanism was appropriate. =A0A filesys= tem >>> from a NAS filer could use filer-specific calls. >>> =A0 =A0 =A0 =A0Beyond that, I wanted snapshot(8) to handle scheduli= ng of >>> snapshots. =A0The usual daily/weekly stuff should be easy to schedu= le >>> generically. >>> =A0 =A0 =A0 =A0That's my brain dump. =A0I could enumerate proposed = command >>> syntaxes, but I don't think that's necessary. >>> >> >> No need for that. snapshot(8) API sounds good. >> Let's sit together in LSF with btrfs representatives and finalize th= is API. >> For ext4, I just need for the 'file' arg to be optional. >> I would like to include some API to attach a snapshot to a namespace >> (mount it in my case) and to see how the inode level snapshots names= pace >> and volume level snapshots namespace will appear the same to the end= -user. >> >> I suppose further discussion on the subject should exclude lsf ml, >> which appear to be very hectic these days, so anyone who likes to jo= in this >> thread, please say so now. > I implemented the reflink support in ocfs2, so please cc me when you > open a private thread about this topic. Thanks. > > Regards, > Tao > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel= " in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html