Hi All,
For over a year, Next3 has been developed in-house by CTERA networks,
as part of its NAS appliances.
Now that the appliances are out in the market, Next3 project can
finally be shared with the world.
Main Next3 features:
- Backward and forward compatible with Ext3
- Incremental, volume level, read-only snapshots
- Snapshots use available file system disk space
- Snapshot deletion frees up disk space
- Retains Ext3 stability including journaling and fsck
- Minimal performance overhead (in average usage scenarios)
- No upper limit on number or size of snapshots
Please visit Next3 wiki page:
http://sourceforge.net/apps/mediawiki/next3/
Next3 project is looking for code reviewers, beta testers and public attention.
Would love to read your comments on Next3-users mailing list:
https://lists.sourceforge.net/lists/listinfo/next3-users
Amir.
On Sun, Apr 18, 2010 at 6:41 PM, Amir G. wrote:
> Hi All,
>
> For over a year, Next3 has been developed in-house by CTERA networks,
> as part of its NAS appliances.
> Now that the appliances are out in the market, Next3 project can
> finally be shared with the world.
>
> Main Next3 features:
> - Backward and forward compatible with Ext3
> - Incremental, volume level, read-only snapshots
> - Snapshots use available file system disk space
> - Snapshot deletion frees up disk space
> - Retains Ext3 stability including journaling and fsck
> - Minimal performance overhead (in average usage scenarios)
> - No upper limit on number or size of snapshots
>
> Please visit Next3 wiki page:
> http://sourceforge.net/apps/mediawiki/next3/
>
> Next3 project is looking for code reviewers, beta testers and public attention.
>
> Would love to read your comments on Next3-users mailing list:
> https://lists.sourceforge.net/lists/listinfo/next3-users
>
> Amir.
>
Hi Ted,
Next3 project was introduced 2 weeks ago. I hope you've had a chance
to visit the wiki page.
Since the snapshot support patches do not fall into standard patch
categories (feature/bug fix),
I would like to request your advise on the best way to move forward
with code review.
There are 2 patch series available for download from:
http://sourceforge.net/projects/next3/files/Latest%20patch%20series
e2fsprogs-1.41.11-next3-020510_patches.tar.gz contains a "small and
simple" patch series (1350 lines/8 patches),
which adds Next3 snapshots awareness to e2fsprogs.
next3_snapshot-020510_patches.tar.gz contains a "large and complex"
patch series (7000 lines/38 patches),
which adds Next3 built-in snapshots support to an Ext3 clone.
In my opinion, accepting the Next3 patches to e2fsprogs makes sense
for 3 reasons:
1. It is "small and simple" (mostly harmless).
2. It will help finalizing the Next3 on-disk format, so the sooner the better.
3. It will be of service for all those who would like to start
using/testing Next3 and still want to be up-to-date with latest
e2fsprogs.
Please let me know what you think.
Your blessing means a lot to me (and to potential reviewers I should hope so).
And now for a few personal requests from the list:
- For those of you who didn't visit the wiki yet, please find the time
to visit, it is very friendly.
- For those of you who have read the wiki and found it interesting (or
not), please send me some comments to feed my ego (or not).
- For those of you who will be willing to pick up one of the code
review gloves above (small or large size), please let me know if you
need anything else from me and I will provide.
Thanks in advance,
Amir.
"Amir G." <[email protected]> writes:
>
> Since the snapshot support patches do not fall into standard patch
> categories (feature/bug fix),
What is it then?
> I would like to request your advise on the best way to move forward
> with code review.
The typical way to get code review is to post patches to the mailing
list.
However do it in smaller batches if you have a lot of them.
-Andi
--
[email protected] -- Speaking for myself only.
On Mon, May 03, 2010 at 12:47:58PM +0300, Amir G. wrote:
>
> Next3 project was introduced 2 weeks ago. I hope you've had a chance
> to visit the wiki page.
I took a quick look, but to be honest, I've been swamped lately, and
with the merge window close at hand, it was something I was going to
put off for another 2-3 weeks. I didn't realize you were in need of
some immediate comments so that you could finalize the on-disk format.
So the high bit on that front is that it looks like at least some of
the fields used, bit positions grabbed, etc., overlap with those used
by ext4. Ext4 is where new development takes place in the ext2/3/4
series. So enhancements such as Next3 will probably not be received
with great welcome into ext3. And as far as e2fsprogs (which handles
the ext2, ext3, and ext4 file systems) is concerned, I don't want to
deal with the complexity of certain fields that mean one thing for
ext4, and something else for Next3.
I'll try to carve out time to look at the Next3 patches in greater
detail this week.
Best regards,
- Ted
On Tue, May 4, 2010 at 9:55 PM, Andi Kleen wrote:
> "Amir G." writes:
>>
>> Since the snapshot support patches do not fall into standard patch
>> categories (feature/bug fix),
>
> What is it then?
It is in fact a rather big feature, that is, built-in snapshot support for Ext3.
However, since I did not expect such a big feature to be added to Ext3
in its current state of development,
I chose to introduce the feature as a new f/s which was branched from Ext3.
>
>> I would like to request your advise on the best way to move forward
>> with code review.
>
> The typical way to get code review is to post patches to the mailing
> list.
>
> However do it in smaller batches if you have a lot of them.
>
Very well, I will do that. I just figured there's not much sense in
reviewing the patches before
reading the basic design concepts in the wiki.
Thanks,
Amir.
On Wed, May 5, 2010 at 12:42 AM, <[email protected]> wrote:
> On Mon, May 03, 2010 at 12:47:58PM +0300, Amir G. wrote:
>>
>> Next3 project was introduced 2 weeks ago. I hope you've had a chance
>> to visit the wiki page.
>
> I took a quick look, but to be honest, I've been swamped lately, and
> with the merge window close at hand, it was something I was going to
> put off for another 2-3 weeks. ?I didn't realize you were in need of
> some immediate comments so that you could finalize the on-disk format.
>
Indeed, on-disk format changes, that is, the first patch in each of
the e2fsprogs and Next3 patch series,
is the most urgent review for the purpose of finalizing Next3 on-disk format.
> So the high bit on that front is that it looks like at least some of
> the fields used, bit positions grabbed, etc., overlap with those used
> by ext4.
I would like to explain a few things regarding on-disk format changes:
1. Whenever it was possible, I tried to grab fields from the end of
the structs (i.e., super_block, s_features, i_flags) to stay as far
away as possible from ongoing changes in Ext4. If you find it better,
I could move these fields to a different location you assign them to.
2. I plea guilt as changed in grabbing i_flags & 0x00F00000 for
snapshot file non-persistent status flags, which overlaps with recent
Ext4 flags. However, I do not, store these flags on-disk, they are
only used by lsattr -X, to display in-memory snapshot status along
with the on-disk snapshot status stored in i_flags & 0x1F000000.
3. I plea guilt as changed in grabbing l_i_reserved1 for snapshot
on-disk list (i_next_snapshot), which overlaps with Ext4 on-disk
i_version. However, since i_version can take any arbitrary value, this
doesn't "break" the Ext4 on-disk format.
The following wiki section lists the on disk format changes in detail:
http://sourceforge.net/apps/mediawiki/next3/index.php?title=Code_documentation#Reserved_fields_and_bits_in_on-disk_structures
> Ext4 is where new development takes place in the ext2/3/4
> series. ?So enhancements such as Next3 will probably not be received
> with great welcome into ext3.
Yes, of course, I realize that. This is the reason I chose to
introduce Next3 as a new f/s,
which was branched from Ext3 and not as a new feature to Ext3.
Unfortunately, merging Next3 snapshots feature into Ext4 is not an easy task,
because extent mapped files break the design concepts of Next3 snapshots.
> And as far as e2fsprogs (which handles
> the ext2, ext3, and ext4 file systems) is concerned, I don't want to
> deal with the complexity of certain fields that mean one thing for
> ext4, and something else for Next3.
>
I was kind of expecting you to say that and I understand why that can
be a problem for you.
Let's try to address this issue in the code review and find
alternative solutions.
> I'll try to carve out time to look at the Next3 patches in greater
> detail this week.
>
That would be great.
Thanks,
Amir.
"Amir G." <[email protected]> writes:
>
> Yes, of course, I realize that. This is the reason I chose to
> introduce Next3 as a new f/s,
> which was branched from Ext3 and not as a new feature to Ext3.
> Unfortunately, merging Next3 snapshots feature into Ext4 is not an easy task,
> because extent mapped files break the design concepts of Next3 snapshots.
As I understand it the ext4 code base still supports not having
extents enabled in the super block (although I'm not sure how well
that variant is tested in practice)
So in theory you could have a feature that requires disabling extents.
It might not make users very happy though.
-Andi
--
[email protected] -- Speaking for myself only.
On Fri, May 7, 2010 at 5:12 PM, Andi Kleen wrote:
> "Amir G." writes:
>>
>> Yes, of course, I realize that. This is the reason I chose to
>> introduce Next3 as a new f/s,
>> which was branched from Ext3 and not as a new feature to Ext3.
>> Unfortunately, merging Next3 snapshots feature into Ext4 is not an easy task,
>> because extent mapped files break the design concepts of Next3 snapshots.
>
> As I understand it the ext4 code base still supports not having
> extents enabled in the super block (although I'm not sure how well
> that variant is tested in practice)
>
> So in theory you could have a feature that requires disabling extents.
>
> It might not make users very happy though.
>
In theory, it is possible to have 2 modes for Ext4 (extents or snapshots)
and some would argue that it makes sense to do that.
But I think that making that decision can be deferred to a later time,
after people have experienced with Next3 and have decided if they
would like to have
the snapshot feature merged into Ext4 or not.
Besides, it would take me a considerable amount of time to merge the
snapshot feature into Ext4,
and Next3 is ready to be used now.
Amir.
On 05/07/2010 03:22 PM, Amir G. wrote:
> On Fri, May 7, 2010 at 5:12 PM, Andi Kleen wrote:
>
>> "Amir G." writes:
>>
>>> Yes, of course, I realize that. This is the reason I chose to
>>> introduce Next3 as a new f/s,
>>> which was branched from Ext3 and not as a new feature to Ext3.
>>> Unfortunately, merging Next3 snapshots feature into Ext4 is not an easy task,
>>> because extent mapped files break the design concepts of Next3 snapshots.
>>>
>> As I understand it the ext4 code base still supports not having
>> extents enabled in the super block (although I'm not sure how well
>> that variant is tested in practice)
>>
>> So in theory you could have a feature that requires disabling extents.
>>
>> It might not make users very happy though.
>>
>>
> In theory, it is possible to have 2 modes for Ext4 (extents or snapshots)
> and some would argue that it makes sense to do that.
> But I think that making that decision can be deferred to a later time,
> after people have experienced with Next3 and have decided if they
> would like to have
> the snapshot feature merged into Ext4 or not.
>
> Besides, it would take me a considerable amount of time to merge the
> snapshot feature into Ext4,
> and Next3 is ready to be used now.
>
> Amir.
> --
>
I think that the counter argument would be that moving features into
ext3 is probably the wrong thing to do.
I don't think that anyone is in a huge hurry given that we have LVM
based snapshots with ext3 and btrfs snapshots around the corner.
Probably this is most interesting when done to the latest version of the
ext family.
Best regards,
Ric
On Fri, May 7, 2010 at 11:25 PM, Ric Wheeler wrote:
> On 05/07/2010 03:22 PM, Amir G. wrote:
>>
>> In theory, it is possible to have 2 modes for Ext4 (extents or snapshots)
>> and some would argue that it makes sense to do that.
>> But I think that making that decision can be deferred to a later time,
>> after people have experienced with Next3 and have decided if they
>> would like to have
>> the snapshot feature merged into Ext4 or not.
>>
>> Besides, it would take me a considerable amount of time to merge the
>> snapshot feature into Ext4,
>> and Next3 is ready to be used now.
>>
>> Amir.
>> --
>>
>
> I think that the counter argument would be that moving features into ext3 is
> probably the wrong thing to do.
>
> I don't think that anyone is in a huge hurry given that we have LVM based
> snapshots with ext3 and btrfs snapshots around the corner. ?Probably this is
> most interesting when done to the latest version of the ext family.
>
This is a valid argument, but it is important for me to clarify a few
issues regarding the statements above:
1. No features are added to Ext3, so there is no concern for the
stability of Ext3.
The feature is added as a new f/s, with the slight overhead of
duplicate code in the
kernel tree and an extra loadable module in the system.
2. From the user's point of view, there is not much difference between
"mount -t next3"
and "mount -t ext4 -o snapshots", because in both cases it would not
be possible to
mount ext4 with extents support on that volume before discarding snapshots and
it will be possible to mount ext4 with extents support after
discarding snapshots.
3. Next3 snapshots are much more scalable durable and efficient than
LVM snapshots.
These are some of the benefits of built-in snapshots support.
4. I do not want to restart the discussion about when btrfs will be
production ready.
As for Next3 stability, I think that with the help of the community,
Next3 can be production ready within a matter of months,
because the Next3 code religiously attempts to retain the stability of
its ancestor Ext3.
I dare you to prove me wrong ;-)
Amir.
On May 8, 2010, at 1:43 AM, Amir G. wrote:
> 1. No features are added to Ext3, so there is no concern for the
> stability of Ext3.
> The feature is added as a new f/s, with the slight overhead of
> duplicate code in the
> kernel tree and an extra loadable module in the system.
This is where it's important to understand exactly what is meant by a ***file system***. Are you referring to the format, or the implementation? The way I've always treated it, and it's the way I believe most of the ext234 developers have treated it is, that what users call ext2, ext3, and ext4 are different _implementations_ of the same _file_ _system_ [format]. That is to say, ext4 simply happens to be a fuller, more complete implementation of the same file system as ext2 and ext3. Ext2 doesn't support certain features such as journaling and directory indexing; ext3 doesn't support some advanced features such as delayed allocation and extents, and requires that the journal always be present. Ext4 is a superset of ext2 plus ext3 plus delayed allocation, extents, a multi-block
allocator, and a few other new features. But they are all the same file system.
Nor are they the only implementations of that file system. The BSD file systems have a compatible (although feature-restricted) implementation, which was independently implemented. So does the GNU HURD. And there are others. And note that all of these folks all use the same userspace utilities, e2fsprogs, for all of these various implementations: BSD, GNU HURD, and the Linux implementations of ext2, ext3, and ext4 all use the same set of tools: mke2fs, e2fsck, tune2fs, debugfs, and so on.
The same this is true for NTFS. There are features in NTFS that you will find in Windows Vista that don't exist in Windows NT or Windows Vista. But everybody treats them as the same file system, even though they have more advanced features in newer versions of the operating system.
The "ext" in ext2 stands for "extended", as in the "the second extended file system" for Linux. It perhaps would be better if we had used the term "extensible", since that's the main thing about ext2/3/4 that has given it so much staying power. We've been able to add, in very carefully backwards and forwards compatible way, new features to the file system format. This is why I object to why Next3 uses some fields that overlaps with ext4. It means that e2fsprogs, which supports _one_ and _only_ _one_ file system format, will now need to support two file system formats. And that's not something I want to do.
Put another away, it should be possible to add your "Next3" snapshots to ext4. Even if today, no one has the time and energy to do the work, it is something that should be _theoretically_ possible. In another e-mail message, you've made the claim: "Unfortunately, merging Next3 snapshots feature into Ext4 is not an easy task,
because extent mapped files break the design concepts of Next3 snapshots." But aside from stealing fields already assigned to various features supported by ext4, this isn't true! I don't see anything that fundamentally incompatible with Next3 and extent-mapped files. (Unless you mean that the snapshot file might not be as efficiently stored using extent-mapped files, but [a] it's not clear the lack of efficiency will matter, since most files are contiguously stored, and there can be over 380 extents in a extent tree leaf block, and [b] we could always use an indirect block mapped file for the snapshot file --- ext4 is fully backwards compatible with ext2, so you can use an old-style direct/indirect block mapped file for the snapshot if you really wanted.)
-- Ted
On 05/08/2010 01:43 AM, Amir G. wrote:
> On Fri, May 7, 2010 at 11:25 PM, Ric Wheeler wrote:
>
>> On 05/07/2010 03:22 PM, Amir G. wrote:
>>
>>> In theory, it is possible to have 2 modes for Ext4 (extents or snapshots)
>>> and some would argue that it makes sense to do that.
>>> But I think that making that decision can be deferred to a later time,
>>> after people have experienced with Next3 and have decided if they
>>> would like to have
>>> the snapshot feature merged into Ext4 or not.
>>>
>>> Besides, it would take me a considerable amount of time to merge the
>>> snapshot feature into Ext4,
>>> and Next3 is ready to be used now.
>>>
>>> Amir.
>>> --
>>>
>>>
>> I think that the counter argument would be that moving features into ext3 is
>> probably the wrong thing to do.
>>
>> I don't think that anyone is in a huge hurry given that we have LVM based
>> snapshots with ext3 and btrfs snapshots around the corner. Probably this is
>> most interesting when done to the latest version of the ext family.
>>
>>
> This is a valid argument, but it is important for me to clarify a few
> issues regarding the statements above:
>
> 1. No features are added to Ext3, so there is no concern for the
> stability of Ext3.
> The feature is added as a new f/s, with the slight overhead of
> duplicate code in the
> kernel tree and an extra loadable module in the system.
>
> 2. From the user's point of view, there is not much difference between
> "mount -t next3"
> and "mount -t ext4 -o snapshots", because in both cases it would not
> be possible to
> mount ext4 with extents support on that volume before discarding snapshots and
> it will be possible to mount ext4 with extents support after
> discarding snapshots.
>
> 3. Next3 snapshots are much more scalable durable and efficient than
> LVM snapshots.
> These are some of the benefits of built-in snapshots support.
>
> 4. I do not want to restart the discussion about when btrfs will be
> production ready.
> As for Next3 stability, I think that with the help of the community,
> Next3 can be production ready within a matter of months,
> because the Next3 code religiously attempts to retain the stability of
> its ancestor Ext3.
>
> I dare you to prove me wrong ;-)
>
> Amir.
>
As Ted mentioned in his reply, the big concern is that you are forking
ext3 instead of adding a new feature to the end of the ext* family of
file systems.
Since we have multiple snapshot mechanisms in place already (not just
btrfs & lvm, but don't forget all of the builtin array snapshots), I
think that we are not in a hurry to get this done quickly. I would
strongly prefer we get this rebased onto the latest ext4 and resubmitted.
As far as proof goes, I think that the unfortunate burden of proof is on
your shoulders to prove to us that we should take and maintain those new
features given the often conflicting priorities :-)
Thanks!
Ric
On Sat, May 8, 2010 at 1:48 PM, Theodore Tso wrote:
>
> On May 8, 2010, at 1:43 AM, Amir G. wrote:
>
>> 1. No features are added to Ext3, so there is no concern for the
>> stability of Ext3.
>> The feature is added as a new f/s, with the slight overhead of
>> duplicate code in the
>> kernel tree and an extra loadable module in the system.
>
> This is where it's important to understand exactly what is meant by a ***file system***. ? Are you referring to the format, or the implementation? ? The way I've always treated it, and it's the way I believe most of the ext234 developers have treated it is, that what users call ext2, ext3, and ext4 are different _implementations_ of the same _file_ _system_ [format]. ? ?That is to say, ext4 simply happens to be a fuller, more complete implementation ?of the same file system as ext2 and ext3. ? Ext2 doesn't support certain features such as journaling and directory indexing; ext3 doesn't support some advanced features such as delayed allocation and extents, and requires that the journal always be present. ? Ext4 is a superset of ext2 plus ext3 plus delayed allocation, extents, a multi-block allocator, and a few other new features. ? But they are all the same file system.
>
Next3 is another implementation of the extended f/s format.
Next3 is a superset of ext3 plus snapshots.
>
> The "ext" in ext2 stands for "extended", as in the "the second extended file system" for Linux. ? It perhaps would be better if we had used the term "extensible", since that's the main thing about ext2/3/4 that has given it so much staying power. ?We've been able to add, in very carefully backwards and forwards compatible way, new features to the file system format. ?This is why I object to why Next3 uses some fields that overlaps with ext4. ? It means that e2fsprogs, which supports _one_ and _only_ _one_ file system format, will now need to support two file system formats. ?And that's not something I want to do.
Next3 is backward and forward compatible with ext3.
Next3 path to e2fsprogs doesn't treat it as a different file system format.
All overlapping field issues can be resolved.
>
> Put another away, it should be possible to add your "Next3" snapshots to ext4. ? Even if today, no one has the time and energy to do the work, it is something that should be _theoretically_ possible.
It is _practically_ possible to support the snapshot features/fields
in e2fsprogs today
and to add the support for the same snapshot features/fields to Ext4 later.
> In another e-mail message, you've made the claim: "Unfortunately, merging Next3 snapshots feature into Ext4 is not an easy task, because extent mapped files break the design concepts of Next3 snapshots."
> But aside from stealing fields already assigned to various features supported by ext4, this isn't true! ?I don't see anything that fundamentally incompatible with Next3 and extent-mapped files. ? ?(Unless you mean that the snapshot file might not be as efficiently stored using extent-mapped files, but [a] it's not clear the lack of efficiency will matter, since most files are contiguously stored, and there can be over 380 extents in a extent tree leaf block, and [b] we could always use an indirect block mapped file for the snapshot file --- ext4 is fully backwards compatible with ext2, so you can use an old-style direct/indirect block mapped file for the snapshot if you really wanted.)
>
It makes me very happy that you've studied Next3 enough to be able to
make this almost correct observation.
I do plan to use indirect mapped snapshot files when I merge them to Ext4.
The only place that extent mapped files break the snapshots design is
when doing move-on-write
when writing in-place to extent mapped file.
Should the extent be broken into 2 extents + single block and then
move the block to snapshot?
Should the block be copied-on-write instead of moved-on-write and pay
the performance penalty?
There is an important design decision to make here.
Amir.
On Sat, May 08, 2010 at 06:07:40PM +0200, Amir G. wrote:
>
> Next3 is another implementation of the extended f/s format.
> Next3 is a superset of ext3 plus snapshots.
As long as Next3 uses fields which have already assigned to ext4, this
is a claim that you can not make correctly. Because, you see, the
ext4 is also an implementation of the extended f/s format, and those
field assignments have already been made.
> All overlapping field issues can be resolved.
As long as you are willing to say that, then sure, let's work towards
that goal.
> It makes me very happy that you've studied Next3 enough to be able to
> make this almost correct observation.
> I do plan to use indirect mapped snapshot files when I merge them to Ext4.
> The only place that extent mapped files break the snapshots design is
> when doing move-on-write
> when writing in-place to extent mapped file.
> Should the extent be broken into 2 extents + single block and then
> move the block to snapshot?
> Should the block be copied-on-write instead of moved-on-write and pay
> the performance penalty?
If you do the "move-on-write" trick, you just have to split the extent
and do a COW of the extent tree and/or the inode. So for a single
block, the performance hit the same, yes? But in the long-run, it's
probably more efficient to do "move-on-write".
> There is an important design decision to make here.
Technically speaking, it's possible to do it both way, yes? I'm not
sure why you consider this such a important design decision. We can
even play games where for some files we might do copy-on-write, and
for some files, we do move-on-write. It's always possible to check
the COW bitmaps to decide what had happened.
In any case, if this is all you have to do, I'm not sure why you said
it was fundamentally impossible to support extents with the Next3
design.
Best regards,
- Ted
On Sat, May 8, 2010 at 7:25 PM, <[email protected]> wrote:
> On Sat, May 08, 2010 at 06:07:40PM +0200, Amir G. wrote:
>>
>> Next3 is another implementation of the extended f/s format.
>> Next3 is a superset of ext3 plus snapshots.
>
> As long as Next3 uses fields which have already assigned to ext4, this
> is a claim that you can not make correctly. ?Because, you see, the
> ext4 is also an implementation of the extended f/s format, and those
> field assignments have already been made.
>
>> All overlapping field issues can be resolved.
>
> As long as you are willing to say that, then sure, let's work towards
> that goal.
>
Let me state my case then:
Next3 uses 1 assigned field (i_version), but it does not "abuse" it.
You see, Next3 only tampers with i_version of snapshot files.
And by tamper I mean: set it to next snapshot inode number on snapshot take.
And snapshot files are not modifiable by users (only by the f/s itself).
So if the f/s decides to assign an arbitrary value to i_version of
snapshot files,
it doesn't break the extended f/s format. does it?
Next3 also uses 9 i_flags bits (0x1FF00000), in snapshot file inodes only,
some currently overlapping flags recently assigned to Ext4 (you beat me to it).
There is a big waste in i_flag bits space, for example, the 4 bits
reserved for compression,
which are not in use by non-compressed files.
Snapshot files are never compressed, so I wouldn't mind reusing those
4 bits for snapshot flags.
Overloading auxiliary bits with different meanings depending on some
other bit does not make this a different f/s format.
It simply makes use of expensive space more efficiently.
>
> If you do the "move-on-write" trick, you just have to split the extent
> and do a COW of the extent tree and/or the inode. ?So for a single
> block, the performance hit the same, yes? ?But in the long-run, it's
> probably more efficient to do "move-on-write".
>
All metadata is COWed, inside the JBD hooks, so the extent tree and
inode are taken care of.
It is the data blocks which are being moved-on-write for efficiency.
The problem with splitting the extent is that when an application does
a lot of in-place writes to an extent mapped file,
it will eventually end up being broken down into tiny extents or
blocks and that is a problem. right?
>> There is an important design decision to make here.
>
> Technically speaking, it's possible to do it both way, yes? ?I'm not
> sure why you consider this such a important design decision. ?We can
> even play games where for some files we might do copy-on-write, and
> for some files, we do move-on-write. ?It's always possible to check
> the COW bitmaps to decide what had happened.
>
Definitely yes! I never thought it would really have to come down to a
"decision",
because there is a trade-off at hand.
Even in Next3, without extents, it makes sense to have a choice of
write performance vs. fragmentation per file.
The few applications that use random in-place write (db, virtual disk)
would probably want to avoid the fragmentation.
> In any case, if this is all you have to do, I'm not sure why you said
> it was fundamentally impossible to support extents with the Next3
> design.
>
Wait just a minute! I said "not an easy task" and "break the design
concepts", but I never said (as far as I recall) "fundamentally
impossible". Well, perhaps "breaking the design concepts" was too
harsh :-)
I quote from Next3 wiki FAQ:
"Can Next3 snapshot support be applied to Ext4?
Most of the snapshot code can work on Ext4 as is, but the
move-on-write technique used for regular files data blocks will
require additional work before it can be applied to extent mapped
files."
I would have to say that "considerable amount of time" is the main
obstacle for the merge task.
So my humble and biased suggestion is:
let's start working with Next3, get to know it's strengths and weaknesses
and then design the nExt4 merge together.
Amir.
On Sat, May 8, 2010 at 2:51 PM, Ric Wheeler wrote:
> On 05/08/2010 01:43 AM, Amir G. wrote:
>
> As Ted mentioned in his reply, the big concern is that you are forking ext3
> instead of adding a new feature to the end of the ext* family of file
> systems.
>
That is a valid concern, but this is where Next3 stands today.
There is no intention of replacing Ext4 with Next4 as the leading ext* f/s.
The branch from Ext3 was made at the time to speed up the development
process of the snapshot feature.
> Since we have multiple snapshot mechanisms in place already (not just btrfs
> & lvm, but don't forget all of the builtin array snapshots), I think that we
> are not in a hurry to get this done quickly. I would strongly prefer we get
> this rebased onto the latest ext4 and resubmitted.
>
If I were you I would also prefer to get snapshots in ext4, and even
snapshots along side extent mapped files,
but unfortunately, I cannot promise to deliver either anytime soon.
I can only promise my support to anyone who wishes to participate in
the merge task.
> As far as proof goes, I think that the unfortunate burden of proof is on
> your shoulders to prove to us that we should take and maintain those new
> features given the often conflicting priorities :-)
>
What can I say, a Windows file server can display previous file
versions without using a costly storage array.
Can a RedHat server do that? Can LVM snapshots be used to do that?
The CTERA NAS appliances do that using Next3.
Amir.
On May 8, 2010, at 3:40 PM, Amir G. wrote:
>
> Let me state my case then:
You need to justify ALL new fields which are used by Next3, not just the ones which overlap ext4, since they are precious resources, not be squandered for just one new file system feature/extension.
For example, if you have flags that only are used in-memory, they shouldn't be allocated out of i_flags, but instead using the i_state field in the ext4_inode_info structure, referenced via EXT4_I(inode).i_state. That's what it is there fore. (And i_flags is overloaded already, since the bit positions are used by the VFS layer as well, so that's something where we do need to be sensitive how bit positions get used).
>
> Next3 uses 1 assigned field (i_version), but it does not "abuse" it.
> You see, Next3 only tampers with i_version of snapshot files.
> And by tamper I mean: set it to next snapshot inode number on snapshot take.
What do you mean by "next snapshot number"? How is it used? Why do you need it? Given that all snapshot inodes must be stored out of a snapshot directory (why?) can't the snapshot directory name be used to establish some kind of ordering? Is ordering significant, or do you just need it to find them all? If it's just to find them all, why not just used a linked list which is stored in memory; does it really need to be in the on-disk structure at all?
> Next3 also uses 9 i_flags bits (0x1FF00000), in snapshot file inodes only,
> some currently overlapping flags recently assigned to Ext4 (you beat me to it).
> There is a big waste in i_flag bits space, for example, the 4 bits
> reserved for compression,
The compression people were amazingly profligate with their flags, yes. It's one of the reasons why I push back now, and ask people to justify *every* single bit assignment or field usage.
For example: i_snapshot_blocks_count. Is that really necessary? Why can not be computed by looking at i_size of the snapshot inode? Or by checking to see if the superblock has be COW'ed? If it hasn't then the s_blocks_count in the fs superblock must be what would have been in i_snapshot_blocks_count. If the sb has been COW'ed, s_blocks_count in the COW'ed sb must be that value. Why allocate and waste a full 32-bit field out of _every_ inode in the file system if it's possible to get that value via other places.
I have similar question about bg_cow_bitmap. Is that really necessary? The COW bitmap is just a copy of the base file system's block allocation bitmap, right? (The wiki documentation and the design PDF isn't completely clear on that point, but that seems to be what it is.) So why do you need to allocate a field out of the bg discriptor field for it?
It's not clear why you need an exclude inode, if you are also storing the address of the exclude bitmap blocks in the bg descriptor. One or the other, but not both...
If s_snapshot_inum and s_snapshot_id refer to the "active" snapshot, why do they need to be in the on-disk structure. Why not just have the first item of the linked list whose head is s_last_snapshot be the "active" snapshot (if this needs to be in the on-disk state at all); wouldn't the active snapshot be the most recent one anyway? Also, as far as i_next_snapshot is concerned, why not just use d_time for the linked list. That's what we do with the orphaned inode list, so we have code that maintains a linked list using d_time already. So that way you don't need _any_ new inode fields.
I'm not convinced by all of the fs feature compatibility flags you've defined, either. In general, if you suspect one part of the file system, you need to check everything. So do you really need separate ro_compat bits for "fix_snapshot" and "fix_exclude"? And why do you need "IS_SNAPSHOT"?
As far as COMPAT_BIG_JOURNAL, we have feature flags in the journal, and that probably is better placed there. And I assume COMPAT_EXCLUDE_INODE is really "COMPAT_HAS_SNAPSHOTS"?
BTW, if you are free at 11:00 US/Eastern on Monday, maybe you could join the ext4 weekly conference call, and we could try to hash some of this out on the telephone call? I'm not sure where you are geographically based, so I don't know if this would be a hard time for you to make or not.
-- Ted
On Sun, May 9, 2010 at 5:25 AM, Theodore Tso wrote:
>
> For example, if you have flags that only are used in-memory, they shouldn't be allocated out of i_flags, but instead using the i_state field in the ext4_inode_info structure, referenced via EXT4_I(inode).i_state. ? That's what it is there fore.
>
I have 4 non-persistent flags. I will move them to i_state.
I've kept them in i_flags out of laziness, since I use lsattr -X
to read non-persistent snapshot flags along with persistent snapshot flags.
>>
>> Next3 uses 1 assigned field (i_version), but it does not "abuse" it.
>> You see, Next3 only tampers with i_version of snapshot files.
>> And by tamper I mean: set it to next snapshot inode number on snapshot take.
>
> What do you mean by "next snapshot number"? ? How is it used? ?Why do you need it? ? Given that all snapshot inodes must be stored out of a snapshot directory (why?) can't the snapshot directory name be used to establish some kind of ordering? ? Is ordering significant, or do you just need it to find them all? ? If it's just to find them all, why not just used a linked list which is stored in memory; does it really need to be in the on-disk structure at all?
>
i_version is used to chain the snapshot list on-disk, similar to orphan list.
I used i_dtime in the past, but I was concerned that a bug would
result in cleanup of all snapshots,
so I started using i_version instead.
I can revert back to using i_dtime (snapshot files are non-truncatable
non-unlinkable) instead of i_version.
the snapshot file directory entry name is arbitrary and may be used by
a "snapshot management system" as it wishes,
to organize and display snapshots.
As far as the snapshot sub-system is concerned, the on-disk snapshot
list is the only reference to the snapshot files.
>
> For example: i_snapshot_blocks_count. ?Is that really necessary? ? Why can not be computed by looking at i_size of the snapshot inode? ? Or by checking to see if the superblock has be COW'ed? ?If it hasn't then the s_blocks_count in the fs superblock must be what would have been in i_snapshot_blocks_count. ?If the sb has been COW'ed, s_blocks_count in the COW'ed sb must be that value. ?Why allocate and waste a full 32-bit field out of _every_ inode in the file system if it's possible to get that value via other places.
>
very well, I can read snapshot_blocks_count from COWed superblock (it
is always COWed on snapshot take) and release i_snapshot_blocks_count.
> I have similar question about bg_cow_bitmap. ?Is that really necessary? ? The COW bitmap is just a copy of the base file system's block allocation bitmap, right? ?(The wiki documentation and the design PDF isn't completely clear on that point, but that seems to be what it is.) ? So why do you need to allocate a field out of the bg discriptor field for it?
>
> It's not clear why you need an exclude inode, if you are also storing the address of the exclude bitmap blocks in the bg descriptor. ?One or the other, but not both...
>
bg_cow_bitmap/bg_exculde_bitmap are used by Next3 as non-persistent
cache for the address of a bitmap blocks,
which can be read from active_snapshot/exclude_inode.
in other words, instead of allocating per group in-memory structure, I
used the 2 unused fields in the in-memory group descriptor.
the only side effect for the ext* on-disk format is that those fields
are no longer 0 after mounting a volume with Next3.
is that a problem? can the CSUM feature resolve that problem?
in e2fsprogs, I only reference those fields for debugging purpose
(dumpe2fs displays them).
also create_exclude_inode and resize2fs set the bg_exclude_bitmap, but
they don't have to,
because Next3 re-reads all exclude_bitmap block addresses from exclude
inode on mount time.
so please feel free to reject those field assignments. I can include
them in a seperate debug only patch.
> If s_snapshot_inum and s_snapshot_id refer to the "active" snapshot, why do they need to be in the on-disk structure. ? Why not just have the first item of the linked list whose head is s_last_snapshot be the "active" snapshot (if this needs to be in the on-disk state at all); wouldn't the active snapshot be the most recent one anyway?
good question. again, there use to be only 1 field s_last_snapshot,
but I split it into 2 field to recover from crash in the middle
of snapshot take.
a half taken snapshot is set as s_last_snapshot, but only a ready
snapshot is set as s_snapshot_inum.
Next3 will cleanup a half taken snapshot on mount time.
tune2fs -O ^has_snapshot will cleanup (discard) all snapshot files,
including the half taken snapshot.
> Also, as far as i_next_snapshot is concerned, why not just use d_time for the linked list. ?That's what we do with the orphaned inode list, so we have code that maintains a linked list using d_time already. ? So that way you don't need _any_ new inode fields.
>
I don't know if you noticed, but I reused the code of
add/del_orphan_list() to manipulate the snapshot list...
And as I said a few lines back, I can revert to using i_dtime instead
of i_next_snapshot.
> I'm not convinced by all of the fs feature compatibility flags you've defined, either. ? In general, if you suspect one part of the file system, you need to check everything. ?So do you really need separate ro_compat bits for "fix_snapshot" and "fix_exclude"?
no, not really. these are informational only. I didn't even use
fix_snapshot yet.
> And why do you need "IS_SNAPSHOT"?
>
I "fix" the COWed superblock to make it look like ext2 superblock and
set the is_snapshot feature, so fsck would know it is checking a Next3
snapshot image (and not report wrong block counts).
> As far as COMPAT_BIG_JOURNAL, we have feature flags in the journal, and that probably is better placed there.
I will look into that.
>And I assume COMPAT_EXCLUDE_INODE is really "COMPAT_HAS_SNAPSHOTS"?
logically, it means that the exclude inode/bitmap is allocated.
currently, the only feature that uses the exclude bitmap is the
snapshot feature,
so I don't mind bundling them together.
but I do recommend to mke2fs -o exclude_inode if you intend to switch
from Ext3 to Next3 at some point.
it will guarenty that exclude bitmap blocks are allocated their
corresponding block bitmap.
>
> BTW, if you are free at 11:00 US/Eastern on Monday, maybe you could join the ext4 weekly conference call, and we could try to hash some of this out on the telephone call? ?I'm not sure where you are geographically based, so I don't know if this would be a hard time for you to make or not.
>
I would be happy to join the weekly call. I am located in Israel. How
does this work exactly?
Should I call in? to which number? need credentials?
Thanks,
Amir.
On Sun, May 9, 2010 at 1:56 PM, Amir G. <[email protected]> wrote:
> On Sun, May 9, 2010 at 5:25 AM, Theodore Tso ?wrote:
>>
>> For example, if you have flags that only are used in-memory, they shouldn't be allocated out of i_flags, but instead using the i_state field in the ext4_inode_info structure, referenced via EXT4_I(inode).i_state. ? That's what it is there fore.
>>
>
> I have 4 non-persistent flags. I will move them to i_state.
> I've kept them in i_flags out of laziness, since I use lsattr -X
> to read non-persistent snapshot flags along with persistent snapshot flags.
>
>>
>> For example: i_snapshot_blocks_count. ?Is that really necessary? ? Why can not be computed by looking at i_size of the snapshot inode? ? Or by checking to see if the superblock has be COW'ed? ?If it hasn't then the s_blocks_count in the fs superblock must be what would have been in i_snapshot_blocks_count. ?If the sb has been COW'ed, s_blocks_count in the COW'ed sb must be that value. ?Why allocate and waste a full 32-bit field out of _every_ inode in the file system if it's possible to get that value via other places.
>>
>
> very well, I can read snapshot_blocks_count from COWed superblock (it
> is always COWed on snapshot take) and release i_snapshot_blocks_count.
>
>> I have similar question about bg_cow_bitmap. ?Is that really necessary? ? The COW bitmap is just a copy of the base file system's block allocation bitmap, right? ?(The wiki documentation and the design PDF isn't completely clear on that point, but that seems to be what it is.) ? So why do you need to allocate a field out of the bg discriptor field for it?
>>
>> It's not clear why you need an exclude inode, if you are also storing the address of the exclude bitmap blocks in the bg descriptor. ?One or the other, but not both...
>>
>
> bg_cow_bitmap/bg_exculde_bitmap are used by Next3 as non-persistent
> cache for the address of a bitmap blocks,
> which can be read from active_snapshot/exclude_inode.
> in other words, instead of allocating per group in-memory structure, I
> used the 2 unused fields in the in-memory group descriptor.
> the only side effect for the ext* on-disk format is that those fields
> are no longer 0 after mounting a volume with Next3.
> is that a problem? can the CSUM feature resolve that problem?
>
> in e2fsprogs, I only reference those fields for debugging purpose
> (dumpe2fs displays them).
> also create_exclude_inode and resize2fs set the bg_exclude_bitmap, but
> they don't have to,
> because Next3 re-reads all exclude_bitmap block addresses from exclude
> inode on mount time.
> so please feel free to reject those field assignments. I can include
> them in a seperate debug only patch.
>
>
>
>> Also, as far as i_next_snapshot is concerned, why not just use d_time for the linked list. ?That's what we do with the orphaned inode list, so we have code that maintains a linked list using d_time already. ? So that way you don't need _any_ new inode fields.
>>
>
> I don't know if you noticed, but I reused the code of
> add/del_orphan_list() to manipulate the snapshot list...
> And as I said a few lines back, I can revert to using i_dtime instead
> of i_next_snapshot.
>
>> I'm not convinced by all of the fs feature compatibility flags you've defined, either. ? In general, if you suspect one part of the file system, you need to check everything. ?So do you really need separate ro_compat bits for "fix_snapshot" and "fix_exclude"?
>
> no, not really. these are informational only. I didn't even use
> fix_snapshot yet.
>
>> And why do you need "IS_SNAPSHOT"?
>>
>
> I "fix" the COWed superblock to make it look like ext2 superblock and
> set the is_snapshot feature, so fsck would know it is checking a Next3
> snapshot image (and not report wrong block counts).
>
>> As far as COMPAT_BIG_JOURNAL, we have feature flags in the journal, and that probably is better placed there.
>
> I will look into that.
>
>>And I assume COMPAT_EXCLUDE_INODE is really "COMPAT_HAS_SNAPSHOTS"?
>
> logically, it means that the exclude inode/bitmap is allocated.
> currently, the only feature that uses the exclude bitmap is the
> snapshot feature,
> so I don't mind bundling them together.
>
> but I do recommend to mke2fs -o exclude_inode if you intend to switch
> from Ext3 to Next3 at some point.
> it will guarenty that exclude bitmap blocks are allocated their
> corresponding block bitmap.
>
I have started making some changes to on-disk format based on the
points we seem to agree upon.
I would like to register only 1 ro_compat feature (has_snapshot) and 1
compat feature (exclude_inode).
the rest of the "informational features" I would like to move to
s_flags, including NEXT3_FLAGS_BIG_JOURNAL.
A Next3 big journal can be created with the option -J big or by
mkfs.next3/mkn3fs.
I will evacuate all the fields that Next3 can do without (i.e.,
{s_snapshot_blocks_count,bg_{cow,exclude}_bitmap}).
I will move the non-persistent snapshot flags to i_state.
I would like to stick with i_version list chaining and not revert to
using i_dtime. awaiting further discussion on that topic.
Awaiting permanent assignments for the rest of the fields/flags.
Per your request, I have added the information above to the Next3 wiki.
Please find the TODO items in red and WIP items in green (implemented
and not published):
http://sourceforge.net/apps/mediawiki/next3/index.php?title=Code_documentation#Reserved_fields_and_bits_in_on-disk_structures
Also, if you could please drop a line about your view of how to
progress with Next3
(something in the lines of what you said in the conference call), that
would be nice.
Some people may have gotten the impression that the fork from Ext3 is
a show stopper for you,
see: http://lwn.net/SubscriberLink/387231/1310b1360769c12b/
Thanks,
Amir.