LinuxLists.cc - [PATCH] Ext4 Documentation updates.

2008-07-02 17:22:02

Subject: [PATCH] Ext4 Documentation updates.

From: Jose R. Santos <[email protected]>

Ext4 Documentation updates.

Some of the information in Documentation/filesystems/ext4.txt is out
of date and in need of an update.

Signed-off-by: Jose R. Santos <[email protected]>
--

Documentation/filesystems/ext4.txt | 83 +++++++++++++++++++++---------------
1 files changed, 49 insertions(+), 34 deletions(-)

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 0c5086d..b0c3bb2 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -15,23 +15,32 @@ Mailing list: [email protected]

- Grab updated e2fsprogs from
ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/
- This is a patchset on top of e2fsprogs-1.39, which can be found at
- ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/

- - It's still mke2fs -j /dev/hda1
+ or grab the latest git repository from:
+ git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
+
+ - Create a new filesystem and set the "test_fs" extended option:

- - mount /dev/hda1 /wherever -t ext4dev
+ # mke2fs -j -E test_fs /dev/hda1

- - To enable extents,
+ Or set the test_fs flag on an existing ext3 filesystem:

- mount /dev/hda1 /wherever -t ext4dev -o extents
+ # debugfs -w /dev/sda5
+ debugfs 1.41-WIP (17-Jun-2008)
+ debugfs: set_super_value s_flags 4
+ debugfs: quit
+
+ - Mounting:
+
+ # mount /dev/hda1 /wherever -t ext4dev
+
+ - To disable extents:
+
+ # mount /dev/hda1 /wherever -t ext4dev -o noextents

- The filesystem is compatible with the ext3 driver until you add a file
which has extents (ie: `mount -o extents', then create a file).

- NOTE: The "extents" mount flag is temporary. It will soon go away and
- extents will be enabled by the "-o extents" flag to mke2fs or tune2fs
-
- When comparing performance with other filesystems, remember that
ext3/4 by default offers higher data integrity guarantees than most. So
when comparing with a metadata-only journalling filesystem, use `mount -o
@@ -44,41 +53,46 @@ Mailing list: [email protected]

2.1 Currently available

-* ability to use filesystems > 16TB
+* ability to use filesystems > 16TB (e2fsprogs support not available yet)
* extent format reduces metadata overhead (RAM, IO for access, transactions)
* extent format more robust in face of on-disk corruption due to magics,
* internal redunancy in tree
+* improved file allocation (multi-block alloc, delayed alloc)
+* fix 32000 subdirectory limit
+* nsec timestamps for mtime, atime, ctime, create time
+* inode version field on disk (NFSv4, Lustre)
+* reduced e2fsck time via uninit_bg feature
+* journal checksumming for robustness, performance
+* persistent file preallocation (e.g for streaming media, databases)
+* ability to pack bitmaps and inode tables into larger virtual groups via the
+ flex_bg feature
+* large file support

-2.1 Previously available, soon to be enabled by default by "mkefs.ext4":
+2.2 Previously available, soon to be enabled by default by "mkefs.ext4":

* dir_index and resize inode will be on by default
* large inodes will be used by default for fast EAs, nsec timestamps, etc

-2.2 Candidate features for future inclusion
+2.3 Candidate features for future inclusion

-There are several under discussion, whether they all make it in is
-partly a function of how much time everyone has to work on them:
+* Online defrag (patches available but not well tested)
+* Inode allocation using large virtual block groups via flex_bg (patch
+ available; fragmentation issues due to prolong fs use still unknown)
+* reduced mke2fs time via uninit_bg feature (capability to do this is
+ available in e2fsprogs but a kernel thread to do lazy zeroing of unused
+ inode table blocks after filesystem is first mounted is required for
+ safety)

-* improved file allocation (multi-block alloc, delayed alloc; basically done)
-* fix 32000 subdirectory limit (patch exists, needs some e2fsck work)
-* nsec timestamps for mtime, atime, ctime, create time (patch exists,
- needs some e2fsck work)
-* inode version field on disk (NFSv4, Lustre; prototype exists)
-* reduced mke2fs/e2fsck time via uninitialized groups (prototype exists)
-* journal checksumming for robustness, performance (prototype exists)
-* persistent file preallocation (e.g for streaming media, databases)
+There are several others under discussion, whether they all make it in is
+partly a function of how much time everyone has to work on them. Features like
+metadata checksumming have been discussed and planned for a bit but no patches
+exist yet so I'm not sure they're in the near-term roadmap.

-Features like metadata checksumming have been discussed and planned for
-a bit but no patches exist yet so I'm not sure they're in the near-term
-roadmap.
+The big performance win will come with mballoc, delalloc and flex_bg
+grouping of bitmaps and inode tables. Some test results available here:

-The big performance win will come with mballoc and delalloc. CFS has
-been using mballoc for a few years already with Lustre, and IBM + Bull
-did a lot of benchmarking on it. The reason it isn't in the first set of
-patches is partly a manageability issue, and partly because it doesn't
-directly affect the on-disk format (outside of much better allocation)
-so it isn't critical to get into the first round of changes. I believe
-Alex is working on a new set of patches right now.
+ - http://www.bullopensource.org/ext4/20080530/ffsb-write-2.6.26-rc2.html
+ - http://www.bullopensource.org/ext4/20080530/ffsb-readwrite-2.6.26-rc2.html

3. Options
==========
@@ -224,7 +238,7 @@ stripe=n Number of filesystem blocks that mballoc will try
disks * RAID chunk size in file system blocks.

Data Mode
----------
+=========
There are 3 different data modes:

* writeback mode
@@ -256,7 +270,8 @@ kernel source: <file:fs/ext4/>
<file:fs/jbd2/>

programs: http://e2fsprogs.sourceforge.net/
- http://ext2resize.sourceforge.net

useful links: http://fedoraproject.org/wiki/ext3-devel
http://www.bullopensource.org/ext4/
+ http://ext4.wiki.kernel.org/index.php/Main_Page
+ http://fedoraproject.org/wiki/Features/Ext4

2008-07-02 21:46:16

by Mingming Cao

[permalink] [raw]

Subject: Re: [PATCH] Ext4 Documentation updates.

On Wed, 2008-07-02 at 12:22 -0500, Jose R. Santos wrote:
> From: Jose R. Santos <[email protected]>
>
> Ext4 Documentation updates.
>
> Some of the information in Documentation/filesystems/ext4.txt is out
> of date and in need of an update.
>
> Signed-off-by: Jose R. Santos <[email protected]>
> --

Thanks, I added it to the ext4 patch queue before the new ordered mode
patches.

Here is another documentation update patch that add documentation for
the new ordered mode and the delayed allocation, should go after the
delayed allocation. Also added a few update

---
Documentation/filesystems/ext4.txt | 37 +++++++++++++++++++++++++++----------
1 file changed, 27 insertions(+), 10 deletions(-)

Index: linux-2.6.26-rc8/Documentation/filesystems/ext4.txt
===================================================================
--- linux-2.6.26-rc8.orig/Documentation/filesystems/ext4.txt 2008-07-02 13:49:28.000000000 -0700
+++ linux-2.6.26-rc8/Documentation/filesystems/ext4.txt 2008-07-02 14:06:34.000000000 -0700
@@ -57,7 +57,7 @@ Mailing list: [email protected]
* extent format reduces metadata overhead (RAM, IO for access, transactions)
* extent format more robust in face of on-disk corruption due to magics,
* internal redunancy in tree
-* improved file allocation (multi-block alloc, delayed alloc)
+* improved file allocation (multi-block alloc)
* fix 32000 subdirectory limit
* nsec timestamps for mtime, atime, ctime, create time
* inode version field on disk (NFSv4, Lustre)
@@ -67,8 +67,15 @@ Mailing list: [email protected]
* ability to pack bitmaps and inode tables into larger virtual groups via the
flex_bg feature
* large file support
+* delayed allocation
+* large block (up to pagesize) support
+* efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force
+ the ordering)
+* Inode allocation using large virtual block groups via flex_bg (patch
+ available; fragmentation issues due to prolong fs use still unknown)
+

-2.2 Previously available, soon to be enabled by default by "mkefs.ext4":
+2.2 Previously available, now is enabled by default by "mkefs.ext4":

* dir_index and resize inode will be on by default
* large inodes will be used by default for fast EAs, nsec timestamps, etc
@@ -76,8 +83,6 @@ Mailing list: [email protected]
2.3 Candidate features for future inclusion

* Online defrag (patches available but not well tested)
-* Inode allocation using large virtual block groups via flex_bg (patch
- available; fragmentation issues due to prolong fs use still unknown)
* reduced mke2fs time via uninit_bg feature (capability to do this is
available in e2fsprogs but a kernel thread to do lazy zeroing of unused
inode table blocks after filesystem is first mounted is required for
@@ -236,7 +241,9 @@ stripe=n Number of filesystem blocks th
to use for allocation size and alignment. For RAID5/6
systems this should be the number of data
disks * RAID chunk size in file system blocks.
-
+delalloc (*) Deferring block allocation until write-out time.
+nodelalloc Disable delayed allocation. Blocks are allocation
+ when data is copied from user to page cache.
Data Mode
=========
There are 3 different data modes:
@@ -250,10 +257,19 @@ typically provide the best ext4 performa

* ordered mode
In data=ordered mode, ext4 only officially journals metadata, but it logically
-groups metadata and data blocks into a single unit called a transaction. When
-it's time to write the new metadata out to disk, the associated data blocks
-are written first. In general, this mode performs slightly slower than
-writeback but significantly faster than journal mode.
+groups metadata information related to data changes with the data blocks into a
+single unit called a transaction. When it's time to write the new metadata
+out to disk, the associated data blocks are written first. In general,
+this mode performs slightly slower than writeback but significantly faster than journal mode.
+
+In ext4/JBD2 this ordered mode implementation is different than ext3/JBD
+ordered mode. First it get rid of using buffer heads to enforce the ordering
+between metadata change with the related data chage. Instead, in the new
+ordering mode, it keeps track of per transaction journalled inode list, and
+flush all the dirty pages for those inodes, when committing that transaction.
+Second, the new ordered mode reverse the lock ordering of the page lock and
+transaction lock, to fixing the locking issue in the new mode, and also provide
+easy support for delayed allocation over the new ordered mode

* journal mode
data=journal mode provides full data and metadata journaling. All new data is
@@ -261,7 +277,8 @@ written to the journal first, and then t
In the event of a crash, the journal can be replayed, bringing both data and
metadata into a consistent state. This mode is the slowest except when data
needs to be read from and written to disk at the same time where it
-outperforms all others modes.
+outperforms all others modes. Right now this mode does not have
+delayed allocation support.

References
==========

2008-07-04 01:19:46

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] Ext4 Documentation updates.

On Wed, Jul 02, 2008 at 12:22:00PM -0500, Jose R. Santos wrote:
> From: Jose R. Santos <[email protected]>
>
> Ext4 Documentation updates.
>
> Some of the information in Documentation/filesystems/ext4.txt is out
> of date and in need of an update.
>
> Signed-off-by: Jose R. Santos <[email protected]>

Here are some changes I have added to further update/correct/clarify
the ext4.txt file:

- Ted

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 4424266..574c96a 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -13,40 +13,45 @@ Mailing list: [email protected]
1. Quick usage instructions:
===========================

- - Grab updated e2fsprogs from
+ - Compile and install the latest version of e2fsprogs (at least
+ 1.41-WIP-0617) from:
+
ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/

or grab the latest git repository from:
+
git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git

- - Create a new filesystem and set the "test_fs" extended option:
+ - Create a new filesystem using the ext4dev filesystem type:

- # mke2fs -j -E test_fs /dev/hda1
+ # mke2fs -t ext4dev /dev/hda1

- Or set the test_fs flag on an existing ext3 filesystem:
+ Or configure an existing ext3 filesystem to support extents and set
+ the test_fs flag to indicate that it's ok for an in-development
+ filesystem to touch this filesystem:

- # debugfs -w /dev/sda5
- debugfs 1.41-WIP (17-Jun-2008)
- debugfs: set_super_value s_flags 4
- debugfs: quit
+ # tune2fs -O extents -E test_fs /dev/hda1

- - Mounting:
+ If the filesystem was created with 128 byte inodes, it can be
+ converted to use 256 byte for greater efficiency via:

- # mount /dev/hda1 /wherever -t ext4dev
+ # tune2fs -I 256 /dev/hda1

- - To disable extents:
+ (Note: we currently do not have tools to convert an ext4dev
+ filesystem back to ext3; so please do not do try this on production
+ filesystems.)

- # mount /dev/hda1 /wherever -t ext4dev -o noextents
+ - Mounting:

- - The filesystem is compatible with the ext3 driver until you add a file
- which has extents (ie: `mount -o extents', then create a file).
+ # mount -t ext4dev /dev/hda1 /wherever

- When comparing performance with other filesystems, remember that
- ext3/4 by default offers higher data integrity guarantees than most. So
- when comparing with a metadata-only journalling filesystem, use `mount -o
- data=writeback'. And you might as well use `mount -o nobh' too along
- with it. Making the journal larger than the mke2fs default often helps
- performance with metadata-intensive workloads.
+ ext3/4 by default offers higher data integrity guarantees than most.
+ So when comparing with a metadata-only journalling filesystem, such
+ as ext3, use `mount -o data=writeback'. And you might as well use
+ `mount -o nobh' too along with it. Making the journal larger than
+ the mke2fs default often helps performance with metadata-intensive
+ workloads.

2. Features
===========
@@ -67,17 +72,13 @@ Mailing list: [email protected]
* ability to pack bitmaps and inode tables into larger virtual groups via the
flex_bg feature
* large file support
-
-2.2 Previously available, soon to be enabled by default by "mkefs.ext4":

2008-07-04 01:31:04

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH] Ext4 Documentation updates.

On Wed, Jul 02, 2008 at 02:45:55PM -0700, Mingming Cao wrote:
> +In ext4/JBD2 this ordered mode implementation is different than ext3/JBD
> +ordered mode. First it get rid of using buffer heads to enforce the ordering
> +between metadata change with the related data chage. Instead, in the new
> +ordering mode, it keeps track of per transaction journalled inode list, and
> +flush all the dirty pages for those inodes, when committing that transaction.
> +Second, the new ordered mode reverse the lock ordering of the page lock and
> +transaction lock, to fixing the locking issue in the new mode, and also provide
> +easy support for delayed allocation over the new ordered mode

This is implementation detail that doesn't belong in
Documentation/filesystems/ext4.txt; a user won't care about this kind
of detail.

However, it is *perfect* for the the (as-yet-undocumented) patch
comment for the new ordered mode patch in the series. I rewrote it
for gramatical correctness and clarity thusly, for the patch
delalloc-new-ordered-mode.patch:

This provides a new ordered mode implementation which gets rid of using
buffer heads to enforce the ordering between metadata change with the
related data chage. Instead, in the new ordering mode, it keeps track
of all of the inodes touched by each transaction on a list, and when
that transaction is committed, it flushes all of the dirty pages for
those inodes. In addition, the new ordered mode reverses the lock
ordering of the page lock and transaction lock, which provides easier
support for delayed allocation.

- Ted