This patch series adds support for fast commits which is a simplified
version of the scheme proposed by Park and Shin, in their paper,
"iJournaling: Fine-Grained Journaling for Improving the Latency of
Fsync System Call"[1]. The basic idea of fast commits is to make JBD2
give the client file system an opportunity to perform a faster
commit. Only if the file system cannot perform such a commit
operation, then JBD2 should fall back to traditional commits.
Because JBD2 operates at block granularity, for every file system
metadata update it commits all the changed blocks are written to the
journal at commit time. This is inefficient because updates to some
blocks that JBD2 commits are derivable from some other blocks. For
example, if a new extent is added to an inode, then corresponding
updates to the inode table, the block bitmap, the group descriptor and
the superblock can be derived based on just the extent information and
the corresponding inode information. So, if we take this relationship
between blocks into account and replay the journalled blocks smartly,
we could increase performance of file system commits significantly.
Fast commits introduced in this patch have two main contributions:
(1) Making JBD2 fast commit aware, so that clients of JBD2 can
implement fast commits
(2) Add support in ext4 to use JBD2's new interfaces and implement
fast commits.
Ext4 supports two modes of fast commits: 1) fast commits with hard
consistency guarantees 2) fast commits with soft consistency guarantees
When hard consistency is enabled, fast commit guarantees that all the
updates will be committed. After a successful replay of fast commits
blocks in hard consistency mode, the entire file system would be in
the same state as that when fsync() returned before crash. This
guarantee is similar to what jbd2 gives with full commits.
With soft consistency, file system only guarantees consistency for the
inode in question. In this mode, file system will try to write as less
data to the backend as possible during the commit time. To be precise,
file system records all the data updates for the inode in question and
directory updates that are required for guaranteeing consistency of the
inode in question.
In our evaluations, fast commits with hard consistency performed
better than fast commits with soft consistency. That's because with
hard consistency, a fast commit often ends up committing other inodes
together, while with soft consistency commits get serialized. Future
work can look at creating hybrid approach between the two extremes
that are there in this patchset.
Testing
-------
e2fsprogs was updated to set fast commit feature flag and to ignore
fast commit blocks during e2fsck.
https://github.com/harshadjs/e2fsprogs.git
After applying all the patches in this series, following runs of
xfstests were performed:
- kvm-xfstest.sh -g log -c 4k
- kvm-xfstests.sh smoke
All the log tests were successful and smoke tests didn't introduce any
additional failures.
Performance Evaluation
----------------------
Ext4 file system performance was tested with full commits, with fast
commits with soft consistency and with fast commits with hard
consistency. fs_mark benchmark showed that depending on the file size,
performance improvement was seen up to 50%. Soft fast commits performed
slightly worse than hard fast commits. But soft fast commits ended up
writing slightly lesser number of blocks on disk.
Changes since V3:
- Removed invocation of fast commits from the jbd2 thread.
- Removed sub transaction ID from journal_t.
- Added rename, truncate, punch hole support.
- Added soft consistency mode and hard consistency mode.
- More bug fixes and refactoring.
- Added better debugging support: more tracepoints and debug mount
options.
Harshad Shirwadkar(20):
ext4: add debug mount option to test fast commit replay
ext4: add fast commit replay path
ext4: disable certain features in replay path
ext4: add idempotent helpers to manipulate bitmaps
ext4: fast commit recovery path preparation
jbd2: add fast commit recovery path support
ext4: main commit routine for fast commits
jbd2: add new APIs for commit path of fast commits
ext4: add fast commit on-disk format structs and helpers
ext4: add fast commit track points
ext4: break ext4_unlink() and ext4_link()
ext4: add inode tracking and ineligible marking routines
ext4: add directory entry tracking routines
ext4: add generic diff tracking routines and range tracking
jbd2: fast commit main commit path changes
jbd2: disable fast commits if journal is empty
jbd2: add fast commit block tracker variables
ext4, jbd2: add fast commit initialization routines
ext4: add handling for extended mount options
ext4: update docs for fast commit feature
Documentation/filesystems/ext4/journal.rst | 127 ++-
Documentation/filesystems/journalling.rst | 18 +
fs/ext4/acl.c | 1 +
fs/ext4/balloc.c | 10 +-
fs/ext4/ext4.h | 127 +++
fs/ext4/ext4_jbd2.c | 1484 +++++++++++++++++++++++++++-
fs/ext4/ext4_jbd2.h | 71 ++
fs/ext4/extents.c | 5 +
fs/ext4/extents_status.c | 24 +
fs/ext4/fsync.c | 2 +-
fs/ext4/ialloc.c | 165 +++-
fs/ext4/inline.c | 3 +
fs/ext4/inode.c | 77 +-
fs/ext4/ioctl.c | 9 +-
fs/ext4/mballoc.c | 157 ++-
fs/ext4/mballoc.h | 2 +
fs/ext4/migrate.c | 1 +
fs/ext4/namei.c | 172 ++--
fs/ext4/super.c | 72 +-
fs/ext4/xattr.c | 6 +
fs/jbd2/commit.c | 61 ++
fs/jbd2/journal.c | 217 +++-
fs/jbd2/recovery.c | 67 +-
include/linux/jbd2.h | 83 +-
include/trace/events/ext4.h | 208 +++-
25 files changed, 3037 insertions(+), 132 deletions(-)
---
Documentation/filesystems/ext4/journal.rst | 127 ++++++++++++++++++++-
Documentation/filesystems/journalling.rst | 18 +++
2 files changed, 139 insertions(+), 6 deletions(-)
diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
index ea613ee701f5..f94e66f2f8c4 100644
--- a/Documentation/filesystems/ext4/journal.rst
+++ b/Documentation/filesystems/ext4/journal.rst
@@ -29,10 +29,10 @@ safest. If ``data=writeback``, dirty data blocks are not flushed to the
disk before the metadata are written to disk through the journal.
The journal inode is typically inode 8. The first 68 bytes of the
-journal inode are replicated in the ext4 superblock. The journal itself
-is normal (but hidden) file within the filesystem. The file usually
-consumes an entire block group, though mke2fs tries to put it in the
-middle of the disk.
+journal inode are replicated in the ext4 superblock. The journal
+itself is normal (but hidden) file within the filesystem. The file
+usually consumes an entire block group, though mke2fs tries to put it
+in the middle of the disk.
All fields in jbd2 are written to disk in big-endian order. This is the
opposite of ext4.
@@ -42,22 +42,74 @@ NOTE: Both ext4 and ocfs2 use jbd2.
The maximum size of a journal embedded in an ext4 filesystem is 2^32
blocks. jbd2 itself does not seem to care.
+Fast Commits
+~~~~~~~~~~~~
+
+Ext4 also implements fast commits and integrates it with JBD2 journalling.
+Fast commits store metadata changes made to the file system as inode level
+diff. In other words, each fast commit block identifies updates made to
+a particular inode and collectively they represent total changes made to
+the file system.
+
+A fast commit is valid only if there is no full commit after that particular
+fast commit. Because of this feature, fast commit blocks can be reused by
+the following transactions.
+
+Each fast commit block stores updates to 1 particular inode. Updates in each
+fast commit block are one of the 2 types:
+- Data updates (add range / delete range)
+- Directory entry updates (Add / remove links)
+
+Fast commit blocks must be replayed in the order in which they appear on disk.
+That's because directory entry updates are written in fast commit blocks
+in the order in which they are applied on the file system before crash.
+Changing the order of replaying for directory entry updates may result
+in inconsistent file system. Note that only directory entry updates need
+ordering, data updates, since they apply to only one inode, do not require
+ordered replay. Also, fast commits guarantee that file system is in consistent
+state after replay of each fast commit block as long as order of replay has
+been followed.
+
+Note that directory inode updates are never directly recorded in fast commits.
+Just like other file system level metaata, updates to directories are always
+implied based on directory entry updates stored in fast commit blocks.
+
+Based on which directory entry updates are committed with an inode, fast
+commits have two modes of operation:
+
+- Hard Consistency (default)
+- Soft Consistency (can be enabled by setting mount flag "fc_soft_consistency")
+
+When hard consistency is enabled, fast commit guarantees that all the updates
+will be committed. After a successful replay of fast commits blocks
+in hard consistency mode, the entire file system would be in the same state as
+that when fsync() returned before crash. This guarantee is similar to what
+jbd2 gives.
+
+With soft consistency, file system only guarantees consistency for the
+inode in question. In this mode, file system will try to write as less data
+to the backed as possible during the commit time. To be precise, file system
+records all the data updates for the inode in question and directory updates
+that are required for guaranteeing consistency of the inode in question.
+
Layout
~~~~~~
Generally speaking, the journal has this format:
.. list-table::
- :widths: 16 48 16
+ :widths: 16 48 16 18
:header-rows: 1
* - Superblock
- descriptor\_block (data\_blocks or revocation\_block) [more data or
revocations] commmit\_block
- [more transactions...]
+ - [Fast commits...]
* -
- One transaction
-
+ -
Notice that a transaction begins with either a descriptor and some data,
or a block revocation list. A finished transaction always ends with a
@@ -76,7 +128,7 @@ The journal superblock will be in the next full block after the
superblock.
.. list-table::
- :widths: 12 12 12 32 12
+ :widths: 12 12 12 32 12 12
:header-rows: 1
* - 1024 bytes of padding
@@ -85,11 +137,13 @@ superblock.
- descriptor\_block (data\_blocks or revocation\_block) [more data or
revocations] commmit\_block
- [more transactions...]
+ - [Fast commits...]
* -
-
-
- One transaction
-
+ -
Block Header
~~~~~~~~~~~~
@@ -609,3 +663,64 @@ bytes long (but uses a full block):
- h\_commit\_nsec
- Nanoseconds component of the above timestamp.
+Fast Commit Block
+~~~~~~~~~~~~~~~~~
+
+The fast commit block indicates an append to the last commit block
+that was written to the journal. One fast commit block records updates
+to one inode. So, typically you would find as many fast commit blocks
+as the number of inodes that got changed since the last commit. A fast
+commit block is valid only if there is no commit block present with
+transaction ID greater than that of the fast commit block. If such a
+block a present, then there is no need to replay the fast commit
+block.
+
+.. list-table::
+ :widths: 8 8 24 40
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Descriptor
+ * - 0x0
+ - journal\_header\_s
+ - (open coded)
+ - Common block header.
+ * - 0xC
+ - \_\_le32
+ - fc\_magic
+ - Magic value which should be set to 0xE2540090. This identifies
+ that this block is a fast commit block.
+ * - 0x10
+ - \_\_u8
+ - fc\_features
+ - Features used by this fast commit block.
+ * - 0x11
+ - \_\_le16
+ - fc_num_tlvs
+ - Number of TLVs contained in this fast commit block
+ * - 0x13
+ - \_\_le32
+ - \_\_fc\_len
+ - Length of the fast commit block in terms of number of blocks
+ * - 0x17
+ - \_\_le32
+ - fc\_ino
+ - Inode number of the inode that will be recovered using this fast commit
+ * - 0x2B
+ - struct ext4\_inode
+ - inode
+ - On-disk copy of the inode at the commit time
+ * - <Variable based on inode size>
+ - struct ext4\_fc\_tl
+ - Array of struct ext4\_fc\_tl
+ - The actual delta with the last commit. Starting at this offset,
+ there is an array of TLVs that indicates which all extents
+ should be present in the corresponding inode. Currently,
+ following tags are supported: EXT4\_FC\_TAG\_EXT (extent that
+ should be present in the inode), EXT4\_FC\_TAG\_HOLE (extent
+ that should be removed from the inode), EXT4\_FC\_TAG\_ADD\_DENTRY
+ (dentry that should be linked), EXT4\_FC\_TAG\_DEL\_DENTRY
+ (dentry that should be unlinked), EXT4\_FC\_TAG\_CREATE\_DENTRY
+ (dentry that for the file that should be created for the first time).
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
index 58ce6b395206..1cb116ab27ab 100644
--- a/Documentation/filesystems/journalling.rst
+++ b/Documentation/filesystems/journalling.rst
@@ -115,6 +115,24 @@ called after each transaction commit. You can also use
``transaction->t_private_list`` for attaching entries to a transaction
that need processing when the transaction commits.
+JBD2 also allows client file systems to implement file system specific
+commits which are called as ``fast commits``. Fast commits are
+asynchronous in nature i.e. file systems can call their own commit
+functions at any time. In order to avoid the race with kjournald
+thread and other possible fast commits that may be happening in
+parallel, file systems should first call
+:c:func:`jbd2_start_async_fc()`. File system can call
+:c:func:`jbd2_map_fc_buf()` to get buffers reserved for fast
+commits. Once a fast commit is completed, file system should call
+:c:func:`jbd2_stop_async_fc()` to indicate and unblock other
+committers and the kjournald thread. After performing either a fast
+or a full commit, JBD2 calls ``journal->j_fc_cleanup_cb`` to allow
+file systems to perform cleanups for their internal fast commit
+related data structures. At the replay time, JBD2 passes each and
+every fast commit block to the file system via
+``journal->j_fc_replay_cb``. Ext4 effectively uses this fast commit
+mechanism to improve journal commit performance.
+
JBD2 also provides a way to block all transaction updates via
:c:func:`jbd2_journal_lock_updates()` /
:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
--
2.24.1.735.g03f4e72817-goog
Add fc_do_one_pass to invoke file system specific replay
callback and pass discovered fast commit blocks to let
file system handle those.
Signed-off-by: Harshad Shirwadkar <[email protected]>
---
fs/jbd2/recovery.c | 67 +++++++++++++++++++++++++++++++++++++++++---
include/linux/jbd2.h | 13 +++++++++
2 files changed, 76 insertions(+), 4 deletions(-)
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index a4967b27ffb6..09f069e59c36 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -35,7 +35,6 @@ struct recovery_info
int nr_revoke_hits;
};
-enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY};
static int do_one_pass(journal_t *journal,
struct recovery_info *info, enum passtype pass);
static int scan_revoke_records(journal_t *, struct buffer_head *,
@@ -225,10 +224,63 @@ static int count_tags(journal_t *journal, struct buffer_head *bh)
/* Make sure we wrap around the log correctly! */
#define wrap(journal, var) \
do { \
- if (var >= (journal)->j_last) \
- var -= ((journal)->j_last - (journal)->j_first); \
+ unsigned long _wrap_last = \
+ jbd2_has_feature_fast_commit(journal) ? \
+ (journal)->j_last_fc : (journal)->j_last; \
+ \
+ if (var >= _wrap_last) \
+ var -= (_wrap_last - (journal)->j_first); \
} while (0)
+static int fc_do_one_pass(journal_t *journal,
+ struct recovery_info *info, enum passtype pass)
+{
+ unsigned int expected_commit_id = info->end_transaction;
+ unsigned long next_fc_block;
+ struct buffer_head *bh;
+ unsigned int seq;
+ journal_header_t *jhdr;
+ int err = 0;
+
+ next_fc_block = journal->j_first_fc;
+
+ while (next_fc_block <= journal->j_last_fc) {
+ jbd_debug(3, "Fast commit replay: next block %ld",
+ next_fc_block);
+ err = jread(&bh, journal, next_fc_block);
+ if (err) {
+ jbd_debug(3, "Fast commit replay: read error");
+ break;
+ }
+
+ jhdr = (journal_header_t *)bh->b_data;
+ seq = be32_to_cpu(jhdr->h_sequence);
+ if (be32_to_cpu(jhdr->h_magic) != JBD2_MAGIC_NUMBER ||
+ seq != expected_commit_id) {
+ jbd_debug(3, "Fast commit replay: magic / commitid error [%d / %d / %d]\n",
+ be32_to_cpu(jhdr->h_magic), seq,
+ expected_commit_id);
+ break;
+ }
+ jbd_debug(3, "Processing fast commit blk with seq %d",
+ seq);
+ if (journal->j_fc_replay_callback) {
+ err = journal->j_fc_replay_callback(
+ journal, bh, pass,
+ next_fc_block -
+ journal->j_first_fc);
+ if (err)
+ break;
+ }
+ next_fc_block++;
+ }
+
+ if (err)
+ jbd_debug(3, "Fast commit replay failed, err = %d\n", err);
+
+ return err;
+}
+
/**
* jbd2_journal_recover - recovers a on-disk journal
* @journal: the journal to recover
@@ -470,7 +522,7 @@ static int do_one_pass(journal_t *journal,
break;
jbd_debug(2, "Scanning for sequence ID %u at %lu/%lu\n",
- next_commit_ID, next_log_block, journal->j_last);
+ next_commit_ID, next_log_block, journal->j_last_fc);
/* Skip over each chunk of the transaction looking
* either the next descriptor block or the final commit
@@ -768,6 +820,9 @@ static int do_one_pass(journal_t *journal,
if (err)
goto failed;
continue;
+ case JBD2_FC_BLOCK:
+ pr_warn("Unexpectedly found fast commit block.\n");
+ continue;
default:
jbd_debug(3, "Unrecognised magic %d, end of scan.\n",
@@ -799,6 +854,10 @@ static int do_one_pass(journal_t *journal,
success = -EIO;
}
}
+
+ if (jbd2_has_feature_fast_commit(journal) && pass != PASS_REVOKE)
+ success = fc_do_one_pass(journal, info, pass);
+
if (block_error && success == 0)
success = -EIO;
return success;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index d450dcb93e51..0b49c8ff0563 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -757,6 +757,8 @@ jbd2_time_diff(unsigned long start, unsigned long end)
#define JBD2_NR_BATCH 64
+enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY};
+
/**
* struct journal_s - The journal_s type is the concrete type associated with
* journal_t.
@@ -1220,6 +1222,17 @@ struct journal_s
* after every commit operation.
*/
void (*j_fc_cleanup_callback)(struct journal_s *journal);
+
+ /*
+ * @j_fc_replay_callback:
+ *
+ * File-system specific function that performs replay of a fast
+ * commit. JBD2 calls this function for each fast commit block found in
+ * the journal.
+ */
+ int (*j_fc_replay_callback)(struct journal_s *journal,
+ struct buffer_head *bh,
+ enum passtype pass, int off);
};
#define jbd2_might_wait_for_commit(j) \
--
2.24.1.735.g03f4e72817-goog
Prepare for making ext4 fast commit recovery path changes. Make a few
existing functions visible. Break and add a wrapper around
ext4_get_inode_loc to allow reading inode from disk without having
a corresponding VFS inode.
Signed-off-by: Harshad Shirwadkar <[email protected]>
---
fs/ext4/ext4.h | 7 +++++++
fs/ext4/inode.c | 32 ++++++++++++++++++--------------
fs/ext4/ioctl.c | 6 +++---
fs/ext4/namei.c | 2 +-
include/trace/events/ext4.h | 8 ++++----
5 files changed, 33 insertions(+), 22 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9c2f67c64b4f..f2603deefe51 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2651,6 +2651,8 @@ extern int ext4_trim_fs(struct super_block *, struct fstrim_range *);
extern void ext4_process_freed_data(struct super_block *sb, tid_t commit_tid);
/* inode.c */
+void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
+ struct ext4_inode_info *ei);
int ext4_inode_is_fast_symlink(struct inode *inode);
struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
@@ -2699,6 +2701,8 @@ extern int ext4_sync_inode(handle_t *, struct inode *);
extern void ext4_dirty_inode(struct inode *, int);
extern int ext4_change_inode_journal_flag(struct inode *, int);
extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
+extern int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
+ struct ext4_iloc *iloc);
extern int ext4_inode_attach_jinode(struct inode *inode);
extern int ext4_can_truncate(struct inode *inode);
extern int ext4_truncate(struct inode *);
@@ -2734,12 +2738,15 @@ extern int ext4_ind_remove_space(handle_t *handle, struct inode *inode,
/* ioctl.c */
extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
+extern void ext4_reset_inode_seed(struct inode *inode);
/* migrate.c */
extern int ext4_ext_migrate(struct inode *);
extern int ext4_ind_migrate(struct inode *inode);
/* namei.c */
+extern int ext4_init_new_dir(handle_t *handle, struct inode *dir,
+ struct inode *inode);
extern int ext4_dirblock_csum_verify(struct inode *inode,
struct buffer_head *bh);
extern int ext4_orphan_add(handle_t *, struct inode *);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c4cde431d5fa..e902000dac51 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -103,8 +103,8 @@ static int ext4_inode_csum_verify(struct inode *inode, struct ext4_inode *raw,
return provided == calculated;
}
-static void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
- struct ext4_inode_info *ei)
+void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
+ struct ext4_inode_info *ei)
{
__u32 csum;
@@ -4548,22 +4548,21 @@ int ext4_truncate(struct inode *inode)
* data in memory that is needed to recreate the on-disk version of this
* inode.
*/
-static int __ext4_get_inode_loc(struct inode *inode,
+static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
struct ext4_iloc *iloc, int in_mem)
{
struct ext4_group_desc *gdp;
struct buffer_head *bh;
- struct super_block *sb = inode->i_sb;
ext4_fsblk_t block;
struct blk_plug plug;
int inodes_per_block, inode_offset;
iloc->bh = NULL;
- if (inode->i_ino < EXT4_ROOT_INO ||
- inode->i_ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))
+ if (ino < EXT4_ROOT_INO ||
+ ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))
return -EFSCORRUPTED;
- iloc->block_group = (inode->i_ino - 1) / EXT4_INODES_PER_GROUP(sb);
+ iloc->block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
gdp = ext4_get_group_desc(sb, iloc->block_group, NULL);
if (!gdp)
return -EIO;
@@ -4572,7 +4571,7 @@ static int __ext4_get_inode_loc(struct inode *inode,
* Figure out the offset within the block group inode table
*/
inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
- inode_offset = ((inode->i_ino - 1) %
+ inode_offset = ((ino - 1) %
EXT4_INODES_PER_GROUP(sb));
block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block);
iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
@@ -4671,15 +4670,14 @@ static int __ext4_get_inode_loc(struct inode *inode,
* has in-inode xattrs, or we don't have this inode in memory.
* Read the block from disk.
*/
- trace_ext4_load_inode(inode);
+ trace_ext4_load_inode(sb, ino);
get_bh(bh);
bh->b_end_io = end_buffer_read_sync;
submit_bh(REQ_OP_READ, REQ_META | REQ_PRIO, bh);
blk_finish_plug(&plug);
wait_on_buffer(bh);
if (!buffer_uptodate(bh)) {
- EXT4_ERROR_INODE_BLOCK(inode, block,
- "unable to read itable block");
+ ext4_error(sb, "unable to read itable block");
brelse(bh);
return -EIO;
}
@@ -4692,10 +4690,16 @@ static int __ext4_get_inode_loc(struct inode *inode,
int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc)
{
/* We have all inode data except xattrs in memory here. */
- return __ext4_get_inode_loc(inode, iloc,
+ return __ext4_get_inode_loc(inode->i_sb, inode->i_ino, iloc,
!ext4_test_inode_state(inode, EXT4_STATE_XATTR));
}
+int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
+ struct ext4_iloc *iloc)
+{
+ return __ext4_get_inode_loc(sb, ino, iloc, 0);
+}
+
static bool ext4_should_use_dax(struct inode *inode)
{
if (!test_opt(inode->i_sb, DAX))
@@ -4845,7 +4849,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
ei = EXT4_I(inode);
iloc.bh = NULL;
- ret = __ext4_get_inode_loc(inode, &iloc, 0);
+ ret = __ext4_get_inode_loc(sb, inode->i_ino, &iloc, 0);
if (ret < 0)
goto bad_inode;
raw_inode = ext4_raw_inode(&iloc);
@@ -5423,7 +5427,7 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
} else {
struct ext4_iloc iloc;
- err = __ext4_get_inode_loc(inode, &iloc, 0);
+ err = __ext4_get_inode_loc(inode->i_sb, inode->i_ino, &iloc, 0);
if (err)
return err;
/*
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 2bc655b2164e..59ff5f90ed2a 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -86,7 +86,7 @@ static void swap_inode_data(struct inode *inode1, struct inode *inode2)
i_size_write(inode2, isize);
}
-static void reset_inode_seed(struct inode *inode)
+void ext4_reset_inode_seed(struct inode *inode)
{
struct ext4_inode_info *ei = EXT4_I(inode);
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -199,8 +199,8 @@ static long swap_inode_boot_loader(struct super_block *sb,
inode->i_generation = prandom_u32();
inode_bl->i_generation = prandom_u32();
- reset_inode_seed(inode);
- reset_inode_seed(inode_bl);
+ ext4_reset_inode_seed(inode);
+ ext4_reset_inode_seed(inode_bl);
ext4_discard_preallocations(inode);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index b732c0bb1d51..48fea5ce8530 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2710,7 +2710,7 @@ struct ext4_dir_entry_2 *ext4_init_dot_dotdot(struct inode *inode,
return ext4_next_entry(de, blocksize);
}
-static int ext4_init_new_dir(handle_t *handle, struct inode *dir,
+int ext4_init_new_dir(handle_t *handle, struct inode *dir,
struct inode *inode)
{
struct buffer_head *dir_block = NULL;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index e47059a02fec..8da371b38332 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -1738,9 +1738,9 @@ TRACE_EVENT(ext4_ext_load_extent,
);
TRACE_EVENT(ext4_load_inode,
- TP_PROTO(struct inode *inode),
+ TP_PROTO(struct super_block *sb, unsigned long ino),
- TP_ARGS(inode),
+ TP_ARGS(sb, ino),
TP_STRUCT__entry(
__field( dev_t, dev )
@@ -1748,8 +1748,8 @@ TRACE_EVENT(ext4_load_inode,
),
TP_fast_assign(
- __entry->dev = inode->i_sb->s_dev;
- __entry->ino = inode->i_ino;
+ __entry->dev = sb->s_dev;
+ __entry->ino = ino;
),
TP_printk("dev %d,%d ino %ld",
--
2.24.1.735.g03f4e72817-goog
Add a debug mount option to simulate errors while replaying. If
fc_debug_max_replay is set, ext4 will replay only as many fast
commit blocks as passed as an argument.
Signed-off-by: Harshad Shirwadkar <[email protected]>
---
fs/ext4/ext4.h | 3 +++
fs/ext4/ext4_jbd2.c | 6 ++++++
fs/ext4/super.c | 12 +++++++++++-
3 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e12900d77673..62d72e7005ad 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1652,6 +1652,9 @@ struct ext4_sb_info {
struct list_head s_fc_dentry_q;
struct ext4_fc_replay_state s_fc_replay_state;
spinlock_t s_fc_lock;
+#ifdef EXT4_FC_DEBUG
+ int s_fc_debug_max_replay;
+#endif
struct ext4_fc_stats s_fc_stats;
};
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index ef36a973ed8b..1f8ba23912ba 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -1520,6 +1520,12 @@ static int ext4_journal_fc_replay_cb(journal_t *journal, struct buffer_head *bh,
return sbi->s_fc_replay_state.fc_replay_error;
}
+#ifdef EXT4_FC_DEBUG
+ if (sbi->s_fc_debug_max_replay && off >= sbi->s_fc_debug_max_replay) {
+ pr_warn("Dropping fc block %d because max_replay set\n", off);
+ return -EINVAL;
+ }
+#endif
sbi->s_mount_state |= EXT4_FC_REPLAY;
fc_hdr = (struct ext4_fc_commit_hdr *)
((__u8 *)bh->b_data + sizeof(journal_header_t));
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index bfd19a127188..117ccde1a7c1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1475,7 +1475,7 @@ enum {
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
- Opt_no_fc, Opt_fc_soft_consistency
+ Opt_no_fc, Opt_fc_soft_consistency, Opt_fc_debug_max_replay
};
static const match_table_t tokens = {
@@ -1560,6 +1560,9 @@ static const match_table_t tokens = {
{Opt_noinit_itable, "noinit_itable"},
{Opt_no_fc, "no_fc"},
{Opt_fc_soft_consistency, "fc_soft_consistency"},
+#ifdef EXT4_FC_DEBUG
+ {Opt_fc_debug_max_replay, "fc_debug_max_replay=%u"},
+#endif
{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
{Opt_test_dummy_encryption, "test_dummy_encryption"},
{Opt_nombcache, "nombcache"},
@@ -1779,6 +1782,9 @@ static const struct mount_opts {
MOPT_CLEAR | MOPT_2 | MOPT_EXT4_ONLY},
{Opt_fc_soft_consistency, EXT4_MOUNT2_JOURNAL_FC_SOFT_CONSISTENCY,
MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
+#ifdef EXT4_FC_DEBUG
+ {Opt_fc_debug_max_replay, 0, MOPT_GTE0},
+#endif
{Opt_err, 0, 0}
};
@@ -1931,6 +1937,10 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
sbi->s_li_wait_mult = arg;
} else if (token == Opt_max_dir_size_kb) {
sbi->s_max_dir_size_kb = arg;
+#ifdef EXT4_FC_DEBUG
+ } else if (token == Opt_fc_debug_max_replay) {
+ sbi->s_fc_debug_max_replay = arg;
+#endif
} else if (token == Opt_stripe) {
sbi->s_stripe = arg;
} else if (token == Opt_resuid) {
--
2.24.1.735.g03f4e72817-goog
Replay path uses similar code paths for replaying committed changes.
But since it runs before full initialization of the file system and
also since we don't have to be super careful about performance, we
can and need to disable certain file system features during the replay
path. More specifically, we disable most of the extent status tree
stuff, mballoc and some places where we mark file system with errors.
Signed-off-by: Harshad Shirwadkar <[email protected]>
---
fs/ext4/balloc.c | 7 +++++-
fs/ext4/ext4_jbd2.c | 2 +-
fs/ext4/extents_status.c | 24 +++++++++++++++++++
fs/ext4/ialloc.c | 52 +++++++++++++++++++++++++++-------------
fs/ext4/inode.c | 15 ++++++++----
fs/ext4/mballoc.c | 22 +++++++++++------
6 files changed, 92 insertions(+), 30 deletions(-)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 14787065d030..2cf39a6b9f7a 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -360,7 +360,12 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
struct buffer_head *bh)
{
ext4_fsblk_t blk;
- struct ext4_group_info *grp = ext4_get_group_info(sb, block_group);
+ struct ext4_group_info *grp;
+
+ if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)
+ return 0;
+
+ grp = ext4_get_group_info(sb, block_group);
if (buffer_verified(bh))
return 0;
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index f371f1f0f914..7c6cdbc63aa6 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -77,7 +77,7 @@ handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
return ERR_PTR(err);
journal = EXT4_SB(sb)->s_journal;
- if (!journal)
+ if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
return ext4_get_nojournal();
return jbd2__journal_start(journal, blocks, rsv_blocks, GFP_NOFS,
type, line);
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index d996b44d2265..69c16ac7416e 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -311,6 +311,9 @@ void ext4_es_find_extent_range(struct inode *inode,
ext4_lblk_t lblk, ext4_lblk_t end,
struct extent_status *es)
{
+ if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+ return;
+
trace_ext4_es_find_extent_range_enter(inode, lblk);
read_lock(&EXT4_I(inode)->i_es_lock);
@@ -361,6 +364,9 @@ bool ext4_es_scan_range(struct inode *inode,
{
bool ret;
+ if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+ return false;
+
read_lock(&EXT4_I(inode)->i_es_lock);
ret = __es_scan_range(inode, matching_fn, lblk, end);
read_unlock(&EXT4_I(inode)->i_es_lock);
@@ -404,6 +410,9 @@ bool ext4_es_scan_clu(struct inode *inode,
{
bool ret;
+ if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+ return false;
+
read_lock(&EXT4_I(inode)->i_es_lock);
ret = __es_scan_clu(inode, matching_fn, lblk);
read_unlock(&EXT4_I(inode)->i_es_lock);
@@ -812,6 +821,9 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
int err = 0;
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+ return 0;
+
es_debug("add [%u/%u) %llu %x to extent status tree of inode %lu\n",
lblk, len, pblk, status, inode->i_ino);
@@ -873,6 +885,9 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
struct extent_status newes;
ext4_lblk_t end = lblk + len - 1;
+ if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+ return;
+
newes.es_lblk = lblk;
newes.es_len = len;
ext4_es_store_pblock_status(&newes, pblk, status);
@@ -908,6 +923,9 @@ int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
struct rb_node *node;
int found = 0;
+ if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+ return 0;
+
trace_ext4_es_lookup_extent_enter(inode, lblk);
es_debug("lookup extent in block %u\n", lblk);
@@ -1419,6 +1437,9 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
int err = 0;
int reserved = 0;
+ if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+ return 0;
+
trace_ext4_es_remove_extent(inode, lblk, len);
es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
lblk, len, inode->i_ino);
@@ -1969,6 +1990,9 @@ int ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
struct extent_status newes;
int err = 0;
+ if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+ return 0;
+
es_debug("add [%u/1) delayed to extent status tree of inode %lu\n",
lblk, inode->i_ino);
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index f1a1432f9ffa..24a2d7171cd4 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -82,7 +82,12 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
struct buffer_head *bh)
{
ext4_fsblk_t blk;
- struct ext4_group_info *grp = ext4_get_group_info(sb, block_group);
+ struct ext4_group_info *grp;
+
+ if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)
+ return 0;
+
+ grp = ext4_get_group_info(sb, block_group);
if (buffer_verified(bh))
return 0;
@@ -287,15 +292,17 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb);
bitmap_bh = ext4_read_inode_bitmap(sb, block_group);
/* Don't bother if the inode bitmap is corrupt. */
- grp = ext4_get_group_info(sb, block_group);
if (IS_ERR(bitmap_bh)) {
fatal = PTR_ERR(bitmap_bh);
bitmap_bh = NULL;
goto error_return;
}
- if (unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp))) {
- fatal = -EFSCORRUPTED;
- goto error_return;
+ if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
+ grp = ext4_get_group_info(sb, block_group);
+ if (unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp))) {
+ fatal = -EFSCORRUPTED;
+ goto error_return;
+ }
}
BUFFER_TRACE(bitmap_bh, "get_write_access");
@@ -871,7 +878,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
struct inode *ret;
ext4_group_t i;
ext4_group_t flex_group;
- struct ext4_group_info *grp;
+ struct ext4_group_info *grp = NULL;
int encrypt = 0;
/* Cannot create files in a deleted directory */
@@ -1009,15 +1016,21 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
if (ext4_free_inodes_count(sb, gdp) == 0)
goto next_group;
- grp = ext4_get_group_info(sb, group);
- /* Skip groups with already-known suspicious inode tables */
- if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp))
- goto next_group;
+ if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
+ grp = ext4_get_group_info(sb, group);
+ /*
+ * Skip groups with already-known suspicious inode
+ * tables
+ */
+ if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp))
+ goto next_group;
+ }
brelse(inode_bitmap_bh);
inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
/* Skip groups with suspicious inode tables */
- if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp) ||
+ if (((!(sbi->s_mount_state & EXT4_FC_REPLAY))
+ && EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) ||
IS_ERR(inode_bitmap_bh)) {
inode_bitmap_bh = NULL;
goto next_group;
@@ -1036,7 +1049,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
goto next_group;
}
- if (!handle) {
+ if ((!(sbi->s_mount_state & EXT4_FC_REPLAY)) && !handle) {
BUG_ON(nblocks <= 0);
handle = __ext4_journal_start_sb(dir->i_sb, line_no,
handle_type, nblocks,
@@ -1140,9 +1153,15 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
/* Update the relevant bg descriptor fields */
if (ext4_has_group_desc_csum(sb)) {
int free;
- struct ext4_group_info *grp = ext4_get_group_info(sb, group);
-
- down_read(&grp->alloc_sem); /* protect vs itable lazyinit */
+ struct ext4_group_info *grp = NULL;
+
+ if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
+ grp = ext4_get_group_info(sb, group);
+ down_read(&grp->alloc_sem); /*
+ * protect vs itable
+ * lazyinit
+ */
+ }
ext4_lock_group(sb, group); /* while we modify the bg desc */
free = EXT4_INODES_PER_GROUP(sb) -
ext4_itable_unused_count(sb, gdp);
@@ -1158,7 +1177,8 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
if (ino > free)
ext4_itable_unused_set(sb, gdp,
(EXT4_INODES_PER_GROUP(sb) - ino));
- up_read(&grp->alloc_sem);
+ if (!(sbi->s_mount_state & EXT4_FC_REPLAY))
+ up_read(&grp->alloc_sem);
} else {
ext4_lock_group(sb, group);
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e902000dac51..f8090f5ce5e4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -527,7 +527,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
return -EFSCORRUPTED;
/* Lookup extent status tree firstly */
- if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) {
+ if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) &&
+ ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) {
if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
map->m_pblk = ext4_es_pblock(&es) +
map->m_lblk - es.es_lblk;
@@ -969,7 +970,8 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
int create = map_flags & EXT4_GET_BLOCKS_CREATE;
int err;
- J_ASSERT(handle != NULL || create == 0);
+ J_ASSERT((EXT4_SB(inode->i_sb)->s_mount_state | EXT4_FC_REPLAY)
+ || handle != NULL || create == 0);
map.m_lblk = block;
map.m_len = 1;
@@ -985,7 +987,8 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
return ERR_PTR(-ENOMEM);
if (map.m_flags & EXT4_MAP_NEW) {
J_ASSERT(create != 0);
- J_ASSERT(handle != NULL);
+ J_ASSERT((EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+ || (handle != NULL));
/*
* Now that we do not always journal data, we should
@@ -4896,8 +4899,10 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
}
if (!ext4_inode_csum_verify(inode, raw_inode, ei)) {
- ext4_error_inode(inode, function, line, 0,
- "iget: checksum invalid");
+
+ if (!(EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
+ ext4_error_inode(inode, function, line, 0,
+ "iget: checksum invalid");
ret = -EFSBADCRC;
goto bad_inode;
}
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 05ca9001f8fa..4f2d611e5a75 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1449,14 +1449,17 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
blocknr = ext4_group_first_block_no(sb, e4b->bd_group);
blocknr += EXT4_C2B(sbi, block);
- ext4_grp_locked_error(sb, e4b->bd_group,
- inode ? inode->i_ino : 0,
- blocknr,
- "freeing already freed block "
- "(bit %u); block bitmap corrupt.",
- block);
- ext4_mark_group_bitmap_corrupted(sb, e4b->bd_group,
+ if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
+ ext4_grp_locked_error(sb, e4b->bd_group,
+ inode ? inode->i_ino : 0,
+ blocknr,
+ "freeing already freed block "
+ "(bit %u); block bitmap corrupt.",
+ block);
+ ext4_mark_group_bitmap_corrupted(
+ sb, e4b->bd_group,
EXT4_GROUP_INFO_BBITMAP_CORRUPT);
+ }
mb_regenerate_buddy(e4b);
goto done;
}
@@ -4088,6 +4091,9 @@ void ext4_discard_preallocations(struct inode *inode)
return;
}
+ if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)
+ return;
+
mb_debug(1, "discard preallocation for inode %lu\n", inode->i_ino);
trace_ext4_discard_preallocations(inode);
@@ -4561,6 +4567,8 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
sb = ar->inode->i_sb;
sbi = EXT4_SB(sb);
+ WARN_ON(sbi->s_mount_state & EXT4_FC_REPLAY);
+
trace_ext4_request_blocks(ar);
/* Allow to use superuser reservation for quota file */
--
2.24.1.735.g03f4e72817-goog
hi Harshad
cc ted
sorry, but i have some idea about this fast commit which i want to
share with you.
there are nearly 20 patches about this v4 fast commit , so many patches.
I wonder if necessary to make this fast commit function so complexly.
maybe i have not understand the difficulty of the fast commit coding work.
so I appreciate it very much if you give some more detailed
descriptions about the patches correlationship of v4 fast commit,
especially the reason why need have so many patches.
from my viewpoint, the purpose of doing this fast commit function is
to resolve the ext4 fsync time-cost-so-much problem.
firstly we need to resolve some actual customer problems which exist
in ext4 filesystems when doing this fast commit function.
so the first release version of fast commit is just only to accomplish
the goal of reducing the time cost of fsync because of jbd2 order
shortcoming described in ijournal paper from my opinion.
it need not do so many other unnecessary things.
if i have free time , I will review these patches continually.
thank you for your reply.
On Tue, Dec 24, 2019 at 4:14 PM Harshad Shirwadkar
<[email protected]> wrote:
>
> This patch series adds support for fast commits which is a simplified
> version of the scheme proposed by Park and Shin, in their paper,
> "iJournaling: Fine-Grained Journaling for Improving the Latency of
> Fsync System Call"[1]. The basic idea of fast commits is to make JBD2
> give the client file system an opportunity to perform a faster
> commit. Only if the file system cannot perform such a commit
> operation, then JBD2 should fall back to traditional commits.
>
> Because JBD2 operates at block granularity, for every file system
> metadata update it commits all the changed blocks are written to the
> journal at commit time. This is inefficient because updates to some
> blocks that JBD2 commits are derivable from some other blocks. For
> example, if a new extent is added to an inode, then corresponding
> updates to the inode table, the block bitmap, the group descriptor and
> the superblock can be derived based on just the extent information and
> the corresponding inode information. So, if we take this relationship
> between blocks into account and replay the journalled blocks smartly,
> we could increase performance of file system commits significantly.
>
> Fast commits introduced in this patch have two main contributions:
>
> (1) Making JBD2 fast commit aware, so that clients of JBD2 can
> implement fast commits
>
> (2) Add support in ext4 to use JBD2's new interfaces and implement
> fast commits.
>
> Ext4 supports two modes of fast commits: 1) fast commits with hard
> consistency guarantees 2) fast commits with soft consistency guarantees
>
> When hard consistency is enabled, fast commit guarantees that all the
> updates will be committed. After a successful replay of fast commits
> blocks in hard consistency mode, the entire file system would be in
> the same state as that when fsync() returned before crash. This
> guarantee is similar to what jbd2 gives with full commits.
>
> With soft consistency, file system only guarantees consistency for the
> inode in question. In this mode, file system will try to write as less
> data to the backend as possible during the commit time. To be precise,
> file system records all the data updates for the inode in question and
> directory updates that are required for guaranteeing consistency of the
> inode in question.
>
> In our evaluations, fast commits with hard consistency performed
> better than fast commits with soft consistency. That's because with
> hard consistency, a fast commit often ends up committing other inodes
> together, while with soft consistency commits get serialized. Future
> work can look at creating hybrid approach between the two extremes
> that are there in this patchset.
>
> Testing
> -------
>
> e2fsprogs was updated to set fast commit feature flag and to ignore
> fast commit blocks during e2fsck.
>
> https://github.com/harshadjs/e2fsprogs.git
>
> After applying all the patches in this series, following runs of
> xfstests were performed:
>
> - kvm-xfstest.sh -g log -c 4k
> - kvm-xfstests.sh smoke
>
> All the log tests were successful and smoke tests didn't introduce any
> additional failures.
>
> Performance Evaluation
> ----------------------
>
> Ext4 file system performance was tested with full commits, with fast
> commits with soft consistency and with fast commits with hard
> consistency. fs_mark benchmark showed that depending on the file size,
> performance improvement was seen up to 50%. Soft fast commits performed
> slightly worse than hard fast commits. But soft fast commits ended up
> writing slightly lesser number of blocks on disk.
>
> Changes since V3:
>
> - Removed invocation of fast commits from the jbd2 thread.
>
> - Removed sub transaction ID from journal_t.
>
> - Added rename, truncate, punch hole support.
>
> - Added soft consistency mode and hard consistency mode.
>
> - More bug fixes and refactoring.
>
> - Added better debugging support: more tracepoints and debug mount
> options.
>
> Harshad Shirwadkar(20):
> ext4: add debug mount option to test fast commit replay
> ext4: add fast commit replay path
> ext4: disable certain features in replay path
> ext4: add idempotent helpers to manipulate bitmaps
> ext4: fast commit recovery path preparation
> jbd2: add fast commit recovery path support
> ext4: main commit routine for fast commits
> jbd2: add new APIs for commit path of fast commits
> ext4: add fast commit on-disk format structs and helpers
> ext4: add fast commit track points
> ext4: break ext4_unlink() and ext4_link()
> ext4: add inode tracking and ineligible marking routines
> ext4: add directory entry tracking routines
> ext4: add generic diff tracking routines and range tracking
> jbd2: fast commit main commit path changes
> jbd2: disable fast commits if journal is empty
> jbd2: add fast commit block tracker variables
> ext4, jbd2: add fast commit initialization routines
> ext4: add handling for extended mount options
> ext4: update docs for fast commit feature
>
> Documentation/filesystems/ext4/journal.rst | 127 ++-
> Documentation/filesystems/journalling.rst | 18 +
> fs/ext4/acl.c | 1 +
> fs/ext4/balloc.c | 10 +-
> fs/ext4/ext4.h | 127 +++
> fs/ext4/ext4_jbd2.c | 1484 +++++++++++++++++++++++++++-
> fs/ext4/ext4_jbd2.h | 71 ++
> fs/ext4/extents.c | 5 +
> fs/ext4/extents_status.c | 24 +
> fs/ext4/fsync.c | 2 +-
> fs/ext4/ialloc.c | 165 +++-
> fs/ext4/inline.c | 3 +
> fs/ext4/inode.c | 77 +-
> fs/ext4/ioctl.c | 9 +-
> fs/ext4/mballoc.c | 157 ++-
> fs/ext4/mballoc.h | 2 +
> fs/ext4/migrate.c | 1 +
> fs/ext4/namei.c | 172 ++--
> fs/ext4/super.c | 72 +-
> fs/ext4/xattr.c | 6 +
> fs/jbd2/commit.c | 61 ++
> fs/jbd2/journal.c | 217 +++-
> fs/jbd2/recovery.c | 67 +-
> include/linux/jbd2.h | 83 +-
> include/trace/events/ext4.h | 208 +++-
> 25 files changed, 3037 insertions(+), 132 deletions(-)
> ---
> Documentation/filesystems/ext4/journal.rst | 127 ++++++++++++++++++++-
> Documentation/filesystems/journalling.rst | 18 +++
> 2 files changed, 139 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
> index ea613ee701f5..f94e66f2f8c4 100644
> --- a/Documentation/filesystems/ext4/journal.rst
> +++ b/Documentation/filesystems/ext4/journal.rst
> @@ -29,10 +29,10 @@ safest. If ``data=writeback``, dirty data blocks are not flushed to the
> disk before the metadata are written to disk through the journal.
>
> The journal inode is typically inode 8. The first 68 bytes of the
> -journal inode are replicated in the ext4 superblock. The journal itself
> -is normal (but hidden) file within the filesystem. The file usually
> -consumes an entire block group, though mke2fs tries to put it in the
> -middle of the disk.
> +journal inode are replicated in the ext4 superblock. The journal
> +itself is normal (but hidden) file within the filesystem. The file
> +usually consumes an entire block group, though mke2fs tries to put it
> +in the middle of the disk.
>
> All fields in jbd2 are written to disk in big-endian order. This is the
> opposite of ext4.
> @@ -42,22 +42,74 @@ NOTE: Both ext4 and ocfs2 use jbd2.
> The maximum size of a journal embedded in an ext4 filesystem is 2^32
> blocks. jbd2 itself does not seem to care.
>
> +Fast Commits
> +~~~~~~~~~~~~
> +
> +Ext4 also implements fast commits and integrates it with JBD2 journalling.
> +Fast commits store metadata changes made to the file system as inode level
> +diff. In other words, each fast commit block identifies updates made to
> +a particular inode and collectively they represent total changes made to
> +the file system.
> +
> +A fast commit is valid only if there is no full commit after that particular
> +fast commit. Because of this feature, fast commit blocks can be reused by
> +the following transactions.
> +
> +Each fast commit block stores updates to 1 particular inode. Updates in each
> +fast commit block are one of the 2 types:
> +- Data updates (add range / delete range)
> +- Directory entry updates (Add / remove links)
> +
> +Fast commit blocks must be replayed in the order in which they appear on disk.
> +That's because directory entry updates are written in fast commit blocks
> +in the order in which they are applied on the file system before crash.
> +Changing the order of replaying for directory entry updates may result
> +in inconsistent file system. Note that only directory entry updates need
> +ordering, data updates, since they apply to only one inode, do not require
> +ordered replay. Also, fast commits guarantee that file system is in consistent
> +state after replay of each fast commit block as long as order of replay has
> +been followed.
> +
> +Note that directory inode updates are never directly recorded in fast commits.
> +Just like other file system level metaata, updates to directories are always
> +implied based on directory entry updates stored in fast commit blocks.
> +
> +Based on which directory entry updates are committed with an inode, fast
> +commits have two modes of operation:
> +
> +- Hard Consistency (default)
> +- Soft Consistency (can be enabled by setting mount flag "fc_soft_consistency")
> +
> +When hard consistency is enabled, fast commit guarantees that all the updates
> +will be committed. After a successful replay of fast commits blocks
> +in hard consistency mode, the entire file system would be in the same state as
> +that when fsync() returned before crash. This guarantee is similar to what
> +jbd2 gives.
> +
> +With soft consistency, file system only guarantees consistency for the
> +inode in question. In this mode, file system will try to write as less data
> +to the backed as possible during the commit time. To be precise, file system
> +records all the data updates for the inode in question and directory updates
> +that are required for guaranteeing consistency of the inode in question.
> +
> Layout
> ~~~~~~
>
> Generally speaking, the journal has this format:
>
> .. list-table::
> - :widths: 16 48 16
> + :widths: 16 48 16 18
> :header-rows: 1
>
> * - Superblock
> - descriptor\_block (data\_blocks or revocation\_block) [more data or
> revocations] commmit\_block
> - [more transactions...]
> + - [Fast commits...]
> * -
> - One transaction
> -
> + -
>
> Notice that a transaction begins with either a descriptor and some data,
> or a block revocation list. A finished transaction always ends with a
> @@ -76,7 +128,7 @@ The journal superblock will be in the next full block after the
> superblock.
>
> .. list-table::
> - :widths: 12 12 12 32 12
> + :widths: 12 12 12 32 12 12
> :header-rows: 1
>
> * - 1024 bytes of padding
> @@ -85,11 +137,13 @@ superblock.
> - descriptor\_block (data\_blocks or revocation\_block) [more data or
> revocations] commmit\_block
> - [more transactions...]
> + - [Fast commits...]
> * -
> -
> -
> - One transaction
> -
> + -
>
> Block Header
> ~~~~~~~~~~~~
> @@ -609,3 +663,64 @@ bytes long (but uses a full block):
> - h\_commit\_nsec
> - Nanoseconds component of the above timestamp.
>
> +Fast Commit Block
> +~~~~~~~~~~~~~~~~~
> +
> +The fast commit block indicates an append to the last commit block
> +that was written to the journal. One fast commit block records updates
> +to one inode. So, typically you would find as many fast commit blocks
> +as the number of inodes that got changed since the last commit. A fast
> +commit block is valid only if there is no commit block present with
> +transaction ID greater than that of the fast commit block. If such a
> +block a present, then there is no need to replay the fast commit
> +block.
> +
> +.. list-table::
> + :widths: 8 8 24 40
> + :header-rows: 1
> +
> + * - Offset
> + - Type
> + - Name
> + - Descriptor
> + * - 0x0
> + - journal\_header\_s
> + - (open coded)
> + - Common block header.
> + * - 0xC
> + - \_\_le32
> + - fc\_magic
> + - Magic value which should be set to 0xE2540090. This identifies
> + that this block is a fast commit block.
> + * - 0x10
> + - \_\_u8
> + - fc\_features
> + - Features used by this fast commit block.
> + * - 0x11
> + - \_\_le16
> + - fc_num_tlvs
> + - Number of TLVs contained in this fast commit block
> + * - 0x13
> + - \_\_le32
> + - \_\_fc\_len
> + - Length of the fast commit block in terms of number of blocks
> + * - 0x17
> + - \_\_le32
> + - fc\_ino
> + - Inode number of the inode that will be recovered using this fast commit
> + * - 0x2B
> + - struct ext4\_inode
> + - inode
> + - On-disk copy of the inode at the commit time
> + * - <Variable based on inode size>
> + - struct ext4\_fc\_tl
> + - Array of struct ext4\_fc\_tl
> + - The actual delta with the last commit. Starting at this offset,
> + there is an array of TLVs that indicates which all extents
> + should be present in the corresponding inode. Currently,
> + following tags are supported: EXT4\_FC\_TAG\_EXT (extent that
> + should be present in the inode), EXT4\_FC\_TAG\_HOLE (extent
> + that should be removed from the inode), EXT4\_FC\_TAG\_ADD\_DENTRY
> + (dentry that should be linked), EXT4\_FC\_TAG\_DEL\_DENTRY
> + (dentry that should be unlinked), EXT4\_FC\_TAG\_CREATE\_DENTRY
> + (dentry that for the file that should be created for the first time).
> diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
> index 58ce6b395206..1cb116ab27ab 100644
> --- a/Documentation/filesystems/journalling.rst
> +++ b/Documentation/filesystems/journalling.rst
> @@ -115,6 +115,24 @@ called after each transaction commit. You can also use
> ``transaction->t_private_list`` for attaching entries to a transaction
> that need processing when the transaction commits.
>
> +JBD2 also allows client file systems to implement file system specific
> +commits which are called as ``fast commits``. Fast commits are
> +asynchronous in nature i.e. file systems can call their own commit
> +functions at any time. In order to avoid the race with kjournald
> +thread and other possible fast commits that may be happening in
> +parallel, file systems should first call
> +:c:func:`jbd2_start_async_fc()`. File system can call
> +:c:func:`jbd2_map_fc_buf()` to get buffers reserved for fast
> +commits. Once a fast commit is completed, file system should call
> +:c:func:`jbd2_stop_async_fc()` to indicate and unblock other
> +committers and the kjournald thread. After performing either a fast
> +or a full commit, JBD2 calls ``journal->j_fc_cleanup_cb`` to allow
> +file systems to perform cleanups for their internal fast commit
> +related data structures. At the replay time, JBD2 passes each and
> +every fast commit block to the file system via
> +``journal->j_fc_replay_cb``. Ext4 effectively uses this fast commit
> +mechanism to improve journal commit performance.
> +
> JBD2 also provides a way to block all transaction updates via
> :c:func:`jbd2_journal_lock_updates()` /
> :c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
> --
> 2.24.1.735.g03f4e72817-goog
>
hi Harshad:
I apologize for my some direct and improper speaking in my last email.
what i want to say in my last email is that maybe an iterative
software development method can be better for patches application.
for the first release version, we can not do everything. it is good
enough if we can have finished just one major function in ext4 fast
commit field.
I have known that develop work of this fast commit function is more
difficult and more complex.
so i am very grateful for your work on field of ext4 fast commit development. :)
On Thu, Jan 9, 2020 at 12:29 PM xiaohui li
<[email protected]> wrote:
>
> hi Harshad
> cc ted
>
> sorry, but i have some idea about this fast commit which i want to
> share with you.
>
> there are nearly 20 patches about this v4 fast commit , so many patches.
> I wonder if necessary to make this fast commit function so complexly.
>
> maybe i have not understand the difficulty of the fast commit coding work.
> so I appreciate it very much if you give some more detailed
> descriptions about the patches correlationship of v4 fast commit,
> especially the reason why need have so many patches.
>
> from my viewpoint, the purpose of doing this fast commit function is
> to resolve the ext4 fsync time-cost-so-much problem.
> firstly we need to resolve some actual customer problems which exist
> in ext4 filesystems when doing this fast commit function.
>
> so the first release version of fast commit is just only to accomplish
> the goal of reducing the time cost of fsync because of jbd2 order
> shortcoming described in ijournal paper from my opinion.
> it need not do so many other unnecessary things.
>
> if i have free time , I will review these patches continually.
> thank you for your reply.
>
>
>
>
>
> On Tue, Dec 24, 2019 at 4:14 PM Harshad Shirwadkar
> <[email protected]> wrote:
> >
> > This patch series adds support for fast commits which is a simplified
> > version of the scheme proposed by Park and Shin, in their paper,
> > "iJournaling: Fine-Grained Journaling for Improving the Latency of
> > Fsync System Call"[1]. The basic idea of fast commits is to make JBD2
> > give the client file system an opportunity to perform a faster
> > commit. Only if the file system cannot perform such a commit
> > operation, then JBD2 should fall back to traditional commits.
> >
> > Because JBD2 operates at block granularity, for every file system
> > metadata update it commits all the changed blocks are written to the
> > journal at commit time. This is inefficient because updates to some
> > blocks that JBD2 commits are derivable from some other blocks. For
> > example, if a new extent is added to an inode, then corresponding
> > updates to the inode table, the block bitmap, the group descriptor and
> > the superblock can be derived based on just the extent information and
> > the corresponding inode information. So, if we take this relationship
> > between blocks into account and replay the journalled blocks smartly,
> > we could increase performance of file system commits significantly.
> >
> > Fast commits introduced in this patch have two main contributions:
> >
> > (1) Making JBD2 fast commit aware, so that clients of JBD2 can
> > implement fast commits
> >
> > (2) Add support in ext4 to use JBD2's new interfaces and implement
> > fast commits.
> >
> > Ext4 supports two modes of fast commits: 1) fast commits with hard
> > consistency guarantees 2) fast commits with soft consistency guarantees
> >
> > When hard consistency is enabled, fast commit guarantees that all the
> > updates will be committed. After a successful replay of fast commits
> > blocks in hard consistency mode, the entire file system would be in
> > the same state as that when fsync() returned before crash. This
> > guarantee is similar to what jbd2 gives with full commits.
> >
> > With soft consistency, file system only guarantees consistency for the
> > inode in question. In this mode, file system will try to write as less
> > data to the backend as possible during the commit time. To be precise,
> > file system records all the data updates for the inode in question and
> > directory updates that are required for guaranteeing consistency of the
> > inode in question.
> >
> > In our evaluations, fast commits with hard consistency performed
> > better than fast commits with soft consistency. That's because with
> > hard consistency, a fast commit often ends up committing other inodes
> > together, while with soft consistency commits get serialized. Future
> > work can look at creating hybrid approach between the two extremes
> > that are there in this patchset.
> >
> > Testing
> > -------
> >
> > e2fsprogs was updated to set fast commit feature flag and to ignore
> > fast commit blocks during e2fsck.
> >
> > https://github.com/harshadjs/e2fsprogs.git
> >
> > After applying all the patches in this series, following runs of
> > xfstests were performed:
> >
> > - kvm-xfstest.sh -g log -c 4k
> > - kvm-xfstests.sh smoke
> >
> > All the log tests were successful and smoke tests didn't introduce any
> > additional failures.
> >
> > Performance Evaluation
> > ----------------------
> >
> > Ext4 file system performance was tested with full commits, with fast
> > commits with soft consistency and with fast commits with hard
> > consistency. fs_mark benchmark showed that depending on the file size,
> > performance improvement was seen up to 50%. Soft fast commits performed
> > slightly worse than hard fast commits. But soft fast commits ended up
> > writing slightly lesser number of blocks on disk.
> >
> > Changes since V3:
> >
> > - Removed invocation of fast commits from the jbd2 thread.
> >
> > - Removed sub transaction ID from journal_t.
> >
> > - Added rename, truncate, punch hole support.
> >
> > - Added soft consistency mode and hard consistency mode.
> >
> > - More bug fixes and refactoring.
> >
> > - Added better debugging support: more tracepoints and debug mount
> > options.
> >
> > Harshad Shirwadkar(20):
> > ext4: add debug mount option to test fast commit replay
> > ext4: add fast commit replay path
> > ext4: disable certain features in replay path
> > ext4: add idempotent helpers to manipulate bitmaps
> > ext4: fast commit recovery path preparation
> > jbd2: add fast commit recovery path support
> > ext4: main commit routine for fast commits
> > jbd2: add new APIs for commit path of fast commits
> > ext4: add fast commit on-disk format structs and helpers
> > ext4: add fast commit track points
> > ext4: break ext4_unlink() and ext4_link()
> > ext4: add inode tracking and ineligible marking routines
> > ext4: add directory entry tracking routines
> > ext4: add generic diff tracking routines and range tracking
> > jbd2: fast commit main commit path changes
> > jbd2: disable fast commits if journal is empty
> > jbd2: add fast commit block tracker variables
> > ext4, jbd2: add fast commit initialization routines
> > ext4: add handling for extended mount options
> > ext4: update docs for fast commit feature
> >
> > Documentation/filesystems/ext4/journal.rst | 127 ++-
> > Documentation/filesystems/journalling.rst | 18 +
> > fs/ext4/acl.c | 1 +
> > fs/ext4/balloc.c | 10 +-
> > fs/ext4/ext4.h | 127 +++
> > fs/ext4/ext4_jbd2.c | 1484 +++++++++++++++++++++++++++-
> > fs/ext4/ext4_jbd2.h | 71 ++
> > fs/ext4/extents.c | 5 +
> > fs/ext4/extents_status.c | 24 +
> > fs/ext4/fsync.c | 2 +-
> > fs/ext4/ialloc.c | 165 +++-
> > fs/ext4/inline.c | 3 +
> > fs/ext4/inode.c | 77 +-
> > fs/ext4/ioctl.c | 9 +-
> > fs/ext4/mballoc.c | 157 ++-
> > fs/ext4/mballoc.h | 2 +
> > fs/ext4/migrate.c | 1 +
> > fs/ext4/namei.c | 172 ++--
> > fs/ext4/super.c | 72 +-
> > fs/ext4/xattr.c | 6 +
> > fs/jbd2/commit.c | 61 ++
> > fs/jbd2/journal.c | 217 +++-
> > fs/jbd2/recovery.c | 67 +-
> > include/linux/jbd2.h | 83 +-
> > include/trace/events/ext4.h | 208 +++-
> > 25 files changed, 3037 insertions(+), 132 deletions(-)
> > ---
> > Documentation/filesystems/ext4/journal.rst | 127 ++++++++++++++++++++-
> > Documentation/filesystems/journalling.rst | 18 +++
> > 2 files changed, 139 insertions(+), 6 deletions(-)
> >
> > diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
> > index ea613ee701f5..f94e66f2f8c4 100644
> > --- a/Documentation/filesystems/ext4/journal.rst
> > +++ b/Documentation/filesystems/ext4/journal.rst
> > @@ -29,10 +29,10 @@ safest. If ``data=writeback``, dirty data blocks are not flushed to the
> > disk before the metadata are written to disk through the journal.
> >
> > The journal inode is typically inode 8. The first 68 bytes of the
> > -journal inode are replicated in the ext4 superblock. The journal itself
> > -is normal (but hidden) file within the filesystem. The file usually
> > -consumes an entire block group, though mke2fs tries to put it in the
> > -middle of the disk.
> > +journal inode are replicated in the ext4 superblock. The journal
> > +itself is normal (but hidden) file within the filesystem. The file
> > +usually consumes an entire block group, though mke2fs tries to put it
> > +in the middle of the disk.
> >
> > All fields in jbd2 are written to disk in big-endian order. This is the
> > opposite of ext4.
> > @@ -42,22 +42,74 @@ NOTE: Both ext4 and ocfs2 use jbd2.
> > The maximum size of a journal embedded in an ext4 filesystem is 2^32
> > blocks. jbd2 itself does not seem to care.
> >
> > +Fast Commits
> > +~~~~~~~~~~~~
> > +
> > +Ext4 also implements fast commits and integrates it with JBD2 journalling.
> > +Fast commits store metadata changes made to the file system as inode level
> > +diff. In other words, each fast commit block identifies updates made to
> > +a particular inode and collectively they represent total changes made to
> > +the file system.
> > +
> > +A fast commit is valid only if there is no full commit after that particular
> > +fast commit. Because of this feature, fast commit blocks can be reused by
> > +the following transactions.
> > +
> > +Each fast commit block stores updates to 1 particular inode. Updates in each
> > +fast commit block are one of the 2 types:
> > +- Data updates (add range / delete range)
> > +- Directory entry updates (Add / remove links)
> > +
> > +Fast commit blocks must be replayed in the order in which they appear on disk.
> > +That's because directory entry updates are written in fast commit blocks
> > +in the order in which they are applied on the file system before crash.
> > +Changing the order of replaying for directory entry updates may result
> > +in inconsistent file system. Note that only directory entry updates need
> > +ordering, data updates, since they apply to only one inode, do not require
> > +ordered replay. Also, fast commits guarantee that file system is in consistent
> > +state after replay of each fast commit block as long as order of replay has
> > +been followed.
> > +
> > +Note that directory inode updates are never directly recorded in fast commits.
> > +Just like other file system level metaata, updates to directories are always
> > +implied based on directory entry updates stored in fast commit blocks.
> > +
> > +Based on which directory entry updates are committed with an inode, fast
> > +commits have two modes of operation:
> > +
> > +- Hard Consistency (default)
> > +- Soft Consistency (can be enabled by setting mount flag "fc_soft_consistency")
> > +
> > +When hard consistency is enabled, fast commit guarantees that all the updates
> > +will be committed. After a successful replay of fast commits blocks
> > +in hard consistency mode, the entire file system would be in the same state as
> > +that when fsync() returned before crash. This guarantee is similar to what
> > +jbd2 gives.
> > +
> > +With soft consistency, file system only guarantees consistency for the
> > +inode in question. In this mode, file system will try to write as less data
> > +to the backed as possible during the commit time. To be precise, file system
> > +records all the data updates for the inode in question and directory updates
> > +that are required for guaranteeing consistency of the inode in question.
> > +
> > Layout
> > ~~~~~~
> >
> > Generally speaking, the journal has this format:
> >
> > .. list-table::
> > - :widths: 16 48 16
> > + :widths: 16 48 16 18
> > :header-rows: 1
> >
> > * - Superblock
> > - descriptor\_block (data\_blocks or revocation\_block) [more data or
> > revocations] commmit\_block
> > - [more transactions...]
> > + - [Fast commits...]
> > * -
> > - One transaction
> > -
> > + -
> >
> > Notice that a transaction begins with either a descriptor and some data,
> > or a block revocation list. A finished transaction always ends with a
> > @@ -76,7 +128,7 @@ The journal superblock will be in the next full block after the
> > superblock.
> >
> > .. list-table::
> > - :widths: 12 12 12 32 12
> > + :widths: 12 12 12 32 12 12
> > :header-rows: 1
> >
> > * - 1024 bytes of padding
> > @@ -85,11 +137,13 @@ superblock.
> > - descriptor\_block (data\_blocks or revocation\_block) [more data or
> > revocations] commmit\_block
> > - [more transactions...]
> > + - [Fast commits...]
> > * -
> > -
> > -
> > - One transaction
> > -
> > + -
> >
> > Block Header
> > ~~~~~~~~~~~~
> > @@ -609,3 +663,64 @@ bytes long (but uses a full block):
> > - h\_commit\_nsec
> > - Nanoseconds component of the above timestamp.
> >
> > +Fast Commit Block
> > +~~~~~~~~~~~~~~~~~
> > +
> > +The fast commit block indicates an append to the last commit block
> > +that was written to the journal. One fast commit block records updates
> > +to one inode. So, typically you would find as many fast commit blocks
> > +as the number of inodes that got changed since the last commit. A fast
> > +commit block is valid only if there is no commit block present with
> > +transaction ID greater than that of the fast commit block. If such a
> > +block a present, then there is no need to replay the fast commit
> > +block.
> > +
> > +.. list-table::
> > + :widths: 8 8 24 40
> > + :header-rows: 1
> > +
> > + * - Offset
> > + - Type
> > + - Name
> > + - Descriptor
> > + * - 0x0
> > + - journal\_header\_s
> > + - (open coded)
> > + - Common block header.
> > + * - 0xC
> > + - \_\_le32
> > + - fc\_magic
> > + - Magic value which should be set to 0xE2540090. This identifies
> > + that this block is a fast commit block.
> > + * - 0x10
> > + - \_\_u8
> > + - fc\_features
> > + - Features used by this fast commit block.
> > + * - 0x11
> > + - \_\_le16
> > + - fc_num_tlvs
> > + - Number of TLVs contained in this fast commit block
> > + * - 0x13
> > + - \_\_le32
> > + - \_\_fc\_len
> > + - Length of the fast commit block in terms of number of blocks
> > + * - 0x17
> > + - \_\_le32
> > + - fc\_ino
> > + - Inode number of the inode that will be recovered using this fast commit
> > + * - 0x2B
> > + - struct ext4\_inode
> > + - inode
> > + - On-disk copy of the inode at the commit time
> > + * - <Variable based on inode size>
> > + - struct ext4\_fc\_tl
> > + - Array of struct ext4\_fc\_tl
> > + - The actual delta with the last commit. Starting at this offset,
> > + there is an array of TLVs that indicates which all extents
> > + should be present in the corresponding inode. Currently,
> > + following tags are supported: EXT4\_FC\_TAG\_EXT (extent that
> > + should be present in the inode), EXT4\_FC\_TAG\_HOLE (extent
> > + that should be removed from the inode), EXT4\_FC\_TAG\_ADD\_DENTRY
> > + (dentry that should be linked), EXT4\_FC\_TAG\_DEL\_DENTRY
> > + (dentry that should be unlinked), EXT4\_FC\_TAG\_CREATE\_DENTRY
> > + (dentry that for the file that should be created for the first time).
> > diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
> > index 58ce6b395206..1cb116ab27ab 100644
> > --- a/Documentation/filesystems/journalling.rst
> > +++ b/Documentation/filesystems/journalling.rst
> > @@ -115,6 +115,24 @@ called after each transaction commit. You can also use
> > ``transaction->t_private_list`` for attaching entries to a transaction
> > that need processing when the transaction commits.
> >
> > +JBD2 also allows client file systems to implement file system specific
> > +commits which are called as ``fast commits``. Fast commits are
> > +asynchronous in nature i.e. file systems can call their own commit
> > +functions at any time. In order to avoid the race with kjournald
> > +thread and other possible fast commits that may be happening in
> > +parallel, file systems should first call
> > +:c:func:`jbd2_start_async_fc()`. File system can call
> > +:c:func:`jbd2_map_fc_buf()` to get buffers reserved for fast
> > +commits. Once a fast commit is completed, file system should call
> > +:c:func:`jbd2_stop_async_fc()` to indicate and unblock other
> > +committers and the kjournald thread. After performing either a fast
> > +or a full commit, JBD2 calls ``journal->j_fc_cleanup_cb`` to allow
> > +file systems to perform cleanups for their internal fast commit
> > +related data structures. At the replay time, JBD2 passes each and
> > +every fast commit block to the file system via
> > +``journal->j_fc_replay_cb``. Ext4 effectively uses this fast commit
> > +mechanism to improve journal commit performance.
> > +
> > JBD2 also provides a way to block all transaction updates via
> > :c:func:`jbd2_journal_lock_updates()` /
> > :c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
> > --
> > 2.24.1.735.g03f4e72817-goog
> >
Hi Xiaohui,
No worries. I think the reason we have so many patches is that this
feature requires on-disk format change and it's better to get
everything in one go, so that the size of testing matrix doesn't
become too big :)
- Harshad
On Fri, Jan 10, 2020 at 1:50 AM xiaohui li
<[email protected]> wrote:
>
> hi Harshad:
>
> I apologize for my some direct and improper speaking in my last email.
>
> what i want to say in my last email is that maybe an iterative
> software development method can be better for patches application.
> for the first release version, we can not do everything. it is good
> enough if we can have finished just one major function in ext4 fast
> commit field.
>
> I have known that develop work of this fast commit function is more
> difficult and more complex.
> so i am very grateful for your work on field of ext4 fast commit development. :)
>
> On Thu, Jan 9, 2020 at 12:29 PM xiaohui li
> <[email protected]> wrote:
> >
> > hi Harshad
> > cc ted
> >
> > sorry, but i have some idea about this fast commit which i want to
> > share with you.
> >
> > there are nearly 20 patches about this v4 fast commit , so many patches.
> > I wonder if necessary to make this fast commit function so complexly.
> >
> > maybe i have not understand the difficulty of the fast commit coding work.
> > so I appreciate it very much if you give some more detailed
> > descriptions about the patches correlationship of v4 fast commit,
> > especially the reason why need have so many patches.
> >
> > from my viewpoint, the purpose of doing this fast commit function is
> > to resolve the ext4 fsync time-cost-so-much problem.
> > firstly we need to resolve some actual customer problems which exist
> > in ext4 filesystems when doing this fast commit function.
> >
> > so the first release version of fast commit is just only to accomplish
> > the goal of reducing the time cost of fsync because of jbd2 order
> > shortcoming described in ijournal paper from my opinion.
> > it need not do so many other unnecessary things.
> >
> > if i have free time , I will review these patches continually.
> > thank you for your reply.
> >
> >
> >
> >
> >
> > On Tue, Dec 24, 2019 at 4:14 PM Harshad Shirwadkar
> > <[email protected]> wrote:
> > >
> > > This patch series adds support for fast commits which is a simplified
> > > version of the scheme proposed by Park and Shin, in their paper,
> > > "iJournaling: Fine-Grained Journaling for Improving the Latency of
> > > Fsync System Call"[1]. The basic idea of fast commits is to make JBD2
> > > give the client file system an opportunity to perform a faster
> > > commit. Only if the file system cannot perform such a commit
> > > operation, then JBD2 should fall back to traditional commits.
> > >
> > > Because JBD2 operates at block granularity, for every file system
> > > metadata update it commits all the changed blocks are written to the
> > > journal at commit time. This is inefficient because updates to some
> > > blocks that JBD2 commits are derivable from some other blocks. For
> > > example, if a new extent is added to an inode, then corresponding
> > > updates to the inode table, the block bitmap, the group descriptor and
> > > the superblock can be derived based on just the extent information and
> > > the corresponding inode information. So, if we take this relationship
> > > between blocks into account and replay the journalled blocks smartly,
> > > we could increase performance of file system commits significantly.
> > >
> > > Fast commits introduced in this patch have two main contributions:
> > >
> > > (1) Making JBD2 fast commit aware, so that clients of JBD2 can
> > > implement fast commits
> > >
> > > (2) Add support in ext4 to use JBD2's new interfaces and implement
> > > fast commits.
> > >
> > > Ext4 supports two modes of fast commits: 1) fast commits with hard
> > > consistency guarantees 2) fast commits with soft consistency guarantees
> > >
> > > When hard consistency is enabled, fast commit guarantees that all the
> > > updates will be committed. After a successful replay of fast commits
> > > blocks in hard consistency mode, the entire file system would be in
> > > the same state as that when fsync() returned before crash. This
> > > guarantee is similar to what jbd2 gives with full commits.
> > >
> > > With soft consistency, file system only guarantees consistency for the
> > > inode in question. In this mode, file system will try to write as less
> > > data to the backend as possible during the commit time. To be precise,
> > > file system records all the data updates for the inode in question and
> > > directory updates that are required for guaranteeing consistency of the
> > > inode in question.
> > >
> > > In our evaluations, fast commits with hard consistency performed
> > > better than fast commits with soft consistency. That's because with
> > > hard consistency, a fast commit often ends up committing other inodes
> > > together, while with soft consistency commits get serialized. Future
> > > work can look at creating hybrid approach between the two extremes
> > > that are there in this patchset.
> > >
> > > Testing
> > > -------
> > >
> > > e2fsprogs was updated to set fast commit feature flag and to ignore
> > > fast commit blocks during e2fsck.
> > >
> > > https://github.com/harshadjs/e2fsprogs.git
> > >
> > > After applying all the patches in this series, following runs of
> > > xfstests were performed:
> > >
> > > - kvm-xfstest.sh -g log -c 4k
> > > - kvm-xfstests.sh smoke
> > >
> > > All the log tests were successful and smoke tests didn't introduce any
> > > additional failures.
> > >
> > > Performance Evaluation
> > > ----------------------
> > >
> > > Ext4 file system performance was tested with full commits, with fast
> > > commits with soft consistency and with fast commits with hard
> > > consistency. fs_mark benchmark showed that depending on the file size,
> > > performance improvement was seen up to 50%. Soft fast commits performed
> > > slightly worse than hard fast commits. But soft fast commits ended up
> > > writing slightly lesser number of blocks on disk.
> > >
> > > Changes since V3:
> > >
> > > - Removed invocation of fast commits from the jbd2 thread.
> > >
> > > - Removed sub transaction ID from journal_t.
> > >
> > > - Added rename, truncate, punch hole support.
> > >
> > > - Added soft consistency mode and hard consistency mode.
> > >
> > > - More bug fixes and refactoring.
> > >
> > > - Added better debugging support: more tracepoints and debug mount
> > > options.
> > >
> > > Harshad Shirwadkar(20):
> > > ext4: add debug mount option to test fast commit replay
> > > ext4: add fast commit replay path
> > > ext4: disable certain features in replay path
> > > ext4: add idempotent helpers to manipulate bitmaps
> > > ext4: fast commit recovery path preparation
> > > jbd2: add fast commit recovery path support
> > > ext4: main commit routine for fast commits
> > > jbd2: add new APIs for commit path of fast commits
> > > ext4: add fast commit on-disk format structs and helpers
> > > ext4: add fast commit track points
> > > ext4: break ext4_unlink() and ext4_link()
> > > ext4: add inode tracking and ineligible marking routines
> > > ext4: add directory entry tracking routines
> > > ext4: add generic diff tracking routines and range tracking
> > > jbd2: fast commit main commit path changes
> > > jbd2: disable fast commits if journal is empty
> > > jbd2: add fast commit block tracker variables
> > > ext4, jbd2: add fast commit initialization routines
> > > ext4: add handling for extended mount options
> > > ext4: update docs for fast commit feature
> > >
> > > Documentation/filesystems/ext4/journal.rst | 127 ++-
> > > Documentation/filesystems/journalling.rst | 18 +
> > > fs/ext4/acl.c | 1 +
> > > fs/ext4/balloc.c | 10 +-
> > > fs/ext4/ext4.h | 127 +++
> > > fs/ext4/ext4_jbd2.c | 1484 +++++++++++++++++++++++++++-
> > > fs/ext4/ext4_jbd2.h | 71 ++
> > > fs/ext4/extents.c | 5 +
> > > fs/ext4/extents_status.c | 24 +
> > > fs/ext4/fsync.c | 2 +-
> > > fs/ext4/ialloc.c | 165 +++-
> > > fs/ext4/inline.c | 3 +
> > > fs/ext4/inode.c | 77 +-
> > > fs/ext4/ioctl.c | 9 +-
> > > fs/ext4/mballoc.c | 157 ++-
> > > fs/ext4/mballoc.h | 2 +
> > > fs/ext4/migrate.c | 1 +
> > > fs/ext4/namei.c | 172 ++--
> > > fs/ext4/super.c | 72 +-
> > > fs/ext4/xattr.c | 6 +
> > > fs/jbd2/commit.c | 61 ++
> > > fs/jbd2/journal.c | 217 +++-
> > > fs/jbd2/recovery.c | 67 +-
> > > include/linux/jbd2.h | 83 +-
> > > include/trace/events/ext4.h | 208 +++-
> > > 25 files changed, 3037 insertions(+), 132 deletions(-)
> > > ---
> > > Documentation/filesystems/ext4/journal.rst | 127 ++++++++++++++++++++-
> > > Documentation/filesystems/journalling.rst | 18 +++
> > > 2 files changed, 139 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
> > > index ea613ee701f5..f94e66f2f8c4 100644
> > > --- a/Documentation/filesystems/ext4/journal.rst
> > > +++ b/Documentation/filesystems/ext4/journal.rst
> > > @@ -29,10 +29,10 @@ safest. If ``data=writeback``, dirty data blocks are not flushed to the
> > > disk before the metadata are written to disk through the journal.
> > >
> > > The journal inode is typically inode 8. The first 68 bytes of the
> > > -journal inode are replicated in the ext4 superblock. The journal itself
> > > -is normal (but hidden) file within the filesystem. The file usually
> > > -consumes an entire block group, though mke2fs tries to put it in the
> > > -middle of the disk.
> > > +journal inode are replicated in the ext4 superblock. The journal
> > > +itself is normal (but hidden) file within the filesystem. The file
> > > +usually consumes an entire block group, though mke2fs tries to put it
> > > +in the middle of the disk.
> > >
> > > All fields in jbd2 are written to disk in big-endian order. This is the
> > > opposite of ext4.
> > > @@ -42,22 +42,74 @@ NOTE: Both ext4 and ocfs2 use jbd2.
> > > The maximum size of a journal embedded in an ext4 filesystem is 2^32
> > > blocks. jbd2 itself does not seem to care.
> > >
> > > +Fast Commits
> > > +~~~~~~~~~~~~
> > > +
> > > +Ext4 also implements fast commits and integrates it with JBD2 journalling.
> > > +Fast commits store metadata changes made to the file system as inode level
> > > +diff. In other words, each fast commit block identifies updates made to
> > > +a particular inode and collectively they represent total changes made to
> > > +the file system.
> > > +
> > > +A fast commit is valid only if there is no full commit after that particular
> > > +fast commit. Because of this feature, fast commit blocks can be reused by
> > > +the following transactions.
> > > +
> > > +Each fast commit block stores updates to 1 particular inode. Updates in each
> > > +fast commit block are one of the 2 types:
> > > +- Data updates (add range / delete range)
> > > +- Directory entry updates (Add / remove links)
> > > +
> > > +Fast commit blocks must be replayed in the order in which they appear on disk.
> > > +That's because directory entry updates are written in fast commit blocks
> > > +in the order in which they are applied on the file system before crash.
> > > +Changing the order of replaying for directory entry updates may result
> > > +in inconsistent file system. Note that only directory entry updates need
> > > +ordering, data updates, since they apply to only one inode, do not require
> > > +ordered replay. Also, fast commits guarantee that file system is in consistent
> > > +state after replay of each fast commit block as long as order of replay has
> > > +been followed.
> > > +
> > > +Note that directory inode updates are never directly recorded in fast commits.
> > > +Just like other file system level metaata, updates to directories are always
> > > +implied based on directory entry updates stored in fast commit blocks.
> > > +
> > > +Based on which directory entry updates are committed with an inode, fast
> > > +commits have two modes of operation:
> > > +
> > > +- Hard Consistency (default)
> > > +- Soft Consistency (can be enabled by setting mount flag "fc_soft_consistency")
> > > +
> > > +When hard consistency is enabled, fast commit guarantees that all the updates
> > > +will be committed. After a successful replay of fast commits blocks
> > > +in hard consistency mode, the entire file system would be in the same state as
> > > +that when fsync() returned before crash. This guarantee is similar to what
> > > +jbd2 gives.
> > > +
> > > +With soft consistency, file system only guarantees consistency for the
> > > +inode in question. In this mode, file system will try to write as less data
> > > +to the backed as possible during the commit time. To be precise, file system
> > > +records all the data updates for the inode in question and directory updates
> > > +that are required for guaranteeing consistency of the inode in question.
> > > +
> > > Layout
> > > ~~~~~~
> > >
> > > Generally speaking, the journal has this format:
> > >
> > > .. list-table::
> > > - :widths: 16 48 16
> > > + :widths: 16 48 16 18
> > > :header-rows: 1
> > >
> > > * - Superblock
> > > - descriptor\_block (data\_blocks or revocation\_block) [more data or
> > > revocations] commmit\_block
> > > - [more transactions...]
> > > + - [Fast commits...]
> > > * -
> > > - One transaction
> > > -
> > > + -
> > >
> > > Notice that a transaction begins with either a descriptor and some data,
> > > or a block revocation list. A finished transaction always ends with a
> > > @@ -76,7 +128,7 @@ The journal superblock will be in the next full block after the
> > > superblock.
> > >
> > > .. list-table::
> > > - :widths: 12 12 12 32 12
> > > + :widths: 12 12 12 32 12 12
> > > :header-rows: 1
> > >
> > > * - 1024 bytes of padding
> > > @@ -85,11 +137,13 @@ superblock.
> > > - descriptor\_block (data\_blocks or revocation\_block) [more data or
> > > revocations] commmit\_block
> > > - [more transactions...]
> > > + - [Fast commits...]
> > > * -
> > > -
> > > -
> > > - One transaction
> > > -
> > > + -
> > >
> > > Block Header
> > > ~~~~~~~~~~~~
> > > @@ -609,3 +663,64 @@ bytes long (but uses a full block):
> > > - h\_commit\_nsec
> > > - Nanoseconds component of the above timestamp.
> > >
> > > +Fast Commit Block
> > > +~~~~~~~~~~~~~~~~~
> > > +
> > > +The fast commit block indicates an append to the last commit block
> > > +that was written to the journal. One fast commit block records updates
> > > +to one inode. So, typically you would find as many fast commit blocks
> > > +as the number of inodes that got changed since the last commit. A fast
> > > +commit block is valid only if there is no commit block present with
> > > +transaction ID greater than that of the fast commit block. If such a
> > > +block a present, then there is no need to replay the fast commit
> > > +block.
> > > +
> > > +.. list-table::
> > > + :widths: 8 8 24 40
> > > + :header-rows: 1
> > > +
> > > + * - Offset
> > > + - Type
> > > + - Name
> > > + - Descriptor
> > > + * - 0x0
> > > + - journal\_header\_s
> > > + - (open coded)
> > > + - Common block header.
> > > + * - 0xC
> > > + - \_\_le32
> > > + - fc\_magic
> > > + - Magic value which should be set to 0xE2540090. This identifies
> > > + that this block is a fast commit block.
> > > + * - 0x10
> > > + - \_\_u8
> > > + - fc\_features
> > > + - Features used by this fast commit block.
> > > + * - 0x11
> > > + - \_\_le16
> > > + - fc_num_tlvs
> > > + - Number of TLVs contained in this fast commit block
> > > + * - 0x13
> > > + - \_\_le32
> > > + - \_\_fc\_len
> > > + - Length of the fast commit block in terms of number of blocks
> > > + * - 0x17
> > > + - \_\_le32
> > > + - fc\_ino
> > > + - Inode number of the inode that will be recovered using this fast commit
> > > + * - 0x2B
> > > + - struct ext4\_inode
> > > + - inode
> > > + - On-disk copy of the inode at the commit time
> > > + * - <Variable based on inode size>
> > > + - struct ext4\_fc\_tl
> > > + - Array of struct ext4\_fc\_tl
> > > + - The actual delta with the last commit. Starting at this offset,
> > > + there is an array of TLVs that indicates which all extents
> > > + should be present in the corresponding inode. Currently,
> > > + following tags are supported: EXT4\_FC\_TAG\_EXT (extent that
> > > + should be present in the inode), EXT4\_FC\_TAG\_HOLE (extent
> > > + that should be removed from the inode), EXT4\_FC\_TAG\_ADD\_DENTRY
> > > + (dentry that should be linked), EXT4\_FC\_TAG\_DEL\_DENTRY
> > > + (dentry that should be unlinked), EXT4\_FC\_TAG\_CREATE\_DENTRY
> > > + (dentry that for the file that should be created for the first time).
> > > diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
> > > index 58ce6b395206..1cb116ab27ab 100644
> > > --- a/Documentation/filesystems/journalling.rst
> > > +++ b/Documentation/filesystems/journalling.rst
> > > @@ -115,6 +115,24 @@ called after each transaction commit. You can also use
> > > ``transaction->t_private_list`` for attaching entries to a transaction
> > > that need processing when the transaction commits.
> > >
> > > +JBD2 also allows client file systems to implement file system specific
> > > +commits which are called as ``fast commits``. Fast commits are
> > > +asynchronous in nature i.e. file systems can call their own commit
> > > +functions at any time. In order to avoid the race with kjournald
> > > +thread and other possible fast commits that may be happening in
> > > +parallel, file systems should first call
> > > +:c:func:`jbd2_start_async_fc()`. File system can call
> > > +:c:func:`jbd2_map_fc_buf()` to get buffers reserved for fast
> > > +commits. Once a fast commit is completed, file system should call
> > > +:c:func:`jbd2_stop_async_fc()` to indicate and unblock other
> > > +committers and the kjournald thread. After performing either a fast
> > > +or a full commit, JBD2 calls ``journal->j_fc_cleanup_cb`` to allow
> > > +file systems to perform cleanups for their internal fast commit
> > > +related data structures. At the replay time, JBD2 passes each and
> > > +every fast commit block to the file system via
> > > +``journal->j_fc_replay_cb``. Ext4 effectively uses this fast commit
> > > +mechanism to improve journal commit performance.
> > > +
> > > JBD2 also provides a way to block all transaction updates via
> > > :c:func:`jbd2_journal_lock_updates()` /
> > > :c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
> > > --
> > > 2.24.1.735.g03f4e72817-goog
> > >
On Thu, Jan 09, 2020 at 12:29:01PM +0800, xiaohui li wrote:
> maybe i have not understand the difficulty of the fast commit coding work.
> so I appreciate it very much if you give some more detailed
> descriptions about the patches correlationship of v4 fast commit,
> especially the reason why need have so many patches.
>
> from my viewpoint, the purpose of doing this fast commit function is
> to resolve the ext4 fsync time-cost-so-much problem.
> firstly we need to resolve some actual customer problems which exist
> in ext4 filesystems when doing this fast commit function.
>
> so the first release version of fast commit is just only to accomplish
> the goal of reducing the time cost of fsync because of jbd2 order
> shortcoming described in ijournal paper from my opinion.
> it need not do so many other unnecessary things.
As Harshad has mentioned, one of the reasons why an incremental
approach does not make sense is that once we release a version of fast
commit into a mainline kernel, we have to worry about what happens if
users start trying to use it, and we have to provide backwards
compatibility for it. So if we were to break up fast commit into 5
parts, then we would have to allocate 5 feature bits, and we would
have to support each version of fast commit --- essentially forever.
As far as why are we doing this, we absolutely have a specific use
case in mind, and that's to improve ext4's performance when used on a
NFS server. The NFS protocol requires that any file system operation
requested by a client is persisted before the server sends an
acknowledgement back to the client. For the workloads that are heavy
with metadata updates, avoiding the need to do a full jbd2 commit for
every NFS RPC request which modifies metadata will a big difference to
the NFS server's performance.
This is why we are interested in making things like renames to be fast
commit eligible, and not just the smaller set of system calls needed
by (for example) SQLite.
Regards,
- Ted