2019-08-09 03:47:02

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v2 00/12] ext4: add support fast commit

This patch series adds support for fast commits which is a simplified
version of the scheme proposed by Park and Shin, in their paper,
"iJournaling: Fine-Grained Journaling for Improving the Latency of
Fsync System Call"[1]. The basic idea of fast commits is to make JBD2
give the client file system an opportunity to perform a faster
commit. Only if the file system cannot perform such a commit
operation, then JBD2 should fall back to traditional commits.

Because JBD2 operates at block granularity, for every file system
metadata update it commits all the changed blocks to the journal at
commit time. This is inefficient because updates to some blocks that
JBD2 commits are derivable from some other blocks. For example, if a
new extent is added to an inode, then corresponding updates to the
inode table, the block bitmap, the group descriptor and the superblock
can be derived based on just the extent information and the
corresponding inode information. So, if we take this relationship
between blocks into account and replay the journalled blocks smartly,
we could increase performance of file system commits significantly.

Fast commits introduced in this patch has two main contributions:

(1) Making JBD2 fast commit aware, so that clients of JBD2 can
implement fast commits

(2) Add support in ext4 to use JBD2's new interfaces and implement
fast commits

Testing
-------

e2fsprogs was updated to set fast commit feature flag and to ignore
fast commit blocks during e2fsck.

https://github.com/harshadjs/e2fsprogs.git

After applying all the patches in this series, following runs of
xfstests were performed:

- kvm-xfstest.sh -g log -c 4k
- kvm-xfstests.sh smoke

All the log tests were successful and smoke tests didn't introduce any
additional failures.

Performance Evaluation
----------------------

In order to evaluate fast commit performance we used fs_mark
benchmark. We updated fs_mark benchmark to send fsync() calls after
every write operation.

https://github.com/harshadjs/fs_mark.git

Following are the results that we got:

Write performance measured in MB/s with 4 parallel threads file sizes
(X) vs write unit sizes (Y).

Without Fast Commit:

|-----+------+------+------|
| | 32k | 128k | 256k |
|-----+------+------+------|
| 4k | 0.27 | 0.25 | 0.24 |
| 8k | 0.45 | 0.51 | 0.46 |
| 32k | 2.15 | 2.23 | 2.28 |
|-----+------+------+------|

With Fast Commit:

|-----+------+------+------|
| | 32k | 128k | 256k |
|-----+------+------+------|
| 4k | 0.74 | 1.42 | 1.94 |
| 8k | 1.52 | 1.88 | 2.48 |
| 32k | 1.8 | 4.29 | 7.38 |
|-----+------+------+------|

On an average, fast commits increased file system write performance by
280% on modified fs_mark benchmark.

Harshad Shirwadkar(13):
docs: Add fast commit documentation
ext4: fast-commit recovery path changes
ext4: fast-commit commit path changes
ext4: fast-commit commit range tracking
ext4: track changed files for fast commit
ext4: add fields that are needed to track changed files
jbd2: fast-commit recovery path changes
jbd2: fast-commit commit path new APIs
jbd2: fast-commit commit path changes
jbd2: fast commit setup and enable
jbd2: add fast commit fields to journal_s structure
ext4: add handling for extended mount options
ext4: add support fast commit

Documentation/filesystems/ext4/journal.rst | 78 ++
Documentation/filesystems/journalling.rst | 15
fs/ext4/acl.c | 1
fs/ext4/balloc.c | 7
fs/ext4/ext4.h | 87 +++
fs/ext4/ext4_jbd2.c | 92 +++
fs/ext4/ext4_jbd2.h | 29 +
fs/ext4/extents.c | 44 +
fs/ext4/fsync.c | 2
fs/ext4/ialloc.c | 1
fs/ext4/inline.c | 17
fs/ext4/inode.c | 62 +-
fs/ext4/ioctl.c | 3
fs/ext4/mballoc.c | 83 ++
fs/ext4/mballoc.h | 2
fs/ext4/migrate.c | 1
fs/ext4/namei.c | 14
fs/ext4/super.c | 538 ++++++++++++++++++-
fs/ext4/xattr.c | 1
fs/jbd2/checkpoint.c | 2
fs/jbd2/commit.c | 85 ++-
fs/jbd2/journal.c | 230 +++++++-
fs/jbd2/recovery.c | 70 ++
fs/jbd2/transaction.c | 6
fs/ocfs2/alloc.c | 2
fs/ocfs2/journal.c | 4
fs/ocfs2/super.c | 2
include/linux/jbd2.h | 106 +++
include/trace/events/ext4.h | 59 ++
include/trace/events/jbd2.h | 9
30 files changed, 1561 insertions(+), 91 deletions(-)
--
2.23.0.rc1.153.gdeed80330f-goog


2019-08-09 03:47:02

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v2 07/12] ext4: add fields that are needed to track changed files

Ext4's fast commit feature tracks changed files and maintains them in
a queue. We also remember for each file the logical block range that
needs to be committed. This patch adds these fields to ext4_inode_info
and ext4_sb_info and also adds initialization calls.

Signed-off-by: Harshad Shirwadkar <[email protected]>

---

Changelog:

V2: Converted s_fc_lock from mutex to spinlock to improve parallelism
performance.
---
fs/ext4/ext4.h | 34 ++++++++++++++++++++++++++++++++++
fs/ext4/ext4_jbd2.c | 13 +++++++++++++
fs/ext4/ext4_jbd2.h | 2 ++
fs/ext4/inode.c | 1 +
fs/ext4/super.c | 7 +++++++
5 files changed, 57 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index becbda38b7db..0d15d4539dda 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -921,6 +921,27 @@ enum {
I_DATA_SEM_QUOTA,
};

+/*
+ * Ext4 fast commit inode specific information
+ */
+struct ext4_fast_commit_inode_info {
+ /* TID / SUB-TID when old_i_size and i_size were recorded */
+ tid_t fc_tid;
+ tid_t fc_subtid;
+
+ /*
+ * Start of logical block range that needs to be committed in this fast
+ * commit
+ */
+ loff_t fc_lblk_start;
+
+ /*
+ * End of logical block range that needs to be committed in this fast
+ * commit
+ */
+ loff_t fc_lblk_end;
+};
+

/*
* fourth extended file system inode data in memory
@@ -955,6 +976,9 @@ struct ext4_inode_info {

struct list_head i_orphan; /* unlinked but open inodes */

+ struct list_head i_fc_list; /* inodes that need fast commit */
+ struct ext4_fast_commit_inode_info i_fc;
+
/*
* i_disksize keeps track of what the inode size is ON DISK, not
* in memory. During truncate, i_size is set to the new size by
@@ -1529,6 +1553,16 @@ struct ext4_sb_info {
/* Barrier between changing inodes' journal flags and writepages ops. */
struct percpu_rw_semaphore s_journal_flag_rwsem;
struct dax_device *s_daxdev;
+
+ /* Ext4 fast commit stuff */
+ bool fc_replay; /* Fast commit replay in progress */
+ struct list_head s_fc_q; /* Inodes that need fast commit. */
+ __u32 s_fc_q_cnt; /* Number of inodes in the fc queue */
+ bool s_fc_eligible; /*
+ * Are changes after the last commit
+ * eligible for fast commit?
+ */
+ spinlock_t s_fc_lock;
};

static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 7c70b08d104c..75b6db808837 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -330,3 +330,16 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
mark_buffer_dirty(bh);
return err;
}
+
+void ext4_init_inode_fc_info(struct inode *inode)
+{
+ handle_t *handle = ext4_journal_current_handle();
+ struct ext4_inode_info *ei = EXT4_I(inode);
+
+ memset(&ei->i_fc, 0, sizeof(ei->i_fc));
+ if (ext4_handle_valid(handle)) {
+ ei->i_fc.fc_tid = handle->h_transaction->t_tid;
+ ei->i_fc.fc_subtid = handle->h_transaction->t_journal->j_subtid;
+ }
+ INIT_LIST_HEAD(&ei->i_fc_list);
+}
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index ef8fcf7d0d3b..2305c1acd415 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -459,4 +459,6 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
return 1;
}

+void ext4_init_inode_fc_info(struct inode *inode);
+
#endif /* _EXT4_JBD2_H */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 420fe3deed39..f230a888eddd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4996,6 +4996,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
for (block = 0; block < EXT4_N_BLOCKS; block++)
ei->i_data[block] = raw_inode->i_block[block];
INIT_LIST_HEAD(&ei->i_orphan);
+ ext4_init_inode_fc_info(&ei->vfs_inode);

/*
* Set transaction id's of transactions that have to be committed
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6bab59ae81f7..0b833e9b61c1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1100,6 +1100,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
ei->i_datasync_tid = 0;
atomic_set(&ei->i_unwritten, 0);
INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
+ ext4_init_inode_fc_info(&ei->vfs_inode);
return &ei->vfs_inode;
}

@@ -1139,6 +1140,7 @@ static void init_once(void *foo)
init_rwsem(&ei->i_data_sem);
init_rwsem(&ei->i_mmap_sem);
inode_init_once(&ei->vfs_inode);
+ ext4_init_inode_fc_info(&ei->vfs_inode);
}

static int __init init_inodecache(void)
@@ -4301,6 +4303,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
mutex_init(&sbi->s_orphan_lock);

+ INIT_LIST_HEAD(&sbi->s_fc_q);
+ sbi->s_fc_q_cnt = 0;
+ sbi->s_fc_eligible = true;
+ spin_lock_init(&sbi->s_fc_lock);
+
sb->s_root = NULL;

needs_recovery = (es->s_last_orphan != 0 ||
--
2.23.0.rc1.153.gdeed80330f-goog

2019-08-09 03:47:02

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v2 05/12] jbd2: fast-commit commit path new APIs

This patch adds new helper APIs that ext4 needs for fast
commits. These new fast commit APIs are used by subsequent fast commit
patches to implement fast commits. Following new APIs are added:

/*
* Returns when either a full commit or a fast commit
* completes
*/
int jbd2_fc_complete_commit(journal_tc *journal, tid_t tid,
tid_t tid, tid_t subtid)

/* Send all the data buffers related to an inode */
int journal_submit_inode_data(journal_t *journal,
struct jbd2_inode *jinode)

/* Map one fast commit buffer for use by the file system */
int jbd2_map_fc_buf(journal_t *journal, struct buffer_head **bh_out)

/* Wait on fast commit buffers to complete IO */
jbd2_wait_on_fc_bufs(journal_t *journal, int num_bufs)

Signed-off-by: Harshad Shirwadkar <[email protected]>

---

Changelog:

V2: 1) Fixed error reported by kbuild test robot. Removed duplicate
EXPORT_SYMBOL() call. Also, added EXPORT_SYMBOL() for the new
APIs introduced.
2) Changed jbd2_submit_fc_bufs() to jbd2_wait_on_fc_bufs(). This
gives client file system to submit JBD2 buffers according to
its own convenience.
---
fs/jbd2/commit.c | 32 +++++++++++++++
fs/jbd2/journal.c | 98 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/jbd2.h | 6 +++
3 files changed, 136 insertions(+)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 9281814606e7..db62a53436e3 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -202,6 +202,38 @@ static int journal_submit_inode_data_buffers(struct address_space *mapping,
return ret;
}

+int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode)
+{
+ struct address_space *mapping;
+ loff_t dirty_start = jinode->i_dirty_start;
+ loff_t dirty_end = jinode->i_dirty_end;
+ int ret;
+
+ if (!jinode)
+ return 0;
+
+ if (!(jinode->i_flags & JI_WRITE_DATA))
+ return 0;
+
+ dirty_start = jinode->i_dirty_start;
+ dirty_end = jinode->i_dirty_end;
+
+ mapping = jinode->i_vfs_inode->i_mapping;
+ jinode->i_flags |= JI_COMMIT_RUNNING;
+
+ trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
+ ret = journal_submit_inode_data_buffers(mapping, dirty_start,
+ dirty_end);
+
+ jinode->i_flags &= ~JI_COMMIT_RUNNING;
+ /* Protect JI_COMMIT_RUNNING flag */
+ smp_mb();
+ wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
+
+ return ret;
+}
+EXPORT_SYMBOL(jbd2_submit_inode_data);
+
/*
* Submit all the data buffers of inode associated with the transaction to
* disk.
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index ab05e47ed2d4..1e15804b2c3c 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -811,6 +811,33 @@ int jbd2_complete_transaction(journal_t *journal, tid_t tid)
}
EXPORT_SYMBOL(jbd2_complete_transaction);

+int jbd2_fc_complete_commit(journal_t *journal, tid_t tid, tid_t subtid)
+{
+ int need_to_wait = 1;
+
+ read_lock(&journal->j_state_lock);
+ if (journal->j_running_transaction &&
+ journal->j_running_transaction->t_tid == tid) {
+ /* Check if fast commit was already done */
+ if (journal->j_subtid > subtid)
+ need_to_wait = 0;
+ if (journal->j_commit_request != tid) {
+ /* transaction not yet started, so request it */
+ read_unlock(&journal->j_state_lock);
+ jbd2_log_start_commit(journal, tid, false);
+ goto wait_commit;
+ }
+ } else if (!(journal->j_committing_transaction &&
+ journal->j_committing_transaction->t_tid == tid))
+ need_to_wait = 0;
+ read_unlock(&journal->j_state_lock);
+ if (!need_to_wait)
+ return 0;
+wait_commit:
+ return __jbd2_log_wait_commit(journal, tid, subtid);
+}
+EXPORT_SYMBOL(jbd2_fc_complete_commit);
+
/*
* Log buffer allocation routines:
*/
@@ -831,6 +858,77 @@ int jbd2_journal_next_log_block(journal_t *journal, unsigned long long *retp)
return jbd2_journal_bmap(journal, blocknr, retp);
}

+int jbd2_map_fc_buf(journal_t *journal, struct buffer_head **bh_out)
+{
+ unsigned long long pblock;
+ unsigned long blocknr;
+ int ret = 0;
+ struct buffer_head *bh;
+ int fc_off;
+ journal_header_t *jhdr;
+
+ write_lock(&journal->j_state_lock);
+
+ if (journal->j_fc_off + journal->j_first_fc < journal->j_last_fc) {
+ fc_off = journal->j_fc_off;
+ blocknr = journal->j_first_fc + fc_off;
+ journal->j_fc_off++;
+ } else {
+ ret = -EINVAL;
+ }
+ write_unlock(&journal->j_state_lock);
+
+ if (ret)
+ return ret;
+
+ ret = jbd2_journal_bmap(journal, blocknr, &pblock);
+ if (ret)
+ return ret;
+
+ bh = __getblk(journal->j_dev, pblock, journal->j_blocksize);
+ if (!bh)
+ return -ENOMEM;
+
+ lock_buffer(bh);
+ jhdr = (journal_header_t *)bh->b_data;
+ jhdr->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
+ jhdr->h_blocktype = cpu_to_be32(JBD2_FC_BLOCK);
+ jhdr->h_sequence = cpu_to_be32(journal->j_running_transaction->t_tid);
+
+ set_buffer_uptodate(bh);
+ unlock_buffer(bh);
+ journal->j_fc_wbuf[fc_off] = bh;
+
+ *bh_out = bh;
+
+ return 0;
+}
+EXPORT_SYMBOL(jbd2_map_fc_buf);
+
+int jbd2_wait_on_fc_bufs(journal_t *journal, int num_blks)
+{
+ struct buffer_head *bh;
+ int i, j_fc_off;
+
+ read_lock(&journal->j_state_lock);
+ j_fc_off = journal->j_fc_off;
+ read_unlock(&journal->j_state_lock);
+
+ /*
+ * Wait in reverse order to minimize chances of us being woken up before
+ * all IOs have completed
+ */
+ for (i = j_fc_off - 1; i >= j_fc_off - num_blks; i--) {
+ bh = journal->j_fc_wbuf[i];
+ wait_on_buffer(bh);
+ if (unlikely(!buffer_uptodate(bh)))
+ return -EIO;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(jbd2_wait_on_fc_bufs);
+
/*
* Conversion of logical to physical block numbers for the journal
*
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 535f88dff653..5362777d06f8 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -124,6 +124,7 @@ typedef struct journal_s journal_t; /* Journal control structure */
#define JBD2_SUPERBLOCK_V1 3
#define JBD2_SUPERBLOCK_V2 4
#define JBD2_REVOKE_BLOCK 5
+#define JBD2_FC_BLOCK 6

/*
* Standard header for all descriptor blocks:
@@ -1582,6 +1583,7 @@ int jbd2_transaction_committed(journal_t *journal, tid_t tid);
int jbd2_complete_transaction(journal_t *journal, tid_t tid);
int jbd2_log_do_checkpoint(journal_t *journal);
int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);
+int jbd2_fc_complete_commit(journal_t *journal, tid_t tid, tid_t subtid);

void __jbd2_log_wait_for_space(journal_t *journal);
extern void __jbd2_journal_drop_transaction(journal_t *, transaction_t *);
@@ -1732,6 +1734,10 @@ static inline tid_t jbd2_get_latest_transaction(journal_t *journal)
return tid;
}

+int jbd2_map_fc_buf(journal_t *journal, struct buffer_head **bh_out);
+int jbd2_wait_on_fc_bufs(journal_t *journal, int num_blks);
+int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode);
+
#ifdef __KERNEL__

#define buffer_trace_init(bh) do {} while (0)
--
2.23.0.rc1.153.gdeed80330f-goog

2019-08-09 03:47:02

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v2 08/12] ext4: track changed files for fast commit

For fast commit, we need to remember all the files that have changed
since last fast commit / full commit. For changes that are fast commit
incompatible, we mark the file system fast commit incompatible. This
patch adds code to either remember files that have changed or to mark
ext4 as fast commit ineligible. We inspect every ext4_mark_inode_dirty
calls and decide whether that particular file change is fast
compatible or not.

Signed-off-by: Harshad Shirwadkar <[email protected]>

---

Changelog:

V2: Using spinlocks instead of mutexes for s_fc_lock.
---
fs/ext4/acl.c | 1 +
fs/ext4/ext4_jbd2.c | 46 +++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/ext4_jbd2.h | 25 ++++++++++++++++++++++++
fs/ext4/extents.c | 17 +++++++++++++++--
fs/ext4/ialloc.c | 1 +
fs/ext4/inline.c | 12 ++++++++++++
fs/ext4/inode.c | 30 +++++++++++++++++++++++++++--
fs/ext4/ioctl.c | 3 +++
fs/ext4/migrate.c | 1 +
fs/ext4/namei.c | 14 +++++++++++++-
fs/ext4/super.c | 15 +++++++++++++++
fs/ext4/xattr.c | 1 +
12 files changed, 161 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index 8c7bbf3e566d..e84be9c315db 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -257,6 +257,7 @@ ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
inode->i_mode = mode;
inode->i_ctime = current_time(inode);
ext4_mark_inode_dirty(handle, inode);
+ ext4_fc_enqueue_inode(handle, inode);
}
out_stop:
ext4_journal_stop(handle);
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 75b6db808837..d77b9f1e9dab 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -343,3 +343,49 @@ void ext4_init_inode_fc_info(struct inode *inode)
}
INIT_LIST_HEAD(&ei->i_fc_list);
}
+
+void ext4_fc_enqueue_inode(handle_t *handle, struct inode *inode)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ struct ext4_inode_info *ei = EXT4_I(inode);
+
+ if (!ext4_should_fast_commit(inode->i_sb))
+ return;
+
+ spin_lock(&sbi->s_fc_lock);
+ if (!sbi->s_fc_eligible) {
+ spin_unlock(&sbi->s_fc_lock);
+ return;
+ }
+ if (list_empty(&EXT4_I(inode)->i_fc_list)) {
+ list_add(&EXT4_I(inode)->i_fc_list, &sbi->s_fc_q);
+ sbi->s_fc_q_cnt++;
+ }
+ spin_unlock(&sbi->s_fc_lock);
+
+ if (!ext4_handle_valid(handle))
+ return;
+
+ if (ei->i_fc.fc_tid == handle->h_transaction->t_tid &&
+ ei->i_fc.fc_subtid ==
+ handle->h_transaction->t_journal->j_subtid)
+ return;
+
+ ei->i_fc.fc_lblk_start = i_size_read(inode);
+ ei->i_fc.fc_lblk_end = i_size_read(inode);
+ ei->i_fc.fc_subtid = handle->h_transaction->t_journal->j_subtid;
+ ei->i_fc.fc_tid = handle->h_transaction->t_tid;
+}
+
+void ext4_fc_del(struct inode *inode)
+{
+ if (!ext4_should_fast_commit(inode->i_sb))
+ return;
+
+ if (list_empty(&EXT4_I(inode)->i_fc_list))
+ return;
+
+ spin_lock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+ list_del_init(&EXT4_I(inode)->i_fc_list);
+ spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+}
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 2305c1acd415..a27cc3a5c676 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -459,6 +459,31 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
return 1;
}

+static inline int ext4_should_fast_commit(struct super_block *sb)
+{
+ if (!ext4_has_feature_fast_commit(sb))
+ return 0;
+ if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
+ return 0;
+ if (test_opt(sb, QUOTA))
+ return 0;
+ return 1;
+}
+
void ext4_init_inode_fc_info(struct inode *inode);
+extern void ext4_fc_enqueue_inode(handle_t *handle,
+ struct inode *inode);
+extern void ext4_fc_del(struct inode *inode);
+
+static inline void
+ext4_fc_mark_ineligible(struct super_block *sb)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ spin_lock(&sbi->s_fc_lock);
+ sbi->s_fc_eligible = false;
+ spin_unlock(&sbi->s_fc_lock);
+}
+

#endif /* _EXT4_JBD2_H */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 92266a2da7d6..eb77e306a82b 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -163,6 +163,7 @@ int __ext4_ext_dirty(const char *where, unsigned int line, handle_t *handle,
} else {
/* path points to leaf/index in inode body */
err = ext4_mark_inode_dirty(handle, inode);
+ ext4_fc_enqueue_inode(handle, inode);
}
return err;
}
@@ -1371,6 +1372,7 @@ static int ext4_ext_create_new_leaf(handle_t *handle, struct inode *inode,
struct ext4_ext_path *curp;
int depth, i, err = 0;

+ ext4_fc_mark_ineligible(inode->i_sb);
repeat:
i = depth = ext_depth(inode);

@@ -3714,6 +3716,8 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
err = ext4_zeroout_es(inode, &zero_ex1);
if (!err)
err = ext4_zeroout_es(inode, &zero_ex2);
+ } else {
+ ext4_fc_mark_ineligible(inode->i_sb);
}
return err ? err : allocated;
}
@@ -3856,7 +3860,7 @@ static int check_eofblocks_fl(handle_t *handle, struct inode *inode,
struct ext4_ext_path *path,
unsigned int len)
{
- int i, depth;
+ int i, ret, depth;
struct ext4_extent_header *eh;
struct ext4_extent *last_ex;

@@ -3898,7 +3902,10 @@ static int check_eofblocks_fl(handle_t *handle, struct inode *inode,
return 0;
out:
ext4_clear_inode_flag(inode, EXT4_INODE_EOFBLOCKS);
- return ext4_mark_inode_dirty(handle, inode);
+ ret = ext4_mark_inode_dirty(handle, inode);
+ ext4_fc_enqueue_inode(handle, inode);
+
+ return ret;
}

static int
@@ -4607,6 +4614,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
inode->i_ino, map.m_lblk,
map.m_len, ret);
ext4_mark_inode_dirty(handle, inode);
+ ext4_fc_enqueue_inode(handle, inode);
ret2 = ext4_journal_stop(handle);
break;
}
@@ -4624,6 +4632,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
ext4_set_inode_flag(inode,
EXT4_INODE_EOFBLOCKS);
}
+ ext4_fc_enqueue_inode(handle, inode);
ext4_mark_inode_dirty(handle, inode);
ext4_update_inode_fsync_trans(handle, inode, 1);
ret2 = ext4_journal_stop(handle);
@@ -4786,6 +4795,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
ext4_set_inode_flag(inode, EXT4_INODE_EOFBLOCKS);
}
ext4_mark_inode_dirty(handle, inode);
+ ext4_fc_enqueue_inode(handle, inode);

/* Zero out partial block at the edges of the range */
ret = ext4_zero_partial_blocks(handle, inode, offset, len);
@@ -4957,6 +4967,7 @@ int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
"ext4_ext_map_blocks returned %d",
inode->i_ino, map.m_lblk,
map.m_len, ret);
+ ext4_fc_mark_ineligible(inode->i_sb);
ext4_mark_inode_dirty(handle, inode);
if (credits)
ret2 = ext4_journal_stop(handle);
@@ -5485,6 +5496,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
if (IS_SYNC(inode))
ext4_handle_sync(handle);
inode->i_mtime = inode->i_ctime = current_time(inode);
+ ext4_fc_mark_ineligible(inode->i_sb);
ext4_mark_inode_dirty(handle, inode);
ext4_update_inode_fsync_trans(handle, inode, 1);

@@ -5599,6 +5611,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
inode->i_size += len;
EXT4_I(inode)->i_disksize += len;
inode->i_mtime = inode->i_ctime = current_time(inode);
+ ext4_fc_mark_ineligible(inode->i_sb);
ret = ext4_mark_inode_dirty(handle, inode);
if (ret)
goto out_stop;
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 764ff4c56233..97a9882a3363 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1175,6 +1175,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
ei->i_datasync_tid = handle->h_transaction->t_tid;
}

+ ext4_fc_mark_ineligible(sb);
err = ext4_mark_inode_dirty(handle, inode);
if (err) {
ext4_std_error(sb, err);
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 88cdf3c90bd1..190968996bc6 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -435,6 +435,8 @@ static int ext4_destroy_inline_data_nolock(handle_t *handle,
if (error)
goto out;

+ ext4_fc_mark_ineligible(inode->i_sb);
+
memset((void *)ext4_raw_inode(&is.iloc)->i_block,
0, EXT4_MIN_INLINE_DATA_SIZE);
memset(ei->i_data, 0, EXT4_MIN_INLINE_DATA_SIZE);
@@ -759,6 +761,8 @@ int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,

ext4_write_unlock_xattr(inode, &no_expand);
brelse(iloc.bh);
+ ext4_fc_enqueue_inode(ext4_journal_current_handle(),
+ inode);
mark_inode_dirty(inode);
out:
return copied;
@@ -974,6 +978,8 @@ int ext4_da_write_inline_data_end(struct inode *inode, loff_t pos,
* ordering of page lock and transaction start for journaling
* filesystems.
*/
+ ext4_fc_enqueue_inode(ext4_journal_current_handle(),
+ inode);
mark_inode_dirty(inode);

return copied;
@@ -1165,6 +1171,7 @@ static int ext4_finish_convert_inline_dir(handle_t *handle,
if (err)
return err;
set_buffer_verified(dir_block);
+ ext4_fc_mark_ineligible(inode->i_sb);
return ext4_mark_inode_dirty(handle, inode);
}

@@ -1216,6 +1223,8 @@ static int ext4_convert_inline_data_nolock(handle_t *handle,
goto out_restore;
}

+ ext4_fc_mark_ineligible(inode->i_sb);
+
data_bh = sb_getblk(inode->i_sb, map.m_pblk);
if (!data_bh) {
error = -ENOMEM;
@@ -1709,6 +1718,8 @@ int ext4_delete_inline_entry(handle_t *handle,
if (err)
goto out;

+ ext4_fc_enqueue_inode(handle, dir);
+
ext4_show_inline_dir(dir, iloc.bh, inline_start, inline_size);
out:
ext4_write_unlock_xattr(dir, &no_expand);
@@ -1986,6 +1997,7 @@ int ext4_inline_data_truncate(struct inode *inode, int *has_inline)

if (err == 0) {
inode->i_mtime = inode->i_ctime = current_time(inode);
+ ext4_fc_enqueue_inode(handle, inode);
err = ext4_mark_inode_dirty(handle, inode);
if (IS_SYNC(inode))
ext4_handle_sync(handle);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f230a888eddd..379e911b48c4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -262,6 +262,7 @@ void ext4_evict_inode(struct inode *inode)
* cleaned up.
*/
ext4_orphan_del(NULL, inode);
+ ext4_fc_del(inode);
sb_end_intwrite(inode->i_sb);
goto no_delete;
}
@@ -279,6 +280,8 @@ void ext4_evict_inode(struct inode *inode)
if (ext4_inode_is_fast_symlink(inode))
memset(EXT4_I(inode)->i_data, 0, sizeof(EXT4_I(inode)->i_data));
inode->i_size = 0;
+ ext4_fc_del(inode);
+ ext4_fc_mark_ineligible(inode->i_sb);
err = ext4_mark_inode_dirty(handle, inode);
if (err) {
ext4_warning(inode->i_sb,
@@ -303,6 +306,7 @@ void ext4_evict_inode(struct inode *inode)
stop_handle:
ext4_journal_stop(handle);
ext4_orphan_del(NULL, inode);
+ ext4_fc_del(inode);
sb_end_intwrite(inode->i_sb);
ext4_xattr_inode_array_free(ea_inode_array);
goto no_delete;
@@ -326,6 +330,8 @@ void ext4_evict_inode(struct inode *inode)
* having errors), but we can't free the inode if the mark_dirty
* fails.
*/
+ ext4_fc_del(inode);
+ ext4_fc_mark_ineligible(inode->i_sb);
if (ext4_mark_inode_dirty(handle, inode))
/* If that failed, just do the required in-core inode clear. */
ext4_clear_inode(inode);
@@ -1436,8 +1442,10 @@ static int ext4_write_end(struct file *file,
* ordering of page lock and transaction start for journaling
* filesystems.
*/
- if (i_size_changed || inline_data)
+ if (i_size_changed || inline_data) {
ext4_mark_inode_dirty(handle, inode);
+ ext4_fc_enqueue_inode(handle, inode);
+ }

if (pos + len > inode->i_size && ext4_can_truncate(inode))
/* if we have allocated more blocks and copied
@@ -1550,6 +1558,7 @@ static int ext4_journalled_write_end(struct file *file,
pagecache_isize_extended(inode, old_size, pos);

if (size_changed || inline_data) {
+ ext4_fc_enqueue_inode(handle, inode);
ret2 = ext4_mark_inode_dirty(handle, inode);
if (!ret)
ret = ret2;
@@ -2077,6 +2086,7 @@ static int __ext4_journalled_writepage(struct page *page,

if (inline_data) {
ret = ext4_mark_inode_dirty(handle, inode);
+ ext4_fc_enqueue_inode(handle, inode);
} else {
ret = ext4_walk_page_buffers(handle, page_bufs, 0, len, NULL,
do_journal_get_write_access);
@@ -2604,6 +2614,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
EXT4_I(inode)->i_disksize = disksize;
up_write(&EXT4_I(inode)->i_data_sem);
err2 = ext4_mark_inode_dirty(handle, inode);
+ ext4_fc_enqueue_inode(handle, inode);
if (err2)
ext4_error(inode->i_sb,
"Failed to mark inode %lu dirty",
@@ -3205,6 +3216,7 @@ static int ext4_da_write_end(struct file *file,
* bu greater than i_disksize.(hint delalloc)
*/
ext4_mark_inode_dirty(handle, inode);
+ ext4_fc_enqueue_inode(handle, inode);
}
}

@@ -3614,8 +3626,12 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ret = PTR_ERR(handle);
goto orphan_del;
}
- if (ext4_update_inode_size(inode, offset + written))
+
+ if (ext4_update_inode_size(inode, offset + written)) {
ext4_mark_inode_dirty(handle, inode);
+ ext4_fc_enqueue_inode(handle, inode);
+ }
+
/*
* We may need to truncate allocated but not written blocks beyond EOF.
*/
@@ -3851,6 +3867,7 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
* ignore it.
*/
ext4_mark_inode_dirty(handle, inode);
+ ext4_fc_enqueue_inode(handle, inode);
}
}
err = ext4_journal_stop(handle);
@@ -4372,6 +4389,8 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
goto out_dio;
}

+ ext4_fc_mark_ineligible(inode->i_sb);
+
ret = ext4_zero_partial_blocks(handle, inode, offset,
length);
if (ret)
@@ -4525,6 +4544,7 @@ int ext4_truncate(struct inode *inode)
if (inode->i_size & (inode->i_sb->s_blocksize - 1))
ext4_block_truncate_page(handle, mapping, inode->i_size);

+ ext4_fc_mark_ineligible(inode->i_sb);
/*
* We add the inode to the orphan list, so that if this
* truncate spans multiple transactions, and we crash, we will
@@ -5593,6 +5613,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
if (attr->ia_valid & ATTR_GID)
inode->i_gid = attr->ia_gid;
error = ext4_mark_inode_dirty(handle, inode);
+ ext4_fc_enqueue_inode(handle, inode);
ext4_journal_stop(handle);
}

@@ -5653,6 +5674,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
inode->i_mtime = current_time(inode);
inode->i_ctime = inode->i_mtime;
}
+ ext4_fc_enqueue_inode(handle, inode);
down_write(&EXT4_I(inode)->i_data_sem);
EXT4_I(inode)->i_disksize = attr->ia_size;
rc = ext4_mark_inode_dirty(handle, inode);
@@ -5697,6 +5719,8 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)

if (!error) {
setattr_copy(inode, attr);
+ ext4_fc_enqueue_inode(ext4_journal_current_handle(),
+ inode);
mark_inode_dirty(inode);
}

@@ -6109,6 +6133,7 @@ void ext4_dirty_inode(struct inode *inode, int flags)
goto out;

ext4_mark_inode_dirty(handle, inode);
+ ext4_fc_enqueue_inode(handle, inode);

ext4_journal_stop(handle);
out:
@@ -6194,6 +6219,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
if (IS_ERR(handle))
return PTR_ERR(handle);

+ ext4_fc_mark_ineligible(inode->i_sb);
err = ext4_mark_inode_dirty(handle, inode);
ext4_handle_sync(handle);
ext4_journal_stop(handle);
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 442f7ef873fc..c676fa118414 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -987,6 +987,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
err = mnt_want_write_file(filp);
if (err)
return err;
+ ext4_fc_mark_ineligible(sb);
err = swap_inode_boot_loader(sb, inode);
mnt_drop_write_file(filp);
return err;
@@ -997,6 +998,8 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
int err = 0, err2 = 0;
ext4_group_t o_group = EXT4_SB(sb)->s_groups_count;

+ ext4_fc_mark_ineligible(sb);
+
if (copy_from_user(&n_blocks_count, (__u64 __user *)arg,
sizeof(__u64))) {
return -EFAULT;
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index b1e4d359f73b..b995690d73ce 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -513,6 +513,7 @@ int ext4_ext_migrate(struct inode *inode)
* work to orphan_list_cleanup()
*/
ext4_orphan_del(NULL, tmp_inode);
+ ext4_fc_del(inode);
retval = PTR_ERR(handle);
goto out;
}
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 129029534075..e77ff130c045 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2022,6 +2022,7 @@ static int add_dirent_to_buf(handle_t *handle, struct ext4_filename *fname,
ext4_update_dx_flag(dir);
inode_inc_iversion(dir);
ext4_mark_inode_dirty(handle, dir);
+ ext4_fc_mark_ineligible(dir->i_sb);
BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
err = ext4_handle_dirty_dirblock(handle, dir, bh);
if (err)
@@ -2140,8 +2141,10 @@ static int make_indexed_dir(handle_t *handle, struct ext4_filename *fname,
* out all the changes we did so far. Otherwise we can end up
* with corrupted filesystem.
*/
- if (retval)
+ if (retval) {
ext4_mark_inode_dirty(handle, dir);
+ ext4_fc_mark_ineligible(dir->i_sb);
+ }
dx_release(frames);
brelse(bh2);
return retval;
@@ -2208,6 +2211,7 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
ext4_clear_inode_flag(dir, EXT4_INODE_INDEX);
dx_fallback++;
ext4_mark_inode_dirty(handle, dir);
+ ext4_fc_mark_ineligible(dir->i_sb);
}
blocks = dir->i_size >> sb->s_blocksize_bits;
for (block = 0; block < blocks; block++) {
@@ -2553,6 +2557,7 @@ static int ext4_add_nondir(handle_t *handle,
int err = ext4_add_entry(handle, dentry, inode);
if (!err) {
ext4_mark_inode_dirty(handle, inode);
+ ext4_fc_mark_ineligible(inode->i_sb);
d_instantiate_new(dentry, inode);
return 0;
}
@@ -2661,6 +2666,7 @@ static int ext4_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
err = ext4_orphan_add(handle, inode);
if (err)
goto err_unlock_inode;
+ ext4_fc_enqueue_inode(handle, inode);
mark_inode_dirty(inode);
unlock_new_inode(inode);
}
@@ -2773,6 +2779,7 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
err = ext4_init_new_dir(handle, dir, inode);
if (err)
goto out_clear_inode;
+ ext4_fc_mark_ineligible(inode->i_sb);
err = ext4_mark_inode_dirty(handle, inode);
if (!err)
err = ext4_add_entry(handle, dentry, inode);
@@ -3114,6 +3121,7 @@ static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
inode->i_size = 0;
ext4_orphan_add(handle, inode);
inode->i_ctime = dir->i_ctime = dir->i_mtime = current_time(inode);
+ ext4_fc_mark_ineligible(inode->i_sb);
ext4_mark_inode_dirty(handle, inode);
ext4_dec_count(handle, dir);
ext4_update_dx_flag(dir);
@@ -3192,6 +3200,7 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
goto end_unlink;
dir->i_ctime = dir->i_mtime = current_time(dir);
ext4_update_dx_flag(dir);
+ ext4_fc_mark_ineligible(dir->i_sb);
ext4_mark_inode_dirty(handle, dir);
drop_nlink(inode);
if (!inode->i_nlink)
@@ -3387,6 +3396,7 @@ static int ext4_link(struct dentry *old_dentry,

err = ext4_add_entry(handle, dentry, inode);
if (!err) {
+ ext4_fc_mark_ineligible(inode->i_sb);
ext4_mark_inode_dirty(handle, inode);
/* this can happen only for tmpfile being
* linked the first time
@@ -3991,6 +4001,8 @@ static int ext4_rename2(struct inode *old_dir, struct dentry *old_dentry,
if (err)
return err;

+ ext4_fc_mark_ineligible(old_dir->i_sb);
+
if (flags & RENAME_EXCHANGE) {
return ext4_cross_rename(old_dir, old_dentry,
new_dir, new_dentry);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0b833e9b61c1..c7bb52bdaf6e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1129,6 +1129,16 @@ static void ext4_destroy_inode(struct inode *inode)
true);
dump_stack();
}
+ if (!list_empty(&(EXT4_I(inode)->i_fc_list))) {
+#ifdef EXT4FS_DEBUG
+ if (EXT4_SB(inode->i_sb)->s_fc_eligible) {
+ pr_warn("%s: INODE %ld in FC List with FC allowd",
+ __func__, inode->i_ino);
+ dump_stack();
+ }
+#endif
+ ext4_fc_del(inode);
+ }
}

static void init_once(void *foo)
@@ -1181,6 +1191,7 @@ void ext4_clear_inode(struct inode *inode)
EXT4_I(inode)->jinode = NULL;
}
fscrypt_put_encryption_info(inode);
+ ext4_fc_del(inode);
}

static struct inode *ext4_nfs_get_inode(struct super_block *sb,
@@ -1325,6 +1336,7 @@ static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
* S_DAX may be disabled
*/
ext4_set_inode_flags(inode);
+ ext4_fc_mark_ineligible(inode->i_sb);
res = ext4_mark_inode_dirty(handle, inode);
if (res)
EXT4_ERROR_INODE(inode, "Failed to mark inode dirty");
@@ -5795,6 +5807,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id,
EXT4_I(inode)->i_flags |= EXT4_NOATIME_FL | EXT4_IMMUTABLE_FL;
inode_set_flags(inode, S_NOATIME | S_IMMUTABLE,
S_NOATIME | S_IMMUTABLE);
+ ext4_fc_mark_ineligible(inode->i_sb);
ext4_mark_inode_dirty(handle, inode);
ext4_journal_stop(handle);
unlock_inode:
@@ -5902,6 +5915,7 @@ static int ext4_quota_off(struct super_block *sb, int type)
EXT4_I(inode)->i_flags &= ~(EXT4_NOATIME_FL | EXT4_IMMUTABLE_FL);
inode_set_flags(inode, 0, S_NOATIME | S_IMMUTABLE);
inode->i_mtime = inode->i_ctime = current_time(inode);
+ ext4_fc_mark_ineligible(inode->i_sb);
ext4_mark_inode_dirty(handle, inode);
ext4_journal_stop(handle);
out_unlock:
@@ -6008,6 +6022,7 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type,
if (inode->i_size < off + len) {
i_size_write(inode, off + len);
EXT4_I(inode)->i_disksize = inode->i_size;
+ ext4_fc_mark_ineligible(inode->i_sb);
ext4_mark_inode_dirty(handle, inode);
}
return len;
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 491f9ee4040e..19bc4046658c 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1406,6 +1406,7 @@ static int ext4_xattr_inode_write(handle_t *handle, struct inode *ea_inode,
inode_unlock(ea_inode);

ext4_mark_inode_dirty(handle, ea_inode);
+ ext4_fc_enqueue_inode(handle, ea_inode);

out:
brelse(bh);
--
2.23.0.rc1.153.gdeed80330f-goog

2019-08-09 03:47:02

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v2 04/12] jbd2: fast-commit commit path changes

This patch adds core fast-commit commit path changes. This patch also
modifies existing JBD2 APIs to allow usage of fast commits. If fast
commits are enabled and journal->j_do_full_commit is not set, the
commit routine tries the file system specific fast commmit first. Only
if it fails, it falls back to the full commit. Commit start and wait
APIs now take an additional argument which indicates if fast commits
are allowed or not.

In this patch we also add a new entry to journal->stats which counts
the number of fast commits performed.

Signed-off-by: Harshad Shirwadkar <[email protected]>

---

Changelog:

V2: JBD2 commit routine passes stats to the fast commit callbac. Also,
added a new entry to journal->stats and its tracking.
---
fs/ext4/super.c | 2 +-
fs/jbd2/checkpoint.c | 2 +-
fs/jbd2/commit.c | 47 +++++++++++++++++++++++--
fs/jbd2/journal.c | 81 +++++++++++++++++++++++++++++++++++--------
fs/jbd2/transaction.c | 6 ++--
fs/ocfs2/alloc.c | 2 +-
fs/ocfs2/super.c | 2 +-
include/linux/jbd2.h | 9 +++--
8 files changed, 124 insertions(+), 27 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 81c3ec165822..6bab59ae81f7 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5148,7 +5148,7 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
!jbd2_trans_will_send_data_barrier(sbi->s_journal, target))
needs_barrier = true;

- if (jbd2_journal_start_commit(sbi->s_journal, &target)) {
+ if (jbd2_journal_start_commit(sbi->s_journal, &target, true)) {
if (wait)
ret = jbd2_log_wait_commit(sbi->s_journal,
target);
diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index a1909066bde6..6297978ae3bc 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -277,7 +277,7 @@ int jbd2_log_do_checkpoint(journal_t *journal)

if (batch_count)
__flush_batch(journal, &batch_count);
- jbd2_log_start_commit(journal, tid);
+ jbd2_log_start_commit(journal, tid, true);
/*
* jbd2_journal_commit_transaction() may want
* to take the checkpoint_mutex if JBD2_FLUSHED
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 132fb92098c7..9281814606e7 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -351,8 +351,12 @@ static void jbd2_block_tag_csum_set(journal_t *j, journal_block_tag_t *tag,
*
* The primary function for committing a transaction to the log. This
* function is called by the journal thread to begin a complete commit.
+ *
+ * fc is input / output parameter. If fc is non-null and is set to true, this
+ * function tries to perform fast commit. If the fast commit is successfully
+ * performed, *fc is set to true.
*/
-void jbd2_journal_commit_transaction(journal_t *journal)
+void jbd2_journal_commit_transaction(journal_t *journal, bool *fc)
{
struct transaction_stats_s stats;
transaction_t *commit_transaction;
@@ -380,6 +384,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
tid_t first_tid;
int update_tail;
int csum_size = 0;
+ bool full_commit;
LIST_HEAD(io_bufs);
LIST_HEAD(log_bufs);

@@ -413,6 +418,40 @@ void jbd2_journal_commit_transaction(journal_t *journal)
J_ASSERT(journal->j_running_transaction != NULL);
J_ASSERT(journal->j_committing_transaction == NULL);

+ read_lock(&journal->j_state_lock);
+ full_commit = journal->j_do_full_commit;
+ read_unlock(&journal->j_state_lock);
+
+ /* Let file-system try its own fast commit */
+ if (jbd2_has_feature_fast_commit(journal)) {
+ if (!full_commit && fc && *fc == true &&
+ journal->j_fc_commit_callback &&
+ !journal->j_fc_commit_callback(
+ journal, journal->j_running_transaction->t_tid,
+ journal->j_subtid, &stats.run)) {
+ jbd_debug(3, "fast commit success.\n");
+ if (journal->j_fc_cleanup_callback)
+ journal->j_fc_cleanup_callback(journal);
+ write_lock(&journal->j_state_lock);
+ journal->j_subtid++;
+ if (fc)
+ *fc = true;
+ write_unlock(&journal->j_state_lock);
+ goto update_overall_stats;
+ }
+ if (journal->j_fc_cleanup_callback)
+ journal->j_fc_cleanup_callback(journal);
+ write_lock(&journal->j_state_lock);
+ journal->j_fc_off = 0;
+ journal->j_subtid = 0;
+ journal->j_do_full_commit = false;
+ write_unlock(&journal->j_state_lock);
+ }
+
+ jbd_debug(3, "fast commit not performed, trying full.\n");
+ if (fc)
+ *fc = false;
+
commit_transaction = journal->j_running_transaction;

trace_jbd2_start_commit(journal, commit_transaction);
@@ -1129,8 +1168,12 @@ void jbd2_journal_commit_transaction(journal_t *journal)
/*
* Calculate overall stats
*/
+update_overall_stats:
spin_lock(&journal->j_history_lock);
- journal->j_stats.ts_tid++;
+ if (fc && *fc == true)
+ journal->j_stats.ts_num_fast_commits++;
+ else
+ journal->j_stats.ts_tid++;
journal->j_stats.ts_requested += stats.ts_requested;
journal->j_stats.run.rs_wait += stats.run.rs_wait;
journal->j_stats.run.rs_request_delay += stats.run.rs_request_delay;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 59ad709154a3..ab05e47ed2d4 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -160,7 +160,13 @@ static void commit_timeout(struct timer_list *t)
*
* 1) COMMIT: Every so often we need to commit the current state of the
* filesystem to disk. The journal thread is responsible for writing
- * all of the metadata buffers to disk.
+ * all of the metadata buffers to disk. If fast commits are allowed,
+ * journal thread passes the control to the file system and file system
+ * is then responsible for writing metadata buffers to disk (in whichever
+ * format it wants). If fast commit succeds, journal thread won't perform
+ * a normal commit. In case the fast commit fails, journal thread performs
+ * full commit as normal.
+ *
*
* 2) CHECKPOINT: We cannot reuse a used section of the log file until all
* of the data in that part of the log has been rewritten elsewhere on
@@ -172,6 +178,7 @@ static int kjournald2(void *arg)
{
journal_t *journal = arg;
transaction_t *transaction;
+ bool fc_flag = true, fc_flag_save;

/*
* Set up an interval timer which can be used to trigger a commit wakeup
@@ -209,9 +216,14 @@ static int kjournald2(void *arg)
jbd_debug(1, "OK, requests differ\n");
write_unlock(&journal->j_state_lock);
del_timer_sync(&journal->j_commit_timer);
- jbd2_journal_commit_transaction(journal);
+ fc_flag_save = fc_flag;
+ jbd2_journal_commit_transaction(journal, &fc_flag);
write_lock(&journal->j_state_lock);
- goto loop;
+ if (!fc_flag) {
+ /* fast commit not performed */
+ fc_flag = fc_flag_save;
+ goto loop;
+ }
}

wake_up(&journal->j_wait_done_commit);
@@ -235,16 +247,18 @@ static int kjournald2(void *arg)

prepare_to_wait(&journal->j_wait_commit, &wait,
TASK_INTERRUPTIBLE);
- if (journal->j_commit_sequence != journal->j_commit_request)
+ if (!fc_flag &&
+ journal->j_commit_sequence != journal->j_commit_request)
should_sleep = 0;
transaction = journal->j_running_transaction;
if (transaction && time_after_eq(jiffies,
- transaction->t_expires))
+ transaction->t_expires))
should_sleep = 0;
if (journal->j_flags & JBD2_UNMOUNT)
should_sleep = 0;
if (should_sleep) {
write_unlock(&journal->j_state_lock);
+ jbd_debug(1, "%s sleeps\n", __func__);
schedule();
write_lock(&journal->j_state_lock);
}
@@ -259,7 +273,10 @@ static int kjournald2(void *arg)
transaction = journal->j_running_transaction;
if (transaction && time_after_eq(jiffies, transaction->t_expires)) {
journal->j_commit_request = transaction->t_tid;
+ fc_flag = false;
jbd_debug(1, "woke because of timeout\n");
+ } else {
+ fc_flag = true;
}
goto loop;

@@ -517,11 +534,17 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target)
return 0;
}

-int jbd2_log_start_commit(journal_t *journal, tid_t tid)
+int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool full_commit)
{
int ret;

write_lock(&journal->j_state_lock);
+ /*
+ * If someone has already requested a full commit,
+ * we have to honor it.
+ */
+ if (!journal->j_do_full_commit)
+ journal->j_do_full_commit = full_commit;
ret = __jbd2_log_start_commit(journal, tid);
write_unlock(&journal->j_state_lock);
return ret;
@@ -556,7 +579,7 @@ static int __jbd2_journal_force_commit(journal_t *journal)
tid = transaction->t_tid;
read_unlock(&journal->j_state_lock);
if (need_to_start)
- jbd2_log_start_commit(journal, tid);
+ jbd2_log_start_commit(journal, tid, true);
ret = jbd2_log_wait_commit(journal, tid);
if (!ret)
ret = 1;
@@ -603,11 +626,14 @@ int jbd2_journal_force_commit(journal_t *journal)
* if a transaction is going to be committed (or is currently already
* committing), and fills its tid in at *ptid
*/
-int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid)
+int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid, bool full_commit)
{
int ret = 0;

write_lock(&journal->j_state_lock);
+ if (!journal->j_do_full_commit)
+ journal->j_do_full_commit = full_commit;
+
if (journal->j_running_transaction) {
tid_t tid = journal->j_running_transaction->t_tid;

@@ -675,7 +701,7 @@ EXPORT_SYMBOL(jbd2_trans_will_send_data_barrier);
* Wait for a specified commit to complete.
* The caller may not hold the journal lock.
*/
-int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
+int __jbd2_log_wait_commit(journal_t *journal, tid_t tid, tid_t subtid)
{
int err = 0;

@@ -702,12 +728,25 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
}
#endif
while (tid_gt(tid, journal->j_commit_sequence)) {
- jbd_debug(1, "JBD2: want %u, j_commit_sequence=%u\n",
- tid, journal->j_commit_sequence);
+ if ((!journal->j_do_full_commit) &&
+ !tid_geq(subtid, journal->j_subtid))
+ break;
+ jbd_debug(1, "JBD2: want full commit %u %s %u, ",
+ tid, journal->j_do_full_commit ?
+ "and ignoring fast commit request for " :
+ "or want fast commit",
+ journal->j_subtid);
+ jbd_debug(1, "j_commit_sequence=%u, j_subtid=%u\n",
+ journal->j_commit_sequence, journal->j_subtid);
read_unlock(&journal->j_state_lock);
wake_up(&journal->j_wait_commit);
- wait_event(journal->j_wait_done_commit,
- !tid_gt(tid, journal->j_commit_sequence));
+ if (journal->j_do_full_commit)
+ wait_event(journal->j_wait_done_commit,
+ !tid_gt(tid, journal->j_commit_sequence));
+ else
+ wait_event(journal->j_wait_done_commit,
+ !tid_gt(tid, journal->j_commit_sequence) ||
+ !tid_geq(subtid, journal->j_subtid));
read_lock(&journal->j_state_lock);
}
read_unlock(&journal->j_state_lock);
@@ -717,6 +756,13 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
return err;
}

+int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
+{
+ journal->j_do_full_commit = true;
+ return __jbd2_log_wait_commit(journal, tid, 0);
+}
+
+
/* Return 1 when transaction with given tid has already committed. */
int jbd2_transaction_committed(journal_t *journal, tid_t tid)
{
@@ -751,7 +797,7 @@ int jbd2_complete_transaction(journal_t *journal, tid_t tid)
if (journal->j_commit_request != tid) {
/* transaction not yet started, so request it */
read_unlock(&journal->j_state_lock);
- jbd2_log_start_commit(journal, tid);
+ jbd2_log_start_commit(journal, tid, true);
goto wait_commit;
}
} else if (!(journal->j_committing_transaction &&
@@ -996,6 +1042,8 @@ static int jbd2_seq_info_show(struct seq_file *seq, void *v)
"each up to %u blocks\n",
s->stats->ts_tid, s->stats->ts_requested,
s->journal->j_max_transaction_buffers);
+ seq_printf(seq, "%lu fast commits performed\n",
+ s->stats->ts_num_fast_commits);
if (s->stats->ts_tid == 0)
return 0;
seq_printf(seq, "average: \n %ums waiting for transaction\n",
@@ -1020,6 +1068,9 @@ static int jbd2_seq_info_show(struct seq_file *seq, void *v)
s->stats->run.rs_blocks / s->stats->ts_tid);
seq_printf(seq, " %lu logged blocks per transaction\n",
s->stats->run.rs_blocks_logged / s->stats->ts_tid);
+ seq_printf(seq, " %lu logged blocks per commit\n",
+ s->stats->run.rs_blocks_logged /
+ (s->stats->ts_tid + s->stats->ts_num_fast_commits));
return 0;
}

@@ -1741,7 +1792,7 @@ int jbd2_journal_destroy(journal_t *journal)

/* Force a final log commit */
if (journal->j_running_transaction)
- jbd2_journal_commit_transaction(journal);
+ jbd2_journal_commit_transaction(journal, NULL);

/* Force any old transactions to disk */

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 990e7b5062e7..87f6627d78aa 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -154,7 +154,7 @@ static void wait_transaction_locked(journal_t *journal)
need_to_start = !tid_geq(journal->j_commit_request, tid);
read_unlock(&journal->j_state_lock);
if (need_to_start)
- jbd2_log_start_commit(journal, tid);
+ jbd2_log_start_commit(journal, tid, true);
jbd2_might_wait_for_commit(journal);
schedule();
finish_wait(&journal->j_wait_transaction_locked, &wait);
@@ -708,7 +708,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, gfp_t gfp_mask)
need_to_start = !tid_geq(journal->j_commit_request, tid);
read_unlock(&journal->j_state_lock);
if (need_to_start)
- jbd2_log_start_commit(journal, tid);
+ jbd2_log_start_commit(journal, tid, true);

rwsem_release(&journal->j_trans_commit_map, 1, _THIS_IP_);
handle->h_buffer_credits = nblocks;
@@ -1822,7 +1822,7 @@ int jbd2_journal_stop(handle_t *handle)
jbd_debug(2, "transaction too old, requesting commit for "
"handle %p\n", handle);
/* This is non-blocking */
- jbd2_log_start_commit(journal, transaction->t_tid);
+ jbd2_log_start_commit(journal, transaction->t_tid, true);

/*
* Special case: JBD2_SYNC synchronous updates require us
diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 0c335b51043d..df41c43573b7 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -6117,7 +6117,7 @@ int ocfs2_try_to_free_truncate_log(struct ocfs2_super *osb,
goto out;
}

- if (jbd2_journal_start_commit(osb->journal->j_journal, &target)) {
+ if (jbd2_journal_start_commit(osb->journal->j_journal, &target, true)) {
jbd2_log_wait_commit(osb->journal->j_journal, target);
ret = 1;
}
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 8b2f39506648..60ecc51759ae 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -410,7 +410,7 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait)
}

if (jbd2_journal_start_commit(osb->journal->j_journal,
- &target)) {
+ &target, true)) {
if (wait)
jbd2_log_wait_commit(osb->journal->j_journal,
target);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 153840b422cc..535f88dff653 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -742,6 +742,7 @@ struct transaction_run_stats_s {

struct transaction_stats_s {
unsigned long ts_tid;
+ unsigned long ts_num_fast_commits;
unsigned long ts_requested;
struct transaction_run_stats_s run;
};
@@ -1364,7 +1365,8 @@ int __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);

/* Commit management */
-extern void jbd2_journal_commit_transaction(journal_t *);
+extern void jbd2_journal_commit_transaction(journal_t *journal,
+ bool *full_commit);

/* Checkpoint list management */
void __jbd2_journal_clean_checkpoint_list(journal_t *journal, bool destroy);
@@ -1571,9 +1573,10 @@ extern void jbd2_clear_buffer_revoked_flags(journal_t *journal);
* transitions on demand.
*/

-int jbd2_log_start_commit(journal_t *journal, tid_t tid);
+int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool full_commit);
int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
-int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
+int jbd2_journal_start_commit(journal_t *journal, tid_t *tid,
+ bool full_commit);
int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
int jbd2_transaction_committed(journal_t *journal, tid_t tid);
int jbd2_complete_transaction(journal_t *journal, tid_t tid);
--
2.23.0.rc1.153.gdeed80330f-goog

2019-08-09 03:47:02

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v2 10/12] ext4: fast-commit commit path changes

This patch implements the actual commit path for fast commit. Based on
inodes tracked and their respective logical ranges remembered, this
patch adds code to create a fast commit block that stores extents
added to the inode. We use new JBD2 interfaces added in previous
patches in this series. The fast commit blocks that are created have
extents that _should_ be present in the file. It doesn't yet support
removing of extents, making operations such as truncate, delete fast
commit incompatible.

Signed-off-by: Harshad Shirwadkar <[email protected]>

---

Changelog:

V2: 1) Use jbd2_wait_on_fc_bufs() instead of jbd2_fc_submit_bufs(). This
also implies that fast commit callback now submits relevant bhs by
itself.
2) Added tracepoints for commit path.
3) Several changes to fast commit on disk format:
- Removed fc_tid from the fast commit header. That's because we TID
can be obtained from journal header that exists before fast commit
header.
- Removed fc_len since it's always 1.
- Added fc_flags fields. We set "last" flag for the last block in a
sub-transaction. This allows us to maintain atomicity of
sub-transactions.
- Added fc_features to indicate what fast commit features are used by
this fast commit block. In future, we plan to add support for
handling of file create and file truncate. fc_features can be used
by future patches to indicate incompatibility of those fast commit
blocks.
---
fs/ext4/ext4.h | 37 ++++++
fs/ext4/extents.c | 8 +-
fs/ext4/fsync.c | 2 +-
fs/ext4/inode.c | 5 +-
fs/ext4/super.c | 259 +++++++++++++++++++++++++++++++++++-
include/trace/events/ext4.h | 37 ++++++
6 files changed, 340 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0d15d4539dda..210bd4c86d4f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2276,6 +2276,43 @@ struct mmpd_data {
*/
#define EXT4_MMP_MAX_CHECK_INTERVAL 300UL

+/* Magic of fast commit header */
+#define EXT4_FC_MAGIC 0xE2540090
+
+#define EXT4_FC_FL_LAST 0x00000001
+
+#define ext4_fc_is_last(__fc_hdr) (((__fc_hdr)->fc_flags) & \
+ EXT4_FC_FL_LAST)
+
+#define ext4_fc_mark_last(__fc_hdr) (((__fc_hdr)->fc_flags) |= \
+ EXT4_FC_FL_LAST)
+
+struct ext4_fc_commit_hdr {
+ /* Fast commit magic, should be EXT4_FC_MAGIC */
+ __le32 fc_magic;
+ /* Sub transaction ID */
+ __le32 fc_subtid;
+ /* Features used by this fast commit block */
+ __u8 fc_features;
+ /* Flags for this block. */
+ __u8 fc_flags;
+ /* Number of TLVs in this fast commmit block */
+ __le16 fc_num_tlvs;
+ /* Inode number */
+ __le32 fc_ino;
+ /* ext4 inode on disk copy */
+ struct ext4_inode inode;
+ /* Csum(hdr+contents) */
+ __le32 fc_csum;
+};
+
+#define EXT4_FC_TAG_EXT 0x1 /* Extent */
+
+struct ext4_fc_tl {
+ __le16 fc_tag;
+ __le16 fc_len;
+};
+
/*
* Function prototypes
*/
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index eb77e306a82b..66f7f4fb1612 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4899,10 +4899,10 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
if (ret)
goto out;

- if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal) {
- ret = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
- EXT4_I(inode)->i_sync_tid);
- }
+ if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal)
+ ret = jbd2_fc_complete_commit(
+ EXT4_SB(inode->i_sb)->s_journal, EXT4_I(inode)->i_sync_tid,
+ journal_current_handle()->h_journal->j_subtid);
out:
inode_unlock(inode);
trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 5508baa11bb6..4f783f9723c5 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -151,7 +151,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
if (journal->j_flags & JBD2_BARRIER &&
!jbd2_trans_will_send_data_barrier(journal, commit_tid))
needs_barrier = true;
- ret = jbd2_complete_transaction(journal, commit_tid);
+ ret = jbd2_fc_complete_commit(journal, commit_tid, journal->j_subtid);
if (needs_barrier) {
issue_flush:
err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f79b185c013e..dd5d39a48363 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5476,8 +5476,9 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync)
return 0;

- err = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
- EXT4_I(inode)->i_sync_tid);
+ err = jbd2_fc_complete_commit(
+ EXT4_SB(inode->i_sb)->s_journal, EXT4_I(inode)->i_sync_tid,
+ EXT4_SB(inode->i_sb)->s_journal->j_subtid);
} else {
struct ext4_iloc iloc;

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c7bb52bdaf6e..1191ebbb55c5 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -437,6 +437,260 @@ static bool system_going_down(void)
|| system_state == SYSTEM_RESTART;
}

+static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
+{
+ struct buffer_head *orig_bh = bh->b_private;
+
+ BUFFER_TRACE(bh, "");
+ if (uptodate) {
+ ext4_debug("%s: Block %lld up-to-date",
+ __func__, bh->b_blocknr);
+ set_buffer_uptodate(bh);
+ } else {
+ ext4_debug("%s: Block %lld not up-to-date",
+ __func__, bh->b_blocknr);
+ clear_buffer_uptodate(bh);
+ }
+ if (orig_bh) {
+ clear_bit_unlock(BH_Shadow, &orig_bh->b_state);
+ /* Protect BH_Shadow bit in b_state */
+ smp_mb__after_atomic();
+ wake_up_bit(&orig_bh->b_state, BH_Shadow);
+ }
+ unlock_buffer(bh);
+}
+
+static int ext4_fc_write_inode(journal_t *journal, struct buffer_head *bh,
+ struct inode *inode, tid_t tid, tid_t subtid,
+ int is_last)
+{
+ loff_t old_blk_size, cur_lblk_off, new_blk_size;
+ struct super_block *sb = journal->j_private;
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_fc_commit_hdr *fc_hdr;
+ struct ext4_map_blocks map;
+ struct ext4_iloc iloc;
+ struct ext4_fc_tl tl;
+ struct ext4_extent extent;
+ __u32 dummy_csum = 0, csum;
+ __u8 *start, *cur, *end;
+ __u16 num_tlvs = 0;
+ int ret;
+
+ if (tid != ei->i_fc.fc_tid || subtid != ei->i_fc.fc_subtid) {
+ jbd_debug(3,
+ "File not modified. Modified %d:%d, expected %d:%d",
+ ei->i_fc.fc_tid, ei->i_fc.fc_subtid, tid, subtid);
+ return 0;
+ }
+
+ if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+ return -ECANCELED;
+
+ ret = ext4_get_inode_loc(inode, &iloc);
+ if (ret)
+ return ret;
+
+ end = (__u8 *)bh->b_data + journal->j_blocksize;
+
+ old_blk_size = (ei->i_fc.fc_lblk_start + sb->s_blocksize - 1) >>
+ inode->i_blkbits;
+ new_blk_size = ei->i_fc.fc_lblk_end >> inode->i_blkbits;
+
+ jbd_debug(3, "Committing as tid = %d, subtid = %d on buffer %lld\n",
+ tid, subtid, bh->b_blocknr);
+
+ ei->i_fc.fc_lblk_start = ei->i_fc.fc_lblk_end;
+
+ fc_hdr = (struct ext4_fc_commit_hdr *)
+ ((__u8 *)bh->b_data + sizeof(journal_header_t));
+ fc_hdr->fc_magic = cpu_to_le32(EXT4_FC_MAGIC);
+ fc_hdr->fc_subtid = cpu_to_le32(subtid);
+ fc_hdr->fc_ino = cpu_to_le32(inode->i_ino);
+ fc_hdr->fc_features = 0;
+ fc_hdr->fc_flags = 0;
+
+ if (is_last)
+ ext4_fc_mark_last(fc_hdr);
+
+ memcpy(&fc_hdr->inode, ext4_raw_inode(&iloc), EXT4_INODE_SIZE(sb));
+ cur = (__u8 *)(fc_hdr + 1);
+ start = cur;
+ csum = 0;
+ cur_lblk_off = old_blk_size;
+ while (cur_lblk_off <= new_blk_size) {
+ map.m_lblk = cur_lblk_off;
+ map.m_len = new_blk_size - cur_lblk_off + 1;
+ ret = ext4_map_blocks(NULL, inode, &map, 0);
+ if (!ret) {
+ cur_lblk_off += map.m_len;
+ continue;
+ }
+
+ if (map.m_flags & EXT4_MAP_UNWRITTEN)
+ return -ECANCELED;
+ extent.ee_block = cpu_to_le32(map.m_lblk);
+ cur_lblk_off += map.m_len;
+ if (cur + sizeof(struct ext4_extent) +
+ sizeof(struct ext4_fc_tl) >= end)
+ return -ENOSPC;
+
+ tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_EXT);
+ tl.fc_len = cpu_to_le16(sizeof(struct ext4_extent));
+ extent.ee_len = cpu_to_le16(map.m_len);
+ ext4_ext_store_pblock(&extent, map.m_pblk);
+ if (map.m_flags & EXT4_MAP_UNWRITTEN)
+ ext4_ext_mark_unwritten(&extent);
+ else
+ ext4_ext_mark_initialized(&extent);
+ memcpy(cur, &tl, sizeof(struct ext4_fc_tl));
+ cur += sizeof(struct ext4_fc_tl);
+ memcpy(cur, &extent, sizeof(struct ext4_extent));
+ cur += sizeof(struct ext4_extent);
+ num_tlvs++;
+ }
+
+ fc_hdr->fc_num_tlvs = cpu_to_le16(num_tlvs);
+ csum = ext4_chksum(sbi, csum, (__u8 *)fc_hdr,
+ offsetof(struct ext4_fc_commit_hdr, fc_csum));
+ csum = ext4_chksum(sbi, csum, &dummy_csum, sizeof(dummy_csum));
+ csum = ext4_chksum(sbi, csum, start, cur - start);
+ fc_hdr->fc_csum = cpu_to_le32(csum);
+
+ jbd_debug(3, "Created FC block for inode %ld with [%d, %d]",
+ inode->i_ino, tid, subtid);
+
+ return 1;
+}
+
+static void ext4_journal_fc_cleanup_cb(journal_t *journal)
+{
+ struct super_block *sb = journal->j_private;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_inode_info *iter;
+ struct inode *inode;
+
+ spin_lock(&sbi->s_fc_lock);
+ while (!list_empty(&sbi->s_fc_q)) {
+ iter = list_first_entry(&sbi->s_fc_q,
+ struct ext4_inode_info, i_fc_list);
+ list_del_init(&iter->i_fc_list);
+ inode = &iter->vfs_inode;
+ }
+ INIT_LIST_HEAD(&sbi->s_fc_q);
+ sbi->s_fc_q_cnt = 0;
+ spin_unlock(&sbi->s_fc_lock);
+}
+
+/*
+ * Fast-commit commit callback. There is contention between sbi->s_fc_lock and
+ * i_data_sem. Locking order is - i_data_sem then s_fc_lock
+ */
+static int ext4_journal_fc_commit_cb(journal_t *journal, tid_t tid,
+ tid_t subtid,
+ struct transaction_run_stats_s *stats)
+{
+ struct super_block *sb = journal->j_private;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct list_head *pos, *tmp;
+ struct ext4_inode_info *iter;
+ struct jbd2_inode *jinode;
+ int num_bufs = 0, ret;
+
+ memset(stats, 0, sizeof(*stats));
+
+ trace_ext4_journal_fc_commit_cb_start(sb);
+ sbi = sbi;
+ spin_lock(&sbi->s_fc_lock);
+ if (!sbi->s_fc_eligible) {
+ sbi->s_fc_eligible = true;
+ spin_unlock(&sbi->s_fc_lock);
+ trace_ext4_journal_fc_commit_cb_stop(sb, 0);
+ return -ECANCELED;
+ }
+
+ stats->rs_flushing = jiffies;
+ /* Submit data buffers first */
+ list_for_each(pos, &sbi->s_fc_q) {
+ iter = list_entry(pos, struct ext4_inode_info, i_fc_list);
+ jinode = iter->jinode;
+ ret = jbd2_submit_inode_data(journal, jinode);
+ if (ret) {
+ spin_unlock(&sbi->s_fc_lock);
+ trace_ext4_journal_fc_commit_cb_stop(sb, 0);
+ return ret;
+ }
+ }
+ stats->rs_logging = jiffies;
+ stats->rs_flushing = jbd2_time_diff(stats->rs_flushing,
+ stats->rs_logging);
+
+ list_for_each_safe(pos, tmp, &sbi->s_fc_q) {
+ struct inode *inode;
+ struct buffer_head *bh;
+ int is_last;
+
+ iter = list_entry(pos, struct ext4_inode_info, i_fc_list);
+ inode = &iter->vfs_inode;
+
+ is_last = list_is_last(pos, &sbi->s_fc_q);
+ spin_unlock(&sbi->s_fc_lock);
+
+ ret = jbd2_map_fc_buf(journal, &bh);
+ if (ret)
+ return -ENOMEM;
+
+ /*
+ * Release s_fc_lock here since fc_write_inode calls
+ * ext4_map_blocks which needs i_data_sem.
+ */
+ ret = ext4_fc_write_inode(journal, bh, inode, tid, subtid,
+ is_last);
+ if (ret < 0) {
+ trace_ext4_journal_fc_commit_cb_stop(sb, 0);
+ return ret;
+ }
+ lock_buffer(bh);
+ clear_buffer_dirty(bh);
+ set_buffer_uptodate(bh);
+ bh->b_end_io = ext4_end_buffer_io_sync;
+ submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
+
+ spin_lock(&sbi->s_fc_lock);
+
+ num_bufs += ret;
+ }
+
+ stats->rs_logging = jbd2_time_diff(stats->rs_logging, jiffies);
+ if (num_bufs == 0) {
+ spin_unlock(&sbi->s_fc_lock);
+ trace_ext4_journal_fc_commit_cb_stop(sb, 0);
+ stats->rs_blocks_logged = num_bufs;
+ return 0;
+ }
+
+ /*
+ * Before returning, check if s_fc_eligible was modified since we
+ * started.
+ */
+ if (!sbi->s_fc_eligible) {
+ spin_unlock(&sbi->s_fc_lock);
+ trace_ext4_journal_fc_commit_cb_stop(sb, 0);
+ return -ECANCELED;
+ }
+
+ spin_unlock(&sbi->s_fc_lock);
+
+ jbd_debug(3, "%s: Journal blocks ready for fast commit\n", __func__);
+
+ stats->rs_blocks_logged = num_bufs;
+
+ trace_ext4_journal_fc_commit_cb_stop(sb, num_bufs);
+
+ return jbd2_wait_on_fc_bufs(journal, num_bufs);
+}
+
/* Deal with the reporting of failure conditions on a filesystem such as
* inconsistencies detected or read IO failures.
*
@@ -4723,7 +4977,10 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
journal->j_commit_interval = sbi->s_commit_interval;
journal->j_min_batch_time = sbi->s_min_batch_time;
journal->j_max_batch_time = sbi->s_max_batch_time;
-
+ if (ext4_should_fast_commit(sb)) {
+ journal->j_fc_commit_callback = ext4_journal_fc_commit_cb;
+ journal->j_fc_cleanup_callback = ext4_journal_fc_cleanup_cb;
+ }
write_lock(&journal->j_state_lock);
if (test_opt(sb, BARRIER))
journal->j_flags |= JBD2_BARRIER;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index d68e9e536814..8ef67b61d54a 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2703,6 +2703,43 @@ TRACE_EVENT(ext4_error,
__entry->function, __entry->line)
);

+TRACE_EVENT(ext4_journal_fc_commit_cb_start,
+ TP_PROTO(struct super_block *sb),
+
+ TP_ARGS(sb),
+
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ ),
+
+ TP_fast_assign(
+ __entry->dev = sb->s_dev;
+ ),
+
+ TP_printk("fast_commit started on dev %d,%d",
+ MAJOR(__entry->dev), MINOR(__entry->dev))
+);
+
+TRACE_EVENT(ext4_journal_fc_commit_cb_stop,
+ TP_PROTO(struct super_block *sb, int nblks),
+
+ TP_ARGS(sb, nblks),
+
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(int, nblks)
+ ),
+
+ TP_fast_assign(
+ __entry->dev = sb->s_dev;
+ __entry->nblks = nblks;
+ ),
+
+ TP_printk("fast_commit done on dev %d,%d, nblks %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->nblks)
+);
+
#endif /* _TRACE_EXT4_H */

/* This part must be outside protection */
--
2.23.0.rc1.153.gdeed80330f-goog

2019-08-09 03:47:16

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v2 06/12] jbd2: fast-commit recovery path changes

This patch adds fast-commit recovery path changes for JBD2. If we find
a fast commit block that is valid in our recovery phase call file
system specific routine to handle that block.

We also clear the fast commit flag in jbd2_mark_journal_empty() which
is called after successful recovery as well successful
checkpointing. This allows JBD2 journal to be compatible with older
versions when there are not fast commit blocks.

Signed-off-by: Harshad Shirwadkar <[email protected]>

---

Changelog:

V2: Fixed checkpatch error.
---
fs/jbd2/journal.c | 12 ++++++++++
fs/jbd2/recovery.c | 59 +++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 1e15804b2c3c..ae4584a60cc3 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1604,6 +1604,7 @@ int jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid,
static void jbd2_mark_journal_empty(journal_t *journal, int write_op)
{
journal_superblock_t *sb = journal->j_superblock;
+ bool had_fast_commit = false;

BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
lock_buffer(journal->j_sb_buffer);
@@ -1617,6 +1618,14 @@ static void jbd2_mark_journal_empty(journal_t *journal, int write_op)

sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
sb->s_start = cpu_to_be32(0);
+ if (jbd2_has_feature_fast_commit(journal)) {
+ /*
+ * When journal is clean, no need to commit fast commit flag and
+ * make file system incompatible with older kernels.
+ */
+ jbd2_clear_feature_fast_commit(journal);
+ had_fast_commit = true;
+ }

jbd2_write_superblock(journal, write_op);

@@ -1624,6 +1633,9 @@ static void jbd2_mark_journal_empty(journal_t *journal, int write_op)
write_lock(&journal->j_state_lock);
journal->j_flags |= JBD2_FLUSHED;
write_unlock(&journal->j_state_lock);
+
+ if (had_fast_commit)
+ jbd2_set_feature_fast_commit(journal);
}


diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index a4967b27ffb6..3a6cd1497504 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -225,8 +225,12 @@ static int count_tags(journal_t *journal, struct buffer_head *bh)
/* Make sure we wrap around the log correctly! */
#define wrap(journal, var) \
do { \
- if (var >= (journal)->j_last) \
- var -= ((journal)->j_last - (journal)->j_first); \
+ unsigned long _wrap_last = \
+ jbd2_has_feature_fast_commit(journal) ? \
+ (journal)->j_last_fc : (journal)->j_last; \
+ \
+ if (var >= _wrap_last) \
+ var -= (_wrap_last - (journal)->j_first); \
} while (0)

/**
@@ -413,6 +417,49 @@ static int jbd2_block_tag_csum_verify(journal_t *j, journal_block_tag_t *tag,
return tag->t_checksum == cpu_to_be16(csum32);
}

+static int fc_do_one_pass(journal_t *journal,
+ struct recovery_info *info, enum passtype pass)
+{
+ unsigned int expected_commit_id = info->end_transaction;
+ unsigned long next_fc_block;
+ struct buffer_head *bh;
+ unsigned int seq;
+ journal_header_t *jhdr;
+ int err = 0;
+
+ next_fc_block = journal->j_first_fc;
+
+ while (next_fc_block != journal->j_last_fc) {
+ jbd_debug(3, "Fast commit replay: next block %lld",
+ next_fc_block);
+ err = jread(&bh, journal, next_fc_block);
+ if (err)
+ break;
+
+ jhdr = (journal_header_t *)bh->b_data;
+ seq = be32_to_cpu(jhdr->h_sequence);
+ if (be32_to_cpu(jhdr->h_magic) != JBD2_MAGIC_NUMBER ||
+ seq != expected_commit_id) {
+ break;
+ }
+ jbd_debug(3, "Processing fast commit blk with seq %d",
+ seq);
+ if (pass == PASS_REPLAY &&
+ journal->j_fc_replay_callback) {
+ err = journal->j_fc_replay_callback(journal,
+ bh);
+ if (err)
+ break;
+ }
+ next_fc_block++;
+ }
+
+ if (err)
+ jbd_debug(3, "Fast commit replay failed, err = %d\n", err);
+
+ return err;
+}
+
static int do_one_pass(journal_t *journal,
struct recovery_info *info, enum passtype pass)
{
@@ -470,7 +517,7 @@ static int do_one_pass(journal_t *journal,
break;

jbd_debug(2, "Scanning for sequence ID %u at %lu/%lu\n",
- next_commit_ID, next_log_block, journal->j_last);
+ next_commit_ID, next_log_block, journal->j_last_fc);

/* Skip over each chunk of the transaction looking
* either the next descriptor block or the final commit
@@ -768,6 +815,8 @@ static int do_one_pass(journal_t *journal,
if (err)
goto failed;
continue;
+ case JBD2_FC_BLOCK:
+ continue;

default:
jbd_debug(3, "Unrecognised magic %d, end of scan.\n",
@@ -799,6 +848,10 @@ static int do_one_pass(journal_t *journal,
success = -EIO;
}
}
+
+ if (jbd2_has_feature_fast_commit(journal) && pass == PASS_REPLAY)
+ fc_do_one_pass(journal, info, pass);
+
if (block_error && success == 0)
success = -EIO;
return success;
--
2.23.0.rc1.153.gdeed80330f-goog

2019-08-09 03:47:53

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v2 12/12] docs: Add fast commit documentation

This patch adds necessary documentation to
Documentation/filesystems/journalling.rst and
Documentation/filesystems/ext4/journal.rst.

Signed-off-by: Harshad Shirwadkar <[email protected]>
---
Documentation/filesystems/ext4/journal.rst | 96 ++++++++++++++++++++--
Documentation/filesystems/journalling.rst | 15 ++++
2 files changed, 105 insertions(+), 6 deletions(-)

diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
index ea613ee701f5..d6e4a698e208 100644
--- a/Documentation/filesystems/ext4/journal.rst
+++ b/Documentation/filesystems/ext4/journal.rst
@@ -29,10 +29,14 @@ safest. If ``data=writeback``, dirty data blocks are not flushed to the
disk before the metadata are written to disk through the journal.

The journal inode is typically inode 8. The first 68 bytes of the
-journal inode are replicated in the ext4 superblock. The journal itself
-is normal (but hidden) file within the filesystem. The file usually
-consumes an entire block group, though mke2fs tries to put it in the
-middle of the disk.
+journal inode are replicated in the ext4 superblock. The journal
+itself is normal (but hidden) file within the filesystem. The file
+usually consumes an entire block group, though mke2fs tries to put it
+in the middle of the disk. Last 128 blocks in the journal are reserved
+for fast commits. Fast commits store metadata changes to inodes in an
+incremental fashion. A fast commit is valid only if there is no full
+commit after that particular fast commit. That makes fast commit space
+reusable after every full commit.

All fields in jbd2 are written to disk in big-endian order. This is the
opposite of ext4.
@@ -48,16 +52,18 @@ Layout
Generally speaking, the journal has this format:

.. list-table::
- :widths: 16 48 16
+ :widths: 16 48 16 18
:header-rows: 1

* - Superblock
- descriptor\_block (data\_blocks or revocation\_block) [more data or
revocations] commmit\_block
- [more transactions...]
+ - [Fast commits...]
* -
- One transaction
-
+ -

Notice that a transaction begins with either a descriptor and some data,
or a block revocation list. A finished transaction always ends with a
@@ -76,7 +82,7 @@ The journal superblock will be in the next full block after the
superblock.

.. list-table::
- :widths: 12 12 12 32 12
+ :widths: 12 12 12 32 12 12
:header-rows: 1

* - 1024 bytes of padding
@@ -85,11 +91,13 @@ superblock.
- descriptor\_block (data\_blocks or revocation\_block) [more data or
revocations] commmit\_block
- [more transactions...]
+ - [Fast commits...]
* -
-
-
- One transaction
-
+ -

Block Header
~~~~~~~~~~~~
@@ -609,3 +617,79 @@ bytes long (but uses a full block):
- h\_commit\_nsec
- Nanoseconds component of the above timestamp.

+Fast Commit Block
+~~~~~~~~~~~~~~~~~
+
+The fast commit block indicates an append to the last commit block
+that was written to the journal. One fast commit block records updates
+to one inode. So, typically you would find as many fast commit blocks
+as the number of inodes that got changed since the last commit. A fast
+commit block is valid only if there is no commit block present with
+transaction ID greater than that of the fast commit block. If such a
+block a present, then there is no need to replay the fast commit
+block.
+
+Multiple fast commit blocks are a part of one sub-transaction. To
+indicate the last block in a fast commit transaction, fc_flags field
+in the last block in every subtransaction is marked with "LAST" (0x1)
+flag. A subtransaction is valid only if all the following conditions
+are met:
+
+1) SUBTID of all blocks is either equal to or greater than SUBTID of
+ the previous fast commit block.
+2) For every sub-transaction, last block is marked with LAST flag.
+3) There are no invalid blocks in between.
+
+.. list-table::
+ :widths: 8 8 24 40
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Descriptor
+ * - 0x0
+ - journal\_header\_s
+ - (open coded)
+ - Common block header.
+ * - 0xC
+ - \_\_le32
+ - fc\_magic
+ - Magic value which should be set to 0xE2540090. This identifies
+ that this block is a fast commit block.
+ * - 0x10
+ - \_\_le32
+ - fc\_subtid
+ - Sub-transaction ID for this commit block
+ * - 0x14
+ - \_\_u8
+ - fc\_features
+ - Features used by this fast commit block.
+ * - 0x15
+ - \_\_u8
+ - fc_flags
+ - Flags. (0x1(Last) - Indicates that this is the last block in sub-transaction)
+ * - 0x16
+ - \_\_le16
+ - fc_num_tlvs
+ - Number of TLVs contained in this fast commit block
+ * - 0x18
+ - \_\_le32
+ - \_\_fc\_len
+ - Length of the fast commit block in terms of number of blocks
+ * - 0x2c
+ - \_\_le32
+ - fc\_ino
+ - Inode number of the inode that will be recovered using this fast commit
+ * - 0x30
+ - struct ext4\_inode
+ - inode
+ - On-disk copy of the inode at the commit time
+ * - 0x34
+ - struct ext4\_fc\_tl
+ - Array of struct ext4\_fc\_tl
+ - The actual delta with the last commit. Starting at this offset,
+ there is an array of TLVs that indicates which all extents
+ should be present in the corresponding inode. Currently, the
+ only tag that is supported is EXT4\_FC\_TAG\_EXT. That tag
+ indicates that the corresponding value is an extent.
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
index 58ce6b395206..2e0d550b546c 100644
--- a/Documentation/filesystems/journalling.rst
+++ b/Documentation/filesystems/journalling.rst
@@ -115,6 +115,21 @@ called after each transaction commit. You can also use
``transaction->t_private_list`` for attaching entries to a transaction
that need processing when the transaction commits.

+JBD2 also allows client file systems to implement file system specific
+commits which are called as ``fast commits``. File systems that wish
+to use this feature should first set
+``journal->j_fc_commit_callback``. That function is called before
+performing a commit. File system can call :c:func:`jbd2_map_fc_buf()`
+to get buffers reserved for fast commits. If file system returns 0,
+JBD2 assumes that file system performed a fast commit and it backs off
+from performing a commit. Otherwise, JBD2 falls back to normal full
+commit. After performing either a fast or a full commit, JBD2 calls
+``journal->j_fc_cleanup_cb`` to allow file systems to perform cleanups
+for their internal fast commit related data structures. At the replay
+time, JBD2 passes each and every fast commit block to the file system
+via ``journal->j_fc_replay_cb``. Ext4 effectively uses this fast
+commit mechanism to improve journal commit performance.
+
JBD2 also provides a way to block all transaction updates via
:c:func:`jbd2_journal_lock_updates()` /
:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
--
2.23.0.rc1.153.gdeed80330f-goog

2019-08-09 03:48:46

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v2 09/12] ext4: fast-commit commit range tracking

With this patch, we track logical range of file offsets that need to
be committed using fast commit. This allows us to find file extents
that need to be committed during the commit time.

Signed-off-by: Harshad Shirwadkar <[email protected]>

---

Changelog:

V2: Since s_fc_lock is now a spinlock, updated calls appropriately.
---
fs/ext4/ext4_jbd2.c | 33 +++++++++++++++++++++++++++++++++
fs/ext4/ext4_jbd2.h | 2 ++
fs/ext4/inline.c | 5 ++++-
fs/ext4/inode.c | 18 +++++++++++++++++-
4 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index d77b9f1e9dab..2897cbf4cc03 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -389,3 +389,36 @@ void ext4_fc_del(struct inode *inode)
list_del_init(&EXT4_I(inode)->i_fc_list);
spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
}
+
+void ext4_fc_update_commit_range(handle_t *handle, struct inode *inode,
+ loff_t start, loff_t end)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+
+ if (!ext4_should_fast_commit(inode->i_sb))
+ return;
+
+ if (!ext4_handle_valid(handle))
+ return;
+
+ if (inode->i_ino < EXT4_FIRST_INO(inode->i_sb))
+ ext4_debug("Special inode %ld being modified\n", inode->i_ino);
+
+ if (!EXT4_SB(inode->i_sb)->s_fc_eligible)
+ return;
+
+ if (ei->i_fc.fc_tid == handle->h_transaction->t_tid &&
+ ei->i_fc.fc_subtid ==
+ handle->h_transaction->t_journal->j_subtid) {
+ ei->i_fc.fc_lblk_start = ei->i_fc.fc_lblk_start < start ?
+ ei->i_fc.fc_lblk_start : start;
+ ei->i_fc.fc_lblk_end = ei->i_fc.fc_lblk_end > end ?
+ ei->i_fc.fc_lblk_end : end;
+ return;
+ }
+
+ ei->i_fc.fc_lblk_start = start;
+ ei->i_fc.fc_lblk_end = end;
+ ei->i_fc.fc_subtid = handle->h_transaction->t_journal->j_subtid;
+ ei->i_fc.fc_tid = handle->h_transaction->t_tid;
+}
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index a27cc3a5c676..1badb142dc2a 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -485,5 +485,7 @@ ext4_fc_mark_ineligible(struct super_block *sb)
spin_unlock(&sbi->s_fc_lock);
}

+void ext4_fc_update_commit_range(handle_t *handle, struct inode *inode,
+ loff_t start, loff_t end);

#endif /* _EXT4_JBD2_H */
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 190968996bc6..de61c15e1b17 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -967,8 +967,11 @@ int ext4_da_write_inline_data_end(struct inode *inode, loff_t pos,
* But it's important to update i_size while still holding page lock:
* page writeout could otherwise come in and zero beyond i_size.
*/
- if (pos+copied > inode->i_size)
+ if (pos+copied > inode->i_size) {
+ ext4_fc_update_commit_range(ext4_journal_current_handle(),
+ inode, inode->i_size, pos + copied);
i_size_write(inode, pos+copied);
+ }
unlock_page(page);
put_page(page);

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 379e911b48c4..f79b185c013e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1549,6 +1549,8 @@ static int ext4_journalled_write_end(struct file *file,
SetPageUptodate(page);
}
size_changed = ext4_update_inode_size(inode, pos + copied);
+ ext4_fc_update_commit_range(handle, inode, pos, pos + copied);
+
ext4_set_inode_state(inode, EXT4_STATE_JDATA);
EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
unlock_page(page);
@@ -2610,8 +2612,12 @@ static int mpage_map_and_submit_extent(handle_t *handle,
i_size = i_size_read(inode);
if (disksize > i_size)
disksize = i_size;
- if (disksize > EXT4_I(inode)->i_disksize)
+ if (disksize > EXT4_I(inode)->i_disksize) {
+ ext4_fc_update_commit_range(handle, inode,
+ EXT4_I(inode)->i_disksize,
+ disksize);
EXT4_I(inode)->i_disksize = disksize;
+ }
up_write(&EXT4_I(inode)->i_data_sem);
err2 = ext4_mark_inode_dirty(handle, inode);
ext4_fc_enqueue_inode(handle, inode);
@@ -3220,6 +3226,8 @@ static int ext4_da_write_end(struct file *file,
}
}

+ ext4_fc_update_commit_range(handle, inode, pos, pos + copied);
+
if (write_mode != CONVERT_INLINE_DATA &&
ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) &&
ext4_has_inline_data(inode))
@@ -3627,6 +3635,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
goto orphan_del;
}

+ ext4_fc_update_commit_range(handle, inode, offset, offset + written);
if (ext4_update_inode_size(inode, offset + written)) {
ext4_mark_inode_dirty(handle, inode);
ext4_fc_enqueue_inode(handle, inode);
@@ -3751,6 +3760,8 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
ext4_update_i_disksize(inode, inode->i_size);
ext4_journal_stop(handle);
}
+ ext4_fc_update_commit_range(journal_current_handle(), inode, offset,
+ offset + count);

BUG_ON(iocb->private == NULL);

@@ -3869,6 +3880,8 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
ext4_mark_inode_dirty(handle, inode);
ext4_fc_enqueue_inode(handle, inode);
}
+ ext4_fc_update_commit_range(handle, inode, offset,
+ offset + end);
}
err = ext4_journal_stop(handle);
if (ret == 0)
@@ -5327,6 +5340,9 @@ static int ext4_do_update_inode(handle_t *handle,
cpu_to_le16(ei->i_file_acl >> 32);
raw_inode->i_file_acl_lo = cpu_to_le32(ei->i_file_acl);
if (ei->i_disksize != ext4_isize(inode->i_sb, raw_inode)) {
+ ext4_fc_update_commit_range(handle, inode,
+ ext4_isize(inode->i_sb, raw_inode),
+ ei->i_disksize);
ext4_isize_set(raw_inode, ei->i_disksize);
need_datasync = 1;
}
--
2.23.0.rc1.153.gdeed80330f-goog

2019-08-09 03:48:46

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v2 01/12] ext4: add handling for extended mount options

We are running out of mount option bits. This patch adds handling for
using s_mount_opt2 and also adds ability to turn on / off the fast
commit feature. In order to use fast commits, new version e2fsprogs
needs to set the fast feature commit flag. This also makes sure that
we have fast commit compatible e2fsprogs before starting to use the
feature. Mount flag "no_fastcommit", introuced in this patch, can be
passed to disable the feature at mount time.

Signed-off-by: Harshad Shirwadkar <[email protected]>

---
Changelog:

V2: No changes since V1
---
fs/ext4/ext4.h | 4 ++++
fs/ext4/super.c | 27 ++++++++++++++++++++++-----
include/linux/jbd2.h | 5 ++++-
3 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf660aa7a9e0..becbda38b7db 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1146,6 +1146,8 @@ struct ext4_inode_info {
#define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM 0x00000008 /* User explicitly
specified journal checksum */

+#define EXT4_MOUNT2_JOURNAL_FAST_COMMIT 0x00000010 /* Journal fast commit */
+
#define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
~EXT4_MOUNT_##opt
#define set_opt(sb, opt) EXT4_SB(sb)->s_mount_opt |= \
@@ -1643,6 +1645,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
#define EXT4_FEATURE_COMPAT_RESIZE_INODE 0x0010
#define EXT4_FEATURE_COMPAT_DIR_INDEX 0x0020
#define EXT4_FEATURE_COMPAT_SPARSE_SUPER2 0x0200
+#define EXT4_FEATURE_COMPAT_FAST_COMMIT 0x0400

#define EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001
#define EXT4_FEATURE_RO_COMPAT_LARGE_FILE 0x0002
@@ -1743,6 +1746,7 @@ EXT4_FEATURE_COMPAT_FUNCS(xattr, EXT_ATTR)
EXT4_FEATURE_COMPAT_FUNCS(resize_inode, RESIZE_INODE)
EXT4_FEATURE_COMPAT_FUNCS(dir_index, DIR_INDEX)
EXT4_FEATURE_COMPAT_FUNCS(sparse_super2, SPARSE_SUPER2)
+EXT4_FEATURE_COMPAT_FUNCS(fast_commit, FAST_COMMIT)

EXT4_FEATURE_RO_COMPAT_FUNCS(sparse_super, SPARSE_SUPER)
EXT4_FEATURE_RO_COMPAT_FUNCS(large_file, LARGE_FILE)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4079605d437a..e376ac040cce 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1455,6 +1455,7 @@ enum {
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
+ Opt_no_fastcommit
};

static const match_table_t tokens = {
@@ -1537,6 +1538,7 @@ static const match_table_t tokens = {
{Opt_init_itable, "init_itable=%u"},
{Opt_init_itable, "init_itable"},
{Opt_noinit_itable, "noinit_itable"},
+ {Opt_no_fastcommit, "no_fastcommit"},
{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
{Opt_test_dummy_encryption, "test_dummy_encryption"},
{Opt_nombcache, "nombcache"},
@@ -1659,6 +1661,7 @@ static int clear_qf_name(struct super_block *sb, int qtype)
#define MOPT_NO_EXT3 0x0200
#define MOPT_EXT4_ONLY (MOPT_NO_EXT2 | MOPT_NO_EXT3)
#define MOPT_STRING 0x0400
+#define MOPT_2 0x0800

static const struct mount_opts {
int token;
@@ -1751,6 +1754,8 @@ static const struct mount_opts {
{Opt_max_dir_size_kb, 0, MOPT_GTE0},
{Opt_test_dummy_encryption, 0, MOPT_GTE0},
{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
+ {Opt_no_fastcommit, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
+ MOPT_CLEAR | MOPT_2 | MOPT_EXT4_ONLY},
{Opt_err, 0, 0}
};

@@ -1858,8 +1863,9 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
set_opt2(sb, EXPLICIT_DELALLOC);
} else if (m->mount_opt & EXT4_MOUNT_JOURNAL_CHECKSUM) {
set_opt2(sb, EXPLICIT_JOURNAL_CHECKSUM);
- } else
+ } else if (m->mount_opt) {
return -1;
+ }
}
if (m->flags & MOPT_CLEAR_ERR)
clear_opt(sb, ERRORS_MASK);
@@ -2027,10 +2033,17 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
WARN_ON(1);
return -1;
}
- if (arg != 0)
- sbi->s_mount_opt |= m->mount_opt;
- else
- sbi->s_mount_opt &= ~m->mount_opt;
+ if (m->flags & MOPT_2) {
+ if (arg != 0)
+ sbi->s_mount_opt2 |= m->mount_opt;
+ else
+ sbi->s_mount_opt2 &= ~m->mount_opt;
+ } else {
+ if (arg != 0)
+ sbi->s_mount_opt |= m->mount_opt;
+ else
+ sbi->s_mount_opt &= ~m->mount_opt;
+ }
}
return 1;
}
@@ -3733,6 +3746,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
#ifdef CONFIG_EXT4_FS_POSIX_ACL
set_opt(sb, POSIX_ACL);
#endif
+ if (ext4_has_feature_fast_commit(sb))
+ set_opt2(sb, JOURNAL_FAST_COMMIT);
+
/* don't forget to enable journal_csum when metadata_csum is enabled. */
if (ext4_has_metadata_csum(sb))
set_opt(sb, JOURNAL_CHECKSUM);
@@ -4334,6 +4350,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sbi->s_def_mount_opt &= ~EXT4_MOUNT_JOURNAL_CHECKSUM;
clear_opt(sb, JOURNAL_CHECKSUM);
clear_opt(sb, DATA_FLAGS);
+ clear_opt2(sb, JOURNAL_FAST_COMMIT);
sbi->s_journal = NULL;
needs_recovery = 0;
goto no_journal;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index df03825ad1a1..b7eed49b8ecd 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -288,6 +288,7 @@ typedef struct journal_superblock_s
#define JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT 0x00000004
#define JBD2_FEATURE_INCOMPAT_CSUM_V2 0x00000008
#define JBD2_FEATURE_INCOMPAT_CSUM_V3 0x00000010
+#define JBD2_FEATURE_INCOMPAT_FAST_COMMIT 0x00000020

/* See "journal feature predicate functions" below */

@@ -298,7 +299,8 @@ typedef struct journal_superblock_s
JBD2_FEATURE_INCOMPAT_64BIT | \
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT | \
JBD2_FEATURE_INCOMPAT_CSUM_V2 | \
- JBD2_FEATURE_INCOMPAT_CSUM_V3)
+ JBD2_FEATURE_INCOMPAT_CSUM_V3 | \
+ JBD2_FEATURE_INCOMPAT_FAST_COMMIT)

#ifdef __KERNEL__

@@ -1235,6 +1237,7 @@ JBD2_FEATURE_INCOMPAT_FUNCS(64bit, 64BIT)
JBD2_FEATURE_INCOMPAT_FUNCS(async_commit, ASYNC_COMMIT)
JBD2_FEATURE_INCOMPAT_FUNCS(csum2, CSUM_V2)
JBD2_FEATURE_INCOMPAT_FUNCS(csum3, CSUM_V3)
+JBD2_FEATURE_INCOMPAT_FUNCS(fast_commit, FAST_COMMIT)

/*
* Journal flag definitions
--
2.23.0.rc1.153.gdeed80330f-goog

2019-08-09 03:48:46

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v2 11/12] ext4: fast-commit recovery path changes

This patch adds core fast-commit recovery path changes. Each fast
commit block stores modified extents for a particular file. Replay
code maps blocks in each such extent to the actual file one-by-one. We
also update corresponding file system metadata to account for newly
mapped blocks. In order to achieve all of these,
ext4_inode_csum_set(), ext4_inode_blocks() which were earlier static
are now made visible.

Signed-off-by: Harshad Shirwadkar <[email protected]>

---

Changelog:

V2:
1) Fixed warning reported by Kbuild.

2) Implement scan pass.
- we look for "last" blocks to maintain atomicity of
subtransactions.
- Implement CRC checksum verification.
- If scan pass detects error, we don't perform replay pass.

3) Calling j_fc_replay_callback for SCAN pass as well. So added
passtype and fast commit block offset parameters to
j_fc_replay_callback.

Added tracepoint for replay SCAN pass
---
fs/ext4/balloc.c | 7 +-
fs/ext4/ext4.h | 12 ++
fs/ext4/extents.c | 19 +--
fs/ext4/inode.c | 8 +-
fs/ext4/mballoc.c | 83 +++++++++++++
fs/ext4/mballoc.h | 2 +
fs/ext4/super.c | 225 ++++++++++++++++++++++++++++++++++++
fs/jbd2/commit.c | 6 +-
fs/jbd2/recovery.c | 11 +-
include/linux/jbd2.h | 5 +-
include/trace/events/ext4.h | 22 ++++
include/trace/events/jbd2.h | 9 +-
12 files changed, 386 insertions(+), 23 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 0b202e00d93f..75c3025c7089 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -360,7 +360,12 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
struct buffer_head *bh)
{
ext4_fsblk_t blk;
- struct ext4_group_info *grp = ext4_get_group_info(sb, block_group);
+ struct ext4_group_info *grp;
+
+ if (EXT4_SB(sb)->fc_replay)
+ return 0;
+
+ grp = ext4_get_group_info(sb, block_group);

if (buffer_verified(bh))
return 0;
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 210bd4c86d4f..ca1fbd77a934 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1378,6 +1378,13 @@ struct ext4_super_block {
#define ext4_has_strict_mode(sbi) \
(sbi->s_encoding_flags & EXT4_ENC_STRICT_MODE_FL)

+struct ext4_fc_replay_state {
+ int fc_replay_error;
+ int fc_replay_expected_off;
+ int fc_replay_expected_tid;
+ int fc_replay_current_subtid;
+};
+
/*
* fourth extended-fs super-block data in memory
*/
@@ -1562,6 +1569,7 @@ struct ext4_sb_info {
* Are changes after the last commit
* eligible for fast commit?
*/
+ struct ext4_fc_replay_state s_fc_replay_state;
spinlock_t s_fc_lock;
};

@@ -2588,6 +2596,10 @@ extern int ext4_trim_fs(struct super_block *, struct fstrim_range *);
extern void ext4_process_freed_data(struct super_block *sb, tid_t commit_tid);

/* inode.c */
+void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
+ struct ext4_inode_info *ei);
+blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
+ struct ext4_inode_info *ei);
int ext4_inode_is_fast_symlink(struct inode *inode);
struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 66f7f4fb1612..59fe596ce97d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2894,7 +2894,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
int depth = ext_depth(inode);
struct ext4_ext_path *path = NULL;
struct partial_cluster partial;
- handle_t *handle;
+ handle_t *handle = NULL;
int i = 0, err = 0;

partial.pclu = 0;
@@ -2904,9 +2904,11 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
ext_debug("truncate since %u to %u\n", start, end);

/* probably first extent we're gonna free will be last in block */
- handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, depth + 1);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
+ if (!sbi->fc_replay) {
+ handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, depth + 1);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+ }

again:
trace_ext4_ext_remove_space(inode, start, end, depth);
@@ -2926,7 +2928,8 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
/* find extent for or closest extent to this block */
path = ext4_find_extent(inode, end, NULL, EXT4_EX_NOCACHE);
if (IS_ERR(path)) {
- ext4_journal_stop(handle);
+ if (!sbi->fc_replay)
+ ext4_journal_stop(handle);
return PTR_ERR(path);
}
depth = ext_depth(inode);
@@ -3012,7 +3015,8 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
path = kcalloc(depth + 1, sizeof(struct ext4_ext_path),
GFP_NOFS);
if (path == NULL) {
- ext4_journal_stop(handle);
+ if (!sbi->fc_replay)
+ ext4_journal_stop(handle);
return -ENOMEM;
}
path[0].p_maxdepth = path[0].p_depth = depth;
@@ -3142,7 +3146,8 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
path = NULL;
if (err == -EAGAIN)
goto again;
- ext4_journal_stop(handle);
+ if (!sbi->fc_replay)
+ ext4_journal_stop(handle);

return err;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index dd5d39a48363..21c9b5197c72 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -103,8 +103,8 @@ static int ext4_inode_csum_verify(struct inode *inode, struct ext4_inode *raw,
return provided == calculated;
}

-static void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
- struct ext4_inode_info *ei)
+void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
+ struct ext4_inode_info *ei)
{
__u32 csum;

@@ -4801,8 +4801,8 @@ void ext4_set_inode_flags(struct inode *inode)
S_ENCRYPTED|S_CASEFOLD);
}

-static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
- struct ext4_inode_info *ei)
+blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
+ struct ext4_inode_info *ei)
{
blkcnt_t i_blocks ;
struct inode *inode = &(ei->vfs_inode);
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a3e2767bdf2f..70551fa91237 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2915,6 +2915,89 @@ void ext4_exit_mballoc(void)
}


+void ext4_mb_mark_used(struct super_block *sb, ext4_fsblk_t block,
+ int len)
+{
+ struct buffer_head *bitmap_bh = NULL;
+ struct ext4_group_desc *gdp;
+ struct buffer_head *gdp_bh;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ ext4_group_t group;
+ ext4_fsblk_t cluster;
+ ext4_grpblk_t blkoff;
+ int i, clen, err;
+ int already_allocated_count;
+
+ cluster = EXT4_B2C(sbi, block);
+ clen = EXT4_B2C(sbi, len);
+
+ ext4_get_group_no_and_offset(sb, block, &group, &blkoff);
+ bitmap_bh = ext4_read_block_bitmap(sb, group);
+ if (IS_ERR(bitmap_bh)) {
+ err = PTR_ERR(bitmap_bh);
+ bitmap_bh = NULL;
+ goto out_err;
+ }
+
+ err = -EIO;
+ gdp = ext4_get_group_desc(sb, group, &gdp_bh);
+ if (!gdp)
+ goto out_err;
+
+ if (!ext4_data_block_valid(sbi, block, len)) {
+ ext4_error(sb, "Allocating blks %llu-%llu which overlap mdata",
+ cluster, cluster+clen);
+ /* File system mounted not to panic on error
+ * Fix the bitmap and return EFSCORRUPTED
+ * We leak some of the blocks here.
+ */
+ ext4_lock_group(sb, group);
+ ext4_set_bits(bitmap_bh->b_data, blkoff, clen);
+ ext4_unlock_group(sb, group);
+ err = ext4_handle_dirty_metadata(NULL, NULL, bitmap_bh);
+ if (!err)
+ err = -EFSCORRUPTED;
+ goto out_err;
+ }
+
+ ext4_lock_group(sb, group);
+ already_allocated_count = 0;
+ for (i = 0; i < clen; i++)
+ if (mb_test_bit(blkoff + i, bitmap_bh->b_data))
+ already_allocated_count++;
+
+ ext4_set_bits(bitmap_bh->b_data, blkoff, clen);
+ if (ext4_has_group_desc_csum(sb) &&
+ (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) {
+ gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
+ ext4_free_group_clusters_set(sb, gdp,
+ ext4_free_clusters_after_init(sb,
+ group, gdp));
+ }
+ clen = ext4_free_group_clusters(sb, gdp) - clen +
+ already_allocated_count;
+ ext4_free_group_clusters_set(sb, gdp, clen);
+ ext4_block_bitmap_csum_set(sb, group, gdp, bitmap_bh);
+ ext4_group_desc_csum_set(sb, group, gdp);
+
+ ext4_unlock_group(sb, group);
+
+ if (sbi->s_log_groups_per_flex) {
+ ext4_group_t flex_group = ext4_flex_group(sbi, group);
+
+ atomic64_sub(len,
+ &sbi->s_flex_groups[flex_group].free_clusters);
+ }
+
+ err = ext4_handle_dirty_metadata(NULL, NULL, bitmap_bh);
+ if (err)
+ goto out_err;
+ err = ext4_handle_dirty_metadata(NULL, NULL, gdp_bh);
+
+out_err:
+ brelse(bitmap_bh);
+}
+
/*
* Check quota and mark chosen space (ac->ac_b_ex) non-free in bitmaps
* Returns 0 if success or error code
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 88c98f17e3d9..1881710041b6 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -215,4 +215,6 @@ ext4_mballoc_query_range(
ext4_mballoc_query_range_fn formatter,
void *priv);

+void ext4_mb_mark_used(struct super_block *sb, ext4_fsblk_t block,
+ int len);
#endif
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 1191ebbb55c5..3b535eb624a7 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -408,6 +408,224 @@ static int block_device_ejected(struct super_block *sb)
return bdi->dev == NULL;
}

+static void ext4_fc_add_block(struct inode *inode, ext4_lblk_t lblk,
+ ext4_fsblk_t pblk, int unwritten)
+{
+ struct ext4_extent ex;
+ struct ext4_ext_path *path = NULL;
+ struct ext4_map_blocks map;
+ int ret;
+
+ map.m_lblk = lblk;
+ map.m_len = 0x1;
+ ret = ext4_map_blocks(NULL, inode, &map, 0);
+ if (ret > 0) {
+ if (pblk != map.m_pblk)
+ jbd_debug(1, "Bad mapping found while replaying fc\n");
+ return;
+ }
+
+ ex.ee_block = cpu_to_le32(lblk);
+ ext4_ext_store_pblock(&ex, pblk);
+ ex.ee_len = cpu_to_le16(0x1);
+ if (unwritten)
+ ext4_ext_mark_unwritten(&ex);
+
+ path = ext4_find_extent(inode, lblk, NULL, 0);
+ if (path) {
+ down_write(&EXT4_I(inode)->i_data_sem);
+ ret = ext4_ext_insert_extent(NULL, inode, &path, &ex, 0);
+ ext4_mb_mark_used(inode->i_sb, ext4_ext_pblock(&ex), 0x1);
+ up_write((&EXT4_I(inode)->i_data_sem));
+ kfree(path);
+ }
+}
+
+static int ext4_journal_fc_replay_scan(struct super_block *sb,
+ struct buffer_head *bh, int off)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_fc_replay_state *state;
+ struct ext4_fc_commit_hdr *fc_hdr;
+ struct ext4_fc_tl *tl;
+ __u32 csum, dummy_csum = 0;
+ __u8 *start;
+ tid_t fc_subtid;
+ int i;
+
+ state = &sbi->s_fc_replay_state;
+ fc_hdr = (struct ext4_fc_commit_hdr *)
+ ((__u8 *)bh->b_data + sizeof(journal_header_t));
+
+ fc_subtid = le32_to_cpu(fc_hdr->fc_subtid);
+
+ if (le32_to_cpu(fc_hdr->fc_magic) != EXT4_FC_MAGIC) {
+ state->fc_replay_error = -ENOENT;
+ goto out_err;
+ }
+
+ if (off != state->fc_replay_expected_off) {
+ state->fc_replay_error = -EFSCORRUPTED;
+ goto out_err;
+ }
+
+ if (le16_to_cpu(fc_hdr->fc_features)) {
+ state->fc_replay_error = -EOPNOTSUPP;
+ goto out_err;
+ }
+
+ /* Check if we already concluded that this fast commit is not useful */
+ if (state->fc_replay_error && state->fc_replay_error != -EPROTO)
+ goto out_err;
+
+ if (state->fc_replay_expected_off == 0) {
+ /* This is a first block */
+ state->fc_replay_current_subtid = fc_subtid;
+ /*
+ * We set replay error by default until we find an end
+ * block for a particular subtid
+ */
+ state->fc_replay_error = -EPROTO;
+ }
+
+ if (state->fc_replay_error != 0) {
+ if (state->fc_replay_current_subtid != fc_subtid) {
+ state->fc_replay_error = -EFSCORRUPTED;
+ goto out_err;
+ }
+ } else {
+ /*
+ * We encountered _last_ block for previous subtid. So we should
+ * only find a bigger subtid here.
+ */
+ if (fc_subtid <= state->fc_replay_current_subtid) {
+ state->fc_replay_error = -EFSCORRUPTED;
+ goto out_err;
+ }
+ state->fc_replay_current_subtid = fc_subtid;
+ }
+
+ /*
+ * We can replay fast commit blocks only if we find a _last_ block for
+ * all subtids.
+ */
+ if (ext4_fc_is_last(fc_hdr))
+ state->fc_replay_error = 0;
+
+ csum = ext4_chksum(sbi, 0, fc_hdr,
+ offsetof(struct ext4_fc_commit_hdr, fc_csum));
+ csum = ext4_chksum(sbi, csum, &dummy_csum, sizeof(dummy_csum));
+
+ tl = (struct ext4_fc_tl *)(fc_hdr + 1);
+ start = (__u8 *)tl;
+ for (i = 0; i < le16_to_cpu(fc_hdr->fc_num_tlvs); i++) {
+ if (le16_to_cpu(tl->fc_tag) != EXT4_FC_TAG_EXT)
+ goto out_err;
+ tl = (struct ext4_fc_tl *)((__u8 *)tl +
+ le16_to_cpu(tl->fc_len) +
+ sizeof(*tl));
+ }
+ csum = ext4_chksum(sbi, csum, start, (__u8 *)tl - start);
+ if (csum != le32_to_cpu(fc_hdr->fc_csum)) {
+ state->fc_replay_error = -EFSBADCRC;
+ goto out_err;
+ }
+
+ state->fc_replay_expected_off++;
+ return 0;
+
+out_err:
+ trace_ext4_journal_fc_replay_scan(sb, off, state->fc_replay_error);
+ return state->fc_replay_error;
+}
+
+static int ext4_journal_fc_replay_cb(journal_t *journal, struct buffer_head *bh,
+ enum passtype pass, int off)
+{
+ struct super_block *sb = journal->j_private;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ struct ext4_fc_commit_hdr *fc_hdr;
+ struct ext4_fc_tl *tl;
+ struct ext4_iloc iloc;
+ struct ext4_extent *ex;
+ struct inode *inode;
+ int ret;
+
+ if (pass == PASS_SCAN)
+ return ext4_journal_fc_replay_scan(sb, bh, off);
+
+ if (sbi->s_fc_replay_state.fc_replay_error)
+ return sbi->s_fc_replay_state.fc_replay_error;
+
+ sbi->fc_replay = true;
+ fc_hdr = (struct ext4_fc_commit_hdr *)
+ ((__u8 *)bh->b_data + sizeof(journal_header_t));
+
+ jbd_debug(3, "%s: Got FC block for inode %d at [%d,%d]", __func__,
+ le32_to_cpu(fc_hdr->fc_ino),
+ be32_to_cpu(((journal_header_t *)bh->b_data)->h_sequence),
+ le32_to_cpu(fc_hdr->fc_subtid));
+
+ inode = ext4_iget(sb, le32_to_cpu(fc_hdr->fc_ino), EXT4_IGET_NORMAL);
+ if (IS_ERR(inode))
+ return 0;
+
+ ret = ext4_get_inode_loc(inode, &iloc);
+ if (ret)
+ return ret;
+
+ inode_lock(inode);
+ tl = (struct ext4_fc_tl *)(fc_hdr + 1);
+ while (le16_to_cpu(tl->fc_tag) == EXT4_FC_TAG_EXT) {
+ int i;
+
+ ex = (struct ext4_extent *)(tl + 1);
+ tl = (struct ext4_fc_tl *)((__u8 *)tl +
+ le16_to_cpu(tl->fc_len) +
+ sizeof(*tl));
+ /*
+ * We add block by block because part of extent may already have
+ * been added by a previous fast commit replay.
+ */
+ for (i = 0; i < ext4_ext_get_actual_len(ex); i++)
+ ext4_fc_add_block(inode, le32_to_cpu(ex->ee_block) + i,
+ ext4_ext_pblock(ex) + i,
+ ext4_ext_is_unwritten(ex));
+ }
+
+ /*
+ * Unless inode contains inline data, copy everything except
+ * i_blocks. i_blocks would have been set alright by ext4_fc_add_block
+ * call above.
+ */
+ if (ext4_has_inline_data(inode)) {
+ memcpy(ext4_raw_inode(&iloc), &fc_hdr->inode,
+ sizeof(struct ext4_inode));
+ } else {
+ memcpy(ext4_raw_inode(&iloc), &fc_hdr->inode,
+ offsetof(struct ext4_inode, i_block));
+ memcpy(&ext4_raw_inode(&iloc)->i_generation,
+ &fc_hdr->inode.i_generation,
+ sizeof(struct ext4_inode) -
+ offsetof(struct ext4_inode, i_generation));
+ }
+
+ ext4_reserve_inode_write(NULL, inode, &iloc);
+ inode_unlock(inode);
+ sbi->fc_replay = false;
+
+ ext4_inode_csum_set(inode, ext4_raw_inode(&iloc), EXT4_I(inode));
+ ret = ext4_handle_dirty_metadata(NULL, inode, iloc.bh);
+ iput(inode);
+ if (!ret)
+ ret = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
+
+ brelse(iloc.bh);
+
+ return ret;
+}
+
+
static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn)
{
struct super_block *sb = journal->j_private;
@@ -4981,6 +5199,13 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
journal->j_fc_commit_callback = ext4_journal_fc_commit_cb;
journal->j_fc_cleanup_callback = ext4_journal_fc_cleanup_cb;
}
+
+ /*
+ * We set replay callback even if fast commit disabled because we may
+ * could still have fast commit blocks that need to be replayed even if
+ * fast commit has now been turned off.
+ */
+ journal->j_fc_replay_callback = ext4_journal_fc_replay_cb;
write_lock(&journal->j_state_lock);
if (test_opt(sb, BARRIER))
journal->j_flags |= JBD2_BARRIER;
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index db62a53436e3..1875cdc839fb 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -469,6 +469,10 @@ void jbd2_journal_commit_transaction(journal_t *journal, bool *fc)
if (fc)
*fc = true;
write_unlock(&journal->j_state_lock);
+ trace_jbd2_run_stats(journal->j_fs_dev->bd_dev,
+ journal->j_running_transaction
+ ->t_tid,
+ &stats.run, true);
goto update_overall_stats;
}
if (journal->j_fc_cleanup_callback)
@@ -1156,7 +1160,7 @@ void jbd2_journal_commit_transaction(journal_t *journal, bool *fc)
stats.run.rs_handle_count =
atomic_read(&commit_transaction->t_handle_count);
trace_jbd2_run_stats(journal->j_fs_dev->bd_dev,
- commit_transaction->t_tid, &stats.run);
+ commit_transaction->t_tid, &stats.run, false);
stats.ts_requested = (commit_transaction->t_requested) ? 1 : 0;

commit_transaction->t_state = T_COMMIT_CALLBACK;
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 3a6cd1497504..ba049a31febc 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -35,7 +35,6 @@ struct recovery_info
int nr_revoke_hits;
};

-enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY};
static int do_one_pass(journal_t *journal,
struct recovery_info *info, enum passtype pass);
static int scan_revoke_records(journal_t *, struct buffer_head *,
@@ -444,10 +443,10 @@ static int fc_do_one_pass(journal_t *journal,
}
jbd_debug(3, "Processing fast commit blk with seq %d",
seq);
- if (pass == PASS_REPLAY &&
- journal->j_fc_replay_callback) {
- err = journal->j_fc_replay_callback(journal,
- bh);
+ if (journal->j_fc_replay_callback) {
+ err = journal->j_fc_replay_callback(
+ journal, bh, pass,
+ next_fc_block - journal->j_first_fc);
if (err)
break;
}
@@ -849,7 +848,7 @@ static int do_one_pass(journal_t *journal,
}
}

- if (jbd2_has_feature_fast_commit(journal) && pass == PASS_REPLAY)
+ if (jbd2_has_feature_fast_commit(journal) && pass != PASS_REVOKE)
fc_do_one_pass(journal, info, pass);

if (block_error && success == 0)
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 5362777d06f8..000363d994bb 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -759,6 +759,8 @@ jbd2_time_diff(unsigned long start, unsigned long end)

#define JBD2_NR_BATCH 64

+enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY};
+
/**
* struct journal_s - The journal_s type is the concrete type associated with
* journal_t.
@@ -1240,7 +1242,8 @@ struct journal_s
* the journal.
*/
int (*j_fc_replay_callback)(struct journal_s *journal,
- struct buffer_head *bh);
+ struct buffer_head *bh,
+ enum passtype pass, int off);
/**
* @j_fc_cleanup_callback:
*
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 8ef67b61d54a..9aef10c8e16d 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2703,6 +2703,28 @@ TRACE_EVENT(ext4_error,
__entry->function, __entry->line)
);

+TRACE_EVENT(ext4_journal_fc_replay_scan,
+ TP_PROTO(struct super_block *sb, int error, int off),
+
+ TP_ARGS(sb, error, off),
+
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(int, error)
+ __field(int, off)
+ ),
+
+ TP_fast_assign(
+ __entry->dev = sb->s_dev;
+ __entry->error = error;
+ __entry->off = off;
+ ),
+
+ TP_printk("FC scan pass on dev %d,%d: error %d, off %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->error, __entry->off)
+);
+
TRACE_EVENT(ext4_journal_fc_commit_cb_start,
TP_PROTO(struct super_block *sb),

diff --git a/include/trace/events/jbd2.h b/include/trace/events/jbd2.h
index 2310b259329f..af78bacdae83 100644
--- a/include/trace/events/jbd2.h
+++ b/include/trace/events/jbd2.h
@@ -233,9 +233,9 @@ TRACE_EVENT(jbd2_handle_stats,

TRACE_EVENT(jbd2_run_stats,
TP_PROTO(dev_t dev, unsigned long tid,
- struct transaction_run_stats_s *stats),
+ struct transaction_run_stats_s *stats, bool fc),

- TP_ARGS(dev, tid, stats),
+ TP_ARGS(dev, tid, stats, fc),

TP_STRUCT__entry(
__field( dev_t, dev )
@@ -249,6 +249,7 @@ TRACE_EVENT(jbd2_run_stats,
__field( __u32, handle_count )
__field( __u32, blocks )
__field( __u32, blocks_logged )
+ __field( bool, fc )
),

TP_fast_assign(
@@ -263,11 +264,13 @@ TRACE_EVENT(jbd2_run_stats,
__entry->handle_count = stats->rs_handle_count;
__entry->blocks = stats->rs_blocks;
__entry->blocks_logged = stats->rs_blocks_logged;
+ __entry->fc = fc;
),

- TP_printk("dev %d,%d tid %lu wait %u request_delay %u running %u "
+ TP_printk("%s commit, dev %d,%d tid %lu wait %u request_delay %u running %u "
"locked %u flushing %u logging %u handle_count %u "
"blocks %u blocks_logged %u",
+ __entry->fc ? "fast" : "full",
MAJOR(__entry->dev), MINOR(__entry->dev), __entry->tid,
jiffies_to_msecs(__entry->wait),
jiffies_to_msecs(__entry->request_delay),
--
2.23.0.rc1.153.gdeed80330f-goog

2019-08-09 03:48:47

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v2 03/12] jbd2: fast commit setup and enable

This patch allows file systems to turn fast commits on and thereby
restrict the normal journalling space to total journal blocks minus
JBD2_FAST_COMMIT_BLOCKS. Fast commits are not actually performed, just
the interface to turn fast commits on is opened.

Signed-off-by: Harshad Shirwadkar <[email protected]>

---

Changelog:

V2: No changes since V1
---
fs/ext4/super.c | 3 ++-
fs/jbd2/journal.c | 39 ++++++++++++++++++++++++++++++++-------
fs/ocfs2/journal.c | 4 ++--
include/linux/jbd2.h | 2 +-
4 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e376ac040cce..81c3ec165822 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4933,7 +4933,8 @@ static int ext4_load_journal(struct super_block *sb,
if (save)
memcpy(save, ((char *) es) +
EXT4_S_ERR_START, EXT4_S_ERR_LEN);
- err = jbd2_journal_load(journal);
+ err = jbd2_journal_load(journal,
+ test_opt2(sb, JOURNAL_FAST_COMMIT));
if (save)
memcpy(((char *) es) + EXT4_S_ERR_START,
save, EXT4_S_ERR_LEN);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 953990eb70a9..59ad709154a3 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1159,12 +1159,15 @@ static journal_t *journal_init_common(struct block_device *bdev,
journal->j_blk_offset = start;
journal->j_maxlen = len;
n = journal->j_blocksize / sizeof(journal_block_tag_t);
- journal->j_wbufsize = n;
+ journal->j_wbufsize = n - JBD2_FAST_COMMIT_BLOCKS;
journal->j_wbuf = kmalloc_array(n, sizeof(struct buffer_head *),
GFP_KERNEL);
if (!journal->j_wbuf)
goto err_cleanup;

+ journal->j_fc_wbuf = &journal->j_wbuf[journal->j_wbufsize];
+ journal->j_fc_wbufsize = JBD2_FAST_COMMIT_BLOCKS;
+
bh = getblk_unmovable(journal->j_dev, start, journal->j_blocksize);
if (!bh) {
pr_err("%s: Cannot get buffer for journal superblock\n",
@@ -1297,11 +1300,19 @@ static int journal_reset(journal_t *journal)
}

journal->j_first = first;
- journal->j_last = last;

- journal->j_head = first;
- journal->j_tail = first;
- journal->j_free = last - first;
+ if (jbd2_has_feature_fast_commit(journal)) {
+ journal->j_last_fc = last;
+ journal->j_last = last - JBD2_FAST_COMMIT_BLOCKS;
+ journal->j_first_fc = journal->j_last + 1;
+ journal->j_fc_off = 0;
+ } else {
+ journal->j_last = last;
+ }
+
+ journal->j_head = journal->j_first;
+ journal->j_tail = journal->j_first;
+ journal->j_free = journal->j_last - journal->j_first;

journal->j_tail_sequence = journal->j_transaction_sequence;
journal->j_commit_sequence = journal->j_transaction_sequence - 1;
@@ -1626,9 +1637,17 @@ static int load_superblock(journal_t *journal)
journal->j_tail_sequence = be32_to_cpu(sb->s_sequence);
journal->j_tail = be32_to_cpu(sb->s_start);
journal->j_first = be32_to_cpu(sb->s_first);
- journal->j_last = be32_to_cpu(sb->s_maxlen);
journal->j_errno = be32_to_cpu(sb->s_errno);

+ if (jbd2_has_feature_fast_commit(journal)) {
+ journal->j_last_fc = be32_to_cpu(sb->s_maxlen);
+ journal->j_last = journal->j_last_fc - JBD2_FAST_COMMIT_BLOCKS;
+ journal->j_first_fc = journal->j_last + 1;
+ journal->j_fc_off = 0;
+ } else {
+ journal->j_last = be32_to_cpu(sb->s_maxlen);
+ }
+
return 0;
}

@@ -1641,7 +1660,7 @@ static int load_superblock(journal_t *journal)
* a journal, read the journal from disk to initialise the in-memory
* structures.
*/
-int jbd2_journal_load(journal_t *journal)
+int jbd2_journal_load(journal_t *journal, bool enable_fc)
{
int err;
journal_superblock_t *sb;
@@ -1684,6 +1703,12 @@ int jbd2_journal_load(journal_t *journal)
return -EFSCORRUPTED;
}

+ if (enable_fc)
+ jbd2_journal_set_features(journal, 0, 0,
+ JBD2_FEATURE_INCOMPAT_FAST_COMMIT);
+ else
+ jbd2_journal_clear_features(journal, 0, 0,
+ JBD2_FEATURE_INCOMPAT_FAST_COMMIT);
/* OK, we've finished with the dynamic journal bits:
* reinitialise the dynamic contents of the superblock in memory
* and reset them on disk. */
diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 930e3d388579..3b4d91b16e8e 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -1057,7 +1057,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local, int replayed)

osb = journal->j_osb;

- status = jbd2_journal_load(journal->j_journal);
+ status = jbd2_journal_load(journal->j_journal, false);
if (status < 0) {
mlog(ML_ERROR, "Failed to load journal!\n");
goto done;
@@ -1642,7 +1642,7 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb,
goto done;
}

- status = jbd2_journal_load(journal);
+ status = jbd2_journal_load(journal, false);
if (status < 0) {
mlog_errno(status);
if (!igrab(inode))
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 9a750b732241..153840b422cc 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1476,7 +1476,7 @@ extern int jbd2_journal_set_features
(journal_t *, unsigned long, unsigned long, unsigned long);
extern void jbd2_journal_clear_features
(journal_t *, unsigned long, unsigned long, unsigned long);
-extern int jbd2_journal_load (journal_t *journal);
+extern int jbd2_journal_load(journal_t *journal, bool enable_fc);
extern int jbd2_journal_destroy (journal_t *);
extern int jbd2_journal_recover (journal_t *journal);
extern int jbd2_journal_wipe (journal_t *, int);
--
2.23.0.rc1.153.gdeed80330f-goog

2019-08-09 19:42:20

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v2 01/12] ext4: add handling for extended mount options

On Aug 8, 2019, at 9:45 PM, Harshad Shirwadkar <[email protected]> wrote:
>
> We are running out of mount option bits. This patch adds handling for
> using s_mount_opt2 and also adds ability to turn on / off the fast
> commit feature. In order to use fast commits, new version e2fsprogs
> needs to set the fast feature commit flag. This also makes sure that
> we have fast commit compatible e2fsprogs before starting to use the
> feature. Mount flag "no_fastcommit", introuced in this patch, can be
> passed to disable the feature at mount time.
>
> Signed-off-by: Harshad Shirwadkar <[email protected]>

Reviewed-by: Andreas Dilger <[email protected]>

>
> ---
> Changelog:
>
> V2: No changes since V1
> ---
> fs/ext4/ext4.h | 4 ++++
> fs/ext4/super.c | 27 ++++++++++++++++++++++-----
> include/linux/jbd2.h | 5 ++++-
> 3 files changed, 30 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index bf660aa7a9e0..becbda38b7db 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1146,6 +1146,8 @@ struct ext4_inode_info {
> #define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM 0x00000008 /* User explicitly
> specified journal checksum */
>
> +#define EXT4_MOUNT2_JOURNAL_FAST_COMMIT 0x00000010 /* Journal fast commit */
> +
> #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \
> ~EXT4_MOUNT_##opt
> #define set_opt(sb, opt) EXT4_SB(sb)->s_mount_opt |= \
> @@ -1643,6 +1645,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
> #define EXT4_FEATURE_COMPAT_RESIZE_INODE 0x0010
> #define EXT4_FEATURE_COMPAT_DIR_INDEX 0x0020
> #define EXT4_FEATURE_COMPAT_SPARSE_SUPER2 0x0200
> +#define EXT4_FEATURE_COMPAT_FAST_COMMIT 0x0400
>
> #define EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER 0x0001
> #define EXT4_FEATURE_RO_COMPAT_LARGE_FILE 0x0002
> @@ -1743,6 +1746,7 @@ EXT4_FEATURE_COMPAT_FUNCS(xattr, EXT_ATTR)
> EXT4_FEATURE_COMPAT_FUNCS(resize_inode, RESIZE_INODE)
> EXT4_FEATURE_COMPAT_FUNCS(dir_index, DIR_INDEX)
> EXT4_FEATURE_COMPAT_FUNCS(sparse_super2, SPARSE_SUPER2)
> +EXT4_FEATURE_COMPAT_FUNCS(fast_commit, FAST_COMMIT)
>
> EXT4_FEATURE_RO_COMPAT_FUNCS(sparse_super, SPARSE_SUPER)
> EXT4_FEATURE_RO_COMPAT_FUNCS(large_file, LARGE_FILE)
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 4079605d437a..e376ac040cce 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1455,6 +1455,7 @@ enum {
> Opt_dioread_nolock, Opt_dioread_lock,
> Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
> Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
> + Opt_no_fastcommit
> };
>
> static const match_table_t tokens = {
> @@ -1537,6 +1538,7 @@ static const match_table_t tokens = {
> {Opt_init_itable, "init_itable=%u"},
> {Opt_init_itable, "init_itable"},
> {Opt_noinit_itable, "noinit_itable"},
> + {Opt_no_fastcommit, "no_fastcommit"},
> {Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
> {Opt_test_dummy_encryption, "test_dummy_encryption"},
> {Opt_nombcache, "nombcache"},
> @@ -1659,6 +1661,7 @@ static int clear_qf_name(struct super_block *sb, int qtype)
> #define MOPT_NO_EXT3 0x0200
> #define MOPT_EXT4_ONLY (MOPT_NO_EXT2 | MOPT_NO_EXT3)
> #define MOPT_STRING 0x0400
> +#define MOPT_2 0x0800
>
> static const struct mount_opts {
> int token;
> @@ -1751,6 +1754,8 @@ static const struct mount_opts {
> {Opt_max_dir_size_kb, 0, MOPT_GTE0},
> {Opt_test_dummy_encryption, 0, MOPT_GTE0},
> {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
> + {Opt_no_fastcommit, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
> + MOPT_CLEAR | MOPT_2 | MOPT_EXT4_ONLY},
> {Opt_err, 0, 0}
> };
>
> @@ -1858,8 +1863,9 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> set_opt2(sb, EXPLICIT_DELALLOC);
> } else if (m->mount_opt & EXT4_MOUNT_JOURNAL_CHECKSUM) {
> set_opt2(sb, EXPLICIT_JOURNAL_CHECKSUM);
> - } else
> + } else if (m->mount_opt) {
> return -1;
> + }
> }
> if (m->flags & MOPT_CLEAR_ERR)
> clear_opt(sb, ERRORS_MASK);
> @@ -2027,10 +2033,17 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> WARN_ON(1);
> return -1;
> }
> - if (arg != 0)
> - sbi->s_mount_opt |= m->mount_opt;
> - else
> - sbi->s_mount_opt &= ~m->mount_opt;
> + if (m->flags & MOPT_2) {
> + if (arg != 0)
> + sbi->s_mount_opt2 |= m->mount_opt;
> + else
> + sbi->s_mount_opt2 &= ~m->mount_opt;
> + } else {
> + if (arg != 0)
> + sbi->s_mount_opt |= m->mount_opt;
> + else
> + sbi->s_mount_opt &= ~m->mount_opt;
> + }
> }
> return 1;
> }
> @@ -3733,6 +3746,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> #ifdef CONFIG_EXT4_FS_POSIX_ACL
> set_opt(sb, POSIX_ACL);
> #endif
> + if (ext4_has_feature_fast_commit(sb))
> + set_opt2(sb, JOURNAL_FAST_COMMIT);
> +
> /* don't forget to enable journal_csum when metadata_csum is enabled. */
> if (ext4_has_metadata_csum(sb))
> set_opt(sb, JOURNAL_CHECKSUM);
> @@ -4334,6 +4350,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> sbi->s_def_mount_opt &= ~EXT4_MOUNT_JOURNAL_CHECKSUM;
> clear_opt(sb, JOURNAL_CHECKSUM);
> clear_opt(sb, DATA_FLAGS);
> + clear_opt2(sb, JOURNAL_FAST_COMMIT);
> sbi->s_journal = NULL;
> needs_recovery = 0;
> goto no_journal;
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index df03825ad1a1..b7eed49b8ecd 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -288,6 +288,7 @@ typedef struct journal_superblock_s
> #define JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT 0x00000004
> #define JBD2_FEATURE_INCOMPAT_CSUM_V2 0x00000008
> #define JBD2_FEATURE_INCOMPAT_CSUM_V3 0x00000010
> +#define JBD2_FEATURE_INCOMPAT_FAST_COMMIT 0x00000020
>
> /* See "journal feature predicate functions" below */
>
> @@ -298,7 +299,8 @@ typedef struct journal_superblock_s
> JBD2_FEATURE_INCOMPAT_64BIT | \
> JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT | \
> JBD2_FEATURE_INCOMPAT_CSUM_V2 | \
> - JBD2_FEATURE_INCOMPAT_CSUM_V3)
> + JBD2_FEATURE_INCOMPAT_CSUM_V3 | \
> + JBD2_FEATURE_INCOMPAT_FAST_COMMIT)
>
> #ifdef __KERNEL__
>
> @@ -1235,6 +1237,7 @@ JBD2_FEATURE_INCOMPAT_FUNCS(64bit, 64BIT)
> JBD2_FEATURE_INCOMPAT_FUNCS(async_commit, ASYNC_COMMIT)
> JBD2_FEATURE_INCOMPAT_FUNCS(csum2, CSUM_V2)
> JBD2_FEATURE_INCOMPAT_FUNCS(csum3, CSUM_V3)
> +JBD2_FEATURE_INCOMPAT_FUNCS(fast_commit, FAST_COMMIT)
>
> /*
> * Journal flag definitions
> --
> 2.23.0.rc1.153.gdeed80330f-goog
>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2019-08-09 20:02:45

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v2 03/12] jbd2: fast commit setup and enable

On Aug 8, 2019, at 9:45 PM, Harshad Shirwadkar <[email protected]> wrote:
>
> This patch allows file systems to turn fast commits on and thereby
> restrict the normal journalling space to total journal blocks minus
> JBD2_FAST_COMMIT_BLOCKS. Fast commits are not actually performed, just
> the interface to turn fast commits on is opened.
>
> Signed-off-by: Harshad Shirwadkar <[email protected]>
>
> ---
>
> Changelog:
>
> V2: No changes since V1
> ---
> fs/ext4/super.c | 3 ++-
> fs/jbd2/journal.c | 39 ++++++++++++++++++++++++++++++++-------
> fs/ocfs2/journal.c | 4 ++--
> include/linux/jbd2.h | 2 +-
> 4 files changed, 37 insertions(+), 11 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index e376ac040cce..81c3ec165822 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4933,7 +4933,8 @@ static int ext4_load_journal(struct super_block *sb,
> if (save)
> memcpy(save, ((char *) es) +
> EXT4_S_ERR_START, EXT4_S_ERR_LEN);
> - err = jbd2_journal_load(journal);
> + err = jbd2_journal_load(journal,
> + test_opt2(sb, JOURNAL_FAST_COMMIT));
> if (save)
> memcpy(((char *) es) + EXT4_S_ERR_START,
> save, EXT4_S_ERR_LEN);
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 953990eb70a9..59ad709154a3 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1159,12 +1159,15 @@ static journal_t *journal_init_common(struct block_device *bdev,
> journal->j_blk_offset = start;
> journal->j_maxlen = len;
> n = journal->j_blocksize / sizeof(journal_block_tag_t);
> - journal->j_wbufsize = n;
> + journal->j_wbufsize = n - JBD2_FAST_COMMIT_BLOCKS;

The reservation of the JBD2_FAST_COMMIT_BLOCKS should only be done in
the case of the FAST_COMMIT feature being enabled. Otherwise it can
hurt performance for filesystems where this feature is not enabled.

Cheers, Andreas

> journal->j_wbuf = kmalloc_array(n, sizeof(struct buffer_head *),
> GFP_KERNEL);
> if (!journal->j_wbuf)
> goto err_cleanup;
>
> + journal->j_fc_wbuf = &journal->j_wbuf[journal->j_wbufsize];
> + journal->j_fc_wbufsize = JBD2_FAST_COMMIT_BLOCKS;
> +
> bh = getblk_unmovable(journal->j_dev, start, journal->j_blocksize);
> if (!bh) {
> pr_err("%s: Cannot get buffer for journal superblock\n",
> @@ -1297,11 +1300,19 @@ static int journal_reset(journal_t *journal)
> }
>
> journal->j_first = first;
> - journal->j_last = last;
>
> - journal->j_head = first;
> - journal->j_tail = first;
> - journal->j_free = last - first;
> + if (jbd2_has_feature_fast_commit(journal)) {
> + journal->j_last_fc = last;
> + journal->j_last = last - JBD2_FAST_COMMIT_BLOCKS;
> + journal->j_first_fc = journal->j_last + 1;
> + journal->j_fc_off = 0;
> + } else {
> + journal->j_last = last;
> + }
> +
> + journal->j_head = journal->j_first;
> + journal->j_tail = journal->j_first;
> + journal->j_free = journal->j_last - journal->j_first;
>
> journal->j_tail_sequence = journal->j_transaction_sequence;
> journal->j_commit_sequence = journal->j_transaction_sequence - 1;
> @@ -1626,9 +1637,17 @@ static int load_superblock(journal_t *journal)
> journal->j_tail_sequence = be32_to_cpu(sb->s_sequence);
> journal->j_tail = be32_to_cpu(sb->s_start);
> journal->j_first = be32_to_cpu(sb->s_first);
> - journal->j_last = be32_to_cpu(sb->s_maxlen);
> journal->j_errno = be32_to_cpu(sb->s_errno);
>
> + if (jbd2_has_feature_fast_commit(journal)) {
> + journal->j_last_fc = be32_to_cpu(sb->s_maxlen);
> + journal->j_last = journal->j_last_fc - JBD2_FAST_COMMIT_BLOCKS;
> + journal->j_first_fc = journal->j_last + 1;
> + journal->j_fc_off = 0;
> + } else {
> + journal->j_last = be32_to_cpu(sb->s_maxlen);
> + }
> +
> return 0;
> }
>
> @@ -1641,7 +1660,7 @@ static int load_superblock(journal_t *journal)
> * a journal, read the journal from disk to initialise the in-memory
> * structures.
> */
> -int jbd2_journal_load(journal_t *journal)
> +int jbd2_journal_load(journal_t *journal, bool enable_fc)
> {
> int err;
> journal_superblock_t *sb;
> @@ -1684,6 +1703,12 @@ int jbd2_journal_load(journal_t *journal)
> return -EFSCORRUPTED;
> }
>
> + if (enable_fc)
> + jbd2_journal_set_features(journal, 0, 0,
> + JBD2_FEATURE_INCOMPAT_FAST_COMMIT);
> + else
> + jbd2_journal_clear_features(journal, 0, 0,
> + JBD2_FEATURE_INCOMPAT_FAST_COMMIT);
> /* OK, we've finished with the dynamic journal bits:
> * reinitialise the dynamic contents of the superblock in memory
> * and reset them on disk. */
> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> index 930e3d388579..3b4d91b16e8e 100644
> --- a/fs/ocfs2/journal.c
> +++ b/fs/ocfs2/journal.c
> @@ -1057,7 +1057,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local, int replayed)
>
> osb = journal->j_osb;
>
> - status = jbd2_journal_load(journal->j_journal);
> + status = jbd2_journal_load(journal->j_journal, false);
> if (status < 0) {
> mlog(ML_ERROR, "Failed to load journal!\n");
> goto done;
> @@ -1642,7 +1642,7 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb,
> goto done;
> }
>
> - status = jbd2_journal_load(journal);
> + status = jbd2_journal_load(journal, false);
> if (status < 0) {
> mlog_errno(status);
> if (!igrab(inode))
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 9a750b732241..153840b422cc 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1476,7 +1476,7 @@ extern int jbd2_journal_set_features
> (journal_t *, unsigned long, unsigned long, unsigned long);
> extern void jbd2_journal_clear_features
> (journal_t *, unsigned long, unsigned long, unsigned long);
> -extern int jbd2_journal_load (journal_t *journal);
> +extern int jbd2_journal_load(journal_t *journal, bool enable_fc);
> extern int jbd2_journal_destroy (journal_t *);
> extern int jbd2_journal_recover (journal_t *journal);
> extern int jbd2_journal_wipe (journal_t *, int);
> --
> 2.23.0.rc1.153.gdeed80330f-goog
>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2019-08-09 20:23:52

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v2 04/12] jbd2: fast-commit commit path changes

On Aug 8, 2019, at 9:45 PM, Harshad Shirwadkar <[email protected]> wrote:
>
> This patch adds core fast-commit commit path changes. This patch also
> modifies existing JBD2 APIs to allow usage of fast commits. If fast
> commits are enabled and journal->j_do_full_commit is not set, the
> commit routine tries the file system specific fast commmit first. Only
> if it fails, it falls back to the full commit. Commit start and wait
> APIs now take an additional argument which indicates if fast commits
> are allowed or not.
>
> In this patch we also add a new entry to journal->stats which counts
> the number of fast commits performed.
>
> Signed-off-by: Harshad Shirwadkar <[email protected]>

It would be better to rename the existing function something like
jbd2_log_start_commit_full() and add wrappers jbd2_log_start_commit()
and jbd2_log_start_commit_fast() to avoid to change all of the
callsites to add the same parameter:

int jbd2_log_start_commit_fast(journal_t *journal, tid_t tid)
{
return jbd2_log_start_commit_full(journal, tid, false);
}
EXPORT_SYMBOL(jbd2_log_start_commit_fast);

int jbd2_log_start_commit(journal_t *journal, tid_t tid)
{
return jbd2_log_start_commit_full(journal, tid, true);
}
EXPORT_SYMBOL(jbd2_log_start_commit);

That makes it much more clear for the few callsites that need the
"fast" variant what is being done, unlike a "true" or "false"
argument to a function that isn't very clear what meaning it has.

Cheers, Andreas

> ---
>
> Changelog:
>
> V2: JBD2 commit routine passes stats to the fast commit callbac. Also,
> added a new entry to journal->stats and its tracking.
> ---
> fs/ext4/super.c | 2 +-
> fs/jbd2/checkpoint.c | 2 +-
> fs/jbd2/commit.c | 47 +++++++++++++++++++++++--
> fs/jbd2/journal.c | 81 +++++++++++++++++++++++++++++++++++--------
> fs/jbd2/transaction.c | 6 ++--
> fs/ocfs2/alloc.c | 2 +-
> fs/ocfs2/super.c | 2 +-
> include/linux/jbd2.h | 9 +++--
> 8 files changed, 124 insertions(+), 27 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 81c3ec165822..6bab59ae81f7 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -5148,7 +5148,7 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
> !jbd2_trans_will_send_data_barrier(sbi->s_journal, target))
> needs_barrier = true;
>
> - if (jbd2_journal_start_commit(sbi->s_journal, &target)) {
> + if (jbd2_journal_start_commit(sbi->s_journal, &target, true)) {
> if (wait)
> ret = jbd2_log_wait_commit(sbi->s_journal,
> target);
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index a1909066bde6..6297978ae3bc 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -277,7 +277,7 @@ int jbd2_log_do_checkpoint(journal_t *journal)
>
> if (batch_count)
> __flush_batch(journal, &batch_count);
> - jbd2_log_start_commit(journal, tid);
> + jbd2_log_start_commit(journal, tid, true);
> /*
> * jbd2_journal_commit_transaction() may want
> * to take the checkpoint_mutex if JBD2_FLUSHED
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 132fb92098c7..9281814606e7 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -351,8 +351,12 @@ static void jbd2_block_tag_csum_set(journal_t *j, journal_block_tag_t *tag,
> *
> * The primary function for committing a transaction to the log. This
> * function is called by the journal thread to begin a complete commit.
> + *
> + * fc is input / output parameter. If fc is non-null and is set to true, this
> + * function tries to perform fast commit. If the fast commit is successfully
> + * performed, *fc is set to true.
> */
> -void jbd2_journal_commit_transaction(journal_t *journal)
> +void jbd2_journal_commit_transaction(journal_t *journal, bool *fc)
> {
> struct transaction_stats_s stats;
> transaction_t *commit_transaction;
> @@ -380,6 +384,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> tid_t first_tid;
> int update_tail;
> int csum_size = 0;
> + bool full_commit;
> LIST_HEAD(io_bufs);
> LIST_HEAD(log_bufs);
>
> @@ -413,6 +418,40 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> J_ASSERT(journal->j_running_transaction != NULL);
> J_ASSERT(journal->j_committing_transaction == NULL);
>
> + read_lock(&journal->j_state_lock);
> + full_commit = journal->j_do_full_commit;
> + read_unlock(&journal->j_state_lock);
> +
> + /* Let file-system try its own fast commit */
> + if (jbd2_has_feature_fast_commit(journal)) {
> + if (!full_commit && fc && *fc == true &&
> + journal->j_fc_commit_callback &&
> + !journal->j_fc_commit_callback(
> + journal, journal->j_running_transaction->t_tid,
> + journal->j_subtid, &stats.run)) {
> + jbd_debug(3, "fast commit success.\n");
> + if (journal->j_fc_cleanup_callback)
> + journal->j_fc_cleanup_callback(journal);
> + write_lock(&journal->j_state_lock);
> + journal->j_subtid++;
> + if (fc)
> + *fc = true;
> + write_unlock(&journal->j_state_lock);
> + goto update_overall_stats;
> + }
> + if (journal->j_fc_cleanup_callback)
> + journal->j_fc_cleanup_callback(journal);
> + write_lock(&journal->j_state_lock);
> + journal->j_fc_off = 0;
> + journal->j_subtid = 0;
> + journal->j_do_full_commit = false;
> + write_unlock(&journal->j_state_lock);
> + }
> +
> + jbd_debug(3, "fast commit not performed, trying full.\n");
> + if (fc)
> + *fc = false;
> +
> commit_transaction = journal->j_running_transaction;
>
> trace_jbd2_start_commit(journal, commit_transaction);
> @@ -1129,8 +1168,12 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> /*
> * Calculate overall stats
> */
> +update_overall_stats:
> spin_lock(&journal->j_history_lock);
> - journal->j_stats.ts_tid++;
> + if (fc && *fc == true)
> + journal->j_stats.ts_num_fast_commits++;
> + else
> + journal->j_stats.ts_tid++;
> journal->j_stats.ts_requested += stats.ts_requested;
> journal->j_stats.run.rs_wait += stats.run.rs_wait;
> journal->j_stats.run.rs_request_delay += stats.run.rs_request_delay;
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 59ad709154a3..ab05e47ed2d4 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -160,7 +160,13 @@ static void commit_timeout(struct timer_list *t)
> *
> * 1) COMMIT: Every so often we need to commit the current state of the
> * filesystem to disk. The journal thread is responsible for writing
> - * all of the metadata buffers to disk.
> + * all of the metadata buffers to disk. If fast commits are allowed,
> + * journal thread passes the control to the file system and file system
> + * is then responsible for writing metadata buffers to disk (in whichever
> + * format it wants). If fast commit succeds, journal thread won't perform
> + * a normal commit. In case the fast commit fails, journal thread performs
> + * full commit as normal.
> + *
> *
> * 2) CHECKPOINT: We cannot reuse a used section of the log file until all
> * of the data in that part of the log has been rewritten elsewhere on
> @@ -172,6 +178,7 @@ static int kjournald2(void *arg)
> {
> journal_t *journal = arg;
> transaction_t *transaction;
> + bool fc_flag = true, fc_flag_save;
>
> /*
> * Set up an interval timer which can be used to trigger a commit wakeup
> @@ -209,9 +216,14 @@ static int kjournald2(void *arg)
> jbd_debug(1, "OK, requests differ\n");
> write_unlock(&journal->j_state_lock);
> del_timer_sync(&journal->j_commit_timer);
> - jbd2_journal_commit_transaction(journal);
> + fc_flag_save = fc_flag;
> + jbd2_journal_commit_transaction(journal, &fc_flag);
> write_lock(&journal->j_state_lock);
> - goto loop;
> + if (!fc_flag) {
> + /* fast commit not performed */
> + fc_flag = fc_flag_save;
> + goto loop;
> + }
> }
>
> wake_up(&journal->j_wait_done_commit);
> @@ -235,16 +247,18 @@ static int kjournald2(void *arg)
>
> prepare_to_wait(&journal->j_wait_commit, &wait,
> TASK_INTERRUPTIBLE);
> - if (journal->j_commit_sequence != journal->j_commit_request)
> + if (!fc_flag &&
> + journal->j_commit_sequence != journal->j_commit_request)
> should_sleep = 0;
> transaction = journal->j_running_transaction;
> if (transaction && time_after_eq(jiffies,
> - transaction->t_expires))
> + transaction->t_expires))
> should_sleep = 0;
> if (journal->j_flags & JBD2_UNMOUNT)
> should_sleep = 0;
> if (should_sleep) {
> write_unlock(&journal->j_state_lock);
> + jbd_debug(1, "%s sleeps\n", __func__);
> schedule();
> write_lock(&journal->j_state_lock);
> }
> @@ -259,7 +273,10 @@ static int kjournald2(void *arg)
> transaction = journal->j_running_transaction;
> if (transaction && time_after_eq(jiffies, transaction->t_expires)) {
> journal->j_commit_request = transaction->t_tid;
> + fc_flag = false;
> jbd_debug(1, "woke because of timeout\n");
> + } else {
> + fc_flag = true;
> }
> goto loop;
>
> @@ -517,11 +534,17 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target)
> return 0;
> }
>
> -int jbd2_log_start_commit(journal_t *journal, tid_t tid)
> +int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool full_commit)
> {
> int ret;
>
> write_lock(&journal->j_state_lock);
> + /*
> + * If someone has already requested a full commit,
> + * we have to honor it.
> + */
> + if (!journal->j_do_full_commit)
> + journal->j_do_full_commit = full_commit;
> ret = __jbd2_log_start_commit(journal, tid);
> write_unlock(&journal->j_state_lock);
> return ret;
> @@ -556,7 +579,7 @@ static int __jbd2_journal_force_commit(journal_t *journal)
> tid = transaction->t_tid;
> read_unlock(&journal->j_state_lock);
> if (need_to_start)
> - jbd2_log_start_commit(journal, tid);
> + jbd2_log_start_commit(journal, tid, true);
> ret = jbd2_log_wait_commit(journal, tid);
> if (!ret)
> ret = 1;
> @@ -603,11 +626,14 @@ int jbd2_journal_force_commit(journal_t *journal)
> * if a transaction is going to be committed (or is currently already
> * committing), and fills its tid in at *ptid
> */
> -int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid)
> +int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid, bool full_commit)
> {
> int ret = 0;
>
> write_lock(&journal->j_state_lock);
> + if (!journal->j_do_full_commit)
> + journal->j_do_full_commit = full_commit;
> +
> if (journal->j_running_transaction) {
> tid_t tid = journal->j_running_transaction->t_tid;
>
> @@ -675,7 +701,7 @@ EXPORT_SYMBOL(jbd2_trans_will_send_data_barrier);
> * Wait for a specified commit to complete.
> * The caller may not hold the journal lock.
> */
> -int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
> +int __jbd2_log_wait_commit(journal_t *journal, tid_t tid, tid_t subtid)
> {
> int err = 0;
>
> @@ -702,12 +728,25 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
> }
> #endif
> while (tid_gt(tid, journal->j_commit_sequence)) {
> - jbd_debug(1, "JBD2: want %u, j_commit_sequence=%u\n",
> - tid, journal->j_commit_sequence);
> + if ((!journal->j_do_full_commit) &&
> + !tid_geq(subtid, journal->j_subtid))
> + break;
> + jbd_debug(1, "JBD2: want full commit %u %s %u, ",
> + tid, journal->j_do_full_commit ?
> + "and ignoring fast commit request for " :
> + "or want fast commit",
> + journal->j_subtid);
> + jbd_debug(1, "j_commit_sequence=%u, j_subtid=%u\n",
> + journal->j_commit_sequence, journal->j_subtid);
> read_unlock(&journal->j_state_lock);
> wake_up(&journal->j_wait_commit);
> - wait_event(journal->j_wait_done_commit,
> - !tid_gt(tid, journal->j_commit_sequence));
> + if (journal->j_do_full_commit)
> + wait_event(journal->j_wait_done_commit,
> + !tid_gt(tid, journal->j_commit_sequence));
> + else
> + wait_event(journal->j_wait_done_commit,
> + !tid_gt(tid, journal->j_commit_sequence) ||
> + !tid_geq(subtid, journal->j_subtid));
> read_lock(&journal->j_state_lock);
> }
> read_unlock(&journal->j_state_lock);
> @@ -717,6 +756,13 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
> return err;
> }
>
> +int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
> +{
> + journal->j_do_full_commit = true;
> + return __jbd2_log_wait_commit(journal, tid, 0);
> +}
> +
> +
> /* Return 1 when transaction with given tid has already committed. */
> int jbd2_transaction_committed(journal_t *journal, tid_t tid)
> {
> @@ -751,7 +797,7 @@ int jbd2_complete_transaction(journal_t *journal, tid_t tid)
> if (journal->j_commit_request != tid) {
> /* transaction not yet started, so request it */
> read_unlock(&journal->j_state_lock);
> - jbd2_log_start_commit(journal, tid);
> + jbd2_log_start_commit(journal, tid, true);
> goto wait_commit;
> }
> } else if (!(journal->j_committing_transaction &&
> @@ -996,6 +1042,8 @@ static int jbd2_seq_info_show(struct seq_file *seq, void *v)
> "each up to %u blocks\n",
> s->stats->ts_tid, s->stats->ts_requested,
> s->journal->j_max_transaction_buffers);
> + seq_printf(seq, "%lu fast commits performed\n",
> + s->stats->ts_num_fast_commits);
> if (s->stats->ts_tid == 0)
> return 0;
> seq_printf(seq, "average: \n %ums waiting for transaction\n",
> @@ -1020,6 +1068,9 @@ static int jbd2_seq_info_show(struct seq_file *seq, void *v)
> s->stats->run.rs_blocks / s->stats->ts_tid);
> seq_printf(seq, " %lu logged blocks per transaction\n",
> s->stats->run.rs_blocks_logged / s->stats->ts_tid);
> + seq_printf(seq, " %lu logged blocks per commit\n",
> + s->stats->run.rs_blocks_logged /
> + (s->stats->ts_tid + s->stats->ts_num_fast_commits));
> return 0;
> }
>
> @@ -1741,7 +1792,7 @@ int jbd2_journal_destroy(journal_t *journal)
>
> /* Force a final log commit */
> if (journal->j_running_transaction)
> - jbd2_journal_commit_transaction(journal);
> + jbd2_journal_commit_transaction(journal, NULL);
>
> /* Force any old transactions to disk */
>
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 990e7b5062e7..87f6627d78aa 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -154,7 +154,7 @@ static void wait_transaction_locked(journal_t *journal)
> need_to_start = !tid_geq(journal->j_commit_request, tid);
> read_unlock(&journal->j_state_lock);
> if (need_to_start)
> - jbd2_log_start_commit(journal, tid);
> + jbd2_log_start_commit(journal, tid, true);
> jbd2_might_wait_for_commit(journal);
> schedule();
> finish_wait(&journal->j_wait_transaction_locked, &wait);
> @@ -708,7 +708,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, gfp_t gfp_mask)
> need_to_start = !tid_geq(journal->j_commit_request, tid);
> read_unlock(&journal->j_state_lock);
> if (need_to_start)
> - jbd2_log_start_commit(journal, tid);
> + jbd2_log_start_commit(journal, tid, true);
>
> rwsem_release(&journal->j_trans_commit_map, 1, _THIS_IP_);
> handle->h_buffer_credits = nblocks;
> @@ -1822,7 +1822,7 @@ int jbd2_journal_stop(handle_t *handle)
> jbd_debug(2, "transaction too old, requesting commit for "
> "handle %p\n", handle);
> /* This is non-blocking */
> - jbd2_log_start_commit(journal, transaction->t_tid);
> + jbd2_log_start_commit(journal, transaction->t_tid, true);
>
> /*
> * Special case: JBD2_SYNC synchronous updates require us
> diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
> index 0c335b51043d..df41c43573b7 100644
> --- a/fs/ocfs2/alloc.c
> +++ b/fs/ocfs2/alloc.c
> @@ -6117,7 +6117,7 @@ int ocfs2_try_to_free_truncate_log(struct ocfs2_super *osb,
> goto out;
> }
>
> - if (jbd2_journal_start_commit(osb->journal->j_journal, &target)) {
> + if (jbd2_journal_start_commit(osb->journal->j_journal, &target, true)) {
> jbd2_log_wait_commit(osb->journal->j_journal, target);
> ret = 1;
> }
> diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
> index 8b2f39506648..60ecc51759ae 100644
> --- a/fs/ocfs2/super.c
> +++ b/fs/ocfs2/super.c
> @@ -410,7 +410,7 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait)
> }
>
> if (jbd2_journal_start_commit(osb->journal->j_journal,
> - &target)) {
> + &target, true)) {
> if (wait)
> jbd2_log_wait_commit(osb->journal->j_journal,
> target);
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 153840b422cc..535f88dff653 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -742,6 +742,7 @@ struct transaction_run_stats_s {
>
> struct transaction_stats_s {
> unsigned long ts_tid;
> + unsigned long ts_num_fast_commits;
> unsigned long ts_requested;
> struct transaction_run_stats_s run;
> };
> @@ -1364,7 +1365,8 @@ int __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
> void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
>
> /* Commit management */
> -extern void jbd2_journal_commit_transaction(journal_t *);
> +extern void jbd2_journal_commit_transaction(journal_t *journal,
> + bool *full_commit);
>
> /* Checkpoint list management */
> void __jbd2_journal_clean_checkpoint_list(journal_t *journal, bool destroy);
> @@ -1571,9 +1573,10 @@ extern void jbd2_clear_buffer_revoked_flags(journal_t *journal);
> * transitions on demand.
> */
>
> -int jbd2_log_start_commit(journal_t *journal, tid_t tid);
> +int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool full_commit);
> int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
> -int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
> +int jbd2_journal_start_commit(journal_t *journal, tid_t *tid,
> + bool full_commit);
> int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
> int jbd2_transaction_committed(journal_t *journal, tid_t tid);
> int jbd2_complete_transaction(journal_t *journal, tid_t tid);
> --
> 2.23.0.rc1.153.gdeed80330f-goog
>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2019-08-09 20:38:43

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] jbd2: fast-commit commit path new APIs

On Aug 8, 2019, at 9:45 PM, Harshad Shirwadkar <[email protected]> wrote:
>
> This patch adds new helper APIs that ext4 needs for fast
> commits. These new fast commit APIs are used by subsequent fast commit
> patches to implement fast commits. Following new APIs are added:
>
> /*
> * Returns when either a full commit or a fast commit
> * completes
> */
> int jbd2_fc_complete_commit(journal_tc *journal, tid_t tid,
> tid_t tid, tid_t subtid)
>
> /* Send all the data buffers related to an inode */
> int journal_submit_inode_data(journal_t *journal,
> struct jbd2_inode *jinode)
>
> /* Map one fast commit buffer for use by the file system */
> int jbd2_map_fc_buf(journal_t *journal, struct buffer_head **bh_out)
>
> /* Wait on fast commit buffers to complete IO */
> jbd2_wait_on_fc_bufs(journal_t *journal, int num_bufs)
>
> Signed-off-by: Harshad Shirwadkar <[email protected]>

Reviewed-by: Andreas Dilger <[email protected]>

> ---
>
> Changelog:
>
> V2: 1) Fixed error reported by kbuild test robot. Removed duplicate
> EXPORT_SYMBOL() call. Also, added EXPORT_SYMBOL() for the new
> APIs introduced.
> 2) Changed jbd2_submit_fc_bufs() to jbd2_wait_on_fc_bufs(). This
> gives client file system to submit JBD2 buffers according to
> its own convenience.
> ---
> fs/jbd2/commit.c | 32 +++++++++++++++
> fs/jbd2/journal.c | 98 ++++++++++++++++++++++++++++++++++++++++++++
> include/linux/jbd2.h | 6 +++
> 3 files changed, 136 insertions(+)
>
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 9281814606e7..db62a53436e3 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -202,6 +202,38 @@ static int journal_submit_inode_data_buffers(struct address_space *mapping,
> return ret;
> }
>
> +int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode)
> +{
> + struct address_space *mapping;
> + loff_t dirty_start = jinode->i_dirty_start;
> + loff_t dirty_end = jinode->i_dirty_end;
> + int ret;
> +
> + if (!jinode)
> + return 0;
> +
> + if (!(jinode->i_flags & JI_WRITE_DATA))
> + return 0;
> +
> + dirty_start = jinode->i_dirty_start;
> + dirty_end = jinode->i_dirty_end;
> +
> + mapping = jinode->i_vfs_inode->i_mapping;
> + jinode->i_flags |= JI_COMMIT_RUNNING;
> +
> + trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
> + ret = journal_submit_inode_data_buffers(mapping, dirty_start,
> + dirty_end);
> +
> + jinode->i_flags &= ~JI_COMMIT_RUNNING;
> + /* Protect JI_COMMIT_RUNNING flag */
> + smp_mb();
> + wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(jbd2_submit_inode_data);
> +
> /*
> * Submit all the data buffers of inode associated with the transaction to
> * disk.
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index ab05e47ed2d4..1e15804b2c3c 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -811,6 +811,33 @@ int jbd2_complete_transaction(journal_t *journal, tid_t tid)
> }
> EXPORT_SYMBOL(jbd2_complete_transaction);
>
> +int jbd2_fc_complete_commit(journal_t *journal, tid_t tid, tid_t subtid)
> +{
> + int need_to_wait = 1;
> +
> + read_lock(&journal->j_state_lock);
> + if (journal->j_running_transaction &&
> + journal->j_running_transaction->t_tid == tid) {
> + /* Check if fast commit was already done */
> + if (journal->j_subtid > subtid)
> + need_to_wait = 0;
> + if (journal->j_commit_request != tid) {
> + /* transaction not yet started, so request it */
> + read_unlock(&journal->j_state_lock);
> + jbd2_log_start_commit(journal, tid, false);
> + goto wait_commit;
> + }
> + } else if (!(journal->j_committing_transaction &&
> + journal->j_committing_transaction->t_tid == tid))
> + need_to_wait = 0;
> + read_unlock(&journal->j_state_lock);
> + if (!need_to_wait)
> + return 0;
> +wait_commit:
> + return __jbd2_log_wait_commit(journal, tid, subtid);
> +}
> +EXPORT_SYMBOL(jbd2_fc_complete_commit);
> +
> /*
> * Log buffer allocation routines:
> */
> @@ -831,6 +858,77 @@ int jbd2_journal_next_log_block(journal_t *journal, unsigned long long *retp)
> return jbd2_journal_bmap(journal, blocknr, retp);
> }
>
> +int jbd2_map_fc_buf(journal_t *journal, struct buffer_head **bh_out)
> +{
> + unsigned long long pblock;
> + unsigned long blocknr;
> + int ret = 0;
> + struct buffer_head *bh;
> + int fc_off;
> + journal_header_t *jhdr;
> +
> + write_lock(&journal->j_state_lock);
> +
> + if (journal->j_fc_off + journal->j_first_fc < journal->j_last_fc) {
> + fc_off = journal->j_fc_off;
> + blocknr = journal->j_first_fc + fc_off;
> + journal->j_fc_off++;
> + } else {
> + ret = -EINVAL;
> + }
> + write_unlock(&journal->j_state_lock);
> +
> + if (ret)
> + return ret;
> +
> + ret = jbd2_journal_bmap(journal, blocknr, &pblock);
> + if (ret)
> + return ret;
> +
> + bh = __getblk(journal->j_dev, pblock, journal->j_blocksize);
> + if (!bh)
> + return -ENOMEM;
> +
> + lock_buffer(bh);
> + jhdr = (journal_header_t *)bh->b_data;
> + jhdr->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
> + jhdr->h_blocktype = cpu_to_be32(JBD2_FC_BLOCK);
> + jhdr->h_sequence = cpu_to_be32(journal->j_running_transaction->t_tid);
> +
> + set_buffer_uptodate(bh);
> + unlock_buffer(bh);
> + journal->j_fc_wbuf[fc_off] = bh;
> +
> + *bh_out = bh;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(jbd2_map_fc_buf);
> +
> +int jbd2_wait_on_fc_bufs(journal_t *journal, int num_blks)
> +{
> + struct buffer_head *bh;
> + int i, j_fc_off;
> +
> + read_lock(&journal->j_state_lock);
> + j_fc_off = journal->j_fc_off;
> + read_unlock(&journal->j_state_lock);
> +
> + /*
> + * Wait in reverse order to minimize chances of us being woken up before
> + * all IOs have completed
> + */
> + for (i = j_fc_off - 1; i >= j_fc_off - num_blks; i--) {
> + bh = journal->j_fc_wbuf[i];
> + wait_on_buffer(bh);
> + if (unlikely(!buffer_uptodate(bh)))
> + return -EIO;
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(jbd2_wait_on_fc_bufs);
> +
> /*
> * Conversion of logical to physical block numbers for the journal
> *
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 535f88dff653..5362777d06f8 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -124,6 +124,7 @@ typedef struct journal_s journal_t; /* Journal control structure */
> #define JBD2_SUPERBLOCK_V1 3
> #define JBD2_SUPERBLOCK_V2 4
> #define JBD2_REVOKE_BLOCK 5
> +#define JBD2_FC_BLOCK 6
>
> /*
> * Standard header for all descriptor blocks:
> @@ -1582,6 +1583,7 @@ int jbd2_transaction_committed(journal_t *journal, tid_t tid);
> int jbd2_complete_transaction(journal_t *journal, tid_t tid);
> int jbd2_log_do_checkpoint(journal_t *journal);
> int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);
> +int jbd2_fc_complete_commit(journal_t *journal, tid_t tid, tid_t subtid);
>
> void __jbd2_log_wait_for_space(journal_t *journal);
> extern void __jbd2_journal_drop_transaction(journal_t *, transaction_t *);
> @@ -1732,6 +1734,10 @@ static inline tid_t jbd2_get_latest_transaction(journal_t *journal)
> return tid;
> }
>
> +int jbd2_map_fc_buf(journal_t *journal, struct buffer_head **bh_out);
> +int jbd2_wait_on_fc_bufs(journal_t *journal, int num_blks);
> +int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode);
> +
> #ifdef __KERNEL__
>
> #define buffer_trace_init(bh) do {} while (0)
> --
> 2.23.0.rc1.153.gdeed80330f-goog
>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2019-08-09 20:58:08

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v2 06/12] jbd2: fast-commit recovery path changes


> On Aug 8, 2019, at 9:45 PM, Harshad Shirwadkar <[email protected]> wrote:
>
> This patch adds fast-commit recovery path changes for JBD2. If we find
> a fast commit block that is valid in our recovery phase call file
> system specific routine to handle that block.
>
> We also clear the fast commit flag in jbd2_mark_journal_empty() which
> is called after successful recovery as well successful

... as well as after successful ...

> checkpointing. This allows JBD2 journal to be compatible with older
> versions when there are not fast commit blocks.
>
> Signed-off-by: Harshad Shirwadkar <[email protected]>

Reviewed-by: Andreas Dilger <[email protected]>

>
> ---
>
> Changelog:
>
> V2: Fixed checkpatch error.
> ---
> fs/jbd2/journal.c | 12 ++++++++++
> fs/jbd2/recovery.c | 59 +++++++++++++++++++++++++++++++++++++++++++---
> 2 files changed, 68 insertions(+), 3 deletions(-)
>
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 1e15804b2c3c..ae4584a60cc3 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1604,6 +1604,7 @@ int jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid,
> static void jbd2_mark_journal_empty(journal_t *journal, int write_op)
> {
> journal_superblock_t *sb = journal->j_superblock;
> + bool had_fast_commit = false;
>
> BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
> lock_buffer(journal->j_sb_buffer);
> @@ -1617,6 +1618,14 @@ static void jbd2_mark_journal_empty(journal_t *journal, int write_op)
>
> sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
> sb->s_start = cpu_to_be32(0);
> + if (jbd2_has_feature_fast_commit(journal)) {
> + /*
> + * When journal is clean, no need to commit fast commit flag and
> + * make file system incompatible with older kernels.
> + */
> + jbd2_clear_feature_fast_commit(journal);
> + had_fast_commit = true;
> + }
>
> jbd2_write_superblock(journal, write_op);
>
> @@ -1624,6 +1633,9 @@ static void jbd2_mark_journal_empty(journal_t *journal, int write_op)
> write_lock(&journal->j_state_lock);
> journal->j_flags |= JBD2_FLUSHED;
> write_unlock(&journal->j_state_lock);
> +
> + if (had_fast_commit)
> + jbd2_set_feature_fast_commit(journal);
> }
>
>
> diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
> index a4967b27ffb6..3a6cd1497504 100644
> --- a/fs/jbd2/recovery.c
> +++ b/fs/jbd2/recovery.c
> @@ -225,8 +225,12 @@ static int count_tags(journal_t *journal, struct buffer_head *bh)
> /* Make sure we wrap around the log correctly! */
> #define wrap(journal, var) \
> do { \
> - if (var >= (journal)->j_last) \
> - var -= ((journal)->j_last - (journal)->j_first); \
> + unsigned long _wrap_last = \
> + jbd2_has_feature_fast_commit(journal) ? \
> + (journal)->j_last_fc : (journal)->j_last; \
> + \
> + if (var >= _wrap_last) \
> + var -= (_wrap_last - (journal)->j_first); \
> } while (0)
>
> /**
> @@ -413,6 +417,49 @@ static int jbd2_block_tag_csum_verify(journal_t *j, journal_block_tag_t *tag,
> return tag->t_checksum == cpu_to_be16(csum32);
> }
>
> +static int fc_do_one_pass(journal_t *journal,
> + struct recovery_info *info, enum passtype pass)
> +{
> + unsigned int expected_commit_id = info->end_transaction;
> + unsigned long next_fc_block;
> + struct buffer_head *bh;
> + unsigned int seq;
> + journal_header_t *jhdr;
> + int err = 0;
> +
> + next_fc_block = journal->j_first_fc;
> +
> + while (next_fc_block != journal->j_last_fc) {
> + jbd_debug(3, "Fast commit replay: next block %lld",
> + next_fc_block);
> + err = jread(&bh, journal, next_fc_block);
> + if (err)
> + break;
> +
> + jhdr = (journal_header_t *)bh->b_data;
> + seq = be32_to_cpu(jhdr->h_sequence);
> + if (be32_to_cpu(jhdr->h_magic) != JBD2_MAGIC_NUMBER ||
> + seq != expected_commit_id) {
> + break;
> + }
> + jbd_debug(3, "Processing fast commit blk with seq %d",
> + seq);
> + if (pass == PASS_REPLAY &&
> + journal->j_fc_replay_callback) {
> + err = journal->j_fc_replay_callback(journal,
> + bh);
> + if (err)
> + break;
> + }
> + next_fc_block++;
> + }
> +
> + if (err)
> + jbd_debug(3, "Fast commit replay failed, err = %d\n", err);
> +
> + return err;
> +}
> +
> static int do_one_pass(journal_t *journal,
> struct recovery_info *info, enum passtype pass)
> {
> @@ -470,7 +517,7 @@ static int do_one_pass(journal_t *journal,
> break;
>
> jbd_debug(2, "Scanning for sequence ID %u at %lu/%lu\n",
> - next_commit_ID, next_log_block, journal->j_last);
> + next_commit_ID, next_log_block, journal->j_last_fc);
>
> /* Skip over each chunk of the transaction looking
> * either the next descriptor block or the final commit
> @@ -768,6 +815,8 @@ static int do_one_pass(journal_t *journal,
> if (err)
> goto failed;
> continue;
> + case JBD2_FC_BLOCK:
> + continue;
>
> default:
> jbd_debug(3, "Unrecognised magic %d, end of scan.\n",
> @@ -799,6 +848,10 @@ static int do_one_pass(journal_t *journal,
> success = -EIO;
> }
> }
> +
> + if (jbd2_has_feature_fast_commit(journal) && pass == PASS_REPLAY)
> + fc_do_one_pass(journal, info, pass);
> +
> if (block_error && success == 0)
> success = -EIO;
> return success;
> --
> 2.23.0.rc1.153.gdeed80330f-goog
>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2019-08-09 21:13:19

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] jbd2: fast-commit commit path new APIs


> On Aug 9, 2019, at 2:38 PM, Andreas Dilger <[email protected]> wrote:
>
> On Aug 8, 2019, at 9:45 PM, Harshad Shirwadkar <[email protected]> wrote:
>>
>> This patch adds new helper APIs that ext4 needs for fast
>> commits. These new fast commit APIs are used by subsequent fast commit
>> patches to implement fast commits. Following new APIs are added:
>>
>> /*
>> * Returns when either a full commit or a fast commit
>> * completes
>> */
>> int jbd2_fc_complete_commit(journal_tc *journal, tid_t tid,
>> tid_t tid, tid_t subtid)
>>
>> /* Send all the data buffers related to an inode */
>> int journal_submit_inode_data(journal_t *journal,
>> struct jbd2_inode *jinode)
>>
>> /* Map one fast commit buffer for use by the file system */
>> int jbd2_map_fc_buf(journal_t *journal, struct buffer_head **bh_out)
>>
>> /* Wait on fast commit buffers to complete IO */
>> jbd2_wait_on_fc_bufs(journal_t *journal, int num_bufs)
>>
>> Signed-off-by: Harshad Shirwadkar <[email protected]>
>>
>> +int jbd2_map_fc_buf(journal_t *journal, struct buffer_head **bh_out)
>> +{
>> + unsigned long long pblock;
>> + unsigned long blocknr;
>> + int ret = 0;
>> + struct buffer_head *bh;
>> + int fc_off;
>> + journal_header_t *jhdr;
>> +
>> + write_lock(&journal->j_state_lock);
>> +
>> + if (journal->j_fc_off + journal->j_first_fc < journal->j_last_fc) {
>> + fc_off = journal->j_fc_off;
>> + blocknr = journal->j_first_fc + fc_off;
>> + journal->j_fc_off++;
>> + } else {
>> + ret = -EINVAL;
>> + }
>> + write_unlock(&journal->j_state_lock);
>> +
>> + if (ret)
>> + return ret;
>> +
>> + ret = jbd2_journal_bmap(journal, blocknr, &pblock);
>> + if (ret)
>> + return ret;
>> +
>> + bh = __getblk(journal->j_dev, pblock, journal->j_blocksize);
>> + if (!bh)
>> + return -ENOMEM;
>> +
>> + lock_buffer(bh);
>> + jhdr = (journal_header_t *)bh->b_data;
>> + jhdr->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
>> + jhdr->h_blocktype = cpu_to_be32(JBD2_FC_BLOCK);
>> + jhdr->h_sequence = cpu_to_be32(journal->j_running_transaction->t_tid);
>> +
>> + set_buffer_uptodate(bh);
>> + unlock_buffer(bh);
>> + journal->j_fc_wbuf[fc_off] = bh;
>> +
>> + *bh_out = bh;
>> +
>> + return 0;
>> +}
>> +EXPORT_SYMBOL(jbd2_map_fc_buf);

One question about this function. It seems that it is called for every
commit by ext4_journal_fc_commit_cb(). Why does it need to map the fast
journal commit blocks on every call? It would make more sense to map the
blocks once at initialization time and then just re-use them on each call.

Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2019-08-09 21:20:46

by harshad shirwadkar

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] jbd2: fast-commit commit path new APIs

On Fri, Aug 9, 2019 at 2:11 PM Andreas Dilger <[email protected]> wrote:
>
>
> > On Aug 9, 2019, at 2:38 PM, Andreas Dilger <[email protected]> wrote:
> >
> > On Aug 8, 2019, at 9:45 PM, Harshad Shirwadkar <[email protected]> wrote:
> >>
> >> This patch adds new helper APIs that ext4 needs for fast
> >> commits. These new fast commit APIs are used by subsequent fast commit
> >> patches to implement fast commits. Following new APIs are added:
> >>
> >> /*
> >> * Returns when either a full commit or a fast commit
> >> * completes
> >> */
> >> int jbd2_fc_complete_commit(journal_tc *journal, tid_t tid,
> >> tid_t tid, tid_t subtid)
> >>
> >> /* Send all the data buffers related to an inode */
> >> int journal_submit_inode_data(journal_t *journal,
> >> struct jbd2_inode *jinode)
> >>
> >> /* Map one fast commit buffer for use by the file system */
> >> int jbd2_map_fc_buf(journal_t *journal, struct buffer_head **bh_out)
> >>
> >> /* Wait on fast commit buffers to complete IO */
> >> jbd2_wait_on_fc_bufs(journal_t *journal, int num_bufs)
> >>
> >> Signed-off-by: Harshad Shirwadkar <[email protected]>
> >>
> >> +int jbd2_map_fc_buf(journal_t *journal, struct buffer_head **bh_out)
> >> +{
> >> + unsigned long long pblock;
> >> + unsigned long blocknr;
> >> + int ret = 0;
> >> + struct buffer_head *bh;
> >> + int fc_off;
> >> + journal_header_t *jhdr;
> >> +
> >> + write_lock(&journal->j_state_lock);
> >> +
> >> + if (journal->j_fc_off + journal->j_first_fc < journal->j_last_fc) {
> >> + fc_off = journal->j_fc_off;
> >> + blocknr = journal->j_first_fc + fc_off;
> >> + journal->j_fc_off++;
> >> + } else {
> >> + ret = -EINVAL;
> >> + }
> >> + write_unlock(&journal->j_state_lock);
> >> +
> >> + if (ret)
> >> + return ret;
> >> +
> >> + ret = jbd2_journal_bmap(journal, blocknr, &pblock);
> >> + if (ret)
> >> + return ret;
> >> +
> >> + bh = __getblk(journal->j_dev, pblock, journal->j_blocksize);
> >> + if (!bh)
> >> + return -ENOMEM;
> >> +
> >> + lock_buffer(bh);
> >> + jhdr = (journal_header_t *)bh->b_data;
> >> + jhdr->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
> >> + jhdr->h_blocktype = cpu_to_be32(JBD2_FC_BLOCK);
> >> + jhdr->h_sequence = cpu_to_be32(journal->j_running_transaction->t_tid);
> >> +
> >> + set_buffer_uptodate(bh);
> >> + unlock_buffer(bh);
> >> + journal->j_fc_wbuf[fc_off] = bh;
> >> +
> >> + *bh_out = bh;
> >> +
> >> + return 0;
> >> +}
> >> +EXPORT_SYMBOL(jbd2_map_fc_buf);
>
> One question about this function. It seems that it is called for every
> commit by ext4_journal_fc_commit_cb(). Why does it need to map the fast
> journal commit blocks on every call? It would make more sense to map the
> blocks once at initialization time and then just re-use them on each call.
>

The only reason why I did it this way is that this way JBD2 gets an
opportunity to set-up journal header at the beginning of the block
which contains TID information. But I guess we could have a separate
call for setting the journal header and ext4 could call that routine
instead of mapping buffers on every commit call. Thanks for pointing
this out. I'll fix this in V3.

> Cheers, Andreas
>
>
>
>
>

2019-08-09 21:23:47

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] ext4: add fields that are needed to track changed files

On Aug 8, 2019, at 9:45 PM, Harshad Shirwadkar <[email protected]> wrote:
>
> Ext4's fast commit feature tracks changed files and maintains them in
> a queue. We also remember for each file the logical block range that
> needs to be committed. This patch adds these fields to ext4_inode_info
> and ext4_sb_info and also adds initialization calls.
>
> Signed-off-by: Harshad Shirwadkar <[email protected]>
>
> ---
>
> Changelog:
>
> V2: Converted s_fc_lock from mutex to spinlock to improve parallelism
> performance.
> ---
> fs/ext4/ext4.h | 34 ++++++++++++++++++++++++++++++++++
> fs/ext4/ext4_jbd2.c | 13 +++++++++++++
> fs/ext4/ext4_jbd2.h | 2 ++
> fs/ext4/inode.c | 1 +
> fs/ext4/super.c | 7 +++++++
> 5 files changed, 57 insertions(+)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index becbda38b7db..0d15d4539dda 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -921,6 +921,27 @@ enum {
> I_DATA_SEM_QUOTA,
> };
>
> +/*
> + * Ext4 fast commit inode specific information
> + */
> +struct ext4_fast_commit_inode_info {
> + /* TID / SUB-TID when old_i_size and i_size were recorded */
> + tid_t fc_tid;
> + tid_t fc_subtid;
> +
> + /*
> + * Start of logical block range that needs to be committed in this fast
> + * commit
> + */
> + loff_t fc_lblk_start;
> +
> + /*
> + * End of logical block range that needs to be committed in this fast
> + * commit
> + */
> + loff_t fc_lblk_end;

Since these are logical block numbers within the journal, they certainly
don't need to be 64-bit values. loff_t is for byte offsets, this should
use ext4_lblk_t, which will also reduce the size of the struct by 8 bytes.

> +};
> +
>
> /*
> * fourth extended file system inode data in memory
> @@ -955,6 +976,9 @@ struct ext4_inode_info {
>
> struct list_head i_orphan; /* unlinked but open inodes */
>
> + struct list_head i_fc_list; /* inodes that need fast commit */

This comment should document what lock is protecting this list, along
with the other fields.

> + struct ext4_fast_commit_inode_info i_fc;

Since this increases the size of the inode, does it affect the number of
inodes that can fit into one page of ext4_inode_cachep?

> /*
> * i_disksize keeps track of what the inode size is ON DISK, not
> * in memory. During truncate, i_size is set to the new size by
> @@ -1529,6 +1553,16 @@ struct ext4_sb_info {
> /* Barrier between changing inodes' journal flags and writepages ops. */
> struct percpu_rw_semaphore s_journal_flag_rwsem;
> struct dax_device *s_daxdev;
> +
> + /* Ext4 fast commit stuff */
> + bool fc_replay; /* Fast commit replay in progress */
> + struct list_head s_fc_q; /* Inodes that need fast commit. */

This comment should document what lock is protecting this list, along
with the other fields.

> + __u32 s_fc_q_cnt; /* Number of inodes in the fc queue */
> + bool s_fc_eligible; /*
> + * Are changes after the last commit
> + * eligible for fast commit?
> + */

It is slightly more space efficient to put the bool values together
rather than interleaving them between 64-bit values.

> + spinlock_t s_fc_lock;
> };
>
> static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> index 7c70b08d104c..75b6db808837 100644
> --- a/fs/ext4/ext4_jbd2.c
> +++ b/fs/ext4/ext4_jbd2.c
> @@ -330,3 +330,16 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
> mark_buffer_dirty(bh);
> return err;
> }
> +
> +void ext4_init_inode_fc_info(struct inode *inode)
> +{
> + handle_t *handle = ext4_journal_current_handle();
> + struct ext4_inode_info *ei = EXT4_I(inode);
> +
> + memset(&ei->i_fc, 0, sizeof(ei->i_fc));
> + if (ext4_handle_valid(handle)) {
> + ei->i_fc.fc_tid = handle->h_transaction->t_tid;
> + ei->i_fc.fc_subtid = handle->h_transaction->t_journal->j_subtid;
> + }
> + INIT_LIST_HEAD(&ei->i_fc_list);
> +}
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index ef8fcf7d0d3b..2305c1acd415 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -459,4 +459,6 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
> return 1;
> }
>
> +void ext4_init_inode_fc_info(struct inode *inode);
> +
> #endif /* _EXT4_JBD2_H */
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 420fe3deed39..f230a888eddd 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4996,6 +4996,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
> for (block = 0; block < EXT4_N_BLOCKS; block++)
> ei->i_data[block] = raw_inode->i_block[block];
> INIT_LIST_HEAD(&ei->i_orphan);
> + ext4_init_inode_fc_info(&ei->vfs_inode);
>
> /*
> * Set transaction id's of transactions that have to be committed
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6bab59ae81f7..0b833e9b61c1 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1100,6 +1100,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
> ei->i_datasync_tid = 0;
> atomic_set(&ei->i_unwritten, 0);
> INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
> + ext4_init_inode_fc_info(&ei->vfs_inode);
> return &ei->vfs_inode;
> }
>
> @@ -1139,6 +1140,7 @@ static void init_once(void *foo)
> init_rwsem(&ei->i_data_sem);
> init_rwsem(&ei->i_mmap_sem);
> inode_init_once(&ei->vfs_inode);
> + ext4_init_inode_fc_info(&ei->vfs_inode);
> }
>
> static int __init init_inodecache(void)
> @@ -4301,6 +4303,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
> mutex_init(&sbi->s_orphan_lock);
>
> + INIT_LIST_HEAD(&sbi->s_fc_q);
> + sbi->s_fc_q_cnt = 0;
> + sbi->s_fc_eligible = true;
> + spin_lock_init(&sbi->s_fc_lock);
> +
> sb->s_root = NULL;
>
> needs_recovery = (es->s_last_orphan != 0 ||
> --
> 2.23.0.rc1.153.gdeed80330f-goog
>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2019-08-09 21:47:58

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v2 08/12] ext4: track changed files for fast commit

On Aug 8, 2019, at 9:45 PM, Harshad Shirwadkar <[email protected]> wrote:
>
> For fast commit, we need to remember all the files that have changed
> since last fast commit / full commit. For changes that are fast commit
> incompatible, we mark the file system fast commit incompatible. This
> patch adds code to either remember files that have changed or to mark
> ext4 as fast commit ineligible. We inspect every ext4_mark_inode_dirty
> calls and decide whether that particular file change is fast
> compatible or not.
>
> Signed-off-by: Harshad Shirwadkar <[email protected]>

Some minor code style cleanups.

> @@ -759,6 +761,8 @@ int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
>
> ext4_write_unlock_xattr(inode, &no_expand);
> brelse(iloc.bh);
> + ext4_fc_enqueue_inode(ext4_journal_current_handle(),
> + inode);

(style) "inode" doesn't need to be split to a separate line

> mark_inode_dirty(inode);
> out:
> return copied;
> @@ -974,6 +978,8 @@ int ext4_da_write_inline_data_end(struct inode *inode, loff_t pos,
> * ordering of page lock and transaction start for journaling
> * filesystems.
> */
> + ext4_fc_enqueue_inode(ext4_journal_current_handle(),
> + inode);

(style) "inode" doesn't need to be split to a separate line

> @@ -5697,6 +5719,8 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
>
> if (!error) {
> setattr_copy(inode, attr);
> + ext4_fc_enqueue_inode(ext4_journal_current_handle(),
> + inode);

(style) "inode" doesn't need to be split to a separate line

> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 0b833e9b61c1..c7bb52bdaf6e 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1129,6 +1129,16 @@ static void ext4_destroy_inode(struct inode *inode)
> true);
> dump_stack();
> }
> + if (!list_empty(&(EXT4_I(inode)->i_fc_list))) {
> +#ifdef EXT4FS_DEBUG
> + if (EXT4_SB(inode->i_sb)->s_fc_eligible) {
> + pr_warn("%s: INODE %ld in FC List with FC allowd",
> + __func__, inode->i_ino);

(style) this should use ext4fs_debug(), since pr_warn() is not really
used in the ext4 code

> + dump_stack();
> + }
> +#endif
> + ext4_fc_del(inode);
> + }
> }


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP

2019-08-12 14:24:34

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v2 01/12] ext4: add handling for extended mount options

On Thu, Aug 08, 2019 at 08:45:41PM -0700, Harshad Shirwadkar wrote:
> We are running out of mount option bits. This patch adds handling for
> using s_mount_opt2 and also adds ability to turn on / off the fast
> commit feature. In order to use fast commits, new version e2fsprogs
> needs to set the fast feature commit flag. This also makes sure that
> we have fast commit compatible e2fsprogs before starting to use the
> feature. Mount flag "no_fastcommit", introuced in this patch, can be
> passed to disable the feature at mount time.
>
> Signed-off-by: Harshad Shirwadkar <[email protected]>

Looks good:

Reviewed-by: Theodore Ts'o <[email protected]>

- Ted

2019-08-12 16:05:23

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] jbd2: fast-commit commit path new APIs

On Thu, Aug 08, 2019 at 08:45:45PM -0700, Harshad Shirwadkar wrote:
> This patch adds new helper APIs that ext4 needs for fast
> commits. These new fast commit APIs are used by subsequent fast commit
> patches to implement fast commits. Following new APIs are added:
>
> /*
> * Returns when either a full commit or a fast commit
> * completes
> */
> int jbd2_fc_complete_commit(journal_tc *journal, tid_t tid,
> tid_t tid, tid_t subtid)

I think there is an opportunity to do something more efficient.

Right now, the ext4_fsync() calls this function, and the file system
can only do a "fast commit" if all of the modifications made to the
file system to date are "fast commit eligible". Otherwise, we have to
fall back to a normal, slow commit.

We can make this decision on a much more granular level. Suppose that
so far during the life of the current transaction, inodes A, B, and C
have been modified. The modification to inode A is not fast commit
eligible (maybe the inode is deleted, or it is involved in a directory
rename, etc.). The modification to inode B is fast commit eligible,
but an fsync was not requested for it. And the modification to inode
C *is* fast commit eligble, *and* fsync() has been requested for it.

We only need to write the information for inode C to the fast commit
area. The fact that inode A is not fast commit eligible isn't a
problem. It will get committed when the normal transaction closes,
perhaps when the 5 second commit transaction timer expires. And inode
B, even though its changes might be fast commit eligible, might
require writing a large number of data blocks if it were included in
the fast commit. So excluding inodes A and B from the fast commit,
and only writing the logical changes corresponding to the those made
to inode C, will allow a fast commit to take place.

In order to do that, though, the ext4's fast commit machinery needs to
know which inode we actually need to do the fast commit for. And so
for that reason, it's actually probably better not to run the changes
through the commit thread. That makes it harder to plumb the file
system specific information through, and it also requires waking up
the commit thread and waiting for it to get scheduled.

Instead, ext4_fsync() could just call the fast commit machinery, and
the only thing we need to expose is a way for the fast commit
machinery to attempt to grab a mutex preventing the normal commit
thread from starting a normal commit. If it loses the race, and the
normal commit takes place before we manage to do the fast commit; then
we don't need to do any thing more. Otherwise the fast commit
machinery can do its thing, writing inode changes to the journal, and
once it is done, it can release the mutex and ext4 fsync can return.

Does that make sense?

- Ted

2019-08-12 17:48:10

by harshad shirwadkar

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] jbd2: fast-commit commit path new APIs

Thanks Andreas and Ted for the review.

Yeah, this makes sense.

On Mon, Aug 12, 2019 at 9:04 AM Theodore Y. Ts'o <[email protected]> wrote:
>
> On Thu, Aug 08, 2019 at 08:45:45PM -0700, Harshad Shirwadkar wrote:
> > This patch adds new helper APIs that ext4 needs for fast
> > commits. These new fast commit APIs are used by subsequent fast commit
> > patches to implement fast commits. Following new APIs are added:
> >
> > /*
> > * Returns when either a full commit or a fast commit
> > * completes
> > */
> > int jbd2_fc_complete_commit(journal_tc *journal, tid_t tid,
> > tid_t tid, tid_t subtid)
>
> I think there is an opportunity to do something more efficient.
>
> Right now, the ext4_fsync() calls this function, and the file system
> can only do a "fast commit" if all of the modifications made to the
> file system to date are "fast commit eligible". Otherwise, we have to
> fall back to a normal, slow commit.
>
> We can make this decision on a much more granular level. Suppose that
> so far during the life of the current transaction, inodes A, B, and C
> have been modified. The modification to inode A is not fast commit
> eligible (maybe the inode is deleted, or it is involved in a directory
> rename, etc.). The modification to inode B is fast commit eligible,
> but an fsync was not requested for it. And the modification to inode
> C *is* fast commit eligble, *and* fsync() has been requested for it.
>
> We only need to write the information for inode C to the fast commit
> area. The fact that inode A is not fast commit eligible isn't a
> problem. It will get committed when the normal transaction closes,
> perhaps when the 5 second commit transaction timer expires. And inode
> B, even though its changes might be fast commit eligible, might
> require writing a large number of data blocks if it were included in
> the fast commit. So excluding inodes A and B from the fast commit,
> and only writing the logical changes corresponding to the those made
> to inode C, will allow a fast commit to take place.
>
> In order to do that, though, the ext4's fast commit machinery needs to
> know which inode we actually need to do the fast commit for. And so
> for that reason, it's actually probably better not to run the changes
> through the commit thread. That makes it harder to plumb the file
> system specific information through, and it also requires waking up
> the commit thread and waiting for it to get scheduled.
I see, so you mean each fsync() call will result in exactly one inode
to be committed (the inode on which fsync was called), right? I agree
this doesn't need to go through JBD2 but we need a mechanism to inform
JBD2 about this fast commit since JBD2 maintains sub-transaction ID.
JBD2 will in turn need to make sure that a subtid was allocated for
such a fast commit and it was incremented once the fast commit was
successful as well.
>
> Instead, ext4_fsync() could just call the fast commit machinery, and
> the only thing we need to expose is a way for the fast commit
> machinery to attempt to grab a mutex preventing the normal commit
> thread from starting a normal commit. If it loses the race, and the
> normal commit takes place before we manage to do the fast commit; then
> we don't need to do any thing more. Otherwise the fast commit
> machinery can do its thing, writing inode changes to the journal, and
> once it is done, it can release the mutex and ext4 fsync can return.
>
> Does that make sense?
Thanks for the suggestion, I will implement this in V3.
>
> - Ted

2019-08-12 18:02:09

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v2 05/12] jbd2: fast-commit commit path new APIs

On Mon, Aug 12, 2019 at 10:41:48AM -0700, harshad shirwadkar wrote:
> I see, so you mean each fsync() call will result in exactly one inode
> to be committed (the inode on which fsync was called), right? I agree
> this doesn't need to go through JBD2 but we need a mechanism to inform
> JBD2 about this fast commit since JBD2 maintains sub-transaction ID.
> JBD2 will in turn need to make sure that a subtid was allocated for
> such a fast commit and it was incremented once the fast commit was
> successful as well.

Why does JBD2 need to maintain the sub-transaction ID? We can only
have a single fast commit happening at a time, and while a fast commit
is happening we can't allow a full commit from happening (or vice
versa). So we need a mutex which enforces this, the transaction id
can just be a field in the transaction structure.

Cheers,

- Ted

2019-08-16 01:01:02

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2 12/12] docs: Add fast commit documentation

On Thu, Aug 08, 2019 at 08:45:52PM -0700, Harshad Shirwadkar wrote:
> This patch adds necessary documentation to
> Documentation/filesystems/journalling.rst and
> Documentation/filesystems/ext4/journal.rst.
>
> Signed-off-by: Harshad Shirwadkar <[email protected]>
> ---
> Documentation/filesystems/ext4/journal.rst | 96 ++++++++++++++++++++--
> Documentation/filesystems/journalling.rst | 15 ++++
> 2 files changed, 105 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
> index ea613ee701f5..d6e4a698e208 100644
> --- a/Documentation/filesystems/ext4/journal.rst
> +++ b/Documentation/filesystems/ext4/journal.rst
> @@ -29,10 +29,14 @@ safest. If ``data=writeback``, dirty data blocks are not flushed to the
> disk before the metadata are written to disk through the journal.
>
> The journal inode is typically inode 8. The first 68 bytes of the
> -journal inode are replicated in the ext4 superblock. The journal itself
> -is normal (but hidden) file within the filesystem. The file usually
> -consumes an entire block group, though mke2fs tries to put it in the
> -middle of the disk.
> +journal inode are replicated in the ext4 superblock. The journal
> +itself is normal (but hidden) file within the filesystem. The file
> +usually consumes an entire block group, though mke2fs tries to put it
> +in the middle of the disk. Last 128 blocks in the journal are reserved
> +for fast commits. Fast commits store metadata changes to inodes in an
> +incremental fashion. A fast commit is valid only if there is no full
> +commit after that particular fast commit. That makes fast commit space
> +reusable after every full commit.
>
> All fields in jbd2 are written to disk in big-endian order. This is the
> opposite of ext4.
> @@ -48,16 +52,18 @@ Layout
> Generally speaking, the journal has this format:
>
> .. list-table::
> - :widths: 16 48 16
> + :widths: 16 48 16 18
> :header-rows: 1
>
> * - Superblock
> - descriptor\_block (data\_blocks or revocation\_block) [more data or
> revocations] commmit\_block
> - [more transactions...]
> + - [Fast commits...]
> * -
> - One transaction
> -
> + -
>
> Notice that a transaction begins with either a descriptor and some data,
> or a block revocation list. A finished transaction always ends with a
> @@ -76,7 +82,7 @@ The journal superblock will be in the next full block after the
> superblock.
>
> .. list-table::
> - :widths: 12 12 12 32 12
> + :widths: 12 12 12 32 12 12
> :header-rows: 1
>
> * - 1024 bytes of padding
> @@ -85,11 +91,13 @@ superblock.
> - descriptor\_block (data\_blocks or revocation\_block) [more data or
> revocations] commmit\_block
> - [more transactions...]
> + - [Fast commits...]
> * -
> -
> -
> - One transaction
> -
> + -
>
> Block Header
> ~~~~~~~~~~~~
> @@ -609,3 +617,79 @@ bytes long (but uses a full block):
> - h\_commit\_nsec
> - Nanoseconds component of the above timestamp.
>
> +Fast Commit Block
> +~~~~~~~~~~~~~~~~~
> +
> +The fast commit block indicates an append to the last commit block
> +that was written to the journal. One fast commit block records updates
> +to one inode. So, typically you would find as many fast commit blocks
> +as the number of inodes that got changed since the last commit. A fast
> +commit block is valid only if there is no commit block present with
> +transaction ID greater than that of the fast commit block. If such a
> +block a present, then there is no need to replay the fast commit
> +block.
> +
> +Multiple fast commit blocks are a part of one sub-transaction. To
> +indicate the last block in a fast commit transaction, fc_flags field
> +in the last block in every subtransaction is marked with "LAST" (0x1)
> +flag. A subtransaction is valid only if all the following conditions
> +are met:
> +
> +1) SUBTID of all blocks is either equal to or greater than SUBTID of
> + the previous fast commit block.
> +2) For every sub-transaction, last block is marked with LAST flag.
> +3) There are no invalid blocks in between.
> +
> +.. list-table::
> + :widths: 8 8 24 40
> + :header-rows: 1
> +
> + * - Offset
> + - Type
> + - Name
> + - Descriptor
> + * - 0x0
> + - journal\_header\_s
> + - (open coded)
> + - Common block header.
> + * - 0xC
> + - \_\_le32
> + - fc\_magic
> + - Magic value which should be set to 0xE2540090. This identifies
> + that this block is a fast commit block.
> + * - 0x10
> + - \_\_le32
> + - fc\_subtid
> + - Sub-transaction ID for this commit block
> + * - 0x14
> + - \_\_u8
> + - fc\_features
> + - Features used by this fast commit block.
> + * - 0x15
> + - \_\_u8
> + - fc_flags
> + - Flags. (0x1(Last) - Indicates that this is the last block in sub-transaction)
> + * - 0x16
> + - \_\_le16
> + - fc_num_tlvs
> + - Number of TLVs contained in this fast commit block
> + * - 0x18
> + - \_\_le32
> + - \_\_fc\_len
> + - Length of the fast commit block in terms of number of blocks
> + * - 0x2c
> + - \_\_le32
> + - fc\_ino
> + - Inode number of the inode that will be recovered using this fast commit
> + * - 0x30
> + - struct ext4\_inode
> + - inode
> + - On-disk copy of the inode at the commit time
> + * - 0x34
> + - struct ext4\_fc\_tl
> + - Array of struct ext4\_fc\_tl
> + - The actual delta with the last commit. Starting at this offset,
> + there is an array of TLVs that indicates which all extents
> + should be present in the corresponding inode. Currently, the
> + only tag that is supported is EXT4\_FC\_TAG\_EXT. That tag
> + indicates that the corresponding value is an extent.

This is a good start, but what's the structure of struct ext4_fc_tl ?
It's written to disk, it should be here too. Looks like it's mostly
just an array of ondisk extent structures?

So if I read this right, this first fastcommit tag type seems to be an
inode core and an array of extents which ... I guess are the extents
that were allocated and mapped into the file? So therefore journal
replay of this metadata update becomes a simple matter of logging the
new inode core, adding the associated fc extent records to the extent
map, and marking the corresponding parts of the block bitmap in use?

I'm wondering why these fast commits aren't written inline with the
regular jbd2 transaction block stream? i.e.

[descriptors][blocks][commit][fastcommit][fastcommit][descriptor...]

That way jbd2 replay just adds a case for a journal block with h_magic
== JBD2_FC_MAGIC where it checkpoints whatever it had staged at that
point, throws the fast commit block up to ext4 to do whatever, and then
continues on replaying regular transactions? I get this feeling like
fastcommit is a journal that runs inside of/alongside jbd2 and wonder
why not just integrate it better with jbd2?

> diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
> index 58ce6b395206..2e0d550b546c 100644
> --- a/Documentation/filesystems/journalling.rst
> +++ b/Documentation/filesystems/journalling.rst
> @@ -115,6 +115,21 @@ called after each transaction commit. You can also use
> ``transaction->t_private_list`` for attaching entries to a transaction
> that need processing when the transaction commits.
>
> +JBD2 also allows client file systems to implement file system specific
> +commits which are called as ``fast commits``. File systems that wish
> +to use this feature should first set
> +``journal->j_fc_commit_callback``. That function is called before
> +performing a commit. File system can call :c:func:`jbd2_map_fc_buf()`
> +to get buffers reserved for fast commits. If file system returns 0,
> +JBD2 assumes that file system performed a fast commit and it backs off
> +from performing a commit. Otherwise, JBD2 falls back to normal full

Huh. Ok, so the caller I guess grabs fastcommit blocks, writes the
intent to the fc block, and pushes it to disk, after which we can return
to userspace. Some time later jbd2 gets around to committing things so
it calls back with ->j_fc_commit_callback at which point we say "Oh! I
already wrote that to disk as a fastcommit, so return 0" and jbd2
shrugs and moves on to the next transaction?

--D

> +commit. After performing either a fast or a full commit, JBD2 calls
> +``journal->j_fc_cleanup_cb`` to allow file systems to perform cleanups
> +for their internal fast commit related data structures. At the replay
> +time, JBD2 passes each and every fast commit block to the file system
> +via ``journal->j_fc_replay_cb``. Ext4 effectively uses this fast
> +commit mechanism to improve journal commit performance.
> +
> JBD2 also provides a way to block all transaction updates via
> :c:func:`jbd2_journal_lock_updates()` /
> :c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
> --
> 2.23.0.rc1.153.gdeed80330f-goog
>

2019-08-20 06:39:26

by harshad shirwadkar

[permalink] [raw]
Subject: Re: [PATCH v2 12/12] docs: Add fast commit documentation

On Thu, Aug 15, 2019 at 6:00 PM Darrick J. Wong <[email protected]> wrote:
>
> On Thu, Aug 08, 2019 at 08:45:52PM -0700, Harshad Shirwadkar wrote:
> > This patch adds necessary documentation to
> > Documentation/filesystems/journalling.rst and
> > Documentation/filesystems/ext4/journal.rst.
> >
> > Signed-off-by: Harshad Shirwadkar <[email protected]>
> > ---
> > Documentation/filesystems/ext4/journal.rst | 96 ++++++++++++++++++++--
> > Documentation/filesystems/journalling.rst | 15 ++++
> > 2 files changed, 105 insertions(+), 6 deletions(-)
> >
> > diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
> > index ea613ee701f5..d6e4a698e208 100644
> > --- a/Documentation/filesystems/ext4/journal.rst
> > +++ b/Documentation/filesystems/ext4/journal.rst
> > @@ -29,10 +29,14 @@ safest. If ``data=writeback``, dirty data blocks are not flushed to the
> > disk before the metadata are written to disk through the journal.
> >
> > The journal inode is typically inode 8. The first 68 bytes of the
> > -journal inode are replicated in the ext4 superblock. The journal itself
> > -is normal (but hidden) file within the filesystem. The file usually
> > -consumes an entire block group, though mke2fs tries to put it in the
> > -middle of the disk.
> > +journal inode are replicated in the ext4 superblock. The journal
> > +itself is normal (but hidden) file within the filesystem. The file
> > +usually consumes an entire block group, though mke2fs tries to put it
> > +in the middle of the disk. Last 128 blocks in the journal are reserved
> > +for fast commits. Fast commits store metadata changes to inodes in an
> > +incremental fashion. A fast commit is valid only if there is no full
> > +commit after that particular fast commit. That makes fast commit space
> > +reusable after every full commit.
> >
> > All fields in jbd2 are written to disk in big-endian order. This is the
> > opposite of ext4.
> > @@ -48,16 +52,18 @@ Layout
> > Generally speaking, the journal has this format:
> >
> > .. list-table::
> > - :widths: 16 48 16
> > + :widths: 16 48 16 18
> > :header-rows: 1
> >
> > * - Superblock
> > - descriptor\_block (data\_blocks or revocation\_block) [more data or
> > revocations] commmit\_block
> > - [more transactions...]
> > + - [Fast commits...]
> > * -
> > - One transaction
> > -
> > + -
> >
> > Notice that a transaction begins with either a descriptor and some data,
> > or a block revocation list. A finished transaction always ends with a
> > @@ -76,7 +82,7 @@ The journal superblock will be in the next full block after the
> > superblock.
> >
> > .. list-table::
> > - :widths: 12 12 12 32 12
> > + :widths: 12 12 12 32 12 12
> > :header-rows: 1
> >
> > * - 1024 bytes of padding
> > @@ -85,11 +91,13 @@ superblock.
> > - descriptor\_block (data\_blocks or revocation\_block) [more data or
> > revocations] commmit\_block
> > - [more transactions...]
> > + - [Fast commits...]
> > * -
> > -
> > -
> > - One transaction
> > -
> > + -
> >
> > Block Header
> > ~~~~~~~~~~~~
> > @@ -609,3 +617,79 @@ bytes long (but uses a full block):
> > - h\_commit\_nsec
> > - Nanoseconds component of the above timestamp.
> >
> > +Fast Commit Block
> > +~~~~~~~~~~~~~~~~~
> > +
> > +The fast commit block indicates an append to the last commit block
> > +that was written to the journal. One fast commit block records updates
> > +to one inode. So, typically you would find as many fast commit blocks
> > +as the number of inodes that got changed since the last commit. A fast
> > +commit block is valid only if there is no commit block present with
> > +transaction ID greater than that of the fast commit block. If such a
> > +block a present, then there is no need to replay the fast commit
> > +block.
> > +
> > +Multiple fast commit blocks are a part of one sub-transaction. To
> > +indicate the last block in a fast commit transaction, fc_flags field
> > +in the last block in every subtransaction is marked with "LAST" (0x1)
> > +flag. A subtransaction is valid only if all the following conditions
> > +are met:
> > +
> > +1) SUBTID of all blocks is either equal to or greater than SUBTID of
> > + the previous fast commit block.
> > +2) For every sub-transaction, last block is marked with LAST flag.
> > +3) There are no invalid blocks in between.
> > +
> > +.. list-table::
> > + :widths: 8 8 24 40
> > + :header-rows: 1
> > +
> > + * - Offset
> > + - Type
> > + - Name
> > + - Descriptor
> > + * - 0x0
> > + - journal\_header\_s
> > + - (open coded)
> > + - Common block header.
> > + * - 0xC
> > + - \_\_le32
> > + - fc\_magic
> > + - Magic value which should be set to 0xE2540090. This identifies
> > + that this block is a fast commit block.
> > + * - 0x10
> > + - \_\_le32
> > + - fc\_subtid
> > + - Sub-transaction ID for this commit block
> > + * - 0x14
> > + - \_\_u8
> > + - fc\_features
> > + - Features used by this fast commit block.
> > + * - 0x15
> > + - \_\_u8
> > + - fc_flags
> > + - Flags. (0x1(Last) - Indicates that this is the last block in sub-transaction)
> > + * - 0x16
> > + - \_\_le16
> > + - fc_num_tlvs
> > + - Number of TLVs contained in this fast commit block
> > + * - 0x18
> > + - \_\_le32
> > + - \_\_fc\_len
> > + - Length of the fast commit block in terms of number of blocks
> > + * - 0x2c
> > + - \_\_le32
> > + - fc\_ino
> > + - Inode number of the inode that will be recovered using this fast commit
> > + * - 0x30
> > + - struct ext4\_inode
> > + - inode
> > + - On-disk copy of the inode at the commit time
> > + * - 0x34
> > + - struct ext4\_fc\_tl
> > + - Array of struct ext4\_fc\_tl
> > + - The actual delta with the last commit. Starting at this offset,
> > + there is an array of TLVs that indicates which all extents
> > + should be present in the corresponding inode. Currently, the
> > + only tag that is supported is EXT4\_FC\_TAG\_EXT. That tag
> > + indicates that the corresponding value is an extent.
>
> This is a good start, but what's the structure of struct ext4_fc_tl ?
> It's written to disk, it should be here too. Looks like it's mostly
> just an array of ondisk extent structures?
Thanks, I'll update this in next version. struct ext4_fc_tl is a
generic tag-length-value container that currently holds only extents
that were added to a file after last commit.
>
> So if I read this right, this first fastcommit tag type seems to be an
> inode core and an array of extents which ... I guess are the extents
> that were allocated and mapped into the file? So therefore journal
> replay of this metadata update becomes a simple matter of logging the
> new inode core, adding the associated fc extent records to the extent
> map, and marking the corresponding parts of the block bitmap in use?
>
Yes, that's precisely what is done here.
> I'm wondering why these fast commits aren't written inline with the
> regular jbd2 transaction block stream? i.e.
>
> [descriptors][blocks][commit][fastcommit][fastcommit][descriptor...]
>
After a full commit all previous fast commits are invalid. So, if we
inline fast commits with corresponding transactions, we'll end up
wasting a whole lot of journal space. So, fast commit area is kept
separate from the normal journaling area and after every transaction
commit, fast commit space is reused. But, if we could overwrite fast
commit blocks with the final commit then it's possible to inline fast
commit blocks with the transaction stream without losing journal
space. So, fast commit could just write a fast commit block after
previous transaction and when next transaction commits, it could
simply overwrite previous fast commit blocks.
> That way jbd2 replay just adds a case for a journal block with h_magic
> == JBD2_FC_MAGIC where it checkpoints whatever it had staged at that
> point, throws the fast commit block up to ext4 to do whatever, and then
> continues on replaying regular transactions? I get this feeling like
> fastcommit is a journal that runs inside of/alongside jbd2 and wonder
> why not just integrate it better with jbd2?
Hmmm, I agree, we want fast commits to be as close to jbd2 as possible.
>
> > diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
> > index 58ce6b395206..2e0d550b546c 100644
> > --- a/Documentation/filesystems/journalling.rst
> > +++ b/Documentation/filesystems/journalling.rst
> > @@ -115,6 +115,21 @@ called after each transaction commit. You can also use
> > ``transaction->t_private_list`` for attaching entries to a transaction
> > that need processing when the transaction commits.
> >
> > +JBD2 also allows client file systems to implement file system specific
> > +commits which are called as ``fast commits``. File systems that wish
> > +to use this feature should first set
> > +``journal->j_fc_commit_callback``. That function is called before
> > +performing a commit. File system can call :c:func:`jbd2_map_fc_buf()`
> > +to get buffers reserved for fast commits. If file system returns 0,
> > +JBD2 assumes that file system performed a fast commit and it backs off
> > +from performing a commit. Otherwise, JBD2 falls back to normal full
>
> Huh. Ok, so the caller I guess grabs fastcommit blocks, writes the
> intent to the fc block, and pushes it to disk, after which we can return
> to userspace. Some time later jbd2 gets around to committing things so
> it calls back with ->j_fc_commit_callback at which point we say "Oh! I
> already wrote that to disk as a fastcommit, so return 0" and jbd2
> shrugs and moves on to the next transaction?
I am sorry for the confusing wording here, let me fix it in the next
version. So, either when fsync() is called or when jbd2 wakes up, in
both case, journal->j_fc_commit_callback() is invoked by jbd2. In
other words, journal->j_fc_commit_callback() is the main fastcommit
"commit" routine. If j_fc_commit_callback() returns 0, jbd2 knows that
file system was able to perform a fast commit and in that case a full
commit is not needed. But, there are scenarios when file system thinks
that it would rather do a full commit. File system can think that for
a couple of reasons - accumulated work is too much to fit in fast
commit region, accumulated work is too much to have any performance
benefits, a complex operation (such as punch hole) was performed for
which there's no fast commit support yet. In such cases,
j_fc_commit_callback() can simply return a non-zero value to tell jbd2
to perform a full traditional commit.

Thanks,
Harshad
>
> --D
>
> > +commit. After performing either a fast or a full commit, JBD2 calls
> > +``journal->j_fc_cleanup_cb`` to allow file systems to perform cleanups
> > +for their internal fast commit related data structures. At the replay
> > +time, JBD2 passes each and every fast commit block to the file system
> > +via ``journal->j_fc_replay_cb``. Ext4 effectively uses this fast
> > +commit mechanism to improve journal commit performance.
> > +
> > JBD2 also provides a way to block all transaction updates via
> > :c:func:`jbd2_journal_lock_updates()` /
> > :c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
> > --
> > 2.23.0.rc1.153.gdeed80330f-goog
> >

2019-08-21 15:45:08

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2 12/12] docs: Add fast commit documentation

On Mon, Aug 19, 2019 at 11:38:42PM -0700, harshad shirwadkar wrote:
> On Thu, Aug 15, 2019 at 6:00 PM Darrick J. Wong <[email protected]> wrote:
> >
> > On Thu, Aug 08, 2019 at 08:45:52PM -0700, Harshad Shirwadkar wrote:
> > > This patch adds necessary documentation to
> > > Documentation/filesystems/journalling.rst and
> > > Documentation/filesystems/ext4/journal.rst.
> > >
> > > Signed-off-by: Harshad Shirwadkar <[email protected]>
> > > ---
> > > Documentation/filesystems/ext4/journal.rst | 96 ++++++++++++++++++++--
> > > Documentation/filesystems/journalling.rst | 15 ++++
> > > 2 files changed, 105 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
> > > index ea613ee701f5..d6e4a698e208 100644
> > > --- a/Documentation/filesystems/ext4/journal.rst
> > > +++ b/Documentation/filesystems/ext4/journal.rst
> > > @@ -29,10 +29,14 @@ safest. If ``data=writeback``, dirty data blocks are not flushed to the
> > > disk before the metadata are written to disk through the journal.
> > >
> > > The journal inode is typically inode 8. The first 68 bytes of the
> > > -journal inode are replicated in the ext4 superblock. The journal itself
> > > -is normal (but hidden) file within the filesystem. The file usually
> > > -consumes an entire block group, though mke2fs tries to put it in the
> > > -middle of the disk.
> > > +journal inode are replicated in the ext4 superblock. The journal
> > > +itself is normal (but hidden) file within the filesystem. The file
> > > +usually consumes an entire block group, though mke2fs tries to put it
> > > +in the middle of the disk. Last 128 blocks in the journal are reserved
> > > +for fast commits. Fast commits store metadata changes to inodes in an
> > > +incremental fashion. A fast commit is valid only if there is no full
> > > +commit after that particular fast commit. That makes fast commit space
> > > +reusable after every full commit.
> > >
> > > All fields in jbd2 are written to disk in big-endian order. This is the
> > > opposite of ext4.
> > > @@ -48,16 +52,18 @@ Layout
> > > Generally speaking, the journal has this format:
> > >
> > > .. list-table::
> > > - :widths: 16 48 16
> > > + :widths: 16 48 16 18
> > > :header-rows: 1
> > >
> > > * - Superblock
> > > - descriptor\_block (data\_blocks or revocation\_block) [more data or
> > > revocations] commmit\_block
> > > - [more transactions...]
> > > + - [Fast commits...]
> > > * -
> > > - One transaction
> > > -
> > > + -
> > >
> > > Notice that a transaction begins with either a descriptor and some data,
> > > or a block revocation list. A finished transaction always ends with a
> > > @@ -76,7 +82,7 @@ The journal superblock will be in the next full block after the
> > > superblock.
> > >
> > > .. list-table::
> > > - :widths: 12 12 12 32 12
> > > + :widths: 12 12 12 32 12 12
> > > :header-rows: 1
> > >
> > > * - 1024 bytes of padding
> > > @@ -85,11 +91,13 @@ superblock.
> > > - descriptor\_block (data\_blocks or revocation\_block) [more data or
> > > revocations] commmit\_block
> > > - [more transactions...]
> > > + - [Fast commits...]
> > > * -
> > > -
> > > -
> > > - One transaction
> > > -
> > > + -
> > >
> > > Block Header
> > > ~~~~~~~~~~~~
> > > @@ -609,3 +617,79 @@ bytes long (but uses a full block):
> > > - h\_commit\_nsec
> > > - Nanoseconds component of the above timestamp.
> > >
> > > +Fast Commit Block
> > > +~~~~~~~~~~~~~~~~~
> > > +
> > > +The fast commit block indicates an append to the last commit block
> > > +that was written to the journal. One fast commit block records updates
> > > +to one inode. So, typically you would find as many fast commit blocks
> > > +as the number of inodes that got changed since the last commit. A fast
> > > +commit block is valid only if there is no commit block present with
> > > +transaction ID greater than that of the fast commit block. If such a
> > > +block a present, then there is no need to replay the fast commit
> > > +block.
> > > +
> > > +Multiple fast commit blocks are a part of one sub-transaction. To
> > > +indicate the last block in a fast commit transaction, fc_flags field
> > > +in the last block in every subtransaction is marked with "LAST" (0x1)
> > > +flag. A subtransaction is valid only if all the following conditions
> > > +are met:
> > > +
> > > +1) SUBTID of all blocks is either equal to or greater than SUBTID of
> > > + the previous fast commit block.
> > > +2) For every sub-transaction, last block is marked with LAST flag.
> > > +3) There are no invalid blocks in between.
> > > +
> > > +.. list-table::
> > > + :widths: 8 8 24 40
> > > + :header-rows: 1
> > > +
> > > + * - Offset
> > > + - Type
> > > + - Name
> > > + - Descriptor
> > > + * - 0x0
> > > + - journal\_header\_s
> > > + - (open coded)
> > > + - Common block header.
> > > + * - 0xC
> > > + - \_\_le32
> > > + - fc\_magic
> > > + - Magic value which should be set to 0xE2540090. This identifies
> > > + that this block is a fast commit block.
> > > + * - 0x10
> > > + - \_\_le32
> > > + - fc\_subtid
> > > + - Sub-transaction ID for this commit block
> > > + * - 0x14
> > > + - \_\_u8
> > > + - fc\_features
> > > + - Features used by this fast commit block.
> > > + * - 0x15
> > > + - \_\_u8
> > > + - fc_flags
> > > + - Flags. (0x1(Last) - Indicates that this is the last block in sub-transaction)
> > > + * - 0x16
> > > + - \_\_le16
> > > + - fc_num_tlvs
> > > + - Number of TLVs contained in this fast commit block
> > > + * - 0x18
> > > + - \_\_le32
> > > + - \_\_fc\_len
> > > + - Length of the fast commit block in terms of number of blocks
> > > + * - 0x2c
> > > + - \_\_le32
> > > + - fc\_ino
> > > + - Inode number of the inode that will be recovered using this fast commit
> > > + * - 0x30
> > > + - struct ext4\_inode
> > > + - inode
> > > + - On-disk copy of the inode at the commit time
> > > + * - 0x34
> > > + - struct ext4\_fc\_tl
> > > + - Array of struct ext4\_fc\_tl
> > > + - The actual delta with the last commit. Starting at this offset,
> > > + there is an array of TLVs that indicates which all extents
> > > + should be present in the corresponding inode. Currently, the
> > > + only tag that is supported is EXT4\_FC\_TAG\_EXT. That tag
> > > + indicates that the corresponding value is an extent.
> >
> > This is a good start, but what's the structure of struct ext4_fc_tl ?
> > It's written to disk, it should be here too. Looks like it's mostly
> > just an array of ondisk extent structures?
> Thanks, I'll update this in next version. struct ext4_fc_tl is a
> generic tag-length-value container that currently holds only extents
> that were added to a file after last commit.

<nod> If they're the same format as the extent map records then I think
you can just reference that part of the documentation.

> > So if I read this right, this first fastcommit tag type seems to be an
> > inode core and an array of extents which ... I guess are the extents
> > that were allocated and mapped into the file? So therefore journal
> > replay of this metadata update becomes a simple matter of logging the
> > new inode core, adding the associated fc extent records to the extent
> > map, and marking the corresponding parts of the block bitmap in use?
> >
> Yes, that's precisely what is done here.
> > I'm wondering why these fast commits aren't written inline with the
> > regular jbd2 transaction block stream? i.e.
> >
> > [descriptors][blocks][commit][fastcommit][fastcommit][descriptor...]
> >
> After a full commit all previous fast commits are invalid. So, if we

All of them, fs-wide? Or just the ones for that particular inode?

> inline fast commits with corresponding transactions, we'll end up
> wasting a whole lot of journal space. So, fast commit area is kept
> separate from the normal journaling area and after every transaction
> commit, fast commit space is reused. But, if we could overwrite fast
> commit blocks with the final commit then it's possible to inline fast
> commit blocks with the transaction stream without losing journal
> space. So, fast commit could just write a fast commit block after
> previous transaction and when next transaction commits, it could
> simply overwrite previous fast commit blocks.

<nod>

> > That way jbd2 replay just adds a case for a journal block with h_magic
> > == JBD2_FC_MAGIC where it checkpoints whatever it had staged at that
> > point, throws the fast commit block up to ext4 to do whatever, and then
> > continues on replaying regular transactions? I get this feeling like
> > fastcommit is a journal that runs inside of/alongside jbd2 and wonder
> > why not just integrate it better with jbd2?
> Hmmm, I agree, we want fast commits to be as close to jbd2 as possible.

:)

--D

> >
> > > diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
> > > index 58ce6b395206..2e0d550b546c 100644
> > > --- a/Documentation/filesystems/journalling.rst
> > > +++ b/Documentation/filesystems/journalling.rst
> > > @@ -115,6 +115,21 @@ called after each transaction commit. You can also use
> > > ``transaction->t_private_list`` for attaching entries to a transaction
> > > that need processing when the transaction commits.
> > >
> > > +JBD2 also allows client file systems to implement file system specific
> > > +commits which are called as ``fast commits``. File systems that wish
> > > +to use this feature should first set
> > > +``journal->j_fc_commit_callback``. That function is called before
> > > +performing a commit. File system can call :c:func:`jbd2_map_fc_buf()`
> > > +to get buffers reserved for fast commits. If file system returns 0,
> > > +JBD2 assumes that file system performed a fast commit and it backs off
> > > +from performing a commit. Otherwise, JBD2 falls back to normal full
> >
> > Huh. Ok, so the caller I guess grabs fastcommit blocks, writes the
> > intent to the fc block, and pushes it to disk, after which we can return
> > to userspace. Some time later jbd2 gets around to committing things so
> > it calls back with ->j_fc_commit_callback at which point we say "Oh! I
> > already wrote that to disk as a fastcommit, so return 0" and jbd2
> > shrugs and moves on to the next transaction?
> I am sorry for the confusing wording here, let me fix it in the next
> version. So, either when fsync() is called or when jbd2 wakes up, in
> both case, journal->j_fc_commit_callback() is invoked by jbd2. In
> other words, journal->j_fc_commit_callback() is the main fastcommit
> "commit" routine. If j_fc_commit_callback() returns 0, jbd2 knows that
> file system was able to perform a fast commit and in that case a full
> commit is not needed. But, there are scenarios when file system thinks
> that it would rather do a full commit. File system can think that for
> a couple of reasons - accumulated work is too much to fit in fast
> commit region, accumulated work is too much to have any performance
> benefits, a complex operation (such as punch hole) was performed for
> which there's no fast commit support yet. In such cases,
> j_fc_commit_callback() can simply return a non-zero value to tell jbd2
> to perform a full traditional commit.
>
> Thanks,
> Harshad
> >
> > --D
> >
> > > +commit. After performing either a fast or a full commit, JBD2 calls
> > > +``journal->j_fc_cleanup_cb`` to allow file systems to perform cleanups
> > > +for their internal fast commit related data structures. At the replay
> > > +time, JBD2 passes each and every fast commit block to the file system
> > > +via ``journal->j_fc_replay_cb``. Ext4 effectively uses this fast
> > > +commit mechanism to improve journal commit performance.
> > > +
> > > JBD2 also provides a way to block all transaction updates via
> > > :c:func:`jbd2_journal_lock_updates()` /
> > > :c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
> > > --
> > > 2.23.0.rc1.153.gdeed80330f-goog
> > >

2019-10-01 07:43:55

by harshad shirwadkar

[permalink] [raw]
Subject: Re: [PATCH v2 04/12] jbd2: fast-commit commit path changes

Thanks, done in V3.


On Fri, Aug 9, 2019 at 1:22 PM Andreas Dilger <[email protected]> wrote:
>
> On Aug 8, 2019, at 9:45 PM, Harshad Shirwadkar <[email protected]> wrote:
> >
> > This patch adds core fast-commit commit path changes. This patch also
> > modifies existing JBD2 APIs to allow usage of fast commits. If fast
> > commits are enabled and journal->j_do_full_commit is not set, the
> > commit routine tries the file system specific fast commmit first. Only
> > if it fails, it falls back to the full commit. Commit start and wait
> > APIs now take an additional argument which indicates if fast commits
> > are allowed or not.
> >
> > In this patch we also add a new entry to journal->stats which counts
> > the number of fast commits performed.
> >
> > Signed-off-by: Harshad Shirwadkar <[email protected]>
>
> It would be better to rename the existing function something like
> jbd2_log_start_commit_full() and add wrappers jbd2_log_start_commit()
> and jbd2_log_start_commit_fast() to avoid to change all of the
> callsites to add the same parameter:
>
> int jbd2_log_start_commit_fast(journal_t *journal, tid_t tid)
> {
> return jbd2_log_start_commit_full(journal, tid, false);
> }
> EXPORT_SYMBOL(jbd2_log_start_commit_fast);
>
> int jbd2_log_start_commit(journal_t *journal, tid_t tid)
> {
> return jbd2_log_start_commit_full(journal, tid, true);
> }
> EXPORT_SYMBOL(jbd2_log_start_commit);
>
> That makes it much more clear for the few callsites that need the
> "fast" variant what is being done, unlike a "true" or "false"
> argument to a function that isn't very clear what meaning it has.
>
> Cheers, Andreas
>
> > ---
> >
> > Changelog:
> >
> > V2: JBD2 commit routine passes stats to the fast commit callbac. Also,
> > added a new entry to journal->stats and its tracking.
> > ---
> > fs/ext4/super.c | 2 +-
> > fs/jbd2/checkpoint.c | 2 +-
> > fs/jbd2/commit.c | 47 +++++++++++++++++++++++--
> > fs/jbd2/journal.c | 81 +++++++++++++++++++++++++++++++++++--------
> > fs/jbd2/transaction.c | 6 ++--
> > fs/ocfs2/alloc.c | 2 +-
> > fs/ocfs2/super.c | 2 +-
> > include/linux/jbd2.h | 9 +++--
> > 8 files changed, 124 insertions(+), 27 deletions(-)
> >
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 81c3ec165822..6bab59ae81f7 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -5148,7 +5148,7 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
> > !jbd2_trans_will_send_data_barrier(sbi->s_journal, target))
> > needs_barrier = true;
> >
> > - if (jbd2_journal_start_commit(sbi->s_journal, &target)) {
> > + if (jbd2_journal_start_commit(sbi->s_journal, &target, true)) {
> > if (wait)
> > ret = jbd2_log_wait_commit(sbi->s_journal,
> > target);
> > diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> > index a1909066bde6..6297978ae3bc 100644
> > --- a/fs/jbd2/checkpoint.c
> > +++ b/fs/jbd2/checkpoint.c
> > @@ -277,7 +277,7 @@ int jbd2_log_do_checkpoint(journal_t *journal)
> >
> > if (batch_count)
> > __flush_batch(journal, &batch_count);
> > - jbd2_log_start_commit(journal, tid);
> > + jbd2_log_start_commit(journal, tid, true);
> > /*
> > * jbd2_journal_commit_transaction() may want
> > * to take the checkpoint_mutex if JBD2_FLUSHED
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index 132fb92098c7..9281814606e7 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -351,8 +351,12 @@ static void jbd2_block_tag_csum_set(journal_t *j, journal_block_tag_t *tag,
> > *
> > * The primary function for committing a transaction to the log. This
> > * function is called by the journal thread to begin a complete commit.
> > + *
> > + * fc is input / output parameter. If fc is non-null and is set to true, this
> > + * function tries to perform fast commit. If the fast commit is successfully
> > + * performed, *fc is set to true.
> > */
> > -void jbd2_journal_commit_transaction(journal_t *journal)
> > +void jbd2_journal_commit_transaction(journal_t *journal, bool *fc)
> > {
> > struct transaction_stats_s stats;
> > transaction_t *commit_transaction;
> > @@ -380,6 +384,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> > tid_t first_tid;
> > int update_tail;
> > int csum_size = 0;
> > + bool full_commit;
> > LIST_HEAD(io_bufs);
> > LIST_HEAD(log_bufs);
> >
> > @@ -413,6 +418,40 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> > J_ASSERT(journal->j_running_transaction != NULL);
> > J_ASSERT(journal->j_committing_transaction == NULL);
> >
> > + read_lock(&journal->j_state_lock);
> > + full_commit = journal->j_do_full_commit;
> > + read_unlock(&journal->j_state_lock);
> > +
> > + /* Let file-system try its own fast commit */
> > + if (jbd2_has_feature_fast_commit(journal)) {
> > + if (!full_commit && fc && *fc == true &&
> > + journal->j_fc_commit_callback &&
> > + !journal->j_fc_commit_callback(
> > + journal, journal->j_running_transaction->t_tid,
> > + journal->j_subtid, &stats.run)) {
> > + jbd_debug(3, "fast commit success.\n");
> > + if (journal->j_fc_cleanup_callback)
> > + journal->j_fc_cleanup_callback(journal);
> > + write_lock(&journal->j_state_lock);
> > + journal->j_subtid++;
> > + if (fc)
> > + *fc = true;
> > + write_unlock(&journal->j_state_lock);
> > + goto update_overall_stats;
> > + }
> > + if (journal->j_fc_cleanup_callback)
> > + journal->j_fc_cleanup_callback(journal);
> > + write_lock(&journal->j_state_lock);
> > + journal->j_fc_off = 0;
> > + journal->j_subtid = 0;
> > + journal->j_do_full_commit = false;
> > + write_unlock(&journal->j_state_lock);
> > + }
> > +
> > + jbd_debug(3, "fast commit not performed, trying full.\n");
> > + if (fc)
> > + *fc = false;
> > +
> > commit_transaction = journal->j_running_transaction;
> >
> > trace_jbd2_start_commit(journal, commit_transaction);
> > @@ -1129,8 +1168,12 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> > /*
> > * Calculate overall stats
> > */
> > +update_overall_stats:
> > spin_lock(&journal->j_history_lock);
> > - journal->j_stats.ts_tid++;
> > + if (fc && *fc == true)
> > + journal->j_stats.ts_num_fast_commits++;
> > + else
> > + journal->j_stats.ts_tid++;
> > journal->j_stats.ts_requested += stats.ts_requested;
> > journal->j_stats.run.rs_wait += stats.run.rs_wait;
> > journal->j_stats.run.rs_request_delay += stats.run.rs_request_delay;
> > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> > index 59ad709154a3..ab05e47ed2d4 100644
> > --- a/fs/jbd2/journal.c
> > +++ b/fs/jbd2/journal.c
> > @@ -160,7 +160,13 @@ static void commit_timeout(struct timer_list *t)
> > *
> > * 1) COMMIT: Every so often we need to commit the current state of the
> > * filesystem to disk. The journal thread is responsible for writing
> > - * all of the metadata buffers to disk.
> > + * all of the metadata buffers to disk. If fast commits are allowed,
> > + * journal thread passes the control to the file system and file system
> > + * is then responsible for writing metadata buffers to disk (in whichever
> > + * format it wants). If fast commit succeds, journal thread won't perform
> > + * a normal commit. In case the fast commit fails, journal thread performs
> > + * full commit as normal.
> > + *
> > *
> > * 2) CHECKPOINT: We cannot reuse a used section of the log file until all
> > * of the data in that part of the log has been rewritten elsewhere on
> > @@ -172,6 +178,7 @@ static int kjournald2(void *arg)
> > {
> > journal_t *journal = arg;
> > transaction_t *transaction;
> > + bool fc_flag = true, fc_flag_save;
> >
> > /*
> > * Set up an interval timer which can be used to trigger a commit wakeup
> > @@ -209,9 +216,14 @@ static int kjournald2(void *arg)
> > jbd_debug(1, "OK, requests differ\n");
> > write_unlock(&journal->j_state_lock);
> > del_timer_sync(&journal->j_commit_timer);
> > - jbd2_journal_commit_transaction(journal);
> > + fc_flag_save = fc_flag;
> > + jbd2_journal_commit_transaction(journal, &fc_flag);
> > write_lock(&journal->j_state_lock);
> > - goto loop;
> > + if (!fc_flag) {
> > + /* fast commit not performed */
> > + fc_flag = fc_flag_save;
> > + goto loop;
> > + }
> > }
> >
> > wake_up(&journal->j_wait_done_commit);
> > @@ -235,16 +247,18 @@ static int kjournald2(void *arg)
> >
> > prepare_to_wait(&journal->j_wait_commit, &wait,
> > TASK_INTERRUPTIBLE);
> > - if (journal->j_commit_sequence != journal->j_commit_request)
> > + if (!fc_flag &&
> > + journal->j_commit_sequence != journal->j_commit_request)
> > should_sleep = 0;
> > transaction = journal->j_running_transaction;
> > if (transaction && time_after_eq(jiffies,
> > - transaction->t_expires))
> > + transaction->t_expires))
> > should_sleep = 0;
> > if (journal->j_flags & JBD2_UNMOUNT)
> > should_sleep = 0;
> > if (should_sleep) {
> > write_unlock(&journal->j_state_lock);
> > + jbd_debug(1, "%s sleeps\n", __func__);
> > schedule();
> > write_lock(&journal->j_state_lock);
> > }
> > @@ -259,7 +273,10 @@ static int kjournald2(void *arg)
> > transaction = journal->j_running_transaction;
> > if (transaction && time_after_eq(jiffies, transaction->t_expires)) {
> > journal->j_commit_request = transaction->t_tid;
> > + fc_flag = false;
> > jbd_debug(1, "woke because of timeout\n");
> > + } else {
> > + fc_flag = true;
> > }
> > goto loop;
> >
> > @@ -517,11 +534,17 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t target)
> > return 0;
> > }
> >
> > -int jbd2_log_start_commit(journal_t *journal, tid_t tid)
> > +int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool full_commit)
> > {
> > int ret;
> >
> > write_lock(&journal->j_state_lock);
> > + /*
> > + * If someone has already requested a full commit,
> > + * we have to honor it.
> > + */
> > + if (!journal->j_do_full_commit)
> > + journal->j_do_full_commit = full_commit;
> > ret = __jbd2_log_start_commit(journal, tid);
> > write_unlock(&journal->j_state_lock);
> > return ret;
> > @@ -556,7 +579,7 @@ static int __jbd2_journal_force_commit(journal_t *journal)
> > tid = transaction->t_tid;
> > read_unlock(&journal->j_state_lock);
> > if (need_to_start)
> > - jbd2_log_start_commit(journal, tid);
> > + jbd2_log_start_commit(journal, tid, true);
> > ret = jbd2_log_wait_commit(journal, tid);
> > if (!ret)
> > ret = 1;
> > @@ -603,11 +626,14 @@ int jbd2_journal_force_commit(journal_t *journal)
> > * if a transaction is going to be committed (or is currently already
> > * committing), and fills its tid in at *ptid
> > */
> > -int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid)
> > +int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid, bool full_commit)
> > {
> > int ret = 0;
> >
> > write_lock(&journal->j_state_lock);
> > + if (!journal->j_do_full_commit)
> > + journal->j_do_full_commit = full_commit;
> > +
> > if (journal->j_running_transaction) {
> > tid_t tid = journal->j_running_transaction->t_tid;
> >
> > @@ -675,7 +701,7 @@ EXPORT_SYMBOL(jbd2_trans_will_send_data_barrier);
> > * Wait for a specified commit to complete.
> > * The caller may not hold the journal lock.
> > */
> > -int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
> > +int __jbd2_log_wait_commit(journal_t *journal, tid_t tid, tid_t subtid)
> > {
> > int err = 0;
> >
> > @@ -702,12 +728,25 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
> > }
> > #endif
> > while (tid_gt(tid, journal->j_commit_sequence)) {
> > - jbd_debug(1, "JBD2: want %u, j_commit_sequence=%u\n",
> > - tid, journal->j_commit_sequence);
> > + if ((!journal->j_do_full_commit) &&
> > + !tid_geq(subtid, journal->j_subtid))
> > + break;
> > + jbd_debug(1, "JBD2: want full commit %u %s %u, ",
> > + tid, journal->j_do_full_commit ?
> > + "and ignoring fast commit request for " :
> > + "or want fast commit",
> > + journal->j_subtid);
> > + jbd_debug(1, "j_commit_sequence=%u, j_subtid=%u\n",
> > + journal->j_commit_sequence, journal->j_subtid);
> > read_unlock(&journal->j_state_lock);
> > wake_up(&journal->j_wait_commit);
> > - wait_event(journal->j_wait_done_commit,
> > - !tid_gt(tid, journal->j_commit_sequence));
> > + if (journal->j_do_full_commit)
> > + wait_event(journal->j_wait_done_commit,
> > + !tid_gt(tid, journal->j_commit_sequence));
> > + else
> > + wait_event(journal->j_wait_done_commit,
> > + !tid_gt(tid, journal->j_commit_sequence) ||
> > + !tid_geq(subtid, journal->j_subtid));
> > read_lock(&journal->j_state_lock);
> > }
> > read_unlock(&journal->j_state_lock);
> > @@ -717,6 +756,13 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
> > return err;
> > }
> >
> > +int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
> > +{
> > + journal->j_do_full_commit = true;
> > + return __jbd2_log_wait_commit(journal, tid, 0);
> > +}
> > +
> > +
> > /* Return 1 when transaction with given tid has already committed. */
> > int jbd2_transaction_committed(journal_t *journal, tid_t tid)
> > {
> > @@ -751,7 +797,7 @@ int jbd2_complete_transaction(journal_t *journal, tid_t tid)
> > if (journal->j_commit_request != tid) {
> > /* transaction not yet started, so request it */
> > read_unlock(&journal->j_state_lock);
> > - jbd2_log_start_commit(journal, tid);
> > + jbd2_log_start_commit(journal, tid, true);
> > goto wait_commit;
> > }
> > } else if (!(journal->j_committing_transaction &&
> > @@ -996,6 +1042,8 @@ static int jbd2_seq_info_show(struct seq_file *seq, void *v)
> > "each up to %u blocks\n",
> > s->stats->ts_tid, s->stats->ts_requested,
> > s->journal->j_max_transaction_buffers);
> > + seq_printf(seq, "%lu fast commits performed\n",
> > + s->stats->ts_num_fast_commits);
> > if (s->stats->ts_tid == 0)
> > return 0;
> > seq_printf(seq, "average: \n %ums waiting for transaction\n",
> > @@ -1020,6 +1068,9 @@ static int jbd2_seq_info_show(struct seq_file *seq, void *v)
> > s->stats->run.rs_blocks / s->stats->ts_tid);
> > seq_printf(seq, " %lu logged blocks per transaction\n",
> > s->stats->run.rs_blocks_logged / s->stats->ts_tid);
> > + seq_printf(seq, " %lu logged blocks per commit\n",
> > + s->stats->run.rs_blocks_logged /
> > + (s->stats->ts_tid + s->stats->ts_num_fast_commits));
> > return 0;
> > }
> >
> > @@ -1741,7 +1792,7 @@ int jbd2_journal_destroy(journal_t *journal)
> >
> > /* Force a final log commit */
> > if (journal->j_running_transaction)
> > - jbd2_journal_commit_transaction(journal);
> > + jbd2_journal_commit_transaction(journal, NULL);
> >
> > /* Force any old transactions to disk */
> >
> > diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> > index 990e7b5062e7..87f6627d78aa 100644
> > --- a/fs/jbd2/transaction.c
> > +++ b/fs/jbd2/transaction.c
> > @@ -154,7 +154,7 @@ static void wait_transaction_locked(journal_t *journal)
> > need_to_start = !tid_geq(journal->j_commit_request, tid);
> > read_unlock(&journal->j_state_lock);
> > if (need_to_start)
> > - jbd2_log_start_commit(journal, tid);
> > + jbd2_log_start_commit(journal, tid, true);
> > jbd2_might_wait_for_commit(journal);
> > schedule();
> > finish_wait(&journal->j_wait_transaction_locked, &wait);
> > @@ -708,7 +708,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, gfp_t gfp_mask)
> > need_to_start = !tid_geq(journal->j_commit_request, tid);
> > read_unlock(&journal->j_state_lock);
> > if (need_to_start)
> > - jbd2_log_start_commit(journal, tid);
> > + jbd2_log_start_commit(journal, tid, true);
> >
> > rwsem_release(&journal->j_trans_commit_map, 1, _THIS_IP_);
> > handle->h_buffer_credits = nblocks;
> > @@ -1822,7 +1822,7 @@ int jbd2_journal_stop(handle_t *handle)
> > jbd_debug(2, "transaction too old, requesting commit for "
> > "handle %p\n", handle);
> > /* This is non-blocking */
> > - jbd2_log_start_commit(journal, transaction->t_tid);
> > + jbd2_log_start_commit(journal, transaction->t_tid, true);
> >
> > /*
> > * Special case: JBD2_SYNC synchronous updates require us
> > diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
> > index 0c335b51043d..df41c43573b7 100644
> > --- a/fs/ocfs2/alloc.c
> > +++ b/fs/ocfs2/alloc.c
> > @@ -6117,7 +6117,7 @@ int ocfs2_try_to_free_truncate_log(struct ocfs2_super *osb,
> > goto out;
> > }
> >
> > - if (jbd2_journal_start_commit(osb->journal->j_journal, &target)) {
> > + if (jbd2_journal_start_commit(osb->journal->j_journal, &target, true)) {
> > jbd2_log_wait_commit(osb->journal->j_journal, target);
> > ret = 1;
> > }
> > diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
> > index 8b2f39506648..60ecc51759ae 100644
> > --- a/fs/ocfs2/super.c
> > +++ b/fs/ocfs2/super.c
> > @@ -410,7 +410,7 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait)
> > }
> >
> > if (jbd2_journal_start_commit(osb->journal->j_journal,
> > - &target)) {
> > + &target, true)) {
> > if (wait)
> > jbd2_log_wait_commit(osb->journal->j_journal,
> > target);
> > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> > index 153840b422cc..535f88dff653 100644
> > --- a/include/linux/jbd2.h
> > +++ b/include/linux/jbd2.h
> > @@ -742,6 +742,7 @@ struct transaction_run_stats_s {
> >
> > struct transaction_stats_s {
> > unsigned long ts_tid;
> > + unsigned long ts_num_fast_commits;
> > unsigned long ts_requested;
> > struct transaction_run_stats_s run;
> > };
> > @@ -1364,7 +1365,8 @@ int __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
> > void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
> >
> > /* Commit management */
> > -extern void jbd2_journal_commit_transaction(journal_t *);
> > +extern void jbd2_journal_commit_transaction(journal_t *journal,
> > + bool *full_commit);
> >
> > /* Checkpoint list management */
> > void __jbd2_journal_clean_checkpoint_list(journal_t *journal, bool destroy);
> > @@ -1571,9 +1573,10 @@ extern void jbd2_clear_buffer_revoked_flags(journal_t *journal);
> > * transitions on demand.
> > */
> >
> > -int jbd2_log_start_commit(journal_t *journal, tid_t tid);
> > +int jbd2_log_start_commit(journal_t *journal, tid_t tid, bool full_commit);
> > int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
> > -int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
> > +int jbd2_journal_start_commit(journal_t *journal, tid_t *tid,
> > + bool full_commit);
> > int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
> > int jbd2_transaction_committed(journal_t *journal, tid_t tid);
> > int jbd2_complete_transaction(journal_t *journal, tid_t tid);
> > --
> > 2.23.0.rc1.153.gdeed80330f-goog
> >
>
>
> Cheers, Andreas
>
>
>
>
>

2019-10-01 07:59:11

by harshad shirwadkar

[permalink] [raw]
Subject: Re: [PATCH v2 07/12] ext4: add fields that are needed to track changed files

Thanks, handled in V3.

On Fri, Aug 9, 2019 at 2:23 PM Andreas Dilger <[email protected]> wrote:
>
> On Aug 8, 2019, at 9:45 PM, Harshad Shirwadkar <[email protected]> wrote:
> >
> > Ext4's fast commit feature tracks changed files and maintains them in
> > a queue. We also remember for each file the logical block range that
> > needs to be committed. This patch adds these fields to ext4_inode_info
> > and ext4_sb_info and also adds initialization calls.
> >
> > Signed-off-by: Harshad Shirwadkar <[email protected]>
> >
> > ---
> >
> > Changelog:
> >
> > V2: Converted s_fc_lock from mutex to spinlock to improve parallelism
> > performance.
> > ---
> > fs/ext4/ext4.h | 34 ++++++++++++++++++++++++++++++++++
> > fs/ext4/ext4_jbd2.c | 13 +++++++++++++
> > fs/ext4/ext4_jbd2.h | 2 ++
> > fs/ext4/inode.c | 1 +
> > fs/ext4/super.c | 7 +++++++
> > 5 files changed, 57 insertions(+)
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index becbda38b7db..0d15d4539dda 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -921,6 +921,27 @@ enum {
> > I_DATA_SEM_QUOTA,
> > };
> >
> > +/*
> > + * Ext4 fast commit inode specific information
> > + */
> > +struct ext4_fast_commit_inode_info {
> > + /* TID / SUB-TID when old_i_size and i_size were recorded */
> > + tid_t fc_tid;
> > + tid_t fc_subtid;
> > +
> > + /*
> > + * Start of logical block range that needs to be committed in this fast
> > + * commit
> > + */
> > + loff_t fc_lblk_start;
> > +
> > + /*
> > + * End of logical block range that needs to be committed in this fast
> > + * commit
> > + */
> > + loff_t fc_lblk_end;
>
> Since these are logical block numbers within the journal, they certainly
> don't need to be 64-bit values. loff_t is for byte offsets, this should
> use ext4_lblk_t, which will also reduce the size of the struct by 8 bytes.
>
> > +};
> > +
> >
> > /*
> > * fourth extended file system inode data in memory
> > @@ -955,6 +976,9 @@ struct ext4_inode_info {
> >
> > struct list_head i_orphan; /* unlinked but open inodes */
> >
> > + struct list_head i_fc_list; /* inodes that need fast commit */
>
> This comment should document what lock is protecting this list, along
> with the other fields.
>
> > + struct ext4_fast_commit_inode_info i_fc;
>
> Since this increases the size of the inode, does it affect the number of
> inodes that can fit into one page of ext4_inode_cachep?
>
> > /*
> > * i_disksize keeps track of what the inode size is ON DISK, not
> > * in memory. During truncate, i_size is set to the new size by
> > @@ -1529,6 +1553,16 @@ struct ext4_sb_info {
> > /* Barrier between changing inodes' journal flags and writepages ops. */
> > struct percpu_rw_semaphore s_journal_flag_rwsem;
> > struct dax_device *s_daxdev;
> > +
> > + /* Ext4 fast commit stuff */
> > + bool fc_replay; /* Fast commit replay in progress */
> > + struct list_head s_fc_q; /* Inodes that need fast commit. */
>
> This comment should document what lock is protecting this list, along
> with the other fields.
>
> > + __u32 s_fc_q_cnt; /* Number of inodes in the fc queue */
> > + bool s_fc_eligible; /*
> > + * Are changes after the last commit
> > + * eligible for fast commit?
> > + */
>
> It is slightly more space efficient to put the bool values together
> rather than interleaving them between 64-bit values.
>
> > + spinlock_t s_fc_lock;
> > };
> >
> > static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
> > diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> > index 7c70b08d104c..75b6db808837 100644
> > --- a/fs/ext4/ext4_jbd2.c
> > +++ b/fs/ext4/ext4_jbd2.c
> > @@ -330,3 +330,16 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
> > mark_buffer_dirty(bh);
> > return err;
> > }
> > +
> > +void ext4_init_inode_fc_info(struct inode *inode)
> > +{
> > + handle_t *handle = ext4_journal_current_handle();
> > + struct ext4_inode_info *ei = EXT4_I(inode);
> > +
> > + memset(&ei->i_fc, 0, sizeof(ei->i_fc));
> > + if (ext4_handle_valid(handle)) {
> > + ei->i_fc.fc_tid = handle->h_transaction->t_tid;
> > + ei->i_fc.fc_subtid = handle->h_transaction->t_journal->j_subtid;
> > + }
> > + INIT_LIST_HEAD(&ei->i_fc_list);
> > +}
> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > index ef8fcf7d0d3b..2305c1acd415 100644
> > --- a/fs/ext4/ext4_jbd2.h
> > +++ b/fs/ext4/ext4_jbd2.h
> > @@ -459,4 +459,6 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
> > return 1;
> > }
> >
> > +void ext4_init_inode_fc_info(struct inode *inode);
> > +
> > #endif /* _EXT4_JBD2_H */
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 420fe3deed39..f230a888eddd 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4996,6 +4996,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
> > for (block = 0; block < EXT4_N_BLOCKS; block++)
> > ei->i_data[block] = raw_inode->i_block[block];
> > INIT_LIST_HEAD(&ei->i_orphan);
> > + ext4_init_inode_fc_info(&ei->vfs_inode);
> >
> > /*
> > * Set transaction id's of transactions that have to be committed
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 6bab59ae81f7..0b833e9b61c1 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -1100,6 +1100,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
> > ei->i_datasync_tid = 0;
> > atomic_set(&ei->i_unwritten, 0);
> > INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
> > + ext4_init_inode_fc_info(&ei->vfs_inode);
> > return &ei->vfs_inode;
> > }
> >
> > @@ -1139,6 +1140,7 @@ static void init_once(void *foo)
> > init_rwsem(&ei->i_data_sem);
> > init_rwsem(&ei->i_mmap_sem);
> > inode_init_once(&ei->vfs_inode);
> > + ext4_init_inode_fc_info(&ei->vfs_inode);
> > }
> >
> > static int __init init_inodecache(void)
> > @@ -4301,6 +4303,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> > INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
> > mutex_init(&sbi->s_orphan_lock);
> >
> > + INIT_LIST_HEAD(&sbi->s_fc_q);
> > + sbi->s_fc_q_cnt = 0;
> > + sbi->s_fc_eligible = true;
> > + spin_lock_init(&sbi->s_fc_lock);
> > +
> > sb->s_root = NULL;
> >
> > needs_recovery = (es->s_last_orphan != 0 ||
> > --
> > 2.23.0.rc1.153.gdeed80330f-goog
> >
>
>
> Cheers, Andreas
>
>
>
>
>

2019-10-01 07:59:17

by harshad shirwadkar

[permalink] [raw]
Subject: Re: [PATCH v2 08/12] ext4: track changed files for fast commit

Thanks, done in V3.

On Fri, Aug 9, 2019 at 2:46 PM Andreas Dilger <[email protected]> wrote:
>
> On Aug 8, 2019, at 9:45 PM, Harshad Shirwadkar <[email protected]> wrote:
> >
> > For fast commit, we need to remember all the files that have changed
> > since last fast commit / full commit. For changes that are fast commit
> > incompatible, we mark the file system fast commit incompatible. This
> > patch adds code to either remember files that have changed or to mark
> > ext4 as fast commit ineligible. We inspect every ext4_mark_inode_dirty
> > calls and decide whether that particular file change is fast
> > compatible or not.
> >
> > Signed-off-by: Harshad Shirwadkar <[email protected]>
>
> Some minor code style cleanups.
>
> > @@ -759,6 +761,8 @@ int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
> >
> > ext4_write_unlock_xattr(inode, &no_expand);
> > brelse(iloc.bh);
> > + ext4_fc_enqueue_inode(ext4_journal_current_handle(),
> > + inode);
>
> (style) "inode" doesn't need to be split to a separate line
>
> > mark_inode_dirty(inode);
> > out:
> > return copied;
> > @@ -974,6 +978,8 @@ int ext4_da_write_inline_data_end(struct inode *inode, loff_t pos,
> > * ordering of page lock and transaction start for journaling
> > * filesystems.
> > */
> > + ext4_fc_enqueue_inode(ext4_journal_current_handle(),
> > + inode);
>
> (style) "inode" doesn't need to be split to a separate line
>
> > @@ -5697,6 +5719,8 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
> >
> > if (!error) {
> > setattr_copy(inode, attr);
> > + ext4_fc_enqueue_inode(ext4_journal_current_handle(),
> > + inode);
>
> (style) "inode" doesn't need to be split to a separate line
>
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 0b833e9b61c1..c7bb52bdaf6e 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -1129,6 +1129,16 @@ static void ext4_destroy_inode(struct inode *inode)
> > true);
> > dump_stack();
> > }
> > + if (!list_empty(&(EXT4_I(inode)->i_fc_list))) {
> > +#ifdef EXT4FS_DEBUG
> > + if (EXT4_SB(inode->i_sb)->s_fc_eligible) {
> > + pr_warn("%s: INODE %ld in FC List with FC allowd",
> > + __func__, inode->i_ino);
>
> (style) this should use ext4fs_debug(), since pr_warn() is not really
> used in the ext4 code
>
> > + dump_stack();
> > + }
> > +#endif
> > + ext4_fc_del(inode);
> > + }
> > }
>
>
> Cheers, Andreas
>
>
>
>
>

2019-10-01 07:59:18

by harshad shirwadkar

[permalink] [raw]
Subject: Re: [PATCH v2 03/12] jbd2: fast commit setup and enable

Oops, I missed this, I'll handle this in V4. Thanks!

On Fri, Aug 9, 2019 at 1:02 PM Andreas Dilger <[email protected]> wrote:
>
> On Aug 8, 2019, at 9:45 PM, Harshad Shirwadkar <[email protected]> wrote:
> >
> > This patch allows file systems to turn fast commits on and thereby
> > restrict the normal journalling space to total journal blocks minus
> > JBD2_FAST_COMMIT_BLOCKS. Fast commits are not actually performed, just
> > the interface to turn fast commits on is opened.
> >
> > Signed-off-by: Harshad Shirwadkar <[email protected]>
> >
> > ---
> >
> > Changelog:
> >
> > V2: No changes since V1
> > ---
> > fs/ext4/super.c | 3 ++-
> > fs/jbd2/journal.c | 39 ++++++++++++++++++++++++++++++++-------
> > fs/ocfs2/journal.c | 4 ++--
> > include/linux/jbd2.h | 2 +-
> > 4 files changed, 37 insertions(+), 11 deletions(-)
> >
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index e376ac040cce..81c3ec165822 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -4933,7 +4933,8 @@ static int ext4_load_journal(struct super_block *sb,
> > if (save)
> > memcpy(save, ((char *) es) +
> > EXT4_S_ERR_START, EXT4_S_ERR_LEN);
> > - err = jbd2_journal_load(journal);
> > + err = jbd2_journal_load(journal,
> > + test_opt2(sb, JOURNAL_FAST_COMMIT));
> > if (save)
> > memcpy(((char *) es) + EXT4_S_ERR_START,
> > save, EXT4_S_ERR_LEN);
> > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> > index 953990eb70a9..59ad709154a3 100644
> > --- a/fs/jbd2/journal.c
> > +++ b/fs/jbd2/journal.c
> > @@ -1159,12 +1159,15 @@ static journal_t *journal_init_common(struct block_device *bdev,
> > journal->j_blk_offset = start;
> > journal->j_maxlen = len;
> > n = journal->j_blocksize / sizeof(journal_block_tag_t);
> > - journal->j_wbufsize = n;
> > + journal->j_wbufsize = n - JBD2_FAST_COMMIT_BLOCKS;
>
> The reservation of the JBD2_FAST_COMMIT_BLOCKS should only be done in
> the case of the FAST_COMMIT feature being enabled. Otherwise it can
> hurt performance for filesystems where this feature is not enabled.
>
> Cheers, Andreas
>
> > journal->j_wbuf = kmalloc_array(n, sizeof(struct buffer_head *),
> > GFP_KERNEL);
> > if (!journal->j_wbuf)
> > goto err_cleanup;
> >
> > + journal->j_fc_wbuf = &journal->j_wbuf[journal->j_wbufsize];
> > + journal->j_fc_wbufsize = JBD2_FAST_COMMIT_BLOCKS;
> > +
> > bh = getblk_unmovable(journal->j_dev, start, journal->j_blocksize);
> > if (!bh) {
> > pr_err("%s: Cannot get buffer for journal superblock\n",
> > @@ -1297,11 +1300,19 @@ static int journal_reset(journal_t *journal)
> > }
> >
> > journal->j_first = first;
> > - journal->j_last = last;
> >
> > - journal->j_head = first;
> > - journal->j_tail = first;
> > - journal->j_free = last - first;
> > + if (jbd2_has_feature_fast_commit(journal)) {
> > + journal->j_last_fc = last;
> > + journal->j_last = last - JBD2_FAST_COMMIT_BLOCKS;
> > + journal->j_first_fc = journal->j_last + 1;
> > + journal->j_fc_off = 0;
> > + } else {
> > + journal->j_last = last;
> > + }
> > +
> > + journal->j_head = journal->j_first;
> > + journal->j_tail = journal->j_first;
> > + journal->j_free = journal->j_last - journal->j_first;
> >
> > journal->j_tail_sequence = journal->j_transaction_sequence;
> > journal->j_commit_sequence = journal->j_transaction_sequence - 1;
> > @@ -1626,9 +1637,17 @@ static int load_superblock(journal_t *journal)
> > journal->j_tail_sequence = be32_to_cpu(sb->s_sequence);
> > journal->j_tail = be32_to_cpu(sb->s_start);
> > journal->j_first = be32_to_cpu(sb->s_first);
> > - journal->j_last = be32_to_cpu(sb->s_maxlen);
> > journal->j_errno = be32_to_cpu(sb->s_errno);
> >
> > + if (jbd2_has_feature_fast_commit(journal)) {
> > + journal->j_last_fc = be32_to_cpu(sb->s_maxlen);
> > + journal->j_last = journal->j_last_fc - JBD2_FAST_COMMIT_BLOCKS;
> > + journal->j_first_fc = journal->j_last + 1;
> > + journal->j_fc_off = 0;
> > + } else {
> > + journal->j_last = be32_to_cpu(sb->s_maxlen);
> > + }
> > +
> > return 0;
> > }
> >
> > @@ -1641,7 +1660,7 @@ static int load_superblock(journal_t *journal)
> > * a journal, read the journal from disk to initialise the in-memory
> > * structures.
> > */
> > -int jbd2_journal_load(journal_t *journal)
> > +int jbd2_journal_load(journal_t *journal, bool enable_fc)
> > {
> > int err;
> > journal_superblock_t *sb;
> > @@ -1684,6 +1703,12 @@ int jbd2_journal_load(journal_t *journal)
> > return -EFSCORRUPTED;
> > }
> >
> > + if (enable_fc)
> > + jbd2_journal_set_features(journal, 0, 0,
> > + JBD2_FEATURE_INCOMPAT_FAST_COMMIT);
> > + else
> > + jbd2_journal_clear_features(journal, 0, 0,
> > + JBD2_FEATURE_INCOMPAT_FAST_COMMIT);
> > /* OK, we've finished with the dynamic journal bits:
> > * reinitialise the dynamic contents of the superblock in memory
> > * and reset them on disk. */
> > diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> > index 930e3d388579..3b4d91b16e8e 100644
> > --- a/fs/ocfs2/journal.c
> > +++ b/fs/ocfs2/journal.c
> > @@ -1057,7 +1057,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local, int replayed)
> >
> > osb = journal->j_osb;
> >
> > - status = jbd2_journal_load(journal->j_journal);
> > + status = jbd2_journal_load(journal->j_journal, false);
> > if (status < 0) {
> > mlog(ML_ERROR, "Failed to load journal!\n");
> > goto done;
> > @@ -1642,7 +1642,7 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb,
> > goto done;
> > }
> >
> > - status = jbd2_journal_load(journal);
> > + status = jbd2_journal_load(journal, false);
> > if (status < 0) {
> > mlog_errno(status);
> > if (!igrab(inode))
> > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> > index 9a750b732241..153840b422cc 100644
> > --- a/include/linux/jbd2.h
> > +++ b/include/linux/jbd2.h
> > @@ -1476,7 +1476,7 @@ extern int jbd2_journal_set_features
> > (journal_t *, unsigned long, unsigned long, unsigned long);
> > extern void jbd2_journal_clear_features
> > (journal_t *, unsigned long, unsigned long, unsigned long);
> > -extern int jbd2_journal_load (journal_t *journal);
> > +extern int jbd2_journal_load(journal_t *journal, bool enable_fc);
> > extern int jbd2_journal_destroy (journal_t *);
> > extern int jbd2_journal_recover (journal_t *journal);
> > extern int jbd2_journal_wipe (journal_t *, int);
> > --
> > 2.23.0.rc1.153.gdeed80330f-goog
> >
>
>
> Cheers, Andreas
>
>
>
>
>

2019-11-01 11:33:08

by xiaohui li

[permalink] [raw]
Subject: Re: [PATCH v2 03/12] jbd2: fast commit setup and enable

greate.

On Tue, Oct 1, 2019 at 3:53 PM harshad shirwadkar
<[email protected]> wrote:
>
> Oops, I missed this, I'll handle this in V4. Thanks!
>
> On Fri, Aug 9, 2019 at 1:02 PM Andreas Dilger <[email protected]> wrote:
> >
> > On Aug 8, 2019, at 9:45 PM, Harshad Shirwadkar <[email protected]> wrote:
> > >
> > > This patch allows file systems to turn fast commits on and thereby
> > > restrict the normal journalling space to total journal blocks minus
> > > JBD2_FAST_COMMIT_BLOCKS. Fast commits are not actually performed, just
> > > the interface to turn fast commits on is opened.
> > >
> > > Signed-off-by: Harshad Shirwadkar <[email protected]>
> > >
> > > ---
> > >
> > > Changelog:
> > >
> > > V2: No changes since V1
> > > ---
> > > fs/ext4/super.c | 3 ++-
> > > fs/jbd2/journal.c | 39 ++++++++++++++++++++++++++++++++-------
> > > fs/ocfs2/journal.c | 4 ++--
> > > include/linux/jbd2.h | 2 +-
> > > 4 files changed, 37 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > > index e376ac040cce..81c3ec165822 100644
> > > --- a/fs/ext4/super.c
> > > +++ b/fs/ext4/super.c
> > > @@ -4933,7 +4933,8 @@ static int ext4_load_journal(struct super_block *sb,
> > > if (save)
> > > memcpy(save, ((char *) es) +
> > > EXT4_S_ERR_START, EXT4_S_ERR_LEN);
> > > - err = jbd2_journal_load(journal);
> > > + err = jbd2_journal_load(journal,
> > > + test_opt2(sb, JOURNAL_FAST_COMMIT));
> > > if (save)
> > > memcpy(((char *) es) + EXT4_S_ERR_START,
> > > save, EXT4_S_ERR_LEN);
> > > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> > > index 953990eb70a9..59ad709154a3 100644
> > > --- a/fs/jbd2/journal.c
> > > +++ b/fs/jbd2/journal.c
> > > @@ -1159,12 +1159,15 @@ static journal_t *journal_init_common(struct block_device *bdev,
> > > journal->j_blk_offset = start;
> > > journal->j_maxlen = len;
> > > n = journal->j_blocksize / sizeof(journal_block_tag_t);
> > > - journal->j_wbufsize = n;
> > > + journal->j_wbufsize = n - JBD2_FAST_COMMIT_BLOCKS;
> >
> > The reservation of the JBD2_FAST_COMMIT_BLOCKS should only be done in
> > the case of the FAST_COMMIT feature being enabled. Otherwise it can
> > hurt performance for filesystems where this feature is not enabled.
> >
> > Cheers, Andreas
> >
> > > journal->j_wbuf = kmalloc_array(n, sizeof(struct buffer_head *),
> > > GFP_KERNEL);
> > > if (!journal->j_wbuf)
> > > goto err_cleanup;
> > >
> > > + journal->j_fc_wbuf = &journal->j_wbuf[journal->j_wbufsize];
> > > + journal->j_fc_wbufsize = JBD2_FAST_COMMIT_BLOCKS;
> > > +
> > > bh = getblk_unmovable(journal->j_dev, start, journal->j_blocksize);
> > > if (!bh) {
> > > pr_err("%s: Cannot get buffer for journal superblock\n",
> > > @@ -1297,11 +1300,19 @@ static int journal_reset(journal_t *journal)
> > > }
> > >
> > > journal->j_first = first;
> > > - journal->j_last = last;
> > >
> > > - journal->j_head = first;
> > > - journal->j_tail = first;
> > > - journal->j_free = last - first;
> > > + if (jbd2_has_feature_fast_commit(journal)) {
> > > + journal->j_last_fc = last;
> > > + journal->j_last = last - JBD2_FAST_COMMIT_BLOCKS;
> > > + journal->j_first_fc = journal->j_last + 1;
> > > + journal->j_fc_off = 0;
> > > + } else {
> > > + journal->j_last = last;
> > > + }
> > > +
> > > + journal->j_head = journal->j_first;
> > > + journal->j_tail = journal->j_first;
> > > + journal->j_free = journal->j_last - journal->j_first;
> > >
> > > journal->j_tail_sequence = journal->j_transaction_sequence;
> > > journal->j_commit_sequence = journal->j_transaction_sequence - 1;
> > > @@ -1626,9 +1637,17 @@ static int load_superblock(journal_t *journal)
> > > journal->j_tail_sequence = be32_to_cpu(sb->s_sequence);
> > > journal->j_tail = be32_to_cpu(sb->s_start);
> > > journal->j_first = be32_to_cpu(sb->s_first);
> > > - journal->j_last = be32_to_cpu(sb->s_maxlen);
> > > journal->j_errno = be32_to_cpu(sb->s_errno);
> > >
> > > + if (jbd2_has_feature_fast_commit(journal)) {
> > > + journal->j_last_fc = be32_to_cpu(sb->s_maxlen);
> > > + journal->j_last = journal->j_last_fc - JBD2_FAST_COMMIT_BLOCKS;
> > > + journal->j_first_fc = journal->j_last + 1;
> > > + journal->j_fc_off = 0;
> > > + } else {
> > > + journal->j_last = be32_to_cpu(sb->s_maxlen);
> > > + }
> > > +
> > > return 0;
> > > }
> > >
> > > @@ -1641,7 +1660,7 @@ static int load_superblock(journal_t *journal)
> > > * a journal, read the journal from disk to initialise the in-memory
> > > * structures.
> > > */
> > > -int jbd2_journal_load(journal_t *journal)
> > > +int jbd2_journal_load(journal_t *journal, bool enable_fc)
> > > {
> > > int err;
> > > journal_superblock_t *sb;
> > > @@ -1684,6 +1703,12 @@ int jbd2_journal_load(journal_t *journal)
> > > return -EFSCORRUPTED;
> > > }
> > >
> > > + if (enable_fc)
> > > + jbd2_journal_set_features(journal, 0, 0,
> > > + JBD2_FEATURE_INCOMPAT_FAST_COMMIT);
> > > + else
> > > + jbd2_journal_clear_features(journal, 0, 0,
> > > + JBD2_FEATURE_INCOMPAT_FAST_COMMIT);
> > > /* OK, we've finished with the dynamic journal bits:
> > > * reinitialise the dynamic contents of the superblock in memory
> > > * and reset them on disk. */
> > > diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> > > index 930e3d388579..3b4d91b16e8e 100644
> > > --- a/fs/ocfs2/journal.c
> > > +++ b/fs/ocfs2/journal.c
> > > @@ -1057,7 +1057,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local, int replayed)
> > >
> > > osb = journal->j_osb;
> > >
> > > - status = jbd2_journal_load(journal->j_journal);
> > > + status = jbd2_journal_load(journal->j_journal, false);
> > > if (status < 0) {
> > > mlog(ML_ERROR, "Failed to load journal!\n");
> > > goto done;
> > > @@ -1642,7 +1642,7 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb,
> > > goto done;
> > > }
> > >
> > > - status = jbd2_journal_load(journal);
> > > + status = jbd2_journal_load(journal, false);
> > > if (status < 0) {
> > > mlog_errno(status);
> > > if (!igrab(inode))
> > > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> > > index 9a750b732241..153840b422cc 100644
> > > --- a/include/linux/jbd2.h
> > > +++ b/include/linux/jbd2.h
> > > @@ -1476,7 +1476,7 @@ extern int jbd2_journal_set_features
> > > (journal_t *, unsigned long, unsigned long, unsigned long);
> > > extern void jbd2_journal_clear_features
> > > (journal_t *, unsigned long, unsigned long, unsigned long);
> > > -extern int jbd2_journal_load (journal_t *journal);
> > > +extern int jbd2_journal_load(journal_t *journal, bool enable_fc);
> > > extern int jbd2_journal_destroy (journal_t *);
> > > extern int jbd2_journal_recover (journal_t *journal);
> > > extern int jbd2_journal_wipe (journal_t *, int);
> > > --
> > > 2.23.0.rc1.153.gdeed80330f-goog
> > >
> >
> >
> > Cheers, Andreas
> >
> >
> >
> >
> >