2019-09-22 19:24:32

by Xiaoguang Wang

[permalink] [raw]
Subject: [PATCH 0/2] make jbd2 support checkpoint asynchronously

In current jbd2's implemention, jbd2 won't reclaim journal space unless
free journal space is lower than specified threshold, see logic in
add_transaction_credits():
write_lock(&journal->j_state_lock);
if (jbd2_log_space_left(journal) < jbd2_space_needed(journal))
__jbd2_log_wait_for_space(journal);
write_unlock(&journal->j_state_lock);

Indeed with this logic, we can also have many transactions queued to be
checkpointd, which means these transactions still occupy jbd2 space.

Journal space is somewhat like a global lock. In high concurrency case,
if many tasks contend for journal credits, they will easily be blocked in
waitting for free journal space, so I wonder whether we can reclaim journal
space asynchronously when free space is lower than a specified threshold,
to avoid that all applications are stalled at the same time. This will be
more useful in high speed store, journal space will be reclaimed in background
quickly, and applications will less likely to be stucked, to improve this
case, we use workqueue to queue a work in background to reclaim journal space.

I have used fs_mark to have performance test, in most cases, we have performance
improvement, in specific case, we can have above 14.4% improvement, see patch
"ext4: add async_checkpoint mount option" for detailed test info.

Xiaoguang Wang (2):
jbd2: checkpoint asynchronously when free journal space is lower than
threshold
ext4: add async_checkpoint mount option

fs/ext4/ext4.h | 2 ++
fs/ext4/super.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++
fs/jbd2/checkpoint.c | 28 +++++++++++++++++---
fs/jbd2/journal.c | 15 +++++++++--
fs/jbd2/transaction.c | 16 ++++++++++++
include/linux/jbd2.h | 48 +++++++++++++++++++++++++++++++++-
6 files changed, 174 insertions(+), 6 deletions(-)

--
1.8.3.1


2019-09-22 19:24:32

by Xiaoguang Wang

[permalink] [raw]
Subject: [RFC 2/2] ext4: add async_checkpoint mount option

Since now jbd2 supports doing transactions checkpoint asynchronously
and initiatively when free journal space is lower than user specified
threshold, here add a new mount option "async_checkpoint" for users
to enable or disable this jbd2 feature.

Usage:
# with default threshold 50%
sudo mount -o async_checkpoint /dev/nvme0n1 mntpoint

# user specifies a threshold 30%
sudo mount -o async_checkpoint=30 /dev/nvme0n1 mntpoint

# do a remount to enable this feature with default threshold 50%
sudo mount -o remount,async_checkpoint /dev/nvme0n1

# do a remount to enable this feature with threshold 30%
sudo mount -o remount,async_checkpoint=30 /dev/nvme0n1

# disable this feature
sudo mount -o remount,noasync_checkpoint /dev/nvme0n1

I have used fs_mark to have performance tests:
fs_mark -d mntpoint/testdir/ -D 16 -t 32 -n 500000 -s 4096 -S $sync_mode -N 256 -k
here sync_mode would be 0, 1, 2, 3, 4, 5 and 6, and transactions commit info comes
from /proc/fs/jbd2/nvme0n1-8/info.

Please also refer to fs_mark's README for what sync_mode means.

Test 1: sync_mod = 0
without patch:
Average Files/sec: 96898.0
177 transactions (177 requested), each up to 65536 block
with patch:
Average Files/sec: 97727.0
177 transactions (177 requested), each up to 65536 blocks
About 0.8% improvement, not obvious.

Test 2: sync_mod = 1
without patch:
Average Files/sec: 46780.0
1210422 transactions (1210422 requested), each up to 65536 blocks
with patch:
Average Files/sec: 49510.0
1053905 transactions (1053905 requested), each up to 65536 blocks
About 5.8% improvement, and the number of transactions are decreased.

Test 3: sync_mod = 2
without patch:
Average Files/sec: 71072.0
190 transactions (190 requested), each up to 65536 blocks
with patch:
Average Files/sec: 72464.0
189 transactions (189 requested), each up to 65536 blocks
About 1.9% improvement.

Test 4: sync_mod = 3
without patch:
Average Files/sec: 61977.0
282973 transactions (282973 requested), each up to 65536 blocks
with patch:
Average Files/sec: 70962.0
88148 transactions (88148 requested), each up to 65536 blocks
About 14.4% improvement, it's much obvious, and the number of
transactions are decreased greatly.

Test 5: sync_mod = 4
without patch:
Average Files/sec: 69796.0
190 transactions (190 requested), each up to 65536 blocks
with patch:
Average Files/sec: 70708.0
189 transactions (189 requested), each up to 65536 blocks
About 1.3% improvement, not obvious.

Test 6: sync_mod = 5
without patch:
Average Files/sec: 61523.0
411394 transactions (411394 requested), each up to 65536 blocks
with patch:
Average Files/sec: 66785.0
280367 transactions (280367 requested), each up to 65536 blocks
About 8.5% improvement, it's obvious, and the number of
transactions are decreased greatly.

Test 7: sync_mod = 6
without patch:
Average Files/sec: 70129.0
189 transactions (189 requested), each up to 65536 blocks
with patch:
Average Files/sec: 69194.0
190 transactions (190 requested), each up to 65536 blocks
About 1.3% performance regression, it's not obvious.

From above tests, we can see that in most cases, async checkpoint
will give some performance improvement.

Signed-off-by: Xiaoguang Wang <[email protected]>
---
fs/ext4/ext4.h | 2 ++
fs/ext4/super.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 73 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1cb6785..f53a64d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1123,6 +1123,7 @@ struct ext4_inode_info {
#define EXT4_MOUNT_JOURNAL_CHECKSUM 0x800000 /* Journal checksums */
#define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT 0x1000000 /* Journal Async Commit */
#define EXT4_MOUNT_WARN_ON_ERROR 0x2000000 /* Trigger WARN_ON on error */
+#define EXT4_MOUNT_JOURNAL_ASYNC_CHECKPOINT 0x4000000 /* Journal Async Checkpoint */
#define EXT4_MOUNT_DELALLOC 0x8000000 /* Delalloc support */
#define EXT4_MOUNT_DATA_ERR_ABORT 0x10000000 /* Abort on file data write */
#define EXT4_MOUNT_BLOCK_VALIDITY 0x20000000 /* Block validity checking */
@@ -1411,6 +1412,7 @@ struct ext4_sb_info {
struct mutex s_orphan_lock;
unsigned long s_ext4_flags; /* Ext4 superblock flags */
unsigned long s_commit_interval;
+ unsigned int s_async_checkponit_thresh;
u32 s_max_batch_time;
u32 s_min_batch_time;
struct block_device *journal_bdev;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4079605..ae21338 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -54,6 +54,7 @@
#include "acl.h"
#include "mballoc.h"
#include "fsmap.h"
+#include <linux/jbd2.h>

#define CREATE_TRACE_POINTS
#include <trace/events/ext4.h>
@@ -1455,6 +1456,7 @@ enum {
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
+ Opt_async_checkpoint, Opt_noasync_checkpoint,
};

static const match_table_t tokens = {
@@ -1546,6 +1548,9 @@ enum {
{Opt_removed, "reservation"}, /* mount option from ext2/3 */
{Opt_removed, "noreservation"}, /* mount option from ext2/3 */
{Opt_removed, "journal=%u"}, /* mount option from ext2/3 */
+ {Opt_async_checkpoint, "async_checkpoint=%u"},
+ {Opt_async_checkpoint, "async_checkpoint"},
+ {Opt_noasync_checkpoint, "noasync_checkpoint"},
{Opt_err, NULL},
};

@@ -1751,6 +1756,9 @@ static int clear_qf_name(struct super_block *sb, int qtype)
{Opt_max_dir_size_kb, 0, MOPT_GTE0},
{Opt_test_dummy_encryption, 0, MOPT_GTE0},
{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
+ {Opt_async_checkpoint, 0, MOPT_GTE0},
+ {Opt_noasync_checkpoint, EXT4_MOUNT_JOURNAL_ASYNC_CHECKPOINT,
+ MOPT_CLEAR},
{Opt_err, 0, 0}
};

@@ -2016,6 +2024,11 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
sbi->s_mount_opt |= m->mount_opt;
} else if (token == Opt_data_err_ignore) {
sbi->s_mount_opt &= ~m->mount_opt;
+ } else if (token == Opt_async_checkpoint) {
+ set_opt(sb, JOURNAL_ASYNC_CHECKPOINT);
+ if (!args->from)
+ arg = JBD2_DEFAULT_ASYCN_CHECKPOINT_THRESH;
+ sbi->s_async_checkponit_thresh = arg;
} else {
if (!args->from)
arg = 1;
@@ -2234,6 +2247,11 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
SEQ_OPTS_PUTS("data_err=abort");
if (DUMMY_ENCRYPTION_ENABLED(sbi))
SEQ_OPTS_PUTS("test_dummy_encryption");
+ if (test_opt(sb, JOURNAL_ASYNC_CHECKPOINT) && (nodefs ||
+ (sbi->s_async_checkponit_thresh !=
+ JBD2_DEFAULT_ASYCN_CHECKPOINT_THRESH)))
+ SEQ_OPTS_PRINT("async_checkpoint=%u",
+ sbi->s_async_checkponit_thresh);

ext4_show_quota_options(seq, sb);
return 0;
@@ -4700,6 +4718,38 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
write_unlock(&journal->j_state_lock);
}

+static int ext4_init_journal_async_checkpoint(struct super_block *sb,
+ journal_t *journal)
+{
+ struct workqueue_struct *wq;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ wq = alloc_workqueue("jbd2-checkpoint-wq",
+ WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+ if (!wq) {
+ pr_err("%s: failed to create workqueue\n", __func__);
+ return -ENOMEM;
+ }
+ INIT_WORK(&journal->j_checkpoint_work, jbd2_log_do_checkpoint_async);
+
+ write_lock(&journal->j_state_lock);
+ journal->j_flags |= JBD2_ASYNC_CHECKPOINT;
+ journal->j_checkpoint_wq = wq;
+ journal->j_async_checkpoint_thresh =
+ sbi->s_async_checkponit_thresh;
+ journal->j_async_checkpoint_run = 0;
+ write_unlock(&journal->j_state_lock);
+ return 0;
+}
+
+static void ext4_destroy_journal_async_checkpoint(journal_t *journal)
+{
+ write_lock(&journal->j_state_lock);
+ journal->j_flags &= ~JBD2_ASYNC_CHECKPOINT;
+ write_unlock(&journal->j_state_lock);
+ jbd2_journal_destroy_async_checkpoint_wq(journal);
+}
+
static struct inode *ext4_get_journal_inode(struct super_block *sb,
unsigned int journal_inum)
{
@@ -4737,6 +4787,7 @@ static journal_t *ext4_get_journal(struct super_block *sb,
{
struct inode *journal_inode;
journal_t *journal;
+ int ret;

BUG_ON(!ext4_has_feature_journal(sb));

@@ -4752,6 +4803,11 @@ static journal_t *ext4_get_journal(struct super_block *sb,
}
journal->j_private = sb;
ext4_init_journal_params(sb, journal);
+ ret = ext4_init_journal_async_checkpoint(sb, journal);
+ if (ret) {
+ jbd2_journal_destroy(journal);
+ return NULL;
+ }
return journal;
}

@@ -4767,6 +4823,7 @@ static journal_t *ext4_get_dev_journal(struct super_block *sb,
unsigned long offset;
struct ext4_super_block *es;
struct block_device *bdev;
+ int ret;

BUG_ON(!ext4_has_feature_journal(sb));

@@ -4841,6 +4898,10 @@ static journal_t *ext4_get_dev_journal(struct super_block *sb,
}
EXT4_SB(sb)->journal_bdev = bdev;
ext4_init_journal_params(sb, journal);
+ ret = ext4_init_journal_async_checkpoint(sb, journal);
+ if (ret)
+ goto out_journal;
+
return journal;

out_journal:
@@ -5471,6 +5532,16 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
}
#endif

+ if ((old_opts.s_mount_opt & EXT4_MOUNT_JOURNAL_ASYNC_CHECKPOINT) &&
+ !test_opt(sb, JOURNAL_ASYNC_CHECKPOINT))
+ ext4_destroy_journal_async_checkpoint(sbi->s_journal);
+ else if (!(old_opts.s_mount_opt & EXT4_MOUNT_JOURNAL_ASYNC_CHECKPOINT) &&
+ test_opt(sb, JOURNAL_ASYNC_CHECKPOINT)) {
+ err = ext4_init_journal_async_checkpoint(sb, sbi->s_journal);
+ if (err)
+ goto restore_opts;
+ }
+
*flags = (*flags & ~SB_LAZYTIME) | (sb->s_flags & SB_LAZYTIME);
ext4_msg(sb, KERN_INFO, "re-mounted. Opts: %s", orig_data);
kfree(orig_data);
--
1.8.3.1

2019-09-22 19:24:32

by Xiaoguang Wang

[permalink] [raw]
Subject: [RFC 1/2] jbd2: checkpoint asynchronously when free journal space is lower than threshold

In current jbd2's implemention, jbd2 won't reclaim journal space unless
free journal space is lower than specified threshold, see logic in
add_transaction_credits():
write_lock(&journal->j_state_lock);
if (jbd2_log_space_left(journal) < jbd2_space_needed(journal))
__jbd2_log_wait_for_space(journal);
write_unlock(&journal->j_state_lock);
Indeed with this logic, we can also have many transactions queued to be
checkpointd, which means these transactions still occupy jbd2 space.

Recently I have seen some disadvantages caused by this logic:
Some of our applications will get stuck in below stack periodically:
__jbd2_log_wait_for_space+0xd5/0x200 [jbd2]
start_this_handle+0x31b/0x8f0 [jbd2]
jbd2__journal_start+0xcd/0x1f0 [jbd2]
__ext4_journal_start_sb+0x69/0xe0 [ext4]
ext4_dirty_inode+0x32/0x70 [ext4]
__mark_inode_dirty+0x15f/0x3a0
generic_update_time+0x87/0xe0
file_update_time+0xbd/0x120
__generic_file_aio_write+0x198/0x3e0
generic_file_aio_write+0x5d/0xc0
ext4_file_write+0xb5/0x460 [ext4]
do_sync_write+0x8d/0xd0
vfs_write+0xbd/0x1e0
SyS_write+0x7f/0xe0

Meanwhile I found io usage in these applications' machines are relatively
low, journal space is somewhat like a global lock. In high concurrency case,
if many tasks contend for journal credits, they will easily hit above stack
and be stuck in waitting for free journal space, so I wonder whether we can
reclaim journal space asynchronously when free space is lower than a specified
threshold, to avoid that all applications are stalled at the same time.

This will be more useful in high speed store, journal space will be reclaimed
in background quickly, and applications will less likely to be stucked by above
issue. To improve this case, we use workqueue to queue a work in background to
reclaim journal space.

And see performance improvements in following patch:
ext4: add async_checkpoint mount option

Signed-off-by: Xiaoguang Wang <[email protected]>
---
fs/jbd2/checkpoint.c | 28 +++++++++++++++++++++++++---
fs/jbd2/journal.c | 15 +++++++++++++--
fs/jbd2/transaction.c | 16 ++++++++++++++++
include/linux/jbd2.h | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 101 insertions(+), 6 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index a190906..fdaf87c 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -100,6 +100,21 @@ static int __try_to_free_cp_buf(struct journal_head *jh)
}

/*
+ * Do transaction checkpoint asynchronously.
+ */
+void jbd2_log_do_checkpoint_async(struct work_struct *work)
+{
+ journal_t *journal = container_of(work, journal_t, j_checkpoint_work);
+
+ mutex_lock_io(&journal->j_checkpoint_mutex);
+ jbd2_log_do_checkpoint(journal, 1);
+ mutex_unlock(&journal->j_checkpoint_mutex);
+ WRITE_ONCE(journal->j_async_checkpoint_run, 0);
+}
+EXPORT_SYMBOL(jbd2_log_do_checkpoint_async);
+
+
+/*
* __jbd2_log_wait_for_space: wait until there is space in the journal.
*
* Called under j-state_lock *only*. It will be unlocked if we have to wait
@@ -142,7 +157,7 @@ void __jbd2_log_wait_for_space(journal_t *journal)
spin_unlock(&journal->j_list_lock);
write_unlock(&journal->j_state_lock);
if (chkpt) {
- jbd2_log_do_checkpoint(journal);
+ jbd2_log_do_checkpoint(journal, 0);
} else if (jbd2_cleanup_journal_tail(journal) == 0) {
/* We were able to recover space; yay! */
;
@@ -201,7 +216,7 @@ void __jbd2_log_wait_for_space(journal_t *journal)
* The journal should be locked before calling this function.
* Called with j_checkpoint_mutex held.
*/
-int jbd2_log_do_checkpoint(journal_t *journal)
+int jbd2_log_do_checkpoint(journal_t *journal, int async)
{
struct journal_head *jh;
struct buffer_head *bh;
@@ -277,7 +292,14 @@ int jbd2_log_do_checkpoint(journal_t *journal)

if (batch_count)
__flush_batch(journal, &batch_count);
- jbd2_log_start_commit(journal, tid);
+
+ /*
+ * It's from async checkpoint routine, which means it's
+ * low priority, so here don't kick transaction commit
+ * early.
+ */
+ if (!async)
+ jbd2_log_start_commit(journal, tid);
/*
* jbd2_journal_commit_transaction() may want
* to take the checkpoint_mutex if JBD2_FLUSHED
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 43df0c9..4fd198254e 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1184,6 +1184,16 @@ static journal_t *journal_init_common(struct block_device *bdev,
return NULL;
}

+void jbd2_journal_destroy_async_checkpoint_wq(journal_t *journal)
+{
+ if (!journal->j_checkpoint_wq)
+ return;
+ flush_workqueue(journal->j_checkpoint_wq);
+ destroy_workqueue(journal->j_checkpoint_wq);
+ journal->j_checkpoint_wq = NULL;
+}
+EXPORT_SYMBOL(jbd2_journal_destroy_async_checkpoint_wq);
+
/* jbd2_journal_init_dev and jbd2_journal_init_inode:
*
* Create a journal structure assigned some fixed set of disk blocks to
@@ -1719,6 +1729,7 @@ int jbd2_journal_destroy(journal_t *journal)
if (journal->j_running_transaction)
jbd2_journal_commit_transaction(journal);

+ jbd2_journal_destroy_async_checkpoint_wq(journal);
/* Force any old transactions to disk */

/* Totally anal locking here... */
@@ -1726,7 +1737,7 @@ int jbd2_journal_destroy(journal_t *journal)
while (journal->j_checkpoint_transactions != NULL) {
spin_unlock(&journal->j_list_lock);
mutex_lock_io(&journal->j_checkpoint_mutex);
- err = jbd2_log_do_checkpoint(journal);
+ err = jbd2_log_do_checkpoint(journal, 0);
mutex_unlock(&journal->j_checkpoint_mutex);
/*
* If checkpointing failed, just free the buffers to avoid
@@ -1990,7 +2001,7 @@ int jbd2_journal_flush(journal_t *journal)
while (!err && journal->j_checkpoint_transactions != NULL) {
spin_unlock(&journal->j_list_lock);
mutex_lock_io(&journal->j_checkpoint_mutex);
- err = jbd2_log_do_checkpoint(journal);
+ err = jbd2_log_do_checkpoint(journal, 0);
mutex_unlock(&journal->j_checkpoint_mutex);
spin_lock(&journal->j_list_lock);
}
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 8ca4fdd..c5a50a9 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -204,6 +204,7 @@ static int add_transaction_credits(journal_t *journal, int blocks,
transaction_t *t = journal->j_running_transaction;
int needed;
int total = blocks + rsv_blocks;
+ unsigned int async_ckpt_thresh;

/*
* If the current transaction is locked down for commit, wait
@@ -248,6 +249,21 @@ static int add_transaction_credits(journal_t *journal, int blocks,
}

/*
+ * When the percentage of free jounal space is lower than user specified
+ * threshold, start to do transaction checkpoint asynchronously.
+ */
+ if (journal->j_flags & JBD2_ASYNC_CHECKPOINT &&
+ READ_ONCE(journal->j_async_checkpoint_run) == 0) {
+ async_ckpt_thresh = journal->j_async_checkpoint_thresh *
+ journal->j_maxlen / 100;
+ if (jbd2_log_space_left(journal) < async_ckpt_thresh) {
+ journal->j_async_checkpoint_run = 1;
+ queue_work(journal->j_checkpoint_wq,
+ &journal->j_checkpoint_work);
+ }
+ }
+
+ /*
* The commit code assumes that it can get enough log space
* without forcing a checkpoint. This is *critical* for
* correctness: a checkpoint of a buffer which is also
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 5c04181..e9b80fb 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -46,6 +46,11 @@
*/
#define JBD2_DEFAULT_MAX_COMMIT_AGE 5

+/*
+ * The default percentage threshold for async checkpoint.
+ */
+#define JBD2_DEFAULT_ASYCN_CHECKPOINT_THRESH 50
+
#ifdef CONFIG_JBD2_DEBUG
/*
* Define JBD2_EXPENSIVE_CHECKING to enable more expensive internal
@@ -814,6 +819,21 @@ struct journal_s
transaction_t *j_checkpoint_transactions;

/**
+ * @j_async_checkpoint_thresh
+ *
+ * When the percentage of free jounal space is lower than this value,
+ * start to do transaction checkpoint asynchronously.
+ */
+ int j_async_checkpoint_thresh;
+
+ /**
+ * @j_num_checkpoint_transactions:
+ *
+ * Number of transactions to be checkpointed.
+ */
+ int j_num_checkpoint_transactions;
+
+ /**
* @j_wait_transaction_locked:
*
* Wait queue for waiting for a locked transaction to start committing,
@@ -1136,6 +1156,27 @@ struct journal_s
*/
__u32 j_csum_seed;

+ /**
+ * @j_async_checkpoint_run:
+ *
+ * Is there a work running asynchronously to do transaction checkpoint.
+ */
+ int j_async_checkpoint_run;
+
+ /**
+ * @j_checkpoint_work:
+ *
+ * Work_struct to do transaction checkpoint.
+ */
+ struct work_struct j_checkpoint_work;
+
+ /**
+ * @j_checkpoint_wq:
+ *
+ * Workqueue for doing transaction checkpoint asynchronously.
+ */
+ struct workqueue_struct *j_checkpoint_wq;
+
#ifdef CONFIG_DEBUG_LOCK_ALLOC
/**
* @j_trans_commit_map:
@@ -1234,6 +1275,9 @@ struct journal_s
* mode */
#define JBD2_REC_ERR 0x080 /* The errno in the sb has been recorded */

+/* Do transaction checkpoint initiatively and asynchronously.*/
+#define JBD2_ASYNC_CHECKPOINT 0x100
+
/*
* Function declarations for the journaling transaction and buffer
* management
@@ -1366,6 +1410,7 @@ extern int jbd2_journal_invalidatepage(journal_t *,
extern void jbd2_journal_lock_updates (journal_t *);
extern void jbd2_journal_unlock_updates (journal_t *);

+extern void jbd2_journal_destroy_async_checkpoint_wq(journal_t *journal);
extern journal_t * jbd2_journal_init_dev(struct block_device *bdev,
struct block_device *fs_dev,
unsigned long long start, int len, int bsize);
@@ -1474,7 +1519,8 @@ extern void jbd2_journal_write_revoke_records(transaction_t *transaction,
int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
int jbd2_transaction_committed(journal_t *journal, tid_t tid);
int jbd2_complete_transaction(journal_t *journal, tid_t tid);
-int jbd2_log_do_checkpoint(journal_t *journal);
+int jbd2_log_do_checkpoint(journal_t *journal, int async);
+void jbd2_log_do_checkpoint_async(struct work_struct *work);
int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);

void __jbd2_log_wait_for_space(journal_t *journal);
--
1.8.3.1

2019-10-25 19:24:51

by Xiaoguang Wang

[permalink] [raw]
Subject: Re: [RFC 2/2] ext4: add async_checkpoint mount option

hi,

Any ideas about this patchset?
From our test result, it also has some performance improvement.

Regards,
Xiaoguang Wang
> Since now jbd2 supports doing transactions checkpoint asynchronously
> and initiatively when free journal space is lower than user specified
> threshold, here add a new mount option "async_checkpoint" for users
> to enable or disable this jbd2 feature.
>
> Usage:
> # with default threshold 50%
> sudo mount -o async_checkpoint /dev/nvme0n1 mntpoint
>
> # user specifies a threshold 30%
> sudo mount -o async_checkpoint=30 /dev/nvme0n1 mntpoint
>
> # do a remount to enable this feature with default threshold 50%
> sudo mount -o remount,async_checkpoint /dev/nvme0n1
>
> # do a remount to enable this feature with threshold 30%
> sudo mount -o remount,async_checkpoint=30 /dev/nvme0n1
>
> # disable this feature
> sudo mount -o remount,noasync_checkpoint /dev/nvme0n1
>
> I have used fs_mark to have performance tests:
> fs_mark -d mntpoint/testdir/ -D 16 -t 32 -n 500000 -s 4096 -S $sync_mode -N 256 -k
> here sync_mode would be 0, 1, 2, 3, 4, 5 and 6, and transactions commit info comes
> from /proc/fs/jbd2/nvme0n1-8/info.
>
> Please also refer to fs_mark's README for what sync_mode means.
>
> Test 1: sync_mod = 0
> without patch:
> Average Files/sec: 96898.0
> 177 transactions (177 requested), each up to 65536 block
> with patch:
> Average Files/sec: 97727.0
> 177 transactions (177 requested), each up to 65536 blocks
> About 0.8% improvement, not obvious.
>
> Test 2: sync_mod = 1
> without patch:
> Average Files/sec: 46780.0
> 1210422 transactions (1210422 requested), each up to 65536 blocks
> with patch:
> Average Files/sec: 49510.0
> 1053905 transactions (1053905 requested), each up to 65536 blocks
> About 5.8% improvement, and the number of transactions are decreased.
>
> Test 3: sync_mod = 2
> without patch:
> Average Files/sec: 71072.0
> 190 transactions (190 requested), each up to 65536 blocks
> with patch:
> Average Files/sec: 72464.0
> 189 transactions (189 requested), each up to 65536 blocks
> About 1.9% improvement.
>
> Test 4: sync_mod = 3
> without patch:
> Average Files/sec: 61977.0
> 282973 transactions (282973 requested), each up to 65536 blocks
> with patch:
> Average Files/sec: 70962.0
> 88148 transactions (88148 requested), each up to 65536 blocks
> About 14.4% improvement, it's much obvious, and the number of
> transactions are decreased greatly.
>
> Test 5: sync_mod = 4
> without patch:
> Average Files/sec: 69796.0
> 190 transactions (190 requested), each up to 65536 blocks
> with patch:
> Average Files/sec: 70708.0
> 189 transactions (189 requested), each up to 65536 blocks
> About 1.3% improvement, not obvious.
>
> Test 6: sync_mod = 5
> without patch:
> Average Files/sec: 61523.0
> 411394 transactions (411394 requested), each up to 65536 blocks
> with patch:
> Average Files/sec: 66785.0
> 280367 transactions (280367 requested), each up to 65536 blocks
> About 8.5% improvement, it's obvious, and the number of
> transactions are decreased greatly.
>
> Test 7: sync_mod = 6
> without patch:
> Average Files/sec: 70129.0
> 189 transactions (189 requested), each up to 65536 blocks
> with patch:
> Average Files/sec: 69194.0
> 190 transactions (190 requested), each up to 65536 blocks
> About 1.3% performance regression, it's not obvious.
>
> From above tests, we can see that in most cases, async checkpoint
> will give some performance improvement.
>
> Signed-off-by: Xiaoguang Wang <[email protected]>
> ---
> fs/ext4/ext4.h | 2 ++
> fs/ext4/super.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 73 insertions(+)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 1cb6785..f53a64d 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1123,6 +1123,7 @@ struct ext4_inode_info {
> #define EXT4_MOUNT_JOURNAL_CHECKSUM 0x800000 /* Journal checksums */
> #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT 0x1000000 /* Journal Async Commit */
> #define EXT4_MOUNT_WARN_ON_ERROR 0x2000000 /* Trigger WARN_ON on error */
> +#define EXT4_MOUNT_JOURNAL_ASYNC_CHECKPOINT 0x4000000 /* Journal Async Checkpoint */
> #define EXT4_MOUNT_DELALLOC 0x8000000 /* Delalloc support */
> #define EXT4_MOUNT_DATA_ERR_ABORT 0x10000000 /* Abort on file data write */
> #define EXT4_MOUNT_BLOCK_VALIDITY 0x20000000 /* Block validity checking */
> @@ -1411,6 +1412,7 @@ struct ext4_sb_info {
> struct mutex s_orphan_lock;
> unsigned long s_ext4_flags; /* Ext4 superblock flags */
> unsigned long s_commit_interval;
> + unsigned int s_async_checkponit_thresh;
> u32 s_max_batch_time;
> u32 s_min_batch_time;
> struct block_device *journal_bdev;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 4079605..ae21338 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -54,6 +54,7 @@
> #include "acl.h"
> #include "mballoc.h"
> #include "fsmap.h"
> +#include <linux/jbd2.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/ext4.h>
> @@ -1455,6 +1456,7 @@ enum {
> Opt_dioread_nolock, Opt_dioread_lock,
> Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
> Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
> + Opt_async_checkpoint, Opt_noasync_checkpoint,
> };
>
> static const match_table_t tokens = {
> @@ -1546,6 +1548,9 @@ enum {
> {Opt_removed, "reservation"}, /* mount option from ext2/3 */
> {Opt_removed, "noreservation"}, /* mount option from ext2/3 */
> {Opt_removed, "journal=%u"}, /* mount option from ext2/3 */
> + {Opt_async_checkpoint, "async_checkpoint=%u"},
> + {Opt_async_checkpoint, "async_checkpoint"},
> + {Opt_noasync_checkpoint, "noasync_checkpoint"},
> {Opt_err, NULL},
> };
>
> @@ -1751,6 +1756,9 @@ static int clear_qf_name(struct super_block *sb, int qtype)
> {Opt_max_dir_size_kb, 0, MOPT_GTE0},
> {Opt_test_dummy_encryption, 0, MOPT_GTE0},
> {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
> + {Opt_async_checkpoint, 0, MOPT_GTE0},
> + {Opt_noasync_checkpoint, EXT4_MOUNT_JOURNAL_ASYNC_CHECKPOINT,
> + MOPT_CLEAR},
> {Opt_err, 0, 0}
> };
>
> @@ -2016,6 +2024,11 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> sbi->s_mount_opt |= m->mount_opt;
> } else if (token == Opt_data_err_ignore) {
> sbi->s_mount_opt &= ~m->mount_opt;
> + } else if (token == Opt_async_checkpoint) {
> + set_opt(sb, JOURNAL_ASYNC_CHECKPOINT);
> + if (!args->from)
> + arg = JBD2_DEFAULT_ASYCN_CHECKPOINT_THRESH;
> + sbi->s_async_checkponit_thresh = arg;
> } else {
> if (!args->from)
> arg = 1;
> @@ -2234,6 +2247,11 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
> SEQ_OPTS_PUTS("data_err=abort");
> if (DUMMY_ENCRYPTION_ENABLED(sbi))
> SEQ_OPTS_PUTS("test_dummy_encryption");
> + if (test_opt(sb, JOURNAL_ASYNC_CHECKPOINT) && (nodefs ||
> + (sbi->s_async_checkponit_thresh !=
> + JBD2_DEFAULT_ASYCN_CHECKPOINT_THRESH)))
> + SEQ_OPTS_PRINT("async_checkpoint=%u",
> + sbi->s_async_checkponit_thresh);
>
> ext4_show_quota_options(seq, sb);
> return 0;
> @@ -4700,6 +4718,38 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
> write_unlock(&journal->j_state_lock);
> }
>
> +static int ext4_init_journal_async_checkpoint(struct super_block *sb,
> + journal_t *journal)
> +{
> + struct workqueue_struct *wq;
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> +
> + wq = alloc_workqueue("jbd2-checkpoint-wq",
> + WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
> + if (!wq) {
> + pr_err("%s: failed to create workqueue\n", __func__);
> + return -ENOMEM;
> + }
> + INIT_WORK(&journal->j_checkpoint_work, jbd2_log_do_checkpoint_async);
> +
> + write_lock(&journal->j_state_lock);
> + journal->j_flags |= JBD2_ASYNC_CHECKPOINT;
> + journal->j_checkpoint_wq = wq;
> + journal->j_async_checkpoint_thresh =
> + sbi->s_async_checkponit_thresh;
> + journal->j_async_checkpoint_run = 0;
> + write_unlock(&journal->j_state_lock);
> + return 0;
> +}
> +
> +static void ext4_destroy_journal_async_checkpoint(journal_t *journal)
> +{
> + write_lock(&journal->j_state_lock);
> + journal->j_flags &= ~JBD2_ASYNC_CHECKPOINT;
> + write_unlock(&journal->j_state_lock);
> + jbd2_journal_destroy_async_checkpoint_wq(journal);
> +}
> +
> static struct inode *ext4_get_journal_inode(struct super_block *sb,
> unsigned int journal_inum)
> {
> @@ -4737,6 +4787,7 @@ static journal_t *ext4_get_journal(struct super_block *sb,
> {
> struct inode *journal_inode;
> journal_t *journal;
> + int ret;
>
> BUG_ON(!ext4_has_feature_journal(sb));
>
> @@ -4752,6 +4803,11 @@ static journal_t *ext4_get_journal(struct super_block *sb,
> }
> journal->j_private = sb;
> ext4_init_journal_params(sb, journal);
> + ret = ext4_init_journal_async_checkpoint(sb, journal);
> + if (ret) {
> + jbd2_journal_destroy(journal);
> + return NULL;
> + }
> return journal;
> }
>
> @@ -4767,6 +4823,7 @@ static journal_t *ext4_get_dev_journal(struct super_block *sb,
> unsigned long offset;
> struct ext4_super_block *es;
> struct block_device *bdev;
> + int ret;
>
> BUG_ON(!ext4_has_feature_journal(sb));
>
> @@ -4841,6 +4898,10 @@ static journal_t *ext4_get_dev_journal(struct super_block *sb,
> }
> EXT4_SB(sb)->journal_bdev = bdev;
> ext4_init_journal_params(sb, journal);
> + ret = ext4_init_journal_async_checkpoint(sb, journal);
> + if (ret)
> + goto out_journal;
> +
> return journal;
>
> out_journal:
> @@ -5471,6 +5532,16 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
> }
> #endif
>
> + if ((old_opts.s_mount_opt & EXT4_MOUNT_JOURNAL_ASYNC_CHECKPOINT) &&
> + !test_opt(sb, JOURNAL_ASYNC_CHECKPOINT))
> + ext4_destroy_journal_async_checkpoint(sbi->s_journal);
> + else if (!(old_opts.s_mount_opt & EXT4_MOUNT_JOURNAL_ASYNC_CHECKPOINT) &&
> + test_opt(sb, JOURNAL_ASYNC_CHECKPOINT)) {
> + err = ext4_init_journal_async_checkpoint(sb, sbi->s_journal);
> + if (err)
> + goto restore_opts;
> + }
> +
> *flags = (*flags & ~SB_LAZYTIME) | (sb->s_flags & SB_LAZYTIME);
> ext4_msg(sb, KERN_INFO, "re-mounted. Opts: %s", orig_data);
> kfree(orig_data);
>